back to indexLanguage Agents: From Reasoning to Acting — with Shunyu Yao of OpenAI, Harrison Chase of LangGraph
Chapters
0:0 Introductions
3:16 The ReAct paper
9:9 Early applications of ReAct in LangChain
14:15 Discussion of the Reflection paper
19:35 Tree of Thoughts paper and search algorithms in language models
24:21 SWE-Agent and SWE-Bench for coding benchmarks
36:21 CoALA: Cognitive Architectures for Language Agents
42:24 Agent-Computer Interfaces (ACI) and tool design for agents
46:24 Designing frameworks for agents vs humans
50:52 UX design for AI applications and agents
58:53 Data and model improvements for agent capabilities
76:10 TauBench
80:9 Promising areas for AI
00:00:05.880 |
This is Alessio, partner and CTO in residence 00:00:09.040 |
And I'm joined by my co-host Swix, founder of SmallAI. 00:00:12.000 |
Hey, and today we have a super special episode. 00:00:14.440 |
I actually always wanted to take a selfie and go like, 00:00:19.440 |
the world of agents, because we have two of the most awesome 00:00:25.000 |
So first, we're going to welcome back Harrison Chase. 00:00:29.800 |
What's new with you recently in sort of like the 10, 00:00:34.120 |
Langchain, Langsmith, Langgraf, pushing on all of them. 00:00:37.440 |
Lots of cool stuff related to a lot of the stuff 00:00:40.160 |
that we're going to talk about today, probably. 00:00:53.400 |
Patriots aren't looking good, though, so that's-- 00:00:56.040 |
And then Shunyu, you've also been on the pod, 00:00:58.160 |
but only in like a sort of oral paper presentation capacity. 00:01:01.320 |
But welcome officially to the Linux-based pod. 00:01:08.920 |
You're one of like-- you're maybe the first PhD thesis defense 00:01:16.400 |
because most people just publish single papers. 00:01:22.680 |
Yeah, maybe we'll just kick it off with, you know, 00:01:25.960 |
what was your journey into using language models for agents? 00:01:28.520 |
I like that your thesis advisor, I didn't catch his name, 00:01:33.040 |
Yeah, it's like this guy just wanted to use language models, 00:01:35.800 |
and it was such a controversial pick at the time. 00:01:46.640 |
you're just composing all the GAN or 3D perception 00:01:49.920 |
or whatever together, and it's not exciting anymore. 00:01:53.040 |
And one day, I just see this transformer paper, 00:02:00.480 |
only when I entered my PhD and met my advisor Karthik. 00:02:04.640 |
So he was actually the second author of GPT-1 00:02:07.280 |
when he was like a visiting scientist at OpenAI. 00:02:15.680 |
and Ilya just said, Karthik, you should stay, 00:02:20.440 |
But apparently, Karthik is not fully convinced. 00:02:22.880 |
So he went to Princeton, started his professorship, 00:02:29.440 |
even though I have no prior knowledge in NLP. 00:02:32.120 |
And, you know, we just met for the first time, 00:02:34.240 |
and he's like, you know, what do you want to do? 00:02:40.960 |
I wonder if we can just redo them with language models. 00:02:52.040 |
And then I guess the first work of yours that I came across was React. 00:02:59.000 |
But also, Harrison, when you came on the podcast last year, 00:03:01.240 |
you said that was one of the first papers that you saw 00:03:03.720 |
when you were getting inspired for Langchain. 00:03:05.160 |
So maybe give a recap of why you thought it was cool, 00:03:08.120 |
because you were already working in AI and machine learning. 00:03:10.840 |
And then, yeah, you can kind of like enter the paper formally. 00:03:14.360 |
But what was that interesting to you specifically? 00:03:16.360 |
Yeah, I mean, I think the interesting part was using these language models to 00:03:20.360 |
interact with the outside world in some form. 00:03:22.840 |
And I think in the paper, you mostly deal with Wikipedia, and I think there's some other 00:03:27.320 |
data sets as well, but the outside world is the outside world. 00:03:30.360 |
And so interacting with things that weren't present in the LLM and APIs and calling into 00:03:34.680 |
them and thinking about, and yeah, the React reasoning and acting and kind of like combining 00:03:42.360 |
I had been playing around with LLMs, been talking with people who were playing around with LLMs. 00:03:46.200 |
People were trying to get LLMs to call into APIs, do things. 00:03:48.760 |
And it was always, how can they do it more reliably and better? 00:03:51.640 |
And so this paper was basically a step in that direction. 00:03:54.520 |
And I think really interesting and also really general as well. 00:03:58.200 |
Like, I think that's part of the appeal is just how general and simple in a good way, 00:04:04.920 |
So that it was really appealing for all those reasons. 00:04:10.920 |
Because I have one favorite part from your PhD defense, 00:04:13.560 |
which I didn't understand when I read the paper. 00:04:16.040 |
But you said something along the lines, React doesn't change the outside or the environment, 00:04:20.840 |
but it does change the inside through the context, putting more things in the context. 00:04:24.360 |
You're not actually changing any of the tools around you to work for you, 00:04:30.840 |
And I think that was like a very profound thing when I, 00:04:33.320 |
not that I've been using these tools for like 18 months, I'm like, 00:04:36.840 |
But like to say that at the time you did the PhD defense was not trivial. 00:04:41.080 |
Another way to put it is like thinking can be an extra tool that's useful. 00:04:49.880 |
I think it's also more controversial within his world because everyone was trying to use RL for 00:04:56.840 |
And this is like the first kind of zero gradient type approach. 00:05:01.080 |
I think the bigger kind of historical context is that we have these two big branches of AI, right? 00:05:07.640 |
So if you think about RL, right, that's pretty much the equivalent of agent at a time. 00:05:13.080 |
And it's like agent is equivalent to reinforcement learning and reinforcement learning is equivalent to 00:05:17.640 |
whatever game environment they're using, right? 00:05:22.440 |
So you have like a pretty much, you know, you have a biased kind of like set of methodologies in terms 00:05:28.600 |
of reinforcement learning and represents agents. 00:05:31.400 |
On the other hand, I think NLP is like a historical kind of subject. 00:05:40.200 |
It's more about solving those concrete tasks. 00:05:43.160 |
And if you look at S.A.L., right, like each task has its own track, right? 00:05:49.320 |
So I think really it's about rethinking agents in terms of what could be the new environments 00:05:57.880 |
It's not just Atari games or whatever, video games, but also those text games or language games. 00:06:02.840 |
And also thinking about, could there be like a more general kind of methodology beyond just 00:06:07.960 |
designing specific pipelines for each NLP task? 00:06:11.240 |
That's like the bigger kind of context, I would say. 00:06:13.800 |
Is there an inspiration spark moment that you remember? 00:06:19.080 |
We had Trida on the podcast and you mentioned he was really inspired working with like systems 00:06:26.600 |
Yeah, so actually before React, I spent the first two years of my PhD 00:06:31.160 |
focusing on text-based games or in other words, text adventure games. 00:06:36.200 |
It's a very kind of small kind of research area and quite ad hoc, I would say. 00:06:41.320 |
And there are like, I don't know, like 10 people working on that at the time. 00:06:45.560 |
And have you guys heard of Zork 1, for example? 00:06:49.880 |
So basically the idea is you have this game and they have text observations. 00:06:59.720 |
And you have actions like kill the group with a sword or whatever, right? 00:07:04.120 |
And that's like a very typical setup of a text game. 00:07:07.080 |
So I think one day after, you know, I've seen all the GPT-3 stuff. 00:07:11.400 |
I just think, think about, you know, how can I solve the game? 00:07:15.320 |
Like why those AI, you know, machine learning methods are pretty stupid, but we are pretty 00:07:22.360 |
So for the context, the predominant method to solve this text game is obviously reinforcement 00:07:28.280 |
And the idea is you just try out an arrow in those games for like millions of steps and you 00:07:35.400 |
But there's no language understanding at all. 00:07:37.320 |
And I'm like, why can I solve the game better? 00:07:40.360 |
And it's kind of like, because we think about the game, right? 00:07:44.920 |
Like when we see this very complex text observation, like you see a guru and you might see a sword, 00:07:51.720 |
you know, in the right of the room and you have to go through the wooden door to go to that room. 00:07:56.840 |
You will think, you know, oh, I have to kill the monster and to kill that monster, I have to get 00:08:00.280 |
the sword, I get to get the sword, I have to go, right? 00:08:02.760 |
And this kind of thinking actually helped us kind of thorough shots of the game. 00:08:06.520 |
And it's like, why don't we also enable the text agents to think? 00:08:13.000 |
And I think that's actually very interesting because the prototype, I think, was around 00:08:20.680 |
So that's even before like chain of thought or whatever came up. 00:08:23.480 |
So we did a bunch of experiments in the text game, but it was not really working that well. 00:08:31.640 |
Like if you use GPD-4 to solve it, it's still very hard. 00:08:34.280 |
So the change came when I started the internship in Google and apparently Google care less about 00:08:41.320 |
text game, they care more about what's more practical. 00:08:44.360 |
So pretty much I just reapply the idea, but to more practical kind of environments like Wikipedia or 00:08:50.120 |
or like simpler text games like R4 and it just worked. 00:08:55.720 |
It's kind of like you first have the idea and then you try to find the domains and the problems 00:09:01.000 |
to demonstrate the idea, which is, I would say, different from most of the AI research. 00:09:06.280 |
But it kind of worked out for me in that case. 00:09:09.320 |
For Harrison, when you were implementing React, what were people applying React to in the early days? 00:09:13.880 |
I think the first demo we did probably had like a calculator tool and a search tool. 00:09:17.800 |
So like general things, we tried to make it pretty easy to write your own tools and plug in your own 00:09:22.840 |
things. And so this is one of the things that we've seen in LinkedIn is people who build their own 00:09:27.400 |
applications generally write their own tools. Like there are a few common ones. I'd say like the 00:09:31.720 |
three common ones might be like a browser, a search tool and a code interpreter. But then other than that- 00:09:37.880 |
Yep. Yeah, exactly. It matches up very nice with that. 00:09:40.680 |
And we just, we actually just redid like our integrations docs page. And if you go to the tools 00:09:45.080 |
section, we like highlight those three. And then there's a bunch of like other ones. 00:09:48.280 |
And there's such a long tail of other ones, but in practice, like when people go to production, 00:09:52.200 |
they generally have their own tools or maybe one of those three, maybe some other ones, but like 00:09:56.200 |
very, very few other ones. So yeah, I think the first demo was, was a search and the calculator one. 00:10:04.920 |
Yeah. Oh, so there's that one. And then there's like the celebrity one 00:10:15.640 |
There's, I'm forgetting the name of the author, 00:10:17.320 |
I was like, we're going to over optimize for Olivia Wilde's boyfriend and it's going to 00:10:21.640 |
There's a few data sets kind of like in that vein that require multi-step kind of like reasoning 00:10:26.120 |
and thinking. So one of the questions I actually had for you in this vein, like the React paper, 00:10:31.000 |
there's a thing, I think there's a few things in there, or at least when I think of that, 00:10:33.320 |
there's a few things that I think of. There's kind of like the specific prompting strategy. 00:10:37.000 |
Then there's like this general idea of kind of like thinking and then taking an action. And then 00:10:41.800 |
there's just even more general idea of just like taking actions in a loop. Today, like obviously language models have 00:10:47.080 |
changed a lot. We have tool calling, the specific prompting strategy probably isn't used super 00:10:51.400 |
heavily anymore. Would you say that like the concept of React is still used though? Or like, do you think 00:10:58.200 |
that tool calling and running tool calling in a loop, is that React in your mind? 00:11:02.920 |
I would say like it's like more implicitly used than explicitly used. To be fair, I think the contribution 00:11:10.600 |
of React is actually twofold. So first is this idea of, you know, we should be able to use calls in a very 00:11:17.480 |
general way. Like there should be a single kind of general method to handle interaction with various 00:11:24.520 |
environments. I think React is the first paper to demonstrate the idea. But then I think later, there are 00:11:30.600 |
two form or whatever, and this become like a trivial idea. But I think at the time, that's like a pretty 00:11:36.200 |
long trivial thing. And I think the second contribution is this idea of what people call 00:11:42.040 |
like inner monologue or thinking or reasoning or whatever to be paired with tool use. I think that's 00:11:48.280 |
still not trivial because if you look at the default function calling or whatever, like there's no inner 00:11:53.240 |
monologue. And in practice, that actually is important, especially if the tool that you use is pretty different 00:12:00.360 |
from the training distribution of the language model. So I think that's like, those are the two 00:12:09.800 |
Yeah. On that note, I think OpenAI even recommended like when you're doing tool calling, 00:12:14.200 |
it's sometimes helpful to put like a thought field in the tool along with all the actual acquired 00:12:18.600 |
arguments and then have that one first. So it fills out that first and then, and that's, they've shown 00:12:22.680 |
that that's yielded to kind of like better results. The reason I ask is just like the same concept is still 00:12:27.480 |
alive and I don't know whether to call it like a React agent or not. Like I don't know what to call 00:12:31.640 |
it. Like I think of it as React, like it's the same ideas that were in the paper, but it's obviously a 00:12:35.480 |
very different implementation at this point in time. And so I just don't know what to call it. 00:12:39.240 |
I feel like people will sometimes think more in terms of different tools, right? Because if you think 00:12:46.360 |
about a web agent versus, you know, like a function calling agent and calling a Python API, you would 00:12:51.880 |
think of them as very different. But in some sense, the methodology is the same. It depends on how you 00:12:57.240 |
view them, right? And I think people will tend to think more in terms of the environment and the tools 00:13:03.000 |
rather than the methodology. So, or in other words, I think the methodology is kind of trivial and simple. 00:13:08.280 |
So people will try to focus more on the different tools, but I think it's good to have like a single 00:13:13.960 |
underlying, principle underlying, all of those things. Yeah. 00:13:17.320 |
How do you see the surface of React getting molded into the model? So a function calling is a good 00:13:21.880 |
example of like, now the model does it. What about the thinking? Now, most models that you use kind of do 00:13:28.040 |
chain of thought on their own. They kind of produce steps. Do you think that more and more of this logic will 00:13:32.440 |
be in the model? Or do you think like the context window will still be the main driver of like reasoning 00:13:38.440 |
and thinking? I think it's already default, right? Like you do some chain of thought and you do some 00:13:44.440 |
tool call, like the cost of adding the chain of thought is kind of relatively low compared to other 00:13:50.040 |
things. So it's not hurting to do that. And I think it's already kind of common practice, I would say. 00:13:56.200 |
Is this a good place to bring in either tree of thought or reflection? Your pick. 00:14:00.920 |
Maybe reflection like to respect the time order, I would say. 00:14:04.760 |
Yeah. Any backstory as well, like, you know, the people involved with NOAA and like the Princeton 00:14:09.000 |
group, I think, you know, we talked about this offline, but people don't understand how these 00:14:12.920 |
research pieces come together and this ideation. Yeah. I think reflection is mostly NOAA's work. 00:14:18.200 |
Like I'm more like advising kind of role. So the story is, I don't remember the time, but like one day we 00:14:24.200 |
just see this preprint that's like reflection and autonomous agent with memory or whatever. And it's 00:14:31.320 |
kind of like a extension to React, which uses this self-refaction. I'm like, oh, somehow it become very 00:14:38.040 |
And NOAA reached out to me. It's like, do you want to collaborate on this and make this from like a 00:14:43.720 |
archive preprint to something more solid, you know, like a conference submission? I'm like, sure. 00:14:48.040 |
We start, started collaborating and we remain good friends today. And, uh, I think another interesting 00:14:54.360 |
backstory is like NOAA was like, I think contacted by OpenAI at the time. It's like, this is pretty 00:14:59.320 |
cool. Do you want to just work at OpenAI? And I think Sierra also reached out at the same time. 00:15:04.200 |
It's like, this is pretty cool. Do you want to work at Sierra? And, and I think NOAA chose, uh, Sierra, 00:15:09.880 |
but it's pretty cool because he was like, still like a second year undergrad and he's a very smart kid. 00:15:16.360 |
Based on one paper? Based on one paper. Yeah. 00:15:18.600 |
Oh my God. He's done some other research based on like programming language or chemistry or whatever, 00:15:23.800 |
but I think that's the paper that got the attention of OpenAI and Sierra, right? 00:15:27.880 |
Okay. For those who haven't gone too deep on it, the way that you present the inside of React, 00:15:32.120 |
like, can you do that also for reflection? Yeah. I think one way to think of reflection is that 00:15:37.880 |
the traditional idea of reinforcement learning is you have a scalar reward and then you, you somehow 00:15:43.240 |
back-propagate the signal of the scalar reward to the rest of your neural network through whatever 00:15:48.040 |
algorithm, like policy gradient or A2C or whatever. And if you think about the real life, you know, most 00:15:54.200 |
of the reward signal is not scalar. It's like your boss told you, you know, you should have done a better 00:15:59.960 |
job in this, but a good job on that or whatever. Right. It's not like a scalar reward, like 29 or 00:16:05.320 |
something. I think in general, humans do more, deal more with, you know, long-scale reward, or you can 00:16:10.760 |
say language feedback, right? And the way that they deal with language feedback also have this kind of 00:16:15.720 |
back-propagation kind of process, right? Because you start from this, you did a good job on job B, 00:16:21.320 |
and then you reflect, you know, what could have done different to, to change, to make it better. 00:16:26.520 |
And you kind of change your prompt, right? Basically, you change your prompt, how to do job A, 00:16:30.920 |
and how to do job B, and then you do the whole thing again. So it's really like a pipeline of 00:16:36.040 |
language where, in self-gradient descent, you have something like text reasoning to replace 00:16:41.480 |
those gradient descent algorithms. I think that's one way to think of reflection, yeah. 00:16:47.160 |
One question I have about reflection is, how general do you think the algorithm 00:16:51.640 |
there is? And so for context, I think at Langchain and at other places as well, we found it, like we and 00:16:56.920 |
others found it pretty easy to implement, kind of like React in a standard way. You plug in any tools, 00:17:00.840 |
and it kind of works off the shelf, you know, can get it up and running. I don't think we have like 00:17:05.720 |
an off the shelf kind of like implementation of reflection and kind of like the general sense. 00:17:09.800 |
I think the concepts like absolutely we see used in different kind of like specific cognitive 00:17:14.600 |
architectures, but I don't think we have one that comes off the shelf. I don't think any of the other 00:17:18.920 |
frameworks have one that comes off the shelf. And I'm curious whether that's because it's not general 00:17:24.280 |
enough, or it's complex as well, because it also requires running it more times. Maybe that's not feasible. 00:17:29.800 |
Like, I'm curious how you think about the generalality complexity. Why? Yeah, should we have 00:17:35.880 |
I think the algorithm is general in the sense that it's just as general as like other algorithms, 00:17:41.720 |
if you think about like positive reading or whatever, but it's not applicable to all tasks, 00:17:45.800 |
just like other algorithms, right? So you can argue PPO is also general, but it works better for those set 00:17:51.560 |
of tasks, but not on those set of tasks. I think it's the same situation for reflection. And I think a key 00:17:57.160 |
bottleneck is the evaluator, right? Basically, you need to have a good sense of the signal. 00:18:02.040 |
So for example, like if you're trying to do a very hard reasoning task, say mathematics, 00:18:07.320 |
for example, and you don't have any tools, right? It's operating in this chain of thought setup. Then 00:18:12.280 |
reflection will be pretty hard because in order to reflect upon your thoughts, you have to have a very 00:18:17.720 |
good evaluator to judge whether your thought is good or not. But that might be as hard as solving the 00:18:23.720 |
problem itself or even harder. The principle of self-reflection is probably more applicable if you have a good 00:18:28.680 |
evaluator, for example, in the case of coding, right? Like if you have those arrows, then you can 00:18:33.400 |
just reflect on that and how to solve the bug and stuff. So I think another criteria is that it 00:18:40.840 |
depends on the application, right? Like if you have this latency or whatever need for like an actual 00:18:46.360 |
application with an end user, right? And the user wouldn't let you, you know, do like two hours of three 00:18:51.240 |
of thought or reflection, right? You need something as soon as possible. So in that case, maybe this is 00:18:56.520 |
better to be used as like a training time technique, right? You do those reflection or tree of thought 00:19:01.800 |
or whatever, you get a lot of data and then you try to use the data to train your model better. And then 00:19:06.600 |
in test time, you still use something as simple as React, but that's already improved. 00:19:10.600 |
And if you think of the Voyager paper as like a way to store skills and then reuse them, like how would you 00:19:16.440 |
compare like this like reflective memory and like at what point it's just like ragging on the memory 00:19:22.520 |
versus like you want to start to fine tune some of them or like what's the next step once you get a very 00:19:26.840 |
long kind of like a reflective corpus? Yeah. So I think there are two questions here. The first question is 00:19:33.080 |
what type of information or memory are you considering, right? Is it like somatic memory that stores, you know, 00:19:40.360 |
knowledge about the word or is it the episodic memory that stores, you know, trajectories or behaviors or is it 00:19:46.680 |
like more of a procedural memory? Like in Voyager's case, like skills or code snippets that you can use 00:19:52.680 |
to do actions, right? That's that's one dimension. And the second dimension is obviously how you use the 00:19:58.120 |
memory, either retrieving from it, using it in the context or or fine tuning it. I think the cognitive 00:20:05.080 |
architecture for language agents paper have a good kind of categorization of all the different combinations. 00:20:10.600 |
And of course, what which way you use depends on the concrete application and the concrete need and the concrete 00:20:17.240 |
task. But I think in general, it's good to think of those like systematic dimensions and all the possible like 00:20:25.240 |
Harrison also has in LangMem. I think you did a presentation in my meetup and I think you've done 00:20:30.680 |
it at a couple other venues as well. User state, semantic memory and append only state. I think kind 00:20:36.040 |
of maps to what you just said. What is LangMem? Can I give it like a quick... 00:20:39.720 |
One of the modules of LinkedIn for a long time has been something around memory. And I think like, 00:20:43.720 |
you know, we're still obviously figuring out what that means as is everyone kind of in the space. But 00:20:48.440 |
one of the experiments that we did and one of the proof of concepts that we did was, 00:20:51.880 |
technically what it was is you would basically create threads. You'd push messages to those 00:20:56.840 |
threads in the background. We process the data in a few ways. One, we like put it into some semantic 00:21:01.880 |
store. That's the semantic memory. And then two, we do some like extraction and reasoning over the 00:21:07.240 |
memories to extract. And we let the user define this, but like extract key facts or anything that's 00:21:12.600 |
of interest to the user. Those aren't exactly trajectories. They're maybe more closer to the, 00:21:17.640 |
to the procedural memory. Is that how you'd think about it or classify it? Or is it like about like 00:21:23.080 |
knowledge about the word or is it more like how to do something? It's reflections basically. So in 00:21:29.480 |
generative worlds, generative agents, generative smallville. Yeah. The smallville one. So the way 00:21:33.800 |
that they had their memory there was they had the sequence of events and that's kind of like the raw 00:21:37.880 |
events that happened. But then every, and events did like run some synthesis over those events 00:21:43.080 |
to, for the, for the LLM to insert its own memory basically. And it's, it's that type of memory. I 00:21:48.760 |
don't know how that would be classified. I think of that more of the semantic memory, but to be fair, 00:21:53.080 |
I think it's just one way to think of that. But whether it's semantic memory or procedure memory or whatever 00:21:58.360 |
memory, that's like, like abstraction layer. But in terms of implementation, you can choose whatever 00:22:03.880 |
implementation for whatever memory. So they're totally kind of orthogonal. I think it's more of a good way to 00:22:09.160 |
think of the things because like from the history of cognitive science and you know, cognitive 00:22:14.120 |
architecture and how people study even neuroscience, right? That's the way people think of how human brain 00:22:19.720 |
organizes memory. And I think it's more useful as a way to think of things, but it's not like for semantic 00:22:25.480 |
memory, you have to do this kind of like way to retrieve or fine tune. And for precision memory, you have to do that. 00:22:30.920 |
Like, I think those are totally orthogonal kind of dimensions. 00:22:34.520 |
How much background do you have in kind of like cognitive sciences and how much do you model some 00:22:39.480 |
That's a great question actually. And, uh, I think one of the undergrad kind of influence for my like 00:22:47.400 |
follow-up research is I was doing like an internship at MIT's computational cognitive science lab with like, 00:22:53.960 |
you know, Josh Tanabam and he's like a very famous cognitive scientist. And I think a lot of, 00:22:59.880 |
a lot of his ideas still influence me today. Like, uh, think of sensing like computational terms and 00:23:06.200 |
get interested in language and a lot of stuff, you know, or even like developer psychology kind of stuff, 00:23:12.920 |
As a developer that tried out Langmem, the way I view it is just, it's a materialized view of a stream 00:23:19.400 |
of logs. And if anything, that's just useful for context compression. I don't have to use the full 00:23:23.560 |
context to run it over everything, but also it's kind of debuggable. Like if it's wrong, I can show it 00:23:27.880 |
to the user, user can manually fix it and I can carry on. 00:23:30.120 |
That's a really good analogy. Yeah, I like that. I'm going to steal that. 00:23:33.080 |
Please, please. You know I'm bullish on memory databases. Um, I guess, Tree of Thoughts? 00:23:37.720 |
Um, yeah, Tree of Thoughts. I mean, you had a... 00:23:40.200 |
I feel like I'm relieving the defense again in like a podcast format. 00:23:44.680 |
Yeah, no, I mean, it was a, you had a banger. Well, this is the one where you're already successful 00:23:49.000 |
and would just like, you know, highlight the glory. It was really good. You mentioned that since thinking 00:23:54.440 |
is kind of like taking an action, you can use like action searching algorithms to think of thinking. 00:23:59.320 |
So just like you will use Tree Search to like find the next thing. And the idea behind Tree of Thoughts 00:24:04.280 |
is like you generate all these possible outcomes and then find the best tree to get to the end. 00:24:09.240 |
Maybe back to the latency question. You can't really do that if you have to respond in real time. So 00:24:13.720 |
what are maybe some of the most helpful use cases for things like this? Where have you seen people 00:24:17.640 |
adopt it where the high latency is actually worth the weight? 00:24:20.840 |
For things that you don't care about latency, obviously, for example, if you're trying to do 00:24:25.480 |
math, right? If you're just trying to come up with a proof. But I feel like one type of task is 00:24:29.880 |
more about searching for a solution, right? You can try a hundred times, but if you find one solution, 00:24:35.480 |
that's good. Like for example, if you're finding a math proof or if you're finding a good code to solve 00:24:39.880 |
a problem or whatever. And I think another type of task is more like reacting, right? For example, 00:24:44.760 |
if you're doing customer service, you're like a web agent booking a ticket for like a end user, right? 00:24:49.960 |
Those are more like kind of reactive kind of tasks, right? You have to, or more real-time tasks, 00:24:54.040 |
right? You have to do things fast. They might be easy, but you have to do it reliably. And you care more 00:24:59.080 |
about like, can you solve 99% of the time out of a hundred, but for the type of search type of tasks, 00:25:05.640 |
then you care more about, you know, can I find one solution out of a hundred? So it's kind of symmetric and different. 00:25:11.080 |
Do you have any data or like intuition from your user base? Like what's the split of 00:25:16.360 |
these type of use cases? Like how many people are doing more reactive things and how many people are 00:25:20.280 |
experimenting with like kind of deep, long search? 00:25:22.680 |
I would say like reacts probably like the most popular. I think there's aspects of reflection 00:25:27.800 |
that get used tree of thought probably like the least. So there's a great tweet from Jason way. 00:25:34.120 |
I think you're now colleague, and he was talking about like prompting strategies and how he thinks 00:25:38.520 |
about them. And I think like the four things that he kind of had was like, one, how easy is it to 00:25:43.960 |
implement? How much compute does it take? How many tasks does it kind of like solve? And how much does 00:25:49.240 |
it improve on those tasks? And, and I'd add a fifth, which is like, how likely is it to be relevant with 00:25:53.720 |
when the next generation of models kind of come out? And I think if you look at kind of like those axes, 00:25:58.040 |
and then you look at like, uh, you know, react reflection, tree of thought, it tracks that the 00:26:04.040 |
ones that score better are used more like react is pretty easy to implement tree of thoughts, pretty 00:26:08.680 |
hard to implement. Um, right. Like the amount of compute. Yeah. A lot more for tree of thought, 00:26:14.360 |
the tasks and how much it improves. I don't have amazing visibility there, but I think like, 00:26:19.000 |
if we're comparing react versus tree of thought, react just dominates the first two axes so much 00:26:24.040 |
that my question around that was going to be like, how do you think about like these prompting 00:26:28.120 |
strategies, cognitive architectures, whatever you want to call them when you're, when you're thinking 00:26:31.400 |
of them, what are the axes that you're judging them on in your head when you're thinking whether 00:26:36.200 |
it's a good one or, or a less good one or. Right. Right. I think there is a difference between 00:26:41.800 |
like a prompting method versus like a research in the sense that like for research, you don't really 00:26:48.360 |
even care about does it actually work on practical tasks or does it help whatever. I think it's more 00:26:55.560 |
about like the idea or the, the principle, right? Like what is the direction that you're like unblocking 00:27:02.040 |
and whatever. And I think for the, for like an actual prompting method to solve like a concrete 00:27:07.480 |
problem, I would say like simplicity is very important because the simpler it is, the less 00:27:13.720 |
decision you have to make about it. And it's easier to design, it's easier to propagate and it's easier 00:27:18.120 |
to, to do stuff. So always try to be as simple as possible. And I think latency obviously is important. 00:27:25.000 |
Like if you can do things fast and you don't want to do things slow. And I think in terms of the actual 00:27:30.200 |
prompting method to use for a particular problem, I'm a, I think we should all be in the minimum list 00:27:36.200 |
kind of camp, right? You should try the minimum thing and see if it works and if it doesn't work and 00:27:41.720 |
there's absolute reason to add something, then you add something, right? Like there's an absolute reason 00:27:46.280 |
that you need some tool, then you should add the tool thing. If there's absolute reason to add 00:27:51.640 |
reflection or whatever, you should add that. Otherwise, if chain of thought can already solve 00:27:54.760 |
something, then you don't even need to use any of that. Yeah. Or if just better prompting can solve 00:27:58.600 |
it. Like, you know, you could add a reflection step or you could make your instructions a little bit 00:28:02.200 |
clearer and it's a lot easier to do that. I think another interesting thing is like, I personally have 00:28:07.160 |
never done those kind of like weird tricks. I think all the prompts that I write are kind of like just 00:28:12.120 |
talking to a human, right? It's like, I don't know, like, like I never say something like, 00:28:15.800 |
your grandma is dying and you have to solve it too. I mean, those are cool, but I feel like 00:28:20.920 |
we should all try to solve this in like a very intuitive way. Like, just like talking to your 00:28:25.400 |
co-worker and that, that should work 99% of the time. That's my personal take. Yeah. 00:28:29.720 |
The problem with how language models, at least in these sort of GPC3 era, was that they're, 00:28:35.640 |
they over optimized to some sets of tokens in sequence. So like what reading the Kojima et al. paper, 00:28:42.200 |
that was listening step by step, like he tried a bunch of them and they had wildly different results. 00:28:47.480 |
It should not be the case, but it is the case. And hopefully we're getting better there. 00:28:50.760 |
Yeah. I think it's also like a timing thing in the sense that if you think about this whole line of 00:28:55.960 |
language model, right? Like at the time it was just like a text generator. We don't have any idea how it's 00:29:01.240 |
going to be used. Right. And obviously at the time you will find all kinds of weird issues because 00:29:06.920 |
it's not trained to do any of that. Right. But then I think we have this loop where once we realize 00:29:12.520 |
chain of thought is important or agent is important or tool using is important. What we see is today's 00:29:17.240 |
language models are heavily optimized towards those things. So I think in some sense they become more 00:29:22.600 |
reliable and robust over those use cases. And, uh, you don't need to do as much prompt engineering 00:29:28.680 |
tricks anymore to, to solve those things. I feel like in some sense, I feel like prompt engineering 00:29:33.560 |
even is like a slightly negative word at the time because it refers to all those kind of weird tricks 00:29:38.440 |
that you have to apply. But I think we don't have to do that anymore. Like given today's progress, 00:29:42.840 |
you should just be able to talk to like a coworker and if you're clear and concrete and you know, 00:29:48.200 |
being reasonable, then it should be reasonable things for you. Yeah. Yeah. The way I put this is, 00:29:52.040 |
uh, you should not be a prompt engineer because it is the goal of the big labs to put you out of a job. 00:29:56.600 |
You should just be a good communicator. Like if you're a good communicator to human, 00:30:00.680 |
you should be a good communicator to them. And I think that's the key though, because oftentimes 00:30:04.600 |
people aren't good communicators to these language models and that is a very important skill and that's 00:30:08.920 |
still messing around with the prompt. And so it depends what you're talking about when you're 00:30:12.760 |
saying prompt engineer. But do you think it's like very correlated with like, are they like a good 00:30:17.640 |
communicator to humans? You know, it's like it may be, but I also think I would say on average, 00:30:22.120 |
people are probably worse at communicating with language models than to humans. That's for sure. 00:30:25.640 |
Right now, at least, because I think we're still figuring out how to do it. You kind of expect it 00:30:28.920 |
to be magical and there's probably some correlation, but I'd say there's also just like people are worse 00:30:34.200 |
at it right now than talking to you. We should, we should, uh, make it like a, you know, like an elementary 00:30:39.240 |
school class or whatever, like how to talk to. Uh, yeah. I'm very proud of that. Yeah. 00:30:44.040 |
Before we leave the topic of, uh, trees and searching, not specific about Q*, but there's 00:30:49.160 |
a lot of questions about MCTS and this combination of tree search and language models. And I just had to 00:30:56.280 |
get in a question there about how seriously should people take this? 00:30:58.920 |
Again, I think it depends on the tasks, right? So MCTS was magical for Go, but it's probably not as 00:31:06.520 |
magical for robotics, right? So I think right now, the problem is not even that we don't have good 00:31:12.520 |
methodologies. It's more about we don't have good tasks. It's also very interesting, right? Because 00:31:17.080 |
if you look at my citations, like obviously the most cited are React, Reflection, and Treehouse, 00:31:21.640 |
all those are methodologies. But I think like equally important, if not more important, 00:31:26.520 |
line of my work is like benchmarks and environments, right? Like web shop or suite venture or whatever. 00:31:31.400 |
And I think in general, what people do in academia that I think is not good is they choose a very 00:31:38.760 |
simple task, like Alfred, and then they apply overly complex methods and to show the improved 2%. 00:31:45.000 |
I think like you should probably match, you know, the level of complexity of your task and your method. 00:31:53.000 |
Right. I feel like where tasks are kind of far behind the method in some sense, right? Because 00:31:59.000 |
we have some good test time approaches, like whatever, React or Reflection and Treehouse, 00:32:03.640 |
all that are like, there are many, many more complicated testing methods afterwards. But on the 00:32:09.800 |
benchmark side, we have made a lot of good progress this year, last year. But I think we still need more 00:32:15.720 |
progress towards that, like better coding benchmark, better web agent benchmark, better agent benchmark, 00:32:22.040 |
not even for web or code. I think in general, we need to catch up with, with tasks. 00:32:27.560 |
What are the biggest reasons in your mind why, why it lags behind? 00:32:31.000 |
I think incentive is one big reason. Like if you see, you know, all the massive paper are cited like 00:32:38.360 |
a hundred times more than the task paper. And also making a good benchmark is actually quite hard. And 00:32:45.560 |
it's almost like a different set of skills in some sense, right? I feel like if you want to build a good 00:32:51.480 |
benchmark, you need to be like a good kind of product manager kind of mindset, right? You need to think about 00:32:56.680 |
why people should use your benchmark, why it's challenging, why it's useful. If you think about 00:33:01.160 |
like a PhD going into like a school, right? The prior skill that expected to have is more about, you know, 00:33:08.760 |
can they code this method and can they just run experiments and can solve that? I think building 00:33:13.560 |
a benchmark is not the typical prior skill that we have, but I think things are getting better. I think 00:33:19.560 |
more and more people are starting to build benchmarks and people are saying that it's like a way to get more 00:33:26.040 |
impact in some sense, right? Because like, if you have a really good benchmark, a lot of people are going to use it. 00:33:30.520 |
But if you have a super complicated test time method, like it's very hard for people to use it. 00:33:35.480 |
Are evaluation metrics also part of the reason, like for some of these tasks that we might want to ask these 00:33:41.000 |
agents or language models to do, is it hard to evaluate them since so it's hard to get an automated benchmark? 00:33:46.600 |
Obviously with Sweetbench, you can, and with coding, it's, it's easier, but. 00:33:50.200 |
I think that's part of the, like the skill set thing that I mentioned, because I feel like it's like, 00:33:55.400 |
it's like a product manager because there are many dimensions and you need to strike a balance and 00:34:00.680 |
it's really hard, right? If you want to make sense very easy to all the gradable, like automatically 00:34:06.760 |
gradable, like either to grade or either to evaluate, then you might lose some of the realness or 00:34:11.880 |
practicality. Or like it might be practical, but it might not be as scalable, right? For example, 00:34:18.440 |
if you think about text game, humans have pre-annotated all the rewards and all the language are 00:34:23.720 |
real. So it's pretty good on auto-gradable dimension and the practical dimension. If you 00:34:29.800 |
think about, you know, practical, like actual English being practical, but it's not scalable, 00:34:34.280 |
right? Like it takes like a year for like experts to, to, to build that game. So it's not really that 00:34:39.240 |
scalable. And I think part of the reason that Sweetbench is so popular now is it kind of hits the 00:34:44.200 |
balance between the three dimensions, right? Either to evaluate and being actually practical and being 00:34:49.160 |
scalable. Like if I were to criticize upon some of my prior work, I think webshop, like it's my initial 00:34:55.400 |
attempt to get into benchmark work. And I'm trying to do a good job striking the balance, but obviously 00:35:01.960 |
we make it all gradable and it's really scalable. But then I think the practicality is not as high as 00:35:08.520 |
actually just using GitHub issues, right? Because you're just creating those like synthetic tasks. 00:35:13.480 |
Are there other areas besides coding that jumped to mind as being really good for being auto-gradable? 00:35:22.760 |
Do you have thoughts on AlphaProof, the, the new DeepMind paper? 00:35:30.520 |
I think it's more of a, you know, it's more of like a confidence boost or like a, sometimes, you know, 00:35:38.280 |
the work is not even about, you know, the technical details or the methodology that it chooses or the, 00:35:44.840 |
the concrete results. I think it's more about a signal, right? 00:35:47.480 |
Yeah. Existence boost, like, yeah, yeah. It's like, can be done. 00:35:50.520 |
This direction is exciting. It kind of encourages people to work more towards that direction. I think 00:35:55.400 |
it's more like a boost of confidence, I would say. Yeah. 00:35:59.320 |
So we're going to focus more on agents now. And, you know, we were a special, 00:36:03.960 |
all of us have a special interest in coding agents. I would consider Devin to be the sort of 00:36:08.920 |
biggest launch of the year as far as AI startups go. And you guys in the Princeton group worked on 00:36:16.040 |
SuiAgent alongside of SuiBench. Tell us the story about SuiAgent. 00:36:19.400 |
Sure. So I think it's kind of like a trilogy. It's actually a series of three works now. So 00:36:25.960 |
actually the first work is called Intercode, but it's not as, it's not as famous, I know. And the 00:36:32.600 |
second word is called SuiBench. And the third word is called SuiAgent. And I'm just really confused why 00:36:38.280 |
nobody's working on coding. You know, it's like a year ago, but I mean, not everybody's working on coding, 00:36:44.520 |
obviously, but a year ago, like literally nobody was working on coding. I was really confused. And 00:36:50.120 |
the people that were working on coding are, you know, trying to solve human evil in like a sick 00:36:55.800 |
to sick way. There's no agent, there's no chain of thought, there's no anything. They're just, you 00:37:00.360 |
know, fine tuning the model and improve some points and whatever. Like I was really confused because 00:37:06.040 |
obviously coding is the best application for agents because it's all degradable. It's super important. 00:37:12.440 |
You can make everything like API or code action, right? So I was confused and I collaborated with 00:37:19.000 |
some of the students in Princeton and we have this work called Intercode. And the idea is, 00:37:23.400 |
first, if you care about coding, then you should solve coding in an interactive way, meaning more 00:37:28.520 |
like a Jupyter notebook kind of way than just writing a program and seeing if fails or succeeds and stop, 00:37:35.320 |
right? You should solve it in an interactive way. That's because that's exactly how humans solve it, 00:37:39.880 |
right? If I tell you to, you know, write a program like next token, next token, next token and stop and 00:37:47.000 |
never do any edits and you cannot really use any terminal or whatever tool, it doesn't make sense, 00:37:52.360 |
right? And that's the way people are solving coding at the time. Basically like sampling a program from a 00:37:58.440 |
language model without chain of thought, without tool call, without reflection, without anything. 00:38:02.280 |
So first point is we should solve code coding in a very interactive way. And that's a very general 00:38:07.400 |
principle that applies for various coding benchmarks. But also I think you can make a lot of the agent 00:38:14.760 |
tasks kind of like interactive coding. If you have Python and you can call any package, then you can 00:38:20.920 |
literally also browse internet or do whatever you want, like control a robot or whatever. So that seems 00:38:26.920 |
to be a very general paradigm. But obviously I think a bottleneck is at the time we're still doing, you know, 00:38:32.840 |
very simple tasks like human evil or whatever coding benchmark people proposed. Like they were super hard 00:38:37.640 |
2021, like 20%, but they're like 95% already in 2023. So obviously the next step is we need better benchmark. And 00:38:44.920 |
Carlos and John, which are the first authors of Sweetbench. I think they come up with this great 00:38:50.840 |
idea that we should just script GitHub and solve whatever human engineers are solving. And I think 00:38:56.520 |
it's actually pretty easy to come up with this idea. And I think in the first week, they already made a lot of 00:39:01.720 |
progress, like the script, the GitHub, and they make all the same, but then there's a lot of pain for 00:39:07.240 |
infra work and whatever, you know, I think the idea is super easy, but the engineering is super hard. And I feel like 00:39:12.680 |
that's a very typical signal of a good work in the AI era now. 00:39:16.840 |
I think also, I think the filtering was challenging because if you look at open source PRs, like a lot 00:39:23.560 |
of them are just like, you know, fixing typos. 00:39:25.800 |
I think it's challenging. And to be honest, we didn't do a perfect job at the time. So if you 00:39:30.120 |
look at the recent blog posts with OpenAI, like we improved the filtering so that, you know, it's more 00:39:36.520 |
so I think OpenAI was just like, look, this is a thing now we have to fix this. 00:39:40.120 |
Like these students just like, you know, rushed it. 00:39:47.240 |
Yeah. Was that tied to you joining OpenAI or like, was that just unrelated? 00:39:52.680 |
It's a coincidence for me, but it's a good coincidence. 00:39:55.800 |
There is a history of anytime a big lab adopts a benchmark, they fix it because, you know, 00:40:02.040 |
Yeah. So naturally, once we propose Swimage, the next step is to solve it, right? 00:40:07.400 |
But I think the typical way you solve something now is you collect some training samples or you 00:40:12.520 |
design some complicated agent method, and then you try to solve it, right? 00:40:17.000 |
Either a super complicated prompt or you build a better model with more trained data. But I think 00:40:22.040 |
at the time we realized that even before those things, there's a fundamental problem with the 00:40:27.080 |
interface or the tool that you're supposed to use, right? Because that's like a ignored problem in 00:40:33.480 |
some sense, right? Like what your tool is or how that matters for your task. So what we found 00:40:40.200 |
concretely is that if you just use the text terminal off the shelf as a tool for those agents, 00:40:45.800 |
there's a lot of problems, right? For example, if you edit something, there's no feedback. 00:40:50.200 |
So you don't know whether your edit is good or not. And that makes the agent very confused 00:40:54.520 |
and makes a lot of mistakes. And there are a lot of like small problems, you would say. And 00:40:59.640 |
well, you can try to do prompt engineering and improve that, but it turns out to be actually very 00:41:05.480 |
hard. And we realized that the interface design is actually a very omitted kind of part of agent design. 00:41:12.520 |
So we did this sweet agent work. And the key idea is just even before you talk about, you know, 00:41:17.240 |
what the agent is, you should talk about what the environment is and you should make sure that the 00:41:21.400 |
environment is actually friendly to whatever agent you're trying to apply, right? And that's the same 00:41:26.280 |
idea for humans, right? Like if I give you like text terminal is good for some tasks like git pool or 00:41:33.240 |
whatever, right? But it's not good if you want to look at, you know, browser and whatever, right? 00:41:39.080 |
So also like, you know, browser is a good tool for some tasks, but it's not a good tool for other tasks. 00:41:44.680 |
We need to talk about how to design an interface in some sense where we should treat agents as our 00:41:49.880 |
customers, right? It's like when we treat humans as a customer, we design human computer interfaces, right? 00:41:56.360 |
We design those beautiful desktops or browser or whatever, so that it's very intuitive and easy for 00:42:02.600 |
humans to use. And this whole great subject of HCI is all about that. I think now the research idea 00:42:09.880 |
of sweet agent is just we should treat agents as our customers and we should do like, you know, AICI. 00:42:16.280 |
So what are the tools that a sweet agent should have or a coding agent in general? 00:42:24.360 |
For sweet agent, it's like a modified text terminal, which kind of adapts to a lot of the patterns of 00:42:30.440 |
language models to make it easier for language model to use. For example, now for edit, instead of having 00:42:36.200 |
no feedback, it will actually have a feedback of, you know, actually here you introduce like a syntax 00:42:41.080 |
error and you should probably want to fix that. And there's ended error there. And that makes it super 00:42:45.880 |
easy for the model to actually do that. And there's other small things like how exactly you write 00:42:50.760 |
arguments, right? Like, do you want to write like a multi-line edit or do you want to write a single 00:42:55.720 |
line edit? I think it's more interesting to think about the way of the development process of ACI rather 00:43:02.280 |
than the actual ACI for like a concrete application, because I think the general paradigm is very similar 00:43:07.640 |
to HCI and psychology, right? Basically for how people develop HCI is they do behavior experiments on 00:43:14.600 |
humans, right? I do A/B test, right? Like which interface is actually better? And I do those 00:43:20.200 |
behavior experiments, kind of like psychology experiments to humans, and I change things. 00:43:24.440 |
And I think what's really interesting for me for this three agent paper is we can probably do the 00:43:29.480 |
same thing for agents, right? We can do A/B test for those agents and do behavior tests. 00:43:33.720 |
And through the process, we not only invent better interfaces for those agents, that's the practical 00:43:39.000 |
value, but we also better understand agents. Just like when we do those A/B tests, we do those 00:43:44.280 |
HCI, we better understand humans. During those ACI experiments, we actually better understand agents. 00:43:50.680 |
Besides kind of like that A/B testing, what are other kind of like processes that people 00:43:57.640 |
That's a great question. And I think switch is like an initial work. And what we do is 00:44:02.520 |
the kind of the live approach, right? You just try some interface and you see what's going 00:44:07.640 |
wrong and then you try to fix that. You do this kind of iterative fixing. But I think what's really 00:44:12.680 |
interesting is there will be a lot of future directions that's very promising if we can apply 00:44:17.880 |
some of the HCI principles more systematically into the interface design. I think that would be a very 00:44:25.680 |
You talked a lot about kind of like agent computer interfaces and interactions. What about like human 00:44:32.440 |
to agent kind of like UX patterns? I'm like, yeah, curious for any thoughts there that you might have. 00:44:38.440 |
That's a great question. And in some sense, I feel like prompt engineering is about 00:44:43.400 |
human agent interface. But I think there can be a lot of interesting research done about it. So prompting is 00:44:51.240 |
about how humans can better communicate with the agent. But I think there could be interesting 00:44:55.960 |
research on how agents can better communicate with humans, right? When to ask questions, 00:45:00.920 |
how to ask questions, like what's the frequency of asking questions. And I think those kind of stuff 00:45:07.240 |
Yeah. I think some of the most interesting stuff that I saw here was also related to coding with 00:45:11.240 |
Devon from Cognition. And they had the three or four different panels where you had like the chat, 00:45:16.200 |
the browser, the terminal, and I guess the code editor as well. 00:45:21.560 |
Yeah. I think they also did a good job on ACI. 00:45:24.680 |
I think that's the main learning I have from Devon. They cracked that. They actually, 00:45:28.280 |
there was no foundational planning breakthrough. The planner is like actually pretty simple, 00:45:34.920 |
I think making the tool good and reliable is probably like 90% of the whole agent. Once the tool is actually 00:45:41.720 |
good, then the agent design can be much, much simpler. On the other hand, if the tool is bad, 00:45:47.240 |
then no matter how much you put into the agent design planning or search or whatever, 00:45:52.840 |
Yeah. I'd argue the same, same with like context and instructions. Like, yeah, go hand in hand. 00:45:59.320 |
On the tool, how do you think about the tension of like, for both of you, I mean, you're building 00:46:03.880 |
a library. So even more for you, the tension between making now a language or a library that 00:46:09.080 |
is like easy for the agent to grasp and write versus one that is easy for like the human to grasp 00:46:15.160 |
and write, because you know, the trend is like more and more code gets written by the agent. So 00:46:18.840 |
why wouldn't you optimize the framework to be as easy as possible for the model versus for the person? 00:46:24.200 |
I think it's possible to design interface that's both friendly to humans and agents. But what do 00:46:29.160 |
We haven't thought about that from the perspective, like we're not trying to design 00:46:32.360 |
LangChain or LangGraph to be friendly, but I mean, I think to be friendly for agents to write. 00:46:40.760 |
But I mean, I think we see this with like, I saw some paper that used TypeScript notation instead of 00:46:46.760 |
JSON notation for tool calling and it got a lot better performance. So it's definitely a thing. I haven't 00:46:51.960 |
really heard of anyone designing like a syntax or a language explicitly for agents, but there's 00:46:58.600 |
I think function calling is a good example where it's like a good interface for both 00:47:03.160 |
human programmers and for agents, right? Like for developers, it's actually a very friendly 00:47:08.680 |
interface because it's very concrete and you don't have to do prompt engineering anymore. You can 00:47:12.840 |
be very systematic. And for models, it's also pretty good, right? Like it can use all the existing coding 00:47:18.600 |
content. So I think we need more of those kinds of designs. 00:47:21.320 |
I will mostly agree and then I'll slightly disagree in terms of this, which is like whether 00:47:26.440 |
designing for humans also overlaps the designing for AI. So Malta Ubo, who's the CTO of Vercel, 00:47:31.960 |
who is creating basically JavaScript's, you know, competitor to LangChain, they're observing that 00:47:37.160 |
basically like if the API is easy to understand for humans, it's actually much easier to understand for 00:47:41.800 |
LMs. For example, because they're not overloaded functions. They don't behave differently under different 00:47:46.120 |
contexts. They do one thing and they always work the same way. It's easy for humans. It's easy for 00:47:51.000 |
LMs. And like that makes a lot of sense. And obviously adding types is another one. Like type 00:47:55.640 |
annotations only help give extra context, which is really great. So that's the agreement. And then a 00:48:00.120 |
disagreement is that I've actually, when I use structured output to do my chain of thought, I have 00:48:05.720 |
found that I change my field names to hint to the LLM of what the field is supposed to do. So instead of 00:48:12.840 |
saying topics, I'll say candidate topics. And that gives me a better result because the LLM was like, 00:48:17.960 |
"Ah, this is just a draft thing I can use for chain of thought." And instead of like summaries, 00:48:22.600 |
I'll say topic summaries to link the previous field to the current field. So like little stuff like that, 00:48:27.320 |
I find myself optimizing for the LLM where I as a human would never do that. 00:48:31.720 |
Interesting. It's kind of like the way you optimize the prompt, it might be different for humans and for 00:48:37.720 |
machines. You can have a common ground that's both clear for humans and agents, but to improve the human 00:48:43.880 |
performance versus improving the agent performance, they might move to different directions. 00:48:48.280 |
I move to different directions. There's a lot more use of metadata as well, like descriptions, 00:48:51.720 |
comments, code comments, annotations and stuff like that. Yeah. 00:48:56.040 |
I would argue that's just you communicating to the agent what it should do. And maybe you need 00:49:01.480 |
to communicate a little bit more than to humans because models aren't quite good enough yet. But 00:49:06.200 |
like, I don't think that's crazy. I don't think that's crazy. 00:49:09.000 |
I will bring this in because it just happened to me yesterday. I was at the cursor office. 00:49:12.600 |
They held their first user meetup and I was telling them about the LLMOS concept and why 00:49:19.560 |
basically every interface, every tool was being redesigned for AIs to use rather than humans. And 00:49:24.200 |
they're like, "Why? Can't we just use Bing and Google for LLM search? Why must I use EXA?" Or 00:49:30.440 |
what's the other one that you guys work with? Tavili. 00:49:33.000 |
Tavili. Web Search API dedicated for LLMs. What's the difference to Bing API? 00:49:38.120 |
Exactly. There weren't great APIs for search. Like the best one, like the one that we used 00:49:42.600 |
initially in LinkChain was SERP API, which is like maybe illegal. I'm not sure. And like, you know, 00:49:51.160 |
and now they're like venture-backed companies. Shout out to DuckDuckGo, which is free. 00:49:55.320 |
Yes. Yes. Yeah. I do think there are some differences though. I think you want, 00:50:00.360 |
like, I think generally these APIs try to return small amounts of text information, clear legible 00:50:05.960 |
field. It's not a massive JSON blob. And I think that matters. I think like when you talk about 00:50:10.520 |
designing tools, it's not only the, it's the interface in the entirety, not only the inputs, 00:50:14.520 |
but also the outputs that really matter. And so I think they try to make the outputs. 00:50:18.120 |
They're doing ACI. Yeah. Yeah, absolutely. Really. Like there, there's a whole set of industries that 00:50:23.000 |
are just being redone for ACI. It's weird. And so my, my simple answer to, to them was like the error 00:50:29.960 |
messages. When you give error messages, they should be basically prompts for the LLM to take and then 00:50:35.560 |
self-correct. Then your error messages get more verbose actually than you normally would with a human. 00:50:39.960 |
stuff like that. Like a little, honestly, it's not that big. Again, like, is this worth a venture 00:50:44.840 |
backed industry? Unless you can tell us, but like, I think code interpreter, I think is a new thing. 00:50:51.720 |
I hope so. We invested in E2B. So I think that's a very interesting point. You're trying to optimize 00:50:56.440 |
to the extreme. Then obviously they're going to be different. For example, the error, take it very 00:51:00.520 |
seriously. Right. The error for, for like language model, the longer, the better. But for humans, 00:51:05.320 |
that will make them very nervous and very tired. Right. But, but I guess that the point is more like, 00:51:10.520 |
maybe we should try to find a co optimized common ground as much as possible. And then if we have 00:51:16.040 |
divergence, then we should try to diverge. But it's more philosophical now. But I think like, 00:51:20.600 |
part of it is like how you use it. So Google invented the page rank because ideally you only click 00:51:25.640 |
on one link, you know, like the top three should have the answer. But like with models, it's like, 00:51:29.240 |
well, you can get 20. So those searches are more like semantic grouping in a way. It's like, 00:51:34.600 |
for this query, I'll return you like 20, 30 things that are kind of good, you know? So it's less about 00:51:40.040 |
ranking and it's more about grouping. Another fundamental thing about ACI is the difference 00:51:45.800 |
between human and machine's kind of memory limit. Right. So I think what's really interesting about this 00:51:51.560 |
concept of ACI versus HCI is interfaces that's optimized for them. You can kind of understand some of the 00:51:57.320 |
fundamental characteristics differences of humans and machines, right? Why, you know, 00:52:02.600 |
if you look at find or whatever terminal command, you know, you can only look at it one thing at a time, 00:52:08.120 |
or that's because we have a very small working memory. You can only deal with one thing at a time. 00:52:13.720 |
You can only look at one paragraph of text at the same time. So the interface for us is by design, 00:52:19.320 |
you know, a small piece of information, but more temporal steps. But for machines, that's, that should be the 00:52:25.000 |
opposite, right? You should just give them a hundred different results and they should just 00:52:28.600 |
decide the context was the most relevant stuff and trade off the context for temporal steps. That's 00:52:34.200 |
actually also better for language models because like the cost is smaller or whatever. So it's 00:52:39.320 |
interesting to connect those interfaces to the fundamental kind of differences of those. 00:52:43.480 |
When you said earlier, you know, we should try to design these to maybe be similar as possible and 00:52:48.200 |
diverge if we need to. I actually don't have a problem with them diverging now and seeing venture 00:52:53.160 |
backed startups emerging now because we are different from machines, code AI, and it's just so early on, 00:53:00.200 |
like they may still look kind of similar and they may still be small differences, but it's still just so 00:53:05.000 |
early. And I think we'll only discover more ways that they differ. And so I'm totally fine with them kind of like 00:53:09.880 |
diverging early and optimizing for the... I agree. I think, I think it's more like, 00:53:13.880 |
you know, we should obviously try to optimize human interface just for humans. We're already doing 00:53:18.680 |
that for 50 years. We should optimize agent interface just for agents, but we might also 00:53:24.680 |
try to co-optimize both and see how far we can get. There's enough people to try all three directions. 00:53:30.600 |
Yeah. There's a thesis I sometimes push, which is the sour lesson as opposed to the bitter lesson, 00:53:35.240 |
which we always inspired by human development, but actually AI develops its own path. 00:53:40.280 |
Right. We need to understand better, you know, what are the fundamental differences between those 00:53:44.520 |
creatures. It's funny when really early on this pod, you were like, how much grounding do you have in 00:53:49.480 |
cognitive development and human brain stuff? And I'm like, maybe that doesn't matter. And actually, 00:53:54.840 |
so like I, in my original agent's blog post, I had a picture of the human brain and now it looks a lot 00:54:01.800 |
more like a CPU. The canonical picture of the LNMOS is kind of like a CPU with all the input and output 00:54:07.160 |
going into it. And I think that that's probably the more scalable system. 00:54:10.520 |
I think the problem with like a lot of like cognitive scientists, like is that... 00:54:15.800 |
They think, you know, the only way to solve intelligence is through the human way. And therefore, 00:54:20.840 |
they like have a lot of critics for whatever things that are not cognitive or human. 00:54:26.120 |
But I think a more useful way to use those knowledge is to think of that as just a reference 00:54:31.240 |
point. I don't think we should copy exactly what's going on with human all the way, but I think it's 00:54:35.640 |
good to have a reference point because this is a working example how intelligence works. 00:54:40.360 |
And if you know all the knowledge and you compare them, I think that actually establishes more 00:54:45.960 |
interesting insights as opposed to just copy that or not copying that or opposing that. 00:54:53.080 |
I feel like this is an unanswerable question, but I'll just put it out there anyway. 00:54:56.280 |
If we can answer this, I think it'd be worth a lot, which is, can we separate intelligence from knowledge? 00:55:03.240 |
And to have a little history background, I think that's really the key thesis at the beginning of AI. 00:55:10.360 |
If you think about Neville and Simon and all those symbolic AI people, right? 00:55:14.760 |
Basically, they're trying to create intelligence by writing down all the knowledge. 00:55:21.000 |
For example, they write like a checker or write a checker program. 00:55:24.920 |
Basically how you would solve the checker, you write down all the knowledge and then implement that. 00:55:29.240 |
And I think the whole thesis of symbolic AI is we should just be able to write down all the knowledge 00:55:35.640 |
But that kind of fails. And I think really, I think a great like quote from Hinton is, 00:55:41.240 |
I think there are two approaches to intelligence. 00:55:44.200 |
One approach is let's deal with reasoning or thinking or knowledge, whatever you call that. 00:55:51.560 |
The other approach is let's deal with learning first. 00:55:54.360 |
And then let's worry about, you know, whatever knowledge or reasoning or thinking later. 00:55:58.360 |
And it turns out right now, at least like the second approach works and the first approach doesn't work. 00:56:04.520 |
And I think there might be something deep about it. 00:56:07.880 |
Partially, I think Apple intelligence might change that. 00:56:12.520 |
If this year is the year of multi-modal models, next year is like on-device year. 00:56:16.440 |
And Apple intelligence basically has hot swappable capabilities, right? 00:56:20.120 |
Like they have like 50 LoRa's that they swap onto a base model that does different tasks. 00:56:25.720 |
And that's the first instance that we have of the separation of intelligence and knowledge. 00:56:31.640 |
And I think that's that's a really interesting approach. 00:56:38.760 |
It's like you can have the same model deployed to 10 million phones with 10 million contacts and see if... 00:56:44.200 |
For on-device deployment, I think it's super important. 00:56:46.120 |
Like if you can boil out, like I actually have most of my problems with AI news when 00:56:51.880 |
the model thinks it knows more than it knows because it combines knowledge of intelligence. 00:56:57.400 |
And it only has the ability to parse the things I tell it. 00:57:02.200 |
I feel like it's more like memorization versus kind of just generalization in some sense. 00:57:08.120 |
You don't want it to know like facts like, you know, who is the president of the United States. 00:57:12.840 |
They should be able to just call internet and use a tool to solve it. 00:57:15.320 |
Yes, because otherwise it's not going to call the tool if it thinks it knows. 00:57:23.320 |
So if that's the case, I guess my point is I don't think it's possible to fully separate them 00:57:28.680 |
because like those kind of intelligence kind of emerges. 00:57:32.840 |
Like even for humans, you can't just operate in an intelligent mode without knowledge, right? 00:57:38.520 |
Throughout the years, you learn how to do things and what things are. 00:57:42.360 |
And it's very hard to separate those things, I would say. 00:57:48.520 |
As a meta strategy, I'm trying to keep as a stack ranked list of like, 00:57:52.680 |
what are the 10 most valuable questions in here? 00:57:55.160 |
You can think of knowledge as a cache of intelligence in some sense. 00:57:59.480 |
If you have like wikihall.com saying like you should tie a shoelace using the following step, 00:58:08.920 |
you can think of that piece of text as like a cache to intelligence, right? 00:58:13.240 |
I guess that's kind of like reflection anyway, right? 00:58:15.960 |
It's like you're storing these things as memory, 00:58:19.080 |
So without the knowledge, you wouldn't have the intelligence to do it better. 00:58:24.280 |
So we had Thomas Shalom from Meta to talk about Lama 3.1. 00:58:32.920 |
And he said it's going to be like really focused on agents. 00:58:35.320 |
I know you talked before about, you know, it's next token prediction enough to get to like problem 00:58:41.560 |
If you say you got the perfect environment, they got the terminal, they got everything. 00:58:46.360 |
And if you were to now move down to the model level and say, I need to make a model that is better 00:58:50.440 |
for like agentic workflow, where would you start? 00:58:54.520 |
I think it's data because like changing architecture now is too hard and we don't have a good, 00:59:00.680 |
I think it's mostly about data and agent data is obviously hard because people just write down the 00:59:08.040 |
They don't write down how they like step by step, how they do the thing on the internet. 00:59:11.800 |
So naturally it's easier for models to learn chain of thought than tool call or whatever agent self 00:59:20.920 |
Like even if you do a search, you won't write down all the search processes on the internet. 00:59:26.840 |
And, uh, I think it's a great thing that NAMA 4 is going to be more towards agents. 00:59:32.360 |
That means, I mean, that should mean a lot for a lot of people. 00:59:35.320 |
In terms of data, you think the right data looks like trajectories basically of, 00:59:49.640 |
That's one of the not famous papers, I guess. 00:59:57.800 |
It's not, it's, it's, it's been rejected for, for like a couple of times. 01:00:07.480 |
Like you can try a lot of different agent methods, right? 01:00:09.800 |
React, chain of thought, reflection or whatever. 01:00:14.280 |
You just have very diverse data, like tasks, and you try very diverse agent methods and you filter all 01:00:20.280 |
the correct solutions and you train a model on all of that. 01:00:22.760 |
And then the benefit is that you should somehow learn, you know, how to use simpler methods for 01:00:28.040 |
simpler tasks and harder methods for harder tasks. 01:00:30.360 |
I guess the problem is we don't have diverse, high quality tasks. 01:00:39.080 |
In school, that kind of pissed me off a little bit. 01:00:41.320 |
When you're doing like a homework, like exercises for like calculus, like they give you the problem, 01:00:47.560 |
But there's no way without the professor or the TA to get like the steps. 01:00:52.760 |
And so I feel like because of how schools are structured, we never brought this thing down. 01:00:57.080 |
But I feel like if you went to every university and it's like write down step by step the solution 01:01:01.320 |
to every single problem in the set and make it available online, that's a start to make this dataset better. 01:01:06.600 |
I think it's also because, you know, it's, it might be hard for you to write down your chain of thought, 01:01:11.960 |
even when you're solving the same, because part of that is conscious in language, but maybe even part 01:01:21.400 |
So when I wrote down the React thing, I would tell it to my Google manager, right? 01:01:25.800 |
Like, you know, what we should do, we should just hire, you know, as many people as possible and 01:01:30.920 |
let them use Google and write down exactly what they think, what they search on the internet. 01:01:36.600 |
But I think it's not, not trivial to, to write down your thoughts. 01:01:39.960 |
Like if you're not trained to do that, if I tell you like, okay, write down what you're thinking 01:01:44.920 |
right now, it's actually not as trivial a task as you might imagine. 01:01:48.600 |
It might be more of a diffusion process than the autoregressive process. 01:01:53.080 |
But I think the problem is starting with the experts, you know, because there's so much like muscle 01:01:57.240 |
memory and what you do once you've done it for so long, that's why we need to like get everybody 01:02:01.960 |
to do it. And then you can see it like separate knowledge and intelligence. 01:02:05.720 |
The simplest way to achieve AGI is literally just, just record the reaction of every human 01:02:12.040 |
being and just put them together. You know, like, what do you have thought about? 01:02:16.200 |
What do you have done? Let's say on the computer, right? Imagine like solid experiment. Like you, 01:02:21.640 |
you write down literally everything you think about and everything you do on the computer and 01:02:25.640 |
you record them and you train on all the successful trajectories by some metric of success. I think 01:02:32.200 |
My, my first work of fiction in like 10 years was exploring that idea of what if you recorded 01:02:38.200 |
everything and uploaded yourself? I'm pretty science-based like, you know, but probably the most like 01:02:42.600 |
spiritual woo-woo thing about me is I don't think that would lead to consciousness or AGI just because 01:02:47.240 |
like there's something in there's like, there's a soul, you know, that is the unspeakable quality. 01:02:58.120 |
What do you think about the role of few-shot prompting for some of these like agent trajectories? 01:03:03.000 |
That was a big part of the original react paper, I think. And as we talk about showing your, your work 01:03:09.160 |
and how you think like. I feel like it's becoming less used than zero-shot prompting. What's your 01:03:14.760 |
observation? I'm pretty bullish on it, to be honest, for a few reasons. Like one, I think it can maybe 01:03:20.280 |
help for more complex things, but then also two, like it's a form of prompting and prompting is just 01:03:24.440 |
communicating with the model what you want it to do. And sometimes it's easier to just show the model 01:03:28.200 |
what you want it to do than write out detailed kind of like instructions. I think the practical reason 01:03:33.160 |
it has become less used is because the agent kind of scaffold become more complex or the tasks you're 01:03:38.920 |
trying to solve is becoming more complex. It's harder to annotate a few-shot examples, right? 01:03:43.880 |
Like in the chain of thought era, she just write down three lines of things. It's very easy to write 01:03:48.120 |
down a few-shot or whatever, but I feel like annotation difficulty has become harder. 01:03:53.720 |
I think also one of the reasons that I'm bullish on it is because I think it's a really good way to 01:03:57.240 |
achieve kind of like personalization. Like if you can collect this through feedback automatically, 01:04:01.160 |
you can then use that in the system at a user level or something like that. Again, 01:04:04.680 |
the issue with that is more complex things that doesn't really work. 01:04:08.200 |
Probably more useful is like an automatic, you know, prompt, right? If you have some way to 01:04:13.160 |
retrieve examples and put it in like automatic pipeline to prompt. But I feel like if you're 01:04:17.640 |
a human, you're manually writing now, I feel like more people will try to use zero-shot. 01:04:22.760 |
Yeah, but if you're doing a consumer product, you're probably not going to ask 01:04:26.280 |
user-facing people to write a prompt or something like that. But I think the thing that you brought 01:04:31.160 |
up is also really relevant here where you can collect feedback from a user, but it's usually at the top 01:04:36.120 |
level. And so then if you have three or four or five or however many LLM calls down below, how do you 01:04:42.120 |
disperse that feedback to those? And I don't have an answer for that. 01:04:45.320 |
There's another super popular paper that you author called Koala, Cognitive 01:04:50.120 |
Architectures for Language Agents. I'm not sure if it's super popular. 01:04:52.600 |
People speak highly of it here within my circles. So shout out to Charles Frey, who 01:04:58.600 |
told me about it. I think that was our most popular webinar. 01:05:00.920 |
I think Harrison promoted the paper a lot. Thanks to him. 01:05:06.520 |
I'll read what you wrote in here and then you can just kind of go take it wherever. 01:05:10.200 |
Koala organizes agents along three key dimensions: their information storage divided into working and 01:05:15.320 |
long-term memories, their action space divided into internal and external actions, and their 01:05:20.440 |
decision-making procedure, which is structured as an interactive loop with planning and execution. 01:05:24.360 |
By the way, I think your communication is very clear, so kudos on how you do these things. 01:05:28.360 |
take us through the sort of three components. And you also have this development diagram, 01:05:31.960 |
which I think is really cool. I think it's figure one on your paper for people reading along. 01:05:35.880 |
Normally people have input, LLM, output. Then they develop into language agents that takes an action 01:05:41.880 |
into an environment and has observations. And then they go into the Koala architecture. 01:05:46.600 |
Shout out to my co-first author, Ted, who made figure one. He's like, you know, 01:05:52.440 |
figure is really good. You don't even need color. You just... 01:05:56.760 |
One of the motivations of Koala is we're seeing those agents become really complicated. 01:06:01.400 |
I think my personal philosophy is to try to make things as simple as possible, but obviously this 01:06:06.120 |
field has become more complex as a whole, and it's very hard to understand what's going on. 01:06:10.360 |
And I think Koala provides a very good way to understand things in terms of those three dimensions. 01:06:17.480 |
And I think they are pretty first principles, because I think this idea of memory is pretty first 01:06:23.560 |
principle if you think about where memory, where information is stored. And you can even think of 01:06:27.560 |
the ways of neural network as some kind of non-term memory, because that's also part of the information 01:06:32.120 |
that is stored. I think a very first principle way of thinking of agents is pretty much just a neural 01:06:38.520 |
network plus the code to call and use the neural network. Obviously also maybe plus some vector store 01:06:45.960 |
or whatever other memory modules, right? And thinking through that, then you immediately realize 01:06:50.680 |
that the kind of the non-term memory or the persistent information is first, the neural network, 01:06:56.760 |
and second, the code associated with the agent that calls the neural network, and maybe also some other 01:07:02.680 |
vector stores. But then there's obviously another kind of storage of information that's shorter horizon, 01:07:09.400 |
right? Which is the context window or whatever episode that people are using. Like you're trying 01:07:15.240 |
to solve this task and information happens there. But once this task is solved, the information is gone, 01:07:19.880 |
right? So I think it's very systematic and first principle to think about where information is and 01:07:24.680 |
thinking, organizing them through categories and time horizon, right? So once you have those information 01:07:31.080 |
stores, then obviously for agent, the next thing is, what kind of action can you do? And that leads to the concept 01:07:37.400 |
of action space, right? And I think one of the fundamental difference between language agents and 01:07:41.960 |
the previous agents is that for traditional agents, if you think about Atari or video game, 01:07:47.000 |
they only have like a predefined action space by the environment. They only have external actions, 01:07:51.320 |
right? Because they don't have complicated memory or information and kind of devices to do internal 01:07:55.880 |
thinking. I think the contribution of React is just to point out that we can also have internal actions 01:08:00.760 |
called thinking. And obviously if you have long-term memory, then you also have retrieval or writing or 01:08:05.880 |
whatever. And then third, once you have those actions, which action should you do? That's the 01:08:10.600 |
problem of decision-making. And, uh, and the three parts should just fully describe our agent. 01:08:20.280 |
Does anything that you normally say about agents not fit in that framework? Because you also get 01:08:26.760 |
Um, I think it's very aligned. Um, if we think about a lot of the stuff we do, I'm just thinking out loud now, 01:08:33.640 |
but a lot of the stuff we do on agents now is through lane graph, lane graph, we would view as 01:08:38.120 |
kind of like the code part of what defines some of these things. 01:08:41.800 |
Also defines part of the decision-making decision. That's what I was thinking actually. Yeah. 01:08:45.640 |
Yeah. And actually one analogy that I like there is like some of the code and part of lane graph, 01:08:51.960 |
and I'm actually curious what you think about this, but like, sometimes I say that like the LLMs aren't 01:08:56.520 |
great at planning yet. So we can help them plan by telling them how to plan and code, because that's very 01:09:00.600 |
explicit and that's a good way of communicating how they should plan and stuff like that. 01:09:05.080 |
What do you mean by that? Like giving them like a DFS algorithm or? 01:09:08.280 |
No, something like much simpler. Like you could tell agent in a prompt like, 01:09:12.200 |
"Hey, every time you do this, you need to also do this and make sure to check this." Or you could just 01:09:16.280 |
put those as explicit checks in kind of like the decision-making procedure or something like that. 01:09:21.320 |
And the more complex it gets, I think the more we see people encoding that in code. And another 01:09:26.440 |
way that I say this is like, all of life really is communication, right? And so you can do that 01:09:31.400 |
through prompts or you can do that through code. And code is great at communicating things. It really is. 01:09:41.960 |
I think the biggest thing that we're thinking a lot about is just the memory component. And we touched on it a 01:09:46.760 |
little bit earlier in the, in the episode, but I think it's still like very unsolved. I think like 01:09:51.320 |
clearly semantic memory, episodic memory, or types of memory, I think, but like where the boundaries 01:09:56.920 |
are, is there, are there other types? How to think about that? I think that to me is maybe one of the 01:10:01.960 |
bigger unsolved things in terms of agents is just memory. Like what, what does memory even mean? 01:10:06.440 |
That's another top high value question. Is it a knowledge graph? 01:10:14.680 |
Yeah. If you're using a knowledge graph as a hammer to hit a nail, it's, it's, it's not that. 01:10:19.160 |
But I think, I think practically what we see is it's still so application specific, what relevant 01:10:25.560 |
memory is. And that also makes it really tough to answer generically, like what is memory? So like, 01:10:30.360 |
it could be a knowledge graph. It could also be like, I don't know, a list of instructions that you 01:10:36.520 |
A meta point is I feel sometimes we underestimate some aspects where humans and agents are actually 01:10:42.600 |
similar and we overestimate sometimes. The difference is, you know, I feel like, I mean, 01:10:47.000 |
at one point I think that's shared by agents and humans is like, we all have very different types of 01:10:52.280 |
memories, right? Some people use Google doc. Some people use notion and some people use paper and pen. 01:10:57.080 |
Like you can argue those are different types of long term memories for people, right? And each person 01:11:02.840 |
develops its own way to maintain their long term memory and diarrhea or whatever. It's a very kind of 01:11:09.000 |
individual kind of thing. And I feel like for agents, probably there's no like single best solution. But 01:11:14.760 |
what we can do is we can create as many good tools as possible, like Google Docker, Notion, equivalent 01:11:20.680 |
of agent memory. And we should just give the choice to the agent. Like, what do you want to use? And through 01:11:26.440 |
learning, they should be able to come up with their own way to use the long memory. 01:11:29.400 |
You know, or give the choice to the developer who's building the agents, because I think it also that 01:11:34.520 |
it might, it depends on the task. Like I use, I think we want to control that one. Right now, 01:11:38.920 |
I would agree with that for sure, because I think you need that level of control. I use linear for 01:11:43.080 |
planning for code. I don't use that for my grocery list, right? Like depending on what I'm trying to do, 01:11:47.560 |
I have different types of long form memory. Maybe if you tried, you would have a gorgeous kitchen. 01:11:52.200 |
Do you think our like tool making kind of progress is good or not good enough in terms of, you know, 01:11:58.360 |
we have all sorts of different memory stores or retrieval methods or whatever? On the memory front 01:12:04.280 |
in particular, I don't think it's very good. I think there's a lot to still be done. What do you think 01:12:08.120 |
are lacking? Yeah, you have a memory service. What's missing? The memory service we launched, I don't 01:12:13.080 |
think really found product market fit. I think like, I mean, I think there's a bunch of different 01:12:17.080 |
types of memory. I'll probably write a blog. I mean, I have a blog that I published at some point on 01:12:22.120 |
this, but I think like right off the bat, there's like procedural memory, which is like how you do 01:12:26.360 |
things. I think this is basically episodic memory, like trajectories of correct things. But there's also, 01:12:31.480 |
then I think a very different type is like personalization. Like I like Italian food. 01:12:40.680 |
It could be a, it depends if it's semantic over like raw events or over reflections over events. 01:12:45.720 |
Right. Again, semantic procedure, whatever, it's just like a categorization. What really 01:12:49.560 |
matters is the implementation, right? And so one of the things that 01:12:52.120 |
we'll probably have released by the time this podcast comes out is right now in line graph, 01:12:56.200 |
line graph is very stateful. You define a state for your graph and basically a run of an agent operates 01:13:01.480 |
on a thread. It's very similar to threads in open AI's assistant API, but you can define the state 01:13:06.360 |
however you want. You can define whatever keys, whatever values you want. Right now, 01:13:09.880 |
they're all persistent for a single thread. We're going to add the ability to persist that between 01:13:14.200 |
threads. So then if you basically want to scope a memory to a user ID or to an assistant or to an 01:13:20.040 |
organization, then you can do that. And practically what that means is you can write to that channel 01:13:25.080 |
whatever you want, and then that can be read in other threads. We're not making any kind of like 01:13:29.240 |
claims around what the shape of memory is, right? You can write kind of like what you want there. 01:13:33.400 |
I still think it's like so early on and we see people needing a lot of control over that. And so 01:13:38.760 |
I think this is our current best thought. This is what we're doing around memory at the moment. It's 01:13:43.160 |
basically extending the state to beyond a thread level. I feel like there's a trade-off between 01:13:47.800 |
you know, complexity and control, right? For example, like Notion is more complex than Google Docs, but 01:13:52.680 |
if you use it well, then it gives you more capability, right? And it's like different tool 01:13:57.560 |
might suit different applications or scenarios or whatever. Yeah. We should make more good tools, 01:14:03.320 |
I guess. My quick take is when I started writing about the AI engineer, this was kind of vaguely in 01:14:08.760 |
my head, but like this is basically the job. Everything outside the LLM is the AI engineer 01:14:13.800 |
that the researcher is not going to do. This basically maps to LLMOS. I would add in the 01:14:20.040 |
code interpreter, the browser and the other stuff, but yeah, this is mostly it. Yeah, those are the 01:14:25.800 |
I mean, those are the tools, yeah. Those are the external environment, which is a small box at the bottom. 01:14:30.040 |
So then having this like reasonable level of confidence, like I know what things are, 01:14:34.600 |
then I want to break it. I want to be like, okay, like what's coming up that's going to blindside me 01:14:38.520 |
completely. And it really is maybe like omni-model where like everything in, everything out. And like, 01:14:45.080 |
does that change anything? Like if you scale up models like a hundred times more, does that change 01:14:49.160 |
anything? That's actually a great, great question. I think that's actually the last paragraph of the 01:14:54.520 |
paper that's talking about this. I also got asked this question when I was interviewing with OpenAI. 01:14:59.880 |
Please tell us how to pass OpenAI interviews. 01:15:05.240 |
Is any of this still true if, you know, if you 100x everything, if we make the model much better? 01:15:11.160 |
My longer answer to this, you should just refer to the last paragraph of the paper, which is like a more 01:15:16.600 |
prepared, longer answer. I think the short answer is understanding is always better. It's like a way of 01:15:22.280 |
understanding things. Like the solid experiment that I write at the end of the paper is, imagine you have 01:15:27.800 |
GPT-10, which is really good. Like it doesn't even need a chain of thought, right? Just input, 01:15:32.520 |
output, just stick to stick, right? It doesn't even need to do browsing or whatever, or maybe it still 01:15:37.560 |
needs some tools, but let's say like, it's really powerful. Like then I think even in that point, I think 01:15:43.400 |
something like Koala is still useful if we want to do some neuroscience on GPT-10. It's like kind of doing 01:15:48.440 |
human kind of neuroscience, right? Which model actually could it be inspectable? 01:15:52.840 |
Yeah. Like you want to expect what is episodic memory? What is the decision-making module? What 01:15:56.360 |
is the, it's kind of like dissecting the human brain, right? And you need some kind of prior 01:16:00.600 |
kind of framework to help you do this kind of discovery. 01:16:05.000 |
Cool. Just one thing I want to highlight from your work, we don't have to go into it, 01:16:08.520 |
it's a Tao Bench. Oh yeah, we should definitely cover this. 01:16:12.520 |
Yeah. I'm a big fan of Simulity of AI. We had a summer of Simulity of AI. 01:16:16.600 |
Another term we're trying to coin hasn't stuck, but I'm going to keep at it. 01:16:19.960 |
I'm really glad you covered my zero citation work. I'm really happy. 01:16:23.160 |
No, zero citation work. Now it's one, now it's one. First citation. 01:16:26.520 |
It's me, it's me right now. We just cited it here, so that counts. 01:16:31.160 |
It's like one citation. Does it show on Google? 01:16:32.840 |
We'll write a paper about this episode. One citation, one citation. 01:16:36.520 |
Let's go. Last time I checked, it's still zero. 01:16:41.960 |
This one was funny because you have agents interact with like LM simulated person. So it's like actually 01:16:47.960 |
just another agent, right? So it's like agent simulating with other agents. This has always 01:16:52.920 |
been my thing with startups doing agents. I'm like, one day there's going to be training grounds for 01:16:58.760 |
companies to train agents that they hire. Actually, Singapore is the first country to build the cyber 01:17:03.880 |
range for cyber attack training. And I think you'll see more of that. So what was the inspiration there? 01:17:09.160 |
Most of these models are bad at it, which is great. You know, we have some room for, I think the best 01:17:14.280 |
model is for all at like 48% average. So there's a lot of room to go. Yeah. Any fun stories from there, 01:17:22.600 |
Yeah. First, I think shout out to Sierra, which is this a very good startup, which was funded by 01:17:30.200 |
Brad Taylor and Clay Beaver. And Sierra is a startup doing conversational AI. So what they do is they 01:17:37.240 |
they build agents for businesses. Like suppose you have a business and you have a customer service, 01:17:42.920 |
we want to automate that part. And then it becomes very interesting because it's very different from coding or 01:17:49.000 |
web agent or whatever people are doing, because it's more about how can you do simple things reliably. 01:17:54.680 |
It's not about, you know, can you sample a hundred times and you find one good mass proof or kill 01:17:59.240 |
solution. It's more about you chat with a hundred different users on very simple things. Can you be 01:18:04.040 |
robust to solve like 99% of the time? Right. And then we find there's no really good benchmark around this. 01:18:11.640 |
So that's one thing. I guess another thing is obviously this kind of customer service kind 01:18:15.960 |
of domain. Previously, there are some benchmarks, but they all have their limitations. And I think 01:18:21.480 |
you want the task to be kind of hard and you want user simulation to be real. We don't have that until 01:18:28.120 |
LLM. So data sets from 10 years ago, like either just have trajectories, conversating with humans, 01:18:34.440 |
or they have very fake kind of simulators. I think right now is a good opportunity to, if you really just 01:18:39.400 |
care about this task of customer service, then it's a good opportunity because now you have LLMs to 01:18:44.200 |
simulate humans. But I think a more general motivation is we don't have enough agent benchmarks 01:18:48.760 |
that target this kind of robustness, reliability kind of standpoint. It's more about, you know, 01:18:53.560 |
code or web. So this is a very good addition to the landscape. If you have a model that can simulate 01:19:00.200 |
the persona, like the user, the right way, shouldn't the model also be able to accomplish the task, 01:19:05.640 |
right? If he has the knowledge of like what the person will want, then it means... 01:19:08.840 |
This is a great question. I think it really stems from like asymmetry of information, right? Because 01:19:14.040 |
if you think about the customer service agent, it has information that you cannot, you cannot access, 01:19:18.440 |
right? Like the APIs it could call or, you know, the policies of internal company policy, whatever. 01:19:24.280 |
And that I think very interesting for TallBench is like, it's kind of okay for the user to be kind of 01:19:29.720 |
stupid. So you can imagine like there are failure cases, right? But I think in our case, as long as the 01:19:35.880 |
user specified the need very clearly, then it's up to the agent to figure out, for example, what is the 01:19:42.040 |
second cheapest flight from this to that under that constraint, very complicated reasoning involved. Like 01:19:47.080 |
we shouldn't require users to be able to solve those things. They should just be able to clearly express 01:19:53.000 |
their need. But then if the task failed, then it's up to the agent. That makes the evaluation much easier. 01:20:00.440 |
I have one last question for Harrison, actually. 01:20:04.920 |
I mean, there are a lot of questions around AI right now, but I feel like perhaps the biggest question is 01:20:12.360 |
application. Because if we have great application, we have super app, whatever, that keeps the whole thing 01:20:16.840 |
going, right? Obviously, we have problems with infra, with chip, with transformer, with whatever, 01:20:22.600 |
S4, a lot of stuff. But I do think the biggest question is application. And I'm curious, like, 01:20:28.200 |
from your perspective, like, is there any things that are actually already kind of working, but people 01:20:33.160 |
don't know enough? Or like, is there any like promising application that you're seeing so far? 01:20:37.880 |
Okay, so I think one big area where there's clearly been success is in customer support, 01:20:43.160 |
both companies doing that as a service, but also larger enterprises doing that and building that 01:20:48.040 |
functionality in inside, right? There's a bunch of people doing coding stuff. We've already talked 01:20:53.560 |
about that. I think that's a little bit, I wouldn't say that's a success yet, but there's a lot of 01:20:58.760 |
excitement and stuff there. One thing that I've seen more of recently, I guess the general category would 01:21:04.760 |
be like research style agents, specific things recently would be like, I've seen a few like AISDR 01:21:11.080 |
companies pop up, where they basically do some data enrichment, they get a company name, they go out, 01:21:16.440 |
find like, you know, funding. What is SDR? Sales Development Rep. It's an entry level job title in B2B SaaS. 01:21:22.760 |
Yeah. So I don't know, I know. The PhD mind cannot comprehend. 01:21:29.400 |
And so I'd classify that under the general area of kind of like researchy style agents, I think like 01:21:36.760 |
legal falls in this as well. I think legal is, yeah, they're a pretty good domain for this. 01:21:43.240 |
I wonder how good hardware is doing. There was some debate, but they raised a lot of money. So who 01:21:49.480 |
knows? I'd say those are, those are a few of the categories that jump to mind. Like entry type kind 01:21:54.600 |
of research. On the topic of applications though, the thing that I think is most interesting in this 01:21:58.840 |
space right now is probably all the UXs around these apps and the different things besides chat that might 01:22:04.200 |
come out. I think two that I'm really interested in, one for the idea of this AISDR. I've seen a bunch of 01:22:10.600 |
them do it in kind of like a spreadsheet style view where you have like, you know, 10 different companies 01:22:16.280 |
or hundreds of different companies and five different attributes you want to run up and then each 01:22:20.040 |
cell is an agent. And I guess the good, the good thing about this is like, you can already use the 01:22:23.720 |
first couple of rules of spreadsheet as a few shot example or whatever. There's so many good things about it. Yeah. You can, you can test it out on a few. It's a great way for humans to run things in batch, which I don't, it's a great interface for that. 01:22:24.200 |
It's still kind of elusive to do this kind of like PhD kind of research, but I think those kind of entry type research where it's more repetitive and it should be automated. And then the other UX I'm really, really interested is, is kind of like when you have agents running in the background, how can they, like ambient style agents, how can they reach out to you? So I think like as an example of this, I have an email assistant, 01:22:54.180 |
um, that runs in the background, it treasers all my emails and it tries to respond to them. And then when it needs my input, like, Hey, do you want to do this podcast? It reaches out to me. It sends me a message. 01:23:03.180 |
Oh, you actually, Oh, you, you have it. It is live. 01:23:05.180 |
Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. I use it for all my emails. 01:23:17.180 |
So at this point, Lengraph for the orchestration, Lengchain for the integrations with the different models. 01:23:23.180 |
I'm curious how the low code kind of direction is going right now. Are people, we talked about this. Oh, yeah. It's not low code. 01:23:32.180 |
No, no, no, no, no. People, people will tune in just for this. 01:23:35.180 |
Well, it actually, it actually has to do with UXs as well. So like probably sums back to this idea of, I think like what it means to build with AI is changing. Like I still really, really strongly believe that developers will be a core kind of like part of this, largely because we see like, you need a lot of control over these agents to get them to work reliably. But there's also very clearly components that you don't need to be a developer for prompting is kind of like the most obvious one. With Lengraph, one of the things that you don't need to be a developer for prompting is kind of like the most obvious one. 01:23:59.180 |
With Lengraph, one of the things that we added recently was like a Lengraph studio. So it's, we called it kind of like an IDE for agents. You point it to your code file where you have your graph defined in code. It spins up a representation of the graph. You can interact with it there. You can test it out. We fucked it up to kind of like a persistence layer. So you can do time travel stuff, which I think is another really cool UX that I first saw in Devon and was, yeah. 01:24:23.180 |
The UX for Devon in general, I think you said it, but that was the novel. That was the best part. But to the low code, no code part, the way that I think about it is you probably want to have your cognitive architecture defined in code. 01:24:37.660 |
Yes. But then there's parts within that that are prompts or maybe configuration options, like something to do with RAG or something like that. We've seen that be a popular configuration option. 01:24:48.340 |
So is it useful for programmers more or is it for like people who cannot program? I guess if you cannot program, it's still very complicated for them. 01:24:55.340 |
It's useful for both. I think like we see it being useful for developers right now, but then we also see like there's often teams building this, right? It's not one person. And so I think there's this handoff where the engineer might define the cognitive architecture. They might do some initial prompt engineering. 01:25:08.340 |
It's easier to communicate to the product manager. 01:25:10.340 |
It's easier to show them what's going on and it's easier to let them control it and maybe they're doing the prompting. And so, yeah, I think what the TLDR is like what it means to build is changing. 01:25:19.340 |
And also like UX is UX in general is interesting, whether it's for how to build these agents or for how to use them as end consumers. And there might also be overlap as well. And it's so early on and no one knows anything. 01:25:30.340 |
But I think UX is one of the most exciting spaces to be innovating in right now. 01:25:37.340 |
That's another theme that we cover on the pod. We had the first AI UX meetup and we're trying to get that going. It's not a job. It's just people just tinkering. 01:25:46.340 |
Well, thank you guys so much for indulging us. 01:25:51.340 |
Harrison, you're amazing as a co-host. We'd love to have you back. Like, that was awesome. 01:25:53.340 |
I just try to listen to you guys for inspiration and stuff. 01:25:57.340 |
It's actually really scary to have you as a listener because I don't want to misrepresent. Like, I talk about 100 companies, right? And God forbid I get one of them wrong and, you know. 01:26:06.340 |
I'm sure all of them listen as well. Not to add pressure. 01:26:09.340 |
Yeah, thank you so much. It's a pleasure to have you on. And you had one of the most impactful PhDs in this sort of AI wave. So I don't know how you do it, but I'm excited to see what you do at OpenAI.