back to indexKeynote: Why people think "agent" is a buzzword but it isn't

00:00:02.040 |
I started an AI infrastructure startup a few years ago, 00:00:05.440 |
and after selling it last year, I have been happily unemployed. 00:00:09.360 |
Before that, I worked with NVIDIA, Snorkel AI, 00:00:12.800 |
and also started a couple courses at Stanford. 00:00:16.160 |
For today, I want to talk about the challenges 00:00:19.680 |
in building agents, or why people think agent is a buzzword, 00:00:26.840 |
I had been wanting to come to the AI engineering summit 00:00:33.020 |
Until very recently, I shared a section of agents 00:00:39.500 |
It's like 8,000 words, and people seem to like it. 00:00:43.820 |
And I actually prepared another talk for the summit. 00:00:50.660 |
But then after watching a lot of talks yesterday, 00:00:53.720 |
I realized that people have covered a lot of ground. 00:00:56.780 |
So I created a new talk hoping to cover newer, more exciting topics. 00:01:02.580 |
So this is a new talk created especially for this conference. 00:01:09.160 |
I heard that if you are to give an agent talk today, 00:01:15.820 |
I know that a lot of people think there's a lot of talk about agents is just hype. 00:01:22.600 |
I think there's a lot of exciting use cases for agents. 00:01:30.560 |
When I was working on my book, I decided to look at a lot of AI books from the 80s and the 90s, 00:01:37.880 |
trying to understand how people define agents back then. 00:01:40.940 |
And a definition that's really resonate with me is from the book by Stuart Russell and Peter Novick. 00:01:48.800 |
So they define an agent as anything that can perceive the environment and then acts on the environment. 00:01:56.520 |
So let's say that you have an agent that plays chess. 00:02:18.240 |
So one of the most popular use cases of agents nowadays is like coding agents. 00:02:23.080 |
And here is from the paper, like SuiAgent paper. 00:02:27.300 |
And as you can see here, the environment for SuiAgent is a computer with terminal and file system. 00:02:34.000 |
And the list of actions it can perform includes navigate repo, search files, view files, edit lines. 00:02:41.140 |
So the environment determines the kind of actions that the model can perform. 00:02:47.740 |
So if you're in a game, if an agent is in the game, it can only perform the actions that the game allows. 00:02:53.020 |
At the same time, giving the model more actions can also help expand its environment. 00:02:59.120 |
So if you give the model the ability to browse the web, then now the internet becomes its environment. 00:03:04.360 |
There are many reasons why we would want to give a model access to actions. 00:03:10.700 |
So first actions can help address a model limitations. 00:03:15.140 |
So all the models have the cut off date, and that makes it pretty hard to answer questions 00:03:22.940 |
By giving the models access to newer APIs such as news or weather web browsers, the model now 00:03:30.780 |
can get relevant recent information to answer questions. 00:03:34.820 |
A very common limitation that people discover very early on with AI is that AI is pretty bad with math. 00:03:41.620 |
So instead of trying to train a model to be really, really good with numbers, it can simply give the model access to a calculator. 00:03:49.860 |
Another thing is a very exciting use case that you can turn text-only or image-only models into a multi-model model by giving it access to tools or actions. 00:04:00.940 |
So first of all, given a language model, right, a language model can only process text and output text. 00:04:07.080 |
So if you want a model to also be able to process image, I can give you access to, say, an image captioning model. 00:04:14.720 |
So given an image, it can use this tool to generate captions and then use a caption to generate a response. 00:04:20.820 |
Now that model can process both text and image. 00:04:26.100 |
I think it's why agents are so exciting is that actions allow you to embed models into the workflow. 00:04:33.900 |
So now, for example, you can give the model access to your inbox, your Slack, your calendars, or the code editors, 00:04:41.180 |
so that you can use the model in the data workflow instead of having to open, say, a web browser so that you can use AI. 00:04:50.520 |
So when we're talking about agents, people will always ask me, like, OK, if agents are so cool, so why isn't everyone using it? 00:05:01.260 |
Like, tell me, like, everyone asking me, like, what would be, like, give me one good use cases of, like, agents? 00:05:09.940 |
It's because, like, doing agents is, like, really, really hard. 00:05:12.940 |
So for the rest of the talk, I will cover, like, a few reasons why, like, doing agents is so hard. 00:05:18.080 |
So when I start with the cursor complexity, so we know that, like, task failure rate increases as the task complexity increases. 00:05:28.920 |
This is true not just for AI, but also for humans as well. 00:05:32.300 |
Like, if you're given more complex tasks, we will be more likely to fail. 00:05:35.400 |
So let's say that you're building an application for your company and you're OK with a failure rate of, like, say, 1% or 2%, right? 00:05:44.280 |
And the model makes mistakes, like, 2% of the time for one step. 00:05:50.160 |
So over 10 steps, the model could make mistakes about, like, 18% of the time. 00:05:58.500 |
And, like, if you increase your number of steps to, like, 100 steps, the model becomes, like, almost, like, worthless. 00:06:04.600 |
Like, you could make mistakes most of the time. 00:06:07.000 |
For a lot of agent use cases, a lot of agent use cases are pretty complex and might require multiple steps to solve them. 00:06:17.840 |
So it's not that you don't want to use agents for, like, simple tasks, but, like, simple tasks just don't usually need agents to do. 00:06:28.780 |
And also simple tasks might have, like, lower economic value so, like, they are less exciting for people to solve. 00:06:37.660 |
And let's go through, like, a very, very simple example. 00:06:40.620 |
So, like, let's say that you want to ask the agent to, like, how many people bought products from, like, company X in last week. 00:06:47.660 |
So this is a very, very simple query, but the agent might need to break it down in, like, in several steps. 00:06:53.660 |
Like, it might first get the product list of the company X, and then for each product in this list, it will want you to get the number of, like, order from, like, last week. 00:07:03.500 |
And then given all this number of order counts, it would have to sum it up. 00:07:07.340 |
And then given this number, it had to generate a response to the users. 00:07:11.180 |
So even with this very, very simple query, very simple task, there were, like, four steps. 00:07:16.620 |
And the more steps there are, like, the more complex queries, even the higher number of steps, and the more likely the agent is going to fail. 00:07:25.500 |
So in the vast majority of agent use cases I'm seeing right now, it's very, very rare to see them, like, consistently being to solve tasks that involve, like, more than five steps. 00:07:39.180 |
And I do believe that enabling agents to handle more complexity will unlock many, many new use cases. 00:07:48.460 |
How do you know what complexity your agent can solve? 00:07:52.460 |
Because you want to give agent the task that, like, at the right level of complexity, it can solve so that it doesn't fail and, like, cause, like, catastrophic business failure. 00:08:04.460 |
So different kind of tasks, different use cases have different definitions of complexity. 00:08:11.340 |
A very, very common way to define complexity is by the number of steps needed to solve the task. 00:08:17.340 |
So this is, like, a synthetic planning benchmark that I'm working on and I'm hoping to, like, publish very soon. 00:08:26.220 |
So I use synthetic data set, synthetic benchmark because it allows me to, like, control the level of complexity to study a model behavior. 00:08:36.220 |
So now I can ask the model, like, generate, like, tasks that require, like, five steps to solve. 00:08:41.100 |
So with that, so in my benchmark, most models don't perform quite well. 00:08:49.340 |
Like, most models can only solve tasks, like, that have, at most, like, five, that require, at most, five steps. 00:09:01.180 |
And this is, like, consistent with another study that I have seen. 00:09:10.460 |
So the results, so the actual pass rates for the tasks for models must have increased a lot by now. 00:09:18.620 |
However, the learning, the insight, is still, like, I think, still very relevant. 00:09:25.020 |
So in this paper, they try to construct, like, different doctrines. 00:09:30.620 |
And then they ask the model, the agent, the model to generate code based on the doctrine. 00:09:39.660 |
They consider, they measure complexity of the task based on, like, how many steps needed in the doctrine. 00:09:46.540 |
So, for example, like, for this task, like, first you want to, you ask the model to write code to convert the string into lowercase. 00:09:53.820 |
And then you ask the model to write code to remove half of the characters in the string. 00:09:58.540 |
So, like, this are considered, like, two building blocks or, like, two steps in the doctrine. 00:10:03.420 |
And they found out the same result, like, as I did, is that the success rate, the pass rate, like, decreased rapidly as the number of steps increased. 00:10:16.700 |
But the good news is that, like, with newer models, they are actually getting a lot better with blending. 00:10:23.500 |
So here, in the same result, you can see this, like, here are these three very nice curves. 00:10:30.300 |
They come from a DeepSec R1, Gemini 2.0 Flash Thinking, and R1 Preview. 00:10:35.180 |
I didn't test this on, like, R1 and R3 because I didn't have access to this model when I ran this test. 00:10:41.180 |
And you can see this, like, the curves are being pushed upward. 00:10:44.140 |
Like, the models, the newer models are able to solve, like, tasks with more complexity. 00:10:49.180 |
And I do believe that this is going to increase over time, allowing us to, like, using agents for more practical, complex, real-world tasks. 00:11:01.580 |
So as you can see here, it shows the number of tasks that each model was able to solve. 00:11:06.780 |
And in overall, you can see that, like, there's a pretty big difference 00:11:11.020 |
between, like, newer reasoning models, such as R1 Preview, D6 R1, and Gemini, 00:11:15.980 |
Flash Thinking, and non-reasoning models, just, like, Sonnet 3.5, Gemini 2.0 Pro, or, like, GPT-4-0. 00:11:27.580 |
Different use cases might define the complexity differently. 00:11:32.940 |
So here is a paper from ZebraLogic that just came out, like, just last month, in which it's a logic task. 00:11:39.820 |
So they define each problem complexity by its number of, like, Z3 conflicts. 00:11:45.420 |
So you can see this, like, by the-- they also got the same result. 00:11:49.100 |
Like, the model success rates, like, decreased rapidly as the number of Z3 conflicts increased. 00:11:56.940 |
So I think there's several tips to get the agent to handle more complexity. 00:12:03.740 |
First, we might want to break tasks into subtasks that agent can solve. 00:12:07.660 |
So you don't want to give an agent a task more than it can handle. 00:12:10.860 |
So let's say that a task-- or your task, like, consistently requires something like 00:12:18.300 |
And so agent can-- maybe, like, sooner-- can do at most, like, three steps. 00:12:22.380 |
Then you might want to break the task into, like, two subtasks. 00:12:25.340 |
Another way to, like, help the model deal with more complexity is do, like, test-time-compute scaling. 00:12:31.180 |
So I-- test-time-compute-- but I think that in the last few years, people have been talking a lot about 00:12:39.660 |
So it's one of the very-- one of the newer, very exciting concepts that give rise to, like, reasoning models. 00:12:46.300 |
So the idea is that, like, you can have-- you can give the model more compute during inference. 00:12:53.420 |
So-- so that-- so that it can either generate, like, using more-- more thinking tokens. 00:13:01.260 |
Or it can also-- you can-- it can also use the compute budget to generate more-- more output. 00:13:07.260 |
So, for example, given a math problem, you can maybe, like, output 10 different samples, 00:13:14.220 |
10 different solutions, and then pick the ones that, like, the model-- like, most of this-- 00:13:21.500 |
Like, most-- um, that's one of the model things output most of the time. 00:13:31.100 |
So, using stronger models can also call, like, uh, train time-- train time-compute scaling, 00:13:36.940 |
because now you need to invest more compute into, like, training bigger models. 00:13:42.940 |
So we finished the first challenge, which is, like, the Curse of Complexity. 00:13:46.780 |
The next part, we're talking about the challenge is to tune use. 00:13:50.140 |
So, tune use is basically, like, natural language, uh, through API translations. 00:13:56.380 |
So, a lot of time for agents, right, we have humans using agents. 00:13:59.900 |
And the human gives the agent instructions in natural language. 00:14:04.620 |
So, for example, an agent-- a human might give the agent a task like, 00:14:08.060 |
"Hey, given this customer email, create an order." 00:14:11.980 |
So, the agent, um, will need to translate that into, like, uh, functions that can perform this task. 00:14:18.620 |
So, it might first need to call a function to extract the customer ID from the email address. 00:14:23.660 |
And then it might call another function to extract the order ID from the content of the email. 00:14:28.460 |
And then given this customer ID and the order, you would need to actually create the order. 00:14:33.180 |
So, now you can see that it can-- it needs to translate from this natural language to just a set of API calls. 00:14:39.340 |
The challenge with this is that the challenge comes from both sides of the-- of the translations. 00:14:45.180 |
So, for natural language, it can be, like, extremely ambiguous. 00:14:49.900 |
And at the same time, on API side, you can have very bad API and very bad documentations. 00:14:55.900 |
So, let's go and show the first example of, like, ambiguous natural language. 00:14:59.900 |
Consider these agents with access to very, very simple functions, like fetch top products and fetch 00:15:06.620 |
So, that fetch product info can return you, like, the product price. 00:15:10.220 |
So, let's say, like, I see that the fetch top products take in, like, three arguments. 00:15:14.540 |
Like, start date, end date, and number products, right? 00:15:17.900 |
So, now the agent knows that it needs to call the fetch top products. 00:15:33.500 |
And what start date, what end date should it be? 00:15:35.420 |
Like, would it be, like, from, like, does the user want best-selling products from, like, yesterday, 00:15:46.140 |
So, now we talk about, like, very, very bad API or bad documentations. 00:15:50.940 |
In my coding career, I have been, like, pretty, like, fortunate or unfortunate. 00:15:56.060 |
You have seen, like, really, really, really bad comments. 00:15:59.660 |
So, as an engineer myself, I know that, like, people don't usually like writing documentations. 00:16:06.700 |
And if you can't explain the function to the agent, it's going to be really, really hard 00:16:17.820 |
So, I do think that's, like, when you give an agent, like, access to a tool, 00:16:24.700 |
you will need to provide necessary documentation. 00:16:27.020 |
As a list, you need to explain, like, what the function does, 00:16:29.820 |
what parameters does it take in, like, what is the type of the parameter, 00:16:35.660 |
You also need to show, like, different error codes for the functions, 00:16:40.620 |
and also, like, expected, like, returned values. 00:16:47.260 |
Because, like, with error code, right, you don't just want, like, 00:16:50.060 |
"Okay, this model returns this error, like, stat 99." 00:16:55.820 |
You might want to, like, explain to the agent, like, "Okay, this error is usually caused by this." 00:17:00.940 |
And if you enter this error, if you encounter this error, maybe that is how you should address this. 00:17:05.820 |
And one company told me that, like, one of the biggest improvements they got for their agents 00:17:13.740 |
is after they explained and add to the documentation, like, how to interpret return values of the functions. 00:17:21.900 |
So let's say that the function, like, returns the value of, like, one. 00:17:25.980 |
So if you help the model interpret the result, the model, the agent can actually be able to perform, like, 00:17:33.580 |
call the functions, like, a lot better and then be able to plan a lot better. 00:17:36.940 |
Another very important thing to think about is that, like, 00:17:42.300 |
tool use for agents can be, like, counter-intuitive for us because humans and AIs, like, have fundamentally 00:17:52.460 |
So, for example, like, humans and AI have different, like, preferences. 00:17:56.380 |
So humans might prefer working with, like, visual things, like with GUIs, whereas, like, 00:18:03.100 |
So, like, if you ask a human to use Salesforce, they might go to Salesforce website. 00:18:07.020 |
But if you, like, assign tasks for AI, it will, like, it would perform much better, like, 00:18:11.820 |
not having to deal with a lot of visual cues and just, like, calling the straightforward API instead. 00:18:16.780 |
And also, like, humans and AI operate in a different way. 00:18:19.980 |
Like, humans, like, at least for me, I find impossible to perform multiple tasks at once. 00:18:26.300 |
So I could perform, like, different steps, like, sequentially. 00:18:29.180 |
Whereas AI can perform tasks in, like, parallel. 00:18:32.460 |
So, for example, if you need agents to perform, like, to browse, if you need to, like, browse 00:18:39.100 |
a hundred websites, it could be, like, very, very boring for humans. 00:18:46.220 |
Like, I browsed, like, thousands of websites, but it was not fun at all. 00:18:49.420 |
However, for AI, browsing a hundred of websites or, like, a thousand of websites is extremely easy. 00:18:55.180 |
You can just send out, like, open, like, like, query, like, this thousand of websites 00:19:02.780 |
So that is actually a challenge for, like, training or creating examples for the models 00:19:11.820 |
Because given a task, what the human annotator does might not be optimal for AI. 00:19:17.820 |
So that's the reason why the reinforcement learning is so exciting. 00:19:22.620 |
Because with supervised file tooling, like, you are teaching AI to, like, clone human behaviors, 00:19:32.460 |
Whereas with reinforcement learning, like, you let the model figure it out. 00:19:36.700 |
Like, we try an error, and you might find ways to do it that is optimal for AI. 00:19:41.660 |
So there are several tips, like, how to make agents better at tool use. 00:19:47.500 |
So the first is that you should create, like, very, very good documentation. 00:19:51.820 |
Like, with everything, not just the function descriptions, parameters, 00:19:55.100 |
but, like, arrow, like, return values, like, we just talked about. 00:19:57.980 |
We should actually give agents, like, very narrow and well-to-defi functions. 00:20:03.500 |
So we just caught up with a friend working for a very big company. 00:20:07.260 |
I wouldn't say the name, but if you say the search engine, you probably know what it is. 00:20:11.340 |
And he was saying that, like, for their use cases, they give their agents, like, only three or four 00:20:20.300 |
For their agents, like, for their tasks, like, their agents just did not work at all with, like, 00:20:26.220 |
You should also, like, because of, like, the ambiguity of natural languages, 00:20:31.420 |
you can help the models understand the tasks or the use of queries better by using techniques 00:20:37.100 |
like query rewriting or using, like, intent classifier to help, like, classify the user intent. 00:20:44.060 |
You can also, like, instruct, or you should definitely instruct your agent to ask for 00:20:48.620 |
qualifications when it's unsure of what users want. 00:20:52.540 |
So, for example, like, if you just ask, like, five best-selling products until, like, under $10, 00:20:59.340 |
you can, like, mix a random guess, like, to fetch products from yesterday or from last year, 00:21:04.620 |
or you can also, like, ask users, like, hey, do you want top product, best-selling products from yesterday 00:21:11.340 |
You can also see one pretty exciting or interesting direction I'm seeing is that, like, 00:21:17.900 |
a lot of companies are building specialized action models for specific types of queries and APIs. 00:21:24.620 |
So, when we have, like, specialized model, action models for coding, right? 00:21:28.380 |
Now, we have some model trained specially for, let's say, like, VS Code or, like, for coding. 00:21:33.580 |
So, why not have, like, specialized action models for different environment as well? 00:21:39.100 |
So, for example, I've seen, like, Salesforce might be interested in, like, building, 00:21:43.100 |
maybe I shouldn't say, like, Salesforce, I should say general, like, different 00:21:46.140 |
companies with very complex, like, ecosystems might want to train action models for their environment. 00:21:55.100 |
Because the cursor complexity is a true new issue with natural language and API translations, 00:22:02.300 |
And it's really funny because we have been talking about context for a long time, like, 00:22:06.140 |
first for RAC and now for agents, which is talk about context. 00:22:09.180 |
So, models, like, AI has always requires a lot of information. 00:22:15.020 |
So, before agents, like, a model has already had to look, like, system instructions, which can be pretty long if 00:22:21.100 |
you, like, really want the model, really want the application to perform well and secure. 00:22:27.020 |
So, you might want to instruct the model to, like, what kind of queries you should respond to, 00:22:31.900 |
what kind of queries you should not respond to, what kind of tool you should carry. 00:22:35.180 |
And then also, like, user instructions and, like, examples. 00:22:37.820 |
But with agents, like, you see a lot more information. 00:22:42.860 |
So, first, you might need to pass documentations about your tool to the agents. 00:22:47.580 |
And the more tools there are, the more documentation will be needed. 00:22:51.420 |
Of course, like, after you call a tool, there can be tool outputs that the model will need to keep 00:22:57.340 |
And this will grow with more, like, execution steps. 00:23:01.340 |
And after getting back, like, a tool output, right, the model may need to reason, like, 00:23:08.300 |
Or, like, after a model generated a plan, the agent may also want to reason, like, 00:23:16.780 |
And all these reasoning tokens, like, take a lot of, like, take a lot of input tokens. 00:23:26.060 |
So, like, the information that an agent can work with, like, can grow very, very quickly. 00:23:30.380 |
And I haven't even mentioned, like, other kind of, like, information, 00:23:34.380 |
such as, like, table schemas for tasks like text to SQL. 00:23:38.620 |
Let's say that you want to do, like, a text to SQL task, right? 00:23:41.340 |
And you're not just, you don't have, you have not just one table, 00:23:46.620 |
So, when you translate a SQL query, you might need to figure out, like, 00:23:53.420 |
And for the model to be able to pick the right table, you might need to pass in, like, 00:23:59.820 |
And if you have, like, a thousand, like, table schemas, 00:24:02.540 |
there can be a lot of information for the model to process. 00:24:05.740 |
So, one thing that, like, I have experienced, like, when I was working with agents, 00:24:10.860 |
that I would love to have more research on, is, like, how to make a model that works well 00:24:19.740 |
Because in my experience, like, the models that are good with planning 00:24:23.420 |
are not necessarily the models that work with long contacts. 00:24:26.540 |
And the reason is that, like, plannings are, like, reasoning. 00:24:31.500 |
Like, it requires a lot of, generate a lot of thinking of reasoning tokens. 00:24:34.940 |
So, this kind of task are, like, output heavy. 00:24:38.540 |
Whereas for long contacts, it's, like, input heavy. 00:24:41.100 |
And I have in my benchmark, my personal benchmark, I've seen this, like, 00:24:45.420 |
models that perform well on my long contacts benchmarks 00:24:49.020 |
don't perform as well on the planning benchmark and vice versa. 00:24:53.500 |
So, okay, so we've talked about, like, an agent's, like, how to deal with a lot of information. 00:24:57.980 |
And that information might not fit inside a model's, like, affections context. 00:25:02.540 |
So, I want to highlight the word, like, affections here. 00:25:05.260 |
Because a model might have very long contacts. 00:25:07.660 |
But then it might not use that context, like, affectionately. 00:25:10.940 |
So, like, a model might be able to fit in, like, a million tokens. 00:25:15.340 |
But, like, if you give it anything more than, like, 30,000 tokens, 00:25:18.780 |
it might get really, really, really funky and, like, hallucinate all the time. 00:25:22.860 |
So, at least in my personal experience, I have, like, yeah. 00:25:27.260 |
So, I have, like, done a lot of, like, benchmarks evaluations that you see, like, 00:25:32.540 |
at what point of my documentation does the models that, like, hallucinate and making up things. 00:25:39.260 |
So, like, if you can't fit all your information into the model context, like, affection context, 00:25:47.500 |
you might need to realize on, like, other forms of, like, information persistence or information storage. 00:25:54.140 |
So, context, you can think of it as, like, a short-term memory. 00:25:57.100 |
Like, you should use this for, like, it should be used to store information relevant to the task at hand. 00:26:03.100 |
And then you can also supplement it with, like, long-term memory. 00:26:06.460 |
For example, like, external databases or storage. 00:26:09.340 |
And it's very common with use case, like, Rack, right? 00:26:12.460 |
Like, so, if you connect a model to your external databases, then you're connecting it to, like, 00:26:19.420 |
So, you can also, like, it's a case of agents, right? 00:26:23.100 |
Like, you can, like, store less immediately relevant information in, like, external files. 00:26:30.700 |
So, let's say that your task requires, like, 10 steps. 00:26:35.500 |
So, maybe, and the output from these only 10 steps, like, doesn't, don't quite fit into the context. 00:26:41.740 |
So, you might want to store the output of the first few steps into external files 00:26:47.100 |
and only retrieve the output, like, when necessary. 00:26:49.660 |
And, of course, like, there's also, like, short-term memory, long-term memory. 00:26:54.620 |
And another level of, like, memory system is internal knowledge, 00:26:59.580 |
which is, like, the knowledge that the model already has. 00:27:02.460 |
So, if you have some information that models, like, that is essential for, like, 00:27:09.020 |
you might want to include that in the trading data and fine-tune the model on it. 00:27:13.260 |
So, that the model can just use it as part of internal knowledge 00:27:16.460 |
instead of, like, having to waste, like, context tokens. 00:27:22.380 |
So, I think we talked about, like, what is an agent? 00:27:26.780 |
Different challenges to building agents, including, like, first, including 00:27:32.540 |
trying to, like, get the model to handle the right, the task of right complexity. 00:27:38.940 |
And we talked about tips, like, how to make the model handle more complexity. 00:27:42.540 |
We talked about two new challenges, like, how to translate with natural language 00:27:47.340 |
And we talked about, like, how to help get models to, like, handle longer contacts 00:27:57.500 |
And if you have any questions or if you want to talk about the agent planning benchmark