back to indexHow to build Enterprise Aware Agents - Chau Tran, Glean

00:00:16.280 |
That was a very impressive LLM generated summary of me. 00:00:23.140 |
So today I'm gonna talk to you about something 00:00:42.360 |
So let's jump straight to the hottest question 00:00:48.220 |
Should I build workflows or should I build agents? 00:00:57.420 |
are orchestrated through predefined code path. 00:01:06.960 |
The first way is through imperative code base. 00:01:10.520 |
So these are the workflows where you write a program 00:01:14.460 |
that calls LLMs, read the response, and then call tools, 00:01:18.540 |
and sort of like do this in a traditional programming flow. 00:01:22.640 |
And then here you can have direct control of the execution 00:01:28.020 |
The second way to represent workflow is through declarative graphs. 00:01:34.020 |
So in this way, you sort of represent your workflow 00:01:38.680 |
as like a graph of where nodes are sort of like steps 00:01:46.180 |
So you kind of define a structure, but not execution. 00:01:56.760 |
So I'm not going to go into the details of pros and cons 00:02:01.940 |
But the main point here is like for workflows, 00:02:07.440 |
So if you run a workflow today, it will mostly 00:02:18.820 |
which are systems where LLMs sort of dynamically direct 00:02:24.160 |
their own processes of like decide how to achieve a task, 00:02:27.900 |
like decide what tools to go, what step to take, 00:02:36.320 |
So it receive a task or like a goal from a human. 00:02:40.600 |
And then it sort of enter this iterative loop 00:02:43.140 |
where it plan what to do, and then execute the action, 00:02:48.020 |
and then read the results from the environment, 00:02:50.640 |
and sort of iterate until it gets all the results it's won, 00:02:58.520 |
So what are the trade-offs between workflows and agents? 00:03:04.640 |
Workflows are sort of like the Toyota of AI systems. 00:03:11.800 |
It's good for when you want to automate repetitive tasks 00:03:15.720 |
or like encode existing best practice or like know-how 00:03:22.940 |
because you don't have to spend time on all these LLM calls 00:03:27.500 |
And they're also easier to debug because you have this code 00:03:31.760 |
or this graph that you can manually pinpoint at which step 00:03:38.600 |
And in building workflows, humans are sort of in control. 00:03:42.920 |
like you can control your destiny, like given imperfect LLMs. 00:03:55.100 |
On the other hand, agents are sort of like the test lab AI 00:04:01.640 |
This is good for like researching unsolved problems. 00:04:16.740 |
because you need LLM to like figure out what to do. 00:04:19.800 |
And then-- but the upside is like there's less logic 00:04:25.920 |
And sometimes you get like these hints of brilliance 00:04:35.080 |
The problem is like your Tesla, like it works very well most 00:04:40.100 |
of the time, but sometimes you still take the wrong exit 00:04:43.060 |
And that's when you kind of miss your Toyota. 00:04:45.860 |
So-- and the decision to be workflows or agent 00:04:51.020 |
is a pretty tricky one because it depends highly 00:04:58.820 |
So some workflows that doesn't work in the agentic loop now 00:05:12.680 |
But recently, one thought that's sort of really changed 00:05:17.620 |
how I think about it is what if you don't really 00:05:31.760 |
needs to be done to achieve that task, right? 00:05:35.200 |
So you give it a task, you figure out this one step, 00:05:42.320 |
And then at the end, when the agent finished the execution, 00:05:44.940 |
and then you look at the trace of what happened, 00:05:51.120 |
So if I represent this in a programming kind of way, 00:05:56.680 |
then agent takes a task and then generate a workflow 00:06:01.480 |
So if we think of it this way, agent take a task 00:06:09.480 |
and generate a workflow, then you can sort of see 00:06:12.640 |
like there are really good synergies between workflows 00:06:19.000 |
use workflows as evaluation for your agents, right? 00:06:26.420 |
can collect a huge amount of golden workflows. 00:06:36.300 |
And you have a huge list of those sort of handbook 00:06:44.720 |
And then you can actually evaluate your agents 00:06:53.820 |
Like did it actually figure out the right steps? 00:06:56.400 |
So this is a little bit different from evaluating end 00:07:00.600 |
You are not judging agents by the end response, 00:07:03.860 |
but by whether it actually did the right step to get 00:07:10.900 |
The second and even better way for workflows to help agents 00:07:24.960 |
So here you truly get the best of both worlds, 00:07:35.520 |
that you have in your library for the known task. 00:07:39.440 |
But then it can also rely on its own internal reasoning 00:07:46.400 |
capabilities to sort of compose different workflows together 00:07:49.860 |
to achieve new tasks and even use its own reasoning 00:07:53.960 |
to kind of extend what you teach it, but make it better. 00:08:01.660 |
And then agents can also help workflows as well. 00:08:06.260 |
One way to do that is for workflow-building platforms, 00:08:11.120 |
you can use an agent to generate the workflows. 00:08:14.520 |
So this is sort of how Glean agents work under the hood, 00:08:22.180 |
like a sort of natural language, a description of the task 00:08:32.840 |
Then the user can sort of like make edit or like add change 00:08:42.200 |
And lastly, and I think is like the most powerful synergy 00:08:52.920 |
is you can use agents as a workflow discovery engine, right? 00:09:00.100 |
Users try to accomplish new tasks with your agent. 00:09:04.380 |
And then when they find that the agent did a good job, 00:09:10.140 |
It's like, OK, this is how you do these tasks in my company. 00:09:13.320 |
And then over time, you can use this as like training data 00:09:29.560 |
do we still need this kind of stuff in a world where we have AGI? 00:09:42.540 |
So AGI is going to be a super intelligent employee, right? 00:09:47.660 |
But if AGI doesn't know about how your company works, 00:09:52.040 |
it's sort of like a really good employee who just joined 00:09:55.600 |
and doesn't know about all the business practices 00:09:59.220 |
and still need onboarding, needs to know who to talk to to get unblocked 00:10:03.500 |
and like all the very nuanced ways of doing things in the enterprise. 00:10:13.060 |
So enterprise aware AGI is fully onboarded, very intelligent, 00:10:20.980 |
And one key kind of insider, I think, is there are many acceptable ways 00:10:30.820 |
to achieve a task, but there's a gap between an acceptable output 00:10:37.200 |
One example is like competitor analysis, like sure, it can do some basic Google search 00:10:45.080 |
and like read some notes outside to like do some competitor analysis, 00:10:51.300 |
but does it actually follow the protocols or the processes that your company define? 00:10:56.580 |
And does it actually address all the key metrics that your executive really care about? 00:11:01.580 |
So given all this data, you know, like tasks and goals and workflows, 00:11:10.460 |
how do you actually train your agents using those data? 00:11:16.880 |
So there are two main ways we have experimented with. 00:11:26.380 |
There are sort of two main flavors of fine-tuning here. 00:11:30.520 |
Another one is, you know, supervised fine-tuning, where you give an input 00:11:35.240 |
and an expected output, and you train your model to just mimic that behavior. 00:11:42.120 |
The second way is through RLHF, where you don't have a golden label, 00:11:47.820 |
but you sort of have a rating or a reward when, you know, like this task, 00:11:52.760 |
this workflow, is it a good one or is it a bad one? 00:11:55.280 |
So then you can sort of run your favorite optimization algorithms 00:11:59.700 |
to fine-tune the RLHF, so the pros of this method 00:12:04.420 |
is that it can learn really well when you have a lot of data. 00:12:08.420 |
If you have a huge amount of tasks and workflows, 00:12:13.920 |
you can really learn, like sort of generalize across different tasks 00:12:21.880 |
The problem here is, one, you kind of have to create a fork 00:12:30.600 |
You do some fine-tuning, and then by the time the fine-tuning finishes, 00:12:34.980 |
maybe there's a new and better model already come out, 00:12:37.600 |
and you have to like redo this whole process again. 00:12:40.600 |
The second is like any change to your training data, 00:12:46.320 |
So if you have a new tool, then maybe some of the existing workflow is outdated, 00:12:51.880 |
If you do change some business priorities or business processes, 00:12:56.320 |
then you have to like redo the training again. 00:12:59.320 |
And it also not super flexible for personalization. 00:13:03.880 |
So given the same tasks, maybe different teams or different employees 00:13:08.920 |
might actually have a different optimal workflows to do those tasks, 00:13:12.920 |
and fine-tuning is not super well-suited for those use cases. 00:13:17.040 |
Then comes the second option, which is dynamic prompting through search. 00:13:24.040 |
So given the same label data from tasks to a golden workflow, 00:13:29.480 |
you build a really good search engine for tasks 00:13:32.600 |
so that you can find similar tasks given a new task. 00:13:40.760 |
we'll find the most similar tasks in the training data, 00:13:44.000 |
and then you feed the representation of those workflows to the LLM as the examples. 00:13:50.900 |
So here you have a spectrum of determinism and creativity. 00:13:55.860 |
So when there's no workflow that sort of match your input tasks, 00:14:03.780 |
Like it can use this creativity to generate a new workflow. 00:14:07.180 |
But when there's a high confidence match of something that you have done before, 00:14:12.380 |
then the LLM will sort of give you a workflow that's very similar 00:14:23.860 |
come back to the competitor analysis example before. 00:14:28.380 |
So you collected this huge list of tasks to workflow. 00:14:32.060 |
And then when a new task, like say what competitors have we been running into recently, 00:14:39.020 |
then it will retrieve, you know, how to analyze each competitor. 00:14:43.100 |
And then you will find a workflow on how to find your recent customer calls. 00:14:48.460 |
And then the LLM will take those examples and sort of generate a composed workflow where 00:14:54.860 |
it reads customer calls, read internal messages, extract competitors, and then run analysis for each of them. 00:15:05.100 |
OK, so comparison time, fine tuning RLHF is very strong when you have a lot of data that you want to generalize. 00:15:14.780 |
Dynamic prompting research is more flexible, also gives you better interpretability that you can sort of look into the exact examples that was affecting your outputs. 00:15:27.260 |
And fine tuning is good for learning generalized behaviors where the ground truth labels don't change over time or like across different users. 00:15:36.540 |
Dynamic prompting research is better for learning customized behaviors or like the last mile quality gap where, you know, requirements are changing quickly. 00:15:47.580 |
So, one sort of analogy I think about fine tuning versus dynamic prompting is fine tuning is very similar to like building customized hardware. 00:15:58.460 |
So, when you know, when you have a sort of task that you really want to optimize for and the requirements don't change over time, 00:16:06.380 |
like you can really build custom hardware that do it very well. 00:16:10.140 |
But it's sort of costly when you change your requirements compared to dynamic prompting is more like writing software. 00:16:16.300 |
Not as optimized, but like you can just change them very quickly. 00:16:22.380 |
Last point, so how do we actually build this workflow search, right? 00:16:29.820 |
So, how do you give it a task, like find similar tasks? 00:16:32.940 |
I would say it's very similar to building document search, right? 00:16:41.900 |
The first one is what everyone usually think of when they think of search, which is a textual similarity, right? 00:16:49.020 |
Given this task, what are some of the similar sounding tasks that are in the training data? 00:16:55.740 |
And here the sort of golden recipe is like hybrid search between lexical, vector embeddings, 00:17:07.580 |
But what I found is in the enterprise settings, pure textual similarity is not enough. 00:17:14.780 |
When you give users the choice to create workflows and write documents, 00:17:20.460 |
when you want to search for something, there'll be like hundreds or thousands of similar looking documents or workflows. 00:17:28.460 |
And the problem becomes how do you choose the right one, right? 00:17:32.700 |
So, which is what I call authoritativeness here. 00:17:37.100 |
And to solve this problem, then you kind of have to go into knowledge graph, right? 00:17:43.020 |
So, if this workflow is created by someone who I work closely with, it has high success rate, 00:17:49.660 |
and like people post about it on Slack, then it's more likely to be the right one. 00:18:03.340 |
And this kind of authoritativeness signals are very hard to encode directly into an LLM, 00:18:10.300 |
which is why we sort of have to have like a separate system that does the search for workflows. 00:18:18.460 |
So, key takeaways, workflows, good for determinism, human are in control. 00:18:28.220 |
And the synergy between agents and workflows is workflows can be used for agents evaluation, 00:18:35.580 |
workflows used for agents training, and agents is used for workflows discovery. 00:18:43.820 |
Fine-tuning is good for generalized behaviors. 00:18:46.380 |
Dynamic prompting research is good for personalized behaviors. 00:18:50.140 |
All right, I still have one minute and 30 seconds, maybe time for one question. 00:19:01.180 |
I'm kind of curious about the fine-tuning, and I think this is how you mentioned that we made a lot of data, right? 00:19:11.660 |
So the question was-- oh, I tried to reinterpret it, let me know if it's wrong. 00:19:21.260 |
How much data do we need to do fine-tuning given the new-- 00:19:28.380 |
That's a very difficult question to answer because it really depends on how our distribution your task is 00:19:38.380 |
compared to the internal knowledge of the RLM. 00:19:41.100 |
But I'll catch you after, and we can talk more.