How to build Enterprise Aware Agents

00:00:00.000 | -

00:00:01.000 | - Thanks, Alex, for the introduction.

00:00:16.280 | That was a very impressive LLM generated summary of me.

00:00:20.320 | I've never heard it before, but nice.

00:00:23.140 | So today I'm gonna talk to you about something

00:00:28.180 | that has been keeping me up at night,

00:00:31.220 | probably some of you too.

00:00:32.740 | So how to build enterprise-aware agents,

00:00:35.700 | how to bring the brilliance of AI

00:00:37.900 | into the messy, complex realities

00:00:40.100 | of how your business operated.

00:00:42.360 | So let's jump straight to the hottest question

00:00:45.980 | of the month for AI builders.

00:00:48.220 | Should I build workflows or should I build agents?

00:00:51.260 | So what are workflows?

00:00:54.180 | Workflows are a system where LLMs and tools

00:00:57.420 | are orchestrated through predefined code path.

00:01:00.460 | So there are two main ways

00:01:03.800 | where you can represent the workflows.

00:01:06.960 | The first way is through imperative code base.

00:01:10.520 | So these are the workflows where you write a program

00:01:14.460 | that calls LLMs, read the response, and then call tools,

00:01:18.540 | and sort of like do this in a traditional programming flow.

00:01:22.640 | And then here you can have direct control of the execution

00:01:25.900 | of all the steps.

00:01:28.020 | The second way to represent workflow is through declarative graphs.

00:01:34.020 | So in this way, you sort of represent your workflow

00:01:38.680 | as like a graph of where nodes are sort of like steps

00:01:41.560 | where you can call tools or call LLMs.

00:01:44.020 | And then there's sort of edge between nodes.

00:01:46.180 | So you kind of define a structure, but not execution.

00:01:51.100 | And the execution of this is usually handled

00:01:53.480 | by some workflow frameworks.

00:01:56.760 | So I'm not going to go into the details of pros and cons

00:01:59.440 | for these two approaches.

00:02:01.940 | But the main point here is like for workflows,

00:02:04.920 | you get structure and predictability.

00:02:07.440 | So if you run a workflow today, it will mostly

00:02:10.860 | behave the same way if you run it tomorrow.

00:02:15.500 | On the other hand, we have agents,

00:02:18.820 | which are systems where LLMs sort of dynamically direct

00:02:24.160 | their own processes of like decide how to achieve a task,

00:02:27.900 | like decide what tools to go, what step to take,

00:02:31.220 | depends on the task itself.

00:02:34.500 | So the core agent loop is pretty simple.

00:02:36.320 | So it receive a task or like a goal from a human.

00:02:40.600 | And then it sort of enter this iterative loop

00:02:43.140 | where it plan what to do, and then execute the action,

00:02:48.020 | and then read the results from the environment,

00:02:50.640 | and sort of iterate until it gets all the results it's won,

00:02:54.980 | and then respond to the user.

00:02:58.520 | So what are the trade-offs between workflows and agents?

00:03:04.640 | Workflows are sort of like the Toyota of AI systems.

00:03:08.660 | It's very predictable.

00:03:11.800 | It's good for when you want to automate repetitive tasks

00:03:15.720 | or like encode existing best practice or like know-how

00:03:18.740 | in your business.

00:03:20.660 | It's usually lower cost and lower latency,

00:03:22.940 | because you don't have to spend time on all these LLM calls

00:03:25.820 | to decide what to do.

00:03:27.500 | And they're also easier to debug because you have this code

00:03:31.760 | or this graph that you can manually pinpoint at which step

00:03:35.880 | is going wrong in the execution.

00:03:38.600 | And in building workflows, humans are sort of in control.

00:03:42.920 | like you can control your destiny, like given imperfect LLMs.

00:03:50.060 | You can sort of do tweaks and engine

00:03:52.080 | so that your tasks work right now.

00:03:55.100 | On the other hand, agents are sort of like the test lab AI

00:03:57.980 | systems.

00:03:58.800 | It's more open-ended.

00:04:01.640 | This is good for like researching unsolved problems.

00:04:05.480 | It's also usually good at taking advantage

00:04:07.640 | of better and better LLM capabilities,

00:04:10.800 | because here the AI is in control.

00:04:14.760 | Generally, it's higher cost and latency

00:04:16.740 | because you need LLM to like figure out what to do.

00:04:19.800 | And then-- but the upside is like there's less logic

00:04:23.400 | to maintain.

00:04:24.120 | The call loop is very simple.

00:04:25.920 | And sometimes you get like these hints of brilliance

00:04:29.280 | that always feels like, you know, everything

00:04:31.040 | is going to be automated in a few months.

00:04:35.080 | The problem is like your Tesla, like it works very well most

00:04:40.100 | of the time, but sometimes you still take the wrong exit

00:04:42.420 | on the highway.

00:04:43.060 | And that's when you kind of miss your Toyota.

00:04:45.860 | So-- and the decision to be workflows or agent

00:04:51.020 | is a pretty tricky one because it depends highly

00:04:56.040 | on the state of the LLM.

00:04:58.820 | So some workflows that doesn't work in the agentic loop now

00:05:03.340 | might start to work later in a few months

00:05:05.220 | when the new models come out.

00:05:07.020 | So it's a really huge dilemma.

00:05:12.680 | But recently, one thought that's sort of really changed

00:05:17.620 | how I think about it is what if you don't really

00:05:20.340 | have to choose, right?

00:05:21.500 | So if you think of agent, what they do

00:05:25.540 | is when you give the agent a task,

00:05:28.800 | it will figure out the steps that

00:05:31.760 | needs to be done to achieve that task, right?

00:05:35.200 | So you give it a task, you figure out this one step,

00:05:39.600 | take the action, figure out the next step.

00:05:42.320 | And then at the end, when the agent finished the execution,

00:05:44.940 | and then you look at the trace of what happened,

00:05:47.840 | all those series of steps is a workflow.

00:05:51.120 | So if I represent this in a programming kind of way,

00:05:56.680 | then agent takes a task and then generate a workflow

00:06:00.280 | to achieve that task.

00:06:01.480 | So if we think of it this way, agent take a task

00:06:09.480 | and generate a workflow, then you can sort of see

00:06:12.640 | like there are really good synergies between workflows

00:06:16.120 | and agent.

00:06:17.300 | So the first thing is you can actually

00:06:19.000 | use workflows as evaluation for your agents, right?

00:06:23.600 | So let's say in your company, you

00:06:26.420 | can collect a huge amount of golden workflows.

00:06:30.460 | Like given a task, this is the steps that

00:06:34.000 | needs to be done to solve that task.

00:06:36.300 | And you have a huge list of those sort of handbook

00:06:40.980 | on how to do things in your company.

00:06:44.720 | And then you can actually evaluate your agents

00:06:47.680 | by give it a task, see what it did,

00:06:51.260 | and compare it to the golden workflow.

00:06:53.820 | Like did it actually figure out the right steps?

00:06:56.400 | So this is a little bit different from evaluating end

00:07:00.260 | to end.

00:07:00.600 | You are not judging agents by the end response,

00:07:03.860 | but by whether it actually did the right step to get

00:07:07.280 | to that end response.

00:07:10.900 | The second and even better way for workflows to help agents

00:07:15.820 | is given that same golden workflows library,

00:07:20.500 | you can also use it to train your agents.

00:07:24.960 | So here you truly get the best of both worlds,

00:07:27.920 | where with the data feeding, your agents

00:07:32.500 | will be able to execute the exact workflow

00:07:35.520 | that you have in your library for the known task.

00:07:39.440 | But then it can also rely on its own internal reasoning

00:07:46.400 | capabilities to sort of compose different workflows together

00:07:49.860 | to achieve new tasks and even use its own reasoning

00:07:53.960 | to kind of extend what you teach it, but make it better.

00:08:01.660 | And then agents can also help workflows as well.

00:08:06.260 | One way to do that is for workflow-building platforms,

00:08:11.120 | you can use an agent to generate the workflows.

00:08:14.520 | So this is sort of how Glean agents work under the hood,

00:08:18.700 | where the user can give the workflow builder

00:08:22.180 | like a sort of natural language, a description of the task

00:08:25.700 | it is trying to achieve.

00:08:27.260 | And then we run an agent implementation

00:08:28.860 | to figure out the steps that are needed

00:08:30.620 | to achieve that workflow.

00:08:32.840 | Then the user can sort of like make edit or like add change

00:08:39.080 | the workflow that the agent was proposing.

00:08:42.200 | And lastly, and I think is like the most powerful synergy

00:08:52.920 | is you can use agents as a workflow discovery engine, right?

00:08:57.240 | So you ship an agent.

00:09:00.100 | Users try to accomplish new tasks with your agent.

00:09:04.380 | And then when they find that the agent did a good job,

00:09:07.420 | then you can sort of save that workflow.

00:09:10.140 | It's like, OK, this is how you do these tasks in my company.

00:09:13.320 | And then over time, you can use this as like training data

00:09:17.060 | to help agents get better.

00:09:20.880 | So that was the main point of my talk.

00:09:26.100 | I guess maybe some of you are thinking,

00:09:29.560 | do we still need this kind of stuff in a world where we have AGI?

00:09:35.120 | So here's my thought experiment and why

00:09:38.760 | I think this may be still needed after AGI.

00:09:42.540 | So AGI is going to be a super intelligent employee, right?

00:09:47.660 | But if AGI doesn't know about how your company works,

00:09:52.040 | it's sort of like a really good employee who just joined

00:09:55.600 | and doesn't know about all the business practices

00:09:59.220 | and still need onboarding, needs to know who to talk to to get unblocked

00:10:03.500 | and like all the very nuanced ways of doing things in the enterprise.

00:10:10.480 | So what is enterprise aware AGI?

00:10:13.060 | So enterprise aware AGI is fully onboarded, very intelligent,

00:10:18.580 | knows the ways your company do things.

00:10:20.980 | And one key kind of insider, I think, is there are many acceptable ways

00:10:30.820 | to achieve a task, but there's a gap between an acceptable output

00:10:35.420 | versus a great output.

00:10:37.200 | One example is like competitor analysis, like sure, it can do some basic Google search

00:10:45.080 | and like read some notes outside to like do some competitor analysis,

00:10:51.300 | but does it actually follow the protocols or the processes that your company define?

00:10:56.580 | And does it actually address all the key metrics that your executive really care about?

00:11:01.580 | So given all this data, you know, like tasks and goals and workflows,

00:11:10.460 | how do you actually train your agents using those data?

00:11:13.960 | So this is the second part of my talk.

00:11:16.880 | So there are two main ways we have experimented with.

00:11:23.440 | The first one is through fine-tuning.

00:11:26.380 | There are sort of two main flavors of fine-tuning here.

00:11:30.520 | Another one is, you know, supervised fine-tuning, where you give an input

00:11:35.240 | and an expected output, and you train your model to just mimic that behavior.

00:11:42.120 | The second way is through RLHF, where you don't have a golden label,

00:11:47.820 | but you sort of have a rating or a reward when, you know, like this task,

00:11:52.760 | this workflow, is it a good one or is it a bad one?

00:11:55.280 | So then you can sort of run your favorite optimization algorithms

00:11:59.700 | to fine-tune the RLHF, so the pros of this method

00:12:04.420 | is that it can learn really well when you have a lot of data.

00:12:08.420 | If you have a huge amount of tasks and workflows,

00:12:13.920 | you can really learn, like sort of generalize across different tasks

00:12:18.420 | and like combine workflows.

00:12:21.880 | The problem here is, one, you kind of have to create a fork

00:12:26.600 | from the front TLLM, right?

00:12:28.600 | So you start with some LLM.

00:12:30.600 | You do some fine-tuning, and then by the time the fine-tuning finishes,

00:12:34.980 | maybe there's a new and better model already come out,

00:12:37.600 | and you have to like redo this whole process again.

00:12:40.600 | The second is like any change to your training data,

00:12:43.320 | like you need to do retraining, right?

00:12:46.320 | So if you have a new tool, then maybe some of the existing workflow is outdated,

00:12:50.820 | and you have to retrain.

00:12:51.880 | If you do change some business priorities or business processes,

00:12:56.320 | then you have to like redo the training again.

00:12:59.320 | And it also not super flexible for personalization.

00:13:03.880 | So given the same tasks, maybe different teams or different employees

00:13:08.920 | might actually have a different optimal workflows to do those tasks,

00:13:12.920 | and fine-tuning is not super well-suited for those use cases.

00:13:17.040 | Then comes the second option, which is dynamic prompting through search.

00:13:24.040 | So given the same label data from tasks to a golden workflow,

00:13:29.480 | you build a really good search engine for tasks

00:13:32.600 | so that you can find similar tasks given a new task.

00:13:35.800 | So then at runtime to accomplish a new task,

00:13:40.760 | we'll find the most similar tasks in the training data,

00:13:44.000 | and then you feed the representation of those workflows to the LLM as the examples.

00:13:49.600 | Right?

00:13:50.900 | So here you have a spectrum of determinism and creativity.

00:13:55.860 | So when there's no workflow that sort of match your input tasks,

00:14:01.460 | then the LLM are in control.

00:14:03.780 | Like it can use this creativity to generate a new workflow.

00:14:07.180 | But when there's a high confidence match of something that you have done before,

00:14:12.380 | then the LLM will sort of give you a workflow that's very similar

00:14:17.340 | to what was in the training data.

00:14:19.980 | So one very concrete example,

00:14:23.860 | come back to the competitor analysis example before.

00:14:28.380 | So you collected this huge list of tasks to workflow.

00:14:32.060 | And then when a new task, like say what competitors have we been running into recently,

00:14:39.020 | then it will retrieve, you know, how to analyze each competitor.

00:14:43.100 | And then you will find a workflow on how to find your recent customer calls.

00:14:48.460 | And then the LLM will take those examples and sort of generate a composed workflow where

00:14:54.860 | it reads customer calls, read internal messages, extract competitors, and then run analysis for each of them.

00:15:05.100 | OK, so comparison time, fine tuning RLHF is very strong when you have a lot of data that you want to generalize.

00:15:14.780 | Dynamic prompting research is more flexible, also gives you better interpretability that you can sort of look into the exact examples that was affecting your outputs.

00:15:27.260 | And fine tuning is good for learning generalized behaviors where the ground truth labels don't change over time or like across different users.

00:15:36.540 | Dynamic prompting research is better for learning customized behaviors or like the last mile quality gap where, you know, requirements are changing quickly.

00:15:47.580 | So, one sort of analogy I think about fine tuning versus dynamic prompting is fine tuning is very similar to like building customized hardware.

00:15:58.460 | So, when you know, when you have a sort of task that you really want to optimize for and the requirements don't change over time,

00:16:06.380 | like you can really build custom hardware that do it very well.

00:16:10.140 | But it's sort of costly when you change your requirements compared to dynamic prompting is more like writing software.

00:16:16.300 | Not as optimized, but like you can just change them very quickly.

00:16:22.380 | Last point, so how do we actually build this workflow search, right?

00:16:29.820 | So, how do you give it a task, like find similar tasks?

00:16:32.940 | I would say it's very similar to building document search, right?

00:16:38.940 | And there are two main components to this.

00:16:41.900 | The first one is what everyone usually think of when they think of search, which is a textual similarity, right?

00:16:49.020 | Given this task, what are some of the similar sounding tasks that are in the training data?

00:16:55.740 | And here the sort of golden recipe is like hybrid search between lexical, vector embeddings,

00:17:04.300 | re-ranking, late interaction, all that.

00:17:07.580 | But what I found is in the enterprise settings, pure textual similarity is not enough.

00:17:14.780 | When you give users the choice to create workflows and write documents,

00:17:20.460 | when you want to search for something, there'll be like hundreds or thousands of similar looking documents or workflows.

00:17:28.460 | And the problem becomes how do you choose the right one, right?

00:17:32.700 | So, which is what I call authoritativeness here.

00:17:37.100 | And to solve this problem, then you kind of have to go into knowledge graph, right?

00:17:43.020 | So, if this workflow is created by someone who I work closely with, it has high success rate,

00:17:49.660 | and like people post about it on Slack, then it's more likely to be the right one.

00:17:54.860 | So, all the tricks in the recommended system

00:17:57.580 | world also applies here for workflow search.

00:18:03.340 | And this kind of authoritativeness signals are very hard to encode directly into an LLM,

00:18:10.300 | which is why we sort of have to have like a separate system that does the search for workflows.

00:18:15.980 | Cool.

00:18:18.460 | So, key takeaways, workflows, good for determinism, human are in control.

00:18:24.620 | Agents, more open-ended, AI is in control.

00:18:28.220 | And the synergy between agents and workflows is workflows can be used for agents evaluation,

00:18:35.580 | workflows used for agents training, and agents is used for workflows discovery.

00:18:43.820 | Fine-tuning is good for generalized behaviors.

00:18:46.380 | Dynamic prompting research is good for personalized behaviors.

00:18:50.140 | All right, I still have one minute and 30 seconds, maybe time for one question.

00:18:57.820 | Is that actually one question?

00:19:00.780 | Yeah.

00:19:01.180 | I'm kind of curious about the fine-tuning, and I think this is how you mentioned that we made a lot of data, right?

00:19:07.660 | How about with the new, like, RLC, R, .

00:19:11.660 | So the question was-- oh, I tried to reinterpret it, let me know if it's wrong.

00:19:21.260 | How much data do we need to do fine-tuning given the new--

00:19:26.300 | RLVR.

00:19:27.420 | RLVR.

00:19:28.380 | That's a very difficult question to answer because it really depends on how our distribution your task is

00:19:38.380 | compared to the internal knowledge of the RLM.

00:19:41.100 | But I'll catch you after, and we can talk more.

00:19:44.300 | Yes.

00:19:45.260 | Thank you.

00:19:45.580 | It's too difficult of a question to answer.

How to build Enterprise Aware Agents - Chau Tran, Glean