How to build Enterprise Aware Agents

- - Thanks, Alex, for the introduction. That was a very impressive LLM generated summary of me. I've never heard it before, but nice. So today I'm gonna talk to you about something that has been keeping me up at night, probably some of you too. So how to build enterprise-aware agents, how to bring the brilliance of AI into the messy, complex realities of how your business operated.

So let's jump straight to the hottest question of the month for AI builders. Should I build workflows or should I build agents? So what are workflows? Workflows are a system where LLMs and tools are orchestrated through predefined code path. So there are two main ways where you can represent the workflows.

The first way is through imperative code base. So these are the workflows where you write a program that calls LLMs, read the response, and then call tools, and sort of like do this in a traditional programming flow. And then here you can have direct control of the execution of all the steps.

The second way to represent workflow is through declarative graphs. So in this way, you sort of represent your workflow as like a graph of where nodes are sort of like steps where you can call tools or call LLMs. And then there's sort of edge between nodes. So you kind of define a structure, but not execution.

And the execution of this is usually handled by some workflow frameworks. So I'm not going to go into the details of pros and cons for these two approaches. But the main point here is like for workflows, you get structure and predictability. So if you run a workflow today, it will mostly behave the same way if you run it tomorrow.

On the other hand, we have agents, which are systems where LLMs sort of dynamically direct their own processes of like decide how to achieve a task, like decide what tools to go, what step to take, depends on the task itself. So the core agent loop is pretty simple. So it receive a task or like a goal from a human.

And then it sort of enter this iterative loop where it plan what to do, and then execute the action, and then read the results from the environment, and sort of iterate until it gets all the results it's won, and then respond to the user. So what are the trade-offs between workflows and agents?

Workflows are sort of like the Toyota of AI systems. It's very predictable. It's good for when you want to automate repetitive tasks or like encode existing best practice or like know-how in your business. It's usually lower cost and lower latency, because you don't have to spend time on all these LLM calls to decide what to do.

And they're also easier to debug because you have this code or this graph that you can manually pinpoint at which step is going wrong in the execution. And in building workflows, humans are sort of in control. like you can control your destiny, like given imperfect LLMs. You can sort of do tweaks and engine so that your tasks work right now.

On the other hand, agents are sort of like the test lab AI systems. It's more open-ended. This is good for like researching unsolved problems. It's also usually good at taking advantage of better and better LLM capabilities, because here the AI is in control. Generally, it's higher cost and latency because you need LLM to like figure out what to do.

And then-- but the upside is like there's less logic to maintain. The call loop is very simple. And sometimes you get like these hints of brilliance that always feels like, you know, everything is going to be automated in a few months. The problem is like your Tesla, like it works very well most of the time, but sometimes you still take the wrong exit on the highway.

And that's when you kind of miss your Toyota. So-- and the decision to be workflows or agent is a pretty tricky one because it depends highly on the state of the LLM. So some workflows that doesn't work in the agentic loop now might start to work later in a few months when the new models come out.

So it's a really huge dilemma. But recently, one thought that's sort of really changed how I think about it is what if you don't really have to choose, right? So if you think of agent, what they do is when you give the agent a task, it will figure out the steps that needs to be done to achieve that task, right?

So you give it a task, you figure out this one step, take the action, figure out the next step. And then at the end, when the agent finished the execution, and then you look at the trace of what happened, all those series of steps is a workflow. So if I represent this in a programming kind of way, then agent takes a task and then generate a workflow to achieve that task.

So if we think of it this way, agent take a task and generate a workflow, then you can sort of see like there are really good synergies between workflows and agent. So the first thing is you can actually use workflows as evaluation for your agents, right? So let's say in your company, you can collect a huge amount of golden workflows.

Like given a task, this is the steps that needs to be done to solve that task. And you have a huge list of those sort of handbook on how to do things in your company. And then you can actually evaluate your agents by give it a task, see what it did, and compare it to the golden workflow.

Like did it actually figure out the right steps? So this is a little bit different from evaluating end to end. You are not judging agents by the end response, but by whether it actually did the right step to get to that end response. The second and even better way for workflows to help agents is given that same golden workflows library, you can also use it to train your agents.

So here you truly get the best of both worlds, where with the data feeding, your agents will be able to execute the exact workflow that you have in your library for the known task. But then it can also rely on its own internal reasoning capabilities to sort of compose different workflows together to achieve new tasks and even use its own reasoning to kind of extend what you teach it, but make it better.

And then agents can also help workflows as well. One way to do that is for workflow-building platforms, you can use an agent to generate the workflows. So this is sort of how Glean agents work under the hood, where the user can give the workflow builder like a sort of natural language, a description of the task it is trying to achieve.

And then we run an agent implementation to figure out the steps that are needed to achieve that workflow. Then the user can sort of like make edit or like add change the workflow that the agent was proposing. And lastly, and I think is like the most powerful synergy is you can use agents as a workflow discovery engine, right?

So you ship an agent. Users try to accomplish new tasks with your agent. And then when they find that the agent did a good job, then you can sort of save that workflow. It's like, OK, this is how you do these tasks in my company. And then over time, you can use this as like training data to help agents get better.

So that was the main point of my talk. I guess maybe some of you are thinking, do we still need this kind of stuff in a world where we have AGI? So here's my thought experiment and why I think this may be still needed after AGI. So AGI is going to be a super intelligent employee, right?

But if AGI doesn't know about how your company works, it's sort of like a really good employee who just joined and doesn't know about all the business practices and still need onboarding, needs to know who to talk to to get unblocked and like all the very nuanced ways of doing things in the enterprise.

So what is enterprise aware AGI? So enterprise aware AGI is fully onboarded, very intelligent, knows the ways your company do things. And one key kind of insider, I think, is there are many acceptable ways to achieve a task, but there's a gap between an acceptable output versus a great output.

One example is like competitor analysis, like sure, it can do some basic Google search and like read some notes outside to like do some competitor analysis, but does it actually follow the protocols or the processes that your company define? And does it actually address all the key metrics that your executive really care about?

So given all this data, you know, like tasks and goals and workflows, how do you actually train your agents using those data? So this is the second part of my talk. So there are two main ways we have experimented with. The first one is through fine-tuning. There are sort of two main flavors of fine-tuning here.

Another one is, you know, supervised fine-tuning, where you give an input and an expected output, and you train your model to just mimic that behavior. The second way is through RLHF, where you don't have a golden label, but you sort of have a rating or a reward when, you know, like this task, this workflow, is it a good one or is it a bad one?

So then you can sort of run your favorite optimization algorithms to fine-tune the RLHF, so the pros of this method is that it can learn really well when you have a lot of data. If you have a huge amount of tasks and workflows, you can really learn, like sort of generalize across different tasks and like combine workflows.

The problem here is, one, you kind of have to create a fork from the front TLLM, right? So you start with some LLM. You do some fine-tuning, and then by the time the fine-tuning finishes, maybe there's a new and better model already come out, and you have to like redo this whole process again.

The second is like any change to your training data, like you need to do retraining, right? So if you have a new tool, then maybe some of the existing workflow is outdated, and you have to retrain. If you do change some business priorities or business processes, then you have to like redo the training again.

And it also not super flexible for personalization. So given the same tasks, maybe different teams or different employees might actually have a different optimal workflows to do those tasks, and fine-tuning is not super well-suited for those use cases. Then comes the second option, which is dynamic prompting through search.

So given the same label data from tasks to a golden workflow, you build a really good search engine for tasks so that you can find similar tasks given a new task. So then at runtime to accomplish a new task, we'll find the most similar tasks in the training data, and then you feed the representation of those workflows to the LLM as the examples.

Right? So here you have a spectrum of determinism and creativity. So when there's no workflow that sort of match your input tasks, then the LLM are in control. Like it can use this creativity to generate a new workflow. But when there's a high confidence match of something that you have done before, then the LLM will sort of give you a workflow that's very similar to what was in the training data.

So one very concrete example, come back to the competitor analysis example before. So you collected this huge list of tasks to workflow. And then when a new task, like say what competitors have we been running into recently, then it will retrieve, you know, how to analyze each competitor. And then you will find a workflow on how to find your recent customer calls.

And then the LLM will take those examples and sort of generate a composed workflow where it reads customer calls, read internal messages, extract competitors, and then run analysis for each of them. OK, so comparison time, fine tuning RLHF is very strong when you have a lot of data that you want to generalize.

Dynamic prompting research is more flexible, also gives you better interpretability that you can sort of look into the exact examples that was affecting your outputs. And fine tuning is good for learning generalized behaviors where the ground truth labels don't change over time or like across different users. Dynamic prompting research is better for learning customized behaviors or like the last mile quality gap where, you know, requirements are changing quickly.

So, one sort of analogy I think about fine tuning versus dynamic prompting is fine tuning is very similar to like building customized hardware. So, when you know, when you have a sort of task that you really want to optimize for and the requirements don't change over time, like you can really build custom hardware that do it very well.

But it's sort of costly when you change your requirements compared to dynamic prompting is more like writing software. Not as optimized, but like you can just change them very quickly. Last point, so how do we actually build this workflow search, right? So, how do you give it a task, like find similar tasks?

I would say it's very similar to building document search, right? And there are two main components to this. The first one is what everyone usually think of when they think of search, which is a textual similarity, right? Given this task, what are some of the similar sounding tasks that are in the training data?

And here the sort of golden recipe is like hybrid search between lexical, vector embeddings, re-ranking, late interaction, all that. But what I found is in the enterprise settings, pure textual similarity is not enough. When you give users the choice to create workflows and write documents, when you want to search for something, there'll be like hundreds or thousands of similar looking documents or workflows.

And the problem becomes how do you choose the right one, right? So, which is what I call authoritativeness here. And to solve this problem, then you kind of have to go into knowledge graph, right? So, if this workflow is created by someone who I work closely with, it has high success rate, and like people post about it on Slack, then it's more likely to be the right one.

So, all the tricks in the recommended system world also applies here for workflow search. And this kind of authoritativeness signals are very hard to encode directly into an LLM, which is why we sort of have to have like a separate system that does the search for workflows. Cool. So, key takeaways, workflows, good for determinism, human are in control.

Agents, more open-ended, AI is in control. And the synergy between agents and workflows is workflows can be used for agents evaluation, workflows used for agents training, and agents is used for workflows discovery. Fine-tuning is good for generalized behaviors. Dynamic prompting research is good for personalized behaviors. All right, I still have one minute and 30 seconds, maybe time for one question.

Is that actually one question? Yeah. I'm kind of curious about the fine-tuning, and I think this is how you mentioned that we made a lot of data, right? How about with the new, like, RLC, R, . So the question was-- oh, I tried to reinterpret it, let me know if it's wrong.

How much data do we need to do fine-tuning given the new-- RLVR. RLVR. That's a very difficult question to answer because it really depends on how our distribution your task is compared to the internal knowledge of the RLM. But I'll catch you after, and we can talk more. Yes.

Thank you. It's too difficult of a question to answer.

How to build Enterprise Aware Agents - Chau Tran, Glean

Transcript