Effective agent design patterns in production

Hi, everybody. You probably met me this morning when I greeted you all to the conference. My name is Laurie. I'm VP at VELP Relations at Llama Index. Today I'm going to be talking about Llama Index and what it is very briefly because I've only got 15 minutes. And then we're going to talk about agents and how they are built plus a very, very brief refresher on RAG and why it's necessary for agents.

Then we're going to look at some high-level design patterns that improve the performance of your agent. And we won't have time to dive into how to build an agent as well as a multi-agent system because we've only got 15 minutes. So let's go fast. So what is Llama Index?

We are a bunch of things. Start with the most obvious. We are a framework in Python and TypeScript for building generative AI applications. We are particularly good at building agents. We also have a service called Llama Parse. Llama Parse will parse complicated document formats for you. So PDFs, Word, PowerPoints, those sorts of things.

These are crucial to building an effective agent is being able to parse your unstructured data in an effective way. And Llama Parse will demonstrably improve the quality of the agents you build by making the data easier for an LLM to understand than if you just try and feed these things in through an open source parser or something like that.

We also have an enterprise service called Llama Cloud. If what you want to do is stuff documents into one end and get a retrieval endpoint out of the other, then that is the service for you. Unlike the rest of Llama Index, it costs money. It is available as a SaaS at cloud.llamaindex.ai or you can get it deployed onto your own private cloud.

We also have a website called Llama Hub, which is a huge registry of open source software that plugs into the framework that integrates with everything. So if you need to get your data out of Notion or out of Slack or out of any database in the world, that is where you find the adapters.

If you want to store your data in any vector database that exists, the adapters exist for that. And we also integrate with every LLM that exists. So, 400 different models over 80 different LLM providers, including local ones like Llama 3. Oh, and it also has pre-built agent tools. So if you are building an agent, you can just plug in an existing agent tool without having to build one yourself.

So why should you use Llama Index? Because we will help you go faster. That is the base promise of a framework generally. You have actual business and technology problems to solve and you have limited time. A framework is going to help you get past those by skipping the boilerplate, getting best practices for free, and getting to production faster in general.

So, what can you build in Llama Index? Well, anything, obviously, but there's two things that we are particularly good at. One is retrieval augmented generation, and the other one is agents, both of which you are probably familiar with already from just being at this conference because we won't shut up about agents at this conference.

So, what is an agent? An agent is a dramatically overused term in the industry. Just about everything is an agent in 2025. But what I mean when I say an agent is it is a bit of semi-autonomous software that can use tools to achieve a goal without you having to explicitly specify what steps it's going to take to achieve that goal.

And the tools can do anything, which is great. They can take retrieve information or they can take action. Agents are a dramatic departure from traditional programming. LLMs are given decision-making power to decide what tools to use and stuff like that. That makes them extremely flexible and powerful. So, the time to build an agent is when that flexibility, that ability to deal with the unexpected or the unknown is going to come in useful.

And when that's really useful is when you have a bunch of unstructured data. LLMs are extremely good at handling messy inputs, and the world is full of messy inputs. So, there's a whole lot of applications for LLMs. In general, I regard a good agent use case as any situation where an LLM is required to turn a large body of text into a smaller body of text.

I think that is a key principle of agent design and LLM use in general, is that they are not good at taking a small prompt and turning it into a big body of text. They are very good at summarizing stuff down. So, interpreting a contract, processing an invoice, applying regulations, summarizing documents, a thing where you need to turn text into less text.

So, a calendar event, a decision, a report, an answer to a question, those are good applications for LLMs and good applications for agents. The most obvious application of LLMs is a chat interface where you give it questions and it answers. I encourage you to think beyond the chat bot.

Chat bots are a very 2023 way of using an LLM. We believe a much greater addressable surface is if you integrate them into existing software. So, you use the LLMs capability to handle messy inputs, to handle unstructured data, and turn it into structured data that it can then make decisions about and feed into your regular software.

That is a really productive and powerful set of use cases, and we think it's much bigger than the market for chat bots. I've talked a lot about unstructured data. The reason I talk about it is because that is where LLMs become really useful. Unless you are building something extremely generic, you are not going to be able to get anything useful out of an LLM by just asking it questions.

You're going to have to give it contextual data relevant to your company, to your domain, to whatever problem set it is that you are working on. So, you have to feed the LLM your data, and the problem is that you have tons of data, and this is the use case for RAC.

You take all of your data, you embed it, which is turning it into vectors that you can search for in a vector search, and you can then take your query, take your questions about your data, and embed them into the same vector database, and they will end up mathematically nearby in vector space to the context that you fed in that is relevant.

So, instead of having to take all of your data and feed it to the LLM every single time, which would be tremendously slow and tremendously expensive, you can just feed in the most relevant context out of your data corpus and answer questions about that. This is why RAC will never die.

People keep talking about larger and larger contexts and, you know, more and more powerful models. It's always going to be cheaper and faster to send less data that the LLM has to think about less. It's always going to be, you're always going to get more answers, better answers, if the context that you have given your LLM is more specific.

So, agents can use RAG as one of their tools, but also, so agents need RAG, but RAG also needs agents. RAG by itself, naive top K RAG, where you just, you know, throw in a query and retrieve the most relevant context and feed that to the LLM, that's not going to work very well for a variety of situations.

But what we've found through lots of production use cases is that layering an agent on top of your RAG will produce significantly higher quality results, and they're capable of doing things that RAG just can't do. RAG is a simple question and answer robot, whereas an agent can do stuff like introspection, like could this complicated question be answered more easily if it were a series of simpler questions?

Do I need to try extracting that data again because the data that I got out is nonsense? Did I just give a sensible answer to your question, or should I try again? That is something that an agent can do. It can look at its own responses and improve itself, which RAG by itself obviously doesn't do.

So, agents improve the performance of RAG, both in terms of speed and, crucially, in accuracy. In December of last year, Anthropic did an excellent post about how to build agents, in which they codified some design patterns about how to build an effective agent that we immediately recognized from our own work, building agents.

So, I'm going to go through them very quickly, and given the amount of time I have, that's probably all I'm going to be able to cover. They are chaining, routing, parallelization, orchestrated workers, and evaluator optimizers. The first and most obvious is the chain. You can use an LLM to do some work, and you pass the output of that to another LLM, and you pass the output of that to another LLM.

It is trivial to build, especially in LLM index. This is what a chain looks like in LLM index. We use an abstraction called workflows, where you define regular Python functions that do whatever you need them to do, and you use event annotations to define how... Sorry, you use type annotations to define how events pass from step to step within a workflow.

It is a very simple and flexible pattern that our users like a lot. So, LLM index workflows have a built-in visualizer, the output of which you can see here. This is obviously a chain. There is much, much more to LLM applications than a chain, though, despite what the names of some other frameworks might indicate.

The next pattern-anthropic called out is routing. In this one, you create several LLM-based tools to solve a problem in different ways or to solve different types of problems, and you give the LLM decision-making power to say, which of these tools should I call? Which of these different LLM paths should I follow?

Again, not that complicated a concept and simple to build in LLM index. I'm going to spare you the code this time. You can do it using branches. You can just decide that you're going to split off into your own chain and do another series of work based on the original decision made by the LLM.

The next pattern is parallelization, which is where things begin to get interesting. Anthropic defines this as running several LLMs in parallel and then aggregating their results, and they define parallelization as having two flavors. The first is sectioning. This is where you take the same input and you act on it in completely different ways.

The sort of canonical use case of this is guardrails. So the user has a query or a piece of input that they want processed, and you use one of your tracks to actually process the data or to answer the query, and you use the second one of your tracks to query, is this an illegal request?

Is this against my rules? Those two questions are related, but they can be answered in parallel, and you can use your guardrails to cut off the answer from your processing if it turns out to be illegal. You know, illegal or otherwise undesirable. The other flavor of parallelization is voting.

This is where you take exactly the same query, and you give it to three different tracks. The tracks can be literally exactly the same LLM because they are non-deterministic, so it might not give the same answer every time, or you could give it to multiple different LLMs, which have different capabilities and different specialties, and then you take the answers, and you allow them, and you see if the tracks came to the same answer.

You can take a majority vote, you can take a unanimous vote, and what this does is it allows you to limit the amount of hallucination that is happening. It's a great way of reducing hallucination because if, you know, three different LLMs come to the same conclusion, then it's probably not the LLM just making stuff up, because LLMs hallucinate, but they hallucinate in different ways.

So they seldom hallucinate to the same answer. Whichever flavor you use, it's implemented the same way, using concurrency. We, LL, LL, LLM index workflows allow you to emit multiple events simultaneously, and then collect those events at the other end, so you can do, you can do work concurrently. And, uh, yeah, and it, I think the visualization here is particularly pretty.

Um, the next pattern is orchestrator workers. Uh, you can use an LLM to look at a complex task, like a multi-part question, uh, and split it into several simpler questions and ask each of those questions in parallel. So this is how deep research works basically, uh, it takes a very complicated question and it says, I'm going to, I'm going to look at all of the possible questions that could come from this deep question.

I'm going to answer them all at the same time, and then I'm going to aggregate all of the answers that I've got and turn them into one single coherent answer. This is a very powerful pattern that is, uh, doing a lot of good in the world right now. Uh, this is also implemented using parallelization, um, and the final pattern that Anthropic called out is the evaluator optimizer, which is also called self-reflection.

Uh, in this pattern, you use the LLM to decide whether or not the LLM has done a good job. So, uh, you take your output, you feed it to an LLM and you say, here was the original question, or here was the original input and the goal that I had.

Have you actually reached the goal that I had, uh, and if not, you can get the LLM to generate feedback and send it back, uh, to the original first step and say, okay, you almost got the answer, but you hallucinated something or, uh, you missed a part of the question or, you know, something like that.

Um, this is again, easy to do in LLM index and workflows, you just create a loop, uh, and you can send yourself back to this step one. Um, and the real power here is obviously combining all of these patterns. You can create arbitrarily complex workflows, uh, to handle any combination of circumstances, um, right near the beginning, uh, when I was defining an agent, I have 60 seconds left.

So I'm just going to go through the syntax super quick. Uh, I said it was that agents are defined by their ability to use tools in LLM index. This is what a tool definition looks like. It is just a Python function that you have wrapped, uh, in a step wrapper.

Um, and the way that you use your tool function is you just give it to an agent. And the agent will figure out that it is a tool function and start using it. Uh, this allows you to create workflows that are multi-agent systems. I do not have time to, uh, explain how multi-agent systems work for, uh, this, but this is how you create a multi-agent system in, in LLM index.

Uh, you create, uh, a function agent which gets, takes a system prompt, it takes an LLM, uh, and it takes, uh, a set of tools and you can feed, uh, an array of agents into a multi-agent system which then just sort of figures it out by itself, passing control back from one agent to another.

Uh, this is technically one line of code and we're pretty proud of it. Uh, and that is about it. Um, if you want a full agent workflow and workflows tutorial, this was the simplest possible one. The, it is available, uh, at that, this, this notebook will teach you how to build a deep research of your own.

Uh, and with that, I am pretty much out of time. If I, if you have any questions, I will be outside in the hallway. Thank you very much. Thank you very much. Thank you very much. Thank you very much. Thank you very much. Thank you very much.

Effective agent design patterns in production — Laurie Voss, LlamaIndex

Transcript