Hey, everybody. I'm Jerry, co-founder and CEO of Llama Index, and I'm excited to be here today to talk about the future of knowledge assistance. So let's get started. First, you know, everybody's building stuff with LLMs these days. Some of the most common use cases we're seeing throughout the enterprise include the following.
It includes, like, document processing, tagging, and extraction. It includes knowledge search and question answering. If you've followed our Twitter for the past year or so, basically, you know, we've talked about RAG probably 75% of the time. And also, just you start generalizing that question answering interface into an overall conversational agent that can not only, you know, do a one-shot querying search, but actually store your conversation history over time.
And of course, this year, a lot of people are excited about building agentic workflows that can not only synthesize information, but actually perform actions and interact with a lot of services to basically get you back the thing that you need. So let's talk about specifically this idea of building a knowledge assistant, which, you know, we've been very interested in since the very beginning of the company.
The goal is to basically build an interface that can take in any task as input and get back some sort of output. So the input forms could be, you know, a simple question. It could be a complex question. It could be a vague research task. And the output forms could be a short answer.
It could be a research report, or it could be a structured output. RAG was just the beginning. Last year, I said that RAG was basically just a hack. And there's a lot of things that you can do on top of RAG to basically make it more advanced and sophisticated.
If you build a knowledge assistant with a very basic RAG pipeline, you run into the following issues. First is a naive data processing pipeline. You know, you put it through some basic parser, do some sentence splitting, chunking, do top K retrieval. And then you realize, you know, even if it took you 10 minutes to set up, that it's not suitable for production.
It also just doesn't really have a sense of being able to understand more complex, broader queries. So query understanding and planning. There's also no kind of more sophisticated way of interacting with other services. And it's also stateless. So there's no memory. So in this setting, we have said, you know, RAG is kind of boring, if it's just the simple RAG pipeline.
It's really just a glorified search system on top of some retrieval methods that have been around for decades. And there's a lot of questions and tasks that naive RAG can't give an answer to. And so, one thread that we've been pulling a lot on is basically figuring out how to go from simple search and naive RAG to building a general context-augmented research assistant.
So we'll talk about these three steps with some cool feature releases, you know, in the mix. But the first step is basically advanced data and retrieval modules. Even if you don't, you know, care about the fancy agentic stuff, you need good core data quality modules to basically help you go to production.
The second is advanced single-agent query flows, building some agentic RAG layer on top of existing data services as tools to basically enhance the level of query understanding that your QA interface provides. And then the third, and this is quite interesting, is this whole idea of a general multi-agent task solver, where you extend beyond even the capabilities of a single agent towards multi-agent orchestration.
So let's talk about advanced data and retrieval as a first step. The first thing is that any LLM app these days is only as good as your data, right? Garbage in, garbage out. If you're an ML engineer, you've heard that kind of statement many times. And so this shouldn't be net new, but it applies in the case of LLM app development as well.
Good data quality is a necessary component of any production-grade LLM application, and you need that data processing layer to translate raw, unstructured, semi-structured data into some form that's good for your LLM app. The main components of data processing, of course, are parsing, chunking, and indexing. And let's start with parsing.
So some of you might have seen these slides already, but basically the first thing that everybody needs to build some sort of proper rag pipeline is you need a good PDF parser, okay? Or a PowerPoint parser or some parser that can actually extract out those complex documents into a well-structured representation instead of just shoving it through PyPDF.
If you have a table in a financial report and you run it through PyPDF, it's going to destroy and collapse the information, blend the numbers and the text together, and what ends up happening is you get hallucinations. And so one of the key things about parsing is that even good parsing itself can improve performance, right?
Even without advanced indexing retrieval, good parsing helps to reduce hallucinations. A simple example here is we took the Caltrain schedule, right, the weekend schedule for Caltrain, parsed it through LlamaParse, one of our offerings, and through some well-structured document parsing format, because the LLMs can actually understand well-spatially laid out text, when you ask questions over it, I know the text is a little faint, it's totally fine, I'll share these slides later on, you're able to actually get back the correct train times for a given column versus if you shove it into PyPDF, you get like a whole bunch of hallucinations when you ask questions over this type of data.
So that's step one. You want good parsing, and you can combine this, of course, with advanced indexing modules to basically model heterogeneous data within a document. One announcement we're making today is we opened up LlamaParse a few months ago, it has like tens of thousands of users, tens of millions of pages processed, gotten very popular, and in general, if you're an enterprise developer that has a bucket of PDFs and wants to shove it in and not have to worry about some of these decisions, come sign up, this is basically what we're building on the Llama Cloud side.
The next step is advanced single-agent flows. So, you know, we have good data retrieval quality, or sorry, good data retrieval modules, but in the end, right now, we're still using a single LLM prompt call. So how do we go a little bit beyond that into something more interesting and sophisticated?
We did this entire course with, you know, Andrew Ng at deeplearning.ai, and we've also written extensively about this in the past few months, but basically, you can layer on different components of agents on top of just a basic RAG system to build something that is a lot more sophisticated in query understanding, planning, and tool use.
And so the way I like to break this down, right, because they all have trade-offs, is on the left side, you have some simple components that come with lower costs and lower latency, and then on the right, you can build full-blown agent systems that can, you know, operate and even work together with other agents.
Some of the core agent ingredients that we see that are pretty fundamental towards building QA systems these days include function calling and tool use, being able to actually do query planning, whether it's sequential or in some style of a DAG, and also maintain conversation memory over time. So it's a stateful service as opposed to stateless.
We've pioneered this idea of agentic RAG, where it's not only just, you know, RAG as a single LLM prompt call, where the whole responsibility is to just synthesize the information, but to actually use the LLMs extensively during the query understanding and processing phase, where not only are you just directly feeding the query to a vector database, in the end, everything is just an LLM interacting with a set of data services as tools, right?
And so this is a pretty important framework to understand, because at the end of the day, you're going to have, in any piece of LLM software, LLMs interacting with other services, whether it's a database or even other agents, as tools, and you're going to need to do some sort of query planning to basically figure out how to use these tools to solve the tasks that you're given.
We've also talked about agent reasoning loops, right? Probably the most stable one that we've seen so far is some sort of while loop over function calling or React. But we've also seen fancier agent papers arise that basically deal with, like, DAG-based planning, planning out an entire DAG of decisions, or tree-based planning, you know, you plan out an entire set of possible outcomes and try to optimize there.
The end result is that if you're able to do this, you're able to build personalized QA systems that are capable of handling more complex questions, for instance, comparison questions across multiple documents, being able to actually maintain the user state over time so you can actually revisit the thing that they were looking for, being able to, for instance, look up information from not only unstructured data, but also structured data, by treating everything as a data service or a tool.
But, you know, there are some remaining gaps here. First of all, you know, we've kind of had some interesting discussions with other people in the community about this, but a single agent generally cannot solve an infinite set of tasks. If anyone's tried to give, like, a thousand tools to an agent, the agent is going to struggle and generally fail, at least with current model capabilities.
And so one principle is that specialist agents tend to do better if the agent is a little bit more focused on a given task, given some input. And then the second gap is that agents are increasingly interfacing with services that, you know, maybe other agents, actually. And so we might want to think about a multi-agent future.
So let's talk about multi-agents and what that means for this idea of knowledge assistance. Multi-agent task solvers. First of all, why multi-agents? Well, we've mentioned this a little bit, but they offer a few benefits beyond just a single agent flow. First, they offer this idea of being able to actually specialize and operate over a, you know, focused set of tasks more reliably so that you can actually stitch together different agents that potentially can work together to solve a bigger task.
Another benefit or set of benefits is on the system side. By being able to have, you know, multiple copies of even, like, the same LLM agent, you're able to paralyze a bunch of tasks and able to do things a lot faster. The third thing is that actually with a multi-agent framework, instead of having, you know, a single agent access, like, a thousand tools, you could potentially have each agent operate over, like, you know, five to ten tools and therefore use a weaker and faster model.
And so there are actually potential costs and latency savings. There are, of course, some fantastic multi-agent frameworks that have come out in the past few months, and many of you might be either using those or kind of building your own. And in general, some of the challenges in building this reliably in production include, one, being able to, you know, either let the agents kind of operate amongst themselves and build some sort of, like, unconstrained flow, or actually being able to inject some sort of constraints between the agents.
So you're basically explicitly forcing an agent to operate in a certain way, given a certain input. The second is when you actually think about having these agents operate in production, currently the bulk of agents are implemented as functions in a Jupyter notebook, and we might want to think about defining the proper service architecture for agents in production and what that looks like.
So today, you know, I'm excited to launch a preview feature of a new repo that we've been working on called Llama Agents, and it's an alpha feature, but basically it represents agents as microservices, right? So, you know, in addition to some of the fantastic work that a lot of these multi-agent frameworks have done, the core goal of Llama Agents really is to think about every agent as just, like, a separate service and figuring out how these different services can operate together, communicate with each other through a central API, you know, communication interface, and then also work together to solve a given task that is, you know, scalable, can handle multiple requests at once, is easy to deploy to, you know, different types of services, and basically each agent can encapsulate a set of logic, but still communicate with each other and actually be reused across different tasks.
So it really is really thinking about how do you take these agents out of a notebook and into production, and it's an idea that we've had for a while now, but we see this as a key ingredient in helping you build something that's production-grade, a production-grade knowledge assistant, especially, you know, as the world gets more agentic over time.
So the core architecture here is that, you know, every agent is just represented as a separate service. You can write the agents however you want, basically, you know, with Lama Index, with another framework as well, and we have some of the interfaces to basically build a custom agent, and then you're able to deploy it as a service, and basically the agents can interact with each other via some sort of message queue, and then the orchestration can happen between the agents via, like, a general control plane, right?
We took some of the inspiration from, you know, existing resource allocators, for instance, like Kubernetes, or just, like, other kind of, like, open-source, like, systems-level projects, and the orchestration can be either explicit, so you explicitly define these flows between services, or it could be implicit, right? You can have some sort of LLM orchestrator just figure out what tasks to delegate to given the current state of things.
And so one thing that I want to show you, basically, is figuring out or just showing you how this relates to this idea of knowledge assistance, right? Because we think that multi-agents are going to be a core component of this, and this is basically a demo that we whipped up showing you how to run LLM agents on a basic RAG pipeline.
This is a pretty trivial RAG pipeline. There is, like, a query rewriting service, right, and then also some sort of default agent that basically just does RAG, like search and retrieval, and you can also add in other components and services, like reflection. You could have other tools as well or even a general tool service.
And the core demo here is really showing that, you know, given some sort of input, they're communicating with each other through some sort of, like, API protocol, and so this allows you to, for instance, launch a bunch of different client requests at once, handle, you know, tasks requests from different directions, and basically have these agents operate as, like, an encapsulated microservice, right?
And so the query rewrite agent takes in some sort of query, processes it, rewrites it into some new query, and then, you know, the second agent will basically take in this query, do some search and retrieval, and basically output a final response. If you built a RAG pipeline, all this stuff, like, the actual logic should be relatively trivial, but the goal is to basically show you how you can turn something, even something that's trivial, into a set of services that you can basically deploy, right?
And this is just, like, another example that's basically a backup slide that basically, again, highlights the fact that you can have multiple agents, right, and they all operate and work together to basically achieve a given task. So, you know, the QR code is linked. First of all, this is in alpha mode, right, and so we're really excited to basically share this with the community.
We're very public about the roadmap, actually, so check out the discussions tab about what's actually in there and what's not. We're launching with dozens of, a dozen, basically, initial tutorials to show you how to basically build a set of, like, microservices that basically help you, you know, build that production-grade agentic knowledge assistant workflow.
And there's also a repo linked that I think should be public now. You know, in general, we're pretty excited to get feedback from the community about what a general communication protocol should look like, how we basically integrate with some of the other, you know, awesome work that the community has done and basically help achieve this core mission of, again, building something that's production-grade and a multi-agent assistant.
And this is just the last component, which I already mentioned, but basically, if you're interested in, like, the data quality side of things, like, let's say you didn't care about agents at all and you just care about data quality, we're opening up a wait list for Llama Cloud more generally so that you're able to, you know, deal with all those decisions that I mentioned, the parsing, chunking, indexing, and ensure that, you know, your bucket of PDFs with embedded charts, tables, images is processed and parsed the right way.
And if you're an enterprise developer, with that use case, come talk to us. So that's basically it. Thanks for your time and hope you enjoyed it. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. you you We'll see you next time.