back to indexThe Future of Knowledge Assistants: Jerry Liu

00:00:00.040 |
Hey, everybody. I'm Jerry, co-founder and CEO of Llama Index, and I'm excited to be here today to 00:00:19.380 |
talk about the future of knowledge assistance. So let's get started. First, you know, everybody's 00:00:26.880 |
building stuff with LLMs these days. Some of the most common use cases we're seeing throughout 00:00:31.480 |
the enterprise include the following. It includes, like, document processing, tagging, and extraction. 00:00:36.920 |
It includes knowledge search and question answering. If you've followed our Twitter for the past year or so, 00:00:42.540 |
basically, you know, we've talked about RAG probably 75% of the time. And also, just you start generalizing 00:00:48.520 |
that question answering interface into an overall conversational agent that can not only, you know, 00:00:54.860 |
do a one-shot querying search, but actually store your conversation history over time. 00:00:59.100 |
And of course, this year, a lot of people are excited about building agentic workflows that can not only 00:01:05.640 |
synthesize information, but actually perform actions and interact with a lot of services to basically get 00:01:11.280 |
you back the thing that you need. So let's talk about specifically this idea of building a knowledge 00:01:18.360 |
assistant, which, you know, we've been very interested in since the very beginning of the company. 00:01:22.460 |
The goal is to basically build an interface that can take in any task as input and get back 00:01:27.340 |
some sort of output. So the input forms could be, you know, a simple question. It could be a complex 00:01:32.640 |
question. It could be a vague research task. And the output forms could be a short answer. It could be a 00:01:37.320 |
research report, or it could be a structured output. 00:01:39.200 |
RAG was just the beginning. Last year, I said that RAG was basically just a hack. And there's a lot of things 00:01:46.740 |
that you can do on top of RAG to basically make it more advanced and sophisticated. 00:01:51.620 |
If you build a knowledge assistant with a very basic RAG pipeline, you run into the following issues. 00:01:56.920 |
First is a naive data processing pipeline. You know, you put it through some basic parser, 00:02:02.580 |
do some sentence splitting, chunking, do top K retrieval. And then you realize, you know, even if it took you 10 minutes to set up, 00:02:11.760 |
It also just doesn't really have a sense of being able to understand more complex, broader queries. So query understanding and planning. 00:02:19.460 |
There's also no kind of more sophisticated way of interacting with other services. And it's also stateless. So there's no memory. 00:02:27.880 |
So in this setting, we have said, you know, RAG is kind of boring, if it's just the simple RAG pipeline. 00:02:36.220 |
It's really just a glorified search system on top of some retrieval methods that have been around for decades. 00:02:41.400 |
And there's a lot of questions and tasks that naive RAG can't give an answer to. 00:02:45.140 |
And so, one thread that we've been pulling a lot on is basically figuring out how to go from simple search and naive RAG 00:02:53.060 |
to building a general context-augmented research assistant. 00:02:56.200 |
So we'll talk about these three steps with some cool feature releases, you know, in the mix. 00:03:02.180 |
But the first step is basically advanced data and retrieval modules. 00:03:05.900 |
Even if you don't, you know, care about the fancy agentic stuff, you need good core data quality modules to basically help you go to production. 00:03:13.760 |
The second is advanced single-agent query flows, building some agentic RAG layer on top of existing data services as tools 00:03:21.180 |
to basically enhance the level of query understanding that your QA interface provides. 00:03:25.260 |
And then the third, and this is quite interesting, is this whole idea of a general multi-agent task solver, 00:03:31.060 |
where you extend beyond even the capabilities of a single agent towards multi-agent orchestration. 00:03:35.560 |
So let's talk about advanced data and retrieval as a first step. 00:03:42.140 |
The first thing is that any LLM app these days is only as good as your data, right? 00:03:50.060 |
If you're an ML engineer, you've heard that kind of statement many times. 00:03:53.540 |
And so this shouldn't be net new, but it applies in the case of LLM app development as well. 00:03:58.660 |
Good data quality is a necessary component of any production-grade LLM application, 00:04:03.980 |
and you need that data processing layer to translate raw, unstructured, semi-structured data into some form that's good for your LLM app. 00:04:10.760 |
The main components of data processing, of course, are parsing, chunking, and indexing. 00:04:19.320 |
So some of you might have seen these slides already, but basically the first thing that everybody needs to build some sort of proper rag pipeline is you need a good PDF parser, okay? 00:04:28.960 |
Or a PowerPoint parser or some parser that can actually extract out those complex documents into a well-structured representation instead of just shoving it through PyPDF. 00:04:38.320 |
If you have a table in a financial report and you run it through PyPDF, it's going to destroy and collapse the information, blend the numbers and the text together, and what ends up happening is you get hallucinations. 00:04:49.220 |
And so one of the key things about parsing is that even good parsing itself can improve performance, right? 00:04:55.340 |
Even without advanced indexing retrieval, good parsing helps to reduce hallucinations. 00:05:00.400 |
A simple example here is we took the Caltrain schedule, right, the weekend schedule for Caltrain, parsed it through LlamaParse, one of our offerings, 00:05:08.140 |
and through some well-structured document parsing format, because the LLMs can actually understand well-spatially laid out text, 00:05:15.740 |
when you ask questions over it, I know the text is a little faint, it's totally fine, I'll share these slides later on, 00:05:20.440 |
you're able to actually get back the correct train times for a given column versus if you shove it into PyPDF, 00:05:27.440 |
you get like a whole bunch of hallucinations when you ask questions over this type of data. 00:05:33.500 |
You want good parsing, and you can combine this, of course, with advanced indexing modules to basically model heterogeneous data within a document. 00:05:41.580 |
One announcement we're making today is we opened up LlamaParse a few months ago, 00:05:47.080 |
it has like tens of thousands of users, tens of millions of pages processed, gotten very popular, 00:05:51.420 |
and in general, if you're an enterprise developer that has a bucket of PDFs and wants to shove it in 00:05:56.500 |
and not have to worry about some of these decisions, come sign up, 00:05:59.380 |
this is basically what we're building on the Llama Cloud side. 00:06:01.520 |
The next step is advanced single-agent flows. 00:06:07.620 |
So, you know, we have good data retrieval quality, or sorry, good data retrieval modules, 00:06:13.380 |
but in the end, right now, we're still using a single LLM prompt call. 00:06:16.620 |
So how do we go a little bit beyond that into something more interesting and sophisticated? 00:06:22.340 |
We did this entire course with, you know, Andrew Ng at deeplearning.ai, 00:06:26.200 |
and we've also written extensively about this in the past few months, 00:06:29.880 |
but basically, you can layer on different components of agents on top of just a basic RAG system 00:06:36.700 |
to build something that is a lot more sophisticated in query understanding, planning, and tool use. 00:06:42.980 |
And so the way I like to break this down, right, because they all have trade-offs, 00:06:45.380 |
is on the left side, you have some simple components that come with lower costs and lower latency, 00:06:50.120 |
and then on the right, you can build full-blown agent systems that can, you know, operate and even work together with other agents. 00:06:56.420 |
Some of the core agent ingredients that we see that are pretty fundamental towards building QA systems these days 00:07:02.540 |
include function calling and tool use, being able to actually do query planning, 00:07:07.820 |
whether it's sequential or in some style of a DAG, 00:07:10.660 |
and also maintain conversation memory over time. 00:07:14.320 |
So it's a stateful service as opposed to stateless. 00:07:21.820 |
where it's not only just, you know, RAG as a single LLM prompt call, 00:07:26.180 |
where the whole responsibility is to just synthesize the information, 00:07:29.240 |
but to actually use the LLMs extensively during the query understanding and processing phase, 00:07:33.720 |
where not only are you just directly feeding the query to a vector database, 00:07:37.940 |
in the end, everything is just an LLM interacting with a set of data services as tools, right? 00:07:44.020 |
And so this is a pretty important framework to understand, 00:07:46.600 |
because at the end of the day, you're going to have, in any piece of LLM software, 00:07:50.300 |
LLMs interacting with other services, whether it's a database or even other agents, as tools, 00:07:56.520 |
and you're going to need to do some sort of query planning 00:07:59.120 |
to basically figure out how to use these tools to solve the tasks that you're given. 00:08:02.760 |
We've also talked about agent reasoning loops, right? 00:08:06.320 |
Probably the most stable one that we've seen so far 00:08:08.700 |
is some sort of while loop over function calling or React. 00:08:11.700 |
But we've also seen fancier agent papers arise 00:08:15.040 |
that basically deal with, like, DAG-based planning, 00:08:21.820 |
you plan out an entire set of possible outcomes 00:08:25.040 |
The end result is that if you're able to do this, 00:08:31.460 |
that are capable of handling more complex questions, 00:08:35.040 |
for instance, comparison questions across multiple documents, 00:08:37.880 |
being able to actually maintain the user state over time 00:08:40.860 |
so you can actually revisit the thing that they were looking for, 00:08:43.340 |
being able to, for instance, look up information 00:08:46.000 |
from not only unstructured data, but also structured data, 00:08:48.760 |
by treating everything as a data service or a tool. 00:08:51.440 |
But, you know, there are some remaining gaps here. 00:08:57.540 |
First of all, you know, we've kind of had some interesting discussions 00:09:01.700 |
with other people in the community about this, 00:09:03.380 |
but a single agent generally cannot solve an infinite set of tasks. 00:09:06.780 |
If anyone's tried to give, like, a thousand tools to an agent, 00:09:10.100 |
the agent is going to struggle and generally fail, 00:09:14.340 |
And so one principle is that specialist agents tend to do better 00:09:17.320 |
if the agent is a little bit more focused on a given task, 00:09:21.360 |
And then the second gap is that agents are increasingly interfacing 00:09:25.680 |
with services that, you know, maybe other agents, actually. 00:09:28.780 |
And so we might want to think about a multi-agent future. 00:09:36.500 |
and what that means for this idea of knowledge assistance. 00:09:46.680 |
but they offer a few benefits beyond just a single agent flow. 00:09:50.260 |
First, they offer this idea of being able to actually specialize 00:09:54.320 |
and operate over a, you know, focused set of tasks more reliably 00:09:58.160 |
so that you can actually stitch together different agents 00:10:00.900 |
that potentially can work together to solve a bigger task. 00:10:03.720 |
Another benefit or set of benefits is on the system side. 00:10:08.260 |
By being able to have, you know, multiple copies of even, like, 00:10:11.320 |
the same LLM agent, you're able to paralyze a bunch of tasks 00:10:18.080 |
The third thing is that actually with a multi-agent framework, 00:10:22.260 |
instead of having, you know, a single agent access, like, 00:10:25.680 |
you could potentially have each agent operate over, like, you know, 00:10:32.500 |
And so there are actually potential costs and latency savings. 00:10:35.260 |
There are, of course, some fantastic multi-agent frameworks 00:10:45.960 |
in building this reliably in production include, 00:10:50.400 |
either let the agents kind of operate amongst themselves 00:10:53.700 |
and build some sort of, like, unconstrained flow, 00:11:00.660 |
So you're basically explicitly forcing an agent 00:11:02.820 |
to operate in a certain way, given a certain input. 00:11:16.220 |
the proper service architecture for agents in production 00:11:20.380 |
So today, you know, I'm excited to launch a preview feature 00:11:30.800 |
but basically it represents agents as microservices, right? 00:11:35.700 |
So, you know, in addition to some of the fantastic work 00:11:38.800 |
that a lot of these multi-agent frameworks have done, 00:11:41.260 |
the core goal of Llama Agents really is to think 00:11:43.740 |
about every agent as just, like, a separate service 00:11:46.260 |
and figuring out how these different services 00:11:48.600 |
can operate together, communicate with each other 00:11:51.220 |
through a central API, you know, communication interface, 00:11:54.540 |
and then also work together to solve a given task 00:12:05.500 |
and basically each agent can encapsulate a set of logic, 00:12:10.840 |
and actually be reused across different tasks. 00:12:15.780 |
how do you take these agents out of a notebook 00:12:19.320 |
and it's an idea that we've had for a while now, 00:12:24.000 |
in helping you build something that's production-grade, 00:12:33.600 |
So the core architecture here is that, you know, 00:12:36.280 |
every agent is just represented as a separate service. 00:12:39.380 |
You can write the agents however you want, basically, 00:12:48.760 |
and then you're able to deploy it as a service, 00:12:51.080 |
and basically the agents can interact with each other 00:12:54.920 |
and then the orchestration can happen between the agents 00:13:01.580 |
from, you know, existing resource allocators, 00:13:09.440 |
and the orchestration can be either explicit, 00:13:12.160 |
so you explicitly define these flows between services, 00:13:23.620 |
And so one thing that I want to show you, basically, 00:13:31.240 |
how this relates to this idea of knowledge assistance, right? 00:13:38.460 |
and this is basically a demo that we whipped up 00:13:48.440 |
There is, like, a query rewriting service, right, 00:13:54.520 |
that basically just does RAG, like search and retrieval, 00:14:04.500 |
And the core demo here is really showing that, you know, 00:14:15.980 |
launch a bunch of different client requests at once, 00:14:18.240 |
handle, you know, tasks requests from different directions, 00:14:23.760 |
as, like, an encapsulated microservice, right? 00:14:27.320 |
And so the query rewrite agent takes in some sort of query, 00:14:30.140 |
processes it, rewrites it into some new query, 00:14:42.380 |
like, the actual logic should be relatively trivial, 00:14:49.860 |
into a set of services that you can basically deploy, right? 00:15:13.520 |
and so we're really excited to basically share this 00:15:16.880 |
We're very public about the roadmap, actually, 00:15:20.900 |
about what's actually in there and what's not. 00:15:54.200 |
of, again, building something that's production-grade 00:16:07.320 |
like, let's say you didn't care about agents at all