The Future of Knowledge Assistants: Jerry Liu

00:00:00.040 | Hey, everybody. I'm Jerry, co-founder and CEO of Llama Index, and I'm excited to be here today to

00:00:19.380 | talk about the future of knowledge assistance. So let's get started. First, you know, everybody's

00:00:26.880 | building stuff with LLMs these days. Some of the most common use cases we're seeing throughout

00:00:31.480 | the enterprise include the following. It includes, like, document processing, tagging, and extraction.

00:00:36.920 | It includes knowledge search and question answering. If you've followed our Twitter for the past year or so,

00:00:42.540 | basically, you know, we've talked about RAG probably 75% of the time. And also, just you start generalizing

00:00:48.520 | that question answering interface into an overall conversational agent that can not only, you know,

00:00:54.860 | do a one-shot querying search, but actually store your conversation history over time.

00:00:59.100 | And of course, this year, a lot of people are excited about building agentic workflows that can not only

00:01:05.640 | synthesize information, but actually perform actions and interact with a lot of services to basically get

00:01:11.280 | you back the thing that you need. So let's talk about specifically this idea of building a knowledge

00:01:18.360 | assistant, which, you know, we've been very interested in since the very beginning of the company.

00:01:22.460 | The goal is to basically build an interface that can take in any task as input and get back

00:01:27.340 | some sort of output. So the input forms could be, you know, a simple question. It could be a complex

00:01:32.640 | question. It could be a vague research task. And the output forms could be a short answer. It could be a

00:01:37.320 | research report, or it could be a structured output.

00:01:39.200 | RAG was just the beginning. Last year, I said that RAG was basically just a hack. And there's a lot of things

00:01:46.740 | that you can do on top of RAG to basically make it more advanced and sophisticated.

00:01:51.620 | If you build a knowledge assistant with a very basic RAG pipeline, you run into the following issues.

00:01:56.920 | First is a naive data processing pipeline. You know, you put it through some basic parser,

00:02:02.580 | do some sentence splitting, chunking, do top K retrieval. And then you realize, you know, even if it took you 10 minutes to set up,

00:02:09.460 | that it's not suitable for production.

00:02:11.760 | It also just doesn't really have a sense of being able to understand more complex, broader queries. So query understanding and planning.

00:02:19.460 | There's also no kind of more sophisticated way of interacting with other services. And it's also stateless. So there's no memory.

00:02:27.880 | So in this setting, we have said, you know, RAG is kind of boring, if it's just the simple RAG pipeline.

00:02:36.220 | It's really just a glorified search system on top of some retrieval methods that have been around for decades.

00:02:41.400 | And there's a lot of questions and tasks that naive RAG can't give an answer to.

00:02:45.140 | And so, one thread that we've been pulling a lot on is basically figuring out how to go from simple search and naive RAG

00:02:53.060 | to building a general context-augmented research assistant.

00:02:56.200 | So we'll talk about these three steps with some cool feature releases, you know, in the mix.

00:03:02.180 | But the first step is basically advanced data and retrieval modules.

00:03:05.900 | Even if you don't, you know, care about the fancy agentic stuff, you need good core data quality modules to basically help you go to production.

00:03:13.760 | The second is advanced single-agent query flows, building some agentic RAG layer on top of existing data services as tools

00:03:21.180 | to basically enhance the level of query understanding that your QA interface provides.

00:03:25.260 | And then the third, and this is quite interesting, is this whole idea of a general multi-agent task solver,

00:03:31.060 | where you extend beyond even the capabilities of a single agent towards multi-agent orchestration.

00:03:35.560 | So let's talk about advanced data and retrieval as a first step.

00:03:42.140 | The first thing is that any LLM app these days is only as good as your data, right?

00:03:48.920 | Garbage in, garbage out.

00:03:50.060 | If you're an ML engineer, you've heard that kind of statement many times.

00:03:53.540 | And so this shouldn't be net new, but it applies in the case of LLM app development as well.

00:03:58.660 | Good data quality is a necessary component of any production-grade LLM application,

00:04:03.980 | and you need that data processing layer to translate raw, unstructured, semi-structured data into some form that's good for your LLM app.

00:04:10.760 | The main components of data processing, of course, are parsing, chunking, and indexing.

00:04:17.540 | And let's start with parsing.

00:04:19.320 | So some of you might have seen these slides already, but basically the first thing that everybody needs to build some sort of proper rag pipeline is you need a good PDF parser, okay?

00:04:28.960 | Or a PowerPoint parser or some parser that can actually extract out those complex documents into a well-structured representation instead of just shoving it through PyPDF.

00:04:38.320 | If you have a table in a financial report and you run it through PyPDF, it's going to destroy and collapse the information, blend the numbers and the text together, and what ends up happening is you get hallucinations.

00:04:49.220 | And so one of the key things about parsing is that even good parsing itself can improve performance, right?

00:04:55.340 | Even without advanced indexing retrieval, good parsing helps to reduce hallucinations.

00:05:00.400 | A simple example here is we took the Caltrain schedule, right, the weekend schedule for Caltrain, parsed it through LlamaParse, one of our offerings,

00:05:08.140 | and through some well-structured document parsing format, because the LLMs can actually understand well-spatially laid out text,

00:05:15.740 | when you ask questions over it, I know the text is a little faint, it's totally fine, I'll share these slides later on,

00:05:20.440 | you're able to actually get back the correct train times for a given column versus if you shove it into PyPDF,

00:05:27.440 | you get like a whole bunch of hallucinations when you ask questions over this type of data.

00:05:31.080 | So that's step one.

00:05:33.500 | You want good parsing, and you can combine this, of course, with advanced indexing modules to basically model heterogeneous data within a document.

00:05:41.580 | One announcement we're making today is we opened up LlamaParse a few months ago,

00:05:47.080 | it has like tens of thousands of users, tens of millions of pages processed, gotten very popular,

00:05:51.420 | and in general, if you're an enterprise developer that has a bucket of PDFs and wants to shove it in

00:05:56.500 | and not have to worry about some of these decisions, come sign up,

00:05:59.380 | this is basically what we're building on the Llama Cloud side.

00:06:01.520 | The next step is advanced single-agent flows.

00:06:07.620 | So, you know, we have good data retrieval quality, or sorry, good data retrieval modules,

00:06:13.380 | but in the end, right now, we're still using a single LLM prompt call.

00:06:16.620 | So how do we go a little bit beyond that into something more interesting and sophisticated?

00:06:22.340 | We did this entire course with, you know, Andrew Ng at deeplearning.ai,

00:06:26.200 | and we've also written extensively about this in the past few months,

00:06:29.880 | but basically, you can layer on different components of agents on top of just a basic RAG system

00:06:36.700 | to build something that is a lot more sophisticated in query understanding, planning, and tool use.

00:06:42.980 | And so the way I like to break this down, right, because they all have trade-offs,

00:06:45.380 | is on the left side, you have some simple components that come with lower costs and lower latency,

00:06:50.120 | and then on the right, you can build full-blown agent systems that can, you know, operate and even work together with other agents.

00:06:56.420 | Some of the core agent ingredients that we see that are pretty fundamental towards building QA systems these days

00:07:02.540 | include function calling and tool use, being able to actually do query planning,

00:07:07.820 | whether it's sequential or in some style of a DAG,

00:07:10.660 | and also maintain conversation memory over time.

00:07:14.320 | So it's a stateful service as opposed to stateless.

00:07:16.720 | We've pioneered this idea of agentic RAG,

00:07:21.820 | where it's not only just, you know, RAG as a single LLM prompt call,

00:07:26.180 | where the whole responsibility is to just synthesize the information,

00:07:29.240 | but to actually use the LLMs extensively during the query understanding and processing phase,

00:07:33.720 | where not only are you just directly feeding the query to a vector database,

00:07:37.940 | in the end, everything is just an LLM interacting with a set of data services as tools, right?

00:07:44.020 | And so this is a pretty important framework to understand,

00:07:46.600 | because at the end of the day, you're going to have, in any piece of LLM software,

00:07:50.300 | LLMs interacting with other services, whether it's a database or even other agents, as tools,

00:07:56.520 | and you're going to need to do some sort of query planning

00:07:59.120 | to basically figure out how to use these tools to solve the tasks that you're given.

00:08:02.760 | We've also talked about agent reasoning loops, right?

00:08:06.320 | Probably the most stable one that we've seen so far

00:08:08.700 | is some sort of while loop over function calling or React.

00:08:11.700 | But we've also seen fancier agent papers arise

00:08:15.040 | that basically deal with, like, DAG-based planning,

00:08:18.340 | planning out an entire DAG of decisions,

00:08:20.140 | or tree-based planning, you know,

00:08:21.820 | you plan out an entire set of possible outcomes

00:08:23.960 | and try to optimize there.

00:08:25.040 | The end result is that if you're able to do this,

00:08:28.400 | you're able to build personalized QA systems

00:08:31.460 | that are capable of handling more complex questions,

00:08:35.040 | for instance, comparison questions across multiple documents,

00:08:37.880 | being able to actually maintain the user state over time

00:08:40.860 | so you can actually revisit the thing that they were looking for,

00:08:43.340 | being able to, for instance, look up information

00:08:46.000 | from not only unstructured data, but also structured data,

00:08:48.760 | by treating everything as a data service or a tool.

00:08:51.440 | But, you know, there are some remaining gaps here.

00:08:57.540 | First of all, you know, we've kind of had some interesting discussions

00:09:01.700 | with other people in the community about this,

00:09:03.380 | but a single agent generally cannot solve an infinite set of tasks.

00:09:06.780 | If anyone's tried to give, like, a thousand tools to an agent,

00:09:10.100 | the agent is going to struggle and generally fail,

00:09:12.260 | at least with current model capabilities.

00:09:14.340 | And so one principle is that specialist agents tend to do better

00:09:17.320 | if the agent is a little bit more focused on a given task,

00:09:20.400 | given some input.

00:09:21.360 | And then the second gap is that agents are increasingly interfacing

00:09:25.680 | with services that, you know, maybe other agents, actually.

00:09:28.780 | And so we might want to think about a multi-agent future.

00:09:32.060 | So let's talk about multi-agents

00:09:36.500 | and what that means for this idea of knowledge assistance.

00:09:38.820 | Multi-agent task solvers.

00:09:42.040 | First of all, why multi-agents?

00:09:44.400 | Well, we've mentioned this a little bit,

00:09:46.680 | but they offer a few benefits beyond just a single agent flow.

00:09:50.260 | First, they offer this idea of being able to actually specialize

00:09:54.320 | and operate over a, you know, focused set of tasks more reliably

00:09:58.160 | so that you can actually stitch together different agents

00:10:00.900 | that potentially can work together to solve a bigger task.

00:10:03.720 | Another benefit or set of benefits is on the system side.

00:10:08.260 | By being able to have, you know, multiple copies of even, like,

00:10:11.320 | the same LLM agent, you're able to paralyze a bunch of tasks

00:10:14.920 | and able to do things a lot faster.

00:10:18.080 | The third thing is that actually with a multi-agent framework,

00:10:22.260 | instead of having, you know, a single agent access, like,

00:10:24.960 | a thousand tools,

00:10:25.680 | you could potentially have each agent operate over, like, you know,

00:10:28.680 | five to ten tools

00:10:29.580 | and therefore use a weaker and faster model.

00:10:32.500 | And so there are actually potential costs and latency savings.

00:10:35.260 | There are, of course, some fantastic multi-agent frameworks

00:10:39.360 | that have come out in the past few months,

00:10:40.840 | and many of you might be either using those

00:10:42.920 | or kind of building your own.

00:10:44.040 | And in general, some of the challenges

00:10:45.960 | in building this reliably in production include,

00:10:48.300 | one, being able to, you know,

00:10:50.400 | either let the agents kind of operate amongst themselves

00:10:53.700 | and build some sort of, like, unconstrained flow,

00:10:56.140 | or actually being able to inject

00:10:58.540 | some sort of constraints between the agents.

00:11:00.660 | So you're basically explicitly forcing an agent

00:11:02.820 | to operate in a certain way, given a certain input.

00:11:05.640 | The second is when you actually think about

00:11:08.680 | having these agents operate in production,

00:11:10.640 | currently the bulk of agents are implemented

00:11:12.920 | as functions in a Jupyter notebook,

00:11:14.440 | and we might want to think about defining

00:11:16.220 | the proper service architecture for agents in production

00:11:19.380 | and what that looks like.

00:11:20.380 | So today, you know, I'm excited to launch a preview feature

00:11:24.680 | of a new repo that we've been working on

00:11:26.480 | called Llama Agents,

00:11:28.580 | and it's an alpha feature,

00:11:30.800 | but basically it represents agents as microservices, right?

00:11:35.700 | So, you know, in addition to some of the fantastic work

00:11:38.800 | that a lot of these multi-agent frameworks have done,

00:11:41.260 | the core goal of Llama Agents really is to think

00:11:43.740 | about every agent as just, like, a separate service

00:11:46.260 | and figuring out how these different services

00:11:48.600 | can operate together, communicate with each other

00:11:51.220 | through a central API, you know, communication interface,

00:11:54.540 | and then also work together to solve a given task

00:11:57.940 | that is, you know, scalable,

00:12:00.140 | can handle multiple requests at once,

00:12:01.940 | is easy to deploy to, you know,

00:12:04.420 | different types of services,

00:12:05.500 | and basically each agent can encapsulate a set of logic,

00:12:09.080 | but still communicate with each other

00:12:10.840 | and actually be reused across different tasks.

00:12:13.700 | So it really is really thinking about

00:12:15.780 | how do you take these agents out of a notebook

00:12:18.020 | and into production,

00:12:19.320 | and it's an idea that we've had for a while now,

00:12:22.340 | but we see this as a key ingredient

00:12:24.000 | in helping you build something that's production-grade,

00:12:26.340 | a production-grade knowledge assistant,

00:12:28.260 | especially, you know,

00:12:29.860 | as the world gets more agentic over time.

00:12:33.600 | So the core architecture here is that, you know,

00:12:36.280 | every agent is just represented as a separate service.

00:12:39.380 | You can write the agents however you want, basically,

00:12:42.340 | you know, with Lama Index,

00:12:44.060 | with another framework as well,

00:12:45.340 | and we have some of the interfaces

00:12:47.000 | to basically build a custom agent,

00:12:48.760 | and then you're able to deploy it as a service,

00:12:51.080 | and basically the agents can interact with each other

00:12:53.520 | via some sort of message queue,

00:12:54.920 | and then the orchestration can happen between the agents

00:12:57.820 | via, like, a general control plane, right?

00:13:00.040 | We took some of the inspiration

00:13:01.580 | from, you know, existing resource allocators,

00:13:03.960 | for instance, like Kubernetes,

00:13:05.100 | or just, like, other kind of, like,

00:13:06.800 | open-source, like, systems-level projects,

00:13:09.440 | and the orchestration can be either explicit,

00:13:12.160 | so you explicitly define these flows between services,

00:13:14.820 | or it could be implicit, right?

00:13:16.880 | You can have some sort of LLM orchestrator

00:13:19.040 | just figure out what tasks to delegate to

00:13:21.240 | given the current state of things.

00:13:23.620 | And so one thing that I want to show you, basically,

00:13:28.540 | is figuring out or just showing you

00:13:31.240 | how this relates to this idea of knowledge assistance, right?

00:13:34.420 | Because we think that multi-agents

00:13:36.280 | are going to be a core component of this,

00:13:38.460 | and this is basically a demo that we whipped up

00:13:41.400 | showing you how to run LLM agents

00:13:43.980 | on a basic RAG pipeline.

00:13:46.840 | This is a pretty trivial RAG pipeline.

00:13:48.440 | There is, like, a query rewriting service, right,

00:13:51.960 | and then also some sort of default agent

00:13:54.520 | that basically just does RAG, like search and retrieval,

00:13:57.640 | and you can also add in other components

00:13:59.520 | and services, like reflection.

00:14:01.040 | You could have other tools as well

00:14:03.060 | or even a general tool service.

00:14:04.500 | And the core demo here is really showing that, you know,

00:14:07.900 | given some sort of input,

00:14:09.060 | they're communicating with each other

00:14:11.640 | through some sort of, like, API protocol,

00:14:13.580 | and so this allows you to, for instance,

00:14:15.980 | launch a bunch of different client requests at once,

00:14:18.240 | handle, you know, tasks requests from different directions,

00:14:21.420 | and basically have these agents operate

00:14:23.760 | as, like, an encapsulated microservice, right?

00:14:27.320 | And so the query rewrite agent takes in some sort of query,

00:14:30.140 | processes it, rewrites it into some new query,

00:14:33.420 | and then, you know, the second agent

00:14:35.360 | will basically take in this query,

00:14:36.640 | do some search and retrieval,

00:14:37.840 | and basically output a final response.

00:14:40.160 | If you built a RAG pipeline, all this stuff,

00:14:42.380 | like, the actual logic should be relatively trivial,

00:14:44.320 | but the goal is to basically show you

00:14:46.160 | how you can turn something,

00:14:47.480 | even something that's trivial,

00:14:49.860 | into a set of services that you can basically deploy, right?

00:14:52.280 | And this is just, like, another example

00:14:56.480 | that's basically a backup slide

00:14:57.780 | that basically, again, highlights the fact

00:14:59.680 | that you can have multiple agents, right,

00:15:02.040 | and they all operate and work together

00:15:03.700 | to basically achieve a given task.

00:15:06.080 | So, you know, the QR code is linked.

00:15:10.480 | First of all, this is in alpha mode, right,

00:15:13.520 | and so we're really excited to basically share this

00:15:16.200 | with the community.

00:15:16.880 | We're very public about the roadmap, actually,

00:15:19.420 | so check out the discussions tab

00:15:20.900 | about what's actually in there and what's not.

00:15:22.860 | We're launching with dozens of,

00:15:25.280 | a dozen, basically, initial tutorials

00:15:27.420 | to show you how to basically build

00:15:29.000 | a set of, like, microservices

00:15:30.660 | that basically help you, you know,

00:15:32.340 | build that production-grade agentic

00:15:34.600 | knowledge assistant workflow.

00:15:35.680 | And there's also a repo linked

00:15:38.400 | that I think should be public now.

00:15:39.840 | You know, in general, we're pretty excited

00:15:42.680 | to get feedback from the community

00:15:44.060 | about what a general communication protocol

00:15:46.220 | should look like,

00:15:47.260 | how we basically integrate with some

00:15:48.740 | of the other, you know, awesome work

00:15:50.220 | that the community has done

00:15:51.200 | and basically help achieve this core mission

00:15:54.200 | of, again, building something that's production-grade

00:15:56.440 | and a multi-agent assistant.

00:15:57.720 | And this is just the last component,

00:16:01.900 | which I already mentioned,

00:16:03.900 | but basically, if you're interested

00:16:05.420 | in, like, the data quality side of things,

00:16:07.320 | like, let's say you didn't care about agents at all

00:16:09.480 | and you just care about data quality,

00:16:11.280 | we're opening up a wait list

00:16:12.520 | for Llama Cloud more generally

00:16:14.140 | so that you're able to, you know,

00:16:15.960 | deal with all those decisions

00:16:17.640 | that I mentioned,

00:16:18.380 | the parsing, chunking, indexing,

00:16:20.400 | and ensure that, you know,

00:16:21.720 | your bucket of PDFs

00:16:22.820 | with embedded charts, tables, images

00:16:24.760 | is processed and parsed the right way.

00:16:26.860 | And if you're an enterprise developer,

00:16:29.320 | with that use case, come talk to us.

00:16:31.260 | So that's basically it.

00:16:32.700 | Thanks for your time

00:16:33.440 | and hope you enjoyed it.

00:16:34.340 | Bye.

00:16:35.000 | Bye.

00:16:36.000 | Bye.

00:16:37.000 | Bye.

00:16:38.000 | Bye.

00:16:39.000 | Bye.

00:16:40.000 | Bye.

00:16:41.000 | Bye.

00:16:42.000 | Bye.

00:16:43.000 | Bye.

00:16:44.000 | Bye.

00:16:45.000 | Bye.

00:16:46.000 | Bye.

00:16:47.000 | Bye.

00:16:48.000 | Bye.

00:16:49.000 | Bye.

00:16:50.000 | you

00:16:50.500 | you

00:16:51.000 | We'll see you next time.