back to indexBuilding AI Agents that actually automate Knowledge Work - Jerry Liu, LlamaIndex

00:00:00.040 |
Okay. Hey, everyone. I'm Gerry, co-founder and CEO of Llama Index. It's great to be here, 00:00:19.320 |
and today my topic, the talk title, is Building AI Agents that Actually Automate Knowledge Work. 00:00:25.560 |
So basically, a big promise of AI agencies is making knowledge workers more efficient. I'm sure you've 00:00:34.320 |
heard the high-level business speak of this, and I copy and pasted a bunch of B2B SaaS vendors on 00:00:39.420 |
the right in terms of screenshots. Increased operational efficiency, better decision-making 00:00:44.820 |
through more data, but what does this actually mean? Does knowledge work automation actually just 00:00:50.460 |
mean building RAG chatbots? And if not, what is the stack and what are the use cases that AI agents 00:00:56.280 |
can actually do in terms of automating knowledge work? So for us, a lot of our use cases and a lot 00:01:02.940 |
of our core focus areas is basically automating knowledge work over unstructured data. 90% of 00:01:08.580 |
enterprise data lives within the form of documents, whether it is PDFs, PowerPoints, Word, and, you know, 00:01:16.080 |
as you'll soon see, Excel. But humans have historically needed to basically read and write these types 00:01:21.900 |
of docs, right? You have, you know, an investment banker or someone, you know, kind of on the customer 00:01:27.180 |
support side reviewing a lot of just unstructured data and using that documentation to basically make 00:01:33.120 |
decisions and take actions. For the first time, AI agents can actually reason and act over massive amounts 00:01:39.840 |
of unstructured context tokens and, you know, do analysis, do research, synthesize these insights, 00:01:46.380 |
and actually take actions end to end. And so for us, when we think about the use cases and the types of 00:01:53.640 |
agents for automating knowledge work, they really fall into two main categories. There's what we call 00:01:58.140 |
assistive agents. So those that are kind of more like a standard chat interface. They help humans get more 00:02:04.140 |
information faster. And then there's automation type agents, agents that automate routine tasks, 00:02:10.500 |
can run in the background, maybe require a little bit less human in the loop, and can take actions 00:02:16.020 |
that automate the routine operational stuff. When we think about the stack that's required to actually 00:02:22.260 |
build either the assistive or automation type agents, there's two main components. There's really, 00:02:28.020 |
really nice tools, and then there's a really nice agent architecture. With MCP 808 these days, a lot of people 00:02:35.280 |
are thinking about how do I build really nice tools that allow agents to interface with the external world to 00:02:40.860 |
basically surface relevant context and let the agent take external actions. And a lot of the agent architecture, 00:02:47.540 |
you know, there's very general reasoning loops as well as more constrained loops. It's basically how do I encode the 00:02:53.640 |
the business logic through an agentic workflow to help achieve the task. So for the purposes of this talk, 00:03:00.780 |
we'll talk about three main things. A lot of stuff to cover, so I'll probably pick up my clock speed a 00:03:05.760 |
little bit. But basically, there is building a document toolbox, which is how do I build really nice tools to 00:03:10.680 |
allow, you know, AI agents interact with massive amounts of unstructured documents. Two is agent design patterns. 00:03:17.700 |
So thinking about just at a high level, the two categories of agents from assistance automation. And 00:03:23.100 |
three is bringing it together in terms of document agent use cases. So first step is on building a document 00:03:29.380 |
toolbox. Basically, if you think about agents interacting with tools, and as LLMs get better, you're going to 00:03:37.260 |
have these very general front end interfaces like Claude or ChatGPT. Agents need access to the right tools to 00:03:43.860 |
basically interface with the external world. And for the purposes of, you know, massive amounts of 00:03:48.480 |
unstructured enterprise data, they basically need the right toolbox to interact with this data. It's 00:03:54.060 |
basically a generalization beyond naive RAG, right? RAG is just retrieval. I know this is a RAG workshop, 00:03:59.880 |
but RAG is just like retrieval and then one-shot synthesis. A lot of what agents can do over your 00:04:07.360 |
documents includes retrieval, but also includes other operations like file-based search, manipulation, and more. 00:04:13.680 |
And one of the points I'm trying to make is that to basically create these tool interfaces in the 00:04:18.300 |
first place, you need a really nice preprocessing layer. So you need, you know, actual data connectors 00:04:23.700 |
to your data sources that basically sync data from your data source into a format that your agents can 00:04:29.520 |
access. You know, it could be SharePoint, Google Drive, S3, Confluence. It needs to sync permissions to 00:04:35.480 |
and the right metadata. You need the right document parsing and extraction piece. More on this in just a bit, 00:04:41.200 |
but you basically need actual, actually good understanding over your documents, over tables, charts, and more. And of 00:04:48.080 |
course, you know, if you have a large collection of docs, you need to index it in some way. It could be vector 00:04:53.340 |
indexing into, you know, vector search. It could also be indexing into a SQL table. It could be graph DBs. It could be 00:05:00.780 |
anything. So basically, to ensure the data is high quality, you need this layer to actually process and structure your documents 00:05:07.200 |
and expose the right tool interfaces. In terms of the right tool interfaces, this is what I want to kind of 00:05:14.080 |
define a term. It's basically called like a document MCP server. Again, it's like a generalization of this idea of rag, 00:05:20.540 |
right? If rag is just one shot vector retrieval, you kind of need like a set of tools to basically equip an AI agent with, 00:05:29.120 |
uh, to basically, uh, understand and manipulate different types of documents. It could be, you know, 00:05:34.500 |
doing semantic search to fuzzy find the relevant source of data. It could be file lookup to basically look up the 00:05:40.000 |
right file metadata. Um, it could be manipulation to actually do operations on top of the files, and it could be 00:05:45.840 |
structured querying, right? Quering, uh, uh, more structured database to get aggregate insights over the types of data, um, 00:05:52.000 |
that, that you've extracted out. One, you know, top consideration, uh, when actually building this type of 00:06:00.000 |
toolbox is, uh, complex documents. Uh, for those of you who follow our socials, we talk a lot about this type of issue 00:06:06.000 |
where a lot of human knowledge in the form of like really complicated PDFs and other formats too. Embedded tables, 00:06:12.880 |
charts, images, irregular layouts, headers, footers. This is typically stuff that's designed for human consumption and not machine 00:06:19.120 |
consumption. And so, you know, if the documents are not processed correctly, no matter how good your LLM is, it will fail. 00:06:25.120 |
So we were probably one of the first people to actually realize that LLMs and LVMs could be used for document 00:06:32.880 |
understanding. Um, if, uh, in contrast to more traditional techniques where you use kind of like hand-tuned and 00:06:39.120 |
task-specific ML models to achieve, uh, kind of like document parsing over a specific class of documents, LLMs actually 00:06:46.480 |
have a much general layer of accuracy, um, that you can use to your advantage and just like understanding 00:06:52.000 |
and inhaling any type of document with comply, uh, any type of complexity. Um, obviously the baseline 00:06:58.000 |
these days is you can just screenshot a PDF, feed it into chat to your cloud. Um, it doesn't actually give 00:07:03.920 |
you amazing accuracy, but it's a good start. And so one of the kind of secret sauce, like, uh, magic tricks we 00:07:09.600 |
found was figuring out how to interleave LLMs and LVMs with more traditional parsing techniques and adding 00:07:16.400 |
kind of test time tokens in terms of agentic validation and reasoning to really get a higher level of accuracy. 00:07:22.160 |
Um, and so, you know, we have a cloud service that does document parsing and is a core step of this document 00:07:27.760 |
toolbox. Uh, we basically benchmarked, uh, our modes where we adapt, uh, you know, Sana 3.5, 4.0, uh, Gemini 2.5 Pro, 00:07:36.400 |
4.1 from open AI. And it basically outperforms all existing parsing benchmarks, um, and, and tools out 00:07:42.800 |
there in terms of open source to proprietary. Um, yeah. So some of you might know us as a RAG framework. 00:07:51.760 |
That's basically how we started. Um, you know, for those of you who don't know, we have this, uh, managed 00:07:56.640 |
platform that is basically this giant AI native document toolbox, um, contains a lot of operations 00:08:01.680 |
that you need to do on top of your docs. It could be document parsing, document extraction, 00:08:06.320 |
uh, uses some of those, you know, kind of capabilities I just mentioned and allows you 00:08:10.000 |
to parse, extract index data for all the set of tools I just mentioned. 00:08:13.440 |
One of the special releases I actually want to highlight today. Um, and we just announced this 00:08:19.840 |
in a blog post a few hours ago is Excel capabilities to help compliment this document toolbox. 00:08:25.280 |
A lot of knowledge work happens in Microsoft Excel and also Google sheets and, you know, 00:08:29.680 |
numbers and basically it's spreadsheets, right? But it's been unsolved by LLMs. Um, if you look at the 00:08:35.920 |
document to the right, uh, neither RAG nor Texas CSV techniques will actually work over this because 00:08:42.720 |
it's not really a structured 2D table. There's a bunch of gaps in the rows and gaps in the columns. 00:08:47.520 |
So we basically built an Excel agent, um, that's capable of taking un-normalized Excel spreadsheets 00:08:56.480 |
and transforming them, um, into a normalized 2D format and also allows you to do agentic QA, um, over, uh, both the un-normalized 00:09:05.840 |
versions of the Excel spreadsheet. Um, it's a pretty cool capability. I'll describe, uh, how it kind of 00:09:12.160 |
works in just a bit. Um, but it's going to complement our toolbox, right? In terms of, uh, more traditional 00:09:18.320 |
document parsing, extraction, indexing, and it's available in, uh, early preview. So if you just, uh, 00:09:24.480 |
take a look at the video, it's also on our blog posts. We basically uploaded that example synthetic 00:09:28.960 |
data set, transformed it into a 2D table, and you can also ask questions over it to basically get 00:09:34.560 |
insights. And it's really doing the heavy lifting of deeply understanding the semantic structure of 00:09:39.280 |
the Excel spreadsheet, um, and then using that and plugging that in as specialized tools to an AI agent. 00:09:44.720 |
Um, the best baseline is not really RAG or Texas CSV. Um, those both suck. Um, it's really just an LLM 00:09:56.080 |
being able to write code. Um, so, uh, LLM with the code interpreter tool is a reasonable baseline, 00:10:00.720 |
gets you to 70, 75% accuracy. Um, over like a private dataset of synthetic Excel sheets, uh, we basically 00:10:07.120 |
were able to get this up to 95%. Um, it actually surpasses human baselines of 90% of a human trying to go and do the data 00:10:13.840 |
transformation by hand. Um, a brief note on how it works. Uh, it's a little bit technical, um, but you 00:10:22.400 |
know, more details are in the blog post. Um, first we do some sort of structure understanding of the Excel 00:10:28.080 |
spreadsheet. So we do a little bit of RL reinforcement learning. Um, you know, uh, we actually kind of adapt 00:10:35.760 |
dynamically to the specific format of the document, um, and learn a semantic map of the sheet. By learning a 00:10:42.400 |
semantic map, uh, we can then translate this into, um, kind of a set of specialized tools that you provide 00:10:49.120 |
to an agent. And so from an abstract perspective, you can kind of think about it as an agent could just 00:10:54.240 |
write code from scratch. Um, as LLMs get better, that will certainly become, um, an ease like a kind of 00:10:59.840 |
higher performing baseline. But in the meantime, we're helping it out by really providing, uh, a set of 00:11:05.200 |
specialized tools over the semantic map. So you can reason over an Excel spreadsheet. Great. Um, 00:11:11.840 |
the next piece here is, so we talked about a document toolbox. Uh, we talked about a lot of 00:11:16.160 |
operations basically make this, uh, document toolbox really good and comprehensive. So now that you plugged 00:11:21.760 |
it into an agent, what are the different agent architectures and what are the use cases are implied 00:11:26.080 |
by them? Um, as many of you probably know from building agents yourselves, agent orchestration ranges from more constrained 00:11:33.200 |
architectures to unconstrained architectures. Um, constrained is basically you kind of more explicitly 00:11:38.080 |
define the control flow. Unconstrained is like a react loop, function calling, codex, uh, whatever. 00:11:42.320 |
You basically give it a set of tools and let it run. Um, deep research is kind of the same thing. 00:11:46.240 |
Um, for us, we basically noticed there's two main categories of UXs. Um, there's more assistant-based UXs 00:11:54.720 |
that can basically surface information and, um, help a human surface information or produce some unit of 00:12:00.880 |
knowledge work through usually a chat-based interface. It's usually chat-oriented, the inputs, natural 00:12:06.160 |
language. Um, the architecture is a little bit more unconstrained. You know, it's basically a react loop 00:12:11.680 |
over some set of tools. Um, and it's inherently both unconstrained but also with a higher degree of human 00:12:17.840 |
in the loop. So the goal is-- or the expectation is that the human is supposed to kind of guide and coax 00:12:23.440 |
the agent, uh, along the steps of the process to basically achieve the task at hand. 00:12:28.080 |
There's a-- I mean, there's-- I'm sure many of you have built these types of use cases, and so this is just 00:12:34.720 |
a very small subset. Um, but it's basically just, you know, your, uh, generalization of a rag chatbot. 00:12:40.240 |
There's a second category of use cases that I think is interesting, and I think a lot of folks are actually 00:12:45.840 |
starting to build more into this space, which is, um, this automation interface. So being able to 00:12:52.160 |
actually, instead of, uh, providing some assistant or co-pilot to help a human get more information, 00:12:57.120 |
um, processing routine tasks in a multi-step, end-to-end manner. And usually the architecture 00:13:03.280 |
is a little bit different. Um, it takes in some batch of inputs. Uh, it can run in the background, 00:13:09.040 |
or it could be triggered ad hoc by the human. Um, the architecture is a little bit more constrained, 00:13:13.840 |
which kind of makes sense, right? If you want this thing to run more end-to-end, um, you need it to 00:13:18.400 |
not just go off the rails. Um, and there's usually a little bit less human in the loop at every step 00:13:24.000 |
of the process, and usually some sort of, like, batch review in the end. And the output is, like, 00:13:28.880 |
structured results, integration with APIs, uh, decision-making. After approval, it'll just go 00:13:33.840 |
route to the downstream systems. Some of the use cases here include, you know, financial data 00:13:39.040 |
data normalization, data sheet extraction, invoice reconciliation, contract view, and more. 00:13:43.840 |
Um, I'll skip this video, but, you know, there's some fun example of some community-based 00:13:52.000 |
open-source repos we built in this area, like the invoice reconciler by Lori Voss. 00:14:01.120 |
Uh, kind of general idea that we've emerged, that has emerged and we've noticed as a pattern is, 00:14:07.680 |
you know, oftentimes the automation agents can serve as a back-end, because it runs in the background, 00:14:12.720 |
you know, can do the data ETL transformation. They're still human in the loop, but it's kind 00:14:17.680 |
of the doing the thing where it needs to process and structure a lot of data, um, and do decisions 00:14:22.720 |
in the background. And then assistant agents are kind of more front-end facing, right? And so automation 00:14:27.840 |
agents can structure, process your data, and provide the right tool interfaces, um, for assistant agents. 00:14:33.600 |
Not every tool depends on agentic reasoning, but for a lot of these use cases, like for a very 00:14:39.680 |
generalized data pipeline, um, where you're processing a lot of unstructured context, you might have 00:14:45.040 |
automation agents go in and process your data, provide the right tools for some sort of more, uh, 00:14:50.320 |
research user facing interface. So we talked about building a document toolbox. We talked about, 00:14:59.120 |
you know, the, uh, the, the different categories of agentic architectures and putting it together. 00:15:03.440 |
Um, here are some real world use cases of document agents. And these are basically examples of agents 00:15:09.520 |
that actually help automate different types of knowledge work. So one of our favorite examples 00:15:14.880 |
is a combination of both automation and assistant UXs for financial due diligence. Um, Carlyle is one of 00:15:21.680 |
our, uh, favorite customers and partners. Um, you know, they basically used, uh, some of the core 00:15:27.200 |
capabilities that we have to build an end-to-end leverage bio agent. Um, you know, it requires an 00:15:33.200 |
automation interface to inhale massive amounts of unstructured public and private financial data. 00:15:38.160 |
Um, Excel sheets, PDFs, PowerPoints, go through some bespoke extraction algorithms with human in the loop 00:15:45.760 |
review. And then once that data is actually structured in the right format, providing a copilot 00:15:51.920 |
interface, uh, for the analyst teams to actually both get insights and generate reports over that data. 00:15:57.680 |
If you look at any enterprise search use case, that typically falls within the assistant UX. Um, 00:16:04.320 |
SemEx is one of our favorite, uh, customers in this space where, you know, just being able to define 00:16:09.200 |
a lot of different collections to different sources of data and providing more task-specific specialized 00:16:14.800 |
agentic rag chatbots over your data, right? Um, you know, it's basically a rag, but you add like an agentic 00:16:21.200 |
reasoning layer on top so that you can basically break down user queries, do research, and answer the 00:16:26.240 |
question at hand. And on the pure automation UX aside, uh, we notice a lot of kind of use cases 00:16:35.600 |
popping up around automate automation and efficiency. And so one example is actually technical data sheet 00:16:41.760 |
ingestion. Um, you know, we're working with a global electronics company. They have a lot of data sheets 00:16:47.600 |
that need to be automatically processed and reviewed. And historically, it's taken a lot of human effort to 00:16:53.120 |
actually do this. Um, so by creating the right end-to-end automation agent, you can basically encode the 00:16:59.680 |
business-specific logic for parsing these types of documents, extracting out the right pieces of 00:17:04.960 |
information, matching it against specific rules, and outputting the structured data into SQL. There's 00:17:11.360 |
human-in-the-loop review, um, but if we're actually able to do this end-to-end, it transforms weeks of 00:17:17.840 |
just like, you know, technical writer work, um, into an automated extraction interface. 00:17:22.640 |
So that's basically it. Um, you know, for those of you who are less familiar, 00:17:28.320 |
Lama Index is, uh, the most accurate customizable platform for automating your document workflows with 00:17:33.200 |
agentic AI. Um, our mission statements evolved a little bit since the past few years. We're, uh, 00:17:38.080 |
for a very broad horizontal, uh, framework oftentimes focused on RAG. Um, but if you're interested in some of the 00:17:43.520 |
capabilities, uh, come talk to us, and then please come check us out at Booth G11. Thank you.