Building AI Agents that actually automate Knowledge Work

Okay. Hey, everyone. I'm Gerry, co-founder and CEO of Llama Index. It's great to be here, and today my topic, the talk title, is Building AI Agents that Actually Automate Knowledge Work. So basically, a big promise of AI agencies is making knowledge workers more efficient. I'm sure you've heard the high-level business speak of this, and I copy and pasted a bunch of B2B SaaS vendors on the right in terms of screenshots.

Increased operational efficiency, better decision-making through more data, but what does this actually mean? Does knowledge work automation actually just mean building RAG chatbots? And if not, what is the stack and what are the use cases that AI agents can actually do in terms of automating knowledge work? So for us, a lot of our use cases and a lot of our core focus areas is basically automating knowledge work over unstructured data.

90% of enterprise data lives within the form of documents, whether it is PDFs, PowerPoints, Word, and, you know, as you'll soon see, Excel. But humans have historically needed to basically read and write these types of docs, right? You have, you know, an investment banker or someone, you know, kind of on the customer support side reviewing a lot of just unstructured data and using that documentation to basically make decisions and take actions.

For the first time, AI agents can actually reason and act over massive amounts of unstructured context tokens and, you know, do analysis, do research, synthesize these insights, and actually take actions end to end. And so for us, when we think about the use cases and the types of agents for automating knowledge work, they really fall into two main categories.

There's what we call assistive agents. So those that are kind of more like a standard chat interface. They help humans get more information faster. And then there's automation type agents, agents that automate routine tasks, can run in the background, maybe require a little bit less human in the loop, and can take actions that automate the routine operational stuff.

When we think about the stack that's required to actually build either the assistive or automation type agents, there's two main components. There's really, really nice tools, and then there's a really nice agent architecture. With MCP 808 these days, a lot of people are thinking about how do I build really nice tools that allow agents to interface with the external world to basically surface relevant context and let the agent take external actions.

And a lot of the agent architecture, you know, there's very general reasoning loops as well as more constrained loops. It's basically how do I encode the the business logic through an agentic workflow to help achieve the task. So for the purposes of this talk, we'll talk about three main things.

A lot of stuff to cover, so I'll probably pick up my clock speed a little bit. But basically, there is building a document toolbox, which is how do I build really nice tools to allow, you know, AI agents interact with massive amounts of unstructured documents. Two is agent design patterns.

So thinking about just at a high level, the two categories of agents from assistance automation. And three is bringing it together in terms of document agent use cases. So first step is on building a document toolbox. Basically, if you think about agents interacting with tools, and as LLMs get better, you're going to have these very general front end interfaces like Claude or ChatGPT.

Agents need access to the right tools to basically interface with the external world. And for the purposes of, you know, massive amounts of unstructured enterprise data, they basically need the right toolbox to interact with this data. It's basically a generalization beyond naive RAG, right? RAG is just retrieval. I know this is a RAG workshop, but RAG is just like retrieval and then one-shot synthesis.

A lot of what agents can do over your documents includes retrieval, but also includes other operations like file-based search, manipulation, and more. And one of the points I'm trying to make is that to basically create these tool interfaces in the first place, you need a really nice preprocessing layer.

So you need, you know, actual data connectors to your data sources that basically sync data from your data source into a format that your agents can access. You know, it could be SharePoint, Google Drive, S3, Confluence. It needs to sync permissions to and the right metadata. You need the right document parsing and extraction piece.

More on this in just a bit, but you basically need actual, actually good understanding over your documents, over tables, charts, and more. And of course, you know, if you have a large collection of docs, you need to index it in some way. It could be vector indexing into, you know, vector search.

It could also be indexing into a SQL table. It could be graph DBs. It could be anything. So basically, to ensure the data is high quality, you need this layer to actually process and structure your documents and expose the right tool interfaces. In terms of the right tool interfaces, this is what I want to kind of define a term.

It's basically called like a document MCP server. Again, it's like a generalization of this idea of rag, right? If rag is just one shot vector retrieval, you kind of need like a set of tools to basically equip an AI agent with, uh, to basically, uh, understand and manipulate different types of documents.

It could be, you know, doing semantic search to fuzzy find the relevant source of data. It could be file lookup to basically look up the right file metadata. Um, it could be manipulation to actually do operations on top of the files, and it could be structured querying, right? Quering, uh, uh, more structured database to get aggregate insights over the types of data, um, that, that you've extracted out.

One, you know, top consideration, uh, when actually building this type of toolbox is, uh, complex documents. Uh, for those of you who follow our socials, we talk a lot about this type of issue where a lot of human knowledge in the form of like really complicated PDFs and other formats too.

Embedded tables, charts, images, irregular layouts, headers, footers. This is typically stuff that's designed for human consumption and not machine consumption. And so, you know, if the documents are not processed correctly, no matter how good your LLM is, it will fail. So we were probably one of the first people to actually realize that LLMs and LVMs could be used for document understanding.

Um, if, uh, in contrast to more traditional techniques where you use kind of like hand-tuned and task-specific ML models to achieve, uh, kind of like document parsing over a specific class of documents, LLMs actually have a much general layer of accuracy, um, that you can use to your advantage and just like understanding and inhaling any type of document with comply, uh, any type of complexity.

Um, obviously the baseline these days is you can just screenshot a PDF, feed it into chat to your cloud. Um, it doesn't actually give you amazing accuracy, but it's a good start. And so one of the kind of secret sauce, like, uh, magic tricks we found was figuring out how to interleave LLMs and LVMs with more traditional parsing techniques and adding kind of test time tokens in terms of agentic validation and reasoning to really get a higher level of accuracy.

Um, and so, you know, we have a cloud service that does document parsing and is a core step of this document toolbox. Uh, we basically benchmarked, uh, our modes where we adapt, uh, you know, Sana 3.5, 4.0, uh, Gemini 2.5 Pro, 4.1 from open AI. And it basically outperforms all existing parsing benchmarks, um, and, and tools out there in terms of open source to proprietary.

Um, yeah. So some of you might know us as a RAG framework. That's basically how we started. Um, you know, for those of you who don't know, we have this, uh, managed platform that is basically this giant AI native document toolbox, um, contains a lot of operations that you need to do on top of your docs.

It could be document parsing, document extraction, uh, uses some of those, you know, kind of capabilities I just mentioned and allows you to parse, extract index data for all the set of tools I just mentioned. One of the special releases I actually want to highlight today. Um, and we just announced this in a blog post a few hours ago is Excel capabilities to help compliment this document toolbox.

A lot of knowledge work happens in Microsoft Excel and also Google sheets and, you know, numbers and basically it's spreadsheets, right? But it's been unsolved by LLMs. Um, if you look at the document to the right, uh, neither RAG nor Texas CSV techniques will actually work over this because it's not really a structured 2D table.

There's a bunch of gaps in the rows and gaps in the columns. So we basically built an Excel agent, um, that's capable of taking un-normalized Excel spreadsheets and transforming them, um, into a normalized 2D format and also allows you to do agentic QA, um, over, uh, both the un-normalized versions of the Excel spreadsheet.

Um, it's a pretty cool capability. I'll describe, uh, how it kind of works in just a bit. Um, but it's going to complement our toolbox, right? In terms of, uh, more traditional document parsing, extraction, indexing, and it's available in, uh, early preview. So if you just, uh, take a look at the video, it's also on our blog posts.

We basically uploaded that example synthetic data set, transformed it into a 2D table, and you can also ask questions over it to basically get insights. And it's really doing the heavy lifting of deeply understanding the semantic structure of the Excel spreadsheet, um, and then using that and plugging that in as specialized tools to an AI agent.

Um, the best baseline is not really RAG or Texas CSV. Um, those both suck. Um, it's really just an LLM being able to write code. Um, so, uh, LLM with the code interpreter tool is a reasonable baseline, gets you to 70, 75% accuracy. Um, over like a private dataset of synthetic Excel sheets, uh, we basically were able to get this up to 95%.

Um, it actually surpasses human baselines of 90% of a human trying to go and do the data transformation by hand. Um, a brief note on how it works. Uh, it's a little bit technical, um, but you know, more details are in the blog post. Um, first we do some sort of structure understanding of the Excel spreadsheet.

So we do a little bit of RL reinforcement learning. Um, you know, uh, we actually kind of adapt dynamically to the specific format of the document, um, and learn a semantic map of the sheet. By learning a semantic map, uh, we can then translate this into, um, kind of a set of specialized tools that you provide to an agent.

And so from an abstract perspective, you can kind of think about it as an agent could just write code from scratch. Um, as LLMs get better, that will certainly become, um, an ease like a kind of higher performing baseline. But in the meantime, we're helping it out by really providing, uh, a set of specialized tools over the semantic map.

So you can reason over an Excel spreadsheet. Great. Um, the next piece here is, so we talked about a document toolbox. Uh, we talked about a lot of operations basically make this, uh, document toolbox really good and comprehensive. So now that you plugged it into an agent, what are the different agent architectures and what are the use cases are implied by them?

Um, as many of you probably know from building agents yourselves, agent orchestration ranges from more constrained architectures to unconstrained architectures. Um, constrained is basically you kind of more explicitly define the control flow. Unconstrained is like a react loop, function calling, codex, uh, whatever. You basically give it a set of tools and let it run.

Um, deep research is kind of the same thing. Um, for us, we basically noticed there's two main categories of UXs. Um, there's more assistant-based UXs that can basically surface information and, um, help a human surface information or produce some unit of knowledge work through usually a chat-based interface. It's usually chat-oriented, the inputs, natural language.

Um, the architecture is a little bit more unconstrained. You know, it's basically a react loop over some set of tools. Um, and it's inherently both unconstrained but also with a higher degree of human in the loop. So the goal is-- or the expectation is that the human is supposed to kind of guide and coax the agent, uh, along the steps of the process to basically achieve the task at hand.

There's a-- I mean, there's-- I'm sure many of you have built these types of use cases, and so this is just a very small subset. Um, but it's basically just, you know, your, uh, generalization of a rag chatbot. There's a second category of use cases that I think is interesting, and I think a lot of folks are actually starting to build more into this space, which is, um, this automation interface.

So being able to actually, instead of, uh, providing some assistant or co-pilot to help a human get more information, um, processing routine tasks in a multi-step, end-to-end manner. And usually the architecture is a little bit different. Um, it takes in some batch of inputs. Uh, it can run in the background, or it could be triggered ad hoc by the human.

Um, the architecture is a little bit more constrained, which kind of makes sense, right? If you want this thing to run more end-to-end, um, you need it to not just go off the rails. Um, and there's usually a little bit less human in the loop at every step of the process, and usually some sort of, like, batch review in the end.

And the output is, like, structured results, integration with APIs, uh, decision-making. After approval, it'll just go route to the downstream systems. Some of the use cases here include, you know, financial data data normalization, data sheet extraction, invoice reconciliation, contract view, and more. Um, I'll skip this video, but, you know, there's some fun example of some community-based open-source repos we built in this area, like the invoice reconciler by Lori Voss.

Uh, kind of general idea that we've emerged, that has emerged and we've noticed as a pattern is, you know, oftentimes the automation agents can serve as a back-end, because it runs in the background, you know, can do the data ETL transformation. They're still human in the loop, but it's kind of the doing the thing where it needs to process and structure a lot of data, um, and do decisions in the background.

And then assistant agents are kind of more front-end facing, right? And so automation agents can structure, process your data, and provide the right tool interfaces, um, for assistant agents. Not every tool depends on agentic reasoning, but for a lot of these use cases, like for a very generalized data pipeline, um, where you're processing a lot of unstructured context, you might have automation agents go in and process your data, provide the right tools for some sort of more, uh, research user facing interface.

So we talked about building a document toolbox. We talked about, you know, the, uh, the, the different categories of agentic architectures and putting it together. Um, here are some real world use cases of document agents. And these are basically examples of agents that actually help automate different types of knowledge work.

So one of our favorite examples is a combination of both automation and assistant UXs for financial due diligence. Um, Carlyle is one of our, uh, favorite customers and partners. Um, you know, they basically used, uh, some of the core capabilities that we have to build an end-to-end leverage bio agent.

Um, you know, it requires an automation interface to inhale massive amounts of unstructured public and private financial data. Um, Excel sheets, PDFs, PowerPoints, go through some bespoke extraction algorithms with human in the loop review. And then once that data is actually structured in the right format, providing a copilot interface, uh, for the analyst teams to actually both get insights and generate reports over that data.

If you look at any enterprise search use case, that typically falls within the assistant UX. Um, SemEx is one of our favorite, uh, customers in this space where, you know, just being able to define a lot of different collections to different sources of data and providing more task-specific specialized agentic rag chatbots over your data, right?

Um, you know, it's basically a rag, but you add like an agentic reasoning layer on top so that you can basically break down user queries, do research, and answer the question at hand. And on the pure automation UX aside, uh, we notice a lot of kind of use cases popping up around automate automation and efficiency.

And so one example is actually technical data sheet ingestion. Um, you know, we're working with a global electronics company. They have a lot of data sheets that need to be automatically processed and reviewed. And historically, it's taken a lot of human effort to actually do this. Um, so by creating the right end-to-end automation agent, you can basically encode the business-specific logic for parsing these types of documents, extracting out the right pieces of information, matching it against specific rules, and outputting the structured data into SQL.

There's human-in-the-loop review, um, but if we're actually able to do this end-to-end, it transforms weeks of just like, you know, technical writer work, um, into an automated extraction interface. So that's basically it. Um, you know, for those of you who are less familiar, Lama Index is, uh, the most accurate customizable platform for automating your document workflows with agentic AI.

Um, our mission statements evolved a little bit since the past few years. We're, uh, for a very broad horizontal, uh, framework oftentimes focused on RAG. Um, but if you're interested in some of the capabilities, uh, come talk to us, and then please come check us out at Booth G11.

Thank you.

Building AI Agents that actually automate Knowledge Work - Jerry Liu, LlamaIndex

Transcript