back to index

Building AI Agents that actually automate Knowledge Work - Jerry Liu, LlamaIndex


Whisper Transcript | Transcript Only Page

00:00:00.040 | Okay. Hey, everyone. I'm Gerry, co-founder and CEO of Llama Index. It's great to be here,
00:00:19.320 | and today my topic, the talk title, is Building AI Agents that Actually Automate Knowledge Work.
00:00:25.560 | So basically, a big promise of AI agencies is making knowledge workers more efficient. I'm sure you've
00:00:34.320 | heard the high-level business speak of this, and I copy and pasted a bunch of B2B SaaS vendors on
00:00:39.420 | the right in terms of screenshots. Increased operational efficiency, better decision-making
00:00:44.820 | through more data, but what does this actually mean? Does knowledge work automation actually just
00:00:50.460 | mean building RAG chatbots? And if not, what is the stack and what are the use cases that AI agents
00:00:56.280 | can actually do in terms of automating knowledge work? So for us, a lot of our use cases and a lot
00:01:02.940 | of our core focus areas is basically automating knowledge work over unstructured data. 90% of
00:01:08.580 | enterprise data lives within the form of documents, whether it is PDFs, PowerPoints, Word, and, you know,
00:01:16.080 | as you'll soon see, Excel. But humans have historically needed to basically read and write these types
00:01:21.900 | of docs, right? You have, you know, an investment banker or someone, you know, kind of on the customer
00:01:27.180 | support side reviewing a lot of just unstructured data and using that documentation to basically make
00:01:33.120 | decisions and take actions. For the first time, AI agents can actually reason and act over massive amounts
00:01:39.840 | of unstructured context tokens and, you know, do analysis, do research, synthesize these insights,
00:01:46.380 | and actually take actions end to end. And so for us, when we think about the use cases and the types of
00:01:53.640 | agents for automating knowledge work, they really fall into two main categories. There's what we call
00:01:58.140 | assistive agents. So those that are kind of more like a standard chat interface. They help humans get more
00:02:04.140 | information faster. And then there's automation type agents, agents that automate routine tasks,
00:02:10.500 | can run in the background, maybe require a little bit less human in the loop, and can take actions
00:02:16.020 | that automate the routine operational stuff. When we think about the stack that's required to actually
00:02:22.260 | build either the assistive or automation type agents, there's two main components. There's really,
00:02:28.020 | really nice tools, and then there's a really nice agent architecture. With MCP 808 these days, a lot of people
00:02:35.280 | are thinking about how do I build really nice tools that allow agents to interface with the external world to
00:02:40.860 | basically surface relevant context and let the agent take external actions. And a lot of the agent architecture,
00:02:47.540 | you know, there's very general reasoning loops as well as more constrained loops. It's basically how do I encode the
00:02:53.640 | the business logic through an agentic workflow to help achieve the task. So for the purposes of this talk,
00:03:00.780 | we'll talk about three main things. A lot of stuff to cover, so I'll probably pick up my clock speed a
00:03:05.760 | little bit. But basically, there is building a document toolbox, which is how do I build really nice tools to
00:03:10.680 | allow, you know, AI agents interact with massive amounts of unstructured documents. Two is agent design patterns.
00:03:17.700 | So thinking about just at a high level, the two categories of agents from assistance automation. And
00:03:23.100 | three is bringing it together in terms of document agent use cases. So first step is on building a document
00:03:29.380 | toolbox. Basically, if you think about agents interacting with tools, and as LLMs get better, you're going to
00:03:37.260 | have these very general front end interfaces like Claude or ChatGPT. Agents need access to the right tools to
00:03:43.860 | basically interface with the external world. And for the purposes of, you know, massive amounts of
00:03:48.480 | unstructured enterprise data, they basically need the right toolbox to interact with this data. It's
00:03:54.060 | basically a generalization beyond naive RAG, right? RAG is just retrieval. I know this is a RAG workshop,
00:03:59.880 | but RAG is just like retrieval and then one-shot synthesis. A lot of what agents can do over your
00:04:07.360 | documents includes retrieval, but also includes other operations like file-based search, manipulation, and more.
00:04:13.680 | And one of the points I'm trying to make is that to basically create these tool interfaces in the
00:04:18.300 | first place, you need a really nice preprocessing layer. So you need, you know, actual data connectors
00:04:23.700 | to your data sources that basically sync data from your data source into a format that your agents can
00:04:29.520 | access. You know, it could be SharePoint, Google Drive, S3, Confluence. It needs to sync permissions to
00:04:35.480 | and the right metadata. You need the right document parsing and extraction piece. More on this in just a bit,
00:04:41.200 | but you basically need actual, actually good understanding over your documents, over tables, charts, and more. And of
00:04:48.080 | course, you know, if you have a large collection of docs, you need to index it in some way. It could be vector
00:04:53.340 | indexing into, you know, vector search. It could also be indexing into a SQL table. It could be graph DBs. It could be
00:05:00.780 | anything. So basically, to ensure the data is high quality, you need this layer to actually process and structure your documents
00:05:07.200 | and expose the right tool interfaces. In terms of the right tool interfaces, this is what I want to kind of
00:05:14.080 | define a term. It's basically called like a document MCP server. Again, it's like a generalization of this idea of rag,
00:05:20.540 | right? If rag is just one shot vector retrieval, you kind of need like a set of tools to basically equip an AI agent with,
00:05:29.120 | uh, to basically, uh, understand and manipulate different types of documents. It could be, you know,
00:05:34.500 | doing semantic search to fuzzy find the relevant source of data. It could be file lookup to basically look up the
00:05:40.000 | right file metadata. Um, it could be manipulation to actually do operations on top of the files, and it could be
00:05:45.840 | structured querying, right? Quering, uh, uh, more structured database to get aggregate insights over the types of data, um,
00:05:52.000 | that, that you've extracted out. One, you know, top consideration, uh, when actually building this type of
00:06:00.000 | toolbox is, uh, complex documents. Uh, for those of you who follow our socials, we talk a lot about this type of issue
00:06:06.000 | where a lot of human knowledge in the form of like really complicated PDFs and other formats too. Embedded tables,
00:06:12.880 | charts, images, irregular layouts, headers, footers. This is typically stuff that's designed for human consumption and not machine
00:06:19.120 | consumption. And so, you know, if the documents are not processed correctly, no matter how good your LLM is, it will fail.
00:06:25.120 | So we were probably one of the first people to actually realize that LLMs and LVMs could be used for document
00:06:32.880 | understanding. Um, if, uh, in contrast to more traditional techniques where you use kind of like hand-tuned and
00:06:39.120 | task-specific ML models to achieve, uh, kind of like document parsing over a specific class of documents, LLMs actually
00:06:46.480 | have a much general layer of accuracy, um, that you can use to your advantage and just like understanding
00:06:52.000 | and inhaling any type of document with comply, uh, any type of complexity. Um, obviously the baseline
00:06:58.000 | these days is you can just screenshot a PDF, feed it into chat to your cloud. Um, it doesn't actually give
00:07:03.920 | you amazing accuracy, but it's a good start. And so one of the kind of secret sauce, like, uh, magic tricks we
00:07:09.600 | found was figuring out how to interleave LLMs and LVMs with more traditional parsing techniques and adding
00:07:16.400 | kind of test time tokens in terms of agentic validation and reasoning to really get a higher level of accuracy.
00:07:22.160 | Um, and so, you know, we have a cloud service that does document parsing and is a core step of this document
00:07:27.760 | toolbox. Uh, we basically benchmarked, uh, our modes where we adapt, uh, you know, Sana 3.5, 4.0, uh, Gemini 2.5 Pro,
00:07:36.400 | 4.1 from open AI. And it basically outperforms all existing parsing benchmarks, um, and, and tools out
00:07:42.800 | there in terms of open source to proprietary. Um, yeah. So some of you might know us as a RAG framework.
00:07:51.760 | That's basically how we started. Um, you know, for those of you who don't know, we have this, uh, managed
00:07:56.640 | platform that is basically this giant AI native document toolbox, um, contains a lot of operations
00:08:01.680 | that you need to do on top of your docs. It could be document parsing, document extraction,
00:08:06.320 | uh, uses some of those, you know, kind of capabilities I just mentioned and allows you
00:08:10.000 | to parse, extract index data for all the set of tools I just mentioned.
00:08:13.440 | One of the special releases I actually want to highlight today. Um, and we just announced this
00:08:19.840 | in a blog post a few hours ago is Excel capabilities to help compliment this document toolbox.
00:08:25.280 | A lot of knowledge work happens in Microsoft Excel and also Google sheets and, you know,
00:08:29.680 | numbers and basically it's spreadsheets, right? But it's been unsolved by LLMs. Um, if you look at the
00:08:35.920 | document to the right, uh, neither RAG nor Texas CSV techniques will actually work over this because
00:08:42.720 | it's not really a structured 2D table. There's a bunch of gaps in the rows and gaps in the columns.
00:08:47.520 | So we basically built an Excel agent, um, that's capable of taking un-normalized Excel spreadsheets
00:08:56.480 | and transforming them, um, into a normalized 2D format and also allows you to do agentic QA, um, over, uh, both the un-normalized
00:09:05.840 | versions of the Excel spreadsheet. Um, it's a pretty cool capability. I'll describe, uh, how it kind of
00:09:12.160 | works in just a bit. Um, but it's going to complement our toolbox, right? In terms of, uh, more traditional
00:09:18.320 | document parsing, extraction, indexing, and it's available in, uh, early preview. So if you just, uh,
00:09:24.480 | take a look at the video, it's also on our blog posts. We basically uploaded that example synthetic
00:09:28.960 | data set, transformed it into a 2D table, and you can also ask questions over it to basically get
00:09:34.560 | insights. And it's really doing the heavy lifting of deeply understanding the semantic structure of
00:09:39.280 | the Excel spreadsheet, um, and then using that and plugging that in as specialized tools to an AI agent.
00:09:44.720 | Um, the best baseline is not really RAG or Texas CSV. Um, those both suck. Um, it's really just an LLM
00:09:56.080 | being able to write code. Um, so, uh, LLM with the code interpreter tool is a reasonable baseline,
00:10:00.720 | gets you to 70, 75% accuracy. Um, over like a private dataset of synthetic Excel sheets, uh, we basically
00:10:07.120 | were able to get this up to 95%. Um, it actually surpasses human baselines of 90% of a human trying to go and do the data
00:10:13.840 | transformation by hand. Um, a brief note on how it works. Uh, it's a little bit technical, um, but you
00:10:22.400 | know, more details are in the blog post. Um, first we do some sort of structure understanding of the Excel
00:10:28.080 | spreadsheet. So we do a little bit of RL reinforcement learning. Um, you know, uh, we actually kind of adapt
00:10:35.760 | dynamically to the specific format of the document, um, and learn a semantic map of the sheet. By learning a
00:10:42.400 | semantic map, uh, we can then translate this into, um, kind of a set of specialized tools that you provide
00:10:49.120 | to an agent. And so from an abstract perspective, you can kind of think about it as an agent could just
00:10:54.240 | write code from scratch. Um, as LLMs get better, that will certainly become, um, an ease like a kind of
00:10:59.840 | higher performing baseline. But in the meantime, we're helping it out by really providing, uh, a set of
00:11:05.200 | specialized tools over the semantic map. So you can reason over an Excel spreadsheet. Great. Um,
00:11:11.840 | the next piece here is, so we talked about a document toolbox. Uh, we talked about a lot of
00:11:16.160 | operations basically make this, uh, document toolbox really good and comprehensive. So now that you plugged
00:11:21.760 | it into an agent, what are the different agent architectures and what are the use cases are implied
00:11:26.080 | by them? Um, as many of you probably know from building agents yourselves, agent orchestration ranges from more constrained
00:11:33.200 | architectures to unconstrained architectures. Um, constrained is basically you kind of more explicitly
00:11:38.080 | define the control flow. Unconstrained is like a react loop, function calling, codex, uh, whatever.
00:11:42.320 | You basically give it a set of tools and let it run. Um, deep research is kind of the same thing.
00:11:46.240 | Um, for us, we basically noticed there's two main categories of UXs. Um, there's more assistant-based UXs
00:11:54.720 | that can basically surface information and, um, help a human surface information or produce some unit of
00:12:00.880 | knowledge work through usually a chat-based interface. It's usually chat-oriented, the inputs, natural
00:12:06.160 | language. Um, the architecture is a little bit more unconstrained. You know, it's basically a react loop
00:12:11.680 | over some set of tools. Um, and it's inherently both unconstrained but also with a higher degree of human
00:12:17.840 | in the loop. So the goal is-- or the expectation is that the human is supposed to kind of guide and coax
00:12:23.440 | the agent, uh, along the steps of the process to basically achieve the task at hand.
00:12:28.080 | There's a-- I mean, there's-- I'm sure many of you have built these types of use cases, and so this is just
00:12:34.720 | a very small subset. Um, but it's basically just, you know, your, uh, generalization of a rag chatbot.
00:12:40.240 | There's a second category of use cases that I think is interesting, and I think a lot of folks are actually
00:12:45.840 | starting to build more into this space, which is, um, this automation interface. So being able to
00:12:52.160 | actually, instead of, uh, providing some assistant or co-pilot to help a human get more information,
00:12:57.120 | um, processing routine tasks in a multi-step, end-to-end manner. And usually the architecture
00:13:03.280 | is a little bit different. Um, it takes in some batch of inputs. Uh, it can run in the background,
00:13:09.040 | or it could be triggered ad hoc by the human. Um, the architecture is a little bit more constrained,
00:13:13.840 | which kind of makes sense, right? If you want this thing to run more end-to-end, um, you need it to
00:13:18.400 | not just go off the rails. Um, and there's usually a little bit less human in the loop at every step
00:13:24.000 | of the process, and usually some sort of, like, batch review in the end. And the output is, like,
00:13:28.880 | structured results, integration with APIs, uh, decision-making. After approval, it'll just go
00:13:33.840 | route to the downstream systems. Some of the use cases here include, you know, financial data
00:13:39.040 | data normalization, data sheet extraction, invoice reconciliation, contract view, and more.
00:13:43.840 | Um, I'll skip this video, but, you know, there's some fun example of some community-based
00:13:52.000 | open-source repos we built in this area, like the invoice reconciler by Lori Voss.
00:14:01.120 | Uh, kind of general idea that we've emerged, that has emerged and we've noticed as a pattern is,
00:14:07.680 | you know, oftentimes the automation agents can serve as a back-end, because it runs in the background,
00:14:12.720 | you know, can do the data ETL transformation. They're still human in the loop, but it's kind
00:14:17.680 | of the doing the thing where it needs to process and structure a lot of data, um, and do decisions
00:14:22.720 | in the background. And then assistant agents are kind of more front-end facing, right? And so automation
00:14:27.840 | agents can structure, process your data, and provide the right tool interfaces, um, for assistant agents.
00:14:33.600 | Not every tool depends on agentic reasoning, but for a lot of these use cases, like for a very
00:14:39.680 | generalized data pipeline, um, where you're processing a lot of unstructured context, you might have
00:14:45.040 | automation agents go in and process your data, provide the right tools for some sort of more, uh,
00:14:50.320 | research user facing interface. So we talked about building a document toolbox. We talked about,
00:14:59.120 | you know, the, uh, the, the different categories of agentic architectures and putting it together.
00:15:03.440 | Um, here are some real world use cases of document agents. And these are basically examples of agents
00:15:09.520 | that actually help automate different types of knowledge work. So one of our favorite examples
00:15:14.880 | is a combination of both automation and assistant UXs for financial due diligence. Um, Carlyle is one of
00:15:21.680 | our, uh, favorite customers and partners. Um, you know, they basically used, uh, some of the core
00:15:27.200 | capabilities that we have to build an end-to-end leverage bio agent. Um, you know, it requires an
00:15:33.200 | automation interface to inhale massive amounts of unstructured public and private financial data.
00:15:38.160 | Um, Excel sheets, PDFs, PowerPoints, go through some bespoke extraction algorithms with human in the loop
00:15:45.760 | review. And then once that data is actually structured in the right format, providing a copilot
00:15:51.920 | interface, uh, for the analyst teams to actually both get insights and generate reports over that data.
00:15:57.680 | If you look at any enterprise search use case, that typically falls within the assistant UX. Um,
00:16:04.320 | SemEx is one of our favorite, uh, customers in this space where, you know, just being able to define
00:16:09.200 | a lot of different collections to different sources of data and providing more task-specific specialized
00:16:14.800 | agentic rag chatbots over your data, right? Um, you know, it's basically a rag, but you add like an agentic
00:16:21.200 | reasoning layer on top so that you can basically break down user queries, do research, and answer the
00:16:26.240 | question at hand. And on the pure automation UX aside, uh, we notice a lot of kind of use cases
00:16:35.600 | popping up around automate automation and efficiency. And so one example is actually technical data sheet
00:16:41.760 | ingestion. Um, you know, we're working with a global electronics company. They have a lot of data sheets
00:16:47.600 | that need to be automatically processed and reviewed. And historically, it's taken a lot of human effort to
00:16:53.120 | actually do this. Um, so by creating the right end-to-end automation agent, you can basically encode the
00:16:59.680 | business-specific logic for parsing these types of documents, extracting out the right pieces of
00:17:04.960 | information, matching it against specific rules, and outputting the structured data into SQL. There's
00:17:11.360 | human-in-the-loop review, um, but if we're actually able to do this end-to-end, it transforms weeks of
00:17:17.840 | just like, you know, technical writer work, um, into an automated extraction interface.
00:17:22.640 | So that's basically it. Um, you know, for those of you who are less familiar,
00:17:28.320 | Lama Index is, uh, the most accurate customizable platform for automating your document workflows with
00:17:33.200 | agentic AI. Um, our mission statements evolved a little bit since the past few years. We're, uh,
00:17:38.080 | for a very broad horizontal, uh, framework oftentimes focused on RAG. Um, but if you're interested in some of the
00:17:43.520 | capabilities, uh, come talk to us, and then please come check us out at Booth G11. Thank you.