RAG for VPs of AI: Jerry Liu

Hi everyone, I'm Jerry, co-founder and CEO of Llama Index, and I'll probably spend the first 10 minutes just giving like a brief overview I mean of RAG and also just like Llama Index, how we see the enterprise developer space and how it's progressing as well as give an overview of the product offerings and then I think in the next 15 minutes happy to like you know generally field questions and kind of answer actually maybe like have a discussion on what's top of mind throughout enterprises today so let's get started you know throughout the enterprise what we're seeing and this might resonate with some of you there's a lot of different use cases that we're seeing pop up and a lot of it's around RAG right I'm pretty sure we all probably know what RAG is you know you point it at some like directory files and then you get the LLM to somehow understand these files and then generate answers from them some other use cases that we see include like document processing and extraction being able to maintain conversations over time and then this year there's a lot of people like you know building agents we haven't seen as many like fully autonomous agents in production they typically are a bit more constrained but actually curious to get your takes as well so happy to discuss that so obviously RAG has been a very popular like set of techniques basically for helping you build a question answering interface over your data that's really the end goal is to help you build a question answer interface and what are the main components of RAG I won't go into like the super technical details but you know you need an LLM to do the final synthesis you need an embedding model you need a vector database or you need some database could be you know a document store it could be a graph store it could be like a sequel database or a vector database and then here's the thing that's interesting is that you basically need a new data processing stack to handle the data parsing and injection side this is different than traditional ETL which is primarily for kind of like you know analytics workloads as well as there's a lot of like technologies that popped up around that here you know you're really you know at a very basic level say taking in a PDF slicing it up into a bunch of chunks figuring out how to do that well and index it and represent it in a bunch of different storage forms so that LLMs can have easy access to it right and a lot of what LLM index is trying to solve is on that data processing piece so at a very you know a big pain point that we see for a lot of companies building LLM applications is going from prototype to production unlike traditional ML it's actually really easy to build a prototype with like some of the tools that LLM index offers it takes like 10 minutes to build a rag pipeline that kind of works over your data but going from kind of works to something that's production quality is a lot harder and so as we see you scale up the the number of documents as we as the documents get more complex as you try to add more data sources you have a higher quality bar that you need to meet and then some of the general you know pain points that we see include accuracy issues knowing how to tune like a bunch of different knobs and then also scaling to a lot of data sources oftentimes this either takes a lot of developer time or they just don't know how to do it and so what ends up happening is that POC you're building for the higher ups just ends up not like really working um and so therefore like the value of that overall project is diminished um the other problem that we see is that generally speaking um most of these a lot of these larger companies that we talked to have a lot of data um and there's this like general issue of just like data silos right um you have unstructured data structured data semi-structured data apis and somehow you know uh like this is um a similar problem occurs during the LLM application development where you want to somehow bring in all this data into some central place so that your LLMs can understand it right and when they're able to understand it and ideally you know somehow if you had this magic tool that made that happen and made it work well then you're able to have this kind of like holy grail of RAG uh just being able to synthesize answers um and do stuff over like any of your knowledge um anywhere in the enterprise um a thing that we talk about a lot uh both during the keynote yesterday as well as more generally is the importance of data processing and data quality um like you know we've probably all heard the term like garbage in garbage out and this is true in machine learning but this is uh also true in uh LLM application development if you don't have good data quality and I can go into an example of what that means um you're not going to get back uh well represented information um so that even if your LLM is very good uh oftentimes if your data quality is bad this leads to hallucinations within your application and so we believe in developers like if you're kind of you know um leading AI at one of these like enterprises uh you do want to make a bet on developers and and like you know I think generally speaking and I tell I say this like pretty often um you should generally bet on probably building a little bit more than more than just like buying pure out-of-the-box solutions and there's a few reasons why this is the case first the AI space is moving really quickly the underlying technology is shifting developers are the best position to translate that technology into enterprise value that are custom to your use case if you you know go through the procurement process and purchase generally speaking like out-of-the-box tools that will solve maybe like the current pain point that you have around that um or like right and and provide a solution for that but it will probably be a lot slower to basically adapt it as new techniques pop up uh new workflows are possible and so we care a lot about developers and we want to basically provide the tooling and infrastructure to enable developers to build LLM applications over their data this helps you get applications with high data response quality that's actually ready for production and importantly it's like easier for developers to set up and maintain so you don't have to keep throwing developers at it and kind of like banging their heads against the wall to figure out how to actually make this thing generate good responses and you can scale with some more data sources great i'm not going to go through you know kind of all the different features of lom index but i'm just going to quickly run through some of the main components our main goal as a company is to help any developer build context augmented LLM apps from prototype to production we have an open source toolkit right and this is an open source framework that's a very popular framework to help you build uh help developers build production LLM apps over your data a lot of the use case that we've seen in the past year have been around like you know productionizing rag uh in you know the next six months we anticipate a lot more agentic use cases to arise as well and it's primarily focused on orchestration around like retrieval prompting agentic reasoning tool use the other piece that we have is llama cloud which is a centralized knowledge interface for your production LLM application unifies your data sources starting with unstructured data is able to process and enhance that data for good data quality so that you actually have you know good quality data from your very complex like pdfs and powerpoints for instance and spreadsheets and helps you build manage pipelines so that you as a developer don't have to worry as much about that and can basically worry uh about building the actually interesting stuff around the orchestration of that data with LLMs um yeah i think i mentioned this already open source toolkit a lot of people using it i'm gonna skip this and then llama cloud is again this like centralized knowledge interface for your production LLM app um you spend like the idea is to help manage a lot of the data infrastructure so that developers generally speaking have to spend less time wrangling with data and spend more time building some of the core you know uh prompting agentic retrieval logic that uh makes up like the custom use case that they want to build for um i'm not going to run through all the features that we have because this is basically just like one of the um you know some of these things are upcoming but one specific thing that i think has actually gotten a decent amount of interest from users is llama parse which is a specific component of llama cloud it's basically our advanced document parser that helps solve this data quality problem basically if you want to build LLM applications over like a complex financial report or a powerpoint with a lot of different messy text layouts like tables images diagrams and so on and so forth we provide a really nice toolkit to basically help you parse that data specifically so that LLMs can understand it and don't hallucinate over it um so far you know we've released this like a few months ago um there's been some impressive usage metrics so far um basically half million monthly downloads on the client sdk uh like tens of millions of pages processed and a lot of like important customers basically using this throughout the enterprise and yeah uh generally speaking maybe just in terms of like discussion topics and happy to talk about any of these components um i'm very interested in generally speaking like kind of the enterprise like data stack and how that translates into LLM applications i'm also interested on the use case side um basically the kind of like advancements from simple qa interfaces into more agentic workflows that can actually take actions and automate uh more decision making uh from from different teams right either internally or externally um and just a quick shout out is you know we have like a general wait list for llama cloud um that's already gotten pretty popular uh there's been a decent number of signups but uh there's uh the goal is to basically help enable more users to kind of like process and index their unstructured data uh so again they can help like manage that and still uh build a lot of the kind of like important use cases um as enterprise developers cool go here so the question was about the enterprise product while on the cloud where um the understanding is that you upload documents to our cloud so how do we deal with like data privacy uh yeah that's a great question can you just repeat part of the question yeah with the microphone um so the question was about the enterprise product while on the cloud where um the understanding is that you upload documents to our cloud so how do we deal with like data privacy uh there's two uh kind of answers that the first is that we have both a cloud service as well as a vpc deployment option i'm happy to chat about that uh if you sign up on the kind of like contact form so we deploy in in um aws and azure with gcp coming soon and then the second is uh we're like kind of a data orchestration layer so we actually intentionally don't store your data um we try to integrate with the existing storage systems um you made a comment on like the differences between traditional etl um and um you know kind of the the new skills and tools etc are required can you expand on that a bit so that you know in in my company where i get like asked maybe hey let's have this uh etl person who's done a lot of other etl do it what what kind of instruction would i give them on like hey these other skill sets or tools might be necessary and if there's any other sort of gotchas around that you if you could highlight those that'd be great totally i think just on a very technical level the steps you actually take um are just different um basically instead of writing like sql or using dbt um you uh just you know uh this is how you like set up a rag pipeline you have a pdf um first you need to parse that pdf so either using llama parse or another document parser that parsing step if you don't get it right then that leads to a lot of kind of like downstream failure modes in and your lm application um after you parse the document into some representation whether it's text or or increasingly we're seeing like multimodal representations as well with um like image representations of a document you then need to chunk that uh document right and so the very naive approach is you basically set a chunk size of like a thousand twenty four tokens and you split every thousand twenty four tokens right and that specifically also you know introduces a bunch of complexities because if you split like tables down the middle you split pages uh that's that like um there's like that there's like a section that spans multiple pages or something you somehow need to better like semantically join them together um so that like most information is preserved within a chunk and that you add like the right metadata to that chunk um and then you need to figure out a good way to index it and this is where like a vector database or a graph store or document store comes in there's a lot of different ways to index it so just very fundamentally it's just like a different set of like steps you need to do and the issue here and the difference actually with traditional etl is is all these steps are kind of like um fuzzy to understand without the end-to-end uh performance like with traditional etl you know it's kind of like you do some step and then it's you you know exactly what you want here like it's really hard to tell what the chunk size you need to set is without having an eval data set and having a rigorous end-to-end testing and eval flow yeah oh sorry oh so i want to make sure i think i saw it hand over there yeah do it yeah yeah so i think we have a few uh audio loaders so i think the default is just uh take so the question was basically how do you integrate audio sources into your rag pipeline using you know uh waman decks or other frameworks um the simplest is probably just like you just directly uh like parse that into text and then ingest it i think in the future as models become more natively multimodal um you might just be able to represent audio as like a specific entity right and then as a chunk almost and directly feed that into a model but i don't think we're there yet um and then okay i'm gonna go go for sure i think the benchmarking is important it's also challenging because we're actually working on that right now to basically find a general benchmark what typically happens is we do like uh just within the enterprise they just do a bake off on their own data um and then compare it and we basically show them a notebook on you know here's how you build a rag pipeline with normal parse uh here's how you can do it with other parsers yeah um just want to make sure i cover yeah yeah my question is uh um what options do you have like for versioning or different promotion across environments to you know do staging and production uh that's one part and the other one is um what regions are you available so that's maybe a little more easier yeah um i think the versioning piece is is definitely important i think um at a high level we are building out features to help you like better version your pipelines we don't have that yet but it's kind of like upcoming and also requested by some enterprise customers um and then uh the the second question around um kind of regions where uh the sas service is in north america it's just hosted on uh uh but we do we do um kind of like on-prem deployments as well right and so that's that's part of you know generally the enterprise plan that we offer yeah hi um i'm building a rack system for a big fintech basically a bank uh the struggle i'm having is i'm obviously working with the servicing team which has other channels right i'm working on an in a chatbot and a whatsapp chatbot the servicing team also has a like a help center an ivr a bunch of other like channels right um it's been very tough for me to convince them that maybe the cms that they're using you know to feed these other sources is not the best way to feed a rack i'm curious to know if you've seen other customers that have like a similar issue where you know internally they want to have like this single source of truth that kind of feeds into all of these channels where the rack system's nature is obviously extremely different than like a help center or an faq or that kind of stuff i see wait so why why is that cms not the right um tool i'm curious to know if you think that could be the right tool or like getting a little bit more into the details that's like we have like q a pairs that's how the cms works right now which could work for rack but we're missing all the metadata the different clusterings of like different documents for different maybe use cases different credit cards it's a little bit tough to explain in a quick question but like have you seen a single system work as a single source of truth and kind of how have you see that work yeah real big use so i think um the yeah i think i think the full details there's probably like a lot to dive into there i think generally speaking what we see is um for like homogeneous data sources where it's like of the same kind of like data type let's say it's all like financial reports you can generally use like the same set of parameters to basically parse it because there's like an expectation they're roughly the same format for very diverse and different data sources like if all of a sudden you're bringing in not just like pdf documents but also um like semi-structured data from like you know uh jira or something or or um uh what was it like salesforce for instance like jasons um you typically need to set up like a separate pipeline there and then what you know we both offer on the open source but also the enterprise side um is this ability to like combine um all these different data sources and then you just have to like combine them together and re-rank them right and and have some re-ranking layer at the at the top all right thank you um yeah so i've been using llama parts for a little bit and first of all i love it so it works really well so thank you for producing it uh however two weeks ago i was working on a project for a client and uh all of a sudden i was getting all these failures and i contacted support via the chat and there was a gentleman helping me out and he's like go pass me the job ids give me the job ids and all of a sudden just went mia and then replied back so the question is what are the support options so in case i get stuck over the weekend i could actually get somebody to help totally first of all i'm sorry you ran into those issues i know we had like a cluster of just uh failures i think that specific weekend it was just it was um it was a good lesson for us right uh keep in mind we're like 15 people at the company um and so we're uh when you talk to support is probably like one of the the founding engineers just like jumping in uh so i promise we're making that process more streamlined um typically on the enterprise side like especially uh for kind of like the like enterprise plans that we offer i'm happy to chat about this offline like we offer dedicated slas right and so this is kind of like there's uh some support option we're doing on on the casual like kind of uh like self-serve APIs but um we're offering like dedicated slas on on the enterprise hey so we're building a building hallucination detection and other evaluation systems for our customers that have a very large collection of documents and typically that's like uh you could have of course like thousands of pdfs and also those pdfs typically of course contain a lot of tables and all that uh and then there's question of how to combine like ocrs and other pdp processing on that so the question is like what is your general recommendation like last does lama parse uh take care of all this or and do you recommend like building some kind of custom system directly on top of lama index or how do you how would you recommend handling that yeah i think i mean i i guess i didn't actually show the capabilities of law first in these slides um but uh maybe if i dig around a little bit i can try to find the specific um uh slides where showcases um yeah like the basically what you want when you parse these documents is you want some generally good parser um that will lay out the text like uh in a spatially aligned way um and so it doesn't matter if you have all the bells and whistles of like bounding boxes and all these things you generally like bare minimum like just want the text to be like faithfully represented and that's exactly what lama parse does especially for like tables so we have a few examples for instance where like you have tables within a document um and then you lay it out in a spatially aligned way and then when you feed this to an llm llms generally are trained to respond pretty well to like well like just formatted pieces of text uh so they can understand what's going on um in that text trunk uh whereas if you use like a very naive parser like a baseline pdf parser um it's going to like collapse the text and numbers and therefore kind of uh it's going to generate a lot of hallucinations but yeah yeah with the with the increase in size of context windows that are available to us and also the improvements that we're finding for like dealing the haste of that kind of problems yeah what is your perspective on where we're headed towards rag yeah i think there's two general trends uh one is longer context windows the other is like multi-modality um i do think uh there's a few things that will probably go away and a few things that will stay one is uh good parsing is still important um the reason is like you know in the end if your parser is bad you're just going to feed bad data into the llm and it's going to hallucinate information um what i think will probably go away is as context windows get bigger chunk sizes can also get bigger um so you know you are probably not going to need to worry about like intra-page splitting like splitting a single page into a bunch of smaller chunks um in the future we could see you just uh like putting entire documents as chunks and basically indexing stuff at a document level i think that actually makes a lot of sense because documents are typically like self-contained entities um and i think they'll make a lot easier for developers um however in general for a multi-doc system which you know if you're in a company you probably have like billions of documents uh many gigabytes of documents it's you're probably not going to feed all billion documents into the uh context window on every inference call even with context caching which i think gemini has because context caching is right now super expensive um probably doesn't make sense from a cost perspective and also is a black box so you don't get accountability into the data um you basically store the transformer weights for those of you who like kind of are familiar with that um and you don't really get like full transparency into what the data is actually being fed um into the language model at each step so actually i think for a variety of reasons the overall idea of retrieval from an external storage system whether it's a vector database or graph database still matters for a variety of reasons um but you know the minute chunking decisions will probably go away the second thing which you didn't ask about but which i'll talk about anyways is multimodal um i think as multimodal uh models get better um i think it actually makes sense to basically start having like diverse representations of the same thing um so for instance we have a a powerpoint presentation um you're able to uh like represent each page for instance as an image in addition to just like parse text and by storing native image chunks you basically preserve all the information within that data um anytime you do parsing it's an inherently lossy right because you're inherently like trying to extract out stuff in like a textual format as opposed to preserving the full picture um and by having like um like different ways of representing the same amount of data you can basically trade off between like cost performance and latency let's check in though hi um so i see you you've done a lot of work improving the accuracy reduce the hallucination i wonder if you are working on anything to make the conversation flow better uh in my experience it's so hard to to get um did the conversation to to feel natural sometimes they overemphasize the the context uh data while i just want to give it a fii and just continue talking like a normal human so you're talking about like basically how to create more natural conversation flows that's uh that's uh yeah i think all the so that that's very interesting i think um the the um the overall answer to that is i think the default way most people are building these conversation flows is you have some like say rag pipeline as like a tool right um and then you basically have an agent as an outer layer um that reasons over the conversation history and can um basically you know synthesize the the right answer at the given point in time so the the knobs basically that you want to tune are the the agent reasoning like prompt um as well as the memory and i think the memory is actually pretty important because um right now most memory modules are like very primitive um there's not a lot of good things beyond just like dumping the conversation history into the prompt um so happy to chat more about that as well but i think there's like a lot of a lot of stuff there that you could probably try um just want to double check the time yeah okay okay okay nurse yeah how are you using llama agents internally what's the most complex task that's a great question um so for those of you who weren't at the keynote we launched this thing called llama agents which is an open source multi-agent um framework basically for helping you basically deploy agents as micro services right now agents primarily live in like notebooks and the idea is to spin them up as like api services right now i think we're mostly just like uh using it to build like kind of more constrained simple rag pipelines and it's actually still in alpha state so i encourage all of you to basically try it out um there's a lot of things that i already know it can't do um for instance have like more general um kind of like uh there's like communication protocols and interfaces that we want to build in a more interesting message queue system but you know if you have an enterprise use case that's like going agentic and you want to basically kind of understand it as microservices uh so they can basically reuse encapsulate it um please track it out come talk to us but cool and thank you yeah sorry for going over no that's all fantastic you

RAG for VPs of AI: Jerry Liu

Transcript