RAG for VPs of AI: Jerry Liu

00:00:00.000 | Hi everyone, I'm Jerry, co-founder and CEO of Llama Index, and I'll probably spend the first

00:00:19.680 | 10 minutes just giving like a brief overview I mean of RAG and also just like Llama Index,

00:00:24.060 | how we see the enterprise developer space and how it's progressing as well as give an overview

00:00:29.960 | of the product offerings and then I think in the next 15 minutes happy to like you know generally

00:00:34.460 | field questions and kind of answer actually maybe like have a discussion on what's top of mind

00:00:39.560 | throughout enterprises today so let's get started you know throughout the enterprise what we're

00:00:45.800 | seeing and this might resonate with some of you there's a lot of different use cases that we're

00:00:49.460 | seeing pop up and a lot of it's around RAG right I'm pretty sure we all probably know what RAG is

00:00:54.740 | you know you point it at some like directory files and then you get the LLM to somehow understand

00:00:59.920 | these files and then generate answers from them some other use cases that we see include like

00:01:05.560 | document processing and extraction being able to maintain conversations over time and then this

00:01:10.780 | year there's a lot of people like you know building agents we haven't seen as many like fully autonomous

00:01:16.180 | agents in production they typically are a bit more constrained but actually curious to get your takes

00:01:20.740 | as well so happy to discuss that so obviously RAG has been a very popular like set of techniques basically for helping you build

00:01:29.880 | a question answering interface over your data that's really the end goal is to help you build a question

00:01:34.880 | answer interface and what are the main components of RAG I won't go into like the super technical details but

00:01:39.940 | you know you need an LLM to do the final synthesis you need an embedding model you need a vector database or you

00:01:46.560 | need some database could be you know a document store it could be a graph store it could be like a sequel

00:01:52.640 | database or a vector database and then here's the thing that's interesting is that you basically need a new data

00:01:59.840 | processing stack to handle the data parsing and injection side this is different than traditional ETL which is primarily for kind of like you know analytics workloads as well as

00:02:09.900 | there's a lot of like technologies that popped up around that here you know you're really you know at a very basic level say taking in a PDF slicing it up into a bunch of chunks

00:02:19.840 | figuring out how to do that well and index it and represent it in a bunch of different storage forms so that LLMs can have easy access to it right and a lot of what LLM index

00:02:29.900 | is trying to solve is on that data processing piece so at a very you know a big pain point that we see for a lot of

00:02:39.960 | companies building LLM applications is going from prototype to production unlike traditional ML it's actually really easy to build a prototype

00:02:50.020 | with like some of the tools that LLM index offers it takes like 10 minutes to build a rag pipeline that kind of works over your data but going from kind of works to something that's production quality is a lot harder and so as we see you scale up the

00:03:04.020 | the number of documents as we as the documents get more complex as you try to add more data sources you have a higher quality bar that you need to meet and then some of the

00:03:12.580 | general you know pain points that we see include accuracy issues knowing how to tune like a bunch of different knobs and then also scaling to a lot of data sources

00:03:20.620 | oftentimes this either takes a lot of developer time or they just don't know how to do it and so what ends up happening is that POC you're building for the higher ups just ends up not like really working

00:03:32.080 | um and so therefore like the value of that overall project is diminished

00:03:36.140 | um the other problem that we see is that generally speaking um most of these a lot of these larger companies that we talked to have a lot of data

00:03:45.020 | um and there's this like general issue of just like data silos right um you have unstructured data structured data semi-structured data apis

00:03:53.280 | and somehow you know uh like this is um a similar problem occurs during the LLM application development where you want to

00:04:01.760 | somehow bring in all this data into some central place so that your LLMs can understand it right and when they're able to understand it and ideally you know

00:04:10.820 | somehow if you had this magic tool that made that happen and made it work well then you're able to have this kind of like holy grail of RAG

00:04:17.820 | uh just being able to synthesize answers um and do stuff over like any of your knowledge um anywhere in the enterprise

00:04:24.700 | um a thing that we talk about a lot uh both during the keynote yesterday as well as more generally

00:04:31.880 | is the importance of data processing and data quality um like you know we've probably all heard the term

00:04:38.280 | like garbage in garbage out and this is true in machine learning but this is uh also true in uh LLM

00:04:45.300 | application development if you don't have good data quality and I can go into an example of what that means

00:04:50.700 | um you're not going to get back uh well represented information um so that even if your LLM is very good

00:04:57.920 | uh oftentimes if your data quality is bad this leads to hallucinations within your application

00:05:03.600 | and so we believe in developers like if you're kind of you know um leading AI at one of these like

00:05:12.180 | enterprises uh you do want to make a bet on developers and and like you know I think generally speaking and

00:05:18.820 | I tell I say this like pretty often um you should generally bet on probably building a little bit more

00:05:24.820 | than more than just like buying pure out-of-the-box solutions and there's a few reasons why this is the

00:05:29.360 | case first the AI space is moving really quickly the underlying technology is shifting developers are

00:05:35.540 | the best position to translate that technology into enterprise value that are custom to your use case

00:05:41.280 | if you you know go through the procurement process and purchase generally speaking like out-of-the-box tools

00:05:47.200 | that will solve maybe like the current pain point that you have around that um or like right and and

00:05:52.800 | provide a solution for that but it will probably be a lot slower to basically adapt it as new techniques

00:05:59.440 | pop up uh new workflows are possible and so we care a lot about developers and we want to basically

00:06:05.440 | provide the tooling and infrastructure to enable developers to build LLM applications over their data

00:06:10.320 | this helps you get applications with high data response quality that's actually ready for production

00:06:16.160 | and importantly it's like easier for developers to set up and maintain so you don't have to keep throwing

00:06:21.600 | developers at it and kind of like banging their heads against the wall to figure out how to actually

00:06:25.520 | make this thing generate good responses and you can scale with some more data sources

00:06:29.280 | great i'm not going to go through you know kind of all the different features of lom index but i'm

00:06:37.360 | just going to quickly run through some of the main components our main goal as a company is to help

00:06:42.560 | any developer build context augmented LLM apps from prototype to production we have an open source toolkit right and

00:06:50.000 | this is an open source framework that's a very popular framework to help you build uh help developers

00:06:54.720 | build production LLM apps over your data a lot of the use case that we've seen in the past year have

00:07:00.080 | been around like you know productionizing rag uh in you know the next six months we anticipate a lot more

00:07:05.280 | agentic use cases to arise as well and it's primarily focused on orchestration around like retrieval

00:07:11.040 | prompting agentic reasoning tool use the other piece that we have is llama cloud which is a centralized

00:07:17.840 | knowledge interface for your production LLM application unifies your data sources starting

00:07:22.640 | with unstructured data is able to process and enhance that data for good data quality so that you actually

00:07:28.160 | have you know good quality data from your very complex like pdfs and powerpoints for instance and

00:07:33.040 | spreadsheets and helps you build manage pipelines so that you as a developer don't have to worry as much

00:07:38.880 | about that and can basically worry uh about building the actually interesting stuff around the orchestration

00:07:44.880 | of that data with LLMs um yeah i think i mentioned this already open source toolkit a lot of people using

00:07:52.560 | it i'm gonna skip this and then llama cloud is again this like centralized knowledge interface for your

00:07:58.640 | production LLM app um you spend like the idea is to help manage a lot of the data infrastructure so that

00:08:04.720 | developers generally speaking have to spend less time wrangling with data and spend more time building

00:08:09.520 | some of the core you know uh prompting agentic retrieval logic that uh makes up like the custom use case

00:08:15.760 | that they want to build for um i'm not going to run through all the features that we have because this is

00:08:22.320 | basically just like one of the um you know some of these things are upcoming but one specific thing that

00:08:27.680 | i think has actually gotten a decent amount of interest from users is llama parse which is a specific component

00:08:33.520 | of llama cloud it's basically our advanced document parser that helps solve this data quality problem

00:08:39.920 | basically if you want to build LLM applications over like a complex financial report or a powerpoint

00:08:47.120 | with a lot of different messy text layouts like tables images diagrams and so on and so forth

00:08:54.560 | we provide a really nice toolkit to basically help you parse that data specifically so that LLMs can

00:09:00.560 | understand it and don't hallucinate over it um so far you know we've released this like a few months ago

00:09:09.200 | um there's been some impressive usage metrics so far um basically half million monthly downloads on the

00:09:13.920 | client sdk uh like tens of millions of pages processed and a lot of like important customers basically using

00:09:21.360 | this throughout the enterprise and yeah uh generally speaking maybe just in terms of like discussion

00:09:30.880 | topics and happy to talk about any of these components um i'm very interested in generally

00:09:36.080 | speaking like kind of the enterprise like data stack and how that translates into LLM applications i'm also

00:09:41.680 | interested on the use case side um basically the kind of like advancements from simple qa interfaces into more

00:09:48.880 | agentic workflows that can actually take actions and automate uh more decision making uh from from

00:09:54.640 | different teams right either internally or externally um and just a quick shout out is you know we have

00:10:00.240 | like a general wait list for llama cloud um that's already gotten pretty popular uh there's been a decent

00:10:05.760 | number of signups but uh there's uh the goal is to basically help enable more users to kind of like

00:10:11.600 | process and index their unstructured data uh so again they can help like manage that and still uh build a

00:10:18.000 | lot of the kind of like important use cases um as enterprise developers cool

00:10:22.640 | go here

00:10:25.920 | so the question was about the enterprise product while on the cloud where um the understanding is that you

00:10:31.680 | upload documents to our cloud so how do we deal with like data privacy uh

00:10:46.960 | yeah that's a great question can you just repeat part of the question yeah with the microphone um so

00:10:51.440 | the question was about the enterprise product while on the cloud where um the understanding is that you

00:10:55.440 | upload documents to our cloud so how do we deal with like data privacy uh there's two uh kind of answers

00:11:00.480 | that the first is that we have both a cloud service as well as a vpc deployment option i'm happy to

00:11:05.120 | chat about that uh if you sign up on the kind of like contact form so we deploy in in um aws and

00:11:10.560 | azure with gcp coming soon and then the second is uh we're like kind of a data orchestration layer

00:11:15.600 | so we actually intentionally don't store your data um we try to integrate with the existing storage systems

00:11:20.320 | um you made a comment on like the differences between traditional etl

00:11:28.640 | um and um you know kind of the the new skills and tools etc are required can you expand on that a

00:11:35.360 | bit so that you know in in my company where i get like asked maybe hey let's have this uh etl

00:11:41.360 | person who's done a lot of other etl do it what what kind of instruction would i give them on like

00:11:45.520 | hey these other skill sets or tools might be necessary and if there's any other sort of gotchas

00:11:49.760 | around that you if you could highlight those that'd be great totally i think just on a very

00:11:53.280 | technical level the steps you actually take um are just different um basically instead of writing

00:11:57.920 | like sql or using dbt um you uh just you know uh this is how you like set up a rag pipeline you have

00:12:04.320 | a pdf um first you need to parse that pdf so either using llama parse or another document parser

00:12:10.320 | that parsing step if you don't get it right then that leads to a lot of kind of like downstream failure

00:12:15.760 | modes in and your lm application um after you parse the document into some representation whether it's text or

00:12:23.120 | or increasingly we're seeing like multimodal representations as well with um like image

00:12:27.520 | representations of a document you then need to chunk that uh document right and so the very naive

00:12:34.080 | approach is you basically set a chunk size of like a thousand twenty four tokens and you split every

00:12:39.280 | thousand twenty four tokens right and that specifically also you know introduces a bunch of complexities

00:12:44.800 | because if you split like tables down the middle you split pages uh that's that like um there's like that

00:12:51.200 | there's like a section that spans multiple pages or something you somehow need to better like

00:12:55.600 | semantically join them together um so that like most information is preserved within a chunk

00:13:00.800 | and that you add like the right metadata to that chunk um and then you need to figure out a good

00:13:05.760 | way to index it and this is where like a vector database or a graph store or document store comes in

00:13:10.880 | there's a lot of different ways to index it so just very fundamentally it's just like a different

00:13:15.840 | set of like steps you need to do and the issue here and the difference actually with traditional etl is

00:13:20.880 | is all these steps are kind of like um fuzzy to understand without the end-to-end uh performance

00:13:26.800 | like with traditional etl you know it's kind of like you do some step and then it's you you know

00:13:31.680 | exactly what you want here like it's really hard to tell what the chunk size you need to set is without

00:13:36.800 | having an eval data set and having a rigorous end-to-end testing and eval flow yeah oh sorry oh so i

00:13:43.440 | want to make sure i think i saw it hand over there yeah do it yeah

00:14:09.840 | yeah so i think we have a few uh audio loaders so i think the default is just uh take so the

00:14:14.720 | question was basically how do you integrate audio sources into your rag pipeline using you know

00:14:18.640 | uh waman decks or other frameworks um the simplest is probably just like you just directly uh like

00:14:23.200 | parse that into text and then ingest it i think in the future as models become more natively multimodal

00:14:28.560 | um you might just be able to represent audio as like a specific entity right and then as a chunk almost

00:14:34.240 | and directly feed that into a model but i don't think we're there yet um and then okay i'm gonna go

00:14:38.320 | go

00:14:44.000 | for sure i think the benchmarking is important it's also challenging because we're actually working on

00:15:01.680 | that right now to basically find a general benchmark what typically happens is we do like uh just within

00:15:06.560 | the enterprise they just do a bake off on their own data um and then compare it and we basically show

00:15:10.480 | them a notebook on you know here's how you build a rag pipeline with normal parse uh here's how you

00:15:14.320 | can do it with other parsers yeah um just want to make sure i cover yeah yeah my question is uh um what

00:15:23.120 | options do you have like for versioning or different promotion across environments to you know do staging and

00:15:29.040 | production uh that's one part and the other one is um what regions are you available so that's maybe a

00:15:35.600 | little more easier yeah um i think the versioning piece is is definitely important i think um at a

00:15:42.720 | high level we are building out features to help you like better version your pipelines we don't have that

00:15:47.440 | yet but it's kind of like upcoming and also requested by some enterprise customers um and then uh the the

00:15:53.280 | second question around um kind of regions where uh the sas service is in north america it's just hosted

00:15:59.360 | on uh uh but we do we do um kind of like on-prem deployments as well right and so that's that's part

00:16:05.840 | of you know generally the enterprise plan that we offer yeah hi um i'm building a rack system for a big

00:16:13.520 | fintech basically a bank uh the struggle i'm having is i'm obviously working with the servicing team

00:16:18.960 | which has other channels right i'm working on an in a chatbot and a whatsapp chatbot the servicing team

00:16:24.960 | also has a like a help center an ivr a bunch of other like channels right um it's been very tough for

00:16:31.120 | me to convince them that maybe the cms that they're using you know to feed these other sources is not

00:16:36.880 | the best way to feed a rack i'm curious to know if you've seen other customers that have like a

00:16:40.960 | similar issue where you know internally they want to have like this single source of truth that kind

00:16:45.680 | of feeds into all of these channels where the rack system's nature is obviously extremely different

00:16:50.640 | than like a help center or an faq or that kind of stuff i see wait so why why is that cms not the right

00:16:57.680 | um tool i'm curious to know if you think that could be the right tool or like getting a little bit more

00:17:03.200 | into the details that's like we have like q a pairs that's how the cms works right now which could work

00:17:08.640 | for rack but we're missing all the metadata the different clusterings of like different documents

00:17:13.280 | for different maybe use cases different credit cards it's a little bit tough to explain in a quick

00:17:17.280 | question but like have you seen a single system work as a single source of truth and kind of how have

00:17:23.200 | you see that work yeah real big use so i think um the yeah i think i think the full details there's

00:17:30.080 | probably like a lot to dive into there i think generally speaking what we see is um for like

00:17:34.400 | homogeneous data sources where it's like of the same kind of like data type let's say it's all like

00:17:38.800 | financial reports you can generally use like the same set of parameters to basically parse it because

00:17:43.120 | there's like an expectation they're roughly the same format for very diverse and different data sources

00:17:48.080 | like if all of a sudden you're bringing in not just like pdf documents but also um like semi-structured

00:17:54.080 | data from like you know uh jira or something or or um uh what was it like salesforce for instance like

00:18:00.320 | jasons um you typically need to set up like a separate pipeline there and then what you know we both offer

00:18:06.160 | on the open source but also the enterprise side um is this ability to like combine um all these different

00:18:12.160 | data sources and then you just have to like combine them together and re-rank them right and and have some

00:18:16.800 | re-ranking layer at the at the top all right thank you um yeah so i've been using llama parts for a

00:18:28.960 | little bit and first of all i love it so it works really well so thank you for producing it uh however

00:18:34.240 | two weeks ago i was working on a project for a client and uh all of a sudden i was getting all these

00:18:39.200 | failures and i contacted support via the chat and there was a gentleman helping me out and he's like go

00:18:44.800 | pass me the job ids give me the job ids and all of a sudden just went mia and then replied back so

00:18:50.240 | the question is what are the support options so in case i get stuck over the weekend i could actually

00:18:55.360 | get somebody to help totally first of all i'm sorry you ran into those issues i know we had like a

00:18:59.600 | cluster of just uh failures i think that specific weekend it was just it was um it was a good lesson

00:19:04.800 | for us right uh keep in mind we're like 15 people at the company um and so we're uh when you talk to

00:19:09.760 | support is probably like one of the the founding engineers just like jumping in uh so i promise

00:19:14.320 | we're making that process more streamlined um typically on the enterprise side like especially

00:19:18.640 | uh for kind of like the like enterprise plans that we offer i'm happy to chat about this offline like

00:19:22.800 | we offer dedicated slas right and so this is kind of like there's uh some support option we're doing on

00:19:27.520 | on the casual like kind of uh like self-serve APIs but um we're offering like dedicated slas on on the

00:19:32.960 | enterprise hey so we're building a building hallucination detection and other evaluation systems for

00:19:48.640 | our customers that have a very large collection of documents and typically that's like uh you could

00:19:53.440 | have of course like thousands of pdfs and also those pdfs typically of course contain a lot of tables and all

00:19:59.040 | that uh and then there's question of how to combine like ocrs and other pdp processing on that so the

00:20:05.760 | question is like what is your general recommendation like last does lama parse uh take care of all this

00:20:11.600 | or and do you recommend like building some kind of custom system directly on top of lama index or how do

00:20:16.720 | you how would you recommend handling that yeah i think i mean i i guess i didn't actually show the

00:20:21.120 | capabilities of law first in these slides um but uh maybe if i dig around a little bit i can try to find the

00:20:27.600 | specific um uh slides where showcases um yeah like the basically what you want when you parse these

00:20:35.840 | documents is you want some generally good parser um that will lay out the text like uh in a spatially

00:20:42.640 | aligned way um and so it doesn't matter if you have all the bells and whistles of like bounding boxes

00:20:47.120 | and all these things you generally like bare minimum like just want the text to be like faithfully

00:20:51.520 | represented and that's exactly what lama parse does especially for like tables so we have a few examples

00:20:56.400 | for instance where like you have tables within a document um and then you lay it out in a spatially

00:21:00.880 | aligned way and then when you feed this to an llm llms generally are trained to respond pretty well to

00:21:07.440 | like well like just formatted pieces of text uh so they can understand what's going on um in that text

00:21:13.520 | trunk uh whereas if you use like a very naive parser like a baseline pdf parser um it's going to like

00:21:19.600 | collapse the text and numbers and therefore kind of uh it's going to generate a lot of hallucinations but

00:21:24.000 | yeah yeah with the with the increase in size of context windows that are available to us and also the

00:21:34.080 | improvements that we're finding for like dealing the haste of that kind of problems yeah what is your

00:21:38.000 | perspective on where we're headed towards rag yeah i think there's two general trends uh one is longer

00:21:43.840 | context windows the other is like multi-modality um i do think uh there's a few things that will

00:21:48.800 | probably go away and a few things that will stay one is uh good parsing is still important um the reason

00:21:54.160 | is like you know in the end if your parser is bad you're just going to feed bad data into the llm and it's

00:21:58.880 | going to hallucinate information um what i think will probably go away is as context windows get bigger

00:22:04.320 | chunk sizes can also get bigger um so you know you are probably not going to need to worry about like

00:22:10.080 | intra-page splitting like splitting a single page into a bunch of smaller chunks um in the future we

00:22:15.040 | could see you just uh like putting entire documents as chunks and basically indexing stuff at a document

00:22:20.480 | level i think that actually makes a lot of sense because documents are typically like self-contained

00:22:24.720 | entities um and i think they'll make a lot easier for developers um however in general for a multi-doc

00:22:31.280 | system which you know if you're in a company you probably have like billions of documents uh many

00:22:35.440 | gigabytes of documents it's you're probably not going to feed all billion documents into the uh

00:22:41.040 | context window on every inference call even with context caching which i think gemini has because

00:22:46.560 | context caching is right now super expensive um probably doesn't make sense from a cost perspective

00:22:51.280 | and also is a black box so you don't get accountability into the data um you basically store the transformer

00:22:55.920 | weights for those of you who like kind of are familiar with that um and you don't really get like full

00:23:00.000 | transparency into what the data is actually being fed um into the language model at each step so actually i think for

00:23:05.360 | a variety of reasons the overall idea of retrieval from an external storage system whether it's a

00:23:09.760 | vector database or graph database still matters for a variety of reasons um but you know the minute

00:23:15.520 | chunking decisions will probably go away the second thing which you didn't ask about but which i'll talk

00:23:20.080 | about anyways is multimodal um i think as multimodal uh models get better um i think it actually makes

00:23:26.400 | sense to basically start having like diverse representations of the same thing um so for instance we have a

00:23:31.200 | a powerpoint presentation um you're able to uh like represent each page for instance as an image

00:23:36.720 | in addition to just like parse text and by storing native image chunks you basically preserve

00:23:41.360 | all the information within that data um anytime you do parsing it's an inherently lossy right because

00:23:46.320 | you're inherently like trying to extract out stuff in like a textual format as opposed to preserving the full picture

00:23:52.160 | um and by having like um like different ways of representing the same amount of data

00:23:57.680 | you can basically trade off between like cost performance and latency

00:24:00.640 | let's check in though

00:24:05.600 | hi um so i see you you've done a lot of work improving the accuracy reduce the hallucination

00:24:12.160 | i wonder if you are working on anything to make the conversation flow better uh in my experience

00:24:18.240 | it's so hard to to get um did the conversation to to feel natural sometimes they overemphasize the

00:24:24.480 | the context uh data while i just want to give it a fii and just continue talking like a normal human

00:24:30.480 | so you're talking about like basically how to create more natural conversation flows that's uh that's uh

00:24:35.920 | yeah i think all the so that that's very interesting i think um

00:24:42.000 | the the um the overall answer to that is i think the default way most people are building these

00:24:48.400 | conversation flows is you have some like say rag pipeline as like a tool right um and then you

00:24:53.920 | basically have an agent as an outer layer um that reasons over the conversation history and can um

00:24:59.920 | basically you know synthesize the the right answer at the given point in time so the the knobs

00:25:04.960 | basically that you want to tune are the the agent reasoning like prompt um as well as the memory

00:25:10.560 | and i think the memory is actually pretty important because um right now most memory modules are like

00:25:15.040 | very primitive um there's not a lot of good things beyond just like dumping the conversation history

00:25:19.440 | into the prompt um so happy to chat more about that as well but i think there's like a lot of a lot

00:25:24.000 | of stuff there that you could probably try um just want to double check the time yeah okay okay okay

00:25:30.000 | nurse yeah how are you using llama agents internally what's the most complex task that's a great question

00:25:36.880 | um so for those of you who weren't at the keynote we launched this thing called llama agents which is

00:25:40.640 | an open source multi-agent um framework basically for helping you basically deploy agents as micro

00:25:46.240 | services right now agents primarily live in like notebooks and the idea is to spin them up as like api

00:25:51.120 | services right now i think we're mostly just like uh using it to build like kind of more constrained

00:25:56.880 | simple rag pipelines and it's actually still in alpha state so i encourage all of you to basically try

00:26:01.680 | it out um there's a lot of things that i already know it can't do um for instance have like more general

00:26:07.440 | um kind of like uh there's like communication protocols and interfaces that we want to build in

00:26:12.720 | a more interesting message queue system but you know if you have an enterprise use case that's

00:26:17.360 | like going agentic and you want to basically kind of understand it as microservices uh so they can

00:26:22.000 | basically reuse encapsulate it um please track it out come talk to us but cool and thank you yeah sorry

00:26:28.400 | for going over no that's all fantastic

00:26:38.400 | you