back to index

RAG for VPs of AI: Jerry Liu


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everyone, I'm Jerry, co-founder and CEO of Llama Index, and I'll probably spend the first
00:00:19.680 | 10 minutes just giving like a brief overview I mean of RAG and also just like Llama Index,
00:00:24.060 | how we see the enterprise developer space and how it's progressing as well as give an overview
00:00:29.960 | of the product offerings and then I think in the next 15 minutes happy to like you know generally
00:00:34.460 | field questions and kind of answer actually maybe like have a discussion on what's top of mind
00:00:39.560 | throughout enterprises today so let's get started you know throughout the enterprise what we're
00:00:45.800 | seeing and this might resonate with some of you there's a lot of different use cases that we're
00:00:49.460 | seeing pop up and a lot of it's around RAG right I'm pretty sure we all probably know what RAG is
00:00:54.740 | you know you point it at some like directory files and then you get the LLM to somehow understand
00:00:59.920 | these files and then generate answers from them some other use cases that we see include like
00:01:05.560 | document processing and extraction being able to maintain conversations over time and then this
00:01:10.780 | year there's a lot of people like you know building agents we haven't seen as many like fully autonomous
00:01:16.180 | agents in production they typically are a bit more constrained but actually curious to get your takes
00:01:20.740 | as well so happy to discuss that so obviously RAG has been a very popular like set of techniques basically for helping you build
00:01:29.880 | a question answering interface over your data that's really the end goal is to help you build a question
00:01:34.880 | answer interface and what are the main components of RAG I won't go into like the super technical details but
00:01:39.940 | you know you need an LLM to do the final synthesis you need an embedding model you need a vector database or you
00:01:46.560 | need some database could be you know a document store it could be a graph store it could be like a sequel
00:01:52.640 | database or a vector database and then here's the thing that's interesting is that you basically need a new data
00:01:59.840 | processing stack to handle the data parsing and injection side this is different than traditional ETL which is primarily for kind of like you know analytics workloads as well as
00:02:09.900 | there's a lot of like technologies that popped up around that here you know you're really you know at a very basic level say taking in a PDF slicing it up into a bunch of chunks
00:02:19.840 | figuring out how to do that well and index it and represent it in a bunch of different storage forms so that LLMs can have easy access to it right and a lot of what LLM index
00:02:29.900 | is trying to solve is on that data processing piece so at a very you know a big pain point that we see for a lot of
00:02:39.960 | companies building LLM applications is going from prototype to production unlike traditional ML it's actually really easy to build a prototype
00:02:50.020 | with like some of the tools that LLM index offers it takes like 10 minutes to build a rag pipeline that kind of works over your data but going from kind of works to something that's production quality is a lot harder and so as we see you scale up the
00:03:04.020 | the number of documents as we as the documents get more complex as you try to add more data sources you have a higher quality bar that you need to meet and then some of the
00:03:12.580 | general you know pain points that we see include accuracy issues knowing how to tune like a bunch of different knobs and then also scaling to a lot of data sources
00:03:20.620 | oftentimes this either takes a lot of developer time or they just don't know how to do it and so what ends up happening is that POC you're building for the higher ups just ends up not like really working
00:03:32.080 | um and so therefore like the value of that overall project is diminished
00:03:36.140 | um the other problem that we see is that generally speaking um most of these a lot of these larger companies that we talked to have a lot of data
00:03:45.020 | um and there's this like general issue of just like data silos right um you have unstructured data structured data semi-structured data apis
00:03:53.280 | and somehow you know uh like this is um a similar problem occurs during the LLM application development where you want to
00:04:01.760 | somehow bring in all this data into some central place so that your LLMs can understand it right and when they're able to understand it and ideally you know
00:04:10.820 | somehow if you had this magic tool that made that happen and made it work well then you're able to have this kind of like holy grail of RAG
00:04:17.820 | uh just being able to synthesize answers um and do stuff over like any of your knowledge um anywhere in the enterprise
00:04:24.700 | um a thing that we talk about a lot uh both during the keynote yesterday as well as more generally
00:04:31.880 | is the importance of data processing and data quality um like you know we've probably all heard the term
00:04:38.280 | like garbage in garbage out and this is true in machine learning but this is uh also true in uh LLM
00:04:45.300 | application development if you don't have good data quality and I can go into an example of what that means
00:04:50.700 | um you're not going to get back uh well represented information um so that even if your LLM is very good
00:04:57.920 | uh oftentimes if your data quality is bad this leads to hallucinations within your application
00:05:03.600 | and so we believe in developers like if you're kind of you know um leading AI at one of these like
00:05:12.180 | enterprises uh you do want to make a bet on developers and and like you know I think generally speaking and
00:05:18.820 | I tell I say this like pretty often um you should generally bet on probably building a little bit more
00:05:24.820 | than more than just like buying pure out-of-the-box solutions and there's a few reasons why this is the
00:05:29.360 | case first the AI space is moving really quickly the underlying technology is shifting developers are
00:05:35.540 | the best position to translate that technology into enterprise value that are custom to your use case
00:05:41.280 | if you you know go through the procurement process and purchase generally speaking like out-of-the-box tools
00:05:47.200 | that will solve maybe like the current pain point that you have around that um or like right and and
00:05:52.800 | provide a solution for that but it will probably be a lot slower to basically adapt it as new techniques
00:05:59.440 | pop up uh new workflows are possible and so we care a lot about developers and we want to basically
00:06:05.440 | provide the tooling and infrastructure to enable developers to build LLM applications over their data
00:06:10.320 | this helps you get applications with high data response quality that's actually ready for production
00:06:16.160 | and importantly it's like easier for developers to set up and maintain so you don't have to keep throwing
00:06:21.600 | developers at it and kind of like banging their heads against the wall to figure out how to actually
00:06:25.520 | make this thing generate good responses and you can scale with some more data sources
00:06:29.280 | great i'm not going to go through you know kind of all the different features of lom index but i'm
00:06:37.360 | just going to quickly run through some of the main components our main goal as a company is to help
00:06:42.560 | any developer build context augmented LLM apps from prototype to production we have an open source toolkit right and
00:06:50.000 | this is an open source framework that's a very popular framework to help you build uh help developers
00:06:54.720 | build production LLM apps over your data a lot of the use case that we've seen in the past year have
00:07:00.080 | been around like you know productionizing rag uh in you know the next six months we anticipate a lot more
00:07:05.280 | agentic use cases to arise as well and it's primarily focused on orchestration around like retrieval
00:07:11.040 | prompting agentic reasoning tool use the other piece that we have is llama cloud which is a centralized
00:07:17.840 | knowledge interface for your production LLM application unifies your data sources starting
00:07:22.640 | with unstructured data is able to process and enhance that data for good data quality so that you actually
00:07:28.160 | have you know good quality data from your very complex like pdfs and powerpoints for instance and
00:07:33.040 | spreadsheets and helps you build manage pipelines so that you as a developer don't have to worry as much
00:07:38.880 | about that and can basically worry uh about building the actually interesting stuff around the orchestration
00:07:44.880 | of that data with LLMs um yeah i think i mentioned this already open source toolkit a lot of people using
00:07:52.560 | it i'm gonna skip this and then llama cloud is again this like centralized knowledge interface for your
00:07:58.640 | production LLM app um you spend like the idea is to help manage a lot of the data infrastructure so that
00:08:04.720 | developers generally speaking have to spend less time wrangling with data and spend more time building
00:08:09.520 | some of the core you know uh prompting agentic retrieval logic that uh makes up like the custom use case
00:08:15.760 | that they want to build for um i'm not going to run through all the features that we have because this is
00:08:22.320 | basically just like one of the um you know some of these things are upcoming but one specific thing that
00:08:27.680 | i think has actually gotten a decent amount of interest from users is llama parse which is a specific component
00:08:33.520 | of llama cloud it's basically our advanced document parser that helps solve this data quality problem
00:08:39.920 | basically if you want to build LLM applications over like a complex financial report or a powerpoint
00:08:47.120 | with a lot of different messy text layouts like tables images diagrams and so on and so forth
00:08:54.560 | we provide a really nice toolkit to basically help you parse that data specifically so that LLMs can
00:09:00.560 | understand it and don't hallucinate over it um so far you know we've released this like a few months ago
00:09:09.200 | um there's been some impressive usage metrics so far um basically half million monthly downloads on the
00:09:13.920 | client sdk uh like tens of millions of pages processed and a lot of like important customers basically using
00:09:21.360 | this throughout the enterprise and yeah uh generally speaking maybe just in terms of like discussion
00:09:30.880 | topics and happy to talk about any of these components um i'm very interested in generally
00:09:36.080 | speaking like kind of the enterprise like data stack and how that translates into LLM applications i'm also
00:09:41.680 | interested on the use case side um basically the kind of like advancements from simple qa interfaces into more
00:09:48.880 | agentic workflows that can actually take actions and automate uh more decision making uh from from
00:09:54.640 | different teams right either internally or externally um and just a quick shout out is you know we have
00:10:00.240 | like a general wait list for llama cloud um that's already gotten pretty popular uh there's been a decent
00:10:05.760 | number of signups but uh there's uh the goal is to basically help enable more users to kind of like
00:10:11.600 | process and index their unstructured data uh so again they can help like manage that and still uh build a
00:10:18.000 | lot of the kind of like important use cases um as enterprise developers cool
00:10:22.640 | go here
00:10:25.920 | so the question was about the enterprise product while on the cloud where um the understanding is that you
00:10:31.680 | upload documents to our cloud so how do we deal with like data privacy uh
00:10:46.960 | yeah that's a great question can you just repeat part of the question yeah with the microphone um so
00:10:51.440 | the question was about the enterprise product while on the cloud where um the understanding is that you
00:10:55.440 | upload documents to our cloud so how do we deal with like data privacy uh there's two uh kind of answers
00:11:00.480 | that the first is that we have both a cloud service as well as a vpc deployment option i'm happy to
00:11:05.120 | chat about that uh if you sign up on the kind of like contact form so we deploy in in um aws and
00:11:10.560 | azure with gcp coming soon and then the second is uh we're like kind of a data orchestration layer
00:11:15.600 | so we actually intentionally don't store your data um we try to integrate with the existing storage systems
00:11:20.320 | um you made a comment on like the differences between traditional etl
00:11:28.640 | um and um you know kind of the the new skills and tools etc are required can you expand on that a
00:11:35.360 | bit so that you know in in my company where i get like asked maybe hey let's have this uh etl
00:11:41.360 | person who's done a lot of other etl do it what what kind of instruction would i give them on like
00:11:45.520 | hey these other skill sets or tools might be necessary and if there's any other sort of gotchas
00:11:49.760 | around that you if you could highlight those that'd be great totally i think just on a very
00:11:53.280 | technical level the steps you actually take um are just different um basically instead of writing
00:11:57.920 | like sql or using dbt um you uh just you know uh this is how you like set up a rag pipeline you have
00:12:04.320 | a pdf um first you need to parse that pdf so either using llama parse or another document parser
00:12:10.320 | that parsing step if you don't get it right then that leads to a lot of kind of like downstream failure
00:12:15.760 | modes in and your lm application um after you parse the document into some representation whether it's text or
00:12:23.120 | or increasingly we're seeing like multimodal representations as well with um like image
00:12:27.520 | representations of a document you then need to chunk that uh document right and so the very naive
00:12:34.080 | approach is you basically set a chunk size of like a thousand twenty four tokens and you split every
00:12:39.280 | thousand twenty four tokens right and that specifically also you know introduces a bunch of complexities
00:12:44.800 | because if you split like tables down the middle you split pages uh that's that like um there's like that
00:12:51.200 | there's like a section that spans multiple pages or something you somehow need to better like
00:12:55.600 | semantically join them together um so that like most information is preserved within a chunk
00:13:00.800 | and that you add like the right metadata to that chunk um and then you need to figure out a good
00:13:05.760 | way to index it and this is where like a vector database or a graph store or document store comes in
00:13:10.880 | there's a lot of different ways to index it so just very fundamentally it's just like a different
00:13:15.840 | set of like steps you need to do and the issue here and the difference actually with traditional etl is
00:13:20.880 | is all these steps are kind of like um fuzzy to understand without the end-to-end uh performance
00:13:26.800 | like with traditional etl you know it's kind of like you do some step and then it's you you know
00:13:31.680 | exactly what you want here like it's really hard to tell what the chunk size you need to set is without
00:13:36.800 | having an eval data set and having a rigorous end-to-end testing and eval flow yeah oh sorry oh so i
00:13:43.440 | want to make sure i think i saw it hand over there yeah do it yeah
00:14:09.840 | yeah so i think we have a few uh audio loaders so i think the default is just uh take so the
00:14:14.720 | question was basically how do you integrate audio sources into your rag pipeline using you know
00:14:18.640 | uh waman decks or other frameworks um the simplest is probably just like you just directly uh like
00:14:23.200 | parse that into text and then ingest it i think in the future as models become more natively multimodal
00:14:28.560 | um you might just be able to represent audio as like a specific entity right and then as a chunk almost
00:14:34.240 | and directly feed that into a model but i don't think we're there yet um and then okay i'm gonna go
00:14:44.000 | for sure i think the benchmarking is important it's also challenging because we're actually working on
00:15:01.680 | that right now to basically find a general benchmark what typically happens is we do like uh just within
00:15:06.560 | the enterprise they just do a bake off on their own data um and then compare it and we basically show
00:15:10.480 | them a notebook on you know here's how you build a rag pipeline with normal parse uh here's how you
00:15:14.320 | can do it with other parsers yeah um just want to make sure i cover yeah yeah my question is uh um what
00:15:23.120 | options do you have like for versioning or different promotion across environments to you know do staging and
00:15:29.040 | production uh that's one part and the other one is um what regions are you available so that's maybe a
00:15:35.600 | little more easier yeah um i think the versioning piece is is definitely important i think um at a
00:15:42.720 | high level we are building out features to help you like better version your pipelines we don't have that
00:15:47.440 | yet but it's kind of like upcoming and also requested by some enterprise customers um and then uh the the
00:15:53.280 | second question around um kind of regions where uh the sas service is in north america it's just hosted
00:15:59.360 | on uh uh but we do we do um kind of like on-prem deployments as well right and so that's that's part
00:16:05.840 | of you know generally the enterprise plan that we offer yeah hi um i'm building a rack system for a big
00:16:13.520 | fintech basically a bank uh the struggle i'm having is i'm obviously working with the servicing team
00:16:18.960 | which has other channels right i'm working on an in a chatbot and a whatsapp chatbot the servicing team
00:16:24.960 | also has a like a help center an ivr a bunch of other like channels right um it's been very tough for
00:16:31.120 | me to convince them that maybe the cms that they're using you know to feed these other sources is not
00:16:36.880 | the best way to feed a rack i'm curious to know if you've seen other customers that have like a
00:16:40.960 | similar issue where you know internally they want to have like this single source of truth that kind
00:16:45.680 | of feeds into all of these channels where the rack system's nature is obviously extremely different
00:16:50.640 | than like a help center or an faq or that kind of stuff i see wait so why why is that cms not the right
00:16:57.680 | um tool i'm curious to know if you think that could be the right tool or like getting a little bit more
00:17:03.200 | into the details that's like we have like q a pairs that's how the cms works right now which could work
00:17:08.640 | for rack but we're missing all the metadata the different clusterings of like different documents
00:17:13.280 | for different maybe use cases different credit cards it's a little bit tough to explain in a quick
00:17:17.280 | question but like have you seen a single system work as a single source of truth and kind of how have
00:17:23.200 | you see that work yeah real big use so i think um the yeah i think i think the full details there's
00:17:30.080 | probably like a lot to dive into there i think generally speaking what we see is um for like
00:17:34.400 | homogeneous data sources where it's like of the same kind of like data type let's say it's all like
00:17:38.800 | financial reports you can generally use like the same set of parameters to basically parse it because
00:17:43.120 | there's like an expectation they're roughly the same format for very diverse and different data sources
00:17:48.080 | like if all of a sudden you're bringing in not just like pdf documents but also um like semi-structured
00:17:54.080 | data from like you know uh jira or something or or um uh what was it like salesforce for instance like
00:18:00.320 | jasons um you typically need to set up like a separate pipeline there and then what you know we both offer
00:18:06.160 | on the open source but also the enterprise side um is this ability to like combine um all these different
00:18:12.160 | data sources and then you just have to like combine them together and re-rank them right and and have some
00:18:16.800 | re-ranking layer at the at the top all right thank you um yeah so i've been using llama parts for a
00:18:28.960 | little bit and first of all i love it so it works really well so thank you for producing it uh however
00:18:34.240 | two weeks ago i was working on a project for a client and uh all of a sudden i was getting all these
00:18:39.200 | failures and i contacted support via the chat and there was a gentleman helping me out and he's like go
00:18:44.800 | pass me the job ids give me the job ids and all of a sudden just went mia and then replied back so
00:18:50.240 | the question is what are the support options so in case i get stuck over the weekend i could actually
00:18:55.360 | get somebody to help totally first of all i'm sorry you ran into those issues i know we had like a
00:18:59.600 | cluster of just uh failures i think that specific weekend it was just it was um it was a good lesson
00:19:04.800 | for us right uh keep in mind we're like 15 people at the company um and so we're uh when you talk to
00:19:09.760 | support is probably like one of the the founding engineers just like jumping in uh so i promise
00:19:14.320 | we're making that process more streamlined um typically on the enterprise side like especially
00:19:18.640 | uh for kind of like the like enterprise plans that we offer i'm happy to chat about this offline like
00:19:22.800 | we offer dedicated slas right and so this is kind of like there's uh some support option we're doing on
00:19:27.520 | on the casual like kind of uh like self-serve APIs but um we're offering like dedicated slas on on the
00:19:32.960 | enterprise hey so we're building a building hallucination detection and other evaluation systems for
00:19:48.640 | our customers that have a very large collection of documents and typically that's like uh you could
00:19:53.440 | have of course like thousands of pdfs and also those pdfs typically of course contain a lot of tables and all
00:19:59.040 | that uh and then there's question of how to combine like ocrs and other pdp processing on that so the
00:20:05.760 | question is like what is your general recommendation like last does lama parse uh take care of all this
00:20:11.600 | or and do you recommend like building some kind of custom system directly on top of lama index or how do
00:20:16.720 | you how would you recommend handling that yeah i think i mean i i guess i didn't actually show the
00:20:21.120 | capabilities of law first in these slides um but uh maybe if i dig around a little bit i can try to find the
00:20:27.600 | specific um uh slides where showcases um yeah like the basically what you want when you parse these
00:20:35.840 | documents is you want some generally good parser um that will lay out the text like uh in a spatially
00:20:42.640 | aligned way um and so it doesn't matter if you have all the bells and whistles of like bounding boxes
00:20:47.120 | and all these things you generally like bare minimum like just want the text to be like faithfully
00:20:51.520 | represented and that's exactly what lama parse does especially for like tables so we have a few examples
00:20:56.400 | for instance where like you have tables within a document um and then you lay it out in a spatially
00:21:00.880 | aligned way and then when you feed this to an llm llms generally are trained to respond pretty well to
00:21:07.440 | like well like just formatted pieces of text uh so they can understand what's going on um in that text
00:21:13.520 | trunk uh whereas if you use like a very naive parser like a baseline pdf parser um it's going to like
00:21:19.600 | collapse the text and numbers and therefore kind of uh it's going to generate a lot of hallucinations but
00:21:24.000 | yeah yeah with the with the increase in size of context windows that are available to us and also the
00:21:34.080 | improvements that we're finding for like dealing the haste of that kind of problems yeah what is your
00:21:38.000 | perspective on where we're headed towards rag yeah i think there's two general trends uh one is longer
00:21:43.840 | context windows the other is like multi-modality um i do think uh there's a few things that will
00:21:48.800 | probably go away and a few things that will stay one is uh good parsing is still important um the reason
00:21:54.160 | is like you know in the end if your parser is bad you're just going to feed bad data into the llm and it's
00:21:58.880 | going to hallucinate information um what i think will probably go away is as context windows get bigger
00:22:04.320 | chunk sizes can also get bigger um so you know you are probably not going to need to worry about like
00:22:10.080 | intra-page splitting like splitting a single page into a bunch of smaller chunks um in the future we
00:22:15.040 | could see you just uh like putting entire documents as chunks and basically indexing stuff at a document
00:22:20.480 | level i think that actually makes a lot of sense because documents are typically like self-contained
00:22:24.720 | entities um and i think they'll make a lot easier for developers um however in general for a multi-doc
00:22:31.280 | system which you know if you're in a company you probably have like billions of documents uh many
00:22:35.440 | gigabytes of documents it's you're probably not going to feed all billion documents into the uh
00:22:41.040 | context window on every inference call even with context caching which i think gemini has because
00:22:46.560 | context caching is right now super expensive um probably doesn't make sense from a cost perspective
00:22:51.280 | and also is a black box so you don't get accountability into the data um you basically store the transformer
00:22:55.920 | weights for those of you who like kind of are familiar with that um and you don't really get like full
00:23:00.000 | transparency into what the data is actually being fed um into the language model at each step so actually i think for
00:23:05.360 | a variety of reasons the overall idea of retrieval from an external storage system whether it's a
00:23:09.760 | vector database or graph database still matters for a variety of reasons um but you know the minute
00:23:15.520 | chunking decisions will probably go away the second thing which you didn't ask about but which i'll talk
00:23:20.080 | about anyways is multimodal um i think as multimodal uh models get better um i think it actually makes
00:23:26.400 | sense to basically start having like diverse representations of the same thing um so for instance we have a
00:23:31.200 | a powerpoint presentation um you're able to uh like represent each page for instance as an image
00:23:36.720 | in addition to just like parse text and by storing native image chunks you basically preserve
00:23:41.360 | all the information within that data um anytime you do parsing it's an inherently lossy right because
00:23:46.320 | you're inherently like trying to extract out stuff in like a textual format as opposed to preserving the full picture
00:23:52.160 | um and by having like um like different ways of representing the same amount of data
00:23:57.680 | you can basically trade off between like cost performance and latency
00:24:00.640 | let's check in though
00:24:05.600 | hi um so i see you you've done a lot of work improving the accuracy reduce the hallucination
00:24:12.160 | i wonder if you are working on anything to make the conversation flow better uh in my experience
00:24:18.240 | it's so hard to to get um did the conversation to to feel natural sometimes they overemphasize the
00:24:24.480 | the context uh data while i just want to give it a fii and just continue talking like a normal human
00:24:30.480 | so you're talking about like basically how to create more natural conversation flows that's uh that's uh
00:24:35.920 | yeah i think all the so that that's very interesting i think um
00:24:42.000 | the the um the overall answer to that is i think the default way most people are building these
00:24:48.400 | conversation flows is you have some like say rag pipeline as like a tool right um and then you
00:24:53.920 | basically have an agent as an outer layer um that reasons over the conversation history and can um
00:24:59.920 | basically you know synthesize the the right answer at the given point in time so the the knobs
00:25:04.960 | basically that you want to tune are the the agent reasoning like prompt um as well as the memory
00:25:10.560 | and i think the memory is actually pretty important because um right now most memory modules are like
00:25:15.040 | very primitive um there's not a lot of good things beyond just like dumping the conversation history
00:25:19.440 | into the prompt um so happy to chat more about that as well but i think there's like a lot of a lot
00:25:24.000 | of stuff there that you could probably try um just want to double check the time yeah okay okay okay
00:25:30.000 | nurse yeah how are you using llama agents internally what's the most complex task that's a great question
00:25:36.880 | um so for those of you who weren't at the keynote we launched this thing called llama agents which is
00:25:40.640 | an open source multi-agent um framework basically for helping you basically deploy agents as micro
00:25:46.240 | services right now agents primarily live in like notebooks and the idea is to spin them up as like api
00:25:51.120 | services right now i think we're mostly just like uh using it to build like kind of more constrained
00:25:56.880 | simple rag pipelines and it's actually still in alpha state so i encourage all of you to basically try
00:26:01.680 | it out um there's a lot of things that i already know it can't do um for instance have like more general
00:26:07.440 | um kind of like uh there's like communication protocols and interfaces that we want to build in
00:26:12.720 | a more interesting message queue system but you know if you have an enterprise use case that's
00:26:17.360 | like going agentic and you want to basically kind of understand it as microservices uh so they can
00:26:22.000 | basically reuse encapsulate it um please track it out come talk to us but cool and thank you yeah sorry
00:26:28.400 | for going over no that's all fantastic