RAG at scale: production ready GenAI apps with Azure AI Search

All right let's get started. Hi everyone, thanks for coming. I'm Pablo I work in the Azure AI team at Microsoft and in this session we'll talk about RAG at scale and in particular I'll focus on the retrieval portion of the RAG pattern. So plan for the session is we'll do a quick recap just make sure that we all have this we use the same terms for the same things and then I'll go through kind of different dimensions of scale and I'll comment a little bit on how do we tackle this in the context of AI search what we've learned from doing this and what are we doing to make it easier to scale these applications.

So kind of as a by means of a quick recap when it comes to bringing your own domain knowledge to to work together with with language models you effectively you have three options either you you can do prompt engineering and you know while it's easy to dismiss you can go a long way with prompt engineering along especially these days you know models have longer longer context and whatnot.

If that's not enough sometimes the challenge is more along the lines of I want to teach particular patterns of or I want the model to learn jargon of some in some particular vertical domain and things like that and for that fine-tuning is often a good option however in many many of these cases what I want is to have the model work over a set of data that the model didn't see during training.

This could be my application data my company data company information about my users or anything like that and for that the prevailing pattern right now is to use retrieval augmented generation. Effectively what that means is if what if you want the model to know facts then what you do is you kind of separate the reasoning piece of the picture from the knowledge piece of the picture you lean on the language model for the reasoning capabilities and you use an external knowledge base to model what you know about a particular domain the data that you have and you and the way you glue them together is like in principle mechanically very simple of course then it gets complicated because you know life is never that easy but in principle you have some orchestration component that you know takes the task at hand let's say you have a chat application and then the user takes a turn and asks the next question the orchestration component hits some knowledge base seeking for pieces of information that could be used to produce an answer to this question and then grabs a bunch of candidates and then you go to the language model and then give it a bunch of instructions your candidates and ask the model to produce like an answer to the to the user question that's kind of the essence of the pattern we all know that in practice it usually takes multiple goals and there's a lot of tuning in the middle and whatnot but the fundamentals boil down to that so just by way of context like how many of you are doing are creating and working on right applications today everyone okay excellent um so with that backdrop what i wanted to do is talk about what are the pressure points when it when you scale these applications what one thing that has been fascinating to see is you know we had we had the opportunity to be involved in this space uh from very early like azure open ai has been there from you know the early days of scaled language models and one thing we saw was all of i'd say last year 2023 everybody built a prototype of something to kind of learn and figure out what could be done with this technology um and uh the interesting shift to this year to 2024 it's been these applications are going to production and when you go to production you go from oh this demo is really cool to all these users are you know are using it at the the same time and they want more they want more data they want more faster answers and whatnot um so the result of that is that you know when before we could focus only on figuring out the interaction model and the applicability of this technology now you know elements of scale also also play a role and scale can takes multiple kind of flavors like you these things things tend to scale in volume because one of the things that happens is when your application works well users come back or the leadership of the organization come back comes back and says let's put all the data there and then you have to deal with it um uh also the rate of change of the data kind of increases uh and the query load increases as well because you know more people are using the stuff um also workflows tend to get more complicated at first you go like oh like how complicated it can be i'll take the question search the thing and then send it to the model well it turns out that that sometimes it works but often it doesn't so you end up doing this multi-step workflows that hit the retrieval system and the language model multiple times um and that taxes all the systems and they all have to scale also in in the kind of in the spirit of now let's put all the data there now you have to deal with more uh more data types different kinds different data sources and whatnot um so um let's cover each each of these dimensions of scale um in detail and what i'll do is i'll cover it in the context of azure ai search that's where i work mostly uh and uh because it's you know the way we think about this in azure is we want to produce a system that is that encompasses entire retrieval problem uh it's of course it has this kind of vector database capabilities the vector based retrieval emerged as a very uh useful solution in many contexts so uh we have full support for that but we also brought into it like uh years and years of microsoft experience in retrieval systems in the more general sense um and uh we integrate them together so you don't have to connect a bunch of parts uh to have a proper uh high quality retrieval system it all comes integrated and of course you know we integrate into the rest of azure so it's easy to pull data in to connect to the other data sources and whatnot um and you know azure is is a place that is used to build some of the largest applications out there so all the kind of enterprise readiness comes pre-built in you know from security to compliance to all these things that you don't want to deal with directly you want to kind of build on a platform that is already taken care of so while there are multiple moving parts to retrieval systems what we've seen the last uh 18 months or so since the emergence of uh kind of right patterns at scale is vector retrieval is an important part of the solution and you know you you can see why like the interesting part about vector retrieval is like it's incredibly effective at getting this kind of soft conceptual similarity uh and put it to work right away so in azure ai search we built a system that has kind of complete a complete feature set when it comes to uh vector search including uh a fast approximate nearest neighbor search uh and as well as also exhaustive search sometimes you want precise search either to create baselines uh to know how kind of your recall performance is looking and things like that um also often applications need to combine uh vector search with the rest of the queries you need to filter uh filter things you need to uh slice and dice like you know filter select the columns you want to project and things like that like effectively treat it like a database that also does retrieval um and we also see multiple scenarios where the documents have multiple queries maybe for different parts of the content or for different data types that use different embeddings um and uh and sometimes the queries need multiple vectors so we try to make it as this all these specific needs come up as you're building an application you'll find answers to all of them directly in azure search let me show you this in action uh so let me start with a very simple example so what i'll do is uh i'm just here i have a small uh jupyter notebook and what i'll do is i connect to um i can set up a default credential point at my azure search service and i'll create an index from scratch so this is all like all it takes to to create an index once you have a service provisioned so in this case i create a few fields i'll create a categorical field like this serves as metadata that you want sometimes to attach to to be attached to to this i create a text field sometimes you want to mix text and vectors um and i'll talk a little bit more about this later and i'll create a vector field in in this case this is a toy example so dimension is three of course that's not very useful you know in practice in practice this is kind of in the hundreds or maybe thousands of dimensions i'll also say what strategy you want the system to use for vector search in this case i'm using hnsw which is a well-known um uh graph-based algorithm for um for indexing vectors so when i run this and i hope uh like the network was a little wonky so oh yeah perfect um so now i have an index i'm going to get a kind of a client to it and then i'm going to index data and you can see here this is a very simple case these are my vectors um this is my full text and these are my cut these are my categorical data bits so you can let us do all the ingestion for you or you can push data into the index in this case i'm pushing data explicitly into the index and once you have data uh if you look at here like we have some vectors and we have some categories so you can first you can just search and because it's in this case i'm indexing vectors when you search you search with a vector uh so i can i can search for that and say these are the two closest the reference vector that i gave and i can do a um a few a few things kind of incrementally like for example if i want to combine text and keyword search i can also add search text right here uh i'm gonna go hello and now i'm searching for both uh the the text and the vectors and then we fuse the result and rank appropriately um and and often in applications you also need to filter stuff you can see here i have a's and b's in categories so i can i can like write filters uh for example here i'll say category equals a uh and then you know i only get the a's uh and this is a trivial example but of course you can do full filter expressions and ans and ours and and uh ranges and and whatnot and the filters are fast so so even if you have uh hundreds of millions of documents these are not a problem so in that one if you have a k-nearest neighbors k-nearest neighbors equals two how come they had three results oh because uh a great question because so what i what i was telling this the the system here is from the vectors retrieve two candidates but then i was like i also told it go to the keyword side and retrieve a bunch of candidates from there and then fuse them uh so only two of them were from vectors but diffuse diffusion of the two uh selected three i actually like if i uh i can make this a larger number and that sometimes is useful um to get more candidates and then still i can say as a result of that get the keyword ones too and then rank the top end so basically separate how many candidates you want from how many you want to return so so those are the basics of a vector engine but there are a few uh key elements to consider when you're actually building an application in production the probably the most the most salient one is quality like in the end your application works well for your users when they ask a question and they get the answer they're looking for and that is highly highly influenced by whether your retrieval system actually found the uh the bits of information that you wanted uh that you wanted to produce that answer um in in ai search the way we do this is we do um what kind of most uh kind of more sophisticated search engines do that where you use a two-stage retrieval system first stage is recall oriented and uses vectors and keywords and kind of all these recall oriented tricks to produce as many candidates as we can find and then the second stage re-ranks those candidates using like a like a larger model that you may be let's say you have 100 million vectors in your database uh you you want to use something fast to go from 100 million to a small set but then once you have a small set you can afford to learn run a larger more sophisticated language um sorry ranking model uh to create better quality ranking so that's what we're doing in the l2 stage so when you turn this on you effectively get better results and uh what you can see here there is a link at the bottom happy to share it later uh with more details on the numbers but you can see that um so from the left this is what you get when you just do keyword search using bm25 which is well known scoring approach second one is only vectors using ada open ai ada vectors third one is using fusion combined vectors and keywords and the fourth one is using fusion plus uh re-ranking um re-ranking step where across the board we see better results just out of the box when uh when re-ranking is enabled yes so the re-ranking is the semantic re-ranking is that is that another cosine similarity style thing or is it actually great great question the question is is it a cosine similarity thing no so the thing bind coders are useful when uh because you can do you can encode all your documents as vectors and then the query as a vector um and then you can evaluate them fast because you are only comparing similarity but that means that at no point the model sees the the document and the query at the same time so the re-ranker is uh these type of rankers are often called cross encoders and uh these are transformer models that you feed them the document and the query and you ask them to predict a label of the correspondence of the query to the document so they are much better positioned to produce a high quality result uh but they are they're they're at inference time or rather at query time you're running an inference step so um you can do it only on a smaller set you couldn't do it on the entire data set it wouldn't be practical uh at least not for interactive performance applications speed this is like 100 milliseconds give or take for a model like this depends on the cross encoder that you that you use ours our trade-off between you know make it fast enough but make it very high quality we landed on 100 milliseconds ballpark um uh and we found that that works well in terms of interactive performance because the majority of the latency ends up hidden on the llm call uh and you still get very high quality results um the other thing i won't drill into this but uh often uh once again often the other dimension of getting quality out of the system is to narrow the data set if you know discrete metadata elements that can help you narrow the data set that's the most effective way uh then do all the ranking tricks on top of the resulting set but uh first uh scoping is uh it's a very effective way to uh to get quality up yes question for the keyword part of keyword search oh so if you look at um let me show you this example so if you look at what i did here is uh you give us keywords you give us text and then you give us a vector as well or you just give us the text and we'll turn it into a vector too uh so we need the original text the vector alone is not enough because we can't go back to a keyword no so i mean if you have the right application let's say you have a chapter yeah how can you know ahead of time what kind of keyword can be added up to the uh oh so um usually in the rag app the question is what of the conversations so far do you want to use for search is that is that the question no i understand but my question is so if you want to use because the new measures you showed that if you search yes how do you like for victim service yeah i maybe we should talk after the talk because i'm not sure i understand the question like usually like you in the index you have the set the text and the vectors uh and then from the user you extract a candidate search for for that this is before you send it to llm uh but maybe we should chat right after we talk on the details um so let me go back here so the other dimension of scaling is um is uh just show how many vectors and how much content you can fit in one of your indexes uh and for that in ai search like one thing we learned last year uh in the beginning of this year is that it went from everybody had tiny indexes to everybody put all their data in these systems uh so we we uh accommodated for that by significantly increasing the limits like for most of the search services uh you will get anywhere 10 between 10 and 12 times the vector density on on the same skews and we didn't change prices or anything just so you can build larger applications on the same setup with these new limits you can build multi-billion vector apps by just provisioning a service and uploading the data and surprisingly straightforward like a year ago the billion data set vectors was the thing there was a curiosity used for uh for benchmarking and now you can just create an index and upload them which is very impressive to to to see i'm not going to drain the slide i put it here for reference uh these are kind of all the new limits that uh that we have are significantly higher than the ones we had before um one of the things that has been exciting for us to watch uh to watch grow is uh you know among our customers one of them is open ai themselves uh where they have a lot of rag workloads like when you have files in chat gpt or when you use the assistance api all of those you can create these vector stores inside their system and uh all of those are backed by ai search uh and when we increase these limits um one of the things they did is they increased the limits they give to their users by 500 times um so it's been impressive to see how fast they grow and it's been fun to kind of see uh kind of first um uh from up close how a system can scale at like this big and uh and still be fast and responsive and whatnot um so finally the the other thing you can do is sometimes the limits higher limits are enough sometimes you want to push even more data into it so the other thing we've been working on is quantization where we you can use narrower types like instead of using full floats you can use ints uh like int eights uh and they just simply use less space at the trade-off for a little bit of quality and interestingly you can even do single bit quantization and i confess that when people said hey like we're gonna do metrics for single bit i felt people were wasting their time uh but it actually works it works surprisingly well our evaluations show that they are still for some models in the mid uh low to mid 90 percent of the original performance and other companies have seen the same thing like for example this is an evaluation from cohere but separate company um and they also see like about almost 95 percent of the original precision is preserved when uh when using big encoding and you can get the precision remaining precision back by re-ranking uh when you're done so surprising that it works but you go for float 32 to one bit that's 30 or 2x the vector density um and it's faster because you just do humming distance much much faster than computing on a small number of a smaller number of bits instead of cosine similarity or something like that on a wider wider set so because of this we now support all these types in ai search as well um i'm gonna skip this slide in the interest of time um so and uh you can do quantization yourself or you can just enable it and we will do quantization uh for you and uh if we do quantization for you we will also store the original uh precision data there which means we can do over sampling where we query at the quantized kind of compressed uh version of the vectors but we have the full precision stored stashed on the side uh so later we can re-rank at full precision uh and so you can effectively choose between you want a highly compressed uh index that is a little lower quality but but but but larger or it's a little slower but larger uh or you just don't compress it and then you get the quality up so effectively you can choose any of the three uh and it's effectively up to you uh what you want to prioritize but we give you control for all three dimensions and then the last thing i wanted to touch on is that the other challenge is you have to keep adding data sources that you bring into these rag systems um and each of them you connect to them differently you enumerate changes differently um and that's that's just not where you want to spend your time um so we have this ingestion system that includes integrated vectorization as part of ai search where if the data is in azure whether it's block storage one lake or customs to be we will connect deal with all the security and all of that we will automatically track changes so it's not a one-shot thing but as as the data changes we'll pick up only the changes and we'll only in process the changes so the cost is also incremental you don't pay as you update the stuff for the entire set and then we'll deal with all file formats you know pdfs office documents images and unpack the nested formats and whatnot do chunking do vectorization and land land it on an index all in one go and this is an industrial strength pipeline that you set up once and then it continuously runs uh after that as your data changes it reflects the changes so you can focus on the right stack you know the workflow how you query the system but you don't have to think about how the data makes it there if the data is anywhere in azure will index it and create like an index that follows the original data automatically all right and and with that i know i raced through this content in these 20 minutes i'll be hanging out outside if anybody wants to chat or have questions um and i would encourage you to go try ai search today here's a link to the starting point um uh and azure subscriptions include include a free instance of ai search so you can even give it a shot in a minute without uh without having to pay for any of the stuff with that thank you

RAG at scale: production ready GenAI apps with Azure AI Search

Transcript