Forget RAG Pipelines—Build Production Ready Agents in 15 Mins: Nina Lopatina, Rajiv Shah, Contextual

welcome everybody thanks for showing up bright and early this morning we're gonna make your life easier for RAG so just hang in there if you want to get started we have the link right there contextual.ai/aie25 that's gonna get you a notebook it's gonna get you the getting started link which will get you an API key that you'll be able to use for the notebook so feel free to jump ahead and start on those pieces like that all right so let me first introduce the team that's here so I'm Rajiv Shah I'm the chief evangelist of contextual AI so I do a lot of these talks workshops I endlessly talk about AI which I'm kind of in this position for background I've been in the AI space for a while before this I was over at hugging face for a number of years as well so besides me talking we also have Nina you want to introduce yourself this is my fourth week at contextual so very exciting for me thank you all for joining us this morning I've been working in NLP and language modeling for the past seven or eight years really since Bert was considered a large language model and I'm excited to show you how you can simplify your RAG pipeline with contextual today thanks and besides Nina we also have two other members of our team that are in the back that if you raise your hand and have questions they will come to you so one of them is Matthew who's a platform engineer so if you have questions about how we scale up what's it what are the things going on in the back end he's the person for you we also have John on our team who's one of the solution architects so if you're like hey how would I integrate this into my environment he's the guy who's gonna be able to handle that as well so let's kick this off what I want to do is for you to all get one of the big pieces out of this is we can treat RAG like a managed service we don't go out anymore and train our own language large language models if we're working with embeddings we don't build our own vector databases same thing with RAG we can treat RAG just like any other managed service and so what we're gonna do today is I'm gonna do this quick introduction here but then we're gonna get you started right away with building this RAG agent with ingesting some files and Nina and I will go back and forth where I'll go and give you a bit more of an overview of contextual you'll build an agent run a number of queries against it get a feel for it and then I'll do a deep dive on some of the advanced settings that developers like around extraction re-ranking retrieval like that and we'll end with a little bit on evaluation as well as how you can use MCP for example connect your RAG agent to Claude desktop as well now to start with we all know the value of AI right we're all here because of the importance of AI but for many of you also know the struggles of AI how it's easy to build that demo with 10 documents when you're doing RAG but when you go to scale that out all of a sudden it's hard to extract across thousands of documents of diversity right your accuracy isn't there your users are complaining that they aren't happy with it because they don't know how to query properly now what we had didn't we had contextual have focused on this so our founders here dao and aman come from working on rag for a long time dao was an author was leading the team on the initial rag paper aman was with him at meta i knew both of them from my time at hugging face where i was helping people do rag years ago on question answer systems and i would be able to pull their researchers in to help work with customers as i was cobbling together kind of open source components to build these pipelines well they saw the need in that they want to change how the world works and so contextual has come around for focusing on enterprise and we still have this dedication to ai research and you'll see that as we go through our product as well now for again just the basics just in case there's and somebody in here not familiar with it right when we're talking about rag retrieval augmented generation the value here is when you have lots of that unstructured enterprise data we want to be able to understand it so rag allows us to take that information a simple a very simple rag pipeline uses a vector database to keep all of that use cosine similarity to find similar pieces pass it to an llm that's the simplest we're going to get much more complicated today like that and what i want to do is show off our platform to you the platform is built for a couple of different levels so one is there's a no code so if you're a business user you just have a bunch of docs you want to be able to ask questions to them we've made it very simple with opinionated defaults on that now if you're a developer on the other hand you know what you want in your rag pipeline you're spending time evaluating it you're spending time tweaking it well we've built a platform where you can then orchestrate and change how do i want to do queering how do i want to do generation along the ways and finally some of you might already have rag you might already have rag pipelines in production where there's just one component that isn't working that well maybe your extraction isn't working that well or you want a better re-ranker well we've made our system modular so if you just want to use one piece of it you can and all we focus on is rag other companies focus on other pieces we just think about building rag and again the reason we're here is as you go and try to do this in production you'll see like once you get to lots of documents it gets messy all of a sudden then you're orchestrating a big environment full of lots of different models now i need a bm25 i need a re-ranker putting all that stuff together ends up sucking up your time it's fun to do the first time but after that it ends up sucking up your time ends up with slow kind of costly um rag pieces so that's enough to get started i'm going to turn it over to nina who's going to get you started on building that first agent all right we're on the clock 15 minutes to get our agent up and running um just kidding we'll take a little longer than that to explain everything but how you can get started is um you can find the notebook and the getting started page at contextual.ai slash aie25 and then the gui for contextual is at app.contextual.ai that's what that getting started button will point you to and so we're going to start by loading a few documents we're going to load some financial statements from nvidia and some fun spurious correlations to see how our reg system treats data that's not fitting with conventional wisdom and then we'll try out a few queries uh including quantitative reasoning over tables and some data interpretation and so i'm going to have these two screens here side by side the notebook on the left and the the platform on the right so you can get started by signing up in the platform if you haven't already that's at app.contextual.ai that's where you're going to get your api key so everyone take maybe a minute or so you'll need to log into that and you'll need to set up your workspace i already have my workspace set up and i'm actually using uh my right uh not my work account just so that you'll see exactly what a general access user would see on their screen on my screen and just a tip for setting up those workspace names it's going to have to be a unique name so you can't call it aie-demo i just throw my name into the names just to make it a little bit more unique so um so everyone just uh take a minute to do that and uh to help me get a sense of where folks are just briefly raise your hand and put it down once you have your workspace set up and you see this welcome screen okay folks are making fast progress um yes matthew so yeah so we can start by in the notebook doing our pip installs oh start by making a copy of the notebook so you can just do go to file save a copy and drive and then that will create a copy for you of the notebook so that you can save whatever changes you're making to it if you want to make any changes it is a bit slowly making a copy um but yeah after the pip installs and the imports to get your api key you go into your contextual app click on api keys here on the right save a copy of the notebook and then in our contextual platform on the right you can go to api keys click on create api key so nina nina key 2 is my name and then we'll copy that and then we can go to our notebook secrets on the notebook secrets on the left here the key there and then add a new secret and then you'll just add contextual underscore api underscore key yeah it's refreshing but if yours is working then uh you can i'll just do it in the notebook without making a copy contextual so that's what we're going to do and then we're going to do it in the notebook and we're going to do it in the notebook and we're going to do it in the notebook so we're going to do it in the notebook and we're going to do it in the notebook and we're going to do it in the notebook and we're going to if you want to brag on your own it's often an api scavenger hunt just to get started and this is literally the only api key we're going to use for the whole rest of the workshop so we have just finished the hardest part of setting this up so once we've saved those credentials then we're going to set up our client here so we're just going to run the client in our notebook so here i'm on the left in the notebook and then we're going to create our data store so you can change the name on this data store name variable and then it's going to see if there's an existing data store or load that one and then after that which seems to still not be running okay yeah after the pip installs and the imports that data store is going to be where you're going to be loading all of your documents and then we have some code here in step two that will download those files that i mentioned earlier to your notebook so that's going to be a few quarterly reports and spurious correlation reports and so here in step two you'll see that it's fetching the data and then uploading it to the data store and then you can go to your app.contextual.ai and this is the aie demo data store that i made just here so if you click on your documents you'll see that they're processing and once they're done processing you can click on these three dots and then once the files are loaded you'll be able to inspect them and see how those files have been parsed and ingested so i'll just take a quick look at the run i did yesterday of all these same exact files and so if you click on inspect once your files are loaded for example if you have a table you're going to see the raw text and then also the rendered preview so you can see this table it has both the structure and the contents of the table in this image below and you're welcome to take a look at these the numbers match very perfectly on the structure so that's kind of step one of getting your rag system set up is having really accurate parsing and you can look at some of the image files we chose some spurious correlations which are a fun way to test how much a rag system will actually actually answer based on your own documents rather than its conventional wisdom and so you can see there also there will be raw text and rendered previews with like key data points from figures and summaries of the figures in addition to the text and other information extracted so while your files are adjusting raj will come back and expand on the uh on the platform all right it seems to me i need to update this slide with wi-fi issues as well but any of you who've built rag systems before know that there's several components of it there's a lot of tasks you have to do right you have to think about building out an extraction pipeline besides that when you're thinking about accuracy how am i going to chunk out what am i going to do for chunking what am i going to do for re-ranking and then finally have to think about scaling it there's a lot of these issues that come up and again it's interesting the first time you do it but maintaining this stuff after a while gets tiresome and that's where contextual comes in where it's really an end-to-end platform for managing rag and our platform runs in our sas it can run in your vpc we have a ui for those business users but we also have an api rest api endpoints as well and when within the platform we manage all those pieces for you so you don't have to worry about maintaining a vector store for your embeddings and the way it works is you'll come with your structured and unstructured data we bring that in we have a document understanding pipeline and that's the extraction part that we're going to walk through here where we need to cleanly be able to pull all that information out of your documents your tables your images once we have that we chunk that up and then we go through the best practices around retrieving so we have a mixture of retrievers so bm25 as well as an embedding model we have a state-of-the-art re-ranker that we've trained ourselves that's in our pipeline and then we pass it all to a grounded language model now we don't use a model from open ai or gemini we trained our own grounded language model because for rag use cases we want the model to be grounded we want it not to give its own advice just because it can pass the law exam or know everything about medicine we don't want it to use its knowledge when it's answering questions we want it to instead stay grounded respect the context that we're giving like that and finally there's lots of different ways we can output this and i'll show you this along the way as well now for each of these parts we have that academic pedigree as i mentioned earlier so one thing we always do is look at the academic benchmarks nowadays we have lots of customer data set points that we use as well but whether we think about the end-to-end accuracy of our platform or the specific parts around document understanding retrieval grounded generation each one of those we focused on to make them state-of-the-art for what they do like that so once we've done that when you go to use them there's a lot of different uses that you can have for this i think often when we think of initially of of rag we think about i'm going to build a question and answer chatbot right like i'm going to ask a question get back an answer well yes that's one use of what you can do with this but we have customers like qualcomm who use contextual on their website so if you go live to their website right now and you ask a question there it's powered by contextual on the back end and you can see similarly the question the answers right our feedback modules sources are all available there besides that we have other customers for example some folks in the finance financial spectrum my audio and they're thinking about automated workflows how can i take all the unstructured data i have structure it to make it useful for folks well we have apis you can hook up those apis be able to run them against here is it good ah okay no worries but we can take all that unstructured use our rag agents in the context of something like a spreadsheet or some other workflow to be able to use that finally we can connect it with lots of other tools i'm going to put this in there even though i know it's not going to work but just for all of you we can show you and i'm going to show you at the end here today how you can for example integrate the rag agents with something like claude with mcp so that way you can take advantage of the enterprise knowledge that you have in there all right so with that i'm going to hand it back over to nina and let's get you started on your agents looks like the wi-fi might be down it looks like the speaker wi-fi and my hot spot are not working so we'll just continue with the notebook and we'll talk through it and then i guess is anyone else's wi-fi working raise your hand if you have wi-fi okay well you guys can follow along you have the videos too right mm-hmm i do have oh oh it's back up okay cool um so yeah where we were in the notebook on the left here is um hopefully all of your uh data stores have loaded now um yes so assuming you had wi-fi to upload them uh these are the ones i just loaded in the workshop earlier and then in the notebook on the left you can see um that you can access these files uh through the api just the same as you can through the gui and then you can have the document metadata like when it was created the name uh the status of the ingestion and now we're going to create our agent uh so over here we have our system prompt uh this is kind of just our default system prompt that the agents are loaded with and then you can run this next block of code here on the left to set up your agent you can change the name if you'd like i called it demo-aie and then in on the right in our gui if you click on agents uh there it is there's my new agent right here that i just created and um and then you can put in your question based on these documents so for example um one question that we can ask is what was nvidia's annual revenue by fiscal year 2022 to 2025 and so uh what you'll notice here is we have our responses what we were provided was quarterly data so what the agent is doing is adding that up and then listing it here on the right and then noting that this is based on quarterly data and then you'll see these little numbers in a circle the two and the one so if you click on the two here you'll actually see the image that this data came from and similarly if you click on one you'll see the other image so interestingly this came from two separate files that were unrelated in any way they actually don't even have like a standard naming convention for these i pulled these from the nvidia financial reports website and it knew to reference those documents to pull that information across documents and then you can do the same exact thing that we just did in the gui programmatically so here in step four we're running our query uh and we're getting the exact same response in the api so this can make it easier to integrate it into your um into your application and so we can also here visualize which files were used to reference in the api and so now we're printing the documents here the base64 encoding is getting returned with another api query so just because it's a more fun and visual demo uh we'll ask the next few questions via the gui here so uh nvidia used to be a gaming compute company so i'm gonna ask the chat bot when did nvidia's data center revenue overtake gaming revenue and so um the crossover point was in q1 fiscal year 23 uh with data center revenue at 3.7 million and gaming revenue at 3.6 and then here i can click on the one here so that's q1 fiscal year 23 and we can look at our slide here and here it is that's actually like the exact crossover point where uh data center revenue overtook gaming revenue now we're going to look at the spurious correlation files uh that we've also loaded so for those of you unfamiliar this is a website that uh basically data crawls and finds correlations between things that have no correlation so we're searching for what's the correlation between the distance uh between neptune and the sun and burglary burglary rates in the us that would be some extreme astrology there if that actually determined burglary rates and so what you'll find is that our rag system that's really really focused on avoiding hallucinations um and really following the data that you've loaded it will give you that correlation coefficient which is very high but then it will also share that context um so otherwise elsewhere in the document they talk about how they're doing data dredging and um that this is not really a valid statistical correlation and so you can then also reference where in that document these claims are made both the statistical correlation and the uh and the caveats which are not loading i think from wi-fi issues but you can test this out in your own instance uh as well um we have another fun one that was uh what's the correlation between global revenue generated by unilever group and google searches for lost my wallet that's another spurious correlation that we loaded and it will do the same thing it will give you the actual answer and then note the caveats and i'll note that if you run these same queries in chat gpt here on the right if you just ask the question without any files not doing rag it will just start with telling you there's no meaningful correlation we know that's true but think about if your own documents don't follow conventional wisdom or have some information that's not out there it's going to argue with the information in your documents rather than presenting it even though it may not be may not actually be true in this case and then if you do a document upload and use the long context in chat gpt you will get that response of the actual correlation but then it will just say it's a spurious correlation and it won't really go into why so it's not really going to um you know a simple system like this is not really going to hew to the facts that you've presented it with um and uh for a fun question we just searched about uh global revenue from unilever group and google searches for lost my wallet does this imply that unilever group's revenue is derived from lost wallets and it will answer no that's not true um which we know and then if we ask another query that relates to different documents that don't have any information shared between them now we're looking at the correlation between the distance between neptune and the sun and global revenue generated by unilever group and it will not answer that question it will just say it does not have that information and so you know these are all uh pretty straightforward questions and answers but um you set up with our default settings but if you go to your agent and you click on edit in the admin panel there's actually a lot that you can change if you know for example that last question i asked if there was data that supported it and it wasn't found or something uh you can check which data stores are linked so you can link multiple data stores you can adjust the system prompt and that is also something you can do via api that we'll go over briefly later there are some settings on query understanding so we disabled multi-turn in our setup just to have everyone have the same consistent responses as they're trying to make the workshop easier to follow um you can set up query expansion or decomposition uh you can set some retrieval settings settings settings for the re-ranker settings for the filter including the prompt generation settings and you can even set up a user experience where you have suggested queries for the user um and so uh going back to our notebook here um what we have next in the notebook is going to be examples of code for some of the components so these are the individual components that make up the full rag system we won't run through this in the workshop i'm just going to share what we have here before raj does a deeper dive into what these components can do so for each component we've shared a link to a full a more complete notebook that lets you use more features of that component we have our benchmarks that compare it to other solutions and then we have a little bit of example code that will run in the notebook to try out these components alone and you don't need to set up the it's the same api key that we set up before so everything will just run i will just say you want to uh let your file process before you display it but in this example we're just printing the first page of the attention is all you need paper we have our re-ranker which is the world's first instruction following re-ranker and that's also a standalone sub-component that you can use independently of the full platform we have our generate model that is according to the effect grounding benchmark the most grounded language model in the world that's available as a standalone component and here's some sample code for that and then after raj's deep dive on the platform and components we'll go over the lm unit the natural language unit testing model available as an endpoint that we will use for our evaluation so yeah back to raj for the next portion of the workshop all right yep all right i just need this one all right if anybody knows me i'd love to kind of talk and get into the technical pieces and what's going on a little bit deeper so this is useful for learning about contextual but if you're even building your own rag pipelines as well all these components come into play um to start with now to start with how many people have had issues with hallucinations in their rag pipelines it's just been a few have people not have issues not had issues okay yeah i think everybody yeah finds this now when we start thinking about hallucinations there's a couple of ways we can think about how to prevent them the first is making sure we retrieve clean information if we don't do a great job at extraction if we don't do a good job at retrieval well then we're not going to give good information to that generator model it makes it the generator model will then put its own we'll try to answer it itself leading to hallucinations so that's one big piece the next is that language model itself and we'll talk about that is how we can ground that language model so it refers back to the context that it was given rather than substituting its own judgment and then i want to talk about having some checks in place because even if you do that you still want your end users to be able to trust the system so i'll talk about groundness checks as well as you've already seen it inside the platform the bounding boxes for attribution as well so to start with when you first build your first rag project right you need to extract from some pdfs maybe you've used like an open source pdf extraction tool have folks used like open source extraction tools like that pdf plumber things like that not that many have people been happy with them in terms of how they've worked for tables and charts yeah it's the document on the left is the easy one right that's what you show in the demos but your folks are going to show up with complex tables multimodal and that's when it gets much trickier because when you get to these types of documents if you have a parsing error in that table and things get shifted over right that's going to cause problems that you can't fix later on if your vision language model hallucinates and adds some other information in there when it's reading that multimodal chart right you're stuck with that right like that's going to lead to downstream problems so making sure you have very good extraction is like top of the list when you get to complex documents and so this is this is how i like to think about kind of what we're doing at contextual right now it's an evolving system we update like every couple weeks the engineers probably don't exactly like how i've diagrammed it like this but at least for me the mental model works where say you have a pdf document the first thing we're going to do is add some metadata around it right like when was it created what's the file name once we have that then you want to do a layout analysis is this a document that's image only that we need to do an ocr for does it have images maybe technical descriptions multimodal charts graphs that we need to add image captioning where we're going to get a text description of those features if you have tables we have a special table extraction mode that focuses on getting high quality results out of tables now beyond that the structure of the document also carries meaningful information right what are the section headers what are the different subsections in there using that information can make your make your extraction and go a lot better now once we have that we create a nice markdown kind of json version of that i'll show you that in the platform here but then for rag use cases we're going to take those documents and we're going to chunk them down into smaller pieces to work with right because lms as much as the long context is growing you can't take some of these large documents and chunk them in and especially there's the overall compute cost of trying to use those long contexts fully so we create chunks we use some of that metadata that we've created inject that in like knowing the hierarchical structure of that along the way we set bounding boxes this is one reason when we do image captioning for example we don't want to take the entire whole page and do that we like to know where the images are so we can do bounding box so the user knows when we say sales went up by you know 10 10 10 they know where on the page exactly it was so within the platform and i know you can't get to it right now we have what the components you can see on the left side let me see if i can kind of right there all the individual parts of our platform you can work with individually and we have a playground for that so this is the parse piece here where you can just take a pdf document decide how you want the extraction i just want text or hey i want the full works where i want image captioning as well or i have long tables i want the table extraction mode and you can do that right inside the ui see what the results look like now of course right when developers we also have an api python sdks javascript sdks as well so you can do this programmatically as well but that's the first step that's the first step here is the first step here is the query reformulation and so here depending on if you're doing multi-turn we need to take into account the previous conversations query expansion maybe the types of users are doing lots of abbreviations inside your company so we need to take that query make it a little richer and fuller before we pass that in to get good results or on the other hand maybe your folks are putting in long complex queries where what we want to do is take that query break it up into smaller sub queries answer each of those so this is where you have the flexibility you can turn these knobs on and off as you want as nina was showing you in the edit to use those now once you've gone through that process that long query might that you have might break into three sub queries each of those reformulated queries then passes through your semantic search and lexical search so hybrid search kind of classic best practices there and then we give you the option of using a what we call a data store filter because maybe your query is asking what's the revenue of apple computer and your data store has apple microsoft tesla all those documents well if you already know they they're interested in the apple company why not just filter out all the other documents so you don't have to worry about that that's where that data store filter comes into play now typically for retriever our defaults are something like a hundred chunks that will come back retrieval's fast you can do that but then to get better accuracy we like to use a re-ranker and again this is best practices around rag we've trained our own the instruction following re-ranker and that helps you go from that 80 and like the default is 15 but again you can change it for your application and give you let's say 15 high quality results that you can then pass all the way through to the end we have another filter stage if you want to for example drop out some of those as well so this is all of these pieces are orchestrated with lots of different machine learning models like this so this is where it's like a pain to build this for yourself but we've built this like this now if i had the time i'd go into all those pieces in depth but i want to highlight a couple of things i like our re-ranker i think is interesting where you have the ability to provide it instructions so a common one might be for example if you're asking questions about apple and you want to get the most recent documents so you can give it a prompt and then when it re-ranks those it's going to look at the contents of the chunks and use that prompt to help re-rank it so a lot of possibilities kind of with this and we're kind of excited to have this piece there now the last step is generation so we've retrieved back all of the query next we're taking it to our grounded large language model again the notion here is we want a large language model that's been specifically fine-tuned on to to respect the knowledge that it's given not give its own knowledge like that besides that we have the two pieces that i've talked about earlier the groundness and the attributions you've already seen the attributions and i'll talk a little bit more kind of about groundness now the grounded language model since we trained it ourselves we got to kind of do some different things with it so one thing we've done is the language model will tell you the difference between facts and commentary so when you have some text it will look at that answer and see hey this is the important facts in here the rest is just other commentary so this is where when you're working with this if you want i need a super grounded model like i just want the facts in my rag system i don't want any of the superfluous stuff like you can turn the setting on to eliminate the commentary and just focus on the facts for example so this is where kind of training our own gives us flexibility to do this kind of thing the second thing is is the groundness check we have so the groundness check works directly in our ui you can also use this through the api where for every response that comes back what we do is we look sentence by sentence essentially and see is the claim that's mated found in the documents that were given so here if you look you'll see the bottom sentence is in yellow and it's in yellow because the claim it's making is a totally bs claim that it just made up right it's not actually in the source documents i literally prompted it to make up something and so here we'll automatically in the ui highlight that's yellow so if you have users that you're worried about like they keep falling for these hallucinations this is kind of a helper feature inside there and the and the way it works is when we retrieve all the context and again you have access to all the chunks all the things that are retrieved that answer is going to be generated but then we're going to decompose that answer into specific claims so we have a model that just does this decomposition and looks and sees is this claim grounded back in that context or not so we can run this live at time be able to get those groundness scores show you that with the response as well all right nina has got more on evaluation like that so for our last step of setting up our rag agent we will get to the evaluation step so we're going to use our lm unit model for natural language unit tests which may be a little bit different from how you've been doing evaluation of your rag system before so we have a fine-tuned model as a judge and that has state-of-the-art performance on flask and big gen bench and the way that it works as we'll get into in the notebook shortly is that you'll put in um the prompt and get the model response as your rag system would generate and then we're going to create unit tests for the prompt that will ask specific questions that you want tested about those responses then we're going to use the lm unit model to evaluate these unit tests and then we can look at those scores so we'll be fully in the notebook now and i've just given up on the wi-fi now i'm just going to go through what i ran earlier so we can just look at the results uh if you had internet and could run this all on your own so so in this section we have uh we have commented out the code that we used to generate the data set so that has our set of questions about the data set uh six questions and then this is the section that will generate those results from those queries so to save some trees we just ran this ahead of time and saved it in our github so you could load it directly um and uh that's uh at this file here we have this eval input csv um and then that just has prompts and responses for six queries from that same agent that we set up earlier and the unit tests we're going to ask i thought these would be interesting ones for this uh for this document set and the queries does the response accurately extract specific numerical data from the documents does the agent properly distinguish between correlation and causation that could be useful if you have some sort of statistical analysis in your documents our multi-document comparisons performed correctly with accurate calculations our potential limitations or uncertainties and the data clearly acknowledged our quantitative claims properly supported with specific evidence from the source documents and does the response avoid unnecessary information we know that uh llms like to blab on and on in their response so this is a good unit test i think overall some of the other ones you can set based on your own documents if you want to try this out later and so we just set up our unit tests here um and then the lm unit model will score on a one to five scale and it's been fine tuned for this specific task and so then the next block of code that you would run would be this uh response generation so we put in our query what was nvidia's data center revenue in q4 fiscal year 25 and then you get your response and that does have the uh the correct amount the citation but then it has a lot of other information and so the unit test we used was does the response avoid our necessary information and the score was 2.2 out of five so for example if we wanted to if this was an important criteria for us then we could later update the system prompt or change some of those other settings to specifically target this um the results from this unit test of course we don't want to operate from an n of one so um we're now testing this batch of six which of course again is a small batch this was meant to run in uh real time if we had internet um and this would just run through all of those six queries and all of those six unit tests to get the results then this is the line here um run unit tests with progress this uh will give you the results from those unit tests and so here we can look at those results we can see the prompt response and um and then we can see the scores here for each of our six unit tests and those are going to be from one to five um and then we can also then map those long sentences to just a one word category so just accuracy causation synthesis limitations evidence and relevance you can set these as you'd like and we're going to create polar plots that show us for example for our second question what is the coefficient between neptune's distance from the sun and u.s burglary rates we can see in the polar plot that the causation was very well addressed by our model the accuracy and limitations also scored high on the unit test but the synthesis was not as highly scored um and these other factors were not as highly scored so we can look at all six and uh we can kind of see some of these queries scored really well across all our unit tests others may point to areas to improve we linked earlier to the full lm unit notebook where you can look at these categories and cluster them so for a larger scale evaluation data set you could actually like look for um for meaningful insights insights from this unit test um and so uh your first homework if you've made it this far is to improve the agent based on the unit test results and so one way that you could do that in the api would be uh here to update the system prompt so i have our original one and you could you know change something here like hey keep the responses really brief only answer um uh the direct question and then you can run this line of code here to update the agent and um and yeah that's our that's our end-to-end pipeline and now for a bonus raj is going to show you how you can connect your agent to claw desktop via mcp and for that we have links to our github here and the youtube how-to if you want to try this out yourself um okay yeah yeah i'm dealing with the challenges of the wi-fi and working around that so let's talk about this see if we can get this running so for lucky for all of you i spend way too much time making videos so i have a copy of the video integration that i want to show so we built a rag agent you can ask questions right you want to be able to use it one of the most common ways of being able to use it right that was introduced like six months ago here um at ai engineer was mcp server so one of the things i like to do is kind of show people how i can connect the rag agents that we built a contextual and you can use them inside of other mcp clients and so here's an example of using it inside of claw desktop where now when it has this particular topic it's going out to the contextual rag agent to be able to get the answer to do that so you can do this working inside of clients like claude you can also be able to do this for example if you're working with code you want to be able to do this in something like cursor that works as well so to do this if you're like hey i want to do this i want to be able to use this rag agent everywhere else to do that we have a repo that's out there so the contextual mcp server it's the link is in the notebook as well so you can grab that it's a neat it's a very basic mcp server you could probably build your own let us know um like that but the workflow is you clone that repo and then inside that repo there's a server.py file and we're just using our contextual um our contextual apis and just pointing to that so the important thing of course is that doc string that you kind of describe what your rag agent is about because that's what your client whether it's claude or cursor is going to be able to use that so to make make this a little bit more here's a copy for example inside of mine where you can see in my local environment i have that mcp server locally hosted this is what my server.py file looks like where you can see i've got two different one set up one for technical queries one for financial queries depending on who i demo to and it can have the client then pull in to that piece there so that's the the server that's sitting in the middle the next piece is configuring it with your client every client's a little bit different like that but you just want to be able to have that client point to where that server file is so if we look at for example in the case of claude desktop which is not running now but luckily we can still do this in claude desktop has a developer piece where you can edit the config for where it should look for mcp servers and if i click here edit config i can go see the file that it's there we can open that up we're not going to open it up in cursor but you get a sense that's the config file that's going to point towards where my server location is so again if you're not sure about this i have all the directions on the github there's an accompanying video as well but it just allows you to use that rag agent now in lots of other ways as well for example if you're building your own deep research tool you want to be able to do that you can do that all right we're going to open it up for questions here in a couple of minutes and again if you're having any issues raise your hand we got a couple of people in the back that can help out so one question often is is like what is this going to cost me like you give me the sales pitch what is it going to cost me well if you use our individual components the parse the re-rank the generate lm unit we've made that all consumption based you pay by the token for those so the pricing is up on the website it's pretty straightforward like that when you've signed up today you started off with a 25 credit so you should be able to go start playing around with it as well so those are the individual components we're also making our rag platform consumption-based pricing as well so the pricing will be based on the number of documents you ingest and how many queries you do so based on that you'll be able to kind of calculate out know what your workload is like that now some we i work with some enterprise customers for some sensitive places they're like hey we really need a guarantee on latency or amount of queries per second like that this is where we can work with that team do a provision throughput so that way you have dedicated kind of hardware for your particular use case like that but i think one of the great things that's going to really open it up to developers is the consumption base because just like you're doing today you're going to be able to sign up pass documents through that and just pay for what you use like that all right so some final takeaways i'm hoping that you got out of this that how you can treat rag just as any other managed service you don't have to go through build out all those pipelines yourself like that you've got the code you can get started now building this pipeline individually or if you just need parts of it just the components try them out as well you your re-ranker you're not happy with give it a shot so go try out the app to do that all of this stuff is documented over in docs we have kind of full api docs over there we also have a bunch of example notebooks showing for example integrations with things like ragas as well if you want to use that for evaluation so those example notebooks will walk you through hey how do i improve an agent how would i use all of these settings if you need other example notebooks let us know nina and i will work on that as well so finally fill out our survey kind of share your feedback as well we've got some nice kind of merch we can hand out to people who have good questions as well i think we got a little bit of time for that how does that sound i've thrown this rag platform at you are you all ready to sign up okay yeah go ahead there's a q a mic or i'll try to repeat back the question to the wider audience so if you want my interpretation of the question or i can't massage it then is that on yes yeah um so there's a really interesting feedback you're defining here around you can define evals via prompt you can see how they're doing you can let people change the system prompt do you find that your users of like the platform or the enterprise customers like who is it that's doing that is it like technical people or is it the people that's using it day to day so one of the things by putting it in the ui what you have is you have some of the business users will try to mess around and play with it and some pieces like a system prompt like non-technical people have a sense of using chat gpt and doing that but when you get to what's your my top k for my retrieval or using query reformulation those business users are just going to mess up and their agents are going to kind of not stop work well like that so a lot of the settings there as you get to the advanced hill climbing because a lot of that stuff comes into like i've built a good rag agent it's hitting 80 percent but the business wants us to get to 90 now i have to build the evaluation now i have to see where are my errors are they on the retrieval side are they on the generator side and this is where having that developer perspective of understanding and being able to do that error analysis is important to figure out which of those settings to do it because we give you a lot of settings but you still have to have some sense of like which settings are appropriate to change for what outcomes like that um company i work for is relatively new to ai we're loving what we're discovering but uh it's also pretty overwhelming the information you presented uh hints at some of the answers to to our major questions which are we have thousands of pdf documents we have thousands of uh you know excel documents we have an mrp system with the nescue ball back end we need to be writing custom queries for this environment to consume i guess does your environment have a checklist of the approach for different data types and how to package this information for your environment yeah i i can repeat back the question like that so so the question is is you know and i'll paraphrase here getting into it we have lots of documents we have excel documents pdf documents structured data like how am i going to make sense of that and figure out what the best way to kind of use your platform is is that fair yes and so this is where as a startup we have a couple of different levels we have we have the developer go figure it out yourself we've given you the buttons the api commands to do that but this is where we've grown and we have a dedicated team that we call customer machine learning engineers that all they do is work with customers and they do that process of hey let's walk you through building your evaluation data sets let's help you kind of hill climb and improve that so depending on what you need we have a team that's just focused on helping customers through that process as well and in the meantime nina and i are trying to document it and make it more available for the rest of the users but there is that gap of like there's a lot of knowledge to effectively use these systems to be able to do that is that fair and it's a consulting fee or a startup package yeah talk to us we we can we got we'll find a way to take your money a little bit uh go ahead hey rajim um um we've built a few agents and i'm very interested in exploring the rag only part of like what you described is there a way to integrate these into our agents that we're building in typescript javascript absolutely so each of the okay yes i don't have to repeat so each of the pieces here we have kind of apis we have a we have a javascript sdk like that so if you just want to use the components of this this is where when the company first started about a year ago if you'd asked to kind of do a demo we would have sold you like you have to buy the entire platform you have to get the end to end but we've realized like if we want to appeal to developers that lots of times you've built stuff out that you just want some components of it so yes we we've kind of modularized that out so if you just want to use parts of it you can and integrate with others like that so and in fact you know most of our customers don't use like the ui that we've showed most of them have other uis that they're integrating their applications with like that so well in response in addition to the costing or pricing um how far can the 25 dollars go like because we weren't able to test it right now like how many documents can we you know experiment in it so the question is like how far will your 25 go i i i don't know exactly i get unlimited use of it um but try it out if if you run into issues just let us know we can we can talk to sales we can figure out something like that if depending on what you're doing like that so don't let the 25 dollars be an issue we want the we want you to use it we want to hear your feedback if you need help let we can talk to you and figure out something to do like that so yes do people like the idea of a managed rag service does this feel like something okay okay okay yes well to that point i guess i'm wondering how managed is it so at my company we build rag applications for government and other clients who have really strict data sovereignty data residency questions uh or requirements so are you in either of the big clouds where i can decide to be in like a gov cloud or some other sovereign data service like how much control do i have over where stuff lives yeah so again we're early stage startup we're starting out right now we have our own cloud which doesn't help you like that we have partnered with snowflake so we are on snowflake if that can work for you we can also install in vpc but right now we're limiting it to vpc we're not doing kind of custom on-premises deployment just because that takes a lot of upfront work to be able to do like that does that help yeah can you do vpc and say like aws gov cloud or azura's government solution we have not yet done like aws in kind of golf cloud like that like if you have a strong demand for that let us know we can we can work and try to figure out something like that but we haven't taken on that yet okay i'll come find you thanks yes thanks hello hi hey yeah go ahead i don't know hi i was curious you know like how does the performance of your rack platform vary as a number of documents vary for example like do you have like recommendations of best practices when you're working with millions of documents as compared to hundreds of documents what kind of you know configurations and knobs work better would be curious to learn yeah so i'm going to give the hand wavy answer of we have customers like qualcomm for example that have tens of thousands of documents their documents have hundreds of pages in them and we handle that fine all i know is that i have engineers like matthew in the back that works the platform that makes sure it kind of scales up like that so i would say go grab him if you have questions about scalability right now we've been able to kind of scale with our customers um along those lines like that so let me get take this and then i'll come back over here hey this is specifically about the lm unit um tool um how deterministic and repeatable are the are the is the scoring from that i'm looking to see if they have i i mean it's still a large language model so i think there might be a little bit but i think it's fairly good is that uh it's pretty repeatable there's actually a paper release there's a lot more details in there too but like there was a lot of analysis done about the correlation with other trusted metrics and you know how you know running sorry run it running the lm unit tests repeatedly making like um we use kind of a random seed to make sure it's repeatable so like like all of our testing has suggested that uh it is repeatable of course like you're if you're like altering the prompt you're altering kind of the natural language unit test that can have you know unforeseen impacts on the results but i think keeping the prompt consistent it should be pretty consistent with different types of queries awesome thank you yeah check out the paper as as matthew said that it goes into a lot of those pieces like that so thanks question here no so is this already integrated with uh microsoft copilot and azure and the second question is uh are there any apis that you guys expose and get meaning if we are developing our own custom code but we want to you know use your rack and what's approach so we have the api so if you're developing your own custom solutions you can easily integrate kind of what we have to do that in terms of installing inside of kind of um azure like that that's one of those custom vpcs that we can support like that in terms of integrations with copilot i don't know if we've really done anything usually like people are like i'm sick of copilot and that's why they come to our rag solution like that so again we have the apis though so i don't know what the integrations would require for something like copilot like that thank you hi first of all a great presentation because i have used your product and the other thing is like regarding rag i have a question like uh you as a company what are the challenges that you're facing in the rag and where do you see that would go in future so what are the challenges we're facing in rag apart from hallucination internet search or things like that apart from that what are the different challenges that you're facing matthew i might have you chime in on this too i think one of the big challenges right now is around the extraction where we've spent a lot of time and energy and there's still i think room for improvement when we're talking about working with complex tables along charts i think there's one there i think scalability is always a piece like that because everybody wants to ingest more documents in a in a faster amount of speed like that i think if you talk about doing generated um working with structured data i think that's a challenge for everybody in the industry when you're trying to do those queries of like text to sql type queries like that so those are some of the top ones on my list like that do you have yeah i agree i think document understanding like making sure that you are correctly understanding long complex documents with hierarchy and interrelations between sections and really large tables and gnarly figures like i think that it has been a consistent thing that we've been working on and seeing improvements on but there's definitely still room to grow there um i think another kind of broader change in the rag space that our research team is very focused on and it hasn't directly translated into our product yet but it's kind of coming soon is moving away from the more like static rag pipelines that raj like beautifully described in his presentation toward kind of a fully dynamic flow with you know model routers uh at certain kind of inflection points deciding which tools to use how many times to retrieve you know how to alter the retrieval config in order to correctly answer the query so i think those sorts of like dynamic workflows will greatly increase what you can answer with any rag platform um almost like deep research for rag uh is one way to think about it and i think that's something that our research team is working very hard on and will be like coming into our platform in the near-ish future great thanks come back in six months do you have plans for a public facing like a web search feature particularly one that could be configured on a company's domain so are what you are you looking for kind of a general i need to be able to search the internet because like there's tavilli fire crawl places like that or it's more that companies that have public facing websites where it makes sense to instead of running rag on your like cms database you're actually just you want to run it on that live and so john i might pull you in here because i think one of the things john's worked with quite a bit because a lot of what that requires is the integrations to be able to pull from that webs from those sources as well and so i know john you've done some of those things for customers with fire crawling pieces yeah sorry what was the question for me again about ingesting customer websites and being able to use that in rag oh yeah so um there's a lot of like uh ways that you can scrape for websites or like directly you're like um kind of set up like an etl pipeline with like pulling the records from the apis or just pulling the unstructured data from like a data cloud um or blob storage and then kind of just setting up like an ingestion queue which is what we do on our side for larger customers um or basically any customer that needs like a kickstart for lots of data coming in some customers are like super happy to write their own scripts or like their own kind of functions our daily con jobs to update uh documents but we're also working on like a managed solution on our side so that you can just give us your credentials and then on our front end and then we'll just start pulling and syncing all the data on our back end but that's like something coming on the roadmap in like two to three months and so please if there's pieces like that you see that you want us let us know because we like to kind of push the product team to kind of make that stuff happen so all right yeah hi hi how do you deal with uh frequently updated content and document level permissions if not everybody can see specific documents yes so those are both tricky things one of the things we'll do for frequently um frequently updated documents is we have a continuous ingestion pipeline john if you can talk to in the back can talk to you more about that piece as well so that's one piece for the frequently updated for the other piece is the harder one entitlements how do you deal with all the permissions right like you're indexing hr rag stuff next to customer support stuff how do you make sure that the person doesn't look up the salaries of anybody else like that so this is where within our platform one of the things we're adding is an entitlements layer like that because as we've talked to lots of customers they're like this is nice when you do it over our customer support but inside of a real enterprise we have governance we have permissions we have to be able to respect that when we do these types of searches and so we're adding an entitlements layer on top of ours to be able to handle that so yeah that's an important piece all right hi uh so my question is regarding uh in last six months what are some of the major breakthroughs in uh rag area what are the major breakthroughs in the last six months in the rag area so i think the re-rankers continue to get better i think that's that's an easy place um where we've seen that i mean i think part of it is just every piece of that has steadily been getting better i think one of the biggest changes you've seen in the last six months year is the rise of the vision language models and how strong they are at being able to handle kind of images like that that's that's off the top of my head i see matthew's busy talking to somebody else otherwise i'd have him weigh in as well does that help help yes um does the contextual parser module can replace like doc lane or gsp invoice parsers and things like this and if if it does does it have the ability to parse qr codes and barcodes and instruct their metadata to kind of read that data as well yes so i think one of the things is there's a lot of different types of documents out there in terms of document types and there's a lot of different solutions so i don't we're not going to be able to handle every type of thing perfectly like that i think there's going to be lots of pieces but this is where we've made those apis so you can try that out on those piece like we haven't specifically focused on for example that type of document we do have the image captioning so maybe it would pick them up maybe it wouldn't like that so but the idea here is to go in that same space where there's a lot of these other companies doing this parsing thing and making sure that we have a module that you can just stick in and replace and kind of use with ours like that yes how do you deal with the domain specific language that's a good question how do you deal with domain specific language and this can get a little tricky and we've seen it for example when we're working with technical companies that have very specific words that they use inside there where on one hand you can do things like changing system prompts a little bit for the retriever part but sometimes what we've done with some of those customers is fine-tune that grounded language model that's closer to their vocabulary and how they speak and that's been one technique we've used now and this is where at one point in the platform we had that available to end developers but fine-tuning models can get a little complex for using for rag so we've kind of put that away now and that's more for kind of our service-oriented customers that work with our customer machine learning engineers but that's one thing we've done to if that knowledge is out there and you need to get that into your grounded language model is to do a kind of a fine-tuning process step go ahead oh i don't know where the mic is we're making them run getting their steps in today hey um so how about hipaa regulations like uh phi data like this kind of stuff like how you deal with that should we hash it before go to the platform or the llm can can understand and deal with that so there's two aspects that one is we just got hipaa certified that people are going to think i'm paying the audience members out there um like that the second is is well what do you do in extraction like maybe i have phi data i want you to automatically mask or filter that so this is one thing that we've been talking with a product team about whether to include that capability for during that parsing piece um to automatically kind of flag and identify that that behavior right now we don't kind of have that in the product but again like if that's something you need or want let us know it's not that hard to kind of build on but you can also do that as a supplement to the parsing and have a second job that runs and looks for that pii information does that help good uh i know garbage in garbage out but if we have a document with say two different numbers on a particular like statistic or fact how does your rag deal with that or how would it answer a question yeah so what what happens in that case is often the retrieval is going to bring both of those pieces of information over to the grounded language model and it's going to be up to the language model to use its reasoning ability to try to reason out the two differences so sometimes if one is for example if it's obvious that one is not like let's let's let's say it's the size of a mosquito and one of the answers is 100 feet and the other is right like three millimeters the rat the language model is going to know like one of these is junk and into ignore it but if they're both close and reasonable yes that and that can get you in trouble um with with those pieces like that so i assume this is a pain that you're having now so we're just trying to just trying to see how we can organize that um because a lot of my job right now is just telling people information that should be in a document but it's not updated so so one thing that can help with that is taking advantage of metadata and using as much rich information about those documents so when at retrieval time it can help it kind of figure out like hey these are the two differences and this is why i should prioritize this answer over another answer like maybe this is the most recent or this was written by the authoritative person like that like that's like one one thing that we kind of recommend for those use cases like that thank you all right okay well thank you all for staying i've loved the kind of the questions like this please use the application will give us feedback on how you're finding that as well good bad negative happy to take that all like that so thank you again for kind of spending your morning with us like that anything else we have nina or we good oh you've got the mic off oh i think i still have my mic uh yeah i've been going around distributing swag to folks that have asked questions sorry if i missed some of you that were further away i have one more pair of socks anyone wants them the guy in the back wants the socks all right yeah all right thank you all we'll be hanging around for a little while if you have that so

Forget RAG Pipelines—Build Production Ready Agents in 15 Mins: Nina Lopatina, Rajiv Shah, Contextual

Transcript