Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

All right. Thank you, everyone. We're excited to be here. And thank you for coming to our talk. My name is Chong. I'm the CEO and co-founder of LandCB. I've been making data tools for machine learning and data science for about 20 years. I was one of the co-authors of the Pandas library, and I'm working on LandCB today for all of that data that doesn't fit neatly into those Pandas data frames.

I'm Calvin. I lead one of the teams at HardVAI, working on RAG. Tough RAG problems across massive data sets of complex legal docs and complex use cases. So yeah, our talk is about, oh, one sec. Maybe we should have used the other clicker. Yeah, yeah. All right. That's okay.

We'll use the laptop. So we're going to talk about some of the tough RAG problems in the legal frontier, sort of challenges, some solutions and learnings from our experiences working together on it. So we'll start roughly with like sort of how Harvey tackles retrieval, the types of problems there are, and then the challenges that come up with that, all with like retrieval quality, scaling, security, all that good stuff, and then how we end up sort of creating a system with good infrastructure to support that.

So first of all, a quick intro to what Harvey is. We're a legal AI assistant. So we sell our sort of AI product to a bunch of law firms to help them do all kinds of legal tasks like draft, analyze documents, sort of go through legal workflows. And a big part of that is processing data.

So we handle data of all different sort of volumes and forms. The sort of different scales of that are, we have an assistant product that's like on demand uploads, same way you might like on demand upload to any AI assistant tool. So that's like a smaller one to 50 range.

We have these vaults, which are sort of larger scale project contacts. So if there's like a big deal going on that the law firm's working on, or like a data room, where they need sort of all their contracts, all their, you know, litigation documents and emails in one place, that's a vault.

And then the third is the largest scale, which is data corpuses, which are like knowledge bases around the world. So like legislation, case laws of a particular country, all the sort of laws, taxes, regulations that go into it. So yeah, some big challenges that come up, come with that.

One is scale, just very large amounts of data. Some of these documents are like super long and dense and packed with content. Sparse versus dense, I'm sure it's like a sort of retrieval challenge that all of you deal with, of how to represent the data, how to retrieve over and index it.

Query complexity is a big one. We got very sort of difficult expert queries. And I'll show an example of that in the next slide. The data is very domain specific and complex. There's sort of a lot of nitty gritty legal details that go into it. So we have to like work with domain experts and lawyers to understand it and try to like translate that into how we represent the data, how we, you know, index, query, pre-process over it.

Data security and privacy is a big one. A lot of this data is sensitive for like confidential deals or confidential, I don't know, IPOs, financial filing, stuff like that. So we have to respect all that for our clients. And then of course, evaluation of how to make sure systems are actually good.

So yeah, I'll show a quick demonstration of a retrieval quality challenge. So this is just on the query side of like, this is maybe the average complexity of a query someone might issue in our product. They're much more complex and maybe simpler ones, but this is right in the middle.

And you can see that like, there's a lot of different components that go into this. There's sort of a semantic, well, to read it out, it's like, what is the applicable regime to covered bonds issued before 9 July 2022 under the directive EU 2019-2062 and article 129 of the CRR.

So, you know, that's a handful. What goes into it is like, there's a semantic aspect. There's sort of implicit filtering going on of like, you know, we want applicability before a certain date. There's a specialized data set being referenced, which is EU laws and directives. There's kind of keyword matches of like the specific, you know, regulation directive ID.

It is multi-part in that it's sort of asking how this applies to two different regulations, like one directive, one article. And there's like domain jargon here, or this is like an abbreviation. I forget what it was, capital regulations something. I looked it up this morning. But yeah, it's very complex.

And we sort of need a system that can tackle all this complexity and sort of break down this query and use all the appropriate technologies for the different parts of it. And yeah, so one common question we get sort of in response to this complexity is, how do you evaluate your systems?

How do you make sure they're good? And that's actually where we spend a ton of time. It's, you know, not as much on the algorithms and the fancy egetic techniques, but more like how to validate them. And I'd say like investing in eval driven development is a huge, huge key to building these systems and making sure they're good, especially when it's a tough domain that like you don't inherently know much about as maybe an engineer or researcher.

So I say there's no silver bullet eval, but we have like a whole range of them of like different task depths and complexities. So in sort of one dimension, you have it being sort of higher fidelity, the more costly. And then the other direction, it's like more automated evals that are faster to iterate on.

So as an example, like the sort of high fidelity would be like expert reviews of just having them directly review outputs and analyze them and write reports. So that's like super expensive, but super high quality. And then maybe something between is like an expert labeled like set of criteria that you can maybe evaluate synthetically or evaluate in some automated way.

So it's still expensive to curate, maybe a little expensive to run, but more, more tractable. And then the third is sort of the fastest iteration, which is sort of more automated quantitative metrics like just, you know, retrieval precision recall, sort of different, more deterministic success criteria of like, am I pulling documents from the right folder?

Is it the right section? Do they have the right keywords in them? Things like that. And yeah, give you a quick sense also of sort of the scale and complexity on the data side, not only on the query side. So the data sets we integrate with are pretty massive.

As you can see, we support like, you know, data sets across all different kinds of countries. And for each one, there's sort of complex filtering and organization and categorization that goes into it. So we sort of work with domain experts for all of this, but also try to apply automation whenever possible, like use their guidance to maybe come up with heuristics or LLM processing techniques to be able to categorize all of this.

And I'd say that the performance implications are pretty significant as well. We need very good performance both online and offline. Online being like querying over this, you want good latency, and then offline being like ingestion, re-ingestion, running ML experiments for different variations and such. And I'd say generally one of these corpuses can be like, yeah, tens of millions of docks.

So yeah, pretty large scale. And each document is often quite large. So I can talk quickly about kind of infrastructure needs to support this. So at this scale, of course, we, you know, we want infrastructure to be reliable, available for all our users at all times. I'm sure that's something that, you know, all products need.

We also want smooth sort of onboarding and scaling, where, you know, we definitely want our ML and data teams to be able to focus more on the sort of the business logic and the quality and spinning up new applications and products for customers and, you know, not too much about like the nitty-gritty details of the database or tuning that or manually scaling.

And of course, there's always some in between where you you want to have awareness of it. It can't be like fully 1000% automated. Likewise, we need sort of flexibility and capabilities around data privacy and data retention. Like I mentioned, with some storage needing to be like segregated depending on the customer, depending on the use case, sort of retention policies on some docs that we might only be allowed to store for certain amounts of time for legal reasons.

We want good sort of telemetry and usage around the database. And then, of course, any sort of vector or or keyword or any filtering database we need, we want to support good performance, query flexibility, scale, especially for all the different kinds of query patterns I mentioned before, where it's like, you need exact matches, you want semantic matches, you want filters, you might want sort of to sort of navigate it, maybe, yeah, agentically or like in some dynamic way.

So yeah, all that flexibility is important to us at scale. And that's where LanceTV comes in. Cool. Thank you. Awesome. So sorry. Okay. I'm going to try to hold this here, maybe. So there's no echo. Okay. Yeah. So as I was saying, and so, you know, I work at LanceTV.

And what we are delivering for AI is beyond what I call just a vector database, but what we call an AI native multimodal lake house. And so if you think about back to maybe Jerry's talk, right, in addition to search, you also need a good foundation, a good platform for you to do all of the other tasks that you need to do with your AI data.

So this can be feature extraction, generating summaries, generating text descriptions from images, managing all that data. And you want to be able to do that all together. So what you really need is sort of this lake house architecture, where all the data can be stored in one place on object store.

You can run search and retrieval workloads. You can run analytical workloads. You can train off of that data. And of course, you can pre-process that data to iterate on new features that you can experiment for your applications and models. Specifically, too, in addition to these large batch offline use cases, you know, lake house architectures generally are good for that, but not necessarily for online serving.

And this is where LanceTV distributed architecture comes in. And it's actually good for both offline and online context so that we can serve at massive scale from cloud object store. We can deliver compute memory and storage separation. And we give you a simple API for sophisticated retrieval, whether you want to combine multiple vector columns, vector and full text search, and then do re-ranking on top of that.

Those are all available with an API in Python or TypeScript that feels what folks have told me feels kind of like pandas or polar, like very familiar to data workers who are used to data frame type of APIs. And of course, for large tables, we support GPU indexing. So I think our record has been around something like three or four billion vectors in a single table that can be indexed in under two or three hours.

So all of that is to say, like, LanceTV excels at massive scale. And this is happening at a fraction of the cost because of the compute storage separation and because we take advantage of object store. And so of course, and I talked about sort of having one place to put all of your AI data.

So this is the only database where you can put, you know, images and videos and audio track next to your embeddings, next to text data, next to your tabular data, time series data. You can put all of that in a single table. And then you can, of course, use that as the single source of truth for all the different workloads that you want to do that do on that data from search to analytics to training and, of course, pre-processing or feature engineering.

A lot of that is possible because of the open source LANs format that we built from the ground up. So, you know, if you're working with multimodal data, whether it's documents, you know, PDF scan slides or just large, even large-scale videos, if you're doing that in, let's say, like Web Dataset or Iceberg Parquet, you're missing out on a lot of features and things like, you know, lack of random access or the inability to support large blob data or not being very efficient about schema evolution.

So LANs format, by giving you all of those, right, it makes it so that you can store all of your data in one place rather than split up across multiple parts. And so this is the, I would say, like the foundational innovation in LANs CB where without it, what we see a lot of AI teams doing is they have to have different copies of different parts of their data in different places, and they're spending a lot of their time and effort just sort of keeping those pieces glued together and in sync with each other.

All right. So, kind of, basically, you can think about LANs format as sort of parquet plus iceberg plus secondary indices, but for AI data. And that gives you fast random access, which is good for search and shuffle. It still gives you fast scans, which is so good for analytics and, you know, data loading and training.

And it's the only one out of this set that is uniquely good for storing blob data, or more importantly, a mix of large blob data and small, like scalar data. And by using Apache Arrow as the main interface, LANs format is already compatible with your current data lake and lake house tools.

So you can use Spark and Ray to write very large amounts of LANs data in a distributed fashion very quickly. You can use PyTorch to load that data for training or fine tuning. You can certainly query it using tools like, you know, pandas and pollers. All right. So I want you to take it back.

Yeah, so we're back? Okay. So I just wanted to share some general take home messages about building RAG for these sort of large scale domain specific use cases. So the first is that these domain specific challenges require very creative solutions around understanding the data and also choosing sort of modeling and infrastructure around that.

Like I mentioned about like trying to understand the structure of your data, what the use cases are, what the explicit and implicit query patterns are. So definitely spend time with that. Work with domain experts and try to sort of immerse yourself as much as possible in that. The second is to make sure you're building for iteration speed and flexibility.

I think this is a very new technology, very new industry, and a lot of things are changing. New tools are coming out, new paradigms, new model context windows and everything. So you kind of want to set yourself up for flexibility and iteration speed. And you can kind of ground that in evaluation, where if you have good evaluation sets or either procedures or automation around that, then you can iterate much faster and just get good signal on whether your systems are good or accurate.

So definitely invest time in the evaluation to enable that iteration speed. And then, yeah, the third, which John covered is that new data infrastructure has to recognize that there's sort of this new world we're entering with multimodal data, a lot heavier on vectors and embeddings. Workloads are very diverse and the scale is just going to keep getting larger and larger as we try to sort of ingest and query over all the data that exists, public and private.

Yeah, thanks for listening to our talk. Thank you for listening to our talk.

Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

Transcript