back to indexScaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

00:00:00.040 |
All right. Thank you, everyone. We're excited to be here. And thank you for coming to our talk. 00:00:20.580 |
My name is Chong. I'm the CEO and co-founder of LandCB. I've been making data tools for machine 00:00:25.500 |
learning and data science for about 20 years. I was one of the co-authors of the Pandas library, 00:00:29.120 |
and I'm working on LandCB today for all of that data that doesn't fit neatly into those 00:00:34.160 |
Pandas data frames. I'm Calvin. I lead one of the teams at HardVAI, working on RAG. Tough RAG 00:00:40.960 |
problems across massive data sets of complex legal docs and complex use cases. So yeah, our talk is 00:00:48.240 |
about, oh, one sec. Maybe we should have used the other clicker. Yeah, yeah. All right. That's okay. We'll use the laptop. So we're going to talk 00:00:58.240 |
about some of the tough RAG problems in the legal frontier, sort of challenges, some solutions 00:01:03.280 |
and learnings from our experiences working together on it. So we'll start roughly with like sort of how 00:01:07.440 |
Harvey tackles retrieval, the types of problems there are, and then the challenges that come up with that, 00:01:11.520 |
all with like retrieval quality, scaling, security, all that good stuff, and then how we end up sort 00:01:16.960 |
of creating a system with good infrastructure to support that. 00:01:19.680 |
So first of all, a quick intro to what Harvey is. We're a legal AI assistant. So we sell our sort of AI 00:01:29.600 |
product to a bunch of law firms to help them do all kinds of legal tasks like draft, analyze documents, 00:01:36.080 |
sort of go through legal workflows. And a big part of that is processing data. So we handle data of all 00:01:41.760 |
different sort of volumes and forms. The sort of different scales of that are, we have an assistant 00:01:46.480 |
product that's like on demand uploads, same way you might like on demand upload to any AI assistant tool. 00:01:51.120 |
So that's like a smaller one to 50 range. We have these vaults, which are sort of larger scale project 00:01:57.840 |
contacts. So if there's like a big deal going on that the law firm's working on, or like a data room, 00:02:02.160 |
where they need sort of all their contracts, all their, you know, litigation documents and emails in 00:02:05.920 |
one place, that's a vault. And then the third is the largest scale, which is data corpuses, 00:02:10.720 |
which are like knowledge bases around the world. So like legislation, case laws of a particular country, 00:02:15.040 |
all the sort of laws, taxes, regulations that go into it. 00:02:18.880 |
So yeah, some big challenges that come up, come with that. One is scale, just very large amounts of data. 00:02:29.280 |
Some of these documents are like super long and dense and packed with content. 00:02:33.680 |
Sparse versus dense, I'm sure it's like a sort of retrieval challenge that all of you deal with, 00:02:38.640 |
of how to represent the data, how to retrieve over and index it. Query complexity is a big one. We got 00:02:44.640 |
very sort of difficult expert queries. And I'll show an example of that in the next slide. 00:02:48.640 |
The data is very domain specific and complex. There's sort of a lot of nitty gritty legal details 00:02:56.240 |
that go into it. So we have to like work with domain experts and lawyers to understand it and try 00:03:00.160 |
to like translate that into how we represent the data, how we, you know, index, query, pre-process over 00:03:05.600 |
it. Data security and privacy is a big one. A lot of this data is sensitive for like confidential deals 00:03:11.760 |
or confidential, I don't know, IPOs, financial filing, stuff like that. So we have to respect all 00:03:16.160 |
that for our clients. And then of course, evaluation of how to make sure systems are actually good. 00:03:22.560 |
So yeah, I'll show a quick demonstration of a retrieval quality challenge. So this is just 00:03:26.720 |
on the query side of like, this is maybe the average complexity of a query someone might issue in our 00:03:31.280 |
product. They're much more complex and maybe simpler ones, but this is right in the middle. And you can 00:03:35.680 |
see that like, there's a lot of different components that go into this. There's sort of a semantic, well, 00:03:40.960 |
to read it out, it's like, what is the applicable regime to covered bonds issued before 00:03:45.200 |
9 July 2022 under the directive EU 2019-2062 and article 129 of the CRR. So, you know, that's a 00:03:52.320 |
handful. What goes into it is like, there's a semantic aspect. There's sort of implicit filtering going on 00:03:58.240 |
of like, you know, we want applicability before a certain date. There's a specialized data set being 00:04:03.040 |
referenced, which is EU laws and directives. There's kind of keyword matches of like the specific, you know, 00:04:09.200 |
regulation directive ID. It is multi-part in that it's sort of asking how this applies to two different 00:04:14.880 |
regulations, like one directive, one article. And there's like domain jargon here, or this is like 00:04:19.760 |
an abbreviation. I forget what it was, capital regulations something. I looked it up this morning. 00:04:24.560 |
But yeah, it's very complex. And we sort of need a system that can tackle all this complexity and sort of 00:04:32.160 |
break down this query and use all the appropriate technologies for the different parts of it. 00:04:36.960 |
And yeah, so one common question we get sort of in response to this complexity is, how do you 00:04:44.960 |
evaluate your systems? How do you make sure they're good? And that's actually where we spend a ton of 00:04:49.200 |
time. It's, you know, not as much on the algorithms and the fancy egetic techniques, but more like how to 00:04:54.240 |
validate them. And I'd say like investing in eval driven development is a huge, huge key to building 00:05:01.120 |
these systems and making sure they're good, especially when it's a tough domain that like you don't 00:05:05.200 |
inherently know much about as maybe an engineer or researcher. So I say there's no silver bullet eval, 00:05:10.640 |
but we have like a whole range of them of like different task depths and complexities. So in sort 00:05:15.360 |
of one dimension, you have it being sort of higher fidelity, the more costly. And then the other 00:05:20.320 |
direction, it's like more automated evals that are faster to iterate on. So as an example, like the 00:05:26.160 |
sort of high fidelity would be like expert reviews of just having them directly review outputs and analyze 00:05:32.640 |
them and write reports. So that's like super expensive, but super high quality. And then 00:05:37.520 |
maybe something between is like an expert labeled like set of criteria that you can maybe evaluate 00:05:42.960 |
synthetically or evaluate in some automated way. So it's still expensive to curate, maybe a little 00:05:46.960 |
expensive to run, but more, more tractable. And then the third is sort of the fastest iteration, 00:05:52.960 |
which is sort of more automated quantitative metrics like just, you know, retrieval precision recall, 00:05:58.560 |
sort of different, more deterministic success criteria of like, am I pulling documents from 00:06:03.360 |
the right folder? Is it the right section? Do they have the right keywords in them? Things like that. 00:06:07.200 |
And yeah, give you a quick sense also of sort of the scale and complexity on the data side, 00:06:16.160 |
not only on the query side. So the data sets we integrate with are pretty massive. As you can see, 00:06:22.560 |
we support like, you know, data sets across all different kinds of countries. And for each one, 00:06:27.040 |
there's sort of complex filtering and organization and categorization that goes into it. 00:06:31.040 |
So we sort of work with domain experts for all of this, but also try to apply automation whenever 00:06:37.200 |
possible, like use their guidance to maybe come up with heuristics or LLM processing techniques 00:06:42.720 |
to be able to categorize all of this. And I'd say that the performance implications are pretty significant 00:06:49.600 |
as well. We need very good performance both online and offline. Online being like querying over this, 00:06:56.320 |
you want good latency, and then offline being like ingestion, re-ingestion, running ML experiments for 00:07:01.040 |
different variations and such. And I'd say generally one of these corpuses can be like, yeah, tens of millions 00:07:06.800 |
of docks. So yeah, pretty large scale. And each document is often quite large. 00:07:11.760 |
So I can talk quickly about kind of infrastructure needs to support this. So at this scale, of course, 00:07:22.400 |
we, you know, we want infrastructure to be reliable, available for all our users at all times. I'm sure 00:07:29.360 |
that's something that, you know, all products need. We also want smooth sort of onboarding and scaling, 00:07:35.920 |
where, you know, we definitely want our ML and data teams to be able to focus more on the sort of 00:07:41.440 |
the business logic and the quality and spinning up new applications and products for customers and, 00:07:46.960 |
you know, not too much about like the nitty-gritty details of the database or tuning that or manually 00:07:52.000 |
scaling. And of course, there's always some in between where you you want to have awareness of it. 00:07:55.840 |
It can't be like fully 1000% automated. Likewise, we need sort of flexibility and capabilities around 00:08:02.720 |
data privacy and data retention. Like I mentioned, with some storage needing to be like segregated 00:08:10.400 |
depending on the customer, depending on the use case, sort of retention policies on some docs that 00:08:15.600 |
we might only be allowed to store for certain amounts of time for legal reasons. We want good sort of 00:08:20.880 |
telemetry and usage around the database. And then, of course, any sort of vector or or keyword or any 00:08:28.320 |
filtering database we need, we want to support good performance, query flexibility, scale, especially 00:08:34.000 |
for all the different kinds of query patterns I mentioned before, where it's like, you need exact 00:08:38.480 |
matches, you want semantic matches, you want filters, you might want sort of to sort of navigate it, maybe, 00:08:44.880 |
yeah, agentically or like in some dynamic way. So yeah, all that flexibility is important to us at scale. 00:08:52.240 |
And that's where LanceTV comes in. Cool. Thank you. Awesome. So 00:09:05.920 |
Okay. I'm going to try to hold this here, maybe. So there's no echo. Okay. Yeah. So as I was saying, 00:09:15.760 |
and so, you know, I work at LanceTV. And what we are delivering for AI is beyond what I call just a vector 00:09:26.720 |
database, but what we call an AI native multimodal lake house. And so if you think about back to maybe 00:09:33.280 |
Jerry's talk, right, in addition to search, you also need a good foundation, a good platform for you to do 00:09:41.040 |
all of the other tasks that you need to do with your AI data. So this can be feature extraction, 00:09:47.440 |
generating summaries, generating text descriptions from images, managing all that data. And you want 00:09:55.040 |
to be able to do that all together. So what you really need is sort of this lake house architecture, 00:10:00.000 |
where all the data can be stored in one place on object store. You can run search and retrieval 00:10:08.160 |
workloads. You can run analytical workloads. You can train off of that data. And of course, 00:10:13.120 |
you can pre-process that data to iterate on new features that you can experiment for your applications 00:10:18.640 |
and models. Specifically, too, in addition to these large batch offline use cases, 00:10:28.400 |
you know, lake house architectures generally are good for that, but not necessarily for online serving. 00:10:34.720 |
And this is where LanceTV distributed architecture comes in. And it's actually good for both offline and 00:10:43.440 |
online context so that we can serve at massive scale from cloud object store. We can deliver 00:10:50.000 |
compute memory and storage separation. And we give you a simple API for sophisticated retrieval, whether 00:10:58.000 |
you want to combine multiple vector columns, vector and full text search, and then do re-ranking on top of 00:11:06.400 |
that. Those are all available with an API in Python or TypeScript that feels what folks have told me feels 00:11:16.400 |
kind of like pandas or polar, like very familiar to data workers who are used to data frame type of APIs. 00:11:24.480 |
And of course, for large tables, we support GPU indexing. So I think our record has been around something like 00:11:32.800 |
three or four billion vectors in a single table that can be indexed in under two or three hours. 00:11:40.080 |
So all of that is to say, like, LanceTV excels at massive scale. And this is happening at a fraction of 00:11:50.240 |
the cost because of the compute storage separation and because we take advantage of object store. And 00:11:59.280 |
so of course, and I talked about sort of having one place to put all of your AI data. So this is the only 00:12:06.400 |
database where you can put, you know, images and videos and audio track next to your embeddings, 00:12:13.920 |
next to text data, next to your tabular data, time series data. You can put all of that in a single table. 00:12:23.120 |
And then you can, of course, use that as the single source of truth for all the different workloads 00:12:27.600 |
that you want to do that do on that data from search to analytics to training and, of course, 00:12:32.480 |
pre-processing or feature engineering. A lot of that is possible because of the open source 00:12:40.480 |
LANs format that we built from the ground up. So, you know, if you're working with multimodal data, 00:12:46.880 |
whether it's documents, you know, PDF scan slides or just large, even large-scale videos, 00:12:52.880 |
if you're doing that in, let's say, like Web Dataset or Iceberg Parquet, you're missing out on a lot of 00:13:00.240 |
features and things like, you know, lack of random access or the inability to support large blob data 00:13:08.080 |
or not being very efficient about schema evolution. So LANs format, by giving you all of those, 00:13:17.760 |
right, it makes it so that you can store all of your data in one place rather than split up across 00:13:24.080 |
multiple parts. And so this is the, I would say, like the foundational innovation in LANs CB where 00:13:33.120 |
without it, what we see a lot of AI teams doing is they have to have different copies of different 00:13:38.560 |
parts of their data in different places, and they're spending a lot of their time and effort 00:13:42.320 |
just sort of keeping those pieces glued together and in sync with each other. 00:13:47.280 |
All right. So, kind of, basically, you can think about LANs format as sort of parquet plus iceberg plus 00:13:59.120 |
secondary indices, but for AI data. And that gives you fast random access, which is good for search and 00:14:05.600 |
shuffle. It still gives you fast scans, which is so good for analytics and, you know, data loading and 00:14:11.280 |
training. And it's the only one out of this set that is uniquely good for storing blob data, or more 00:14:19.680 |
importantly, a mix of large blob data and small, like scalar data. 00:14:25.120 |
And by using Apache Arrow as the main interface, LANs format is already compatible with your current 00:14:34.080 |
data lake and lake house tools. So you can use Spark and Ray to write very large amounts of LANs data in a 00:14:42.240 |
distributed fashion very quickly. You can use PyTorch to load that data for training or fine tuning. 00:14:49.920 |
You can certainly query it using tools like, you know, pandas and pollers. All right. So I want you to take it back. 00:14:58.880 |
Yeah, so we're back? Okay. So I just wanted to share some general take home messages about building RAG for 00:15:09.040 |
these sort of large scale domain specific use cases. So the first is that these domain specific challenges 00:15:14.640 |
require very creative solutions around understanding the data and also choosing sort of modeling and 00:15:19.440 |
infrastructure around that. Like I mentioned about like trying to understand the structure of your data, what the 00:15:23.760 |
use cases are, what the explicit and implicit query patterns are. So definitely spend time with that. Work with 00:15:29.760 |
domain experts and try to sort of immerse yourself as much as possible in that. The second is to make sure you're 00:15:36.240 |
building for iteration speed and flexibility. I think this is a very new technology, very new industry, and a lot of 00:15:41.760 |
things are changing. New tools are coming out, new paradigms, new model context windows and everything. So you kind of want to 00:15:47.120 |
set yourself up for flexibility and iteration speed. And you can kind of ground that in evaluation, where if you have 00:15:54.080 |
good evaluation sets or either procedures or automation around that, then you can iterate much faster and just get good 00:15:59.920 |
signal on whether your systems are good or accurate. So definitely invest time in the evaluation to enable that iteration speed. 00:16:06.960 |
And then, yeah, the third, which John covered is that new data infrastructure has to recognize that there's sort of this new world we're entering 00:16:14.080 |
with multimodal data, a lot heavier on vectors and embeddings. Workloads are very diverse and the scale is just going to keep 00:16:21.760 |
getting larger and larger as we try to sort of ingest and query over all the data that exists, public and private.