Scaling Enterprise-Grade RAG: Lessons from Legal Frontier - Calvin Qi (Harvey), Chang She (Lance)

00:00:00.040 | All right. Thank you, everyone. We're excited to be here. And thank you for coming to our talk.

00:00:20.580 | My name is Chong. I'm the CEO and co-founder of LandCB. I've been making data tools for machine

00:00:25.500 | learning and data science for about 20 years. I was one of the co-authors of the Pandas library,

00:00:29.120 | and I'm working on LandCB today for all of that data that doesn't fit neatly into those

00:00:34.160 | Pandas data frames. I'm Calvin. I lead one of the teams at HardVAI, working on RAG. Tough RAG

00:00:40.960 | problems across massive data sets of complex legal docs and complex use cases. So yeah, our talk is

00:00:48.240 | about, oh, one sec. Maybe we should have used the other clicker. Yeah, yeah. All right. That's okay. We'll use the laptop. So we're going to talk

00:00:58.240 | about some of the tough RAG problems in the legal frontier, sort of challenges, some solutions

00:01:03.280 | and learnings from our experiences working together on it. So we'll start roughly with like sort of how

00:01:07.440 | Harvey tackles retrieval, the types of problems there are, and then the challenges that come up with that,

00:01:11.520 | all with like retrieval quality, scaling, security, all that good stuff, and then how we end up sort

00:01:16.960 | of creating a system with good infrastructure to support that.

00:01:19.680 | So first of all, a quick intro to what Harvey is. We're a legal AI assistant. So we sell our sort of AI

00:01:29.600 | product to a bunch of law firms to help them do all kinds of legal tasks like draft, analyze documents,

00:01:36.080 | sort of go through legal workflows. And a big part of that is processing data. So we handle data of all

00:01:41.760 | different sort of volumes and forms. The sort of different scales of that are, we have an assistant

00:01:46.480 | product that's like on demand uploads, same way you might like on demand upload to any AI assistant tool.

00:01:51.120 | So that's like a smaller one to 50 range. We have these vaults, which are sort of larger scale project

00:01:57.840 | contacts. So if there's like a big deal going on that the law firm's working on, or like a data room,

00:02:02.160 | where they need sort of all their contracts, all their, you know, litigation documents and emails in

00:02:05.920 | one place, that's a vault. And then the third is the largest scale, which is data corpuses,

00:02:10.720 | which are like knowledge bases around the world. So like legislation, case laws of a particular country,

00:02:15.040 | all the sort of laws, taxes, regulations that go into it.

00:02:18.880 | So yeah, some big challenges that come up, come with that. One is scale, just very large amounts of data.

00:02:29.280 | Some of these documents are like super long and dense and packed with content.

00:02:33.680 | Sparse versus dense, I'm sure it's like a sort of retrieval challenge that all of you deal with,

00:02:38.640 | of how to represent the data, how to retrieve over and index it. Query complexity is a big one. We got

00:02:44.640 | very sort of difficult expert queries. And I'll show an example of that in the next slide.

00:02:48.640 | The data is very domain specific and complex. There's sort of a lot of nitty gritty legal details

00:02:56.240 | that go into it. So we have to like work with domain experts and lawyers to understand it and try

00:03:00.160 | to like translate that into how we represent the data, how we, you know, index, query, pre-process over

00:03:05.600 | it. Data security and privacy is a big one. A lot of this data is sensitive for like confidential deals

00:03:11.760 | or confidential, I don't know, IPOs, financial filing, stuff like that. So we have to respect all

00:03:16.160 | that for our clients. And then of course, evaluation of how to make sure systems are actually good.

00:03:22.560 | So yeah, I'll show a quick demonstration of a retrieval quality challenge. So this is just

00:03:26.720 | on the query side of like, this is maybe the average complexity of a query someone might issue in our

00:03:31.280 | product. They're much more complex and maybe simpler ones, but this is right in the middle. And you can

00:03:35.680 | see that like, there's a lot of different components that go into this. There's sort of a semantic, well,

00:03:40.960 | to read it out, it's like, what is the applicable regime to covered bonds issued before

00:03:45.200 | 9 July 2022 under the directive EU 2019-2062 and article 129 of the CRR. So, you know, that's a

00:03:52.320 | handful. What goes into it is like, there's a semantic aspect. There's sort of implicit filtering going on

00:03:58.240 | of like, you know, we want applicability before a certain date. There's a specialized data set being

00:04:03.040 | referenced, which is EU laws and directives. There's kind of keyword matches of like the specific, you know,

00:04:09.200 | regulation directive ID. It is multi-part in that it's sort of asking how this applies to two different

00:04:14.880 | regulations, like one directive, one article. And there's like domain jargon here, or this is like

00:04:19.760 | an abbreviation. I forget what it was, capital regulations something. I looked it up this morning.

00:04:24.560 | But yeah, it's very complex. And we sort of need a system that can tackle all this complexity and sort of

00:04:32.160 | break down this query and use all the appropriate technologies for the different parts of it.

00:04:36.960 | And yeah, so one common question we get sort of in response to this complexity is, how do you

00:04:44.960 | evaluate your systems? How do you make sure they're good? And that's actually where we spend a ton of

00:04:49.200 | time. It's, you know, not as much on the algorithms and the fancy egetic techniques, but more like how to

00:04:54.240 | validate them. And I'd say like investing in eval driven development is a huge, huge key to building

00:05:01.120 | these systems and making sure they're good, especially when it's a tough domain that like you don't

00:05:05.200 | inherently know much about as maybe an engineer or researcher. So I say there's no silver bullet eval,

00:05:10.640 | but we have like a whole range of them of like different task depths and complexities. So in sort

00:05:15.360 | of one dimension, you have it being sort of higher fidelity, the more costly. And then the other

00:05:20.320 | direction, it's like more automated evals that are faster to iterate on. So as an example, like the

00:05:26.160 | sort of high fidelity would be like expert reviews of just having them directly review outputs and analyze

00:05:32.640 | them and write reports. So that's like super expensive, but super high quality. And then

00:05:37.520 | maybe something between is like an expert labeled like set of criteria that you can maybe evaluate

00:05:42.960 | synthetically or evaluate in some automated way. So it's still expensive to curate, maybe a little

00:05:46.960 | expensive to run, but more, more tractable. And then the third is sort of the fastest iteration,

00:05:52.960 | which is sort of more automated quantitative metrics like just, you know, retrieval precision recall,

00:05:58.560 | sort of different, more deterministic success criteria of like, am I pulling documents from

00:06:03.360 | the right folder? Is it the right section? Do they have the right keywords in them? Things like that.

00:06:07.200 | And yeah, give you a quick sense also of sort of the scale and complexity on the data side,

00:06:16.160 | not only on the query side. So the data sets we integrate with are pretty massive. As you can see,

00:06:22.560 | we support like, you know, data sets across all different kinds of countries. And for each one,

00:06:27.040 | there's sort of complex filtering and organization and categorization that goes into it.

00:06:31.040 | So we sort of work with domain experts for all of this, but also try to apply automation whenever

00:06:37.200 | possible, like use their guidance to maybe come up with heuristics or LLM processing techniques

00:06:42.720 | to be able to categorize all of this. And I'd say that the performance implications are pretty significant

00:06:49.600 | as well. We need very good performance both online and offline. Online being like querying over this,

00:06:56.320 | you want good latency, and then offline being like ingestion, re-ingestion, running ML experiments for

00:07:01.040 | different variations and such. And I'd say generally one of these corpuses can be like, yeah, tens of millions

00:07:06.800 | of docks. So yeah, pretty large scale. And each document is often quite large.

00:07:11.760 | So I can talk quickly about kind of infrastructure needs to support this. So at this scale, of course,

00:07:22.400 | we, you know, we want infrastructure to be reliable, available for all our users at all times. I'm sure

00:07:29.360 | that's something that, you know, all products need. We also want smooth sort of onboarding and scaling,

00:07:35.920 | where, you know, we definitely want our ML and data teams to be able to focus more on the sort of

00:07:41.440 | the business logic and the quality and spinning up new applications and products for customers and,

00:07:46.960 | you know, not too much about like the nitty-gritty details of the database or tuning that or manually

00:07:52.000 | scaling. And of course, there's always some in between where you you want to have awareness of it.

00:07:55.840 | It can't be like fully 1000% automated. Likewise, we need sort of flexibility and capabilities around

00:08:02.720 | data privacy and data retention. Like I mentioned, with some storage needing to be like segregated

00:08:10.400 | depending on the customer, depending on the use case, sort of retention policies on some docs that

00:08:15.600 | we might only be allowed to store for certain amounts of time for legal reasons. We want good sort of

00:08:20.880 | telemetry and usage around the database. And then, of course, any sort of vector or or keyword or any

00:08:28.320 | filtering database we need, we want to support good performance, query flexibility, scale, especially

00:08:34.000 | for all the different kinds of query patterns I mentioned before, where it's like, you need exact

00:08:38.480 | matches, you want semantic matches, you want filters, you might want sort of to sort of navigate it, maybe,

00:08:44.880 | yeah, agentically or like in some dynamic way. So yeah, all that flexibility is important to us at scale.

00:08:52.240 | And that's where LanceTV comes in. Cool. Thank you. Awesome. So

00:08:59.600 | sorry.

00:09:05.920 | Okay. I'm going to try to hold this here, maybe. So there's no echo. Okay. Yeah. So as I was saying,

00:09:15.760 | and so, you know, I work at LanceTV. And what we are delivering for AI is beyond what I call just a vector

00:09:26.720 | database, but what we call an AI native multimodal lake house. And so if you think about back to maybe

00:09:33.280 | Jerry's talk, right, in addition to search, you also need a good foundation, a good platform for you to do

00:09:41.040 | all of the other tasks that you need to do with your AI data. So this can be feature extraction,

00:09:47.440 | generating summaries, generating text descriptions from images, managing all that data. And you want

00:09:55.040 | to be able to do that all together. So what you really need is sort of this lake house architecture,

00:10:00.000 | where all the data can be stored in one place on object store. You can run search and retrieval

00:10:08.160 | workloads. You can run analytical workloads. You can train off of that data. And of course,

00:10:13.120 | you can pre-process that data to iterate on new features that you can experiment for your applications

00:10:18.640 | and models. Specifically, too, in addition to these large batch offline use cases,

00:10:28.400 | you know, lake house architectures generally are good for that, but not necessarily for online serving.

00:10:34.720 | And this is where LanceTV distributed architecture comes in. And it's actually good for both offline and

00:10:43.440 | online context so that we can serve at massive scale from cloud object store. We can deliver

00:10:50.000 | compute memory and storage separation. And we give you a simple API for sophisticated retrieval, whether

00:10:58.000 | you want to combine multiple vector columns, vector and full text search, and then do re-ranking on top of

00:11:06.400 | that. Those are all available with an API in Python or TypeScript that feels what folks have told me feels

00:11:16.400 | kind of like pandas or polar, like very familiar to data workers who are used to data frame type of APIs.

00:11:24.480 | And of course, for large tables, we support GPU indexing. So I think our record has been around something like

00:11:32.800 | three or four billion vectors in a single table that can be indexed in under two or three hours.

00:11:40.080 | So all of that is to say, like, LanceTV excels at massive scale. And this is happening at a fraction of

00:11:50.240 | the cost because of the compute storage separation and because we take advantage of object store. And

00:11:59.280 | so of course, and I talked about sort of having one place to put all of your AI data. So this is the only

00:12:06.400 | database where you can put, you know, images and videos and audio track next to your embeddings,

00:12:13.920 | next to text data, next to your tabular data, time series data. You can put all of that in a single table.

00:12:23.120 | And then you can, of course, use that as the single source of truth for all the different workloads

00:12:27.600 | that you want to do that do on that data from search to analytics to training and, of course,

00:12:32.480 | pre-processing or feature engineering. A lot of that is possible because of the open source

00:12:40.480 | LANs format that we built from the ground up. So, you know, if you're working with multimodal data,

00:12:46.880 | whether it's documents, you know, PDF scan slides or just large, even large-scale videos,

00:12:52.880 | if you're doing that in, let's say, like Web Dataset or Iceberg Parquet, you're missing out on a lot of

00:13:00.240 | features and things like, you know, lack of random access or the inability to support large blob data

00:13:08.080 | or not being very efficient about schema evolution. So LANs format, by giving you all of those,

00:13:17.760 | right, it makes it so that you can store all of your data in one place rather than split up across

00:13:24.080 | multiple parts. And so this is the, I would say, like the foundational innovation in LANs CB where

00:13:33.120 | without it, what we see a lot of AI teams doing is they have to have different copies of different

00:13:38.560 | parts of their data in different places, and they're spending a lot of their time and effort

00:13:42.320 | just sort of keeping those pieces glued together and in sync with each other.

00:13:47.280 | All right. So, kind of, basically, you can think about LANs format as sort of parquet plus iceberg plus

00:13:59.120 | secondary indices, but for AI data. And that gives you fast random access, which is good for search and

00:14:05.600 | shuffle. It still gives you fast scans, which is so good for analytics and, you know, data loading and

00:14:11.280 | training. And it's the only one out of this set that is uniquely good for storing blob data, or more

00:14:19.680 | importantly, a mix of large blob data and small, like scalar data.

00:14:25.120 | And by using Apache Arrow as the main interface, LANs format is already compatible with your current

00:14:34.080 | data lake and lake house tools. So you can use Spark and Ray to write very large amounts of LANs data in a

00:14:42.240 | distributed fashion very quickly. You can use PyTorch to load that data for training or fine tuning.

00:14:49.920 | You can certainly query it using tools like, you know, pandas and pollers. All right. So I want you to take it back.

00:14:58.880 | Yeah, so we're back? Okay. So I just wanted to share some general take home messages about building RAG for

00:15:09.040 | these sort of large scale domain specific use cases. So the first is that these domain specific challenges

00:15:14.640 | require very creative solutions around understanding the data and also choosing sort of modeling and

00:15:19.440 | infrastructure around that. Like I mentioned about like trying to understand the structure of your data, what the

00:15:23.760 | use cases are, what the explicit and implicit query patterns are. So definitely spend time with that. Work with

00:15:29.760 | domain experts and try to sort of immerse yourself as much as possible in that. The second is to make sure you're

00:15:36.240 | building for iteration speed and flexibility. I think this is a very new technology, very new industry, and a lot of

00:15:41.760 | things are changing. New tools are coming out, new paradigms, new model context windows and everything. So you kind of want to

00:15:47.120 | set yourself up for flexibility and iteration speed. And you can kind of ground that in evaluation, where if you have

00:15:54.080 | good evaluation sets or either procedures or automation around that, then you can iterate much faster and just get good

00:15:59.920 | signal on whether your systems are good or accurate. So definitely invest time in the evaluation to enable that iteration speed.

00:16:06.960 | And then, yeah, the third, which John covered is that new data infrastructure has to recognize that there's sort of this new world we're entering

00:16:14.080 | with multimodal data, a lot heavier on vectors and embeddings. Workloads are very diverse and the scale is just going to keep

00:16:21.760 | getting larger and larger as we try to sort of ingest and query over all the data that exists, public and private.

00:16:31.840 | Yeah, thanks for listening to our talk.

00:16:33.840 | Thank you for listening to our talk.