back to index[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar

00:00:00.000 |
Okay, thank you everyone for joining us for PaperClub today. Today, we will be talking about 00:00:04.080 |
doc ETL, documentation, or rather document ETL. This is a paper by Shreya, big fan of her work, 00:00:10.560 |
very practical work. And of course, looks like there's another solid Eugene on this paper, 00:00:16.720 |
so should be a great paper. So the entire, I'm going to try to go through it in a few, 00:00:23.280 |
there's a lot of concepts to this paper, there's a lot of examples and a lot of results. 00:00:27.120 |
I'm going to try to run through it very quickly and try to finish at 12.45 mark, and then we can 00:00:33.120 |
discuss what we learned and what we thought was useful. So essentially, the premise of the paper 00:00:37.920 |
is that we can use LLMs to process documents for us in a fairly hands-off way. But the problem is, 00:00:47.120 |
is that for fairly complex tasks and data, LLM outputs for what we wanted to do is fairly 00:00:53.920 |
inaccurate. So a lot of times what we wanted to do, we just write a prompt, right? We just say, 00:00:57.600 |
"Hey, LLM, given this archive, here are these 10 archive papers, find me everything that's 00:01:04.320 |
relevant to hallucinations." We sort of do it in a very crude ad hoc way, every archive paper, 00:01:11.440 |
what is the snippet relevant to hallucinations, we pull it out, and after we try to combine it, 00:01:16.800 |
and then we try to extract some teams from it, we may include citations, et cetera, et cetera. 00:01:20.720 |
So what this paper is trying to do is it's trying to provide formal vocabulary to try to do this, 00:01:25.440 |
and we'll see what the formal vocabulary is. So in a nutshell, they have an agent-based framework. 00:01:31.840 |
Let's put the agent-based framework aside for now, we won't focus on that. 00:01:35.040 |
So what they try to do is that they will define operators, and they will define directives, 00:01:43.280 |
and I'll define what operators and directives are, and then they will try to optimize this. 00:01:48.320 |
So the optimization is really called logical rewriting and agent-guided plan evaluation. 00:01:53.600 |
And of course, optimization algorithm. I mean, I'm oversimplifying it, but all of this is 00:01:58.000 |
essentially trying to optimize the overall pipeline to try to get what you want. 00:02:00.960 |
And they introduce directives and operators as well, which we'll get into right now. 00:02:07.440 |
So there's a lot of intro over here, and there's this very big diagram over here. 00:02:11.120 |
I won't actually go into this right now, I feel like we need some vocabulary before we can go 00:02:15.680 |
into this. We need to understand what the pinks mean, what the greens mean, 00:02:19.440 |
and what the different maps, all these different bolded words mean. 00:02:22.880 |
So I'm just going to jump right into it. I'm going to skip the programming model. 00:02:28.000 |
Essentially what this is trying to do is it's trying to take database concepts and pandas 00:02:33.120 |
dataframe concepts and try to apply them to shapeless documents. Is Shreya on the call yet? 00:02:40.800 |
Not yet. Oh, Shreya is. Hey, Shreya, if I say anything wrong, just stop me and just jump in. 00:02:50.880 |
I'm just starting again to the operators right now. 00:02:54.000 |
So essentially, the operators are fairly straightforward. 00:02:57.680 |
And I have a note over here, it's really trying to think about how can we apply data pipeline 00:03:02.480 |
and database operations to LLM tasks. And the LLM task over here is really documents. 00:03:09.440 |
And so at the very simplest level, we have the map operator. Essentially the map operator, and 00:03:17.600 |
there's this language here that's a little bit not so straightforward, applies an LLM powered 00:03:22.720 |
projection known as a semantic projection to each documented data set. What this means, 00:03:27.520 |
and my interpretation of it is that given a document, we create new features out of it. 00:03:32.800 |
So you can look at this example here. Given a document, what are all the instances of 00:03:39.200 |
police misconduct and what's the name of the officer involved and a new description? Essentially, 00:03:44.080 |
given a PDF document, I want new columns for officer involved and new columns for 00:03:49.600 |
brief description of the misconduct violation. So you can think of map. Map is infinitely flexible. 00:03:55.840 |
Given a document, extract the relevant text. Given a document, extract your summarization. 00:04:00.320 |
Given a document, translate it. Given a document, add a classification, et cetera, et cetera. 00:04:04.960 |
So map is the workhorse of the operators here. And then they also share about parallel maps, 00:04:13.120 |
where you can run maps in parallel. So you can see one prompt could be extracting misconduct, 00:04:18.560 |
while another prompt could be summarizing policies. Essentially, you just do a map 00:04:22.080 |
on every single thing. Then after you do a map, you have to do a reduce. Essentially, okay, 00:04:27.360 |
so let's say I've extracted all the different officer names, all the different incident dates. 00:04:31.840 |
I want to return it as a report. Then we have to do a reduce. So map will spin up a lot of new 00:04:39.440 |
columns. Reduce will take all those columns and try to generate something that's coherent and easy 00:04:44.000 |
for a human to read. And then they also talk about batch folding. Essentially, they talk about, 00:04:49.920 |
you know, sometimes a lot of this is bigger than your context length, and that's why we use the 00:04:54.320 |
map reduce paradigm, where you chunk it up and then you do reduce. But when they reduce, sometimes 00:04:59.440 |
it's larger than the context length of the LLM, so they do folding. They talk about folding and 00:05:04.640 |
hierarchical aggregation, but they apply batch folding. Essentially, imagine here we have these 00:05:13.520 |
six boxes, and then we need to aggregate them. Maybe we aggregate them two at a time, 00:05:17.760 |
and then we do something like that. I think that's all very straightforward. It makes a lot of sense. 00:05:22.160 |
Now, there's the third operator, which is, in my opinion, a very important operator, 00:05:28.160 |
but a very difficult operator to get right, which is the resolve operator. In a nutshell, 00:05:33.680 |
this is deduplication. You know, SWIX has the name of Sean Wang, has the name of SWIX, 00:05:38.960 |
has the name of YX Wang. How do we know that they are all SWIX? This is extremely, 00:05:44.480 |
extremely difficult. So compare the following two officer records from the police documents, 00:05:49.280 |
right? And then how do we know that they're actually the same name? 00:05:55.440 |
And then, you know, resolution has many different aspects of it. It could be sci-fi, science fiction, 00:06:00.560 |
sci-fi fantasy. Sci-fi fantasy is not really sci-fi, but depending on your context, 00:06:05.920 |
you could consider sci-fi and sci-fi fantasy as the same thing. So there's a bit of domain 00:06:11.600 |
nuance to this as well. I think this is really, really difficult to do. I actually have a... 00:06:17.120 |
So one, several things I've seen in production is that when you do resolving and deduplication, 00:06:25.120 |
if your deduplication is too loose, everything deduplicates to a single huge megacluster. 00:06:30.160 |
And that's when you get... That's when you have severity issues. And so this is actually very, 00:06:37.120 |
very tricky to get right. And in the paper, they actually talk about different ways to do this. 00:06:43.120 |
And they actually talk about a cheap way to do this, which we have also implemented as well, 00:06:46.720 |
which is using semantic and code base, which we will talk about later. But long story short, 00:06:51.920 |
this is very difficult. And then they have other operators, very standard, filter. Essentially, 00:06:58.320 |
given a document, maybe we perform some kind of classification, do we want to filter the document, 00:07:03.280 |
or maybe you do some kind of map to extract the relevant sections and filter everything out. 00:07:07.840 |
That makes sense. Equi-join is, again, I think it's really difficult. In a sense, it's sort of 00:07:17.360 |
like resolve, can you join these two concepts together? And then all of this all relies on 00:07:23.200 |
schema. And then they have a few auxiliary operators. Essentially, auxiliary operators 00:07:27.680 |
means that it doesn't even rely on LLMs. A nest is one of them. So for example, given a document, 00:07:32.880 |
maybe you have created a lot of... Maybe given a blog post, what are all the papers 00:07:38.320 |
referenced in this blog post? You'll return an array, a list. A nest basically splits that nest 00:07:44.480 |
into individual elements. And then split is essentially, you can think of it as chunking. 00:07:50.560 |
In the most naive way, it could be chunking in terms of number of tokens. In a smarter way, 00:07:56.400 |
it could be chunking based on paragraphs or chapters or sections, or it could be even LLM 00:08:01.920 |
based splitting. And then gather is essentially gathering all the individual chunks. So for 00:08:08.480 |
example, imagine you are summarizing an archive paper. Imagine you're summarizing section by 00:08:13.600 |
section. The previous section was the operator section, and now we're in the rewrite section. 00:08:17.920 |
So it could be that when you're summarizing the rewrite section, you actually do need context 00:08:23.600 |
about the operator section. It mentions certain abbreviations that the LLM would never know, 00:08:30.000 |
but you have to provide a kind of context. So what gather does is by providing this peripheral 00:08:34.720 |
information. So it could be as you're summarizing a document section by section, you collect some 00:08:41.600 |
kind of map on all the various abbreviations that they've shown up. So that you provide that 00:08:46.960 |
context to the LLM, so the LLM doesn't try to hallucinate what those abbreviations mean. 00:08:51.200 |
So those are the basic parameters of doc ETL. I'll pause here. Any questions about the operators? 00:08:59.360 |
I think it's essential to understand map reduce and resolve. Everything else builds on this. 00:09:07.360 |
So this is a classic, not so much a question, more of a comment. My real question was, 00:09:13.760 |
was there a rank operator? And the answer is no. You could maybe compose the rank operator with 00:09:19.600 |
sometimes split, gather, thingamajig, but it's not a formal ranking system. And even before AI, 00:09:30.240 |
I had always had this strong view that filtering, ranking, sorting, all this stuff is kind of like 00:09:38.080 |
the same layer in the API stack, and should be kind of done together. This is my number one 00:09:46.240 |
problem with AI news right now, which is that all filtering is effectively a rexis, and I have to 00:09:56.240 |
rank. Yes. So you filter just based on metadata, like hard numbers, because you don't have any 00:10:05.520 |
other better stuff. If you could create a ranking formula, you would use the ranking formula rather 00:10:09.680 |
than any other filter basis. There is a filter op, right? So when section 2.2 in other ops, 00:10:21.280 |
the filter operation independently retains documents from the input dataset based on 00:10:25.440 |
conditions specified through an LLM prompt. In some sense, it's under ops, if you scroll down 00:10:32.240 |
a little. This filter one seems to, if you can specify conditions through an LLM and retain 00:10:40.960 |
docs, that's a form of ranking, no? That's how I interpreted this one, as opposed to like, yeah. 00:10:49.280 |
Yeah. I mean, I think there's a question of sort of filter, then rank, or rank, then filter. 00:10:53.840 |
I think in my strong view, filter. Yeah. You can think of filtering as in dropping 00:10:59.440 |
items from an array, whereas ranking is me pushing things down or pushing things up. 00:11:03.840 |
So I think it's slightly different. And you know, there's a question for you, 00:11:07.040 |
Shreya. Is there a reason why they're split into operators and auxiliary operators? 00:11:13.120 |
Oh, very simply, we just wanted to put the primary LLM-powered operators in one section, 00:11:20.400 |
the one that people will probably spend a lot of time tinkering on, and the auxiliary ones 00:11:24.480 |
are there to stitch them. So a lot of people want to do, for example, reduce after map, 00:11:29.600 |
well, they want to unnest after map, and then do reduce there. 00:11:37.520 |
A comment about ranking. We're working on a sort operator. We haven't found, 00:11:43.440 |
maybe Swix, we should chat about this. We haven't found a really compelling use case 00:11:48.720 |
Well, a lot of people's use cases can simply be solved by, you know, applying like embedding 00:11:55.280 |
based similarity to some query vector and then ranking by that similarity. I want to know, 00:12:00.560 |
kind of, I think Swix's case is interesting. Like, in what cases does it make sense to have 00:12:05.760 |
this pairwise comparison interface be the primary way to steer that operator? So, 00:12:12.640 |
maybe like ranking articles in terms of what is most interesting to a human could be such a case. 00:12:23.440 |
I mean, like, yes on the what's, you know, what's most interesting to a human. I mean, 00:12:29.520 |
this is standard Rexis. I'm not convinced that it has to be pairwise. I feel like pairwise 00:12:35.520 |
maybe it's easy to do data entry. So, maybe that's why. 00:12:39.680 |
I'm actually strongly convinced that it should be pairwise and we can debate that and see how 00:12:50.960 |
I think it's just more reliable and stable that way. I think you don't have to go through all, 00:12:56.160 |
I mean, if we do everything pairwise, it's quadratic, right? But we don't have to go 00:13:00.640 |
through all of that. We can sort of do it smartly. I think that, and you know, in Shreya's paper, 00:13:06.320 |
they actually cover how they use embeddings and codebase resolve, right? You can think of it as, 00:13:12.480 |
also, you can apply that if you tweak that a bit, that can be applied to ranking. 00:13:16.560 |
But you can also tweak it a bit. You can say that ranking, we actually have 00:13:20.160 |
confidence levels. If the similarity score is strong enough or weak enough, like if it's really 00:13:25.120 |
good, like 0.9, we know it's strongly related. If it's really bad, like 0.1, we know it's poorly 00:13:29.680 |
related. But then there's a lot of stuff that's in the middle that then we can use an LLM power 00:13:34.800 |
ranking. We can go into that a bit, but this paper is huge and I want to try to go through 00:13:39.360 |
as much of it as I can. So now that we've covered operators, 00:13:44.400 |
they propose rewrites. So actually, I have a question for you, Shreya. When you say rewrite, 00:13:51.520 |
do you mean rewriting the pipeline or do you mean rewriting the document? 00:13:57.760 |
What does rewrite in rewrite directive stand for? 00:14:00.240 |
Rewrite, this is a good question. No one's ever asked this. Rewriting the pipeline. So say you 00:14:04.800 |
have a pipeline that is just a map operator. Many people have this pipeline. They have a document 00:14:10.160 |
set and they want to extract a bunch of metadata from each document. Like I have a bunch of legal 00:14:15.040 |
contracts and I want to extract as many clauses of interest as possible from them. That would be, 00:14:20.160 |
you can program that as a map operation, a single map operation to go through every document and 00:14:26.000 |
extract all your fields that you want. When we say rewrite, we acknowledge that this operation 00:14:32.640 |
might not work when you execute it. What if the document is too long? What if you're extracting 00:14:38.000 |
too many fields? It's too hard. So you want to write it into more complex sequence of operators. 00:14:43.760 |
So is my mental model right that the rewrite directives are somewhat in the realm of 00:14:51.600 |
Right. So now we're going to talk about rewrite objectives. So let's again, take our example of 00:14:56.720 |
this archive PDF. We want to extract what the key teams are. A noob like me would just upload 00:15:01.760 |
it to Cloud AI and Cloud AI, what are key teams? Now we can rewrite that to be smarter. We can 00:15:07.120 |
rewrite it into a map, basically map the entire document into all text. And then we can do split, 00:15:13.840 |
we saw the split operator. And then on the splits, we can do a map on extracting the key teams. 00:15:18.720 |
And then we can do a reduce, right? To try to reduce all these key teams. Again, the point 00:15:23.520 |
here is that if we give the LLM smaller chunks, it is better able to pay attention and therefore 00:15:28.000 |
come up with a more rich, more comprehensive and more factual summary. But you can imagine the 00:15:34.320 |
search space of creating such a pipeline is infinite, right? The search space of chaining 00:15:39.040 |
together all the different legal blocks of putting together all the different AWS services is 00:15:42.480 |
infinite. So what they propose is a methodology to help us try to do that semi-automatically. 00:15:51.120 |
So I will focus on all the different things that they do in rewrites, but I'm going to call out 00:15:55.120 |
what I think is absolutely essential. So rewrite directives. The first one is data decomposition. 00:16:01.920 |
So essentially when you have very large documents and there are many documents, and then you just 00:16:07.200 |
have to decompose it. The very standard pattern they share is document chunking, which everyone 00:16:12.720 |
has been doing. This is standard. That said, I actually do want to share this example whereby 00:16:17.680 |
I was asked to help with a pipeline, and I was able to improve downstream metrics significantly 00:16:26.160 |
by 20 to 50% by removing chunking. So every now and then it's good to rethink, hey, do we actually 00:16:31.840 |
need chunking? In such cases, I think Shreya's example here is that it's way beyond the context 00:16:36.480 |
window size, and so you definitely do need it. So you can see they do chunking. So you can map, 00:16:41.680 |
split, gather, map, reduce. And you can see the optimization. There's many, many different ways 00:16:47.120 |
to optimize. I'm just going to pick B here. So you can imagine we could split it and then we could 00:16:53.120 |
gather all of it. After we split it, we gather all of it. But what they're proposing is that we split 00:16:58.000 |
it, we create a summary, and then we create... I can't remember what H stands for. Hierarchical 00:17:06.000 |
information. We create a summary and create hierarchical information. So we enrich the 00:17:11.040 |
splitted chunks and then we gather it. So you can imagine essentially all pipelines now are LLM 00:17:18.720 |
pipelines. LLM is enriching the data. LLM is cleaning the data. LLM is filtering the data. 00:17:23.600 |
It's going to be a bit lossy. It's going to be a bit stochastic, but I think we will figure it out 00:17:28.080 |
in the next one to two years to get it to a more reliable state. So they actually share some very 00:17:34.160 |
insider ghost knowledge. When splitting a document, there are some kinds of context 00:17:39.360 |
that are very useful. I fully agree with all of this. Document level metadata. What are all the 00:17:42.960 |
different abbreviations? Who's the author, et cetera? Hierarchical information. Summaries of 00:17:47.200 |
neighboring chunks. Summaries of previous chunks are actually very valuable. So the LLM has enough 00:17:51.680 |
context and doesn't make shit up. And then they also have different patterns here, document level 00:17:57.760 |
metadata. Maybe I'll just go through one of them. So example, in this case, before you split it, 00:18:03.120 |
you extract metadata that's relevant to all chunks. So by extracting this abbreviation metadata, 00:18:07.680 |
author metadata, you make sure that all chunks, when you're doing map on all of these chunks, 00:18:12.080 |
you actually have all this context. The LLM has all these contexts and is able to do a more 00:18:15.360 |
reliable job. And then they have many, many different other patterns that you can consider. 00:18:21.440 |
Now, the next thing is multi-level aggregation. So you need to aggregate it. Essentially, 00:18:27.760 |
let's imagine you have hundreds of chunks. You are summarizing a movie and you have all the 00:18:32.720 |
different movie cut scenes. And imagine we don't have Gemini to do this for us and just upload the 00:18:36.560 |
entire movie. Imagine you have to do it. We have to chunk it up. So what they propose here is to 00:18:42.000 |
aggregate, is to chunk it up, aggregating the data at a finer granular. So essentially, it's like, 00:18:47.280 |
okay, we have seen one, or let's put it another way. Imagine you're trying to summarize a series 00:18:53.920 |
of, oh, wow, this is a bit close. No, imagine you're trying to summarize the Harry Potter movies. 00:18:59.280 |
So you could first summarize the first Harry Potter movie, first summarize every scene in the 00:19:04.240 |
first Harry Potter movie, roll it up to a summary of the first Harry Potter movie, and then roll it 00:19:08.720 |
up to a summary of the entire Harry Potter movies. So that's hierarchical aggregation. 00:19:15.120 |
And then they have several things here, like LLM centric improvements. There's this term here, 00:19:19.920 |
gleaning. So it's prompted with the previous inputs and outputs and asked to improve the 00:19:24.480 |
inputs. I mean, to simplify, oh, sorry, Shreya has a hand raise. Oh, I didn't want to interrupt 00:19:30.080 |
you. Sorry. Your point about chunking, I thought was very interesting in that sometimes it's 00:19:37.920 |
beneficial to chunk and sometimes you should not chunk. And we have observed this in a number of 00:19:43.280 |
workloads. And the insight that we've gained is that we will never know when it's all task-specific 00:19:48.160 |
and data-specific. And we are so further convinced that you kind of need some optimizer to explore 00:19:54.880 |
these different choices for you and come up with something reasonable. But anyways, just a meta 00:20:00.080 |
comment on, it's not the rewrite rule that if you apply it, it always works. It's that if you apply 00:20:06.000 |
it, it sometimes works. It sometimes works a lot better. And you need some way to kind of try a lot 00:20:11.600 |
of these things automatically for you. - I agree. Vips, you have a hand. 00:20:16.080 |
- A little follow-up to that point. Somewhere, I think on like the first page where it talks about 00:20:23.200 |
chunking and performance, there was just like a handful of dumps of citations of different papers. 00:20:29.520 |
So like, there was a section, I think on the first page, recent work has shown that LLM 00:20:34.800 |
performance degrades considerably as length increases. Citing a paper, they can be distracted. 00:20:39.600 |
Citing a paper, pay attention more to certain papers, cite a paper, or fail to gain holistic 00:20:45.360 |
understanding. And there's like four citations. So pretty cool if anyone wants to dig into that, 00:20:49.920 |
there's like eight citations on the first page that talk about this. And then another little 00:20:56.480 |
comment actually. So you skipped over a few of the chunking strategies. So there's stuff like 00:21:02.720 |
document level extraction header. I thought that the third one was kind of interesting where you 00:21:06.880 |
have chunk filtering. I haven't seen this approach too often. So just if people haven't read it, 00:21:11.760 |
it's an interesting little one that popped up. So as you chunk, whether it's semantically 00:21:16.320 |
or whatever it is, you can filter out sections. Not many other pre-processing handlers seem to do 00:21:21.840 |
this, but the example here is pretty clear, right? So as you break up chunks, you can filter out 00:21:27.440 |
stuff. So for example, in an archive research paper, you can filter out citations and references 00:21:33.280 |
that aren't really relevant to the core information. It's just one of those chunking 00:21:38.000 |
strategies that you don't see too often. So, you know, figure it out that better. 00:21:41.200 |
Yeah. A hundred percent agree. All of these bolded mini haters here, they're all very valuable. I 00:21:47.840 |
just don't have the time to go through them, but you should read through them. I think these are 00:21:51.520 |
practices that they have gained from working on very, very difficult real world problems 00:21:56.080 |
to just get the level of performance up. And you can take a lot of inspiration from this. 00:22:00.800 |
Shreya, you have another hand raised. Oh, a small comment again for the 00:22:04.640 |
police misconduct data. Often we have like thousand page records documents where it's 00:22:10.960 |
just page on page of like random image of like a street sign or something. And if you just drop 00:22:17.360 |
those, your processing, your accuracy just stays the same and you save so much on costs. So that's 00:22:23.200 |
where the filtering came from. Yep. Very valuable. Now they have LM centric improvements. This one 00:22:30.080 |
here is gleaning, which is prompted with the previous inputs and outputs and ask it to improve 00:22:34.480 |
it. You can think of it another way if you have a valid. So that's one way you can imagine have 00:22:39.040 |
a validator in the loop or essentially like code. You just copy. Hey, Claude, here's my error 00:22:45.120 |
message. How do I fix it? This is very similar that given the previous input and output and error 00:22:50.480 |
message or some kind of feedback LLM, get it to improve on it. So this can, you can iteratively 00:22:56.080 |
improve on this. And I think, uh, later we have an experiment result where they try to do this 00:23:01.360 |
four times and you can see how it tapers off. But for some, in some instances, he just keeps 00:23:05.760 |
improving, uh, as, as many times. So in this initial, so how this is, how this is done. 00:23:12.640 |
Oh, actually they do have any better within the loop. Okay. So how this is done. Okay. 00:23:16.960 |
You preprocess, you process it, you get the original map operation, like summary of the 00:23:23.680 |
paper, and then you have a validator, try to evaluate it and provide some feedback. Now you 00:23:29.360 |
provide the previous summary and the feedback and maybe the context and then try to refine on it. 00:23:33.520 |
And then you just improve this many times. So you can imagine now all your SQL pipelines, 00:23:38.480 |
all your data pipelines now have intelligence sprinkled in where it can, it can, it can map 00:23:43.520 |
where it can map on all this. We have a validator in the loop. So users can have a more capable 00:23:48.880 |
model to be the validator. Yeah. Yeah. I think I missed the part about being a validator in the 00:23:55.200 |
loop. Because I thought it was just asked to improve the outputs. Yeah. But again, fully 00:24:00.320 |
agree. I think this cannot be done without an evaluator or validator. It's absolutely essential. 00:24:06.400 |
Then they have this thing, duplicate key resolved. This is the problem, right? LLM outputs are not 00:24:14.400 |
canonicalized. So for the same, for the same person six, they may have very many different 00:24:20.480 |
names out there. So you have to canonicalize it. So they take semantic equivalent values of the 00:24:27.040 |
key. And I think they try to reduce the search space in the path. I can't remember where it's 00:24:33.360 |
mentioned, but they try, maybe they try to reduce the search space with embeddings and then try to 00:24:36.720 |
resolve this. Long and short of it, this is a very, very hard problem. And you have to do this 00:24:43.440 |
with fairly high precision. If not, you could be saying that Eugene Xia and Eugene Yan actually 00:24:48.560 |
the same thing. And Eugene Yan maybe just raised $10 million when it's really Eugene Xia. I actually 00:24:54.080 |
don't know how much Eugene Xia raised. I'm just making an example where precision is quite 00:24:58.000 |
important. Vips, you have a hand raised. Yeah. So when I read this section about 00:25:03.760 |
gleaning, I saw the same thing. So I understood basically on the page you're at that there's a 00:25:10.000 |
validation model. So there's a validator, but then it also says that that first paragraph, 00:25:15.760 |
they employ a separate validator and data generation. So go on below this next one. Yeah. 00:25:21.040 |
I didn't really see where this data generation LLM is, but I understand that you do some 00:25:26.000 |
pre-processing, you have a validator to check how it worked. Then you have a number of steps to do 00:25:32.240 |
this, do your whatever processing, validate, process, validate. But I don't get where this 00:25:38.320 |
data generation LLM is. And then it seems like data generation is just, that's the only reference 00:25:43.840 |
of it. So I didn't know if this is like, I just didn't see it too much in this. 00:25:48.320 |
No, it's ambiguous from us. I just meant the data processing LLM, the normal LLM that you use. 00:25:54.240 |
So how does a traditional operation work? Look at figure four, you just have the output that comes 00:26:00.080 |
from the data processing LLM, nothing more. What gleaning does is it adds that validation agent, 00:26:05.520 |
and it creates a loop of some finite number of iterations or whether the validation agent said 00:26:11.680 |
it was good. So you can just bound the number of times it refines, and then the final output, 00:26:16.320 |
that's all. So is it using that original step again? So it says it's a separate 00:26:22.080 |
validator and data generation model. You can specify two different models. You could also 00:26:29.600 |
specify the same model architecture, but the point is the validator agent has a different prompt 00:26:35.120 |
that has a specific instruction of, does this contain all the officers or instances? 00:26:40.720 |
And the validator agent's job is to say yes or no with optional feedback. The data processor LLM's 00:26:47.280 |
job is to give the answer. That's why we have a distinction between the two. 00:26:55.520 |
And Shreya also has an interesting point that it's much easier to verify the output and generate it. 00:27:04.080 |
It depends on the task. For classification tasks, yes, it's easy for factuality or 00:27:09.440 |
comprehensiveness or relevance, whatever. It's a lot harder. 00:27:13.120 |
A lot of the synthetic data gen papers like ORCA3, Wizard LLM, they show that 00:27:18.800 |
models are better at verifying output than generating output. So you have a weaker... 00:27:24.800 |
Yeah. So it's shown in other work too, but interesting note. It does depend on the task. 00:27:31.920 |
So next we have project synthesis. Well, this is tricky. I think what it means is that it's 00:27:40.080 |
hyper-pipeline optimization. So essentially we don't really know what is going to be good. 00:27:46.560 |
So you can imagine that we have very many different ways on how to do this. 00:27:52.160 |
And we can find that agents are... It's really hard for... If you have a pipeline generating agent, 00:28:00.000 |
it's really hard for an agent to try to figure out which pipeline is going to work in the first 00:28:04.880 |
place. And there are infinite ways that you could build this pipeline. So they propose several ways 00:28:14.640 |
to do this. And of course, the first thing is chaining. It's like iteratively chain LLM calls 00:28:20.560 |
to simplify the task. Essentially, this is what people say. Break up your task so that every 00:28:24.800 |
prompt only does a single thing. So it could be extracting entities, summarizing key points, 00:28:28.720 |
and then generating recommendations. Another one could be isolating. So for example, maybe 00:28:34.000 |
a question would be what sections of this paper talk about hallucination? 00:28:38.560 |
And then so you could reduce the scope of work to be done on the doc, whereby you could just 00:28:44.000 |
classify the documents, maybe just classify sections, classify paragraphs as related 00:28:48.240 |
documents, hallucination or not, and then just flag it up. So there's a lot there. 00:28:54.880 |
And I won't go through all of this. But anyway, that is the end of the rewrite section. Any 00:29:01.840 |
questions there on rewriting? Because now we're going to go into the optimization step. 00:29:09.040 |
I'm happy to explain projection synthesis if that's helpful. 00:29:13.600 |
Yeah. So this is very similar to the chain of thought mindset that you have in LLM prompt 00:29:22.400 |
engineering, where if you have a very complex task, often what can help is if you break that 00:29:28.400 |
down into two LLM calls. Now, you can generalize this to say all this means is you have a map 00:29:37.200 |
operation that does partial work and then your original operation that does the task to maybe 00:29:46.080 |
make it easier. And you could do this for any operation. You can have a map before another 00:29:51.760 |
map. You can have a map operation before another reduce, before another filter. In this police 00:29:57.200 |
misconduct case, we've seen this very helpful around whittling down, like summarizing the 00:30:03.520 |
document into the relevant points needed to then answer the hard question of misconduct. 00:30:09.200 |
So you'll take a very long document and the first map operation that's synthesized will just 00:30:14.960 |
extract information that is relevant to police officers at all, making it a smaller document. 00:30:22.560 |
And then summarizing the instances of misconduct is much simpler to do on that very focused 00:30:29.360 |
document. So that's an example of projection synthesis. And projection means map in the 00:30:36.640 |
data world. So you can apply that to any operation. And you can have the LLM write that. 00:30:42.720 |
You can have an LLM take a task and try to break it down into multiple steps and synthesize that 00:30:52.320 |
questions on rewriting directives? Okay. So now we're going to go into the optimization aspect. 00:31:04.080 |
So the key aspect here in the optimization aspect is that, okay, you sort of declare 00:31:09.840 |
an initial pipeline. And then .etl will try to run that pipeline, get initial evaluation scores, 00:31:16.880 |
and then it will try to optimize that pipeline for you. So the two key agents here, that's the 00:31:21.440 |
generation agent. So what a generation agent does is that the generation will actually generate 00:31:26.160 |
rewrites. So when we talk about generating here, the generation agent is going to generate the 00:31:30.800 |
rewrites and the alternative pipelines, the pipeline options. And then you have validation 00:31:35.520 |
agents. And again, this is another example, right, of how evals and validation are just 00:31:41.920 |
fundamental to this. Because without the validation agent, you would not be able to 00:31:45.600 |
assess and measure the different pipelines, how the different pipelines do, and then how 00:31:51.680 |
to compare between them. So there's two things. You generate pipelines. Second, you evaluate each 00:31:56.720 |
step of the -- maybe you evaluate each step of the pipeline and you evaluate the overall pipeline. 00:32:01.120 |
And then we try to see which one performs well. So there's a lot of algorithms here. But I find 00:32:09.360 |
that the paragraphs here actually explain it. I found it far easier for me to understand. 00:32:14.480 |
So essentially, you can imagine that there's a pipeline, right? You have a pipeline. Again, 00:32:19.200 |
let's just stick to the police document, police example pipeline, all the different 00:32:23.040 |
examples of misconduct. So what you first do is you use a validation agent to create a custom 00:32:30.320 |
validation prompt, right? And I don't know if we do need inputs and label outputs here. 00:32:36.560 |
Do we actually need inputs and label outputs here to get the validation prompt? 00:32:41.840 |
Nope. I mean, you could -- we don't support that because nobody has ever come to us with 00:32:46.880 |
labeled examples. The other thing that's hard is sometimes there are newly synthesized operators 00:32:52.240 |
as part of rewrite rules. And those are certainly not going to have any label examples. 00:32:58.240 |
But one of the things we are exploring is we have an interface for doc ETL and we're exploring 00:33:02.400 |
kind of interactive optimization of pipelines. So if we synthesize a new operator, we might ask 00:33:06.880 |
a human to label some examples and then use that for the validation prompt. But that's, you know, 00:33:13.280 |
in the works. I think that's a good idea. And that's the entire crux of Align-Eval, 00:33:18.000 |
which I shared with you. I'm of a slightly different take. I feel like we do need some 00:33:23.120 |
set of C human labeled data. So first thing, we use the validation data. We have a custom 00:33:29.120 |
validation prompt. And then what a validation agent will do is sample some outputs from the 00:33:36.160 |
initial pipeline or from chunks of the initial pipeline to see if there's room for improvement. 00:33:40.000 |
If there is room for improvement, what we do is we rewrite -- they have rewrite rules where 00:33:46.880 |
they apply to the subpipeline or the individual operation or the entire pipeline itself. 00:33:51.280 |
So then what that means is that -- and then they have this thing we should call recursive 00:33:55.360 |
improvement. In the sense that when you create new optimizations for a rewrite, 00:34:00.240 |
you immediately optimize them before continuing the rest of the optimization. So what it means 00:34:06.000 |
is that from a pipeline, imagine a pipeline goes from left to right, you start upstream, 00:34:10.480 |
iterate on that, and then you go downstream. And this can take as much time. Compute is free. You 00:34:14.960 |
can just run it overnight. And of course, it just costs a few hundred dollars. But that's fine 00:34:19.760 |
compared to how much human time it would take. So now you have multiple candidate plans. Let's 00:34:25.840 |
just say you have 10 candidate plans. And then you take all these 10 candidate plans. You first 00:34:30.880 |
execute each plan on a sample of the data. You don't want to execute it on all your thousands 00:34:34.960 |
of datasets. You just execute on maybe 50. And then use the validation agent to rate each output. 00:34:39.680 |
So now after you've rated this sample, these different sample plans, 00:34:45.440 |
then you take the top K. And then you do, I think, full evaluation. Oh no, it does pairwise 00:34:51.440 |
comparisons across these top plans. And the plan that wins most is the winner. And of course, 00:34:56.160 |
you select the optimized plan to do that. I think let's go through an example. I think an example 00:35:02.480 |
here is helpful. So here is the police misconduct dataset. So the current pipeline is this, right? 00:35:18.400 |
The team has a domain-specific clustering algorithm and then human annotation to de-duplicate officer 00:35:24.240 |
names. And this is the baseline pipeline that the team has. So what they did was that doc ETL 00:35:33.360 |
synthesized various pipelines. The first one was to extract misconduct summaries. 00:35:38.240 |
Extract the name and the summary of the misconduct before trying to de-duplicate 00:35:46.080 |
on the officer name, before trying to resolve the officer name. And then it summarizes it. 00:35:52.480 |
Then you can see here's a new... Was this synthesized by doc ETL, Shreya? Doc ETL-T? 00:36:01.760 |
Good question. So we are redoing this entire eval, by the way. So we're submitting this to VLDB 00:36:09.280 |
December 1st. You'll see an entirely new eval for the next version. But for this case study, 00:36:16.560 |
the doc ETL-S-T-O are all candidate plants from doc ETL's optimizer. And when we were exploring 00:36:27.680 |
them, we were like, "Oh man, I don't even know which one is going to be better." Truly. You 00:36:31.920 |
read them, you have no idea. So we compared them all. Doc ETL-O was the one that was actually 00:36:37.520 |
selected. So you can see that they generated several plants. I don't know how many hundreds 00:36:42.800 |
of plants they generated. The author evaluated 50. I recall they had certain numbers here. 00:36:55.120 |
So okay, yes. Okay, shoot. I don't know. Okay. So they had 220 documents and the baseline was $2, 00:37:07.520 |
while doc ETL-S-T-O each cost a little bit more money. Running the optimizer, 00:37:12.160 |
they used the optimizer. I don't know how many plants the optimizer generated, but it cost 00:37:16.720 |
approximately $100 and less than half an hour. Just go get lunch and you come back and you get 00:37:21.120 |
your optimized pipeline. And the optimization cost was just $100. How many pipelines? Oh, 00:37:27.360 |
perfect. Here it is. It generated 200 pipeline variants and all of this was 00:37:32.000 |
just evaluated with the validator prompt. Yeah. So I think that's an example of how 00:37:38.720 |
the optimization step runs. During a bunch of pipelines, validate it and optimize the 00:37:45.040 |
subcomponents within the pipeline and the pipeline itself. So now that we've gone through this, 00:37:51.040 |
we can now discuss figure one. We had to go through a lot of it, but I think now we can 00:37:57.680 |
discuss figure one. Figure one is you can imagine all of these are police documents and we want to 00:38:02.640 |
extract the names of officers and the misconduct. So you can see this is the user-defined map, 00:38:09.440 |
right? Okay. And maybe if I'm a user, my initial prompt is very basic. Given a PDF, 00:38:16.080 |
just extract any cases of misconduct. So now we apply the rewrite directives. 00:38:21.280 |
We apply rewrite and the rewrite, of course, we have the baseline, which is no change. 00:38:25.280 |
We can do projection synthesis, which by we extract name, extract summary, 00:38:30.000 |
and then after that do another one. Or we could do decomposition where we split the document up 00:38:34.160 |
together and map. And then all of these, all the pink boxes are where the agent will automatically 00:38:41.440 |
rewrite it. And then the green boxes are when the plan is selected and NC here stands for NSYNC. No, 00:38:48.320 |
I mean, it stands for no change. Essentially, if it's no change, we just accept it. 00:38:52.160 |
And you can see, we just go through from left to right. And that is just for extracting any 00:38:58.960 |
cases of misconduct. After you've extracted these cases of misconduct, you can see Officer A, 00:39:04.160 |
Sergeant B, and Officer C, then now we need to reduce it. And again, we apply the same thing 00:39:10.160 |
down the stream. So we start from left to right. Every time we generate something new, 00:39:15.600 |
we validate it. And if it passes, if it's good enough, we move on to the next one. 00:39:20.320 |
And so it just iteratively goes down the entire stream. So now that is the section on operators, 00:39:27.040 |
rewrite directives, and optimization. And that's all summarized in this document. Any questions? 00:39:34.400 |
I can answer RJ's question. It's easier than typing it out. 00:39:42.480 |
Yeah, so RJ asked a question. I get how you can generate good evaluation prompts for the 00:39:46.240 |
individual operators, but why do we trust an end-to-end auto-generated trust? Oh, sorry, 00:39:51.200 |
end-to-end auto-generated judge. I too was very skeptical, but then I saw the kinds of tasks that 00:39:57.120 |
the journalism team here at UC Berkeley was running, which was extract all cases of misconduct, 00:40:03.280 |
where they have defined misconduct in the prompt. Very precision, recall-oriented tasks here, 00:40:09.440 |
which you can have an LLM judge pretty ambiguously determine whether one output 00:40:15.200 |
is better than another. Did one output extract more names than another? LLM is just super good 00:40:20.880 |
here. So I think what really sold me was having an optimizer like this that off the bat improved 00:40:30.000 |
performance for the actual people who are working on these tasks, who are going to be authors in 00:40:34.800 |
the eventual submission, just because they've contributed a lot to helping out with the 00:40:39.200 |
evaluation here, kind of guiding what they find useful. But once you kind of see it work, 00:40:46.480 |
it's kind of hard to go back. I know Eugene Yan also has experience with LLM as judge here. 00:40:52.480 |
I strongly believe it could work. I mean, last year, this time last year, or maybe a few months, 00:40:59.440 |
like in June last year, I didn't think it could work, but I've been debating it a lot. I've been 00:41:02.960 |
trying a lot. I think when we simplify it to binary classification metrics, I think it can work. 00:41:06.960 |
And I think a lot of things can be simplified to binary, like Shreya mentioned precision, 00:41:10.400 |
I think a lot of things can be simplified to binary classification metrics. And I've seen 00:41:13.520 |
evidence of it working. Well, you may have to apply chain of thought, you may have to include 00:41:16.640 |
a few few short, you may have to do a bit of self-consistency or ensembling. But I think 00:41:20.480 |
the technology is getting there. Eugene Chia, you have a hand raised. Yeah, I think the other 00:41:25.520 |
important aspect to consider in this scenario is that what is the failure condition? So in this 00:41:30.480 |
case, right, if let's say this pipeline fails, right, it still reaches the journalist and 00:41:36.080 |
becomes a human in the loop kind of situation. Like the failed documents can be ignored safely, 00:41:41.360 |
because they achieve their goal with the successful documents. In the case of using this 00:41:47.840 |
pipeline in a more Q&A, agent style for helpdesk, right, a failed answer impact is a lot more 00:41:55.680 |
drastic. So it depends on the use case. Agree. Libu. 00:42:02.880 |
And a quick comment there. So you mentioned you've seen examples of LLM as a judge working 00:42:09.600 |
if you break it down to a binary problem. And then you might need to include a few short 00:42:13.760 |
examples and stuff. And you said ensembling. Can you explain a bit more about what ensembling 00:42:20.880 |
helps for LLM as a judge? Yeah, when I talk about ensembling, 00:42:26.720 |
the main ensembling I'm thinking about is, oh, gosh. 00:42:32.880 |
I wish I could find a paper as easily as… No, just add a high level, you know. 00:42:44.160 |
Yeah, the poll paper essentially talks about just combining multiple, using multiple weaker LLMs. 00:42:52.400 |
You can get as good as an LLM judge as GPT-4. This is the paper from… 00:43:01.280 |
Okay, so is the benefit there trying to save on performance of a large model or, like, 00:43:11.920 |
optimization that you can't run a large one? Like, mixture of agents kind of shows how, 00:43:16.240 |
from together, shows how you can use mixture of agents of small models to match performance of 00:43:21.920 |
a larger model, which we've deployed because it has to be on-prem. So ensembling models 00:43:26.800 |
leads to smaller models being able to do efficient judging than larger models. Is that the takeaway? 00:43:35.840 |
Yes, that's correct. So in this paper here by Kohir, what they did was they have a reference 00:43:39.920 |
model. And this reference model is GPT-4. And then essentially what they did was they 00:43:45.440 |
ensemble command R, haiku, and GPT-2.5. And I can't remember what the… I think the ensemble 00:43:52.320 |
was just majority voting. And they were able to have higher Kappa score, higher correlation 00:43:58.640 |
to humans compared to the strong reference, the large language model. 00:44:06.080 |
Yeah. And I think it makes sense that, you know, all these smaller calls can be parallelized, 00:44:10.480 |
and then you can… the ensembling, yeah, so… And it's fast and cheaper. 00:44:14.160 |
So sometimes they can. Sometimes they can't be parallelized. For us, we see a lot of 00:44:17.680 |
parallel… not parallel, sequential chaining add significant performance. 00:44:24.320 |
It just adds performance bottleneck in terms of time and latency end-to-end. 00:44:29.200 |
But like, it's a necessary trade-off for what we've… 00:44:32.240 |
Yeah. Okay. Someone asked, can you link to this paper? I don't know if I can link to this paper. 00:44:38.960 |
You can… if you've been in our paper clubs, Eugene Yen just does not multitask well. As you 00:44:47.360 |
saw while I was struggling to find his paper. But I've pasted you the link. If you search it up on 00:44:50.640 |
Google, I'm sure you'll link to the archive. I've linked it. 00:44:53.600 |
Okay. I did have a quick high-level question, since we're approaching the end. More so for 00:45:01.360 |
Shreya. So if you can give a sort of high-level, how we go from datasets to extracted whatever. So 00:45:09.280 |
currently you can break up documents by chunking, extracting stuff. There's some basic filtering. 00:45:15.120 |
Just high-level, what happens here? And then in that optimization step, it didn't really click 00:45:20.000 |
for me where you have the cost go from a couple dollars to $100 and you're generating a bunch of 00:45:25.840 |
pipelines to optimize this stuff. But at a high-level, one, what is the doc ETL? And then 00:45:32.080 |
two, what are these optimizers doing? I'll answer the second question. The 00:45:39.760 |
optimizer's goal is to rewrite the user's pipeline into something that is more accurate. 00:45:45.360 |
And in this process, the optimizer explores a lot of plans using LLMs to create those plans, 00:45:52.720 |
to verify, to evaluate those plans. So that's where the cost goes up. The reason it was $100, 00:45:58.960 |
if I ran the optimizer with GPT 4.0 mini as the LLMs, it would be $10. But we use GPT 4.0 just 00:46:07.920 |
because I think we did this at a time where mini hadn't come out yet. Anyways, but LLMs are 00:46:15.760 |
trending to be cheaper and cheaper. This project is a very high budget project, but I think that 00:46:20.320 |
in the future as LLMs are much cheaper, as open source LLMs get much better, the cost should not 00:46:27.920 |
be an issue for such kind of approach. And then your first question was, what is the high-level, 00:46:35.200 |
what's the vision for the project? - No, I mean, just what's the input/output? 00:46:40.400 |
I, from the title, when I first started reading it and went through all this section one, two, 00:46:45.600 |
three, I was like, I thought it was more so single document to better extraction, 00:46:52.240 |
like pre-processing a document for a RAG system. And then once I get to the end, so my whole 00:46:57.440 |
background was, let's say I've got 10,000 documents that I want referenced in a RAG system. 00:47:04.080 |
Am I doing this on all of them? And then it turned into, you can throw this at a data set of a couple 00:47:09.520 |
of hundred docs. So like, there was a bit of a jump there. - Yeah. So it's, this is very different 00:47:14.960 |
from traditional RAG or Q and A or document processing for a chatbot. Like the kinds of 00:47:20.560 |
queries that people are, people want to use doc ETL for can be expressed as ETL, SCDIL, 00:47:27.520 |
sweep and harvest kind of, I want to look at my entire data set. I would hire an analyst to look 00:47:35.200 |
at my entire data set and then just tell me things, generate me reports that I can then learn 00:47:41.920 |
from. So for example, this officer in misconduct thing is much more digestible than say looking at 00:47:47.040 |
all these millions of PDFs. So in that sense, I mean, you can certainly apply chunking or like 00:47:53.840 |
maybe a strategy used for processing one of the documents could be used for RAG systems, but the 00:47:59.840 |
focus is not RAG. It's like building a kind of semantic layer on top of your unstructured data 00:48:05.760 |
is the way that I like to think about it. - Got it. Got it. Thank you. - Okay, cool. And I know 00:48:14.320 |
we don't have time to go through the examples at the end, but definitely do go through the examples 00:48:18.000 |
at the end. And you can see, okay, let's just, for the police example one, you can see they had three 00:48:22.720 |
binary criteria. Did this binary criteria come from the team that was doing the work? - Yes. 00:48:28.560 |
- Okay. Like whether each officer name referred to a real person, if the summary included dates, 00:48:32.960 |
whether each identified misconduct was- - This is a very weak evaluation. So part of the reason we 00:48:39.440 |
want to rerun this evaluation is just this task here doesn't have any grounds roots. So when you 00:48:44.880 |
ask the people, like, what do they care about? It's they look at some of the outputs, they look 00:48:49.440 |
at their data you'd need, and they're like, oh, the officer names are fake. So that becomes a 00:48:54.000 |
criteria for them. 'Cause sometimes it'll be officer A, officer B, officer C, that's not real, 00:48:59.600 |
that's not what they want extracted. They'll read some of the summaries and say, you know, 00:49:03.920 |
some of them aren't exhaustive. I want them to include date, location, names, summary of what 00:49:09.120 |
happened. So I think like the criteria emerges from looking at the data and it's not necessarily 00:49:16.240 |
a reflection of like how well the LLM did this task. - Gosh, I think Shreya must have seen this 00:49:24.160 |
red post-it note and preempted me. - But that's the question, right? From an academic standpoint, 00:49:31.600 |
this is an underwhelming evaluation because there's no ground truths, right? So we're redoing 00:49:36.880 |
our evaluation to be on datasets where we actually have ground truths from human annotators, but 00:49:42.640 |
that's just not a practical setting. Like nobody's coming to Doc ETL with the ground truth. - That's 00:49:48.000 |
why we do a line eval to get them to provide ground truth and give me five. And you can imagine, 00:49:52.880 |
right, let's just say police misconduct. You can imagine there's two officers, Eugene Cheah and 00:49:56.400 |
Eugene Yen, and then we resolve them to Eugene. And then now it becomes, the risk is higher, 00:50:04.160 |
right? Eugene Cheah or Eugene Yen conducted some offense. So that's the problem where resolution 00:50:09.840 |
is very tricky. I think somewhere you mentioned resolution, you have a metric of recall. I 00:50:16.640 |
actually think it's precision is actually more, it depends on your use case. I actually think 00:50:21.200 |
precision is actually something that we want to be careful about. - Yeah, we do measure precision in 00:50:29.520 |
the updated eval that is not on archive. - So yeah, they also have two other examples here. 00:50:38.160 |
Again, very valuable. This is really people doing the work, sharing the work they did, 00:50:42.640 |
what were the pain points and how they improve on it. Just definitely do read this end-to-end. 00:50:50.240 |
You'll definitely learn a lot. And you can see this is the number of, again, this is the number 00:50:53.840 |
of iterations. You can see it just keeps getting better even on the fourth iteration or the third 00:51:00.320 |
iteration, though it may be expensive, whereas on some things like distinct games, maybe after one 00:51:05.600 |
or two iterations is good enough. Again, it depends on your cost and how much latency you want to 00:51:09.920 |
incur. So yeah, I think that's all key takeaways. We can think about the operations we do as a few 00:51:19.680 |
key operators, map, reduce, resolve, filter, rank, maybe. We can think about how we want to rewrite 00:51:25.840 |
these operators together in a different way. And we also have a methodology on how to optimize 00:51:31.280 |
these rewrites by generating pipelines. And of course, we need a validator to validate every 00:51:38.560 |
step of the way. Shreya, anything I missed? Any key takeaways that I missed? 00:51:46.560 |
and thank you for joining us. - Thank you for all spending so 00:51:49.600 |
much time on this work. This work has been two years in progress, by the way. 00:51:53.280 |
- Really? Wow, I didn't know it was that long. - Oh yeah, and even longer. 00:51:59.920 |
- I'm in your bucket now. Yeah. I'm super excited to seeing the full paper. 00:52:04.800 |
- The evalgen and all my other papers actually were part of this broader project. I've been 00:52:13.440 |
working on this for so long that my advisor was like, "What are you doing? Let's come up with 00:52:20.160 |
something." So I was like, "What's our biggest problem right now?" About a year ago, a year and 00:52:25.040 |
a half ago, I was like, "Validation. We need to come up with these validation agents. I don't 00:52:29.760 |
know how to do it." So that's how evalgen came about. Yeah, I love the evalgen paper. I mean, 00:52:34.960 |
I built an entire app inspired by it. But Eugene, do you have a hand raised? 00:52:39.040 |
- Yeah, so I'm just going to do the shout out for the three plus one, I guess, for newer folks who 00:52:45.200 |
are not aware and also for the recording. The amount of meta in this room and the active 00:52:49.680 |
participants is ridiculously high on this topic of REG/LMSHR/eval. Shreya is the paper author and 00:52:56.720 |
has been doing multiple banger papers on this, top tier papers on this topic regarding REG/eval 00:53:03.840 |
and everything in between. If you want a list of her writing, you can probably just ask Eugene 00:53:08.800 |
Yen and he'll produce a list for you. Eugene Yen himself undersells himself, but he writes 00:53:15.280 |
extensively on LMSHR and helps teams build these solutions in the world's largest bookstore. So 00:53:22.320 |
he's probably seen some of the biggest production use cases for this kind of deployment. Vibhu 00:53:28.640 |
works at MinesDB, which is all about embedding search and enterprise REG solutions built on top 00:53:33.120 |
of it and has seen a lot of real world productions for all MinesDB clients. So he provides more 00:53:39.360 |
breadth across multiple use cases respectively. And Sean, as you may know, works on small AI as 00:53:47.040 |
well, which does summarization and search on this kind of thing. So lots of high level method. Hence, 00:53:52.320 |
you can see why they were the most active in their conversations as well. 00:53:55.200 |
Yeah, and I've dropped a few links about this. I think that you should definitely read Shreya's 00:54:00.800 |
previous paper. This paper is really, really, really good. And what I really like about this, 00:54:06.400 |
it actually has UX research. You actually have participants who... So you can hear what other 00:54:14.320 |
people are saying. That's very useful. I'll also drop another link to a compilation I did on 00:54:20.720 |
different LMSHR techniques. So I think that's... It's helpful. It's a lot, I realize, 00:54:26.880 |
but it actually references a lot of other papers. So this can be helpful for you. 00:54:30.880 |
Okay. And next week, we have Eric. Thank you for volunteering, Eric. We look forward to that. And 00:54:37.760 |
drop the paper that you'll be taking us through in the Discord channel so that we have time to 00:54:43.920 |
pre-read. Yeah, we'll do. Thank you, everyone. Thank you, Eric. Thanks. Thank you, Shreya, 00:54:50.080 |
for joining us. Thank you for being here. Yep. Bye. Okay, Swix, do you need to stop the recording 00:54:58.000 |
or we just leave? I would say you'll just leave. He made me the host. He made you host. Can you 00:55:08.160 |
stop the recording? Yeah, but then the thing is my session kind of bugged out and I need a refresh. 00:55:14.880 |
So I think the host will stand out eventually. This session is not the host. Oh, shoot. Okay, 00:55:23.280 |
I'm going to have to leave then. I have a NAH meeting. All right. See you, man. Bye. 00:55:26.320 |
I might... I have a backup recording, I think. So if we lost it, then I can provide that. We'll see. 00:55:33.360 |
Yep, yep, yep. No worries. Cool. Hey, was that yikes? Yeah, yeah. I want the backup record. 00:55:37.680 |
Okay. Yeah, yeah. Cool. All right, everyone, feel free to leave. I'm just waiting for the 00:55:45.120 |
system to time out. Oh, no. Sean left his computer on on this.