[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar

Okay, thank you everyone for joining us for PaperClub today. Today, we will be talking about doc ETL, documentation, or rather document ETL. This is a paper by Shreya, big fan of her work, very practical work. And of course, looks like there's another solid Eugene on this paper, so should be a great paper.

So the entire, I'm going to try to go through it in a few, there's a lot of concepts to this paper, there's a lot of examples and a lot of results. I'm going to try to run through it very quickly and try to finish at 12.45 mark, and then we can discuss what we learned and what we thought was useful.

So essentially, the premise of the paper is that we can use LLMs to process documents for us in a fairly hands-off way. But the problem is, is that for fairly complex tasks and data, LLM outputs for what we wanted to do is fairly inaccurate. So a lot of times what we wanted to do, we just write a prompt, right?

We just say, "Hey, LLM, given this archive, here are these 10 archive papers, find me everything that's relevant to hallucinations." We sort of do it in a very crude ad hoc way, every archive paper, what is the snippet relevant to hallucinations, we pull it out, and after we try to combine it, and then we try to extract some teams from it, we may include citations, et cetera, et cetera.

So what this paper is trying to do is it's trying to provide formal vocabulary to try to do this, and we'll see what the formal vocabulary is. So in a nutshell, they have an agent-based framework. Let's put the agent-based framework aside for now, we won't focus on that. So what they try to do is that they will define operators, and they will define directives, and I'll define what operators and directives are, and then they will try to optimize this.

So the optimization is really called logical rewriting and agent-guided plan evaluation. And of course, optimization algorithm. I mean, I'm oversimplifying it, but all of this is essentially trying to optimize the overall pipeline to try to get what you want. And they introduce directives and operators as well, which we'll get into right now.

So there's a lot of intro over here, and there's this very big diagram over here. I won't actually go into this right now, I feel like we need some vocabulary before we can go into this. We need to understand what the pinks mean, what the greens mean, and what the different maps, all these different bolded words mean.

So I'm just going to jump right into it. I'm going to skip the programming model. Essentially what this is trying to do is it's trying to take database concepts and pandas dataframe concepts and try to apply them to shapeless documents. Is Shreya on the call yet? Not yet. Oh, Shreya is.

Hey, Shreya, if I say anything wrong, just stop me and just jump in. I'm just starting again to the operators right now. So essentially, the operators are fairly straightforward. And I have a note over here, it's really trying to think about how can we apply data pipeline and database operations to LLM tasks.

And the LLM task over here is really documents. And so at the very simplest level, we have the map operator. Essentially the map operator, and there's this language here that's a little bit not so straightforward, applies an LLM powered projection known as a semantic projection to each documented data set.

What this means, and my interpretation of it is that given a document, we create new features out of it. So you can look at this example here. Given a document, what are all the instances of police misconduct and what's the name of the officer involved and a new description?

Essentially, given a PDF document, I want new columns for officer involved and new columns for brief description of the misconduct violation. So you can think of map. Map is infinitely flexible. Given a document, extract the relevant text. Given a document, extract your summarization. Given a document, translate it. Given a document, add a classification, et cetera, et cetera.

So map is the workhorse of the operators here. And then they also share about parallel maps, where you can run maps in parallel. So you can see one prompt could be extracting misconduct, while another prompt could be summarizing policies. Essentially, you just do a map on every single thing.

Then after you do a map, you have to do a reduce. Essentially, okay, so let's say I've extracted all the different officer names, all the different incident dates. I want to return it as a report. Then we have to do a reduce. So map will spin up a lot of new columns.

Reduce will take all those columns and try to generate something that's coherent and easy for a human to read. And then they also talk about batch folding. Essentially, they talk about, you know, sometimes a lot of this is bigger than your context length, and that's why we use the map reduce paradigm, where you chunk it up and then you do reduce.

But when they reduce, sometimes it's larger than the context length of the LLM, so they do folding. They talk about folding and hierarchical aggregation, but they apply batch folding. Essentially, imagine here we have these six boxes, and then we need to aggregate them. Maybe we aggregate them two at a time, and then we do something like that.

I think that's all very straightforward. It makes a lot of sense. Now, there's the third operator, which is, in my opinion, a very important operator, but a very difficult operator to get right, which is the resolve operator. In a nutshell, this is deduplication. You know, SWIX has the name of Sean Wang, has the name of SWIX, has the name of YX Wang.

How do we know that they are all SWIX? This is extremely, extremely difficult. So compare the following two officer records from the police documents, right? And then how do we know that they're actually the same name? And then, you know, resolution has many different aspects of it. It could be sci-fi, science fiction, sci-fi fantasy.

Sci-fi fantasy is not really sci-fi, but depending on your context, you could consider sci-fi and sci-fi fantasy as the same thing. So there's a bit of domain nuance to this as well. I think this is really, really difficult to do. I actually have a... So one, several things I've seen in production is that when you do resolving and deduplication, if your deduplication is too loose, everything deduplicates to a single huge megacluster.

And that's when you get... That's when you have severity issues. And so this is actually very, very tricky to get right. And in the paper, they actually talk about different ways to do this. And they actually talk about a cheap way to do this, which we have also implemented as well, which is using semantic and code base, which we will talk about later.

But long story short, this is very difficult. And then they have other operators, very standard, filter. Essentially, given a document, maybe we perform some kind of classification, do we want to filter the document, or maybe you do some kind of map to extract the relevant sections and filter everything out.

That makes sense. Equi-join is, again, I think it's really difficult. In a sense, it's sort of like resolve, can you join these two concepts together? And then all of this all relies on schema. And then they have a few auxiliary operators. Essentially, auxiliary operators means that it doesn't even rely on LLMs.

A nest is one of them. So for example, given a document, maybe you have created a lot of... Maybe given a blog post, what are all the papers referenced in this blog post? You'll return an array, a list. A nest basically splits that nest into individual elements. And then split is essentially, you can think of it as chunking.

In the most naive way, it could be chunking in terms of number of tokens. In a smarter way, it could be chunking based on paragraphs or chapters or sections, or it could be even LLM based splitting. And then gather is essentially gathering all the individual chunks. So for example, imagine you are summarizing an archive paper.

Imagine you're summarizing section by section. The previous section was the operator section, and now we're in the rewrite section. So it could be that when you're summarizing the rewrite section, you actually do need context about the operator section. It mentions certain abbreviations that the LLM would never know, but you have to provide a kind of context.

So what gather does is by providing this peripheral information. So it could be as you're summarizing a document section by section, you collect some kind of map on all the various abbreviations that they've shown up. So that you provide that context to the LLM, so the LLM doesn't try to hallucinate what those abbreviations mean.

So those are the basic parameters of doc ETL. I'll pause here. Any questions about the operators? I think it's essential to understand map reduce and resolve. Everything else builds on this. So this is a classic, not so much a question, more of a comment. My real question was, was there a rank operator?

And the answer is no. You could maybe compose the rank operator with sometimes split, gather, thingamajig, but it's not a formal ranking system. And even before AI, I had always had this strong view that filtering, ranking, sorting, all this stuff is kind of like the same layer in the API stack, and should be kind of done together.

This is my number one problem with AI news right now, which is that all filtering is effectively a rexis, and I have to rank. Yes. So you filter just based on metadata, like hard numbers, because you don't have any other better stuff. If you could create a ranking formula, you would use the ranking formula rather than any other filter basis.

There is a filter op, right? So when section 2.2 in other ops, the filter operation independently retains documents from the input dataset based on conditions specified through an LLM prompt. In some sense, it's under ops, if you scroll down a little. This filter one seems to, if you can specify conditions through an LLM and retain docs, that's a form of ranking, no?

That's how I interpreted this one, as opposed to like, yeah. Yeah. I mean, I think there's a question of sort of filter, then rank, or rank, then filter. I think in my strong view, filter. Yeah. You can think of filtering as in dropping items from an array, whereas ranking is me pushing things down or pushing things up.

So I think it's slightly different. And you know, there's a question for you, Shreya. Is there a reason why they're split into operators and auxiliary operators? Oh, very simply, we just wanted to put the primary LLM-powered operators in one section, the one that people will probably spend a lot of time tinkering on, and the auxiliary ones are there to stitch them.

So a lot of people want to do, for example, reduce after map, well, they want to unnest after map, and then do reduce there. Yep. Okay. A comment about ranking. We're working on a sort operator. We haven't found, maybe Swix, we should chat about this. We haven't found a really compelling use case from any of the users yet.

Really? Well, a lot of people's use cases can simply be solved by, you know, applying like embedding based similarity to some query vector and then ranking by that similarity. I want to know, kind of, I think Swix's case is interesting. Like, in what cases does it make sense to have this pairwise comparison interface be the primary way to steer that operator?

So, maybe like ranking articles in terms of what is most interesting to a human could be such a case. Yeah, so- I think there's another case. Sorry. Yeah, go ahead. Go for it. Go for it. I mean, like, yes on the what's, you know, what's most interesting to a human.

I mean, this is standard Rexis. I'm not convinced that it has to be pairwise. I feel like pairwise maybe it's easy to do data entry. So, maybe that's why. I'm actually strongly convinced that it should be pairwise and we can debate that and see how it works. And I also think that- Why do you think it should be pairwise?

I think it's just more reliable and stable that way. I think you don't have to go through all, I mean, if we do everything pairwise, it's quadratic, right? But we don't have to go through all of that. We can sort of do it smartly. I think that, and you know, in Shreya's paper, they actually cover how they use embeddings and codebase resolve, right?

You can think of it as, also, you can apply that if you tweak that a bit, that can be applied to ranking. But you can also tweak it a bit. You can say that ranking, we actually have confidence levels. If the similarity score is strong enough or weak enough, like if it's really good, like 0.9, we know it's strongly related.

If it's really bad, like 0.1, we know it's poorly related. But then there's a lot of stuff that's in the middle that then we can use an LLM power ranking. We can go into that a bit, but this paper is huge and I want to try to go through as much of it as I can.

So now that we've covered operators, they propose rewrites. So actually, I have a question for you, Shreya. When you say rewrite, do you mean rewriting the pipeline or do you mean rewriting the document? What does rewrite in rewrite directive stand for? Rewrite, this is a good question. No one's ever asked this.

Rewriting the pipeline. So say you have a pipeline that is just a map operator. Many people have this pipeline. They have a document set and they want to extract a bunch of metadata from each document. Like I have a bunch of legal contracts and I want to extract as many clauses of interest as possible from them.

That would be, you can program that as a map operation, a single map operation to go through every document and extract all your fields that you want. When we say rewrite, we acknowledge that this operation might not work when you execute it. What if the document is too long?

What if you're extracting too many fields? It's too hard. So you want to write it into more complex sequence of operators. So is my mental model right that the rewrite directives are somewhat in the realm of optimization already? Yes. Exactly. Right. So now we're going to talk about rewrite objectives.

So let's again, take our example of this archive PDF. We want to extract what the key teams are. A noob like me would just upload it to Cloud AI and Cloud AI, what are key teams? Now we can rewrite that to be smarter. We can rewrite it into a map, basically map the entire document into all text.

And then we can do split, we saw the split operator. And then on the splits, we can do a map on extracting the key teams. And then we can do a reduce, right? To try to reduce all these key teams. Again, the point here is that if we give the LLM smaller chunks, it is better able to pay attention and therefore come up with a more rich, more comprehensive and more factual summary.

But you can imagine the search space of creating such a pipeline is infinite, right? The search space of chaining together all the different legal blocks of putting together all the different AWS services is infinite. So what they propose is a methodology to help us try to do that semi-automatically.

So I will focus on all the different things that they do in rewrites, but I'm going to call out what I think is absolutely essential. So rewrite directives. The first one is data decomposition. So essentially when you have very large documents and there are many documents, and then you just have to decompose it.

The very standard pattern they share is document chunking, which everyone has been doing. This is standard. That said, I actually do want to share this example whereby I was asked to help with a pipeline, and I was able to improve downstream metrics significantly by 20 to 50% by removing chunking.

So every now and then it's good to rethink, hey, do we actually need chunking? In such cases, I think Shreya's example here is that it's way beyond the context window size, and so you definitely do need it. So you can see they do chunking. So you can map, split, gather, map, reduce.

And you can see the optimization. There's many, many different ways to optimize. I'm just going to pick B here. So you can imagine we could split it and then we could gather all of it. After we split it, we gather all of it. But what they're proposing is that we split it, we create a summary, and then we create...

I can't remember what H stands for. Hierarchical information. We create a summary and create hierarchical information. So we enrich the splitted chunks and then we gather it. So you can imagine essentially all pipelines now are LLM pipelines. LLM is enriching the data. LLM is cleaning the data. LLM is filtering the data.

It's going to be a bit lossy. It's going to be a bit stochastic, but I think we will figure it out in the next one to two years to get it to a more reliable state. So they actually share some very insider ghost knowledge. When splitting a document, there are some kinds of context that are very useful.

I fully agree with all of this. Document level metadata. What are all the different abbreviations? Who's the author, et cetera? Hierarchical information. Summaries of neighboring chunks. Summaries of previous chunks are actually very valuable. So the LLM has enough context and doesn't make shit up. And then they also have different patterns here, document level metadata.

Maybe I'll just go through one of them. So example, in this case, before you split it, you extract metadata that's relevant to all chunks. So by extracting this abbreviation metadata, author metadata, you make sure that all chunks, when you're doing map on all of these chunks, you actually have all this context.

The LLM has all these contexts and is able to do a more reliable job. And then they have many, many different other patterns that you can consider. Now, the next thing is multi-level aggregation. So you need to aggregate it. Essentially, let's imagine you have hundreds of chunks. You are summarizing a movie and you have all the different movie cut scenes.

And imagine we don't have Gemini to do this for us and just upload the entire movie. Imagine you have to do it. We have to chunk it up. So what they propose here is to aggregate, is to chunk it up, aggregating the data at a finer granular. So essentially, it's like, okay, we have seen one, or let's put it another way.

Imagine you're trying to summarize a series of, oh, wow, this is a bit close. No, imagine you're trying to summarize the Harry Potter movies. So you could first summarize the first Harry Potter movie, first summarize every scene in the first Harry Potter movie, roll it up to a summary of the first Harry Potter movie, and then roll it up to a summary of the entire Harry Potter movies.

So that's hierarchical aggregation. And then they have several things here, like LLM centric improvements. There's this term here, gleaning. So it's prompted with the previous inputs and outputs and asked to improve the inputs. I mean, to simplify, oh, sorry, Shreya has a hand raise. Oh, I didn't want to interrupt you.

Sorry. Your point about chunking, I thought was very interesting in that sometimes it's beneficial to chunk and sometimes you should not chunk. And we have observed this in a number of workloads. And the insight that we've gained is that we will never know when it's all task-specific and data-specific.

And we are so further convinced that you kind of need some optimizer to explore these different choices for you and come up with something reasonable. But anyways, just a meta comment on, it's not the rewrite rule that if you apply it, it always works. It's that if you apply it, it sometimes works.

It sometimes works a lot better. And you need some way to kind of try a lot of these things automatically for you. - I agree. Vips, you have a hand. - A little follow-up to that point. Somewhere, I think on like the first page where it talks about chunking and performance, there was just like a handful of dumps of citations of different papers.

So like, there was a section, I think on the first page, recent work has shown that LLM performance degrades considerably as length increases. Citing a paper, they can be distracted. Citing a paper, pay attention more to certain papers, cite a paper, or fail to gain holistic understanding. And there's like four citations.

So pretty cool if anyone wants to dig into that, there's like eight citations on the first page that talk about this. And then another little comment actually. So you skipped over a few of the chunking strategies. So there's stuff like document level extraction header. I thought that the third one was kind of interesting where you have chunk filtering.

I haven't seen this approach too often. So just if people haven't read it, it's an interesting little one that popped up. So as you chunk, whether it's semantically or whatever it is, you can filter out sections. Not many other pre-processing handlers seem to do this, but the example here is pretty clear, right?

So as you break up chunks, you can filter out stuff. So for example, in an archive research paper, you can filter out citations and references that aren't really relevant to the core information. It's just one of those chunking strategies that you don't see too often. So, you know, figure it out that better.

Yeah. A hundred percent agree. All of these bolded mini haters here, they're all very valuable. I just don't have the time to go through them, but you should read through them. I think these are practices that they have gained from working on very, very difficult real world problems to just get the level of performance up.

And you can take a lot of inspiration from this. Shreya, you have another hand raised. Oh, a small comment again for the police misconduct data. Often we have like thousand page records documents where it's just page on page of like random image of like a street sign or something.

And if you just drop those, your processing, your accuracy just stays the same and you save so much on costs. So that's where the filtering came from. Yep. Very valuable. Now they have LM centric improvements. This one here is gleaning, which is prompted with the previous inputs and outputs and ask it to improve it.

You can think of it another way if you have a valid. So that's one way you can imagine have a validator in the loop or essentially like code. You just copy. Hey, Claude, here's my error message. How do I fix it? This is very similar that given the previous input and output and error message or some kind of feedback LLM, get it to improve on it.

So this can, you can iteratively improve on this. And I think, uh, later we have an experiment result where they try to do this four times and you can see how it tapers off. But for some, in some instances, he just keeps improving, uh, as, as many times. So in this initial, so how this is, how this is done.

Oh, actually they do have any better within the loop. Okay. So how this is done. Okay. You preprocess, you process it, you get the original map operation, like summary of the paper, and then you have a validator, try to evaluate it and provide some feedback. Now you provide the previous summary and the feedback and maybe the context and then try to refine on it.

And then you just improve this many times. So you can imagine now all your SQL pipelines, all your data pipelines now have intelligence sprinkled in where it can, it can, it can map where it can map on all this. We have a validator in the loop. So users can have a more capable model to be the validator.

Yeah. Yeah. I think I missed the part about being a validator in the loop. Because I thought it was just asked to improve the outputs. Yeah. But again, fully agree. I think this cannot be done without an evaluator or validator. It's absolutely essential. Then they have this thing, duplicate key resolved.

This is the problem, right? LLM outputs are not canonicalized. So for the same, for the same person six, they may have very many different names out there. So you have to canonicalize it. So they take semantic equivalent values of the key. And I think they try to reduce the search space in the path.

I can't remember where it's mentioned, but they try, maybe they try to reduce the search space with embeddings and then try to resolve this. Long and short of it, this is a very, very hard problem. And you have to do this with fairly high precision. If not, you could be saying that Eugene Xia and Eugene Yan actually the same thing.

And Eugene Yan maybe just raised $10 million when it's really Eugene Xia. I actually don't know how much Eugene Xia raised. I'm just making an example where precision is quite important. Vips, you have a hand raised. Yeah. So when I read this section about gleaning, I saw the same thing.

So I understood basically on the page you're at that there's a validation model. So there's a validator, but then it also says that that first paragraph, they employ a separate validator and data generation. So go on below this next one. Yeah. I didn't really see where this data generation LLM is, but I understand that you do some pre-processing, you have a validator to check how it worked.

Then you have a number of steps to do this, do your whatever processing, validate, process, validate. But I don't get where this data generation LLM is. And then it seems like data generation is just, that's the only reference of it. So I didn't know if this is like, I just didn't see it too much in this.

No, it's ambiguous from us. I just meant the data processing LLM, the normal LLM that you use. So how does a traditional operation work? Look at figure four, you just have the output that comes from the data processing LLM, nothing more. What gleaning does is it adds that validation agent, and it creates a loop of some finite number of iterations or whether the validation agent said it was good.

So you can just bound the number of times it refines, and then the final output, that's all. So is it using that original step again? So it says it's a separate validator and data generation model. You can specify two different models. You could also specify the same model architecture, but the point is the validator agent has a different prompt that has a specific instruction of, does this contain all the officers or instances?

And the validator agent's job is to say yes or no with optional feedback. The data processor LLM's job is to give the answer. That's why we have a distinction between the two. Got it. Got it. Okay. Good question. And Shreya also has an interesting point that it's much easier to verify the output and generate it.

I actually observed the opposite. It depends on the task. For classification tasks, yes, it's easy for factuality or comprehensiveness or relevance, whatever. It's a lot harder. A lot of the synthetic data gen papers like ORCA3, Wizard LLM, they show that models are better at verifying output than generating output.

So you have a weaker... Yeah. So it's shown in other work too, but interesting note. It does depend on the task. So next we have project synthesis. Well, this is tricky. I think what it means is that it's hyper-pipeline optimization. So essentially we don't really know what is going to be good.

So you can imagine that we have very many different ways on how to do this. And we can find that agents are... It's really hard for... If you have a pipeline generating agent, it's really hard for an agent to try to figure out which pipeline is going to work in the first place.

And there are infinite ways that you could build this pipeline. So they propose several ways to do this. And of course, the first thing is chaining. It's like iteratively chain LLM calls to simplify the task. Essentially, this is what people say. Break up your task so that every prompt only does a single thing.

So it could be extracting entities, summarizing key points, and then generating recommendations. Another one could be isolating. So for example, maybe a question would be what sections of this paper talk about hallucination? And then so you could reduce the scope of work to be done on the doc, whereby you could just classify the documents, maybe just classify sections, classify paragraphs as related documents, hallucination or not, and then just flag it up.

So there's a lot there. And I won't go through all of this. But anyway, that is the end of the rewrite section. Any questions there on rewriting? Because now we're going to go into the optimization step. I'm happy to explain projection synthesis if that's helpful. Yeah, go for it.

Yeah. So this is very similar to the chain of thought mindset that you have in LLM prompt engineering, where if you have a very complex task, often what can help is if you break that down into two LLM calls. Now, you can generalize this to say all this means is you have a map operation that does partial work and then your original operation that does the task to maybe make it easier.

And you could do this for any operation. You can have a map before another map. You can have a map operation before another reduce, before another filter. In this police misconduct case, we've seen this very helpful around whittling down, like summarizing the document into the relevant points needed to then answer the hard question of misconduct.

So you'll take a very long document and the first map operation that's synthesized will just extract information that is relevant to police officers at all, making it a smaller document. And then summarizing the instances of misconduct is much simpler to do on that very focused document. So that's an example of projection synthesis.

And projection means map in the data world. So you can apply that to any operation. And you can have the LLM write that. You can have an LLM take a task and try to break it down into multiple steps and synthesize that map projection. >> Thank you, Shreya. So any questions on rewriting directives?

Okay. So now we're going to go into the optimization aspect. So the key aspect here in the optimization aspect is that, okay, you sort of declare an initial pipeline. And then .etl will try to run that pipeline, get initial evaluation scores, and then it will try to optimize that pipeline for you.

So the two key agents here, that's the generation agent. So what a generation agent does is that the generation will actually generate rewrites. So when we talk about generating here, the generation agent is going to generate the rewrites and the alternative pipelines, the pipeline options. And then you have validation agents.

And again, this is another example, right, of how evals and validation are just fundamental to this. Because without the validation agent, you would not be able to assess and measure the different pipelines, how the different pipelines do, and then how to compare between them. So there's two things. You generate pipelines.

Second, you evaluate each step of the -- maybe you evaluate each step of the pipeline and you evaluate the overall pipeline. And then we try to see which one performs well. So there's a lot of algorithms here. But I find that the paragraphs here actually explain it. I found it far easier for me to understand.

So essentially, you can imagine that there's a pipeline, right? You have a pipeline. Again, let's just stick to the police document, police example pipeline, all the different examples of misconduct. So what you first do is you use a validation agent to create a custom validation prompt, right? And I don't know if we do need inputs and label outputs here.

Do we actually need inputs and label outputs here to get the validation prompt? Nope. I mean, you could -- we don't support that because nobody has ever come to us with labeled examples. The other thing that's hard is sometimes there are newly synthesized operators as part of rewrite rules.

And those are certainly not going to have any label examples. But one of the things we are exploring is we have an interface for doc ETL and we're exploring kind of interactive optimization of pipelines. So if we synthesize a new operator, we might ask a human to label some examples and then use that for the validation prompt.

But that's, you know, in the works. I think that's a good idea. And that's the entire crux of Align-Eval, which I shared with you. I'm of a slightly different take. I feel like we do need some set of C human labeled data. So first thing, we use the validation data.

We have a custom validation prompt. And then what a validation agent will do is sample some outputs from the initial pipeline or from chunks of the initial pipeline to see if there's room for improvement. If there is room for improvement, what we do is we rewrite -- they have rewrite rules where they apply to the subpipeline or the individual operation or the entire pipeline itself.

So then what that means is that -- and then they have this thing we should call recursive improvement. In the sense that when you create new optimizations for a rewrite, you immediately optimize them before continuing the rest of the optimization. So what it means is that from a pipeline, imagine a pipeline goes from left to right, you start upstream, iterate on that, and then you go downstream.

And this can take as much time. Compute is free. You can just run it overnight. And of course, it just costs a few hundred dollars. But that's fine compared to how much human time it would take. So now you have multiple candidate plans. Let's just say you have 10 candidate plans.

And then you take all these 10 candidate plans. You first execute each plan on a sample of the data. You don't want to execute it on all your thousands of datasets. You just execute on maybe 50. And then use the validation agent to rate each output. So now after you've rated this sample, these different sample plans, then you take the top K.

And then you do, I think, full evaluation. Oh no, it does pairwise comparisons across these top plans. And the plan that wins most is the winner. And of course, you select the optimized plan to do that. I think let's go through an example. I think an example here is helpful.

So here is the police misconduct dataset. So the current pipeline is this, right? The team has a domain-specific clustering algorithm and then human annotation to de-duplicate officer names. And this is the baseline pipeline that the team has. So what they did was that doc ETL synthesized various pipelines. The first one was to extract misconduct summaries.

Extract the name and the summary of the misconduct before trying to de-duplicate on the officer name, before trying to resolve the officer name. And then it summarizes it. Then you can see here's a new... Was this synthesized by doc ETL, Shreya? Doc ETL-T? Good question. So we are redoing this entire eval, by the way.

So we're submitting this to VLDB December 1st. You'll see an entirely new eval for the next version. But for this case study, the doc ETL-S-T-O are all candidate plants from doc ETL's optimizer. And when we were exploring them, we were like, "Oh man, I don't even know which one is going to be better." Truly.

You read them, you have no idea. So we compared them all. Doc ETL-O was the one that was actually selected. So you can see that they generated several plants. I don't know how many hundreds of plants they generated. The author evaluated 50. I recall they had certain numbers here.

So okay, yes. Okay, shoot. I don't know. Okay. So they had 220 documents and the baseline was $2, while doc ETL-S-T-O each cost a little bit more money. Running the optimizer, they used the optimizer. I don't know how many plants the optimizer generated, but it cost approximately $100 and less than half an hour.

Just go get lunch and you come back and you get your optimized pipeline. And the optimization cost was just $100. How many pipelines? Oh, perfect. Here it is. It generated 200 pipeline variants and all of this was just evaluated with the validator prompt. Yeah. So I think that's an example of how the optimization step runs.

During a bunch of pipelines, validate it and optimize the subcomponents within the pipeline and the pipeline itself. So now that we've gone through this, we can now discuss figure one. We had to go through a lot of it, but I think now we can discuss figure one. Figure one is you can imagine all of these are police documents and we want to extract the names of officers and the misconduct.

So you can see this is the user-defined map, right? Okay. And maybe if I'm a user, my initial prompt is very basic. Given a PDF, just extract any cases of misconduct. So now we apply the rewrite directives. We apply rewrite and the rewrite, of course, we have the baseline, which is no change.

We can do projection synthesis, which by we extract name, extract summary, and then after that do another one. Or we could do decomposition where we split the document up together and map. And then all of these, all the pink boxes are where the agent will automatically rewrite it. And then the green boxes are when the plan is selected and NC here stands for NSYNC.

No, I mean, it stands for no change. Essentially, if it's no change, we just accept it. And you can see, we just go through from left to right. And that is just for extracting any cases of misconduct. After you've extracted these cases of misconduct, you can see Officer A, Sergeant B, and Officer C, then now we need to reduce it.

And again, we apply the same thing down the stream. So we start from left to right. Every time we generate something new, we validate it. And if it passes, if it's good enough, we move on to the next one. And so it just iteratively goes down the entire stream.

So now that is the section on operators, rewrite directives, and optimization. And that's all summarized in this document. Any questions? I can answer RJ's question. It's easier than typing it out. Yeah, go for it. Yeah, so RJ asked a question. I get how you can generate good evaluation prompts for the individual operators, but why do we trust an end-to-end auto-generated trust?

Oh, sorry, end-to-end auto-generated judge. I too was very skeptical, but then I saw the kinds of tasks that the journalism team here at UC Berkeley was running, which was extract all cases of misconduct, where they have defined misconduct in the prompt. Very precision, recall-oriented tasks here, which you can have an LLM judge pretty ambiguously determine whether one output is better than another.

Did one output extract more names than another? LLM is just super good here. So I think what really sold me was having an optimizer like this that off the bat improved performance for the actual people who are working on these tasks, who are going to be authors in the eventual submission, just because they've contributed a lot to helping out with the evaluation here, kind of guiding what they find useful.

But once you kind of see it work, it's kind of hard to go back. I know Eugene Yan also has experience with LLM as judge here. I strongly believe it could work. I mean, last year, this time last year, or maybe a few months, like in June last year, I didn't think it could work, but I've been debating it a lot.

I've been trying a lot. I think when we simplify it to binary classification metrics, I think it can work. And I think a lot of things can be simplified to binary, like Shreya mentioned precision, I think a lot of things can be simplified to binary classification metrics. And I've seen evidence of it working.

Well, you may have to apply chain of thought, you may have to include a few few short, you may have to do a bit of self-consistency or ensembling. But I think the technology is getting there. Eugene Chia, you have a hand raised. Yeah, I think the other important aspect to consider in this scenario is that what is the failure condition?

So in this case, right, if let's say this pipeline fails, right, it still reaches the journalist and becomes a human in the loop kind of situation. Like the failed documents can be ignored safely, because they achieve their goal with the successful documents. In the case of using this pipeline in a more Q&A, agent style for helpdesk, right, a failed answer impact is a lot more drastic.

So it depends on the use case. Agree. Libu. And a quick comment there. So you mentioned you've seen examples of LLM as a judge working if you break it down to a binary problem. And then you might need to include a few short examples and stuff. And you said ensembling.

Can you explain a bit more about what ensembling helps for LLM as a judge? Yeah, when I talk about ensembling, the main ensembling I'm thinking about is, oh, gosh. I wish I could find a paper as easily as… No, just add a high level, you know. Yeah, the poll paper essentially talks about just combining multiple, using multiple weaker LLMs.

You can get as good as an LLM judge as GPT-4. This is the paper from… Okay, so is the benefit there trying to save on performance of a large model or, like, optimization that you can't run a large one? Like, mixture of agents kind of shows how, from together, shows how you can use mixture of agents of small models to match performance of a larger model, which we've deployed because it has to be on-prem.

So ensembling models leads to smaller models being able to do efficient judging than larger models. Is that the takeaway? Yes, that's correct. So in this paper here by Kohir, what they did was they have a reference model. And this reference model is GPT-4. And then essentially what they did was they ensemble command R, haiku, and GPT-2.5.

And I can't remember what the… I think the ensemble was just majority voting. And they were able to have higher Kappa score, higher correlation to humans compared to the strong reference, the large language model. Interesting. Makes sense. Yeah. And I think it makes sense that, you know, all these smaller calls can be parallelized, and then you can… the ensembling, yeah, so… And it's fast and cheaper.

So sometimes they can. Sometimes they can't be parallelized. For us, we see a lot of parallel… not parallel, sequential chaining add significant performance. It just adds performance bottleneck in terms of time and latency end-to-end. But like, it's a necessary trade-off for what we've… Yeah. Okay. Someone asked, can you link to this paper?

I don't know if I can link to this paper. You can… if you've been in our paper clubs, Eugene Yen just does not multitask well. As you saw while I was struggling to find his paper. But I've pasted you the link. If you search it up on Google, I'm sure you'll link to the archive.

I've linked it. Okay. I did have a quick high-level question, since we're approaching the end. More so for Shreya. So if you can give a sort of high-level, how we go from datasets to extracted whatever. So currently you can break up documents by chunking, extracting stuff. There's some basic filtering.

Just high-level, what happens here? And then in that optimization step, it didn't really click for me where you have the cost go from a couple dollars to $100 and you're generating a bunch of pipelines to optimize this stuff. But at a high-level, one, what is the doc ETL? And then two, what are these optimizers doing?

I'll answer the second question. The optimizer's goal is to rewrite the user's pipeline into something that is more accurate. And in this process, the optimizer explores a lot of plans using LLMs to create those plans, to verify, to evaluate those plans. So that's where the cost goes up. The reason it was $100, if I ran the optimizer with GPT 4.0 mini as the LLMs, it would be $10.

But we use GPT 4.0 just because I think we did this at a time where mini hadn't come out yet. Anyways, but LLMs are trending to be cheaper and cheaper. This project is a very high budget project, but I think that in the future as LLMs are much cheaper, as open source LLMs get much better, the cost should not be an issue for such kind of approach.

And then your first question was, what is the high-level, what's the vision for the project? - No, I mean, just what's the input/output? I, from the title, when I first started reading it and went through all this section one, two, three, I was like, I thought it was more so single document to better extraction, like pre-processing a document for a RAG system.

And then once I get to the end, so my whole background was, let's say I've got 10,000 documents that I want referenced in a RAG system. Am I doing this on all of them? And then it turned into, you can throw this at a data set of a couple of hundred docs.

So like, there was a bit of a jump there. - Yeah. So it's, this is very different from traditional RAG or Q and A or document processing for a chatbot. Like the kinds of queries that people are, people want to use doc ETL for can be expressed as ETL, SCDIL, sweep and harvest kind of, I want to look at my entire data set.

I would hire an analyst to look at my entire data set and then just tell me things, generate me reports that I can then learn from. So for example, this officer in misconduct thing is much more digestible than say looking at all these millions of PDFs. So in that sense, I mean, you can certainly apply chunking or like maybe a strategy used for processing one of the documents could be used for RAG systems, but the focus is not RAG.

It's like building a kind of semantic layer on top of your unstructured data is the way that I like to think about it. - Got it. Got it. Thank you. - Okay, cool. And I know we don't have time to go through the examples at the end, but definitely do go through the examples at the end.

And you can see, okay, let's just, for the police example one, you can see they had three binary criteria. Did this binary criteria come from the team that was doing the work? - Yes. - Okay. Like whether each officer name referred to a real person, if the summary included dates, whether each identified misconduct was- - This is a very weak evaluation.

So part of the reason we want to rerun this evaluation is just this task here doesn't have any grounds roots. So when you ask the people, like, what do they care about? It's they look at some of the outputs, they look at their data you'd need, and they're like, oh, the officer names are fake.

So that becomes a criteria for them. 'Cause sometimes it'll be officer A, officer B, officer C, that's not real, that's not what they want extracted. They'll read some of the summaries and say, you know, some of them aren't exhaustive. I want them to include date, location, names, summary of what happened.

So I think like the criteria emerges from looking at the data and it's not necessarily a reflection of like how well the LLM did this task. - Gosh, I think Shreya must have seen this red post-it note and preempted me. - But that's the question, right? From an academic standpoint, this is an underwhelming evaluation because there's no ground truths, right?

So we're redoing our evaluation to be on datasets where we actually have ground truths from human annotators, but that's just not a practical setting. Like nobody's coming to Doc ETL with the ground truth. - That's why we do a line eval to get them to provide ground truth and give me five.

And you can imagine, right, let's just say police misconduct. You can imagine there's two officers, Eugene Cheah and Eugene Yen, and then we resolve them to Eugene. And then now it becomes, the risk is higher, right? Eugene Cheah or Eugene Yen conducted some offense. So that's the problem where resolution is very tricky.

I think somewhere you mentioned resolution, you have a metric of recall. I actually think it's precision is actually more, it depends on your use case. I actually think precision is actually something that we want to be careful about. - Yeah, we do measure precision in the updated eval that is not on archive.

- So yeah, they also have two other examples here. Again, very valuable. This is really people doing the work, sharing the work they did, what were the pain points and how they improve on it. Just definitely do read this end-to-end. You'll definitely learn a lot. And you can see this is the number of, again, this is the number of iterations.

You can see it just keeps getting better even on the fourth iteration or the third iteration, though it may be expensive, whereas on some things like distinct games, maybe after one or two iterations is good enough. Again, it depends on your cost and how much latency you want to incur.

So yeah, I think that's all key takeaways. We can think about the operations we do as a few key operators, map, reduce, resolve, filter, rank, maybe. We can think about how we want to rewrite these operators together in a different way. And we also have a methodology on how to optimize these rewrites by generating pipelines.

And of course, we need a validator to validate every step of the way. Shreya, anything I missed? Any key takeaways that I missed? - Crushed it. Wow. - Thank you, Shreya, and thank you for joining us. - Thank you for all spending so much time on this work. This work has been two years in progress, by the way.

- Really? Wow, I didn't know it was that long. - Oh yeah, and even longer. - Really? Shoot. - That makes me very happy. - I'm in your bucket now. Yeah. I'm super excited to seeing the full paper. - The evalgen and all my other papers actually were part of this broader project.

I've been working on this for so long that my advisor was like, "What are you doing? Let's come up with something." So I was like, "What's our biggest problem right now?" About a year ago, a year and a half ago, I was like, "Validation. We need to come up with these validation agents.

I don't know how to do it." So that's how evalgen came about. Yeah, I love the evalgen paper. I mean, I built an entire app inspired by it. But Eugene, do you have a hand raised? - Yeah, so I'm just going to do the shout out for the three plus one, I guess, for newer folks who are not aware and also for the recording.

The amount of meta in this room and the active participants is ridiculously high on this topic of REG/LMSHR/eval. Shreya is the paper author and has been doing multiple banger papers on this, top tier papers on this topic regarding REG/eval and everything in between. If you want a list of her writing, you can probably just ask Eugene Yen and he'll produce a list for you.

Eugene Yen himself undersells himself, but he writes extensively on LMSHR and helps teams build these solutions in the world's largest bookstore. So he's probably seen some of the biggest production use cases for this kind of deployment. Vibhu works at MinesDB, which is all about embedding search and enterprise REG solutions built on top of it and has seen a lot of real world productions for all MinesDB clients.

So he provides more breadth across multiple use cases respectively. And Sean, as you may know, works on small AI as well, which does summarization and search on this kind of thing. So lots of high level method. Hence, you can see why they were the most active in their conversations as well.

Yeah, and I've dropped a few links about this. I think that you should definitely read Shreya's previous paper. This paper is really, really, really good. And what I really like about this, it actually has UX research. You actually have participants who... So you can hear what other people are saying.

That's very useful. I'll also drop another link to a compilation I did on different LMSHR techniques. So I think that's... It's helpful. It's a lot, I realize, but it actually references a lot of other papers. So this can be helpful for you. Okay. And next week, we have Eric.

Thank you for volunteering, Eric. We look forward to that. And drop the paper that you'll be taking us through in the Discord channel so that we have time to pre-read. Yeah, we'll do. Thank you, everyone. Thank you, Eric. Thanks. Thank you, Shreya, for joining us. Thank you for being here.

Yep. Bye. Okay, Swix, do you need to stop the recording or we just leave? I would say you'll just leave. He made me the host. He made you host. Can you stop the recording? Yeah, but then the thing is my session kind of bugged out and I need a refresh.

So I think the host will stand out eventually. This session is not the host. Oh, shoot. Okay, I'm going to have to leave then. I have a NAH meeting. All right. See you, man. Bye. I might... I have a backup recording, I think. So if we lost it, then I can provide that.

We'll see. Yep, yep, yep. No worries. Cool. Hey, was that yikes? Yeah, yeah. I want the backup record. Okay. Yeah, yeah. Cool. All right, everyone, feel free to leave. I'm just waiting for the system to time out. Oh, no. Sean left his computer on on this.

[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar

Transcript