[Paper Club] DocETL: Agentic Query Rewriting + Eval for Complex Document Processing w Shreya Shankar

00:00:00.000 | Okay, thank you everyone for joining us for PaperClub today. Today, we will be talking about

00:00:04.080 | doc ETL, documentation, or rather document ETL. This is a paper by Shreya, big fan of her work,

00:00:10.560 | very practical work. And of course, looks like there's another solid Eugene on this paper,

00:00:16.720 | so should be a great paper. So the entire, I'm going to try to go through it in a few,

00:00:23.280 | there's a lot of concepts to this paper, there's a lot of examples and a lot of results.

00:00:27.120 | I'm going to try to run through it very quickly and try to finish at 12.45 mark, and then we can

00:00:33.120 | discuss what we learned and what we thought was useful. So essentially, the premise of the paper

00:00:37.920 | is that we can use LLMs to process documents for us in a fairly hands-off way. But the problem is,

00:00:47.120 | is that for fairly complex tasks and data, LLM outputs for what we wanted to do is fairly

00:00:53.920 | inaccurate. So a lot of times what we wanted to do, we just write a prompt, right? We just say,

00:00:57.600 | "Hey, LLM, given this archive, here are these 10 archive papers, find me everything that's

00:01:04.320 | relevant to hallucinations." We sort of do it in a very crude ad hoc way, every archive paper,

00:01:11.440 | what is the snippet relevant to hallucinations, we pull it out, and after we try to combine it,

00:01:16.800 | and then we try to extract some teams from it, we may include citations, et cetera, et cetera.

00:01:20.720 | So what this paper is trying to do is it's trying to provide formal vocabulary to try to do this,

00:01:25.440 | and we'll see what the formal vocabulary is. So in a nutshell, they have an agent-based framework.

00:01:31.840 | Let's put the agent-based framework aside for now, we won't focus on that.

00:01:35.040 | So what they try to do is that they will define operators, and they will define directives,

00:01:43.280 | and I'll define what operators and directives are, and then they will try to optimize this.

00:01:48.320 | So the optimization is really called logical rewriting and agent-guided plan evaluation.

00:01:53.600 | And of course, optimization algorithm. I mean, I'm oversimplifying it, but all of this is

00:01:58.000 | essentially trying to optimize the overall pipeline to try to get what you want.

00:02:00.960 | And they introduce directives and operators as well, which we'll get into right now.

00:02:07.440 | So there's a lot of intro over here, and there's this very big diagram over here.

00:02:11.120 | I won't actually go into this right now, I feel like we need some vocabulary before we can go

00:02:15.680 | into this. We need to understand what the pinks mean, what the greens mean,

00:02:19.440 | and what the different maps, all these different bolded words mean.

00:02:22.880 | So I'm just going to jump right into it. I'm going to skip the programming model.

00:02:28.000 | Essentially what this is trying to do is it's trying to take database concepts and pandas

00:02:33.120 | dataframe concepts and try to apply them to shapeless documents. Is Shreya on the call yet?

00:02:40.800 | Not yet. Oh, Shreya is. Hey, Shreya, if I say anything wrong, just stop me and just jump in.

00:02:50.880 | I'm just starting again to the operators right now.

00:02:54.000 | So essentially, the operators are fairly straightforward.

00:02:57.680 | And I have a note over here, it's really trying to think about how can we apply data pipeline

00:03:02.480 | and database operations to LLM tasks. And the LLM task over here is really documents.

00:03:09.440 | And so at the very simplest level, we have the map operator. Essentially the map operator, and

00:03:17.600 | there's this language here that's a little bit not so straightforward, applies an LLM powered

00:03:22.720 | projection known as a semantic projection to each documented data set. What this means,

00:03:27.520 | and my interpretation of it is that given a document, we create new features out of it.

00:03:32.800 | So you can look at this example here. Given a document, what are all the instances of

00:03:39.200 | police misconduct and what's the name of the officer involved and a new description? Essentially,

00:03:44.080 | given a PDF document, I want new columns for officer involved and new columns for

00:03:49.600 | brief description of the misconduct violation. So you can think of map. Map is infinitely flexible.

00:03:55.840 | Given a document, extract the relevant text. Given a document, extract your summarization.

00:04:00.320 | Given a document, translate it. Given a document, add a classification, et cetera, et cetera.

00:04:04.960 | So map is the workhorse of the operators here. And then they also share about parallel maps,

00:04:13.120 | where you can run maps in parallel. So you can see one prompt could be extracting misconduct,

00:04:18.560 | while another prompt could be summarizing policies. Essentially, you just do a map

00:04:22.080 | on every single thing. Then after you do a map, you have to do a reduce. Essentially, okay,

00:04:27.360 | so let's say I've extracted all the different officer names, all the different incident dates.

00:04:31.840 | I want to return it as a report. Then we have to do a reduce. So map will spin up a lot of new

00:04:39.440 | columns. Reduce will take all those columns and try to generate something that's coherent and easy

00:04:44.000 | for a human to read. And then they also talk about batch folding. Essentially, they talk about,

00:04:49.920 | you know, sometimes a lot of this is bigger than your context length, and that's why we use the

00:04:54.320 | map reduce paradigm, where you chunk it up and then you do reduce. But when they reduce, sometimes

00:04:59.440 | it's larger than the context length of the LLM, so they do folding. They talk about folding and

00:05:04.640 | hierarchical aggregation, but they apply batch folding. Essentially, imagine here we have these

00:05:13.520 | six boxes, and then we need to aggregate them. Maybe we aggregate them two at a time,

00:05:17.760 | and then we do something like that. I think that's all very straightforward. It makes a lot of sense.

00:05:22.160 | Now, there's the third operator, which is, in my opinion, a very important operator,

00:05:28.160 | but a very difficult operator to get right, which is the resolve operator. In a nutshell,

00:05:33.680 | this is deduplication. You know, SWIX has the name of Sean Wang, has the name of SWIX,

00:05:38.960 | has the name of YX Wang. How do we know that they are all SWIX? This is extremely,

00:05:44.480 | extremely difficult. So compare the following two officer records from the police documents,

00:05:49.280 | right? And then how do we know that they're actually the same name?

00:05:55.440 | And then, you know, resolution has many different aspects of it. It could be sci-fi, science fiction,

00:06:00.560 | sci-fi fantasy. Sci-fi fantasy is not really sci-fi, but depending on your context,

00:06:05.920 | you could consider sci-fi and sci-fi fantasy as the same thing. So there's a bit of domain

00:06:11.600 | nuance to this as well. I think this is really, really difficult to do. I actually have a...

00:06:17.120 | So one, several things I've seen in production is that when you do resolving and deduplication,

00:06:25.120 | if your deduplication is too loose, everything deduplicates to a single huge megacluster.

00:06:30.160 | And that's when you get... That's when you have severity issues. And so this is actually very,

00:06:37.120 | very tricky to get right. And in the paper, they actually talk about different ways to do this.

00:06:43.120 | And they actually talk about a cheap way to do this, which we have also implemented as well,

00:06:46.720 | which is using semantic and code base, which we will talk about later. But long story short,

00:06:51.920 | this is very difficult. And then they have other operators, very standard, filter. Essentially,

00:06:58.320 | given a document, maybe we perform some kind of classification, do we want to filter the document,

00:07:03.280 | or maybe you do some kind of map to extract the relevant sections and filter everything out.

00:07:07.840 | That makes sense. Equi-join is, again, I think it's really difficult. In a sense, it's sort of

00:07:17.360 | like resolve, can you join these two concepts together? And then all of this all relies on

00:07:23.200 | schema. And then they have a few auxiliary operators. Essentially, auxiliary operators

00:07:27.680 | means that it doesn't even rely on LLMs. A nest is one of them. So for example, given a document,

00:07:32.880 | maybe you have created a lot of... Maybe given a blog post, what are all the papers

00:07:38.320 | referenced in this blog post? You'll return an array, a list. A nest basically splits that nest

00:07:44.480 | into individual elements. And then split is essentially, you can think of it as chunking.

00:07:50.560 | In the most naive way, it could be chunking in terms of number of tokens. In a smarter way,

00:07:56.400 | it could be chunking based on paragraphs or chapters or sections, or it could be even LLM

00:08:01.920 | based splitting. And then gather is essentially gathering all the individual chunks. So for

00:08:08.480 | example, imagine you are summarizing an archive paper. Imagine you're summarizing section by

00:08:13.600 | section. The previous section was the operator section, and now we're in the rewrite section.

00:08:17.920 | So it could be that when you're summarizing the rewrite section, you actually do need context

00:08:23.600 | about the operator section. It mentions certain abbreviations that the LLM would never know,

00:08:30.000 | but you have to provide a kind of context. So what gather does is by providing this peripheral

00:08:34.720 | information. So it could be as you're summarizing a document section by section, you collect some

00:08:41.600 | kind of map on all the various abbreviations that they've shown up. So that you provide that

00:08:46.960 | context to the LLM, so the LLM doesn't try to hallucinate what those abbreviations mean.

00:08:51.200 | So those are the basic parameters of doc ETL. I'll pause here. Any questions about the operators?

00:08:59.360 | I think it's essential to understand map reduce and resolve. Everything else builds on this.

00:09:07.360 | So this is a classic, not so much a question, more of a comment. My real question was,

00:09:13.760 | was there a rank operator? And the answer is no. You could maybe compose the rank operator with

00:09:19.600 | sometimes split, gather, thingamajig, but it's not a formal ranking system. And even before AI,

00:09:30.240 | I had always had this strong view that filtering, ranking, sorting, all this stuff is kind of like

00:09:38.080 | the same layer in the API stack, and should be kind of done together. This is my number one

00:09:46.240 | problem with AI news right now, which is that all filtering is effectively a rexis, and I have to

00:09:56.240 | rank. Yes. So you filter just based on metadata, like hard numbers, because you don't have any

00:10:05.520 | other better stuff. If you could create a ranking formula, you would use the ranking formula rather

00:10:09.680 | than any other filter basis. There is a filter op, right? So when section 2.2 in other ops,

00:10:21.280 | the filter operation independently retains documents from the input dataset based on

00:10:25.440 | conditions specified through an LLM prompt. In some sense, it's under ops, if you scroll down

00:10:32.240 | a little. This filter one seems to, if you can specify conditions through an LLM and retain

00:10:40.960 | docs, that's a form of ranking, no? That's how I interpreted this one, as opposed to like, yeah.

00:10:49.280 | Yeah. I mean, I think there's a question of sort of filter, then rank, or rank, then filter.

00:10:53.840 | I think in my strong view, filter. Yeah. You can think of filtering as in dropping

00:10:59.440 | items from an array, whereas ranking is me pushing things down or pushing things up.

00:11:03.840 | So I think it's slightly different. And you know, there's a question for you,

00:11:07.040 | Shreya. Is there a reason why they're split into operators and auxiliary operators?

00:11:13.120 | Oh, very simply, we just wanted to put the primary LLM-powered operators in one section,

00:11:20.400 | the one that people will probably spend a lot of time tinkering on, and the auxiliary ones

00:11:24.480 | are there to stitch them. So a lot of people want to do, for example, reduce after map,

00:11:29.600 | well, they want to unnest after map, and then do reduce there.

00:11:34.240 | Yep. Okay.

00:11:37.520 | A comment about ranking. We're working on a sort operator. We haven't found,

00:11:43.440 | maybe Swix, we should chat about this. We haven't found a really compelling use case

00:11:46.720 | from any of the users yet. Really?

00:11:48.720 | Well, a lot of people's use cases can simply be solved by, you know, applying like embedding

00:11:55.280 | based similarity to some query vector and then ranking by that similarity. I want to know,

00:12:00.560 | kind of, I think Swix's case is interesting. Like, in what cases does it make sense to have

00:12:05.760 | this pairwise comparison interface be the primary way to steer that operator? So,

00:12:12.640 | maybe like ranking articles in terms of what is most interesting to a human could be such a case.

00:12:18.640 | Yeah, so-

00:12:20.320 | I think there's another case.

00:12:21.760 | Sorry. Yeah, go ahead.

00:12:22.720 | Go for it. Go for it.

00:12:23.440 | I mean, like, yes on the what's, you know, what's most interesting to a human. I mean,

00:12:29.520 | this is standard Rexis. I'm not convinced that it has to be pairwise. I feel like pairwise

00:12:35.520 | maybe it's easy to do data entry. So, maybe that's why.

00:12:39.680 | I'm actually strongly convinced that it should be pairwise and we can debate that and see how

00:12:45.600 | it works. And I also think that-

00:12:47.280 | Why do you think it should be pairwise?

00:12:50.960 | I think it's just more reliable and stable that way. I think you don't have to go through all,

00:12:56.160 | I mean, if we do everything pairwise, it's quadratic, right? But we don't have to go

00:13:00.640 | through all of that. We can sort of do it smartly. I think that, and you know, in Shreya's paper,

00:13:06.320 | they actually cover how they use embeddings and codebase resolve, right? You can think of it as,

00:13:12.480 | also, you can apply that if you tweak that a bit, that can be applied to ranking.

00:13:16.560 | But you can also tweak it a bit. You can say that ranking, we actually have

00:13:20.160 | confidence levels. If the similarity score is strong enough or weak enough, like if it's really

00:13:25.120 | good, like 0.9, we know it's strongly related. If it's really bad, like 0.1, we know it's poorly

00:13:29.680 | related. But then there's a lot of stuff that's in the middle that then we can use an LLM power

00:13:34.800 | ranking. We can go into that a bit, but this paper is huge and I want to try to go through

00:13:39.360 | as much of it as I can. So now that we've covered operators,

00:13:44.400 | they propose rewrites. So actually, I have a question for you, Shreya. When you say rewrite,

00:13:51.520 | do you mean rewriting the pipeline or do you mean rewriting the document?

00:13:57.760 | What does rewrite in rewrite directive stand for?

00:14:00.240 | Rewrite, this is a good question. No one's ever asked this. Rewriting the pipeline. So say you

00:14:04.800 | have a pipeline that is just a map operator. Many people have this pipeline. They have a document

00:14:10.160 | set and they want to extract a bunch of metadata from each document. Like I have a bunch of legal

00:14:15.040 | contracts and I want to extract as many clauses of interest as possible from them. That would be,

00:14:20.160 | you can program that as a map operation, a single map operation to go through every document and

00:14:26.000 | extract all your fields that you want. When we say rewrite, we acknowledge that this operation

00:14:32.640 | might not work when you execute it. What if the document is too long? What if you're extracting

00:14:38.000 | too many fields? It's too hard. So you want to write it into more complex sequence of operators.

00:14:43.760 | So is my mental model right that the rewrite directives are somewhat in the realm of

00:14:48.400 | optimization already? Yes. Exactly.

00:14:51.600 | Right. So now we're going to talk about rewrite objectives. So let's again, take our example of

00:14:56.720 | this archive PDF. We want to extract what the key teams are. A noob like me would just upload

00:15:01.760 | it to Cloud AI and Cloud AI, what are key teams? Now we can rewrite that to be smarter. We can

00:15:07.120 | rewrite it into a map, basically map the entire document into all text. And then we can do split,

00:15:13.840 | we saw the split operator. And then on the splits, we can do a map on extracting the key teams.

00:15:18.720 | And then we can do a reduce, right? To try to reduce all these key teams. Again, the point

00:15:23.520 | here is that if we give the LLM smaller chunks, it is better able to pay attention and therefore

00:15:28.000 | come up with a more rich, more comprehensive and more factual summary. But you can imagine the

00:15:34.320 | search space of creating such a pipeline is infinite, right? The search space of chaining

00:15:39.040 | together all the different legal blocks of putting together all the different AWS services is

00:15:42.480 | infinite. So what they propose is a methodology to help us try to do that semi-automatically.

00:15:51.120 | So I will focus on all the different things that they do in rewrites, but I'm going to call out

00:15:55.120 | what I think is absolutely essential. So rewrite directives. The first one is data decomposition.

00:16:01.920 | So essentially when you have very large documents and there are many documents, and then you just

00:16:07.200 | have to decompose it. The very standard pattern they share is document chunking, which everyone

00:16:12.720 | has been doing. This is standard. That said, I actually do want to share this example whereby

00:16:17.680 | I was asked to help with a pipeline, and I was able to improve downstream metrics significantly

00:16:26.160 | by 20 to 50% by removing chunking. So every now and then it's good to rethink, hey, do we actually

00:16:31.840 | need chunking? In such cases, I think Shreya's example here is that it's way beyond the context

00:16:36.480 | window size, and so you definitely do need it. So you can see they do chunking. So you can map,

00:16:41.680 | split, gather, map, reduce. And you can see the optimization. There's many, many different ways

00:16:47.120 | to optimize. I'm just going to pick B here. So you can imagine we could split it and then we could

00:16:53.120 | gather all of it. After we split it, we gather all of it. But what they're proposing is that we split

00:16:58.000 | it, we create a summary, and then we create... I can't remember what H stands for. Hierarchical

00:17:06.000 | information. We create a summary and create hierarchical information. So we enrich the

00:17:11.040 | splitted chunks and then we gather it. So you can imagine essentially all pipelines now are LLM

00:17:18.720 | pipelines. LLM is enriching the data. LLM is cleaning the data. LLM is filtering the data.

00:17:23.600 | It's going to be a bit lossy. It's going to be a bit stochastic, but I think we will figure it out

00:17:28.080 | in the next one to two years to get it to a more reliable state. So they actually share some very

00:17:34.160 | insider ghost knowledge. When splitting a document, there are some kinds of context

00:17:39.360 | that are very useful. I fully agree with all of this. Document level metadata. What are all the

00:17:42.960 | different abbreviations? Who's the author, et cetera? Hierarchical information. Summaries of

00:17:47.200 | neighboring chunks. Summaries of previous chunks are actually very valuable. So the LLM has enough

00:17:51.680 | context and doesn't make shit up. And then they also have different patterns here, document level

00:17:57.760 | metadata. Maybe I'll just go through one of them. So example, in this case, before you split it,

00:18:03.120 | you extract metadata that's relevant to all chunks. So by extracting this abbreviation metadata,

00:18:07.680 | author metadata, you make sure that all chunks, when you're doing map on all of these chunks,

00:18:12.080 | you actually have all this context. The LLM has all these contexts and is able to do a more

00:18:15.360 | reliable job. And then they have many, many different other patterns that you can consider.

00:18:21.440 | Now, the next thing is multi-level aggregation. So you need to aggregate it. Essentially,

00:18:27.760 | let's imagine you have hundreds of chunks. You are summarizing a movie and you have all the

00:18:32.720 | different movie cut scenes. And imagine we don't have Gemini to do this for us and just upload the

00:18:36.560 | entire movie. Imagine you have to do it. We have to chunk it up. So what they propose here is to

00:18:42.000 | aggregate, is to chunk it up, aggregating the data at a finer granular. So essentially, it's like,

00:18:47.280 | okay, we have seen one, or let's put it another way. Imagine you're trying to summarize a series

00:18:53.920 | of, oh, wow, this is a bit close. No, imagine you're trying to summarize the Harry Potter movies.

00:18:59.280 | So you could first summarize the first Harry Potter movie, first summarize every scene in the

00:19:04.240 | first Harry Potter movie, roll it up to a summary of the first Harry Potter movie, and then roll it

00:19:08.720 | up to a summary of the entire Harry Potter movies. So that's hierarchical aggregation.

00:19:15.120 | And then they have several things here, like LLM centric improvements. There's this term here,

00:19:19.920 | gleaning. So it's prompted with the previous inputs and outputs and asked to improve the

00:19:24.480 | inputs. I mean, to simplify, oh, sorry, Shreya has a hand raise. Oh, I didn't want to interrupt

00:19:30.080 | you. Sorry. Your point about chunking, I thought was very interesting in that sometimes it's

00:19:37.920 | beneficial to chunk and sometimes you should not chunk. And we have observed this in a number of

00:19:43.280 | workloads. And the insight that we've gained is that we will never know when it's all task-specific

00:19:48.160 | and data-specific. And we are so further convinced that you kind of need some optimizer to explore

00:19:54.880 | these different choices for you and come up with something reasonable. But anyways, just a meta

00:20:00.080 | comment on, it's not the rewrite rule that if you apply it, it always works. It's that if you apply

00:20:06.000 | it, it sometimes works. It sometimes works a lot better. And you need some way to kind of try a lot

00:20:11.600 | of these things automatically for you. - I agree. Vips, you have a hand.

00:20:16.080 | - A little follow-up to that point. Somewhere, I think on like the first page where it talks about

00:20:23.200 | chunking and performance, there was just like a handful of dumps of citations of different papers.

00:20:29.520 | So like, there was a section, I think on the first page, recent work has shown that LLM

00:20:34.800 | performance degrades considerably as length increases. Citing a paper, they can be distracted.

00:20:39.600 | Citing a paper, pay attention more to certain papers, cite a paper, or fail to gain holistic

00:20:45.360 | understanding. And there's like four citations. So pretty cool if anyone wants to dig into that,

00:20:49.920 | there's like eight citations on the first page that talk about this. And then another little

00:20:56.480 | comment actually. So you skipped over a few of the chunking strategies. So there's stuff like

00:21:02.720 | document level extraction header. I thought that the third one was kind of interesting where you

00:21:06.880 | have chunk filtering. I haven't seen this approach too often. So just if people haven't read it,

00:21:11.760 | it's an interesting little one that popped up. So as you chunk, whether it's semantically

00:21:16.320 | or whatever it is, you can filter out sections. Not many other pre-processing handlers seem to do

00:21:21.840 | this, but the example here is pretty clear, right? So as you break up chunks, you can filter out

00:21:27.440 | stuff. So for example, in an archive research paper, you can filter out citations and references

00:21:33.280 | that aren't really relevant to the core information. It's just one of those chunking

00:21:38.000 | strategies that you don't see too often. So, you know, figure it out that better.

00:21:41.200 | Yeah. A hundred percent agree. All of these bolded mini haters here, they're all very valuable. I

00:21:47.840 | just don't have the time to go through them, but you should read through them. I think these are

00:21:51.520 | practices that they have gained from working on very, very difficult real world problems

00:21:56.080 | to just get the level of performance up. And you can take a lot of inspiration from this.

00:22:00.800 | Shreya, you have another hand raised. Oh, a small comment again for the

00:22:04.640 | police misconduct data. Often we have like thousand page records documents where it's

00:22:10.960 | just page on page of like random image of like a street sign or something. And if you just drop

00:22:17.360 | those, your processing, your accuracy just stays the same and you save so much on costs. So that's

00:22:23.200 | where the filtering came from. Yep. Very valuable. Now they have LM centric improvements. This one

00:22:30.080 | here is gleaning, which is prompted with the previous inputs and outputs and ask it to improve

00:22:34.480 | it. You can think of it another way if you have a valid. So that's one way you can imagine have

00:22:39.040 | a validator in the loop or essentially like code. You just copy. Hey, Claude, here's my error

00:22:45.120 | message. How do I fix it? This is very similar that given the previous input and output and error

00:22:50.480 | message or some kind of feedback LLM, get it to improve on it. So this can, you can iteratively

00:22:56.080 | improve on this. And I think, uh, later we have an experiment result where they try to do this

00:23:01.360 | four times and you can see how it tapers off. But for some, in some instances, he just keeps

00:23:05.760 | improving, uh, as, as many times. So in this initial, so how this is, how this is done.

00:23:12.640 | Oh, actually they do have any better within the loop. Okay. So how this is done. Okay.

00:23:16.960 | You preprocess, you process it, you get the original map operation, like summary of the

00:23:23.680 | paper, and then you have a validator, try to evaluate it and provide some feedback. Now you

00:23:29.360 | provide the previous summary and the feedback and maybe the context and then try to refine on it.

00:23:33.520 | And then you just improve this many times. So you can imagine now all your SQL pipelines,

00:23:38.480 | all your data pipelines now have intelligence sprinkled in where it can, it can, it can map

00:23:43.520 | where it can map on all this. We have a validator in the loop. So users can have a more capable

00:23:48.880 | model to be the validator. Yeah. Yeah. I think I missed the part about being a validator in the

00:23:55.200 | loop. Because I thought it was just asked to improve the outputs. Yeah. But again, fully

00:24:00.320 | agree. I think this cannot be done without an evaluator or validator. It's absolutely essential.

00:24:06.400 | Then they have this thing, duplicate key resolved. This is the problem, right? LLM outputs are not

00:24:14.400 | canonicalized. So for the same, for the same person six, they may have very many different

00:24:20.480 | names out there. So you have to canonicalize it. So they take semantic equivalent values of the

00:24:27.040 | key. And I think they try to reduce the search space in the path. I can't remember where it's

00:24:33.360 | mentioned, but they try, maybe they try to reduce the search space with embeddings and then try to

00:24:36.720 | resolve this. Long and short of it, this is a very, very hard problem. And you have to do this

00:24:43.440 | with fairly high precision. If not, you could be saying that Eugene Xia and Eugene Yan actually

00:24:48.560 | the same thing. And Eugene Yan maybe just raised $10 million when it's really Eugene Xia. I actually

00:24:54.080 | don't know how much Eugene Xia raised. I'm just making an example where precision is quite

00:24:58.000 | important. Vips, you have a hand raised. Yeah. So when I read this section about

00:25:03.760 | gleaning, I saw the same thing. So I understood basically on the page you're at that there's a

00:25:10.000 | validation model. So there's a validator, but then it also says that that first paragraph,

00:25:15.760 | they employ a separate validator and data generation. So go on below this next one. Yeah.

00:25:21.040 | I didn't really see where this data generation LLM is, but I understand that you do some

00:25:26.000 | pre-processing, you have a validator to check how it worked. Then you have a number of steps to do

00:25:32.240 | this, do your whatever processing, validate, process, validate. But I don't get where this

00:25:38.320 | data generation LLM is. And then it seems like data generation is just, that's the only reference

00:25:43.840 | of it. So I didn't know if this is like, I just didn't see it too much in this.

00:25:48.320 | No, it's ambiguous from us. I just meant the data processing LLM, the normal LLM that you use.

00:25:54.240 | So how does a traditional operation work? Look at figure four, you just have the output that comes

00:26:00.080 | from the data processing LLM, nothing more. What gleaning does is it adds that validation agent,

00:26:05.520 | and it creates a loop of some finite number of iterations or whether the validation agent said

00:26:11.680 | it was good. So you can just bound the number of times it refines, and then the final output,

00:26:16.320 | that's all. So is it using that original step again? So it says it's a separate

00:26:22.080 | validator and data generation model. You can specify two different models. You could also

00:26:29.600 | specify the same model architecture, but the point is the validator agent has a different prompt

00:26:35.120 | that has a specific instruction of, does this contain all the officers or instances?

00:26:40.720 | And the validator agent's job is to say yes or no with optional feedback. The data processor LLM's

00:26:47.280 | job is to give the answer. That's why we have a distinction between the two.

00:26:51.840 | Got it. Got it. Okay.

00:26:54.960 | Good question.

00:26:55.520 | And Shreya also has an interesting point that it's much easier to verify the output and generate it.

00:27:00.720 | I actually observed the opposite.

00:27:04.080 | It depends on the task. For classification tasks, yes, it's easy for factuality or

00:27:09.440 | comprehensiveness or relevance, whatever. It's a lot harder.

00:27:13.120 | A lot of the synthetic data gen papers like ORCA3, Wizard LLM, they show that

00:27:18.800 | models are better at verifying output than generating output. So you have a weaker...

00:27:24.800 | Yeah. So it's shown in other work too, but interesting note. It does depend on the task.

00:27:31.920 | So next we have project synthesis. Well, this is tricky. I think what it means is that it's

00:27:40.080 | hyper-pipeline optimization. So essentially we don't really know what is going to be good.

00:27:46.560 | So you can imagine that we have very many different ways on how to do this.

00:27:52.160 | And we can find that agents are... It's really hard for... If you have a pipeline generating agent,

00:28:00.000 | it's really hard for an agent to try to figure out which pipeline is going to work in the first

00:28:04.880 | place. And there are infinite ways that you could build this pipeline. So they propose several ways

00:28:14.640 | to do this. And of course, the first thing is chaining. It's like iteratively chain LLM calls

00:28:20.560 | to simplify the task. Essentially, this is what people say. Break up your task so that every

00:28:24.800 | prompt only does a single thing. So it could be extracting entities, summarizing key points,

00:28:28.720 | and then generating recommendations. Another one could be isolating. So for example, maybe

00:28:34.000 | a question would be what sections of this paper talk about hallucination?

00:28:38.560 | And then so you could reduce the scope of work to be done on the doc, whereby you could just

00:28:44.000 | classify the documents, maybe just classify sections, classify paragraphs as related

00:28:48.240 | documents, hallucination or not, and then just flag it up. So there's a lot there.

00:28:54.880 | And I won't go through all of this. But anyway, that is the end of the rewrite section. Any

00:29:01.840 | questions there on rewriting? Because now we're going to go into the optimization step.

00:29:09.040 | I'm happy to explain projection synthesis if that's helpful.

00:29:12.320 | Yeah, go for it.

00:29:13.600 | Yeah. So this is very similar to the chain of thought mindset that you have in LLM prompt

00:29:22.400 | engineering, where if you have a very complex task, often what can help is if you break that

00:29:28.400 | down into two LLM calls. Now, you can generalize this to say all this means is you have a map

00:29:37.200 | operation that does partial work and then your original operation that does the task to maybe

00:29:46.080 | make it easier. And you could do this for any operation. You can have a map before another

00:29:51.760 | map. You can have a map operation before another reduce, before another filter. In this police

00:29:57.200 | misconduct case, we've seen this very helpful around whittling down, like summarizing the

00:30:03.520 | document into the relevant points needed to then answer the hard question of misconduct.

00:30:09.200 | So you'll take a very long document and the first map operation that's synthesized will just

00:30:14.960 | extract information that is relevant to police officers at all, making it a smaller document.

00:30:22.560 | And then summarizing the instances of misconduct is much simpler to do on that very focused

00:30:29.360 | document. So that's an example of projection synthesis. And projection means map in the

00:30:36.640 | data world. So you can apply that to any operation. And you can have the LLM write that.

00:30:42.720 | You can have an LLM take a task and try to break it down into multiple steps and synthesize that

00:30:48.080 | map projection. >> Thank you, Shreya. So any

00:30:52.320 | questions on rewriting directives? Okay. So now we're going to go into the optimization aspect.

00:31:04.080 | So the key aspect here in the optimization aspect is that, okay, you sort of declare

00:31:09.840 | an initial pipeline. And then .etl will try to run that pipeline, get initial evaluation scores,

00:31:16.880 | and then it will try to optimize that pipeline for you. So the two key agents here, that's the

00:31:21.440 | generation agent. So what a generation agent does is that the generation will actually generate

00:31:26.160 | rewrites. So when we talk about generating here, the generation agent is going to generate the

00:31:30.800 | rewrites and the alternative pipelines, the pipeline options. And then you have validation

00:31:35.520 | agents. And again, this is another example, right, of how evals and validation are just

00:31:41.920 | fundamental to this. Because without the validation agent, you would not be able to

00:31:45.600 | assess and measure the different pipelines, how the different pipelines do, and then how

00:31:51.680 | to compare between them. So there's two things. You generate pipelines. Second, you evaluate each

00:31:56.720 | step of the -- maybe you evaluate each step of the pipeline and you evaluate the overall pipeline.

00:32:01.120 | And then we try to see which one performs well. So there's a lot of algorithms here. But I find

00:32:09.360 | that the paragraphs here actually explain it. I found it far easier for me to understand.

00:32:14.480 | So essentially, you can imagine that there's a pipeline, right? You have a pipeline. Again,

00:32:19.200 | let's just stick to the police document, police example pipeline, all the different

00:32:23.040 | examples of misconduct. So what you first do is you use a validation agent to create a custom

00:32:30.320 | validation prompt, right? And I don't know if we do need inputs and label outputs here.

00:32:36.560 | Do we actually need inputs and label outputs here to get the validation prompt?

00:32:41.840 | Nope. I mean, you could -- we don't support that because nobody has ever come to us with

00:32:46.880 | labeled examples. The other thing that's hard is sometimes there are newly synthesized operators

00:32:52.240 | as part of rewrite rules. And those are certainly not going to have any label examples.

00:32:58.240 | But one of the things we are exploring is we have an interface for doc ETL and we're exploring

00:33:02.400 | kind of interactive optimization of pipelines. So if we synthesize a new operator, we might ask

00:33:06.880 | a human to label some examples and then use that for the validation prompt. But that's, you know,

00:33:13.280 | in the works. I think that's a good idea. And that's the entire crux of Align-Eval,

00:33:18.000 | which I shared with you. I'm of a slightly different take. I feel like we do need some

00:33:23.120 | set of C human labeled data. So first thing, we use the validation data. We have a custom

00:33:29.120 | validation prompt. And then what a validation agent will do is sample some outputs from the

00:33:36.160 | initial pipeline or from chunks of the initial pipeline to see if there's room for improvement.

00:33:40.000 | If there is room for improvement, what we do is we rewrite -- they have rewrite rules where

00:33:46.880 | they apply to the subpipeline or the individual operation or the entire pipeline itself.

00:33:51.280 | So then what that means is that -- and then they have this thing we should call recursive

00:33:55.360 | improvement. In the sense that when you create new optimizations for a rewrite,

00:34:00.240 | you immediately optimize them before continuing the rest of the optimization. So what it means

00:34:06.000 | is that from a pipeline, imagine a pipeline goes from left to right, you start upstream,

00:34:10.480 | iterate on that, and then you go downstream. And this can take as much time. Compute is free. You

00:34:14.960 | can just run it overnight. And of course, it just costs a few hundred dollars. But that's fine

00:34:19.760 | compared to how much human time it would take. So now you have multiple candidate plans. Let's

00:34:25.840 | just say you have 10 candidate plans. And then you take all these 10 candidate plans. You first

00:34:30.880 | execute each plan on a sample of the data. You don't want to execute it on all your thousands

00:34:34.960 | of datasets. You just execute on maybe 50. And then use the validation agent to rate each output.

00:34:39.680 | So now after you've rated this sample, these different sample plans,

00:34:45.440 | then you take the top K. And then you do, I think, full evaluation. Oh no, it does pairwise

00:34:51.440 | comparisons across these top plans. And the plan that wins most is the winner. And of course,

00:34:56.160 | you select the optimized plan to do that. I think let's go through an example. I think an example

00:35:02.480 | here is helpful. So here is the police misconduct dataset. So the current pipeline is this, right?

00:35:18.400 | The team has a domain-specific clustering algorithm and then human annotation to de-duplicate officer

00:35:24.240 | names. And this is the baseline pipeline that the team has. So what they did was that doc ETL

00:35:33.360 | synthesized various pipelines. The first one was to extract misconduct summaries.

00:35:38.240 | Extract the name and the summary of the misconduct before trying to de-duplicate

00:35:46.080 | on the officer name, before trying to resolve the officer name. And then it summarizes it.

00:35:52.480 | Then you can see here's a new... Was this synthesized by doc ETL, Shreya? Doc ETL-T?

00:36:01.760 | Good question. So we are redoing this entire eval, by the way. So we're submitting this to VLDB

00:36:09.280 | December 1st. You'll see an entirely new eval for the next version. But for this case study,

00:36:16.560 | the doc ETL-S-T-O are all candidate plants from doc ETL's optimizer. And when we were exploring

00:36:27.680 | them, we were like, "Oh man, I don't even know which one is going to be better." Truly. You

00:36:31.920 | read them, you have no idea. So we compared them all. Doc ETL-O was the one that was actually

00:36:37.520 | selected. So you can see that they generated several plants. I don't know how many hundreds

00:36:42.800 | of plants they generated. The author evaluated 50. I recall they had certain numbers here.

00:36:55.120 | So okay, yes. Okay, shoot. I don't know. Okay. So they had 220 documents and the baseline was $2,

00:37:07.520 | while doc ETL-S-T-O each cost a little bit more money. Running the optimizer,

00:37:12.160 | they used the optimizer. I don't know how many plants the optimizer generated, but it cost

00:37:16.720 | approximately $100 and less than half an hour. Just go get lunch and you come back and you get

00:37:21.120 | your optimized pipeline. And the optimization cost was just $100. How many pipelines? Oh,

00:37:27.360 | perfect. Here it is. It generated 200 pipeline variants and all of this was

00:37:32.000 | just evaluated with the validator prompt. Yeah. So I think that's an example of how

00:37:38.720 | the optimization step runs. During a bunch of pipelines, validate it and optimize the

00:37:45.040 | subcomponents within the pipeline and the pipeline itself. So now that we've gone through this,

00:37:51.040 | we can now discuss figure one. We had to go through a lot of it, but I think now we can

00:37:57.680 | discuss figure one. Figure one is you can imagine all of these are police documents and we want to

00:38:02.640 | extract the names of officers and the misconduct. So you can see this is the user-defined map,

00:38:09.440 | right? Okay. And maybe if I'm a user, my initial prompt is very basic. Given a PDF,

00:38:16.080 | just extract any cases of misconduct. So now we apply the rewrite directives.

00:38:21.280 | We apply rewrite and the rewrite, of course, we have the baseline, which is no change.

00:38:25.280 | We can do projection synthesis, which by we extract name, extract summary,

00:38:30.000 | and then after that do another one. Or we could do decomposition where we split the document up

00:38:34.160 | together and map. And then all of these, all the pink boxes are where the agent will automatically

00:38:41.440 | rewrite it. And then the green boxes are when the plan is selected and NC here stands for NSYNC. No,

00:38:48.320 | I mean, it stands for no change. Essentially, if it's no change, we just accept it.

00:38:52.160 | And you can see, we just go through from left to right. And that is just for extracting any

00:38:58.960 | cases of misconduct. After you've extracted these cases of misconduct, you can see Officer A,

00:39:04.160 | Sergeant B, and Officer C, then now we need to reduce it. And again, we apply the same thing

00:39:10.160 | down the stream. So we start from left to right. Every time we generate something new,

00:39:15.600 | we validate it. And if it passes, if it's good enough, we move on to the next one.

00:39:20.320 | And so it just iteratively goes down the entire stream. So now that is the section on operators,

00:39:27.040 | rewrite directives, and optimization. And that's all summarized in this document. Any questions?

00:39:34.400 | I can answer RJ's question. It's easier than typing it out.

00:39:41.280 | Yeah, go for it.

00:39:42.480 | Yeah, so RJ asked a question. I get how you can generate good evaluation prompts for the

00:39:46.240 | individual operators, but why do we trust an end-to-end auto-generated trust? Oh, sorry,

00:39:51.200 | end-to-end auto-generated judge. I too was very skeptical, but then I saw the kinds of tasks that

00:39:57.120 | the journalism team here at UC Berkeley was running, which was extract all cases of misconduct,

00:40:03.280 | where they have defined misconduct in the prompt. Very precision, recall-oriented tasks here,

00:40:09.440 | which you can have an LLM judge pretty ambiguously determine whether one output

00:40:15.200 | is better than another. Did one output extract more names than another? LLM is just super good

00:40:20.880 | here. So I think what really sold me was having an optimizer like this that off the bat improved

00:40:30.000 | performance for the actual people who are working on these tasks, who are going to be authors in

00:40:34.800 | the eventual submission, just because they've contributed a lot to helping out with the

00:40:39.200 | evaluation here, kind of guiding what they find useful. But once you kind of see it work,

00:40:46.480 | it's kind of hard to go back. I know Eugene Yan also has experience with LLM as judge here.

00:40:52.480 | I strongly believe it could work. I mean, last year, this time last year, or maybe a few months,

00:40:59.440 | like in June last year, I didn't think it could work, but I've been debating it a lot. I've been

00:41:02.960 | trying a lot. I think when we simplify it to binary classification metrics, I think it can work.

00:41:06.960 | And I think a lot of things can be simplified to binary, like Shreya mentioned precision,

00:41:10.400 | I think a lot of things can be simplified to binary classification metrics. And I've seen

00:41:13.520 | evidence of it working. Well, you may have to apply chain of thought, you may have to include

00:41:16.640 | a few few short, you may have to do a bit of self-consistency or ensembling. But I think

00:41:20.480 | the technology is getting there. Eugene Chia, you have a hand raised. Yeah, I think the other

00:41:25.520 | important aspect to consider in this scenario is that what is the failure condition? So in this

00:41:30.480 | case, right, if let's say this pipeline fails, right, it still reaches the journalist and

00:41:36.080 | becomes a human in the loop kind of situation. Like the failed documents can be ignored safely,

00:41:41.360 | because they achieve their goal with the successful documents. In the case of using this

00:41:47.840 | pipeline in a more Q&A, agent style for helpdesk, right, a failed answer impact is a lot more

00:41:55.680 | drastic. So it depends on the use case. Agree. Libu.

00:42:02.880 | And a quick comment there. So you mentioned you've seen examples of LLM as a judge working

00:42:09.600 | if you break it down to a binary problem. And then you might need to include a few short

00:42:13.760 | examples and stuff. And you said ensembling. Can you explain a bit more about what ensembling

00:42:20.880 | helps for LLM as a judge? Yeah, when I talk about ensembling,

00:42:26.720 | the main ensembling I'm thinking about is, oh, gosh.

00:42:32.880 | I wish I could find a paper as easily as… No, just add a high level, you know.

00:42:44.160 | Yeah, the poll paper essentially talks about just combining multiple, using multiple weaker LLMs.

00:42:52.400 | You can get as good as an LLM judge as GPT-4. This is the paper from…

00:43:01.280 | Okay, so is the benefit there trying to save on performance of a large model or, like,

00:43:11.920 | optimization that you can't run a large one? Like, mixture of agents kind of shows how,

00:43:16.240 | from together, shows how you can use mixture of agents of small models to match performance of

00:43:21.920 | a larger model, which we've deployed because it has to be on-prem. So ensembling models

00:43:26.800 | leads to smaller models being able to do efficient judging than larger models. Is that the takeaway?

00:43:35.840 | Yes, that's correct. So in this paper here by Kohir, what they did was they have a reference

00:43:39.920 | model. And this reference model is GPT-4. And then essentially what they did was they

00:43:45.440 | ensemble command R, haiku, and GPT-2.5. And I can't remember what the… I think the ensemble

00:43:52.320 | was just majority voting. And they were able to have higher Kappa score, higher correlation

00:43:58.640 | to humans compared to the strong reference, the large language model.

00:44:02.800 | Interesting. Makes sense.

00:44:06.080 | Yeah. And I think it makes sense that, you know, all these smaller calls can be parallelized,

00:44:10.480 | and then you can… the ensembling, yeah, so… And it's fast and cheaper.

00:44:14.160 | So sometimes they can. Sometimes they can't be parallelized. For us, we see a lot of

00:44:17.680 | parallel… not parallel, sequential chaining add significant performance.

00:44:24.320 | It just adds performance bottleneck in terms of time and latency end-to-end.

00:44:29.200 | But like, it's a necessary trade-off for what we've…

00:44:32.240 | Yeah. Okay. Someone asked, can you link to this paper? I don't know if I can link to this paper.

00:44:38.960 | You can… if you've been in our paper clubs, Eugene Yen just does not multitask well. As you

00:44:47.360 | saw while I was struggling to find his paper. But I've pasted you the link. If you search it up on

00:44:50.640 | Google, I'm sure you'll link to the archive. I've linked it.

00:44:53.600 | Okay. I did have a quick high-level question, since we're approaching the end. More so for

00:45:01.360 | Shreya. So if you can give a sort of high-level, how we go from datasets to extracted whatever. So

00:45:09.280 | currently you can break up documents by chunking, extracting stuff. There's some basic filtering.

00:45:15.120 | Just high-level, what happens here? And then in that optimization step, it didn't really click

00:45:20.000 | for me where you have the cost go from a couple dollars to $100 and you're generating a bunch of

00:45:25.840 | pipelines to optimize this stuff. But at a high-level, one, what is the doc ETL? And then

00:45:32.080 | two, what are these optimizers doing? I'll answer the second question. The

00:45:39.760 | optimizer's goal is to rewrite the user's pipeline into something that is more accurate.

00:45:45.360 | And in this process, the optimizer explores a lot of plans using LLMs to create those plans,

00:45:52.720 | to verify, to evaluate those plans. So that's where the cost goes up. The reason it was $100,

00:45:58.960 | if I ran the optimizer with GPT 4.0 mini as the LLMs, it would be $10. But we use GPT 4.0 just

00:46:07.920 | because I think we did this at a time where mini hadn't come out yet. Anyways, but LLMs are

00:46:15.760 | trending to be cheaper and cheaper. This project is a very high budget project, but I think that

00:46:20.320 | in the future as LLMs are much cheaper, as open source LLMs get much better, the cost should not

00:46:27.920 | be an issue for such kind of approach. And then your first question was, what is the high-level,

00:46:35.200 | what's the vision for the project? - No, I mean, just what's the input/output?

00:46:40.400 | I, from the title, when I first started reading it and went through all this section one, two,

00:46:45.600 | three, I was like, I thought it was more so single document to better extraction,

00:46:52.240 | like pre-processing a document for a RAG system. And then once I get to the end, so my whole

00:46:57.440 | background was, let's say I've got 10,000 documents that I want referenced in a RAG system.

00:47:04.080 | Am I doing this on all of them? And then it turned into, you can throw this at a data set of a couple

00:47:09.520 | of hundred docs. So like, there was a bit of a jump there. - Yeah. So it's, this is very different

00:47:14.960 | from traditional RAG or Q and A or document processing for a chatbot. Like the kinds of

00:47:20.560 | queries that people are, people want to use doc ETL for can be expressed as ETL, SCDIL,

00:47:27.520 | sweep and harvest kind of, I want to look at my entire data set. I would hire an analyst to look

00:47:35.200 | at my entire data set and then just tell me things, generate me reports that I can then learn

00:47:41.920 | from. So for example, this officer in misconduct thing is much more digestible than say looking at

00:47:47.040 | all these millions of PDFs. So in that sense, I mean, you can certainly apply chunking or like

00:47:53.840 | maybe a strategy used for processing one of the documents could be used for RAG systems, but the

00:47:59.840 | focus is not RAG. It's like building a kind of semantic layer on top of your unstructured data

00:48:05.760 | is the way that I like to think about it. - Got it. Got it. Thank you. - Okay, cool. And I know

00:48:14.320 | we don't have time to go through the examples at the end, but definitely do go through the examples

00:48:18.000 | at the end. And you can see, okay, let's just, for the police example one, you can see they had three

00:48:22.720 | binary criteria. Did this binary criteria come from the team that was doing the work? - Yes.

00:48:28.560 | - Okay. Like whether each officer name referred to a real person, if the summary included dates,

00:48:32.960 | whether each identified misconduct was- - This is a very weak evaluation. So part of the reason we

00:48:39.440 | want to rerun this evaluation is just this task here doesn't have any grounds roots. So when you

00:48:44.880 | ask the people, like, what do they care about? It's they look at some of the outputs, they look

00:48:49.440 | at their data you'd need, and they're like, oh, the officer names are fake. So that becomes a

00:48:54.000 | criteria for them. 'Cause sometimes it'll be officer A, officer B, officer C, that's not real,

00:48:59.600 | that's not what they want extracted. They'll read some of the summaries and say, you know,

00:49:03.920 | some of them aren't exhaustive. I want them to include date, location, names, summary of what

00:49:09.120 | happened. So I think like the criteria emerges from looking at the data and it's not necessarily

00:49:16.240 | a reflection of like how well the LLM did this task. - Gosh, I think Shreya must have seen this

00:49:24.160 | red post-it note and preempted me. - But that's the question, right? From an academic standpoint,

00:49:31.600 | this is an underwhelming evaluation because there's no ground truths, right? So we're redoing

00:49:36.880 | our evaluation to be on datasets where we actually have ground truths from human annotators, but

00:49:42.640 | that's just not a practical setting. Like nobody's coming to Doc ETL with the ground truth. - That's

00:49:48.000 | why we do a line eval to get them to provide ground truth and give me five. And you can imagine,

00:49:52.880 | right, let's just say police misconduct. You can imagine there's two officers, Eugene Cheah and

00:49:56.400 | Eugene Yen, and then we resolve them to Eugene. And then now it becomes, the risk is higher,

00:50:04.160 | right? Eugene Cheah or Eugene Yen conducted some offense. So that's the problem where resolution

00:50:09.840 | is very tricky. I think somewhere you mentioned resolution, you have a metric of recall. I

00:50:16.640 | actually think it's precision is actually more, it depends on your use case. I actually think

00:50:21.200 | precision is actually something that we want to be careful about. - Yeah, we do measure precision in

00:50:29.520 | the updated eval that is not on archive. - So yeah, they also have two other examples here.

00:50:38.160 | Again, very valuable. This is really people doing the work, sharing the work they did,

00:50:42.640 | what were the pain points and how they improve on it. Just definitely do read this end-to-end.

00:50:50.240 | You'll definitely learn a lot. And you can see this is the number of, again, this is the number

00:50:53.840 | of iterations. You can see it just keeps getting better even on the fourth iteration or the third

00:51:00.320 | iteration, though it may be expensive, whereas on some things like distinct games, maybe after one

00:51:05.600 | or two iterations is good enough. Again, it depends on your cost and how much latency you want to

00:51:09.920 | incur. So yeah, I think that's all key takeaways. We can think about the operations we do as a few

00:51:19.680 | key operators, map, reduce, resolve, filter, rank, maybe. We can think about how we want to rewrite

00:51:25.840 | these operators together in a different way. And we also have a methodology on how to optimize

00:51:31.280 | these rewrites by generating pipelines. And of course, we need a validator to validate every

00:51:38.560 | step of the way. Shreya, anything I missed? Any key takeaways that I missed?

00:51:43.120 | - Crushed it. Wow. - Thank you, Shreya,

00:51:46.560 | and thank you for joining us. - Thank you for all spending so

00:51:49.600 | much time on this work. This work has been two years in progress, by the way.

00:51:53.280 | - Really? Wow, I didn't know it was that long. - Oh yeah, and even longer.

00:51:56.160 | - Really? Shoot. - That makes me very happy.

00:51:59.920 | - I'm in your bucket now. Yeah. I'm super excited to seeing the full paper.

00:52:04.800 | - The evalgen and all my other papers actually were part of this broader project. I've been

00:52:13.440 | working on this for so long that my advisor was like, "What are you doing? Let's come up with

00:52:20.160 | something." So I was like, "What's our biggest problem right now?" About a year ago, a year and

00:52:25.040 | a half ago, I was like, "Validation. We need to come up with these validation agents. I don't

00:52:29.760 | know how to do it." So that's how evalgen came about. Yeah, I love the evalgen paper. I mean,

00:52:34.960 | I built an entire app inspired by it. But Eugene, do you have a hand raised?

00:52:39.040 | - Yeah, so I'm just going to do the shout out for the three plus one, I guess, for newer folks who

00:52:45.200 | are not aware and also for the recording. The amount of meta in this room and the active

00:52:49.680 | participants is ridiculously high on this topic of REG/LMSHR/eval. Shreya is the paper author and

00:52:56.720 | has been doing multiple banger papers on this, top tier papers on this topic regarding REG/eval

00:53:03.840 | and everything in between. If you want a list of her writing, you can probably just ask Eugene

00:53:08.800 | Yen and he'll produce a list for you. Eugene Yen himself undersells himself, but he writes

00:53:15.280 | extensively on LMSHR and helps teams build these solutions in the world's largest bookstore. So

00:53:22.320 | he's probably seen some of the biggest production use cases for this kind of deployment. Vibhu

00:53:28.640 | works at MinesDB, which is all about embedding search and enterprise REG solutions built on top

00:53:33.120 | of it and has seen a lot of real world productions for all MinesDB clients. So he provides more

00:53:39.360 | breadth across multiple use cases respectively. And Sean, as you may know, works on small AI as

00:53:47.040 | well, which does summarization and search on this kind of thing. So lots of high level method. Hence,

00:53:52.320 | you can see why they were the most active in their conversations as well.

00:53:55.200 | Yeah, and I've dropped a few links about this. I think that you should definitely read Shreya's

00:54:00.800 | previous paper. This paper is really, really, really good. And what I really like about this,

00:54:06.400 | it actually has UX research. You actually have participants who... So you can hear what other

00:54:14.320 | people are saying. That's very useful. I'll also drop another link to a compilation I did on

00:54:20.720 | different LMSHR techniques. So I think that's... It's helpful. It's a lot, I realize,

00:54:26.880 | but it actually references a lot of other papers. So this can be helpful for you.

00:54:30.880 | Okay. And next week, we have Eric. Thank you for volunteering, Eric. We look forward to that. And

00:54:37.760 | drop the paper that you'll be taking us through in the Discord channel so that we have time to

00:54:43.920 | pre-read. Yeah, we'll do. Thank you, everyone. Thank you, Eric. Thanks. Thank you, Shreya,

00:54:50.080 | for joining us. Thank you for being here. Yep. Bye. Okay, Swix, do you need to stop the recording

00:54:58.000 | or we just leave? I would say you'll just leave. He made me the host. He made you host. Can you

00:55:08.160 | stop the recording? Yeah, but then the thing is my session kind of bugged out and I need a refresh.

00:55:14.880 | So I think the host will stand out eventually. This session is not the host. Oh, shoot. Okay,

00:55:23.280 | I'm going to have to leave then. I have a NAH meeting. All right. See you, man. Bye.

00:55:26.320 | I might... I have a backup recording, I think. So if we lost it, then I can provide that. We'll see.

00:55:33.360 | Yep, yep, yep. No worries. Cool. Hey, was that yikes? Yeah, yeah. I want the backup record.

00:55:37.680 | Okay. Yeah, yeah. Cool. All right, everyone, feel free to leave. I'm just waiting for the

00:55:45.120 | system to time out. Oh, no. Sean left his computer on on this.