back to indexBuilding Reliable Agents: Lessons in Building an IDE for Data Processing Agents

00:00:00.000 |
Hey everyone. My name is Shreya. I am finishing up my PhD at UC Berkeley, so that's quite exciting 00:00:15.300 |
for me. And I'm here to give you a different kind of talk. This is about research, what 00:00:20.760 |
we're learning through research, and how to help people build reliable LLM pipelines. 00:00:27.380 |
Just to give a picture of the kind of research that we do at Berkeley, this is around data 00:00:30.620 |
processing agents. What do I mean by data processing? Organizations have lots of unstructured data, 00:00:36.940 |
documents that they want to extract and analyze, extract insights from and make sense of. So 00:00:41.840 |
for example, maybe in customer service reviews, they want to extract themes, summarize them, 00:00:48.060 |
figuring out actionable next steps. Maybe they want to look through their emails to figure 00:00:53.220 |
out for a sales agent. Which clients could have gotten closed? Why didn't they? How 00:00:59.620 |
do we move forward from that? And all sorts of domains have these kinds of tasks. For example, 00:01:05.020 |
in traffic safety, aviation safety, what are the causes of accidents? How can we mitigate 00:01:10.980 |
them? And when people write pipelines to use LLMs to solve these problems, their number one complaint 00:01:19.960 |
is that, you know, this is really hard. It doesn't work. And so I want to put you in 00:01:24.140 |
that mindset to figure out why. Imagine you are a real estate agent trying to find a place 00:01:29.460 |
to meet your customer or your client's needs. And your client has a pet. It's a dog owner. 00:01:35.460 |
So you might want to know, okay, what neighborhoods in, say, SF have the most restrictive pet policies? 00:01:41.700 |
I want to tell that to my client. So you might write this pipeline as a sequence of LLM operations 00:01:48.140 |
on a bunch of real estate rental contracts. You might start out with a map operation, which 00:01:53.940 |
for every document gives you some extracted output. More map operations, for example, to categorize 00:02:00.520 |
or classify clauses. And then aggregate these clauses together, maybe by neighborhood, by city, 00:02:07.080 |
and come up with a summary or report for each. People write these pipelines, and the number one 00:02:12.520 |
thing that they tell us is, my prompts don't work. And then the number one thing that they're told as 00:02:17.480 |
a solution is, oh, just iterate on your prompts. So today's talk, I really want to dive into what 00:02:23.400 |
does this kind of iteration entail, right? Why is this problem hard? How can you feel like you're not 00:02:28.840 |
just hacking away at nothing to make progress? 00:02:32.520 |
So at UC Berkeley, we put our research hats on, our HCI hats on, and studied how people write 00:02:38.520 |
these kinds of data processing pipelines. The very first thing we observed is that people did not 00:02:43.880 |
even know what the right question is. And many of you might resonate with this a little bit. 00:02:48.360 |
So in our real estate agent example, someone might think they want to extract all pet policy clauses, 00:02:55.400 |
and then realize only after looking at the documents and looking at the outputs, that they only wanted, 00:03:01.080 |
you know, dog and cat pet policy clauses. Then when they feel like they know they have the right question 00:03:06.920 |
they want to ask, then they want to figure out how to specify that question. So we all know when working 00:03:12.360 |
with LLMs that we need to have very well specified, clear, unambiguous prompts. And things that we as 00:03:18.760 |
humans think are unambiguous are actually pretty ambiguous. For example, just saying dog and cat 00:03:24.440 |
policy clauses doesn't tell the LLM much. Maybe you need to say weight limits or restrictions, breed 00:03:31.000 |
restrictions, quantity limits, and so forth, improving the LLM's performance. 00:03:36.040 |
So zooming out a bit, what do these challenges mean, right? Iteration kind of reveals a lot of 00:03:40.840 |
these insights if you do it correctly. But when we help people build data processing pipelines, 00:03:46.840 |
what we really want to do is close these gaps between the user or the developer, the data they're 00:03:51.880 |
trying to query and make sense of, and the pipeline that we're writing. And as researchers, we figured out 00:03:58.280 |
that, oh my gosh, there's so much tooling in this bottom half in LLM accuracy. When you have a very 00:04:03.880 |
well specified pipeline, how do we make sure that generalizes to all of our documents and our needs? 00:04:09.000 |
But there's virtually no tooling in this data understanding and intent specification gaps. 00:04:14.840 |
So in today's talk, I want to spend the rest of the time telling you about how we are thinking of closing 00:04:19.880 |
these gaps and insights that you might apply when you are trying to iterate on your own pumps. 00:04:25.640 |
First, I'll talk about this data understanding gap. So going back to our real estate rental contract 00:04:31.320 |
example, the core challenge here is what are the types of documents in the data and what are the 00:04:37.080 |
unique failure modes that happen for each types of documents? So for example, all of these types 00:04:41.960 |
of pet clauses might exist. Breed restriction type clauses, clauses on the number of pets, service animal 00:04:47.560 |
exemptions. And many people don't even know this until they look at the data. So when we're building tools, 00:04:52.440 |
we might want to automatically be able to extract them for our end users so they can look at examples 00:04:58.360 |
of failure modes for each type. And then we see that there's a really, really long tail of failure 00:05:04.040 |
modes. And this is not just unique to real estate settings. We observe this for pretty much any 00:05:09.080 |
application here. It's like ML in general. There's so many different types of failure modes that are 00:05:14.680 |
difficult to make sense of. So for example, clauses might be phrased unusually and the LLM might miss 00:05:19.800 |
extracting them. LLMs might overfit to certain keywords. It might extract things that are unrelated 00:05:25.800 |
because, you know, a keyword is separately related and so forth. It's not uncommon to see people flag 00:05:33.000 |
hundreds of issues in a thousand document collection. So putting this all together, zooming out, 00:05:40.840 |
what does it mean to close this data understanding gap, right? I mentioned that we want tooling to help 00:05:46.040 |
people find anomalies and failure modes in their data, but also to be able to design evals on the fly for 00:05:51.800 |
each of these different failure modes. And some of the solutions that we're prototyping in our stack, 00:05:57.240 |
in our research stack, are for having people look at clusters of outputs automatically, annotate them 00:06:06.520 |
So to give you a concrete example of what a real estate agent might do or a real estate agent developer, 00:06:12.520 |
they might see for each failure mode that either we organize or once they label, we are able to 00:06:18.680 |
identify that they've labeled them all the same. And come up with, okay, here's a data set for where you can 00:06:25.240 |
design evals on. And maybe there are some potential strategies, for example, generating alternative 00:06:30.600 |
phrasings with an LLM or doing keyword checks in hybrid with LLMs. And this is where it gets a little bit fuzzy and 00:06:36.360 |
interesting, right? How do we build these for our users? And I think a lot of different domains have very 00:06:44.520 |
So now I want to move over to the intent gap, which is when we know that there are lots of failure modes in our data, 00:06:51.000 |
How do we even go about improving the pipeline? And much of this revolves around reducing query 00:06:56.920 |
ambiguity or prompt ambiguity. Maybe I want to change pet related clauses to dog and cat related clauses. 00:07:03.320 |
This is a very simple example. But you can imagine with the hundreds of failure modes, figuring out how to 00:07:08.440 |
translate this into actual pipeline improvements is very difficult. Do we prompt engineer? Do we add new 00:07:14.760 |
operations? Do we do task decomposition? Do we try to look at subsections of the document and unify the results? 00:07:20.760 |
People often get very lost in that. So one of the solutions that we're prototyping and that's 00:07:25.160 |
available on our doc ETL project is the ability to take users provided notes and automatically translate 00:07:32.040 |
them into prompt improvements in an interface where people can interactively give feedback at it and 00:07:38.040 |
maintain their revision history. So it's fully steerable. All right. Now, in my last slide, you might be 00:07:46.200 |
wondering, okay, why does this matter? I don't really care. I might not be building agents for 00:07:50.520 |
data processing. What can I take away from this? Great question. So here's my takeaways for you. 00:07:56.200 |
First is that we always find in every single domain that people are processing data with, evals are 00:08:01.160 |
very, very fuzzy. And they're never done first off with evaluation. People are always collecting new 00:08:07.000 |
failure modes as they run pipelines, always creating new subsets of documents or example traces that will 00:08:15.240 |
represent evals to run in the future. And failure modes really hide in this long tail, right? We 00:08:20.360 |
see people having tens, twenties of different failure modes that they're constantly checking for. 00:08:25.080 |
Then the next thing that we've observed that is very helpful is when our users unpack the cycle of 00:08:32.840 |
iteration into distinct stages, right? I mentioned that people try to do strategies like query decomposition 00:08:39.800 |
or prompt optimization to get a well-specified pipeline into a generalizable pipeline. However, 00:08:47.560 |
we find that people first need to figure out how to specify their pipeline in the first place. So first, 00:08:53.400 |
understand your data. Do this as a stage yourself. Don't worry about having good accuracy. Just know 00:08:59.000 |
what's going on in your failure modes. Second, figure out how to get your prompts as well specified as 00:09:04.680 |
possible. Make sure there's no ambiguity. If you were to send them to a human, they would not misinterpret 00:09:09.560 |
them, for example. And then only do people get really good gains in applying well-known accuracy 00:09:16.200 |
optimization strategies. With that, thanks so much. Feel free to email me at shreyashankar@berkeley.edu.