Building Reliable Agents: Lessons in Building an IDE for Data Processing Agents

00:00:00.000 | Hey everyone. My name is Shreya. I am finishing up my PhD at UC Berkeley, so that's quite exciting

00:00:15.300 | for me. And I'm here to give you a different kind of talk. This is about research, what

00:00:20.760 | we're learning through research, and how to help people build reliable LLM pipelines.

00:00:27.380 | Just to give a picture of the kind of research that we do at Berkeley, this is around data

00:00:30.620 | processing agents. What do I mean by data processing? Organizations have lots of unstructured data,

00:00:36.940 | documents that they want to extract and analyze, extract insights from and make sense of. So

00:00:41.840 | for example, maybe in customer service reviews, they want to extract themes, summarize them,

00:00:48.060 | figuring out actionable next steps. Maybe they want to look through their emails to figure

00:00:53.220 | out for a sales agent. Which clients could have gotten closed? Why didn't they? How

00:00:59.620 | do we move forward from that? And all sorts of domains have these kinds of tasks. For example,

00:01:05.020 | in traffic safety, aviation safety, what are the causes of accidents? How can we mitigate

00:01:10.980 | them? And when people write pipelines to use LLMs to solve these problems, their number one complaint

00:01:19.960 | is that, you know, this is really hard. It doesn't work. And so I want to put you in

00:01:24.140 | that mindset to figure out why. Imagine you are a real estate agent trying to find a place

00:01:29.460 | to meet your customer or your client's needs. And your client has a pet. It's a dog owner.

00:01:35.460 | So you might want to know, okay, what neighborhoods in, say, SF have the most restrictive pet policies?

00:01:41.700 | I want to tell that to my client. So you might write this pipeline as a sequence of LLM operations

00:01:48.140 | on a bunch of real estate rental contracts. You might start out with a map operation, which

00:01:53.940 | for every document gives you some extracted output. More map operations, for example, to categorize

00:02:00.520 | or classify clauses. And then aggregate these clauses together, maybe by neighborhood, by city,

00:02:07.080 | and come up with a summary or report for each. People write these pipelines, and the number one

00:02:12.520 | thing that they tell us is, my prompts don't work. And then the number one thing that they're told as

00:02:17.480 | a solution is, oh, just iterate on your prompts. So today's talk, I really want to dive into what

00:02:23.400 | does this kind of iteration entail, right? Why is this problem hard? How can you feel like you're not

00:02:28.840 | just hacking away at nothing to make progress?

00:02:32.520 | So at UC Berkeley, we put our research hats on, our HCI hats on, and studied how people write

00:02:38.520 | these kinds of data processing pipelines. The very first thing we observed is that people did not

00:02:43.880 | even know what the right question is. And many of you might resonate with this a little bit.

00:02:48.360 | So in our real estate agent example, someone might think they want to extract all pet policy clauses,

00:02:55.400 | and then realize only after looking at the documents and looking at the outputs, that they only wanted,

00:03:01.080 | you know, dog and cat pet policy clauses. Then when they feel like they know they have the right question

00:03:06.920 | they want to ask, then they want to figure out how to specify that question. So we all know when working

00:03:12.360 | with LLMs that we need to have very well specified, clear, unambiguous prompts. And things that we as

00:03:18.760 | humans think are unambiguous are actually pretty ambiguous. For example, just saying dog and cat

00:03:24.440 | policy clauses doesn't tell the LLM much. Maybe you need to say weight limits or restrictions, breed

00:03:31.000 | restrictions, quantity limits, and so forth, improving the LLM's performance.

00:03:36.040 | So zooming out a bit, what do these challenges mean, right? Iteration kind of reveals a lot of

00:03:40.840 | these insights if you do it correctly. But when we help people build data processing pipelines,

00:03:46.840 | what we really want to do is close these gaps between the user or the developer, the data they're

00:03:51.880 | trying to query and make sense of, and the pipeline that we're writing. And as researchers, we figured out

00:03:58.280 | that, oh my gosh, there's so much tooling in this bottom half in LLM accuracy. When you have a very

00:04:03.880 | well specified pipeline, how do we make sure that generalizes to all of our documents and our needs?

00:04:09.000 | But there's virtually no tooling in this data understanding and intent specification gaps.

00:04:14.840 | So in today's talk, I want to spend the rest of the time telling you about how we are thinking of closing

00:04:19.880 | these gaps and insights that you might apply when you are trying to iterate on your own pumps.

00:04:25.640 | First, I'll talk about this data understanding gap. So going back to our real estate rental contract

00:04:31.320 | example, the core challenge here is what are the types of documents in the data and what are the

00:04:37.080 | unique failure modes that happen for each types of documents? So for example, all of these types

00:04:41.960 | of pet clauses might exist. Breed restriction type clauses, clauses on the number of pets, service animal

00:04:47.560 | exemptions. And many people don't even know this until they look at the data. So when we're building tools,

00:04:52.440 | we might want to automatically be able to extract them for our end users so they can look at examples

00:04:58.360 | of failure modes for each type. And then we see that there's a really, really long tail of failure

00:05:04.040 | modes. And this is not just unique to real estate settings. We observe this for pretty much any

00:05:09.080 | application here. It's like ML in general. There's so many different types of failure modes that are

00:05:14.680 | difficult to make sense of. So for example, clauses might be phrased unusually and the LLM might miss

00:05:19.800 | extracting them. LLMs might overfit to certain keywords. It might extract things that are unrelated

00:05:25.800 | because, you know, a keyword is separately related and so forth. It's not uncommon to see people flag

00:05:33.000 | hundreds of issues in a thousand document collection. So putting this all together, zooming out,

00:05:40.840 | what does it mean to close this data understanding gap, right? I mentioned that we want tooling to help

00:05:46.040 | people find anomalies and failure modes in their data, but also to be able to design evals on the fly for

00:05:51.800 | each of these different failure modes. And some of the solutions that we're prototyping in our stack,

00:05:57.240 | in our research stack, are for having people look at clusters of outputs automatically, annotate them

00:06:04.120 | in situ so that we can help organize them.

00:06:06.520 | So to give you a concrete example of what a real estate agent might do or a real estate agent developer,

00:06:12.520 | they might see for each failure mode that either we organize or once they label, we are able to

00:06:18.680 | identify that they've labeled them all the same. And come up with, okay, here's a data set for where you can

00:06:25.240 | design evals on. And maybe there are some potential strategies, for example, generating alternative

00:06:30.600 | phrasings with an LLM or doing keyword checks in hybrid with LLMs. And this is where it gets a little bit fuzzy and

00:06:36.360 | interesting, right? How do we build these for our users? And I think a lot of different domains have very

00:06:40.520 | exciting challenges in there.

00:06:44.520 | So now I want to move over to the intent gap, which is when we know that there are lots of failure modes in our data,

00:06:51.000 | How do we even go about improving the pipeline? And much of this revolves around reducing query

00:06:56.920 | ambiguity or prompt ambiguity. Maybe I want to change pet related clauses to dog and cat related clauses.

00:07:03.320 | This is a very simple example. But you can imagine with the hundreds of failure modes, figuring out how to

00:07:08.440 | translate this into actual pipeline improvements is very difficult. Do we prompt engineer? Do we add new

00:07:14.760 | operations? Do we do task decomposition? Do we try to look at subsections of the document and unify the results?

00:07:20.760 | People often get very lost in that. So one of the solutions that we're prototyping and that's

00:07:25.160 | available on our doc ETL project is the ability to take users provided notes and automatically translate

00:07:32.040 | them into prompt improvements in an interface where people can interactively give feedback at it and

00:07:38.040 | maintain their revision history. So it's fully steerable. All right. Now, in my last slide, you might be

00:07:46.200 | wondering, okay, why does this matter? I don't really care. I might not be building agents for

00:07:50.520 | data processing. What can I take away from this? Great question. So here's my takeaways for you.

00:07:56.200 | First is that we always find in every single domain that people are processing data with, evals are

00:08:01.160 | very, very fuzzy. And they're never done first off with evaluation. People are always collecting new

00:08:07.000 | failure modes as they run pipelines, always creating new subsets of documents or example traces that will

00:08:15.240 | represent evals to run in the future. And failure modes really hide in this long tail, right? We

00:08:20.360 | see people having tens, twenties of different failure modes that they're constantly checking for.

00:08:25.080 | Then the next thing that we've observed that is very helpful is when our users unpack the cycle of

00:08:32.840 | iteration into distinct stages, right? I mentioned that people try to do strategies like query decomposition

00:08:39.800 | or prompt optimization to get a well-specified pipeline into a generalizable pipeline. However,

00:08:47.560 | we find that people first need to figure out how to specify their pipeline in the first place. So first,

00:08:53.400 | understand your data. Do this as a stage yourself. Don't worry about having good accuracy. Just know

00:08:59.000 | what's going on in your failure modes. Second, figure out how to get your prompts as well specified as

00:09:04.680 | possible. Make sure there's no ambiguity. If you were to send them to a human, they would not misinterpret

00:09:09.560 | them, for example. And then only do people get really good gains in applying well-known accuracy

00:09:16.200 | optimization strategies. With that, thanks so much. Feel free to email me at shreyashankar@berkeley.edu.

00:09:23.400 | And happy to chat with anyone afterwards.