Building Reliable Agents: Lessons in Building an IDE for Data Processing Agents

Hey everyone. My name is Shreya. I am finishing up my PhD at UC Berkeley, so that's quite exciting for me. And I'm here to give you a different kind of talk. This is about research, what we're learning through research, and how to help people build reliable LLM pipelines. Just to give a picture of the kind of research that we do at Berkeley, this is around data processing agents.

What do I mean by data processing? Organizations have lots of unstructured data, documents that they want to extract and analyze, extract insights from and make sense of. So for example, maybe in customer service reviews, they want to extract themes, summarize them, figuring out actionable next steps. Maybe they want to look through their emails to figure out for a sales agent.

Which clients could have gotten closed? Why didn't they? How do we move forward from that? And all sorts of domains have these kinds of tasks. For example, in traffic safety, aviation safety, what are the causes of accidents? How can we mitigate them? And when people write pipelines to use LLMs to solve these problems, their number one complaint is that, you know, this is really hard.

It doesn't work. And so I want to put you in that mindset to figure out why. Imagine you are a real estate agent trying to find a place to meet your customer or your client's needs. And your client has a pet. It's a dog owner. So you might want to know, okay, what neighborhoods in, say, SF have the most restrictive pet policies?

I want to tell that to my client. So you might write this pipeline as a sequence of LLM operations on a bunch of real estate rental contracts. You might start out with a map operation, which for every document gives you some extracted output. More map operations, for example, to categorize or classify clauses.

And then aggregate these clauses together, maybe by neighborhood, by city, and come up with a summary or report for each. People write these pipelines, and the number one thing that they tell us is, my prompts don't work. And then the number one thing that they're told as a solution is, oh, just iterate on your prompts.

So today's talk, I really want to dive into what does this kind of iteration entail, right? Why is this problem hard? How can you feel like you're not just hacking away at nothing to make progress? So at UC Berkeley, we put our research hats on, our HCI hats on, and studied how people write these kinds of data processing pipelines.

The very first thing we observed is that people did not even know what the right question is. And many of you might resonate with this a little bit. So in our real estate agent example, someone might think they want to extract all pet policy clauses, and then realize only after looking at the documents and looking at the outputs, that they only wanted, you know, dog and cat pet policy clauses.

Then when they feel like they know they have the right question they want to ask, then they want to figure out how to specify that question. So we all know when working with LLMs that we need to have very well specified, clear, unambiguous prompts. And things that we as humans think are unambiguous are actually pretty ambiguous.

For example, just saying dog and cat policy clauses doesn't tell the LLM much. Maybe you need to say weight limits or restrictions, breed restrictions, quantity limits, and so forth, improving the LLM's performance. So zooming out a bit, what do these challenges mean, right? Iteration kind of reveals a lot of these insights if you do it correctly.

But when we help people build data processing pipelines, what we really want to do is close these gaps between the user or the developer, the data they're trying to query and make sense of, and the pipeline that we're writing. And as researchers, we figured out that, oh my gosh, there's so much tooling in this bottom half in LLM accuracy.

When you have a very well specified pipeline, how do we make sure that generalizes to all of our documents and our needs? But there's virtually no tooling in this data understanding and intent specification gaps. So in today's talk, I want to spend the rest of the time telling you about how we are thinking of closing these gaps and insights that you might apply when you are trying to iterate on your own pumps.

First, I'll talk about this data understanding gap. So going back to our real estate rental contract example, the core challenge here is what are the types of documents in the data and what are the unique failure modes that happen for each types of documents? So for example, all of these types of pet clauses might exist.

Breed restriction type clauses, clauses on the number of pets, service animal exemptions. And many people don't even know this until they look at the data. So when we're building tools, we might want to automatically be able to extract them for our end users so they can look at examples of failure modes for each type.

And then we see that there's a really, really long tail of failure modes. And this is not just unique to real estate settings. We observe this for pretty much any application here. It's like ML in general. There's so many different types of failure modes that are difficult to make sense of.

So for example, clauses might be phrased unusually and the LLM might miss extracting them. LLMs might overfit to certain keywords. It might extract things that are unrelated because, you know, a keyword is separately related and so forth. It's not uncommon to see people flag hundreds of issues in a thousand document collection.

So putting this all together, zooming out, what does it mean to close this data understanding gap, right? I mentioned that we want tooling to help people find anomalies and failure modes in their data, but also to be able to design evals on the fly for each of these different failure modes.

And some of the solutions that we're prototyping in our stack, in our research stack, are for having people look at clusters of outputs automatically, annotate them in situ so that we can help organize them. So to give you a concrete example of what a real estate agent might do or a real estate agent developer, they might see for each failure mode that either we organize or once they label, we are able to identify that they've labeled them all the same.

And come up with, okay, here's a data set for where you can design evals on. And maybe there are some potential strategies, for example, generating alternative phrasings with an LLM or doing keyword checks in hybrid with LLMs. And this is where it gets a little bit fuzzy and interesting, right?

How do we build these for our users? And I think a lot of different domains have very exciting challenges in there. So now I want to move over to the intent gap, which is when we know that there are lots of failure modes in our data, How do we even go about improving the pipeline?

And much of this revolves around reducing query ambiguity or prompt ambiguity. Maybe I want to change pet related clauses to dog and cat related clauses. This is a very simple example. But you can imagine with the hundreds of failure modes, figuring out how to translate this into actual pipeline improvements is very difficult.

Do we prompt engineer? Do we add new operations? Do we do task decomposition? Do we try to look at subsections of the document and unify the results? People often get very lost in that. So one of the solutions that we're prototyping and that's available on our doc ETL project is the ability to take users provided notes and automatically translate them into prompt improvements in an interface where people can interactively give feedback at it and maintain their revision history.

So it's fully steerable. All right. Now, in my last slide, you might be wondering, okay, why does this matter? I don't really care. I might not be building agents for data processing. What can I take away from this? Great question. So here's my takeaways for you. First is that we always find in every single domain that people are processing data with, evals are very, very fuzzy.

And they're never done first off with evaluation. People are always collecting new failure modes as they run pipelines, always creating new subsets of documents or example traces that will represent evals to run in the future. And failure modes really hide in this long tail, right? We see people having tens, twenties of different failure modes that they're constantly checking for.

Then the next thing that we've observed that is very helpful is when our users unpack the cycle of iteration into distinct stages, right? I mentioned that people try to do strategies like query decomposition or prompt optimization to get a well-specified pipeline into a generalizable pipeline. However, we find that people first need to figure out how to specify their pipeline in the first place.

So first, understand your data. Do this as a stage yourself. Don't worry about having good accuracy. Just know what's going on in your failure modes. Second, figure out how to get your prompts as well specified as possible. Make sure there's no ambiguity. If you were to send them to a human, they would not misinterpret them, for example.

And then only do people get really good gains in applying well-known accuracy optimization strategies. With that, thanks so much. Feel free to email me at shreyashankar@berkeley.edu. And happy to chat with anyone afterwards.

Building Reliable Agents: Lessons in Building an IDE for Data Processing Agents

Transcript