Back to Index

LangChain Interrupt 2025 Lessons in Building an IDE for Data Processing Agents – Shreya Shankar


Transcript

Hey everyone, my name is Shreya. I'm finishing up my PhD at UC Berkeley so that's quite exciting for me. And I'm here to give you a different kind of talk. This is about research, what we're learning through research and how to help people build reliable L1 pipelines. Just to give a picture of the kind of research that we do at Berkeley, this is around data processing agents.

What do I mean by data processing? Organizations have lots of unstructured data, documents that they want to extract and analyze, extract insights from and make sense of. So for example, maybe in customer service reviews they want to extract themes, summarize them, figuring out matchable next steps. Maybe they want to look through their emails to figure out for a sales agent.

Which clients could have gotten closed? Why didn't they? How do we move forward from that? And all sorts of domains have these kinds of tasks. For example, in traffic safety, aviation safety. What are the causes of accidents? How can we mitigate them? And when people write pipelines to use LLMs to solve these problems, their number one complaint is that, you know, this is really hard.

It doesn't work. And so I want to put you in that mindset to figure out why. Imagine you are a real estate agent trying to find a place to meet your customer or your client's needs. And your client has a pet. It's a dog owner. You might want to know, okay, what neighborhoods in, say, SF have the most restrictive pet policies.

I want to tell that to my client. So you might write this pipeline as a sequence of LLM operations on a bunch of real estate rental contracts. You might start out with a map operation, which for every document gives you some extracted output. More map operations, for example, to categorize or classify clauses.

And then aggregate these clauses together, maybe by neighborhood, by city, and come up with a summary or report for each. People write these pipelines and the number one thing that they tell us is, "My prompts don't work." And then the number one thing that they're told as a solution is, "Oh, just iterate on your prompts." So today's talk, I really want to dive into what does this kind of iteration entail, right?

Why is this problem hard? How can you feel like you're not just hacking away at nothing to make progress? So at UC Berkeley, we put our research hats on, our HCI hats on, and studied how people write these kinds of data processing pipelines. The very first thing we observed is that people did not even know what the right question is.

And many of you might resonate with this a little bit. So in our real estate agent example, someone might think they want to extract all pet policy clauses and then realize only after looking at the documents and looking at the outputs that they only wanted, you know, dog and cat pet policy clauses.

Then when they feel like they know they have the right question they want to ask, then they want to figure out how to specify that question. So we all know with working with LLMs that we need to have very well specified, clear, unambiguous prompts. And things that we as humans think are unambiguous are actually pretty ambiguous.

For example, just saying dog and cat policy clauses doesn't tell the LLM much. Maybe you need to say weight limits or restrictions, brief restrictions, quantity limits, and so forth, improving the LLM's performance. So zooming out a bit, what do these challenges mean, right? Iteration kind of reveals a lot of these insights if you do it correctly.

But when we help people build data processing pipelines, what we really want to do is close these gaps between the user or the developer, the data they're trying to query and make sense of, and the pipeline that we're writing. And as researchers, we figured out that, oh my gosh, there's so much tooling in this bottom half in LLM accuracy.

When you have a very well specified pipeline, how do we make sure that generalizes to all of our documents and our needs? But there's virtually no tooling in this data understanding and intense specification gaps. So in today's talk, I want to spend the rest of the time telling you about how we're thinking of closing these gaps and insights that you might apply when you are trying to iterate on your own forms.

First, I'll talk about this data understanding gap. So going back to our real estate rental contract example, the core challenge here is what are the types of documents in the data and what are the unique failure modes that happen for each type of documents? So for example, all of these types of pet clauses might exist, three restriction type clauses, clauses on the number of pets, service animal exemptions.

And many people don't even know this until they look at the data. So when we're building tools, we might want to automatically be able to extract them for our end users so they can look at examples of failure modes for each type. And then we see that there's a really, really long tail of failure modes.

And this is not just unique in real estate settings. We observe this for pretty much any application here. It's like ML in general. There's so many different types of failure modes that are difficult to make sense of. So for example, clauses might be phrased unusually and the LLM might miss extracting them.

It might extract things that are unrelated because, you know, a keyword is separately related and so forth. But it's not uncommon to see people flag hundreds of issues in a thousand document collection. So putting this all together, you're zooming out, what does it mean to close this data understanding gap?

I mentioned that we want pooling to help people find anomalies and failure modes in their data, but also to be able to design emails on the fly for each of these different failure modes. And some of the solutions that we're prototyping in our stack, in our research stack, are for having people look at clusters of outputs automatically and take them in situ so that we can help organize them.

So to give you a concrete example of what a real estate agent might do or a real estate agent developer, they might see for each failure mode that either we organize or once they label, we are able to identify that they've labeled them all the same. They come up with, okay, here's a data set for where you can design emails on.

And maybe there are some potential strategies. For example, generating alternative phrasing with an LLM or doing keyword checks in hybrid with LLMs. And this is where it gets a little bit fuzzy and interesting, right? How do we build these for our users? And I think a lot of different domains have very exciting challenges.

So now I want to move over to the intent, which is when we know that there are lots of failure modes in our data, how do we even go about improving the pipeline? And much of this revolves around reducing query ambiguity or prompting ambiguity. Maybe I want to change pet related clauses to dog and cat related clauses.

This is a very simple example. But you can imagine with the hundreds of failure modes, figuring out how to translate this into actual pipeline improvements is very difficult. Do we prompt engineer? Do we add new operations? Do we do task decomposition? Do we try to look at subsections of the document and unify themselves?

People often get very lost in that. So one of the solutions that we're prototyping and that's available on our DocuTL project is the ability to take users provided notes and automatically translate them into prompt improvements in an interface where people can interactively give feedback, edit, and maintain their revision history so it's fully steerable.

Alright. Now in my last slide, you might be wondering, okay, why does this matter? I don't really care. I might not be emailing agents for data processing. What can I take away from this? Great question. So here's my takeaways for you. First is that we always find in every single domain that people are processing data with, evals are very, very possible.

And they're never done first off with evaluation. People are always collecting new failure modes as they run pipelines. Always creating new subsets of documents or example traces that will represent evals in the future. And failure modes really hide in this long tail, right? We see people having tens, twenties of different failure modes that they're constantly checking.

Then the next thing that we've observed that is very helpful is when our users unpack the cycle of iteration into distinct stages. Right? I mentioned that people try to do strategies like query decomposition or prompt optimization to get a well-specified pipeline into a generalizable pipeline. However, we find that people first need to figure out how to specify a pipeline in the first place.

So first, understand your data. Do this as a stage yourself. Don't worry about having good accuracy. Just know what's going on in your failure modes. Second, figure out how to get your prompts as well-specified as possible. Make sure there's no ambiguity. If you were to send them to be human, they would not misinterpret them, for example.

And then only do people get really good gains in applying well-known accuracy optimization strategies. So, with that, thanks so much. Feel free to email me at shreyashunker@berkeley.edu. I'm happy to chat with anyone afterwards. Thank you, Shreya. Thank you, Shreya. Next up is Tan. She's going to come back. Thank you, Shreya.

Okay. Thank you, Shreya. Next up is Tan. She's going to come back. Thank you, Shreya. Thank you, Shreya. Thank you, Shreya. Next up is Tan. She's going to come back.