LangChain Interrupt 2025 Evaluation Challenges – Sayantan Mukhopadhyay, Nubank

to help build eight reliable agents. Let's open up. Hello, everyone. My name is Tan, and it's very hard to follow up like who don't research her, right? From his work here, like figuring out how to do the best problems in the world. But I'm excited to talk about UEVAC here.

My name is Tan, and as I said, I have been at UEVAC for about four years, we are building the AI private sector for all of our customers. And the idea is that people are continuously bad at making financial decisions. Like, which subscription to capital is a digital decision for many people to itself.

So imagine that how you think about loan investment, how to-- especially in a situation like new bank, where we are the third largest bank in Brazil, the fastest growing bank in Mexico and Colombia. We have given first credit card access to 21 million people in Brazil in last five years alone.

And just to be sure, these numbers are actually outdated because we just revealed our numbers yesterday in our earnings call, and these numbers have gone up since then. But the interesting part is that right in the very beginning, that we ran into the world of ChatDBT, we got into it.

And we have been working very closely with Langshade, Langshade, Langshade, and all the teams here. And I'm excited to talk about a few things we have built, and how we have built it, and also, like, how we are going to evaluate them effectively. So just to be clear, I'm going to talk about two different applications here.

First one is the Chatbot. No surprise. We have 120 million, almost, users, right? And we get almost 8.5 million contacts every month. And out of that, Chat is our main channel. And in that channel, right now, 16% of them are first dead with elements. The results are improving, and we are building agents for different situations, and then we'll talk about them.

And the next one is actually more interesting, and we'll talk about these two applications, because in the world of finance, I just heard from Hervey that the complexity of the legal matters are important. For finance, every single dollar, every single thing matters. And that is important because it creates trust, it makes sure the users are happy with our service, and so forth.

So just for a second, let's look at this application. So in the application, you can see that we have built an agentic system that can do money transfer over voice, image, and chat at very low inaccuracy, that I'll show you the numbers really soon. Here you can see that the user is connecting their account with WhatsApp.

We are asking them multiple times about their password, etc., to make sure they have the right contacts. And once they do all of them, they can give a very simple instruction in a bit that, "Hey, make a transfer to, let's say, 400 TIs." We confirm this is the one that users want, and then, once it confirms, it makes the transfer.

All you need is to take around 70 seconds to make this transfer to 9 different screens. It's taking less than 30 seconds now. And you can see the C# is more than 90%, less than 0.5% in accuracy, so on and so forth. And you're doing that on scale. And I will talk about where evaluations matter and what kind of things matter.

So just to be on the same page, the experience of building a chatbot and building an application like this is very different. Because you need to iterate, but at the same time, you need to think about how to not build one observation. Because if you're building it for one application of this kind, imagine a finance world, you're doing hundreds of these operations to make these changes, or to make money movements, or make micro decisions.

Then you're building hundreds of separate systems and agents, which is just not scalable. So, thinking of the fact, what is NuBank LLM ecosystem? NuBank LLM ecosystem has four different layers. The first one is core engine, testing and evals, tools, and development experience. I don't have time, unfortunately, to go over in front of them.

But you can see that in three of them right now we are working very closely with LangSleep, and testing and evals is something we'll talk about, LLM as a job, and also online forward evaluation. And also in the developed experience side, we are using LangGraph and LangSleep from the very beginning from LangSleep, not LangGraph, we are using all of them.

Now, how do we use it, and why it matters? Let's see. Okay. So, the first thing that happens is that without LangGraph, we cannot do more faster iterations, and cannot make it very standard than what a canonical approach we can take to build agentic systems or any kind of black systems.

So, the learning there is that complex LM flows can be hard to analyze. centralized LL logs and repositories and graphical interface helps people to make faster decisions, because we don't want only our developers to make decisions. We want our business users to also contribute to it, and the way to do it is by democratizing data and giving them access to how a business analyst, our product managers, our product operations, our total operations, that how they can make faster decisions in terms of how they can make faster decisions in terms of prompt, in terms of adding inputs, in terms of adding parameters of different kinds, right?

And last but not the least, we can decrease the cognitive effort to represent flows. and this is something that Treyya was mentioning about that the human interaction is difficult for machine-term discovery. To graph basically makes that process easier for us. Now, to be true to my presentation, I will go to the evaluation part.

And on the evaluation side, I will first talk about a few different challenges overall we have. The first thing is that, as we heard from Kirby that they are not only in Antarctica. We are only in three countries, so it's a much smaller problem set. But still you can imagine that when you're dealing with like Portuguese, Spanish, the languages, the dialects, the kind of way people talk, etc., that changes across the country.

And we have 58% of Brazilian population that are customers now, so we have to understand what users are talking about, etc., very extensively. The second thing is that Nubank's brand presence is huge. We are more popular than some of the well-known brands like McDonald's or Nike even in Brazil.

So we cannot do anything, especially when it comes to jailbreak or garters, it's very important for us to keep a very high cost. And last but not the least, we have to pay curate in our messaging. Because at the end of the day, we are dealing with people's funding.

And money is something that people care about, about advocacy, and losing trust over money transfer is very easy. So taking a step back again, actually moving a little forward, on the customer service side and the money transfer is based, we have very different needs from a business side and from a technical side that are kind of valuations we need.

So in the case of customer service, in addition to accuracy, what matters a lot is how are we approaching a customer. If a customer is calling us, "Hey, where is my card?" or "Hey, I see this chart that I don't recognize." If we give a very robotic experience, we lose the customer's trust and equity and it matters.

It's very easy for human to have this connection. It's very hard for machine to have this connection and we all know that. Also, very high flattening doesn't work. I think a lot of you have seen that what happened with the activity 4.1 model, last week, and we called and then they launched.

So in addition, in order to do this to judge well, we need to think about, do we understand customers in terms well? Do we understand that how are we retrieving content and context from different sources that we have internally? What is the deep link accuracy that we have? Because imagine our app is 3,000 pages, and in 3,000 pages we have hundreds of deep links and basically landing a user to the way up of the node and then asking the server through different clicks and go to the page where they get self-service is very tedious and not very effective.

And last but not the least, we need to make sure that we are not hallucinating. While there's a money transfer, tone and sentiment is okay, but we need to be accurate. But the accuracy is not only about the transfer money, but also who we are transferring. Which source we are using for transfer?

Does the person have enough money in their account? Are there a broad suspect? Do they have a petting collection? All of these things are very intricately connected because our customers are using not only one product, but also the product from lending, to financing, to investment, to banking, to credit card, etc.

And also, often times, they have dependent accounts. They have multiple cards. So if you look at all of them together. So what's important there is that can you identify the named agent recognition properly? Because you can say, hey, send $100 to my brother. Now, say you only have one brother, you have saved your brother, and you have sent money before, it's easy.

But imagine the situation where you have not done that with your multiple brothers. And it's like, my favorite brother, my next favorite brother, didn't come to identify with which brother I'm talking about to send the money, right? Because then they don't want to send money to my next favorite brother.

The next one is about making sure the correct interpretation of the user input. Because if the user is saying that, hey, I want to send $100, but do it tomorrow. That's a different instruction than doing it right now. Because if I do it right now, maybe you will learn that it over time.

The last but not the least also identifies that what is the correct action. Because the user might be saying that, I don't want to send it. The last tape you saw, I want to cancel it. So we didn't understand that as well. So all of these things matter. And evaluation for all of these things matter.

Without them, we cannot launch a problem. So in absence of Evalor, in absence of a tool like Landscape, what happens, is that we have a linear path of development. We are running every tests. Because we make all decisions with any tests. I'm not sure if I could cover it before, but we have 1,800 services that we do deployment every 2 minutes.

So we do every decision by any tests. And that's with the linear path. But if we have a system that can very well connect the traces and give observabilities, give login and then alert on top of it, so on and so forth. Then we have a full cycle of observability to filtering, to define data sets, to run experiments and so on.

And this is the final we have in other situations. And we are building with Landscape for our generative AI applications. I think these two things I've heard a few times in this last couple of talks about offline evaluation and online evaluation. So I will not go in very deep details of it.

But as you can see, in the case of offline evaluation, this is after the experiment results, we take them to the other apps and we have individual evaluation and we have pairwise evaluation. For both of them, I will mostly, I will talk about it later, but we primarily use human labellers in that process.

And then we currently also using algorithms and other custom heuristics. Based on all of that, we run statistical tests and the linear variant is something that we launch. Things get more interesting, actually not at this stage, but at the online stage. Because in online evaluation, you can run things in your sandboxes, in your own, like, more controlled environments.

And in that situation, you have a more continuous view of improvement and development. If we only do the online evaluation, then our decision-making speed, especially for developers and analysts, is much slower. But if we can do good online evaluation, and tracing, and logging, and alerting, etc., then our development speed goes up significantly.

So we are doing both of them. Now, last but not least, I will talk about LLM as a judge, which is something we have talked about a few times. And the question basically goes back quite many years. Imagine the situation I was describing about the money transfer. In that situation, you need to understand who we are sending, what the requests, how much money, from where, all of that.

And doing all of that, sending, let's say, we are currently doing, say, a few hundred thousand or a few million such transactions every day. That amount of data, and that amount of labeling, even if we do sampling, it's not enough to maintain the quality of the product. And that's why we need to do more labeling, and doing it only when it's not scalable.

Because training, how people understand, the mistakes people make, and so forth. So our part was that, let's build LLM as a judge, and try to keep the quality of the judge at the same level of human. And so we started with the first test, with a simple prompt, we do the 4-0 mini model, because it's cheap.

We didn't do any time tuning in, and see how it works. And we got that in, like, humans were making 80% accurate decisions, and 20% mistakes, and something like that. Everyone's exact doesn't show that, but you can imagine that way, a little bit as an accuracy metric. We are at 51%, and in the case 2, we moved to a fine-tuned model, and we increased the 8-1 score from 51 to 59.

Next test, we changed the prompt, and we got B2, and we got a big jump of 11% point of 17. So we made another declaration of 4-0, from 4-0 mini to 4-0, it's a better, better, and bigger model. Changed the prompt again in test 5, changed the pine-tuning again in test 6, and this is where we landed, where we are at x-1 score of 79% compared to 80% of humans, which is quite comparable.

Now, you might ask, next time, why did you move from test 5 to test 6, the prompt was 80, and 79%, because in 79%, we are identifying the inaccurate information better, and a more, like, more, like, more, like, more inaccuracy we are catching there, that's why we are in here.

And just to no surprise, but actually quite surprising, this whole development took us maybe around 2 weeks, with a couple of developers, to go through these 6 iterations, and we can only do it because we have the online tracing and system in place, otherwise it wouldn't be possible. So, taking up, like, wrapping it up all, there is no magic in building any agent, any element, it's hard work, evals are hard work, and if you don't evaluate, you don't know what you are building, and if you don't know what you are building, then you cannot ship it to the world.

So, do more evals, spend more time, understand what your users are saying, think about not only hallucinations and anxiety, etc., but also, like, nuanced situations of, like, empathy, tone of voice, those things matter. We are in a very exciting time, and thank you all for listening to me, and thank you all for listening to me.

We are now going to a 20 minute break before our next session. I want to mention to you about Boba, a bar located in a very group-top, an espresso-carsel, in an area for the residents. Thank you for the rise and pollinators. On through these.

LangChain Interrupt 2025 Evaluation Challenges – Sayantan Mukhopadhyay, Nubank

Transcript