Building Reliable Agents: Evaluation Challenges

TAN BANG: Hello, everyone. My name is Tan, and it's very hard to follow a full-blown researcher from UC Berkeley, like, figuring out how to do the best prompt in the world. But I'm excited to talk about NewBank here. My name is Tan. As I said, I have been at NewBank for the last 40 years.

We are building the AI private banker for all of our customers. And the idea is that people are notoriously bad at making financial decisions. Like, which subscription to cancel is a difficult decision for many people itself. So imagine how you think about loan investment, especially the situation like NewBank, where we are the third largest bank in Brazil, the largest-- the fastest growing bank in Mexico and Colombia.

We have given first credit card access to 21 million people in Brazil in the last five years alone. And just to be sure, these numbers are actually outdated because we just released our numbers yesterday in our earnings call. And these all numbers have gone up since then. But the interesting part is that from the very beginning, we ran into the world of ChatGPT.

We got into it. And we have been working very closely with Langchain, Langsmith, and all the teams here. And I'm excited to talk about a few things we have built and how we have built it. And also how you can evaluate them effectively. So just to be clear, I'm going to talk about two different applications here.

First one is the chatbot. No surprise. We have 120 million almost users, right? And we get almost 8 and 1/2 million contacts every month. And out of that, Chat is our main channel. And in that channel right now, 60% of them are first dealt with LLMs. The results are improving.

And we are building agents for different situations. And I'll talk about them. And the next one is actually more interesting. And I'll talk about these two applications. Because in the world of finance, as you heard from Hervey, that the complexity of the legal matters are important. For finance, every single dollar, every single penny matters.

And that is important because it creates trust. It makes sure the users are happy with our service, one and so forth. So just for a second, let's look at this application. So in this application, you can see that we have built an agentic system that can do money transfer over voice, image, and chat at very low inaccuracy.

That will show you the numbers really soon. Here, you can see that the user is connecting their account with WhatsApp. We are asking them multiple times about their password, et cetera, to make sure the right contact. And once they do all of them, they can give a very simple instruction in a bit that, hey, make a transfer to Jose for 100 reais.

We confirm this is the one that users want. And then once it confirms, it makes the transfer. Early it is to take around 70 seconds to make this transfer to nine different screens. It's taking less than 30 seconds now. And you can see the CSAT is more than 90%, less than 0.5% inaccuracy, so on and so forth.

And we are doing that at scale. And I will talk about that where evaluations matter and what kind of things matter. So just to be on the same page, the experience of building the chatbot and building an application like this is very different. Because you need to iterate. But at the same time, you need to think about how to not build one-off solution.

Because if you are building it for one application of this kind, imagine in a finance world, you are doing hundreds of these operations to make these changes, or to make money movement, or make micro decisions. Then you are building hundreds of separate systems and agents, which is just not scalable.

So taking a step back, what does Newbank LLM ecosystem look like? Newbank LLM ecosystem has four different layers. The first one is core engine, testing and evals, tools, and developer experience. I don't have time, unfortunately, to go over each one of them. But you can see that in three of them right now, we are working very closely with LangChain and LangSmith.

And testing and evals is something we'll talk about. LLM as a judge, and also online quality evaluation. And also in the developer experience side, we are using LangGraph and from the very beginning from LangChain, now to LangGraph. We are using all of them. Now, how do we use it and why it matters?

Let's see. OK, so the first thing that happens is that without LangGraph, we cannot do more faster iterations and cannot make it very standard that what's a canonical approach we can take to build agentic systems or any kind of lag systems even. So the learnings there is that complex LLM flows can be hard to analyze.

Centralized LLM logs and repository and graphical interface helps people to make faster decisions. Because we don't want only our developers to make decisions. We want our business users to also contribute to it. And the way to do it is by democratizing data and giving them access to how a business analyst, our product managers, our product operations, our total whole operations, that how they can make faster decisions in terms of prompt, in terms of adding inputs, in terms of adding parameters of different kinds, right?

And last but not the least, graphs graphs can decrease the cognitive effort to represent flows. And this is something that what Shreya was mentioning about, that the human instruction is difficult for a machine to understand. So graph basically makes that process easier for us. Now, to be true to my presentation, I will go to the evaluation part.

And on the evaluation side, I first talk about a few different challenges overall we have. The first thing is that, as we heard from Harvey that they're not only in Antarctica. We are only in three countries, so it's a much smaller problem set. But still, you can imagine that when you're dealing with Portuguese, Spanish, the languages, the dialects, the kind of way people talk, et cetera, that changes across the country.

And we have 58% of Brazilian population is our customer now, so we have to understand what users are talking about, et cetera, very extensively. The second thing is that Newbank's brand presence is huge. We are more popular than some of the well-known brands like McDonald's or Nike, even, in Brazil.

So we cannot do anything, especially when it comes to jailbreak or guardrails. It's very important for us to keep a very high bar. And last but not the least, we have to be accurate in our messaging. Because at the end of the day, we are dealing with people's money.

And money is something that people care about, about accuracy. And losing trust over money transfer is very easy. So taking a step back again, actually moving a little forward, on the customer service side and the money transfer use case, we have very different needs from a business side and from a technical side that what kind of evaluations we need.

So in the case of customer service, in addition to accuracy, what matters a lot, that how are we approaching a customer? If a customer is calling us, hey, where is my card? Or hey, I see this chart that I don't recognize. If we give a very robotic experience, we lose the customer's trust and empathy and it matters.

It's very easy for human to have this connection. It's very hard for machine to have this connection and we all know that. Also very high flattery doesn't work. I think all of you have seen that what happened with ChatGPT 4.1 model last week and they recalled and they relaunched.

So in addition, in order to do these two jobs well, we need to think about do we understand customers' intent well? Do we understand that how are we retrieving content and context from different sources that we have internally? What is the deep link accuracy that we have? Because imagine our app is 3,000 pages.

And in 3,000 pages we have hundreds of deep links and basically landing a user to the very app of the node and then asking them to traverse through different clicks and go to the page where they can self-service is very tedious and not very effective. And last but not the least, we need to make sure that we are not hallucinating.

While as a money transfer, tone and state sentiment is okay. But we need to be accurate. But the accuracy is not only about the transfer money but also who we are transferring, which source we are using for transfer, does the person have enough money in their account, are they fraud suspects, do they have a pending collection?

All of these things are very intricately connected because our customers are using not only one product but a whole suite of product from lending to financing to investment to banking account to credit card, et cetera. And also, oftentimes, they have dependent account, they have multiple cards, so we have to look at all of them together.

So what is important there is that can we identify the name identity recognition properly? Because you can say, hey, send $100 to my brother. Now say you only have one brother and you have saved that brother and you have sent money before, it's easy. But imagine a situation where you have not done that and you have multiple brothers and it's like my favorite brother, my less favorite brother, then you have to identify that which brother I'm talking about to send the money, right?

Because definitely I don't want to send money to my less favorite brother. The next one is about making sure the correct interpretation of the user input. Because if the user is saying that, hey, I want to send $100 but do it tomorrow, that's a different instruction than doing it right now.

Because if I do it right now, maybe you will land up in overdraft. The last but not the least, also identify that what is the correct action. Because the user might be saying that I don't want to send it, the last tape you saw, I want to cancel it.

So we need to understand that as well. So all of these things matter. And evaluations for all of these things matter. And without them, we cannot launch a product. So in absence of eval or in absence of a tool like Langsmith, what happens is that we have a linear path of development.

We are running A/B tests because we make all decisions with A/B tests. I'm not sure if I could cover before, but we have 1,800 services and we do deployment every two minutes. So we do every decision by A/B tests. And that will be the linear path. But if we have a system that can very well connect the traces and give observability, give logging and then alerts on top of it, so on and so forth, then we have a full cycle of observability to filtering, to define data sets, to run experiments and go on.

And this is the flywheel we have in other situations. And we are building with Langsmith for our generative AI applications. I think these two things I have heard a few times in this last couple of talks about offline evaluation and online evaluation. So I will not go in very deep details of it.

But as you can see, that in the case of offline evaluation, this is an -- after -- imagine an experiment result. After the experiment result, we take them to the LLM apps and we have, like, individual evaluation and we have pairwise evaluation. For both of them, I will mostly -- I will talk about it later.

But we primarily use human labelers in that process. And then we are currently also using LLM and other customer heuristics. Based on all of that, we run statistical tests and the winner variant is something that we launch. Things get more interesting, actually not at this stage, but at the online stage.

Because in online evaluation, you can run things in your sandboxes, in your own, like, more controlled environments. And in that situation, you have a more continuous loop of improvements and development. If we only do the online evaluation -- why not going back? Okay. If we only do offline evaluation, then our decision-making speed, especially for developers and analysts, is much slower.

But if we can do good online evaluation and tracing and logging and alerting, et cetera, then our development speed goes up significantly. So we are doing both of them. Now, last but not the least, I will talk about LLM as a judge, which is something we have talked about a few times.

And the question basically goes back to why we need it. Imagine the situation I was describing about the money transfer. In that situation, you need to understand who we are sending, what the request, how much money, from where, all of that. And doing all of that, sending -- like, we are currently doing, say, a few hundred thousand or a few million such transactions every day.

That amount of data and that amount of labeling, even if we do, like, sampling, it's not enough to maintain the quality of the product. And that's why we need to do more labeling, and doing it only by human is not scalable, because training how people understand, the mistakes people make, so on and so forth.

So our bar was that, let's build LLM as a judge, and try to keep the quality of the judge at the same level of human. And so we started with the first test, it was a simple prompt. We used the photo mini model, because it's cheap. We didn't do any fine-tuning and see that, okay, how it works.

And we got that, like, human were making 80% accurate decisions and 20% mistakes, and something like that. F1 score exactly doesn't show that, but you can imagine that way, a little bit as an accuracy metric. We are at 51%, and in the test 2, we moved to a fine-tuned model, and we increased the accurate F1 score from 51 to 59.

Next test, we changed the prompt, and we got to V2. We got to a big jump of 11% point of 70. We made another iteration at 4.0. From 4.0 mini to 4.0, it's a better and bigger model. Changed the prompt again in test 5. Changed the fine-tuning again in test 6.

And this is where we landed, where we are at F1 score of 79%, compared to 80% of human, which is quite comparable. Now, you might ask that time, why did you move from test 5 to test 6? The F1 score is 80, and 79, because in 79, we are identifying the inaccurate information we are catching there, that's why we are here.

And just to no surprise, actually, it might be surprising, this whole development took us maybe around two weeks with a couple of developers to go through these six iterations, and we could only do it because we had the online tracing and system in place, otherwise, it would not be possible.

So wrapping it up all, there is no magic in building any agent, any LLM. It's hard work. Evals are hard work. And if you don't evaluate, you don't know what you're building. And if you don't know what you're building, then you cannot ship it to the world. So do more evals, spend more time, understand what your users are saying, think about not only hallucination and red teaming, et cetera, but also, nuanced situations of empathy, tone of voice, those things matter.

We are at a very exciting time, and thank you all for listening to me. And if you have any questions, please let me know. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Building Reliable Agents: Evaluation Challenges

Transcript