back to indexBuilding Reliable Agents: Evaluation Challenges

00:00:10.840 |
My name is Tan, and it's very hard to follow a full-blown 00:00:16.800 |
figuring out how to do the best prompt in the world. 00:00:23.640 |
As I said, I have been at NewBank for the last 40 years. 00:00:26.680 |
We are building the AI private banker for all of our customers. 00:00:38.180 |
is a difficult decision for many people itself. 00:00:40.580 |
So imagine how you think about loan investment, 00:00:47.400 |
where we are the third largest bank in Brazil, the largest-- 00:00:51.860 |
the fastest growing bank in Mexico and Colombia. 00:00:54.360 |
We have given first credit card access to 21 million people 00:01:00.120 |
And just to be sure, these numbers are actually 00:01:02.080 |
outdated because we just released our numbers yesterday 00:01:06.920 |
And these all numbers have gone up since then. 00:01:09.020 |
But the interesting part is that from the very beginning, 00:01:17.540 |
And we have been working very closely with Langchain, Langsmith, 00:01:21.580 |
And I'm excited to talk about a few things we have built 00:01:25.580 |
And also how you can evaluate them effectively. 00:01:40.340 |
And we get almost 8 and 1/2 million contacts every month. 00:01:55.140 |
And we are building agents for different situations. 00:01:59.260 |
And the next one is actually more interesting. 00:02:02.940 |
Because in the world of finance, as you heard from Hervey, 00:02:05.980 |
that the complexity of the legal matters are important. 00:02:09.980 |
For finance, every single dollar, every single penny matters. 00:02:13.440 |
And that is important because it creates trust. 00:02:16.000 |
It makes sure the users are happy with our service, 00:02:19.920 |
So just for a second, let's look at this application. 00:02:27.200 |
that can do money transfer over voice, image, and chat 00:02:35.960 |
Here, you can see that the user is connecting their account 00:02:40.220 |
We are asking them multiple times about their password, 00:02:46.760 |
can give a very simple instruction in a bit that, hey, 00:02:56.960 |
And then once it confirms, it makes the transfer. 00:03:02.860 |
to make this transfer to nine different screens. 00:03:07.220 |
And you can see the CSAT is more than 90%, less than 0.5% 00:03:16.660 |
And I will talk about that where evaluations matter 00:03:21.640 |
So just to be on the same page, the experience 00:03:25.380 |
of building the chatbot and building an application 00:03:36.240 |
Because if you are building it for one application of this kind, 00:03:39.520 |
imagine in a finance world, you are doing hundreds of these operations 00:03:42.720 |
to make these changes, or to make money movement, 00:03:46.800 |
Then you are building hundreds of separate systems and agents, which 00:03:52.440 |
So taking a step back, what does Newbank LLM ecosystem look like? 00:03:57.140 |
Newbank LLM ecosystem has four different layers. 00:04:02.080 |
The first one is core engine, testing and evals, tools, 00:04:05.940 |
I don't have time, unfortunately, to go over each one of them. 00:04:08.720 |
But you can see that in three of them right now, 00:04:11.060 |
we are working very closely with LangChain and LangSmith. 00:04:13.880 |
And testing and evals is something we'll talk about. 00:04:15.760 |
LLM as a judge, and also online quality evaluation. 00:04:20.900 |
we are using LangGraph and from the very beginning from LangChain, 00:04:32.480 |
OK, so the first thing that happens is that without LangGraph, 00:04:45.320 |
a canonical approach we can take to build agentic systems 00:04:51.500 |
So the learnings there is that complex LLM flows 00:04:56.240 |
Centralized LLM logs and repository and graphical interface 00:05:01.420 |
Because we don't want only our developers to make decisions. 00:05:04.360 |
We want our business users to also contribute to it. 00:05:08.640 |
And the way to do it is by democratizing data 00:05:11.260 |
and giving them access to how a business analyst, 00:05:15.000 |
our product managers, our product operations, 00:05:19.720 |
they can make faster decisions in terms of prompt, 00:05:22.160 |
in terms of adding inputs, in terms of adding parameters 00:05:30.360 |
graphs can decrease the cognitive effort to represent flows. 00:05:33.780 |
And this is something that what Shreya was mentioning about, 00:05:36.500 |
that the human instruction is difficult for a machine 00:05:40.620 |
So graph basically makes that process easier for us. 00:05:49.580 |
And on the evaluation side, I first talk about a few different challenges 00:05:55.100 |
The first thing is that, as we heard from Harvey that they're not only in Antarctica. 00:06:00.340 |
We are only in three countries, so it's a much smaller problem set. 00:06:03.340 |
But still, you can imagine that when you're dealing with Portuguese, Spanish, the languages, 00:06:07.800 |
the dialects, the kind of way people talk, et cetera, that changes across the country. 00:06:13.680 |
And we have 58% of Brazilian population is our customer now, so we have to understand what users are talking about, et cetera, very extensively. 00:06:19.820 |
The second thing is that Newbank's brand presence is huge. 00:06:23.820 |
We are more popular than some of the well-known brands like McDonald's or Nike, even, in Brazil. 00:06:29.960 |
So we cannot do anything, especially when it comes to jailbreak or guardrails. 00:06:32.960 |
It's very important for us to keep a very high bar. 00:06:35.960 |
And last but not the least, we have to be accurate in our messaging. 00:06:39.260 |
Because at the end of the day, we are dealing with people's money. 00:06:42.000 |
And money is something that people care about, about accuracy. 00:06:45.960 |
And losing trust over money transfer is very easy. 00:06:49.820 |
So taking a step back again, actually moving a little forward, on the customer service side 00:06:57.620 |
and the money transfer use case, we have very different needs from a business side and from 00:07:02.060 |
a technical side that what kind of evaluations we need. 00:07:06.000 |
So in the case of customer service, in addition to accuracy, what matters a lot, that how are 00:07:13.340 |
If a customer is calling us, hey, where is my card? 00:07:15.820 |
Or hey, I see this chart that I don't recognize. 00:07:18.880 |
If we give a very robotic experience, we lose the customer's trust and empathy and it matters. 00:07:25.060 |
It's very easy for human to have this connection. 00:07:27.740 |
It's very hard for machine to have this connection and we all know that. 00:07:32.840 |
I think all of you have seen that what happened with ChatGPT 4.1 model last week and they recalled 00:07:40.240 |
So in addition, in order to do these two jobs well, we need to think about do we understand 00:07:47.200 |
Do we understand that how are we retrieving content and context from different sources 00:07:58.140 |
And in 3,000 pages we have hundreds of deep links and basically landing a user to the very 00:08:04.500 |
app of the node and then asking them to traverse through different clicks and go to the page 00:08:08.360 |
where they can self-service is very tedious and not very effective. 00:08:12.460 |
And last but not the least, we need to make sure that we are not hallucinating. 00:08:17.260 |
While as a money transfer, tone and state sentiment is okay. 00:08:23.400 |
But the accuracy is not only about the transfer money but also who we are transferring, which 00:08:29.200 |
source we are using for transfer, does the person have enough money in their account, are 00:08:34.680 |
they fraud suspects, do they have a pending collection? 00:08:37.860 |
All of these things are very intricately connected because our customers are using not only one 00:08:41.900 |
product but a whole suite of product from lending to financing to investment to banking account 00:08:49.840 |
And also, oftentimes, they have dependent account, they have multiple cards, so we have to look 00:08:56.200 |
So what is important there is that can we identify the name identity recognition properly? 00:09:01.440 |
Because you can say, hey, send $100 to my brother. 00:09:05.920 |
Now say you only have one brother and you have saved that brother and you have sent money before, 00:09:11.140 |
But imagine a situation where you have not done that and you have multiple brothers and it's 00:09:14.660 |
like my favorite brother, my less favorite brother, then you have to identify that which 00:09:19.680 |
brother I'm talking about to send the money, right? 00:09:21.820 |
Because definitely I don't want to send money to my less favorite brother. 00:09:26.100 |
The next one is about making sure the correct interpretation of the user input. 00:09:30.320 |
Because if the user is saying that, hey, I want to send $100 but do it tomorrow, that's 00:09:35.960 |
a different instruction than doing it right now. 00:09:38.720 |
Because if I do it right now, maybe you will land up in overdraft. 00:09:41.820 |
The last but not the least, also identify that what is the correct action. 00:09:46.180 |
Because the user might be saying that I don't want to send it, the last tape you saw, I want 00:09:55.180 |
And evaluations for all of these things matter. 00:09:58.060 |
And without them, we cannot launch a product. 00:10:02.100 |
So in absence of eval or in absence of a tool like Langsmith, what happens is that we have 00:10:10.520 |
We are running A/B tests because we make all decisions with A/B tests. 00:10:13.860 |
I'm not sure if I could cover before, but we have 1,800 services and we do deployment every 00:10:25.540 |
But if we have a system that can very well connect the traces and give observability, give logging 00:10:34.500 |
and then alerts on top of it, so on and so forth, then we have a full cycle of observability 00:10:41.080 |
to filtering, to define data sets, to run experiments and go on. 00:10:45.160 |
And this is the flywheel we have in other situations. 00:10:49.860 |
And we are building with Langsmith for our generative AI applications. 00:10:56.920 |
I think these two things I have heard a few times in this last couple of talks about offline 00:11:05.680 |
But as you can see, that in the case of offline evaluation, this is an -- after -- imagine an 00:11:12.120 |
After the experiment result, we take them to the LLM apps and we have, like, individual 00:11:20.220 |
For both of them, I will mostly -- I will talk about it later. 00:11:24.160 |
But we primarily use human labelers in that process. 00:11:30.360 |
And then we are currently also using LLM and other customer heuristics. 00:11:34.840 |
Based on all of that, we run statistical tests and the winner variant is something that we launch. 00:11:40.640 |
Things get more interesting, actually not at this stage, but at the online stage. 00:11:46.000 |
Because in online evaluation, you can run things in your sandboxes, in your own, like, more 00:11:54.400 |
And in that situation, you have a more continuous loop of improvements and development. 00:12:01.180 |
If we only do the online evaluation -- why not going back? 00:12:07.420 |
If we only do offline evaluation, then our decision-making speed, especially for developers and analysts, 00:12:12.780 |
But if we can do good online evaluation and tracing and logging and alerting, et cetera, 00:12:19.020 |
then our development speed goes up significantly. 00:12:24.140 |
Now, last but not the least, I will talk about LLM as a judge, which is something we have talked 00:12:30.140 |
And the question basically goes back to why we need it. 00:12:34.140 |
Imagine the situation I was describing about the money transfer. 00:12:36.500 |
In that situation, you need to understand who we are sending, what the request, how much money, 00:12:43.460 |
And doing all of that, sending -- like, we are currently doing, say, a few hundred thousand 00:12:48.560 |
or a few million such transactions every day. 00:12:52.280 |
That amount of data and that amount of labeling, even if we do, like, sampling, it's not enough 00:12:59.940 |
And that's why we need to do more labeling, and doing it only by human is not scalable, because 00:13:05.540 |
training how people understand, the mistakes people make, so on and so forth. 00:13:09.800 |
So our bar was that, let's build LLM as a judge, and try to keep the quality of the judge at 00:13:20.420 |
And so we started with the first test, it was a simple prompt. 00:13:23.780 |
We used the photo mini model, because it's cheap. 00:13:26.620 |
We didn't do any fine-tuning and see that, okay, how it works. 00:13:29.900 |
And we got that, like, human were making 80% accurate decisions and 20% mistakes, and something 00:13:38.200 |
F1 score exactly doesn't show that, but you can imagine that way, a little bit as an accuracy 00:13:44.440 |
We are at 51%, and in the test 2, we moved to a fine-tuned model, and we increased the accurate 00:13:56.240 |
Next test, we changed the prompt, and we got to V2. 00:14:06.760 |
From 4.0 mini to 4.0, it's a better and bigger model. 00:14:16.260 |
And this is where we landed, where we are at F1 score of 79%, compared to 80% of human, 00:14:22.700 |
Now, you might ask that time, why did you move from test 5 to test 6? 00:14:28.640 |
The F1 score is 80, and 79, because in 79, we are identifying the inaccurate information 00:14:34.700 |
we are catching there, that's why we are here. 00:14:41.260 |
And just to no surprise, actually, it might be surprising, this whole development took us 00:14:46.000 |
maybe around two weeks with a couple of developers to go through these six iterations, and we could 00:14:50.980 |
only do it because we had the online tracing and system in place, otherwise, it would not 00:14:57.080 |
So wrapping it up all, there is no magic in building any agent, any LLM. 00:15:08.900 |
And if you don't evaluate, you don't know what you're building. 00:15:12.500 |
And if you don't know what you're building, then you cannot ship it to the world. 00:15:16.540 |
So do more evals, spend more time, understand what your users are saying, think about not 00:15:21.700 |
only hallucination and red teaming, et cetera, but also, nuanced situations of empathy, tone 00:15:31.260 |
We are at a very exciting time, and thank you all for listening to me. 00:15:35.160 |
And if you have any questions, please let me know.