Building Reliable Agents: Evaluation Challenges

00:00:00.800 | TAN BANG: Hello, everyone.

00:00:10.840 | My name is Tan, and it's very hard to follow a full-blown

00:00:15.200 | researcher from UC Berkeley, like,

00:00:16.800 | figuring out how to do the best prompt in the world.

00:00:20.360 | But I'm excited to talk about NewBank here.

00:00:22.560 | My name is Tan.

00:00:23.640 | As I said, I have been at NewBank for the last 40 years.

00:00:26.680 | We are building the AI private banker for all of our customers.

00:00:30.340 | And the idea is that people are notoriously

00:00:34.180 | bad at making financial decisions.

00:00:36.300 | Like, which subscription to cancel

00:00:38.180 | is a difficult decision for many people itself.

00:00:40.580 | So imagine how you think about loan investment,

00:00:44.960 | especially the situation like NewBank,

00:00:47.400 | where we are the third largest bank in Brazil, the largest--

00:00:51.860 | the fastest growing bank in Mexico and Colombia.

00:00:54.360 | We have given first credit card access to 21 million people

00:00:57.660 | in Brazil in the last five years alone.

00:01:00.120 | And just to be sure, these numbers are actually

00:01:02.080 | outdated because we just released our numbers yesterday

00:01:04.860 | in our earnings call.

00:01:06.920 | And these all numbers have gone up since then.

00:01:09.020 | But the interesting part is that from the very beginning,

00:01:12.400 | we ran into the world of ChatGPT.

00:01:16.060 | We got into it.

00:01:17.540 | And we have been working very closely with Langchain, Langsmith,

00:01:20.400 | and all the teams here.

00:01:21.580 | And I'm excited to talk about a few things we have built

00:01:24.400 | and how we have built it.

00:01:25.580 | And also how you can evaluate them effectively.

00:01:31.240 | So just to be clear, I'm going to talk

00:01:33.420 | about two different applications here.

00:01:34.880 | First one is the chatbot.

00:01:36.060 | No surprise.

00:01:37.540 | We have 120 million almost users, right?

00:01:40.340 | And we get almost 8 and 1/2 million contacts every month.

00:01:44.800 | And out of that, Chat is our main channel.

00:01:48.640 | And in that channel right now, 60% of them

00:01:50.980 | are first dealt with LLMs.

00:01:53.280 | The results are improving.

00:01:55.140 | And we are building agents for different situations.

00:01:57.580 | And I'll talk about them.

00:01:59.260 | And the next one is actually more interesting.

00:02:01.140 | And I'll talk about these two applications.

00:02:02.940 | Because in the world of finance, as you heard from Hervey,

00:02:05.980 | that the complexity of the legal matters are important.

00:02:09.980 | For finance, every single dollar, every single penny matters.

00:02:13.440 | And that is important because it creates trust.

00:02:16.000 | It makes sure the users are happy with our service,

00:02:18.860 | one and so forth.

00:02:19.920 | So just for a second, let's look at this application.

00:02:22.740 | So in this application, you can see

00:02:23.900 | that we have built an agentic system

00:02:27.200 | that can do money transfer over voice, image, and chat

00:02:32.360 | at very low inaccuracy.

00:02:34.360 | That will show you the numbers really soon.

00:02:35.960 | Here, you can see that the user is connecting their account

00:02:38.460 | with WhatsApp.

00:02:40.220 | We are asking them multiple times about their password,

00:02:42.640 | et cetera, to make sure the right contact.

00:02:45.020 | And once they do all of them, they

00:02:46.760 | can give a very simple instruction in a bit that, hey,

00:02:50.480 | make a transfer to Jose for 100 reais.

00:02:54.060 | We confirm this is the one that users want.

00:02:56.960 | And then once it confirms, it makes the transfer.

00:03:01.140 | Early it is to take around 70 seconds

00:03:02.860 | to make this transfer to nine different screens.

00:03:05.140 | It's taking less than 30 seconds now.

00:03:07.220 | And you can see the CSAT is more than 90%, less than 0.5%

00:03:12.640 | inaccuracy, so on and so forth.

00:03:14.740 | And we are doing that at scale.

00:03:16.660 | And I will talk about that where evaluations matter

00:03:19.820 | and what kind of things matter.

00:03:21.640 | So just to be on the same page, the experience

00:03:25.380 | of building the chatbot and building an application

00:03:27.500 | like this is very different.

00:03:29.020 | Because you need to iterate.

00:03:30.840 | But at the same time, you need to think

00:03:32.760 | about how to not build one-off solution.

00:03:36.240 | Because if you are building it for one application of this kind,

00:03:39.520 | imagine in a finance world, you are doing hundreds of these operations

00:03:42.720 | to make these changes, or to make money movement,

00:03:45.000 | or make micro decisions.

00:03:46.800 | Then you are building hundreds of separate systems and agents, which

00:03:50.340 | is just not scalable.

00:03:52.440 | So taking a step back, what does Newbank LLM ecosystem look like?

00:03:57.140 | Newbank LLM ecosystem has four different layers.

00:04:02.080 | The first one is core engine, testing and evals, tools,

00:04:04.540 | and developer experience.

00:04:05.940 | I don't have time, unfortunately, to go over each one of them.

00:04:08.720 | But you can see that in three of them right now,

00:04:11.060 | we are working very closely with LangChain and LangSmith.

00:04:13.880 | And testing and evals is something we'll talk about.

00:04:15.760 | LLM as a judge, and also online quality evaluation.

00:04:19.000 | And also in the developer experience side,

00:04:20.900 | we are using LangGraph and from the very beginning from LangChain,

00:04:25.000 | now to LangGraph.

00:04:26.020 | We are using all of them.

00:04:28.040 | Now, how do we use it and why it matters?

00:04:31.940 | Let's see.

00:04:32.480 | OK, so the first thing that happens is that without LangGraph,

00:04:37.640 | we cannot do more faster iterations

00:04:40.400 | and cannot make it very standard that what's

00:04:45.320 | a canonical approach we can take to build agentic systems

00:04:49.080 | or any kind of lag systems even.

00:04:51.500 | So the learnings there is that complex LLM flows

00:04:54.240 | can be hard to analyze.

00:04:56.240 | Centralized LLM logs and repository and graphical interface

00:04:59.360 | helps people to make faster decisions.

00:05:01.420 | Because we don't want only our developers to make decisions.

00:05:04.360 | We want our business users to also contribute to it.

00:05:08.640 | And the way to do it is by democratizing data

00:05:11.260 | and giving them access to how a business analyst,

00:05:15.000 | our product managers, our product operations,

00:05:17.220 | our total whole operations, that how

00:05:19.720 | they can make faster decisions in terms of prompt,

00:05:22.160 | in terms of adding inputs, in terms of adding parameters

00:05:27.060 | of different kinds, right?

00:05:28.920 | And last but not the least, graphs

00:05:30.360 | graphs can decrease the cognitive effort to represent flows.

00:05:33.780 | And this is something that what Shreya was mentioning about,

00:05:36.500 | that the human instruction is difficult for a machine

00:05:39.620 | to understand.

00:05:40.620 | So graph basically makes that process easier for us.

00:05:44.960 | Now, to be true to my presentation,

00:05:47.840 | I will go to the evaluation part.

00:05:49.580 | And on the evaluation side, I first talk about a few different challenges

00:05:53.840 | overall we have.

00:05:55.100 | The first thing is that, as we heard from Harvey that they're not only in Antarctica.

00:06:00.340 | We are only in three countries, so it's a much smaller problem set.

00:06:03.340 | But still, you can imagine that when you're dealing with Portuguese, Spanish, the languages,

00:06:07.800 | the dialects, the kind of way people talk, et cetera, that changes across the country.

00:06:13.680 | And we have 58% of Brazilian population is our customer now, so we have to understand what users are talking about, et cetera, very extensively.

00:06:19.820 | The second thing is that Newbank's brand presence is huge.

00:06:23.820 | We are more popular than some of the well-known brands like McDonald's or Nike, even, in Brazil.

00:06:29.960 | So we cannot do anything, especially when it comes to jailbreak or guardrails.

00:06:32.960 | It's very important for us to keep a very high bar.

00:06:35.960 | And last but not the least, we have to be accurate in our messaging.

00:06:39.260 | Because at the end of the day, we are dealing with people's money.

00:06:42.000 | And money is something that people care about, about accuracy.

00:06:45.960 | And losing trust over money transfer is very easy.

00:06:49.820 | So taking a step back again, actually moving a little forward, on the customer service side

00:06:57.620 | and the money transfer use case, we have very different needs from a business side and from

00:07:02.060 | a technical side that what kind of evaluations we need.

00:07:06.000 | So in the case of customer service, in addition to accuracy, what matters a lot, that how are

00:07:11.640 | we approaching a customer?

00:07:13.340 | If a customer is calling us, hey, where is my card?

00:07:15.820 | Or hey, I see this chart that I don't recognize.

00:07:18.880 | If we give a very robotic experience, we lose the customer's trust and empathy and it matters.

00:07:25.060 | It's very easy for human to have this connection.

00:07:27.740 | It's very hard for machine to have this connection and we all know that.

00:07:31.060 | Also very high flattery doesn't work.

00:07:32.840 | I think all of you have seen that what happened with ChatGPT 4.1 model last week and they recalled

00:07:38.120 | and they relaunched.

00:07:40.240 | So in addition, in order to do these two jobs well, we need to think about do we understand

00:07:45.180 | customers' intent well?

00:07:47.200 | Do we understand that how are we retrieving content and context from different sources

00:07:51.680 | that we have internally?

00:07:53.660 | What is the deep link accuracy that we have?

00:07:55.540 | Because imagine our app is 3,000 pages.

00:07:58.140 | And in 3,000 pages we have hundreds of deep links and basically landing a user to the very

00:08:04.500 | app of the node and then asking them to traverse through different clicks and go to the page

00:08:08.360 | where they can self-service is very tedious and not very effective.

00:08:12.460 | And last but not the least, we need to make sure that we are not hallucinating.

00:08:17.260 | While as a money transfer, tone and state sentiment is okay.

00:08:21.960 | But we need to be accurate.

00:08:23.400 | But the accuracy is not only about the transfer money but also who we are transferring, which

00:08:29.200 | source we are using for transfer, does the person have enough money in their account, are

00:08:34.680 | they fraud suspects, do they have a pending collection?

00:08:37.860 | All of these things are very intricately connected because our customers are using not only one

00:08:41.900 | product but a whole suite of product from lending to financing to investment to banking account

00:08:48.240 | to credit card, et cetera.

00:08:49.840 | And also, oftentimes, they have dependent account, they have multiple cards, so we have to look

00:08:54.360 | at all of them together.

00:08:56.200 | So what is important there is that can we identify the name identity recognition properly?

00:09:01.440 | Because you can say, hey, send $100 to my brother.

00:09:05.920 | Now say you only have one brother and you have saved that brother and you have sent money before,

00:09:10.140 | it's easy.

00:09:11.140 | But imagine a situation where you have not done that and you have multiple brothers and it's

00:09:14.660 | like my favorite brother, my less favorite brother, then you have to identify that which

00:09:19.680 | brother I'm talking about to send the money, right?

00:09:21.820 | Because definitely I don't want to send money to my less favorite brother.

00:09:26.100 | The next one is about making sure the correct interpretation of the user input.

00:09:30.320 | Because if the user is saying that, hey, I want to send $100 but do it tomorrow, that's

00:09:35.960 | a different instruction than doing it right now.

00:09:38.720 | Because if I do it right now, maybe you will land up in overdraft.

00:09:41.820 | The last but not the least, also identify that what is the correct action.

00:09:46.180 | Because the user might be saying that I don't want to send it, the last tape you saw, I want

00:09:51.220 | to cancel it.

00:09:52.320 | So we need to understand that as well.

00:09:53.660 | So all of these things matter.

00:09:55.180 | And evaluations for all of these things matter.

00:09:58.060 | And without them, we cannot launch a product.

00:10:02.100 | So in absence of eval or in absence of a tool like Langsmith, what happens is that we have

00:10:08.320 | a linear path of development.

00:10:10.520 | We are running A/B tests because we make all decisions with A/B tests.

00:10:13.860 | I'm not sure if I could cover before, but we have 1,800 services and we do deployment every

00:10:18.860 | two minutes.

00:10:20.140 | So we do every decision by A/B tests.

00:10:23.160 | And that will be the linear path.

00:10:25.540 | But if we have a system that can very well connect the traces and give observability, give logging

00:10:34.500 | and then alerts on top of it, so on and so forth, then we have a full cycle of observability

00:10:41.080 | to filtering, to define data sets, to run experiments and go on.

00:10:45.160 | And this is the flywheel we have in other situations.

00:10:49.860 | And we are building with Langsmith for our generative AI applications.

00:10:56.920 | I think these two things I have heard a few times in this last couple of talks about offline

00:11:00.900 | evaluation and online evaluation.

00:11:02.840 | So I will not go in very deep details of it.

00:11:05.680 | But as you can see, that in the case of offline evaluation, this is an -- after -- imagine an

00:11:11.120 | experiment result.

00:11:12.120 | After the experiment result, we take them to the LLM apps and we have, like, individual

00:11:17.060 | evaluation and we have pairwise evaluation.

00:11:20.220 | For both of them, I will mostly -- I will talk about it later.

00:11:24.160 | But we primarily use human labelers in that process.

00:11:30.360 | And then we are currently also using LLM and other customer heuristics.

00:11:34.840 | Based on all of that, we run statistical tests and the winner variant is something that we launch.

00:11:40.640 | Things get more interesting, actually not at this stage, but at the online stage.

00:11:46.000 | Because in online evaluation, you can run things in your sandboxes, in your own, like, more

00:11:52.800 | controlled environments.

00:11:54.400 | And in that situation, you have a more continuous loop of improvements and development.

00:12:01.180 | If we only do the online evaluation -- why not going back?

00:12:06.420 | Okay.

00:12:07.420 | If we only do offline evaluation, then our decision-making speed, especially for developers and analysts,

00:12:11.780 | is much slower.

00:12:12.780 | But if we can do good online evaluation and tracing and logging and alerting, et cetera,

00:12:19.020 | then our development speed goes up significantly.

00:12:22.360 | So we are doing both of them.

00:12:24.140 | Now, last but not the least, I will talk about LLM as a judge, which is something we have talked

00:12:29.420 | about a few times.

00:12:30.140 | And the question basically goes back to why we need it.

00:12:34.140 | Imagine the situation I was describing about the money transfer.

00:12:36.500 | In that situation, you need to understand who we are sending, what the request, how much money,

00:12:41.500 | from where, all of that.

00:12:43.460 | And doing all of that, sending -- like, we are currently doing, say, a few hundred thousand

00:12:48.560 | or a few million such transactions every day.

00:12:52.280 | That amount of data and that amount of labeling, even if we do, like, sampling, it's not enough

00:12:57.380 | to maintain the quality of the product.

00:12:59.940 | And that's why we need to do more labeling, and doing it only by human is not scalable, because

00:13:05.540 | training how people understand, the mistakes people make, so on and so forth.

00:13:09.800 | So our bar was that, let's build LLM as a judge, and try to keep the quality of the judge at

00:13:18.240 | the same level of human.

00:13:20.420 | And so we started with the first test, it was a simple prompt.

00:13:23.780 | We used the photo mini model, because it's cheap.

00:13:26.620 | We didn't do any fine-tuning and see that, okay, how it works.

00:13:29.900 | And we got that, like, human were making 80% accurate decisions and 20% mistakes, and something

00:13:37.200 | like that.

00:13:38.200 | F1 score exactly doesn't show that, but you can imagine that way, a little bit as an accuracy

00:13:43.440 | metric.

00:13:44.440 | We are at 51%, and in the test 2, we moved to a fine-tuned model, and we increased the accurate

00:13:53.440 | F1 score from 51 to 59.

00:13:56.240 | Next test, we changed the prompt, and we got to V2.

00:13:59.460 | We got to a big jump of 11% point of 70.

00:14:03.620 | We made another iteration at 4.0.

00:14:06.760 | From 4.0 mini to 4.0, it's a better and bigger model.

00:14:10.440 | Changed the prompt again in test 5.

00:14:12.460 | Changed the fine-tuning again in test 6.

00:14:16.260 | And this is where we landed, where we are at F1 score of 79%, compared to 80% of human,

00:14:21.700 | which is quite comparable.

00:14:22.700 | Now, you might ask that time, why did you move from test 5 to test 6?

00:14:28.640 | The F1 score is 80, and 79, because in 79, we are identifying the inaccurate information

00:14:34.700 | we are catching there, that's why we are here.

00:14:41.260 | And just to no surprise, actually, it might be surprising, this whole development took us

00:14:46.000 | maybe around two weeks with a couple of developers to go through these six iterations, and we could

00:14:50.980 | only do it because we had the online tracing and system in place, otherwise, it would not

00:14:55.640 | be possible.

00:14:57.080 | So wrapping it up all, there is no magic in building any agent, any LLM.

00:15:05.760 | It's hard work.

00:15:06.760 | Evals are hard work.

00:15:08.900 | And if you don't evaluate, you don't know what you're building.

00:15:12.500 | And if you don't know what you're building, then you cannot ship it to the world.

00:15:16.540 | So do more evals, spend more time, understand what your users are saying, think about not

00:15:21.700 | only hallucination and red teaming, et cetera, but also, nuanced situations of empathy, tone

00:15:28.880 | of voice, those things matter.

00:15:31.260 | We are at a very exciting time, and thank you all for listening to me.

00:15:35.160 | And if you have any questions, please let me know.

00:15:36.380 | Thank you.

00:15:37.000 | Thank you.

00:15:37.380 | Thank you.

00:15:37.880 | Thank you.

00:15:38.880 | Thank you.

00:15:39.880 | Thank you.