back to index

Building Reliable Agents: Evaluation Challenges


Whisper Transcript | Transcript Only Page

00:00:00.800 | TAN BANG: Hello, everyone.
00:00:10.840 | My name is Tan, and it's very hard to follow a full-blown
00:00:15.200 | researcher from UC Berkeley, like,
00:00:16.800 | figuring out how to do the best prompt in the world.
00:00:20.360 | But I'm excited to talk about NewBank here.
00:00:22.560 | My name is Tan.
00:00:23.640 | As I said, I have been at NewBank for the last 40 years.
00:00:26.680 | We are building the AI private banker for all of our customers.
00:00:30.340 | And the idea is that people are notoriously
00:00:34.180 | bad at making financial decisions.
00:00:36.300 | Like, which subscription to cancel
00:00:38.180 | is a difficult decision for many people itself.
00:00:40.580 | So imagine how you think about loan investment,
00:00:44.960 | especially the situation like NewBank,
00:00:47.400 | where we are the third largest bank in Brazil, the largest--
00:00:51.860 | the fastest growing bank in Mexico and Colombia.
00:00:54.360 | We have given first credit card access to 21 million people
00:00:57.660 | in Brazil in the last five years alone.
00:01:00.120 | And just to be sure, these numbers are actually
00:01:02.080 | outdated because we just released our numbers yesterday
00:01:04.860 | in our earnings call.
00:01:06.920 | And these all numbers have gone up since then.
00:01:09.020 | But the interesting part is that from the very beginning,
00:01:12.400 | we ran into the world of ChatGPT.
00:01:16.060 | We got into it.
00:01:17.540 | And we have been working very closely with Langchain, Langsmith,
00:01:20.400 | and all the teams here.
00:01:21.580 | And I'm excited to talk about a few things we have built
00:01:24.400 | and how we have built it.
00:01:25.580 | And also how you can evaluate them effectively.
00:01:31.240 | So just to be clear, I'm going to talk
00:01:33.420 | about two different applications here.
00:01:34.880 | First one is the chatbot.
00:01:36.060 | No surprise.
00:01:37.540 | We have 120 million almost users, right?
00:01:40.340 | And we get almost 8 and 1/2 million contacts every month.
00:01:44.800 | And out of that, Chat is our main channel.
00:01:48.640 | And in that channel right now, 60% of them
00:01:50.980 | are first dealt with LLMs.
00:01:53.280 | The results are improving.
00:01:55.140 | And we are building agents for different situations.
00:01:57.580 | And I'll talk about them.
00:01:59.260 | And the next one is actually more interesting.
00:02:01.140 | And I'll talk about these two applications.
00:02:02.940 | Because in the world of finance, as you heard from Hervey,
00:02:05.980 | that the complexity of the legal matters are important.
00:02:09.980 | For finance, every single dollar, every single penny matters.
00:02:13.440 | And that is important because it creates trust.
00:02:16.000 | It makes sure the users are happy with our service,
00:02:18.860 | one and so forth.
00:02:19.920 | So just for a second, let's look at this application.
00:02:22.740 | So in this application, you can see
00:02:23.900 | that we have built an agentic system
00:02:27.200 | that can do money transfer over voice, image, and chat
00:02:32.360 | at very low inaccuracy.
00:02:34.360 | That will show you the numbers really soon.
00:02:35.960 | Here, you can see that the user is connecting their account
00:02:38.460 | with WhatsApp.
00:02:40.220 | We are asking them multiple times about their password,
00:02:42.640 | et cetera, to make sure the right contact.
00:02:45.020 | And once they do all of them, they
00:02:46.760 | can give a very simple instruction in a bit that, hey,
00:02:50.480 | make a transfer to Jose for 100 reais.
00:02:54.060 | We confirm this is the one that users want.
00:02:56.960 | And then once it confirms, it makes the transfer.
00:03:01.140 | Early it is to take around 70 seconds
00:03:02.860 | to make this transfer to nine different screens.
00:03:05.140 | It's taking less than 30 seconds now.
00:03:07.220 | And you can see the CSAT is more than 90%, less than 0.5%
00:03:12.640 | inaccuracy, so on and so forth.
00:03:14.740 | And we are doing that at scale.
00:03:16.660 | And I will talk about that where evaluations matter
00:03:19.820 | and what kind of things matter.
00:03:21.640 | So just to be on the same page, the experience
00:03:25.380 | of building the chatbot and building an application
00:03:27.500 | like this is very different.
00:03:29.020 | Because you need to iterate.
00:03:30.840 | But at the same time, you need to think
00:03:32.760 | about how to not build one-off solution.
00:03:36.240 | Because if you are building it for one application of this kind,
00:03:39.520 | imagine in a finance world, you are doing hundreds of these operations
00:03:42.720 | to make these changes, or to make money movement,
00:03:45.000 | or make micro decisions.
00:03:46.800 | Then you are building hundreds of separate systems and agents, which
00:03:50.340 | is just not scalable.
00:03:52.440 | So taking a step back, what does Newbank LLM ecosystem look like?
00:03:57.140 | Newbank LLM ecosystem has four different layers.
00:04:02.080 | The first one is core engine, testing and evals, tools,
00:04:04.540 | and developer experience.
00:04:05.940 | I don't have time, unfortunately, to go over each one of them.
00:04:08.720 | But you can see that in three of them right now,
00:04:11.060 | we are working very closely with LangChain and LangSmith.
00:04:13.880 | And testing and evals is something we'll talk about.
00:04:15.760 | LLM as a judge, and also online quality evaluation.
00:04:19.000 | And also in the developer experience side,
00:04:20.900 | we are using LangGraph and from the very beginning from LangChain,
00:04:25.000 | now to LangGraph.
00:04:26.020 | We are using all of them.
00:04:28.040 | Now, how do we use it and why it matters?
00:04:31.940 | Let's see.
00:04:32.480 | OK, so the first thing that happens is that without LangGraph,
00:04:37.640 | we cannot do more faster iterations
00:04:40.400 | and cannot make it very standard that what's
00:04:45.320 | a canonical approach we can take to build agentic systems
00:04:49.080 | or any kind of lag systems even.
00:04:51.500 | So the learnings there is that complex LLM flows
00:04:54.240 | can be hard to analyze.
00:04:56.240 | Centralized LLM logs and repository and graphical interface
00:04:59.360 | helps people to make faster decisions.
00:05:01.420 | Because we don't want only our developers to make decisions.
00:05:04.360 | We want our business users to also contribute to it.
00:05:08.640 | And the way to do it is by democratizing data
00:05:11.260 | and giving them access to how a business analyst,
00:05:15.000 | our product managers, our product operations,
00:05:17.220 | our total whole operations, that how
00:05:19.720 | they can make faster decisions in terms of prompt,
00:05:22.160 | in terms of adding inputs, in terms of adding parameters
00:05:27.060 | of different kinds, right?
00:05:28.920 | And last but not the least, graphs
00:05:30.360 | graphs can decrease the cognitive effort to represent flows.
00:05:33.780 | And this is something that what Shreya was mentioning about,
00:05:36.500 | that the human instruction is difficult for a machine
00:05:39.620 | to understand.
00:05:40.620 | So graph basically makes that process easier for us.
00:05:44.960 | Now, to be true to my presentation,
00:05:47.840 | I will go to the evaluation part.
00:05:49.580 | And on the evaluation side, I first talk about a few different challenges
00:05:53.840 | overall we have.
00:05:55.100 | The first thing is that, as we heard from Harvey that they're not only in Antarctica.
00:06:00.340 | We are only in three countries, so it's a much smaller problem set.
00:06:03.340 | But still, you can imagine that when you're dealing with Portuguese, Spanish, the languages,
00:06:07.800 | the dialects, the kind of way people talk, et cetera, that changes across the country.
00:06:13.680 | And we have 58% of Brazilian population is our customer now, so we have to understand what users are talking about, et cetera, very extensively.
00:06:19.820 | The second thing is that Newbank's brand presence is huge.
00:06:23.820 | We are more popular than some of the well-known brands like McDonald's or Nike, even, in Brazil.
00:06:29.960 | So we cannot do anything, especially when it comes to jailbreak or guardrails.
00:06:32.960 | It's very important for us to keep a very high bar.
00:06:35.960 | And last but not the least, we have to be accurate in our messaging.
00:06:39.260 | Because at the end of the day, we are dealing with people's money.
00:06:42.000 | And money is something that people care about, about accuracy.
00:06:45.960 | And losing trust over money transfer is very easy.
00:06:49.820 | So taking a step back again, actually moving a little forward, on the customer service side
00:06:57.620 | and the money transfer use case, we have very different needs from a business side and from
00:07:02.060 | a technical side that what kind of evaluations we need.
00:07:06.000 | So in the case of customer service, in addition to accuracy, what matters a lot, that how are
00:07:11.640 | we approaching a customer?
00:07:13.340 | If a customer is calling us, hey, where is my card?
00:07:15.820 | Or hey, I see this chart that I don't recognize.
00:07:18.880 | If we give a very robotic experience, we lose the customer's trust and empathy and it matters.
00:07:25.060 | It's very easy for human to have this connection.
00:07:27.740 | It's very hard for machine to have this connection and we all know that.
00:07:31.060 | Also very high flattery doesn't work.
00:07:32.840 | I think all of you have seen that what happened with ChatGPT 4.1 model last week and they recalled
00:07:38.120 | and they relaunched.
00:07:40.240 | So in addition, in order to do these two jobs well, we need to think about do we understand
00:07:45.180 | customers' intent well?
00:07:47.200 | Do we understand that how are we retrieving content and context from different sources
00:07:51.680 | that we have internally?
00:07:53.660 | What is the deep link accuracy that we have?
00:07:55.540 | Because imagine our app is 3,000 pages.
00:07:58.140 | And in 3,000 pages we have hundreds of deep links and basically landing a user to the very
00:08:04.500 | app of the node and then asking them to traverse through different clicks and go to the page
00:08:08.360 | where they can self-service is very tedious and not very effective.
00:08:12.460 | And last but not the least, we need to make sure that we are not hallucinating.
00:08:17.260 | While as a money transfer, tone and state sentiment is okay.
00:08:21.960 | But we need to be accurate.
00:08:23.400 | But the accuracy is not only about the transfer money but also who we are transferring, which
00:08:29.200 | source we are using for transfer, does the person have enough money in their account, are
00:08:34.680 | they fraud suspects, do they have a pending collection?
00:08:37.860 | All of these things are very intricately connected because our customers are using not only one
00:08:41.900 | product but a whole suite of product from lending to financing to investment to banking account
00:08:48.240 | to credit card, et cetera.
00:08:49.840 | And also, oftentimes, they have dependent account, they have multiple cards, so we have to look
00:08:54.360 | at all of them together.
00:08:56.200 | So what is important there is that can we identify the name identity recognition properly?
00:09:01.440 | Because you can say, hey, send $100 to my brother.
00:09:05.920 | Now say you only have one brother and you have saved that brother and you have sent money before,
00:09:10.140 | it's easy.
00:09:11.140 | But imagine a situation where you have not done that and you have multiple brothers and it's
00:09:14.660 | like my favorite brother, my less favorite brother, then you have to identify that which
00:09:19.680 | brother I'm talking about to send the money, right?
00:09:21.820 | Because definitely I don't want to send money to my less favorite brother.
00:09:26.100 | The next one is about making sure the correct interpretation of the user input.
00:09:30.320 | Because if the user is saying that, hey, I want to send $100 but do it tomorrow, that's
00:09:35.960 | a different instruction than doing it right now.
00:09:38.720 | Because if I do it right now, maybe you will land up in overdraft.
00:09:41.820 | The last but not the least, also identify that what is the correct action.
00:09:46.180 | Because the user might be saying that I don't want to send it, the last tape you saw, I want
00:09:51.220 | to cancel it.
00:09:52.320 | So we need to understand that as well.
00:09:53.660 | So all of these things matter.
00:09:55.180 | And evaluations for all of these things matter.
00:09:58.060 | And without them, we cannot launch a product.
00:10:02.100 | So in absence of eval or in absence of a tool like Langsmith, what happens is that we have
00:10:08.320 | a linear path of development.
00:10:10.520 | We are running A/B tests because we make all decisions with A/B tests.
00:10:13.860 | I'm not sure if I could cover before, but we have 1,800 services and we do deployment every
00:10:18.860 | two minutes.
00:10:20.140 | So we do every decision by A/B tests.
00:10:23.160 | And that will be the linear path.
00:10:25.540 | But if we have a system that can very well connect the traces and give observability, give logging
00:10:34.500 | and then alerts on top of it, so on and so forth, then we have a full cycle of observability
00:10:41.080 | to filtering, to define data sets, to run experiments and go on.
00:10:45.160 | And this is the flywheel we have in other situations.
00:10:49.860 | And we are building with Langsmith for our generative AI applications.
00:10:56.920 | I think these two things I have heard a few times in this last couple of talks about offline
00:11:00.900 | evaluation and online evaluation.
00:11:02.840 | So I will not go in very deep details of it.
00:11:05.680 | But as you can see, that in the case of offline evaluation, this is an -- after -- imagine an
00:11:11.120 | experiment result.
00:11:12.120 | After the experiment result, we take them to the LLM apps and we have, like, individual
00:11:17.060 | evaluation and we have pairwise evaluation.
00:11:20.220 | For both of them, I will mostly -- I will talk about it later.
00:11:24.160 | But we primarily use human labelers in that process.
00:11:30.360 | And then we are currently also using LLM and other customer heuristics.
00:11:34.840 | Based on all of that, we run statistical tests and the winner variant is something that we launch.
00:11:40.640 | Things get more interesting, actually not at this stage, but at the online stage.
00:11:46.000 | Because in online evaluation, you can run things in your sandboxes, in your own, like, more
00:11:52.800 | controlled environments.
00:11:54.400 | And in that situation, you have a more continuous loop of improvements and development.
00:12:01.180 | If we only do the online evaluation -- why not going back?
00:12:06.420 | Okay.
00:12:07.420 | If we only do offline evaluation, then our decision-making speed, especially for developers and analysts,
00:12:11.780 | is much slower.
00:12:12.780 | But if we can do good online evaluation and tracing and logging and alerting, et cetera,
00:12:19.020 | then our development speed goes up significantly.
00:12:22.360 | So we are doing both of them.
00:12:24.140 | Now, last but not the least, I will talk about LLM as a judge, which is something we have talked
00:12:29.420 | about a few times.
00:12:30.140 | And the question basically goes back to why we need it.
00:12:34.140 | Imagine the situation I was describing about the money transfer.
00:12:36.500 | In that situation, you need to understand who we are sending, what the request, how much money,
00:12:41.500 | from where, all of that.
00:12:43.460 | And doing all of that, sending -- like, we are currently doing, say, a few hundred thousand
00:12:48.560 | or a few million such transactions every day.
00:12:52.280 | That amount of data and that amount of labeling, even if we do, like, sampling, it's not enough
00:12:57.380 | to maintain the quality of the product.
00:12:59.940 | And that's why we need to do more labeling, and doing it only by human is not scalable, because
00:13:05.540 | training how people understand, the mistakes people make, so on and so forth.
00:13:09.800 | So our bar was that, let's build LLM as a judge, and try to keep the quality of the judge at
00:13:18.240 | the same level of human.
00:13:20.420 | And so we started with the first test, it was a simple prompt.
00:13:23.780 | We used the photo mini model, because it's cheap.
00:13:26.620 | We didn't do any fine-tuning and see that, okay, how it works.
00:13:29.900 | And we got that, like, human were making 80% accurate decisions and 20% mistakes, and something
00:13:37.200 | like that.
00:13:38.200 | F1 score exactly doesn't show that, but you can imagine that way, a little bit as an accuracy
00:13:43.440 | metric.
00:13:44.440 | We are at 51%, and in the test 2, we moved to a fine-tuned model, and we increased the accurate
00:13:53.440 | F1 score from 51 to 59.
00:13:56.240 | Next test, we changed the prompt, and we got to V2.
00:13:59.460 | We got to a big jump of 11% point of 70.
00:14:03.620 | We made another iteration at 4.0.
00:14:06.760 | From 4.0 mini to 4.0, it's a better and bigger model.
00:14:10.440 | Changed the prompt again in test 5.
00:14:12.460 | Changed the fine-tuning again in test 6.
00:14:16.260 | And this is where we landed, where we are at F1 score of 79%, compared to 80% of human,
00:14:21.700 | which is quite comparable.
00:14:22.700 | Now, you might ask that time, why did you move from test 5 to test 6?
00:14:28.640 | The F1 score is 80, and 79, because in 79, we are identifying the inaccurate information
00:14:34.700 | we are catching there, that's why we are here.
00:14:41.260 | And just to no surprise, actually, it might be surprising, this whole development took us
00:14:46.000 | maybe around two weeks with a couple of developers to go through these six iterations, and we could
00:14:50.980 | only do it because we had the online tracing and system in place, otherwise, it would not
00:14:55.640 | be possible.
00:14:57.080 | So wrapping it up all, there is no magic in building any agent, any LLM.
00:15:05.760 | It's hard work.
00:15:06.760 | Evals are hard work.
00:15:08.900 | And if you don't evaluate, you don't know what you're building.
00:15:12.500 | And if you don't know what you're building, then you cannot ship it to the world.
00:15:16.540 | So do more evals, spend more time, understand what your users are saying, think about not
00:15:21.700 | only hallucination and red teaming, et cetera, but also, nuanced situations of empathy, tone
00:15:28.880 | of voice, those things matter.
00:15:31.260 | We are at a very exciting time, and thank you all for listening to me.
00:15:35.160 | And if you have any questions, please let me know.
00:15:36.380 | Thank you.
00:15:37.000 | Thank you.
00:15:37.380 | Thank you.
00:15:37.880 | Thank you.
00:15:38.880 | Thank you.
00:15:38.880 | Thank you.
00:15:38.880 | Thank you.
00:15:39.880 | Thank you.