[Evals Workshop] Mastering AI Evaluation: From Playground to Production

00:00:00.000 | Hey everyone. Thanks for joining us for the eval session today. This is the first workshop that

00:00:21.640 | we'll be leading. There's another one at 3:30, so you get to be the first people to go through it.

00:00:26.600 | Very exciting stuff. If you've gotten a chance to sign up for Braintrust, please do that now.

00:00:33.020 | If not, we also have some workshop materials and a Slack channel for you to follow along.

00:00:39.620 | In the Slack channel, we also sent out a poll. If you'd like to respond with a little emoji

00:00:45.420 | underneath the message, that'd be great. In the Slack channel also, there is the workshop

00:00:51.360 | guide, so in case you're not able to get the QR code for whatever reason, go into the Slack

00:00:56.480 | channel and you'll be able to pull up that document. Before we jump in, obviously, maybe

00:01:02.900 | just a quick intro of Carlos and I. My name is Doug. I am a Solutions Engineer at Braintrust.

00:01:08.900 | I have a background in data and finance. Actually, my third week here at Braintrust. But I'm looking

00:01:15.860 | forward to leading you all through the platform and giving you a sense for how you can master

00:01:19.820 | eval with Braintrust.

00:01:22.240 | Carlos Esteban: Yeah, my name's Carlos Esteban. I'm also a Solutions Engineer, helping out

00:01:26.240 | with some of our great customers at Braintrust. I'm a little bit more tenured, been here six

00:01:31.240 | weeks, and before I was in the info world working at HashiCorp doing some stuff with Terraform

00:01:37.240 | and Vault. But yeah, super exciting to be here today at the AI World Fair. We have a lot of

00:01:42.160 | exciting things to go over with you. So just to go over the high-level agenda, we're going

00:01:49.160 | to be alternating between lectures with slides and hands-on activities. So we're going to

00:01:54.460 | start with understanding, you know, why even eval? What is an eval? Go over the different ingredients

00:01:59.500 | or components, if you will, there. Then we'll jump into the actual Braintrust UI. You'll go

00:02:04.320 | through some activity tasks there. And then move back into the lecture, talk about the

00:02:11.320 | SDK, how you can do the same thing via the SDK. It can be a bit more powerful in certain situations

00:02:16.260 | as well. Then go into production, like logging. So day two stuff, how are you observing your

00:02:23.800 | users interacting with your production app or production feature. And then finally, we're

00:02:28.480 | going to be incorporating some human in the loop. So trying to establish some ground truth

00:02:33.320 | for the ideal responses, improve your data sets, and overall improve the performance of your

00:02:39.320 | app. If you've got a chance to check out the poll in the Slack, feel free to submit a response.

00:02:50.320 | I'm really curious to see how everybody is currently evaluating your AI systems. Just as a question,

00:03:00.320 | out of curiosity, could I ask for a show of hands? How many people have seen Braintrust

00:03:06.320 | before, gone to Braintrust.dev and interacted with it? Cool. That's great. So we have some

00:03:14.320 | pupils that have already gone and explored it a little bit. That's exciting. And a lot of people

00:03:19.320 | are brand new. So starting off with just an introduction. What are evals? How do you get

00:03:27.320 | started? First, I wanted to just show off some mentions of evals in the public space. You may

00:03:36.320 | recognize some of these names. They see the importance of evals, which may or not point you

00:03:42.320 | that this is something important that we should be thinking about when pushing changes into production,

00:03:48.320 | when we're developing these AI features. So why even do evals? Well, they help you answer questions.

00:03:59.320 | That's ultimately what they're for. What type of model should I use? What's the best cost for my use case?

00:04:07.320 | What's going to perform best in all of the edge cases that my users will be interacting with?

00:04:14.320 | Is it going to be consistent with my brand? Is it going to talk to the end customer, to the end user in the same voice that I would want a human?

00:04:26.320 | Am I improving the system over time? Am I able to catch bugs? Am I able to troubleshoot effectively?

00:04:33.320 | So all of this can be answered with the help of evals, which is what we'll be discussing today.

00:04:43.320 | The best LLMs don't always guarantee consistent performance. So this is why you need to have a testing framework in place.

00:04:52.320 | We have hallucinations occurring at a pretty high rate. Performance is also degrading when you make changes.

00:05:01.320 | It's difficult to guarantee that the change that you're putting through isn't going to regress the application.

00:05:08.320 | And changing a prompt, even if it may seem like it's improving it, will actually regress it.

00:05:15.320 | So you need to have some scientific empirical way of testing these changes and making sure that your AI feature is performing at the level that your users expect.

00:05:26.320 | So how do evals help your business? Well, they cut dev time. You'll be able to push changes into production a lot faster.

00:05:35.320 | Evals will live at the center of your development lifecycle. They will reduce costs.

00:05:41.320 | Due to the automated nature of evals, you'll replace manual review. It will then lead to faster iteration, faster releases.

00:05:50.320 | You'll also be able to optimize the model that you're using, make sure that it's the best bang for buck.

00:05:54.320 | Your quality will go up and you'll be able to scale your teams. It will enable non-technical users and technical users to also have a say in the prompt choice, in the model choice, and just the overall management of the performance in the production traffic.

00:06:16.320 | These are some of Braintrust's customer outcomes. So we've been able to help some of these great companies move a lot faster, increase their team productivity, and increase their AI product quality.

00:06:31.320 | So now moving into some of the core concepts of Braintrust. So we're really targeting three things. Prompt engineering.

00:06:37.320 | So we're thinking about how we're writing the prompts. What's the best way to provide context on our specific use case to the prompt so that we are optimizing its response.

00:06:49.320 | The middle piece evals. Are we measuring improvements? Are we measuring regressions? Is this being done in a statistical way that's easy to review, easy to understand?

00:07:01.320 | And then finally, AI observability. Are we capturing what's happening in production? Do we know if our users are happy with the outputs? Unhappy?

00:07:09.320 | Are we able to prioritize certain responses so that we can keep iterating, keep improving?

00:07:16.320 | Great. So now moving to the eval section. So what is an eval? So the definition we've come up with is that it's a structured test that checks how well your AI systems perform.

00:07:29.320 | It helps you measure quality, reliability, and correctness across various scenarios. Ideally, you're capturing every scenario that a user will live through when interacting with your AI feature.

00:07:43.320 | When it comes to brain trust and writing evals, there's really three ingredients that you need to understand to be able to work effectively.

00:07:51.320 | The first is a task. So this is the thing that you're testing. This is the code or prompt that you want to evaluate.

00:07:57.320 | It can be a single prompt or a full agentic workflow. The complexity is really up to you. The one requirement is that it has an input and an output.

00:08:08.320 | Then we have our data set. So this is the set of real-world examples, our test cases that we want to push through the task to see how it performs.

00:08:17.320 | And then the score, that's the logic behind the evals. So how are we grading the output of our prompt on our data set?

00:08:27.320 | So these can be LLM as a judge, scores, or they can be full code functions. And the caveat is that they need to output a score from 0 to 1, which will then be converted into a percentage.

00:08:38.320 | Another question.

00:08:40.320 | Yeah.

00:08:41.320 | What are the test cases? Is it also LLM agent generated?

00:08:45.320 | It can be at first. The question was, is the data set synthetic? Can it be synthetic? And the answer is, it's a great way to get started quickly, is having an AI generate those initial use cases.

00:08:59.320 | But as you progress, as you mature, it's great to ground those in logs so you're capturing the real user traffic, the real interactions that users are having, and integrating those into your data sets.

00:09:18.320 | Great. So now I wanted to talk about offline evals and online evals. So there's two mental models to think through. Offline evals are what you're doing in development.

00:09:27.320 | All right. So this is the structured testing of the AI model, of the prompt that you are going to then eventually show off to customers in production.

00:09:37.320 | So this is for proactive identification of issues.

00:09:43.320 | This is what we'll be doing today in the playground, in Braintrust, and also via the SDK.

00:09:48.320 | But then on the other side is online evals. So this is in production, real traffic is being captured and being measured. It's being graded just like your offline evals are being graded.

00:10:00.320 | And this is going to allow you to diagnose problems, monitor the overall performance, and capture user feedback in real time so that you can understand, oh, this edge case isn't included in my data.

00:10:12.320 | This is a weak point in my current AI product. I need to spend some time attacking it and improving it.

00:10:18.320 | A big question that we get asked is, what should I improve? Right? I have my prompt. I have my evals. You know, how do I know what's wrong?

00:10:31.320 | And I think this matrix really helps simplify this question. So if you have a good output from, you know, your own judgment looking at what the LLM is giving you and it's a high score, then great, right?

00:10:48.320 | You've verified yourself that the output is high quality and also the scores, the evals, have also come to the same conclusion.

00:10:55.320 | If you think it's a good output, but it's a low score, then that's a signal that you need to improve your evals. Maybe the score isn't actually representing what a human would think, right?

00:11:07.320 | If it's a bad output but a high score, same thing, right? It doesn't match what a human would think looking at the output. So you need to improve your evals.

00:11:14.320 | And then finally, if it's a bad output and a low score, your evals are working correctly. That's good. And now you need to focus on improving your AI app.

00:11:23.320 | So I hope this helps explain how you should be thinking through your scores and what to what to tackle in which moment.

00:11:30.320 | So now we're going to zoom into each of those ingredients or components starting off with the task. So as I mentioned, right, a task is really just an input and an output.

00:11:43.320 | It can be a single LLM call or a whole agentic workflow. It's really up to you what you want to test.

00:11:50.320 | So in this pattern, we're just going to be creating a simple prompt. This is what the activity today is going to encompass.

00:11:56.320 | And you can use dynamic templating with mustache. So you can provide your data set rows as part of the prompt and that will be tested and you'll get to see that in action soon.

00:12:07.320 | What if you have more than just a prompt? What if you have a multi turn chat, a whole conversation that you want to evaluate?

00:12:17.320 | You can do that in Braintrust today. You can provide the whole conversation as extra messages.

00:12:21.320 | So providing that whole chain of messages back and forth with the user and the assistant.

00:12:26.320 | You can include tool calls as well to simulate those tool calls and evaluate that big chunk, that context, that whole conversation at once.

00:12:38.320 | Tools are also something supported in Braintrust. So that's oftentimes something that your applications will leverage.

00:12:44.320 | Talking to external services or grabbing information from somewhere else.

00:12:48.320 | So you can add tools to Braintrust and have the tool available for your prompt to use.

00:13:00.320 | And just to mention, that's great for RAG use cases. So I know that's a hot word right now.

00:13:04.320 | So if you have that in mind, Braintrust can handle it.

00:13:08.320 | We support tools. We support RAG.

00:13:10.320 | Agents, another hot word right now.

00:13:13.320 | So we also allow you to chain your prompts, right?

00:13:17.320 | So you can have three prompts chained together.

00:13:19.320 | The output of the first prompt will become the input of the next and so on.

00:13:24.320 | So you can start testing end to end all these prompts back and forth, right?

00:13:30.320 | And do the same thing that you would with a single prompt.

00:13:33.320 | Great. So now moving into data sets.

00:13:38.320 | So this is the test cases, right?

00:13:40.320 | You're going to keep iterating over time, but initially maybe you're using something synthetic.

00:13:45.320 | There are three fields that you need to understand for a data set.

00:13:49.320 | Only one of them is required though, and that's the input.

00:13:52.320 | So that is the user-provided use case.

00:13:57.320 | The prompt that would be provided by the user would be the input.

00:14:01.320 | You could think of it that way.

00:14:03.320 | And then you have the expected column, which is optional, which is the anticipated output,

00:14:07.320 | or the ideal response of that prompt.

00:14:10.320 | And then finally you have your metadata, which can allow you to capture any additional information

00:14:15.320 | that you may want to associate to that specific row in the data set.

00:14:19.320 | Some tips for data sets is to start small and iterate.

00:14:25.320 | It doesn't need to be the largest data set of all time.

00:14:29.320 | It doesn't need to include all of your use cases, right?

00:14:32.320 | Just get started.

00:14:33.320 | Use synthetic data at first.

00:14:36.320 | And the important piece is to keep improving, right? Keep iterating.

00:14:39.320 | So if you start logging your real user interactions, you know, even if it's just in staging or internally in your organization,

00:14:45.320 | you can start to increase the scope of the data set, and it will start to become closer to the overall domain of use cases that users will interact with.

00:15:01.320 | And then finally, you want to start implementing human review.

00:15:03.320 | This will allow you to establish ground truth, improve your data set, improve the expected column, which will be great for your evals.

00:15:11.320 | And zooming into scores, so this is -- you have two options here in the type of score that you want to use.

00:15:21.320 | LLM as a judge.

00:15:22.320 | This is great for more subjective or contextual feedback.

00:15:27.320 | What would a human need to understand when looking at the output?

00:15:31.320 | What criteria would they consider?

00:15:33.320 | This is more of a qualitative question that you want an answer to, right, using that LLM as a judge.

00:15:39.320 | On the code-based score, this is deterministic, right?

00:15:42.320 | So you would want exact or binary conditions.

00:15:46.320 | This is more of an objective question, and it's -- the important piece is to try to use both.

00:15:52.320 | So you want some LLM as a judge scores, but you also would like some code-based scores, and they'll help you meet in the middle and understand the quality.

00:16:01.320 | So some tips here, you know, if you're using an LLM as a judge, maybe use a higher-quality model, a more expensive model to grade the cheaper model.

00:16:12.320 | Make sure that the LLM as a judge has a focus, so don't give it, you know, four or five criteria to consider.

00:16:19.320 | Zoom into one specific piece and expand, and explain the steps it should think through to come to its conclusion.

00:16:28.320 | If you're writing LLM as a judge, maybe you should eval the judge and make sure that the prompt that you're using is matching what a human would think.

00:16:35.320 | So that's another great way of improving your scores.

00:16:38.320 | And, you know, just make sure that it's confined and you're not overloading it with all the context in the world, right?

00:16:44.320 | You want it to be focused on the relevant input and output for consistency.

00:16:52.320 | Great, almost at the end here.

00:16:53.320 | So there's two things to understand about the brain trust UI specifically.

00:16:57.320 | So there's the playgrounds, and this is for quick iteration of your prompts, agents, scores, datasets, right?

00:17:03.320 | It's really effective for comparing.

00:17:06.320 | You can do A/B testing with prompts.

00:17:08.320 | You can do A/B testing with models.

00:17:10.320 | And then you can save a snapshot of the playground to your experiments view.

00:17:15.320 | And the experiments is for comparison over time.

00:17:18.320 | So you'll be able to track how your scores change over weeks, months, and everything that your team is doing across the UI, across the SDK, will also aggregate in the experiments view.

00:17:31.320 | So you can analyze everything and understand, okay, this new model came out today.

00:17:35.320 | How is it performing to the prompt from two weeks ago?

00:17:38.320 | Great, so now we've reached the first activity.

00:17:42.320 | So if you could please go to the activity document, and it will take you through the journey of running your first eval in the brain trust UI.

00:17:52.320 | Please raise your hand if you have any questions or run into any issues.

00:17:58.320 | We'll be walking around and just making sure there's no blockers.

00:18:05.320 | Yeah, if you check Slack, we'll also go back to the QR code.

00:18:15.320 | So did everybody have a chance to get these QR codes?

00:18:19.320 | The middle one is going to be the most important.

00:18:22.320 | This is where you're going to access the materials for the workshop.

00:18:25.320 | We'll be uploading the slides, please.

00:18:27.320 | Yeah, I'll repeat the question.

00:18:35.320 | The question was around extra messages in the prompt.

00:18:38.320 | And if you are overseeing agents and, you know, multiple types of users, multiple different roles are all talking back and forth,

00:18:48.320 | and you want to distinguish their roles, right?

00:18:51.320 | And all being, all within the playground UI.

00:18:53.320 | So you can, right now, there is no additional delineation between the assistant, the user, the tool call, and I believe that's it.

00:19:07.320 | So having the user be branched and play different roles is something that you would need to rely on the SDK for that additional flexibility.

00:19:18.320 | That was supposed to be my other question.

00:19:21.320 | You have the API, right?

00:19:22.320 | Right.

00:19:23.320 | Yeah, and we'll cover the SDK in the next section.

00:19:31.320 | I think maybe the biggest takeaway is that there is no limit, really, on the complexity that you feed to that, like, as that task, right?

00:19:39.320 | The only requirement is that input and that output.

00:19:41.320 | Like, maybe something is a little bit more tailored to the brain trust playground in the UI, where some things are actually a little bit more tailored to that SDK.

00:19:50.320 | So that's -- we can jump into that in that next section.

00:19:54.320 | Maybe it makes sense to, like, as we're going through, like, the workshop, I'll kind of walk through this as well,

00:20:00.320 | just so you can all kind of see me go through it.

00:20:02.320 | But feel free to raise your hand.

00:20:04.320 | We can walk around and answer questions.

00:20:14.320 | For Slack?

00:20:15.320 | For Slack?

00:20:16.320 | Are you able to access the document via the --

00:20:19.320 | Is the -- the slide deck's not public?

00:20:21.320 | Is that --

00:20:22.320 | It's not public.

00:20:23.320 | Oh, for the slide deck?

00:20:24.320 | Is this --

00:20:25.320 | Yeah.

00:20:26.320 | We've just -- yeah, the question was, this is all in the UI, right?

00:20:29.320 | Is this -- that's the only place we've been thus far, right?

00:20:33.320 | Just talking a little bit about the different components of the brain trust platform.

00:20:37.320 | Let me walk through that, right, and give you a sense for what we just kind of showed to you in slides, right?

00:20:44.320 | So you can't access the slide deck?

00:20:47.320 | Yeah.

00:20:48.320 | We can -- we can update that.

00:20:49.320 | Yeah.

00:20:50.320 | Yeah.

00:20:51.320 | Just -- let's kind of walk through this so we can get a sense for what we're building here,

00:20:57.320 | some of the things that Carlos just walked through.

00:21:00.320 | I have a lot of this stuff already installed.

00:21:02.320 | I hope that you kind of walked through this, right?

00:21:04.320 | We need certain things on our system to actually go and run this.

00:21:07.320 | So we have Node.

00:21:08.320 | We have Git.

00:21:09.320 | We're going to sign up for a brain trust account, creating a brain trust org.

00:21:14.320 | Already done this, so I'm not going to kind of bore you through that step.

00:21:17.320 | Right there.

00:21:18.320 | This project, unreleased AI.

00:21:21.320 | If you don't do that, you'll see two projects in your account, but just -- this is where we're

00:21:26.320 | going to actually create our prompts and our scores and our data set from the repo that we're

00:21:31.320 | going to clone into our local machine.

00:21:37.320 | Part of this demo requires an open AI API key.

00:21:40.320 | That's just what we're using under the hood.

00:21:42.320 | It's certainly not a limitation of brain trust.

00:21:45.320 | You can use -- and maybe just to kind of highlight something here -- you can use really any AI

00:21:52.320 | provider out there.

00:21:53.320 | So if you've gone into brain trust account, you've probably seen this.

00:21:56.320 | You've entered your open AI API key here.

00:21:58.320 | This is what is going to allow you to run those prompts in the playground.

00:22:01.320 | You can see you have access to many other providers.

00:22:04.320 | You have access to cloud providers like Bedrock and so on.

00:22:08.320 | And then you can even use your own custom provider.

00:22:10.320 | But for this workshop right now, we are using open AI.

00:22:13.320 | Sorry?

00:22:14.320 | Are you able to run evals or brain trust locally with local models?

00:22:23.320 | Yes.

00:22:24.320 | Yeah.

00:22:25.320 | The question was, can you run brain trust locally using local models?

00:22:28.320 | Yeah.

00:22:29.320 | Yeah.

00:22:30.320 | We have -- if you look a little bit further out, there's a section for what we call remote evals.

00:22:35.320 | You might not have time to get to it in this particular section.

00:22:37.320 | But know that you can go to that and play with that feature as well.

00:22:44.320 | Sorry.

00:22:45.320 | Coming back down here.

00:22:48.320 | So we're going to clone this repo.

00:22:52.320 | This is the application that we're creating.

00:22:55.320 | The idea is to give it a GitHub URL and look for most recent commits since the last release

00:23:03.320 | and then summarize those for us as developers.

00:23:06.320 | So that's the application that we're going to use.

00:23:08.320 | We're going to create some different API keys locally.

00:23:11.320 | So if you've cloned your repo, you'll have a .env.local file.

00:23:16.320 | I'll show you my example.

00:23:18.320 | You're going to also input your brain trust API key here and then your open AI API key.

00:23:24.320 | This is optional down here.

00:23:26.320 | It's just if you don't want to get rate limited by GitHub.

00:23:29.320 | Probably not going to create a lot of requests right now.

00:23:31.320 | So you probably don't need this.

00:23:33.320 | Really important step here.

00:23:39.320 | So I'm going to come back into brain trust.

00:23:42.320 | So as part of our install, we're actually going to go create some of these resources within our brain trust project that we just created.

00:23:55.320 | So I'm going to run pnpm install.

00:23:58.320 | This will actually go push some of these resources and you'll find these in the brain trust folder, the resources.

00:24:06.320 | And we'll jump into that.

00:24:07.320 | But just wanted to highlight that.

00:24:08.320 | So now if I look back into my project, I should see that unreleased AI and the different things that we've created.

00:24:14.320 | We have two different prompts now.

00:24:16.320 | These are the prompts that we use to generate the change log as well as the test cases, the data set that we'll use as part of this.

00:24:23.320 | Yeah.

00:24:24.320 | Yeah.

00:24:25.320 | Yeah.

00:24:26.320 | Yeah.

00:24:27.320 | Of course.

00:24:28.320 | Code or here within the doc.

00:24:29.320 | Both.

00:24:30.320 | Yeah.

00:24:31.320 | All right.

00:24:32.320 | Let me stop there.

00:24:33.320 | Anybody having issues just kind of going through that initial setup phase?

00:24:46.320 | How are the slides?

00:24:47.320 | Excuse me?

00:24:48.320 | Where are the slides?

00:24:50.320 | We haven't made those public yet.

00:24:51.320 | Yeah.

00:24:52.320 | I'm trying to do that now.

00:24:53.320 | Yeah.

00:24:54.320 | Do you have Slack?

00:24:55.320 | Were you able to join the workshop evals channel?

00:25:10.320 | Yeah.

00:25:11.320 | Yeah.

00:25:12.320 | The Wi-Fi is not working.

00:25:16.320 | Okay.

00:25:17.320 | Well, I'll walk you through.

00:25:21.320 | Yeah.

00:25:22.320 | How are we connecting the phone repo to the project UI?

00:25:27.320 | Yeah.

00:25:28.320 | So when we ran PNPM install, we ran a script in the background called just brain trust push.

00:25:34.320 | And if I look at that file here, there's different things that we've configured, right?

00:25:39.320 | We've configured my change log.

00:25:41.320 | So this is actually the brain trust SDK under the hood.

00:25:44.320 | This is where we're creating that prompt in that project unreleased AI.

00:25:49.320 | And so there's a couple things that we can do here from an SDK perspective.

00:25:53.320 | This is like you think about, you know, version controlling all of these different things and

00:25:57.320 | actually pushing them into the brain trust UI.

00:26:00.320 | So there's a lot of different ways to work with brain trust.

00:26:03.320 | I think we mentioned earlier either via just like the UI or actually via the SDK.

00:26:08.320 | But that's how a lot of this stuff got created.

00:26:13.320 | Cool.

00:26:14.320 | Let's kind of walk through this first activity.

00:26:18.320 | We're going to access the unreleased AI project.

00:26:22.320 | So if we go to that prompts, so this is what we just created.

00:26:26.320 | We created two different prompts, right?

00:26:28.320 | This is essentially what we can start to play around with, right?

00:26:30.320 | We have this one.

00:26:31.320 | And Carlos mentioned earlier, there's this mustache syntax.

00:26:35.320 | We can actually input variables here into our prompts.

00:26:39.320 | And this is going to actually map to the different data sets that we can actually use as part of this project.

00:26:45.320 | So here's our first prompt.

00:26:47.320 | That was impossible.

00:26:49.320 | Okay.

00:26:50.320 | Yeah.

00:26:51.320 | Well, no, no, that's fine.

00:26:52.320 | I appreciate that.

00:26:53.320 | Thank you.

00:26:54.320 | And the lighting is hard.

00:26:55.320 | Yeah.

00:26:56.320 | Oh, maybe we can...

00:26:57.320 | Change the appearance to light mode.

00:26:58.320 | I don't know.

00:26:59.320 | Yeah.

00:27:00.320 | Where's that?

00:27:01.320 | Should I...

00:27:02.320 | How's that?

00:27:05.320 | Yeah.

00:27:06.320 | Thank you.

00:27:07.320 | No, I appreciate that.

00:27:08.320 | How's this?

00:27:09.320 | Looks good.

00:27:10.320 | Okay.

00:27:11.320 | Cool.

00:27:12.320 | So really just reviewing the stuff that we created.

00:27:15.320 | We created these two prompts.

00:27:16.320 | Here's our data set that we're going to use when we run our evals and our experiments.

00:27:25.320 | You can get a sense for this.

00:27:27.320 | Here's my input.

00:27:28.320 | I have a series of commits.

00:27:29.320 | And then I have a repository URL.

00:27:31.320 | And I have when the last release was.

00:27:36.320 | That's that sense field.

00:27:38.320 | So this is, again, the thing that we use inside of that playground to create evals and to use

00:27:45.320 | to iterate from.

00:27:46.320 | And then the last thing that I'll call out here is the scorers.

00:27:53.320 | So we created a few different scorers that we'll want to use to actually score these prompts.

00:27:59.320 | So we have an accuracy, formatting, and completeness score.

00:28:02.320 | And again, this is just in that repo and that resources.ts file.

00:28:06.320 | We have, maybe just to point out, linking a little bit of what Carlos was talking about

00:28:10.320 | to the actual code that you're seeing.

00:28:12.320 | We have LLM as judge scores, as you can see here.

00:28:16.320 | Really, again, trying to pinpoint here the accuracy, right?

00:28:20.320 | We're not overloading a single LLM as a judge score with accuracy, completeness, and formatting.

00:28:25.320 | It's going to be very sort of detailed or scoped down to that particular thing.

00:28:31.320 | And then last one, we have a code-based score, right?

00:28:34.320 | So this is a little bit more binary, right?

00:28:36.320 | Is the formatting of this change log that the LLM generated, does it map to what we expect?

00:28:42.320 | And so we can use some code to do that.

00:28:45.320 | So that's what we created via that script.

00:28:47.320 | Question?

00:28:48.320 | Yeah?

00:28:49.320 | How do we get that sandbox project into our brain trust project?

00:28:57.320 | Yeah.

00:28:58.320 | So if you go back to the lab setup, when you run, when you run that install, so I'm using

00:29:03.320 | pnpm.

00:29:04.320 | I don't know if you use that, if you're using that locally, you can also use npm.

00:29:09.320 | So what is the key for OpenAI API key?

00:29:13.320 | What is the key?

00:29:14.320 | Do you not have an OpenAI API key?

00:29:16.320 | No.

00:29:17.320 | And the instruction says reach out to us if you do not have the key.

00:29:22.320 | In case you don't have API key, please let us know.

00:29:27.320 | OK.

00:29:28.320 | I don't know if we have one to distribute at the moment.

00:29:31.320 | Yeah.

00:29:32.320 | If you don't have--

00:29:34.320 | Well, if you go, like the part of the setup here, right?

00:29:42.320 | When we go to, if you come in here to a playground as an example.

00:29:46.320 | All right.

00:29:47.320 | And we're going to pull in one of those prompts, or we can pull in both of these prompts to do,

00:29:53.320 | again, like what Carlos was talking about, that sort of like A/B testing.

00:29:58.320 | It's going to ask for some OpenAI models, right?

00:30:01.320 | If you don't configure an OpenAI API key inside of your brain trust account, you don't have

00:30:07.320 | a provider to actually run this task against.

00:30:14.320 | But this is the playground, right?

00:30:15.320 | This is what Carlos was talking a little bit about earlier, about being able to do some A/B testing, right?

00:30:20.320 | I have my two prompts that I've loaded in.

00:30:24.320 | The idea here is to load in those different ingredients, right?

00:30:27.320 | Our tasks, our data set, and then our scores.

00:30:30.320 | So you'll look down here.

00:30:32.320 | We're going to select that data set.

00:30:34.320 | And then we're going to select the different scores that we want to score this task against.

00:30:38.320 | And I'll load in my accuracy, formatting, and completeness.

00:30:42.320 | I can do a couple things here.

00:30:44.320 | I can click Run.

00:30:45.320 | This will actually, in parallel, go through that data set and use each task that we've defined,

00:30:51.320 | and then it will score those, right?

00:30:53.320 | So the idea here is to, again, like this provides that sort of rapid iterative feedback loop that we

00:30:59.320 | oftentimes need to build these types of products.

00:31:02.320 | So here are my, you know, like my example rows.

00:31:05.320 | Again, these could be synthetic.

00:31:06.320 | These could be a small subset of rows that are coming back from my application.

00:31:10.320 | But now I can get a sense for prompt A, prompt B.

00:31:13.320 | How are these performing with my scores, you know, relative to the things that I have here?

00:31:19.320 | But then I'm able to do a lot of different things here.

00:31:22.320 | I can look at maybe a summary layout, get a sense for the scores.

00:31:27.320 | So at the top, this is sort of like my baseline.

00:31:30.320 | Up here is my base task, and this is my comparison task.

00:31:33.320 | So you can get a very high-level look at these different scores and how they fared with the

00:31:38.320 | different prompts that we've loaded in here.

00:31:40.320 | And the other thing that we can do that Carlos mentioned is experiments.

00:31:43.320 | So oftentimes we'll want to capture these scores over time.

00:31:48.320 | So when we do make changes, we understand how those scores fared a week ago, a month ago,

00:31:53.320 | or whatever it is.

00:31:54.320 | So I can click this experiments button, and you'll see the different things that we've loaded up into

00:31:59.320 | this playground are already here within this modal.

00:32:02.320 | We'll click create.

00:32:04.320 | And this will actually create the experiment that you can go to here.

00:32:08.320 | And again, this will, if I click maybe one out, this will allow us to track this over time.

00:32:14.320 | This is what we can also lay in our CI kind of workflow.

00:32:18.320 | So we go make a change to that prompt, make a change to that model.

00:32:21.320 | What are the impacts to the scores relative to what we had over history?

00:32:28.320 | Sorry, can you?

00:32:32.320 | What is the completeness score?

00:32:34.320 | What is the completeness score?

00:32:35.320 | Yeah, we can dig into that a little bit.

00:32:37.320 | This is an LLM as a judge score.

00:32:40.320 | So the idea, right, we're just going to give it instructions.

00:32:44.320 | The LLM is going to score the output based on what we've provided in this prompt.

00:32:49.320 | You'll also note, I'm just really pulling in the structure of my data set.

00:32:55.320 | Right?

00:32:56.320 | And so you obviously can write.

00:32:58.320 | And another thing that Carlos mentioned is like scoring the score.

00:33:01.320 | Evaluing the score.

00:33:02.320 | So how well is this thing actually doing based on the output that we are seeing within our application?

00:33:09.320 | Okay, that's really activity one, right?

00:33:21.320 | It's reviewing some of that stuff.

00:33:23.320 | And then it's creating that playground and showing you all like the sort of way that we can iterate here within Braintrust to create better AI or Gen AI products.

00:33:34.320 | Right?

00:33:35.320 | So this allows me to now, well, okay, so maybe this isn't the right model.

00:33:38.320 | Maybe if I do, maybe I want to see this new GPT model.

00:33:42.320 | I can run this.

00:33:43.320 | And now I can see how the model changed for that particular score.

00:33:48.320 | How the scores change when I change the underlying model.

00:33:52.320 | Right?

00:33:53.320 | But now I have these, you have like all of these different inputs that could happen to these applications.

00:33:57.320 | It's a way for us to track and understand when I do go and tweak this thing, there's actual data behind it.

00:34:04.320 | Right?

00:34:05.320 | This isn't like vibe check.

00:34:06.320 | This isn't, yep, I think that looks good.

00:34:08.320 | I looked at this row.

00:34:09.320 | It seems like it's better output.

00:34:11.320 | This is data behind it.

00:34:12.320 | And we can actually understand as we tweak that prompt, tweak that model.

00:34:16.320 | How does that impact our scoring?

00:34:18.320 | And then again, you can like overlay this within CI and so on.

00:34:23.320 | Yeah.

00:34:26.320 | So I pulled the project down and I have the brain trust now, the opening key and everything.

00:34:37.320 | I just don't know where should I be right now.

00:34:41.320 | How do I get the GitHub project that is on my machine to the--

00:34:48.320 | To the--

00:34:49.320 | Yeah.

00:34:50.320 | Okay.

00:34:51.320 | So you cloned the repo.

00:34:52.320 | Yeah.

00:34:53.320 | Can we get us into where people are?

00:34:57.320 | Yeah.

00:34:58.320 | Please.

00:34:59.320 | We can back up considerably.

00:35:01.320 | I can just, you know, with you, we can fix this.

00:35:05.320 | You cloned the repo, correct?

00:35:07.320 | Yeah.

00:35:08.320 | Okay.

00:35:09.320 | In your .env.local, do you have a .env.local file?

00:35:14.320 | I should have, yeah.

00:35:16.320 | There is a .env.local.example file.

00:35:20.320 | So you can copy that into the .env.local.

00:35:23.320 | Those are the keys that we want to fill in.

00:35:24.320 | Have you filled those in?

00:35:25.320 | No, I did not.

00:35:26.320 | Okay.

00:35:27.320 | So if you haven't within your brain trust org created an API key,

00:35:34.320 | I'm guessing no.

00:35:39.320 | Yeah, I don't think so.

00:35:41.320 | I think the internet connection is probably the biggest thing we're fighting here.

00:35:46.320 | Yeah.

00:35:47.320 | Luckily, all the instructions will be available after the workshop.

00:35:51.320 | Same with the slides.

00:35:52.320 | I'm about to share the slide link.

00:35:54.320 | So tough to update with the connection.

00:35:56.320 | But at least seeing Doug go through the same process will give you an understanding of,

00:36:03.320 | you know, what we were hoping you'd have that hands-on experience doing.

00:36:06.320 | We don't have too much time to wait on this specific activity.

00:36:11.320 | So I think in a few minutes we'll keep going and hopefully you'll be able to set up your keys so that you can catch up when you have some time.

00:36:21.320 | Yeah, just in the interest of time, we'll probably move forward.

00:36:24.320 | But just to complete that, if you go into your settings within your project, or excuse me, within your brain trust org, you'll see API keys.

00:36:33.320 | This is where you're going to create this API key.

00:36:36.320 | I did that.

00:36:37.320 | You did that.

00:36:38.320 | Okay, perfect.

00:36:39.320 | Create that.

00:36:40.320 | You put that in that .env.local file.

00:36:42.320 | And then you run pnpm install.

00:36:44.320 | So that is in my home directory, right?

00:36:49.320 | It's wherever you cloned that repo.

00:36:52.320 | Yeah.

00:36:53.320 | So at the root of that, you put the brain trust API key in your .env.local file.

00:36:59.320 | And you should have a spot for it already if you're using sort of the template from the example.

00:37:04.320 | And then when you run pnpm install, again, just to highlight this, it's actually running this command, brain trust push.

00:37:16.320 | So we're taking the resources that we've configured in our project and is pushing it via our API key to the brain trust org.

00:37:25.320 | Cool.

00:37:26.320 | Maybe going forward a little bit here to talk about the flip side of this, right?

00:37:52.320 | We've been kind of in the UI for the most part in our playground, doing that iteration, changing prompts, changing models, and so on.

00:38:00.320 | I think it's important to understand that we can do a lot of this via the SDK as well.

00:38:05.320 | We also have a Python SDK if that's the kind of flavor you're most used to using or the language.

00:38:13.320 | But that top portion here is essentially what we did, right?

00:38:18.320 | In that install, that post-install script.

00:38:20.320 | We defined our assets in code.

00:38:22.320 | We defined scores.

00:38:23.320 | We defined prompts.

00:38:24.320 | We defined a data set even.

00:38:26.320 | And then we pushed that into our brain trust org.

00:38:29.320 | The benefit here is that, again, we can use our repo, right?

00:38:32.320 | We can leverage version control to ensure that the things that we want to change are actually version controlled alongside of everything else that we're building within that application.

00:38:41.320 | So there are really two modes to actually work with the brain trust platform.

00:38:44.320 | It's its UI or its SDK.

00:38:46.320 | Again, there's really no limits that we place on the user of the platform.

00:38:50.320 | It's going to cater to maybe a different persona, a different use case.

00:38:55.320 | But you can use kind of both of these different things.

00:38:58.320 | The other thing that we haven't done yet, that we did within the playground when we were in that experiment, we ran those evals, right?

00:39:04.320 | We actually evaluated the task against the data set with our scores.

00:39:09.320 | You can also define evals in code, right?

00:39:12.320 | You can define the evals within your repo.

00:39:14.320 | And in a very similar command, brain trust eval, now we can push that up to the brain trust platform and essentially run that same thing.

00:39:21.320 | Track it in that experiment.

00:39:23.320 | I have now an understanding over time how my evals are performing as I go and change things, all those different things.

00:39:30.320 | A little bit more insight here you probably saw with some of that code.

00:39:35.320 | This is the command we're going to run brain trust push.

00:39:39.320 | And you essentially give it either the name of a file that have evals in it, or you can give it the name of a folder as long as your files have .eval.ts as their naming convention.

00:39:51.320 | And we're just going to go run those evals in that parallel fashion that you saw within the UI.

00:39:57.320 | A couple things maybe I mentioned earlier, but just important to highlight.

00:40:00.320 | You should do this when you want source controlled prompt versioning.

00:40:03.320 | You want consistent usage across your different environments.

00:40:06.320 | And you also want to leverage online scoring.

00:40:13.320 | Mention this, obviously, the .eval.ts.

00:40:16.320 | That's essentially what the SDK is looking for when we go and run those evals.

00:40:21.320 | It makes it really easy to run a larger subset of those without specifying each single file that you want to go run those evals for.

00:40:28.320 | But you can see, and let me jump into the actual activity.

00:40:32.320 | You can see the eval that we've created.

00:40:34.320 | It's what we've been talking about, right?

00:40:36.320 | The three ingredients that we need.

00:40:38.320 | We need that task.

00:40:39.320 | We need that data set.

00:40:40.320 | And then we need at least one score there as well.

00:40:43.320 | Yeah, so the question was like, how do you bootstrap a data set?

00:40:57.320 | I think it's a good question.

00:40:59.320 | I think you could certainly start with synthetic data.

00:41:01.320 | Or you could even start with, you know, you release a feature, right?

00:41:06.320 | You're going to start logging that feature.

00:41:08.320 | This is another thing that we haven't yet covered.

00:41:10.320 | But you can actually use the logs from that to add to the data set.

00:41:14.320 | So now you have actual real life data.

00:41:16.320 | The thing to not do is wait until you have 100, 200, like you have this golden data set.

00:41:23.320 | If you think back to that matrix that Carlos was showing,

00:41:25.320 | there are the different ways in which we could start to think about improving the application based on what we observe.

00:41:31.320 | So start with something small.

00:41:32.320 | And again, it could be synthetic.

00:41:33.320 | But then you can, once you start to evaluate it, then you have a different, you have different inputs on what you need to do to go and improve a score or improve the application.

00:41:44.320 | But it's really, I think, maybe up to you.

00:41:47.320 | The best practice or the thing that I would think about is like, don't stop yourself because you only have a small subset of data.

00:41:54.320 | Okay.

00:41:55.320 | Yeah?

00:41:56.320 | So for the tests you're running, if you're using an LLM as a judge, so like for the percent completeness score, using GPT-4-1 as a judge, that's subjectively scoring the test that you're running, right?

00:42:22.320 | So like for the same, for two runs that are, that happen one after the other, you could end up with different scores, right?

00:42:31.320 | If you're using like LLM as a judge to run those evaluations.

00:42:37.320 | So the question was like, would I get different scores because I'm using an LLM to do this, right?

00:42:44.320 | And it's not really deterministic.

00:42:47.320 | I think that's the reason why you would use a better model so you don't see something like that.

00:42:53.320 | I don't know.

00:42:54.320 | Carlos, you have any of the thoughts there?

00:42:56.320 | Yeah.

00:42:57.320 | I mean, I think you're speaking to the nature of an LLM being non-deterministic.

00:43:01.320 | So yes, there may be some variability.

00:43:04.320 | What we see our customers do, especially with the SDK, is do trial evals.

00:43:09.320 | So you will run it maybe five times and then take the average of those.

00:43:12.320 | So there are things that you can do to try to beat that.

00:43:15.320 | But it is the nature of the beast and you have to learn to work with it.

00:43:18.320 | And then the other thing is, how are you scoring like a percent completeness if the task of

00:43:27.320 | the LLM is like to judge, like put it in categories, like excellent, good.

00:43:32.320 | Are you mapping those to scores or?

00:43:36.320 | Yeah.

00:43:37.320 | So I think the question is, and if I come to the score and you look at the completeness.

00:43:43.320 | So the LLM here, yeah.

00:43:45.320 | Got it.

00:43:46.320 | It has to decide, based again on like the criteria that we give it, if it comes up with excellent,

00:43:52.320 | that maps to one and so on.

00:43:54.320 | But again, the score that it gives has to be between zero and one.

00:43:59.320 | Yeah.

00:44:00.320 | It's really helpful when you're using an LLM as a judge to go through the brain trust logs

00:44:07.320 | and read the rationale.

00:44:08.320 | So it will explain why it shows, you know, 100% or 75%.

00:44:13.320 | And you can use that to tune the LLM as a judge and improve it.

00:44:20.320 | It likely will not, you know, you don't want it to say 100% for everything, right?

00:44:25.320 | If that's the case, you need to improve your evals.

00:44:28.320 | And even if it's saying, you know, 30% for everything, it doesn't necessarily mean that

00:44:34.320 | it's performing horribly.

00:44:36.320 | What really matters is the baseline that you're comparing it to.

00:44:39.320 | You know, how did it perform yesterday on these scores on this data set?

00:44:43.320 | You shouldn't be comparing just, you know, it needs to be 80%.

00:44:47.320 | Not necessarily.

00:44:48.320 | It matters what happened yesterday, what you're comparing to previously.

00:44:52.320 | Yeah.

00:44:53.320 | And this is where it becomes really beneficial to be able to actually drill into

00:44:57.320 | what happened within that task.

00:45:00.320 | So being able to not only understand the calls that the LLM makes, the tools that it invokes,

00:45:05.320 | but actually drilling into those scores that we're using.

00:45:07.320 | And then like Carlos mentioned, what is the rationale that it gave to give it a good score here?

00:45:13.320 | Does that make sense?

00:45:14.320 | This becomes, again, another way of like, this is the human review portion or part of building a Gen AI app.

00:45:20.320 | Yeah.

00:45:21.320 | Does Braintree offer any kind of like optimization or prompt optimization features?

00:45:28.320 | So assuming you have your email down, you got tons of different like data points to test from.

00:45:34.320 | Can you kind of use Braintree to optimize the prompt itself?

00:45:38.320 | Yeah, that's a good question.

00:45:39.320 | And it's something that we're thinking a lot about is how can we add LLMs to optimize this process for you?

00:45:46.320 | So not just for prompts, like you mentioned, but also for data sets and for scores.

00:45:51.320 | We're one to two weeks away from releasing our first version of this.

00:45:56.320 | We're going to call it Loop and it will do exactly what you're saying.

00:45:59.320 | It will help you optimize the prompts and improve your evals.

00:46:03.320 | Just to build on that, imagine like this feature Loop, it has access to previous results as it's doing it.

00:46:11.320 | So it understands when it makes that change, is it better than it was previously?

00:46:15.320 | So it's sort of like that agentic type workflow where it has access to tools, but it's able to iterate on that prompt and run those experiments.

00:46:25.320 | With you, of course, like in the middle to prompt it to do different things.

00:46:29.320 | But it has access to those previous experiments and the results of that.

00:46:33.320 | So it knows like the general direction it needs to go to get improvements.

00:46:37.320 | Yeah.

00:46:38.320 | We use Braintrust.

00:46:42.320 | Yeah, exactly.

00:46:43.320 | It's a way of dog fooding.

00:46:45.320 | So the question was, how do we eval our AI feature?

00:46:48.320 | And it's, you know, of course, we have to use Braintrust.

00:46:50.320 | And it's honestly really cool to look at the project and see all the logs coming in and looking at the scores that we've chosen to go with.

00:46:58.320 | Yeah, Braintrust is really helping.

00:47:00.320 | And it was actually something that Ankur, our CEO, has talked about.

00:47:04.320 | The process of actually getting to a point where we were excited to release something like this.

00:47:10.320 | Previously, the models were not performing to the level that we were needing them to perform to.

00:47:16.320 | So every few months he would run a new benchmark on this specific use case.

00:47:20.320 | And it wasn't until, you know, the last month that a model finally reached that expectation.

00:47:26.320 | Yeah.

00:47:30.320 | They gave me a mic, so hopefully you can hear me.

00:47:32.320 | Yeah.

00:47:33.320 | Cascading off the gentleman and the friend's question around the subjectivity of using the LLMs as a judge in these types of cases.

00:47:41.320 | Do you offer any way to gain access to that rationale programmatically such that you could evaluate the thought process of the LLM as it's doing it?

00:47:51.320 | It's kind of meta, like one-step review.

00:47:53.320 | Yeah.

00:47:54.320 | But adding in that second layer where you could identify weak spots, perhaps, if there's something that's hyper-workflow oriented or has a very strict process you're looking for the LLM to follow.

00:48:03.320 | Yeah.

00:48:04.320 | I mean, I think you probably saw what I highlighted here.

00:48:08.320 | Correct me.

00:48:09.320 | Or thumbs up if you saw this.

00:48:11.320 | Yeah, okay.

00:48:12.320 | This is all accessible via API.

00:48:14.320 | Coming back down here.

00:48:17.320 | So like the rationale that the LLM gave, you could certainly build something on top of these rationales and then generate, you know, again, like eval the eval type of workflow.

00:48:30.320 | Cool.

00:48:38.320 | Yeah.

00:48:39.320 | Yeah.

00:48:40.320 | Yeah.

00:48:41.320 | Question.

00:48:42.320 | So I'm just curious.

00:48:43.320 | There's a mic if you'd like it.

00:48:44.320 | Yeah.

00:48:45.320 | Thank you.

00:48:46.320 | Yeah.

00:48:47.320 | So, yeah.

00:48:48.320 | I'm just curious, like, how should we build our confidence around, like, you know, the result of LLM as a judge?

00:48:55.320 | Yes.

00:48:56.320 | I mean, you know, how do we trust the evaluation?

00:48:59.320 | Because it's, after all, the model, like, evaluate the data set.

00:49:04.320 | And it's, like, we could get, like, you know, maybe a good result, but actually it's, like, maybe a large model is just over confidence or something like that.

00:49:13.320 | It's like we need to evaluate or evaluate results using, like, humans, like, at the beginning, so that we can build the confidence.

00:49:21.320 | Like, yeah, just curious, any experience, like, you have here?

00:49:25.320 | Yeah, definitely.

00:49:26.320 | That's a great question.

00:49:27.320 | So I guess everybody heard.

00:49:29.320 | Don't need to repeat.

00:49:30.320 | I think there's two things that you can do.

00:49:33.320 | One that you mentioned, which is involving a human, reviewing that LLM as a judge, and confirming that it's thinking in the right way, it's outputting the correct score.

00:49:45.320 | Another approach that you could do as well as the human in the loop is using deterministic scores.

00:49:52.320 | So full coding functions that are trying to grade the same type of criteria using regular expressions or some other logic.

00:50:00.320 | And you can approximate, right?

00:50:02.320 | So if the LLM as a judge is giving a zero, but the, you know, the deterministic code score is giving a way higher score, then you know that there's something that needs attention, needs to be fixed.

00:50:12.320 | The matrix, as well, that we showed at the beginning, that pointed to, you know, should you improve your evals or improve your AI app, that's also a great resource.

00:50:22.320 | Yes?

00:50:23.320 | How are you guys thinking about the role, if you think there is a role, for traditional machine learning models in evals?

00:50:33.320 | I mean, you know, on one hand, you have totally deterministic code, and then on the other, you have LLM as a judge.

00:50:38.320 | Do you think there's kind of a middle ground for things like intent classification models, entity recognition, sentiment classification, and clustering, and kind of more traditional machine learning approaches that kind of sit somewhere in between, you know, the totally deterministic versus totally non-deterministic spectrum of code versus LLMs?

00:50:58.320 | Like, do you think that there's a role for those type of models, and what do you think that looks like?

00:51:04.320 | Yeah, that's a good question.

00:51:07.320 | I think it's still to be determined how this all shakes out.

00:51:11.320 | There are some customers, companies that we talk to, that are going full deterministic.

00:51:16.320 | They don't use LLM as a judge.

00:51:18.320 | And then there are others that are very much going in the LLM as a judge route.

00:51:21.320 | And I think the reason that there's a split is because they both work.

00:51:25.320 | So, you know, I don't know, I don't know how this will eventually shake out, if we'll reach in the middle, or if, you know, determinism will win.

00:51:34.320 | What I can say, though, is that it's highly dependent on the use case, the problem that you're solving, and experiment with both.

00:51:41.320 | And then you can determine which one is working best.

00:51:44.320 | I guess those are, like, also largely code-based, right, the things that you're talking about.

00:51:49.320 | And so maybe they lean a little bit more towards that.

00:51:52.320 | Yeah, I mean, I'd say there's still kind of neural approaches.

00:51:56.320 | Mm-hm.

00:51:57.320 | Or still not entirely deterministic.

00:51:58.320 | Thank you.

00:52:01.320 | Yeah, no, I got you.

00:52:02.320 | It's sort of like that middle ground.

00:52:05.320 | Yeah, I don't have a great answer for you.

00:52:08.320 | I do think using Braintrust, you do have the ability to at least configure both of these, the LLM as a judge and then the scorer.

00:52:18.320 | And then you can, again, using the human review process, find the ones that actually map to the right output the best.

00:52:27.320 | And then that's how you start to build your application.

00:52:29.320 | But it's a really good question.

00:52:31.320 | Thank you.

00:52:36.320 | How's the activity going?

00:52:37.320 | I know we are getting, you know, the last 25 minutes of the session.

00:52:43.320 | We still have two more little slide chunks to go through.

00:52:48.320 | So maybe in, you know, two minutes or now we could move to the slides and then keep it going.

00:52:54.320 | Again, this will all be available after.

00:52:56.320 | So feel free to keep working on it.

00:52:59.320 | Yeah, maybe just really quick we could run the eval.

00:53:02.320 | Just from here you can see we're actually going to take the eval defined here.

00:53:11.320 | So we have our task.

00:53:13.320 | We have our scores.

00:53:14.320 | Then we have our data set, essentially what we just did within the UI.

00:53:18.320 | Pushing that into the Braintrust platform.

00:53:20.320 | And then you can even see, like, this is where it's running right now.

00:53:27.320 | So, again, being able to do these in a couple different ways, either via the SDK or from the playground itself.

00:53:34.320 | I think you explained this already.

00:53:42.320 | Maybe I was distracted by the Wi-Fi, but how do I think about the difference between the playground and the experiment?

00:53:47.320 | Yeah, that's a great question.

00:53:49.320 | Let's see if we can just quickly go back to the slide.

00:53:52.320 | Playground you can think of as quick iteration.

00:54:02.320 | Experiment, so a playground ephemeral, right?

00:54:07.320 | Experiments, long-lived historical analysis.

00:54:10.320 | If that helps answer your question.

00:54:13.320 | They're becoming more and more the same.

00:54:15.320 | You know, historically the experiments had a bit more bells and whistles.

00:54:19.320 | So, you know, typically teams would gravitate towards the experiments.

00:54:22.320 | But we found that they really liked the quick iteration.

00:54:26.320 | They really liked using the UI.

00:54:28.320 | And so we started beefing it up.

00:54:30.320 | And now they've become pretty much the same.

00:54:32.320 | So, yeah.

00:54:33.320 | Playground, more ephemeral, quick iteration.

00:54:35.320 | You want to save the work that you've done to an experiment so that you can later review it and see the scores update and change over time.

00:54:45.320 | So, when I do an eval from the SDK, it always is an experiment.

00:54:50.320 | Like, what if I just want to iterate in my text editor?

00:54:53.320 | I should use the UI, the Playground UI for that?

00:54:56.320 | Yes, and remote evals, which will allow you to define via the SDK the eval, but then expose it into the Playground.

00:55:06.320 | So, it's like the bonus activity in the document at the very end.

00:55:10.320 | So, maybe you should check that out.

00:55:12.320 | And we won't have time in this session.

00:55:14.320 | But if you come to the one at 3:30, we will.

00:55:16.320 | Any other questions?

00:55:20.320 | Okay, cool.

00:55:24.320 | So, moving into lecture three.

00:55:26.320 | So, this is, you know, once you've finished development, it's reaching customers.

00:55:31.320 | You're in production, right?

00:55:32.320 | Now, what do you do?

00:55:33.320 | Well, the important thing is logging.

00:55:35.320 | You want some observability.

00:55:36.320 | You want to understand what's going on.

00:55:39.320 | How are they using it?

00:55:40.320 | Are there any gaps?

00:55:41.320 | Are they unhappy?

00:55:42.320 | It can help you debug a lot faster.

00:55:47.320 | It can allow you to measure quality instantly on that live traffic.

00:55:53.320 | You can turn those production traces, what you're logging.

00:55:57.320 | You can turn it into a data set and bring it back into the Playground, keep improving the prompt.

00:56:03.320 | And, you know, it allows a lot of non-technical people to understand what the end user is thinking.

00:56:10.320 | So, you can close this feedback loop.

00:56:12.320 | We have a lot of PMs, SMEs using Braintrust and going through the logs and looking at that user feedback to understand what gaps and improvements may exist.

00:56:23.320 | So, how do you log into Braintrust?

00:56:29.320 | Well, this is done via the SDK, right?

00:56:31.320 | It needs to plug into your production code.

00:56:33.320 | So, these are some of the steps here, the tools that you can use.

00:56:37.320 | So, you know, you need to initialize a logger that will authenticate into Braintrust.

00:56:42.320 | It will connect it to a project.

00:56:44.320 | So, now your logs will go to a specific project in Braintrust.

00:56:47.320 | Some great ways to get started with really one line of code is to wrap your LLM client.

00:56:54.320 | So, you can use RAP OpenAI around that LLM client and now any communication will get logged with your prompt, response, also metrics.

00:57:08.320 | So, how many tokens were sent back and forth, the latency, all errors, everything, just by adding that RAP OpenAI.

00:57:15.320 | You can do the same with Vercel AI SDK, or you could use OTEL.

00:57:20.320 | So, we also integrate with OTEL.

00:57:22.320 | So, if you want to go that route, it's also available.

00:57:25.320 | If you want to log and trace arbitrary functions, we also support that.

00:57:31.320 | You can just use a trace decorator around the function.

00:57:36.320 | Really helpful for keeping track of any functions that are helpful to understand and keep track of.

00:57:44.320 | And then, if you need to add additional information like metadata, you can use span.log.

00:57:49.320 | So, it's very capable, very flexible, but there's still these, you know, one line code ways to get started.

00:58:01.320 | So, now that you're pushing all of your logs into Braintrust, you're capturing, you're observing real user traffic.

00:58:08.320 | Now, we're going to get into that, you know, online scoring piece as opposed to offline.

00:58:13.320 | So, online is measuring the quality of live traffic.

00:58:16.320 | So, you can decide how many logs that are coming in will get evaluated and scored.

00:58:23.320 | It could be 100%.

00:58:25.320 | It could be 1%.

00:58:26.320 | It's up to you.

00:58:27.320 | This allows you to set early regression alerts.

00:58:30.320 | So, if it starts dropping below, you know, 70%, 60%.

00:58:34.320 | Ultimately, it depends on the score that you're using and what you've established as the baseline.

00:58:38.320 | But if it starts dropping below a critical amount, you can set up alerts and notify the correct team.

00:58:44.320 | You can also A/B test different prompts.

00:58:46.320 | You can set up tagging and understand, oh, this trace coming from this user is from prompt A versus this one from prompt B.

00:58:54.320 | And you can compare the grades, right, the score results coming back in.

00:59:01.320 | So, this is great for just improving feedback, moving quickly, and understanding if there's been a drop in quality.

00:59:11.320 | How do you create an online scoring rule?

00:59:13.320 | Well, everything is done via the UI.

00:59:15.320 | You go to your project configurations, click on online scoring, and then you can add your rule.

00:59:20.320 | This is where you'll define what scores you want to be used on that live traffic.

00:59:26.320 | And then, crucially, that sampling rate.

00:59:29.320 | So, maybe at the beginning, you start with a lower sampling rate, and then you can increase it once you trust the metrics coming in.

00:59:36.320 | You can also choose what span you want this online score to run on.

00:59:41.320 | So, it defaults to the root span, but you can get more granular and specify, you know, I want this nested child span to be scored.

00:59:55.320 | So, once you start collecting these logs, collecting these online scores, oftentimes teams want to view them in interesting ways and customize the lenses on those logs.

01:00:06.320 | So, that's where custom views come in.

01:00:08.320 | You can apply filters, sorts.

01:00:10.320 | You can customize columns on the logs with whatever information you'd like.

01:00:16.320 | And now you can start saving those views and making them available for the rest of your team to just come to the logs and select,

01:00:24.320 | Oh, I want to go to, you know, the logs under 50% view.

01:00:29.320 | Or, you know, their own custom view that they've made that's specific to what they care about.

01:00:34.320 | So, it's a great way of collaborating and speeding up the process of viewing the important things to you.

01:00:41.320 | Great.

01:00:46.320 | So, you know, we went through the slides.

01:00:48.320 | Now we would jump back in to the activity document.

01:00:53.320 | There we can look at the actual code, see where the logging is being captured in our files, spin up the application.

01:01:03.320 | So, you can actually view the application in your dev environment, interact with it, and you'll see those prompts and outputs being logged in Braintrust.

01:01:12.320 | So, if you've gone that far and you have the dependencies installed, then I would recommend doing a PNPM dev.

01:01:19.320 | And now you'll have your application up and running, interact with it a few times, and you'll see that populate in your project logs.

01:01:27.320 | Once you do that, then you can go to your online scoring settings, set up a rule, and you can keep interacting with the app, and now you'll see it populate with that online score that you just enabled.

01:01:40.320 | Maybe a quick example for those that are still having sort of Wi-Fi issues.

01:01:53.320 | So, I'll come back down here.

01:01:55.320 | So, I'm going to come and spin this up.

01:02:01.320 | So, as Carlos mentioned, PNPM dev.

01:02:03.320 | This is going to spin up that server on localhost 3000.

01:02:07.320 | You should see something that looks like this.

01:02:12.320 | There's a few things that you can just click these.

01:02:16.320 | This is sort of like the easy button to get going.

01:02:18.320 | This is the GitHub URL.

01:02:20.320 | Again, it's looking for the commits that have been made since the last release.

01:02:24.320 | And it's going to summarize, again, using the prompts that we've configured here, and then start to categorize them.

01:02:30.320 | But now the interesting part of this, if you come back into the Braintrust platform, is if you look at the logs.

01:02:38.320 | So, this is what just happened on the Braintrust side.

01:02:42.320 | So, we sort of have this top-level trace, the generate change log request.

01:02:46.320 | And then, essentially, the tool calls you can think of as, right?

01:02:50.320 | We're getting the commits.

01:02:51.320 | We're getting the latest release.

01:02:52.320 | We're fetching those commits.

01:02:54.320 | We're then loading that prompt.

01:02:55.320 | So, we're actually loading the prompt from Braintrust.

01:02:58.320 | And then you can start to, you know, you can click through a lot of these.

01:03:01.320 | And as Carlos mentioned, when you use those wrappers, you get all of this sort of goodness out of the box, right?

01:03:05.320 | So, what are the number of prompt tokens?

01:03:07.320 | What is the estimated cost?

01:03:09.320 | This becomes really helpful as you start to monitor over time, right?

01:03:14.320 | Probably not set up yet because we don't have too much going on.

01:03:17.320 | But, like, actually understanding what does that token amount look like over time?

01:03:21.320 | What does the cost look like over time as we change models?

01:03:24.320 | And so on.

01:03:25.320 | But that's how really easy this is.

01:03:27.320 | And maybe just to complete that loop, if you come over to the resource, not the resources,

01:03:35.320 | the app generate route.ts, you'll start to see some of this stuff.

01:03:38.320 | So, I'll just highlight a couple things.

01:03:40.320 | We're wrapping this SDK, AI SDK model.

01:03:43.320 | So this, again, is how we're really getting all of that metrics.

01:03:47.320 | And it's really allowing for us to log a lot of that information with very little lift from ourselves as developers.

01:03:54.320 | But then you also have the ability to configure things in a different way.

01:03:58.320 | So maybe we have different inputs or different outputs that we want to actually log in a particular span.

01:04:03.320 | Or actually, we want to log metadata, right?

01:04:05.320 | This becomes really powerful when we want to actually go into those views, right?

01:04:09.320 | And we can actually start to filter these things down.

01:04:11.320 | We can even filter by that metadata.

01:04:13.320 | So this is where, again, you can hit the easy button.

01:04:16.320 | We're going to wrap our LLM client with those SDKs.

01:04:19.320 | Or we can actually get a little bit more detailed and start to log the particular input and output information that we want, metadata.

01:04:26.320 | And now we can sort of set these different things.

01:04:28.320 | So if I come out here, I can even create -- actually, when we add in the scores here, we can create filters based on those scores.

01:04:37.320 | So I want to create a view that says, hey, any time my completeness score goes below 50%, I'm going to create a view for this.

01:04:43.320 | This is going to enable my human reviewers to go in and actually understand that.

01:04:47.320 | And then if you look up here in the top right, we can actually add this span to a data set really easily.

01:04:51.320 | So we find this thing, right, that will actually add a lot of value in that offline eval type of process.

01:04:56.320 | Click this.

01:04:57.320 | Now we have a net new row in that data set that now adds a lot of value, right?

01:05:01.320 | This is sort of like that feedback loop, right?

01:05:03.320 | We've done that offline eval type of work.

01:05:06.320 | We have found the right prompt, the right model, all these different things.

01:05:09.320 | It's in production.

01:05:10.320 | We are logging it.

01:05:11.320 | Now we're sort of understanding that, and maybe in a human review type of way.

01:05:15.320 | Add that span to the data set.

01:05:16.320 | This adds, again, to the offline type of portion.

01:05:19.320 | Again, you just see this, like, this sort of flywheel effect of creating really powerful GenAI apps.

01:05:31.320 | Is there an eval score generated for this online log as well, like as we ran it?

01:05:37.320 | Is there an eval score created for it?

01:05:39.320 | And do we add it to the data set based on if the eval score is good because we don't want, like, bad examples in the data set?

01:05:46.320 | That's one way of thinking it.

01:05:47.320 | You don't need to run an eval as the user's interacting with it.

01:05:51.320 | That's what the online scoring does.

01:05:53.320 | So once you set up that online scoring rule, it'll output a score based on the judge that you've chosen as the online scoring rule.

01:06:04.320 | And exactly what you said, right?

01:06:06.320 | You could either filter it and select the good responses, add those to a data set, or vice versa, select the bad responses and understand why are they bad?

01:06:16.320 | How do I improve them, make them better?

01:06:19.320 | Just to complete that and kind of how do we configure this, right?

01:06:25.320 | Carlos walked through, like, what this would look like.

01:06:28.320 | This is, you know, my rule, right?

01:06:30.320 | Obviously, you would call that something a little bit better, but I actually want to, you know, add in my scores for those online logs, right?

01:06:38.320 | And then you would, you probably wouldn't do 100%, but we're just going to do that for this instance.

01:06:43.320 | And now when I come back to my application here, right, maybe I want to do, just do a quick refresh.

01:06:51.320 | So now when these logs happen within the brain trust side, we're actually going to run those scores against that output.

01:06:58.320 | So we'll understand, based on what happened here, how did it score on formatting?

01:07:02.320 | How did it score on correctness?

01:07:05.320 | And then this also can now layer into, so you can see here, right, we have a 25% on the accuracy, 100% on the completeness.

01:07:18.320 | So maybe we have a little bit work to do.

01:07:20.320 | But now if I click into this, this is where you can start to now create different things within, like, the view portion here to ensure, like, so this is a filter.

01:07:30.320 | So maybe I want to change this to anything that is less than, let's say, 50%.

01:07:37.320 | Now I can save this as a view, and my human reviewers are able to now come in here, open up this view, and look at all of the logs where my accuracy score is less than 50%.

01:07:49.320 | And now we can, again, create that sort of iterative feedback loop.

01:07:53.320 | Any questions on this section?

01:08:08.320 | Yeah, maybe a good segue to the human in the loop.

01:08:13.320 | This becomes really, you almost really, oh, sorry.

01:08:20.320 | I had a question about using Braintrust and implementing it on existing projects.

01:08:27.320 | Is it something that's easy to do with, like, something like Langsmith, you can just add, like, a couple lines and it'll trace everything for you?

01:08:36.320 | Is it the same in Braintrust, is it?

01:08:38.320 | Or do you have to refactor all your prompts to use, like, the Braintrust prompts?

01:08:42.320 | It's essentially the same thing.

01:08:45.320 | So, like, you have Langsmith as code to wrap an LLM client, right?

01:08:49.320 | Or it has decorators to put on functions.

01:08:52.320 | Mm-hm.

01:08:53.320 | It's the exact same thing on the Braintrust side.

01:08:55.320 | Got it.

01:08:56.320 | Yeah.

01:08:57.320 | So then if you do that, is it easy to then use, like, create data sets from those logs?

01:09:03.320 | Yeah, absolutely.

01:09:04.320 | As long as you are-- the logs that you're producing map to the structure of the data set that you've created,

01:09:13.320 | then absolutely it becomes really-- just you click that button, we're going to add that span to the data set,

01:09:18.320 | and it becomes really easy to connect those two things.

01:09:21.320 | Cool.

01:09:22.320 | Thanks.

01:09:23.320 | Yeah, of course.

01:09:24.320 | Yeah, good question.

01:09:25.320 | Okay, yeah, so let's talk maybe a little bit, really quickly, about the--

01:09:30.320 | Oh, there's a question over here.

01:09:31.320 | Oh.

01:09:32.320 | Why not?

01:09:33.320 | Super, super quick.

01:09:34.320 | With the sampling rate, is there a way to just override that, where if certain inputs are received,

01:09:39.320 | you can force that to be included in your sample set?

01:09:42.320 | Like if you have some manual user, like, you know, just push that back into the system on that fleet,

01:09:48.320 | and it's not in your sample rate?

01:09:51.320 | The way that you would go about that is changing the span that is targeted.

01:09:56.320 | So instead of it applying to the root span, you would specify a span that only happens if a certain criteria is met, right?

01:10:03.320 | Okay.

01:10:04.320 | So then it could be 100% or 50% of just when that span appears.

01:10:08.320 | Okay, so yeah, this is where we could bring sort of like the human in the loop type of workflow.

01:10:20.320 | This is where we want to actually maybe bring in subject matter experts.

01:10:23.320 | And Carlos mentioned like product managers, maybe SMEs.

01:10:26.320 | We also have doctors coming into the platform and actually evaluating some of this stuff, right?

01:10:31.320 | These are the people that actually understand whether or not that output that's created by that large language model is valid, is good, right?

01:10:37.320 | And this is a really powerful thing to have as part of the process to building really powerful AI applications.

01:10:44.320 | We can catch hallucinations.

01:10:46.320 | Being able to establish that solid foundation or that ground truth is oftentimes, you know, having that human in the review, in the loop type of person becomes really beneficial here.

01:10:59.320 | So why does this matter, excuse me?

01:11:02.320 | It's really critical for quality and reliability.

01:11:06.320 | Like we were just talking about like with LLMs and being able to trust whether or not they can do the same thing over and over again.

01:11:11.320 | It's non-deterministic.

01:11:12.320 | Automations can miss that nuance, right?

01:11:14.320 | We want to be able to sort of apply that human type of review to the things that we're doing on the AI side with LLM as a judge type score.

01:11:24.320 | We also want to help you make sure the final product meets the actual expectations of the user, right?

01:11:31.320 | So the user is going to have a much better understanding of how to -- or like what the final output should be.

01:11:40.320 | So having that person in the loop to look at those outputs becomes really powerful to ensuring that you build really, really strong Gen AI applications.

01:11:51.320 | Two types of human in the loop.

01:11:53.320 | There is the human review.

01:11:56.320 | So this is where these people are actually going to go into the Braintrust platform and actually manually label, score, and audit those interactions that the user had with the AI application.

01:12:07.320 | And then there's actual feedback from the user real time.

01:12:11.320 | So this is like, you know, a thumbs up, thumbs down button within the application saying, hey, you did a really good job.

01:12:16.320 | You did a really bad job.

01:12:17.320 | But now we can -- you can sort of use these together as well.

01:12:20.320 | I can now create a view within Braintrust that filters down to any of my user feedbacks of zero.

01:12:26.320 | So thumbs down.

01:12:27.320 | And I actually want to review whether or not those things are bad.

01:12:31.320 | And if they are bad, I can add those to the data set, again, creating that sort of flywheel effect.

01:12:35.320 | Just really quick here.

01:12:39.320 | If you go -- if you look at that application, you're able to actually click one of these, you know, thumbs up or thumbs down.

01:12:45.320 | Create a comment.

01:12:46.320 | Really good.

01:12:47.320 | And then this is now logged back to the Braintrust application.

01:12:51.320 | So if you look back at our logs, I'll remove our filter.

01:12:57.320 | So we should have a user feedback score now here of 100%.

01:13:01.320 | And then we should have a comment over here as well.

01:13:03.320 | Really good.

01:13:04.320 | But then, again, these are the things that we can now -- if I open this back up, I can create a filter on my user feedback score.

01:13:10.320 | Now I want to understand all of my logs where user feedback is one or zero.

01:13:15.320 | And then I can do something from there.

01:13:17.320 | But this is done very easily via the log feedback function within Braintrust.

01:13:23.320 | You provide it sort of like the span that was already created within that log.

01:13:28.320 | And then you just -- you log -- you provide that user feedback to it.

01:13:32.320 | You can also enter -- really quickly here in the platform -- you can enter human review mode.

01:13:48.320 | So this is a way in which -- it's sort of hiding away some of the different -- you know, some fields that may not be really relevant for those people that are coming in and doing human review.

01:14:00.320 | Oh, I haven't actually configured any scores yet.

01:14:03.320 | So you can actually see you would come out here and you would create different scores for that human to go in and do that review.

01:14:10.320 | Whether it's sort of like an option-based free-form input slider.

01:14:15.320 | So is this, you know, maybe thumbs up, thumbs down, sort of like yes or no.

01:14:19.320 | Or maybe you could do something a little bit more verbose, A, B, C, D, whatever it is.

01:14:24.320 | But this is where you create these scores.

01:14:26.320 | Now they exist as part of those logs.

01:14:28.320 | Those humans can go in and now look at the input and the output, give their review of it.

01:14:33.320 | Again, adding to that, again, like that flywheel effect that we need to create.

01:14:38.320 | Yeah, I wanted to add there as well.

01:14:41.320 | This is really helpful for evaling your LLM as a judge.

01:14:45.320 | Oftentimes we see customers use this process to provide some ground truth for their data sets and also for the LLM as a judge, right?

01:14:54.320 | So you can imagine having a team of SMEs come in, they review, they do thumbs up, thumbs down on maybe five different criteria qualities that they're measuring.

01:15:06.320 | And then they provide that data to a playground where the prompt is the LLM as a judge and they go through the playground and they test to make sure that the LLM as a judge prompt matches what the humans thought.

01:15:19.320 | So just something there to think about.

01:15:21.320 | But as Doug is saying, it's a great feedback loop.

01:15:24.320 | It's a great flywheel effect that can be created when you add this human to verify and confirm.

01:15:33.320 | Cool.

01:15:35.320 | That is it for the workshop.

01:15:38.320 | We do have a few minutes left.

01:15:40.320 | We can certainly answer a couple more questions.

01:15:42.320 | Yeah.

01:15:43.320 | So for people who are successful with this, how much time are they spending going backwards and forth,

01:15:51.320 | putting people around and validating and test themselves before they get into more live testing?

01:15:56.320 | And then how often are they going back?

01:15:58.320 | What's the kind of balance of time on task?

01:16:02.320 | Offline versus online evals?

01:16:04.320 | Yeah.

01:16:05.320 | And how much you have to do up front to really get the best results?

01:16:08.320 | Or can you really just put something down, figure it out later and optimize on the fly?

01:16:13.320 | You don't want to get stuck in analysis mode.

01:16:17.320 | Right.

01:16:18.320 | Yeah.

01:16:19.320 | The question, just to repeat it, is how much time do you have to invest up front to get value?

01:16:24.320 | Should you keep going over it to try to optimize?

01:16:28.320 | Or better to just start quickly with minimal scores, minimal data set, and then keep improving?

01:16:34.320 | And I would say the latter, right?

01:16:35.320 | You don't want to be fixated on creating a golden data set or 20 scores.

01:16:40.320 | Like if you have one or two scores and you have 10 rows in a data set, it's going to be tremendously helpful.

01:16:46.320 | And then from there, it's all about iteration.

01:16:48.320 | So just going back and improving, adding some more rows, adding another score, tweaking the scores.

01:16:54.320 | But you really just want to get started quickly.

01:16:57.320 | Yeah.

01:16:58.320 | So you've mentioned some elements of this scoring.

01:17:06.320 | There's the function that you want to test, that you have to define the test steps, if you will.

01:17:13.320 | One of the challenges that we are finding is our actual application does change.

01:17:20.320 | And it could change biweekly.

01:17:23.320 | It could change monthly.

01:17:25.320 | Is there a way to look at trying to automate changing the actual function that you now need

01:17:32.320 | to change to match the way that your application logic has just changed this week from two weeks ago?

01:17:40.320 | Would you say -- oh, go for it.

01:17:43.320 | Yeah.

01:17:44.320 | I guess I was just going to, again, clarify to make sure I understood.

01:17:47.320 | So you're saying that the scorer, the actual scoring function is going to stop being useful.

01:17:57.320 | It's going to become obsolete.

01:17:58.320 | It's going to become too old to actually gauge the quality.

01:18:01.320 | Not just the scoring function but the actual steps that you want to test.

01:18:06.320 | So, you know, this week there might be only two turns.

01:18:09.320 | Just giving a very simple example.

01:18:11.320 | And in two weeks, in the next sprint, there are now five turns in your app because the logic has changed.

01:18:19.320 | And now you have to update, of course, I think the function element.

01:18:25.320 | There's probably no way around it.

01:18:27.320 | I'm just curious about whether you guys have thoughts about how that could be improved or made easier.

01:18:37.320 | Well, I think your task will always change, right?

01:18:40.320 | Right?

01:18:41.320 | The thing that we're trying to build, that's where brain trust helps because we're going to understand when we do go make that change, we actually understand whether or not that change improved our application or regressed it.

01:18:52.320 | So, like, we're not going to say stop making changes to the underlying.

01:18:56.320 | Yes.

01:18:57.320 | Yes.

01:18:58.320 | No, I understand that.

01:18:59.320 | So, it is inevitable that the application is going to be changing.

01:19:04.320 | Yeah.

01:19:05.320 | And you're going to have to constantly update the test, the function that you're actually wanting to mimic in your test.

01:19:15.320 | It's very similar to traditional software testing.

01:19:18.320 | Yeah.

01:19:19.320 | You don't want to write a test that lasts for a day or, you know, a week, right?

01:19:23.320 | You want to think of robust tests that will live on for months or years and will actually measure the underlying quality of the application that will be long-lived.

01:19:37.320 | So, I think it's more of how do you optimize the scores to measure qualities that will still be around even if you add some additional steps in the task.

01:19:47.320 | No, it's worse than that because unless you have those additional steps in your function, you're not mimicking your application's logic.

01:19:57.320 | You're still using the logic from, you know, last sprint.

01:20:01.320 | So, no matter how good your scoring could be, it's no longer reflecting what your application is doing this week or this sprint.

01:20:09.320 | I think, like, regardless of, like, how many steps you have, like, there's still an input and there's still an output that we want to score against.

01:20:17.320 | Correct?

01:20:18.320 | Yes, but I think one of the things you need to do is to first define how you're going to arrive at the score.

01:20:26.320 | The input comes in and now maybe you have three turns and then because you're mimicking your app and then you get your output from these three turns.

01:20:38.320 | Your app just got upgraded.

01:20:40.320 | There are now seven or five turns or whatever.

01:20:43.320 | Yeah, so when you're writing the evals, you can dynamically call the task.

01:20:48.320 | So, as you're working on your application and it's changing, you're still pointing to the changing app.

01:20:57.320 | So, the idea is that when you are wanting to merge into main, you open a PR and then your evals will run on those new changes.

01:21:07.320 | You don't need to go in and update the .eval.ts files.

01:21:11.320 | They will now reference the updated task application that you're trying to understand the underlying quality for.

01:21:19.320 | If that makes sense.

01:21:20.320 | So, I think the question again is, are the scores, is the underlying logic something that you can trust and that will live on?

01:21:29.320 | Again, it's not easy and it's something that is changing, but that's what we're hearing from customers is investing in that.

01:21:37.320 | At Raid Trust, when you send evals to humans, SMEs, what's the name of that role and how are you managing that?

01:21:48.320 | Like, I'm guessing to some extent it was originally the team, right, but that can't scale.

01:21:54.320 | So, how are you managing that?

01:21:57.320 | I think it's like organization specific.

01:21:59.320 | I don't know if there's a specific.

01:22:01.320 | I'm saying your organization.

01:22:02.320 | Using your own tool, how are you managing the SMEs yourself?

01:22:07.320 | Hmm.

01:22:08.320 | I don't think we're using any SMEs at the moment.

01:22:11.320 | We don't, we're not a healthcare company or a legal tech company where we heavily rely on specialized knowledge in that degree, you know.

01:22:22.320 | But you're not doing human evals of your own product?

01:22:26.320 | We just now started, we just now branched into having an AI component to our application.

01:22:32.320 | So, we haven't needed to go there just yet.

01:22:36.320 | But we, you know, we talked to a lot of customers that are working in those specific industries with those use cases and they will sometimes hire external services that will do the annotations for them.

01:22:51.320 | Or they'll bring them into BrainTrust and, you know, they'll be using the platform just to review.

01:22:57.320 | So, they have a specific role within BrainTrust.

01:22:59.320 | And there's a specific view that they would operate in that's just for annotation.

01:23:04.320 | Yeah.

01:23:05.320 | Great.

01:23:06.320 | Yeah.

01:23:07.320 | Another question over here.

01:23:09.320 | I was just curious that because we're using out-of-the-box AI models here and are not really fine-tuning the models that the application progresses.

01:23:18.320 | Are we, do we have a way to, like, do some few short example prompting from the dataset and the eval scores that we are already using?

01:23:28.320 | So, is there some feature like that where I can use the datasets or the online logs that are added to the datasets?

01:23:35.320 | If the eval score is good, use it as an example for future prompts to just make the prompt better because the models are out of the box.

01:23:43.320 | Yeah.

01:23:44.320 | So, question around few-shot prompting, providing examples to the prompt of the ideal response.

01:23:49.320 | That's something that you can do today within the dataset.

01:23:52.320 | In the metadata column is where you can provide the few-shot examples that you want for each row.

01:23:58.320 | And then when you're running that eval or messing around in the playground, it'll reference the few shots in the metadata.

01:24:05.320 | Got it.

01:24:06.320 | But what about, like, the online testing stuff, right?

01:24:09.320 | Or the online logs, whatever you call it?

01:24:11.320 | Like, when users are actually using the application and it's hitting the prompt, then can the prompt real-time use those examples from the datasets as well?

01:24:21.320 | Right now, it's not something that brain-trust facilitates.

01:24:26.320 | Within the SDK and building your own logic, like, you could come up with a workflow like this.

01:24:31.320 | But natively in the platform, we're not facilitating, like, live traffic into few-shot examples.

01:24:37.320 | Got it.

01:24:38.320 | Makes sense.

01:24:39.320 | Sounds good.

01:24:40.320 | Great.

01:24:41.320 | Well, thanks, everyone.

01:24:43.320 | I know we're over time.

01:24:44.320 | Really great to have you all here for our first workshop of the day.

01:24:47.320 | I hope you can walk away with some ideas of how you can improve your eval workflow.

01:24:53.320 | And, you know, our team is here.

01:24:55.320 | We have a booth just outside of this.

01:24:58.320 | So, feel free to stop by.

01:24:59.320 | We can answer more questions, have a conversation.

01:25:01.320 | Yeah.

01:25:02.320 | Thanks, everyone.

01:25:03.320 | Thank you all.

01:25:04.320 | We'll see you next time.