How to build world-class AI products — Sarah Sachs (AI lead @ Notion) & Carlos Esteban (Braintrust)

00:00:00.000 | Carlos Esteban:

00:00:13.000 | Wow, look at this turnout.

00:00:16.280 | It's kind of crazy how many people are excited to listen to Sarah and hear what Notion AI

00:00:20.840 | has been building.

00:00:22.560 | So just going to do some quick introductions before handing it over to her.

00:00:26.740 | My name is Carlos Esteban.

00:00:27.880 | I'm a Solution Engineer here at BrainTrust.

00:00:30.680 | So Doug and I are going to share a bit about BrainTrust after Sarah presents.

00:00:36.940 | Just wanted to say hi.

00:00:38.640 | You'll see me talk a bit more today.

00:00:41.580 | I've been at BrainTrust for six weeks.

00:00:43.640 | Previously, I was at an infant company.

00:00:46.260 | Yeah, the six weeks may be funny, but also I'm the veteran on the team.

00:00:51.100 | So Doug here.

00:00:52.100 | It's his third week.

00:00:53.100 | So we're up here in front of you and going to teach you about BrainTrust.

00:00:57.760 | We have some devals.

00:00:58.660 | Yeah, if you want to go ahead.

00:00:59.560 | Yeah, sure.

00:01:00.560 | Solutions Engineer also alongside Carlos.

00:01:03.160 | Like you mentioned, been here a full three weeks at this point.

00:01:05.900 | Actually, not even full.

00:01:06.820 | It's like my, I don't know, 11th or 9th day.

00:01:10.780 | But yeah, this is incredibly exciting to be here.

00:01:13.520 | Obviously, a lot of interest in Sarah's talk.

00:01:15.840 | And hopefully, we can teach you a little bit about BrainTrust in the process.

00:01:18.820 | So yeah, I just want to present Sarah.

00:01:21.320 | She's the lead of Notion AI.

00:01:23.220 | They've been customers pretty much since the beginning of BrainTrust, definitely pioneered

00:01:27.620 | the platform.

00:01:28.620 | And yeah, super exciting to hear what she has to say about their journey.

00:01:32.380 | Awesome.

00:01:33.380 | Thanks.

00:01:34.380 | I've been working for 10 months.

00:01:35.280 | So I might be the most experienced BrainTrust user on the panel.

00:01:38.220 | Just kidding.

00:01:40.220 | No.

00:01:41.220 | I mean, I think I want to save time for questions, too, before we dig into the workshop.

00:01:46.080 | I think at a higher level, what I tell the team and what I tell people who are also building

00:01:51.840 | is like all of the rigor and excellence that comes from building great AI products comes from

00:01:56.560 | observability and good evals.

00:01:58.200 | And that's how you scale an engineering team.

00:02:00.840 | That's how you build good product.

00:02:02.420 | And ultimately, we spend maybe 10% of our time prompting and 90% of our time looking at evals

00:02:09.440 | and iterating on our evals and looking at our usage and BrainTrust.

00:02:12.620 | And that, I believe, is the right balance of work in order to know that you're not just

00:02:17.600 | shipping something that worked well in a demo with your VP or that you got working on the

00:02:23.880 | Caltrain to work and finally did what you wanted that one time, but actually worked consistently

00:02:28.960 | and for the users you were curious about.

00:02:31.380 | So I'll speak a little high level, save some time for questions, and then I hope everyone

00:02:35.140 | sees value in the workshop as well because it's really a tool that is accessible to all different

00:02:41.500 | levels, like I have tech leads on our team that themselves are true experts in BrainTrust,

00:02:46.480 | someone like me that might not be necessarily executing experiments, but I'm in the platform

00:02:50.360 | every day, and then a whole host of data labelers and specialists that work out of the platform

00:02:54.340 | as well, which we'll talk about.

00:02:56.560 | So, who here has used Notion?

00:03:01.540 | Okay.

00:03:02.540 | Love it.

00:03:03.540 | Who here has used Notion AI?

00:03:05.540 | Okay.

00:03:06.540 | Some sales opportunities.

00:03:07.960 | So, what is Notion AI?

00:03:11.080 | Oh, that's me.

00:03:12.020 | I've been at Notion for about 10 months, as I said.

00:03:15.460 | Notion, this idea, for those of you that didn't raise your hand, is a connected workspace.

00:03:20.700 | What are we connecting?

00:03:22.140 | We're everything from workplace management, asynchronous work, documents, but we also connect to third-party

00:03:28.600 | tools like Slack and Jira and Google Drive.

00:03:31.100 | We have 100 million users, over 100 million users now.

00:03:35.720 | And something that's really important to note about Notion AI is we offer a free trial of all

00:03:40.420 | of our AI products, almost all of them.

00:03:42.300 | And what does that mean?

00:03:44.500 | It means that the scale that we build for has to support that scale.

00:03:47.840 | So, if you looked at the balance of who raised their hands, right, maybe not every user is an active

00:03:53.420 | Notion AI user, but the scale in which we support Notion AI, we want to offer it.

00:03:57.220 | So, a new feature, for instance, like concurrently might have far more users than paid enterprise

00:04:02.560 | plan users or people in our business plan.

00:04:04.480 | And I like to think we're known for our exceptional customer experience.

00:04:08.200 | We're certainly a design-driven company, which is to say that we care deeply about our product,

00:04:13.560 | and there's a lot of polish that's associated with the brand.

00:04:17.080 | And polish is not often associated with Gen AI experiences.

00:04:20.300 | So, how do you add that level of polish and care into what you're building while still building

00:04:27.500 | at the speed and rate of acceleration that exists in the industry, right?

00:04:31.280 | For instance, we're very proud that we partner with a lot of foundation model providers.

00:04:36.040 | We also fine-tune our own models.

00:04:37.580 | Anytime a new model is released, within usually less than a day, we're able to give that to all

00:04:43.240 | of our users in production.

00:04:45.080 | In order to move at that pacing, but still have a polished product, you need to have evals,

00:04:50.220 | and you need to have integration with a product like BrainTrust.

00:04:53.500 | So, here's our latest launch.

00:04:56.440 | This was two weeks ago, a suite of projects that we just launched with AI.

00:05:03.420 | No audio, but I'll give you a little voiceover.

00:05:05.640 | So, AI meeting notes is kind of our first.

00:05:10.100 | So, we now have text-to-speech, or speech-to-text, excuse me, along with transcription AI generated summaries,

00:05:22.640 | and fairly soon you'll be able to see those also interact with your task database, your whole workspace,

00:05:27.900 | workspace awareness, when talking about action items.

00:05:31.060 | We have an enterprise search product that, when you're searching, can search across everything.

00:05:36.740 | I use this constantly, particularly in stand-ups, when they're talking about Slack threads that are like 50 messages deep that I stopped following at 10 p.m.

00:05:45.160 | I can use Notion AI, and for much deeper searches, we have a deep research tool.

00:05:49.420 | That deep research executes parallel searches, serves a lot of our fine-tuned agentic capabilities,

00:05:55.380 | and is kind of our first transition out of workflows into agents, where rather than having like a set list of tasks or flows that your AI program is running through,

00:06:07.940 | now we're giving it the reasoning capability to decide on different tools and spend longer on the work that it's doing.

00:06:14.480 | So that's Notion AI for work.

00:06:15.940 | That's the latest suite that we built, and I'll talk a little bit about what we've learned as we've built this latest suite,

00:06:22.560 | and as well, older generation Notion AI products.

00:06:25.520 | Let's see.

00:06:27.800 | No, we already watched that.

00:06:30.920 | Okay.

00:06:34.780 | So, believe it or not, Notion AI actually came out before ChatGPT.

00:06:39.680 | That's kind of its own story that none of you are here to hear about,

00:06:43.660 | but we had early access to the generative models and always believed that content generation was core to how Notion worked.

00:06:49.360 | So our first product launched right around the same time.

00:06:52.020 | It was called the AI Writer.

00:06:53.580 | It allowed you to generate just in line, you know, write a sentence about XYZ, right?

00:06:59.060 | From there, we started building Autofill.

00:07:01.500 | Autofill is our AI kind of agent that lives in every database property.

00:07:04.960 | Notion allows you to have databases.

00:07:06.280 | Now, all of a sudden, the AI is not just acting on the page, but is acting across the database and is triggered frequently and is doing things like translating things so that every column is in a different language.

00:07:17.100 | This is where we started seeing a lot of usage that was more unpredicted from our users, and it's around the same time that we also started building a core AI team.

00:07:27.220 | Then we built kind of a natural rag solution.

00:07:29.140 | This was kind of the, I should say, like, the first time that we had, like, a full data platform AI full collaboration, and this is where we started offering, for instance, Q&A to, like, free users.

00:07:43.480 | So we have to have embeddings for everyone.

00:07:45.020 | We have to think about multilingual workspaces, things like that.

00:07:47.820 | And then now we get to the era where we started working with Braintrust.

00:07:52.240 | So you saw you could search across all apps.

00:07:54.160 | We launched the idea to have attachments.

00:07:56.580 | A lot of people use Notion as file storage, or they'll upload things to Notion.

00:08:00.620 | We can search over those attachments.

00:08:02.060 | And finally, the things that we just talked about.

00:08:04.520 | So one thing I want to note is that we didn't start with what I just showed you, and I think that would have been quite naive.

00:08:10.940 | Obviously, with the technology we have today, that would have been quite easy.

00:08:13.860 | But we also knew that we didn't have the technology to build those things yet, and we worked really with where the models were most capable.

00:08:19.280 | What makes it hard to evaluate?

00:08:23.480 | So number one, these are exceptionally large data sets.

00:08:27.060 | Notion's really lucky because we use Notion constantly when we work on Notion.

00:08:31.380 | And so we can generate a lot of training data or evaluation data just from our own dogfooding.

00:08:37.020 | I think that's actually one of the unique advantages that allows us to be one of the more fast-paced enterprise AI solutions.

00:08:42.420 | Similarly, our human evaluators were just exceptionally overwhelmed.

00:08:46.480 | Like, how do they look at – I think we were – like, they were working in Google Sheets.

00:08:50.200 | I think we hired them even before we onboarded Notion.

00:08:52.620 | We have data specialists, and they were looking at this, like, dump in Google Sheets

00:08:56.420 | and trying to figure out how to parse the prompt with, like, a Google Sheet formula, right, and figure out what did the user actually say, how do I play with a few shots.

00:09:04.540 | It's very involved.

00:09:06.460 | And I think anyone that utilizes human labelers well – and it's also been proven in research – is that, particularly when it comes to fine-tuning but also iteration, quality is much more important than quantity in terms of the insights and things that you extract.

00:09:20.280 | And so we definitely needed a scalable, efficient solution to support how we were looking at our data and also keeping track of user feedback, all of the thumbs-downs that we were getting in the development usage.

00:09:31.500 | So this is kind of what our iteration cycle looks like.

00:09:35.060 | So let's say that we want to decide on an improvement.

00:09:37.940 | So, for instance, let's say that we wanted to launch a JIRA connector in our universal search product, right?

00:09:44.120 | What does it mean to query on JIRA?

00:09:46.180 | How do I have my AI even figure out what's a task, what's a sprint, how to query from the JIRA workspace?

00:09:52.480 | We then curate targeted data sets from that workspace.

00:09:57.200 | You'll hear me keep mentioning these data specialists for much smaller enterprises.

00:10:01.760 | That might be your PM.

00:10:02.940 | I would highly encourage it to also be your engineer.

00:10:05.800 | It should be the people that are closest to the data.

00:10:08.120 | We're at a scale now where we have a specialty that's kind of like an LLM trainer that's a mix of a PM and a data analyst and a data annotator.

00:10:16.220 | Those are our data specialists.

00:10:17.300 | But they are creating, just from logs and using a product and prototype, handcrafted data sets.

00:10:24.740 | Those can be 10 things, right?

00:10:26.720 | But I would actually, let's pause, make it 10 things first and make sure that it's formatted the way that you want.

00:10:32.200 | We have definitely done the bad thing of creating lots and lots and lots of dummy data, and it's not structured the way we want, and that's a real pain.

00:10:39.600 | So first is like making sure everything is structured and your data flywheel is set up successfully to get insights that you might want.

00:10:46.800 | Then we tie them to scoring functions.

00:10:48.820 | Scoring functions should come after you've looked at the data for a long time.

00:10:53.100 | There are a lot of out-of-the-box scoring functions.

00:10:55.300 | We don't use them frequently.

00:10:56.500 | We tend to use things that are specific to the product.

00:10:59.640 | I'll talk a bit about the LLM as a judge process that we use.

00:11:03.300 | That's probably what we rely on the most.

00:11:05.200 | But there are also a lot of things that are heuristic-based.

00:11:07.700 | So, for instance, if I have, if I want to make sure that my reasoning model triggers querying from Jira appropriately, everything in that data set, that tool call should have in its query must be Jira, right?

00:11:20.760 | And so we can also make deterministic functions.

00:11:23.640 | Similarly, we have a data set of places where there are multiple languages happening, and we know what the output language should be.

00:11:31.420 | So, for instance, Toyota's a large customer of ours.

00:11:34.300 | Toyota might be working in Japanese and English.

00:11:36.560 | Someone might be asking questions in Japanese, but the output might be like, write me a paragraph about XYZ.

00:11:41.800 | They're asking in Japanese, and, like, the lucky job of the product team is to figure out what language they actually want the response to be, right?

00:11:48.620 | It's a hard problem to have.

00:11:50.140 | We'll have a data set of those as a curated brain trust data set so we can, before we ship anything, run that eval and make sure that we don't break the multilingual language context switching experience, right?

00:12:01.860 | So for certain experiences, we actually run the eval as kind of like an ad hoc CI experience.

00:12:07.800 | I think there are some companies that actually integrate in CI.

00:12:10.200 | We don't do that currently, but you could.

00:12:12.840 | That's more just to do with how long it takes to run the evals and how CI is built at Notion.

00:12:16.820 | Everyone submitting code would run evals, and that would be a lot.

00:12:19.720 | We inspect the results, and then we keep working.

00:12:23.260 | The other thing I'll say is it's not just engineers that are on brain trust.

00:12:27.200 | Because of that, we often will have, like, our PMs in brain trust working very closely.

00:12:33.040 | So, for instance, for that research mode product, we found that people were trying to draft reports much more than they were just trying to do research.

00:12:39.080 | And we found that by looking through all of that thumbs-down data that was propagated into brain trust from internal dev usage.

00:12:46.240 | That's something that we could escalate to our PM, but we also want the product thinkers and the designers to be thinking about that as, like, a first-class citizen.

00:12:54.840 | You can think of this as your version of UXR, right, is to actually see what the model is doing well and how people end up wanting the model to do different things for them.

00:13:02.600 | And then in terms of that feedback by development, I want to talk a little bit about the LLM-as-a-judge system.

00:13:09.980 | There's kind of two components that I've seen pretty prominently in industry.

00:13:13.580 | One is LLM-as-a-judge in which you have one prompt, which judges everything in your data set.

00:13:19.900 | So, for instance, is this information concise?

00:13:23.360 | Is this information faithful?

00:13:24.820 | Those are very common types of LLM-as-a-judge prompts.

00:13:27.840 | There's another version of it, which is a little bit more laborious from a creation standpoint, but I find to be far more insightful,

00:13:34.280 | which is for every single element in your data set or trace or whatever it is that you're evaluating, you have a particular prompt.

00:13:42.040 | So, for instance, I might write a prompt that says, this answer should be in Japanese.

00:13:46.540 | The bullets should be formatted this way, and it should answer XYZ and point to page A, right?

00:13:52.840 | This is because, obviously, Levenstein distance isn't going to do a very good job of capturing the expected value to what we want, but we actually know exactly what we want the output to look like, and we don't want to have too conservative of a prompt that will always break.

00:14:05.760 | For instance, just having a golden piece of data saying this is what the output should look like.

00:14:09.900 | We actually just say, like, what are the rules for what we want?

00:14:12.080 | This works really well with search evals, actually, because our index is always changing.

00:14:16.760 | When we want to reevaluate how search retrieval goes, we say the first result should be the most recent element in the list about the Q1 offsite, right?

00:14:24.820 | And maybe since we redid that, maybe since we created that element in our data set, there's a new document about the Q1 offsite that wasn't what was in our golden set.

00:14:32.160 | This allows your golden set to be much more up-to-date and current.

00:14:37.180 | Yeah, and so I highly suggest that, and that's actually what our data specialists run, and what's really nice about it, as I mentioned earlier, our ability to switch to new models.

00:14:46.060 | This can output a score that's far more reliable, and from that score, we can quickly understand if a new model has any serious regressions, and this allows us to, given our modular infrastructure of different prompts for different tasks and different models for different prompts,

00:15:02.180 | we can really quickly say, "Okay, Nano came out. Nano was a fairly effective model that's very fast, maybe lower reasoning, and very cheap.

00:15:11.660 | What are kind of the high-frequency use cases where we don't need advanced reasoning, but we do want it to be able to run quickly?

00:15:17.860 | Let's gather those, and with one button, run all of the evals on them, see which prompts it does or doesn't do well on, and then quickly just change to those prompts, you know?"

00:15:26.980 | And that cycle can become very fast, and that allows us to stay at the top of the frontier in a way that I think has really been advantageous for us as a customer and, more importantly, for our users.

00:15:38.860 | And the outcomes have been crazy. I don't think that we could exist without Braintrust or a similar type of software today.

00:15:44.940 | It is critical to our iteration flow, and it actually is our IP.

00:15:48.740 | So everything that we build, the IP comes from how we evaluate it and how we build it, and that comes from where we build on top of Braintrust.

00:15:55.940 | Obviously, we have prompts that are like a series of strings, and obviously we have code that navigates through them, but how we decide if those work well is just as critical,

00:16:04.780 | and how we decide on model selection, and all of that lives inside of Braintrust.

00:16:08.980 | Similarly, our AI product quality has skyrocketed because now we have observability.

00:16:13.660 | 60% of Notion Enterprise users aren't speaking English.

00:16:17.060 | I will tell you, 100% of Notion AI engineers are speaking English.

00:16:21.020 | So how do we build something that works for a majority non-English speakers?

00:16:25.260 | We have to use rigorous evaluation metrics that understand that multi-lingual experience.

00:16:31.260 | I think that was the last slide I prepared.

00:16:33.460 | I want to save some time to answer any questions that you guys might have.

00:16:37.260 | Yeah, I think there's a mic if you want to, or you can just scream it, and I'll repeat it.

00:16:41.180 | That might be faster.

00:16:42.980 | You had a question?

00:16:44.540 | So for the LLM judge, so what you're saying is actually you guys have multiple judges, and each responsible for a tiny thing to check?

00:16:53.140 | Yeah, the question was for LLMs as a judge, she wanted to clarify that we have multiple judges, each responsible for a smaller thing.

00:17:02.140 | The answer is yes.

00:17:03.140 | Let's think about it in terms of scope of what they're evaluating and magnitude of number of samples that they're evaluating.

00:17:11.020 | Usually we have a variety.

00:17:13.740 | Sometimes we have like a one-to-one mapping where for a single thumbs-down output, we just have a prompt that is shared with no one else.

00:17:20.740 | And that's what we engineer.

00:17:22.460 | We have some layers on top of brain trust that make that easier for our data specialists to write that work with the brain trust SDK.

00:17:29.580 | But we have that, and then we also have LLMs as a judge that will operate on top of everything that's in that data set.

00:17:37.020 | You'll learn more in the workshop, the distinctions with like data sets themselves and logs, but data sets are hand-curated, and it can be just one aspect of your trace.

00:17:45.020 | So in an AI interaction, you can have like five LLM calls that happen between the user and like the Notion AI response.

00:17:52.200 | We can extract just one of those and put it in a data set.

00:17:54.900 | Yeah.

00:17:55.940 | Yeah.

00:17:57.160 | Have you tried doing any like automated prompt optimization, or is it generally like you'll have to go manually through a person, go look at the results, and then update them?

00:18:05.600 | The question was about automated prompt optimization.

00:18:07.940 | The answer is yes, we have played with it.

00:18:10.140 | I'm not sure that a majority of our problems are as solved by that as they could be in other workplace contexts.

00:18:15.960 | I've been to like dinners and events where I've heard massive success from it.

00:18:19.400 | Maybe we haven't cracked it yet.

00:18:21.100 | A lot of people have, but we've played with it.

00:18:24.400 | One quick question.

00:18:25.400 | Yeah.

00:18:25.400 | What's your process of aligning the thumbs up, thumbs down, with the scoring function?

00:18:31.840 | Yeah, the question was about aligning thumbs up, thumbs down with scoring functions.

00:18:35.720 | They don't.

00:18:38.920 | I mean, so a majority of things in our data set were things that are either thumbs up or thumbs down.

00:18:43.460 | Sometimes it's just legally how it works, because we have a process with some alpha users where that's the only way they give us permission to look at their data for evaluation purposes.

00:18:52.480 | We don't really rely on thumbs up for anything, except for maybe internal Notino's thumbs upping things is like good golden data for fine tuning, but we don't really rely.

00:19:00.860 | There's like no consistency in what makes someone thumbs up something, so we don't look at that that closely.

00:19:05.900 | And for thumbs down data, it's more just that this is a functionality that we know we didn't do our best work on, but that thumbs down could have been given in September of 2023.

00:19:15.480 | And so we don't perform how we did in September 2023, so it doesn't necessarily align with the LLM as a judge, because that's judging a particular experiment, not what the production user experience was at that time.

00:19:28.240 | And that's what makes this really powerful, because we don't just need to look at what our output was in September 2023, right?

00:19:34.260 | Our data can be far more robust and last much longer, because really what we're getting from the thumbs down is the natural language request from the user.

00:19:41.620 | Everything else we can re-modify. We know the state of their workspace and what their request was.

00:19:45.920 | Everything else, you know, we, as our server code changes, changes the output of the LLM and the engine.

00:19:52.240 | Any other questions? Yeah?

00:19:56.840 | For the LLM as a judge, is the scoring more like binary or is it on a scale?

00:20:01.380 | Yeah, the question was if LLM as a judge scoring as binary or a scale.

00:20:04.760 | I'll tell you like the honest answer in practice.

00:20:07.520 | I believe that we do it as a scale, and I don't think that scale is very well calibrated.

00:20:12.500 | And I believe that we just grab everything that's less than a certain score and look at them as all equal.

00:20:16.560 | That's just like how we do it.

00:20:19.160 | I'm sure that there are technical ways that I've read about people that run it multiple times and calibrate their LLMs as a judge.

00:20:25.240 | For us, we still have humans look through those outputs.

00:20:27.500 | We actually will take the failures and pass them into another LLM and ask them to summarize what the failures are so that we can write a quick report for the engineer working on it to see because we have thousands of samples.

00:20:38.080 | So then we actually look at those deltas and pass it to another LLM to tell us what the biggest themes and difference were.

00:20:44.980 | Yeah, that's how we do it.

00:20:46.120 | I actually think there's been a lot of academic research on that calibration, but it has not been necessary for us from an investment perspective, so we haven't worked on it much.

00:20:54.520 | Yeah, the question was about pairwise combinations versus just saying yes or no.

00:21:20.840 | It depends on the experiment that we're doing.

00:21:23.960 | Oftentimes, if we're experimenting between two methodologies, yes, or if the control is particularly important, like this is what's in prod, and I don't want to break prod, then we will use that kind of A-B setup.

00:21:37.360 | If it's the development cycle that I talked about earlier, where we're still in dev or we're still just with alpha customers, and we don't know what the golden experience is, and we're far more comfortable breaking things, we're much less likely to use that type of setup.

00:21:51.100 | But Braintrust actually has a great UI for letting you do both.

00:21:53.640 | Yeah.

00:21:54.620 | I'm curious if you have a process around figuring out what criteria to judge for you as a feature.

00:22:02.260 | Is there any best practices, or is it thinking?

00:22:06.280 | The question was about criteria to judge as a feature.

00:22:08.360 | I mean, I'm sure there are best practices.

00:22:11.640 | I think for us, the reminder, so like, you know, there's thousands of these, and then they live in this ether, and it's maintained by the people that wrote them,

00:22:21.140 | and then they take on a lot of power, right?

00:22:24.500 | And very few people actually investigate the losses, and that's really risky for building a robust enterprise application.

00:22:31.320 | So things that we've learned, lessons that we've learned, because that, and I see some people nodding, you can imagine how that's a little bad sometimes.

00:22:39.180 | For instance, we used to use them just to catch specific regressions.

00:22:43.140 | It's like, we have a special markdown format that works with Notion.

00:22:45.940 | We just used them for formatting at first, and we'd be like, has to, you know, be consistent with our markdown and news bullet points, and has to talk about this topic.

00:22:55.460 | All of a sudden, we weren't catching that it would, like, switch languages, because that's true in that prompt, right?

00:23:01.140 | And so then, when you start relying on it to catch everything, or to be like a yes for something, then you have a problem where you're over-reliant on it, and you're not catching regressions.

00:23:10.980 | So there's kind of two approaches that I would suggest that make that successful.

00:23:14.620 | One, actually, goes back to your very intuitive question, which is having it for a particular task.

00:23:21.600 | So we have a particular task of markdown formatting.

00:23:23.960 | We have a particular task of language following.

00:23:26.280 | We don't assume that they can do everything.

00:23:29.260 | Or you have a very small set, and commit yourself to looking at the losses.

00:23:33.360 | The problem is you don't commit to the losses, and you know that they're lossy.

00:23:37.360 | I think that's, like, a trap that we certainly fell into when we were first developing this concept about nine months ago, before they were called LLMs as a judge.

00:23:45.840 | We just were LLM graders, right?

00:23:48.340 | I would say that's, like, kind of the two biggest pitfalls.

00:23:50.820 | Yeah.

00:23:51.920 | Do we have time for more, or it's up to you?

00:23:54.580 | Yeah, we can do a couple more.

00:23:56.000 | Okay.

00:23:56.260 | Yeah.

00:23:56.580 | Yes.

00:23:57.620 | How do you isolate RAG versus generation?

00:24:02.560 | How do we isolate RAG versus generation is a very good question.

00:24:07.180 | Are you asking about, well, do you mean, like, in terms of how we evaluate it, how we evaluate changes?

00:24:13.480 | Yeah.

00:24:15.060 | So this has more to do with, like, what you're experimenting against.

00:24:19.500 | So, for instance, there are times where the index itself changes when you're doing an evaluation, and things like recent docs, particularly for retrieval, freshness of the index or the index at time of evaluation are exceptionally important.

00:24:33.180 | And can definitely change the downstream results, which can be unobservable and confusing to the rater or to anyone looking at the evaluation.

00:24:41.820 | So when is our next offsite?

00:24:43.860 | You know, maybe I did that before our Q3 offsite, and it's Q1 right now, right?

00:24:48.720 | So, like, the answer would change.

00:24:50.200 | One approach is to create a technology that freezes that index that's actually really expensive, because you don't know what elements of that index you want to freeze.

00:24:57.420 | You don't know what the query is.

00:24:58.460 | It's a lot of storage.

00:24:59.860 | What we have found is that we freeze the retrieval and take the retrieved results and operate on, like, did this retrieval actually retrieve the right thing?

00:25:07.280 | We operate on that as its own evaluation that has its own evaluation framework that has much more to do with, like, our vector databases and Elasticsearch setups.

00:25:15.660 | And then we have a retrieval that says, like, presume that this is what was retrieved, execute everything else, but, like, what was actually returned is frozen.

00:25:23.440 | And then the point in which the index, the point in which, like, the actual answer wasn't in that index and the eval can't be done, we just get rid of that sample.

00:25:32.820 | So, we either assume that retrieval worked or just do the re-retrieval.

00:25:37.260 | We don't really merge them very much for that exact reason.

00:25:40.800 | Yeah.

00:25:41.420 | Just one follow-up.

00:25:42.940 | Yeah.

00:25:43.180 | So, do you isolate the retrieval at the time of creating the building dataset?

00:25:50.480 | Yeah.

00:25:51.260 | Do we isolate retrieval at the time of creating?

00:25:53.040 | It kind of depends on what we're building and what we have, like, access to.

00:25:56.580 | The answer is, the majority of the time, we try really hard not to find what the index looks like at that period of time.

00:26:02.540 | That's very technically difficult.

00:26:04.100 | It brings up privacy questions.

00:26:05.640 | For instance, the permissions of that object could have changed.

00:26:09.160 | We try not to do that.

00:26:11.420 | Yeah.

00:26:11.840 | But we also get enough feedback where it doesn't become as frequent of a problem, again, because we're lucky in that we use Notion so much that we get a lot of natural, particularly for retrieval and natural use cases.

00:26:22.780 | Let's do maybe one more, two more?

00:26:25.340 | Yeah.

00:26:25.540 | One more.

00:26:26.300 | Let's do someone from the back.

00:26:27.920 | Yes.

00:26:30.320 | How many prompts do we have?

00:26:36.980 | I don't think this is private.

00:26:39.700 | Let me think about if it would be.

00:26:42.180 | We have over 100, like hundreds.

00:26:46.500 | How do we manage dependencies?

00:26:48.080 | I mean, some of them, well, we have our evals for them, and that helps a lot.

00:26:53.640 | A lot of them are different variations of things.

00:26:56.140 | For instance, like there's some model-specific variations on our prompts.

00:27:01.620 | We have a large team.

00:27:02.860 | It hasn't really come up.

00:27:04.320 | I think the only time that managing a large number of prompts has been a problem is when a particular model provider has an outage, and we need to switch traffic over to other model providers.

00:27:14.320 | It's not like all 4.0 mini traffic can just go to Sonnet 3.7 for a variety of reasons.

00:27:21.920 | Obviously, that works for a lot of it.

00:27:23.260 | It's like, okay, I'm willing to pay more for a small amount of time until OpenAI comes back up.

00:27:27.700 | But for some of them, like our database autofill, it's 12 times more expensive.

00:27:31.520 | We'd have like millions and millions of dollars of debt for that decision in that period of time, and we can't make that.

00:27:38.600 | And so I think it comes more to like ownership over prompts and making sure that a fallback mechanism is up to date so that whoever's on call, they can just say like, you know, 4.0 is down.

00:27:50.260 | And we can automatically, for the owner of that product or that prompt, they've already determined what to switch to instead in like a code-driven or config-based mechanism.

00:27:59.580 | I would say that's been the most laborious aspect is like getting everyone together and creating that system.

00:28:04.880 | I wish we did that from day one.

00:28:06.140 | We did it more recently.

00:28:08.520 | And that was kind of like a tribal of elders.

00:28:11.160 | Everyone that's ever written a prompt, please come and fill out this config.

00:28:14.100 | But otherwise, it hasn't been a problem yet.

00:28:16.600 | Yeah.

00:28:18.700 | Great.

00:28:19.100 | Thank you.

00:28:19.960 | Thank you, everyone.

00:28:20.500 | Thank you, Sarah Sachs.

00:28:21.200 | Yeah, Notion AI is my lifeline.

00:28:30.060 | It's great to search on there.

00:28:31.780 | It gives me all the answers I need.

00:28:33.500 | She did not endorse this.

00:28:36.280 | Okay, great.

00:28:39.000 | So we are the BrainTrust speakers.

00:28:41.040 | The two left here on the podium.

00:28:43.620 | And we'll be going over some lectures, then jumping into the platform.

00:28:49.700 | You'll do an activity, then we'll come back, cover a new topic, and so on.

00:28:55.820 | So we're going to start with just covering why eval, what is an eval, what is an eval, excuse

00:29:00.880 | me.

00:29:01.020 | And then we'll talk about running the same thing via the code via the SDK, then we'll move into production, so day two, how are you logging, how are you looking at the data points that you're collecting in your production application.

00:29:17.480 | And then finally covering human in the loop, so how do you incorporate user feedback or actual human annotators to improve the quality of your prompts, improve the quality of your data set.

00:29:31.060 | And if we have time at the end, we'll do some remote evals, stuff which extends what the UI is capable of in the playground specifically.

00:29:41.440 | So I think we have a poll in the Slack that's currently pinned, if you're part of the Slack channel, feel free to add your response there.

00:29:51.820 | So jumping into evals and getting started.

00:29:54.640 | These are some of the curated tweets that we've seen notable people talk about online, so just something to think about why it may be so useful for them.

00:30:08.820 | Well, it helps you answer questions.

00:30:11.820 | So, again, you know, is my change regressing the performance in production, am I using the best model for my use case, the cheapest model for my use case, does it have brand consistency in its responses, am I learning from the data that I'm capturing, from the logs, and am I able to debug and troubleshoot some of the responses that are underperforming?

00:30:37.460 | This is a trend across the whole industry.

00:30:42.460 | Everybody's dealing with hallucinations and performance degradation, and so we're here to try to set up a system using statistical analysis to catch these mistakes proactively and also with online evals reactively on that online traffic.

00:30:59.760 | This can help your business by allowing you to move faster, helping you reduce costs, and scale teams.

00:31:07.440 | It's really helpful to have non-technical people collaborate with technical people to build the best AI apps possible by using the playground for non-technical people and then having that SDK compatibility and everything centrally managed in one place.

00:31:25.560 | Moving into the core concepts of Brain Trust, so you can think of it in these three slivers, starting off with prompt engineering, right, you want to keep track of the versions of the prompts that you're iterating on, you want to do this rapidly, so having a playground or a place where you can quickly go through these changes and identify improvements or regressions is really crucial.

00:31:47.680 | Then you want to have evals that are automated ideally, right, and this can be kicked off via the SDK, you'll have a score that is from zero to one as a percentage that will give you a signal of, is this going up, is this going down, what needs attention, what doesn't.

00:32:05.600 | And then finally, observability, what's happening in production, are users thumbs upping or thumbs down, are they providing ideal output that I can then use in my data set to keep improving the evals or keep improving my AI feature.

00:32:22.400 | So jumping into, you know, what, what even is an eval, well the definition that we come up with is that it's a structured test that checks how well your AI system is performing, it helps you measure quality, reliability and correctness across scenarios, again, the criteria is really up to you, as we heard Sarah say, the LLM as a judge can be across whatever dimension you want to test.

00:32:47.060 | There are three components that go into an eval, so you have your task which is the thing that you want to test, this is the code or prompt that you're evaluating, it can be a simple, you know, one-time call to an LLM or a full agentic workflow, the complexity is really up to you, we'll see some simpler versions in the UI and then in the SDK, it's really, you know, the complexity is up to you, you can, you can go crazy there.

00:33:13.200 | The next piece is the data set, so this is the set of real world examples, the test cases that you're going to be throwing at the prompt and seeing how it's performing based on the score output, right, so the score is the logic behind the eval, this is outputting from 0 to 100 a score, you can use LLM as a judge or full heuristic functions, so it's great to use a combination of the two and try to meet in the middle.

00:33:40.420 | There are two mental models to think through, offline evals, which are evals in development, as you're iterating on the prompt, figuring out what works best or what model to go with, you're writing these structured tests and running them through these predefined data sets, right, you're not using live traffic in this scenario,

00:33:59.420 | but on the online eval side, so this is real time tracing, you're monitoring that live production application, and you're also getting scores from those real outputs, and that will allow you to diagnose problems, monitor performance, and capture that user feedback so you can keep improving and close that feedback loop.

00:34:19.420 | So we'll show how you can use both of these types of evals offline and online.

00:34:26.420 | This matrix here is really helpful to understand what should I even be improving, you know, and you have to look at two things, right, the score that's being output by your evals, and using your own eyes, looking at the output of the LLM, do I think this is a good or bad output?

00:34:45.420 | If both match, you look at the prompt, it's good, high score, great, that's where you want to be. But if not, you have to decide, okay, do I want to improve my evals, the actual scores, or do I want to improve my AI app? And this is not very trivial, there does require some thinking of what requires my focus, improving the eval, improving the actual app.

00:35:13.420 | So now zooming into each of these components. So the task, this is the thing that you want to test. So starting off with a prompt, right, this is a screenshot from the UI.

00:35:23.420 | It's the input that you're going to give to the LLM, right? You can set up a system prompt in Braintrust, and then pass user prompts as mustache templating, and that's something that we'll see in the activity shortly.

00:35:38.420 | So, you know, this is great for getting started. You may have some multi-turn use cases where you're having a whole conversation and you want to evaluate that with a tool like Braintrust.

00:35:50.420 | So that's where extra messages comes in, and you can provide the system prompt, followed by the user message, the assistant response, the user response, the assistant response, maybe a tool call thrown in there, right?

00:36:02.420 | All of that prepackaged and given at once to your eval system that will then output a score.

00:36:12.420 | Tools are also used with Braintrust, and they should be for, you know, RAG applications or for agentic applications.

00:36:20.420 | So this can live in Braintrust. You can push your tools to the Braintrust library, and they'll become accessible to anybody playing in the playground or running experiments.

00:36:33.420 | And the last piece here in the task are agents in the UI. So now you can have prompts chained together where the output of the first prompt becomes the input of the next.

00:36:45.420 | So this is going to keep increasing in scope within Braintrust. We'll have branching and more capabilities. This is currently a beta feature, but really exciting to be able to do end-to-end evals on prompts being chained, as many as you'd like.

00:37:03.420 | So now moving into datasets. So there are three fields, right, three columns in a dataset, one of which is required, the input field. So this is what's going to be passed into the prompt.

00:37:14.420 | This is the user input into your AI feature application, right? The other two columns are optional. You can have your expected column, which would be that anticipated output or that ideal response.

00:37:28.420 | This is difficult to create initially, and we see a lot of customers leave this blank at first and then start to fill it in with human annotation or user feedback.

00:37:40.420 | If you're doing a rag bot, it's common to have assertions. So what do you want to be included in the output? Providing that in the expected column will help make sure that it's using that as the ground truth.

00:37:54.420 | And then metadata is for any additional information that you want to track at that specific dataset row, so with the specific user prompt.

00:38:01.420 | So some tips here. We recommend for you to just get started. You don't need to create 200 rows in a dataset to run your first eval. You know, five, ten rows is great.

00:38:15.420 | Just start small and get some feedback. And, you know, the next thing is don't stop iterating. Keep adding rows or tweaking rows, using logs as that source of truth, right?

00:38:29.420 | How are your users interacting with the feature that you're developing? Of course, at first, you may not be capturing logs, but there's still that process, whether it's via synthetic data or with internal testing, that you can help yourself improve the dataset.

00:38:47.420 | So, human reviews is another way of establishing ground truth. Very much needed in certain industries. If you're dealing in the medical space, you need doctors to look at the output. Or you need lawyers or people with highly specialized skills.

00:39:03.420 | Cool. So now moving into the score types. So this is the last piece. So we covered tasks, we covered datasets, and now scores. So there's two types, right? We covered the LLM as a judge a little bit.

00:39:14.420 | So this is subjective, non-deterministic, more of a qualitative assessment that will be done on the output. And then on the other side, there's a full code base heuristic score. This is very exact, deterministic, objective.

00:39:29.420 | So you want to try to use a combination of the two. And like I said, meet in the middle, right? There are some organizations that will choose to go one way or the other, but we found that most will typically incorporate both as much as they can.

00:39:43.420 | I know Sarah mentioned that LLM as a judge has been crucial for her.

00:39:48.420 | So some tips here. So we recommend for you to use a higher model in the LLM as a judge to evaluate the smaller ones. Just something that we've noticed could be useful.

00:40:03.420 | You want to make sure that the judge is scoped to a specific criteria. We want it to be focused and not be a broad decision that it has to make across five or six different criteria.

00:40:20.420 | You want to make sure that you're evaluating the score. This is probably the most crucial piece. But you want to put the score prompt, the LLM as a judge prompt, through the same rigor that your other prompts are receiving.

00:40:31.420 | Right? So you want to make sure that you're testing it on human annotated feedback. So what would a human think looking at the same outputs? Is it matching the LLM as a judge? If not, probably needs some improvement.

00:40:43.420 | This is a little tidbit about the Braintrust UI. So we have two views, playgrounds and experiments. They look very similar. This wasn't always the case historically, but they've now grown to be very similar because of customer requests.

00:41:02.420 | The playground, a way to think of it is a quick iteration. It's more ephemeral, right? So what you do in the playground won't necessarily stick around and be part of the historical view.

00:41:13.420 | But you can always save a snapshot of it to the experiments view and sort of bring it over. And that will then kick off the more traditional experiment, which is the same thing that happens when you run an experiment via the SDK, right?

00:41:26.420 | So you run an eval via the SDK or from your CI pipeline. That will end up in the experiments view. And the benefit of that is that you have everything in one place that you can review and compare across weeks and months.

00:41:39.420 | So you can compare your performance with the new model that dropped today with a prompt from three weeks ago in the experiments view and see the scores change over time.

00:41:52.420 | Cool. So that was probably a lot. But now you understand the components and ingredients behind running an eval.

00:42:00.420 | So now we're going to jump into the activity. There should be a document pinned in the Slack channel. If not, we can also pull up the QR code quickly.

00:42:10.420 | But if you can pull that up and try to follow along. I know the Wi-Fi has been a little spotty. If you can't follow along, no worries at all.

00:42:21.420 | Doug here will be leading through those steps.

00:42:26.420 | All right. Let's have a little fun here actually getting our hands dirty with the Braintrust platform.

00:42:33.420 | As Carlos mentioned, we have -- sorry, wrong button -- the document guide out here that will walk you through the setup of the workshop.

00:42:46.420 | Again, I'll walk through it. It'll give you sort of enumerate the different requirements that you'll have to have at least for how it's set up today.

00:42:53.420 | Just as like a call out here, we're using OpenAI here under the hood. That certainly doesn't mean that Braintrust only works with OpenAI.

00:43:00.420 | You can use any sort of AI provider, custom AI provider. You can sort of bring your own here as well. But just wanted to call out we're going to use OpenAI here to do this.

00:43:10.420 | Another thing to call out -- somebody ran into this in our last workshop -- is the particular version of Node that you're using.

00:43:17.420 | Just don't be on version 23.

00:43:19.420 | 2220 are generally pretty good ones to use.

00:43:23.420 | And then let's get going. So you can start here at the top by installing Node, Git, and TypeScript.

00:43:31.420 | I've already done this here on my machine, but you can get a sense for what this looks like.

00:43:37.420 | Obviously, we need a Braintrust organization and a project here to actually go and play with that playground, create some evals, run some experiments.

00:43:48.420 | So if you go to www.braintrust.dev, this is where you can create your organization and your project.

00:43:58.420 | We're calling our project "Unreleased AI" today. You can certainly call it something different.

00:44:02.420 | You would just see multiple projects be created inside of your account, and I'll show you why in just a second.

00:44:07.420 | But this is where all of this activity is going to be happening underneath.

00:44:12.420 | You can almost think of it as a project as like a particular feature, right?

00:44:15.420 | So you have feature A, B, C, and they may each have their own unique projects.

00:44:20.420 | I speak very high-level about the use case here, unreleased AI.

00:44:24.420 | What we're trying to do is build sort of an application that looks at a GitHub URL and looks at the commits that have happened since the most recent release.

00:44:32.420 | And the idea is to inform us as developers what's coming in maybe a subsequent release.

00:44:37.420 | Once we've created our account, I'll come out to the Braintrust platform here.

00:44:46.420 | So I'm going to kind of follow along with all y'all.

00:44:51.420 | I'll create a project, right, unreleased AI.

00:44:56.420 | Then you're going to need to configure your AI provider, right?

00:45:00.420 | And so this will be the OpenAI API key that you'll configure.

00:45:03.420 | Again, here are the different AI providers that you can configure within the platform.

00:45:08.420 | You also can use default cloud providers, and then, like I mentioned, bring in your own custom providers to the platform as well.

00:45:15.420 | Configure that, and then we'll come back to the guide.

00:45:19.420 | This is where the repo exists, right?

00:45:26.420 | It's a public repo in our Braintrust org.

00:45:29.420 | You just clone this locally.

00:45:32.420 | Let's zoom in a little bit.

00:45:39.420 | Obviously, I have it here locally.

00:45:42.420 | The next thing that we'll do is we're going to copy this .env.local.example file to a .env.local.

00:45:52.420 | You're going to replace the Braintrust API key and OpenAI API key with your specific keys.

00:45:59.420 | Optionally, you can include this GitHub token.

00:46:02.420 | This will allow you to not hit sort of like rate limits on the GitHub side.

00:46:05.420 | Not super important for this.

00:46:06.420 | We're probably not going to hit that here with just running a couple examples.

00:46:10.420 | But that's essentially what you do, right?

00:46:12.420 | We're just going to run that cp command, and you have that there.

00:46:16.420 | Fill in your API keys.

00:46:18.420 | Okay.

00:46:23.420 | Let's do this now.

00:46:25.420 | So the first thing that I'll do is run pnpm install.

00:46:28.420 | All right, this will install all of the dependencies for this application.

00:46:32.420 | The other thing to call out here is it's going to run this command after installation.

00:46:37.420 | So this is going to actually create some of those resources that Carlos was just talking about.

00:46:42.420 | It's going to create a couple different prompts.

00:46:44.420 | It's going to create some scores and actually create a data set within that project that we just created.

00:46:49.420 | So this is, I think, just to highlight here, one of the really unique things about Braintrust

00:46:54.420 | is being able to connect the things that we're doing within our code to what's happening within the platform itself.

00:47:00.420 | A couple different ways to actually utilize or use the Braintrust platform,

00:47:05.420 | whether you are kind of maybe strictly code-based using our SDKs or you are using a lot more of the platform itself,

00:47:11.420 | you can do a lot of these things and share code and so on.

00:47:15.420 | But just to kind of scroll through here, this is how I'm creating my prompt.

00:47:19.420 | If you scroll down a little bit further, here's another prompt.

00:47:21.420 | But when I run this, this is going to actually go and push all of these resources into that project that we just created.

00:47:28.420 | OK, cool.

00:47:42.420 | So we have our prompts.

00:47:44.420 | We have our data sets.

00:47:46.420 | We even have some scores here down below.

00:47:49.420 | So let's just kind of investigate some of these different things that we created.

00:47:52.420 | So here's my prompt.

00:47:54.420 | As Carlos mentioned, we have that sort of mustache syntax here

00:47:58.420 | that allows us to inject certain things based on whatever our data set looks like.

00:48:05.420 | Maybe to back up, thinking a little bit about what Sarah said in her talk,

00:48:09.420 | thinking more about that data structure that you want.

00:48:13.420 | And then this is how we will then create that underlying data set.

00:48:16.420 | But this is really going to map to the data set that we've created within that repo.

00:48:21.420 | Right?

00:48:22.420 | So again, what we're trying to do is we're trying to get that list of commits, summarize them,

00:48:25.420 | and give sort of that summary to that developer of what's coming in that next release.

00:48:32.420 | The other thing to highlight here are our scores.

00:48:36.420 | And luckily, I was able to follow most of Sarah's best practices, being very targeted in what we're doing with these individual scores.

00:48:45.420 | To highlight a couple, I'll just open up my completeness score.

00:48:50.420 | So this is really just being able to understand, did the LLM do a good enough job?

00:48:56.420 | When it looked at the commits, is the summary complete?

00:48:59.420 | Did it summarize all of the breaking changes?

00:49:02.420 | Did it summarize all of the net new features?

00:49:04.420 | All right, so this is where we can actually start to create that criteria that meets the expectations that we have of what is excellent, what is good, and so on.

00:49:15.420 | And then this then maps to individual scores between zero and one.

00:49:20.420 | The other one here I'll highlight here is this formatting score.

00:49:23.420 | All right, this is actually just more of that heuristic, just being able to understand, is it following a particular format that we've laid out?

00:49:32.420 | And this is a little bit more binary, just a zero or one, but just to give you a sense of the different scores that we can add to the project.

00:49:41.420 | Last one I'll highlight is just the dataset.

00:49:44.420 | If you look through here, and maybe I'll zoom in a little bit further, you should start to see some of these things actually map what we have within that prompt.

00:49:54.420 | And so this is how we're going to be able to pull those things in and actually run our evals.

00:50:03.420 | All right, come back here, make sure I haven't skipped any steps.

00:50:07.420 | So now that we have a sense for what we just put within the platform, let's actually start playing around with it.

00:50:16.420 | So we're going to go into our eval playground.

00:50:19.420 | So this is where we can actually create evals on the fly.

00:50:24.420 | I'll create an initial playground here.

00:50:28.420 | Feel free to name it something better than that.

00:50:31.420 | And here's what we can do.

00:50:32.420 | So let's load in our two prompts that we've created.

00:50:35.420 | So we have our change log one and our change log two.

00:50:40.420 | Thinking back to what Carlos was talking about, like the different sort of components of an eval, right?

00:50:45.420 | We need our task, right?

00:50:47.420 | In this case, it's our prompts.

00:50:49.420 | We need a dataset.

00:50:51.420 | And then we need an ability to score it.

00:50:53.420 | And so we can actually include all of these different scores here.

00:50:56.420 | And now we have all of the different components to run an eval here within the platform.

00:51:03.420 | So I'll click Run.

00:51:05.420 | And then you'll actually start to see all of these different rows within this dataset actually run in parallel.

00:51:10.420 | And then we're going to generate those scores against the output that the LLM created for those individual prompts.

00:51:18.420 | And we should start to see some of these scores come back pretty soon.

00:51:22.420 | Lots of different ways to actually go in and, you know, you can highlight each one of these different things.

00:51:27.420 | You're able to actually, once this starts to come back, look at the rationale that the LLM gave to providing the score that it gave.

00:51:36.420 | But this allows you to, like, you know, you can even look at a diff.

00:51:39.420 | Again, once some of this stuff comes back.

00:51:42.420 | Or we can maybe get more of a summary type layout.

00:51:46.420 | And I can look at, like, okay, so my accuracy score for my base prompt up here or my base task, how does it compare relative to my comparison task?

00:51:54.420 | So very, very quickly, I got some scores here that give me a sense for how well these particular prompts fare against the scores that I've defined against them.

00:52:08.420 | Maybe another really quick thing to highlight here is up here at the top, right?

00:52:12.420 | We were talking a little bit about, like, maybe understanding over time how are these things faring as I start to change these prompts.

00:52:19.420 | That's where experiments come into play.

00:52:22.420 | So I can create an experiment, again, very quickly.

00:52:24.420 | It's going to use what I've configured within the playground.

00:52:27.420 | This will actually create, again, back up a little bit here to experiments.

00:52:32.420 | But now I can start to track over time these scores.

00:52:35.420 | I can change the model of these.

00:52:37.420 | There's a lot of different ways to start to view this data.

00:52:40.420 | Maybe I want to understand what does cost look like for these particular tasks over time.

00:52:46.420 | I'm able to look at that.

00:52:47.420 | If I change the underlying model, I'm able to group it by that model and see how those models affect that score.

00:52:53.420 | But this becomes a very trivial thing now, right?

00:52:57.420 | Going back to that playground, maybe I want to test here with GPT-4-1.

00:53:03.420 | How does this impact my scores?

00:53:06.420 | And then how does it impact my scores relative to the cost maybe that I'm incurring to do that?

00:53:11.420 | So run that experiment.

00:53:13.420 | Again, now I can track these things over time.

00:53:16.420 | Maybe really quickly as sort of a bonus to this, going a little bit off-grid here, just because I've actually heard this numerous times at this conference.

00:53:28.420 | And I just had a question over here.

00:53:30.420 | How can we automatically optimize prompts?

00:53:32.420 | This is something that we are thinking about internally.

00:53:35.420 | And hopefully within a couple weeks or so, we can release this feature to our customers.

00:53:39.420 | But this loop here allows you to either optimize prompts or generate net new dataset rows.

00:53:46.420 | And then the kind of unique thing here is that it has all of the context of this playground.

00:53:51.420 | What are the evaluation results the last time it ran?

00:53:55.420 | How can we change this prompt to ensure we beat those results?

00:53:59.420 | And so you can see it's -- I'm guessing most of you all are familiar with something like cursor.

00:54:04.420 | So sort of a similar type of interface here.

00:54:06.420 | But it will actually, again, use those results, start to change the prompt.

00:54:10.420 | Pretty soon it will give me like a diff here.

00:54:13.420 | Here is maybe a better prompt that we can accept.

00:54:16.420 | And then it will, again, it will run those scores for that new task.

00:54:21.420 | And it will give us a sense for whether or not we are better or worse because of that.

00:54:27.420 | But now this becomes a little bit easier to do that iteration when we have sort of AI layered here in the mix.

00:54:33.420 | And another question that we got in the previous workshop is are we using brain trust to actually evaluate this?

00:54:40.420 | And we certainly are.

00:54:41.420 | Yeah, this is really cool.

00:54:47.420 | This can also improve the data set, right?

00:54:50.420 | You could ask it to enhance the data set, add another row, maybe even highlight some of the data set rows and be like can you improve the prompt based on these specific scores?

00:55:01.420 | Yep, absolutely.

00:55:02.420 | Yeah.

00:55:03.420 | Pretty cool.

00:55:04.420 | Any questions that you guys may have?

00:55:05.420 | Yeah, we can open it up.

00:55:06.420 | Yeah.

00:55:07.420 | How do we add, let's say if we have a home grown model, how do we add it in the list?

00:55:10.420 | Yeah, that's a good question.

00:55:19.420 | There's some custom AI models being used in brain trust.

00:55:26.420 | In the settings here, in the AI providers, you can add your own custom model.

00:55:32.420 | So if it speaks open AI, you should be able to add it in here or choose whichever one you want.

00:55:39.420 | If you have it in inference.

00:55:41.420 | Yeah?

00:55:42.420 | What if your score does not have like upper bound that you can convert into percentages?

00:55:55.420 | So how do you calculate score then?

00:55:59.420 | So what kind of score would you be thinking about here?

00:56:02.420 | So for example, I'm building finance ticker, right?

00:56:07.420 | Offering like buy some kind of stocks.

00:56:10.420 | And I use score as how much money I made.

00:56:13.420 | So that could range, you know, from $10 to like $10,000, right?

00:56:17.420 | How can I convert this like percentage score?

00:56:21.420 | Is that value relative to something that's expected though?

00:56:25.420 | Like you're talking about like an unbounded range.

00:56:28.420 | That's not necessarily what I think these scores are designed to do.

00:56:32.420 | Right, but I can engineer my prompt and try to maximize some kind of score that I can get, right?

00:56:39.420 | But that does not mean that I have a range in which I try to operate.

00:56:43.420 | I'm just trying to optimize a function, some generic that I'm, you know, define my score.

00:56:50.420 | Yeah, it's definitely something we've heard before.

00:56:54.420 | I think that's part of the struggle of writing scores is trying to normalize them

00:56:59.420 | and bring them into that range of zero to one.

00:57:01.420 | So up to you of what you decide the floor and ceiling are.

00:57:05.420 | Yeah.

00:57:06.420 | Do you have anything for evaluating multi-turn conversations?

00:57:11.420 | And as part of that, would you be able to review the agent's features?

00:57:19.420 | I think you kind of went through that earlier, right?

00:57:21.420 | Yeah, in the playground we can just-- so here at the bottom you can add more messages.

00:57:31.420 | So the idea is that you could provide a whole back and forth in context

00:57:36.420 | and then evaluate that multi-turn conversation at once.

00:57:39.420 | You could do it as well in the SDK.

00:57:41.420 | And then for the agents here, the idea is that you can chain multiple prompts together.

00:57:52.420 | So here we could just grab the two.

00:57:54.420 | But the idea is that the output of the initial prompt will become the input of the next and so on.

00:57:59.420 | And then you can evaluate them as a unit, right?

00:58:02.420 | And as opposed to the multi-turn, right, you're providing all the context at once.

00:58:07.420 | Whereas here it's, you know, making multiple LLM calls.

00:58:11.420 | And just piggybacking on this question, for a multi-turn scenario, do we score each turn versus the entire conversation?

00:58:19.420 | Also, how about if the conversation has, let's say, 100 messages in between the agent and the user?

00:58:25.420 | What about the memory?

00:58:27.420 | Do we contain the context for the message one versus the message 90th message?

00:58:33.420 | How does this all play out?

00:58:34.420 | Do we score the entire conversation versus each turn?

00:58:38.420 | With the multi-turn extra messages approach, you're providing everything at once to the LLM.

00:58:44.420 | So it's one LLM call.

00:58:45.420 | So yeah, you would need to comply with the context window of the model that you're working with.

00:58:49.420 | The agent feature, though, each prompt is its own LLM call.

00:58:54.420 | I think maybe just speaking also more generally, if I come back over here.

00:58:59.420 | I just wanted to highlight something because I think it's relevant to the question.

00:59:03.420 | Do you evaluate sort of like the individual turns?

00:59:09.420 | To me, like, it's always like somewhat dependent.

00:59:12.420 | But if I look at these logs here, this is sort of like this application where there's multiple steps that have to be taken for the user to get the output that it needs.

00:59:24.420 | So you can see here that the first step is actually rephrasing a question, right?

00:59:29.420 | So there's actually this chat history that is needed and then the most recent input.

00:59:33.420 | And we need to be able to rephrase that question into something that the user actually means.

00:59:37.420 | And if the rephrase here, the task that we have for this, falls down, it means the rest of the application is going to fall down.

00:59:46.420 | So I think one of the things that becomes really powerful here is the ability to actually score these individual calls as they're happening.

00:59:53.420 | So I have a rephrased question.

00:59:55.420 | From there, I'm going to determine the intent.

00:59:57.420 | But if I scroll down a little bit further, I can actually create scores for those individual spans.

01:00:02.420 | So I can understand the, like, the application has all of these different steps.

01:00:07.420 | I want to make sure that, like, if something does not work, if the output doesn't match what I would expect, I need to be able to go back through there and figure out where it fell down.

01:00:16.420 | I don't know if that helps at all, but--

01:00:18.420 | Almost like a session trace there, right?

01:00:19.420 | You're going back to the session and seeing what happened and whether that fell through the tracks.

01:00:23.420 | Yeah, yeah.

01:00:24.420 | And then you're able to configure scores for these individual spans that are created.

01:00:28.420 | Does BrainQ support evaluating speech-to-speech models for real-time or Gemini multimodal evaluations of voice, et cetera?

01:00:45.420 | Yeah, that's a good question.

01:00:46.420 | We have a cookbook about evaluating a voice agent that you should check out.

01:00:51.420 | It would be through the SDK.

01:00:52.420 | Yeah.

01:00:53.420 | Yeah.

01:00:54.420 | Any questions?

01:00:55.420 | Cool.

01:00:56.420 | Are you guys able to follow along in the activity?

01:00:57.420 | Like, the internet's working, everything's okay?

01:00:58.420 | Sort of, kinda.

01:00:59.420 | Sort of.

01:01:00.420 | Yeah.

01:01:01.420 | Do you guys offer brain trust for on-prem installation?

01:01:02.420 | We have a hybrid deployment.

01:01:03.420 | So we would manage the control plane.

01:01:04.420 | You would manage the data plane.

01:01:05.420 | Everything meets in the browser.

01:01:06.420 | Similar to the Databricks model.

01:01:06.420 | So yes.

01:01:06.420 | Our response is, you would have the data fully in your VPC.

01:01:06.420 | Yeah.

01:01:07.420 | Yeah.

01:01:08.420 | Yeah.

01:01:09.420 | Yeah.

01:01:10.420 | Yeah.

01:01:11.420 | Yeah.

01:01:12.420 | Do you guys offer brain trust for on-prem installation?

01:01:17.420 | We have a hybrid deployment.

01:01:19.420 | So we would manage the control plane.

01:01:21.420 | You would manage the data plane.

01:01:23.420 | Everything meets in the browser.

01:01:25.420 | Similar to the Databricks model.

01:01:26.420 | So, yes.

01:01:28.420 | My response is, you would have the data fully in your VPC.

01:01:44.420 | Yes.

01:01:45.420 | Question over here.

01:01:46.420 | There's a lot of similarities with Langsmith.

01:01:56.420 | Yeah.

01:01:57.420 | That's a good question.

01:01:58.420 | So the question is, how do we compare it to Langsmith since we're both in the eval space?

01:02:04.420 | And, you know, I think you're right in that we do very similar things.

01:02:09.420 | And maybe if you're looking from a distance, it looks like it's the same.

01:02:12.420 | But up close, what we hear is that our UI/UX is more intuitive, cleaner, easier to work with.

01:02:19.420 | And crucially, the performance and scale of brain trust is unmatched due to our underlying infrastructure.

01:02:26.420 | BrainStore was actually developed in-house.

01:02:30.420 | And it was only possible because of the technical team, especially Anker and the founding engineers that have deep database knowledge from working at single store MemSQL and having built many databases before.

01:02:42.420 | So they were able to increase the full text search capabilities and, yeah.

01:02:54.420 | Cool.

01:02:55.420 | So, finished with activity one.

01:02:56.420 | Again, this will be available for you guys to do at your own pace at home whenever you want.

01:03:00.420 | There's also a free tier available for brain trust.

01:03:04.420 | So, you have still a bunch of credits to use.

01:03:07.420 | So, moving into the second lecture.

01:03:10.420 | So, now doing the same thing that we did via the SDK.

01:03:14.420 | So, we got to see a little bit of the code, right?

01:03:17.420 | We cloned the repo.

01:03:18.420 | We pushed the prompts, the scores into the brain trust so we could use them in the playground.

01:03:26.420 | But here, we'll actually do an experiment via code.

01:03:29.420 | So, just wanted to quickly go over high level how this works.

01:03:34.420 | So, you know, the top row we went through.

01:03:38.420 | You didn't really get to see the assets defined in code.

01:03:41.420 | But we talked through the brain trust push and then how they ended up in the brain trust library.

01:03:46.420 | So, the second row is what we'll go over now where you're defining the evals in the code.

01:03:51.420 | You're running a brain trust eval command.

01:03:54.420 | This can occur via the CLI or in a CI pipeline.

01:03:57.420 | That's very common, right?

01:03:58.420 | So, when you're trying to open a PR and merge it into main, you would trigger this eval process.

01:04:05.420 | And then all the experiments would appear in brain trust.

01:04:09.420 | And, you know, Sarah was mentioning that they have their own process that looks at those experiment scores.

01:04:15.420 | Summarizes them with an LLM and then writes a report.

01:04:18.420 | So, you can get creative with this pipeline.

01:04:21.420 | So, pushing the brain trust, very similar, right?

01:04:26.420 | The key here is defining the assets as code.

01:04:29.420 | So, you would just follow the pattern to the right here.

01:04:32.420 | Projects.prompts.create or it would be project.scores.create.

01:04:37.420 | And then you would just define what would appear in brain trust.

01:04:41.420 | And this could be used in production.

01:04:43.420 | You know, a lot of people will either push their prompts into brain trust or pull them.

01:04:49.420 | Really, the prompt management style is up to you.

01:04:52.420 | But it is really helpful to have that source controlled prompt versioning in place.

01:04:58.420 | Depends if you want to go push or pull.

01:05:01.420 | So, running an eval via SDK is very similar to what we saw in the UI.

01:05:10.420 | You need the three things again, right?

01:05:11.420 | So, you need a task, a data set, and one or more scores.

01:05:15.420 | So, we'll show you what that looks like in the actual code.

01:05:19.420 | And then we'll run the command and we'll see what that looks like in the UI.

01:05:24.420 | And then we'll come back and do some of the logging day two, moving into production stuff.

01:05:34.420 | Cool.

01:05:35.420 | Back to the guide here.

01:05:37.420 | If you're looking on activity two.

01:05:39.420 | We'll start again with just kind of understanding what we have within the code base.

01:05:42.420 | And then we'll actually run some evals.

01:05:44.420 | You'll again, like Carlos mentioned, you'll see some similarities with what we did within the platform.

01:05:49.420 | But now we're just kind of strictly within the code.

01:05:51.420 | Again, just trying to like hammer home here.

01:05:53.420 | There's like multiple ways to actually interact with the Braintrust platform.

01:05:57.420 | Trying to meet you all where you want to be.

01:06:00.420 | But if I come over to the code base.

01:06:03.420 | Again, this is where the resources.ts.

01:06:06.420 | This is where all of this will be defined.

01:06:08.420 | All of our prompts, scores, and such.

01:06:11.420 | I have this eval folder.

01:06:14.420 | I'm going to open up my changelog.eval.ts.

01:06:17.420 | One thing to kind of highlight here.

01:06:20.420 | If you name your files with that .eval.ts, and then you pass just a folder.

01:06:26.420 | We can pick up all of those evals and run those within the platform.

01:06:30.420 | But let me open this up a little bit.

01:06:34.420 | So this is really all we're doing.

01:06:36.420 | Right?

01:06:37.420 | So we have this eval, which was just coming from the Braintrust SDK.

01:06:40.420 | Here's our data set that we've created.

01:06:43.420 | The task that, you know, that lives over here from Braintrust.

01:06:50.420 | And this is really like, again, all we're doing from the repo side to run evals within the platform.

01:06:56.420 | So I'll clear this and then run pnpm eval.

01:07:02.420 | This is, again, just another command that is unique to this repo.

01:07:06.420 | All we're doing under the hood is running this Braintrust eval and then giving it the name of this file.

01:07:11.420 | But you can actually see the experiments running right now within the platform.

01:07:16.420 | Here is the link that you can go out to.

01:07:19.420 | And this should look very similar to what you just saw when we ran the experiments within the UI specifically.

01:07:25.420 | But now, again, this becomes part of the history of our experiments for this particular feature and can track this over time.

01:07:35.420 | But now we are kind of strictly within the code while we are running this.

01:07:47.420 | Maybe just another thing to kind of highlight, I think I did last time, but I didn't really pull anything in here, is different ways in which we can start to compare this.

01:07:59.420 | So maybe we're really worried about sort of like the duration or we are worried about sort of the number of completion tokens.

01:08:08.420 | There are different ways in which we can start to view the data here.

01:08:12.420 | Again, I didn't change -- actually, I did change the model, right?

01:08:15.420 | This is another way in which we can start to understand the scores for the particular model.

01:08:21.420 | But again, like, this all just happened via code.

01:08:24.420 | Some simple commands that we just wrote on our command line and pushed that into the Braintrust platform.

01:08:30.420 | Cool.

01:08:31.420 | Any questions about running evals with the SDK and that eval class that we showed and the Braintrust eval command?

01:08:46.420 | Yeah?

01:08:47.420 | I just wanted to inform you, like, there is image, like, like, vision models and stuff.

01:08:54.420 | Does that work with the UI?

01:08:56.420 | Or is that only SDK?

01:08:58.420 | Yeah, that should work with the UI images, yeah.

01:09:01.420 | Can I do non-LLM, like, models?

01:09:04.420 | Like, if I had, like, a -- like, a grounding model as part of my pipeline?

01:09:10.420 | Yeah, that's a good question.

01:09:13.420 | I'm not sure if it will let you add it as a custom model.

01:09:18.420 | Yeah, I can't -- I would say go try it and see.

01:09:23.420 | And if there's any issues, we have a Discord channel and, you know, we can add it to see

01:09:28.420 | if more people want that feature.

01:09:30.420 | Yeah.

01:09:31.420 | Thank you.

01:09:32.420 | Yes?

01:09:37.420 | Python-based SDK?

01:09:38.420 | Yeah, we have Python and TypeScript-based SDK.

01:09:41.420 | And then we have some additional ones in other languages, like Java or Kotlin, that -- Ruby, that are created with stainless.

01:09:51.420 | So they're just wrappers of the API.

01:09:53.420 | We're creating a Go SDK as well.

01:10:03.420 | Great.

01:10:04.420 | Any other questions about Activity 2?

01:10:05.420 | Ready to move into production and we can talk about logging?

01:10:10.420 | All right, let's do it.

01:10:15.420 | So why should you even set up logging?

01:10:18.420 | Well, you know, there's a few reasons here.

01:10:20.420 | The main reason is to measure the quality of your live traffic.

01:10:23.420 | So this will allow you to set up alerts, if you'd like, of when it starts to dip, right?

01:10:29.420 | If you're putting out less than your best work and users are noticing, maybe they're providing feedback or, you know, maybe they -- you know, the scores that you have on those specific logs are just dipping, you can notify the right teams.

01:10:46.420 | You can debug, troubleshoot a lot faster by having that visibility into all of the functions that are running, all of the LLM calls going back and forth.

01:10:55.420 | And crucially, you can close the feedback loop.

01:10:58.420 | So you can capture the feedback.

01:11:00.420 | And, you know, we hear a lot of customers that are doing that, but they're not implementing it into their iteration process.

01:11:07.420 | They're not changing the prompts or adding it to data sets and closing that loop.

01:11:12.420 | So this can really help you do that.

01:11:14.420 | So what does logging into Braintrust look like?

01:11:20.420 | We have a very easy path, which is wrapping your LLM client.

01:11:24.420 | So if you're using OpenAI, you can just use the wrap OpenAI function around it.

01:11:31.420 | If you're using the Vercel AI SDK, we have a very similar thing.

01:11:34.420 | So this is essentially like one line of code, and then now everything is getting tracked, all your metrics, right, the amount of tokens, your latency, all the cost, all of that is getting populated in Braintrust.

01:11:47.420 | You can also trace arbitrary functions.

01:11:50.420 | We have the same sort of wrapping approach where you use a trace decorator or a wrap trace, and it will track everything occurring within that function.

01:11:59.420 | If you want to be more specific or more granular or have more flexibility, you can also do that.

01:12:04.420 | We integrate with Otel, and we have a span.log support, which can allow you to add any additional information metadata that you want to track and add it to that interaction.

01:12:16.420 | And initializing the logger, right, that's important because it tracks which project you want this to be synced to, and it authenticates you into Braintrust.

01:12:23.420 | So that's part of the process, and in the code repo that you clone, you'll see that if you go to the generate/root.ts, you'll be able to see all of these steps in the code and understand what's needed to get it into Braintrust.

01:12:42.420 | So now that you have the logging and real user traffic entering Braintrust, you want to have that assurance of quality, or you want to have that visibility into quality of responses.

01:12:55.420 | And a great way of doing that is through online scoring.

01:12:58.420 | So this is going to use the scores from your playground or from your experiments, right, from your offline evals, and bringing them into that production environment, into that live traffic, and giving you scores that you can then use to filter down the logs,

01:13:14.420 | grab the, you know, underperforming use cases, and add them to a dataset, keep iterating, keep improving.

01:13:21.420 | You can also do some A/B testing, right, you tag certain logs in different ways, and you can compare the online scores for each one of those.

01:13:34.420 | So we'll show you in the activity how you can set up online scoring rules, and which spans you can enforce them to be on, and the sample size.

01:13:45.420 | So you could choose to online score 100% of the logs, or 1%, it's up to you.

01:13:52.420 | We do recommend for you to start with a low sampling rate, and then work your way up as you trust the metrics more and more.

01:13:59.420 | So this is a bit of the process, and then Doug here will show you in the platform how you can go about creating the rule.

01:14:07.420 | So once you have online scoring set up, it's really helpful to have one-click lenses of what matters to you or to your team.

01:14:20.420 | So you can think, you know, if you've used Notion, you've probably realized that they have views, right?

01:14:26.420 | You can customize the filter, save it to a view.

01:14:29.420 | Same idea in Braintrust.

01:14:30.420 | You can filter, sort, do whatever you want, save it to the view.

01:14:34.420 | It'll retain those settings, and then your team can just use that.

01:14:39.420 | So it's a great way of moving quickly, defining the filters that matter to you.

01:14:44.420 | Great, so now back to the platform, back to the repo, and go over logging and online scoring.

01:14:53.420 | Awesome.

01:14:54.420 | Let's configure some online scoring.

01:14:57.420 | All right.

01:15:01.420 | Before we do that, obviously we need to, like, get some logs into our Braintrust project.

01:15:06.420 | To do that, we'll come back to our application, and we're going to spin this up.

01:15:10.420 | So I'm going to run pnpm dev, and this is going to run this localhost 3000.

01:15:15.420 | So again, the use case that we're building for is that change log generator.

01:15:20.420 | The idea here is you provided that GitHub URL.

01:15:23.420 | We're going to give you, you know, the LLM will give you the summary of those commits, and then maybe categorize them into different categories.

01:15:30.420 | So I'll click that.

01:15:32.420 | This should go through the process.

01:15:34.420 | Actually, while it does that, let me come back here just to kind of highlight where this is all happening.

01:15:39.420 | So just kind of connecting what Carlos was talking about, like how we can start to instrument our application.

01:15:46.420 | There's a few different things here, right?

01:15:48.420 | We have our Braintrust SDK for TypeScript, obviously, and we're going to wrap the AI SDK model.

01:15:55.420 | So this is really going to give us a lot of goodness out of the box, being able to understand, like, the metrics, right?

01:16:02.420 | What are the completion tokens?

01:16:03.420 | What is the cost?

01:16:04.420 | All of that stuff happens simply by just wrapping the LLM client with that.

01:16:10.420 | That's, again, like really the easy way to do it.

01:16:13.420 | Then there's being able to, like, you know, instrument your application with a little bit more granularity.

01:16:19.420 | And so maybe you want to use this trace function from the logger to actually be able to specify what that input is, what that output should be, and then add some metadata.

01:16:28.420 | The reason you might want to do something like this is now this maps to the data structure, right?

01:16:33.420 | That becomes really powerful to set up those scores against, right?

01:16:37.420 | So we start with something that's very basic where we get something very easy out of the box, and then we can go down to the actual, like, you know, the individual tool calls, if you will, and then actually create the inputs and outputs for that so that we can then create scores on top of those.

01:16:54.420 | Back to the application here, right?

01:16:56.420 | Here's the, you know, essentially the response from the large language model.

01:17:00.420 | Coming back into Braintrust, we should now see a log.

01:17:06.420 | And then this is essentially what just happened, right?

01:17:09.420 | Because of what we configured with that WRAP-AI SDK, as well as the different logs that we had set up, right?

01:17:15.420 | So there's this top level, think of this as a trace, and all of these sort of under the hood or underneath it are the individual spans that we actually want to understand.

01:17:24.420 | This allows us the visibility into the multiple steps that the application will take, and allows us, like I showed you in that previous example, to create scores for these individual spans as well.

01:17:37.420 | But you can see here, right, the top level generate request.

01:17:40.420 | Here is my input, right?

01:17:42.420 | What are my commits for this repo?

01:17:43.420 | Here's the URL, and then here is the output.

01:17:46.420 | But then being able to drill down into these individual things, right?

01:17:49.420 | So what are my commits?

01:17:50.420 | What are my, what is my latest, excuse me, release?

01:17:53.420 | And then again, being able to drill down into these becomes really beneficial as we start to need to understand how our application is performing.

01:18:03.420 | So from there, right, just going to come back here, we now want to maybe set up some online scoring.

01:18:08.420 | So we walked through both within the platform running experiments, walked through the SDK running experiments as well.

01:18:16.420 | This is pre-production, right?

01:18:18.420 | This is, now when we have our logs running in production, we want to understand how our application is performing, relevant, or excuse me, relative to the scores that we've already configured.

01:18:29.420 | So I'm going to come back into Braintrust, click on the settings, and then the, where am I?

01:18:50.420 | Excuse me, just configuration over here on the left, and then online scoring.

01:18:55.420 | So this is where we can now create the rule.

01:18:58.420 | Let's say, this is, come on.

01:19:03.420 | This will allow you to select the different scores that you've configured within your project.

01:19:07.420 | Now, I could select all of these, or maybe I have, like in that example that I showed to you previously, I really just want to apply this to a particular span.

01:19:18.420 | All right, and this allows you to get very granular.

01:19:20.420 | Right now, we'll just configure all of these for the overall trace, and we'll do it for 100% of the samples.

01:19:29.420 | Okay, so now when we come back to our application, and when we generate this, we'll have the same sort of workflow, right?

01:19:47.420 | We're going to generate this, the logs will look the exact same as they did previously, but now we are going to score this.

01:19:54.420 | We're going to understand the things that are happening within production.

01:19:58.420 | How does this output fare to the scores that we've configured for this particular task?

01:20:05.420 | This allows us to understand, we could even create automation rules here, so maybe a score drops below 50%, whatever it is.

01:20:12.420 | We can now create sort of an automation that allows us to understand when something like that happens.

01:20:17.420 | So you could see here, we have our scores ran.

01:20:23.420 | Again, you can drill down into these things, and you can see, you know, what is the rationale for that particular score in that instance.

01:20:36.420 | Maybe the other thing here, and then because we had some good outputs, but this is where, like, we would want to think about,

01:20:44.420 | how do we give users the ability to traverse through these logs in a really scoped down way?

01:20:51.420 | This is where what Carlos was talking about with views.

01:20:54.420 | And so maybe it's being able to understand, if I open this up, and I create a filter on my completeness score,

01:21:01.420 | and I'm just going to modify this slightly, I'll say less than 50%, this now can become a view that I save to this project.

01:21:13.420 | And now it becomes really easy for somebody to come through and look through these logs, and they can start to understand the inputs and the outputs,

01:21:22.420 | and if it makes sense, we can add these spans to the data set.

01:21:26.420 | That's another thing that I think we think about, like, the sort of feedback loop that we need here.

01:21:33.420 | We were kind of like in pre-prod, right?

01:21:34.420 | We were developing evals in the UI or in our code base.

01:21:39.420 | We pushed it into production.

01:21:40.420 | How do we create that flywheel effect from those two different places?

01:21:44.420 | This is part of it, right?

01:21:45.420 | Being able to enable sort of like the humans looking at these things, and then being able to add those really unique or relevant spans to the existing data sets, or even creating net new data sets from them.

01:21:56.420 | Cool.

01:21:57.420 | That's activity three.

01:21:59.420 | Questions?

01:22:00.420 | Yes.

01:22:01.420 | Can you aggregate those scores?

01:22:02.420 | If you go to the settings, aggregate scores here, so you can do that.

01:22:21.420 | So like, yeah, you could create a true north, right?

01:22:26.420 | And just based on the aggregate score, decide if it's improving or regressing.

01:22:30.420 | So the SDK is meant to be pushed to production then to get all those data amounts?

01:22:36.420 | If you want to do online scoring, you'll need to push scores into Braintrust, right?

01:22:41.420 | So that they can then be referenced, because online scoring, if you're using the SAS model, right, we're running that compute.

01:22:48.420 | We're, you know, we're grading the logs.

01:22:53.420 | So yeah, I think one you would want to push to Braintrust is for online scoring.

01:23:01.420 | If you have teammates that are using the playground, and they want access to the prompts or the tools that you've made in your SDK, that's when it makes sense.

01:23:10.420 | But you don't need to, right?

01:23:11.420 | Like you could still do everything with code and get value from Braintrust.

01:23:18.420 | Maybe just to highlight really quick, we also have just sort of scores out of the box that you can configure.

01:23:26.420 | So if I go to the playground, like if you don't want to push up certain things.

01:23:31.420 | Now, there's not as much flexibility in this as there are with your custom scores.

01:23:36.420 | But there is an auto evals package that is developed internally by us that you can use to add to these tasks as well.

01:23:44.420 | And then you can use an online scoring also.

01:23:53.420 | Any other questions about logging or online scoring?

01:23:57.420 | Okay, cool.

01:24:01.420 | Now we're going to move into the human section, human in the loop.

01:24:04.420 | My favorite part.

01:24:07.420 | I'm needed.

01:24:11.420 | All right, let's get, let's get into it.

01:24:12.420 | Let's get into it.

01:24:13.420 | So yeah, it's honestly really helpful for certain companies to implement human in the loop, right?

01:24:19.420 | Not all AI mistakes are created equal.

01:24:22.420 | Depending on the industry that you're in, like healthcare, finance, or legal tech, a single failure can have, you know, huge, serious consequences.

01:24:31.420 | So that's where you may want to have and invest more resources and putting humans in front of those AI responses and making sure that they're up to par.

01:24:40.420 | This can be really helpful for catching hallucinations, especially in specific fields that require specific knowledge, establishing ground truth.

01:24:49.420 | So if you're going to start filling out that expected column in the data set of ideal outputs or, you know, ideal responses, it may be helpful to have a human help create that.

01:25:00.420 | And, you know, you also want to capture user feedback and make sure that that's being incorporated in your feedback level.

01:25:09.420 | And being used in, in the prompted duration.

01:25:14.420 | So this is great for quality, reliability, providing that ground truth and aligning with what human, humans need in specific areas where the, the LLM just doesn't have as much visibility.

01:25:30.420 | There are two types of human in the loop that we're going to cover.

01:25:35.420 | Human review.

01:25:36.420 | So this is an annotator and a subject matter expert that's going to come into brain trust.

01:25:41.420 | They're going to manually label a specific data set or a specific log, right?

01:25:46.420 | They'll score it or they'll audit outputs.

01:25:48.420 | And this is really useful for ground truth to the data set.

01:25:53.420 | They can audit certain edge cases and they can also help eval the LLM as a judge.

01:26:00.420 | So based on what that human annotator determined was a good response, you could train the LLM as a judge to also label it the same.

01:26:08.420 | And then for user feedback, that's, you know, closing the feedback loop.

01:26:14.420 | You're incorporating what your real users are thinking.

01:26:17.420 | If they keep thumbs downing a certain edge case, you can incorporate that feedback and add it into a specific data set, you know?

01:26:24.420 | And that is then used to help improve that problem that you're dealing with.

01:26:33.420 | So that's it.

01:26:34.420 | You know, now we're going to jump back in and show you how you can enable user feedback and set up human review.

01:26:41.420 | And at the end, we'll do some questions.

01:26:44.420 | We also have a bonus lecture and activity about remote evals that if you stick around, we'll probably have time to get to.

01:26:52.420 | Cool.

01:26:55.420 | Let's jump into activity four.

01:26:57.420 | Obviously, we're going to look at user feedback and human review.

01:27:00.420 | There's sort of like two different routes here with human in the loop.

01:27:04.420 | Maybe to give a little background here or just to show you where some of this is happening.

01:27:11.420 | If you look at API feedback and then route.ts, the one thing I'll highlight is this log feedback.

01:27:17.420 | So this is really all like what we need on the brain trust side.

01:27:21.420 | Again, using the brain trust SDK.

01:27:24.420 | We know of the span that we want to log this to from the current span that we are in.

01:27:31.420 | This is the user feedback score.

01:27:33.420 | Right?

01:27:34.420 | So this is like essentially what we're calling it.

01:27:35.420 | User underscore feedback.

01:27:36.420 | And then the score.

01:27:37.420 | In our case, it's really just going to be a thumbs up or thumbs down.

01:27:40.420 | So a zero or a one.

01:27:41.420 | We also optionally have the ability to add a comment.

01:27:45.420 | So if the user, like your application has the ability for the user to define free text, you

01:27:51.420 | can also put that in here as well.

01:27:53.420 | And then also, of course, using metadata.

01:27:56.420 | Metadata becomes really powerful from the filtering aspect.

01:27:59.420 | So like we can use this within our custom views to filter on these different metadata keys.

01:28:09.420 | coming back to the app.

01:28:10.420 | And you may have seen this at the bottom when we ran this last time.

01:28:14.420 | But we have at the bottom the thumbs up and the thumbs down.

01:28:19.420 | This will allow the application here to log this particular user's feedback.

01:28:25.420 | So based on what they think, we can obviously do a thumbs up.

01:28:29.420 | We can submit with comment.

01:28:34.420 | And now if we go back into the Braintrust application, we have that net new log.

01:28:38.420 | We have over here on the right, we have our comment from the user.

01:28:43.420 | And then we should see a net new column here within our logs for that user feedback.

01:28:49.420 | But again, as Carlos pointed out, we can now use this in a lot of different ways to help augment

01:28:54.420 | the workflows that we've already created.

01:28:56.420 | This can help create custom views on top of this.

01:29:00.420 | This can now feed into a workflow for our human annotators to understand what that user feedback is

01:29:06.420 | and whether or not we should incorporate it into other datasets.

01:29:09.420 | Again, being able to click here, add a span to a dataset, and so on.

01:29:13.420 | That's the sort of user feedback route.

01:29:18.420 | The other one is the human review.

01:29:21.420 | So back in our configuration pane over here, we have the ability to create different human review scores.

01:29:31.420 | So again, this will be a little bit more specific to your use case and what you want your humans to review

01:29:37.420 | or the scores that they should create.

01:29:39.420 | It could be an option, it could be a slider, or it could be sort of free form text.

01:29:48.420 | So we could say something like, what's a better score, and then give it sort of an option.

01:29:55.420 | You know, thumbs up and thumbs down.

01:30:01.420 | So sort of like the human now can sort of look at what the output was and give their own sort of thumbs up, thumb down of this.

01:30:09.420 | And now if you come back into the logs UI, users that are doing human review can have a different interface into those logs

01:30:19.420 | where maybe something like this is a little bit too much for what they're trying to do.

01:30:25.420 | Let's give them a little bit more pared down look at the inputs and outputs of this particular span.

01:30:34.420 | And then they have the different scores that we've configured over here.

01:30:37.420 | And so maybe this is a thumbs up.

01:30:39.420 | They'll go through and they can do that review.

01:30:41.420 | Again, this now can add to that sort of flywheel effect that we need to create to build this Gen AI app.

01:30:48.420 | Of course, yeah, I think I mentioned this a few different times, but being able to now add some of these spans to the data set,

01:31:01.420 | we can do that here as well.

01:31:03.420 | So now we should have, you know, if you look at the sample data, we had seven rows.

01:31:12.420 | Through this process, we've added two net new rows to this data set.

01:31:15.420 | This now can help our offline evals and really create a rigorous sort of workflow to building a robust AI app.

01:31:23.420 | Cool.

01:31:28.420 | Any questions?

01:31:29.420 | Yes.

01:31:30.420 | So those human evaluators, they have to have their own account.

01:31:34.420 | Yes.

01:31:35.420 | But how will they see those evals that they need to evaluate?

01:31:41.420 | Yeah.

01:31:42.420 | So you can add them to the specific project that you would want them to annotate.

01:31:47.420 | You could assign them.

01:31:49.420 | So we have the ability to add comments under specific, under data sets, under experiments, or a specific trace.

01:31:58.420 | You would tag them in something that you would want them to see.

01:32:01.420 | We have sometimes, you know, you'll create an experiment and you'll assign it to a specific annotator.

01:32:07.420 | And they'll go in and fill out everything in that experiment or in a data set.

01:32:11.420 | Yeah.

01:32:12.420 | Sorry.

01:32:13.420 | The first one is developed by the client, right?

01:32:13.420 | Because it's shown in that web page where you have thumbs up and thumbs down.

01:32:13.420 | Yeah.

01:32:14.420 | So the actual, like, the text box that you want them to fill out or, you know, the categories

01:32:31.420 | that you want them to thumbs up or thumbs down, you would define initially.

01:32:34.420 | And then you would let them go in and, and they would say, okay, based on completeness,

01:32:40.420 | I would give this a thumbs down.

01:32:41.420 | Or, you know, what, what is the ideal output?

01:32:44.420 | Let me use this text box and, and type it all out.

01:32:47.420 | So, so yeah, you would essentially create the form that they would then go and fill out.

01:32:53.420 | Okay.

01:32:54.420 | Any other questions?

01:32:56.420 | Yeah.

01:32:57.420 | This isn't so much of the e-mails, but I saw in your documentation that you support, like,

01:33:06.420 | writing tools and uploading those as well and kind of, like, injecting those in your, like,

01:33:14.420 | the evaluation prompt, right?

01:33:15.420 | Is there a way to, like, version the tools as well?

01:33:18.420 | Because, like, tools will also, like, be changing over time.

01:33:21.420 | And so, like, how can we keep track of that as well in the system?

01:33:25.420 | Yeah.

01:33:26.420 | That's a good question.

01:33:27.420 | Yeah.

01:33:28.420 | I believe it supports versioning.

01:33:29.420 | I guess this project doesn't have any tools.

01:33:35.420 | No.

01:33:36.420 | I'm going to give you a soft, like, yes, it should be versioned.

01:33:44.420 | Same with the prompts and, and scores.

01:33:46.420 | Like, they all, at the very bottom, if you create a tool and add a new version of it, at

01:33:51.420 | the very bottom, well, we can show you with the prompts.

01:33:54.420 | Potentially, if, yeah, so at the very bottom, we see this one version that was initially created.

01:34:01.420 | But if we were to go and, and make a change, save version, now we'd have that new version.

01:34:12.420 | So I, I believe the same thing would happen with the tools.

01:34:14.420 | So, like, you would, like, add a tool to this and that would just be, like, a new version of it, you know?

01:34:19.420 | Yeah, yeah, exactly.

01:34:20.420 | Yeah.

01:34:21.420 | If you were to add a tool or change the prompt at all, it would be a new version of the prompt.

01:34:25.420 | You'd save it.

01:34:26.420 | It would now appear at the bottom.

01:34:27.420 | And your question was specifically, like, in tools, right?

01:34:30.420 | Like, if you change a tool, would it be versioned?

01:34:31.420 | Yeah, I guess because you have, like, different-- oh, thanks.

01:34:36.420 | I guess you have, like, different prompt versions and then you have different tool versions.

01:34:41.420 | And then you're going to, like, evaluate those, like, mix and match those together, right?

01:34:45.420 | To, like, see which performs the best.

01:34:48.420 | Yeah.

01:34:49.420 | So there would be a scenario where the tool doesn't change, just the version that you're using in the prompt changes.

01:34:56.420 | And how would you then version the prompt?

01:34:58.420 | Yeah.

01:34:59.420 | Right.

01:35:00.420 | Yeah.

01:35:01.420 | That's an interesting scenario.

01:35:02.420 | I think we versioned it correctly today.

01:35:03.420 | But if there's any issues, you know, the SE team is here to help.

01:35:07.420 | So feel free to reach out.

01:35:09.420 | Great.

01:35:10.420 | Thank you.

01:35:11.420 | Yeah.

01:35:12.420 | Any other questions?

01:35:17.420 | Human in the loop?

01:35:19.420 | Okay, cool.

01:35:24.420 | All right, so now we're into the bonus session, I believe.

01:35:30.420 | Oh, not this one, I guess.

01:35:32.420 | Let's see.

01:35:33.420 | No bonus for you guys.

01:35:36.420 | No.

01:35:37.420 | This one has it.

01:35:42.420 | Okay, we can just do it this way.

01:35:59.420 | All right, so the new bonus session.

01:36:01.420 | Look at this.

01:36:02.420 | Extending playgrounds with remote evals.

01:36:05.420 | So as you've seen, right, the playground can handle one LLM call being made.

01:36:13.420 | It can handle multiple LLM calls chained together, right?

01:36:16.420 | The agent feature.

01:36:17.420 | But if you wanted to have intermediate code steps in between the prompts, not supported,

01:36:24.420 | right?

01:36:25.420 | You'd have to go via the SDK.

01:36:26.420 | Or if you wanted it to talk to these external systems that are part of your VPC, that also

01:36:34.420 | wouldn't work.

01:36:35.420 | Essentially, if your task gets too complex, it doesn't fit into the mold of the playground

01:36:42.420 | at the moment.

01:36:43.420 | So that's where remote evals comes in.

01:36:45.420 | It allows you to expose that local environment that you've set up in your laptop that has all

01:36:53.420 | the bells and whistles, has everything that you want set up.

01:36:56.420 | You can now make that available in the playground.

01:37:01.420 | So if you have custom internal tooling, intermediate code steps, you know, a dynamic R&D environment

01:37:06.420 | that's constantly changing and you don't want to keep pushing tools or pushing scores, everything,

01:37:11.420 | to brain trust, then you can just set up a remote eval and it will become available to anybody

01:37:16.420 | in the playground.

01:37:17.420 | So this is a great way of bridging the gap between very complex technical teams building

01:37:25.420 | crazy tasks in the SDK and non-technical people, maybe PMs, SMEs that still want to be hands-on,

01:37:32.420 | still want to iterate on the prompt, change certain parameters and see how that impacts the scores.

01:37:38.420 | So the remote evals can really help there.

01:37:44.420 | So how this works, right, brain trust will send an eval request to your remote server.

01:37:49.420 | This can be local host or you can set it up somewhere else.

01:37:53.420 | Your server will then run the task and the score logic all locally.

01:37:57.420 | So all the intermediate steps, crazy business that you have going on, it'll all work because it's occurring on that server.

01:38:03.420 | Same with the scores.

01:38:04.420 | You could have crazy complex scores.

01:38:06.420 | That would all work as well.

01:38:08.420 | And then it will return the outputs, the scores, metadata to brain trust and it'll populate the playground.

01:38:14.420 | To start this, it's very simple.

01:38:19.420 | You do the same brain trust eval and then point to a specific eval.ts file.

01:38:25.420 | The difference is that you use the --dev flag.

01:38:29.420 | So this will start that remote eval server.

01:38:34.420 | It will default to local host, but you can change it.

01:38:38.420 | And it will allow it to bind to all the -- it will allow it to list it to all the network interfaces so you could access it via -- yeah, access remote servers out in the world.

01:38:50.420 | Cool.

01:38:51.420 | So now we can get into the activity.

01:38:54.420 | We have something to -- you know, if you follow along with the activity document, you could set this up yourself.

01:39:00.420 | But if not, we can just show you quickly.

01:39:02.420 | All right.

01:39:05.420 | Let's first look at the code.

01:39:07.420 | It's going to look very similar to what I showed the first time.

01:39:11.420 | Slightly different prompt, just trying to layer in a little bit more complexity here within this task.

01:39:16.420 | We now have all these different parameters that we want the user to be able to define and don't yet necessarily have the ability to do that within -- or in a great way within -- within brain trust.

01:39:26.420 | Again, this can -- this can be very, very complex.

01:39:30.420 | Whatever sort of like, you know, the intermediate steps that you want to create are certainly allowed here.

01:39:37.420 | I'm going to take down this server and I'm going to run this command, pnpm remote eval, which is just under the hood.

01:39:45.420 | Is running that brain trust eval and it's giving the location of that file that we're running the remote eval on.

01:39:52.420 | Back in the brain trust playground, this now is exposed as something that we can -- we can load in here.

01:40:01.420 | So let me remove this.

01:40:04.420 | I'll remove this here as well.

01:40:06.420 | And now if you look down here, we have this -- this other option for remote eval.

01:40:13.420 | And this should link out to my change log generator eval.

01:40:16.420 | Here are the different -- excuse me -- parameters that I'm able to now configure that I've sort of exposed in that prompt.

01:40:23.420 | But again, now think of like the more complex type of workflows that you may have within your code base.

01:40:30.420 | And you don't have the ability necessarily to push those into brain trust.

01:40:34.420 | We can still leverage this platform with that existing code base via remote evals.

01:40:40.420 | And this sort of like interaction works very similar to what I showed earlier with the evals locally via the SDK as well as the playground.

01:40:49.420 | We should -- oh, well, probably have to -- oh, we did configure that.

01:40:55.420 | My scores.

01:40:56.420 | I'll switch this to the grid layout so we can start to see what the output is.

01:41:02.420 | But this is how we can really meet a maybe more technical team with a more complex task.

01:41:09.420 | And they're not necessarily wanting to push all of these things into the brain trust platform, but still leverage it in some way.

01:41:15.420 | This is how we can sort of bridge that gap.

01:41:18.420 | And grounding this in the notion use case, they use a lot of services, right?

01:41:28.420 | They have a lot of tools that will retrieve different user information.

01:41:31.420 | And they run all their evals via the SDK.

01:41:36.420 | And if they notice that there's an issue, right, it's underperforming in a certain area, a certain edge case isn't great.

01:41:43.420 | Maybe they isolate a certain row that they want to keep rerunning.

01:41:47.420 | Currently, they have to go to the SDK and run everything over and over again.

01:41:52.420 | So if they were to use remote evals, they could bring that custom task into the playground, and they could isolate just one row and run it on one specific thing that they're working on.

01:42:04.420 | They'd save money, and they'd move faster.

01:42:09.420 | So that's something that we're going to discuss shortly with them and a lot of our other customers.

01:42:13.420 | This is a brand new feature, and it's something that we've been hearing a lot of excitement around.

01:42:18.420 | So I hope you see the vision of where this could be effective.

01:42:23.420 | And, yeah, feel free to, you know, if you have any questions about remote evals or about anything else in the presentation, we've pretty much reached the end.

01:42:31.420 | And thank you for sticking around.

01:42:32.420 | I know it's been a long day, to say the least.

01:42:36.420 | For the remote, do you need your IP...

01:42:37.420 | I'm right here.

01:42:38.420 | I'm right here.

01:42:39.420 | I'm right here.

01:42:40.420 | For the remote...

01:42:41.420 | For the remote...

01:42:42.420 | Yeah, it's good.

01:42:43.420 | For the remote, do you need your IP address publicly available so Braintrust can hit it, or...?

01:43:01.420 | Yes, or you can use Logo Host.

01:43:05.420 | So, I mean, I guess, yeah.