back to indexEngineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize

00:00:00.000 |
All right. Well, let's get started. We don't have much time, but I hope your conference is going 00:00:18.780 |
well. Welcome to AI Engineering World Fair. My name is Dat No. Today's talk is about LME Val 00:00:25.960 |
Pipelines. So I never know what I want to talk about until I get into the room, so I 00:00:30.580 |
don't prep too hard, but by show of hands, who here has built an agent? Just raise your 00:00:36.260 |
hand. Okay. Who here has run an eval? Right? Who here has productionized an AI product? 00:00:45.160 |
Nice. Okay. Some technical builders. Let's get technical then. So my name is Dat. I'm an AI 00:00:54.280 |
architect at Arise AI. This is Mochi and Latte. They're dogs of my friends. I figured, let's keep 00:01:02.520 |
it spicy and interesting. But I've been building observability and evals since day zero. So since 00:01:08.620 |
the first, I don't know if you guys know what Arise AI is, but we are the largest AI evals player in the 00:01:15.460 |
space. So observability evals kind of beyond. We work really heavily with real use cases. So folks like 00:01:22.160 |
Reddit, folks like Duolingo. So we work across the best AI teams and we have a really unique business. 00:01:28.400 |
Being on the observability side, we get to see what everyone is building, how they're tackling those 00:01:35.300 |
problems, what are their biggest pains, and what are kind of the tips and tricks that they use to really 00:01:40.820 |
productionize these things. And just to give you a hint, Duolingo has massive eval scale. They tend to run 00:01:47.160 |
about 20 evals per trace. So they end up spending quite a fair amount doing evals, understanding their 00:01:55.400 |
evals, optimizing them. And the last thing about me, I have a huge passion for the AI community. When I was in SF 00:02:01.000 |
the last five years, I really loved to go to pretty much every single event that I could. I'm not a developer 00:02:06.520 |
advocate. I'm an engineer by trade, but I just love the community. So yeah, this is a little bit about 00:02:13.240 |
Rise, but I think I don't want to keep it too salesy. I just want to keep it pretty technical. 00:02:16.440 |
So really three concepts that I think everybody should be familiar with and where evals really sit 00:02:21.800 |
in the space is really as simple as this. This is what I teach all my customers. Really, the first thing 00:02:27.320 |
is observability. I think you guys have kind of seen this before. Observability really just answers the 00:02:32.120 |
question of what is the thing that I built actually doing, right? To some people, it may be traces, 00:02:37.240 |
traces in spans. I'll show you a little bit, you know, platform agnostic. Just think about the concepts. 00:02:45.160 |
But traces might be one area for people. So traces represent, hey, what's happening? Can I look at 00:02:51.080 |
things? To an AI engineer, makes a lot of sense. Let's say you're an AI PM, maybe not super technical, 00:02:57.000 |
or maybe you want to think about things differently. Maybe you want to look at, 00:03:01.080 |
hey, what are the conversations that are happening? Turns out you can run evals at these levels. 00:03:05.000 |
We'll get into that in depth kind of later. You know, signal comes and observability comes in 00:03:11.240 |
different kind of flavors and forms. Maybe it's analytics. What we're starting to really realize is 00:03:16.200 |
that LLM teams are getting split into two special niches. There's platform teams, right? And they own 00:03:23.320 |
things like the infrastructure. So who here has heard of a model gateway router? It's like an interface 00:03:29.240 |
pattern behind it or all the models, right? Well, it turns out the central LLM platform team tends to 00:03:33.960 |
own that. They care about costs, latency, things like that. And then you have the other LLM teams. 00:03:39.240 |
These LLM teams sit on the like the outer side of the business, so like a hub and spoke. 00:03:45.240 |
They work for the business side. So these are like the people building the applications to help the 00:03:49.880 |
business. So if anyone here comes from like the ML or data science space, it's actually not far from 00:03:55.320 |
that. And so different teams care about different metrics. So maybe if you're an AI PM sitting on the 00:04:01.480 |
business side, you care about evals. If you care about, you know, the platform, maybe you care about 00:04:06.040 |
costs, latency, things like that. But TLDR, observability, observability represents what's happening. 00:04:15.400 |
And now evals are really important in this space because the reality of the fact is if you've ever 00:04:19.560 |
seen a trace or something like that, you're not going to inspect every single trace manually, right? 00:04:25.880 |
It is not scalable for you an AI engineer or you the AI PM to look through these things. So what is evals used 00:04:33.080 |
for? It's actually just a really clever word for signal. You're just trying to understand what's going well 00:04:37.880 |
and what's not going well. So I'm not here to sell you on like evals. I think everybody knows how important they are. 00:04:43.000 |
But if you think evals, evals are LLM as a judge only, there's actually a lot of other tools that 00:04:49.400 |
you're missing. So LLM as a judge, raise your hand if you used LLM as a judge. Okay, about half the room. 00:04:55.000 |
It's super great. You use an LLM to give you feedback or process on any process, including an LLM process. 00:05:02.520 |
So if you're doing RAG, this is a really good way to think about RAG in terms of evals. RAG would be like, 00:05:08.440 |
hey, user has a question. We retrieve some context to be able to possibly answer that question. 00:05:14.840 |
And then we generate an answer. It turns out every arrow on this is actually an eval you can run. So, 00:05:21.320 |
hey, I retrieve some context and I want to compare that to the query being asked. Well, 00:05:25.720 |
that's RAG relevance. It's like, is the thing that I returned even helpful in answering the question? 00:05:30.840 |
And so LLM as a judge is great. It's super helpful. You know, I think most people understand why it 00:05:36.760 |
works, but there's a whole research area on why they're really good indicators. The original task 00:05:43.000 |
is not the eval task, right? So if I asked you a human, hey, generate me a summary on something long 00:05:49.080 |
and complex, like the book War and Peace, that's a very different task than say, hey, I wrote this summary 00:05:55.000 |
for you. Is it a good one or is it a bad one? But LLM as a judge is a small part. It doesn't always 00:06:02.360 |
have to be a large language model or autoregressive model. Things like encoder-only BERT-type 00:06:08.120 |
architectures are super helpful. They're about 10 times cheaper, about one or two orders of magnitudes 00:06:14.520 |
faster to run that eval. But, you know, you don't just have LLMs at your disposal. You, a human, 00:06:20.760 |
also are a really good way to discern signal. So it turns out evals can also come in the form of, 00:06:25.320 |
is your user having a good or bad experience? So for those people who have productionized some sort 00:06:29.800 |
of LLM application, do you guys have user feedback? Raise your hand if you've implemented user feedback. 00:06:34.360 |
And, okay, about 30%. It's actually incredible signal. So that comes from a human. Obviously, 00:06:41.640 |
you yourself can also generate labels on stuff. So has anyone here heard of a golden data set? 00:06:46.760 |
Raise your hand. Okay. Most of the room. And the way I encourage folks to think about this, 00:06:52.040 |
this is a pro tip, actually. If the first column represents scale. So LLM as a judge is valuable 00:06:57.240 |
because I don't have to grade it myself, right? But let's say I don't necessarily trust it off the bat. 00:07:02.120 |
Use the third column to help you out. So a golden data set represents quality. So you yourself graded it. 00:07:09.000 |
You know that it's what you expected, you know. Well, it turns out you can run your LLM as a judge 00:07:14.600 |
on a golden data set. What you're trying to do is say, hey, can the LLM approximate the thing that I 00:07:20.920 |
trust? Right? And what that allows you to do is to actually quantify and tune your LLM as a judge. 00:07:26.360 |
So we'll go over that in a second. But strong pro tip. Most really strong LLM teams in the world 00:07:31.560 |
kind of do this today. And it turns out we don't always have to use an LLM or a human. You can use 00:07:37.800 |
what are called like heuristics or code-based logic. So I'm going to take you into the platform a little 00:07:42.040 |
bit to talk through it. But in our platform, you have a way to run evals. Great. What code evals 00:07:47.720 |
actually are, are just, it's much cheaper. So I'll just run a little test here. But, you know, let's say 00:07:55.320 |
you want to say, hey, does this output contain any keywords? I don't need to use an LLM or a human for that. 00:08:00.280 |
I can just use code. It's infinitely cheaper, faster to run. Does this match this regex pattern, 00:08:06.520 |
XYZ? Is this a parsable JSON? So the reality of it is you have this kind of large toolbox 00:08:12.360 |
in your kind of eval set. So when you say evals, don't just think of LLM as a judge. There's a whole 00:08:18.120 |
other set of smarter things that you can use. They're actually more cost effective. And so really, you know, 00:08:24.280 |
this is a really good way to emphasize really the value of like the AI engineer evals and 00:08:29.720 |
observability. Most people understand this left-hand circle, this purple one. It actually 00:08:35.480 |
represents what we all want to do. And it's like, hey, build a better AI system, right? So what you do 00:08:39.560 |
is you collect data, observability, traces, things like that. Then you run some evals to say, hey, 00:08:45.560 |
did this process go well or did this process not go well? So you're discerning signal from that, 00:08:50.360 |
you know, mass of data. You'll actually collect where areas of things went right or wrong, 00:08:55.480 |
right? And you'll say, hey, turns out we hallucinated on this. It's because our rag strategy 00:09:00.200 |
is off or the agent is off, for example. You'll also annotate data sets as well, just to double check 00:09:05.960 |
that those evals are correct. And then, of course, you always come back into your platform and you, 00:09:11.240 |
you know, you update the prompt template, right? You change the model because it wasn't good enough, 00:09:16.280 |
or you update the agent orchestration. So everybody understands that left-hand circle. 00:09:20.520 |
Now, a lot of people actually forget about the right-hand circle. And so it turns out, 00:09:25.480 |
the first time you run evals, what you'll quickly realize is that they're not perfect, right? You 00:09:33.160 |
actually have to tune those evals over time. So the way you collect signal actually adjusts as your 00:09:38.120 |
application also, you know, gets better. And so what I mean by that is that process of running evals, 00:09:43.960 |
what you might notice if you annotate some of them is that the eval said something hallucinated or 00:09:48.760 |
wasn't correct, and it actually was, or vice versa. So what you actually need to do is collect a set of 00:09:54.840 |
those failures, right? Say, hey, this is where the eval was wrong. And you'll know that by annotating some 00:10:00.440 |
data every now and again. And then you'll want to improve the eval prompt template, right? Because the 00:10:06.120 |
way you collect signal at first, you'll quickly realize it's either too obscure, too vague, not specific 00:10:11.720 |
enough. So these are the kind of two virtuous cycles that you really want to get through very quickly. 00:10:17.640 |
And the way I describe it to AI engineering teams is, if you want to build like a quality AI product, 00:10:23.080 |
think about velocity. So the faster you iterate through stuff, if I can get through four iterations 00:10:29.320 |
in a month rather than two, you're going to exponentially have a better AI product as you 00:10:33.240 |
build. And so when we talk about architectures and things like that, when the industry first started, 00:10:39.400 |
this was state-of-the-art routers, right? Am I right? So routers are made up of like components. 00:10:46.120 |
This is a really dumb example of booking.com's trip planner. So booking, you know, they're one of our 00:10:51.640 |
largest customers. Trip planner is basically a travel agent in LLM form. It drives revenue for that 00:10:57.480 |
company. It helps you book, you know, it'll book your flights, your hotels. It'll give you an itinerary. 00:11:02.920 |
And so, you know, when we think about evals, evals can be as complex as the application itself. 00:11:10.280 |
So in kind of older architectures, where there's things like routing, you can eval individual 00:11:17.160 |
components. I think most people get this when you're looking inside of a trace, for example. 00:11:21.160 |
Maybe I want to eval a specific component or trace. So this one LLM call, right? But remember... Oops, 00:11:29.400 |
I'll come up here. But remember that you can, okay, remember that you can zoom out too. So it doesn't 00:11:34.840 |
have to just be this one specific component. Let's say this one component is part of an agent or a 00:11:39.400 |
workflow. Maybe I just want to evaluate the input output of that larger workflow. So that larger workflow 00:11:45.560 |
is made up of LLM calls, API calls, right? I have to find actual flights, actual hotels that have vacancy, 00:11:51.640 |
maybe some heuristics. And then you can zoom out a little bit more. Maybe you want to eval things like 00:11:58.440 |
the way control flow happens. It's a really important component. If you have components 00:12:02.920 |
in your AI agents that have control flow in them, it actually makes way more sense to eval your control 00:12:08.440 |
flow first. And you have conditional evals, meaning if you didn't get the control flow right, why eval 00:12:14.840 |
anything down the line? Because it's probably wrong, right? So save yourself some money and some costs. 00:12:21.000 |
So you can think about conditional evals as well. And then of course, we have things like people want to run 00:12:26.680 |
evals at the highest level. So imagine you have a back and forth. So we call this a session in our 00:12:32.520 |
platform. But the whole idea is, you know, a session is made up of a series of traces. So you can imagine 00:12:38.440 |
there's a back and forth between you and your agent. I just want to understand, hey, at any point was the 00:12:43.560 |
customer frustrated? Was the customer XYZ? So when you start to think about evals, there's no one-stop shop. 00:12:50.520 |
If anybody says this is how you should do evals, and they never asked you about how your application 00:12:55.560 |
works, you probably shouldn't trust them. Also, I have a hot take. And my hot take is that don't use 00:13:00.760 |
out-of-the-box evals. If you get out-of-the-box, if you use out-of-the-box evals, you'll get out-of-the-box 00:13:04.760 |
results. So really customize them very heavily. It's something that we've learned really from some of the best 00:13:10.360 |
teams in the world. Okay, let me come here. Then you have complexity. This is our own architecture 00:13:17.240 |
for our AI co-pilot. We built an AI whose one purpose is to troubleshoot, observe, build evals 00:13:24.280 |
for your AI system. It obviously takes advantage of our platform. But, you know, the reason why we go 00:13:28.760 |
this route is take us forward five, ten years from now. Do you guys really think that you, a human, 00:13:34.760 |
are going to be the ones who are evaluating all these AI systems, like, mainly? What do you think would 00:13:39.960 |
actually take your place? It's probably going to be an AI that evaluates future AI. So this is our 00:13:45.320 |
first iteration on this stuff. We're super excited about it. We, you know, it's been out for a year. 00:13:50.520 |
It's getting better and better. But maybe I'll show you a little bit of the workflows that we have 00:13:54.280 |
in our platform really quickly. Who here is working with agents? Okay, who here is interested in, like, 00:14:01.800 |
agent evaluation? Okay, let's cover that then. Let's see. I'll show you. We'll show you how the industry is 00:14:06.680 |
doing agent evals. So the agent evals, things get, like, way more complex, right? The calls are longer. 00:14:13.160 |
When you look at your traces, they're much longer. I'll actually show you our agent traces. So this is 00:14:20.680 |
one that kind of failed. But our agent trace kind of works like this. So Copilot works like this. It 00:14:25.720 |
basically, based off what you say and where you're at in the platform, there's agents that kind of can do 00:14:31.560 |
things. And it has tools. Each agent has access to a set of tools that it's particularly good at. 00:14:37.240 |
So the whole idea is that, yes, we can see what each individual trace is doing, right? We can say, 00:14:45.000 |
hey, what's happening in this particular area? We can look at the traces. But the reality of what people 00:14:50.920 |
are actually asking in the space is not, is my AI agent good or bad? What they're actually asking is, 00:14:57.160 |
what are the failure modes in which my agent fails, right? And so what I mean by that is, 00:15:02.600 |
you can look at one individual trace in the graph view of it. But the reality is you want to understand 00:15:09.560 |
and discern the signal with your entirety of your AI agent. So what is the, like, what does the pathing 00:15:15.960 |
look like across all of that particular AI agent's calls? So for instance, if it had access to 10 tools, 00:15:24.040 |
maybe you want to answer questions like, how often did it call a specific tool, right? What were the evals 00:15:29.800 |
in a specific path? So in our agent graph, for example, it's framework agnostic. So whether you use 00:15:36.440 |
land graph, whether you use crew AI, whether you use your hand rolled code, this is an agnostic way 00:15:43.640 |
to look at how an agent's pathing performs across the aggregate traces. And so this helps you understand, 00:15:51.560 |
okay, if my agent that hits component one, then two, then three, my evals look great. But for some reason, 00:15:58.360 |
when we hit component four, then two, then three, our evals are dropping. And the reason is why? Well, 00:16:04.680 |
oh, it turns out component four had a dependency, right, on component three, and it needs that dependency in order 00:16:10.520 |
to perform. And so when you think about the complexity of agent evals, you need kind of 00:16:16.360 |
the ability to see across not one instantiation, but all of them. You need to understand the distribution 00:16:21.640 |
of what's happening. And so when we think about evals across agents, that's one way you can think about it. 00:16:27.880 |
And then maybe an easier way to kind of think about it too is trajectory. We're thinking about trajectory 00:16:34.680 |
evals. So imagine for a second, you have this specific input, right? And the input is like, you know, 00:16:41.320 |
hey, find me these hotels at TripPlanner. And you know you should hit this component, then that component, 00:16:47.240 |
this other component. So in this case, it's like start agent, tool agent. You might have a golden data 00:16:52.360 |
set, like very similar to how we have golden data sets for LLM as a judge, but this is for trajectories. 00:16:57.000 |
So I expect us to be able to hit at least three or four of these components, for example. So the reference 00:17:01.960 |
trajectory is kind of mentioned, like I need to hit these components. Then you get to do two things, 00:17:07.240 |
one of two things. Either one, you can pass in like, here's what we did, here's what we expected, 00:17:12.440 |
into an LLM. And then an LLM can actually grade the trajectory. 00:17:17.080 |
You can also just say, hey, did we explicitly hit these exact like trajectory strings? Great. 00:17:21.160 |
But you don't always need a ground truth for that. You can start to get creative here. You can say, hey, 00:17:25.560 |
you know, here's this process that I expected to hit. Does these nodes and the description of their nodes 00:17:32.120 |
match the correct trajectory, for instance? And then maybe you could do things like, we're kind of 00:17:38.200 |
playing around with this, but maybe here's the trajectory that I hit. Here's the possible paths that 00:17:43.480 |
are just possible. Did I do well in these specific areas, right? And you can pass in the pathing as a 00:17:49.240 |
series of like, nested key value pairs, for example. LLMs are pretty good at that. But we start to think 00:17:53.960 |
about, you know, agent evals. You know, the eval space is already complex enough. And what we're seeing 00:18:00.200 |
is even more complexity. But hopefully that makes sense. I'll pause here. Hopefully that makes sense, 00:18:06.920 |
but usually I like to make time for questions at the end to keep this pretty interactive. Hope that's okay, 00:18:11.800 |
team. But any questions? Does this make sense? 00:18:19.560 |
Most of the evals you're talking about are kind of like after the fact, right? Is there a way that you 00:18:23.960 |
can use evals sort of like to flow as a pattern? Like that's clearly a hallucination. 00:18:29.080 |
Yeah. Incredible question. So a lot of people, so there's evals that can be, 00:18:36.600 |
you know, some people call them offline or online. I like to say is like, is it in the path? Is it in 00:18:41.560 |
orchestration or out of orchestration? So for some people, there's a cost to in orchestration evals, 00:18:48.120 |
and the cost is things like latency, right? Some people might call those a guardrail too. Like, 00:18:52.040 |
hey, can I continue or not continue? And so there's pros and cons to everything. I think 00:18:57.080 |
when it comes to guardrails, this is something I kind of coach my customers. The way to think about 00:19:02.440 |
guardrails in general is you have system one. System one is your orchestration system. It's what you built. 00:19:08.760 |
It's your prompts. It's everything else. System two is your guardrail system, right? Guardrails are 00:19:14.200 |
really nice because they mitigate risk, right? But there is a cost and the cost is maybe it's latency 00:19:20.440 |
in your user's experience. You can get around that by doing smart things like maybe embeddings guardrails. 00:19:25.640 |
They're, you know, two orders of magnitude shorter. But a lot of people don't think about the other two 00:19:31.160 |
cons here. The other con is complexity. Two systems is complex, especially when one system checks in with 00:19:36.760 |
the first. The third thing is that a lot of people mistake guardrails as, like, the thing that needs 00:19:43.640 |
to be adjusted. A lot of people will go to their guardrails first, like, oh, I need to adjust my 00:19:47.320 |
guardrails. The reality is you need to adjust system one. That's the root cause, right? Your guardrails 00:19:52.760 |
are really there to protect you. And then maybe the last thing I'll say too is guardrails are not infallible. 00:19:58.600 |
They kind of act like unit tests. They're for known knowns, right? Whereas observability plus evals, 00:20:05.640 |
because the reality is you don't know the distribution of what you're going to see 00:20:09.000 |
until you get there, right? Ask anybody who's built in the LLM space. Their users are just crazy. 00:20:14.520 |
And so that's the difference. And I really caution people because people are like, oh, I need to fix my 00:20:20.520 |
guardrail. No, no, go fix the prompt first and then worry about your guardrails. But yeah, inline, 00:20:26.840 |
we call those inline evals. Some people call them guardrails. But really, do you do the evals in the 00:20:31.320 |
orchestration or outside of it? And so you can, there's pros and cons. So there's no right or wrong 00:20:36.760 |
answer there. But good question. Yeah, so when we have a complex system that is typically taking a long 00:20:44.760 |
time to run and we have timeouts in this. And I know you'll have something called a span that limits 00:20:51.000 |
what kind of view we're taking to a complex agent. So if we have like a complex system that's really 00:20:59.400 |
going to take time and then there's an asynchronous way, is there support to manage something like that 00:21:04.920 |
and evolve across the whole system that we have? Oh, yeah. Amazing question. So who here has ever 00:21:11.720 |
heard of OTEL? Open telemetry. Okay, even less than, okay. One of the most important things to our 00:21:17.880 |
enterprise customers is being on open telemetry. So you know how I said LLM teams are being split into 00:21:23.640 |
two? Well, it turns out LLM services are also being split across services. And so the idea is like 00:21:29.240 |
people want to understand. So maybe an asynchronous process in one service or one Docker container, 00:21:34.760 |
or you know, one Docker or one Kubernetes pod. OTEL propagation is a great way to get around that, 00:21:41.480 |
meaning you can have a process like application A sends data to my model router, right? And then that 00:21:48.440 |
comes back to application A. Then application A hits application B for some reason. And then it comes back 00:21:53.560 |
to A. When you're actually creating those traces, you want to be able to see all that work, right? You don't 00:21:58.120 |
want to just instrument one particular thing. You want to see it across work across. And so OTEL is 00:22:03.960 |
an incredible pattern for that. It's a solved problem. So that's why we at Arise two and a half years ago, 00:22:09.240 |
when this crazy time started for all of us, we made a bet to be OTEL first. And it's it's really paid off. 00:22:17.080 |
Okay, so confidence scores on evals, right? Yeah, I think it depends where you're getting your eval. If 00:22:36.920 |
eval, if it's from an auto-regressive model, companies like OpenAI have actually exposed the log 00:22:41.800 |
prob. So the log probability is pseudo like confidence of like, and since you're returning 00:22:47.560 |
only one code token, and that token is like the eval label, log prob is a really good way for those 00:22:52.920 |
auto-regressive models. If you're using things like small language models and coder-only models, 00:22:58.440 |
they come with a probability of the classification. But really, yeah, it's tough. But you have a bunch 00:23:05.480 |
of tools in your toolbox, and you generally use them together to discern where things go well or 00:23:09.400 |
not well. But log prob, if you're using a model provider that exposes the log prob, is a really 00:23:14.920 |
good way to start for auto-regressive models. Okay, last question, and then we're time up. Yeah? 00:23:20.520 |
Hey, do you have anything in your plans, like going forward, like how to shorten the loop between 00:23:26.360 |
customer feedback and automatically improving the props, and also having like, you know, the development team 00:23:31.560 |
effort work at it? Oh, good question. Yeah, we want to automate in that area, definitely. So, 00:23:36.200 |
who here has heard of DSPY? All right, okay. If you didn't raise your hand on any of this, 00:23:41.480 |
I hope you learned a bunch. DSPY obviously has something like MiPro. MiPro is an optimizer, 00:23:46.440 |
you get like 30 inputs, 30 outputs, and then it basically creates less fragile prompts that span 00:23:51.880 |
across different models. Like, so it does, it works for OpenAI, and then it works for Gemini, et cetera. 00:23:56.920 |
In terms of like auto-optimization, yeah, I think we have the ability to, or we're releasing the ability 00:24:03.000 |
to run prompt optimization. So some people call it, we call it meta-prompting. But basically, 00:24:08.920 |
we feed it a dataset, we said, here's the input-output pairs, here's the evals on those things, 00:24:13.640 |
and then where things failed and didn't fail. Look at the original prompts, look at this dataset. 00:24:19.880 |
Can you give me a new prompt that fixes this dataset? Yeah, so we call that meta-prompting, 00:24:25.000 |
but it's basically use an LLM to just automate, so you don't have to. Yeah, but good question. 00:24:31.000 |
But really appreciate the time. We're over at the booth. Feel free to come grab me if you want to talk 00:24:35.160 |
architecture or anything, but really nice to see you all.