back to indexCase Study + Deep Dive: Telemedicine Support Agents with LangGraph/MCP - Dan Mason

00:00:00.040 |
Okay. Hey everybody, thank you so much for coming. Really appreciate you being here. This is a great 00:00:20.340 |
show. I love this show. I was here last year as an attendee. Spoke in New York at the New York 00:00:25.860 |
Summit in February, and I'm really thrilled to be back. So this is very much a show and 00:00:31.420 |
tell. I said this in the Slack channel, so if anybody's not in the Slack channel, feel 00:00:34.380 |
free to join it. There's a couple of links in there that might be helpful to you. It 00:00:37.680 |
is Workshop LangGraph MCP Agents, if anybody needs that. But fundamentally, I'm just here 00:00:45.200 |
to walk through some really interesting work that my team's been doing around building agent 00:00:49.920 |
workflows for a healthcare use case. And this is very much like the way we did it. I'll 00:00:57.060 |
get into some details about that. It's not the only way to do it. And I'm hopeful that 00:01:01.420 |
somebody in this audience might look at this and be like, that's dumb. You should do that 00:01:04.480 |
better. And please raise your hand and tell me. But it's been really fun to build. I'm really, 00:01:09.540 |
really happy with the results and really excited to show you guys what it's all about. Okay, 00:01:14.860 |
so I will go into presentation mode. All right, here we go. Okay, first, just a very quick 00:01:25.180 |
couple things about Stride. That's me if anybody needs my LinkedIn. But there's a couple other 00:01:29.080 |
places you can find that. We are a custom software consultancy. So what that means in practice 00:01:35.780 |
is whatever you need, we'll build it. We have been doing a whole lot of AI stuff. This kind 00:01:41.540 |
of falls into a few specific buckets. We use a lot of AI for code generation. We have a couple 00:01:47.480 |
of both products and services we've built to do things like unit test creation and maintenance. 00:01:52.580 |
We've done a bunch of stuff around modernization of super old dumb code bases. You know, things like, 00:01:58.460 |
you know, early 2000s era dot net is one of the things we specialize in. But what I'm going to show 00:02:04.280 |
you today is really what we do around agent workflows. So the idea with, you know, this agent workflow 00:02:09.140 |
stuff is really just that, you know, it's something that could be done with traditional software. And 00:02:13.820 |
this thing I'm going to show you was done with traditional software in its first run. But we have 00:02:19.040 |
rebuilt it with an LLM at the core to make it more flexible, more capable, and, you know, ultimately, 00:02:24.800 |
just a lot cooler. So really excited to show you guys more about that. So I'm going to start with a 00:02:31.580 |
little bit of grounding. I'm going to do a case study, which is very brief, but let's give you a sense of 00:02:35.880 |
kind of the problem we were trying to solve and how well I think we solved it. And then we'll go as 00:02:40.560 |
deep as we all want to go in terms of how it works. Let me ask this up front. If you have questions, 00:02:46.380 |
please raise your hand. I will try to notice and I'll try to get to you. There are some mics we could 00:02:50.280 |
pass around, but it probably is better for you to just shout it out. And then I'll repeat it into into 00:02:54.560 |
my mic. And fundamentally, like, I have no idea if this is two hours worth of material. 00:02:59.240 |
It probably is. But, you know, please keep me honest. I'll talk about anything that is relevant 00:03:03.920 |
to this that you guys want to talk about. So the client here is Avila. So Avila Science is a women's 00:03:13.600 |
health sort of institution, which is trying to help with the treatment of early pregnancy loss, 00:03:18.880 |
otherwise usually known as miscarriage. What this is, though, is specifically a treatment where what 00:03:25.040 |
happens is that, you know, you experience the event, you end up at the hospital or at a clinic. They send you 00:03:29.280 |
home with medicine, right? The medicine is something that then you have to administer yourself at both a 00:03:34.880 |
very traumatic time for you and your family, at a time when, you know, you need to keep track of when 00:03:38.720 |
you are supposed to do things. It can be really challenging. There's other use cases beyond this in 00:03:43.040 |
terms of chemotherapy, when people have trouble remembering what day it is, you know, let alone what 00:03:46.800 |
they're supposed to be doing, right? You know, there's a variety of treatments that this is relevant for. 00:03:50.080 |
But Avila, in particular, has a system that they use to help people essentially administer these 00:03:56.480 |
telemedicine regimes at home, right? And that system is text message-based. So everything here I'm going 00:04:01.840 |
to show you is essentially a text messaging-based engine with, you know, some core business logic 00:04:07.520 |
that helps people stay on track. It answers their questions. It checks in on them, you know, 00:04:12.240 |
to make sure that the treatment went well. They still have a doctor relationship. This isn't replacing 00:04:16.000 |
the doctor. It is simply helping to get people through this treatment without a doctor's direct 00:04:20.480 |
support, at least a lot of the time. So a few disclaimers up front. First of all, 00:04:25.360 |
I'm going to show you a whole bunch of stuff here that is the client's actual code. Thank you so much 00:04:30.320 |
to my client. Thanks to Avila for being so open with this. It's really awesome. I'm really happy to be 00:04:34.000 |
able to show you as much as I'm going to show you. I have redacted a few things. I think what's left is 00:04:38.560 |
still, you know, very much going to give you the character of the whole thing and an idea of how it works. 00:04:42.000 |
Stride, we are custom software people, so we built custom software. I don't want to hide that part, 00:04:48.480 |
right? It's possible to do a lot of this stuff with off-the-shelf tools, but there were some specific 00:04:53.120 |
requirements this client had that made it better, frankly, to build a lot of it custom. So we did, 00:04:58.880 |
but we did the best we could to use, you know, big swaths of off-the-shelf, right? So you'll see a lot of 00:05:03.280 |
LangGraph, LangChain, LangSmith, you know, a bunch of other things like that, you know, very much in here, 00:05:07.520 |
because we do believe that that adds value and that it fundamentally makes the system a lot more 00:05:12.160 |
explainable. Yeah, and, you know, there are also some constraints in terms of how it's hosted. This 00:05:16.800 |
is, you know, at least partially intersecting with patient data and various things like HIPAA and other 00:05:21.280 |
privacy requirements. The other thing, as I started with earlier, there's no right way to do this, 00:05:27.040 |
but this one does work for us, and I think you'll see as we walk through it some of the choices we made. 00:05:31.440 |
You know, there are definitely other ways we could have plugged the tools together. I think there's 00:05:35.440 |
definitely other ways we could have done this workflow. But we like how this came out. It 00:05:39.520 |
preserved some of the things that we really knew were important to our client and that kind of reserve, 00:05:43.440 |
you know, a lot of human judgment, you know, as opposed to sort of taking the elements entirely at 00:05:48.240 |
their word. And this is very much a hybrid system with humans very much in the loop. And again, as I 00:05:54.560 |
mentioned before, I would really love it if you guys looked at what we're doing here and said, "That's dumb." 00:05:59.840 |
Or have you thought about this, right? Because, A, you know, this is a project which we've only been 00:06:05.200 |
working on for a few months, but, you know, things have already evolved. That's the way it is in AI. 00:06:09.440 |
So I'm certain, and I know of a handful of things where, you know, we could replace 00:06:13.360 |
some of the choices we made with newer, more modern choices. And at the same time, you know, 00:06:17.440 |
there may be cases where, you know, I'm genuinely not using line graph right. I would love if someone 00:06:22.320 |
raises their hand and tells me that. So please do have that in the back of your brains. 00:06:25.520 |
Okay, cool. Really briefly on the stack that we used and on the team that we built. So the first 00:06:32.160 |
thing, again, there's a lot of line chain in here. That is not because other frameworks can't do this. 00:06:37.440 |
It's not because we couldn't build our own. The number one reason we went with this is because of 00:06:41.280 |
how easy it is to explain the system to other people, right? You know, if you look at, and I'll show you 00:06:45.680 |
the line graph stuff in particular, it was straightforward to go into our client, you know, on a very early day 00:06:51.200 |
in the project and say, "Hey, this is how this thing works." You can see it goes from here to here. 00:06:54.960 |
There's loops here. Like this is where we're doing our, you know, our evaluation of the process. 00:06:59.360 |
And here's where humans come in. It was very straightforward to do that. And I think it would 00:07:03.920 |
have been a lot harder with something that was less visual and frankly, just less well orchestrated. 00:07:08.000 |
So we're happy with this, right? There are some trade-offs to the line chain tools, but they're 00:07:12.160 |
mostly things we can live with. We are using Claude in the examples that I'm going to show you here. 00:07:18.880 |
But the core code that we wrote works with Gemini, works with OpenAI. There are a few reasons that we 00:07:23.760 |
think Claude is better for this. I'll get into that as we go. But, you know, there's, there's no model 00:07:28.320 |
specific stuff really happening here. This is almost all just tool calling and MCP and, you know, other 00:07:32.720 |
things that are pretty portable across most of the models. The stack overall is not just the LLM piece, 00:07:40.400 |
right? So the LLM piece is Python and a line graph container. And then the other piece, right, 00:07:45.200 |
the piece that is a text message gateway and a database and a dashboard, which I'm going to show you 00:07:49.440 |
pretty extensively, is Node and React and MongoDB and Twilio. And the whole thing is hosted in AWS. 00:07:55.760 |
None of that has to be that way. That's just what we picked. You know, the main reason we picked AWS 00:08:00.640 |
was for, you know, this has to support multiple different regions. We had to be able to deploy 00:08:04.560 |
stuff, you know, entirely in Europe in a couple of cases, right? And so we needed to make sure that we 00:08:08.720 |
had, you know, a decent set of, you know, cloud connections that we could work with. 00:08:12.400 |
Evals. So I will show you the eval system that we built. We were not able to use, or at least I 00:08:20.960 |
shouldn't say not able. We chose not to use the stuff entirely off the shelf from Langsmith. This 00:08:25.200 |
is partly because I didn't really want to be fully locked into them. I wanted the data to live there. 00:08:28.960 |
I wanted to be able to see, you know, the current system in Langsmith. But I wanted to have something 00:08:32.960 |
separate. And it turns out that some of what we had to do to make the evals, you know, fundamentally 00:08:38.240 |
functional required a lot of pre-processing. So we built an external harness that essentially pulls 00:08:43.280 |
data out of Langsmith, processes it, and then runs things through PromptFu. And one of the reasons we 00:08:48.480 |
picked PromptFu, if anyone's ever worked with it, they have a very flexible, they call it an LLM rubric. 00:08:53.920 |
And so this is an LLM as a judge. You basically describe how you want the eval to work. You feed the 00:08:59.200 |
data in and, you know, then it gives you a separate sort of visualization for that. So we ended up very 00:09:03.280 |
happy with it. It's not the only way to do it at all. It was definitely, you know, the thing that fit best for 00:09:08.080 |
for us. The team. So there were, and still are, two software engineers, one designer, and me. And I'm 00:09:17.600 |
just, I would not call myself a software engineer. That's why I didn't include myself in that pool. 00:09:21.440 |
You can imagine there being two software engineers kind of maintaining the core 00:09:25.040 |
system that has the gateway and the dashboard and the text message stuff, right? And the database. 00:09:30.960 |
I maintained and built basically everything on the line graph side, right? So imagine this is being 00:09:37.920 |
two separate systems that talk to each other through a well-defined contract. And those two 00:09:42.720 |
software engineers understand roughly how my code works, but they really weren't maintaining it. You 00:09:46.320 |
know, it was almost entirely me with AI friends. And on that note, so everything I'm going to show 00:09:53.600 |
you is the code that I wrote. And I want to be very clear. I haven't been a real software engineer in 00:09:58.080 |
a long time. I do have an engineering background. I spent seven years out of college, you know, hacking on 00:10:01.440 |
mobile apps. I took 15 years off and went to be a product person. And for about two years now, I've 00:10:08.080 |
been back. But what that really means is just that, you know, essentially the stuff you're seeing, right, 00:10:12.800 |
or the stuff I'm going to show you is mostly, you know, code that I wrote with Klein. That's my personal 00:10:17.120 |
favorite. And so there's a bunch of options here. I like Klein best of all these options. You can use 00:10:23.200 |
anything you want. The code isn't actually that complicated. Like, I would estimate, and I haven't 00:10:27.200 |
actually counted, but there's probably a few thousand lines of Python, and there's a few 00:10:30.800 |
thousand lines of prompt. It's about equal, right? So I vibe-coded the Python, and I mostly hand-coded 00:10:37.600 |
the prompt. Not 100%, right? But that's the way to think about the division of labor here. 00:10:42.320 |
And for that matter, I mean, any of these tools can be great. The main reason I picked Klein was just 00:10:47.040 |
because, you know, we did not need, you know, sort of a hyper-optimized, you know, like $20 a month flow. 00:10:52.960 |
Like, I've spent a lot more than $20 a month on tokens. That's just the way it is. 00:10:56.960 |
You know, it was worth spending the money to just have sort of the best available 00:10:59.760 |
context of the model at any given point. Klein is a very good way to do that. 00:11:02.960 |
Okay, and there's a little bit of sample code. So I did mention, and this is in the Slack channel as well, 00:11:08.400 |
if you wanted to follow along with any of this, you could sort of do it by standing up your own 00:11:12.560 |
little LangGraph container with MCP. You're more than welcome to do that. Everything I'm going to show 00:11:17.040 |
you, though, is proprietary client code, so I obviously can't send you those links. So if you'd like to, 00:11:22.000 |
feel free to fire it up. We have two hours, which is a really long period of time. If you're 00:11:26.720 |
interested in spending a little time at the end of this actually working with some of 00:11:29.440 |
this real code, I'm thrilled to do that. So feel free to get yourself ready in the meantime. 00:11:33.600 |
Okay, just a couple things up front, just to make sure we're level set in terms of sort of the terms 00:11:40.640 |
and kind of the way that we're talking about this stuff. So I do like the Langchain definition 00:11:45.520 |
here of agent, basically just because, you know, you'll see what we're doing here is using an LLM to 00:11:51.200 |
control the flow, right, of this application. That is literally what this is. And I like this, 00:11:59.280 |
and I don't know if Chris is here, he was at the last event in New York. I like this as a way of sort of 00:12:04.000 |
justifying the way that we tried to architect this system and why, right? So the idea of the agents in 00:12:11.120 |
production, right? You have to know what they're doing, you have to know, you know, that they can do it, 00:12:16.640 |
and you have to be able to steer, right? If you only have a couple of these things, you end up with 00:12:21.040 |
bad outcomes, right? And so I just like this framing of if you're capable, but you can't tell what it's 00:12:25.280 |
doing, it's dangerous. If you know exactly what it's doing, but you can't control it, it does weird 00:12:29.280 |
stuff and you can't help. Please go ahead. Oh, let me find that one more. It's right here, actually. 00:12:35.920 |
Workshop Lang graph MCP agents. Got it? Okay. No problem. Okay, but and so transparency with no control is 00:12:44.960 |
frustrating and control with no capability is useless. I just love this framing. I think this 00:12:49.200 |
is exactly the thing that we were trying to solve for. We needed something that was able to do the job, 00:12:53.840 |
clear about what it was doing, and that was steerable by humans in a really obvious way. 00:12:57.440 |
So with that said, I'm going to start with a case study, right? And this is going to be a little weird 00:13:02.960 |
out of context, but hopefully this will give you a sense of what we were trying to solve for. So the idea 00:13:08.560 |
here was that there's an existing product, right? So there was a product out there that was essentially 00:13:13.360 |
having humans manually push buttons on a console that would enable a text message to go out, right? 00:13:20.400 |
So you would read what the patient had said. They could say, I took my medicine at 3 p.m. They could 00:13:25.040 |
say, I'm bleeding and I don't know what's going on. Like, am I okay? They could ask other sorts of 00:13:29.920 |
questions about the treatment. And a human would have to go into a piece of software and click, you know, 00:13:34.800 |
a button that accurately reflected sort of where in the workflow somebody was, right? Because, you know, 00:13:39.680 |
you can model a lot of this out. Imagine there being fantastically complicated flow charts of all the 00:13:44.560 |
things that can happen during a medical treatment. So the Avila team had built this, right? They realized, 00:13:50.080 |
though, that essentially to scale the human team to be able to serve a lot more patients was prohibitive, 00:13:54.400 |
right? They needed too many people clicking too many buttons. They also realized they couldn't really 00:13:59.040 |
scale the system to new treatments, right? Which was something they wanted to do, that this isn't the only 00:14:02.480 |
regimen that you needed to support. They had other ones. And so the idea is that either they were 00:14:07.920 |
going to rebuild the legacy software to be more flexible or they were going to essentially rebuild 00:14:12.400 |
it to use a different kind of decisioning at the core. And when they were looking at doing this, 00:14:17.200 |
you know, LLMs had started, I think, become capable enough to handle this kind of work. 00:14:21.120 |
So what we did, what we did is we built for them a workflow and essentially a piece of software that 00:14:27.520 |
connects to it that enabled them to do new treatments flexibly, right? So this idea of 00:14:32.320 |
essentially defining a blueprint and a knowledge base is the way that we thought about this. 00:14:35.760 |
And essentially medically approved language, right? So one of the reasons that you had humans pressing 00:14:40.560 |
buttons instead of typing text messages is because this is medical advice, right? You know, you are not, 00:14:46.080 |
you should not at least be giving medical advice that differs substantially from this approved language, 00:14:51.120 |
right? There's reasons that this stuff is said the way that it's said, you know, and doctors have, 00:14:55.280 |
you know, similar limitations. We also built a self-evaluation function, which I'll go into 00:15:01.200 |
tremendously in a second. We wanted to make sure that we caught essentially situations that were 00:15:06.320 |
complicated and surfaced them for humans, right? Because we wanted to have a human in the loop. 00:15:10.960 |
But we were trying to raise up the existing folks who were really just operating the system and clicking 00:15:15.440 |
all those buttons to be supervisors of agents that were doing that instead, right? That really was the 00:15:20.800 |
model that we were working with at its core. So I have a question over here. Yeah? 00:15:23.840 |
You may have said it. Were these operators medically trained? 00:15:26.560 |
So the question is, are these operators medically trained? 00:15:29.840 |
There is a physician's assistant who essentially leads the operations team. So the way that you can 00:15:34.400 |
think about it is that she would be escalated to whenever something came up that was outside of the 00:15:39.280 |
blueprint, right? So if you had a situation where they're just like, "I'm really not sure what to do 00:15:42.400 |
here." A slack goes out to that channel with a physician's assistant in it who would then give 00:15:46.400 |
medical advice. So again, this is one of the reasons it was hard to scale, right? Because you only had one 00:15:51.200 |
of those people on this particular team. Sure. And so to sort of jump a little bit ahead, but hopefully 00:15:58.640 |
you'll see why this is in a minute, this roughly, and again, we're still doing the measurement, right? 00:16:03.280 |
We're still trying to figure out exactly what capacity has gone up to. We think it's something like 00:16:08.320 |
10x. We think that they can serve as roughly 10x more people with this new approach. 00:16:12.720 |
Now, it's not free, right? We have to build the software. We have to pay for the tokens. Tokens can 00:16:17.600 |
get expensive. But if you think about just the scale issues involved in scaling up a team of people 00:16:22.320 |
and, again, in building the software to be more flexible for more treatments, we think this capacity 00:16:26.560 |
increase is very much warranted and very much the thing that solves the problem. And you can do new 00:16:33.440 |
treatments and new workflows without writing more code, right? That was the single biggest thing about 00:16:37.200 |
this. And you'll see what we're doing here is largely Google Docs, right? And, you know, 00:16:40.720 |
we have some more advanced techniques to manage those things and inversion them over time. But we're 00:16:45.120 |
talking about being able to support whole new treatments and whole new workflows without going 00:16:48.240 |
back to the code, right? That's hugely valuable to these guys. Question? 00:16:52.880 |
You mentioned velocity increasing like 10x. Is there any measuring about quality of care? 00:16:58.320 |
So the question was velocity increases. Is there a quality of care measure? Short answer is it's early, 00:17:04.720 |
right? I mean, this is still a system that's, you know, in progress. It is being used with real people, 00:17:08.240 |
but it's still very, very much early on that. The way that I think we're looking at it is that 00:17:12.240 |
there would be some combination of the operators being the ultimate arbiter, right? They're going to be 00:17:16.720 |
able to see these conversations and determine as they approve them, you know, as they review them, like, 00:17:20.480 |
hey, is this mostly getting it right? And then there are sort of existing kind of CSAT, you know, level of measures that you can 00:17:25.840 |
apply to the people who are on the other end of the treatment. 00:17:28.800 |
So 10x sounds a little low. Is that because the operators are still approving everything that comes 00:17:32.640 |
out right now? So they're not approving everything that comes out. And I agree, the 10x is kind of, 00:17:36.560 |
it's an order of magnitude, not a precise measure, right? But I think in this case, 00:17:40.640 |
you'll see a couple of cases that require approval, right? And sort of why. But the approval also is 00:17:45.360 |
very quick, right? So the argument is that you probably only see one of every 10 exchanges. And when you see it, 00:17:50.960 |
it takes you roughly as long as it took the last time to just push the button, right? Which was the thing they were 00:17:55.120 |
already doing. So that's kind of why we've benchmarked it there. All right. So let's get into it a little 00:18:02.240 |
bit. So this is just a snapshot of what this looks like in line graph. I'll show you the real thing in 00:18:07.040 |
just a minute. And it's actually evolved a tiny bit since I took this picture. But really, what we're 00:18:11.440 |
talking about here is the people who operate the system today, we call them operations associates. 00:18:16.240 |
So what this is really doing is introducing a virtual operations associate. That operations associate is going to 00:18:22.160 |
assess the state of essentially a conversation, interaction with a patient, determine what the 00:18:28.240 |
best response is, both in terms of the text message you might send, the questions you might ask, the 00:18:33.840 |
actions you might take. Because some of this is about maintaining essentially a state for that patient, 00:18:38.240 |
right? You know, you are at any given point trying to figure out, when is this person taking their 00:18:43.200 |
medicine? When did they take their medicine? You know, what medicine do they have? What time is it for them, 00:18:48.720 |
which is actually more important than you may think? All of this has to be maintained, right, 00:18:52.640 |
by the system. And so the virtual lawyer is doing all that work. And then it's passing essentially its 00:18:57.760 |
proposal, right? It basically comes up with, I think this is what we should do. And it passes it to an 00:19:02.800 |
evaluator agent. So there's a live LLM as a judge process, separate from the evals, which we'll get to. 00:19:08.480 |
But the live LLM as a judge is essentially saying, okay, given this thing that just happened, 00:19:13.280 |
here is our assessment of A, you know, how right the LLM thinks it is. That's frankly very challenging. 00:19:19.440 |
LLMs are very hard to convince that they're wrong about anything. But it also is looking at the 00:19:24.080 |
complexity, right? So even if the LLM believes it's made all the right decisions, you can have it 00:19:27.600 |
impartially say, well, I changed this and I changed that and I'm scheduling a bunch of messages. That's 00:19:31.840 |
complicated. Maybe a human should look at this, right? So that's actually a lot easier to implement. 00:19:37.840 |
And both of these things are calling tools. The tools are a mix of MCP. And so there's sort of 00:19:44.400 |
two versions of MCP here. I'm going to show you one, which is basically just looking at local files, 00:19:48.560 |
just so I can show you all the stuff in my environment. But there's also MCP going across 00:19:52.480 |
the wire to the larger software system and keeping all this stuff in the database, right? So there's a 00:19:56.720 |
mix of those two things. And the rest of the tools are about maintaining state. Because as a conversation 00:20:03.440 |
is happening, the LLM needs to know, you know, essentially, well, I made this update and that 00:20:08.000 |
update. And here's the current state that I'm working with. And it has to be able to sort of 00:20:11.440 |
manipulate these things in real time. That is not MCP. That's not going to a database anywhere. Like, 00:20:15.520 |
this is happening entirely in sort of the live thread. And then once it finishes, then it gets pushed out 00:20:20.080 |
and essentially saved away. Okay? Again, we'll get into a lot more of that. I did want to spend a minute on 00:20:27.440 |
this on the system architecture, right? And so I realized it was a little small. Go ahead. 00:20:30.720 |
When you mentioned about the system state, I heard before that LNGraph has a context, 00:20:36.720 |
or some of the state object building the LNGraph itself. You mentioned that you use tools. Are you 00:20:41.520 |
talking about separate things or the same thing? The question was about how the state is managed 00:20:45.680 |
in LNGraph. So short answer is, this may be one of the things where I'm not doing it optimally, 00:20:50.800 |
by the way. But with LNGraph, there is a state object that we load, essentially, when the request comes in, 00:20:56.960 |
from a JSON blob, right? We keep it alive inside the graph run. It is not directly accessible to the 00:21:03.840 |
model, right? At least not the way that we're doing it, right? So you'll see, actually, as we get into 00:21:08.400 |
this, that you can see all the state coming in in LNGsmith, right? I can see, like, hey, this is the 00:21:12.160 |
whole thing that was loaded. I still have to repeat that in my first message to Claude, right? It doesn't 00:21:17.360 |
actually show up, you know, in the same place. And then I call the functions. That state will evolve in terms of 00:21:22.560 |
what's inside the graph run. And then when it outputs, it's the Python code, not the model, 00:21:27.040 |
which essentially takes all that state and then serializes it and sends it out. 00:21:31.200 |
So you'll see how it works. But like, that's generally one of the things that I'm not sure I'm 00:21:35.280 |
doing right. Anything else? Yep? Yeah, somewhat related. You have one node for that virtual 00:21:41.360 |
and it's going back and forth to tools, right? And updating the state. Yeah. I'm assuming the reason 00:21:45.840 |
we have it in some form hard-coded out that business logic more into separate nodes is precisely because 00:21:51.520 |
you'll lose the workflow diagnostic agent for the next time. Is that sort of the notion there? 00:21:56.960 |
Yeah. So the question is why, essentially, the virtual lay is one agent and not, you know, 00:22:01.520 |
a sort of a pre-coded sort of version of here's how I administer the specific treatment. 00:22:05.600 |
Yes, the reason I think we kept it simple is because we did not want to be super 00:22:09.920 |
treatment specific and how the architecture worked. But you could imagine doing, you know, 00:22:14.000 |
a set of slightly smaller, you know, better tuned agents that were, you know, kind of taking care 00:22:18.640 |
of elements of the task that was still pretty generic. The main reason I think it's not optimal to do that 00:22:24.240 |
is caching. And this is another question where, you know, I think I'm doing this right, but there 00:22:28.960 |
are a lot of variations here. Caching the entire message stream is easier with either one agent or with 00:22:35.440 |
sort of one agent doing most of the work. We're using Claude. Claude has very explicit 00:22:39.840 |
caching mechanisms. And every time I switch the system prompt, I think the cache blows up. 00:22:45.040 |
And so fundamentally changing the agent identity does that. So that was one, that's one reason we 00:22:50.000 |
chose that. It's certainly not, you know, a hard and fast forever choice. 00:22:53.360 |
What's the duration of the, like, care? Are we talking like months? Or like how many messages? 00:23:04.240 |
Yeah. So this use case, the early pregnancy loss, it tends to be a treatment which takes, 00:23:10.080 |
I think, three days end-to-end to administer most of the time. And then there's a check-in after that, 00:23:14.160 |
right? So imagine that probably within a week, the entire interaction with that patient is done, 00:23:18.400 |
unless they come back and just have questions later on, right? You know, there are some variants of this, 00:23:22.240 |
where you take a pregnancy test after six weeks, right? And so that's all fine. 00:23:26.000 |
The message history is preserved, but the computation that happens to generate each message is not, 00:23:32.880 |
or at least not in sort of the state that we save. So like, you know, the most complicated conversation 00:23:38.320 |
I've seen was something like 150 texts. It's a lot in terms of, you know, a human keeping it in their 00:23:43.120 |
brain. It's not that bad for an LLM, right? So, but it's that level. 00:23:48.960 |
All right. So again, just to point out where the lines are here, right? So I kind of got off on a 00:23:54.560 |
tangent. The top box is what we're going to be looking at here today, right? It's really a Python 00:23:58.640 |
container with access locally to these blueprints, this knowledge base, right? We are also then 00:24:04.320 |
maintaining some stuff over across the wire in this blue container. That's really where the dashboard I'm 00:24:08.400 |
going to show you is. It's where the text message gateway is. And it is where we're going to be moving, 00:24:12.160 |
I think, a lot of that context, right? Although the blueprints, like all that stuff really should live 00:24:15.920 |
kind of in the more durable software container. Right now it lives, you know, close to the Python. 00:24:19.520 |
Okay. So let's get into it. So the first thing I'll do here is just to show you kind of at a high 00:24:27.760 |
level what the software looks like. So this again is the console, the dashboard, right? The thing that 00:24:34.960 |
the operations associates, the humans are going to be looking at. And I'll show a couple things here just 00:24:39.360 |
to give you the sort of baseline, right? So the first thing here is this needs attention. So the current 00:24:43.600 |
system basically has this needs attention flashing all the time. Every time a text message comes in 00:24:48.800 |
from any patient, this thing is going off, right? You know, so there's, you know, hundreds of patients, 00:24:52.960 |
thousands of pages in the system at any time. So, you know, this needs attention used to be something 00:24:57.840 |
that multiple people were having to stare at constantly, right? Just to make sure that they 00:25:01.200 |
caught everything so that they could get out messages in a reasonable time. Now, needs attention is 00:25:06.080 |
really, you know, just sort of one thing at a time, right? And if I look here at the conversations, 00:25:09.920 |
there we go, you can see that the top one here actually needs a response. I'll get to that in a 00:25:13.520 |
minute. But at any given point, right, this is my test environment, you know, I've got a handful of 00:25:17.120 |
these conversations kind of already, already queued up. What I can see here, if I click into these things, 00:25:22.480 |
is essentially, I'll just go back to the beginning here for the whole message history, and I'm going to 00:25:26.080 |
toggle this rationale on. What you're seeing is the entire conversation, is that readable? Let me see if I 00:25:32.960 |
blow it up a little bit. Is that a little better? Okay. So the idea here is the agent is named Ava, right? 00:25:40.320 |
That's the personality that people are interacting with. This language is all coming out of these 00:25:46.080 |
blueprints, right, that I'll show you. And so this first message is just an initial message sent by the 00:25:50.240 |
system, essentially, just to kick things off. So imagine someone is, they have a package of medicine in 00:25:54.560 |
their hand, they scan a QR code. They put in their phone number, they get this text message, right? And then they 00:25:59.760 |
start talking. So you can see here, the kinds of things a patient is going to say are, you know, 00:26:04.400 |
freeform text, right? You know, this, I mean, they could say yes in any number of ways. The old system 00:26:09.840 |
used to have literally different buttons for yes, like, yes, I have the medicine. Yes, I heard you. I 00:26:15.840 |
mean, it's like, there's all sorts of variants, right? And because you did have to respond differently 00:26:19.280 |
depending on what those things were. What we're able to do here is really just take, you know, these 00:26:24.160 |
freeform answers, interpret them, and then essentially provide a rationale for why you would 00:26:29.680 |
say a given thing at a given time, right? So this is equivalent to if you were doing this with a 00:26:33.520 |
human, and you asked the human, well, why did you say this? The LLM can provide this kind of context. 00:26:37.920 |
So this is Claude looking at the history here, and I'll show you what this looks like in Langsmith, 00:26:41.760 |
which will make it a lot more obvious, and then saying, okay, here's the next thing that I should 00:26:46.400 |
say, and my confidence that I should say it is 100%, right? It's usually very confident, right? But the 00:26:52.000 |
point is, this whole process is largely going to go along in an automated fashion, right? You don't usually need 00:26:57.520 |
humans involved because this is a very straightforward thing. They have their medicine. 00:27:00.880 |
The next thing I need to know, and this is a very interesting part of this treatment, 00:27:04.000 |
I need to know what time it is. These are text messages. We don't know anything about these people. 00:27:08.320 |
For a variety of reasons, it's kind of good that we don't know much about them, right? We don't want 00:27:11.280 |
to have to deal with all of the stuff around provider confidentiality and patient data, right? So one of the 00:27:16.160 |
things that we need if we're going to go through this longitudinal treatment is to figure out what time it is for 00:27:20.000 |
them, and then essentially pull out that data and figure out what their local time is, right? So in this case, 00:27:24.960 |
I was in Eastern time when I answered these questions. This is all me doing this, you know, from my laptop. 00:27:29.280 |
I tell it what time it is. It calculates an offset from UTC and says, well, I guess you're in Eastern time, right? 00:27:34.800 |
And then it sets this over here and it says, all right, from now on, I know that my patient is in Eastern time 00:27:38.800 |
unless they tell me otherwise. And they could come back and tell you otherwise, right? That's something the old system 00:27:43.680 |
really didn't have a good way to do. But if the patient comes back and says, I'm on a plane, 00:27:47.440 |
it's actually seven for me, we just update the time zone and move on, right? This is a very flexible system that way. 00:27:52.880 |
Then we get into this over here and we say, okay, now that I know what time it is, 00:27:57.200 |
I'm going to ask them if they've started their treatment, right? And, you know, there is a blueprint, 00:28:01.920 |
right, which we'll get to, you know, that essentially just has, you know, the medicine that they're going to 00:28:05.600 |
take in a very specific way to take it, right? The protocol. The patient says, well, no, I want to take 00:28:11.040 |
it soon. You know, the Ava says, cool, I'll text you when we're ready. And then it gives, you know, a regimen, 00:28:15.600 |
which in this case, this is an SVG that we are stapling times and dates on top of, right? So, 00:28:21.040 |
you know, fairly straightforward. We're doing this in software. The LLM is not doing it. The LLM is 00:28:24.560 |
actually just passing along the instructions. You know, it says, send the step one image and provide, 00:28:30.240 |
you know, like this date and this time, and we substitute the rest of it in. And this goes out as 00:28:33.840 |
an MMS, right? So this is a text message. And so we provide this. The patient, you know, 00:28:40.000 |
says, you know, in this case, we're talking, you know, again, this is all the LLM reasoning through 00:28:43.680 |
this, right? You know, I am sending this immediately because it's actually within sort of the 35 minute 00:28:49.120 |
window that you've told me that I have to send these things. This is all business logic that the LLM is 00:28:53.760 |
interpreting pretty much on the fly. And then I have these reminders, right? I didn't get back to it. 00:28:59.040 |
So this is an important part. It sent me this thing and it thought that I was going to take it at 5:45. 00:29:03.440 |
I didn't text it back, right? This is partly because I was maintaining the system myself and 00:29:07.040 |
I forgot. So I had to come back in the next day and catch up. So it sent me an automated reminder 00:29:11.360 |
because it scheduled one when it sent the first message. So part of this is the LLM only gets 00:29:15.680 |
called when the patient says anything. So if they don't, you know, you have to make sure that you 00:29:20.080 |
stay engaged, right? You don't do this overly. Like we don't try to bother people beyond one or two 00:29:23.920 |
reminders. It's their treatment. But this bump sort of functionality was really important to the 00:29:28.560 |
client, right? So we built it in. So you can see here, I came back the next day and I said, 00:29:33.040 |
yep, sorry, I did take it. You know, Ava confirms that I completed step one. And what it does is 00:29:37.680 |
it sets this thing called an anchor, right? And it says, okay, you know, the patient was going to take 00:29:42.320 |
it at 5:45. They confirmed that they did. And so now, you know, I can refer back to this. I know that 00:29:46.640 |
this happened, right? And if the patient had then said, oh, no, I screwed up. I actually haven't taken 00:29:50.640 |
it. I'll take it today. We just changed the anchor. We update everything, right? So this is a system 00:29:54.640 |
that humans used to have to do. If a patient came back and said, I didn't take my medicine, 00:29:59.280 |
you know, a human has to go in and manually update all the times and all the scheduled messages. And 00:30:03.600 |
it was, it was a big pain in the butt. Um, yeah, please. 00:30:16.240 |
Not exactly. Um, the way that we do state and I'll, I'll spend a lot of time on this, 00:30:21.200 |
but the way that we do state is really just that with any given message from the patient, 00:30:25.120 |
right? This entire system only kicks off when the patient sends a message. Um, what we do is we say, 00:30:31.040 |
all right, given this state, what is the best response? And that response could be, 00:30:35.440 |
I changed some of these anchors. I update their treatment phase. I scheduled a bunch of messages. 00:30:40.080 |
All that state is preserved so that the next time they write in, then, you know, we have that state 00:30:44.960 |
to go on. Um, but again, we're not checking, right? There's no polling going on in the system 00:30:49.280 |
where we're saying after three hours, did the patient text me back? We don't do that. We depend 00:30:53.440 |
on the scheduled messages essentially just to nudge the patient. Um, if they choose to not say anything 00:30:58.560 |
for three days and they come back after three days, we just pick up where we left off. Um, again, 00:31:02.640 |
this is a choice. This is the way the client wants it. It's, it's intended to be low enough touch that 00:31:07.040 |
it doesn't bother people, but high enough touch that it doesn't lose track. Sure. Um, I'll pause here, 00:31:12.800 |
actually. Any other questions so far? Um, I, I realize I'm going through a lot. Yes. 00:31:15.920 |
Do your anchors have to be sequential or can your user come in at any point in the treatment plan? 00:31:22.080 |
They can. Great question. So the question was, do the anchors have to be sequential? Um, or like, 00:31:26.160 |
do you have to go through these one step at a time? So one of the great things, one of the best things 00:31:29.520 |
about this system is that I could have, and I'm happy to try this when we go a bit later, I could 00:31:34.640 |
have basically said, oh yeah, I already took the first pill and I'm like in the middle of taking the 00:31:37.760 |
second pill, you know, as like the first thing I say to, to Ava and she would be like, okay, 00:31:41.920 |
cool. There's an anchor. Here's the next thing. It skips ahead and it doesn't force you to go through 00:31:45.840 |
this prescriptive part of the blueprint. Whereas the old system, you know, at least nominally did, 00:31:49.520 |
right? Like you could, you could kind of skip ahead, but this automatically does it. You know, 00:31:52.800 |
part of the instructions are don't ask the patient a question they've already answered, 00:31:56.240 |
like period. Right? But that's annoying. Don't do that. Um, so, so yes, that's, that's very much in there. 00:32:03.600 |
Does the LLM itself have a concept of the internal state machine that is kind of 00:32:08.560 |
determining all of this? Or is that kind of outsourced to the actual software stack of the LLM? 00:32:16.400 |
does the LLM have an internal representation of the state? Um, kind of, sort of. So you'll, 00:32:20.320 |
you'll see, um, when we get into the, the, the actual back and forth with Claude, um, in Langsmith, 00:32:26.000 |
you as a human can see kind of where it starts, right? So every thread is going to show like, 00:32:30.080 |
all right, here's the incoming state. We repeat it essentially to Claude. Again, 00:32:33.680 |
that's just one dot I've never made it managed to connect with line graph, right? So we basically 00:32:36.880 |
have to serialize the state and say, this is your, this is your starting point. But then the LLM has 00:32:40.960 |
that in its window. And then, you know, it's going to cause changes to the state. It'll call functions 00:32:46.000 |
that update the state. It can always ask again. It can say, well, what's the current state? You know, 00:32:49.600 |
it can go back and retrieve it. Um, but in the context of that one, from when the patient responded, 00:32:54.960 |
you know, to when I actually come up with my response to them, that whole thing is going to be in its memory at one moment. 00:32:59.680 |
So the, the short answer is the question was, um, do we ever run into the state being too big? 00:33:10.640 |
Uh, generally speaking, because of the way that we're kind of compressing and serializing at the 00:33:15.440 |
end of the conversations, it doesn't ever get so big that it can't finish its job of responding to 00:33:20.400 |
one situation, right? You know, like patient said this, now I'm going to do this. We have considered 00:33:26.320 |
having longer running threads where you kind of pick up in the middle and you've already, 00:33:29.040 |
you can reload sort of the entire previous conversation. That does get weird, right? 00:33:32.800 |
Especially with older clods, you would get it for getting to sort of call tools the right way and 00:33:37.200 |
have all sorts of JSON errors, right? We have a bunch of retry logic in there to kind of compensate for 00:33:40.880 |
that. Um, so that's one reason we kept it short. We make it so that we basically throw everything out 00:33:45.920 |
and restart when the patient gets back to us in part because blueprints could change, right? You know, 00:33:50.560 |
a bunch of things could change in the meantime that might end up with weird states. 00:33:54.000 |
So on the management of states and taking decisions what to do next and so on, 00:34:00.000 |
this is 100% LLM driven or there's some software like logic around it as well? 00:34:07.040 |
It's 100% LLM driven. Uh, sorry, the question was, uh, is the, is the steering done by software, 00:34:12.160 |
right? Any, any of that steering? The answer is really no, it's not, um, except for when it surfaces to a 00:34:17.200 |
human, right? And so when it goes to a human for approval, the human can use English and basically 00:34:21.840 |
say, yeah, change that word to that and that message shouldn't go out and you know, whatever. 00:34:26.400 |
So like we, we actually, as part of the flexibility part, we are not building any software that manages 00:34:32.240 |
the state. We just want you to talk to the LLM to do it, right? We think that's a better practice, 00:34:36.400 |
right? It means like, you know, you as a human just have to talk to it and you don't have to figure 00:34:40.000 |
out how to flip all the bits on this new console. 00:34:59.920 |
Oh, oh, got it. Sorry. So, um, question was, is there a RAC? Uh, no, there's not. And, and it's actually 00:35:06.240 |
just because what we really did is we just came up with a structure for the documents that was 00:35:10.480 |
self-referential. So you read a very small document which says, here's the treatment, 00:35:14.560 |
right? If you need to read for this phase, go to this file, right? If you need to read for this phase, 00:35:17.760 |
go to this file. If you have a question that doesn't fall underneath any of those things, 00:35:20.960 |
here's a CSV with a bunch of questions and answers. We didn't do it as RAC in part because we didn't 00:35:27.040 |
believe that either we could do a really good job of getting all the right information into the window, 00:35:31.200 |
like we didn't think we'd be reliable enough about that. We just want to give the entire document. 00:35:34.720 |
They're not that big. Um, and because these, this is Claude, right? It's, it's got a big enough 00:35:39.040 |
window that we could just put the entire thing in there, you know, for, for most treatments. 00:35:42.160 |
Um, so we chose to do that. What was your second question though? 00:35:44.000 |
Uh, is the patient going off a typical journey, uh, how do you detect and intercept? 00:35:50.000 |
Right. So the question is if the patient goes off track. So we, we have this idea of a blueprint, 00:35:54.640 |
but then there are plenty of cases where the blueprint may, um, you know, not fully answer whatever the 00:35:59.760 |
patient is, is, is bringing up. Um, like one example is the blueprint is very much about asking questions, 00:36:05.200 |
right? So you will say, have you taken your medicine yet? When do you plan to take your 00:36:09.120 |
medicine? The patient will say, my stomach hurts. Okay. So yes, your stomach hurts. You didn't answer 00:36:15.520 |
the question. What we do is the patient typically will get an answer to their question. So one of the 00:36:19.920 |
principles is always answer the patient's question, right? We don't ever want to leave them hanging, 00:36:24.000 |
but then ask yours again. So the idea is that at any given point, we can answer anything that they need 00:36:29.280 |
and as gently as we can, we'll try to pull them back onto the blueprint so that we understand where they are in 00:36:33.040 |
the treatment. Um, it's an, an exact science, but, uh, is there a way to detect if, uh, someone 00:36:40.000 |
actually was off track? Well, the LLM does that effectively by knowing that it's supposed to keep 00:36:45.840 |
people on the blueprint, but having an escape hatch for the knowledge base, essentially what we call 00:36:49.440 |
it, right? Triage or knowledge base, you know, whatever you want to call it. Um, so, you know, we, 00:36:53.520 |
we don't have an explicit bit sort of flipped in the system that will say this patient is off track. 00:36:59.120 |
We just kind of know roughly where they are in the treatment. And if they want to answer, 00:37:02.480 |
if they want to ask a bunch of questions, we'll, we'll just answer them until they, they are satisfied. 00:37:08.720 |
So if you were building it again today, or actually the question is twofold. 00:37:14.640 |
What drove you to actually use LangChain? And secondfold, if you were building it today, 00:37:21.440 |
So the question is why did we choose LangChain and would we still, um, I will be very candid 00:37:27.280 |
that the main reason that I chose LangChain is that I had personally gotten pretty comfortable 00:37:30.960 |
with LangGraph as, as a demonstration of these concepts, right? It's not that crew, I mean, 00:37:35.840 |
we did a lot of autogen work back in the earlier days, right? You know, I've, I've done a little bit 00:37:39.040 |
with crew AI. All of those frameworks can functionally do very similar things. LangGraph was the absolute 00:37:44.720 |
best at explaining to people who were not neck deep in this stuff, how it worked. Um, and because there 00:37:50.000 |
was a path to production from there, I didn't feel a need to, to re-platform and change all of it. 00:37:54.000 |
We certainly thought about it, right? We considered, well, what if we didn't do this in LangGraph, 00:37:57.040 |
what would we gain? And the, but the answer is you still have to implement observability in certain 00:38:01.360 |
ways. You know, you don't necessarily get, you know, the support that you might get from LangChain 00:38:05.280 |
if you end up in a place, remember that we're also doing this for clients. We're not going to be there forever. 00:38:08.800 |
Um, leaving them with something that they can call, you know, somebody to, to support is also 00:38:13.040 |
a helpful aspect. So I think, I, I don't think I'd do it differently. I think it's really just that, 00:38:18.960 |
you know, ultimately, you know, we're getting pushed, all of us, in the direction of using the native 00:38:24.400 |
model tools for this, right? You know, OpenAI has the responses API, which lets you define tools. 00:38:29.200 |
Claude has its new stuff, right? Like, I don't really want to be locked in. Um, I, I am to some 00:38:34.560 |
degree locked into LangChain now, but I, I prefer that honestly to being locked into the models. 00:38:38.560 |
Um, these are, these are not performance intensive things we're doing in terms of the software, 00:38:42.320 |
right? Like, you know, I don't care that LangChain is sometimes a little slow. Um, I would rather 00:38:46.640 |
have the optionality. So, you said you are not using RAG in such documents. So, as the documents scale, 00:38:54.640 |
how, how do you just to, like, fetch those in a deterministic way if it's not RAG? 00:38:59.520 |
Uh, it, it, it is just that they have, uh, sorry, the question was about, um, if it's not RAG, 00:39:03.600 |
how do we fetch documents? The documents refer to each other. So you, you'll see that we have an 00:39:08.000 |
overview.md, right? This is all in markdown. Um, there's an overview.md that tells you what 00:39:13.600 |
other documents are involved in the treatment, right? There's some of the prompting which says 00:39:17.520 |
you can always request a triage overview, right? To, to try to handle problems, um, and it'll be there, 00:39:23.120 |
right? Regardless of what the treatment is. So it, it is very much just a document management thing. 00:39:27.280 |
Um, RAG, the main issue is just that I, I don't think, and, and, you know, this will probably be more 00:39:34.400 |
obvious as we get into it, right? I don't think that you could really design a RAG which would pull 00:39:38.080 |
back snippets of everything in sort of perfectly relevant, relevant ways. You really do kind of 00:39:42.560 |
need to understand the shape of the whole treatment, right? To, to make a good decision, right? Otherwise, 00:39:46.560 |
you're just going to pair it whatever particular snippet the RAG happened to bring back, and then the 00:39:50.080 |
logical has to be in the RAG. It makes more sense, and it's more transparent, I think, to do it this way. 00:39:54.080 |
If you'll get into this in the state management later on, are anchors predefined in the blueprint, 00:39:59.600 |
or are they perving by the LL? They, they are mostly predefined by the blueprint, and that we say, 00:40:04.480 |
as part of the overview, you know, the concept of an anchor is that it is a thing that happened, 00:40:08.160 |
or a thing that will happen, and here are the examples for this treatment, right? This is the 00:40:11.760 |
thing that will happen, or did happen in this treatment. Sorry, that was the question about the anchors. Yeah? 00:40:16.080 |
So when you, you mentioned that you, like, compressed the conversation, and you keep track of all that, 00:40:22.000 |
is that anchors, or is that another part of that? No, so the, the state is essentially, 00:40:26.240 |
you know, we, we call it, for reasons that only an engineer could love, we call it a schedule document, 00:40:31.200 |
right? The idea is that for any given patient, there is a schedule that they're on, and the document 00:40:36.080 |
snapshots their current state at any given point, right? And it's a version database, so we could go back 00:40:40.400 |
in time, and we could see what their document was three days ago. But it has, at any given point, 00:40:44.880 |
the messages that have been exchanged, any unsent messages that are scheduled, 00:40:48.880 |
and enough state about their treatment to fill out this view. Yeah? Yeah. 00:40:53.600 |
So in this case, all of this stuff is locked away, right? So I mean, just to go back to this diagram 00:41:04.640 |
for a second, this entire thing is all behind, you know, AWS's VPC, right? So like, there is no external 00:41:11.760 |
access to the LLM period, the only things it can talk to are essentially its own documents, 00:41:15.760 |
you know, in, in local files, and to the, the blue box. So, you know, there, there certainly are vectors, 00:41:21.120 |
but the vectors would be through the text messages, right? Not really through anything else. 00:41:26.320 |
Yeah. Yeah. Oh, I'm sorry, a bunch of people. You first. 00:41:28.960 |
Yeah. Just a question regarding, I guess, it's two boulders. One is like, how are you assessing the confidence rate from the model's response, and the second is how are you safeguarding against prompt injection for malicious behavior? 00:41:40.320 |
Yeah. Well, so, the question was about prompt injection and, and generally sort of steering. 00:41:44.320 |
I mean, the, the, the basic answer is just that you could definitely try to trick the model by sending weird texts, right? And we do that as part of our, you know, sort of internal red teaming. Like we have the entire team of operations associates who have been spending, you know, weeks and months trying to trick this thing. 00:42:00.320 |
Um, and granted, they're not trying to trick it from a reveal proprietary personal medical data, you know, I mean, there, there's things like that. We also obscure a lot of that medical data. 00:42:08.320 |
So the things that get, get to the yellow box do not include phone numbers. They do not include anything other than the patient's identified first name. Um, so there's a lot of, there's a lot of that data that's kept only in the blue, which is a lot easier to, to defend against. 00:42:20.320 |
Um, so yeah, we, we, we, we very much do obscure the, the, the patient. We don't obscure the treatment, right? The treatment is fully visible to the LM. 00:42:30.320 |
You mentioned human in the loop. How did you do the evaluation of asking the correct answer in the loop? 00:42:37.320 |
Yep. Uh, hold that thought. I will get to that very, very shortly. Um, we're back there. 00:42:41.320 |
Uh, sorry. It's just a question about unclear instructions. Um, so, uh, when, when the situation is ambiguous, 00:42:57.320 |
the LLM is told to look at the blueprint and pick the best possible answer. Now, if you don't believe the, the LMU, if you don't believe the, the answer is perfect, um, you should say so. 00:43:07.320 |
Right. In the rationale. So if I go back over here, this idea of the rationale, if there is uncertainty on the models, you know, point of view, it can say, well, I picked this blueprint response, but I'm not sure that it's right. 00:43:16.320 |
In practice is not great at doing that. Right. But that is the idea. And then the evaluator is also going to look at this and say, well, did you actually pick either the exact blueprint response word for word? 00:43:26.320 |
Did you adapt it? You know, does this seem right to you? Like we're trying to at least give a little bit of a layer before we get to humans. 00:43:32.320 |
And then hopefully we can trap situations like that and say, well, this is a complicated situation. A human should, should take a look. Um, it is not an exact science though. 00:43:39.320 |
Like that's generally just true with this stuff. Sorry. You in the back. 00:43:42.320 |
I'm curious about the scale. Like, uh, like, what's the load of how many batches of the . 00:43:53.320 |
Yeah. Um, so the question was just about load and scale. So, uh, look, the really short answer is that this, this system exists, right? 00:43:58.320 |
There's an existing version of it that is humans pushing buttons. Um, that scale is, you know, again, let's say thousands, not millions of patients. 00:44:05.320 |
Um, this opens up the possibility of doing more treatments, right? That's how we would get sort of additional patient scale. 00:44:11.320 |
You can also sell this to new hospitals, new clinics, things like that. Um, so part of this is to get the scale to be larger. 00:44:17.320 |
Um, we have not run into scale issues with, you know, just the, the conversations with Claude, you know, the software that we're building would scale much, much larger than thousands of users, right? 00:44:25.320 |
You know, the, the text message gateway might actually be the, the biggest bottleneck. So it's, it's honestly, it's a problem we want to have. Um, go ahead. 00:44:32.320 |
So, um, you keep saying Claude, did you guys select the LLMs because it was what the client had access to? 00:44:38.320 |
Or was there like a specific reason why you're going with 3.5 or whatever you're using? 00:44:42.320 |
Yeah. Uh, so the question was of model selection. Um, when we started this, right? 00:44:46.320 |
And I think, you know, let, let's assume that we kicked this project off, you know, late last year, early this year, right? 00:44:51.320 |
Um, we had to make a choice and our main criteria were it had to be a steerable model that we felt pretty good about, you know, transparency wise. Um, you know, one example, just, just to give you a specific one. 00:45:00.320 |
Oh, four mini is pretty good at this workflow, but it won't show its reasoning. Um, like, I mean, that's just one example. 00:45:07.320 |
And like, it's not a deal breaker. Like we can still see the rationales. Like there's some pieces of it, but I like being able to go into Langsmith and seeing the whole conversation, right? 00:45:13.320 |
That, that really helps me out. Um, we needed, you know, again, flexible hosting, but I mean, all the clouds kind of do that. 00:45:18.320 |
Frankly, we didn't want to deal with Microsoft and we kind of preferred AWS to Google. That was kind of how we got there, but you know, you can do this anywhere. 00:45:27.320 |
It really was just, we had to pick a horse and we largely have not regretted it. And in part, because we built enough flexibility where if I want to switch, I still can. 00:45:43.320 |
Uh, so the question is about sensitivity, uh, of data through the text carriers and also about, uh, using the data to learn. Um, I'll do the learning first. Um, we don't. 00:46:05.320 |
We, we do not take any of the responses and do anything to the models other than when we see situations that we as humans have evaluated and found wanting, um, we can tweak the prompting and the guidelines, right? 00:46:17.320 |
But we are not putting this in any sort of durable form. Like ultimately, you know, we believe the right model here is the provider interaction. If there's a provider involved, that sticks around, right? 00:46:26.320 |
The provider knows that you interact with the system. They can have, you know, whatever records they need. Um, otherwise, you know, we forget about you when your treatment is done. 00:46:32.320 |
We think it's better that way. Um, on the, on the, the sensitivity question, yes, there is sensitivity involved. And at the same time, again, there's prior art with these products, right? 00:46:42.320 |
There are existing systems which essentially take, you know, text messages in and provide medical advice. Um, we're just trying to stay within the guidelines of that. 00:46:50.320 |
And again, that's one reason why we don't want the LLM actually to have any data that is not explicitly required just to do decisioning, right? 00:46:57.320 |
It doesn't need anything beyond that to, to make a good decision. Okay. 00:47:07.320 |
Is it determining that that's a hundred percent? 00:47:09.320 |
And then what are the situations where it's not ? 00:47:12.320 |
Yep. Um, sorry. Hold that thought too. Cause I will get to that in just a second. Um, let me move on. 00:47:16.320 |
Uh, please like bring these questions back up. I just want to get a little bit further so we can see some other, some other cool things about this. Um, I'm going to move on from this flow just because you can imagine that this is going over a period of days, right? 00:47:25.320 |
There's another step here, step two, where there's, you know, more medicine being dispersed. Um, and then, you know, ultimately we're going to get to the end, right? 00:47:32.320 |
And, you know, essentially, did you complete this? And then, okay, great. You know, this is what's going to happen to you. You know, you're going to see some bleeding. Um, and then we have this check in, right? 00:47:40.320 |
So imagine that this now is, you know, a full, let's say three or four days later, right? After the, the treatment has begun. Um, you know, we check in, you know, the patient gets back to them or not, right? 00:47:47.320 |
Remember, some of these patients will just be like, I'm done. I don't really need to talk to this thing anymore. But if they do, right, we continue with the treatment. We don't bother them. We just let them sort of resume where they left off. 00:47:58.320 |
Again, we have these rationales, you know, we have these questions. And then what I want to do here is just to show you briefly, um, sorry, I got to zoom back out so I get the full phone number, um, what it would look like to interact. 00:48:08.320 |
So if I go here into my sandbox, um, imagine that normally this would be a text message. Um, so, you know, I would be doing this on my phone. Um, but here, you know, I can answer this question. 00:48:18.320 |
If I had any pregnancy systems before, have they decreased? It's like, yes, uh, they have decreased. 00:48:27.320 |
Okay, so I post this message. Now what's going to happen from here is thinking. So none of this is instant. And so now what I want to show you is what this looks like in Langsmith. 00:48:36.320 |
So, um, you can see here a couple of things. Um, one is that this, this is now spinning. Um, so this thing that I just asked it is now in active processing. 00:48:44.320 |
I'll show you what it looks like when we're done. Um, but I will give you just a brief look at, um, I think this is probably a useful one here. 00:48:51.320 |
Um, what this actually looks like in terms of processing the state. Um, so I'll blow this up a little bit and make it a bit bigger. 00:48:59.320 |
So, um, what you can imagine, this is using Sonnet 4, um, is that every time a message comes in from a page, 00:49:06.320 |
a patient, this is what I get. Okay? I get this description of, you know, everything that's going on here. 00:49:12.320 |
I can see this is an Avela patient. I can see the thread that we're currently executing, right? 00:49:17.320 |
Because you may need to resume these threads if you need to give feedback. Um, I have this idea of I'm in the three day check-in phase. 00:49:22.320 |
So that's the blueprint that I'm going to read. Um, and then I have a couple of things. I have these anchors, right? 00:49:27.320 |
Which, you know, you could see, I think this is exactly what you saw before, um, you know, in that same patient. 00:49:31.320 |
Um, these are all defined as, you know, actually a mix of UTC and Eastern timestamps. Um, that's one of the problems that's hard to eradicate. 00:49:39.320 |
Um, getting LLMs to deal well with time is really tough. Um, but then I have this entire message queue, right? 00:49:43.320 |
And this is the compressed state of the conversation to date, right? This does not include every message that Claude sent itself while it was thinking, right? 00:49:52.320 |
That part is contained in these individual Langsmith threads. I could go back and I could look at this if I needed it. 00:49:56.320 |
Um, but what I'm doing is I'm compressing and basically saying all I really care about is the actual messages that went back and forth. 00:50:01.320 |
I want these rationales because I want to be able to review them, right? That helps me understand the decision that's going on here. 00:50:06.320 |
Um, you know, I want these confidence scores so I can go back and look, you know, what did it think at any given point? 00:50:11.320 |
And again, I'll show you one where the confidence was low, but these things can go on a little ways, right? 00:50:15.320 |
This is probably, I don't know, 20, 25 messages, right? 00:50:18.320 |
All of this goes in as initial context in the window, right? 00:50:21.320 |
So if you had 150 messages, all 150 of them are going to potentially go in. 00:50:25.320 |
Now we do have a function where you can optionally set it to compress and say, well, just show me the last 50, right? 00:50:31.320 |
If I need to request more, I can do that. There's a way to do it. 00:50:33.320 |
Um, but I don't need to have the entire thing in the window. 00:50:36.320 |
Um, so I get down here. This is the last message from the patient, right? 00:50:39.320 |
So the question was, did you notice blood clots? I said, yes, a few, right? 00:50:43.320 |
You know, that was, that was what I, as a patient said, Claude is now going to start processing this thing, right? 00:50:48.320 |
So imagine, you know, this all being basically pasted into, you know, a Claude window and then having it go through this process and call tools. 00:50:55.320 |
So it starts by looking at directories that it's allowed to view. Again, this is a version where it's got the blueprints kind of all local and it's, it's talking to them this way. 00:51:03.320 |
We have another version where it talks via MCP over to the blue box, right? The larger system. 00:51:07.320 |
Um, so it figures out what directory it has. It reads these basic ones because these need to be read in all cases. 00:51:13.320 |
So these guidelines, right? The idea of how do you do your job, right? The idea of what the confidence framework looks like, the overview of the treatment, right? 00:51:20.320 |
You know, those sorts of things. We read those up front. None of these is very large, right? 00:51:23.320 |
And so you read all this stuff, you know, it, it comes into the, the window. Um, and then, you know, essentially it reads those descriptions and it says, well, I was told as part of this that I have to read the current blueprint for this current phase, right? 00:51:34.320 |
So I read that file individually. So a bunch of these early calls are just about setting up the context. This is not the only way to do it, right? 00:51:40.320 |
I mean, this, this is the way that we've chosen to do it again. We chose not to do rag for a couple of, you know, reasons around. 00:51:45.320 |
We just did not think we could get good enough results. And because this is honestly easier to interpret, right? You can sort of tell what it's doing. 00:51:51.320 |
Um, I get to the blueprint, the blueprint, and you will, we'll see more of these examples in a second, but the blueprint is basically this kind of structured bulleted list, right? 00:51:59.320 |
Here's all the stuff that you might need to say to somebody, right? And you know, here's what you do when, you know, the user says a certain thing. 00:52:06.320 |
This isn't actually that prescriptive. It's just structured, right? This isn't an if then statement, right? It's kind of like that, but it's not an actual if then statement. 00:52:15.320 |
So like this format, you know, is one that we iterated on and got to a point where we actually get really good results. Um, but you know, it wasn't a hundred percent obvious. 00:52:23.320 |
This is the way to do it up front. Um, you know, we started with charts. Um, and so now you get to this point where now you can see, okay, 00:52:30.320 |
now I got to look at these, you know, uh, conversations. I got to figure out what's been going on here. 00:52:34.320 |
And so you can see here, even though I passed in the state, it has a function to list messages. 00:52:38.320 |
And so it basically says, all right, well, now that I sort of know what's going on, let me see the last five messages, right? 00:52:42.320 |
And you can see here, it's going to start sending, you know, a bunch of these in. Um, and so it does that. 00:52:47.320 |
It looks to see if there's anything scheduled. There's not. Right. And so now it says, all right, this is, 00:52:52.320 |
this is sort of the point where Claude does its little explaining thing. I understand what's going on. 00:52:57.320 |
The patient's in the three day check-in phase. I already asked about bleeding and cramping. 00:53:01.320 |
I asked about blood clots and the patient, you know, basically just said, yes, they have blood clots. 00:53:05.320 |
And so I'm just going to keep on going. Right. And it goes to the next question about pregnancy systems. 00:53:10.320 |
This message comes directly from the blueprint. Okay. And I'll show you in a Google doc form in a second what that looks like. 00:53:16.320 |
Um, so it schedules it. It says, you should send this message, you know, as soon as you want to. 00:53:21.320 |
And then we get over to this evaluator flow. Right. And the evaluator says, all right, 00:53:25.320 |
I'm going to look at this situation. I'm going to look at everything that requires confidence scoring. Right. 00:53:29.320 |
That new message is the only thing. It's, it's the, the only thing that just happened. 00:53:32.320 |
Um, and I'm going to send it immediately. This is just a, a timestamp for immediately. 00:53:36.320 |
Um, I then get this kind of report. Right. And the way that we set up our framework, um, 00:53:42.320 |
and I'll show it in code a little bit clearer is, you know, do we know what the user is saying? 00:53:47.320 |
Do we know what to say? And do we think that we did a good job? 00:53:50.320 |
Again, this is a tough one. Right. Um, generally speaking, the LLM, you know, says at all times, yes, I know what I'm doing. 00:53:56.320 |
And you know, like buzz off. Um, but what I can also do is I can say, all right, then there's a bunch of cases in which if I set an anchor, 00:54:04.320 |
if I updated the patient's data, like maybe I changed their time zone offset, maybe I changed their name. Right. 00:54:09.320 |
That's a weird thing that, you know, if it happened, you'd probably want a human to look at. Um, do I, am I sending multiple messages? 00:54:14.320 |
Do I send it, am I sending duplicate messages accidentally? Do I have reminders for things that have already happened? 00:54:19.320 |
All of those things would deduct from the score and cause a human to get involved. Right. 00:54:24.320 |
That that's part of how we do this is to combine. Does the model think it's okay? Right. That's this top part. 00:54:29.320 |
And then overall, is there a weird circumstance that I should try to catch? Right. And then I should try to, to, to show people, uh, to show a human for review. 00:54:36.320 |
Um, in this case, nothing came up. I update the confidence. It's confidence of a hundred percent. Um, and then essentially the virtual OA, you know, as, as a final thing, it's very hard to get Claude not to summarize itself. 00:54:46.320 |
It does. Um, it basically just says, here's everything I did. I'm good. And then if you go down here to the bottom, this is the output state. 00:54:53.320 |
So this output state says, well, I have a hundred percent confidence. Again, it's version of it that I did the right thing. I, you know, here's my anchors. Here's my messages. 00:55:02.320 |
And here's the unsent message that I'm, I'm now going to send. And because it's a hundred percent confidence, it just goes out. Right. 00:55:08.320 |
It goes back to the text message gateway and it just goes out. Um, that is a risk, right? You know, if you want it to be perfectly safe, you have a human review, all of these things. 00:55:16.320 |
We don't want to do that because we're trying to scale, right? So we are comfortable in general with things that are, are, you know, coming back with a hundred percent confidence that we just send those messages out. 00:55:26.320 |
Yeah. So when you're doing that evaluation stuff. 00:55:30.320 |
Are you, like, bucketing those situations somehow, like tracking what the agent is having trouble, like, determining in any way to, like, create that, like, later on? 00:55:39.320 |
Uh, yeah. So the question is just, uh, how do we determine sort of the, the, the, the situations that might have confidence issues? 00:55:45.320 |
Um, it is very hand tuned and geared to this evaluation team, like basically the virtual OA team that exists now as, as humans. 00:55:52.320 |
Um, we will review, you know, in sort of spot checks, you know, a bunch of situations just to kind of see like, hey, is, is, does it seem like it's okay? 00:55:59.320 |
Um, when the, when a patient writes back, cause there are cases where a patient will write back and say, you got that wrong. 00:56:04.320 |
Like, that's not the time I said, like, you know, I, I'm actually taking it now. 00:56:07.320 |
Um, the confidence system is pretty good at picking up that that happened and basically saying, all right, even if I think I'm confident something's wrong, right? 00:56:14.320 |
You know, a human should take a look at this. 00:56:16.320 |
Um, but I mean, the, the answer is it's, it's more art than science. 00:56:19.320 |
It's not something that we are perfect at even now. 00:56:21.320 |
And because we want to scale, we've chosen to say, look, the, the worst that happens is essentially something weird happens and a couple of text messages go back and forth that are just wrong. 00:56:30.320 |
Usually the human will get involved and say, that doesn't sound right to me, right? 00:56:34.320 |
It's not, it's not a case where the patient is in danger. 00:56:36.320 |
Um, you know, if they say, well, I'm having these symptoms, you're not helping me, like a human will step in. 00:56:41.320 |
Like that's, that's something we're pretty good at flagging. 00:56:45.320 |
I guess, like, are you, like, tracking that somehow? 00:56:47.320 |
So it's like, if, like, if he says, like, okay, I'm having this admirable bleeding, I'm taking the medication. 00:56:54.320 |
And do you, like, go back and, like, address that? 00:56:59.320 |
I mean, so the, the short answer is, um, we can look at interactions that ultimately are scored as low confidence, and then we can trace back from there, right? 00:57:06.320 |
So a lot of what we're doing is when something gets flagged and a human is like, well, there's something weird here. 00:57:11.320 |
Um, you know, we share those things internally, right? 00:57:13.320 |
The, the Slack channel that I was talking about before where they talked to the physician's assistant, that's largely been repurposed to people saying, hey, this behavior is off. 00:57:20.320 |
And that ends up essentially in my queue as, you know, I got to go check my evals. 00:57:24.320 |
I got to see if there's something I can do to catch this. 00:57:26.320 |
And maybe it's a matter of changing the behavior. 00:57:28.320 |
Um, but so it's usually, it, when we, when we know there's an issue, we can backtrack. 00:57:38.320 |
Do you have any data from, like, before in the system of, like, the percent or error rate for human response versus AI? 00:57:46.320 |
Uh, so the question was about human versus AI error response from prior data. 00:57:50.320 |
And, and yes, the answer is we do have that data. 00:57:53.320 |
And that's one of the reasons that the client is as comfortable as they are with letting an LLM kind of run amok, right? 00:57:58.320 |
Is the idea that humans do make mistakes now. 00:58:00.320 |
And when they get escalated, you know, it's something where you can look back and be like, oh yeah, that was a little bit off. 00:58:05.320 |
Um, this is kind of unique in that, again, it, it needs to be, you know, precisely worded. 00:58:10.320 |
Like, one of the biggest risks is just that you give sort of off-label medical advice. 00:58:13.320 |
But if the idea is that, like, oh, you misunderstood and you have to go back and correct yourself, that's okay, right? 00:58:18.320 |
It's, it's, it's, that's not a fatal error, right? 00:58:21.320 |
So a lot of it is that, you know, we think that we can get better use out of our humans by reviewing these situations, you know, 00:58:26.320 |
than we can out of just having to push the buttons because they will occasionally push buttons wrong, right? 00:58:31.320 |
So, um, we were just talking about mistakes and, uh, this has been running for a while. 00:58:37.320 |
Have you thought about, um, fine-tuning a model with de-identified, uh, messages? 00:58:45.320 |
Yeah, I mean, the, the, the, so the question was about, um, have we thought about fine-tuning? 00:58:50.320 |
Um, we have already seen two major model releases in the time we've been working on this. 00:58:55.320 |
Um, we, we, we generally don't think that fine-tuning is a great use of our, of our dollars. 00:58:59.320 |
Um, it, it, it, obviously, it could be cheaper. 00:59:02.320 |
We can, I mean, one, one example is, um, we tried, you know, at one point to use Haiku. 00:59:06.320 |
Um, and, you know, Haiku is not even that much cheaper. 00:59:10.320 |
Um, we, we got to a point where we made our blueprints better, in part, because, like, 00:59:15.320 |
we'd sort of had some shortcuts where we just didn't have to be as, as precise with Sonnet, right? 00:59:18.320 |
You know, we had to be more precise with Haiku, and then it worked. 00:59:23.320 |
Haiku was terrible at figuring out what times it needed to sort of put on things. 00:59:27.320 |
And so, the, the kind of thing we would have to do there, like, it either just kind of requires a smarter model, 00:59:31.320 |
and there were smarter models from multiple people. 00:59:33.320 |
Like, '04 mini really is both, you know, I mean, it's, it costs a little bit less than Haiku, I think, right? 00:59:40.320 |
We chose not to go with it, in part, because it wasn't as transparent. 00:59:43.320 |
Um, but so, in, in general, we don't believe that fine-tuning is warranted, because we think the models are just gonna keep getting better and cheaper, 00:59:49.320 |
and that we, you know, we'll be able to just kind of switch wholesale, as opposed to having fine-tuned something. 00:59:53.320 |
So, when you're going through that, is there, you had this, like, chattiness with the model, or it was describing its actions and then calling tools? 01:00:02.320 |
Is that, like, was that an intentional choice? 01:00:03.320 |
Because I feel like you could just skip that and just do standard outputs. 01:00:06.320 |
It, well, so, yes, it was, it was kind of intentional choice, right? 01:00:10.320 |
This is partly that we, we already get the, the rationales and sort of the general, you know, explanation of its actions. 01:00:16.320 |
Um, but there are times where you want to be like, look, why did it do this? 01:00:20.320 |
And, you know, if it's thinking out loud, it's a lot easier to catch. 01:00:23.320 |
Um, so, yes, it, it's possible that we could eradicate some of that. 01:00:26.320 |
We don't really think the juice is worth the squeeze. 01:00:29.320 |
Well, if you start one miscarriage, most likely you're going to suffer a second miscarriage. 01:00:33.320 |
How does the current structure set up so that you have a new anchor point to see, like, this person is using this medication all over? 01:00:42.320 |
So, uh, questions about, essentially, multiple treatments are coming back again after, you know, 01:00:48.320 |
So one is that, um, you know, again, depending on how you get there, if you scan a QR code, 01:00:55.320 |
So we can know that you're coming in a second time. 01:00:57.320 |
Um, but people will write back after, you know, two months and say, I have a question, right? 01:01:02.320 |
And, and so we either can just reactivate that conversation. 01:01:05.320 |
The other thing is different treatments would usually come from different phone numbers. 01:01:09.320 |
So there's a few different ways to kind of disambiguate, you know, what somebody is actually up to. 01:01:12.320 |
But that notion of, like, you know, the same thing happened to me again. 01:01:18.320 |
You just say, like, hey, I had a miscarriage two months ago. 01:01:26.320 |
Can I share some skepticism on the evaluator, though? 01:01:31.320 |
Because there's plenty to, there's plenty to share. 01:01:35.320 |
Obviously, it will have the intention of improving, right? 01:01:41.320 |
I guess I'm skeptical that doubling the costs are yielding 100% better outcomes, right? 01:01:51.320 |
Do you have, like, a funnel of how often the evaluator might be impacting? 01:02:01.320 |
Was there an intentional decision to stick with flawed rather than switching a model family 01:02:06.320 |
where, in theory, hypothetically, you've got a different brain looking at the other brain 01:02:17.320 |
Was the inclusion of this evaluation node, like, something that made the client feel better? 01:02:20.320 |
So were there other impacts besides just, like, this is a performance thing that made it 01:02:26.320 |
So questions are all about sort of the evaluator node and the processes. 01:02:28.320 |
So the shortest possible answer is, yes, we're also skeptical about it. 01:02:33.320 |
And at the same time, we think that there's still value in trying, you know, essentially, 01:02:39.320 |
just getting a second bite at the apple, right? 01:02:40.320 |
We do think that just having a different system prompt in the same conversation does occasionally 01:02:45.320 |
But you could have the virtual OA evaluating the complexity of its own situation. 01:02:49.320 |
I don't think you could get it to evaluate whether it was right or not. 01:02:52.320 |
Just typically, LLMs are terrible at that anyway. 01:02:54.320 |
And so I think the basic answer, though, is that we wanted the flexibility in part so we 01:02:58.320 |
could do things like try a different model entirely, right? 01:03:01.320 |
Or, you know, have something where maybe you did fine-tune a model specifically to catch these 01:03:06.320 |
Like, that, I think, wouldn't be crazy at all. 01:03:08.320 |
So, yeah, we wanted kind of that optionality. 01:03:10.320 |
And at this point, you know, it's still early enough, right? 01:03:15.320 |
If we get to a point where, like, look, the only issue with this is how much it costs 01:03:18.320 |
or, like, specific details about, like, how good it is at catching errors, we'd go harder 01:03:23.320 |
But we're pretty happy with the balance of it usually escalates situations that need review, 01:03:29.320 |
It will sometimes screw up something just because it thinks that it was easy and it wasn't. 01:03:34.320 |
So, like, we sort of are meeting the bar that we'd set for ourselves in the first place. 01:03:43.320 |
So, it's not really just the same thing two times. 01:03:49.320 |
And so, one thing, actually, though, is that the evaluator can see what the VOA is supposed 01:03:54.320 |
So, it is able to basically say, "You didn't do that right because I know what you were told 01:03:58.320 |
And likewise, the virtual OA can see the evaluator's confidence framework. 01:04:02.320 |
And it can say, "Well, I'm going to be scored against these things. 01:04:07.320 |
Again, this is very much more art than science. 01:04:09.320 |
But, I mean, you're asking the right question about, like, could we just have either a more 01:04:17.320 |
So, again, this idea of, like, every interaction looks like this. 01:04:21.320 |
It is a starting state, a conversation, an ending state, which then goes back to the system. 01:04:26.320 |
And so, what I wanted to show you here was if I go back to a conversation, right? 01:04:33.320 |
So, I said the pregnancy symptoms have decreased. 01:04:34.320 |
The next question in the blueprint is, do you think you're done? 01:04:38.320 |
You know, do you believe that, you know, the miscarriage and sort of the changes that these 01:04:41.320 |
medicines were supposed to elicit have completed? 01:04:44.320 |
And there's basically one more message after this, which kind of confirms and says, like, 01:04:47.320 |
"Hey, let us know if you have any questions." 01:04:50.320 |
Back and forth, back and forth, assessing the state as it currently exists, is what this is 01:04:56.320 |
And we're compressing after every one of these interactions into only the changes that happen 01:05:01.320 |
We're not saving -- you know, in Langsmith, we're saving the entire conversation, right? 01:05:05.320 |
This data -- sorry, that's the wrong tab -- this data, you know, about, like, what the virtual 01:05:09.320 |
lawyer and the evaluator said to each other and what tools they called, this is preserved 01:05:14.320 |
But we do not save this in the state on the blue box, right? 01:05:17.320 |
That's not part of the patient's interactions with us. 01:05:20.320 |
And we don't reload it every time you go back with a new message because that would ultimately 01:05:25.320 |
both confuse things and blow up the context window. 01:05:32.320 |
I have another conversation here which actually needs response. 01:05:36.320 |
So I'm going to grab this and put it in the sandbox so you can see what this looks like. 01:05:42.320 |
So how do you keep persistence then if you only have an online . 01:05:49.320 |
You just don't save the process of the model talking to itself, right? 01:06:03.320 |
Input and output is all that we snapshot in the larger system. 01:06:15.320 |
Actually, I don't have to go to the sandbox to look at this. 01:06:18.320 |
This is an example of what happens when things are complicated enough that we're asking for 01:06:25.320 |
So in this case, I've just started this conversation. 01:06:27.320 |
And I said, yep, I got my medicine, came from the clinic. 01:06:30.320 |
Now this is a moment where in the treatment a lot of stuff is happening. 01:06:33.320 |
I'm figuring out what time zone they're in, right? 01:06:35.320 |
And I'm saving that as part of the patient data, right? 01:06:36.320 |
So in this case, I said I was on West Coast time. 01:06:38.320 |
So my time zone offset is 4:20 minutes before UTC. 01:06:51.320 |
So in the confidence framework, and I think I can find this, but I won't dig into it until 01:06:57.320 |
In the confidence framework, we say when you have all of these changes at once, you should 01:07:06.320 |
So for anything that's below 75%, I stop and I ask a human to either approve, right? 01:07:13.320 |
So if I were to approve this, it would just say, all right, these changes are fine. 01:07:20.320 |
I could say, and I'll try this now and live demos be damned. 01:07:24.320 |
Let's say, you know, I want to say, please mention the patient's name in your messages. 01:07:40.320 |
And I'm working through about this because it's kind of an operational detail. 01:07:45.320 |
So I'll have to reload this in a minute and see what happened. 01:07:47.320 |
But what's actually happening here, if I go over to Langsmith again, which I should be 01:07:50.320 |
able to do, is see that what's happening now is that it is restarting a thread that I already 01:08:01.320 |
So the one exception to us wiping out its brain and reloading everything is when you come back 01:08:06.320 |
Because you want to basically be able to pick up right in the thread and say, hey, you just 01:08:11.320 |
But everything else here, like you need to be able to see how you got to that place, right? 01:08:14.320 |
You know, so make the right decision and finish it up. 01:08:28.320 |
So you can see, the only change that happened here is that it mentioned her name, right? 01:08:34.320 |
Same time zone offset, same treatment phase, same reminder. 01:08:39.320 |
And you can see here, the rationale even includes this. 01:08:44.320 |
Now, you could imagine doing a version of this where I just had a little edit box and I said, 01:08:50.320 |
We want the LLM actually to drive these changes. 01:08:52.320 |
We think that it's better for humans to speak to them as though they're talking to a person. 01:08:56.320 |
This is a debatable choice, but it is a choice that we made. 01:08:59.320 |
And part of that means we can be very, very flexible about the treatment, right? 01:09:03.320 |
We can just give feedback on the situation rather than having to build some sort of tools 01:09:06.320 |
that are flexible enough to deal with all different types of treatments. 01:09:10.320 |
But so here, I'm just going to go ahead and say approve. 01:09:12.320 |
And now, those messages go out and the changes are made, right? 01:09:15.320 |
I have my patient local time set and I know I'm in the next part of the blueprint. 01:09:21.320 |
I think we have something like 45 minutes left. 01:09:23.320 |
Any questions on any of this so far that are not, I just want to see the code because I can do that part. 01:09:30.320 |
Have you heard any feedback from their actual customers on this experience? 01:09:34.320 |
It seems like it's a pretty highly emotional interaction. 01:09:39.320 |
The question is about feedback from patients and that it is emotional. 01:09:42.320 |
So remember, this is a system the client is already running, right? 01:09:45.320 |
So fundamentally, they already believe that they're talking to humans even when they're not exactly, right? 01:09:51.320 |
Even the humans pushing the buttons are just calling up essentially bot generated responses. 01:09:56.320 |
When things get emotional, humans can step in. 01:09:59.320 |
You know, we tend to steer them towards kind of approved knowledge-based responses. 01:10:03.320 |
Like you don't want this to be something where it goes completely freeform. 01:10:06.320 |
There's legal and other reasons not to do that. 01:10:09.320 |
So by stepping in and having LLM make the decisions, it doesn't really change the kind of current context of these treatments. 01:10:15.320 |
They're already getting, you know, basically this sort of medically approved feedback, you know, based on a certain flow chart. 01:10:20.320 |
And if it goes somewhere, you know, a little crazy, the escalation point is usually to call someone, right? 01:10:25.320 |
It's not, you know, we keep on talking forever in text because that's messy. 01:10:29.320 |
There are a bunch of points which I'm not going to be able to demo here, which basically just say, yeah, I'm sorry, I can't answer that question. 01:10:34.320 |
Call 9-1-1, go to your doctor, whatever it is, right? 01:10:38.320 |
But that is usually where it goes from there. 01:10:40.320 |
The blueprints look a lot like graphs or paper sheets themselves. 01:10:45.320 |
Like why do you use laying ground for that as well? 01:10:49.320 |
So honestly, part of it is just that we needed to have something that the patient, or not the patients, the client was actually comfortable maintaining, right? 01:10:56.320 |
Because remember, part of it is that we do not want this in code, right? 01:10:59.320 |
We don't want this to be something where you can only maintain it if you have a technical person. 01:11:04.320 |
And so just to jump over for a second, I'll show you what this kind of looks like. 01:11:07.320 |
So this is essentially the thing that the client is maintaining. 01:11:14.320 |
But the idea here is that we're using terms and, you know, we'll see a little bit more of this in the code. 01:11:20.320 |
We're using terms that are defined in the framework. 01:11:23.320 |
A trigger is, you know, something that happens, you know, essentially after an event, right? 01:11:27.320 |
You know, we have the conversation of these messages. 01:11:30.320 |
We always tell the LLM why this is important. 01:11:32.320 |
If we just had this detail and we just said this is the message you send, I don't think it would perform as well. 01:11:37.320 |
It's much more helpful to actually give the LLM justification for why it would say something because then it makes better decisions. 01:11:44.320 |
On that one, I'm not sure if this is code or not for the next part, but how complicated and how simple do you do the if statements kind of logic in there? 01:11:56.320 |
Or if you have any issues that actually got lost in the . 01:12:01.320 |
So the question was just about, you know, essentially why do we have this framework and why is it maybe not more declarative, right? 01:12:06.320 |
In terms of like specifically if then and that sort of thing, right? 01:12:09.320 |
Actually, I use similar with a different framework with Yama index. 01:12:23.320 |
It's gotta be like one, two levels max, otherwise we got lost. 01:12:28.320 |
So the answer is just about, again, how do you define these things as clearly but, you know, maybe not complexly as possible, right? 01:12:40.320 |
So this framework tends to work where you're really just saying, look, I'm giving you this approved language and I'm trying to give you in the bold statements here, right? 01:12:50.320 |
Primarily, I'm trying to give you a sense of, you know, what the conditioning really is. 01:12:53.320 |
But part of the reason that we did it this way is because, you know, if the patient writes back after this thing and he says, you know, yes, I have the medication and I took the pills and my stomach hurts and I'm confused, right? 01:13:05.320 |
We wouldn't want to represent something like that in a flow chart. 01:13:08.320 |
What we really want to do is just say, look, this is the outline of the thing. 01:13:12.320 |
You know, if you need to jump ahead, jump ahead and don't ask the patient questions they've already answered. 01:13:15.320 |
It just turns out that this framework really does work pretty well for letting the LLM do that sort of thing. 01:13:20.320 |
It's I know that's kind of a magic answer, but it's pretty good at it. 01:13:24.320 |
Yeah, no, I mean, well, yeah, Claude mostly nails it. 01:13:29.320 |
Does including the instructions like quantifiably improve the response quality? 01:13:36.320 |
Sorry, what do you mean by including the instructions? 01:13:40.320 |
So the question was, does including the reasoning help with the response quality? 01:13:45.320 |
I mean, this is one of these things where we started also by borrowing from human documentation, right? 01:13:50.320 |
So this was a process that was originally explained to humans who were going to push the buttons. 01:13:54.320 |
And so we took a combination of flow charts that existed to explain the flow of the treatment and these kinds of, you know, this is the message that you should send in these situations. 01:14:02.320 |
And this was kind of the hybrid output of those two things. 01:14:05.320 |
So I wouldn't say we did aggressive testing on is it really better or is it just that, you know, this is good enough. 01:14:11.320 |
It's more that like we started with this framework based on the human materials we had. 01:14:16.320 |
Did you find any sacrifices that you had to make in kind of maintaining this document as human readable versus LLM? 01:14:28.320 |
So the question is maintaining this document as human readable versus LLM. 01:14:36.320 |
So, you know, imagine the workflow here being, you know, this Google Doc is maintained essentially by our physician's assistant, right? 01:14:42.320 |
She is the co-owner of the blueprint maybe next to me. 01:14:45.320 |
When we make changes, we talk about them together. 01:14:47.320 |
We recommend in this document and then accept them. 01:14:49.320 |
And then effectively, I export it to Markdown and check it in, right? 01:14:54.320 |
We're going to build a lot of these tools into the database. 01:14:56.320 |
And so that's really where you'd be doing this instead. 01:14:58.320 |
But because this is human, you know, maintained, right? 01:15:02.320 |
Because it is basically, you know, still driven by the team. 01:15:06.320 |
I don't think it's a trade-off that's super damaging. 01:15:09.320 |
Based on your current design, just now when you do the approve, disapprove thing, what does it change after you do the approval? 01:15:20.320 |
Is it like a one-time thing or does it improve your answer in the future or even changing the approval? 01:15:27.320 |
So at the moment, no, and the question was just about the approve sort of defer, you know, feedback mechanism. 01:15:34.320 |
So actually, I'll go back and just show this really quick. 01:15:39.320 |
Again, we are saving this in the sense that I can see this in Langsmith, right? 01:15:45.320 |
I can look at this and I can say, well, in, you know, these cases where an approval was needed, and in this case, like, just as a visual, you know, sort of feedback, whenever you have this graph null start, right? 01:15:54.320 |
That is one of these cases where, you know, there was an approve feedback defer choice. 01:15:58.320 |
I could filter by this and I could look at all of these things and I could say, well, what kinds of things were we actually trying to approve or give feedback on? 01:16:06.320 |
We, as humans, will maybe update the blueprints. 01:16:09.320 |
We do not put this back into training data, again, for a bunch of reasons which are kind of specific to the situation. 01:16:14.320 |
But, you know, you can see here that, like, I can go all the way down here and I'll just try to find this quickly. 01:16:18.320 |
And you'll get to a point where the human says, all right, yeah, here it is. 01:16:22.320 |
So we get feedback from, yep, from the human OA. 01:16:26.320 |
This is essentially what happens whenever I push that button and I say give feedback, right? 01:16:31.320 |
The human OA has feedback about your unsent messages. 01:16:36.320 |
It's like, okay, let me look at the messages that, you know, I was sending. 01:16:43.320 |
Actually, no, I just updated that one in place. 01:16:46.320 |
One thing we've said is that we do not change the confidence score on something that a human reviewed. 01:16:54.320 |
And so in all these cases, this is also a much quicker, you know, simpler sort of operation, right? 01:16:58.320 |
And so you can see the evaluator here is basically like, yep, that message is fine, 01:17:01.320 |
but we're not going to do anything, you know, really to change the overall score. 01:17:06.320 |
So, you know, that's the kind of thing that, you know, we can look at afterwards, right? 01:17:10.320 |
But we are not at this point at least, you know, really trying to feed that back into the model. 01:17:15.320 |
Just to get a sense of your metrics, a lot of these are, you know, over-admitted, and it says about, you know, a couple hundred thousand tokens. 01:17:23.320 |
Like a necessary evil, the time it takes, the cost, or just a testing environment? 01:17:30.320 |
And actually, just to point it out, these costs I don't believe are correct. 01:17:35.320 |
One of the shortcomings of Langsmith, and I think they've admitted this in various forms, is they don't really take into account the caching. 01:17:40.320 |
So these costs should be lower than what you see here, but fundamentally, yeah, these are expensive operations. 01:17:46.320 |
And, you know, we could change some of them at the potential cost of higher error rates, right? 01:17:51.320 |
Like we could try to cache more and have you inherit threads in progress, and it would be faster, right? 01:17:57.320 |
It would be, you know, potentially you're not reloading any context, and so, you know, you're spending maybe less on tokens. 01:18:02.320 |
And you just might have a higher error rate, and that's, you know, a thing we are trading off. 01:18:08.320 |
Is there some kind of knowledge base that your model is taking to, like, a pivot, depending on the medicine that your patient is taking? 01:18:19.320 |
Because maybe the side effects are different, and the question should be different? 01:18:23.320 |
Yeah, so the question is just in terms of the knowledge bases. 01:18:27.320 |
So let me actually jump over and just show this really quick. 01:18:30.320 |
I'll jump over now, really, into just what the implementation looks like. 01:18:34.320 |
So you can see over here, you know, this idea of for Avila, right? 01:18:38.320 |
We have a handful of documents here that are, again, exported into Markdown. 01:18:42.320 |
And I'll try to blow this up because I know these are small. 01:18:47.320 |
Okay, so the idea here is that, you know, I've got all of this, you know, sort of framework data, right? 01:18:55.320 |
We're defining this every time, not in the prompt, right? 01:19:00.320 |
We're doing this as part of the context window in part because we do want this to be really flexible. 01:19:05.320 |
If you want to change the terms, you should be able to do that, right? 01:19:07.320 |
We don't want the treatments to be hamstrung by terms we use for other treatments. 01:19:14.320 |
So all of this stuff exists in part just to lay the groundwork. 01:19:17.320 |
And then this framework, right, is now referring to specific documents, right? 01:19:20.320 |
And so you can see here, like, again, these documents are all referring to each other. 01:19:24.320 |
So I can go through here and I can look, you know, and click on these links and go straight to other things. 01:19:29.320 |
If I want to do something around the knowledge base, so the way that we do that is this triage idea. 01:19:33.320 |
So, you know, if something happens that a blueprint doesn't address, right? 01:19:37.320 |
So the way that we tier it is first check on the blueprint. 01:19:40.320 |
If you have approved language, use it, right? 01:19:42.320 |
Send it back for human review, whatever you need to do, right? 01:19:45.320 |
If you don't think you can answer that question, you go and look at this, which now, again, 01:19:50.320 |
We don't read the entire scope of medically approved knowledge all at once. 01:19:54.320 |
We let the LM decide, are they complaining about stomach pain or bleeding, right? 01:19:58.320 |
You know, if I can't find anything in any one of these, I have a larger knowledge base, right? 01:20:02.320 |
Which is, you know, sort of just a laundry list of, like, random questions people ask. 01:20:06.320 |
We've chosen to do it this way in part because it is human readable. 01:20:09.320 |
It mirrors something the client already mostly had, right? 01:20:13.320 |
And, you know, we fundamentally did not believe that it made sense to over-process, you know, 01:20:20.320 |
Now, I will say that for the thing that we have, like, kind of a backup, you know, sort of knowledge 01:20:23.320 |
store, which is almost entirely a CSV, that probably is suited for RAG, right? 01:20:30.320 |
So in some sense, it's just not worth implementing that way, at least not yet. 01:20:47.320 |
Is that also defined in the overview right now or is that . 01:20:53.320 |
So the virtual OA and the evaluator both have relatively small prompts, which I can show. 01:20:59.320 |
Those prompts are, they do reference each other, right? 01:21:04.320 |
So imagine this being, you know, again, built on the line chain stuff, so the base agent class. 01:21:08.320 |
This prompt is basically aware of the other agent, right? 01:21:15.320 |
The evaluator knows about the virtual OA and vice versa, right? 01:21:18.320 |
You know, the things that we try to do and, you know, this is, I think, normal prompt engineering stuff 01:21:22.320 |
where people have really played with this stuff. 01:21:25.320 |
You have to tell it that, you know, if it gets called on, it has to talk, right? 01:21:28.320 |
Like, you know, one problem we have that we have to sort of frequently do retries on is 01:21:32.320 |
the LLM thinks that everything's done, it doesn't say anything, and the whole thing dies. 01:21:36.320 |
So, you know, you have to talk, but then you can be done, right? 01:21:41.320 |
We have the basic idea of you have to determine this overall confidence score, 01:21:45.320 |
but we don't include this in the prompt because we want to be able to show it to the virtual OA as well, right? 01:21:50.320 |
So the details of how you score something is factored out, but the notion of here's who you are, 01:21:56.320 |
here's who this other guy is, and here's how you work together, that is in the prompts. 01:22:00.320 |
Okay, actually, on that note, let me jump over and actually show you some of the confidence stuff and the guidelines. 01:22:05.320 |
So I'll start with confidence, I'll get into the guidelines, which are much, much longer. 01:22:09.320 |
This, again, is, you know, intended to be mostly LLM readable. 01:22:13.320 |
This is not something the client generally maintains, right? 01:22:15.320 |
So this is not in the same category as these blueprints. 01:22:18.320 |
But the idea here is that, you know, I've got this confidence score, and, you know, I am trying to figure out across these multiple dimensions. 01:22:24.320 |
Do I know what's going on? You know, here's some examples. 01:22:27.320 |
We are trying to be as prescriptive as possible with examples of these different situations. 01:22:31.320 |
Do I know, you know, the knowledge that I need to know? 01:22:34.320 |
Here's an example which might speak to your question actually over here. 01:22:36.320 |
You know, the idea of, do we want the LLM to use its world knowledge to figure out that when I'm talking about an antibiotic 01:22:43.320 |
and I give a specific antibiotic that it applies to the whole class of them? 01:22:46.320 |
Yes, that's a risk that we're kind of willing to take, right? 01:22:49.320 |
We don't need to have an explicit, this specific antibiotic is safe for this treatment, right? 01:22:53.320 |
That would very quickly spiral out of control. 01:22:55.320 |
So we do have a handful of places where we ask it, use your own judgment, but refer to, you know, the knowledge base 01:23:02.320 |
And then, so after I get through these categories, then I have this idea of deductions, right? 01:23:06.320 |
And the deductions here are specifically things like, you know, you should deduct from the overall score, not the individual messages, right? 01:23:14.320 |
Because an individual message could be like, yeah, this is exactly from the blueprint, like it's the right thing to say. 01:23:18.320 |
But overall, these situations can be complicated, right? 01:23:21.320 |
And so what we're trying to do is explain that such that it can, again, score the overall interaction in a way that surfaces it for human review. 01:23:29.320 |
Okay, so I'm going to move on to the guidelines because, again, there's just a lot more in here. 01:23:33.320 |
This is long enough that I'm not going to review everything, but I'll try to get to some of the biggest parts. 01:23:40.320 |
One issue we've certainly had over time is, you know, fabrication. 01:23:43.320 |
And, you know, honestly, instructions like this do help. 01:23:53.320 |
There is a lot of stuff around, hey, you need to ask about it in the right way. 01:23:57.320 |
You don't ask about it in a way that forces someone to tell you where they are, right? 01:24:00.320 |
There are people who are very sensitive about this. 01:24:02.320 |
They don't want, you know, people knowing where they physically are. 01:24:04.320 |
But you need to know what their time is so that you can schedule the messages for them, right? 01:24:08.320 |
You want, you know, when you work with time, you have to use things like, you know, ISO timestamps. 01:24:14.320 |
Like that's how the rest of the system works. 01:24:16.320 |
But calculating these things and keeping them all straight, it requires a relatively smart model. 01:24:19.320 |
So a lot of this, you know, has sort of grown over time to just work with the idea that, you know, 01:24:24.320 |
this is how you can talk to models about this and do a pretty good job. 01:24:30.320 |
Again, these are all the things that are core parts of the system. 01:24:34.320 |
This is all written to be generic enough that I don't have to rewrite this every time I add a new drug, right? 01:24:39.320 |
Which is one of the core requirements, right? 01:24:46.320 |
And every time we're setting it to a block, right? 01:24:48.320 |
Have you experimented with the block caching because this doesn't change? 01:24:54.320 |
And so right now, and I'll get to caching actually in just a second. 01:24:56.320 |
I'm just trying to manage time here, but I do have time for that. 01:24:59.320 |
Like, we are doing some explicit system prompt caching. 01:25:02.320 |
And then we are caching explicitly, you know, the multiple turns of the messages such that, you know, each operation is, you know, I think the average operation with just sort of the baseline stuff is maybe 10 to 15,000 tokens, right? 01:25:17.320 |
So, you know, you do have maybe the average cost to generate a single message somewhere in the 15 to 20 cent range, right? 01:25:22.320 |
It's not cheap, but we are caching as aggressively as we can. 01:25:28.320 |
The idea of the hour cache, you know, that Claude just introduced. 01:25:31.320 |
It's not clear to us that that would help because, you know, we can't guarantee that the patient is going to get back to us within, you know, either five minutes or an hour, right? 01:25:37.320 |
It's just, it's a bit of a risk to take at the system level. 01:25:40.320 |
But if anybody here actually knows more about Claude caching than I do, please talk to me. 01:25:44.320 |
Because, like, we ideally, we would like to cache a lot of these documents. 01:25:48.320 |
It's just not clear if we can do that across sessions. 01:25:50.320 |
It's not clear if, you know, we would get the benefits that we're looking for. 01:25:53.320 |
So we've just tried to be as aggressive as we can within a single conversation. 01:25:56.320 |
That's a very long list of guidelines, isn't it? 01:26:01.320 |
How did you come up with it, and how do you optimize that? 01:26:05.320 |
Yeah, I mean, look, the real answer is that I mentioned before that, you know, 01:26:09.320 |
there's a few thousand lines of code and a few thousand lines of prompt. 01:26:15.320 |
It is something where, you know, we have tuned it over time. 01:26:18.320 |
This is, you know, myself as well as, you know, the physician's assistant. 01:26:21.320 |
Like, we have come up with something that we believe is fundamentally, you know, pretty good at handling these, you know, generic situations. 01:26:27.320 |
And when we find edge cases, we just modify these prompts. 01:26:31.320 |
We could definitely think about subdividing this. 01:26:33.320 |
We could think about moving some of it into the prompts. 01:26:35.320 |
But we think this division is, you know, roughly correct for keeping it generic so that it's, you know, it handles a bunch of different treatments. 01:26:41.320 |
And it handles the situations that we see across treatments pretty well, right? 01:26:45.320 |
You will have cases where you're doing medicine that's all in one day. 01:26:48.320 |
That's relatively unusual because, you know, that's something you just send instructions home. 01:26:51.320 |
You know, but there, I mean, I'll show an example around Zempic. 01:26:59.320 |
We've tried to get to a point of balancing it where, you know, we do end up with a good result. 01:27:08.320 |
Have you ever seen like a cloud is for a very specific, or sometimes there's one equal or basically, but it has a very long prompt. 01:27:16.320 |
Sometimes they're adding more, and then, you know, the . 01:27:23.320 |
So the question is just about the prompt length. 01:27:27.320 |
So again, not to dismiss that out of hand, in cloud terms, this isn't actually that long, right? 01:27:32.320 |
I mean, this is, like I said, I think on average, 15,000 tokens. 01:27:35.320 |
It still leaves a lot of the window, you know, behind, right? 01:27:38.320 |
It is not actually so long that we start seeing really crazy behaviors until we start doing multiple turns, like multiple conversations in one thread, right? 01:27:49.320 |
So we just genuinely have not gotten to a point where we're like, my guidelines are too long. 01:27:53.320 |
You know, the guidelines could be a little shorter. 01:27:55.320 |
And I think we have optimized them in various places over time. 01:27:58.320 |
But, you know, we've crammed this into a box where we really can sort of process one situation all in one gulp without really feeling any pain. 01:28:10.320 |
You mentioned tools that I'm not sure you mentioned. 01:28:15.320 |
So let me go down here to the tools code itself. 01:28:20.320 |
And I think I mentioned this really earlier on about there is some stuff coming from an MCP gateway, right? 01:28:25.320 |
So in this case, you know, I'm loading from files. 01:28:27.320 |
It's just the file system MCP, again, all localized to our VPC. 01:28:32.320 |
But I could instead load from, you know, my database, right? 01:28:35.320 |
I could choose to have that be the place where you interact. 01:28:39.320 |
And so, you know, this actually is where probably most of the code in my app actually is. 01:28:48.320 |
And you can see it's things that enable interacting with the state, right? 01:28:51.320 |
So all of these functions that are looking at anchors and messages and confidence and the treatment and patient data, all of that stuff is local to my graph run. 01:29:03.320 |
And, you know, in this case, like this code just lives in this Python app. 01:29:08.320 |
Like the state can still live here and the code could be somewhere else. 01:29:13.320 |
But one note actually about all of this stuff is, and I'm trying to find a good example here. 01:29:17.320 |
So I am aggressively using the command object. 01:29:19.320 |
For anybody who has programmed with line graph, the whole idea behind this is that at any given point, you're able to pass back a message. 01:29:26.320 |
And this particular thing is just an error message. 01:29:28.320 |
But like, you know, you're able to pass back a message and a place to go, right? 01:29:34.320 |
And by the way, I know that the evaluator asked for this. 01:29:37.320 |
You can actually get around some of the graph routing this way. 01:29:39.320 |
And so we've definitely, you know, tried to work this into both the MCP tools and the state tools that we have. 01:29:56.320 |
So the question is about how do we improve the prompt. 01:30:01.320 |
We were able to go from essentially the physician's assistant who had the most experience with tricking things, right? 01:30:07.320 |
For what's coming up with tricky situations, right? 01:30:09.320 |
So we were able to test a lot of the edges really just with her, you know, having her pretend to be the patient. 01:30:13.320 |
We then scaled up to the full team of operations associates who, you know, then tried to trick it at a higher level, right? 01:30:18.320 |
And, you know, we're putting out things that they'd seen from patients themselves, you know, trying to sort of, you know, get to these complicated cases. 01:30:25.320 |
And then, you know, with real people, you know, we're able to take that a step further. 01:30:28.320 |
But with those first two levels, we're not seeing, you know, a tremendous amount of stuff that's not expected. 01:30:35.320 |
If we'd been doing this from absolute scratch, I think we would have a lot tougher of a time coming up with what we think the edges are. 01:30:41.320 |
Whereas, you know, there's a system that already exists. 01:30:46.320 |
So we'll look at conversations in the old system and replay them here and essentially just try to figure out, you know, where the edges are. 01:31:02.320 |
Are there questions about how to decide what goes in the state or not? 01:31:07.320 |
I guess the short answer is everything that comes in in that initial payload, which I'll go back over here for, all of this stuff... 01:31:14.320 |
I think this one's probably the better example. 01:31:17.320 |
So all of this stuff over here, this is all state. 01:31:23.320 |
Like, everything that comes in to sort of preload the conversation is state. 01:31:30.320 |
Like, here examples are, you can't change the source. 01:31:33.320 |
I couldn't say, you know, in the context of the LLM operation, this is not an available patient anymore. 01:31:41.320 |
So the LLM is not allowed to look at the message queue and say message five that went out three days ago no longer exists. 01:31:48.320 |
So we've sort of just calibrated it to where the only things it can do is read the entire state, modify the patient data, modify messages, modify anchors. 01:31:58.320 |
So it's, you know, it's just a software choice. 01:32:01.320 |
So, I guess if you have a session, do you want to kind of store the messages that went back and forth within the session? 01:32:12.320 |
There's nothing inside, but if the context window gets too late and you kind of want to summarize it, in that case, would you want to be able to delete some messages or summarize it? 01:32:23.320 |
I mean, I think the real answer, though, is that hasn't happened for us. 01:32:29.320 |
Like, the way that we've structured it, there is not a single thinking operation that goes long enough to blow things up. 01:32:35.320 |
But I will talk really quickly about retries. 01:32:39.320 |
So we do have the notion, and I'll just pop this up. 01:32:42.320 |
So we do have the notion inside the graph of, you know, again, the virtual OA talks to the evaluator, you know, at a certain point in the thing. 01:32:51.320 |
But then there are cases that will cause the graph to retry a certain operation, right? 01:32:56.320 |
So one of them is one of the agents malforms a tool call, right? 01:33:00.320 |
It tries to call a tool, but it uses the wrong JSON, and, you know, otherwise things would have died. 01:33:05.320 |
We delete the message, and we say, "You screwed up that tool call. Try again." 01:33:10.320 |
That, you know, keeps it inside the graph, right? 01:33:12.320 |
Essentially, this retry node is then able to loop back via that command message. 01:33:16.320 |
It's not shown here on the graph, but via that command object. 01:33:19.320 |
It can say, "Go back to the virtual OA. Try that again." 01:33:23.320 |
We also have some cases where, again, you know, we expect the model to talk, and it doesn't. 01:33:28.320 |
There are just a bunch of cases where Claude will just end its turn prematurely. 01:33:36.320 |
Even if you're going to end your turn, just say, "I'm done." 01:33:39.320 |
That's enough, you know, for us to keep the logic going. 01:33:42.320 |
Low confidence from the evaluator will be triggered? 01:33:47.320 |
Low confidence coming from the evaluator will be triggered? 01:33:51.320 |
So low confidence is something we want to pass through to the human. 01:33:54.320 |
So low confidence is a valid response to the graph. 01:33:57.320 |
You know, like, I have a low confidence message. 01:34:13.320 |
Do you have any automated regression against that in order to change it or add to it? 01:34:18.320 |
Because you've obviously tested it thoroughly by human. 01:34:25.320 |
Let me talk about evals a little bit just to sort of give you guys a sense of what we did 01:34:29.320 |
So this was a weird one because if you go to LangSmith, and I'll just, I'm not sure I can 01:34:35.320 |
find the exact place where this happens, but there's, is it under, maybe it's under datasets. 01:34:41.320 |
There is a place in here where you can basically say, I want to run, you know, an evaluation 01:34:45.320 |
against, you know, these, you know, this dataset that I've defined in LangSmith. 01:34:50.320 |
So I can see here, assuming this loads up, which hopefully it will. 01:34:53.320 |
So I've got a happy path dataset here that I defined in LangSmith. 01:35:00.320 |
But if I look at these things, what I'm really doing is I'm saying, okay, this is an interaction 01:35:07.320 |
You know, here's, you know, the, the input, right? 01:35:12.320 |
This conversation, you know, A, I've got this initial state here that I can look at, right? 01:35:15.320 |
So I can see, you know, what was going on in the first place. 01:35:18.320 |
I see this conversation and, you know, I get to the end and I get a message out, which 01:35:25.320 |
So this is one part of my happy path dataset. 01:35:28.320 |
The evals that I'm running are fundamentally, it's a, it's a custom harness. 01:35:35.320 |
But again, the idea here was that we couldn't just say, you know, when I ask for, you know, 01:35:42.320 |
what the weather is in San Francisco, it gives me back, you know, cold and, you know, foggy. 01:35:47.320 |
It had to be, here's examples of these input states and then, you know, let's evaluate it, 01:35:53.320 |
you know, sort of an LLM as a judge form, what the output state looks like with the caveat 01:35:56.320 |
that one of the big things we wanted to test was things like time operations. 01:35:59.320 |
So I can't put in an eval from three weeks ago and run it now and get equivalent times. 01:36:04.320 |
I have to either give it really specific guidance on how to handle time or have to replace all 01:36:11.320 |
And so what this did was my eval suite is a custom Python app stable to a bash script. 01:36:19.320 |
I got what I wanted, but I can't vouch for much more than that. 01:36:25.320 |
I call essentially the medical agent via, in this case, like I'm running this locally. 01:36:30.320 |
Like I could run this against my cloud line graph instance, but I'm literally running this 01:36:33.320 |
against line graph on my laptop using that data set from LangSmith, having pre-processed a bunch of stuff 01:36:39.320 |
around dates, times, you know, sort of circumstances. 01:36:42.320 |
So that when I get back the result, I can not confuse the LLM as a judge about whether it's right or not. 01:36:47.320 |
And I'm using this LLM rubric, right, to do this. 01:36:50.320 |
And so let me see if I can find my rubric here. 01:36:56.320 |
So this is just, this is prompt foo and how it works. 01:36:58.320 |
What I do is I basically say, look, I'm trying to test, you know, this custom thing that I'm going to call, 01:37:04.320 |
you know, essentially with my, you know, my custom harness. 01:37:08.320 |
And then I want you to evaluate it, you know, with an LLM. 01:37:12.320 |
You know, this evaluation rubric is basically, is everything basically exactly the same? 01:37:18.320 |
Do I see minor discrepancies, but I don't think they're a big deal? 01:37:22.320 |
What we really want to know is, is anything completely busted, right? 01:37:26.320 |
I had to put in specific notes here, like, hey, don't be picky about like the, you know, 01:37:30.320 |
different wording that you might see in something like an anchor, right? 01:37:32.320 |
Sometimes it'll say they will take the medicines. 01:37:34.320 |
It says they did take the medicines, like who cares? 01:37:36.320 |
Like in this case, the spirit of it was right. 01:37:38.320 |
And so, you know, and the times and dates, like there's some specific language here. 01:37:42.320 |
So all of this turns into basically this guy right here. 01:37:48.320 |
So when I look at this, and again, I realize there's a lot of text here. 01:37:54.320 |
I ran these three examples from my data set, and I got, you know, basically a passing grade, right? 01:37:58.320 |
I'll go into the fail, one fail, one pass in a second. 01:38:01.320 |
But in each of these cases, right, I can see what the LLM as a judge said, right? 01:38:05.320 |
So if I go down here, this is GPT-40 going, you know, opining on, 01:38:10.320 |
well, this is the source data you gave me and the sample output. 01:38:16.320 |
You know, it does point out some things, right? 01:38:18.320 |
So if I ran these evals and I was like, well, it's actually a big problem that, you know, 01:38:21.320 |
the reminder, you know, unsent message didn't show up, right? 01:38:27.320 |
But the way that we did this was just to say, look, at any given point, 01:38:29.320 |
we do need to be able to test the current state of the system. 01:38:32.320 |
We want to do it against, you know, first the happy path, 01:38:34.320 |
and then we can certainly do it against edge cases. 01:38:38.320 |
I think this might change once we actually hand this over, you know, 01:38:41.320 |
We want them to have all the protection they might need. 01:38:46.320 |
Does that answer your question, more or less? 01:38:51.320 |
Or did you check to see if it caught specific cases, you know, like I know this is bad in a certain way 01:39:01.320 |
Some of that I'm redacting for a couple reasons. 01:39:03.320 |
But, yeah, more or less, we started with a happy path. 01:39:06.320 |
We do have a handful of specific, like, hey, this is busted and it's frequently busted. 01:39:09.320 |
Let's make sure it's not, you know, but it's just a different data set. 01:39:16.320 |
Well, so, again, the part that I'm showing here is entirely Python in a line graph container. 01:39:32.320 |
I would guess it's, you know, maybe 4,000 lines of code. 01:39:37.320 |
Like, that just -- and I'm sure I could refactor that to be shorter also. 01:39:41.320 |
It's really just enough to run essentially this graph, right? 01:39:44.320 |
You know, I have to have the routing between all of it. 01:39:48.320 |
Everything else is in the prompts and the guidelines, right? 01:39:51.320 |
And so, you know, it really is more English than it is code. 01:39:55.320 |
On the other side, right, on the other side of the box, right, this thing, that blue box is -- 01:39:59.320 |
I don't know exactly how much code, but it's entirely Node and React and Mongo. 01:40:04.320 |
And frankly, I haven't been that involved in it. 01:40:09.320 |
I also work in a highly-regulated industry, so this is a lot of . 01:40:15.320 |
How do you think about taking this, like, this is one medical procedure -- 01:40:23.320 |
How do you think about taking that and scaling it across an industry or another client 01:40:30.320 |
So let me actually jump over and show you one thing I should have probably already shown, 01:40:34.320 |
I mentioned before the idea of us doing different treatments. 01:40:37.320 |
So what I did in this particular case -- so, you know, again, we were focused on the early 01:40:45.320 |
In fact, I think I have this sitting here somewhere. 01:40:48.320 |
So let me zoom out and find it, and then I'll zoom back in. 01:40:52.320 |
And I basically said to Klein, hey, I've got -- yeah, this should be it right here. 01:41:05.320 |
Here's the link to Novo Nordisk's suggestions on how to dose Ozempic. 01:41:11.320 |
And it basically went through this process -- and I'll shrink this down -- and created, 01:41:17.320 |
you know, a basic treatment, you know, for Ozempic. 01:41:25.320 |
And here's a couple of tweaks based on the thing that you said. 01:41:30.320 |
And I ended up with what I think is a pretty serviceable Ozempic treatment, 01:41:35.320 |
If I go back to these conversations, I'm pretty sure I named these people all O. 01:41:45.320 |
You could, obviously, get to the point of having it be a different personality. 01:41:48.320 |
But in this case, this is Ava with a new treatment, asking all about Ozempic pens, 01:41:53.320 |
and helping me figure out their time, same as we were before. 01:41:59.320 |
I literally threw this through Klein, got a new set of treatments out, 01:42:04.320 |
Yeah, on your graph, what's with the retry node? 01:42:12.320 |
The question was, it's for catching those errors around malforming tool calls 01:42:16.320 |
is one easy way for the graph to terminate, right? 01:42:18.320 |
So if it forgets a bracket and sends back something that's invalid JSON. 01:42:24.320 |
Like, Sonnet 4 is pretty good at it, but Sonnet 3.5 was not nearly as good. 01:42:30.320 |
well, you're trying to make a tool call because I see certain things in here. 01:42:32.320 |
I either see tool call ID as a parameter, I see weird brackets. 01:42:37.320 |
And then, again, we wipe that message out, and we go back to the guy that called it 01:42:44.320 |
And so that idea of, you know, the retry node, there's a handful of those situations 01:42:48.320 |
where we aren't going to take over and do it in some deterministic way. 01:42:52.320 |
We're just telling the LLM, you made a mistake. 01:42:56.320 |
And we do have to wipe out its memory of that mistake because it can get very confused. 01:43:00.320 |
One of the ways this happened a lot earlier in our testing was it would malform tool calls 01:43:07.320 |
And so if you left the message in there, it would think that it understood the blueprint 01:43:10.320 |
even though it had made the entire blueprint up. 01:43:13.320 |
So I don't want to sugarcoat where, like, there are some weird cases that if you don't 01:43:15.320 |
very carefully control for it, you can end up with some very bad behaviors. 01:43:18.320 |
But we were able to catch the main ones and essentially just give it another shot. 01:43:25.320 |
The reason there's no loop, this is just totally -- this is actually one thing where I think 01:43:28.320 |
LangSmith and LangGraph could be better at this. 01:43:30.320 |
The retry node is capable of calling back to the other ones using the command object. 01:43:35.320 |
So it turns out, like, they were very invested, LangSmith was, or LangGraph, in graph flows. 01:43:40.320 |
And then they introduced this idea of, well, you don't really have to define it in the graph. 01:44:01.320 |
The question was about confidence scoring and how do we get it to sort of to be higher. 01:44:08.320 |
We want the score to be lower when there is a complicated situation. 01:44:13.320 |
Not because we think it's wrong, but because we want a human to review it. 01:44:23.320 |
So if the confidence score is above the threshold, meaning higher than it, right? 01:44:23.320 |
So let's say I have a confidence score of .9. 01:44:24.320 |
What that could mean, right, is that I am sending multiple messages at a time, but there's no other 01:44:29.320 |
reason for me to think that those messages are wrong. 01:44:31.320 |
We have chosen, with the client, to set the threshold lower than that because they don't want their 01:44:36.320 |
humans having to get involved every single time something is a little complicated. 01:44:39.320 |
But we've agreed that if it's below .75, they should. 01:44:42.320 |
It's you decide if you wanted your humans involved in everything, set the confidence score to 01:44:45.320 |
So if the confidence score is above the threshold, meaning higher than it, right? 01:44:47.320 |
So let's say I have a confidence score of .9. 01:44:48.320 |
What that could mean, right, is that I am sending multiple messages at a time, but there's 01:44:49.320 |
no other reason for me to think that those messages are wrong. 01:44:51.320 |
We have chosen, with the client, to set the threshold lower than that because they don't want 01:44:54.320 |
their humans having to get involved every single time something is a little complicated. 01:44:56.320 |
But we've agreed that if it's below .75, they should. 01:44:58.320 |
You decide if you wanted your humans involved in everything, set the confidence score to 01:45:02.320 |
You could see every single thing that happens. 01:45:06.320 |
But that's hopefully not what people actually want, right? 01:45:09.320 |
You want a bunch of these things if they are high confidence and, you know, fundamentally recoverable, 01:45:15.320 |
Like, you know, there might be certain circumstances where you don't want a message automatically sent 01:45:19.320 |
Hopefully we can control for that, but in general, like, we wanted to set it in a place that felt 01:45:23.320 |
like we were going to get some scale out of our humans, right? 01:45:25.320 |
Which meant message is going out automatically. 01:45:31.320 |
So it seems like you're kind of merging confidence and its complexity into one metric. 01:45:39.320 |
Do you find that that kind of has any side effects instead of trying to evaluate those two things separately and then making the determination as to whether to get a human involved? 01:45:55.320 |
So the question is about conflation of confidence scoring and complexity. 01:46:00.320 |
And again, it's one of the things where I think we're happy with how it sort of operates now, 01:46:04.320 |
but I'm not sure that we're doing the confidence piece right. 01:46:08.320 |
And at the same time, like, the concept of it, I think, is good, right? 01:46:11.320 |
You know, do you have the information you need? 01:46:13.320 |
You know, is there anything like, do you understand the user's intent or, you know, did they say something ambiguous? 01:46:18.320 |
And we do see cases where it triggers, right? 01:46:20.320 |
It's not that we never see it, you know, correctly rate itself as being like, well, I'm not totally sure what they said. 01:46:26.320 |
So we combine it with complexity in part because, you know, we just want to get it below that threshold so that we can have a human review it. 01:46:32.320 |
It's not a perfect system, but it is a system that we think works. 01:46:35.320 |
So it's more of a confidence that this falls within the happy path where we find polluting scenarios. 01:46:46.320 |
So to that point, it's not confidence that the specific response is exact and perfect and whatever it is. 01:46:53.320 |
It's confidence that we don't think that there's a blend of uncertainty on, you know, what the response is and complexity of the situation, right? 01:46:59.320 |
Either of those things can push it below the threshold. 01:47:07.320 |
So this is all -- the question was about hosting. 01:47:10.320 |
So this is all -- line graph has a pre-built containerization that you can use. 01:47:15.320 |
We are using it with a couple of modifications. 01:47:17.320 |
We are essentially deploying both halves of this, right? 01:47:21.320 |
So to go back to this, we're deploying both halves of this with Terraform. 01:47:24.320 |
You know, everything's hooked up with GitHub Actions. 01:47:26.320 |
I mean, like, you know, we're doing as much of it as automatically as we can, but we are using most of the built-in line graph containerization. 01:47:34.320 |
No, I mean, we have a conversation with Langchain actively about that. 01:47:43.320 |
So the question actually is also the specific nature of how we need to be deployed. 01:47:49.320 |
Lang graph platform doesn't currently support that, right? 01:47:51.320 |
So, like, there -- it's -- it's all -- it's all evolving. 01:47:58.320 |
And so -- or roughly 10 minutes -- nine minutes. 01:48:00.320 |
Let me just double-check my own list of things that I wanted to talk about just to make sure I didn't miss anything major. 01:48:05.320 |
Talked about that, talked about that, dun-dun-dun-dun-dun-dun-dun. 01:48:15.320 |
I have a couple of final thoughts that I'll leave sitting up here if anyone's curious. 01:48:24.320 |
So the question was just about resource allocation to build these things. 01:48:42.320 |
So, I mean, I think the best way to say it is just that we had built a handful of things like this. 01:48:50.320 |
But we were able to bring some things in in terms of, you know, I had an open source line graph project that I was comfortable using as a base for this client code, right? 01:48:59.320 |
You know, but it got us, you know, part of the way there in part because this isn't a totally special snowflake. 01:49:06.320 |
So, you know, we have, I think, spent less time on that upfront scaffolding each time we've done it. 01:49:12.320 |
And, you know, we're looking at another project right now, which would be that much quicker. 01:49:15.320 |
So a lot of it is just once you do this and you understand the mechanism, getting to, you know, good enough or getting to a starting point is just a lot quicker. 01:49:22.320 |
It is something, though, we're like, you know, again, we are still working on this and we've been at it for a few months. 01:49:28.320 |
But I think we probably built what would have been a year's worth of conventional software, maybe more, right, you know, in that short period of time. 01:49:37.320 |
So you mentioned you did a lot of Vibe coding on this one. 01:49:42.320 |
I'm curious on how much specific language the tools, like, know. 01:49:48.320 |
So the question was just about Vibe coding and how much the tools know of these frameworks. 01:49:55.320 |
So the single biggest issue with Vibe coding, right, is when you're dealing with something new enough that the models don't really understand it. 01:50:00.320 |
You can always just send them the API docs and they usually do pretty well. 01:50:05.320 |
I ran into plenty of places with line graph where nobody had tried to do this before. 01:50:09.320 |
No one had this exact bug and we just had to actively debug it back and forth. 01:50:13.320 |
I can definitely endorse O3 as a better debugger than Claude. 01:50:16.320 |
So, you know, we did have cases where we had to do that. 01:50:19.320 |
But for the most part, like, I mean, as long as you can at the docs, you know, you can get to a reasonable place. 01:50:27.320 |
I may eventually call myself a current engineer, but I kind of wouldn't right now. 01:50:30.320 |
Like, I don't have all the same practices and sort of all the same hygiene, you know, that our professional engineers do. 01:50:35.320 |
But I do know how to smell kind of bad behavior. 01:50:38.320 |
And I'm pretty good at prompting to get what I want. 01:50:54.320 |
So if you look at the connection, there's actually two and one of them I just didn't totally label. 01:50:58.320 |
The blueprint knowledge base at the top, that yellow box, that is an MCP connection in the line graph context. 01:51:04.320 |
And then vertically, there's another set of connections back to the database. 01:51:10.320 |
But again, it's mostly for the interchange of state. 01:51:13.320 |
And it's for this, you know, reading of documents. 01:51:17.320 |
I had a question about, like, the type of people that like to send a whole sentence, but, like, five different messages. 01:51:31.320 |
So the question is about rapid-fire text messages. 01:51:42.320 |
But we have the idea of a configurable delay before we actually send it for processing. 01:51:45.320 |
So if someone is going to send five text messages in a row, we wait five seconds before doing anything. 01:51:52.320 |
But the other way is we will invalidate previous running threads if someone texts in afterwards. 01:52:00.320 |
We want to take whatever the complete context of the conversation was, send all of that in, 01:52:05.320 |
and then we respond to five messages at once. 01:52:08.320 |
So it's a combination of smart retries and smart invalidation. 01:52:13.320 |
So in your case, if a user decides to act rogue, basically... 01:52:26.320 |
...because you are not using a rack, the system is not getting any... 01:52:30.320 |
Well, this question was about rogue responses. 01:52:34.320 |
Well, in general, we are giving some pretty basic guidance around, you're only here to answer treatment-related questions. 01:52:43.320 |
If someone wants to talk to you about the weather, you just say, I'm sorry, I can't help with that. 01:52:48.320 |
So, generally speaking, that works pretty well. 01:52:51.320 |
We have it pretty well-tuned to escalate, right? 01:52:53.320 |
If someone just is essentially going off the rails and you can't think of a response, you just say, I'm sorry, I'm gonna get somebody to help you. 01:52:58.320 |
And it sets the confidence low, and then a human can get involved. 01:53:03.320 |
I mean, again, if you're involved in this, like, you know, if you're involved in this and you take the time to actually engage with the system, 01:53:08.320 |
you want the result, and you probably want to get back to your life. 01:53:11.320 |
So we don't see a tremendous amount of it, but that's the way that we would deal with it. 01:53:16.320 |
At any point in your development process, did you get frustrated enough with Landgraf where you thought that it certainly would be different? 01:53:24.320 |
Yeah, did I get frustrated enough with Landgraf at any point? 01:53:27.320 |
Honestly, so the one thing I'll say, and I mean, this is just totally a personal project. 01:53:32.320 |
I was doing a side thing around college counseling, and I just have this sitting here because I show it off sometimes. 01:53:39.320 |
And the point was, I went into this project explicitly trying to avoid it. 01:53:44.320 |
I said, I'm going to do a very similar thing where I have an AI in the middle of a workflow, and I want it to ask questions, and I want it to think. 01:53:49.320 |
And I do not want to use Landgraf or Crew or any of these other frameworks because I don't want to be dependent on them. 01:53:53.320 |
And so I asked, you know, I asked Klein to write me a layer that was pretty good at, you know, talking to these models and structuring a thinking process, and it did fine. 01:54:03.320 |
I mean, the reason to have the framework is in part because, you know, again, we won't be with this client forever. 01:54:08.320 |
We want them to have something they can operate. 01:54:10.320 |
We want them to have something explainable and easy to use. 01:54:12.320 |
Like, this is all, you know, logs in Google Cloud. 01:54:20.320 |
Like, you get a better experience with the other ones. 01:54:25.320 |
Thank you so much for asking me that question. 01:54:32.320 |
I will admit that Cursor and Winsurf is great, by the way. 01:54:35.320 |
Like, I mean, there's a handful of them that I really do like. 01:54:37.320 |
Cursor and Winsurf, as two examples, are just trying to hit this sort of narrow, you know, thing about it has to cost $20 a month, and therefore it has to be heavily optimized about how it sends tokens in different places. 01:54:55.320 |
But, like, it's just giving tools to a smart model. 01:54:57.320 |
And that smart model can cost whatever it costs. 01:54:59.320 |
Klein is the spiritual cousin to Cloud Code and Codex, right? 01:55:02.320 |
I mean, it's that style of thing, as opposed to an IDE that has to hit this very narrow target and has to do a lot of pre-optimization on how the tokens flow. 01:55:09.320 |
How do you balance the tasks that you've done before you've been played versus, like, relying on evals and productive data? 01:55:20.320 |
So, I mean, honestly, this is also another place where I will raise my hand and say this is why I'm not a real software engineer. 01:55:26.320 |
I don't have a robust testing framework on my Python code. 01:55:31.320 |
Or at the very least, like, I'm using evals and sort of overall performance of the system as the better benchmark, right? 01:55:38.320 |
You know, the other side of the code, like, you know, Stride is a TDD shop. 01:55:41.320 |
Like, our software engineers are very, very good at test-driven development. 01:55:48.320 |
But just my code, you know, very much is eval sort of tested instead. 01:55:52.320 |
It's not a great answer, but, like, that is kind of how I thought about it.