Architecting and Testing Controllable Agents: Lance Martin

Good to be here. Good to see a fair number of people. It's early, so I wasn't sure if anyone would come, but thank you for coming. One quick note, I just put my slides on Twitter. I wasn't sure of the best way to access everyone. I'm @rlancemartin. If there's another way I can get everyone's slides, then, yeah, I see some people opening.

So the slides will link to a few different -- it will link to a colab and other notebooks I'm providing, so all the code will be available for you. Okay, good. I see people finding it. That's fantastic. If there's a better way, let me know, but I figured this is somewhat easy.

Well, it was great to be here. We have a bit of time. So I think the format is I'll lay out some slides to set the preliminaries and give the big picture, and then there will be a bunch of time where I can just walk around, talk to people, because we have, I guess, three hours.

So I think the idea of this will be a hands-on workshop. I provide a bunch of starter code, and also one of my slides shows it will be a choose-your-own-adventure format. So why don't I kick it off? I think maybe one or two people are still coming in, but -- so the theme here is building and testing reliable agents.

And let me go to slideshow mode here. And maybe I'll just kind of start with, like, the very basics. You know, LLM applications follow a general control flow of some sort. You start -- usually it's a user input. There's some set of steps, and then you end. And you've heard a lot about chains.

You know, when you build applications, oftentimes we talk about this idea of chains. And chain is just basically, you know, it is some control flow set by the developer. Again, you start, proceed through some steps, and you end. So retrieval augmented generation is a super popular application many -- some of you may be familiar with -- basically refers to retrieving documents from an index and passing them to an LLM.

This is a good example of a chain. It's a control flow set by a user. The question's, you know, provided. Vector store retrieves them, passed to an LLM. LLM produces an answer. So this is kind of a classic chain. Now, when you get into agents, there's a lot of different confusing interpretations.

What is an agent? Here's the way I might think about it, which is just a really simple kind of framing, is agent is just one of the control flows set by an LLM. And so you can imagine we talked about this process of you start your app, step one, step two.

In this case, I have an LLM in there. An LLM looks at the output of step one, makes the decision. Do I go back and do I proceed? So that's like the simple way to think about an agent. So again, chains, developer defined control flow. I set it ahead of time.

I follow some set of steps every time. An agent, an LLM kind of determines the control flow. An LLM makes a decision about where to go inside my application. That's one simple way to think about it. Now you hear about function calling a lot. And this is kind of a confusing topic, so I want to talk through it kind of carefully.

Agents typically use function calling to kind of determine what step to go to. So usually the way this works is what you do is you basically give the LLM awareness of some number of tools or steps that it can take. So in my little example here, I define this tool and this little decorator is a line chain thing.

But the point is, I have some step, it's some function. I'm defining it as a tool and I'm binding it to the LLM. So then LLM has awareness of this tool. And here's the key point. When it sees an input, like what is the output of step two, it actually produces the payload needed to run that tool.

Now, this is often confusing. Remember LLMs are just string to string. They don't have the ability to magically call some function. What they can do is produce the payload or arguments needed to run that function and the function name. So really think about tool calling or function calling as just an LLM producing a structured output.

Still, you know, obviously a string, but it's a structured output that then can be used to call a tool. So that's all the function calling is. And you might have heard of React agents. So the way to think about this is, it's just basically binding some set of tools to my LLM.

And again, we talked about tool calling. So LLM makes decisions about what step or what tool to use. And you have some node that will call that tool. So LLM says, okay, run step one. I have some node that runs step one and passes the output of step one back to my agent.

React typically stands for like basically action. So the LLM chooses the action. A tool is run. It observes the output. That's what goes back to the agent. Observe that tool response. Thinks about what to do next. Maybe runs another tool. And this runs in a loop until you end.

And usually the end condition is the LLM just outputs a string response, not a tool call. So this is the way to think about a classic React agent. And it's really flexible. That's the nice thing about a React agent. So basically, it can implement many different control flows. It can do step one only.

Step two, one, two, two, one. That's the beauty of these open-ended style React agents. And these have a lot of promise. These kind of flexible tool calling agents were really hyped last year. They're still really hyped. It's really exciting because they're flexible. They're open-ended. You can give them a task.

Give them some tools. They can just execute arbitrary control flows given those tools to solve open-ended problems. The catch is, and this is kind of the crux of what we're getting to with this workshop, is they do have poor reliability or they can. So you can get caught if you've played with agents.

So as you can see, they kind of get caught on one step and they keep calling the same tool. And, you know, really this is often caused by LLM non-determinism. LLMs are non-deterministic. And also errors in tool calling. So tool calling is kind of tricky. If you think about it, LLM has to basically pick the right tool given the input and has to pick the right payload.

So it has to produce the right inputs needed to run the tool. And these both can break. So here's a good example. The tool I'm passing is step two. And the LLM is saying the tool name to run is step three. So that's obviously wrong. Or I'm passing what is step two of the input three.

And the LLM says, okay, pass four. So both of these errors can happen. Tool calling is a tricky thing. And it's exacerbated. If you pass an LLM five tools, ten tools, it actually gets worse. If you have very long dialogues, it gets worse. And so this idea of open-ended tool calling agents is really promising.

It's really exciting. But it's really challenging because of these issues that we were mentioning here. So this is kind of the buildup here. Like, so can we envision something in the middle? So again, we talked about chains. They are not flexible, but they're very reliable. This chain will always run step one, two in order.

We talked about like react agents on the other extreme. They're extremely flexible. They can run any sequence of tool calls. You know, can run step one, two, step one only, two only, two, one. But they do have reliability issues. So can we imagine something in the middle that's both flexible and reliable?

So here's kind of the setup. And this is kind of the intuition. Like a lot of times in many applications, you have some idea of what you want the thing to do every time. So some parts of the application may be fixed. Like the developer can set, okay, I always want to run step one and I want to end with step two.

And you can inject an LLM in certain places that you want there to be some kind of branching or kind of optionality in the control flow. Okay. So this is the motivation for what we call LangGraph. So LangGraph is basically a library from the LangChain team that can be used to express control flows as graphs.

And it is a very general tool and I put out a bunch of videos on it and we're going to use it today. And by the end of this, you will all have an agent that runs reliably using LangGraph, hopefully, and we'll see. So you should test me on that.

If things don't work for you, then we'll work it out. But the idea is kind of this. This graph has some set of nodes and edges. So nodes you can think about are basically, well, maybe I should start with this. This graph has something called state. So it's like short term memory that lives across the lifetime of this graph that contains things you want to operate on.

Nodes modify the state in some way. So basically each node can like call a tool and can modify the state. Edges just make decisions about what node to go to next. Okay. So you basically have this idea of memory. And this is the same as common agents, right? Agents are characterized by having tool calling and short term memory as well as planning.

Those same things are present in LangGraph. Memory is the state that lives across your graph. Tools exist within your nodes. And planning, basically, you can incorporate LLM dictated decision making in the edges of your graph. So, like, why is this interesting? And where has this been cropping up? We've actually been seeing this theme crop up a lot of places.

So there's a really interesting paper. There's actually a few I really like. This one's called Corrective Rag. And the idea is pretty simple. Like with a naive Rag pipeline, you're doing a retrieval. You're taking retrieved docs and you're generating your answer. Corrective Rag, like, is doing one step more where it's saying, well, why don't we reflect on the docs we retrieved and ask, are they actually relevant?

You can have lots of issues with retrieval. You can reflect on the documents, see if they're relevant. If they're not relevant, you can do different things. You can kick out and do a web search. So it makes your application a lot more dynamic to poor quality retrieval. So this is one of the first videos I put on LangGraph back in February.

It was very popular. And I basically showed you can build Corrective Rag inside LangGraph. And it's super simple. This is what the graph looks like. I do retrieval. I grade my documents. And we're going to actually, we're going to do this today. And I have a bunch of code for you that does exactly this.

So we're going to go way in detail on this one. But this is kind of the setup. And I showed this working. I showed it works locally with Olama using, at that time it was Mistral 7b. And it works really well. So this is like one simple illustration of how you can use LangGraph to build kind of a self-reflective or corrective rag application.

Now another cool paper was called Self Rag, which actually looked at the generation. So basically we're all familiar with the idea of hallucinations. It's a real problem. Instead of just allowing hallucinations to propagate to the user, you can actually reflect on the answer relative to the documents and catch hallucinations.

If there's hallucinations, you can basically do different things. And they propose a few ideas here. I implemented this. And this is actually our most popular video of all time. So this was showing LangGraph in Lama 3 implementing three different things. Corrective rag, which we just talked about. The self-rag thing of hallucination checking.

And this adaptive rag thing. So I can kind of walk through it. This all runs in LangGraph locally. And I have the notebook here. If you want to test that today, you definitely could. So that's the point. It's reliable enough to run this whole thing locally. So what's happening here is I take a question.

I route it either to my index or to web search. I then retrieve documents. I grade them for relevance. If any are not relevant, I kick out and do a web search to supplement my retrieval. If they're relevant, I generate my answer. I check it for hallucinations. And then I finally check it for answer relevance.

So basically, does it have hallucinations relative to my documents? And does it answer my question? If all that passes, I finish and return that to the user. So this is kind of like a complex rag flow. But with LangGraph, you can actually run this on your laptop. It is reliable enough to run a laptop with LangGraph.

And the intuition, again, is that you're constraining the control flow. You're allowing the LM to make certain decisions, but at very discrete points. If you implement this as a rag agent, this could be very open ended and a lot of opportunities for breakage. And so that's the real intuition here.

Now, a final theme is Karpathy kind of mentioned this idea of flow engineering related to this alpha codium paper, a really nice paper on code generation. And the intuition here is produce a code solution. They tested this on a bunch of coding challenges, produce a code solution and check it against the number of unit tests, auto-generated or pre-existing.

And basically, if it fails unit tests, feedback those error cell. I'm gonna try again. Really simple idea. I implemented this in LangGraph. Again, the code is here. And this works really well. So I basically, I share a blog post as well. I ran this on our internal coding. We have an internal application for rag at LangChain.

And we're actually working on implementing this right now in production because the performance is way better. And a common thing this can fix with code generation and code solutions is hallucinations and imports. So we see that a lot with our rag app. So what I did was, I very simply implemented a unit test for import checks.

Just run that. It significantly improves performance relative to without doing it. And so we're actually working on implementing this in our internal rag system. So super simple idea that can really improve code generation. So, you know, if I kind of like back up, what did we talk about? I mean, we talked about chains.

They are not flexible, which is fine in some cases. But a lot of interesting newer papers with rag, for example, this idea of self-reflection is really beneficial. The ability to kind of self-correct applications can be really beneficial and not beyond rag as well for coding. So chains are very reliable, but they're not flexible.

Now, if you go to the other end, like a classic react agent is very flexible. It can implement any sequence control flows through your different tools. But it does have reliability problems due to things we talked about, non-determinism, tool calling errors. And line graph kind of sits in the middle where you can actually implement these user defined slash LLM gated control flows.

And they can actually be extremely reliable because of that constraint. They are less flexible than a classic react agent. So that is true. So for very open-ended tasks, I do agree. Maybe you do need, you know, a very open-ended more like autonomous style react agent. But for a lot of applications that our customers are working on and seeing, these kinds of like hybrid flows are sufficient and you gain a lot of reliability.

And so we've talked to a lot of companies that have implemented line graph successfully for agentic flows for this reason. Because reliability is just incredibly important in a production setting. So this gets into, if you look at the slides, I have a few different notebooks. And what I show is, we talked about corrective rag.

These notebooks show how to build corrective rag yourself. And I thought that's a fun starting application. It's a really popular one. It's super simple. There are not many dependencies. You can use your own web, whatever tool you want to use your web search. You can use other things as well.

You should have a look at the notebooks and I'll kind of walk around. And we're going to, we're going to keep going. I'm just wanting to like, this is just like a placeholder here. But, so if you want to test this locally, if you have a laptop capable of running things locally, then we have a notebook to support that.

I use a llama. I can talk a lot about that. That's a really cool thing. If you don't, then I have two options for you. So one is a colab. So that's probably the easiest. If there's issues, let me know if I've tested it. So if you have a Google account, you're going to go over here.

If you have a Google account, you're going to spin up a colab. All I need is a few API keys, depending on what models you want to use. It's all kind of there. You can set those accordingly. And I also have a notebook. So this just kind of is like a kind of a, gives you a roadmap of the different things you can try today, since this is a workshop format.

And I'll just be walking around and we'll do questions for a while, but I want to talk about the second half of this, of this, you know, story. So one of the things we're seeing a lot, I think you're going to hear a lot at this conference is the challenge of testing and evaluation.

And this is a real pain point. Like, for example, how do I actually know that my Landgraf agent is more reliable than the React agent? How do I know what LLM to use? How do I know what prompt to use? Right? So testing agents is, testing in general is really hard.

And agents in particular is challenging. So there's kind of three types of testing loops I like to think about. One is this in-app error correction. And that's actually what we just talked about. So Landgraf and Landgraf agents are really good for that. So basically, in-app error handling, where you can catch and fix errors is really useful for code generation for RAG.

We just talked about that. So that's like placeholder one. Now we get into this idea of pre-production testing. And then finally production monitoring. And I want to introduce a few ideas on the latter two. So we just talked through this. Here we're going to build corrective RAG a few different ways.

And I just showed the choose-your-own-adventure stuff. And so this is just kind of reiterating that. But I want to show you some other things. So Langsmith is a tool from the Langchain team that supports testing and evaluation as well as monitoring and tracing. And so we've seen a lot of interest in this and it's quite popular.

It is really useful for doing these types of testings and evaluations. So the key idea behind Langsmith and the notebooks actually have this. So this is totally optional. If you just want to build an agent, that's totally fine. If you want to also test it, you can use Langsmith.

You don't have to, of course, but I have it all set up to use Langsmith if you want. It's free to use, of course. And so the idea is there's kind of four components that I like to think about when it comes to testing/evaluation. You have some data sets.

That's some set of examples you want to test on. So say you have a rag app. That's like a set of ground truth question and answer pairs you've built. Like you're testing your system. You have question and answer pairs that you know are correct. Can your system produce those answers?

How many will actually get right? You have your application. That's your agent. That's your rag app. That's your code app. Whatever that is. That's your application. Now the thing that's often the trick is you have this evaluator thing. And the notebooks show you in detail. But this evaluator is something as simple as a user-defined function that can implement a few different things.

You can think about using an LLM to actually judge your output. So in that case, let's take Rag as an example. My application produces an answer. I have a ground truth answer. You can actually have an LLM. Look at those two answers jointly in reason. Is it correct? And this is often very effective.

It requires some prompt engineering. I have some nice templates in the notebooks to show you. But this is something that's very popular. This idea of LLM as judge evaluators is very interesting. A lot of people -- actually, you'll probably hear about it this week. It's a really good theme.

It's still kind of in development. But that's like one placeholder to keep in mind. So for one option for time testing is this idea of using LLMs themselves. The other is building your own heuristic evaluator. So a custom evaluator of some sort. And actually, the notebooks that I share have both.

And so we're actually going to -- the notebooks actually show how to evaluate an agent specifically. And there's a few different things you can look at with an agent. So one is if you go to that far right in blue, the end-to-end performance. So our notebooks are basically going to be a RAG agent.

The eval set has five questions. And I basically have a set of question-answer pairs. So basically, I'm going to compare my agent answers to reference answers. And we'll walk through that in the notebook. But that's kind of one thing. I just want to introduce the idea. So one big idea is you can evaluate the end-to-end performance of your agent, right?

You don't care anything about what's happening inside the end-to-end performance. The other, which I actually like to look at a lot, is this thing on top. What are actually the tool calls that the agent executed? This is how you can actually test the agent's reasoning. So what you see often with agents is they can make some weird trajectory of tool calls.

It's highly inefficient but still gets to the right answer. You don't get that if you only look at the answer. You say, oh, okay, it's got the right answer. But if you look at the trajectory, it's some crazy path. And so you want to actually look at both. Like how efficient, how correct is the trajectory?

And does it get the right answer, right? And so the notebooks I share actually do both. Now this is actually an evaluation that I ran. And this data set is public. This is on the agents that we actually just talked about. So this is kind of what you see when you open Langsmith.

So these are different experiment names. This is just saying like I've run three replicates of each experiment. And these are my aggregate scores. So this first score is basically the answer correctness thing. And the second score is like the tool use trajectory. Or like does it use the right reasoning trace?

And I can go through my experiments. So this top one is actually, this is kind of cool. This is actually my local agent running on my laptop with Langraph. Okay. Is a five question eval set. Small eval set. Just a bunch of it's some very small test examples. But basically my local agent does not does fine.

It does 60% in terms of the ultimate answer. So that's not amazing. But it does do very well in terms of the tool calling trajectory. So it's very reliable in terms of reasoning. It's an 8 billion parameter model. So I basically I think the quality of its outputs are a little bit lower than you might see with larger models.

Now fire function v2 is another option. It's basically a fine-tuned llama 70b from fireworks. This one with Langraph, so this is actually showing this top, actually gets up to 80%. So very strong performance in terms of answers. And 100% again in terms of tool calling. So the key observation here is the tool calling or reasoning is consistent.

Whether you're using a local model or a 70 billion parameter model with Langraph. So you get very high consistency in your tool calling. The answer quality degrades. That's more an LLM capacity problem. But the reasoning of the agent is consistent. So that's the key point. Now here's where it gets interesting.

Fire function v2, again that's llama 70b. This is with a React agent. What you can see here is the answer quality is degraded. But here's the interesting thing. The tool calling trajectories are really bad. And this again gets back to that problem with React agents. They're open-ended. They can choose arbitrary sequences of tool calls.

And you can deviate really quickly from your expected trajectory. So that's the key intuition here. Now, the final two are GPT-40s. That's obviously a very flagship model. It's maybe number two now. Again, on the chatbot arena at least. You know, again, answers ultimately are strong. The tool calling, though, even here is degraded.

So basically it follows some weird trajectories to get to its answers that are unexpected. So what's the high-level point here? The high-level point is Langraph allows you to significantly constrain the control flow of your app and get higher reliability. And if you look at these tool calling scores, it's very kind of consistent, going all the way down to local models.

It follows the same sequence every time. React agents kind of go off the rails much more easily. The answer performance is really a function of the model capacity. So using an 8 billion parameter model locally, the answer quality is lower than a 70 billion. That's to be expected. But the reasoning of my app is consistent and strong.

So that's the key thing that you kind of get with Langraph, and this is all public. And hopefully some of you will actually, you know, implement this or reproduce this today. And this is just walking through those same insights I just mentioned. And then deployment, we're going to be talking later this week.

We have an announcement related to deployment of Langraph. So this is actually a very good setup. If you're playing with Langraph, you enjoy working with it, we're going to have some really nice options for deploying later this week. And so Harrison will be here on Thursday to give a keynote on that one.

And if you've deployed, we also have some really nice tools and Langsma to actually monitor deployment. And this is not as relevant for this workshop. It's something to just be aware of, I can talk about if you're interested. So maybe to close out, so this is a really nice write up.

These guys are actually going to give a keynote later this week. It's Jason and company, Hamel and others. And they kind of made a really nice point that the model is not the moat. Like, LMs are always changing. The moat is really the systems you build around your application.

That's what we talked about today. Like, do you have an orchestration framework, for example, like Langraph? Do you have an evaluation chassis like Langsmith? And again, you don't have to use Langraph. You don't have to use Langsmith for these things. But this workshop will introduce these ideas to you.

And frankly, I think it's important just to understand the ideas rather than the implementation. Whether or not you use Langraph, whether or not you use Langsmith. I think understanding these principles is still helpful. But, you know, an evaluation chassis, guardrails, data flywheel. These are like the components that give you the ability to improve your app over time.

That's really the big idea. That's the goal. And I think you'll hear more on that later this week. The goal here is how are you measuring improvement of your app and ensuring it always gets better. That's what we're actually trying to achieve here. And that's kind of what evaluation is giving you.

And, yeah, this is kind of my last slide. Then maybe we can just move into maybe some Q&A. I can actually show the notebooks themselves if you want to walk through them together. I mean, I'll just do that and I'll let you guys kind of hack on them in parallel as I walk through them.

And then I can just walk around and talk to people, something like that. So, you know, the three types of feedback loops. You have this design phase feedback. Something like Langraph. In-app error handling. That's kind of step one we talked about. Cool examples there for coding, for RAG. A lot of nice papers.

Really promising. I'm very excited about anything you can do here with terms of kind of agentic self-correction, self-reflection in your app itself. Pre-production testing. We just talked through that. Building evaluation sets. Running evaluations. Testing for an agent. Like your tool use trajectory. Your answer quality. All really interesting and important.

And then in production phase, production monitoring. This gets into, we didn't talk about it too much, but basically this stuff. So basically you can have evaluators running on your app in production. Looking at inputs. Looking at outputs. Tagging them accordingly. And then you can go back and look later.

So that's kind of the set up here. I know that's probably a lot. And we went about half an hour. So if there's any questions, I can just open it up. And we can kind of talk through stuff. I can also start ripping through some of the notebooks just to kind of give you an overview of the code itself.

But if there's any questions here, maybe, you know, happy to take a few and give you a minute to digest all that. Yeah. Is there a non-Twitter link to the slides yet? That's a good point. Let's see. Is there a non-Twitter link to the slides? Let's see if the conference organizers give me some.

I don't know if I have an email list for everyone in here. Is there, okay, is there a Slack? Yeah, if someone can, I actually didn't know there's a Slack. So that's very helpful. Yeah, if there's a Slack or an app for this conference, then please, someone post. I appreciate that.

I actually didn't know that. Thank you for that question. Yes, sir. So about testing and evaluation, does it really scale to predict the exact sequence of the agent? If it's smart enough dealing with a complex problem, it's hard to say exactly how it achieves the task. Okay. This is a very good point.

So I'm going to repeat the question now. How do you evaluate an agent's like reasoning trajectory? If it's a large and open-ended problem, that can be solved in many different ways. In that particular case, you are right. It is hard to enumerate a specific trajectory that is actually reasonable.

For really open-ended long running type problems, trajectory evaluation may not be appropriate. One thing I would think about is, can you define a few canonical trajectories of tool use through your application? So, it depends on the stat. If it's a very long running agent, I think it's probably infeasible.

If it is a shorter run agent where it's like, you expect something in the order of maybe five to ten steps, you can probably enumerate some set of reasonable trajectories and basically check does it follow any of these. You can also do things, so you can do things like this.

You can do things like check for the repeat of certain tool calls. You can be very flexible at this, and it's kind of open to you, so you can look for like, is it repeating certain tool calls? You can look at recall, like, is it for sure calling this tool or not?

So, you can actually be very flexible. Actually, the way we set up Lang-Smith evaluator for this, it's just a simple function that you can define yourself. So, it's a very good point. You can be arbitrarily creative about how you evaluate that. But I would say, for very long running, you're right.

You can't really articulate step one, two, and three. But I would then think about more like, evaluating, is it repeating steps? Can you evaluate for clearly aberrant behaviors? Excessive number of tool use repeats, excessive number of overall tool calls. So, kind of like guardrails related to, like, kind of clear aberrant behavior.

It's very short-term, you can actually enumerate the trajectory specifically. That's a good question. So, in the code, you can see we actually lay out a custom function. You can define that yourself. So, that's a very good point, though. Yeah. Yep. Can you transition to human in the loop? Yeah.

Human in the loop is a good one. The workshop notebooks I share do not have that. But, line graph does have some good support for human in the loop. And I can share with you some notebooks that showcase that. Also, what we have shipping on Thursday has very good support for human in the loop.

So, I will share some notebooks with you for that. And wait for Thursday for even more there. So, if we're building a rack-like application, right? Pre-production, we can do this testing framework using a known set of, this is a question, this is a proper answer. So then, what I've been struggling to draw is, like, the right way to approach that pre-production, where we don't know what the right answer is.

And I'm wondering . Yeah. So, the question was, this is actually a really good one. For RAG, I'm just going to go to our docs because I actually wrote a doc on this recently. So, you can still see my slides. For RAG, in a pre-production setting, it's easy to define, or not even easy, but you can define a set of question answer pairs and evaluate them.

When you're in production, though, how do you evaluate your app because you don't have a ground truth answer? So, what are other things you can actually evaluate for RAG app that don't require a reference? Yeah. So, there is a conceptual guide that I will share. So, this is actually our RAG section.

I have kind of a nice overview of this. There's actually a few different things you can evaluate for RAG that don't require a reference that are very useful. Yeah. So, it's this right here. So, this is like a typical RAG flow. So, I have a question. I retrieve documents.

I pass them to an LLM. I get an answer. Right? What we just talked about and we showed is comparing your answer to some reference answer. Now, this is, to be honest, pretty hard to do. You have to build an eval set of question answer pairs. Very important to do, but it's not easy.

So, what else can you do? So, some of this we've seen that are really easy and actually pretty popular. There's three different types of grading you can do that don't require a reference. They're like internal checks you can run. I mean, you can run them online. So, one is retrieval grading.

So, basically looking at your retrieved documents relative to your question. So, like an internal self-consistency check. So, this is actually a great check to run and actually the corrective RAG paper that, or the corrective RAG thing that is in the cookbooks that I share here does this. So, you can play with the prompt and all that.

But basically this is just checking the consistency of your retrieved docs relative to your question. You can do that and we have some really good prompts for that. Another one I like is just comparing your answer to your question. Have an LLM look at, here's my answer, here's the question.

Is this sane? Are they related? And this is a really nice check just for like, you know, of course you don't have a reference answer, but like you can still sanity check and say does this deviate significantly from what the questioner is asking. The other, this is a great one, is hallucination.

And this is super intuitive. Compare your answer to the retrieved documents. So, if the LLM went off the rails and didn't ground the answer properly and you hallucinated, you can catch that really easily. And so, I need to get on this slack because I want to share this link with you.

I'll figure that out. I'll find you. But this is in our Langsmith docs. If you search Langsmith valuation concepts, we have, I actually have a bunch of videos that showcase how to do this. And I have a bunch of code as well. So, but those are three things you can do that don't require a reference.

And we do run those in as online evaluation with our application. So, yeah. Yep. Yeah, unit testing. So, yeah. Okay. So, do we have any thoughts on unit testing? So, Langsmith supports PyTests as unit tests. But basically, it depends what you mean by unit tests. Typically, like conventional software engineering unit tests are very effectively done in things like PyTests.

There's a lot of frameworks for that. What I like to think about in unit testing with respect to LLM apps is kind of like what we show in this code generation example. So, here we use some really simple unit tests just for like imports and code execution. Simple unit tests like this can actually run in your app itself.

So, basically, one place you can think about putting unit tests are actually in your app itself within Langraph for self for kind of in-app error handling or self correction. So, one place for unit tests that's kind of interesting and new with LLM applications. They can live inside your app itself.

Another good one for unit tests within app is if you're doing structured output anywhere in your application, which is like a really common thing people like, confirm the schema is correct. That's another good use case for unit tests within your application. Those also, both of those things could also be done independently like in CI outside of your application.

So, we are going to have more integration support for CI with Langsmith soon. I will check with the team on that. But I think the interesting idea for unit tests with LLM applications is this idea of inline within your app itself. Because LLMs are so good at self-correcting. If you run unit tests in your application, they can often catch the error and then correct themselves.

And unit tests are fast and cheap to run. So, it's actually a really nice kind of like piece of alpha that, in fact, that's exactly what Karpathy was mentioning here. That, you know, running unit tests in line with your application is actually really quite nice. And produce significant improvement in performance in alpha codiums.

Cool. Yep. Okay. Yeah, yeah. So, what the question was, if I wanted to do some of this in-app error correction stuff. So, let's take this example. The corrective rag thing. If I actually want this to run in my application, it obviously needs to be super fast. So, that's actually what we've seen.

The tricks we've seen here are basically use very fast, smaller LLMs. So, you mentioned, for example, even the ability to fine tune. That's actually a good idea. If you have a judging task that is very consistent, it's a very good use case for fine tuning actually. Fine tune a small, low capacity, extremely fast, and effectively very cheap to deploy model.

That's a very good idea. That's a very good idea. We've seen people do that. Also, use very simple grading criteria. Don't have some kind of arbitrary scale from zero to five with high cognitive load that LLM has to think about. Yes, no. Very simple binary grading. Even, for some of the stuff, you can even be old school and fine tune a classifier.

But basically, really simple, lightweight, fast, LLM as judge style. Classifier wouldn't be an LLM necessarily. But basically, very simple, fast test for in-app. Anything like kind of in runtime, you will need or want. Another cool use case for this, or another interesting option for this, is Grok is very, very fast.

Kind of with their LPU stuff. And they actually would be a very interesting option. We've done some work on that with Grok. Basically, for any of these kind of in-app LMS judge error correction things, using something like Grok, which is extremely fast. But it's a very good insight. Fine tuning your own model is actually a really good idea.

We've seen people do that for these types of anything with land graph and in-app error correction. Cool. Yep. Yeah, we have some good cookbooks talking about multi-agent, which I, again, will need to find a way to share with you. Yep. So we have this, if you go to land graph, land graph github, examples multi-agent, there's a few different notebooks here that are worth checking out.

Yep. Yeah. Yeah. So actually, land graph is specifically designed for cycles. So some of the examples, like what we're showing today is only a branch. So it's a simpler graph. But for example, the react agent we'll show today is a cycle. So it's basically going to continue in a loop just like this.

And what you do is you set a recursion limit in your land graph config. So you basically tell it to only proceed for some number of cycles. And this is default set for you. I believe it's like 20 or something like that. But that's what you're going to, that's what you're going to want to do.

Yep. So the question was about timing of the responses. So do you mean like if you're implementing some kind of self-correction, how long that takes? Well, that's kind of a -- that gets back to the question that this gentleman asked. It depends a lot on the LLM you choose to use for your judging.

And it's latency. So that's kind of where if you're -- so maybe there's two sides of this. One side of it is choosing an LLM that's very, very fast and that's very important to do. Could be something like Grock. Could be a fine-tuned like deployment that you do yourself.

Could be a GPT-3-5. So that's like one side of it. The other side of it is how do you actually kind of monitor, measure that? And so again, Langsmith actually does have very good support for tracing and observability. And we do have timings all. In fact, I can go ahead and show you very quickly if you want to see.

So this is my Langsmith dashboard. I'll zoom in a little bit. These are my experiments. Now if I zoom in here, I can open up one. The Wi-Fi is a little bit slow. These are my replicates. I can open up one of my traces. And what I can see here is over here I get the timing.

So this is the timing of the entire graph. And I can go through my steps. So this is like the retrieval is really fast. That's good. You know, less than a second. Okay. Now here's what's interesting. My grading in this particular case is like four seconds. So that's, you know, not acceptable in a production setting most likely.

But again, this is just like a, this is a test case. In fact, I'm using, what am I using to grade here? GPT-4-0. So it is, you know, there's ways you could speed this up by using different models or different prompts or grading everything in bulk. There's a lot of ideas.

I actually grade each document independently. Oh, actually, you know what? This is using chat fireworks. So anyway, but you can look at your timings in lengths, but that's a really nice way. I'd like to do it to kind of see, to kind of monitor the timing of my applications.

Yep. So the case you showed, sort of, like, from scratch, like, when the agent is running from the beginning, it's running from scratch. But in reality, usually people have some context or interest that's being passed. Do you have any suggestions as to how you test that's going through? And the question was, with agents, you typically pass them a message history.

That is absolutely true. And in fact, the React agent that we implement here, like, I can even open up one of the traces. We can look at it together. So here's a React agent with GPT-40. Here's one of my traces. Let's open it up and actually see what's going on here.

So what happens is, first, my assistant is right here. So this is OpenAI. We'll, you know, here's the system prompt, right? So your helpful assistant, you're answering questions. Here's the human question, right? So again, this is the start of our message history. Okay? So what the -- and also, these are the tools that the LLM has.

And this is pretty cool. We can see this is the one it called. So what happened is, our LLM, it looked at, here's our system prompt, here's the human instruction. And it says, okay, I'm going to retrieve documents. Great. So then it goes, and this is the tool call.

It goes and retrieves the documents. Now that goes -- so you look here, we can actually open up the retriever. What do we get? Here's our documents. Cool. Now I go -- it goes back to the assistant. So back -- this is a looping thing. It started with our assistant.

It made the tool call. The tool ran. We got documents. Now we get them back. Now let's go back to our LLM. So now our LLM -- this is pretty cool, right? Here's the message history, like you were saying. Instructions, question, document retrieval. The documents that are retrieved are right here.

And then now the LLM says, okay, I want to grade them. It calls the grader tool. And this is its reasoning and this is its grade. So anyway, you are right that as this goes through, you basically accumulate a message history. And the LLM will use the message history and most recent instruction to reason about what tool to call next.

That's exactly how it works. I think I answered the question. Is there anything that isn't clear about that? So it is true. Like let's look at an example, right? So in this particular case, right, the LLM sees the retrieved documents from the tool. And then it makes a decision that says, okay, I have this tool response.

It's retrieved documents. What should I do next? And it says, okay, well, why don't I go ahead and grade them? And it calls the grade tool. So it looks at the message history and reads about what tool it'll call. And that's exactly how these React style agents work. And the whole issue is that's kind of a noisy process.

Like it can look at that whole trajectory. It can get confused. It can call the wrong tool. Then it's on the wrong track. And that's exactly why these more open-ended tool calling agents fail. Yup. So that's the same like in the second follow-up question to the agent. Yup. That, that, that, technically should follow similar, right?

Because sometimes the data tree, the first, the first match of the whole, when they're they're going to bring you back to the actual structure you're asking so that you can get the value of a problem. You can bring you a question after that. Single question. Yeah. Okay. Right. So I think that the question is, well, let's say this is a multi-turn conversation where I can, the user can go ahead and ask a second question, of course, and that whole kind of message history will be propagated.

Um, yes, that is, that is a common pattern. Um, and that, let's see, I mean, what's the question on that though? Like it, it could use context from its initial trajectory to answer the second question for sure. Um, it'll probably look at that jointly when it's deciding what tool to call.

So for example, if it receives a question and in its message history, it sees the context need to answer the question. The agent could probably decide, okay, I don't need to retrieve documents. Again, I have the documents need to answer the question. I'll answer that question directly. So it is true that a multi-turn conversation, the agent can look at its message history to inform what to do next.

That's definitely true. Here. I don't consider evaluation of multi-turn conversations. Um, but that is, it's a good topic actually. Um, I don't quite have a tutorial for that yet, but I could think about putting that together. Yeah. That's a good point. Uh, I'll make a note of that actually.

Yeah. So multi-turn is a good one. Cool. Um, okay. Yep. Yep. Yep. Yep. Yep. Yep. Yep. So, uh, test units are cool because they would use it up and we can actually run it in the office and support it. Have you . Um, so the question is, I believe, um, have I tested, like, the ability to do this?

Um, have I tested like the ability to kind of online auto generate unit tests? Yeah. Okay. So that's a big topic. So basically the alpha codium paper, um, that Karpathy references here does that. I have not tested that because it does ramp up the complexity because then you're relying on an, and I don't think, I mean, that would be aggressive for a production setting because basically be relying on an LLM to auto generate unit tests, testing against things that are auto generated.

There's a lot of opportunity for error there. I think it's interesting, particularly in terms of like offline challenges like this, but in terms of like a production application that feels pretty hard and risky, but it's an interesting theme. The thing I've tested more on, I found to be very effective is super simple, crisp, lightweight, free, effectively unit tests.

Like, again, the good use case I found was our, um, so Langchain, we have an internal rag application called chat Langchain, and it indexes our documents and provides QA. It occasionally hallucinates the imports, and that's a really bad experience, right, if you take a code block, you get from this app, and then this import doesn't exist.

It's like, what the hell? You know, that's really annoying. So I incorporate a really simple check where I have a unit test that just, it does a function call where it extracts the imports from the code block from the answer, and it tests the imports in isolation. If they don't exist, there's an error.

I feed it back to LLM and say, look, this isn't a real import. Try again. And I can do, you can do other tricks like then context stuff, you know, stuff, relevant documents or something like that. Anyway, you can handle that differently, but that, just that little alpha, significantly improved our performance.

So I kind of like simple, lightweight, free unit tests. Um, the idea of online general unit is interesting, but like opens up a lot more surface area for errors. Yeah. Um, so I, the follow-up there was, um, try, let's see. you're trying to write the test before the app is implemented.

Um, I see. Yes. So I, the follow-up there was, um, try, let's see, you're trying to write the test before the app is implemented. Um, I see. Yes. So they also do that. So basically for each question, they do have, they, so the alpha codium work references, uh, existing unit tests for a given question, as well as auto-generates.

So you are right. That would be interesting to have a bunch of pre-generated unit tests that you know are good for certain questions and to run them. Absolutely. Hard to do in a production setting setting with an open-ended input, but potentially very useful in, well, okay. Even in a production set, you could maybe have some battery of unit tests.

And based upon the question type, pull related unit tests that you know are going to be relevant. It's a good idea. Yeah. It's a good idea for sure. So basically using some battery of dynamically chosen pre-existing unit tests based on the question type or the documentation, whatever documentation they're asking about, that's a good idea.

Yeah, he was saying so for larger projects, you can also use that to test for aggressions. Yes, that is, that's definitely a good idea. And this paper, it does incorporate that idea as well as this auto-generated unit test thing, which is a little bit more aggressive. Cool. Yep. Cool.

Yep. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. So the question was, in some of these self-reflective applications, like let's say this one, the self-reflective rag, right? We're doing a few different checks here. We're checking documents. We're checking hallucinations. We're checking answer quality. So instead of having some hard-coded single prompt to do that, can you have another agent, like kind of a checker or a greater agent that's a little bit more sophisticated?

I have, and he mentioned a few frameworks, I have not played with them. I think it's interesting. I think it's, it's one of these things where it's really good for kind of academic papers. It's very interesting. It's really good. It could be good for offline testing in an online or production setting or it's, the latency and complexity is probably a bit much.

I think in a production setting, it goes back to what I think this guy was referencing. You probably would want something that's extremely fast, lightweight. And I would not think about like a multi-agent system in a production setting, doing any kind of grading. This whole idea of LLM as LLM graders is kind of a new idea.

So I think this idea of like more complex agent graders is interesting. But we're kind of in the, we're taking baby steps at this point, especially in a production setting. So I'd probably shy away from that for now, particularly if you're thinking about production. But for something offline or experimentation, it's probably interesting.

Yeah. What's the best practice that you've seen so far for performing tasks or evaluations on those cycles? Because obviously, you can end up with many, many different types of input and that sort of thing. And obviously, LaneGraph provides some tracing there. But what about, hey, here's an interesting test case.

What kind of best practices do you see for actually performing those routine evaluations? Yeah, exactly. So actually, the notebooks shared here go into it a little bit. So the way I like to do it is, and I can even, why don't I just show you one of the notebooks.

So basically, so at the bottom, I have kind of all the different evaluations. So this goes back to a question that someone mentioned down here as well. Which one is it? So this is, so there's a colab, and then there's a notebook, and they are both the same. So they have the same evaluation section.

This is ragagenttesting.ipnb, and there's also a colab, which is, it's the same notebook, basically. It's in the slides, just to make sure you've got those links. But to your question, so the way I did it, and you can be very flexible with this, is basically, I define the expected trajectory through the nodes in my graph.

And in this particular simple case, the trajectories that I expect are basically retrieval, grading, web search, generate, or retrieval, grade, generate. Those are the two expected trajectories that I want to take. Now, in this case, I don't do cycles. If you did cycles, you could just incorporate more steps here to say, okay, here's kind of the expected number of steps I want or expect to see through that cycle.

And I think someone mentioned here, if you have a really open-ended, challenging task, then it may be hard to enumerate exactly. But for a lot of, like, more production-setting applications where these are not extremely long-running, you can enumerate, here's the steps I expect to take through the cycle. And the way I do the evaluator, it's as simple as this.

There's not much code, it's all in the notebooks. But basically, I compare the tool calls that the thing did to these trajectories. That's it. Super simple. So that's the way I would think about it. I keep it really simple. I would basically try to enumerate. Here's the steps through the difference, the cycle I want it to take.

And, yeah, go ahead. Just to make sure I understand. Right. Basically, you're using, so in the land graph concept, basically using the nodes to say, hey, I expected to be calling these tools along the way. That's it. And that's how I perform in my evaluator. That's it. That's it.

So it's actually really simple. And this is actually a good point. In the land graph case, for this custom agent, you'll see, all I do is, for every node, and we need to get into the code if we want. Maybe people have already explored. So, well, I'll just answer the question directly.

Then we can back up and go through all the code if we want. But basically, each node in my graph, I just append this step name. Right. So, like, retrieve, I just say retrieve documents. This is my generate node. I just, you know, append this thing, generate answer. And I return that in my state as steps.

So then I have this record of the steps I took in my graph. That's it. And I can go ahead and fetch that at eval time and compare it to what I expect. That's all you're doing. So that's how it would work with, like, the custom land graph thing.

And now with the React agent, actually, it's a little easier because the React agent uses a message history. So I can just go to my message history, and that's exactly what I show here. You can strip out -- I guess I do it up above. But basically, I have this function in the notebook.

Let's see. Where is it? Yeah. It's find tool calls react. So it's this little function that will basically look at my message history and strip out all the tool calls. So, yeah, it's a nice little -- nice little thing with a React agent. It's really easy to get the tool calls out.

With land graph, I just log them at every node. And then at eval time, I just can extract them. Yeah. So the question was, with an agent and a multi-state agent? Yep. So if you elaborate on a certain scenario with your content switching, where is it actually the cognitive ability to say, I don't have the answer to the multiple trees, where is that cognitive ability to say, I don't have the answer to the multiple trees?

Yeah. So the question was, with an agent in a multi-turn conversation setting, how does it know whether or not it has the answer to a given question? Where to go. Where to go. Exactly. Yeah. That's right. So there's a couple different ways to kind of break this down. So with these agents, there's a few levels of instruction you give it.

First, you give it an overall agent prompt. So if I look at the notebook here, we can go look at the React agent as an example of this. So the React agent is defined right here. This is kind of like the planning step. Here's my, like, naive prompt. Okay.

So your helpful assistant is tasked with answering tasks. Use the provided vector store to retrieve documents, grade them, and go on. Now let's take the case that's more complicated. Let's say I had two vector stores. So one thing I can do is I can explicitly put in the agent prompt, you have two vector stores, A, B.

A has this, B has this. And then you're implicitly having the LLM giving it the ability to kind of reason out which one to use. Now this is where the second piece comes in. You have to also, you also bind it to a set of tools. This is really where the decision-making comes in.

When you create this tool, so here's retrieve documents, right? This tool description is captured by the agent so the agent knows what's in this tool. And this is really where that decision to use this retriever tool versus another one would be made. It'd be a combination of the prompt you give to the agent and/or the tool description.

So if you had two vector stores, you could basically say retrieve documents one. This vector store contains information about X, another one contains information about Y. Then the agent is deciding what tool to call based on that description and maybe based on its overall prompt. But to your point, it's not easy.

So actually, with custom agent, this is with the React style agent, with the Landgraf custom agent, you can do it a little bit differently. Where I actually don't have it in this notebook, but I have other cases where you actually can build a router node. And I mean, I'll show you actually.

So this particular notebook, so this one, this self-read, it's in the slides. If you open this up, we did this with Llama folks. So this is actually a trick I really like. If you go to, if you go to Landgraf rag agent local here. See, it's Wi-Fi is a little slow.

What I define here is a very specific router at the start of my agent that decides where to send the query. And this is something I really like to do because like we saw with the React thing, it has to kind of decide the right tool to use, which can be kind of noisy, versus right here.

So here's my router, right? This is reliable enough to run locally. And what I do here is I run this at the start of my graph. And I have the agent, or yeah, the agent explicitly take the question and decide what to use. And based on what decision it makes, I can send it to either web search, or in this case, a vector store.

So to answer your question, if I pull all the way back, I personally like to do explicit routing as a node at the start of my graph. Because it's, it's pretty clean. And you can see in the flow of this overall graph, this router runs first. And it looks at my question and sends it to one of two places.

And this can be more complex. You can send it to one of n places. But I see one of two here. This is with like a custom line graph agent. This is what I like to do. If you're using a React agent, it is then using a combination of the, the tool definitions and the overall agent prompt to decide what tool to call.

But you can see it's more squishy because it has to call the right tool. And as opposed to giving it a router and saying always run this router first, it has to kind of make the right decision as to what tool to call based on the context question, which is harder.

And that gets back to our, that gets back to our overall story about these kind of line graph explicitly defined agents are more reliable because you can like lay out this routing step right up front and have it always execute that step before going forward. So, if you're not looking back at it, you're like, hey, I actually have the answer.

So, this is the internet, so, this is everything around and go, you've got to, like, actually graph and say, do I have a history of the answer? Can I answer that? Yep. Okay. Got it. So, the question was, how do I incorporate the idea of routing with history? So, here's what you do.

It's actually, you know, kind of, it should be pretty straightforward. You can define this router node in your graph. And that router node, I'll actually go down to it. So, basically, here's my graph and here's the, the route question node, right? Yeah. Actually, in this particular case, it's, it's an edge.

Don't worry about those details. Basically, what you could do is, you could have a node that takes in the state. Now, that state could include your history. So, what you could do is, in that router prompt, you could really easily, here, include another placeholder variable for, like, your message history or something.

And yet, what you could say, then, is, make a decision about where to go next based upon the question and based upon something in our history. And so, you actually would plumb in your message history here and use it to jointly decide what to do next. So, that's actually really easily handled in line graph using a node and you can reference the history.

They can be passed, you can pass into that node as state. Cool. Yep. The notion of state. Yeah, let's, let's talk about state in a little bit more detail. So, let's actually go to the notebooks that we're working with here that I've shared. So, here's this rag agent testing notebook.

So, if you go down to the custom line graph agent, the way you do it is, I'll find the state. Yeah. So, here's what I call graph state. So, the graph state is basically something that lives across the lifetime of my graph. I typically like to do something as simple as just a dictionary.

So, basically, this is a rag graph. And here, I'm basically going to define a number of attributes in my state that are relevant to what I want to do with rag. A question, my answer generated, whether or not to run search, some documents, my step list. And basically, the idea here is that I define my state up front.

And then at every node, I basically accept state as the input. And then I operate on it in some way and right back out to state. So, basically, what's happening is I define state generally up front as like a dictionary or something like that. The placeholder for things I want to modify throughout my graph, throughout my agent.

And every node just takes in state, does something, and writes back out to state. That's really it. So, basically, it's a way you can think of it as a really simple mechanism to persist information across the lifetime of my agent. And for this rag agent, it's things that are really intuitive for rag.

It's like question, it's documents. And so, let's take an example. Like, okay, here's a fun one. So, my great documents node, right? What I'm doing here is I'm taking in state. And from my state, it's just a dictionary. So, I'm extracting my question. I'm extracting my documents. And I'm appending a new step.

I'm notifying, hey, here's my new step. And basically, I'm doing some operation. I'm iterating through my documents. I'm grading each one. If the grade's yes, I keep it. If the grade is no, so yes, no means like, is it relevant or not, basically. So, if yes, it's relevant, I keep it.

I put in this filter docs list. If it's not, I set the search flag to yes, which means I'm going to run web search. Because I want to supplement, I have some docs that are irrelevant. And I write back to state at the end. My filter docs, my question, search, and the steps.

That's it. So, state's a really convenient way to just pass information across my agent. And I like using a dictionary, just like nice and clean to manage. As opposed to a message history, which is like a little more confusing. Like, in any node, if you use a message history, it's like a stack of messages.

So, if you want the question, you have to like, it's like usually the first message. You have to just index it. It's just kind of ugly. I like using a dict, which is like, I just get the question out as a key in my dictionary. Okay. Cool. Let's see.

We're about an hour in. I can also, you know, let people just hack and walk around, talk to people, stay up here, whatever's best. And you keep asking questions if you want to. I think people are just working, doing their own thing now anyway. So, I might ask one question just for fun.

Is anyone interested in local agents? We didn't talk about that too much. It's a big theme. Yeah. So, I shared a notebook for that. And by default, I am using Llama 3. You can absolutely test other models. So, this is set up to test just Llama 3 with a Llama.

Try other things. I have a M2 32 gig. So, 8B runs fine for me. If you sling bigger, you can actually bump that up a little bit. So, that can be a nice idea. Yeah, exactly. 70B is, it's unfortunate that, yeah, I actually want a bigger machine so I can run that.

Because I found for tool calling, 8B is really at the edge of reliability. So, that's actually why you really can't run the React agent locally with an 8B model reliably. You can run the LandGraph agent very reliably because it doesn't actually need tool calling. It only needs structured outputs, you'll see in the notebook.

So, that's actually a really nice thing. But React agent won't run locally. Reliably, at least. Yep. Regarding RAG, what is the typical chunk size? You say chunk documents? Yeah. Okay, yeah. So, this question always comes up. With RAG, what is the typical chunk size? Yeah. If you ask 10 people, you'll get 10 answers.

It's notoriously ad hoc. You know what? To be honest, I did kind of a talk on the future of RAG with long context models. I'm kind of a fan of trying to keep chunk size as large as possible. Actually, I think, this is a whole tangent, but I actually think one of the nicest tricks, let me see if I have a good visual for it, basically, let me try to find something here.

This is a whole separate talk. But basically, I think, yeah, this one, RAG and long context. So, this is a whole different thing. But, yeah, this idea. So, I think for RAG, one of the ideas I like the most is decoupling. I'll explain this. I'll just say it and then I'll explain it.

Decoupling what you actually index for retrieval from what you pass the LLM. Because you have this weird tension, right? Smaller semantic-related chunks are good for retrieval relative to a question, right? But LLMs can process huge amounts of context at this point, you know, up to, say, a million tokens.

So, historically, what we would do is you would chunk really small, like, very, very -- try to get it as tight as possible. All these tricks, semantic chunking, a lot of things to really compress down to just compress and group semantic-related chunks of context, right? But the problem is then you're passing to the LLM very narrow chunks of information, which has problems in recall.

So, basically, it's more likely that you'll miss context necessary. So, it's good for retrieval but bad for answer generation. So, a nice trick is, for retrieval, use some chunking strategy, whatever you want. Like, make it, you know, small. But you can actually use, like, a doc store to store the full document.

And what you can do is retrieve based on small chunks but then reference the full document and pass the full document to the LLM for actual generation time. So, you decouple the problem of, like, are you sure you're passing sufficient context to LLM itself? Now, I also understand if you have massive documents, it can be wasteful in terms of tokens to pass full documents through.

But there's some Pareto optimum here where I think being too strict with your indexing approach doesn't make sense anymore given that you can process very large context in your LLM. So, you want to avoid being overly restrictive with your chunk size. So, this is a nice way to get around it, basically, to summarize.

You can choose a chunking strategy, whatever one you want. But I like this idea of referencing full documents and then basically passing full documents to the LLM for the answer itself. It gets around the problem of an overly aggressive chunking strategy that misses context needed to actually answer your question.

And I think with long context LLM is getting cheaper and cheaper, this is starting to make more sense. I'm trying to think. I had a whole kind of, like, slide on. Yeah, this one is kind of, like, balancing system complexity and latency. So, it's kind of, like, on the left is, like, maybe the historical view.

You need the exact relevant chunk. You can get really complex chunking schemes, a lot of overengineering, lower recall. Like, if you're passing 100 token chunks to your LLM for the final answer, you might miss the exact, you know, part of the document that's necessary. Very sensitive to chunk size K, all these weird parameters, right?

On the other extreme, just throw in everything, throw everything into context. Google, actually, I think this week will probably announce some interesting stuff with context caching. Seems really cool. Maybe that actually could be a really good option for this. But higher latency, higher token usage, can't audit retrieval, security authentication.

Like, if you're passing 10 million tokens of context in for your answer generation. So, something in the middle is what I'm advocating for. And I think this kind of document level decoupling, where basically you index and reference full documents and pass full documents to your LLM, is, like, a nice trick.

We've seen a lot of people use this pretty effectively. So, yeah. Yeah. Yeah, kind of a . Is there any way or any guarantees or tricks that you have to make sure that you're not actually missing any of the elements of the context around the entire window? Like, how do you measure that you're actually using all of the context as well as the beginning or end?

Okay, that's interesting. So, the question was, how can you evaluate the amount of context you are using? Okay. Yeah, like, does it definitely use the whole state? I don't want to miss any. Okay. But I guess, so, in a RAG context, you have a question, you have an answer.

So, you have a question, you have an answer, and you have some retrieved documents. So, you can evaluate the question relative to your document. That's one way to get at this. Like, how much of the document is relevant to your question? So, that's maybe one approach. And, actually, the notebooks show a few prompts to kind of get at that.

And, actually, a good way to think about that is you can think about document precision and document recall. And this is a little confusing, maybe, so I should explain it. Basically, document recall is, does the document contain the answer to my question anywhere? So, let's say you have a 100-page document on page 55 is my answer.

Recall is one. It's in there. Precision is the other side of that coin, which is, does it contain information not relevant to my question? In that particular case, huge amount of irrelevant information. So, recall will be one, precision will be very low. So, that's one thing you can do.

You can actually look at your retrieved docs, measure precision and recall. That's, like, one thing I think I would probably like to do there. And that's probably the best way to get at this question of, like, how much of my documents am I actually using? Now, with this approach, your recall will be high.

Your precision will be kind of low. And you would say, I don't care. It's fine. If I have a model that's super cheap to use a large number of tokens, I'm okay. Maybe I'll frame it another way. I care more about recall than precision. I want to make sure I always -- I answer the question, if I pass a little bit more context than necessary, I'm okay with that.

Versus if you're a precision-gated system, then you would say, okay, I'm going to miss the answer sometimes, but I'm okay with that because I never want to pass more context than necessary. I think a lot of people are moving towards higher recall because these LMs are getting cheaper and they can process more context.

So, that's kind of how I might think about it. And I think this approach is a nice idea of, you know, indexing full documents -- or indexing chunks, but then referencing full documents, passing full documents to your LLM to actually generate answers. So, yep. I was wondering if it seems like there's a couple of options for where exactly you inject the context into the conversation.

So, just for one example you could stick something into the system and say, hey, this is your knowledge that you know. And then the user just asks the question and there's no prompting from the user. The other thing is you can completely modify what the user -- what the quote-unquote user is, and then they answer the question.

And then, like, a further idea on top of that is you ask a question and you ask a follow-up question. Maybe, if you say, we collaborate. There's no need to actually retrieve anything, but you're just having to pass. That can be passed. Or you ask something else for retrieval.

Do you have any, like, observations as far as things that work well, or capital work well, or continually impending things into the system process, or, like, adding another user? Like, where do you stick that context? How much of a history? Yep. Yeah, okay. So, this is a great question.

The question was related to kind of, like, agentic rag and where to actually put documents. So, let's walk through the cases. So, case one is I have a fixed context for every question you're just going to ask. Let's say I have, like, a rag bot against, like, one particular document.

And that document is always going to be referenced. So, that -- you make a very good point. In that case, what I would do is -- let's go back to our, like, agent example. You can put that in the system prompt itself, like you said. So, and actually, I'll mention something else about this.

It would be, like, right here. So, here's your system prompt for your agent. You just plumb that whole document in. You say, every question you're going to reference this document. You don't need retriever system. You're done. That's a very nice case. No retrieval complexity. You just context stuff your whole document.

So, the thing that Google is going to announce this week, I believe, this context caching seems really interesting for this. Because, basically, what they're saying is -- if I get it right, I think Logan will speak to this. But I think they have a context when a million to 10 million tokens is huge, right?

So, basically, you can take a large set of documents and stuff them into this model, effectively. And they house them for you, somehow. I think you have some minor data storage fee. But then, for every inference call, they don't charge you for all those tokens that are cached. Which is pretty nice.

So, basically, here's the use case, to your point exactly. I have some set of documentation. It's 10 million tokens. That's a lot of pages. Hundreds of pages. I have it cached with the model. And every time I usually ask a question, I don't get charged, you know, 10 million tokens to process the answer.

They just cached for me. Really nice idea. So, that's your point of, like, the first thing. That's, like, not quite in the system prompt. It's in, like, the cache. But that's the same idea. So, you have cached or system prompt fixed context. That's, like, case one. So, case two is you have -- you want to dynamically retrieve.

So, you can't stuff your context. Maybe you have a few different vector stores, like we were talking about here with routing. So, in that case, yeah, you have to use an index of some sort. Maybe a router to choose which index to retrieve from. So, that's kind of case two.

And I'm trying to remember the -- oh, okay. So, like, in that particular case, for follow-up questions, how do I kind of control whether I re-retrieve or not? So, that's the nice thing about either one of these agents. It has some state. So, the state lives across lights on the agent.

So, basically, the agent -- and this actually gets exactly what the other question was on. Let's say I built my agent with -- I'll show you right here. So, let's say I have a router node at the start of my agent, okay? And that router has access to state.

What I can do is then, given a question -- this could be -- let's say it's a multi-turn thing. This is the second question in my conversation. I have an appended state from the rest of my discussion here. The agent knows it returned an answer. So, basically, when a new question comes in, you could pass, like, the entire state back to that router.

And the router could know, okay, here's the docs I've already retrieved. And it can basically then decide to answer directly because I already have the answer to the question. So, that's a long way of saying you can use state, either message history or explicitly defined in your LandGraph agent, to preserve docs that you've retrieved already, and then to just use them to answer the question without re-retrieving.

So, that's kind of what these -- these rag agents can be really good at. That was kind of, like, storing that in short-term memory and reasoning about, hey, do I need to re-retrieve or not? So, that's exactly the intuition behind why these rag agents can be pretty nice. Yeah?

Yeah. There's no problem with that. It seems like, you know, this idea, like, rag is a hack, right? Like, you just kind of . Yeah? I mean, it seems like there's a bit of a back and forth going on where the model is designed in this kind of purpose way, where the coder packs something together, and now, you know, you go back to the training model, and then you say, "Oh, well, we've actually defined the model, and this is going to be good at this." And I'm curious, it almost seems like you're going to, like, propose all these ideas, and this is going to work better.

Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Exactly. Okay. This is a really good discussion and whole debate. So, the highest level framing of this is how do you want to -- how do you want your model to learn? So, one option is you can modify the weights of the model itself with something like fine-tuning.

Another is you can use what we call in-context learning through your prompt. So, rag is like a form of in-context learning. I'm basically giving up some documents. It's reasonable as documents producing answers. It's not touching the weights of my model. Fine-tuning would be taking the knowledge I want to run rag on, fine-tuning your model and updating the weights so it has that knowledge.

So, the highest level framing of this is how do you want your model to learn? So, one option is you can modify the weights of the model itself with something like fine-tuning. Another is you can use what we call in-context learning through your prompt. So, rag is like a form of in-context learning.

I'm basically giving up some documents. It's reasonable as documents producing answers. It's not touching the weights of my model. Fine-tuning would be taking the knowledge I want to run rag on, fine-tuning your model and updating the weights so it has that knowledge. There's a lot of debates on this and actually I think Hamel has a whole course on fine-tuning.

Oh, yeah. Do you have a... I just want to clarify. I don't mean fine-tuning information. I just mean fine-tuning so that you have the idea. Like, if you could find things so that it knows, oh, I should always retrieve this information from my system and use that when you answer your question.

Before you... Okay. ...you really use whatever extra content the users provide. So, just as a simple example of that, if you have the system at the very beginning of the chat, you would be training the model to focus more on the information at the beginning of the chat. Whereas if you were fine-tuning the model, you would be good at using the context, from the use that the user provides, then it's focusing more of an extension on the end of the chat.

But it's not adding the information in, it's just how do I use... Yep. I got it. Okay. I'll repeat that. The clarification was thinking about using fine-tuning more to govern the behavior of the agent rather than to encode facts, which is a very good clarification because I was going to say using fine-tuning to encode facts, I think a lot of literature has pointed to that being a bad idea for a lot of reasons.

It's costly, you have to continually fine-tune as facts change, so let's dispatch that. I think that's kind of not a great idea. But you make a very interesting point about fine-tuning to govern behavior. Now, there's a paper called Raft that came out kind of recently, and actually, as far as my understanding, as I haven't played with it in life, it's fine-tuning to kind of do what our notebooks show today, of this kind of like, look at the documents that are retrieved, reason if they're relevant, and then don't use them if they're not automatically filtered.

They're doing exactly what we're doing in this land graph thing, but it's kind of achieving that same outcome through process fine-tuning. So that's a very good insight, you're right. It seems promising to have these kind of fine-tuned RAG agents, so to speak, or it wouldn't be an agent, it would be an LLM fine-tuned for RAG that incorporates this kind of logical reasoning, or what you're saying, like maybe some kind of reasoning about if you have a multi-turn conversation, like avoid recency bias, whatever it is, that seems like a very good trend and interesting.

The challenge is, a little bit, if you're fine-tuning yourself, fine-tuning is hard and somewhat advanced and all that. Alternatively, if it's like a very niche use case, like the RAF system would basically, if fine-tunes to kind of do this, as models get changed or update all the time, you kind of need to keep your fine-tuned model up to date, if you see what I'm saying.

So I think I'm still a little bit queasy about using fine-tuning even in that context, because of the challenge of keeping it kind of up-to-date with the state-of-the-art. But it's interesting, I think the RAF paper is a good reference in this direction. And it does exactly what we do in this workshop, but I believe it fine-tunes this into the model, or attempts to, which is a very intuitive thing to think about.

Basically, let the model reflect automatically on retrieved documents and like automatically filter them for you. It seems like it should be able to do that. It seems like a good idea. So, but my hesitation would still be like, what if I want to switch my models? I need to like re-fine-tune.

I want to use Lama 3. I have to fine-tune Lama 3 on this task. I can't use proprietary. Maybe I can fine-tune, you know, GBD-40 might have fine-tuning now. I'm not even sure. I could fine-tune, you know. So again, if I fine-tune myself, that's hard. It still feels a little bit like I'd rather just sit up a simple, like orchestrated agent that does it rather than rely on fine-tuning.

My sense. Yeah. I guess I'm not really imagining the person doing . Yeah. The provider. In an analogy, you know, like Tesla's data engine . Right. Right. Right. Right. Yep. But it's going to start to become that data engine . Right. Right. Right. Right. Right. Right. There's got to be some kind of like back and forth.

Right. Right. Right. Right. Right. Yeah. Okay. So that's a very good point. I think this is also a very big debate. So OpenAI just did an acquisition this week on a retrieval company. I forget the name. Rockset, I believe. So I think they are moving more in the direction of retrieval.

I could absolutely see them offering, you know, an API that potentially does retrieval for you and incorporates some of these ideas for you. So how much of this does get pushed behind APIs and they take care of whatever is necessary behind the scenes for you. That could absolutely be the case.

I would not be surprised at all if they move in that direction. And I think there's always, you know, it's an interesting trade off. Like how much are you willing to kind of, you know, abstract behind an API versus not. I think there's always a lot of companies, developers that want to kind of control everything themselves and build it themselves, have full transparency and others that don't.

And so, you know, it's an interesting question. But of course, for certain functionalities, multimodality. Very few people are going to stand that up themselves. You kind of let that live behind an API. So what do you allow to live behind an API or not? My only concern is, I think for some of these kind of things, they could be very domain specific.

Like what you consider relevant or not could be very relevant to you and your application. You kind of want to be able to control that. That's the only thing I can imagine. It could be kind of hard to abstract that all behind an API, which I think is maybe why OpenAI hasn't done too much in retrieval yet.

It's just a hard beast. I know they've been trying for a while. I don't know. It's a great debate, though. Yeah, we can discuss more. Yeah, it's a good topic for sure. Yep. You mentioned about long context windows. Oh, yeah. So there's a problem of like, loss in the middle.

For example, the precision, like, for example, in the context, they say 95% is relevant information, 5% is somewhere in the middle. And somehow, this kind of message is at that point. Like, do you have any questions over that? Yeah. Yeah. So the question is on loss in the middle of long context.

I actually did a whole study on this with Greg Cameron. Yeah, it's a really interesting topic. So the insight was basically that long context LLMs tend to have lower recall or like, you know, factual recall for things in the middle of the context. Okay. So that was one observation.

At least that's what their paper reported. So I actually looked at this with Greg, and we did something a little bit even harder. We actually tested for multiple fact retrieval. So we tested, can you retrieve one, three, or ten different facts from the context? And this was using GPT-4.

GPT-4 turbo, single turn. And on the x-axis, you can see the fraction of the needles that it basically can get. And then on the y is the number of needles. So basically one needle, three needles, ten needles. Green versus red is basically just retrieving versus retrieving and reasoning. So it's like reasoning is a little bit harder than just retrieving.

These needles were actually pizza ingredients. So basically the background was, you know, this was 120,000 tokens of Paul Graham essays. And three secret pizza ingredients, or however many, one, three, or ten, but I injected in that context. And I basically asked the LLM, what's the ingredients needed to build the secret pizza?

So I'd have to find them in there. And basically as you ramp up the number of needles, go from one to ten, it gets worse. So with ten, it's actually retrieval itself is only like 60%. So then I looked at, okay, well where is it failing? And that's what I look at here in this heat map.

So basically this is telling you like how long the context is. So a thousand tokens all up to 120,000. And then here's like the needle placement. So one to ten. So this red means you couldn't retrieve it. And what I found is actually it doesn't get them towards the start of the document.

So the retrieval gets worse if the needle's at the front. So it's like this. I read a book. I asked you a question about the first chapter. You forgot because I read that a month ago or something. Same idea. And actually I put this on Twitter and then someone said, oh yeah, it's probably recency bias.

And that's a good point that basically the most informative tokens in next token prediction are often the more recent ones. So, you know, basically these LLMs learn a bias to tend to recent tokens. And that's not good for a rag. So that is all to say, I'm a little wary about long context retrieval.

I wouldn't quite trust basically high quality rag across a million tokens of context. You can see, look, if it's 1,000 tokens, no problem. If it's 120,000 tokens of context, you know, it depends a lot on where those facts are. If they're towards the start, you actually can have much lower recall.

And so that's a real risk. Which is kind of why it goes back to this whole thing of, like, I don't really buy just stuffing everything into context, that far right side. I think there's too many issues with bad recall, recency bias, like you said. And so I think until we have very, and by the way, I also don't really trust.

You know when they show those needle in the haystack charts, it's like perfect. I don't trust any of that. I did my own study. I found there's like a lot of errors. And I think it depends a lot on a couple different things. One, how many needles? So in this case, you see with one, it's okay.

With 10, it's really bad, right? So how many needles? And then I saw an interesting study saying that, like, the difference in the needles relative to your context makes it easier. So like in these studies, it's like, Pete's ingredients in Paul Gramsci is really different. But if it's just like related slightly, it's actually harder still.

So that is to say, I don't really trust the needle in haystack studies. I don't particularly trust stuffing, you know, passing a million tokens of context and counting that and counting on that to just work effectively. I'd be very wary about that. That's kind of my thing. But you know, in these studies, look, a thousand tokens of context already, you know, that's, that is, if you're stuffing a thousand tokens, eh, I mean, that's actually still pretty small.

So yeah, I'd just be wary about retrieval from very large contexts. Yeah. Okay, that's a great question with agents, the number of tools. This is a really big issue I hear mentioned a lot. So if you recall, if you go back to the agent stuff. So you're basically binding some set of tools to your LLM, right?

And that's what we show here, right? I've seen a lot of issues with a large number of tools. So I don't know exactly know what the exact cutoff is. But this is one of the big problems with open-ended tool calling agents is if I am basically selecting from 20 different tools.

I actually, I, maybe the Berkeley leader board has data on this. So if someone knows, feel free to mention it, but reliability of tool calling, even with a small number of tools, like on the order of five is already challenging. If you're talking about hundreds of tools or dozens of tools, I don't think there's really, yeah, I think it's quite challenging.

Which is why I've seen more success in a not using these open-ended style tool calling agents, laying it out more explicitly as a line graph where the tool calls live inside nodes. And you're not relying on your agent to pick from like 20 different tools. So you can lay out more of like a control flow where you route to different tool nodes based upon the logic.

So, so that's kind of one thing I've seen. Another thing I've seen is maybe multi-agent type things where you have different agents with subtasks with each having like maybe a small number of tools. But basically what I've seen is it seems to be that, okay, maybe it's two things.

Selection from large number of tools is definitely challenging. One of the most interesting things I saw is something like you can use something like RAG where basically you can take a description of your tools, create a natural language description, embed it, and then use basic like semantic similarity search, your query versus the embedded summaries to select using semantics.

That's actually not a bad idea. I would actually use that more than I would trust an LLM to just like do the tool selection from a list of 100. That's not going to work. So actually I think that like RAG for tool selection is a cool idea. I was going to do a little like test that out and do a little tutorial.

So actually maybe I'll just make a note of that. That's it. That's a great, a great question. To do RAG for many tools. Right. Yeah. Well I think, I think using semantic similarity for tool selections is a good idea. Definitely. What about data querying? What do you mean by data querying though?

Right. Let me make sure I understand. I think the way I would think about it is, so you know how when you're in the notebook, like in the code, for all of our tools, right, you have this little tool description. Right. Like retrieve documents, grade them, run web search.

I would actually write verbose descriptions for all my tools and then index those descriptions. Right. Or embed them. And then I would do, and I would probably create like very, very verbose high quality summaries of what the tool actually does and then do semantic similarity search against those summaries.

I think that could actually, I haven't done that yet, but I think that could work really well. because it's a very tall task to ask an LLM to differentiate between like 20 different tools. Whereas, you could do something like semantic similarity, that would actually probably be very effective. Yep.

Right. Like have you done any, like, experiments with that? Like in terms of, like, knowing if . Yeah. Yeah. Yeah, yeah, so the question was about multi-agent context when you want to orchestrate a large number of tools, I think sub-agents with specializations that manage some small set of tools each.

And how do you kind of move between them? So, if you look at our LangGraph repo, we do have a subdirectory. It's under, it's under LangGraph examples multi-agent. We have a few different notebooks that have multi-agent style kind of layouts, which I would encourage you to look at. I haven't personally done too much work on it.

It seems promising, but I haven't played with it. Multi-agent in general, and these all reference papers. You can also look at the papers. But multi-agent in general for production setting feels quite aggressive. Although, that said, as far as I understand, I remember looking at the code a while ago, Devon and some of the software agents do use, like, multi-agent style setups.

So, maybe have a look at the Devon repo, or there's OpenDevon. Have a look at these notebooks. Those could all be useful if you want to learn more about multi-agent. Yep. So, I'm wondering, like, and I stepped up for this. Yeah, sure. But, I'm wondering, like, we've been talking a lot about kind of the variability on the picture of how many LL and disciples to recall.

Right. So, given that, I'm wondering, how do you think about when it makes sense to wrap the lag inside of an agent versus just make it a chain on its own system? That's a classic question, yeah. So, the question was, when do I use a chain versus an agent?

So, that's very good. So, we kind of touched on it a little bit, kind of here. So, I think that the intuition behind where and why an agent can make sense is simply that sometimes you want your application control flow to be variable. And if you want some flexibility within your application, an agent is a nice idea.

And so, all this self-corrective type stuff we're talking about, the corrective rag thing, those are all kind of agentic flows where the control flow depends upon the grading of the documents. And so, you know, historically, people have largely been building chains. And chains are very reliable and they're easy to ship and all that.

I think with things like LangGraph, and of course, I work at LangChain, so I'll speak my book about LangGraph, but I've really used it quite a bit. And I found it to be very reliable. And we are seeing a lot of people starting to deploy with it because you can actually ship and deploy a reliable agent with LangGraph.

And so, I think a blocker to the ability to kind of ship agents has been reliability. And I think I would actually encourage you to play with the notebooks and look at LangGraph, because it does allow you to have that kind of reliability that would be necessary to ship something in production.

And we do have customers that have LangGraph in production. Whereas a React agent in production is not recommended. Yeah. Yeah. Yep. Sure. A hundred percent. So, okay. So, I think, yeah. So, the different, why would you ever want kind of like an agent, be it LangGraph, React or otherwise, versus not?

And I get, again, I think it goes back to, do you want your application to have any kind of adaptability? So, okay. So, okay. Here's one we can talk about. Routing. I have three different vector stores. I want to be able to route between them. That is kind of a quote-unquote agentic use case because the control flow depends on the question.

So, that's one. You might want routing. You may want self-correction. So, that's kind of what we talked about here a whole bunch with the corrective rack stuff. So, you want routing. You want self-correction. I mean, those are two obvious ones in the context of rack itself. I mean, that's one thing I've often found the problem with the rack systems is the routing thing is a real issue.

Like, you want your system to be flexible enough to deal with questions that are out of domain for your vector store. And you need some kind of dynamism in your application to handle that. So, looking at the question saying, okay, just answer this directly. Don't use the vector store.

Yeah. So, those are like the most popular ones. Self-correction or routing. Yeah. Yeah. I'm wondering if there's anything to be said about building evaluation data sets. Like, question-answer pairs are so domain specific. Yeah. I'm wondering if there are like general best practices, mental models, like things to think about when sitting down to build an evaluation data set.

Yeah, yeah. Okay. So, the question was about kind of building eval data sets. Okay. That's a great question. It's often a very, very challenging part of app development. So, if you have a RAG application that's domain specific, then oftentimes you have some set of canonical question-answer pairs you care about.

You know, it's hard to find like, you know, very, very general rules for that. I think it depends on your application. I think there's kind of this hurdle. Any evaluation is better than no evaluation. So, small, well. Okay. That's a great question. So, the question was about kind of building eval data sets.

Okay. That's a great question. It's often a very, very challenging part of app development. So, if you have a RAG application that's domain specific, then oftentimes you have some set of canonical question-answer pairs you care about. You know, it's hard to find like, you know, very, very general rules for that.

I think it depends on your application. I think there's kind of this hurdle. Any evaluation is better than no evaluation. So, small-scale eval sets that you can use and just work are already better than not doing any evaluation. So, I mean, for this particular case, I just looked at the document.

Now, maybe I'll back up and answer. So, one thing I've seen, and I've done this a little bit, is you can use LLM-assisted QA generation. So, here's one thing you can do. And I've done this a little bit with Langchain docs. I can build a prompt that says, given this document produced three high quality question-answer pairs from it, right?

And I can just basically load my documents and pass them into that LLM. I use a high capacity model like Sonnet or 4.0 and have it generate QA pairs for me and then audit them. That's a nice trick. I've used that. It actually kind of works. You have to be careful with it.

You only would pass it, like, usually one document at a time to keep it really, like, you know, restricted. And you audit them. But actually, that's a nice way to bootstrap your eval sets. That's, like, idea one. And that gets into the whole idea of synthetic data sets. But if you're building, you know, domain-specific synthetic QA pair data sets, that's a nice trick.

So, basically use an LLM to help bootstrap. I think that's one idea that can help a lot. Yeah. And otherwise, I think that basically trying to stand up a small evaluation set, for example, for RAG is, even in this case, five questions. But you can already see. I can get some nice insights.

It's very simple to set these up. I have my little experiments all over here. And, again, it's only five questions. But it gives me some immediate insights about the reliability of React versus LangGraph agent. So, keep it small. Potentially use synthetic data. Start with something. And then, like, kind of build it out over time.

Now, a whole other thing here is, we didn't talk about this too much, but if you have an app in production then, the way this whole thing kind of comes together is, you can actually have different types of evaluators that run on your app online. We call those online evaluators, okay?

So, this is with our internal app. And this gets back to the question I think he mentioned of, you can have a bunch of evaluators for rag that don't require a reference. So, like, I don't show it here, but basically I can look at, like, document retrieval quality. I can look at my answer relevance or hallucinations.

I can run that online. I can flag cases where things did not work well. And I can actually roll those back into my eval set. So, if I do that, then I actually have this self-perpetuating loop, like Karpathy talked about, like the data flag, where actually I'm running my app in production, I'm collecting cases of bad behavior that I'm tagging with, like, online evaluation, and I'm rolling those back into my offline eval set.

So, what you would do there is look at the case that the app is doing poorly in production, audit them, correct them, so build, like, a canonical question-answer pair from that and put that back into your test set. And that's a good way to bootstrap and build it up.

So, I'd start cold start, synthetic data, small-scale examples, online evaluation, or some system to check online where it's failing, loop those back in and build it up that way. That's like your data flywheel. Yeah. Actually, I even had a slide on this in one of my older talks. I used to work in self-driving for many years, and I actually was a big fan of Karpathy's stuff at Tesla, and actually I've had his thing here.

This is like the data engine thing of, like, you know, you ship your model, you do some kind of online evaluation, where's it failing, capture those failures, curate them, put them back in your test set, run that as a loop. That's like -- he called it operation vacation because you can go on vacation and the model keeps getting better.

And this was more in the context of, like, training models because basically all those failed examples, once they're labeled, they become part of your training set. But the same thing applies here with LLM apps. Cool. Yeah? Yeah. So, I just wanted to ask, you just mentioned fix to SQL, right?

Oh, yeah. So, I just had a question where, like, for example, all that -- all that happens when you send the same schema and you generate a SQL query and then we run it on the database. And if the result is too large or something, so we cannot send it for another LLM to generate a user-friendly query.

Like, user-friendly answer or something, right? How do we handle that? What do you send it? The query -- the result is so big that we cannot completely change the context. Yeah. Yeah, the question was on text to SQL. We actually have a pretty nice text to SQL agent example here.

So, it's in Langraph examples. I think it's in -- is it SQL? Where is it? I'll find it here. Tutorials -- oh, yeah, it's in tutorials. So, SQL agent here. I think a lot's in the prompting. So, basically, in this particular case, I believe Ankush from our team set this up.

You prompt your -- you can do a couple things. So, you can prompt your SQL agent to -- where is it? It's somewhere where he tells it to -- yeah, it's basically on all these instructions. So, you can -- you can -- you can -- you can instruct a SQL agent when it's writing its query to ensure not to extract an excessive amount of context.

And I can't remember exactly where it does that. Okay, yeah, it's right here. Limit your -- always use a limit statement on your query to restrict it to, like, whatever it is, five results. And that's the hard-coded thing here. And then, also, in your SQL agent, you can incorporate a query check node to actually look at the query before you execute it to sanity check for things like this.

So, basically, I would have a look at the LandGraph example tutorial SQL agent notebook to have -- I've actually -- I ran this and evaluated it, and I found it did work pretty well. So, that's one thing that I would look at. Okay. One way question about SQL, so, a lot of times, like, the next SQL, they move well on, like, the do-by kind of questions, but, how do you handle, like, filtering, like, where flaws, like, when you have a high-cardy column, like, you have a question, right?

Like, if you say, I'm going to SQL query, and it means you know where, where flaws in SQL, where is actually translate, so, like, any column value, like, any value from the column. Yeah. So, the question was related to, like, how do you handle high-cardinality columns, and I guess that is related to restricting the output size.

I mean, I'm actually not really a SQL expert, so I'm probably not the right person to ask about very gory details of text-to-SQL, but, in general, I would kind of consider, can you prompt the LLM effectively? Like, okay, two different things. Like, okay, two different things. One, the general ideas here were basically upfront prompting of your LLM to kind of follow some general kind of query formulation criteria, and then, two, an actual query check explicitly to review and confirm it doesn't have some of the issues like extracting accessible context.

But, for anything more detailed than that, I'm probably not the person for those insights. But, there might be a specific text-to-SQL deep dive at some other point in this conference, and you should definitely seek that out. But, I haven't done that much with text-to-SQL. Yeah? Yeah? It seems like a lot of, you know, question-answer rag patterns, but one thing I'm kind of interested in is the board long-form document generation, so, you know, writing for a board or something like that.

Yeah. Can you see the interesting patterns or, like, what you think about, you know, archiving solutions with a scripting line graph or other type of scripting ? Yeah. Okay, so the question is related to kind of document generation. That is a really good theme. So, we actually have kind of -- there was an interesting paper.

I did this a while ago. I need to find it. Where is it? So, we have a notebook. If you look in line graph examples, Storm. So, this was actually for wiki article generation. Here's kind of the diagram for it. We actually have a video on this, too. I'm actually trying to refresh myself.

I did this, like, three or four months ago. But it was basically a multi-agent style setup in line graph where, if I recall correctly, I'm just looking at the flow here myself. But basically, what it did was you give it a topic and it'll kind of do initially this kind of, like, generation of related topics.

And actually uses the multi-agent thing of, like, editors and experts. The experts go and do, like, web research. Don't worry too much of the details. So, the point is it was an interesting paper in flow for wiki article generation using line graph in a multi-step process. And actually, the wikis are, like, pretty good.

I think at the bottom of the notebook I have an example wiki. So, you can see all the code here. That's the graph. Yeah. And then here's the final wiki that you get out from this type thing. So, it's pretty good. Have a look at that notebook. I think, yeah, Jason Liu also had a post on this recently.

this idea of, like, report generation is a theme that we're going to see more and more of. This was one idea that was pretty sophisticated, though. You can probably simplify it a lot. I've done a lot of work just on, like, a simple kind of distillation prompt. Like, perform rag and then have some generation prompt give a bunch of instructions for how I want the output to be formatted.

That also is really effective. Yeah. Yeah. So, the user-fitting is that. What's the statistic that you implement classification that the . So, during the . You might, you know, you might get a little, like, whatever that information that they provide. Like, say, mostly or most . There is some information that is very specific to the .

What is the way to implement the . Yeah. So, the idea, the question was related to user feedback. That's a really good one. So, we do have, I think I mentioned previously, if you look at line graph, where is it? We, I believe we have some user feedback examples.

Thursday, we're definitely going to be announcing something that has a lot of support for user feedback. And I would encourage you to keep an eye out for that. So, Tarasyn is going to launch that here on Thursday. I will look for, I know we have some user feedback examples in line graph, but I will need to find them.

Let me see. Let's see. We probably, let's check length. We probably tweeted about it at some point. I haven't actually done anything with user feedback, though. Let's see. Let's see. Yeah, maybe I might have to get back to you on that. I thought we had some nice examples with line graph.

Yeah, I'll have to get back to you on that one. Let's see. Feedback. Customer support might be in here. Hmm. Let's try something else. Look at line graph docs. Hmm. I'd poke around the line graph docs. We have a bunch of tutorials in the docs here. Just Google line graph documentation.

I'm just poking around here for user feedback. Ah, here we go. Look at this. So, line graph how to's human in the loop. I would have a look at that. I've not played with that myself. But, yeah. . Yeah. . Yeah, yeah, yeah, yeah. Okay. Right. So, for like.

um. Yeah, yeah, yeah. Okay. Right. So, for like. Um. mid to longer term problem solving tasks. Yeah, yeah, yeah. Okay. Right. So, for like. um. Um. Um. mid to longer term problem solving tasks. How do you incorporate user feedback to ask for more information. So, um. Um. I would have a look at this documentation.

Because I would imagine. Um. I would have a look at this documentation. Cause I would imagine. Um. Uh. Yeah, yeah, yeah. Okay. Right. So, for like. Um. Um. Yeah, yeah, yeah. Okay. Right. So, for like. Um. Mid to longer term problem solving tasks. How do you incorporate user feedback to ask for more information.

So, um. I would have a look at this documentation. Cause I would imagine it will. Uh. Cover examples along those lines. I haven't personally done that. Um. I also. Believe that. I would have a look at the customer support bot. Um. Because that's an example of. Of a. Kind of multi-turn interaction between a user and a support agent.

Um. So, I would look at the customer support bot. Which Will and my team did. As well as the documentation on human loop. So, those are two things I would check out there. Um. Nice. Yep. Is there any research into model architecture or training models for agentic reasoning? Optimized for agentic reasoning?

Yeah. Um. For agentic reasoning? Um. Yeah. So, that's kind of an interesting question. So, the question is related to training models specifically for agentic reasoning. Um. If anyone. So, I mean there's a lot of. There's a lot of work on prompting approaches for different types of reasoning for sure.

Um. Um. Um. I'm a little bit less familiar with like efforts to fine tune a model specifically for particular like agentic architecture or use case. But you could imagine it. Um. Most of the work that I've encountered though is just using generalist, like high capacity generalist models with tool calling and specific prompting techniques.

So, like React like React is kind of a particular orchestration flow and prompting technique rather than, you know, and you can interchange the LLM according to this. So, um. Um. Uh. So, you know, you can't just fine tune a model specifically for particular like agentic architecture or use case.

But you could imagine it. Um. Most of the work that I've encountered though is just using generalist, like high capacity generalist models with tool calling and specific prompting techniques. So, like React is kind of a particular orchestration flow and prompting technique rather than, you know, and you can interchange the LLM according to that.

I think the main thing typically or historically for, for agents has been the ability to perform high quality and accurate tool calling. Because agents, that's one of the central components of agents. And so, um, that's kind of been the gating thing. And I think model providers been focused a lot on just high quality tool calling which helps kind of like all agent architectures.

I haven't seen as much on like fine tuning for one particular architecture. I think it's like high capacity generalist models with tool calling and then prompting. So it's more like in context learning. That's kind of the trend I've seen at least. Yeah. Yeah. Can you talk a little bit more about .

Because I just saw that in the . And my wife is thinking that it's more about . And then sort of saving the state. Yeah. Yeah. Yeah. So the checkpointing stuff, actually this Thursday, that's going to be a lot more relevant. Because we're launching some stuff to support deployments for line graph.

In which case you, you can do a bunch of different things. But you can have a single state that persists across many different sessions. You can also have check points. You can return to state and revisit an agent from a particular point. Don't worry about that too much for now.

I think there will be a lot more documentation and kind of context for that on Thursday. When the stuff to deployment comes in. But it's good to be somewhat aware of. And I would poke around the documentation for a little bit more on checkpointing. But it really becomes relevant on the stuff we're announcing on Thursday.

So I would have a look then. Let's see if we update our docs. Yeah. Yeah. So there is some documentation on it now. But it will become a lot more interesting and relevant come Thursday. And we have a lot more support for deployment. Yep. Okay. Yeah. Yeah. Okay. Yeah.

So that's a really good question. And so the way it works with the existing. So it depends on the architecture. So using the React architecture, let's see if I can find an example of it. So here's with React agent. Let's look at one of the traces. Let's see if I have an example.

So basically the tool call itself will return like an error. And the LLM then is expected to self-correct from that error. It has to kind of self-correct. So that's kind of one approach that at least we do with the React agent. So actually you can see it in the notebook.

Um, if you go to, and if I can find some traces that have that example, I will pull them up. Um, but, uh, I think it's in utilities somewhere. Yeah. So basically this tool node with fallbacks, basically what happens is in this tool error. So this, if there's an error in the tool call itself, it'll return that error.

And usually the agent will then, or the LLM assistant will look at that and like self-correct its tool call. So that's, that's typically how it's done. And this actually is reasonably effective. But again, you know, the nice thing about the, the other implementation, the custom agent, I call it in the notebook is you don't rely on tool calling in this way.

And so you can get around this type of issue. Um, but basically catching the errors in the tool call itself with this code is, is what's currently done. Let's see if I can actually find an example. Um, yeah, I mean, if I can look for one where it gets the answer wrong.

Yeah. Let's see this one. Let's see if we can find a tool call failure. Um, so here's the trace. Um, let's see. Let's see. Eh, okay. It didn't have a tool call error. Yeah. So basically what you'll see in the message history is though, like the tool itself will return this error message.

And then the LLM will say, oh, okay, I need to retry. And then it'll retry and hopefully get it right. Yeah. Yep. Yeah. Yep. All up to that question. I use Python, Pydantic, and Jason Lewis Instructor. Oh, yeah. Which is great for that exact problem. Yep. Like the instance revalidation of like the output.

Especially for those simple little errors. Yes. Um, I'm wondering if I'll provide probably work, but does, um, I'm Shane, I haven't used it much, but could you use it with an instructor and then Pydantic, okay? Okay. So this is a very good point. So yeah, I'm a big fan of instructor.

Um, I haven't used it as much, but what you're saying is one particular type of tool call. So basically that pertains, I believe more structured outputs, which is indeed a kind of tool call. And when you're using something like a Pydantic schema, you're right. It's very easy to check and like correct errors.

So I've found catching errors, like with schema validation, like using instructor is, is really good. And we have some other things you can use within line chain to do the same thing. So, so that's one type of error. That's actually particularly easy to kind of detect and correct. What we show in this notebook here and the code I showed is more for any general tool.

So, um, so this code here will operate on any tool you call regardless. So it doesn't have to do with structured outputs or anything. And so it's just a more general, uh, check for tool call errors. Now, in terms of instructor with Lang chain. Now, maybe I'll just back up a little bit.

So Lang graph does not require Lang chain at all. So that's kind of point one and neither does Lang Smith. So actually everything we're doing here does not need to use Lang chain. So actually that could be a pretty interesting thing to try for like kind of choose your own adventure thing.

But basically in the custom agent part, um, I use with structured outputs to do the grading. So if you go to, um, yeah, if you look at the, um, the retrieval grader here, so this is using LLM with structured output. And here's my grade schema. Try that one instructor.

That should work great. You don't need Lang chain at all for this. Um, and that'll fit right into Lang graph. So actually I think it'd be great to use instructor with Lang graph for this particular use case. And I do agree that Lang, that instructor is really nice for those kind of like schema validation error correction.

I plug and play that, that I'm going to make a note. That's a really good kind of choose your own adventure case. Um, or should I put that? Uh, try instructor with Lang graph for grading. Yeah, I like that a lot. Yep. Yep. Yep. Um, kind of more of a meta question.

Just, uh, really aligned with what you guys tried to tackle. Um, what's the, what are the path forward, uh, to continue to make this a model better? And specifically, where's the not, uh, where's the rag pipeline to be in the right now? Like what do we still have to do that?

And your guys have to do so, like down in the moment. Um, yeah. So a question was related to just rag in general and where is rag? Rag agents. Yeah, sure. Well, to be honest, a lot of the problems with rag, I think about our own internal application chat line chain.

A lot of problems with rag actually are retrieval problems. Retrieval is just hard. I'll give a good example. Like Lang chain, we have, um, I'm trying to remember five million tokens of context across all our docs, something like that. We have all sorts of different, we have a very long tail of integration docs.

You want very high quality coverage and questions across all of that. There's a lot in how you index all that stuff to ensure that you boost retrievals from more canonical how to guys that are much better documented, but still having coverage over long tail content for like, you know, long tail questions.

For example, if you're using raw semantic similarity search, you can have relevance to, you know, say your how to guy, which is really well developed and three random long tail documents that are not well developed and they'll all get returned. And so how do you overlay different systems? It could be re ranking, uh, to basically promote content, uh, that you believe to be more accurate or better based on some criteria.

So that is all to say, I think with rag, the challenge is actually just domain specific retrieval for your application. That's just a hard problem. And there's been a lot of work on this. It's been around for a long time. I think that's really the limiter. And actually there's kind of no, no silver bullet.

Like in our case, we're having to look at the structure of our documents very carefully, design our retrieval strategy based on that doc structure. Like in particular, we're thinking about applying certain post retrieval ranking to docs of certain types based upon their importance. We're thinking about retrieving a large, like a large initial number of docs and then boiling them down with, with kind of re ranking based upon importance.

So I still think retrieval is very hard. It's very domain specific. It depends on the structure of your documentation. And there's kind of no free lunch. I think the things that are good about rag is context windows are getting much larger for LLMs. And so back to that point I was making before, I think we're seeing, and we're considering this ourselves, less worry about the exact right chunk size.

You can think more about chunking in, you know, different ways and then passing full documents to your final model. So I think that part of it's really good. But still, even like, even in this particular case, you probably still need some re ranking to promote the most important documents.

So I think retrieval is still quite hard in particular, like even looking at the line chain docs in particular, the overlay of document importance on top of raw semantic similarity search, right? Take a case of like, I have a question semantically, it's similar to 10 different documents. Those documents, though, vary widely in their quality and their relevant, like, more like higher level relevance.

Like, maybe that passage is related, but like, it might be a general question about how to build an agent. And then some random integration doc talks about building agent for integration X. And I want to make sure that the more canonical, well-developed agent, you know, overview doc gets promoted and passed back in the end.

Stuff like that. Sorry, it's a long answer, but basically rag is, it's hard. I mean, I think retrieval is really the hard part. The generation part is getting better and better as long contexts grow. Yep. So, this re-ranking approach for your documents. Yeah. What's the metadata? Do you have a relevancy to a particular topic as well as the numerical ranking?

Yeah. Okay. That's a great question. So the question was, when we're talking about this re-ranking, how do you assign this relevance to your documents? What is that? So, I'll just give you what we've been thinking about. I actually think it is, for us, going to be a hand-tuned kind of relevance score based upon our doc structure.

So, if you look at the Langchain docs, like go to Langchain documents. Yeah. So, Langchain documentation. We have these sections up here, tutorials, how-to guides, conceptual guides, which are like really well-developed, more recent, well-curated. These, you can imagine, have some kind of relevance or importance ranking of one or highest.

So, these are documents that contain very high-quality, well-curated answers that we want to promote and serve to users in the generation phase. However, let's say someone asks a question about one particular integration, right? If you go to integrations, we have all these pages, right? Components, go to retrievers, look at the ZEP cloud retriever.

This is some stuff related to ZEP cloud specifically. If someone asks about ZEP cloud, you do want to be able to retrieve that doc, right? And so, some ability to differentiate between questions that need, you know, general answers, in which case you would promote your more canonical how-to guides, conceptual docs, versus questions that require retrieval from very specific integration docs, in which case you would still promote this information.

That's kind of the crux of it. And I think we'll probably use kind of manual or heuristic scoring to up-weight or up-rank our core, like, how-to guides and conceptual guides over longer-tailed integration docs. And we might have a router that will indicate whether the question is general or specific.

So, those are the two things that I'd probably do. So, routing on the question side, and then some kind of heuristic relevance or importance grading or quality grading on the document side. And that can be packed in the metadata that you pack along with your index chunks. Yep. Maybe not.

Yeah. Oh, yeah. So, let's say a typical RAC application where there's a question-and-answer fail, but we kind of maintain a multi-turn, like, by example, we maintain the conversation history of the user to kind of create a . The problem is, like, let's say, for example, a question is asked, and then the retrieval chunks are, like, let's say, five.

Right. And then a subsequent question is asked. So, let's say, you know, but it's related to the first question, but still, somehow, in the first node, you transform the query, and then the retrieval chunks are still the same. So, like, I would just get that answer which is, like, more of the first time.

So, like, how do you, my question is, like, how do you make sure that, let's say, he wanted to deep dive into the document, like, into the more context? How do you make that happen? And then, like, . Yeah, so the question, I guess, was, like, in a multi-turn RAG context, let's say you have a case where a user asks an initial question, and you retrieve some documents, you produce an answer, and they ask a follow-up that says, give me more information about this.

Now, do you want to re-retrieve, or do you want to re-reference those same docs? No. And so, what happens in my case is, like, I kind of, I try to rewrite that question. Okay, you do a re-writing, so you rewrite the question, okay. And then go test the documents.

So, most often the documents would be the same as before. The same as before. Yeah. Yeah. Okay. So, the answers would be, like, mostly the same. Okay. Okay, interesting. So, the problem there is more of a retrieval problem. You're doing a rewrite, you're still retrieving the same set of documents, though.

Now, what do you want to have happen? You want to, do you actually want to retrieve different documents, or do you want to... It's kind of like deep dive, like, for example. Yeah, but that's the question, what do you mean by deep dive? Like, you're retrieving, let's say, it's a chapter of a book.

You're retrieving only the first page. You want to retrieve the whole chapter. Yeah. Okay. Okay, then I think, actually, a question rewrite would probably not sufficient. What I would think about more is, for that second pass, you could actually do something like metadata filtering on your chunks. If you have your data or your documents partitioned by, like, chapters or some sections, I would just do a bulk retrieval of the whole section or something like that.

So, it's more like a trick on the, on the retrieval side rather than a rewrite of the query. Because I hear, I see what you're saying. You rewrite the query, you might get the same docs back. If you want to guarantee that you actually get, like, a deeper dive in your docs, then maybe it's something in your retriever itself.

You can increase K, so retrieve more docs. You could use metadata filtering to, like, ensure you get all the docs in a given chapter. So, I think it's more a retrieval thing. But, that's kind of an interesting point though. Yeah. Cool. Well, I know it's been two and a half hours almost.

So, there we go. It's good. Yeah, yeah. Yeah, yeah. Yeah. I'm, I'm, I'm hanging out for another till noon, so. There's a local tutorial. Oh, cool. And it's working fine. I was just talking about the sports questions. Is that, like, control group questions? Yeah, yeah. Okay, okay. This is good.

So, see, the question was, he's doing the, the local, the local agent tutorial. And the question's on the eval set. So, actually, that's a fun one. Modify them any way you want. The key point was, I wanted some questions that are definitely outside the vector store. So, I asked something about, like, two things about sports.

Because I know it's not in my vector store about agents. So, I think I indexed three blog posts about, like, agents and prompting and adversarial examples. I just wanted some orthogonal questions that'll force web search. So, that's the only thing there. But you actually play with those and you can modify them and all that.

Yeah. But that's cool. It's working. Are you using Lama 3? Yeah. Cool. Yeah. It's in the 70b. Oh. You have a, you have a laptop big enough for 70b. Not really. Okay, okay. You're at the edge of. Okay, okay. Yeah, 8b, 8b's. Yeah, exactly. I mean, it's actually kind of nice.

You can even run the 70b, to be honest. I'm not sure I can even run it. But, yeah, that's cool. Nice. Let's see. Let's see. Yeah. Yeah, well, I can just hang out and, oh, yeah. Sure. Sure. What would be the best way to interpret for questions? Is quality or?

Yeah. Yeah. Well, I can just hang out and, oh, yeah. Sure. Yeah. What would be the best way to interpret for questions? Is quality or? Yeah. Well, if you have a chat application, you have a chat application, you have a chat So, the question was related to how do you incorporate multi-turn?

So, if you look at the React agent, it uses a chat history as its state. In that case, follow-up questions will be captured just in the message history as part of chat. I think the current layout of the custom line graph agent, though, is a little bit more single single turn.

So, it would be kind of question-answer. That's, oh, yeah. And then, like, that the agent actually has to come back with a question, too. Oh, okay. About details. Okay, got it. So, the, okay, got it. So, the question was, how do you modify the agent so that it will actually return, like, if it needs more clarification from the user?

Yeah. Yeah. These particular agent examples don't do that. But, again, I think that's maybe a good takeaway for me. I should add that to these tutorials, incorporate a simple example of multi-turn. I will do that and get my, I will do that and get my, I will do that and get my, I will do that.

And, I will do that and get my, I will do that. And, I will do that and get my, I will do that. And, I will do that and get my, I will do that. And, I will do that. And, I will do that and, I will do that and, I will send that to you.

So, now I have, like, a simple about or pull off question. Yes. I think that should be . Yes. Exactly. So, you want, so, I mentioned previously, if you look at Langrath. Um, let me find it. It's, it's one of our notebooks. Um, the customer support agent. Uh, this, so, Langrath examples customer support is an example of an agent that has, like, multi-turn dialogue.

But, it's complicated. So, I'd like to maybe augment these tutorials with a simpler example. Um, I will, yeah, I will, I'll follow up on that. If you give me, give me a contact info and I'll, I'll send you something. Cool. Well, I'll just, I'll sit up here. Anyone can just come and grab me.

Uh, thanks for everything. Hopefully, the cookbooks are working. Um, yeah, it was good. Made it two and a half hours, so. Cool. Well, I'll just sit up here. Anyone can just come and grab me. Thanks for everything. Hopefully, the cookbooks are working. Yeah, it was good. Made it two and a half hours.

So... Thanks.

Architecting and Testing Controllable Agents: Lance Martin

Transcript