Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Thank you guys for having me, and then thank you guys for being here. This is maybe one of the most famous screens of 2023, and yet I believe, and I think we all believe, and that's why we're all here, that this is just the beginning of a lot of amazing things that we're all going to create.

Because as good as chat GPT is, and as good as the language models that underlie them are, by themselves, they're just the start. By themselves, they don't know about current events, they cannot run the code that you write, and they don't remember their previous interactions with you. In order to get to a future where we have truly personalized and actually helpful AI assistants, we're going to need to take these language models and use them as one part of a larger system, and that's what I think a lot of us in here are trying to do.

These systems will be able to produce seemingly amazing and magical experiences. They'll understand the appropriate context, and they'll be able to reason about it and respond appropriately. At Langchain, we're trying to help teams close that gap between these magical experiences and the work that's actually required to get there, and we believe that behind all of these seemingly magical product moments, there is an extraordinary feat of engineering, and that's why it's awesome to be here at the AI Engineering Summit.

And so I'm going to talk a little bit about some of the approaches that we see work for developers when they're building these context-aware reasoning applications that are going to power the future. So first, I'm going to talk about context, and when I say context, I mean bringing relevant context to the language model so that it can reason about what to do, and bringing that context is really, really important, because if you don't provide that context, no matter how good the language model is, it's not going to be able to figure out what to do.

And so the first type of context, and probably the most common type of context that we see people bringing to the language model, we see them bringing through this instruction prompting type of approach, where they basically tell the language model how to respond to specific scenarios or specific inputs.

This is pretty straightforward, and I think the way to think about it is if you have a new employee who shows up on the first day of work, you give them an employee handbook and it tells them how they should behave in certain scenarios, and I equivalate that to kind of like this instruction prompting technique, it's pretty straightforward, I think that's why people start with it, and as the models get better and better, this zero-shot type of prompting is going to be able to carry a lot of the relevant context for how you expect the language model to behave.

There are some cases where telling the language model is actually quite hard, and it becomes better to give it some few-shot examples. It becomes better to give it examples where you show the language model how to behave rather than just tell it how to behave. So I think a few concrete places where this works is where it's actually a little bit difficult to describe how exactly the language model should respond.

So tone I think is a good use case for this, and then also structured output is a good use case for this. You can give examples of the structured output format, you can give examples of the output tone a little bit more easily than you could describe in language my particular tone.

The structured output is a little bit you can describe structured output, but I think as it starts to get more and more complicated, giving these really specific examples can help. The next type of context is maybe the most, it pops to the mind most when you hear about bringing context to the language model, contrasting this with the first two, retrieval augmented generation uses context not to decide how to respond, but to decide what to base its response in.

So the canonical thing is you have a user question, you do some retrieval strategy, you get back some context, you pass that to the language model, and you say answer this question based on the context that is provided to you. So this is a little bit different from the instructions, it is maybe the same as asking someone to take a test with like an open book test, you can look at the book, you can look at the answers, and in this case the answers are the text that you pass in to this context.

And then the fourth way that we see people providing context to language models is through fine-tuning, so updating the actual weights of the language model. This is still kind of like in its infancy, and I think we are starting to figure out how best to do this and what scenarios this is good to do in.

One of the things that we have seen is that this is good for the same use cases where few-shot examples are kind of good, it takes it to another extreme. And so for tone and structured data parsing, these are two use cases where we have seen it pretty beneficial to start doing some fine-tuning.

And the idea here is that, yeah, it can be helpful to have three examples of how your model should respond and what the tone there should be, but what if you could give it 10,000 examples and it updates its way accordingly. And so I think for those where the output is in a specific format, and again, you need more examples, you need to show it a lot more than you can tell it, this is where we see fine-tuning starting to become helpful, and I think we will see that grow more and more over time.

So we have talked about context, and now I want to talk a little bit about the reasoning bit, and I think this is the most exciting and the most new bit of it as well. And so we have tried to think and categorize some of the approaches that we have seen to allow these applications to do this reasoning component.

And so we have listed a few of them out here and tried to discern a few different axes along which they kind of vary. So if we think about kind of like just plain old code, this is kind of like the way things were, you know, like a year ago, so a long, long time ago.

And so in code, it's all there, it's declared if it says what to run, it says what the outputs are, what steps to take, things like that. We start adding in a language model call, and so this is like the simplest form of these reasoning applications, and here you're using the language model to determine what the output should be, but that's it.

You're not using it to take actions yet, nothing fancy, you're just using it to determine what the output should be, and it's just a single language model call, so you're providing the context and then you're returning the output to the user. If we take it up a little bit, then we start to get into a chain of language model calls or a chain of language model call to API back to language model.

And so this can be used, this is again used to decide the steps of the output. And here, there's multiple calls that are happening, and this can be used to break down more complex tasks into individual components, it can be used to insert knowledge dynamically in the middle of kind of like one language model call, then you go fetch some knowledge based on that language model call, and then you do another one.

But importantly here, the steps are known. You do this, and then you do this, and then you do this. And so it's a chain of events, and that starts to change a little bit when you use a router. And so in here, you're now using the language model call to start determining which steps to take.

So that's the big difference here. It's no longer just determining the output of the system, but it's determining which steps to take. So you can use it to determine which prompts to use, so route between a prompt that's really good at math problems versus a prompt that's really good at writing English essays.

You can use it to route between language models, so one model might be better than another. You might want to use Claude because of its long context window, or you might want to use GPT-4 because it's really good at reasoning. And so having the language model look at the question and decide whether it needs to reason or whether it wants to respond in a long-form fashion, you can determine which branches to go down, or I think another common use case is using it to determine which of several tools to take.

So do I want to call this tool, or do I want to call this tool, and what should the input to that tools be? And so we have this router here, and I think before going on to the next step, the main thing here that distinguishes it from that step is that there's no kind of like cycles.

You don't kind of get these loops. You're just choosing kind of like which branch to go down. Once you start adding in these loops, this is where we see a lot more complex applications. These are things that we often see being called agents, kind of like out in the wild, and it's essentially kind of like a while loop, and then in that loop you're doing a series of steps, and the language model is determining which steps to do, and then at some point there's a point where it can choose whether to end the loop or not, and if it ends the loop then you finish and return to the user, otherwise you go back and continue the loop.

And so here you get the language model deciding what the outputs are, it decides what steps to take, and you do have these cycles. The last thing, and I think this is largely what we would describe as kind of like what auto GPT did that took the world by storm, is this idea of an agent where you kind of like remove a lot of the kind of like guardrails around what steps to take.

So here the sequences of steps that are available are almost like determined by the LLM, and what I mean by this is that here's where you can start doing things like adding in tools that the language model can take, so if you guys are familiar with the Voyager paper, it starts adding in tools and building up a skill set of tools over time, and so some of the actions that the language model can take are dynamically created.

And then I think the other big thing here is that you remove some of the scaffolding from the state machines. So some of the, if I go back a little bit, so a lot of these kind of like cycles that we see in the wild break things down into discrete states.

The most common one that we see are kind of like plan, execute, and validate. So you ask the language model to plan what to do with it, then it goes to it, and then you validate it often with a language model call or something like that. And I think the big difference between that and then the autonomous agent style thing is that here you're implicitly asking the agent to do all of those things in one go.

It should know when it should plan, it should know when it should validate, and it should know when it should kind of like determine what action to take. And you're asking it all to do that implicitly. You don't have these kind of like distinct sequences of steps laid out in the code.

And so this is a little bit about how we're thinking about it. I think the thing to, the thing that I like to say when saying this as well, which goes back to the beginning, is that the main thing that we think is it's still just extremely early on in the space.

We still think it's the beginning. And this could, you know, in three months be kind of irrelevant as the space progresses. So I would just keep that in mind. If we think about kind of like some of the magical experiences like this where it can reason over the relevant context, what is it going to take to kind of like build it under the hood?

What is the engineering that's going to go in to all these seemingly magical experiences? And so this is an example of what could be going under the hood of something like this. It's going to be a challenging experience to build these complex systems. And that's why we're building some of the tooling like this, what you see here, to help debug, understand, and iterate on these systems of the future.

And so what exactly are the challenges associated with building these complex context-aware reasoning applications? The first is kind of just the orchestration layer. So figuring out which of the different reasoning kind of like cognitive architectures you should be using? Should you be using a simple chain? Should you be using a router, a more complex agent?

And I think the thing to remember here is that it's not necessarily that one is better than the other or superior to the other. They all have kind of like their pros and cons and strengths and weaknesses. So chains are really good because you have more control over the sequence of steps that are taken.

Agents are better because they can more dynamically react to unexpected inputs and handle edge cases. And so being able to choose the right cognitive architecture that you want and being able to quickly experiment with a bunch of other ones are part of what inspired the initial release of LangChain and kind of how we aim to help people prototype these types of applications.

And then LangSmith, which is this thing here, provides a lot of visibility into what actually is going on as these applications start to get more and more complex, understanding what exact sequences of tools are being used, what exact sequences of language model calls are being made becomes increasingly important.

Another big thing that we see people struggling with and spending a lot of time on is good old-fashioned data engineering. So a lot of this comes down to providing the right context, the language models. And the right context is often data. And so you need to have ways to load that data.

You need to have ways to transform that data, to transport that data. And then you often want to have observability into what exact data is getting passed around and where. And so LangChain itself has a lot of open source modules for loading that data and transforming that data. And then LangSmith, we often see being really useful for debugging what exactly does that data look like by the time it is getting to the language model.

Have you extracted the right documents from your vector store? Have you transformed them and formatted in the right way where it is clear to the language model what is actually in them? These are all things that you are going to want to be able to debug so there is no little small errors or small issues that pop up.

And then the third thing that we see a lot of people spending time on when building these applications is good old-fashioned prompt engineering. So the main new thing here is language models. And the main way of interacting with language models is through prompts. And so being able to understand what exactly does the fully formatted prompt look like by the time it is going into the language model is really important.

How are you combining the system instructions with maybe the few shot examples, any retrieved context, the chat history that you have going on, any previous steps that the agent took? What does this all look by the time it gets to the language model? And what does this look like in the middle of this complex application?

So it is easy enough to kind of test and debug this if it is the first call, the first part of the system. But after it has already done three of these steps, if you want to kind of debug what that prompt looks like, what that fully formatted prompt looks like, being able to do that becomes increasingly difficult as the systems kind of scale up in their entangledness.

And so we have tried to make it really easy to hop into any kind of like particular language model call at any point in time, open it up in a playground like this so you can edit it directly and experiment with that prompt engineering and go kind of like change some of the instructions and see how it responds or swap out model providers so that you can see if another model provider does better.

So the other big challenge with these language model applications and is probably worth a talk on its own is evaluation of them. And so I think evaluation is really hard for a few reasons. I think the two primary ones are a lack of data and a lack of good metrics.

So comparing to traditional kind of like data science and machine learning, with those you generally started with a data set. You needed that to build your model and so then when it came time to evaluate it you at least had those data points that you could look at and evaluate on.

And I think that is a little bit different with a lot of these LLM applications because these models are fantastic zero shot kind of like learners. That is kind of like the whole exciting bit of them. And so you can get to a working MVP without building up kind of like any data set at all.

And that is awesome. But that does make it a little bit of a challenge when it comes to evaluating them because you don't have these data points. And so one of the things that we often encourage a lot of people to do and try to help them do as well as build up these data sets and iterate on those and those can come from either labeling data points by hand or looking at production traffic and pulling things in or auto-generating things with LLMs.

The second big challenge in evaluation is lack of metrics. I think most traditional kind of like quantitative metrics don't perform super well for large unstructured outputs. A lot of what we see people doing is still doing a kind of like vibe check to kind of like see how the model is performing.

And as unsatisfying as that is, I still think that is probably the best way to gain kind of like intuition as to what is going on. So a lot of what we try to do is make it really easy to observe the outputs and the inputs of the language models so that you can build up that intuition.

In terms of more quantitative and systematic metrics, we are very bullish on LLM-assisted evaluation. So using LLMs to evaluate the outputs. And then I think maybe the biggest thing that we see people doing in production is just keeping track of feedback, whether it be direct or indirect feedback. So do they leave kind of like a thumbs up or a thumbs down on your application?

That is an example of direct feedback where you are gathering that. An example of indirect feedback might be if they click on a link or that might be a good thing that you provided a good suggestion. Or if they respond really confused to your chatbot, that might be a good indication that your chatbot actually did not perform well.

So tracking these over time and doing A/B testing with that using kind of like traditional A/B testing software can be pretty impactful for gathering a sense online of how your model is doing. And then the last interesting thing that we are spending a lot of time thinking about is collaboration.

So as these systems get bigger and bigger, they are doubtless going to be a collaboration among a lot of people. And so who exactly is working on these systems? Is it all AI engineers? Is it a combination of AI engineers and data engineers and data scientists and product managers?

And I think one of the interesting trends that we are seeing is it is still a little bit unclear what the best skill sets for this new AI engineer type role is. And there could very well be a bunch of different skill sets that are valuable. So going back to kind of like the two things that we see making up a lot of these applications, the context awareness and the reasoning bit.

The context awareness is bringing the right context to these applications. You often need kind of like a data engineering team to get in there and assist with that. The reasoning bit is often done through prompting. And oftentimes that is best done by non-technical people who can really outline kind of like the exact specification of the app that they are building.

Whether they be product managers or subject matter experts. And so how do you enable collaboration between these two different types of folks? And what exactly does that look like? I do not think that is something that anyone kind of like knows or definitely hasn't solved. But I think that is a really interesting trend that we are thinking a lot about going forward.

And so I think like the main thing that I want to leave you all with is that the big thing that we believe is that it is still really, really early on in this journey. It is just the beginning. As crazy as things have been over the past year, they are hopefully going to get even crazier.

You saw an incredible demo of GPT-4V. Things like that are going to change it. And so we think behind all of these things it is going to take a lot of engineering. And we are trying to build some of the tooling to help enable that. And I think you guys are all on the right track towards becoming those types of engineers by being at a conference like this.

So thank you SWIX for having me. Thank you guys for being here. Have a good rest of your day. Have a good rest of your day. Thank you. Thank you. Thank you. Thank you. Thank you. We'll be right back.

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Chapters

Transcript