back to indexBuilding Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase

Chapters
0:0 Intro
1:17 What is LangChain
1:48 Context
2:50 Examples
3:36 Retrieval Augmented Generation
4:18 Fine Tuning
5:13 Reasoning
10:21 Magical Experiences
10:38 Challenges
12:10 Data Engineering
14:21 Evaluation
16:52 Collaboration
18:6 Conclusion
00:00:00.000 |
Thank you guys for having me, and then thank you guys for being here. 00:00:15.900 |
This is maybe one of the most famous screens of 2023, and yet I believe, and I think we 00:00:25.420 |
all believe, and that's why we're all here, that this is just the beginning of a lot of 00:00:29.540 |
amazing things that we're all going to create. 00:00:32.920 |
Because as good as chat GPT is, and as good as the language models that underlie them are, 00:00:41.700 |
By themselves, they don't know about current events, they cannot run the code that you write, 00:00:46.620 |
and they don't remember their previous interactions with you. 00:00:49.320 |
In order to get to a future where we have truly personalized and actually helpful AI assistants, 00:00:55.800 |
we're going to need to take these language models and use them as one part of a larger 00:01:00.480 |
system, and that's what I think a lot of us in here are trying to do. 00:01:04.160 |
These systems will be able to produce seemingly amazing and magical experiences. 00:01:11.080 |
They'll understand the appropriate context, and they'll be able to reason about it and respond 00:01:17.460 |
At Langchain, we're trying to help teams close that gap between these magical experiences and 00:01:24.900 |
the work that's actually required to get there, and we believe that behind all of these seemingly 00:01:29.460 |
magical product moments, there is an extraordinary feat of engineering, and that's why it's awesome 00:01:36.460 |
And so I'm going to talk a little bit about some of the approaches that we see work for 00:01:41.320 |
developers when they're building these context-aware reasoning applications that are going to 00:01:47.680 |
So first, I'm going to talk about context, and when I say context, I mean bringing relevant 00:01:54.140 |
context to the language model so that it can reason about what to do, and bringing that context 00:01:58.200 |
is really, really important, because if you don't provide that context, no matter how good the 00:02:01.920 |
language model is, it's not going to be able to figure out what to do. 00:02:06.820 |
And so the first type of context, and probably the most common type of context that we see 00:02:11.360 |
people bringing to the language model, we see them bringing through this instruction prompting 00:02:15.840 |
type of approach, where they basically tell the language model how to respond to specific 00:02:23.840 |
This is pretty straightforward, and I think the way to think about it is if you have a new 00:02:27.000 |
employee who shows up on the first day of work, you give them an employee handbook and 00:02:30.780 |
it tells them how they should behave in certain scenarios, and I equivalate that to kind of 00:02:35.100 |
like this instruction prompting technique, it's pretty straightforward, I think that's why 00:02:40.400 |
people start with it, and as the models get better and better, this zero-shot type of prompting 00:02:44.640 |
is going to be able to carry a lot of the relevant context for how you expect the language model to 00:02:50.680 |
There are some cases where telling the language model is actually quite hard, and it becomes 00:02:57.180 |
It becomes better to give it examples where you show the language model how to behave rather 00:03:03.660 |
So I think a few concrete places where this works is where it's actually a little bit difficult 00:03:08.420 |
to describe how exactly the language model should respond. 00:03:12.380 |
So tone I think is a good use case for this, and then also structured output is a good use 00:03:17.980 |
You can give examples of the structured output format, you can give examples of the output 00:03:23.060 |
tone a little bit more easily than you could describe in language my particular tone. 00:03:27.540 |
The structured output is a little bit you can describe structured output, but I think as it 00:03:30.860 |
starts to get more and more complicated, giving these really specific examples can help. 00:03:36.500 |
The next type of context is maybe the most, it pops to the mind most when you hear about 00:03:42.560 |
bringing context to the language model, contrasting this with the first two, retrieval augmented 00:03:46.740 |
generation uses context not to decide how to respond, but to decide what to base its response 00:03:54.600 |
So the canonical thing is you have a user question, you do some retrieval strategy, you get back 00:03:59.160 |
some context, you pass that to the language model, and you say answer this question based 00:04:05.240 |
So this is a little bit different from the instructions, it is maybe the same as asking 00:04:08.880 |
someone to take a test with like an open book test, you can look at the book, you can look 00:04:13.000 |
at the answers, and in this case the answers are the text that you pass in to this context. 00:04:19.180 |
And then the fourth way that we see people providing context to language models is through fine-tuning, 00:04:24.060 |
so updating the actual weights of the language model. 00:04:28.020 |
This is still kind of like in its infancy, and I think we are starting to figure out how best 00:04:31.520 |
to do this and what scenarios this is good to do in. 00:04:35.700 |
One of the things that we have seen is that this is good for the same use cases where few-shot 00:04:39.740 |
examples are kind of good, it takes it to another extreme. 00:04:42.780 |
And so for tone and structured data parsing, these are two use cases where we have seen it 00:04:47.600 |
pretty beneficial to start doing some fine-tuning. 00:04:50.140 |
And the idea here is that, yeah, it can be helpful to have three examples of how your model should 00:04:54.440 |
respond and what the tone there should be, but what if you could give it 10,000 examples 00:05:00.100 |
And so I think for those where the output is in a specific format, and again, you need 00:05:04.740 |
more examples, you need to show it a lot more than you can tell it, this is where we see 00:05:08.340 |
fine-tuning starting to become helpful, and I think we will see that grow more and more 00:05:12.300 |
So we have talked about context, and now I want to talk a little bit about the reasoning 00:05:17.280 |
bit, and I think this is the most exciting and the most new bit of it as well. 00:05:20.820 |
And so we have tried to think and categorize some of the approaches that we have seen to 00:05:24.860 |
allow these applications to do this reasoning component. 00:05:29.220 |
And so we have listed a few of them out here and tried to discern a few different axes along 00:05:35.460 |
So if we think about kind of like just plain old code, this is kind of like the way things 00:05:38.820 |
were, you know, like a year ago, so a long, long time ago. 00:05:44.000 |
And so in code, it's all there, it's declared if it says what to run, it says what the outputs 00:05:52.200 |
We start adding in a language model call, and so this is like the simplest form of these reasoning 00:05:57.240 |
applications, and here you're using the language model to determine what the output should be, 00:06:02.480 |
You're not using it to take actions yet, nothing fancy, you're just using it to determine what 00:06:05.520 |
the output should be, and it's just a single language model call, so you're providing the 00:06:08.960 |
context and then you're returning the output to the user. 00:06:13.240 |
If we take it up a little bit, then we start to get into a chain of language model calls or 00:06:18.340 |
a chain of language model call to API back to language model. 00:06:22.480 |
And so this can be used, this is again used to decide the steps of the output. 00:06:28.920 |
And here, there's multiple calls that are happening, and this can be used to break down more complex 00:06:35.460 |
tasks into individual components, it can be used to insert knowledge dynamically in the 00:06:39.500 |
middle of kind of like one language model call, then you go fetch some knowledge based on that 00:06:43.240 |
language model call, and then you do another one. 00:06:47.480 |
You do this, and then you do this, and then you do this. 00:06:49.480 |
And so it's a chain of events, and that starts to change a little bit when you use a router. 00:06:55.320 |
And so in here, you're now using the language model call to start determining which steps 00:07:01.200 |
It's no longer just determining the output of the system, but it's determining which steps 00:07:05.200 |
So you can use it to determine which prompts to use, so route between a prompt that's really 00:07:09.520 |
good at math problems versus a prompt that's really good at writing English essays. 00:07:14.740 |
You can use it to route between language models, so one model might be better than another. 00:07:18.620 |
You might want to use Claude because of its long context window, or you might want to use 00:07:24.500 |
And so having the language model look at the question and decide whether it needs to reason 00:07:27.740 |
or whether it wants to respond in a long-form fashion, you can determine which branches 00:07:31.040 |
to go down, or I think another common use case is using it to determine which of several 00:07:37.120 |
So do I want to call this tool, or do I want to call this tool, and what should the input 00:07:41.760 |
And so we have this router here, and I think before going on to the next step, the main thing 00:07:46.280 |
here that distinguishes it from that step is that there's no kind of like cycles. 00:07:53.400 |
You're just choosing kind of like which branch to go down. 00:07:56.160 |
Once you start adding in these loops, this is where we see a lot more complex applications. 00:08:01.020 |
These are things that we often see being called agents, kind of like out in the wild, and 00:08:07.100 |
it's essentially kind of like a while loop, and then in that loop you're doing a series 00:08:11.340 |
of steps, and the language model is determining which steps to do, and then at some point there's 00:08:15.840 |
a point where it can choose whether to end the loop or not, and if it ends the loop then 00:08:19.640 |
you finish and return to the user, otherwise you go back and continue the loop. 00:08:23.840 |
And so here you get the language model deciding what the outputs are, it decides what steps to 00:08:31.960 |
The last thing, and I think this is largely what we would describe as kind of like what 00:08:37.860 |
auto GPT did that took the world by storm, is this idea of an agent where you kind of like remove 00:08:43.760 |
a lot of the kind of like guardrails around what steps to take. 00:08:50.040 |
So here the sequences of steps that are available are almost like determined by the LLM, and 00:08:55.860 |
what I mean by this is that here's where you can start doing things like adding in tools 00:08:59.760 |
that the language model can take, so if you guys are familiar with the Voyager paper, it 00:09:04.760 |
starts adding in tools and building up a skill set of tools over time, and so some of the actions 00:09:08.700 |
that the language model can take are dynamically created. 00:09:12.320 |
And then I think the other big thing here is that you remove some of the scaffolding from 00:09:17.540 |
So some of the, if I go back a little bit, so a lot of these kind of like cycles that we 00:09:22.940 |
see in the wild break things down into discrete states. 00:09:26.920 |
The most common one that we see are kind of like plan, execute, and validate. 00:09:30.380 |
So you ask the language model to plan what to do with it, then it goes to it, and then you 00:09:33.140 |
validate it often with a language model call or something like that. 00:09:36.720 |
And I think the big difference between that and then the autonomous agent style thing is 00:09:41.100 |
that here you're implicitly asking the agent to do all of those things in one go. 00:09:45.800 |
It should know when it should plan, it should know when it should validate, and it should 00:09:49.080 |
know when it should kind of like determine what action to take. 00:09:51.480 |
And you're asking it all to do that implicitly. 00:09:53.700 |
You don't have these kind of like distinct sequences of steps laid out in the code. 00:10:00.500 |
And so this is a little bit about how we're thinking about it. 00:10:02.560 |
I think the thing to, the thing that I like to say when saying this as well, which goes 00:10:08.520 |
back to the beginning, is that the main thing that we think is it's still just extremely early 00:10:14.560 |
And this could, you know, in three months be kind of irrelevant as the space progresses. 00:10:21.880 |
If we think about kind of like some of the magical experiences like this where it can reason over 00:10:27.760 |
the relevant context, what is it going to take to kind of like build it under the hood? 00:10:33.000 |
What is the engineering that's going to go in to all these seemingly magical experiences? 00:10:39.160 |
And so this is an example of what could be going under the hood of something like this. 00:10:43.780 |
It's going to be a challenging experience to build these complex systems. 00:10:48.000 |
And that's why we're building some of the tooling like this, what you see here, to help debug, 00:10:52.740 |
understand, and iterate on these systems of the future. 00:10:56.680 |
And so what exactly are the challenges associated with building these complex context-aware reasoning 00:11:04.780 |
The first is kind of just the orchestration layer. 00:11:07.240 |
So figuring out which of the different reasoning kind of like cognitive architectures you should 00:11:15.640 |
Should you be using a router, a more complex agent? 00:11:18.120 |
And I think the thing to remember here is that it's not necessarily that one is better than 00:11:24.000 |
They all have kind of like their pros and cons and strengths and weaknesses. 00:11:26.680 |
So chains are really good because you have more control over the sequence of steps that 00:11:31.720 |
Agents are better because they can more dynamically react to unexpected inputs and handle edge cases. 00:11:36.880 |
And so being able to choose the right cognitive architecture that you want and being able 00:11:41.160 |
to quickly experiment with a bunch of other ones are part of what inspired the initial 00:11:45.440 |
release of LangChain and kind of how we aim to help people prototype these types of applications. 00:11:52.780 |
And then LangSmith, which is this thing here, provides a lot of visibility into what actually 00:11:56.620 |
is going on as these applications start to get more and more complex, understanding what 00:12:02.380 |
exact sequences of tools are being used, what exact sequences of language model calls are being 00:12:11.720 |
Another big thing that we see people struggling with and spending a lot of time on is good 00:12:17.340 |
So a lot of this comes down to providing the right context, the language models. 00:12:23.340 |
And so you need to have ways to load that data. 00:12:25.340 |
You need to have ways to transform that data, to transport that data. 00:12:28.560 |
And then you often want to have observability into what exact data is getting passed around 00:12:32.880 |
And so LangChain itself has a lot of open source modules for loading that data and transforming 00:12:39.220 |
And then LangSmith, we often see being really useful for debugging what exactly does that data 00:12:43.580 |
look like by the time it is getting to the language model. 00:12:46.600 |
Have you extracted the right documents from your vector store? 00:12:49.600 |
Have you transformed them and formatted in the right way where it is clear to the language 00:12:54.640 |
These are all things that you are going to want to be able to debug so there is no little 00:13:02.820 |
And then the third thing that we see a lot of people spending time on when building these 00:13:06.600 |
applications is good old-fashioned prompt engineering. 00:13:08.980 |
So the main new thing here is language models. 00:13:11.760 |
And the main way of interacting with language models is through prompts. 00:13:14.840 |
And so being able to understand what exactly does the fully formatted prompt look like by the 00:13:19.580 |
time it is going into the language model is really important. 00:13:23.700 |
How are you combining the system instructions with maybe the few shot examples, any retrieved 00:13:29.380 |
context, the chat history that you have going on, any previous steps that the agent took? 00:13:34.360 |
What does this all look by the time it gets to the language model? 00:13:37.700 |
And what does this look like in the middle of this complex application? 00:13:40.880 |
So it is easy enough to kind of test and debug this if it is the first call, the first part 00:13:46.300 |
But after it has already done three of these steps, if you want to kind of debug what that 00:13:49.700 |
prompt looks like, what that fully formatted prompt looks like, being able to do that becomes 00:13:54.580 |
increasingly difficult as the systems kind of scale up in their entangledness. 00:13:58.600 |
And so we have tried to make it really easy to hop into any kind of like particular language 00:14:03.000 |
model call at any point in time, open it up in a playground like this so you can edit it 00:14:07.500 |
directly and experiment with that prompt engineering and go kind of like change some of the instructions 00:14:15.260 |
and see how it responds or swap out model providers so that you can see if another model provider 00:14:23.080 |
So the other big challenge with these language model applications and is probably worth a 00:14:29.580 |
And so I think evaluation is really hard for a few reasons. 00:14:32.580 |
I think the two primary ones are a lack of data and a lack of good metrics. 00:14:36.920 |
So comparing to traditional kind of like data science and machine learning, with those you 00:14:42.680 |
You needed that to build your model and so then when it came time to evaluate it you at least 00:14:46.400 |
had those data points that you could look at and evaluate on. 00:14:49.400 |
And I think that is a little bit different with a lot of these LLM applications because 00:14:53.220 |
these models are fantastic zero shot kind of like learners. 00:14:57.160 |
That is kind of like the whole exciting bit of them. 00:15:00.460 |
And so you can get to a working MVP without building up kind of like any data set at all. 00:15:06.000 |
But that does make it a little bit of a challenge when it comes to evaluating them because you 00:15:11.880 |
And so one of the things that we often encourage a lot of people to do and try to help them do 00:15:16.540 |
as well as build up these data sets and iterate on those and those can come from either labeling 00:15:21.540 |
data points by hand or looking at production traffic and pulling things in or auto-generating 00:15:29.720 |
The second big challenge in evaluation is lack of metrics. 00:15:33.140 |
I think most traditional kind of like quantitative metrics don't perform super well for large unstructured 00:15:40.160 |
A lot of what we see people doing is still doing a kind of like vibe check to kind of 00:15:47.160 |
And as unsatisfying as that is, I still think that is probably the best way to gain kind of 00:15:56.160 |
So a lot of what we try to do is make it really easy to observe the outputs and the inputs of 00:15:59.160 |
the language models so that you can build up that intuition. 00:16:03.660 |
In terms of more quantitative and systematic metrics, we are very bullish on LLM-assisted 00:16:14.160 |
And then I think maybe the biggest thing that we see people doing in production is just keeping 00:16:18.160 |
track of feedback, whether it be direct or indirect feedback. 00:16:21.160 |
So do they leave kind of like a thumbs up or a thumbs down on your application? 00:16:24.160 |
That is an example of direct feedback where you are gathering that. 00:16:27.160 |
An example of indirect feedback might be if they click on a link or that might be a good 00:16:33.160 |
Or if they respond really confused to your chatbot, that might be a good indication that 00:16:39.160 |
So tracking these over time and doing A/B testing with that using kind of like traditional 00:16:44.160 |
A/B testing software can be pretty impactful for gathering a sense online of how your model 00:16:52.160 |
And then the last interesting thing that we are spending a lot of time thinking about 00:16:56.160 |
So as these systems get bigger and bigger, they are doubtless going to be a collaboration 00:17:01.160 |
And so who exactly is working on these systems? 00:17:07.160 |
Is it a combination of AI engineers and data engineers and data scientists and product 00:17:14.160 |
And I think one of the interesting trends that we are seeing is it is still a little bit 00:17:17.160 |
unclear what the best skill sets for this new AI engineer type role is. 00:17:22.160 |
And there could very well be a bunch of different skill sets that are valuable. 00:17:26.160 |
So going back to kind of like the two things that we see making up a lot of these applications, 00:17:32.160 |
The context awareness is bringing the right context to these applications. 00:17:36.160 |
You often need kind of like a data engineering team to get in there and assist with that. 00:17:40.160 |
The reasoning bit is often done through prompting. 00:17:42.160 |
And oftentimes that is best done by non-technical people who can really outline kind of like the 00:17:46.160 |
exact specification of the app that they are building. 00:17:48.160 |
Whether they be product managers or subject matter experts. 00:17:51.160 |
And so how do you enable collaboration between these two different types of folks? 00:17:57.160 |
I do not think that is something that anyone kind of like knows or definitely hasn't solved. 00:18:01.160 |
But I think that is a really interesting trend that we are thinking a lot about going forward. 00:18:07.160 |
And so I think like the main thing that I want to leave you all with is that the big thing 00:18:14.160 |
that we believe is that it is still really, really early on in this journey. 00:18:19.160 |
As crazy as things have been over the past year, they are hopefully going to get even crazier. 00:18:28.160 |
And so we think behind all of these things it is going to take a lot of engineering. 00:18:32.160 |
And we are trying to build some of the tooling to help enable that. 00:18:34.160 |
And I think you guys are all on the right track towards becoming those types of engineers