back to index

Building Context-Aware Reasoning Applications with LangChain and LangSmith: Harrison Chase


Chapters

0:0 Intro
1:17 What is LangChain
1:48 Context
2:50 Examples
3:36 Retrieval Augmented Generation
4:18 Fine Tuning
5:13 Reasoning
10:21 Magical Experiences
10:38 Challenges
12:10 Data Engineering
14:21 Evaluation
16:52 Collaboration
18:6 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Thank you guys for having me, and then thank you guys for being here.
00:00:15.900 | This is maybe one of the most famous screens of 2023, and yet I believe, and I think we
00:00:25.420 | all believe, and that's why we're all here, that this is just the beginning of a lot of
00:00:29.540 | amazing things that we're all going to create.
00:00:32.920 | Because as good as chat GPT is, and as good as the language models that underlie them are,
00:00:39.120 | by themselves, they're just the start.
00:00:41.700 | By themselves, they don't know about current events, they cannot run the code that you write,
00:00:46.620 | and they don't remember their previous interactions with you.
00:00:49.320 | In order to get to a future where we have truly personalized and actually helpful AI assistants,
00:00:55.800 | we're going to need to take these language models and use them as one part of a larger
00:01:00.480 | system, and that's what I think a lot of us in here are trying to do.
00:01:04.160 | These systems will be able to produce seemingly amazing and magical experiences.
00:01:11.080 | They'll understand the appropriate context, and they'll be able to reason about it and respond
00:01:16.080 | appropriately.
00:01:17.460 | At Langchain, we're trying to help teams close that gap between these magical experiences and
00:01:24.900 | the work that's actually required to get there, and we believe that behind all of these seemingly
00:01:29.460 | magical product moments, there is an extraordinary feat of engineering, and that's why it's awesome
00:01:35.000 | to be here at the AI Engineering Summit.
00:01:36.460 | And so I'm going to talk a little bit about some of the approaches that we see work for
00:01:41.320 | developers when they're building these context-aware reasoning applications that are going to
00:01:46.840 | power the future.
00:01:47.680 | So first, I'm going to talk about context, and when I say context, I mean bringing relevant
00:01:54.140 | context to the language model so that it can reason about what to do, and bringing that context
00:01:58.200 | is really, really important, because if you don't provide that context, no matter how good the
00:02:01.920 | language model is, it's not going to be able to figure out what to do.
00:02:06.820 | And so the first type of context, and probably the most common type of context that we see
00:02:11.360 | people bringing to the language model, we see them bringing through this instruction prompting
00:02:15.840 | type of approach, where they basically tell the language model how to respond to specific
00:02:20.920 | scenarios or specific inputs.
00:02:23.840 | This is pretty straightforward, and I think the way to think about it is if you have a new
00:02:27.000 | employee who shows up on the first day of work, you give them an employee handbook and
00:02:30.780 | it tells them how they should behave in certain scenarios, and I equivalate that to kind of
00:02:35.100 | like this instruction prompting technique, it's pretty straightforward, I think that's why
00:02:40.400 | people start with it, and as the models get better and better, this zero-shot type of prompting
00:02:44.640 | is going to be able to carry a lot of the relevant context for how you expect the language model to
00:02:49.680 | behave.
00:02:50.680 | There are some cases where telling the language model is actually quite hard, and it becomes
00:02:55.340 | better to give it some few-shot examples.
00:02:57.180 | It becomes better to give it examples where you show the language model how to behave rather
00:03:01.660 | than just tell it how to behave.
00:03:03.660 | So I think a few concrete places where this works is where it's actually a little bit difficult
00:03:08.420 | to describe how exactly the language model should respond.
00:03:12.380 | So tone I think is a good use case for this, and then also structured output is a good use
00:03:16.980 | case for this.
00:03:17.980 | You can give examples of the structured output format, you can give examples of the output
00:03:23.060 | tone a little bit more easily than you could describe in language my particular tone.
00:03:27.540 | The structured output is a little bit you can describe structured output, but I think as it
00:03:30.860 | starts to get more and more complicated, giving these really specific examples can help.
00:03:36.500 | The next type of context is maybe the most, it pops to the mind most when you hear about
00:03:42.560 | bringing context to the language model, contrasting this with the first two, retrieval augmented
00:03:46.740 | generation uses context not to decide how to respond, but to decide what to base its response
00:03:54.600 | So the canonical thing is you have a user question, you do some retrieval strategy, you get back
00:03:59.160 | some context, you pass that to the language model, and you say answer this question based
00:04:02.740 | on the context that is provided to you.
00:04:05.240 | So this is a little bit different from the instructions, it is maybe the same as asking
00:04:08.880 | someone to take a test with like an open book test, you can look at the book, you can look
00:04:13.000 | at the answers, and in this case the answers are the text that you pass in to this context.
00:04:19.180 | And then the fourth way that we see people providing context to language models is through fine-tuning,
00:04:24.060 | so updating the actual weights of the language model.
00:04:28.020 | This is still kind of like in its infancy, and I think we are starting to figure out how best
00:04:31.520 | to do this and what scenarios this is good to do in.
00:04:35.700 | One of the things that we have seen is that this is good for the same use cases where few-shot
00:04:39.740 | examples are kind of good, it takes it to another extreme.
00:04:42.780 | And so for tone and structured data parsing, these are two use cases where we have seen it
00:04:47.600 | pretty beneficial to start doing some fine-tuning.
00:04:50.140 | And the idea here is that, yeah, it can be helpful to have three examples of how your model should
00:04:54.440 | respond and what the tone there should be, but what if you could give it 10,000 examples
00:04:57.960 | and it updates its way accordingly.
00:05:00.100 | And so I think for those where the output is in a specific format, and again, you need
00:05:04.740 | more examples, you need to show it a lot more than you can tell it, this is where we see
00:05:08.340 | fine-tuning starting to become helpful, and I think we will see that grow more and more
00:05:11.300 | over time.
00:05:12.300 | So we have talked about context, and now I want to talk a little bit about the reasoning
00:05:17.280 | bit, and I think this is the most exciting and the most new bit of it as well.
00:05:20.820 | And so we have tried to think and categorize some of the approaches that we have seen to
00:05:24.860 | allow these applications to do this reasoning component.
00:05:29.220 | And so we have listed a few of them out here and tried to discern a few different axes along
00:05:33.640 | which they kind of vary.
00:05:35.460 | So if we think about kind of like just plain old code, this is kind of like the way things
00:05:38.820 | were, you know, like a year ago, so a long, long time ago.
00:05:44.000 | And so in code, it's all there, it's declared if it says what to run, it says what the outputs
00:05:49.520 | are, what steps to take, things like that.
00:05:52.200 | We start adding in a language model call, and so this is like the simplest form of these reasoning
00:05:57.240 | applications, and here you're using the language model to determine what the output should be,
00:06:01.480 | but that's it.
00:06:02.480 | You're not using it to take actions yet, nothing fancy, you're just using it to determine what
00:06:05.520 | the output should be, and it's just a single language model call, so you're providing the
00:06:08.960 | context and then you're returning the output to the user.
00:06:13.240 | If we take it up a little bit, then we start to get into a chain of language model calls or
00:06:18.340 | a chain of language model call to API back to language model.
00:06:22.480 | And so this can be used, this is again used to decide the steps of the output.
00:06:28.920 | And here, there's multiple calls that are happening, and this can be used to break down more complex
00:06:35.460 | tasks into individual components, it can be used to insert knowledge dynamically in the
00:06:39.500 | middle of kind of like one language model call, then you go fetch some knowledge based on that
00:06:43.240 | language model call, and then you do another one.
00:06:45.860 | But importantly here, the steps are known.
00:06:47.480 | You do this, and then you do this, and then you do this.
00:06:49.480 | And so it's a chain of events, and that starts to change a little bit when you use a router.
00:06:55.320 | And so in here, you're now using the language model call to start determining which steps
00:06:59.200 | to take.
00:07:00.200 | So that's the big difference here.
00:07:01.200 | It's no longer just determining the output of the system, but it's determining which steps
00:07:04.200 | to take.
00:07:05.200 | So you can use it to determine which prompts to use, so route between a prompt that's really
00:07:09.520 | good at math problems versus a prompt that's really good at writing English essays.
00:07:14.740 | You can use it to route between language models, so one model might be better than another.
00:07:18.620 | You might want to use Claude because of its long context window, or you might want to use
00:07:22.020 | GPT-4 because it's really good at reasoning.
00:07:24.500 | And so having the language model look at the question and decide whether it needs to reason
00:07:27.740 | or whether it wants to respond in a long-form fashion, you can determine which branches
00:07:31.040 | to go down, or I think another common use case is using it to determine which of several
00:07:36.120 | tools to take.
00:07:37.120 | So do I want to call this tool, or do I want to call this tool, and what should the input
00:07:40.080 | to that tools be?
00:07:41.760 | And so we have this router here, and I think before going on to the next step, the main thing
00:07:46.280 | here that distinguishes it from that step is that there's no kind of like cycles.
00:07:50.440 | You don't kind of get these loops.
00:07:53.400 | You're just choosing kind of like which branch to go down.
00:07:56.160 | Once you start adding in these loops, this is where we see a lot more complex applications.
00:08:01.020 | These are things that we often see being called agents, kind of like out in the wild, and
00:08:07.100 | it's essentially kind of like a while loop, and then in that loop you're doing a series
00:08:11.340 | of steps, and the language model is determining which steps to do, and then at some point there's
00:08:15.840 | a point where it can choose whether to end the loop or not, and if it ends the loop then
00:08:19.640 | you finish and return to the user, otherwise you go back and continue the loop.
00:08:23.840 | And so here you get the language model deciding what the outputs are, it decides what steps to
00:08:27.540 | take, and you do have these cycles.
00:08:31.960 | The last thing, and I think this is largely what we would describe as kind of like what
00:08:37.860 | auto GPT did that took the world by storm, is this idea of an agent where you kind of like remove
00:08:43.760 | a lot of the kind of like guardrails around what steps to take.
00:08:50.040 | So here the sequences of steps that are available are almost like determined by the LLM, and
00:08:55.860 | what I mean by this is that here's where you can start doing things like adding in tools
00:08:59.760 | that the language model can take, so if you guys are familiar with the Voyager paper, it
00:09:04.760 | starts adding in tools and building up a skill set of tools over time, and so some of the actions
00:09:08.700 | that the language model can take are dynamically created.
00:09:12.320 | And then I think the other big thing here is that you remove some of the scaffolding from
00:09:15.920 | the state machines.
00:09:17.540 | So some of the, if I go back a little bit, so a lot of these kind of like cycles that we
00:09:22.940 | see in the wild break things down into discrete states.
00:09:26.920 | The most common one that we see are kind of like plan, execute, and validate.
00:09:30.380 | So you ask the language model to plan what to do with it, then it goes to it, and then you
00:09:33.140 | validate it often with a language model call or something like that.
00:09:36.720 | And I think the big difference between that and then the autonomous agent style thing is
00:09:41.100 | that here you're implicitly asking the agent to do all of those things in one go.
00:09:45.800 | It should know when it should plan, it should know when it should validate, and it should
00:09:49.080 | know when it should kind of like determine what action to take.
00:09:51.480 | And you're asking it all to do that implicitly.
00:09:53.700 | You don't have these kind of like distinct sequences of steps laid out in the code.
00:10:00.500 | And so this is a little bit about how we're thinking about it.
00:10:02.560 | I think the thing to, the thing that I like to say when saying this as well, which goes
00:10:08.520 | back to the beginning, is that the main thing that we think is it's still just extremely early
00:10:12.320 | on in the space.
00:10:13.320 | We still think it's the beginning.
00:10:14.560 | And this could, you know, in three months be kind of irrelevant as the space progresses.
00:10:19.000 | So I would just keep that in mind.
00:10:21.880 | If we think about kind of like some of the magical experiences like this where it can reason over
00:10:27.760 | the relevant context, what is it going to take to kind of like build it under the hood?
00:10:33.000 | What is the engineering that's going to go in to all these seemingly magical experiences?
00:10:39.160 | And so this is an example of what could be going under the hood of something like this.
00:10:43.780 | It's going to be a challenging experience to build these complex systems.
00:10:48.000 | And that's why we're building some of the tooling like this, what you see here, to help debug,
00:10:52.740 | understand, and iterate on these systems of the future.
00:10:56.680 | And so what exactly are the challenges associated with building these complex context-aware reasoning
00:11:02.880 | applications?
00:11:04.780 | The first is kind of just the orchestration layer.
00:11:07.240 | So figuring out which of the different reasoning kind of like cognitive architectures you should
00:11:12.340 | be using?
00:11:13.340 | Should you be using a simple chain?
00:11:15.640 | Should you be using a router, a more complex agent?
00:11:18.120 | And I think the thing to remember here is that it's not necessarily that one is better than
00:11:22.280 | the other or superior to the other.
00:11:24.000 | They all have kind of like their pros and cons and strengths and weaknesses.
00:11:26.680 | So chains are really good because you have more control over the sequence of steps that
00:11:30.720 | are taken.
00:11:31.720 | Agents are better because they can more dynamically react to unexpected inputs and handle edge cases.
00:11:36.880 | And so being able to choose the right cognitive architecture that you want and being able
00:11:41.160 | to quickly experiment with a bunch of other ones are part of what inspired the initial
00:11:45.440 | release of LangChain and kind of how we aim to help people prototype these types of applications.
00:11:52.780 | And then LangSmith, which is this thing here, provides a lot of visibility into what actually
00:11:56.620 | is going on as these applications start to get more and more complex, understanding what
00:12:02.380 | exact sequences of tools are being used, what exact sequences of language model calls are being
00:12:08.060 | made becomes increasingly important.
00:12:11.720 | Another big thing that we see people struggling with and spending a lot of time on is good
00:12:15.220 | old-fashioned data engineering.
00:12:17.340 | So a lot of this comes down to providing the right context, the language models.
00:12:21.820 | And the right context is often data.
00:12:23.340 | And so you need to have ways to load that data.
00:12:25.340 | You need to have ways to transform that data, to transport that data.
00:12:28.560 | And then you often want to have observability into what exact data is getting passed around
00:12:31.880 | and where.
00:12:32.880 | And so LangChain itself has a lot of open source modules for loading that data and transforming
00:12:37.820 | that data.
00:12:39.220 | And then LangSmith, we often see being really useful for debugging what exactly does that data
00:12:43.580 | look like by the time it is getting to the language model.
00:12:46.600 | Have you extracted the right documents from your vector store?
00:12:49.600 | Have you transformed them and formatted in the right way where it is clear to the language
00:12:52.880 | model what is actually in them?
00:12:54.640 | These are all things that you are going to want to be able to debug so there is no little
00:12:58.320 | small errors or small issues that pop up.
00:13:02.820 | And then the third thing that we see a lot of people spending time on when building these
00:13:06.600 | applications is good old-fashioned prompt engineering.
00:13:08.980 | So the main new thing here is language models.
00:13:11.760 | And the main way of interacting with language models is through prompts.
00:13:14.840 | And so being able to understand what exactly does the fully formatted prompt look like by the
00:13:19.580 | time it is going into the language model is really important.
00:13:23.700 | How are you combining the system instructions with maybe the few shot examples, any retrieved
00:13:29.380 | context, the chat history that you have going on, any previous steps that the agent took?
00:13:34.360 | What does this all look by the time it gets to the language model?
00:13:37.700 | And what does this look like in the middle of this complex application?
00:13:40.880 | So it is easy enough to kind of test and debug this if it is the first call, the first part
00:13:45.300 | of the system.
00:13:46.300 | But after it has already done three of these steps, if you want to kind of debug what that
00:13:49.700 | prompt looks like, what that fully formatted prompt looks like, being able to do that becomes
00:13:54.580 | increasingly difficult as the systems kind of scale up in their entangledness.
00:13:58.600 | And so we have tried to make it really easy to hop into any kind of like particular language
00:14:03.000 | model call at any point in time, open it up in a playground like this so you can edit it
00:14:07.500 | directly and experiment with that prompt engineering and go kind of like change some of the instructions
00:14:15.260 | and see how it responds or swap out model providers so that you can see if another model provider
00:14:19.540 | does better.
00:14:23.080 | So the other big challenge with these language model applications and is probably worth a
00:14:26.840 | talk on its own is evaluation of them.
00:14:29.580 | And so I think evaluation is really hard for a few reasons.
00:14:32.580 | I think the two primary ones are a lack of data and a lack of good metrics.
00:14:36.920 | So comparing to traditional kind of like data science and machine learning, with those you
00:14:40.680 | generally started with a data set.
00:14:42.680 | You needed that to build your model and so then when it came time to evaluate it you at least
00:14:46.400 | had those data points that you could look at and evaluate on.
00:14:49.400 | And I think that is a little bit different with a lot of these LLM applications because
00:14:53.220 | these models are fantastic zero shot kind of like learners.
00:14:57.160 | That is kind of like the whole exciting bit of them.
00:15:00.460 | And so you can get to a working MVP without building up kind of like any data set at all.
00:15:05.000 | And that is awesome.
00:15:06.000 | But that does make it a little bit of a challenge when it comes to evaluating them because you
00:15:09.420 | don't have these data points.
00:15:11.880 | And so one of the things that we often encourage a lot of people to do and try to help them do
00:15:16.540 | as well as build up these data sets and iterate on those and those can come from either labeling
00:15:21.540 | data points by hand or looking at production traffic and pulling things in or auto-generating
00:15:27.060 | things with LLMs.
00:15:29.720 | The second big challenge in evaluation is lack of metrics.
00:15:33.140 | I think most traditional kind of like quantitative metrics don't perform super well for large unstructured
00:15:39.160 | outputs.
00:15:40.160 | A lot of what we see people doing is still doing a kind of like vibe check to kind of
00:15:45.160 | like see how the model is performing.
00:15:47.160 | And as unsatisfying as that is, I still think that is probably the best way to gain kind of
00:15:53.440 | like intuition as to what is going on.
00:15:56.160 | So a lot of what we try to do is make it really easy to observe the outputs and the inputs of
00:15:59.160 | the language models so that you can build up that intuition.
00:16:03.660 | In terms of more quantitative and systematic metrics, we are very bullish on LLM-assisted
00:16:09.160 | evaluation.
00:16:10.160 | So using LLMs to evaluate the outputs.
00:16:14.160 | And then I think maybe the biggest thing that we see people doing in production is just keeping
00:16:18.160 | track of feedback, whether it be direct or indirect feedback.
00:16:21.160 | So do they leave kind of like a thumbs up or a thumbs down on your application?
00:16:24.160 | That is an example of direct feedback where you are gathering that.
00:16:27.160 | An example of indirect feedback might be if they click on a link or that might be a good
00:16:31.160 | thing that you provided a good suggestion.
00:16:33.160 | Or if they respond really confused to your chatbot, that might be a good indication that
00:16:37.160 | your chatbot actually did not perform well.
00:16:39.160 | So tracking these over time and doing A/B testing with that using kind of like traditional
00:16:44.160 | A/B testing software can be pretty impactful for gathering a sense online of how your model
00:16:50.160 | is doing.
00:16:52.160 | And then the last interesting thing that we are spending a lot of time thinking about
00:16:55.160 | is collaboration.
00:16:56.160 | So as these systems get bigger and bigger, they are doubtless going to be a collaboration
00:16:59.160 | among a lot of people.
00:17:01.160 | And so who exactly is working on these systems?
00:17:05.160 | Is it all AI engineers?
00:17:07.160 | Is it a combination of AI engineers and data engineers and data scientists and product
00:17:13.160 | managers?
00:17:14.160 | And I think one of the interesting trends that we are seeing is it is still a little bit
00:17:17.160 | unclear what the best skill sets for this new AI engineer type role is.
00:17:22.160 | And there could very well be a bunch of different skill sets that are valuable.
00:17:26.160 | So going back to kind of like the two things that we see making up a lot of these applications,
00:17:30.160 | the context awareness and the reasoning bit.
00:17:32.160 | The context awareness is bringing the right context to these applications.
00:17:36.160 | You often need kind of like a data engineering team to get in there and assist with that.
00:17:40.160 | The reasoning bit is often done through prompting.
00:17:42.160 | And oftentimes that is best done by non-technical people who can really outline kind of like the
00:17:46.160 | exact specification of the app that they are building.
00:17:48.160 | Whether they be product managers or subject matter experts.
00:17:51.160 | And so how do you enable collaboration between these two different types of folks?
00:17:55.160 | And what exactly does that look like?
00:17:57.160 | I do not think that is something that anyone kind of like knows or definitely hasn't solved.
00:18:01.160 | But I think that is a really interesting trend that we are thinking a lot about going forward.
00:18:07.160 | And so I think like the main thing that I want to leave you all with is that the big thing
00:18:14.160 | that we believe is that it is still really, really early on in this journey.
00:18:18.160 | It is just the beginning.
00:18:19.160 | As crazy as things have been over the past year, they are hopefully going to get even crazier.
00:18:23.160 | You saw an incredible demo of GPT-4V.
00:18:26.160 | Things like that are going to change it.
00:18:28.160 | And so we think behind all of these things it is going to take a lot of engineering.
00:18:32.160 | And we are trying to build some of the tooling to help enable that.
00:18:34.160 | And I think you guys are all on the right track towards becoming those types of engineers
00:18:39.160 | by being at a conference like this.
00:18:41.160 | So thank you SWIX for having me.
00:18:42.160 | Thank you guys for being here.
00:18:43.160 | Have a good rest of your day.
00:18:44.160 | Have a good rest of your day.
00:18:45.160 | Thank you.
00:18:45.160 | Thank you.
00:18:46.160 | Thank you.
00:18:47.160 | Thank you.
00:18:48.160 | Thank you.
00:18:49.160 | We'll be right back.