back to index

Prompt Templating and Techniques in LangChain


Chapters

0:0 Prompts are Fundamental to LLMs
2:13 Building Good LLM Prompts
7:13 LangChain Prompts Code Setup
11:36 Using our LLM with Templates
16:54 Few-shot Prompting
23:11 Chain of Thought Prompting

Whisper Transcript | Transcript Only Page

00:00:00.000 | Now we're going to move on to the chapter on prompts in Langchain. Now prompts they seem like
00:00:07.280 | a simple concept and they are a simple concept but there's actually quite a lot to them when you
00:00:11.280 | start diving into them and they truly have been a very fundamental part of what has propelled us
00:00:19.760 | forwards from pre-LLM times to the current LLM times. You have to think until LLMs became widespread
00:00:27.840 | the way to fine-tune a AI model or ML model back then was to get loads of data for your particular
00:00:38.880 | use case, spend a load of training your specific transformer or part of the transformer to essentially
00:00:46.000 | adapt it for that particular task. That could take a long time. Depending on the the task it could take
00:00:53.600 | you you know months or in some times if it was a simpler task you might take probably days potentially
00:01:00.720 | weeks. Now the interesting thing with LLMs is that rather than needing to go through this whole fine
00:01:09.040 | tuning process to modify a model for one task over another task rather than doing that we just prompt it
00:01:18.000 | differently we literally tell the model hey I want you to do this in this particular way and that is a
00:01:24.400 | you know that's a paradigm shift in what you're doing it's so much faster it's going to take you you know a
00:01:30.320 | couple of minutes rather than days weeks or months and LLMs are incredibly powerful when it comes to just
00:01:37.840 | generalizing to you know across these many different tasks. So prompts which control those instructions
00:01:45.600 | are a fundamental part of that. Now line chain naturally has many functionalities around prompts
00:01:53.360 | and we can build very dynamic prompting pipelines that modify the structure and content of what we're
00:01:59.120 | actually feeding into our LLM depending on different variables different inputs and we'll see that in
00:02:05.760 | this chapter so we're going to work through prompting within the scope of a RAG example. So let's start
00:02:14.000 | by just dissecting the various parts of a prompt that we might expect to see for a use case like RAG.
00:02:21.760 | So our typical prompt for RAG or retrieval augmented generation will include rules for
00:02:29.520 | the LLM and this is you know this you will see in most prompts if not all. This part of the prompt sets up
00:02:38.480 | the behavior of the LLM that is how it should be responding to user queries, what sort of personality it should
00:02:46.320 | be taking on, what it should be focusing on when it is responding, any particular rules or boundaries that
00:02:52.560 | we want to set and really what we're trying to do here is just to simply provide as much information
00:03:00.320 | as possible to the LLM about what we're doing. We just want to give the LLM context as to the
00:03:11.120 | place that it finds itself in because now LLM has no idea where it is, it's just is a it takes in some
00:03:17.520 | information and spits out information. If the only information it receives is from the users you know
00:03:22.560 | user query it has you know it doesn't know the context what is the application that it is within,
00:03:29.680 | what is its objective, what is its aim, what are the boundaries. All of this we need to just assume
00:03:37.040 | the LLM has absolutely no idea about because it truly does not. So as much context as we can provide
00:03:44.800 | but it's important that we don't overdo it. We see this all the time people will over prompt an LLM.
00:03:53.280 | You want to be concise you don't want fluff and in general every single part of your prompt the more
00:04:00.000 | concise and less fluffy you can make it the better. Now those rules or instructions are typically in system
00:04:06.640 | prompt of your LLM. Now the second one is context which is RAG specific. The context refers to some
00:04:14.240 | sort of external information that you are feeding into your LLM. We may have received this information
00:04:21.120 | from like a web search, database, query or quite often in this case of RAG it's a vector database. This
00:04:29.600 | external information that external information that we provide is essentially the RA, retrieval augmentation
00:04:37.040 | of RAG. We are augmenting the knowledge of our LLM which the knowledge of our LLM is contained within
00:04:45.040 | the LLM model weights. We're augmenting that knowledge with some external knowledge. That's what we're doing
00:04:51.360 | here. Now for chat LLMs this context is typically placed within a conversational context within the user
00:05:01.680 | or assistant messages and with more recent models it can also be placed within tool and messages as well.
00:05:12.640 | Then we have the question. It's pretty straightforward. This is the query from the user. This is more
00:05:19.680 | as it's usually a user message of course. There might be some additional formatting around this. You might
00:05:26.960 | add a little bit of extra context or you might add some additional instructions if you find that your LLM
00:05:33.680 | sometimes veers off the rules that you've set within the system prompt. You might you know append or prefix
00:05:40.240 | something here but for the most part it's probably just going to be the user's input. And finally so
00:05:45.680 | these are all the inputs for our prompt. Here is going to be the output that we get. So the answer
00:05:53.040 | from the assistant. Again I mean that's not even specific to RAG. It's just what you would expect in a
00:05:58.400 | chat LLM or any LL. And of course that would be an assistant message. So putting all of that together
00:06:06.560 | an actual prompt so you can see everything we have here. So we have the rules for our prompt here. The instructions
00:06:13.120 | we're just saying okay answer the question based on the context below. If you cannot answer the question
00:06:17.680 | using the information as with I don't know. Then we have some context here. Okay in this scenario that
00:06:26.000 | context that we're feeding in here because it's the first message we might put into the system prompt. But that
00:06:32.400 | may also be turned around. Okay if you if you for example have an agent you might have your question
00:06:38.800 | up here before the context. And then that would be coming from a user message. And then this context
00:06:45.600 | would follow the question and be recognized as a tool message. It would be fed in that way as well.
00:06:53.120 | It kind of depends on on what sort of structure you're going for there. But you can do either. You can feed it
00:06:57.360 | into the system message if it's less conversational. Whereas if it's more conversation we might feed
00:07:03.520 | it in as a tool message. Okay and then we have a user query which is here. And then we'd have the AI
00:07:09.600 | answer. Okay and obviously that would be generated here. Okay so let's switch across to the code. We're in
00:07:16.320 | the Langchain course repo notebook's 03 prompts. I'm just going to open this in Colab. Okay let's scroll
00:07:23.040 | down and we'll start just by installing the prerequisites. Okay so we just have the various
00:07:28.800 | libraries again. As I mentioned before Langsmith is optional. You don't need to install it but if you
00:07:33.840 | would like to see your traces and everything in Langsmith then I would recommend doing that.
00:07:39.040 | And if you are using Langsmith you will need to enter your API key here. Again if you're not using
00:07:45.120 | Langsmith you don't need to enter anything here. You just skip that cell. Okay cool. And let's jump into
00:07:51.680 | the basic prompting then. So we're going to start with this prompt and so use query based on the question
00:07:58.080 | below. So we're just structuring what we just saw in code. And we're going to be using the chat prompt
00:08:06.240 | template because generally speaking we're using chat LLMs in most cases nowadays. So we have our chat prompt
00:08:15.360 | template and that is going to contain a list of messages. System message to begin with which is just
00:08:21.360 | going to contain this and we're feeding in the context within that there. And we have our user query here.
00:08:29.520 | Okay so we'll run this and if we take a look here we haven't specified what our
00:08:38.160 | input variables are. Okay but we can see that we have query and we have context up here. Right so we can
00:08:47.360 | see that okay these are the input variables we just haven't explicitly defined them here. So let's just confirm
00:08:56.400 | with this that line chain did pick those up and we can see that it did. So it has context and query as
00:09:01.360 | our input variables for the prompt template that we just defined. Okay so we can also see the structure
00:09:09.200 | of our templates. Let's have a look. Okay so we can see that within messages here we have a system message
00:09:18.160 | prompt template. The way that we define this you can see here that we have from messages and this will
00:09:23.520 | consume various different structures. So you can see here that it has a from messages is a sequence of
00:09:34.320 | message like representation. So we could pass in a system prompt template object and then a user prompt
00:09:41.840 | template object or we can just use a tuple like this and this actually defines okay this system this is a user and
00:09:50.320 | you could also do assistant or tool messages and stuff here as well using the same structure. And then we
00:09:58.400 | can look in here and of course that is being translated into the system message prompt template and human
00:10:05.280 | message prompt template. Okay we have our input variables in there and then we have the template too. Okay
00:10:14.080 | now let's continue we'll see here why what I just said. So we're importing our system message prompt
00:10:21.520 | template and human message prompt template and you can see we're using the same from messages method here
00:10:27.600 | right and you can see it's still sequence of message like representation it's just you know what that
00:10:34.800 | actually means it can vary right. So here we have system message prompt template from template prompt here from
00:10:41.040 | template query you know there's various ways that you might want to do this it just depends on how explicit
00:10:47.280 | you want to be. Generally speaking I think for myself I would prefer that we stick with the objects
00:10:56.560 | themselves and be explicit but it is definitely a little harder to pass when you're when you're reading
00:11:02.400 | this so I understand why you might also prefer this it's definitely cleaner and it is it does look
00:11:08.640 | simpler so it just depends I suppose on preference. Okay so you can see again that this is exactly the same
00:11:20.160 | okay with chat prompt template and it contains this and this okay you probably want to see the exact output
00:11:28.320 | so as messages. Okay exactly the same as what I output before. Cool so we have all that let's see how we
00:11:38.960 | would invoke our LLM with these we're going to be using 4.0 mini again we do need our API key so enter that
00:11:50.000 | and we'll just initialize our LLM we are going with a low temperature here so less randomness or less
00:11:57.360 | creativity and you know in in many cases this is actually what I would be doing. The reason
00:12:03.360 | in this scenario that we're going with low temperature is we're doing RAG and if you
00:12:10.240 | remember before we scroll up a little bit here our template says answer the user's query based on the
00:12:15.520 | context below if you cannot answer the question using the provided answer information answer with
00:12:21.600 | I don't know right so just from reading that we know that we want our LLM to be as truthful and accurate
00:12:30.400 | as possible so a more creative LLM is going to struggle with that and is more likely to hallucinate
00:12:38.240 | whereas a low creativity or low temperature LLM will probably stick with the rules a little better
00:12:45.120 | so again it depends on your use case you know if you're creative writing you might want to go
00:12:50.240 | with a higher temperature there but for things like RAG where the information being output should be accurate
00:12:57.280 | and truthful it's important I think that we keep temperature low okay I talk about that a little bit
00:13:05.760 | here so of course a lower temperature of zero makes the LLM's output more deterministic which in theory
00:13:12.160 | should lead to less hallucination okay so we're gonna go with lcell again here this is for those of you
00:13:19.600 | that use linechain in the past this is equivalent to an LLM chain object so our prompt template is being fed
00:13:26.960 | into our LLM okay and from now we have this pipeline now let's see how we would use that pipeline so we're
00:13:36.720 | gonna get some uh create some context here so this is just some context around Aurelio AI
00:13:43.920 | mention you know that we built semantic routers semantic junkers there's AI platform
00:13:52.640 | and development services we mentioned I think we specifically outline this later on in the example
00:13:59.520 | so the linechain experts a little piece of information now most LLMs would have not been
00:14:05.120 | trained on the recent internet so the fact that this came in September 2024 is relatively recent so a lot of
00:14:13.200 | LLMs out of the box you wouldn't expect them to know that so that is a good little bit of information
00:14:20.480 | to ask about so we invoke we have our query so what do we do and we have that context okay so we're
00:14:28.080 | feeding that into that pipeline that we defined here all right so when we invoke that it's automatically
00:14:33.840 | going to take query and context and actually feed it into our prompt template okay if we
00:14:41.680 | want to we can also be a little more explicit so you you will probably see me doing this throughout the
00:14:48.800 | course because I do like to be explicit with everything to be honest and you'll probably see me doing this
00:14:57.680 | okay and this is doing the same thing or you'll see it will in a moment this is doing the exact same thing
00:15:11.440 | again this is again this is just a also thing so all I'm doing in this scenario is I'm saying okay take
00:15:20.080 | from the dictionary query and then also take from that input dictionary the context key
00:15:29.440 | okay so this is doing the exact same thing the reason that we might want to write this is mainly for
00:15:39.200 | clarity to be honest just too explicit say okay these are the inputs because otherwise we don't
00:15:44.240 | really have them in the code other than within our original prompts up here which is not super clear
00:15:51.920 | so I think it's usually a good idea to just be more explicit with these things and of course
00:15:57.840 | if you decide you're going to modify things a little bit let's say you modify this to input down the line you
00:16:04.080 | can still feed in the same input here you're just you know mapping it between different keys essentially
00:16:09.360 | or if you would like to just modify that you need to lowercase it on the way in or something you can do
00:16:15.760 | so you have that I'll just uh redefine that actually and we'll invoke again
00:16:27.200 | okay and we see that this is the exact same thing okay so radio ai and so this is a ai message just
00:16:33.440 | generated by the lm okay expertise in building ai agents several open source frameworks router ai platform
00:16:42.640 | okay all right so provide them so they have everything there other than the line chain
00:16:49.120 | experts saying it didn't mention that but we will yeah we'll test it later on that okay so on to future
00:16:55.920 | prompting this is a specific prompting technique now many sort of state of the art or also to lms
00:17:03.280 | are very good at instruction following so you'll find that future prompting is less common now than
00:17:09.920 | it used to be at least for this or bigger more safety art models but when you start using smaller models
00:17:17.840 | not really what we can use here but let's say you're using a source model like llama 3
00:17:25.200 | or llama 2 which is much smaller you will probably need to consider things like few shot prompting
00:17:32.800 | although that being said with open ai models you're not at least the current opening models this is not so
00:17:40.560 | important nonetheless it can be useful so the idea behind future prompting is that you are providing a few
00:17:47.440 | example examples to your mlm of how it should behave before you are actually going into the main
00:17:56.160 | part of the conversation so let's see how that would look so we create an example prompt so we have our
00:18:03.120 | human and ai so human input ai response so we're basically setting up okay this with this type of input you should
00:18:10.720 | provide this type of output that's what we're doing here and we're just going to provide some examples
00:18:17.120 | okay so we have our input here's query one here is the answer one all right this is just i just want to
00:18:24.000 | show you how it works this is not what we'd actually feed into our lm then with both these examples and our
00:18:30.720 | example prompt would feed both of these into line chains a few shot chat message prompt template okay and while you're
00:18:40.560 | you'll see what we get out of it okay so we basically get it formats everything and structures everything
00:18:46.080 | for us okay and using this of course it depends on
00:18:51.520 | let's say you see that your user is talking about a particular topic and you would like to guide your
00:19:00.960 | llm to talk about that particular topic in a particular way right so you could identify that the user is
00:19:07.440 | talking about that topic either like a keyword match or a semantic similarity match and based on
00:19:12.960 | that you might want to modify these examples that you feed into your few shot chat message prompt
00:19:19.360 | template and then obviously for that could be what you do for topic a for topic b you might have another
00:19:25.280 | set of examples that you feed into this all the all this time your example prompts is remaining the same
00:19:30.640 | but you're you're just modifying the examples that are going in so that they're more relevant to whatever
00:19:35.200 | it is your user is actually talking about so that can be useful now let's see an example of that so
00:19:41.120 | when we are using a tiny lm its ability would be limited although i think we are probably fine here
00:19:48.240 | we're going to say answer the user's query based on the context below always enter a markdown format you
00:19:54.240 | know being very specific the self-system prompt okay that's nice but what we've kind of said here is okay
00:20:02.400 | always answer a markdown format do that but when doing so please provide headers short summaries and
00:20:09.840 | follow bullet points then conclude okay so you see this and yeah here okay so we get this overview
00:20:17.040 | already right now you have this and this it's actually quite good but if we come down here what i
00:20:23.600 | specifically want is to always follow this structure right so we have the double header for the topic
00:20:30.880 | summary header a couple of bullet points and then i always want to follow this pattern where it's like
00:20:37.200 | to conclude always it's always bold you know i want to be very specific on what i want and to be you know
00:20:45.280 | fully honest with gpt4o mini you can actually just prompt most of this in but for the sake of the example
00:20:52.960 | we're going to provide a few shot examples in our few shot prompt examples instead to get this so we're
00:21:00.560 | going to provide one example here second example here and you see we're just following that same pattern
00:21:06.800 | we're just setting up the pattern that the lm should use so we're going to set that up here we have our
00:21:14.400 | main header a little summary some subheaders bullet points subheader bullet points subheader bullet
00:21:20.720 | points to conclude so on and so on same with this one here okay and let's see what we got
00:21:29.760 | okay so this is the structure of our new few shot prompt template you can see what all this looks like
00:21:40.320 | let's come down and we're going to do we're basically going to insert that directly into our chat prompt
00:21:46.960 | template so we have from messages system prompt user prompt and then we have in there these so let me
00:21:56.960 | actually show you very quickly right so we just have our this few shot chat to message prompt template
00:22:04.800 | which will be fed into the middle here run that and then feed all this back into our pipeline
00:22:10.240 | okay and this will you know modify the structure so that we have that bold to conclude at the end here
00:22:16.160 | okay we can see nicely here so we get a bit more of that the exact structure that we were getting again
00:22:23.280 | with gpt4o models and many other open air models you don't really need to do this but you will see it
00:22:30.560 | in other examples we do have an example of this where we're using a llama and we're using i think llama 2
00:22:37.360 | if i'm not wrong and you can see that adding this few shot prompt template is actually a very good way of
00:22:45.840 | getting those smaller less capable models to follow your instructions so this is really when you're
00:22:52.000 | working those smaller lines this can be super useful but even first so to models like gp4o if you do find
00:22:58.960 | that you're struggling with the prompting it's just not quite following exactly what you want it to do
00:23:03.520 | this is a very good technique for actually getting it to follow a very strict structure or behavior okay so
00:23:11.760 | moving on we have chain of thought prompting so this is a more common prompting technique that
00:23:19.440 | encourages the elm to think through its reasoning or its thoughts step by step so it's a chain of
00:23:26.880 | thoughts the idea behind this is that okay in math class when you're a kid the teachers would always
00:23:34.240 | push you to put down your your working out right and there's multiple reasons for that one of them is
00:23:40.640 | to get you to think because they they know in a lot of cases actually you know you're a kid and you're
00:23:44.800 | in a rush and you don't really care about this test and the you know they're just trying to get you to
00:23:50.880 | slow down a little bit and actually put down your reasoning and that kind of forced you to think oh
00:23:55.840 | actually i'm skipping a little bit in my head because i'm trying to just do everything up here if i write
00:24:00.800 | it down all of a sudden it's like oh actually i'm yeah i need to actually do that slightly
00:24:06.320 | differently you you realize okay you're probably rushing a little bit now i'm not saying an lm is
00:24:10.640 | rushing but it's a similar effect by an lm writing everything down they tend to actually get things
00:24:17.360 | right more frequently and at the same time also similar to when you're a child and a teacher is reviewing
00:24:24.400 | your exam work by having the lm write down its reasoning you as a as a human or engineer
00:24:31.200 | you can see where the lm went wrong if it did go wrong which can be very useful when you're trying
00:24:36.960 | to diagnose problems so with train of thought we should see less hallucinations and generally bad
00:24:43.760 | performance now to implement chain of thought in line chain there's no specific like line chain
00:24:48.560 | objects that do that instead it's it's just prompting okay so let's go down and just see how we might do
00:24:54.560 | that okay so be helpful assistant answer the user question you must answer the question directly without
00:25:00.720 | any other text or explanation okay so that's our no chain of thought system prompt i will just note here
00:25:07.840 | especially with open ai again this is one of those things where you'll see it more with the smaller models
00:25:13.440 | most elements are actually trained to use chain of thought prompting by default so we're actually
00:25:17.920 | specifically telling it here you must answer the question directly without any other text or
00:25:23.200 | explanation okay so we're actually kind of reverse prompting it to not use chain of thought otherwise
00:25:29.440 | by default it actually will try and do that because it's been trained to that's how that's how relevant
00:25:34.880 | chain of thought is okay so i'm going to say how many keystrokes you need to type in type the numbers
00:25:40.800 | from one to five hundred okay we set up our like lm chain pipeline and we're going to just invoke
00:25:48.560 | our query and we'll see what we get total number of keystrokes needed to type numbers from
00:25:53.840 | one to five hundred is one thousand five hundred and eleven uh the actual answers i've written here is
00:26:01.840 | one thousand three hundred and ninety two without chain thought is hallucinating okay now let's go ahead
00:26:07.680 | and see okay with chain of thought prompting what does it do so be helpful assistant answer user's
00:26:13.360 | question to answer the question you must list systematically and in precise detail all sub problems
00:26:20.720 | that are needed to be solved to answer the question solve each sub problem individually you have to shout
00:26:26.720 | at the lm sometimes to get them to listen and in sequence finally use everything you've worked through to provide the final answer okay so we're getting it we're forcing it to kind of go through
00:26:38.000 | the full problem now we can remove that not sure why so run that again i don't know why we have context
00:26:45.200 | there i'll remove that and let's see
00:26:48.080 | you can see straight away that's taking a lot longer to generate the output that's because
00:26:55.760 | it's generating so many more tokens so that's just one one drawback of this but let's see what we have so
00:27:01.760 | to determine how many keystrokes to type those numbers we is breaking down several sub problems
00:27:07.760 | so count number of digits from one to nine ten to ninety nine so so on and count the digits in number 500.
00:27:15.440 | okay interesting so that's how it's breaking it up some more digits counted in the previous steps so we go
00:27:22.320 | through total digits and we see us okay nine digits for those for here 180 for here 1200 and then of course
00:27:34.560 | three here so it gets all those sums those digits and actually comes to the right answer okay so that that
00:27:42.640 | is you know that's the difference with with chain of thought versus uh without so without it we just get
00:27:48.960 | the wrong answer basically guessing with chain of thought we get the right answer just by the llm
00:27:55.840 | writing down its reasoning and breaking the problem down into multiple parts which is i found that super
00:28:01.920 | interesting that it it does that so that's pretty cool now i will just see so as i meant as we mentioned
00:28:10.320 | before most llms nowadays are actually trying to use chain of thought prompting by default so let's just
00:28:15.600 | see if we don't mention anything right be a helpful assistant and answer the user's question so we're not telling it
00:28:21.200 | not to think through its reasoning and we're not telling it to think through its reasoning let's just
00:28:26.320 | see what it does okay so you can see again it's actually doing the exact same reasoning okay it doesn't
00:28:37.200 | it doesn't give us like the sub problems at the start but it is going through and it's breaking everything
00:28:42.160 | apart okay which is quite interesting and we get the same correct answer so the formatting here is slightly
00:28:48.640 | different it's probably a little cleaner actually although i think uh i don't know i here we get a
00:28:55.760 | lot more information so both are fine and in this scenario we actually do get the the right answer as
00:29:03.680 | well so you can see that that chain of thought prompting has actually been quite literally trained
00:29:10.320 | into the model and you'll see that with most uh well i think all save the art lms okay cool so that is our
00:29:19.600 | our chapter on prompting again we're just focusing very much on a lot of the fundamentals of prompting
00:29:28.400 | there and of course tying that back to the actual objects and methods within line chain but for now that's
00:29:36.240 | it for prompting and we'll move on to the next chapter