Prompt Templating and Techniques in LangChain

00:00:00.000 | Now we're going to move on to the chapter on prompts in Langchain. Now prompts they seem like

00:00:07.280 | a simple concept and they are a simple concept but there's actually quite a lot to them when you

00:00:11.280 | start diving into them and they truly have been a very fundamental part of what has propelled us

00:00:19.760 | forwards from pre-LLM times to the current LLM times. You have to think until LLMs became widespread

00:00:27.840 | the way to fine-tune a AI model or ML model back then was to get loads of data for your particular

00:00:38.880 | use case, spend a load of training your specific transformer or part of the transformer to essentially

00:00:46.000 | adapt it for that particular task. That could take a long time. Depending on the the task it could take

00:00:53.600 | you you know months or in some times if it was a simpler task you might take probably days potentially

00:01:00.720 | weeks. Now the interesting thing with LLMs is that rather than needing to go through this whole fine

00:01:09.040 | tuning process to modify a model for one task over another task rather than doing that we just prompt it

00:01:18.000 | differently we literally tell the model hey I want you to do this in this particular way and that is a

00:01:24.400 | you know that's a paradigm shift in what you're doing it's so much faster it's going to take you you know a

00:01:30.320 | couple of minutes rather than days weeks or months and LLMs are incredibly powerful when it comes to just

00:01:37.840 | generalizing to you know across these many different tasks. So prompts which control those instructions

00:01:45.600 | are a fundamental part of that. Now line chain naturally has many functionalities around prompts

00:01:53.360 | and we can build very dynamic prompting pipelines that modify the structure and content of what we're

00:01:59.120 | actually feeding into our LLM depending on different variables different inputs and we'll see that in

00:02:05.760 | this chapter so we're going to work through prompting within the scope of a RAG example. So let's start

00:02:14.000 | by just dissecting the various parts of a prompt that we might expect to see for a use case like RAG.

00:02:21.760 | So our typical prompt for RAG or retrieval augmented generation will include rules for

00:02:29.520 | the LLM and this is you know this you will see in most prompts if not all. This part of the prompt sets up

00:02:38.480 | the behavior of the LLM that is how it should be responding to user queries, what sort of personality it should

00:02:46.320 | be taking on, what it should be focusing on when it is responding, any particular rules or boundaries that

00:02:52.560 | we want to set and really what we're trying to do here is just to simply provide as much information

00:03:00.320 | as possible to the LLM about what we're doing. We just want to give the LLM context as to the

00:03:11.120 | place that it finds itself in because now LLM has no idea where it is, it's just is a it takes in some

00:03:17.520 | information and spits out information. If the only information it receives is from the users you know

00:03:22.560 | user query it has you know it doesn't know the context what is the application that it is within,

00:03:29.680 | what is its objective, what is its aim, what are the boundaries. All of this we need to just assume

00:03:37.040 | the LLM has absolutely no idea about because it truly does not. So as much context as we can provide

00:03:44.800 | but it's important that we don't overdo it. We see this all the time people will over prompt an LLM.

00:03:53.280 | You want to be concise you don't want fluff and in general every single part of your prompt the more

00:04:00.000 | concise and less fluffy you can make it the better. Now those rules or instructions are typically in system

00:04:06.640 | prompt of your LLM. Now the second one is context which is RAG specific. The context refers to some

00:04:14.240 | sort of external information that you are feeding into your LLM. We may have received this information

00:04:21.120 | from like a web search, database, query or quite often in this case of RAG it's a vector database. This

00:04:29.600 | external information that external information that we provide is essentially the RA, retrieval augmentation

00:04:37.040 | of RAG. We are augmenting the knowledge of our LLM which the knowledge of our LLM is contained within

00:04:45.040 | the LLM model weights. We're augmenting that knowledge with some external knowledge. That's what we're doing

00:04:51.360 | here. Now for chat LLMs this context is typically placed within a conversational context within the user

00:05:01.680 | or assistant messages and with more recent models it can also be placed within tool and messages as well.

00:05:12.640 | Then we have the question. It's pretty straightforward. This is the query from the user. This is more

00:05:19.680 | as it's usually a user message of course. There might be some additional formatting around this. You might

00:05:26.960 | add a little bit of extra context or you might add some additional instructions if you find that your LLM

00:05:33.680 | sometimes veers off the rules that you've set within the system prompt. You might you know append or prefix

00:05:40.240 | something here but for the most part it's probably just going to be the user's input. And finally so

00:05:45.680 | these are all the inputs for our prompt. Here is going to be the output that we get. So the answer

00:05:53.040 | from the assistant. Again I mean that's not even specific to RAG. It's just what you would expect in a

00:05:58.400 | chat LLM or any LL. And of course that would be an assistant message. So putting all of that together

00:06:06.560 | an actual prompt so you can see everything we have here. So we have the rules for our prompt here. The instructions

00:06:13.120 | we're just saying okay answer the question based on the context below. If you cannot answer the question

00:06:17.680 | using the information as with I don't know. Then we have some context here. Okay in this scenario that

00:06:26.000 | context that we're feeding in here because it's the first message we might put into the system prompt. But that

00:06:32.400 | may also be turned around. Okay if you if you for example have an agent you might have your question

00:06:38.800 | up here before the context. And then that would be coming from a user message. And then this context

00:06:45.600 | would follow the question and be recognized as a tool message. It would be fed in that way as well.

00:06:53.120 | It kind of depends on on what sort of structure you're going for there. But you can do either. You can feed it

00:06:57.360 | into the system message if it's less conversational. Whereas if it's more conversation we might feed

00:07:03.520 | it in as a tool message. Okay and then we have a user query which is here. And then we'd have the AI

00:07:09.600 | answer. Okay and obviously that would be generated here. Okay so let's switch across to the code. We're in

00:07:16.320 | the Langchain course repo notebook's 03 prompts. I'm just going to open this in Colab. Okay let's scroll

00:07:23.040 | down and we'll start just by installing the prerequisites. Okay so we just have the various

00:07:28.800 | libraries again. As I mentioned before Langsmith is optional. You don't need to install it but if you

00:07:33.840 | would like to see your traces and everything in Langsmith then I would recommend doing that.

00:07:39.040 | And if you are using Langsmith you will need to enter your API key here. Again if you're not using

00:07:45.120 | Langsmith you don't need to enter anything here. You just skip that cell. Okay cool. And let's jump into

00:07:51.680 | the basic prompting then. So we're going to start with this prompt and so use query based on the question

00:07:58.080 | below. So we're just structuring what we just saw in code. And we're going to be using the chat prompt

00:08:06.240 | template because generally speaking we're using chat LLMs in most cases nowadays. So we have our chat prompt

00:08:15.360 | template and that is going to contain a list of messages. System message to begin with which is just

00:08:21.360 | going to contain this and we're feeding in the context within that there. And we have our user query here.

00:08:29.520 | Okay so we'll run this and if we take a look here we haven't specified what our

00:08:38.160 | input variables are. Okay but we can see that we have query and we have context up here. Right so we can

00:08:47.360 | see that okay these are the input variables we just haven't explicitly defined them here. So let's just confirm

00:08:56.400 | with this that line chain did pick those up and we can see that it did. So it has context and query as

00:09:01.360 | our input variables for the prompt template that we just defined. Okay so we can also see the structure

00:09:09.200 | of our templates. Let's have a look. Okay so we can see that within messages here we have a system message

00:09:18.160 | prompt template. The way that we define this you can see here that we have from messages and this will

00:09:23.520 | consume various different structures. So you can see here that it has a from messages is a sequence of

00:09:34.320 | message like representation. So we could pass in a system prompt template object and then a user prompt

00:09:41.840 | template object or we can just use a tuple like this and this actually defines okay this system this is a user and

00:09:50.320 | you could also do assistant or tool messages and stuff here as well using the same structure. And then we

00:09:58.400 | can look in here and of course that is being translated into the system message prompt template and human

00:10:05.280 | message prompt template. Okay we have our input variables in there and then we have the template too. Okay

00:10:14.080 | now let's continue we'll see here why what I just said. So we're importing our system message prompt

00:10:21.520 | template and human message prompt template and you can see we're using the same from messages method here

00:10:27.600 | right and you can see it's still sequence of message like representation it's just you know what that

00:10:34.800 | actually means it can vary right. So here we have system message prompt template from template prompt here from

00:10:41.040 | template query you know there's various ways that you might want to do this it just depends on how explicit

00:10:47.280 | you want to be. Generally speaking I think for myself I would prefer that we stick with the objects

00:10:56.560 | themselves and be explicit but it is definitely a little harder to pass when you're when you're reading

00:11:02.400 | this so I understand why you might also prefer this it's definitely cleaner and it is it does look

00:11:08.640 | simpler so it just depends I suppose on preference. Okay so you can see again that this is exactly the same

00:11:20.160 | okay with chat prompt template and it contains this and this okay you probably want to see the exact output

00:11:28.320 | so as messages. Okay exactly the same as what I output before. Cool so we have all that let's see how we

00:11:38.960 | would invoke our LLM with these we're going to be using 4.0 mini again we do need our API key so enter that

00:11:50.000 | and we'll just initialize our LLM we are going with a low temperature here so less randomness or less

00:11:57.360 | creativity and you know in in many cases this is actually what I would be doing. The reason

00:12:03.360 | in this scenario that we're going with low temperature is we're doing RAG and if you

00:12:10.240 | remember before we scroll up a little bit here our template says answer the user's query based on the

00:12:15.520 | context below if you cannot answer the question using the provided answer information answer with

00:12:21.600 | I don't know right so just from reading that we know that we want our LLM to be as truthful and accurate

00:12:30.400 | as possible so a more creative LLM is going to struggle with that and is more likely to hallucinate

00:12:38.240 | whereas a low creativity or low temperature LLM will probably stick with the rules a little better

00:12:45.120 | so again it depends on your use case you know if you're creative writing you might want to go

00:12:50.240 | with a higher temperature there but for things like RAG where the information being output should be accurate

00:12:57.280 | and truthful it's important I think that we keep temperature low okay I talk about that a little bit

00:13:05.760 | here so of course a lower temperature of zero makes the LLM's output more deterministic which in theory

00:13:12.160 | should lead to less hallucination okay so we're gonna go with lcell again here this is for those of you

00:13:19.600 | that use linechain in the past this is equivalent to an LLM chain object so our prompt template is being fed

00:13:26.960 | into our LLM okay and from now we have this pipeline now let's see how we would use that pipeline so we're

00:13:36.720 | gonna get some uh create some context here so this is just some context around Aurelio AI

00:13:43.920 | mention you know that we built semantic routers semantic junkers there's AI platform

00:13:52.640 | and development services we mentioned I think we specifically outline this later on in the example

00:13:59.520 | so the linechain experts a little piece of information now most LLMs would have not been

00:14:05.120 | trained on the recent internet so the fact that this came in September 2024 is relatively recent so a lot of

00:14:13.200 | LLMs out of the box you wouldn't expect them to know that so that is a good little bit of information

00:14:20.480 | to ask about so we invoke we have our query so what do we do and we have that context okay so we're

00:14:28.080 | feeding that into that pipeline that we defined here all right so when we invoke that it's automatically

00:14:33.840 | going to take query and context and actually feed it into our prompt template okay if we

00:14:41.680 | want to we can also be a little more explicit so you you will probably see me doing this throughout the

00:14:48.800 | course because I do like to be explicit with everything to be honest and you'll probably see me doing this

00:14:57.680 | okay and this is doing the same thing or you'll see it will in a moment this is doing the exact same thing

00:15:11.440 | again this is again this is just a also thing so all I'm doing in this scenario is I'm saying okay take

00:15:20.080 | from the dictionary query and then also take from that input dictionary the context key

00:15:29.440 | okay so this is doing the exact same thing the reason that we might want to write this is mainly for

00:15:39.200 | clarity to be honest just too explicit say okay these are the inputs because otherwise we don't

00:15:44.240 | really have them in the code other than within our original prompts up here which is not super clear

00:15:51.920 | so I think it's usually a good idea to just be more explicit with these things and of course

00:15:57.840 | if you decide you're going to modify things a little bit let's say you modify this to input down the line you

00:16:04.080 | can still feed in the same input here you're just you know mapping it between different keys essentially

00:16:09.360 | or if you would like to just modify that you need to lowercase it on the way in or something you can do

00:16:15.760 | so you have that I'll just uh redefine that actually and we'll invoke again

00:16:27.200 | okay and we see that this is the exact same thing okay so radio ai and so this is a ai message just

00:16:33.440 | generated by the lm okay expertise in building ai agents several open source frameworks router ai platform

00:16:42.640 | okay all right so provide them so they have everything there other than the line chain

00:16:49.120 | experts saying it didn't mention that but we will yeah we'll test it later on that okay so on to future

00:16:55.920 | prompting this is a specific prompting technique now many sort of state of the art or also to lms

00:17:03.280 | are very good at instruction following so you'll find that future prompting is less common now than

00:17:09.920 | it used to be at least for this or bigger more safety art models but when you start using smaller models

00:17:17.840 | not really what we can use here but let's say you're using a source model like llama 3

00:17:25.200 | or llama 2 which is much smaller you will probably need to consider things like few shot prompting

00:17:32.800 | although that being said with open ai models you're not at least the current opening models this is not so

00:17:40.560 | important nonetheless it can be useful so the idea behind future prompting is that you are providing a few

00:17:47.440 | example examples to your mlm of how it should behave before you are actually going into the main

00:17:56.160 | part of the conversation so let's see how that would look so we create an example prompt so we have our

00:18:03.120 | human and ai so human input ai response so we're basically setting up okay this with this type of input you should

00:18:10.720 | provide this type of output that's what we're doing here and we're just going to provide some examples

00:18:17.120 | okay so we have our input here's query one here is the answer one all right this is just i just want to

00:18:24.000 | show you how it works this is not what we'd actually feed into our lm then with both these examples and our

00:18:30.720 | example prompt would feed both of these into line chains a few shot chat message prompt template okay and while you're

00:18:40.560 | you'll see what we get out of it okay so we basically get it formats everything and structures everything

00:18:46.080 | for us okay and using this of course it depends on

00:18:51.520 | let's say you see that your user is talking about a particular topic and you would like to guide your

00:19:00.960 | llm to talk about that particular topic in a particular way right so you could identify that the user is

00:19:07.440 | talking about that topic either like a keyword match or a semantic similarity match and based on

00:19:12.960 | that you might want to modify these examples that you feed into your few shot chat message prompt

00:19:19.360 | template and then obviously for that could be what you do for topic a for topic b you might have another

00:19:25.280 | set of examples that you feed into this all the all this time your example prompts is remaining the same

00:19:30.640 | but you're you're just modifying the examples that are going in so that they're more relevant to whatever

00:19:35.200 | it is your user is actually talking about so that can be useful now let's see an example of that so

00:19:41.120 | when we are using a tiny lm its ability would be limited although i think we are probably fine here

00:19:48.240 | we're going to say answer the user's query based on the context below always enter a markdown format you

00:19:54.240 | know being very specific the self-system prompt okay that's nice but what we've kind of said here is okay

00:20:02.400 | always answer a markdown format do that but when doing so please provide headers short summaries and

00:20:09.840 | follow bullet points then conclude okay so you see this and yeah here okay so we get this overview

00:20:17.040 | already right now you have this and this it's actually quite good but if we come down here what i

00:20:23.600 | specifically want is to always follow this structure right so we have the double header for the topic

00:20:30.880 | summary header a couple of bullet points and then i always want to follow this pattern where it's like

00:20:37.200 | to conclude always it's always bold you know i want to be very specific on what i want and to be you know

00:20:45.280 | fully honest with gpt4o mini you can actually just prompt most of this in but for the sake of the example

00:20:52.960 | we're going to provide a few shot examples in our few shot prompt examples instead to get this so we're

00:21:00.560 | going to provide one example here second example here and you see we're just following that same pattern

00:21:06.800 | we're just setting up the pattern that the lm should use so we're going to set that up here we have our

00:21:14.400 | main header a little summary some subheaders bullet points subheader bullet points subheader bullet

00:21:20.720 | points to conclude so on and so on same with this one here okay and let's see what we got

00:21:29.760 | okay so this is the structure of our new few shot prompt template you can see what all this looks like

00:21:40.320 | let's come down and we're going to do we're basically going to insert that directly into our chat prompt

00:21:46.960 | template so we have from messages system prompt user prompt and then we have in there these so let me

00:21:56.960 | actually show you very quickly right so we just have our this few shot chat to message prompt template

00:22:04.800 | which will be fed into the middle here run that and then feed all this back into our pipeline

00:22:10.240 | okay and this will you know modify the structure so that we have that bold to conclude at the end here

00:22:16.160 | okay we can see nicely here so we get a bit more of that the exact structure that we were getting again

00:22:23.280 | with gpt4o models and many other open air models you don't really need to do this but you will see it

00:22:30.560 | in other examples we do have an example of this where we're using a llama and we're using i think llama 2

00:22:37.360 | if i'm not wrong and you can see that adding this few shot prompt template is actually a very good way of

00:22:45.840 | getting those smaller less capable models to follow your instructions so this is really when you're

00:22:52.000 | working those smaller lines this can be super useful but even first so to models like gp4o if you do find

00:22:58.960 | that you're struggling with the prompting it's just not quite following exactly what you want it to do

00:23:03.520 | this is a very good technique for actually getting it to follow a very strict structure or behavior okay so

00:23:11.760 | moving on we have chain of thought prompting so this is a more common prompting technique that

00:23:19.440 | encourages the elm to think through its reasoning or its thoughts step by step so it's a chain of

00:23:26.880 | thoughts the idea behind this is that okay in math class when you're a kid the teachers would always

00:23:34.240 | push you to put down your your working out right and there's multiple reasons for that one of them is

00:23:40.640 | to get you to think because they they know in a lot of cases actually you know you're a kid and you're

00:23:44.800 | in a rush and you don't really care about this test and the you know they're just trying to get you to

00:23:50.880 | slow down a little bit and actually put down your reasoning and that kind of forced you to think oh

00:23:55.840 | actually i'm skipping a little bit in my head because i'm trying to just do everything up here if i write

00:24:00.800 | it down all of a sudden it's like oh actually i'm yeah i need to actually do that slightly

00:24:06.320 | differently you you realize okay you're probably rushing a little bit now i'm not saying an lm is

00:24:10.640 | rushing but it's a similar effect by an lm writing everything down they tend to actually get things

00:24:17.360 | right more frequently and at the same time also similar to when you're a child and a teacher is reviewing

00:24:24.400 | your exam work by having the lm write down its reasoning you as a as a human or engineer

00:24:31.200 | you can see where the lm went wrong if it did go wrong which can be very useful when you're trying

00:24:36.960 | to diagnose problems so with train of thought we should see less hallucinations and generally bad

00:24:43.760 | performance now to implement chain of thought in line chain there's no specific like line chain

00:24:48.560 | objects that do that instead it's it's just prompting okay so let's go down and just see how we might do

00:24:54.560 | that okay so be helpful assistant answer the user question you must answer the question directly without

00:25:00.720 | any other text or explanation okay so that's our no chain of thought system prompt i will just note here

00:25:07.840 | especially with open ai again this is one of those things where you'll see it more with the smaller models

00:25:13.440 | most elements are actually trained to use chain of thought prompting by default so we're actually

00:25:17.920 | specifically telling it here you must answer the question directly without any other text or

00:25:23.200 | explanation okay so we're actually kind of reverse prompting it to not use chain of thought otherwise

00:25:29.440 | by default it actually will try and do that because it's been trained to that's how that's how relevant

00:25:34.880 | chain of thought is okay so i'm going to say how many keystrokes you need to type in type the numbers

00:25:40.800 | from one to five hundred okay we set up our like lm chain pipeline and we're going to just invoke

00:25:48.560 | our query and we'll see what we get total number of keystrokes needed to type numbers from

00:25:53.840 | one to five hundred is one thousand five hundred and eleven uh the actual answers i've written here is

00:26:01.840 | one thousand three hundred and ninety two without chain thought is hallucinating okay now let's go ahead

00:26:07.680 | and see okay with chain of thought prompting what does it do so be helpful assistant answer user's

00:26:13.360 | question to answer the question you must list systematically and in precise detail all sub problems

00:26:20.720 | that are needed to be solved to answer the question solve each sub problem individually you have to shout

00:26:26.720 | at the lm sometimes to get them to listen and in sequence finally use everything you've worked through to provide the final answer okay so we're getting it we're forcing it to kind of go through

00:26:38.000 | the full problem now we can remove that not sure why so run that again i don't know why we have context

00:26:45.200 | there i'll remove that and let's see

00:26:48.080 | you can see straight away that's taking a lot longer to generate the output that's because

00:26:55.760 | it's generating so many more tokens so that's just one one drawback of this but let's see what we have so

00:27:01.760 | to determine how many keystrokes to type those numbers we is breaking down several sub problems

00:27:07.760 | so count number of digits from one to nine ten to ninety nine so so on and count the digits in number 500.

00:27:15.440 | okay interesting so that's how it's breaking it up some more digits counted in the previous steps so we go

00:27:22.320 | through total digits and we see us okay nine digits for those for here 180 for here 1200 and then of course

00:27:34.560 | three here so it gets all those sums those digits and actually comes to the right answer okay so that that

00:27:42.640 | is you know that's the difference with with chain of thought versus uh without so without it we just get

00:27:48.960 | the wrong answer basically guessing with chain of thought we get the right answer just by the llm

00:27:55.840 | writing down its reasoning and breaking the problem down into multiple parts which is i found that super

00:28:01.920 | interesting that it it does that so that's pretty cool now i will just see so as i meant as we mentioned

00:28:10.320 | before most llms nowadays are actually trying to use chain of thought prompting by default so let's just

00:28:15.600 | see if we don't mention anything right be a helpful assistant and answer the user's question so we're not telling it

00:28:21.200 | not to think through its reasoning and we're not telling it to think through its reasoning let's just

00:28:26.320 | see what it does okay so you can see again it's actually doing the exact same reasoning okay it doesn't

00:28:37.200 | it doesn't give us like the sub problems at the start but it is going through and it's breaking everything

00:28:42.160 | apart okay which is quite interesting and we get the same correct answer so the formatting here is slightly

00:28:48.640 | different it's probably a little cleaner actually although i think uh i don't know i here we get a

00:28:55.760 | lot more information so both are fine and in this scenario we actually do get the the right answer as

00:29:03.680 | well so you can see that that chain of thought prompting has actually been quite literally trained

00:29:10.320 | into the model and you'll see that with most uh well i think all save the art lms okay cool so that is our

00:29:19.600 | our chapter on prompting again we're just focusing very much on a lot of the fundamentals of prompting

00:29:28.400 | there and of course tying that back to the actual objects and methods within line chain but for now that's

00:29:36.240 | it for prompting and we'll move on to the next chapter

Prompt Templating and Techniques in LangChain

Chapters