back to indexPrompt Templating and Techniques in LangChain

Chapters
0:0 Prompts are Fundamental to LLMs
2:13 Building Good LLM Prompts
7:13 LangChain Prompts Code Setup
11:36 Using our LLM with Templates
16:54 Few-shot Prompting
23:11 Chain of Thought Prompting
00:00:00.000 |
Now we're going to move on to the chapter on prompts in Langchain. Now prompts they seem like 00:00:07.280 |
a simple concept and they are a simple concept but there's actually quite a lot to them when you 00:00:11.280 |
start diving into them and they truly have been a very fundamental part of what has propelled us 00:00:19.760 |
forwards from pre-LLM times to the current LLM times. You have to think until LLMs became widespread 00:00:27.840 |
the way to fine-tune a AI model or ML model back then was to get loads of data for your particular 00:00:38.880 |
use case, spend a load of training your specific transformer or part of the transformer to essentially 00:00:46.000 |
adapt it for that particular task. That could take a long time. Depending on the the task it could take 00:00:53.600 |
you you know months or in some times if it was a simpler task you might take probably days potentially 00:01:00.720 |
weeks. Now the interesting thing with LLMs is that rather than needing to go through this whole fine 00:01:09.040 |
tuning process to modify a model for one task over another task rather than doing that we just prompt it 00:01:18.000 |
differently we literally tell the model hey I want you to do this in this particular way and that is a 00:01:24.400 |
you know that's a paradigm shift in what you're doing it's so much faster it's going to take you you know a 00:01:30.320 |
couple of minutes rather than days weeks or months and LLMs are incredibly powerful when it comes to just 00:01:37.840 |
generalizing to you know across these many different tasks. So prompts which control those instructions 00:01:45.600 |
are a fundamental part of that. Now line chain naturally has many functionalities around prompts 00:01:53.360 |
and we can build very dynamic prompting pipelines that modify the structure and content of what we're 00:01:59.120 |
actually feeding into our LLM depending on different variables different inputs and we'll see that in 00:02:05.760 |
this chapter so we're going to work through prompting within the scope of a RAG example. So let's start 00:02:14.000 |
by just dissecting the various parts of a prompt that we might expect to see for a use case like RAG. 00:02:21.760 |
So our typical prompt for RAG or retrieval augmented generation will include rules for 00:02:29.520 |
the LLM and this is you know this you will see in most prompts if not all. This part of the prompt sets up 00:02:38.480 |
the behavior of the LLM that is how it should be responding to user queries, what sort of personality it should 00:02:46.320 |
be taking on, what it should be focusing on when it is responding, any particular rules or boundaries that 00:02:52.560 |
we want to set and really what we're trying to do here is just to simply provide as much information 00:03:00.320 |
as possible to the LLM about what we're doing. We just want to give the LLM context as to the 00:03:11.120 |
place that it finds itself in because now LLM has no idea where it is, it's just is a it takes in some 00:03:17.520 |
information and spits out information. If the only information it receives is from the users you know 00:03:22.560 |
user query it has you know it doesn't know the context what is the application that it is within, 00:03:29.680 |
what is its objective, what is its aim, what are the boundaries. All of this we need to just assume 00:03:37.040 |
the LLM has absolutely no idea about because it truly does not. So as much context as we can provide 00:03:44.800 |
but it's important that we don't overdo it. We see this all the time people will over prompt an LLM. 00:03:53.280 |
You want to be concise you don't want fluff and in general every single part of your prompt the more 00:04:00.000 |
concise and less fluffy you can make it the better. Now those rules or instructions are typically in system 00:04:06.640 |
prompt of your LLM. Now the second one is context which is RAG specific. The context refers to some 00:04:14.240 |
sort of external information that you are feeding into your LLM. We may have received this information 00:04:21.120 |
from like a web search, database, query or quite often in this case of RAG it's a vector database. This 00:04:29.600 |
external information that external information that we provide is essentially the RA, retrieval augmentation 00:04:37.040 |
of RAG. We are augmenting the knowledge of our LLM which the knowledge of our LLM is contained within 00:04:45.040 |
the LLM model weights. We're augmenting that knowledge with some external knowledge. That's what we're doing 00:04:51.360 |
here. Now for chat LLMs this context is typically placed within a conversational context within the user 00:05:01.680 |
or assistant messages and with more recent models it can also be placed within tool and messages as well. 00:05:12.640 |
Then we have the question. It's pretty straightforward. This is the query from the user. This is more 00:05:19.680 |
as it's usually a user message of course. There might be some additional formatting around this. You might 00:05:26.960 |
add a little bit of extra context or you might add some additional instructions if you find that your LLM 00:05:33.680 |
sometimes veers off the rules that you've set within the system prompt. You might you know append or prefix 00:05:40.240 |
something here but for the most part it's probably just going to be the user's input. And finally so 00:05:45.680 |
these are all the inputs for our prompt. Here is going to be the output that we get. So the answer 00:05:53.040 |
from the assistant. Again I mean that's not even specific to RAG. It's just what you would expect in a 00:05:58.400 |
chat LLM or any LL. And of course that would be an assistant message. So putting all of that together 00:06:06.560 |
an actual prompt so you can see everything we have here. So we have the rules for our prompt here. The instructions 00:06:13.120 |
we're just saying okay answer the question based on the context below. If you cannot answer the question 00:06:17.680 |
using the information as with I don't know. Then we have some context here. Okay in this scenario that 00:06:26.000 |
context that we're feeding in here because it's the first message we might put into the system prompt. But that 00:06:32.400 |
may also be turned around. Okay if you if you for example have an agent you might have your question 00:06:38.800 |
up here before the context. And then that would be coming from a user message. And then this context 00:06:45.600 |
would follow the question and be recognized as a tool message. It would be fed in that way as well. 00:06:53.120 |
It kind of depends on on what sort of structure you're going for there. But you can do either. You can feed it 00:06:57.360 |
into the system message if it's less conversational. Whereas if it's more conversation we might feed 00:07:03.520 |
it in as a tool message. Okay and then we have a user query which is here. And then we'd have the AI 00:07:09.600 |
answer. Okay and obviously that would be generated here. Okay so let's switch across to the code. We're in 00:07:16.320 |
the Langchain course repo notebook's 03 prompts. I'm just going to open this in Colab. Okay let's scroll 00:07:23.040 |
down and we'll start just by installing the prerequisites. Okay so we just have the various 00:07:28.800 |
libraries again. As I mentioned before Langsmith is optional. You don't need to install it but if you 00:07:33.840 |
would like to see your traces and everything in Langsmith then I would recommend doing that. 00:07:39.040 |
And if you are using Langsmith you will need to enter your API key here. Again if you're not using 00:07:45.120 |
Langsmith you don't need to enter anything here. You just skip that cell. Okay cool. And let's jump into 00:07:51.680 |
the basic prompting then. So we're going to start with this prompt and so use query based on the question 00:07:58.080 |
below. So we're just structuring what we just saw in code. And we're going to be using the chat prompt 00:08:06.240 |
template because generally speaking we're using chat LLMs in most cases nowadays. So we have our chat prompt 00:08:15.360 |
template and that is going to contain a list of messages. System message to begin with which is just 00:08:21.360 |
going to contain this and we're feeding in the context within that there. And we have our user query here. 00:08:29.520 |
Okay so we'll run this and if we take a look here we haven't specified what our 00:08:38.160 |
input variables are. Okay but we can see that we have query and we have context up here. Right so we can 00:08:47.360 |
see that okay these are the input variables we just haven't explicitly defined them here. So let's just confirm 00:08:56.400 |
with this that line chain did pick those up and we can see that it did. So it has context and query as 00:09:01.360 |
our input variables for the prompt template that we just defined. Okay so we can also see the structure 00:09:09.200 |
of our templates. Let's have a look. Okay so we can see that within messages here we have a system message 00:09:18.160 |
prompt template. The way that we define this you can see here that we have from messages and this will 00:09:23.520 |
consume various different structures. So you can see here that it has a from messages is a sequence of 00:09:34.320 |
message like representation. So we could pass in a system prompt template object and then a user prompt 00:09:41.840 |
template object or we can just use a tuple like this and this actually defines okay this system this is a user and 00:09:50.320 |
you could also do assistant or tool messages and stuff here as well using the same structure. And then we 00:09:58.400 |
can look in here and of course that is being translated into the system message prompt template and human 00:10:05.280 |
message prompt template. Okay we have our input variables in there and then we have the template too. Okay 00:10:14.080 |
now let's continue we'll see here why what I just said. So we're importing our system message prompt 00:10:21.520 |
template and human message prompt template and you can see we're using the same from messages method here 00:10:27.600 |
right and you can see it's still sequence of message like representation it's just you know what that 00:10:34.800 |
actually means it can vary right. So here we have system message prompt template from template prompt here from 00:10:41.040 |
template query you know there's various ways that you might want to do this it just depends on how explicit 00:10:47.280 |
you want to be. Generally speaking I think for myself I would prefer that we stick with the objects 00:10:56.560 |
themselves and be explicit but it is definitely a little harder to pass when you're when you're reading 00:11:02.400 |
this so I understand why you might also prefer this it's definitely cleaner and it is it does look 00:11:08.640 |
simpler so it just depends I suppose on preference. Okay so you can see again that this is exactly the same 00:11:20.160 |
okay with chat prompt template and it contains this and this okay you probably want to see the exact output 00:11:28.320 |
so as messages. Okay exactly the same as what I output before. Cool so we have all that let's see how we 00:11:38.960 |
would invoke our LLM with these we're going to be using 4.0 mini again we do need our API key so enter that 00:11:50.000 |
and we'll just initialize our LLM we are going with a low temperature here so less randomness or less 00:11:57.360 |
creativity and you know in in many cases this is actually what I would be doing. The reason 00:12:03.360 |
in this scenario that we're going with low temperature is we're doing RAG and if you 00:12:10.240 |
remember before we scroll up a little bit here our template says answer the user's query based on the 00:12:15.520 |
context below if you cannot answer the question using the provided answer information answer with 00:12:21.600 |
I don't know right so just from reading that we know that we want our LLM to be as truthful and accurate 00:12:30.400 |
as possible so a more creative LLM is going to struggle with that and is more likely to hallucinate 00:12:38.240 |
whereas a low creativity or low temperature LLM will probably stick with the rules a little better 00:12:45.120 |
so again it depends on your use case you know if you're creative writing you might want to go 00:12:50.240 |
with a higher temperature there but for things like RAG where the information being output should be accurate 00:12:57.280 |
and truthful it's important I think that we keep temperature low okay I talk about that a little bit 00:13:05.760 |
here so of course a lower temperature of zero makes the LLM's output more deterministic which in theory 00:13:12.160 |
should lead to less hallucination okay so we're gonna go with lcell again here this is for those of you 00:13:19.600 |
that use linechain in the past this is equivalent to an LLM chain object so our prompt template is being fed 00:13:26.960 |
into our LLM okay and from now we have this pipeline now let's see how we would use that pipeline so we're 00:13:36.720 |
gonna get some uh create some context here so this is just some context around Aurelio AI 00:13:43.920 |
mention you know that we built semantic routers semantic junkers there's AI platform 00:13:52.640 |
and development services we mentioned I think we specifically outline this later on in the example 00:13:59.520 |
so the linechain experts a little piece of information now most LLMs would have not been 00:14:05.120 |
trained on the recent internet so the fact that this came in September 2024 is relatively recent so a lot of 00:14:13.200 |
LLMs out of the box you wouldn't expect them to know that so that is a good little bit of information 00:14:20.480 |
to ask about so we invoke we have our query so what do we do and we have that context okay so we're 00:14:28.080 |
feeding that into that pipeline that we defined here all right so when we invoke that it's automatically 00:14:33.840 |
going to take query and context and actually feed it into our prompt template okay if we 00:14:41.680 |
want to we can also be a little more explicit so you you will probably see me doing this throughout the 00:14:48.800 |
course because I do like to be explicit with everything to be honest and you'll probably see me doing this 00:14:57.680 |
okay and this is doing the same thing or you'll see it will in a moment this is doing the exact same thing 00:15:11.440 |
again this is again this is just a also thing so all I'm doing in this scenario is I'm saying okay take 00:15:20.080 |
from the dictionary query and then also take from that input dictionary the context key 00:15:29.440 |
okay so this is doing the exact same thing the reason that we might want to write this is mainly for 00:15:39.200 |
clarity to be honest just too explicit say okay these are the inputs because otherwise we don't 00:15:44.240 |
really have them in the code other than within our original prompts up here which is not super clear 00:15:51.920 |
so I think it's usually a good idea to just be more explicit with these things and of course 00:15:57.840 |
if you decide you're going to modify things a little bit let's say you modify this to input down the line you 00:16:04.080 |
can still feed in the same input here you're just you know mapping it between different keys essentially 00:16:09.360 |
or if you would like to just modify that you need to lowercase it on the way in or something you can do 00:16:15.760 |
so you have that I'll just uh redefine that actually and we'll invoke again 00:16:27.200 |
okay and we see that this is the exact same thing okay so radio ai and so this is a ai message just 00:16:33.440 |
generated by the lm okay expertise in building ai agents several open source frameworks router ai platform 00:16:42.640 |
okay all right so provide them so they have everything there other than the line chain 00:16:49.120 |
experts saying it didn't mention that but we will yeah we'll test it later on that okay so on to future 00:16:55.920 |
prompting this is a specific prompting technique now many sort of state of the art or also to lms 00:17:03.280 |
are very good at instruction following so you'll find that future prompting is less common now than 00:17:09.920 |
it used to be at least for this or bigger more safety art models but when you start using smaller models 00:17:17.840 |
not really what we can use here but let's say you're using a source model like llama 3 00:17:25.200 |
or llama 2 which is much smaller you will probably need to consider things like few shot prompting 00:17:32.800 |
although that being said with open ai models you're not at least the current opening models this is not so 00:17:40.560 |
important nonetheless it can be useful so the idea behind future prompting is that you are providing a few 00:17:47.440 |
example examples to your mlm of how it should behave before you are actually going into the main 00:17:56.160 |
part of the conversation so let's see how that would look so we create an example prompt so we have our 00:18:03.120 |
human and ai so human input ai response so we're basically setting up okay this with this type of input you should 00:18:10.720 |
provide this type of output that's what we're doing here and we're just going to provide some examples 00:18:17.120 |
okay so we have our input here's query one here is the answer one all right this is just i just want to 00:18:24.000 |
show you how it works this is not what we'd actually feed into our lm then with both these examples and our 00:18:30.720 |
example prompt would feed both of these into line chains a few shot chat message prompt template okay and while you're 00:18:40.560 |
you'll see what we get out of it okay so we basically get it formats everything and structures everything 00:18:46.080 |
for us okay and using this of course it depends on 00:18:51.520 |
let's say you see that your user is talking about a particular topic and you would like to guide your 00:19:00.960 |
llm to talk about that particular topic in a particular way right so you could identify that the user is 00:19:07.440 |
talking about that topic either like a keyword match or a semantic similarity match and based on 00:19:12.960 |
that you might want to modify these examples that you feed into your few shot chat message prompt 00:19:19.360 |
template and then obviously for that could be what you do for topic a for topic b you might have another 00:19:25.280 |
set of examples that you feed into this all the all this time your example prompts is remaining the same 00:19:30.640 |
but you're you're just modifying the examples that are going in so that they're more relevant to whatever 00:19:35.200 |
it is your user is actually talking about so that can be useful now let's see an example of that so 00:19:41.120 |
when we are using a tiny lm its ability would be limited although i think we are probably fine here 00:19:48.240 |
we're going to say answer the user's query based on the context below always enter a markdown format you 00:19:54.240 |
know being very specific the self-system prompt okay that's nice but what we've kind of said here is okay 00:20:02.400 |
always answer a markdown format do that but when doing so please provide headers short summaries and 00:20:09.840 |
follow bullet points then conclude okay so you see this and yeah here okay so we get this overview 00:20:17.040 |
already right now you have this and this it's actually quite good but if we come down here what i 00:20:23.600 |
specifically want is to always follow this structure right so we have the double header for the topic 00:20:30.880 |
summary header a couple of bullet points and then i always want to follow this pattern where it's like 00:20:37.200 |
to conclude always it's always bold you know i want to be very specific on what i want and to be you know 00:20:45.280 |
fully honest with gpt4o mini you can actually just prompt most of this in but for the sake of the example 00:20:52.960 |
we're going to provide a few shot examples in our few shot prompt examples instead to get this so we're 00:21:00.560 |
going to provide one example here second example here and you see we're just following that same pattern 00:21:06.800 |
we're just setting up the pattern that the lm should use so we're going to set that up here we have our 00:21:14.400 |
main header a little summary some subheaders bullet points subheader bullet points subheader bullet 00:21:20.720 |
points to conclude so on and so on same with this one here okay and let's see what we got 00:21:29.760 |
okay so this is the structure of our new few shot prompt template you can see what all this looks like 00:21:40.320 |
let's come down and we're going to do we're basically going to insert that directly into our chat prompt 00:21:46.960 |
template so we have from messages system prompt user prompt and then we have in there these so let me 00:21:56.960 |
actually show you very quickly right so we just have our this few shot chat to message prompt template 00:22:04.800 |
which will be fed into the middle here run that and then feed all this back into our pipeline 00:22:10.240 |
okay and this will you know modify the structure so that we have that bold to conclude at the end here 00:22:16.160 |
okay we can see nicely here so we get a bit more of that the exact structure that we were getting again 00:22:23.280 |
with gpt4o models and many other open air models you don't really need to do this but you will see it 00:22:30.560 |
in other examples we do have an example of this where we're using a llama and we're using i think llama 2 00:22:37.360 |
if i'm not wrong and you can see that adding this few shot prompt template is actually a very good way of 00:22:45.840 |
getting those smaller less capable models to follow your instructions so this is really when you're 00:22:52.000 |
working those smaller lines this can be super useful but even first so to models like gp4o if you do find 00:22:58.960 |
that you're struggling with the prompting it's just not quite following exactly what you want it to do 00:23:03.520 |
this is a very good technique for actually getting it to follow a very strict structure or behavior okay so 00:23:11.760 |
moving on we have chain of thought prompting so this is a more common prompting technique that 00:23:19.440 |
encourages the elm to think through its reasoning or its thoughts step by step so it's a chain of 00:23:26.880 |
thoughts the idea behind this is that okay in math class when you're a kid the teachers would always 00:23:34.240 |
push you to put down your your working out right and there's multiple reasons for that one of them is 00:23:40.640 |
to get you to think because they they know in a lot of cases actually you know you're a kid and you're 00:23:44.800 |
in a rush and you don't really care about this test and the you know they're just trying to get you to 00:23:50.880 |
slow down a little bit and actually put down your reasoning and that kind of forced you to think oh 00:23:55.840 |
actually i'm skipping a little bit in my head because i'm trying to just do everything up here if i write 00:24:00.800 |
it down all of a sudden it's like oh actually i'm yeah i need to actually do that slightly 00:24:06.320 |
differently you you realize okay you're probably rushing a little bit now i'm not saying an lm is 00:24:10.640 |
rushing but it's a similar effect by an lm writing everything down they tend to actually get things 00:24:17.360 |
right more frequently and at the same time also similar to when you're a child and a teacher is reviewing 00:24:24.400 |
your exam work by having the lm write down its reasoning you as a as a human or engineer 00:24:31.200 |
you can see where the lm went wrong if it did go wrong which can be very useful when you're trying 00:24:36.960 |
to diagnose problems so with train of thought we should see less hallucinations and generally bad 00:24:43.760 |
performance now to implement chain of thought in line chain there's no specific like line chain 00:24:48.560 |
objects that do that instead it's it's just prompting okay so let's go down and just see how we might do 00:24:54.560 |
that okay so be helpful assistant answer the user question you must answer the question directly without 00:25:00.720 |
any other text or explanation okay so that's our no chain of thought system prompt i will just note here 00:25:07.840 |
especially with open ai again this is one of those things where you'll see it more with the smaller models 00:25:13.440 |
most elements are actually trained to use chain of thought prompting by default so we're actually 00:25:17.920 |
specifically telling it here you must answer the question directly without any other text or 00:25:23.200 |
explanation okay so we're actually kind of reverse prompting it to not use chain of thought otherwise 00:25:29.440 |
by default it actually will try and do that because it's been trained to that's how that's how relevant 00:25:34.880 |
chain of thought is okay so i'm going to say how many keystrokes you need to type in type the numbers 00:25:40.800 |
from one to five hundred okay we set up our like lm chain pipeline and we're going to just invoke 00:25:48.560 |
our query and we'll see what we get total number of keystrokes needed to type numbers from 00:25:53.840 |
one to five hundred is one thousand five hundred and eleven uh the actual answers i've written here is 00:26:01.840 |
one thousand three hundred and ninety two without chain thought is hallucinating okay now let's go ahead 00:26:07.680 |
and see okay with chain of thought prompting what does it do so be helpful assistant answer user's 00:26:13.360 |
question to answer the question you must list systematically and in precise detail all sub problems 00:26:20.720 |
that are needed to be solved to answer the question solve each sub problem individually you have to shout 00:26:26.720 |
at the lm sometimes to get them to listen and in sequence finally use everything you've worked through to provide the final answer okay so we're getting it we're forcing it to kind of go through 00:26:38.000 |
the full problem now we can remove that not sure why so run that again i don't know why we have context 00:26:48.080 |
you can see straight away that's taking a lot longer to generate the output that's because 00:26:55.760 |
it's generating so many more tokens so that's just one one drawback of this but let's see what we have so 00:27:01.760 |
to determine how many keystrokes to type those numbers we is breaking down several sub problems 00:27:07.760 |
so count number of digits from one to nine ten to ninety nine so so on and count the digits in number 500. 00:27:15.440 |
okay interesting so that's how it's breaking it up some more digits counted in the previous steps so we go 00:27:22.320 |
through total digits and we see us okay nine digits for those for here 180 for here 1200 and then of course 00:27:34.560 |
three here so it gets all those sums those digits and actually comes to the right answer okay so that that 00:27:42.640 |
is you know that's the difference with with chain of thought versus uh without so without it we just get 00:27:48.960 |
the wrong answer basically guessing with chain of thought we get the right answer just by the llm 00:27:55.840 |
writing down its reasoning and breaking the problem down into multiple parts which is i found that super 00:28:01.920 |
interesting that it it does that so that's pretty cool now i will just see so as i meant as we mentioned 00:28:10.320 |
before most llms nowadays are actually trying to use chain of thought prompting by default so let's just 00:28:15.600 |
see if we don't mention anything right be a helpful assistant and answer the user's question so we're not telling it 00:28:21.200 |
not to think through its reasoning and we're not telling it to think through its reasoning let's just 00:28:26.320 |
see what it does okay so you can see again it's actually doing the exact same reasoning okay it doesn't 00:28:37.200 |
it doesn't give us like the sub problems at the start but it is going through and it's breaking everything 00:28:42.160 |
apart okay which is quite interesting and we get the same correct answer so the formatting here is slightly 00:28:48.640 |
different it's probably a little cleaner actually although i think uh i don't know i here we get a 00:28:55.760 |
lot more information so both are fine and in this scenario we actually do get the the right answer as 00:29:03.680 |
well so you can see that that chain of thought prompting has actually been quite literally trained 00:29:10.320 |
into the model and you'll see that with most uh well i think all save the art lms okay cool so that is our 00:29:19.600 |
our chapter on prompting again we're just focusing very much on a lot of the fundamentals of prompting 00:29:28.400 |
there and of course tying that back to the actual objects and methods within line chain but for now that's 00:29:36.240 |
it for prompting and we'll move on to the next chapter