back to indexHugging Face LLMs with SageMaker + RAG with Pinecone
Chapters
0:0 Open Source LLMs on AWS SageMaker
0:27 Open Source RAG Pipeline
4:25 Deploying Hugging Face LLM on SageMaker
8:33 LLM Responses with Context
10:39 Why Retrieval Augmented Generation
11:50 Deploying our MiniLM Embedding Model
14:34 Creating the Context Embeddings
19:49 Downloading the SageMaker FAQs Dataset
20:23 Creating the Pinecone Vector Index
24:51 Making Queries in Pinecone
25:58 Implementing Retrieval Augmented Generation
30:0 Deleting our Running Instances
00:00:00.000 |
Today, we're going to be taking a look at how we do retrieval-augmented generation 00:00:04.440 |
with open-source models using AWS's SageMaker. 00:00:09.400 |
To do this, we're going to need to set up a few different components. 00:00:13.040 |
Obviously, with our LLM, we're using an open-source model, 00:00:19.640 |
So we need a specialized instance in order to do this, 00:00:24.360 |
and we can actually do that through SageMaker as well. 00:00:40.960 |
and if you don't know why we would want that, 00:00:55.700 |
Now, for this, we're not going to use anything crazy big, 00:00:59.500 |
but we will need it, so we'll have our dataset down here. 00:01:03.240 |
What we're going to do is we're going to take our dataset, 00:01:08.100 |
which is essentially chunks of information about AWS, 00:01:13.760 |
and we are going to use them to inform our large language model. 00:01:27.840 |
because our LLM has been trained on internet data, 00:01:31.640 |
so I'm sure it does know about AWS and the services they provide, 00:01:36.000 |
but it's probably not up to date, so that will be very important. 00:01:44.240 |
is we'll take this relevant information here. 00:01:47.040 |
We're going to take it to the embedding model, 00:01:49.400 |
and then from this, we're going to get what we call vector embeddings. 00:01:53.100 |
We're going to store those within our vector database, 00:01:59.340 |
and what we will do is when we ask our LLM something, 00:02:05.060 |
so we have our query, we want to know something about AWS, 00:02:13.860 |
and give us a response like it usually would. 00:02:19.500 |
Instead, it is actually going to go to our embedding model. 00:02:26.040 |
I haven't drawn this very well or organized this very well. 00:02:31.360 |
and from that, we get what we would call a query vector. 00:02:34.100 |
I'm going to call it xq, and we take that into Pinecone, 00:02:38.960 |
and we say to Pinecone, "Okay, given my query vector, 00:02:43.440 |
what are the relevant other records that you have stored?" 00:02:49.260 |
So those AWS documents that we embedded and stored before, 00:02:56.360 |
Okay, so we're going to get a load of relevant information, essentially. 00:03:05.960 |
and we take our context, and we put those together. 00:03:12.800 |
and that gives us a context or retrieval augmented prompt. 00:03:21.700 |
and now it can actually give us some relevant information. 00:03:25.300 |
So up-to-date information, which it usually wouldn't be able to do. 00:03:30.100 |
So let's actually dive into how we would implement all this. 00:03:35.860 |
Okay, so I'm going to begin in the homepage of my console. 00:03:42.140 |
You can also just do SageMaker at the top here and click on that. 00:03:58.260 |
There will also be a link to this notebook on GitHub, 00:04:01.820 |
which you can copy across over into your own SageMaker. 00:04:05.800 |
So we're going to start by just installing everything we need. 00:04:09.560 |
So it's just going to be SageMaker, PyCon Client, and the widgets. 00:04:14.100 |
Okay, now I said we're going to be using open source models, 00:04:17.900 |
and we are going to be getting those from Hugging Face. 00:04:34.060 |
So we can just import everything we need here. 00:04:37.360 |
So we're going to import the Hugging Face model. 00:04:39.700 |
And this is essentially the configuration for the image 00:04:47.560 |
Okay, so we can decide which model we'd like to use. 00:05:10.200 |
I think there should be a text generation here. 00:05:13.360 |
This will give us a list of all the text generation models. 00:05:18.600 |
So we are going to go for the Google Flan T5 model, 00:05:22.960 |
which actually isn't even tagged under text generation. 00:05:32.260 |
So you can also, maybe we could just apply both. 00:05:37.800 |
So text to text generation, and we can see T5 is there. 00:05:46.360 |
but you'll need a bigger compute instance to run it on. 00:05:54.700 |
particularly when we're retrieving all this relevant information. 00:06:00.600 |
So that's what we're copying in to SageMaker. 00:06:09.540 |
And what we then need to do is retrieve the LLM image URI. 00:06:15.500 |
So there are different images that you can use for Hugging Face models. 00:06:22.840 |
this is the one that you're going to want to use. 00:06:39.400 |
and we're going to deploy it to this instance. 00:06:52.360 |
And actually, let me take this and just command F in here. 00:07:16.600 |
which is probably more important, is 24 gigabytes. 00:07:19.800 |
Okay. So definitely big enough for our T5XL model. 00:07:24.000 |
Cool. So we can, let's first, I just show you in here, 00:07:30.300 |
this is the Amazon SageMaker console we saw before. 00:07:43.700 |
And we have nothing in endpoints or endpoint configurations. 00:07:47.200 |
Now that will change as soon as I run this next step. 00:07:54.860 |
It will take a moment, it does take a little bit of time to deploy, 00:07:59.360 |
but you will see like a little loading bar at the bottom in a moment. 00:08:03.620 |
So I'm just going to go and skip ahead for when that is done. 00:08:07.520 |
Now, actually, while we're waiting for that to load, 00:08:10.260 |
I will show you where we are in that sort of rough diagram 00:08:20.900 |
Okay. So our LLM here, that is now being initialized 00:08:27.820 |
So for that, we're using, like I said, the Flan T5 XL model. 00:08:32.700 |
Okay. So that has just finished and we can move on to the next steps. 00:08:41.600 |
what the difference is between asking an LLM question directly 00:08:46.660 |
and asking it a question when you provide some context, 00:08:50.700 |
which is obviously what we want to do with RAG in this instance. 00:08:55.560 |
So I'm going to ask which instances can I use 00:09:01.000 |
and we're going to send this directly to the LLM. 00:09:04.640 |
Okay. And we get the generated text of SageMaker and SageMaker XL, 00:09:10.160 |
which sounds like a great product, but as far as I'm aware, doesn't exist. 00:09:14.660 |
So what we need to do is pass in some relevant context in the model. 00:09:19.900 |
So that relevant context would look something like this, right? 00:09:23.900 |
Here we're just, this is an example. This is not how we're going to do it. 00:09:26.840 |
I just want to show you what actually happens. 00:09:29.140 |
So we're going to tell it managed spot training can be used 00:09:31.240 |
with all instances supported in Amazon SageMaker. 00:09:37.000 |
And then what we do is create a prompt template. 00:09:39.640 |
So we're just going to feed in our context here 00:09:53.080 |
well, kind of retrieval augmented prompt here. 00:09:56.840 |
It's retrieval as in we put that information in there. 00:10:07.140 |
is all instances supported in Amazon SageMaker. 00:10:10.840 |
Okay. So that is actually the correct answer this time. 00:10:18.240 |
is our LLM capable of following our instructions? 00:10:21.280 |
All right, because here I said, if you do not know the answer 00:10:34.540 |
They're not that good yet. So it says, I don't know. 00:10:38.480 |
That's great, but obviously I just fed in the context. 00:10:42.680 |
We're not going to do that in a real use case. 00:10:47.540 |
In reality, we're probably going to have tons of documents 00:10:53.000 |
little bits of information from those documents. 00:10:55.280 |
One thing that I have seen people doing a lot 00:10:57.400 |
is feeding in all those documents into an LLM. 00:11:00.080 |
Basically, don't do that because it doesn't work very well. 00:11:05.000 |
There's actually a paper on this and I'll make sure there's a link 00:11:08.180 |
probably at the top of the video right now if you want to read that. 00:11:10.880 |
Basically showing that if you fill your context window 00:11:16.240 |
it's going to get everything that isn't either at the start 00:11:23.500 |
So it's not a good idea. It's also expensive. 00:11:26.840 |
More tokens you use, the more you're going to pay. 00:11:39.600 |
Essentially, we're going to be looking at our question 00:11:45.780 |
that seem like they'll probably answer our question 00:11:53.900 |
this is where our embedding model comes into play. 00:11:56.400 |
So right now, what we need to do is we need to deploy this here. 00:12:03.900 |
So let's go ahead and see how we will do that. 00:12:06.740 |
Again, we're going to be using Hugging Face Transformers. 00:12:09.500 |
We can actually copy this and we can just go to Models again, 00:12:24.100 |
So that means, one, when we're doing this Hugging Face model here, 00:12:38.300 |
And we also change the tasks that we're doing here 00:12:51.380 |
And then we come down to here and we're going to deploy it. 00:13:09.980 |
Okay, so you can see some kind of similar instances here. 00:13:16.580 |
The T3-Large has, it's just actually a CPU model, right? 00:13:23.280 |
But again, this embedding model is very small. 00:13:26.680 |
You could use GPU if you want it to be quicker. 00:13:45.000 |
Now we have both our LLM and embedding model deployed. 00:13:48.920 |
And we can go and have a look over in SageMaker here. 00:13:53.360 |
So now we can see we have both of these models, 00:14:08.260 |
And we can go over to endpoint configurations. 00:14:14.600 |
So we have the mini LLM demo image and flan T5 demo image. 00:14:22.900 |
that we are going to be calling later when we're... 00:14:26.120 |
Well, actually, we already called the flan T5 endpoint. 00:14:29.660 |
And we're about to call the mini LLM endpoint. 00:14:41.920 |
All I'm going to do is I'm going to create some little examples, 00:14:45.260 |
and I'm going to pull those into our embedding model. 00:14:50.400 |
and that will create our query vectors, the XQ vectors, okay? 00:14:58.120 |
We actually just do encoder, predict, that's it. 00:15:04.100 |
We're taking two contexts or chunks of text or documents, 00:15:09.820 |
So that means we're going to get two embeddings back, 00:15:30.220 |
or any other embedding model, we expect to output a vector. 00:15:34.420 |
But the vector dimensionality that I'm expecting 00:15:42.300 |
What we see here is two eight-dimensional somethings, 00:15:50.960 |
Let's take a look at what is within these somethings, okay? 00:16:25.000 |
They're like something random, I think, okay? 00:16:33.520 |
is going to be broken apart into what we call tokens, right? 00:16:37.420 |
And those tokens might look something like this, okay? 00:16:49.800 |
And let's say that this one was containing eight tokens, right? 00:16:57.900 |
Basically, what would happen here is this shorter sentence 00:17:01.960 |
will be padded with what we call padding tokens. 00:17:04.700 |
So actually, this gets extended with some extra padding tokens, 00:17:19.120 |
within the batch that we're passing it to the model, okay? 00:17:24.720 |
Now, what we have here is two lists of eight tokens each. 00:17:30.060 |
These are then passing to our embedding model, right? 00:17:37.620 |
And what it's going to do is it's going to output 00:17:41.320 |
a vector embedding at the token level, right? 00:17:47.180 |
So that means that we get one, two, three, four, 00:17:52.880 |
five, six, seven, eight token level embeddings here. 00:17:58.380 |
But we want to represent the whole sentence or each document. 00:18:03.260 |
So what we actually do here is something called mean pooling, 00:18:17.460 |
we create a single sentence embedding, right? 00:18:20.920 |
So we just need to add that on to the end of our process here. 00:18:26.460 |
And with that, we would get xq, our query vector. 00:18:31.920 |
Actually, sorry, not necessarily our query vector in this case. 00:18:36.760 |
It would also be our, we can call them xc or xd, 00:18:41.920 |
which would be like our context vectors or document vectors. 00:18:47.460 |
So that's actually my bad because right here, 00:18:52.560 |
this red bit, we're not actually doing that yet. 00:18:55.320 |
We're actually going along this line here, right? 00:19:10.620 |
we're going to take the mean across a single axis. 00:19:15.380 |
And you can see that now we have two 384 dimensional vector embeddings. 00:19:22.120 |
Now, what I want to do is just kind of package that into a single function. 00:19:26.760 |
All right, so this is going to take a list of strings, 00:19:31.960 |
and we're going to create the token level embeddings. 00:19:36.980 |
or we're going to do mean pooling to create a sentence level embedding. 00:19:41.120 |
Okay, now that is how we create our context/document embeddings. 00:19:49.260 |
Now what we want to do is actually take what we've just learned 00:19:55.920 |
So we're going to be using the Amazon SageMaker FAQs, 00:20:03.720 |
And we're just going to open that dataset with pandas. 00:20:06.820 |
Okay, so we have the question and answer columns here. 00:20:12.420 |
because we just want to look at answers here. 00:20:15.620 |
Okay, this is all we're going to be indexing. 00:20:23.120 |
What we should do now is, well, we need to embed them, 00:20:31.380 |
All right, so we need to take our context vectors 00:20:39.620 |
we need to initialize a vector index to actually store them within. 00:20:45.760 |
Okay, so to initialize our connection to Pinecone, 00:20:53.880 |
Okay, once we are in Pinecone, we go over to our API keys. 00:20:58.560 |
You want to copy that and also remember the environment here. 00:21:08.500 |
And for the API key, you need to paste it into here. 00:21:13.460 |
we just initialize our connection to Pinecone 00:21:16.360 |
and we can make sure it's connected by taking a look at this. 00:21:22.740 |
because I'm doing a lot of things to Pinecone right now. 00:21:28.880 |
but we would expect that in reality, it should be. 00:21:35.220 |
is make sure I delete this index, which is already running. 00:21:38.540 |
So I come down to here, we have that index name, 00:21:41.920 |
and I'm going to check if that index is already running. 00:21:48.840 |
And then what I'm going to do is create a new index with the same name. 00:22:00.040 |
We know that, that's the 384 that we saw before for our MiniLM model. 00:22:12.180 |
but some of them may need you to use dot product 00:22:19.420 |
So basically, just check which embedding model you're using. 00:22:24.380 |
on which one of those metrics you should use, 00:22:28.480 |
And then here, we're just waiting for the index 00:22:31.420 |
to finish initializing before we move on to the next step, 00:22:39.940 |
because I've already had the Retrieval Augmentation AWS index in there. 00:22:49.640 |
And then I've put here that we do this in batches of 128. 00:22:54.700 |
Ignore that, we're doing it in batches of two, 00:23:01.140 |
Obviously, if you're wanting to do this for a big dataset, 00:23:10.540 |
And if you do want to use a larger batch size, 00:23:16.340 |
what was it, like the MLT2 model that we're using. 00:23:19.580 |
We're just going to upload or upsert 1000 of those vectors. 00:23:27.980 |
We then initialize our connection to the specific index 00:23:33.600 |
And then we just loop through and upsert everything. 00:23:46.980 |
So the IDs here are just like 0, 1, 2, 3, nothing special. 00:23:54.280 |
if you're wanting to do something real with this. 00:24:01.680 |
So basically, I just want to store the text for each answer 00:24:05.000 |
within Pinecone because it makes things a little bit easier 00:24:13.580 |
So we take the answers like the documents within this batch, 00:24:17.680 |
and we do embedDocs, which is a function that we defined earlier. 00:24:20.980 |
And then we upsert everything. Okay, and that's it. 00:24:25.540 |
Now, if we take a look at our number of records within the index, 00:24:28.980 |
we should see that there are 154, which is a tiny, tiny index. 00:24:38.320 |
you probably wouldn't use Pinecone for something this small. 00:24:40.940 |
You really want to be using 10,000, 50,000, 100,000 million vectors. 00:24:50.780 |
Okay, so let's just take a look at the question we initialized earlier. 00:24:56.380 |
Which instances can I use with managed spot training in SageMaker? 00:25:02.180 |
What I'm going to do is I'm going to embed that 00:25:04.880 |
to create our query vector, which I was calling XQ earlier. 00:25:13.300 |
So that will allow us to see the text of the answers as well. 00:25:17.880 |
All right, so we can run that, and we get these contexts here. 00:25:24.540 |
So, so far, that means we have basically done, 00:25:31.900 |
And now what we're doing is we're asking a question, a query, 00:25:37.900 |
And this time, we are actually going along this path here 00:25:41.900 |
and creating our query vector, taking that into Pinecone 00:25:52.560 |
Those matches that we just saw, they're our relevant contexts here. 00:25:57.100 |
So what is left, okay, we need to take our query, 00:26:00.960 |
we need to take this context, feed them in to the LLM. 00:26:11.500 |
I can even show you here, it's literally just a list of those. 00:26:16.600 |
And what we're going to do is we're going to construct this 00:26:25.020 |
as to how much data we feed in at any one time. 00:26:29.320 |
Because we're not using like a massive large language model here. 00:26:36.080 |
but it cannot store a ton of text within its context window. 00:26:44.520 |
And what we're going to do is we're going to say 00:26:49.560 |
we're going to add each one until we get to a point 00:26:53.020 |
where we can't add any more because we reach the context window limit. 00:27:05.840 |
And then we can actually run that with the context. 00:27:08.280 |
And let's just see what that actually returns us, context string. 00:27:13.520 |
Okay, so we get, I'm not sure how many that is. 00:27:25.540 |
So I think we're actually, okay, here, it's telling us, sorry. 00:27:36.740 |
So we retrieve five, we're just going to pass through the top four. 00:27:40.920 |
It's that last one we couldn't fit into the limit we set. 00:27:46.020 |
So now what we want to do is same as what we did a lot earlier, 00:27:50.720 |
answer the question below, given the context, 00:28:05.560 |
Okay, answer the following question based on the context, right? 00:28:10.480 |
So we have our context here, which we're feeding in. 00:28:15.560 |
Okay, so let's now predict with that and see what we get. 00:28:19.180 |
Which instances can I use my spot training and SageMaker? 00:28:24.680 |
Okay, cool. Okay, so with that, we've now taken our context 00:28:28.780 |
and we've fed them into what will be our new prompt. 00:28:38.380 |
and use all that to create our retrieval augmented prompt, 00:28:41.760 |
which ILM then uses to go through and create our answer. 00:28:54.220 |
We now have our retrieval augmented generation pipeline 00:28:59.960 |
So now we can pull that together in our single reg query. 00:29:08.020 |
And I'm going to ask the same question initially, 00:29:11.460 |
because I don't know too much about AWS here. 00:29:15.720 |
So I just want to ask something that I know the answer to, 00:29:20.580 |
which is the question, the spot instances one. 00:29:26.620 |
Now I checked the data set and there isn't actually any mention 00:29:30.920 |
of the hugging face instances in SageMaker, right? 00:29:40.260 |
because it doesn't actually know about this piece of information. 00:29:47.820 |
Okay, and first chunk that we're getting here 00:29:51.360 |
is actually the contexts that are being returned. 00:29:54.920 |
Now these contexts, we can see they don't contain anything 00:30:03.120 |
And the reason that we're retrieving this irrelevant information 00:30:06.400 |
is because when we do our embedding and we query Pinecone, 00:30:11.600 |
we're saying retrieve the top five most relevant contexts 00:30:24.420 |
but it's still going to return the top five most relevant items. 00:30:32.300 |
that if the context doesn't contain relevant information, 00:30:37.520 |
So that is exactly what it does here, responds with, I don't know. 00:30:41.420 |
Okay, so with that, we've seen how we can use SageMaker 00:30:46.800 |
for retrieval augmented generation with Pinecone 00:30:49.960 |
using open source models, which is, I think, pretty cool, 00:30:57.460 |
I should show you very quickly before finishing is right now, 00:31:05.960 |
So we can do that by going to our endpoints here 00:31:09.600 |
and selecting those, clicking delete, and that will be deleted. 00:31:15.820 |
Okay, and we'll just want to do that for our other items. 00:31:19.620 |
We have the images, you can delete those as well. 00:31:28.960 |
So, yeah, once you've gone through and deleted those, 00:31:32.720 |
you won't be paying any more for following this. 00:31:39.220 |
We've obviously seen how to do RAG with open source with Pinecone. 00:31:47.160 |
Obviously, when you're wanting more performant generations, 00:31:51.100 |
you'll probably want to switch up to a larger model. 00:31:54.360 |
The Flan T5 XL that we demoed here is pretty limited in its abilities, 00:31:59.720 |
but it's not bad and it's definitely not bad for like a demo. 00:32:05.000 |
So, yeah, I hope this has all been useful and interesting.