Back to Index

Hugging Face LLMs with SageMaker + RAG with Pinecone


Chapters

0:0 Open Source LLMs on AWS SageMaker
0:27 Open Source RAG Pipeline
4:25 Deploying Hugging Face LLM on SageMaker
8:33 LLM Responses with Context
10:39 Why Retrieval Augmented Generation
11:50 Deploying our MiniLM Embedding Model
14:34 Creating the Context Embeddings
19:49 Downloading the SageMaker FAQs Dataset
20:23 Creating the Pinecone Vector Index
24:51 Making Queries in Pinecone
25:58 Implementing Retrieval Augmented Generation
30:0 Deleting our Running Instances

Transcript

Today, we're going to be taking a look at how we do retrieval-augmented generation with open-source models using AWS's SageMaker. To do this, we're going to need to set up a few different components. Obviously, with our LLM, we're using an open-source model, but it's still a big language model. So we need a specialized instance in order to do this, and we can actually do that through SageMaker as well.

So we're going to have our instance here, which is where we're going to store our LLM, and that's within SageMaker. We're also going to have another instance that will store our embedding model, and if you don't know why we would want that, I'll explain in a moment. So we'll just call this embed.

So we're going to have those two instances. We're going to see how to set those up. Then we're going to need to get a dataset. Now, for this, we're not going to use anything crazy big, but we will need it, so we'll have our dataset down here. What we're going to do is we're going to take our dataset, which is essentially chunks of information about AWS, and we are going to use them to inform our large language model.

So these will become the long-term memory or the external knowledge base for our LLM, because our LLM has been trained on internet data, so I'm sure it does know about AWS and the services they provide, but it's probably not up to date, so that will be very important. Now, what this will actually look like is we'll take this relevant information here.

We're going to take it to the embedding model, and then from this, we're going to get what we call vector embeddings. We're going to store those within our vector database, which is Pinecone. It will look like this, and what we will do is when we ask our LLM something, so we have our query, we want to know something about AWS, that is actually not going to go to our LLM and give us a response like it usually would.

This is a usual approach. Instead, it is actually going to go to our embedding model. I haven't drawn this very well or organized this very well. It's going to go to our embedding model, and from that, we get what we would call a query vector. I'm going to call it xq, and we take that into Pinecone, and we say to Pinecone, "Okay, given my query vector, what are the relevant other records that you have stored?" So those AWS documents that we embedded and stored before, and it will just give us a set of responses.

Okay, so we're going to get a load of relevant information, essentially. Now what we do is we take our query vector, so we can just bring it here, and we take our context, and we put those together. So now we have queries and context, and that gives us a context or retrieval augmented prompt.

We feed that into our LLM, and now it can actually give us some relevant information. So up-to-date information, which it usually wouldn't be able to do. So let's actually dive into how we would implement all this. Okay, so I'm going to begin in the homepage of my console. I'm going to go to Amazon SageMaker.

You can also just do SageMaker at the top here and click on that. I'm going to go to my domain here, and I'm just going to launch my studio. Okay, so here we are within the studio, and we're going to go through this notebook. There will also be a link to this notebook on GitHub, which you can copy across over into your own SageMaker.

So we're going to start by just installing everything we need. So it's just going to be SageMaker, PyCon Client, and the widgets. Okay, now I said we're going to be using open source models, and we are going to be getting those from Hugging Face. So to do that, we will need at least, I think, this version of SageMaker.

And what we have is within SageMaker, they actually have support for Hugging Face. So we can just import everything we need here. So we're going to import the Hugging Face model. And this is essentially the configuration for the image that we'll be initializing. Okay, so we can decide which model we'd like to use.

So I'm going to use a Google's Flan T5 XL. Now, where would I get this model ID from? You can go to HuggingFace.co. You can head over to models at the top here, and you can search for models, right? So if we wanted to come down to, I think there should be a text generation here.

This will give us a list of all the text generation models. You can see we've got Lama 2 in there. So we are going to go for the Google Flan T5 model, which actually isn't even tagged under text generation. So let's remove that filter and go again. So this is actually text to text generation.

So you can also, maybe we could just apply both. Oh, one at a time. Okay, great. So text to text generation, and we can see T5 is there. And there's also this bigger model. You could also try that as well. Obviously, your results will be better, but you'll need a bigger compute instance to run it on.

So I'm just going to go the T5 XL model, which gives us good enough results, particularly when we're retrieving all this relevant information. And this is just a model ID right here. So that's what we're copying in to SageMaker. Okay. We are going to use text generation. And what we then need to do is retrieve the LLM image URI.

So there are different images that you can use for Hugging Face models. If you're using a large language model, this is the one that you're going to want to use. Okay. And then we can initialize the, I believe this is actually the image. So we initialize the image. Okay.

And then this here will deploy that image, and we're going to deploy it to this instance. We can see a list of instances here. So this is aws.amazon.com/sagemaker/pricing. And actually, let me take this and just command F in here. Okay. So here we can see it, right? So we can see it is, we're using this.

It's a NVIDIA A10G, what do we have? Instance memory is 64 gigabytes, GPU memory, which is probably more important, is 24 gigabytes. Okay. So definitely big enough for our T5XL model. Cool. So we can, let's first, I just show you in here, this is the Amazon SageMaker console we saw before.

And if we go down to inference, open this, and we take a look at models, okay, there's nothing in there at the moment. And we have nothing in endpoints or endpoint configurations. Now that will change as soon as I run this next step. So this is going to deploy our model.

It will take a moment, it does take a little bit of time to deploy, but you will see like a little loading bar at the bottom in a moment. So I'm just going to go and skip ahead for when that is done. Now, actually, while we're waiting for that to load, I will show you where we are in that sort of rough diagram I created before.

Okay. So right now, what have we done? We have just initialized this endpoint. Okay. So our LLM here, that is now being initialized and deployed within SageMaker. So for that, we're using, like I said, the Flan T5 XL model. Okay. So that has just finished and we can move on to the next steps.

So what I want to do here is just show you what the difference is between asking an LLM question directly and asking it a question when you provide some context, which is obviously what we want to do with RAG in this instance. So I'm going to ask which instances can I use with managed spot training in SageMaker, and we're going to send this directly to the LLM.

Okay. And we get the generated text of SageMaker and SageMaker XL, which sounds like a great product, but as far as I'm aware, doesn't exist. So what we need to do is pass in some relevant context in the model. So that relevant context would look something like this, right?

Here we're just, this is an example. This is not how we're going to do it. I just want to show you what actually happens. So we're going to tell it managed spot training can be used with all instances supported in Amazon SageMaker. Okay. Let's run that. And then what we do is create a prompt template.

So we're just going to feed in our context here and then feed in our user question here. And that creates our full prompt. And then we call LLM predict again. But this time we have that retrieval, well, kind of retrieval augmented prompt here. It's retrieval as in we put that information in there.

Later, of course, we'll automate that. Okay. And then the answer we get this time is all instances supported in Amazon SageMaker. Okay. So that is actually the correct answer this time. Okay. And I just want to also see, is our LLM capable of following our instructions? All right, because here I said, if you do not know the answer and the context doesn't contain the answer, truthfully say, I don't know.

Okay. So what color is my desk? It's white. Obviously, the LLM doesn't know this. They're not that good yet. So it says, I don't know. That's great, but obviously I just fed in the context. We're not going to do that in a real use case. In reality, we're probably going to have tons of documents and we're going to want to extract little bits of information from those documents.

One thing that I have seen people doing a lot is feeding in all those documents into an LLM. Basically, don't do that because it doesn't work very well. There's actually a paper on this and I'll make sure there's a link probably at the top of the video right now if you want to read that.

Basically showing that if you fill your context window for your LLM with loads of text, it's going to get everything that isn't either at the start of what you've fed into that context window or the end of that context window. So it's not a good idea. It's also expensive.

More tokens you use, the more you're going to pay. So we don't want to do that. What we do want to do is be more efficient. And to do that, we can use RAG, which is Retrieval Augmented Generation. Essentially, we're going to be looking at our question and we're going to be finding chunks of text that seem like they'll probably answer our question from a larger database.

Now, to make this work, this is where our embedding model comes into play. So right now, what we need to do is we need to deploy this here. Our embedding model. So let's go ahead and see how we will do that. Again, we're going to be using Hugging Face Transformers.

We can actually copy this and we can just go to Models again, and we can do this. So we're using this model here. It's a very small and efficient model, but the performance is actually fairly good. So that means, one, when we're doing this Hugging Face model here, we don't need to use that LLM image because this isn't a large language model.

It's just a small transform model. And we also change the tasks that we're doing here to feature extraction, because we're extracting embeddings, which are like features, from the model. Okay, so we will run that. And then we come down to here and we're going to deploy it. We are going to deploy to this instance.

Come over here and let's see where that is. It's actually not even on this page. I don't remember where I found it, but essentially, it's a small model. Maybe if I do T2. Okay, so you can see some kind of similar instances here. We are using the MLT2-Large. The T3-Large has, it's just actually a CPU model, right?

It's not even GPU. But again, this embedding model is very small. You could use GPU if you want it to be quicker. And the memory is just eight gigabytes, which is actually plenty for this model. I think in reality, you can load this model onto like two gigabytes of GPU RAM.

So that should be okay. Now let's deploy that. Okay, so that has now deployed. Now we have both our LLM and embedding model deployed. And we can go and have a look over in SageMaker here. So now we can see we have both of these models, which one is which.

So this one deployed earlier is our LLM. And this one here is our embedding model. And we can go over to endpoint configurations. We can see these are our images. So we have the mini LLM demo image and flan T5 demo image. And then our actual endpoints that we are going to be calling later when we're...

Well, actually, we already called the flan T5 endpoint. And we're about to call the mini LLM endpoint. So what am I showing you next? Next, I'm just going to show you basically how we create XQ here, okay? I'm not going to do it with the dataset yet. All I'm going to do is I'm going to create some little examples, and I'm going to pull those into our embedding model.

And then we're going to go on over here, and that will create our query vectors, the XQ vectors, okay? So to do that, we have our encoder. We actually just do encoder, predict, that's it. Here, we're creating two... We're taking two contexts or chunks of text or documents, whatever you want to call them.

So that means we're going to get two embeddings back, which we can see here. And each of those, if we take a look, is not what we'd expect, right? So when we're creating these embeddings from these sentence transformer models or any other embedding model, we expect to output a vector.

But the vector dimensionality that I'm expecting from the MiniLM model is 384 dimensions. What we see here is two eight-dimensional somethings, which is not what we would expect. Let's take a look at what is within these somethings, okay? So it looks like we have two records. That's fine. That makes sense.

But each record or each embedding output is actually eight 384-dimensional vectors. Now, the reason that we have this is because we have input some text, right? We create like two inputs. And I can't remember what were they. They're like something random, I think, okay? And basically, each one of those sentences or the text within those sentences is going to be broken apart into what we call tokens, right?

And those tokens might look something like this, okay? So I think it was called something random. I don't remember exactly what I wrote. Let's say it's something random, right? And then we had another one here. And let's say that this one was containing eight tokens, right? So there's eight tokens in this one.

So it's a bit longer. Basically, what would happen here is this shorter sentence will be padded with what we call padding tokens. So actually, this gets extended with some extra padding tokens, as mentioned, to align with the same size or the same length as our longest sequence within the batch that we're passing it to the model, okay?

So it's going to look like this. Now, what we have here is two lists of eight tokens each. These are then passing to our embedding model, right? So embedding model, it gets these. So this is our embedding model. And what it's going to do is it's going to output a vector embedding at the token level, right?

So that means that we get one, two, three, four, five, six, seven, eight token level embeddings here. But we want to represent the whole sentence or each document. So what we actually do here is something called mean pooling, where we essentially just take the average across each dimension here.

We take the average and using that, we create a single sentence embedding, right? So we just need to add that on to the end of our process here. And with that, we would get xq, our query vector. Actually, sorry, not necessarily our query vector in this case. It would also be our, we can call them xc or xd, which would be like our context vectors or document vectors.

So that's actually my bad because right here, this red bit, we're not actually doing that yet. We're actually going along this line here, right? So from here into here and creating those, we'll call them xc vectors. So ignore the xq bit for now. All right, so to get those, we're going to take the mean across a single axis.

Let's do that. And you can see that now we have two 384 dimensional vector embeddings. Now, what I want to do is just kind of package that into a single function. All right, so this is going to take a list of strings, which are our documents or context, and we're going to create the token level embeddings.

And then we're going to take the mean, or we're going to do mean pooling to create a sentence level embedding. Okay, now that is how we create our context/document embeddings. Now what we want to do is actually take what we've just learned and apply that to an actual dataset.

So we're going to be using the Amazon SageMaker FAQs, which we can download from here. And we're just going to open that dataset with pandas. Okay, so we have the question and answer columns here. We're going to drop the question column because we just want to look at answers here.

Okay, this is all we're going to be indexing. Now, that gives us our database down here. What we should do now is, well, we need to embed them, but we also need somewhere to store them. All right, so we need to take our context vectors and store them within Pinecone here.

Okay, to store them within Pinecone, we need to initialize a vector index to actually store them within. So let's do that. Okay, so to initialize our connection to Pinecone, we're going to need a free API key. We can get that from app.pinecone.io. Okay, once we are in Pinecone, we go over to our API keys.

You want to copy that and also remember the environment here. We need that as well. So the environment you can put in here. So mine was us-west1-gcp. And for the API key, you need to paste it into here. Okay, and then with those, we just initialize our connection to Pinecone and we can make sure it's connected by taking a look at this.

Now, mine is not empty because I'm doing a lot of things to Pinecone right now. So, yeah, not empty, but we would expect that in reality, it should be. Now, one thing I do need to do is make sure I delete this index, which is already running. So I come down to here, we have that index name, and I'm going to check if that index is already running.

And if so, I'm going to delete the index because I want to start fresh. And then what I'm going to do is create a new index with the same name. The dimensionality is telling us what is the dimensionality of the vectors that we'll be putting into our index. We know that, that's the 384 that we saw before for our MiniLM model.

And we also have this metric. So basically, most embedding models, you can use the cosine metric, but some of them may need you to use dot product or also Euclidean distance. So basically, just check which embedding model you're using. If you can't see any information on which one of those metrics you should use, just assume it's probably cosine.

And then here, we're just waiting for the index to finish initializing before we move on to the next step, which would be list indexes again, which will look exactly the same because I've already had the Retrieval Augmentation AWS index in there. So it takes like a minute for that to run.

Now it has. And then I've put here that we do this in batches of 128. Ignore that, we're doing it in batches of two, which given the dataset size, it's fine. It's a really small dataset. Obviously, if you're wanting to do this for a big dataset, you should use a larger batch size.

Otherwise, it can take a really long time. And if you do want to use a larger batch size, you need to use a larger instance than the, what was it, like the MLT2 model that we're using. We're just going to upload or upsert 1000 of those vectors. Okay, so that's what we're doing here.

We then initialize our connection to the specific index that we created up here. And then we just loop through and upsert everything. So I'm going to run that. And let me just explain what is happening. So we're going through in batches of two. We are getting our IDs for those batches.

So the IDs here are just like 0, 1, 2, 3, nothing special. You should probably use actual IDs if you're wanting to do something real with this. We create a metadata for each batch. So basically, I just want to store the text for each answer within Pinecone because it makes things a little bit easier later on while we're retrieving everything.

And then we want to create our embeddings. So we take the answers like the documents within this batch, and we do embedDocs, which is a function that we defined earlier. And then we upsert everything. Okay, and that's it. Now, if we take a look at our number of records within the index, we should see that there are 154, which is a tiny, tiny index.

Honestly, in reality, you probably wouldn't use Pinecone for something this small. You really want to be using 10,000, 50,000, 100,000 million vectors. But for this example, it's fine. Okay, so let's just take a look at the question we initialized earlier. Which instances can I use with managed spot training in SageMaker?

All right, so that was the question. What I'm going to do is I'm going to embed that to create our query vector, which I was calling XQ earlier. And then we're going to query Pinecone, and we're going to include the metadata. So that will allow us to see the text of the answers as well.

All right, so we can run that, and we get these contexts here. So, so far, that means we have basically done, okay, we've created our database. And now what we're doing is we're asking a question, a query, taking it through to our embedding model. And this time, we are actually going along this path here and creating our query vector, taking that into Pinecone and getting the relevant context from there.

So we've just, that's what we've just seen. Those matches that we just saw, they're our relevant contexts here. So what is left, okay, we need to take our query, we need to take this context, feed them in to the LLM. Okay, so let's do that. I'm going to get those contexts.

Okay, so it's just a list of these. I can even show you here, it's literally just a list of those. And what we're going to do is we're going to construct this into a single string. And what we need to do here is be careful as to how much data we feed in at any one time.

Because we're not using like a massive large language model here. We're using Flan, T5XL, which is okay, but it cannot store a ton of text within its context window. So we need to be extra careful here. And what we're going to do is we're going to say for the text within the context here, we're going to add each one until we get to a point where we can't add any more because we reach the context window limit.

And what is our limit going to be? We've just said to 1000 characters. Okay, cool. So we run that. And then we can actually run that with the context. And let's just see what that actually returns us, context string. Okay, so we get, I'm not sure how many that is.

Just print it, it's a bit easier. So I think we're actually, okay, here, it's telling us, sorry. I forgot that I added that in there. So with maximum sequence length 1000, selected the top four document sections. So we retrieve five, we're just going to pass through the top four.

It's that last one we couldn't fit into the limit we set. Okay, that's great. So now what we want to do is same as what we did a lot earlier, where we had that prompt where it was like, answer the question below, given the context, we're going to do that.

Okay, let me do that on in another cell, so I can at least show it to you. So I'm going to do print text input. Okay, answer the following question based on the context, right? So we have our context here, which we're feeding in. And we have our question.

Okay, so let's now predict with that and see what we get. Which instances can I use my spot training and SageMaker? All instances are supported. Okay, cool. Okay, so with that, we've now taken our context and we've fed them into what will be our new prompt. Also taking our question, okay.

And we've fed that into new prompt and use all that to create our retrieval augmented prompt, which ILM then uses to go through and create our answer. Okay, cool. So yeah, that is the process. We now have our retrieval augmented generation pipeline and the answer that that is producing.

So now we can pull that together in our single reg query. Okay, so run that. And I'm going to ask the same question initially, because I don't know too much about AWS here. So I just want to ask something that I know the answer to, which is the question, the spot instances one.

Okay, and we get this, right? Now I checked the data set and there isn't actually any mention of the hugging face instances in SageMaker, right? So although this is a relevant question, the model should say, I don't know, because it doesn't actually know about this piece of information. So we can test that.

Okay, and first chunk that we're getting here is actually the contexts that are being returned. Now these contexts, we can see they don't contain anything about hugging face, right? That is talking about something else. And the reason that we're retrieving this irrelevant information is because when we do our embedding and we query Pinecone, we're saying retrieve the top five most relevant contexts within the database.

Now, there is nothing that contains anything about hugging face within our database, but it's still going to return the top five most relevant items. So it does that. But fortunately, we've told our LM that if the context doesn't contain relevant information, you need to respond with, I don't know.

So that is exactly what it does here, responds with, I don't know. Okay, so with that, we've seen how we can use SageMaker for retrieval augmented generation with Pinecone using open source models, which is, I think, pretty cool, relatively easy to set up. I think one thing that we actually, I should show you very quickly before finishing is right now, we have some running instances in SageMaker.

You should probably shut those down. So we can do that by going to our endpoints here and selecting those, clicking delete, and that will be deleted. Okay, and we'll just want to do that for our other items. We have the images, you can delete those as well. And also the models as well.

Okay, cool. So, yeah, once you've gone through and deleted those, you won't be paying any more for following this. So, yeah, that's it for this video. We've obviously seen how to do RAG with open source with Pinecone. And it seems to work pretty well. Obviously, when you're wanting more performant generations, you'll probably want to switch up to a larger model.

The Flan T5 XL that we demoed here is pretty limited in its abilities, but it's not bad and it's definitely not bad for like a demo. So, yeah, I hope this has all been useful and interesting. Thank you very much for watching. And I will see you again in the next one.

Bye. (MUSIC) (MUSIC) (MUSIC) (MUSIC) (gentle music)