Back to Index

Better Llama 2 with Retrieval Augmented Generation (RAG)


Chapters

0:0 Retrieval Augmented Generation with Llama 2
0:29 Python Prerequisites and Llama 2 Access
1:39 Retrieval Augmented Generation 101
3:53 Creating Embeddings with Open Source
6:23 Building Pinecone Vector DB
8:38 Creating Embedding Dataset
11:45 Initializing Llama 2
14:38 Creating the RAG RetrievalQA Component
15:43 Comparing Llama 2 vs RAG Llama 2

Transcript

In today's video we're going to be looking at more LLAMA2. This time we're going to be looking at a very simple version of Retrieval Augmented Generation using the 13 billion parameter LLAMA2 model, which we're going to quantize and actually fit that onto a single T4 GPU, which is included within the free tier of Colab, so anyone can actually run this.

It should be pretty fun. Let's jump straight into the code. So to get started with this notebook, there'll be a link to this at the top of the video right now. The first thing that you will have to do if you haven't already is actually request access to LLAMA2, which you can do via a form.

If you need some guidance on that, there'll be a link to another video of mine, the previous LLAMA2 video, where I describe how to go through that and get access. So first thing I'm going to want to do after getting your access is we want to go to change runtime type and you want to make sure that you're using GPU for hardware accelerator and T4 for your GPU type.

If you have Colab Pro, you can use one of these and it will run a lot faster, but T4 is good enough. Cool. So we just have to install everything we need. Okay. And once that is ready, we come down to here. So Hunkface embedding pipeline. So before we dive into the embedding pipeline, maybe what I should do is try to explain a little bit of what this retrieve augmented generation thing is and why it's so important.

So a problem that we have with LLMs is that they don't have access to the outside world. The only knowledge contained within them is knowledge that they learned during training, which can be super limiting. So in this example here, this was a little while ago, I asked GPT-4 how to use the LLM chain and Lang chain.

Okay. So Lang chain being the sort of new LLM framework. And the answer it gave me specified this Lang chain, which is a blockchain based decentralized AI language model, which is completely wrong. Basically it hallucinated. And the reason for that is that GPT-4 just didn't know anything about Lang chain.

And that's because it didn't have access to the outside world. It just had knowledge. It's called parametric knowledge. This knowledge stored within the model itself that gained during training. So the idea behind retrieval augmented generation is that you give your LLM access to the outside world. And the way that we do it, at least in this example, is we're going to give it access to the outside world, like our subset of the outside world, not the entire outside world.

And we're going to do that by searching with natural language, which is ideal when it comes to our LLM, because our LLM works with natural language. So we interact with LLM using natural language, and then we search with natural language. And what that will allow us to do is we'll ask a question, we'll get relevant information about that question from somewhere else.

And we get to feed that relevant information plus our original question back into the LLM, giving it access. So this is what we would call source knowledge, rather than parametric knowledge. Now, part of this is that embedding model. So the embedding model is how we build this retrieval system.

It's how we translate human readable text into machine readable vectors. And we need machine readable vectors in order to perform a search and to perform it based on semantic meaning, rather than my traditional search, which would be more on keywords. So in the spirit of going with open source or open access models, as is the case with LLAMA2, we're going to use a open source model.

So we're going to use the Sentence Transformers library. If you've been watching my videos for a while, this will be kind of like a flashback to a little while ago. So we used Sentence Transformers a lot before the whole open AI chatty petite thing I kicked off. Now, this model here is a very small model, super easy to run.

You can run it on CPU. Okay. Let's have a look at how much RAM I just used. Okay. At the moment, it seems like we're not really even using any. So I think it may need to wait until we actually start creating embeddings, which we do next. So you can see that we're using the CUDA device.

Here, we're going to create some embeddings. Okay. You see that we're using some GPU RAM now, but very little, 0.9 gigabytes, which is nothing. That's pretty cool. So what we've done here is we've created these two documents or chunks of text. We embed them using our embedding model. So if I just come up to here, the way that we've initialized our Sentence Transformer is a little different to how I used to do it.

So we've essentially initialized it through HuggingFace. And then we have actually loaded that into the LangChain HuggingFace embeddings object. Okay. So we're using HuggingFace via LangChain to use Sentence Transformers. So there's a few abstractions there, but this will make things a lot easier for us later on. Okay. Cool.

And let's onto this. So we have loaded our embedding model. We have two document embeddings. That's because we have two documents here. And each of those has a dimensionality of 384. Now with OpenAI, for comparison, we're going to be embedding to a dimensionality of 1,536, I think it is.

So with this, you can, particularly with Pinecone, the vector database I'm talking about later, you can fit in five of these for every one OpenAI embedding. The performance is less with these, to be honest, but it kind of depends on your use case. A lot of the time, you don't need the performance that OpenAI embeddings gives you.

Like in this example, it actually works really well with this very small model. So that's pretty useful. Now, yeah, let's move on to the Pinecone bit. So when we're going to create our vector database and build our vector index. So to do that, we're going to need a free Pinecone API key.

So I'm going to click on this link here. That's going to take us to here, app.pinecone.io. I'm going to come over to my default project, zoom in a little bit here, and go to API keys, right? And we need the environment here. So us-west1-gcp, remember that, or for you, this environment will be different.

So whatever environment you have next to your API key, remember that, and then just copy your API key. Come back over to here. You're going to put in your API key here, and you're also going to put in that environment or the cloud region. So it was us-west1-gcp for me.

Okay. And I initialize that with my API key. And now we move on to the next cell. So in this next cell, we're going to initialize the index, basically just create where we're going to store all of our vectors that we create with that embedding model. There are a few items here.

So dimension, this needs to match the dimensionality of your embedding model. We already found ours before. So it's this 3, 8, 4. So we feed that into there. And then the metric, metrics can change depending on your embedding model. With OpenAI's R002, you're going to be using, you can use either cosine or dot product.

With open source models, it varies a bit more. Sometimes you have to use cosine. Sometimes you have to use dot product. Sometimes you have to use Euclidean, although that one is a little less common. So it's worth just checking. You can usually find in the model cards on Huggingface which metric you need to use, but most common, the kind of go-to is cosine.

All right, cool. So we initialize that. Okay, cool. So that initialize, it does take a minute. For me, it was like a minute right now. And then we want to connect to the index. So we do, I go index, index name, and then we can describe that index as well, just to see what is in there at the moment, which should for now be nothing.

Okay, cool. Now with the index ready and the embedding ready, we're ready to begin populating our database. Okay. So just like a typical traditional database with a vector database, you need to put things in there in order to retrieve things from that database later on. So that's what we're going to do now.

So we're going to come down to here. I quickly just pulled this together. It's essentially a small dataset. I think it's just around 5,000 items in there. And it just contains chunks of text from the LLAMA2 paper and a few other related papers. So I just built that by kind of going through the LLAMA2 paper and extracting the references and extracting those papers as well.

And just kind of like repeating that loop a few times. All right. So once we download that, we come down to here, we're going to convert that HuggingFace dataset. So this is using HuggingFace datasets. We're going to convert that into a pandas data frame. And we're specifying here that we would like to upload everything in batches of 32.

Honestly, we could definitely increase that to like 100 or so, but it doesn't really matter because it's not a big dataset. It's not going to take long to push everything to Pinecone. So let's just have a look at this loop. We're going through in these batches of 32. We are getting our batch from the data frame.

We're getting IDs first. Then we get the chunks of texts from the data frame, and then we get our metadata from the data frame. So maybe what would actually be helpful here is if I just show you what's in that data frame. So data.head. Okay. So you can see here, we just have a chunk ID.

So I'm going to use, I think I use DOI and chunk ID to create the ID for each entry. Yeah. And then we have the chunk, which is just like a chunk of text. Okay. You can kind of see that here. We have the paper IDs, the title of the paper, some summaries, the source, several other things in there.

Okay. But we don't need all of that. So for the metadata, we actually just keep the text, the source, and the title. And yeah, we can run that. It should be pretty quick. Okay. So that took 30 seconds for me. You can also, I kind of forgot to do this, but you can do from TQDM, auto import TQDM, and you can add like a progress bar so that you can actually see the progress like that.

Okay. So that's just a little bit nicer if you would rather not just be staring at a cell doing something. Okay. Cool. So now if we describe index sets, we should see about 5,000 vectors in there. Okay. So it's pretty cool. Now what we're going to do, so we have our index like database ready.

What we want to do now is we want to add in the LLM. So we want to add in LLM2. To do that, we're going to be using the text generation pipeline from HuggingFace. And then we're going to be loading that into the line chain. We're going to be using the LLM2 13-bit chat model, which you can see here and everything that comes with that.

I've explained this stuff here. So like how to load the model, the quantization, everything else several times. So I'm not going to go through that again. If you do want to go through that, it's in the video that I linked earlier, the previous LLM2 video. But what I will do is show you how to get this HuggingFace authentication token.

So for that, we go to HuggingFace.co. We want to go to your profile icon at the top here, settings, and then you go to access tokens. You would have to create a new token here. I've already created mine. Just make it a read token. You can use a write if you want, but it just gives more permissions that you don't need for this.

But I've created mine here. I'm just going to copy it and I will put it into this string here and we run that. That's just going to load everything. So we need that authentication token because LLM2, all those models, you need permission to use them. You get that by signing up through Meta's forms and everything, as I mentioned earlier.

So you need to, in this case, which you don't for every model on HuggingFace, but for this model, you do need to authenticate yourself. Okay. So that will take a moment to load. Just note here, I'm using a GPU and then I am switching the model to like evaluation mode.

And actually, sorry, we don't need to use that GPU code here because the device actually figures it out by itself. But it's good to make sure that we actually are using CUDA. So that would just print out down here. It should print out something like model loaded on CUDA zero.

So this will take a moment to load. So I'll just skip ahead to when it's ready. Okay. So that has finished loading. It took eight minutes and we can see that the GPU memory has gone up to 8.2 gigabytes. So it's using more now, considering also that that 1.2 gigabytes of that was used by the mini LLM model.

We're using like seven gigabytes for this quantized version of the model, which is pretty cool. Now I'm slowing the tokenizer, the pipeline. Again, I went through all this stuff before, so I'm not going to go through it again. And then what we do is just initialize that in line chain.

So now we can start using all the different line chain utilities. So come down to here, what we need to do is initialize the retrieval QA chain. So this is like the simplest form of reg that you can get in for your LLMs. So for that, for retrieval QA chain, we need a vector store, which is like another line chain object and our LLM, which we already have.

So let's initialize our vector store and we just confirmed that it works. So we have this query. I'm going to do a similar search. So this is not using the LLM or here, this is just retrieving what it believes are relevant documents. Now it's kind of hard to read these, to be honest, I at least struggle, but we'll see in a moment that the LLM does actually manage to get good information from these.

So we create our reg pipeline like so, so we just pass in our LLM, our retriever and the chain type. Chain type basically just means it's going to stuff all of the context into the context window of the LLM query. And then we can begin asking questions. So let's begin by asking what is so special about LLAMA2?

We run that. This will take, again, we're using the smallest GPU possible here. So it's going to take a little bit of time. Also the quantization set that we use to make this model so small adds time to the processing. Or inference speeds. So we do have to wait a moment.

Okay. And we get our response. It took like a minute. Again, if you actually want to run this in production, you're probably going to want more GPU power and also not to quantize the model. So yeah, we get this. It's talking about actual LLAMAs. It just tells us a load of random things like their coats can be a variety of colors.

They are silky, I think it says somewhere. I know it did in the previous output. They're calm, so on and so on. We don't need that. So what we actually want to ask about is LLAMA2, the large language model. So now what we're going to do is run it through our REG pipeline and see what we get.

Okay. So that was 30 seconds to run. I think maybe the first time that you run the model it's a little bit slower. But yeah, that was quicker. So we get LLAMA2 is a collection of pre-trained fine-tuned large language models. Additionally, they're considered a suitable substitute for closed-source models like ChatGT, BARD, and Cloud.

They are optimized for dialogue and outperform open-source chat models on most benchmarks tested, which I think is the special thing about LLAMA2. Cool. Now, let's try some more questions. I'll save that REG example. It works a lot better. So what safety measures we use in the development of LLAMA2?

Just using the LLM without retrieval augmentation, we get this. So it just, I don't even know what it's talking about. It kind of just, it's almost like it's rambling about something. I'm not sure what that something is, but yeah, not a good answer. Now, if we look at what we get with retrieval augmentation, we get the development of LLAMA2 included safety measures, such as pre-training, fine-tuning, and model safety approaches.

The release of the 34 billion parameter model was delayed because they didn't have time to red team. That's a pretty good answer, but let's ask a little more about the red teaming procedures. I'm not going to bother asking the LLM because it clearly isn't capable of giving us good answers here.

So let's just go straight for the retrieval augmented pipeline. So we asked what are the red teaming procedures for LLAMA2 and it describes, okay, red teaming procedures used for LLAMA2 included creating prompts that might elicit unsafe or undesirable responses from the model, such as sensitive topics or prompts that could cause harm if the model was spun inappropriately.

These exercises were performed by a set of experts and it also notes that the paper mentions that multiple additional rounds of red team were performed over several months to ensure the robustness of the model. Cool. Now, let's ask one more final question. How does the performance of LLAMA2 compare to other local LLMs?

The performance of LLAMA2 is compared to other local LLMs such as Chinchilla and Bard in the paper, although I wouldn't call Bard a local LLM. Fine. Specifically, the authors report that LLAMA2 outperforms the other models on the series of helpfulness and safety benchmarks that they tested. LLAMA2 appears to be on par with some of the closed source models, at least on the human evaluations they performed.

So that would be models like GPT 3.5, which is, seems a little bit better than LLAMA2, but not by that much. Except for my coding stuff. Coding stuff, LLAMA2 is pretty terrible. Everything else, it seems pretty good. Now, yeah, that's the example. We can see very clearly that retrieval augmentation works a lot better than without retrieval augmentation.

That's why this sort of technique is super powerful. It means your LLM can answer questions about more up-to-date topics, which it can't otherwise. It means it can answer questions about, like if you have, maybe you work in an organization, you have internal documents, it means it can answer questions about that.

So overall, retrieval augmentation in most cases is really useful. Now that's it for this video. I hope this has been useful and interesting. So thank you very much for watching and I will see you again in the next one. Bye.