back to indexBetter Llama 2 with Retrieval Augmented Generation (RAG)
Chapters
0:0 Retrieval Augmented Generation with Llama 2
0:29 Python Prerequisites and Llama 2 Access
1:39 Retrieval Augmented Generation 101
3:53 Creating Embeddings with Open Source
6:23 Building Pinecone Vector DB
8:38 Creating Embedding Dataset
11:45 Initializing Llama 2
14:38 Creating the RAG RetrievalQA Component
15:43 Comparing Llama 2 vs RAG Llama 2
00:00:00.000 |
In today's video we're going to be looking at more LLAMA2. 00:00:03.360 |
This time we're going to be looking at a very simple version of Retrieval Augmented Generation 00:00:11.040 |
using the 13 billion parameter LLAMA2 model, which we're going to quantize and actually fit 00:00:16.640 |
that onto a single T4 GPU, which is included within the free tier of Colab, so anyone can 00:00:24.560 |
actually run this. It should be pretty fun. Let's jump straight into the code. 00:00:29.680 |
So to get started with this notebook, there'll be a link to this at the top of the video right now. 00:00:33.920 |
The first thing that you will have to do if you haven't already is actually request access to 00:00:40.960 |
LLAMA2, which you can do via a form. If you need some guidance on that, there'll be a link to 00:00:48.160 |
another video of mine, the previous LLAMA2 video, where I describe how to go through that and get 00:00:55.120 |
access. So first thing I'm going to want to do after getting your access is we want to go to 00:01:02.480 |
change runtime type and you want to make sure that you're using GPU for hardware accelerator 00:01:08.720 |
and T4 for your GPU type. If you have Colab Pro, you can use one of these and it will run a lot 00:01:15.280 |
faster, but T4 is good enough. Cool. So we just have to install everything we need. Okay. And 00:01:23.920 |
once that is ready, we come down to here. So Hunkface embedding pipeline. So before we dive 00:01:29.920 |
into the embedding pipeline, maybe what I should do is try to explain a little bit of what this 00:01:35.040 |
retrieve augmented generation thing is and why it's so important. So a problem that we have with 00:01:40.400 |
LLMs is that they don't have access to the outside world. The only knowledge contained within them 00:01:46.800 |
is knowledge that they learned during training, which can be super limiting. So in this example 00:01:52.400 |
here, this was a little while ago, I asked GPT-4 how to use the LLM chain and Lang chain. Okay. So 00:01:58.800 |
Lang chain being the sort of new LLM framework. And the answer it gave me specified this Lang 00:02:07.040 |
chain, which is a blockchain based decentralized AI language model, which is completely wrong. 00:02:12.720 |
Basically it hallucinated. And the reason for that is that GPT-4 just didn't know anything 00:02:17.760 |
about Lang chain. And that's because it didn't have access to the outside world. It just had 00:02:24.080 |
knowledge. It's called parametric knowledge. This knowledge stored within the model itself 00:02:29.120 |
that gained during training. So the idea behind retrieval augmented generation is that you give 00:02:35.040 |
your LLM access to the outside world. And the way that we do it, at least in this example, 00:02:40.960 |
is we're going to give it access to the outside world, like our subset of the outside world, 00:02:47.120 |
not the entire outside world. And we're going to do that by searching with natural language, 00:02:52.400 |
which is ideal when it comes to our LLM, because our LLM works with natural language. So we 00:03:00.400 |
interact with LLM using natural language, and then we search with natural language. 00:03:04.640 |
And what that will allow us to do is we'll ask a question, we'll get relevant information about 00:03:11.920 |
that question from somewhere else. And we get to feed that relevant information plus our original 00:03:19.280 |
question back into the LLM, giving it access. So this is what we would call source knowledge, 00:03:25.920 |
rather than parametric knowledge. Now, part of this is that embedding model. 00:03:29.760 |
So the embedding model is how we build this retrieval system. It's how we translate human 00:03:38.320 |
readable text into machine readable vectors. And we need machine readable vectors in order to 00:03:45.280 |
perform a search and to perform it based on semantic meaning, rather than my traditional 00:03:51.040 |
search, which would be more on keywords. So in the spirit of going with open source or open access 00:03:56.880 |
models, as is the case with LLAMA2, we're going to use a open source model. So we're going to 00:04:02.400 |
use the Sentence Transformers library. If you've been watching my videos for a while, this will be 00:04:08.000 |
kind of like a flashback to a little while ago. So we used Sentence Transformers a lot before the 00:04:17.440 |
whole open AI chatty petite thing I kicked off. Now, this model here is a very small model, 00:04:25.280 |
super easy to run. You can run it on CPU. Okay. Let's have a look at how much RAM I just used. 00:04:30.560 |
Okay. At the moment, it seems like we're not really even using any. So I think it may need 00:04:37.680 |
to wait until we actually start creating embeddings, which we do next. So you can see 00:04:42.800 |
that we're using the CUDA device. Here, we're going to create some embeddings. Okay. You see 00:04:47.840 |
that we're using some GPU RAM now, but very little, 0.9 gigabytes, which is nothing. That's pretty 00:04:54.320 |
cool. So what we've done here is we've created these two documents or chunks of text. We embed 00:04:59.280 |
them using our embedding model. So if I just come up to here, the way that we've initialized our 00:05:05.520 |
Sentence Transformer is a little different to how I used to do it. So we've essentially 00:05:10.400 |
initialized it through HuggingFace. And then we have actually loaded that into the LangChain 00:05:17.520 |
HuggingFace embeddings object. Okay. So we're using HuggingFace via LangChain to use Sentence 00:05:23.600 |
Transformers. So there's a few abstractions there, but this will make things a lot easier for us 00:05:28.240 |
later on. Okay. Cool. And let's onto this. So we have loaded our embedding model. We have two 00:05:38.640 |
document embeddings. That's because we have two documents here. And each of those has a 00:05:42.560 |
dimensionality of 384. Now with OpenAI, for comparison, we're going to be embedding to a 00:05:48.640 |
dimensionality of 1,536, I think it is. So with this, you can, particularly with Pinecone, the 00:05:58.320 |
vector database I'm talking about later, you can fit in five of these for every one OpenAI embedding. 00:06:04.640 |
The performance is less with these, to be honest, but it kind of depends on your use case. A lot of 00:06:11.600 |
the time, you don't need the performance that OpenAI embeddings gives you. Like in this example, 00:06:17.360 |
it actually works really well with this very small model. So that's pretty useful. Now, yeah, 00:06:24.480 |
let's move on to the Pinecone bit. So when we're going to create our vector database and build our 00:06:30.400 |
vector index. So to do that, we're going to need a free Pinecone API key. So I'm going to click on 00:06:37.120 |
this link here. That's going to take us to here, app.pinecone.io. I'm going to come over to my 00:06:45.600 |
default project, zoom in a little bit here, and go to API keys, right? And we need the environment 00:06:52.880 |
here. So us-west1-gcp, remember that, or for you, this environment will be different. So whatever 00:07:00.080 |
environment you have next to your API key, remember that, and then just copy your API key. Come back 00:07:05.200 |
over to here. You're going to put in your API key here, and you're also going to put in that 00:07:09.360 |
environment or the cloud region. So it was us-west1-gcp for me. Okay. And I initialize that 00:07:18.000 |
with my API key. And now we move on to the next cell. So in this next cell, we're going to 00:07:23.600 |
initialize the index, basically just create where we're going to store all of our vectors that we 00:07:28.960 |
create with that embedding model. There are a few items here. So dimension, this needs to match 00:07:33.920 |
the dimensionality of your embedding model. We already found ours before. So it's this 3, 8, 4. 00:07:39.440 |
So we feed that into there. And then the metric, metrics can change depending on your embedding 00:07:44.800 |
model. With OpenAI's R002, you're going to be using, you can use either cosine or dot product. 00:07:51.600 |
With open source models, it varies a bit more. Sometimes you have to use cosine. Sometimes you 00:07:57.680 |
have to use dot product. Sometimes you have to use Euclidean, although that one is a little less 00:08:02.160 |
common. So it's worth just checking. You can usually find in the model cards on Huggingface 00:08:08.080 |
which metric you need to use, but most common, the kind of go-to is cosine. All right, cool. So 00:08:15.840 |
we initialize that. Okay, cool. So that initialize, it does take a minute. For me, it was like a 00:08:22.640 |
minute right now. And then we want to connect to the index. So we do, I go index, index name, 00:08:30.400 |
and then we can describe that index as well, just to see what is in there at the moment, 00:08:34.400 |
which should for now be nothing. Okay, cool. Now with the index ready and the embedding ready, 00:08:44.480 |
we're ready to begin populating our database. Okay. So just like a typical traditional database 00:08:51.040 |
with a vector database, you need to put things in there in order to retrieve things from that 00:08:57.040 |
database later on. So that's what we're going to do now. So we're going to come down to here. 00:09:02.160 |
I quickly just pulled this together. It's essentially a small dataset. I think it's 00:09:09.920 |
just around 5,000 items in there. And it just contains chunks of text from the LLAMA2 paper 00:09:17.840 |
and a few other related papers. So I just built that by kind of going through the LLAMA2 paper 00:09:26.320 |
and extracting the references and extracting those papers as well. And just kind of like 00:09:31.600 |
repeating that loop a few times. All right. So once we download that, we come down to here, 00:09:38.160 |
we're going to convert that HuggingFace dataset. So this is using HuggingFace datasets. We're going 00:09:44.480 |
to convert that into a pandas data frame. And we're specifying here that we would like to upload 00:09:51.360 |
everything in batches of 32. Honestly, we could definitely increase that to like 100 or so, 00:09:59.120 |
but it doesn't really matter because it's not a big dataset. It's not going to take long to 00:10:04.800 |
push everything to Pinecone. So let's just have a look at this loop. We're going through in these 00:10:09.680 |
batches of 32. We are getting our batch from the data frame. We're getting IDs first. Then we get 00:10:19.920 |
the chunks of texts from the data frame, and then we get our metadata from the data frame. 00:10:25.120 |
So maybe what would actually be helpful here is if I just show you what's in that data frame. 00:10:30.400 |
So data.head. Okay. So you can see here, we just have a chunk ID. So I'm going to use, 00:10:39.760 |
I think I use DOI and chunk ID to create the ID for each entry. Yeah. And then we have the chunk, 00:10:46.160 |
which is just like a chunk of text. Okay. You can kind of see that here. We have the paper IDs, 00:10:52.240 |
the title of the paper, some summaries, the source, several other things in there. Okay. 00:10:58.000 |
But we don't need all of that. So for the metadata, we actually just keep the text, 00:11:02.960 |
the source, and the title. And yeah, we can run that. It should be pretty quick. Okay. So that 00:11:10.080 |
took 30 seconds for me. You can also, I kind of forgot to do this, but you can do from TQDM, 00:11:17.120 |
auto import TQDM, and you can add like a progress bar so that you can actually see the progress 00:11:24.960 |
like that. Okay. So that's just a little bit nicer if you would rather not just be staring at a 00:11:34.800 |
cell doing something. Okay. Cool. So now if we describe index sets, we should see about 5,000 00:11:41.680 |
vectors in there. Okay. So it's pretty cool. Now what we're going to do, so we have our index like 00:11:48.000 |
database ready. What we want to do now is we want to add in the LLM. So we want to add in LLM2. 00:11:55.920 |
To do that, we're going to be using the text generation pipeline from HuggingFace. And then 00:12:00.160 |
we're going to be loading that into the line chain. We're going to be using the LLM2 13-bit 00:12:05.520 |
chat model, which you can see here and everything that comes with that. I've explained this stuff 00:12:14.880 |
here. So like how to load the model, the quantization, everything else several times. 00:12:19.840 |
So I'm not going to go through that again. If you do want to go through that, it's in the video that 00:12:25.040 |
I linked earlier, the previous LLM2 video. But what I will do is show you how to get this HuggingFace 00:12:30.400 |
authentication token. So for that, we go to HuggingFace.co. We want to go to your profile 00:12:38.160 |
icon at the top here, settings, and then you go to access tokens. You would have to create a new 00:12:44.480 |
token here. I've already created mine. Just make it a read token. You can use a write if you want, 00:12:49.440 |
but it just gives more permissions that you don't need for this. But I've created mine here. I'm 00:12:53.840 |
just going to copy it and I will put it into this string here and we run that. That's just going to 00:13:02.160 |
load everything. So we need that authentication token because LLM2, all those models, you need 00:13:08.640 |
permission to use them. You get that by signing up through Meta's forms and everything, as I 00:13:15.600 |
mentioned earlier. So you need to, in this case, which you don't for every model on HuggingFace, 00:13:21.840 |
but for this model, you do need to authenticate yourself. Okay. So that will take a moment to 00:13:28.240 |
load. Just note here, I'm using a GPU and then I am switching the model to like evaluation mode. 00:13:36.240 |
And actually, sorry, we don't need to use that GPU code here because the device actually figures 00:13:43.840 |
it out by itself. But it's good to make sure that we actually are using CUDA. So that would just 00:13:50.160 |
print out down here. It should print out something like model loaded on CUDA zero. So this will take 00:13:56.160 |
a moment to load. So I'll just skip ahead to when it's ready. Okay. So that has finished loading. 00:14:03.600 |
It took eight minutes and we can see that the GPU memory has gone up to 8.2 gigabytes. So it's using 00:14:11.040 |
more now, considering also that that 1.2 gigabytes of that was used by the mini LLM model. We're 00:14:17.680 |
using like seven gigabytes for this quantized version of the model, which is pretty cool. 00:14:23.040 |
Now I'm slowing the tokenizer, the pipeline. Again, I went through all this stuff before, 00:14:27.920 |
so I'm not going to go through it again. And then what we do is just initialize that in 00:14:33.840 |
line chain. So now we can start using all the different line chain utilities. So come down to 00:14:40.240 |
here, what we need to do is initialize the retrieval QA chain. So this is like the simplest 00:14:45.760 |
form of reg that you can get in for your LLMs. So for that, for retrieval QA chain, we need a 00:14:54.000 |
vector store, which is like another line chain object and our LLM, which we already have. So 00:15:02.240 |
let's initialize our vector store and we just confirmed that it works. So we have this query. 00:15:08.560 |
I'm going to do a similar search. So this is not using the LLM or here, this is just retrieving 00:15:13.920 |
what it believes are relevant documents. Now it's kind of hard to read these, to be honest, 00:15:19.600 |
I at least struggle, but we'll see in a moment that the LLM does actually manage to get good 00:15:25.600 |
information from these. So we create our reg pipeline like so, so we just pass in our LLM, 00:15:32.960 |
our retriever and the chain type. Chain type basically just means it's going to stuff all 00:15:37.040 |
of the context into the context window of the LLM query. And then we can begin asking questions. So 00:15:45.040 |
let's begin by asking what is so special about LLAMA2? We run that. This will take, again, 00:15:55.120 |
we're using the smallest GPU possible here. So it's going to take a little bit of time. 00:16:00.160 |
Also the quantization set that we use to make this model so small adds time to the processing. 00:16:06.960 |
Or inference speeds. So we do have to wait a moment. Okay. And we get our response. It took 00:16:13.200 |
like a minute. Again, if you actually want to run this in production, you're probably going to want 00:16:18.640 |
more GPU power and also not to quantize the model. So yeah, we get this. It's talking about actual 00:16:25.920 |
LLAMAs. It just tells us a load of random things like their coats can be a variety of colors. 00:16:31.520 |
They are silky, I think it says somewhere. I know it did in the previous output. They're calm, 00:16:37.600 |
so on and so on. We don't need that. So what we actually want to ask about is LLAMA2, the 00:16:43.600 |
large language model. So now what we're going to do is run it through our REG pipeline and see what 00:16:50.480 |
we get. Okay. So that was 30 seconds to run. I think maybe the first time that you run the model 00:16:57.120 |
it's a little bit slower. But yeah, that was quicker. So we get LLAMA2 is a collection of 00:17:02.240 |
pre-trained fine-tuned large language models. Additionally, they're considered a suitable 00:17:07.360 |
substitute for closed-source models like ChatGT, BARD, and Cloud. They are optimized for dialogue 00:17:13.120 |
and outperform open-source chat models on most benchmarks tested, which I think is the special 00:17:18.480 |
thing about LLAMA2. Cool. Now, let's try some more questions. I'll save that REG example. 00:17:26.880 |
It works a lot better. So what safety measures we use in the development of LLAMA2? Just using 00:17:32.640 |
the LLM without retrieval augmentation, we get this. So it just, I don't even know what it's 00:17:39.600 |
talking about. It kind of just, it's almost like it's rambling about something. I'm not sure what 00:17:44.080 |
that something is, but yeah, not a good answer. Now, if we look at what we get with retrieval 00:17:49.600 |
augmentation, we get the development of LLAMA2 included safety measures, such as pre-training, 00:17:54.480 |
fine-tuning, and model safety approaches. The release of the 34 billion parameter model was 00:18:00.320 |
delayed because they didn't have time to red team. That's a pretty good answer, but let's ask a 00:18:07.680 |
little more about the red teaming procedures. I'm not going to bother asking the LLM because it 00:18:13.760 |
clearly isn't capable of giving us good answers here. So let's just go straight for the retrieval 00:18:19.200 |
augmented pipeline. So we asked what are the red teaming procedures for LLAMA2 and it describes, 00:18:28.000 |
okay, red teaming procedures used for LLAMA2 included creating prompts that might elicit 00:18:33.280 |
unsafe or undesirable responses from the model, such as sensitive topics or prompts that could 00:18:40.400 |
cause harm if the model was spun inappropriately. These exercises were performed by a set of experts 00:18:47.360 |
and it also notes that the paper mentions that multiple additional rounds of red team 00:18:52.400 |
were performed over several months to ensure the robustness of the model. Cool. Now, let's ask one 00:18:59.920 |
more final question. How does the performance of LLAMA2 compare to other local LLMs? The performance 00:19:05.600 |
of LLAMA2 is compared to other local LLMs such as Chinchilla and Bard in the paper, although I 00:19:10.160 |
wouldn't call Bard a local LLM. Fine. Specifically, the authors report that LLAMA2 outperforms the 00:19:18.480 |
other models on the series of helpfulness and safety benchmarks that they tested. LLAMA2 appears 00:19:23.760 |
to be on par with some of the closed source models, at least on the human evaluations they 00:19:28.080 |
performed. So that would be models like GPT 3.5, which is, seems a little bit better than LLAMA2, 00:19:36.080 |
but not by that much. Except for my coding stuff. Coding stuff, LLAMA2 is pretty terrible. 00:19:41.680 |
Everything else, it seems pretty good. Now, yeah, that's the example. We can see very clearly that 00:19:50.880 |
retrieval augmentation works a lot better than without retrieval augmentation. That's why this 00:19:57.440 |
sort of technique is super powerful. It means your LLM can answer questions about more up-to-date 00:20:04.720 |
topics, which it can't otherwise. It means it can answer questions about, like if you have, 00:20:10.880 |
maybe you work in an organization, you have internal documents, it means it can answer 00:20:15.360 |
questions about that. So overall, retrieval augmentation in most cases is really useful. 00:20:23.760 |
Now that's it for this video. I hope this has been useful and interesting. 00:20:29.520 |
So thank you very much for watching and I will see you again in the next one. Bye.