Back to Index

Superfast RAG with Llama 3 and Groq


Chapters

0:0 Groq and Llama 3 for RAG
0:37 Llama 3 in Python
4:25 Initializing e5 for Embeddings
5:56 Using Pinecone for RAG
7:24 Why We Concatenate Title and Content
10:15 Testing RAG Retrieval Performance
11:28 Initialize connection to Groq API
12:24 Generating RAG Answers with Llama 3 70B
14:37 Final Points on Why Groq Matters

Transcript

Today, we are going to be taking a look at using the Grok API with LLAMA3 for Rack. Now, the Grok API, for those of you that haven't heard of it, is an API that gives us access to what Grok, the company, call a LPU or language processing unit. A LPU is essentially some hardware that allows us to run LLMs very, very quickly.

So you'll see that the token throughput, when you're calling even big LLMs, like we will today, through this API is insanely fast. So to find the code, we're going to go to the PyCon examples repo, and we're going to go examples, integrations, Grok, and then to this Grok LLAMA3 reg notebook.

I'm going to go ahead and just open this directly in Colab. Okay, and here we are. I'm going to go ahead and connect. I'm going to make sure I'm connected to a GPU because we're actually going to be using for our embeddings a local embedding model. So it's just quicker.

It's not necessary, but it is quicker if we use GPU there. Then, what I'm going to do is go ahead and install the prerequisite libraries. So what are we using here? We have HuniFace datasets, which is where we're going to be getting our data from. We have Grok, which is the Grok API for the LPU that I mentioned.

We have Semantic Router, which is where we're going to be pulling in our encoder or our embedding model from. And we also have PyCon, which is, of course, where we're going to be storing our embeddings. So installing these, that might take a moment, but once it's done, we're going to go ahead and download just part of the dataset, not the full thing.

So 10,000 rows from this AI_ARCHIVE2_SEMANTIC_CHUNKS dataset that I created in the past. Essentially, it is a set of AI_ARCHIVE papers, and they have been semantically chunked. So roughly paragraph-sized chunks, but it varies because we're chunking semantically here, so it's essentially looking for where the topic within those papers changes.

So I'm going to run that, we'll download those, and we'll just have a quick look at what that looks like. So one of the papers we have in there is the Mixture of Experts paper, which introduced Mixture. And you can see that this first chunk is basically, well, it's like the introduction to the paper.

So you can see the title of the paper, you can see the authors, and then I believe it also includes the abstract here. And then once it gets past the abstract, it cuts, right? It senses that there is a change in topic and moves on to the next chunk.

So yes, that is our dataset. And we'll come down to here, and I'll just wait a moment for this to actually finish installing everything. Okay, so that is complete. We can also see how much the structure of that dataset there. So you have these and number of rows. Then I'm going to come down to here, and we're going to just rearrange that structure so that essentially we're rebuilding it so that it is a more friendly format for when we're going to be placing everything or embedding everything and then placing it in Pinecone.

So I want to keep the IDs for each of our chunks. So I keep that. And then for metadata, I want the chunk and the content. And we're going to be using both of those later in our embeddings. So we'll need to keep both of those. And then I'm going to remove all the unneeded columns, so everything else.

So let's remove those. Just note, of course, title and content I'm removing here, but we've moved it into the metadata column here. So we've basically copied it into metadata, and then we're removing the original versions of title and content. So we're not actually removing those. Cool, so I'm going to go ahead and just initialize a encoder model.

So we're going to be using this E5 model, and this E5 model actually has a longer context length. Typically, they're quite small, like around 512 tokens. And most of our chunks are actually below 512 tokens. But occasionally, you might find one which is a little bit higher. So I do want to use this slightly larger context window version of E5.

So I'm going to run that. E5, for those of you who don't know, it's a very good open source embedding model. It's pretty reliable. It tends to perform well on benchmarks. And when you test it, it actually tends to perform well on real data as well, which is, to be honest, quite rare.

Usually, they perform well on benchmarks, and then you try them on actual data, and they are not so good. So the E5 models are usually a good bet when it comes to open source. So I'm going to go ahead and I'm going to create our embeddings. I'm going to create an embedding like this.

So I'll just pass in a list of what will soon be our title and content. I'm just performing a quick test with it here. And you can also see the dimensionality by checking the length of what you just created here. So 768 dimensions. Cool. So we have that. Now I'm going to want to jump ahead and set up my Pinecone API key.

So to do that, I'm going to head on over to app.pinecone.io. I'm going to get API keys, and I'm just going to copy one here. And I'm just going to enter my API key when that little box pops up. Cool. And then I'm going to set up this to work with serverless.

Okay, so I have my serverless spec, initialize that. Now I'm going to go ahead and just check, do I already have this index? So I've called this index GrokLama3Reg, call it whatever you want. But I'm going to check, does that already exist? If it does already exist, then I'm going to skip ahead and just connect to it.

Otherwise, I'm going to create the index. Now, when I am initializing that, I need to make sure I'm using the metric that the E5 model is trained to use, which is cosine. I need to make sure dimensionality is the same, so this is 768. And yes, I have my serverless spec.

So that will initialize everything for me. You can see that my total vector count is zero, which means I've just initialized it. So, cool. We can go ahead and actually add our vectors into PyCone. So I'm going to do that in batches of 128. And you can also see what I'm doing here.

This is kind of important. So rather than just passing the content of those paragraphs to E5 and embedding those, I'm actually passing the title and the content. Now, the reason I'm doing that is that it essentially provides more context to the paragraph, right? So you could imagine, for example, with the LLAMA3 paper or LLAMA2 paper, the title contains LLAMA2.

This is the paper about LLAMA2, right? We have that information from the title. But then in the middle of our paper, there might be a paragraph where it's talking about the performance of the model, but it doesn't explicitly say LLAMA2 or LLAMA3. So by concatenating both of those, embedding both together, we're providing more context to our embedding model, which in theory means we should be able to get better results from our search.

So that's why I'm doing that. I'm going to go ahead and run that. And yeah, that will take a moment to run. It shouldn't be too long. You can also check your disk usage here. So especially the GPU RAM. And if this is very low, you can probably go ahead and actually just increase your batch size.

So I'm just going to check for a moment. It doesn't seem to be increasing beyond 2.6. So I'm going to go ahead and just increase this number, probably increase it quite a bit. Let's go to 384. And essentially what we're going to do is embed more chunks in one go.

Because we're running all this locally on a GPU. So yeah, we can increase that. And that should, in theory, reduce our waiting time as well. And you see the GPU RAM increasing after a moment. So it didn't increase immediately. I believe, although I'm not 100% sure, but that is simply because it's caching the initial tensors that we created.

And yeah, you can see it jumping up a lot now. So we'll need to keep an eye on that and make sure it doesn't go too high. If it does go too high, we'll probably want to reduce that batch size a little bit. But I think we should be good.

Okay, so that is complete. And let me close this. Okay, we can jump on down to... Do we need to look at this? We can see this. So this is just the metadata. So we have, for every single record, we have the content and the title. And we were just merging those.

But yeah, I mean, not super important, to be honest. So let me remove that. And we can go on to testing our retrieval. Okay, so I'm just going to wrap the retrieval component in this function called getDocs. And that will just take a query. It's going to embed that using our encoder, the E5 model.

And it's going to also consume a topK parameter, which just allows us to control how much information we're going to be returning. Then I'm going to extract out the metadata, or more specifically, the content from that metadata. So yeah, let me return that. And my first query is just going to be, "Can you tell me about the Llama LLMs?" Okay, so we run that, and we'll see.

So we get a variety of different papers here. So we have Code Llama, Chinese Oriented Models. And you can see there's some Llama stuff in there. The MLLU Benchmark, which probably mentions something about Llama, the original Llama model, and something else about Llama here. So plenty of Llama results there, which is good.

Now, next thing we want to do is pair that with the Grok API, and specifically Llama 3, the 70 billion parameter version behind the Grok API. So to use this, we need to go ahead and actually get access to the Grok API. So let's see how we do that.

So I'll come to here, and let's go Grok API. I'm just going to go to get started, console.grok.com. I think that was correct. And then, yes, we have API keys here. So I'm going to go get my API key to log in. And I'll just go ahead and create an API key.

So I'm going to call grok-demo. And once I have it, I'm going to come over to here, and I'm just going to enter it when I get this little pop-up. Okay. And that's it. So I'm now authenticated with the Grok API. It was pretty easy. And now I'm going to create another function called generate.

And this is just going to take -- well, it's going to consume my original query, plus all of the documents I retrieved with the retrieve function, put them together, pass them to Llama 3 via Grok, and generate a response. So here, we're going to be using the Llama370b model, 70-billion parameter model, with a -- I think it's context window of A192.

So run that, and then let's see what we get. Okay. So, yeah, it looks kind of good. We have, okay, the Llama LLMs refers to a series of large language models developed by various researchers and teams, so on and so on. Right? So pretty straightforward. And if we want to ask more questions, we can.

Okay? So we would just add this. We'll take this as well. Bring it down here. And we would just ask a question. Right? So, I don't know. Can you tell me about the mixture of experts paper? Right? So tell me about the mixture of experts paper. Okay? So I could ask about this.

And again, we'll get some output. And just see how quick, like, that's so fast. So that was retrieval, and also the generate component as well, which is just kind of insane, to be honest. So, yeah, we get this. The output is even nicely formatted. We have a markdown here, which is cool.

And generally speaking, I think it looks pretty accurate. Although I haven't been through it, but it looks reasonable. So, yeah, that is the Croc API in LLAMA 3. Once you are done, of course, you can go through, ask more questions. Once you are done, you can delete your index from PyCon, so that you save resources.

So I'm going to go ahead and do that. And, yeah, that is it for this little walkthrough. So, yeah, we've seen how we use the Croc API for RAG with LLAMA 3. And as you've seen, it's just insanely fast, even with the 70 billion parameter model. I want to really just point out how insane that sort of response speed was.

That's the sort of response speed that you'd expect from something like GPT 3.5. It's just so quick. But here we're using LLAMA 3.70b, which, yeah, it's kind of nuts. So, very interesting. I think if you start pairing Croc with things like Agent Flows, I mean, that's 100%. Because Agents are really probably the hardest thing to get working nicely and responding quickly when you have users sort of talking with them and the Agent is needing to create what decision, what tool is it going to use, retrieving that information, using another tool, coming back and then answering.

That takes time, especially with larger LLMs, especially with open source LLMs. So Croc has kind of made that not so much of an issue. Because you can, yeah, we just use LLAMA 3.70b, huge model, and it's super fast. And even as an Agent, I think we're going to get pretty good response times.

So, it's very interesting. I think it's a really cool service. And, yeah, it makes using open source LLMs a lot easier as well, which is nice. So, that's it for this video. I hope this is all being useful and interesting. But for now, I'll leave it there. Thank you very much for watching, and I will see you again in the next one.

Bye. (Music) (Music) (Music) (Music)