back to indexSuperfast RAG with Llama 3 and Groq
Chapters
0:0 Groq and Llama 3 for RAG
0:37 Llama 3 in Python
4:25 Initializing e5 for Embeddings
5:56 Using Pinecone for RAG
7:24 Why We Concatenate Title and Content
10:15 Testing RAG Retrieval Performance
11:28 Initialize connection to Groq API
12:24 Generating RAG Answers with Llama 3 70B
14:37 Final Points on Why Groq Matters
00:00:00.000 |
Today, we are going to be taking a look at using the Grok API with LLAMA3 for Rack. 00:00:07.380 |
Now, the Grok API, for those of you that haven't heard of it, 00:00:11.040 |
is an API that gives us access to what Grok, the company, 00:00:19.340 |
A LPU is essentially some hardware that allows us to run LLMs very, very quickly. 00:00:27.560 |
So you'll see that the token throughput, when you're calling even big LLMs, 00:00:32.700 |
like we will today, through this API is insanely fast. 00:00:37.920 |
So to find the code, we're going to go to the PyCon examples repo, 00:00:42.020 |
and we're going to go examples, integrations, Grok, 00:00:49.200 |
I'm going to go ahead and just open this directly in Colab. 00:00:52.320 |
Okay, and here we are. I'm going to go ahead and connect. 00:00:56.420 |
I'm going to make sure I'm connected to a GPU 00:00:58.420 |
because we're actually going to be using for our embeddings a local embedding model. 00:01:05.460 |
So it's just quicker. It's not necessary, but it is quicker if we use GPU there. 00:01:11.100 |
Then, what I'm going to do is go ahead and install the prerequisite libraries. 00:01:19.320 |
We have HuniFace datasets, which is where we're going to be getting our data from. 00:01:24.040 |
We have Grok, which is the Grok API for the LPU that I mentioned. 00:01:31.280 |
which is where we're going to be pulling in our encoder or our embedding model from. 00:01:37.720 |
which is, of course, where we're going to be storing our embeddings. 00:01:41.020 |
So installing these, that might take a moment, 00:01:45.580 |
but once it's done, we're going to go ahead and download 00:01:48.860 |
just part of the dataset, not the full thing. 00:01:51.840 |
So 10,000 rows from this AI_ARCHIVE2_SEMANTIC_CHUNKS dataset 00:01:58.860 |
Essentially, it is a set of AI_ARCHIVE papers, 00:02:11.020 |
but it varies because we're chunking semantically here, 00:02:14.100 |
so it's essentially looking for where the topic within those papers changes. 00:02:21.120 |
So I'm going to run that, we'll download those, 00:02:24.140 |
and we'll just have a quick look at what that looks like. 00:02:27.820 |
So one of the papers we have in there is the Mixture of Experts paper, 00:02:36.500 |
And you can see that this first chunk is basically, 00:02:39.780 |
well, it's like the introduction to the paper. 00:02:42.220 |
So you can see the title of the paper, you can see the authors, 00:02:46.960 |
and then I believe it also includes the abstract here. 00:02:52.220 |
And then once it gets past the abstract, it cuts, right? 00:02:56.220 |
It senses that there is a change in topic and moves on to the next chunk. 00:03:09.540 |
and I'll just wait a moment for this to actually finish installing everything. 00:03:17.700 |
We can also see how much the structure of that dataset there. 00:03:27.300 |
and we're going to just rearrange that structure 00:03:36.780 |
for when we're going to be placing everything or embedding everything 00:03:42.080 |
So I want to keep the IDs for each of our chunks. 00:03:49.460 |
And then for metadata, I want the chunk and the content. 00:03:52.780 |
And we're going to be using both of those later in our embeddings. 00:04:00.860 |
And then I'm going to remove all the unneeded columns, 00:04:09.280 |
Just note, of course, title and content I'm removing here, 00:04:12.700 |
but we've moved it into the metadata column here. 00:04:20.020 |
and then we're removing the original versions of title and content. 00:04:25.900 |
Cool, so I'm going to go ahead and just initialize a encoder model. 00:04:34.180 |
and this E5 model actually has a longer context length. 00:04:37.720 |
Typically, they're quite small, like around 512 tokens. 00:04:42.380 |
And most of our chunks are actually below 512 tokens. 00:04:49.380 |
But occasionally, you might find one which is a little bit higher. 00:04:52.540 |
So I do want to use this slightly larger context window version of E5. 00:05:02.820 |
it's a very good open source embedding model. 00:05:12.180 |
And when you test it, it actually tends to perform well on real data as well, 00:05:21.580 |
and then you try them on actual data, and they are not so good. 00:05:26.080 |
So the E5 models are usually a good bet when it comes to open source. 00:05:31.780 |
So I'm going to go ahead and I'm going to create our embeddings. 00:05:37.100 |
So I'll just pass in a list of what will soon be our title and content. 00:05:43.180 |
I'm just performing a quick test with it here. 00:05:48.860 |
by checking the length of what you just created here. 00:05:58.540 |
Now I'm going to want to jump ahead and set up my Pinecone API key. 00:06:03.980 |
So to do that, I'm going to head on over to app.pinecone.io. 00:06:08.600 |
I'm going to get API keys, and I'm just going to copy one here. 00:06:14.900 |
And I'm just going to enter my API key when that little box pops up. 00:06:19.120 |
Cool. And then I'm going to set up this to work with serverless. 00:06:24.940 |
Okay, so I have my serverless spec, initialize that. 00:06:28.580 |
Now I'm going to go ahead and just check, do I already have this index? 00:06:32.620 |
So I've called this index GrokLama3Reg, call it whatever you want. 00:06:38.540 |
But I'm going to check, does that already exist? 00:06:41.020 |
If it does already exist, then I'm going to skip ahead and just connect to it. 00:06:51.540 |
I need to make sure I'm using the metric that the E5 model is trained to use, 00:06:57.180 |
I need to make sure dimensionality is the same, so this is 768. 00:07:06.620 |
You can see that my total vector count is zero, 00:07:14.860 |
We can go ahead and actually add our vectors into PyCone. 00:07:30.060 |
So rather than just passing the content of those paragraphs to E5 00:07:35.900 |
I'm actually passing the title and the content. 00:07:39.160 |
Now, the reason I'm doing that is that it essentially provides 00:07:47.140 |
So you could imagine, for example, with the LLAMA3 paper or LLAMA2 paper, 00:08:05.220 |
there might be a paragraph where it's talking about the performance 00:08:10.060 |
but it doesn't explicitly say LLAMA2 or LLAMA3. 00:08:21.340 |
we're providing more context to our embedding model, 00:08:24.540 |
which in theory means we should be able to get better results 00:08:47.740 |
you can probably go ahead and actually just increase your batch size. 00:08:57.980 |
So I'm going to go ahead and just increase this number, 00:09:05.280 |
And essentially what we're going to do is embed more chunks in one go. 00:09:10.540 |
Because we're running all this locally on a GPU. 00:09:16.260 |
And that should, in theory, reduce our waiting time as well. 00:09:21.620 |
And you see the GPU RAM increasing after a moment. 00:09:31.640 |
but that is simply because it's caching the initial tensors that we created. 00:09:36.140 |
And yeah, you can see it jumping up a lot now. 00:09:39.100 |
So we'll need to keep an eye on that and make sure it doesn't go too high. 00:09:41.820 |
If it does go too high, we'll probably want to reduce 00:10:00.840 |
We can see this. So this is just the metadata. 00:10:03.860 |
So we have, for every single record, we have the content and the title. 00:10:10.260 |
But yeah, I mean, not super important, to be honest. 00:10:19.500 |
Okay, so I'm just going to wrap the retrieval component 00:10:28.640 |
It's going to embed that using our encoder, the E5 model. 00:10:33.540 |
And it's going to also consume a topK parameter, 00:10:37.900 |
which just allows us to control how much information we're going to be returning. 00:10:44.900 |
or more specifically, the content from that metadata. 00:11:02.100 |
So we get a variety of different papers here. 00:11:06.020 |
So we have Code Llama, Chinese Oriented Models. 00:11:11.140 |
And you can see there's some Llama stuff in there. 00:11:13.940 |
The MLLU Benchmark, which probably mentions something about Llama, 00:11:19.140 |
the original Llama model, and something else about Llama here. 00:11:23.720 |
So plenty of Llama results there, which is good. 00:11:28.340 |
Now, next thing we want to do is pair that with the Grok API, 00:11:37.580 |
the 70 billion parameter version behind the Grok API. 00:11:41.220 |
So to use this, we need to go ahead and actually get access to the Grok API. 00:11:53.860 |
I'm just going to go to get started, console.grok.com. 00:11:56.900 |
I think that was correct. And then, yes, we have API keys here. 00:12:03.460 |
And I'll just go ahead and create an API key. 00:12:09.940 |
And once I have it, I'm going to come over to here, 00:12:12.100 |
and I'm just going to enter it when I get this little pop-up. 00:12:24.540 |
And now I'm going to create another function called generate. 00:12:30.700 |
well, it's going to consume my original query, 00:12:33.020 |
plus all of the documents I retrieved with the retrieve function, 00:12:38.500 |
put them together, pass them to Llama 3 via Grok, 00:12:45.680 |
So here, we're going to be using the Llama370b model, 00:12:53.260 |
with a -- I think it's context window of A192. 00:13:04.500 |
We have, okay, the Llama LLMs refers to a series of large language models 00:13:09.580 |
developed by various researchers and teams, so on and so on. 00:13:16.020 |
And if we want to ask more questions, we can. 00:13:31.140 |
Can you tell me about the mixture of experts paper? 00:13:34.820 |
Right? So tell me about the mixture of experts paper. 00:13:49.560 |
And just see how quick, like, that's so fast. 00:14:09.900 |
And generally speaking, I think it looks pretty accurate. 00:14:14.320 |
Although I haven't been through it, but it looks reasonable. 00:14:21.880 |
Once you are done, of course, you can go through, ask more questions. 00:14:25.640 |
Once you are done, you can delete your index from PyCon, 00:14:33.480 |
And, yeah, that is it for this little walkthrough. 00:14:59.480 |
That's the sort of response speed that you'd expect 00:15:15.440 |
I think if you start pairing Croc with things like Agent Flows, 00:15:23.520 |
Because Agents are really probably the hardest thing 00:15:31.000 |
when you have users sort of talking with them 00:15:34.460 |
and the Agent is needing to create what decision, 00:15:40.800 |
retrieving that information, using another tool, 00:15:44.520 |
That takes time, especially with larger LLMs, 00:15:51.920 |
So Croc has kind of made that not so much of an issue. 00:15:59.640 |
Because you can, yeah, we just use LLAMA 3.70b, huge model, 00:16:08.520 |
I think we're going to get pretty good response times. 00:16:15.960 |
And, yeah, it makes using open source LLMs a lot easier as well, 00:16:23.280 |
I hope this is all being useful and interesting. 00:16:27.960 |
Thank you very much for watching, and I will see you again in the next one.