back to indexRAG with Mistral AI!
Chapters
0:0 RAG with Mistral and Pinecone
1:7 Mistral API in Python
1:44 Setting up Vector DB
3:14 Mistral Embeddings
4:12 Creating Pinecone Index
8:24 RAG with Mistral
11:20 Final Thoughts on Mistral
00:00:00.000 |
Today we are going to be taking a look at using the Mistral API for RAG with their Mistral embed 00:00:08.400 |
model and also Mistral large LLM. Now Mistral is a pretty interesting LLM AI company. Their approach 00:00:18.000 |
to releasing a lot of the models that they have has been to open source for the most part and then 00:00:24.560 |
provide an API service. So just the fact that they open sourced the models in the first place is 00:00:32.000 |
kind of cool and different and the models themselves are really good. So for example, 00:00:39.200 |
Mistral that they released a while ago now, but using Mistral locally was one, possible and two, 00:00:48.720 |
you could actually use it for a lot of different things that you could not do with any other open 00:00:54.800 |
source models at the time. So that was really cool, very impressive. And now they also have 00:01:01.680 |
the API, which you can use and it makes using their models much easier again. So let's jump 00:01:08.000 |
into it. Now we're going to be going through this example here. So this is from Pinecone, 00:01:12.080 |
examples, integrations, Mistral AI, and then we're going to Mistral RAG. I'm going to go ahead and 00:01:18.720 |
just open this in Colab. Okay, cool. So I'm going to go ahead and now just install the prerequisites. 00:01:25.200 |
So I'll just run this. So we have Hungerface datasets for our data, the Mistral AI client, 00:01:32.960 |
which is of course to hit their APIs, and then also Pinecone again for the API so that we can 00:01:40.160 |
go and store all of our embeddings and retrieve them for RAG. So that will take a moment to 00:01:46.480 |
install. And then after they have installed, I'm going to go ahead and just download this dataset 00:01:52.880 |
from Hungerface. So this is the AI archive two semantic chunks dataset. I use this all the time. 00:01:58.400 |
Essentially, a load of archive papers, semantically chunked. That is kind of all it is. Okay, 00:02:07.200 |
cool. So we actually have 10,000 chunks, not 200,000. Yes, 10,000. So ignore that 200K there. 00:02:17.920 |
And they look like this. So you can see you actually have the mixture of experts paper here, 00:02:23.280 |
first chunk, and includes the authors of what includes the title, actually, the authors, 00:02:29.280 |
and also the abstract of the paper here before moving on to the next chunk based on the semantic 00:02:36.240 |
chunking algorithm. Okay, I'm going to go ahead and restructure this, basically this what we have 00:02:44.160 |
here into something simpler, which we need for Pinecone. So I'm going to include the ID for each 00:02:49.520 |
of our chunks, which I do need. And I'm going to move the title and content over into the metadata 00:02:55.760 |
field because we're going to be adding both those into Pinecone. So yeah, we do that. And then 00:03:01.440 |
everything else you just remove. Okay, so now we're just left with these two features, ID and 00:03:06.720 |
metadata. Within metadata, we of course have the title and the content. Cool. So we have this. And 00:03:14.080 |
now I just want to go ahead and initialize my connection to Mistral because we're going to be 00:03:19.280 |
using Mistral for the embeddings, right? So we need that. So I'm going to go here, console Mistral AI 00:03:26.400 |
API keys. Okay, cool. I will create a new key. What do we call it? Let's call it just Mistral 00:03:33.600 |
demo. Expiration doesn't matter. Okay, once you have your API key, you're going to come into here, 00:03:40.160 |
run this cell and just enter your API key in here. Okay. And now we can just create our embeddings 00:03:48.320 |
like this. So we have the embedding model. I'm going to be using Mistral embed. I think it's 00:03:52.240 |
the only one they offer at the moment. And then we do Mistral embeddings, the model and the input. 00:03:57.520 |
That's it. We can check the dimensionality of the model, which is 1024 that you can see here. I'm 00:04:05.280 |
going to use that. So we need this value here because we're going to be using that to initialize 00:04:10.880 |
our Pinecone index. So Pinecone, we also need to get our API key. So I'm going to go over there, 00:04:19.680 |
copy and pull this in. Okay. So now I have both Mistral and Pinecone set up. 00:04:30.720 |
We're going to go ahead and just initialize an index. So I'm going to be using serverless, 00:04:35.760 |
of course. I'm going to call that index Mistral rag. Call it whatever you like. It's up to you. 00:04:41.680 |
And I'm just going to check if it already exists. If it does not, I'm going to create it. 00:04:47.760 |
Now, when I'm creating it, I need to make sure I pass in the dimensionality of the Mistral embed 00:04:53.280 |
model. I need to pass in the metric that that model has been trained to use, which is dot 00:04:58.400 |
product. And I also need to pass in my basically index specification, which is the serverless spec 00:05:06.000 |
that I initialized here. Cool. And then we just -- we initialize that, and we will just connect 00:05:13.600 |
to that index once it's ready. Cool. So we have this. My index is currently empty, because I've 00:05:20.800 |
just initialized it. So, yeah. We will need to go ahead and actually basically throw everything in 00:05:28.400 |
there. So I'm going to define this embedding function, because basically first time running 00:05:34.960 |
through this, I realized that if you go over a certain token limit, which is this -- like the 00:05:41.040 |
16,000 here, that may have increased by now. It's actually like more than a month later than the 00:05:47.840 |
last time I checked. So it may be more now. But if you go over that token limit, you're going to 00:05:55.920 |
see this Mistral API exception. So you don't want that. So here all I'm doing is checking, okay, 00:06:03.120 |
can I embed? Okay. I try. If I cannot, if we receive this exception, we're going to cut the 00:06:12.160 |
batch size in half, essentially. So, yeah. That's all this embedding function is doing here. It'll 00:06:19.520 |
just handle that for us. So, yep. We define that. And now what I'm going to do is go ahead and 00:06:26.720 |
define the sort of, you know, embedding for loop that we do. So, I'm going to start with a batch 00:06:35.120 |
size of 32. Of course, if we hit that limit, it's going to half and half and half. Then we're going 00:06:42.320 |
to go through, embed everything, and then we're going to add everything to Pinecone. And that's 00:06:48.080 |
kind of it. Right? One thing that is probably worth noting here, actually, which I didn't 00:06:52.880 |
mention, is that for the input to our embeddings, the text that we're embedding, we're actually 00:06:58.080 |
including both the title and the content. Right? And the reason that we do that is, I mean, you 00:07:03.280 |
imagine we're halfway through the Mixture of Experts paper, and I was talking about how they 00:07:09.680 |
red teamed the model, right? They could be talking about this, and they might say we red teamed the 00:07:17.360 |
LLM and we found these results, but it might not say anything about Mixture. Okay? So, just by 00:07:24.880 |
adding the title into every embedding there, we're adding far more context to our embeddings, 00:07:31.120 |
and that means that when we are searching later on for Mixture red teaming information, it in 00:07:38.640 |
theory will at least be able to find that for us. If we don't include the title, it will not be able 00:07:43.520 |
to find that for us. So, it can be really useful to just do this concatenation between title and 00:07:49.840 |
context. Also, you know, if you can, headings, subheadings, all really good things to include in 00:07:55.360 |
there. But anyway, so, we can run that. And you will see that it may, anyway, depending on when 00:08:06.480 |
you're running this, it may hit that API exception. So, you'll see print out below the progress bar 00:08:14.080 |
here, but then it will half the batch size and it will continue. So, you don't need to worry about 00:08:19.520 |
it. But yeah, I'm gonna go ahead and just skip ahead to when this is complete. That is complete 00:08:26.240 |
now, and we can move on to testing everything. So, first, retrieval. I'm going to define this 00:08:34.080 |
get docs function. It's going to consume a query, which is a question, and a top K value, which is 00:08:41.040 |
essentially defining how many results we'd like to return. So, we, yeah, I use this. I embed with 00:08:51.680 |
Mistral, return those embeddings to XQ, which is just shorthand for query vector. And then I pass 00:08:59.840 |
the query vector to Pinecone. We do the search, include the metadata, and then return that 00:09:05.440 |
metadata or return what we need from that metadata back. So, we do that. And then I'm going to ask, 00:09:14.160 |
can you tell me about the Mistral LM? Okay. So, we'll see what we get or what results we get from 00:09:19.760 |
that. So, this is just the results, the retrieval component of RAG, not the generation component. 00:09:26.960 |
So, you can see we have I think this is the abstract from Mistral at the top here, or is it 00:09:32.400 |
maybe Lama? Okay, no. So, this is some references from a few papers. Then we have this one, Mistral 00:09:40.560 |
7b outperforms the previous best 13 billion parameter model, Lama 2. So, talking about this. 00:09:46.960 |
Then here I think we do have the Lama 7b abstract and some other stuff in here. Okay. But again, 00:09:56.000 |
including Mistral. So, it seems pretty relevant. We can go ahead and move on to the generation 00:10:01.680 |
component now then. And for generation, we're going to be passing in the query that we already 00:10:07.440 |
asked. And we're going to be passing in the list of documents that we retrieved from that getDocs 00:10:14.800 |
function. So, this docs here. And then what we do is we basically format all of them into a system 00:10:24.240 |
message. Well, pass the system message into the system message of a conversation. Then we pass 00:10:31.600 |
our query into the user message for the conversation. And then what we're going to do is 00:10:37.360 |
generate a response using the Mistral large model. Okay? And the latest version of that. So, we run 00:10:45.280 |
that. And then we can generate. And we can see what we get. So, looks pretty good. Mistral 7b 00:10:52.400 |
is a 7 billion parameter language model. Grouped query attention, fast inference, 00:10:58.160 |
so on and so on. Looks relevant to me. So, yeah. That is how we would use the Mistral API. 00:11:09.040 |
You can, of course, ask more questions as you prefer. But once you are done, you can go ahead 00:11:15.920 |
and just delete your Pinecone index to save resources. And with that, we're done. So, 00:11:22.560 |
that was it for this walkthrough of using the Mistral API for RAG. As you can see, they kind 00:11:30.480 |
of have everything that you would need. So, we have the embedding model, we have the LLM. And 00:11:35.040 |
they have quite a few different LLMs. So, you can sort of mix and match depending on what you need 00:11:41.600 |
in terms of latency and accuracy. Because, of course, you don't always need the best model, 00:11:46.880 |
like the biggest model. And, of course, you sometimes do. So, it depends. But it's a nice 00:11:53.520 |
option beyond sort of anthropic and open AI, which I like. So, yeah. That's it for this video. I hope 00:12:00.720 |
all this has been useful and interesting. But I'll leave it there for now. So, thank you very 00:12:04.560 |
much for watching. And I will see you again in the next one. Bye.