RAG with Mistral AI!

00:00:00.000 | Today we are going to be taking a look at using the Mistral API for RAG with their Mistral embed

00:00:08.400 | model and also Mistral large LLM. Now Mistral is a pretty interesting LLM AI company. Their approach

00:00:18.000 | to releasing a lot of the models that they have has been to open source for the most part and then

00:00:24.560 | provide an API service. So just the fact that they open sourced the models in the first place is

00:00:32.000 | kind of cool and different and the models themselves are really good. So for example,

00:00:39.200 | Mistral that they released a while ago now, but using Mistral locally was one, possible and two,

00:00:48.720 | you could actually use it for a lot of different things that you could not do with any other open

00:00:54.800 | source models at the time. So that was really cool, very impressive. And now they also have

00:01:01.680 | the API, which you can use and it makes using their models much easier again. So let's jump

00:01:08.000 | into it. Now we're going to be going through this example here. So this is from Pinecone,

00:01:12.080 | examples, integrations, Mistral AI, and then we're going to Mistral RAG. I'm going to go ahead and

00:01:18.720 | just open this in Colab. Okay, cool. So I'm going to go ahead and now just install the prerequisites.

00:01:25.200 | So I'll just run this. So we have Hungerface datasets for our data, the Mistral AI client,

00:01:32.960 | which is of course to hit their APIs, and then also Pinecone again for the API so that we can

00:01:40.160 | go and store all of our embeddings and retrieve them for RAG. So that will take a moment to

00:01:46.480 | install. And then after they have installed, I'm going to go ahead and just download this dataset

00:01:52.880 | from Hungerface. So this is the AI archive two semantic chunks dataset. I use this all the time.

00:01:58.400 | Essentially, a load of archive papers, semantically chunked. That is kind of all it is. Okay,

00:02:07.200 | cool. So we actually have 10,000 chunks, not 200,000. Yes, 10,000. So ignore that 200K there.

00:02:17.920 | And they look like this. So you can see you actually have the mixture of experts paper here,

00:02:23.280 | first chunk, and includes the authors of what includes the title, actually, the authors,

00:02:29.280 | and also the abstract of the paper here before moving on to the next chunk based on the semantic

00:02:36.240 | chunking algorithm. Okay, I'm going to go ahead and restructure this, basically this what we have

00:02:44.160 | here into something simpler, which we need for Pinecone. So I'm going to include the ID for each

00:02:49.520 | of our chunks, which I do need. And I'm going to move the title and content over into the metadata

00:02:55.760 | field because we're going to be adding both those into Pinecone. So yeah, we do that. And then

00:03:01.440 | everything else you just remove. Okay, so now we're just left with these two features, ID and

00:03:06.720 | metadata. Within metadata, we of course have the title and the content. Cool. So we have this. And

00:03:14.080 | now I just want to go ahead and initialize my connection to Mistral because we're going to be

00:03:19.280 | using Mistral for the embeddings, right? So we need that. So I'm going to go here, console Mistral AI

00:03:26.400 | API keys. Okay, cool. I will create a new key. What do we call it? Let's call it just Mistral

00:03:33.600 | demo. Expiration doesn't matter. Okay, once you have your API key, you're going to come into here,

00:03:40.160 | run this cell and just enter your API key in here. Okay. And now we can just create our embeddings

00:03:48.320 | like this. So we have the embedding model. I'm going to be using Mistral embed. I think it's

00:03:52.240 | the only one they offer at the moment. And then we do Mistral embeddings, the model and the input.

00:03:57.520 | That's it. We can check the dimensionality of the model, which is 1024 that you can see here. I'm

00:04:05.280 | going to use that. So we need this value here because we're going to be using that to initialize

00:04:10.880 | our Pinecone index. So Pinecone, we also need to get our API key. So I'm going to go over there,

00:04:19.680 | copy and pull this in. Okay. So now I have both Mistral and Pinecone set up.

00:04:30.720 | We're going to go ahead and just initialize an index. So I'm going to be using serverless,

00:04:35.760 | of course. I'm going to call that index Mistral rag. Call it whatever you like. It's up to you.

00:04:41.680 | And I'm just going to check if it already exists. If it does not, I'm going to create it.

00:04:47.760 | Now, when I'm creating it, I need to make sure I pass in the dimensionality of the Mistral embed

00:04:53.280 | model. I need to pass in the metric that that model has been trained to use, which is dot

00:04:58.400 | product. And I also need to pass in my basically index specification, which is the serverless spec

00:05:06.000 | that I initialized here. Cool. And then we just -- we initialize that, and we will just connect

00:05:13.600 | to that index once it's ready. Cool. So we have this. My index is currently empty, because I've

00:05:20.800 | just initialized it. So, yeah. We will need to go ahead and actually basically throw everything in

00:05:28.400 | there. So I'm going to define this embedding function, because basically first time running

00:05:34.960 | through this, I realized that if you go over a certain token limit, which is this -- like the

00:05:41.040 | 16,000 here, that may have increased by now. It's actually like more than a month later than the

00:05:47.840 | last time I checked. So it may be more now. But if you go over that token limit, you're going to

00:05:55.920 | see this Mistral API exception. So you don't want that. So here all I'm doing is checking, okay,

00:06:03.120 | can I embed? Okay. I try. If I cannot, if we receive this exception, we're going to cut the

00:06:12.160 | batch size in half, essentially. So, yeah. That's all this embedding function is doing here. It'll

00:06:19.520 | just handle that for us. So, yep. We define that. And now what I'm going to do is go ahead and

00:06:26.720 | define the sort of, you know, embedding for loop that we do. So, I'm going to start with a batch

00:06:35.120 | size of 32. Of course, if we hit that limit, it's going to half and half and half. Then we're going

00:06:42.320 | to go through, embed everything, and then we're going to add everything to Pinecone. And that's

00:06:48.080 | kind of it. Right? One thing that is probably worth noting here, actually, which I didn't

00:06:52.880 | mention, is that for the input to our embeddings, the text that we're embedding, we're actually

00:06:58.080 | including both the title and the content. Right? And the reason that we do that is, I mean, you

00:07:03.280 | imagine we're halfway through the Mixture of Experts paper, and I was talking about how they

00:07:09.680 | red teamed the model, right? They could be talking about this, and they might say we red teamed the

00:07:17.360 | LLM and we found these results, but it might not say anything about Mixture. Okay? So, just by

00:07:24.880 | adding the title into every embedding there, we're adding far more context to our embeddings,

00:07:31.120 | and that means that when we are searching later on for Mixture red teaming information, it in

00:07:38.640 | theory will at least be able to find that for us. If we don't include the title, it will not be able

00:07:43.520 | to find that for us. So, it can be really useful to just do this concatenation between title and

00:07:49.840 | context. Also, you know, if you can, headings, subheadings, all really good things to include in

00:07:55.360 | there. But anyway, so, we can run that. And you will see that it may, anyway, depending on when

00:08:06.480 | you're running this, it may hit that API exception. So, you'll see print out below the progress bar

00:08:14.080 | here, but then it will half the batch size and it will continue. So, you don't need to worry about

00:08:19.520 | it. But yeah, I'm gonna go ahead and just skip ahead to when this is complete. That is complete

00:08:26.240 | now, and we can move on to testing everything. So, first, retrieval. I'm going to define this

00:08:34.080 | get docs function. It's going to consume a query, which is a question, and a top K value, which is

00:08:41.040 | essentially defining how many results we'd like to return. So, we, yeah, I use this. I embed with

00:08:51.680 | Mistral, return those embeddings to XQ, which is just shorthand for query vector. And then I pass

00:08:59.840 | the query vector to Pinecone. We do the search, include the metadata, and then return that

00:09:05.440 | metadata or return what we need from that metadata back. So, we do that. And then I'm going to ask,

00:09:14.160 | can you tell me about the Mistral LM? Okay. So, we'll see what we get or what results we get from

00:09:19.760 | that. So, this is just the results, the retrieval component of RAG, not the generation component.

00:09:26.960 | So, you can see we have I think this is the abstract from Mistral at the top here, or is it

00:09:32.400 | maybe Lama? Okay, no. So, this is some references from a few papers. Then we have this one, Mistral

00:09:40.560 | 7b outperforms the previous best 13 billion parameter model, Lama 2. So, talking about this.

00:09:46.960 | Then here I think we do have the Lama 7b abstract and some other stuff in here. Okay. But again,

00:09:56.000 | including Mistral. So, it seems pretty relevant. We can go ahead and move on to the generation

00:10:01.680 | component now then. And for generation, we're going to be passing in the query that we already

00:10:07.440 | asked. And we're going to be passing in the list of documents that we retrieved from that getDocs

00:10:14.800 | function. So, this docs here. And then what we do is we basically format all of them into a system

00:10:24.240 | message. Well, pass the system message into the system message of a conversation. Then we pass

00:10:31.600 | our query into the user message for the conversation. And then what we're going to do is

00:10:37.360 | generate a response using the Mistral large model. Okay? And the latest version of that. So, we run

00:10:45.280 | that. And then we can generate. And we can see what we get. So, looks pretty good. Mistral 7b

00:10:52.400 | is a 7 billion parameter language model. Grouped query attention, fast inference,

00:10:58.160 | so on and so on. Looks relevant to me. So, yeah. That is how we would use the Mistral API.

00:11:09.040 | You can, of course, ask more questions as you prefer. But once you are done, you can go ahead

00:11:15.920 | and just delete your Pinecone index to save resources. And with that, we're done. So,

00:11:22.560 | that was it for this walkthrough of using the Mistral API for RAG. As you can see, they kind

00:11:30.480 | of have everything that you would need. So, we have the embedding model, we have the LLM. And

00:11:35.040 | they have quite a few different LLMs. So, you can sort of mix and match depending on what you need

00:11:41.600 | in terms of latency and accuracy. Because, of course, you don't always need the best model,

00:11:46.880 | like the biggest model. And, of course, you sometimes do. So, it depends. But it's a nice

00:11:53.520 | option beyond sort of anthropic and open AI, which I like. So, yeah. That's it for this video. I hope

00:12:00.720 | all this has been useful and interesting. But I'll leave it there for now. So, thank you very

00:12:04.560 | much for watching. And I will see you again in the next one. Bye.

00:12:18.400 | (Music)

RAG with Mistral AI!

Chapters