RAG with Mistral AI!

Today we are going to be taking a look at using the Mistral API for RAG with their Mistral embed model and also Mistral large LLM. Now Mistral is a pretty interesting LLM AI company. Their approach to releasing a lot of the models that they have has been to open source for the most part and then provide an API service.

So just the fact that they open sourced the models in the first place is kind of cool and different and the models themselves are really good. So for example, Mistral that they released a while ago now, but using Mistral locally was one, possible and two, you could actually use it for a lot of different things that you could not do with any other open source models at the time.

So that was really cool, very impressive. And now they also have the API, which you can use and it makes using their models much easier again. So let's jump into it. Now we're going to be going through this example here. So this is from Pinecone, examples, integrations, Mistral AI, and then we're going to Mistral RAG.

I'm going to go ahead and just open this in Colab. Okay, cool. So I'm going to go ahead and now just install the prerequisites. So I'll just run this. So we have Hungerface datasets for our data, the Mistral AI client, which is of course to hit their APIs, and then also Pinecone again for the API so that we can go and store all of our embeddings and retrieve them for RAG.

So that will take a moment to install. And then after they have installed, I'm going to go ahead and just download this dataset from Hungerface. So this is the AI archive two semantic chunks dataset. I use this all the time. Essentially, a load of archive papers, semantically chunked. That is kind of all it is.

Okay, cool. So we actually have 10,000 chunks, not 200,000. Yes, 10,000. So ignore that 200K there. And they look like this. So you can see you actually have the mixture of experts paper here, first chunk, and includes the authors of what includes the title, actually, the authors, and also the abstract of the paper here before moving on to the next chunk based on the semantic chunking algorithm.

Okay, I'm going to go ahead and restructure this, basically this what we have here into something simpler, which we need for Pinecone. So I'm going to include the ID for each of our chunks, which I do need. And I'm going to move the title and content over into the metadata field because we're going to be adding both those into Pinecone.

So yeah, we do that. And then everything else you just remove. Okay, so now we're just left with these two features, ID and metadata. Within metadata, we of course have the title and the content. Cool. So we have this. And now I just want to go ahead and initialize my connection to Mistral because we're going to be using Mistral for the embeddings, right?

So we need that. So I'm going to go here, console Mistral AI API keys. Okay, cool. I will create a new key. What do we call it? Let's call it just Mistral demo. Expiration doesn't matter. Okay, once you have your API key, you're going to come into here, run this cell and just enter your API key in here.

Okay. And now we can just create our embeddings like this. So we have the embedding model. I'm going to be using Mistral embed. I think it's the only one they offer at the moment. And then we do Mistral embeddings, the model and the input. That's it. We can check the dimensionality of the model, which is 1024 that you can see here.

I'm going to use that. So we need this value here because we're going to be using that to initialize our Pinecone index. So Pinecone, we also need to get our API key. So I'm going to go over there, copy and pull this in. Okay. So now I have both Mistral and Pinecone set up.

We're going to go ahead and just initialize an index. So I'm going to be using serverless, of course. I'm going to call that index Mistral rag. Call it whatever you like. It's up to you. And I'm just going to check if it already exists. If it does not, I'm going to create it.

Now, when I'm creating it, I need to make sure I pass in the dimensionality of the Mistral embed model. I need to pass in the metric that that model has been trained to use, which is dot product. And I also need to pass in my basically index specification, which is the serverless spec that I initialized here.

Cool. And then we just -- we initialize that, and we will just connect to that index once it's ready. Cool. So we have this. My index is currently empty, because I've just initialized it. So, yeah. We will need to go ahead and actually basically throw everything in there. So I'm going to define this embedding function, because basically first time running through this, I realized that if you go over a certain token limit, which is this -- like the 16,000 here, that may have increased by now.

It's actually like more than a month later than the last time I checked. So it may be more now. But if you go over that token limit, you're going to see this Mistral API exception. So you don't want that. So here all I'm doing is checking, okay, can I embed?

Okay. I try. If I cannot, if we receive this exception, we're going to cut the batch size in half, essentially. So, yeah. That's all this embedding function is doing here. It'll just handle that for us. So, yep. We define that. And now what I'm going to do is go ahead and define the sort of, you know, embedding for loop that we do.

So, I'm going to start with a batch size of 32. Of course, if we hit that limit, it's going to half and half and half. Then we're going to go through, embed everything, and then we're going to add everything to Pinecone. And that's kind of it. Right? One thing that is probably worth noting here, actually, which I didn't mention, is that for the input to our embeddings, the text that we're embedding, we're actually including both the title and the content.

Right? And the reason that we do that is, I mean, you imagine we're halfway through the Mixture of Experts paper, and I was talking about how they red teamed the model, right? They could be talking about this, and they might say we red teamed the LLM and we found these results, but it might not say anything about Mixture.

Okay? So, just by adding the title into every embedding there, we're adding far more context to our embeddings, and that means that when we are searching later on for Mixture red teaming information, it in theory will at least be able to find that for us. If we don't include the title, it will not be able to find that for us.

So, it can be really useful to just do this concatenation between title and context. Also, you know, if you can, headings, subheadings, all really good things to include in there. But anyway, so, we can run that. And you will see that it may, anyway, depending on when you're running this, it may hit that API exception.

So, you'll see print out below the progress bar here, but then it will half the batch size and it will continue. So, you don't need to worry about it. But yeah, I'm gonna go ahead and just skip ahead to when this is complete. That is complete now, and we can move on to testing everything.

So, first, retrieval. I'm going to define this get docs function. It's going to consume a query, which is a question, and a top K value, which is essentially defining how many results we'd like to return. So, we, yeah, I use this. I embed with Mistral, return those embeddings to XQ, which is just shorthand for query vector.

And then I pass the query vector to Pinecone. We do the search, include the metadata, and then return that metadata or return what we need from that metadata back. So, we do that. And then I'm going to ask, can you tell me about the Mistral LM? Okay. So, we'll see what we get or what results we get from that.

So, this is just the results, the retrieval component of RAG, not the generation component. So, you can see we have I think this is the abstract from Mistral at the top here, or is it maybe Lama? Okay, no. So, this is some references from a few papers. Then we have this one, Mistral 7b outperforms the previous best 13 billion parameter model, Lama 2.

So, talking about this. Then here I think we do have the Lama 7b abstract and some other stuff in here. Okay. But again, including Mistral. So, it seems pretty relevant. We can go ahead and move on to the generation component now then. And for generation, we're going to be passing in the query that we already asked.

And we're going to be passing in the list of documents that we retrieved from that getDocs function. So, this docs here. And then what we do is we basically format all of them into a system message. Well, pass the system message into the system message of a conversation. Then we pass our query into the user message for the conversation.

And then what we're going to do is generate a response using the Mistral large model. Okay? And the latest version of that. So, we run that. And then we can generate. And we can see what we get. So, looks pretty good. Mistral 7b is a 7 billion parameter language model.

Grouped query attention, fast inference, so on and so on. Looks relevant to me. So, yeah. That is how we would use the Mistral API. You can, of course, ask more questions as you prefer. But once you are done, you can go ahead and just delete your Pinecone index to save resources.

And with that, we're done. So, that was it for this walkthrough of using the Mistral API for RAG. As you can see, they kind of have everything that you would need. So, we have the embedding model, we have the LLM. And they have quite a few different LLMs. So, you can sort of mix and match depending on what you need in terms of latency and accuracy.

Because, of course, you don't always need the best model, like the biggest model. And, of course, you sometimes do. So, it depends. But it's a nice option beyond sort of anthropic and open AI, which I like. So, yeah. That's it for this video. I hope all this has been useful and interesting.

But I'll leave it there for now. So, thank you very much for watching. And I will see you again in the next one. Bye. (Music)

RAG with Mistral AI!

Chapters

Transcript