Hybrid Search Walkthrough in Pinecone

Today we're going to have a look at how we can perform a hybrid search in Pinecone. Now a hybrid search is where we perform a semantic search and also a keyword search. Now we know semantic search is an incredibly useful tool that allows us to search based on meaning or concepts rather than relying on specific keywords, but sometimes a more traditional basic keyword search can be quite useful, particularly if you know what keywords appear in the documents that you're searching for.

So Pinecone allows you to perform a hybrid search, allowing us to both perform a semantic search and a keyword search. Let's have a look at how that works. So we start with our full index with all of our vectors, and what we do is apply our keyword search to filter out irrelevant vectors from our search scope.

And then we introduce our query vector, and using that query vector we find top K, in this case three, most similar vectors. And this is a semantic search portion of our query. And those are our top K most similar results using hybrid search. Now let's have a look at how we can actually implement that in Pinecone and start adding some basic keyword search logic in there using AND, OR, AND, NOT modifiers.

So we're going to start with a few sentences. So here we just have 10 sentences that are completely random. And what we first need to do with these sentences, as we usually would with semantic search, is we need to encode them. And we'll be encoding them to produce sentence embeddings.

For that, the easiest approach is to use the sentence transformers library, which you can pip install using this code here. And of course, you will also need the Pinecone client as well. So here, all we're doing is initializing a sentence transformer. And we're using one of the more recent sentence transformers to produce our embeddings.

To produce our embeddings, we've initialized our model up here. And all we do is we call the encode method and pass all of our sentences to that. And that will produce all of our embeddings. And we can find a shape that it will be 10 embeddings, or 10 sentence embeddings.

And each one of those has a dimensionality of 768. So now what we need to do-- so that's the semantic search portion of our data. Now we need to deal with the keyword search portion of our data. So when we upset our data to Pinecone, we're going to need to include a list of tokens or a list of words so that we can then use that list of words to filter and perform our keyword search.

So to build that list of tokens for each one of our sentences, we're going to use Hugging Faces transformers library. So for that, we're going to write, from transformers, import, and we're going to import the AutoTokenizer class. Now it's important that we use a tokenizer that uses word-level tokenization, because many of these tokenizers do not split sentences into words, but they split into sub-words or even byte-level encodings.

So we need to make sure that we're using a word-level tokenizer. And that is what this model here is. So this transform XLWT103, the tokenizer for that is a word-level tokenizer. So we initialize that, and we'll put all of our INC tokens within a variable called AllTokens. And we'll use list comprehension.

All we need to do is write tokenizer, and we use the tokenize method. And then in here, we want to pass our sentence. We also need to lowercase it, because this tokenizer will not lowercase our text by default. So we just handle that, and we're doing that for each sentence in all of our sentences.

And let's have a look at what that looks like for our first sentence. So we see that we've split our first sentence, which you can see up here, into a list of words, which is exactly what we need in Pinecone to perform a keyword search. So that's everything we need in terms of data.

We have our dense vector representations, the sentence embeddings. And we also have our keywords, the list of tokens. So let's continue, and we will connect to a Pinecone instance. If you haven't used Pinecone before, you can get a free API key over here. So we run our initialization cell, and then what we'll need to do is create a new index.

Now before we create that index, what I'm going to do is list all of my current indexes to make sure I don't overwrite any existing indexes. Now I don't have any at the moment, so that's fine. I can call this whatever I want, but I'm going to go with keyword search.

Now you can name this anything you'd like. You don't have to use the same name as what I'm using here. So what I'm going to do is create the index, and then after that, I initialize my connection to that index. So I run both of those. And then just note here, I'm passing the vector dimensionality when I create the index there.

And we can check this, so this will be the 768 that we saw earlier. So you can see the 768 there. And now what we want to do is merge all the data that we've created so far. So when we upset data to Pinecone, we want a list of tuples.

Each one of those tuples is going to contain an ID, a value, which is our sentence embedding, and also any metadata. Now the tokens that we're creating, we will include within the metadata field. And we'll include that using this format here. So we can imagine within that metadata field for every single record or sample, we are going to have this tokens, and that will map to the list of tokens for each sentence.

So we'll execute that. And then we upset all of that to our index. And we'll see a little response here telling us how many samples or records we upserted, which in this case is 10, as we would expect. Now alternatively, if you'd like, you can also upsert with a curl.

And for that, you just reformat your data into a dictionary format, save it to a JSON object, and then upsert it using this curl command here. Now the URL that you see here, you will have to go into your Pinecone dashboard and find the URL for your index. Now we've upserted all of the data into our index.

So let's go ahead and start querying. So the first thing we need to do is create a query sentence. So we just have this string here. And what we do is we encode that using the same model that we used earlier to encode all of our sentences. And we then convert that to a list, because it is otherwise a numpy array, and we need to make sure we are sending our requests with a list.

So we execute that. And let's start with a simple query without any keyword search at the moment. So we pass our query vector, xq. We say we'd like to return the top k results. I'm going to set that equal to 10, so we're just returning everything. And I'm going to include metadata just for this column, and then I'll remove this just so you can see what we have in our index.

OK, so we can see we have our ID in here. And we also have inside this metadata field. We have all of our tokens, and it's using these tokens here that will be performing our keyword search. And if we just save this, and if we just iterate through those results, we like this for x in-- we can just have a look here, results.

We see we have results, and then we want to enter the 0 index of that list. And then we're going to matches to get to the records that have been returned to us. So we write in results, results, 0, and matches. You can see here that we're returning the 10 IDs of our sentences.

So what we now want to do is move on to actually implementing a keyword search. So we'll make it a very simple query to start with. So the index.query of xq top k, we'll set that 10 again, we're just returning everything. And then we can set our filter, and it's through this filter that we perform our keyword search.

So we want to return only records where, within tokens, there is the word bananas. And again, we will get these IDs from here, but we're going to store them in the IDs variable. Let's have a look what we get. OK, so you see straightaway we're restricting our search, and there are only four records that contain the word bananas, so we're now restricting our search.

So what we can now do is, for i in IDs, I'm going to print each one of those sentences. So we have all sentences i. OK, and we can see to make sure that we're converting this back to an integer value. And now we return those sentences. So we can see each one of these does contain the word bananas.

OK, now what we might say is that we'd like to return sentences where we have one of two words. So we're going to do bananas and way this time. So we're going to introduce the or logic. Now we take this code here, and let's take this as well. And what we're going to do, in our filter here, we are going to modify this to use the or modifier.

So we have or, and then using or, we can pass a list of conditions. And if any one of these conditions is true, we will return that record. So we're going to say tokens contains bananas, or tokens contains way. And let's return and see what we get. So you can see that we're returning one new sentence, which is this one here, which does not contain bananas, but it does contain way.

Now that's using the or statement here, but we can also use, which is probably simpler, we can also write this using the in modifier. So we first write tokens, and then we say within tokens, within or in, we want to search for any records that contain either bananas or way.

And this will produce the exact same results as what we got before. So you see, we return those same five sentences. So they're the two tool alternatives we have for or logic in our keyword search. And let's copy this one. And what we're going to do is just modify or, and replace it with and.

So now we're saying we only want to return sentences that contain both the word bananas and also the word way. So we do that. And we see now we're only returning these two, which contain both of those words. Now another thing that you might want to add here is, let's say maybe we do want the word way, but we also want to specify that we actually don't want any records that contain the word bananas.

And again, we can still use the and statement here. And the only thing we actually need to change is we have to add a not equals ne to the bananas condition. And this will invert that single condition here. Okay, so now we're searching for any records that do not contain bananas and contain the word way.

So we execute that and we'll see that there's only actually one of those. So there's only one sentence that contains the word way and does not contain the word bananas. And what if we'd like to negate both of these? Well, we could just add this, this any to the way condition as well, or what is simpler using the not in modifier.

So actually what we can do is we'll come up here and you see we have the, we have the in modifier here. Very very similar. All we need to do is bring that down here and replace in with not in. And if we then search, what we're doing here is searching for any records that do not contain the word bananas or the word way.

So here we're saying any sentences that contain just one of these words, we're not interested, we exclude those. So that's it for this introduction to hybrid search using both semantic search and keyword search in Pinecone. We hope that this has been useful and we'll see you in the next video.

Hybrid Search Walkthrough in Pinecone

Chapters

Transcript