back to indexHybrid Search Walkthrough in Pinecone
Chapters
0:0
0:52 How Hybrid Search Works
1:25 Preprocessing
3:1 Creating Keywords
5:34 Creating an Index
6:50 Data Upsert
8:33 Query Setup
10:52 Keyword Search
12:31 OR Logic
14:49 AND Logic
15:10 Negation
17:4 Outro
00:00:00.000 |
Today we're going to have a look at how we can perform a hybrid search in Pinecone. 00:00:06.620 |
Now a hybrid search is where we perform a semantic search and also a keyword search. 00:00:15.240 |
Now we know semantic search is an incredibly useful tool that allows us to search based 00:00:21.280 |
on meaning or concepts rather than relying on specific keywords, but sometimes a more 00:00:30.360 |
traditional basic keyword search can be quite useful, particularly if you know what keywords 00:00:36.300 |
appear in the documents that you're searching for. 00:00:39.780 |
So Pinecone allows you to perform a hybrid search, allowing us to both perform a semantic 00:00:52.400 |
So we start with our full index with all of our vectors, and what we do is apply our 00:00:59.040 |
keyword search to filter out irrelevant vectors from our search scope. 00:01:05.960 |
And then we introduce our query vector, and using that query vector we find top K, in 00:01:15.580 |
And this is a semantic search portion of our query. 00:01:19.880 |
And those are our top K most similar results using hybrid search. 00:01:25.680 |
Now let's have a look at how we can actually implement that in Pinecone and start adding 00:01:32.120 |
some basic keyword search logic in there using AND, OR, AND, NOT modifiers. 00:01:40.720 |
So we're going to start with a few sentences. 00:01:43.160 |
So here we just have 10 sentences that are completely random. 00:01:48.160 |
And what we first need to do with these sentences, as we usually would with semantic search, 00:01:57.300 |
And we'll be encoding them to produce sentence embeddings. 00:02:00.680 |
For that, the easiest approach is to use the sentence transformers library, which you can 00:02:10.080 |
And of course, you will also need the Pinecone client as well. 00:02:15.120 |
So here, all we're doing is initializing a sentence transformer. 00:02:20.420 |
And we're using one of the more recent sentence transformers to produce our embeddings. 00:02:27.960 |
To produce our embeddings, we've initialized our model up here. 00:02:34.600 |
And all we do is we call the encode method and pass all of our sentences to that. 00:02:42.320 |
And we can find a shape that it will be 10 embeddings, or 10 sentence embeddings. 00:02:49.160 |
And each one of those has a dimensionality of 768. 00:02:54.500 |
So now what we need to do-- so that's the semantic search portion of our data. 00:03:01.960 |
Now we need to deal with the keyword search portion of our data. 00:03:07.840 |
So when we upset our data to Pinecone, we're going to need to include a list of tokens 00:03:15.720 |
or a list of words so that we can then use that list of words to filter and perform our 00:03:25.140 |
So to build that list of tokens for each one of our sentences, we're going to use Hugging 00:03:33.080 |
So for that, we're going to write, from transformers, import, and we're going to import the AutoTokenizer 00:03:43.640 |
Now it's important that we use a tokenizer that uses word-level tokenization, because 00:03:51.080 |
many of these tokenizers do not split sentences into words, but they split into sub-words 00:04:01.080 |
So we need to make sure that we're using a word-level tokenizer. 00:04:12.360 |
So this transform XLWT103, the tokenizer for that is a word-level tokenizer. 00:04:24.320 |
So we initialize that, and we'll put all of our INC tokens within a variable called AllTokens. 00:04:35.480 |
All we need to do is write tokenizer, and we use the tokenize method. 00:04:42.280 |
And then in here, we want to pass our sentence. 00:04:46.440 |
We also need to lowercase it, because this tokenizer will not lowercase our text by default. 00:04:53.560 |
So we just handle that, and we're doing that for each sentence in all of our sentences. 00:05:00.400 |
And let's have a look at what that looks like for our first sentence. 00:05:08.360 |
So we see that we've split our first sentence, which you can see up here, into a list of 00:05:15.400 |
words, which is exactly what we need in Pinecone to perform a keyword search. 00:05:21.060 |
So that's everything we need in terms of data. 00:05:24.620 |
We have our dense vector representations, the sentence embeddings. 00:05:29.400 |
And we also have our keywords, the list of tokens. 00:05:33.700 |
So let's continue, and we will connect to a Pinecone instance. 00:05:39.780 |
If you haven't used Pinecone before, you can get a free API key over here. 00:05:45.740 |
So we run our initialization cell, and then what we'll need to do is create a new index. 00:05:55.180 |
Now before we create that index, what I'm going to do is list all of my current indexes 00:06:01.120 |
to make sure I don't overwrite any existing indexes. 00:06:05.100 |
Now I don't have any at the moment, so that's fine. 00:06:07.780 |
I can call this whatever I want, but I'm going to go with keyword search. 00:06:15.540 |
You don't have to use the same name as what I'm using here. 00:06:17.860 |
So what I'm going to do is create the index, and then after that, I initialize my connection 00:06:32.220 |
And then just note here, I'm passing the vector dimensionality when I create the index there. 00:06:40.580 |
And we can check this, so this will be the 768 that we saw earlier. 00:06:51.140 |
And now what we want to do is merge all the data that we've created so far. 00:06:56.980 |
So when we upset data to Pinecone, we want a list of tuples. 00:07:03.580 |
Each one of those tuples is going to contain an ID, a value, which is our sentence embedding, 00:07:11.940 |
Now the tokens that we're creating, we will include within the metadata field. 00:07:18.020 |
And we'll include that using this format here. 00:07:21.420 |
So we can imagine within that metadata field for every single record or sample, we are 00:07:32.100 |
going to have this tokens, and that will map to the list of tokens for each sentence. 00:07:54.060 |
And we'll see a little response here telling us how many samples or records we upserted, 00:08:02.340 |
which in this case is 10, as we would expect. 00:08:05.420 |
Now alternatively, if you'd like, you can also upsert with a curl. 00:08:11.400 |
And for that, you just reformat your data into a dictionary format, save it to a JSON 00:08:18.500 |
object, and then upsert it using this curl command here. 00:08:23.740 |
Now the URL that you see here, you will have to go into your Pinecone dashboard and find 00:08:33.220 |
Now we've upserted all of the data into our index. 00:08:39.820 |
So the first thing we need to do is create a query sentence. 00:08:49.540 |
And what we do is we encode that using the same model that we used earlier to encode 00:08:55.940 |
And we then convert that to a list, because it is otherwise a numpy array, and we need 00:09:01.420 |
to make sure we are sending our requests with a list. 00:09:08.580 |
And let's start with a simple query without any keyword search at the moment. 00:09:21.260 |
We say we'd like to return the top k results. 00:09:25.340 |
I'm going to set that equal to 10, so we're just returning everything. 00:09:30.220 |
And I'm going to include metadata just for this column, and then I'll remove this just 00:09:46.460 |
We have all of our tokens, and it's using these tokens here that will be performing 00:09:55.480 |
And if we just save this, and if we just iterate through those results, we like this for x 00:10:15.840 |
We see we have results, and then we want to enter the 0 index of that list. 00:10:22.500 |
And then we're going to matches to get to the records that have been returned to us. 00:10:31.620 |
So we write in results, results, 0, and matches. 00:10:41.740 |
You can see here that we're returning the 10 IDs of our sentences. 00:10:51.780 |
So what we now want to do is move on to actually implementing a keyword search. 00:10:58.300 |
So we'll make it a very simple query to start with. 00:11:02.860 |
So the index.query of xq top k, we'll set that 10 again, we're just returning everything. 00:11:14.940 |
And then we can set our filter, and it's through this filter that we perform our keyword search. 00:11:22.540 |
So we want to return only records where, within tokens, there is the word bananas. 00:11:34.300 |
And again, we will get these IDs from here, but we're going to store them in the IDs variable. 00:11:45.100 |
OK, so you see straightaway we're restricting our search, and there are only four records 00:11:51.740 |
that contain the word bananas, so we're now restricting our search. 00:11:57.020 |
So what we can now do is, for i in IDs, I'm going to print each one of those sentences. 00:12:09.620 |
OK, and we can see to make sure that we're converting this back to an integer value. 00:12:23.500 |
So we can see each one of these does contain the word bananas. 00:12:29.440 |
OK, now what we might say is that we'd like to return sentences where we have one of two 00:12:41.120 |
So we're going to do bananas and way this time. 00:12:49.220 |
Now we take this code here, and let's take this as well. 00:12:59.740 |
And what we're going to do, in our filter here, we are going to modify this to use the 00:13:07.720 |
So we have or, and then using or, we can pass a list of conditions. 00:13:15.860 |
And if any one of these conditions is true, we will return that record. 00:13:23.140 |
So we're going to say tokens contains bananas, or tokens contains way. 00:13:40.660 |
So you can see that we're returning one new sentence, which is this one here, which does 00:13:48.980 |
not contain bananas, but it does contain way. 00:13:53.500 |
Now that's using the or statement here, but we can also use, which is probably simpler, 00:14:04.780 |
we can also write this using the in modifier. 00:14:10.760 |
So we first write tokens, and then we say within tokens, within or in, we want to search 00:14:23.140 |
for any records that contain either bananas or way. 00:14:30.500 |
And this will produce the exact same results as what we got before. 00:14:34.020 |
So you see, we return those same five sentences. 00:14:39.380 |
So they're the two tool alternatives we have for or logic in our keyword search. 00:14:49.380 |
And what we're going to do is just modify or, and replace it with and. 00:14:54.860 |
So now we're saying we only want to return sentences that contain both the word bananas 00:15:06.220 |
And we see now we're only returning these two, which contain both of those words. 00:15:10.220 |
Now another thing that you might want to add here is, let's say maybe we do want the word 00:15:19.780 |
way, but we also want to specify that we actually don't want any records that contain the word 00:15:28.620 |
And again, we can still use the and statement here. 00:15:32.620 |
And the only thing we actually need to change is we have to add a not equals ne to the bananas 00:15:43.820 |
And this will invert that single condition here. 00:15:48.180 |
Okay, so now we're searching for any records that do not contain bananas and contain the 00:15:59.040 |
So we execute that and we'll see that there's only actually one of those. 00:16:03.260 |
So there's only one sentence that contains the word way and does not contain the word 00:16:10.300 |
And what if we'd like to negate both of these? 00:16:14.500 |
Well, we could just add this, this any to the way condition as well, or what is simpler 00:16:27.940 |
So actually what we can do is we'll come up here and you see we have the, we have the 00:16:38.380 |
All we need to do is bring that down here and replace in with not in. 00:16:44.940 |
And if we then search, what we're doing here is searching for any records that do not contain 00:16:56.420 |
So here we're saying any sentences that contain just one of these words, we're not interested, 00:17:04.580 |
So that's it for this introduction to hybrid search using both semantic search and keyword 00:17:14.380 |
We hope that this has been useful and we'll see you in the next video.