back to index

Hybrid Search Walkthrough in Pinecone


Chapters

0:0
0:52 How Hybrid Search Works
1:25 Preprocessing
3:1 Creating Keywords
5:34 Creating an Index
6:50 Data Upsert
8:33 Query Setup
10:52 Keyword Search
12:31 OR Logic
14:49 AND Logic
15:10 Negation
17:4 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to have a look at how we can perform a hybrid search in Pinecone.
00:00:06.620 | Now a hybrid search is where we perform a semantic search and also a keyword search.
00:00:15.240 | Now we know semantic search is an incredibly useful tool that allows us to search based
00:00:21.280 | on meaning or concepts rather than relying on specific keywords, but sometimes a more
00:00:30.360 | traditional basic keyword search can be quite useful, particularly if you know what keywords
00:00:36.300 | appear in the documents that you're searching for.
00:00:39.780 | So Pinecone allows you to perform a hybrid search, allowing us to both perform a semantic
00:00:46.580 | search and a keyword search.
00:00:50.080 | Let's have a look at how that works.
00:00:52.400 | So we start with our full index with all of our vectors, and what we do is apply our
00:00:59.040 | keyword search to filter out irrelevant vectors from our search scope.
00:01:05.960 | And then we introduce our query vector, and using that query vector we find top K, in
00:01:11.640 | this case three, most similar vectors.
00:01:15.580 | And this is a semantic search portion of our query.
00:01:19.880 | And those are our top K most similar results using hybrid search.
00:01:25.680 | Now let's have a look at how we can actually implement that in Pinecone and start adding
00:01:32.120 | some basic keyword search logic in there using AND, OR, AND, NOT modifiers.
00:01:40.720 | So we're going to start with a few sentences.
00:01:43.160 | So here we just have 10 sentences that are completely random.
00:01:48.160 | And what we first need to do with these sentences, as we usually would with semantic search,
00:01:54.320 | is we need to encode them.
00:01:57.300 | And we'll be encoding them to produce sentence embeddings.
00:02:00.680 | For that, the easiest approach is to use the sentence transformers library, which you can
00:02:06.120 | pip install using this code here.
00:02:10.080 | And of course, you will also need the Pinecone client as well.
00:02:15.120 | So here, all we're doing is initializing a sentence transformer.
00:02:20.420 | And we're using one of the more recent sentence transformers to produce our embeddings.
00:02:27.960 | To produce our embeddings, we've initialized our model up here.
00:02:34.600 | And all we do is we call the encode method and pass all of our sentences to that.
00:02:39.740 | And that will produce all of our embeddings.
00:02:42.320 | And we can find a shape that it will be 10 embeddings, or 10 sentence embeddings.
00:02:49.160 | And each one of those has a dimensionality of 768.
00:02:54.500 | So now what we need to do-- so that's the semantic search portion of our data.
00:03:01.960 | Now we need to deal with the keyword search portion of our data.
00:03:07.840 | So when we upset our data to Pinecone, we're going to need to include a list of tokens
00:03:15.720 | or a list of words so that we can then use that list of words to filter and perform our
00:03:21.400 | keyword search.
00:03:25.140 | So to build that list of tokens for each one of our sentences, we're going to use Hugging
00:03:30.280 | Faces transformers library.
00:03:33.080 | So for that, we're going to write, from transformers, import, and we're going to import the AutoTokenizer
00:03:41.720 | class.
00:03:43.640 | Now it's important that we use a tokenizer that uses word-level tokenization, because
00:03:51.080 | many of these tokenizers do not split sentences into words, but they split into sub-words
00:03:58.620 | or even byte-level encodings.
00:04:01.080 | So we need to make sure that we're using a word-level tokenizer.
00:04:07.840 | And that is what this model here is.
00:04:12.360 | So this transform XLWT103, the tokenizer for that is a word-level tokenizer.
00:04:24.320 | So we initialize that, and we'll put all of our INC tokens within a variable called AllTokens.
00:04:34.480 | And we'll use list comprehension.
00:04:35.480 | All we need to do is write tokenizer, and we use the tokenize method.
00:04:42.280 | And then in here, we want to pass our sentence.
00:04:46.440 | We also need to lowercase it, because this tokenizer will not lowercase our text by default.
00:04:53.560 | So we just handle that, and we're doing that for each sentence in all of our sentences.
00:05:00.400 | And let's have a look at what that looks like for our first sentence.
00:05:08.360 | So we see that we've split our first sentence, which you can see up here, into a list of
00:05:15.400 | words, which is exactly what we need in Pinecone to perform a keyword search.
00:05:21.060 | So that's everything we need in terms of data.
00:05:24.620 | We have our dense vector representations, the sentence embeddings.
00:05:29.400 | And we also have our keywords, the list of tokens.
00:05:33.700 | So let's continue, and we will connect to a Pinecone instance.
00:05:39.780 | If you haven't used Pinecone before, you can get a free API key over here.
00:05:45.740 | So we run our initialization cell, and then what we'll need to do is create a new index.
00:05:55.180 | Now before we create that index, what I'm going to do is list all of my current indexes
00:06:01.120 | to make sure I don't overwrite any existing indexes.
00:06:05.100 | Now I don't have any at the moment, so that's fine.
00:06:07.780 | I can call this whatever I want, but I'm going to go with keyword search.
00:06:12.520 | Now you can name this anything you'd like.
00:06:15.540 | You don't have to use the same name as what I'm using here.
00:06:17.860 | So what I'm going to do is create the index, and then after that, I initialize my connection
00:06:25.460 | to that index.
00:06:28.060 | So I run both of those.
00:06:32.220 | And then just note here, I'm passing the vector dimensionality when I create the index there.
00:06:40.580 | And we can check this, so this will be the 768 that we saw earlier.
00:06:47.420 | So you can see the 768 there.
00:06:51.140 | And now what we want to do is merge all the data that we've created so far.
00:06:56.980 | So when we upset data to Pinecone, we want a list of tuples.
00:07:03.580 | Each one of those tuples is going to contain an ID, a value, which is our sentence embedding,
00:07:09.780 | and also any metadata.
00:07:11.940 | Now the tokens that we're creating, we will include within the metadata field.
00:07:18.020 | And we'll include that using this format here.
00:07:21.420 | So we can imagine within that metadata field for every single record or sample, we are
00:07:32.100 | going to have this tokens, and that will map to the list of tokens for each sentence.
00:07:40.740 | So we'll execute that.
00:07:45.300 | And then we upset all of that to our index.
00:07:54.060 | And we'll see a little response here telling us how many samples or records we upserted,
00:08:02.340 | which in this case is 10, as we would expect.
00:08:05.420 | Now alternatively, if you'd like, you can also upsert with a curl.
00:08:11.400 | And for that, you just reformat your data into a dictionary format, save it to a JSON
00:08:18.500 | object, and then upsert it using this curl command here.
00:08:23.740 | Now the URL that you see here, you will have to go into your Pinecone dashboard and find
00:08:30.700 | the URL for your index.
00:08:33.220 | Now we've upserted all of the data into our index.
00:08:37.400 | So let's go ahead and start querying.
00:08:39.820 | So the first thing we need to do is create a query sentence.
00:08:45.020 | So we just have this string here.
00:08:49.540 | And what we do is we encode that using the same model that we used earlier to encode
00:08:54.940 | all of our sentences.
00:08:55.940 | And we then convert that to a list, because it is otherwise a numpy array, and we need
00:09:01.420 | to make sure we are sending our requests with a list.
00:09:05.940 | So we execute that.
00:09:08.580 | And let's start with a simple query without any keyword search at the moment.
00:09:17.460 | So we pass our query vector, xq.
00:09:21.260 | We say we'd like to return the top k results.
00:09:25.340 | I'm going to set that equal to 10, so we're just returning everything.
00:09:30.220 | And I'm going to include metadata just for this column, and then I'll remove this just
00:09:36.060 | so you can see what we have in our index.
00:09:39.500 | OK, so we can see we have our ID in here.
00:09:43.860 | And we also have inside this metadata field.
00:09:46.460 | We have all of our tokens, and it's using these tokens here that will be performing
00:09:52.860 | our keyword search.
00:09:55.480 | And if we just save this, and if we just iterate through those results, we like this for x
00:10:10.300 | in-- we can just have a look here, results.
00:10:15.840 | We see we have results, and then we want to enter the 0 index of that list.
00:10:22.500 | And then we're going to matches to get to the records that have been returned to us.
00:10:31.620 | So we write in results, results, 0, and matches.
00:10:41.740 | You can see here that we're returning the 10 IDs of our sentences.
00:10:51.780 | So what we now want to do is move on to actually implementing a keyword search.
00:10:58.300 | So we'll make it a very simple query to start with.
00:11:02.860 | So the index.query of xq top k, we'll set that 10 again, we're just returning everything.
00:11:14.940 | And then we can set our filter, and it's through this filter that we perform our keyword search.
00:11:22.540 | So we want to return only records where, within tokens, there is the word bananas.
00:11:34.300 | And again, we will get these IDs from here, but we're going to store them in the IDs variable.
00:11:42.820 | Let's have a look what we get.
00:11:45.100 | OK, so you see straightaway we're restricting our search, and there are only four records
00:11:51.740 | that contain the word bananas, so we're now restricting our search.
00:11:57.020 | So what we can now do is, for i in IDs, I'm going to print each one of those sentences.
00:12:06.700 | So we have all sentences i.
00:12:09.620 | OK, and we can see to make sure that we're converting this back to an integer value.
00:12:20.340 | And now we return those sentences.
00:12:23.500 | So we can see each one of these does contain the word bananas.
00:12:29.440 | OK, now what we might say is that we'd like to return sentences where we have one of two
00:12:40.020 | words.
00:12:41.120 | So we're going to do bananas and way this time.
00:12:45.220 | So we're going to introduce the or logic.
00:12:49.220 | Now we take this code here, and let's take this as well.
00:12:59.740 | And what we're going to do, in our filter here, we are going to modify this to use the
00:13:06.720 | or modifier.
00:13:07.720 | So we have or, and then using or, we can pass a list of conditions.
00:13:15.860 | And if any one of these conditions is true, we will return that record.
00:13:23.140 | So we're going to say tokens contains bananas, or tokens contains way.
00:13:36.900 | And let's return and see what we get.
00:13:40.660 | So you can see that we're returning one new sentence, which is this one here, which does
00:13:48.980 | not contain bananas, but it does contain way.
00:13:53.500 | Now that's using the or statement here, but we can also use, which is probably simpler,
00:14:04.780 | we can also write this using the in modifier.
00:14:10.760 | So we first write tokens, and then we say within tokens, within or in, we want to search
00:14:23.140 | for any records that contain either bananas or way.
00:14:30.500 | And this will produce the exact same results as what we got before.
00:14:34.020 | So you see, we return those same five sentences.
00:14:39.380 | So they're the two tool alternatives we have for or logic in our keyword search.
00:14:46.700 | And let's copy this one.
00:14:49.380 | And what we're going to do is just modify or, and replace it with and.
00:14:54.860 | So now we're saying we only want to return sentences that contain both the word bananas
00:15:01.540 | and also the word way.
00:15:04.480 | So we do that.
00:15:06.220 | And we see now we're only returning these two, which contain both of those words.
00:15:10.220 | Now another thing that you might want to add here is, let's say maybe we do want the word
00:15:19.780 | way, but we also want to specify that we actually don't want any records that contain the word
00:15:27.260 | bananas.
00:15:28.620 | And again, we can still use the and statement here.
00:15:32.620 | And the only thing we actually need to change is we have to add a not equals ne to the bananas
00:15:42.820 | condition.
00:15:43.820 | And this will invert that single condition here.
00:15:48.180 | Okay, so now we're searching for any records that do not contain bananas and contain the
00:15:56.980 | word way.
00:15:59.040 | So we execute that and we'll see that there's only actually one of those.
00:16:03.260 | So there's only one sentence that contains the word way and does not contain the word
00:16:08.420 | bananas.
00:16:10.300 | And what if we'd like to negate both of these?
00:16:14.500 | Well, we could just add this, this any to the way condition as well, or what is simpler
00:16:23.340 | using the not in modifier.
00:16:27.940 | So actually what we can do is we'll come up here and you see we have the, we have the
00:16:34.700 | in modifier here.
00:16:37.380 | Very very similar.
00:16:38.380 | All we need to do is bring that down here and replace in with not in.
00:16:44.940 | And if we then search, what we're doing here is searching for any records that do not contain
00:16:52.380 | the word bananas or the word way.
00:16:56.420 | So here we're saying any sentences that contain just one of these words, we're not interested,
00:17:03.220 | we exclude those.
00:17:04.580 | So that's it for this introduction to hybrid search using both semantic search and keyword
00:17:11.780 | search in Pinecone.
00:17:14.380 | We hope that this has been useful and we'll see you in the next video.