Pinecone's New *Hybrid* Search - the future of search?

00:00:00.000 | VectorSearch has unlocked the door to another level of relevance and

00:00:05.880 | efficiency when it comes to retrieving data. In the past year alone the number

00:00:11.800 | of VectorSearch use cases have exploded and there's no signs of that slowing

00:00:18.320 | down anytime soon. Now the capabilities of VectorSearch are pretty impressive

00:00:23.800 | but it's not always a perfect technology. In fact unless we have big domain

00:00:29.960 | specific datasets to fine-tune embedding models with, traditional search still has

00:00:35.480 | some advantages. We repeatedly see that VectorSearch unlocks this incredible

00:00:42.200 | potential for really intelligent and powerful retrieval performance but it

00:00:50.360 | really struggles when it's adapting to a new domain. Particularly when this new

00:00:55.280 | domain is very different to the domain that the embedding model was fine-tuned

00:01:00.560 | on. Whereas traditional search manages to adapt to new domains much better but

00:01:07.260 | we're limited to that very specific performance level. So both approaches

00:01:12.680 | have their pros and cons but what if we could somehow manage to eliminate a few

00:01:18.080 | of those cons? Could we create a hybrid search with the heightened performance

00:01:23.680 | potential of VectorSearch and the zero-shot adaptability of traditional

00:01:28.880 | search? Today that's exactly what we're going to look at. We're going to look at

00:01:32.160 | the new hybrid search that Pinecone has come up with that merges both VectorSearch

00:01:38.800 | and a more traditional search into one single index. VectorSearch or

00:01:44.480 | DenseRetrieval has been shown to significantly outperform traditional

00:01:49.040 | methods but only when the embedding models that are creating these dense

00:01:53.480 | vector embeddings have been fine-tuned on that target domain. When we try and

00:01:59.720 | use the same models for outer domain tasks this performance doesn't tend to

00:02:04.160 | hold so well. That means if we have a large amount of data covering a very

00:02:09.280 | specific domain like medical question answering then we're okay we can fine

00:02:14.560 | tune our model everything will work that's great. But if we don't have a

00:02:18.840 | large amount of data to fine-tune our embedding model then the chances are

00:02:23.040 | that we might actually find better performance from a traditional method or

00:02:27.680 | sparse retrieval method like BM25. That gives us a best use case performance set

00:02:34.080 | by BM25 and we have no potential to fine-tune that and improve that

00:02:38.600 | performance and get more human-like intelligent retrieval. So if we want

00:02:43.640 | better performance we're left with two options. We either need to annotate a

00:02:47.800 | large data set to then use and fine-tune our embedding model or we can just go

00:02:53.680 | ahead and use hybrid search. The problem is that hybrid search isn't a very easy

00:02:59.520 | thing to do. In the past engineering teams had to have two separate solutions

00:03:05.800 | one would have been a sparse search index and a dense search index and they

00:03:10.360 | would have to have another system to merge the scores from both of those in

00:03:15.640 | an intelligent way and re-rank the results. Whereas with Pinecone we don't

00:03:21.040 | need to handle that anymore we just have a single endpoint and Pinecone does the

00:03:26.720 | rest. And we can even adjust whether we want to go for more of a sparse search

00:03:31.520 | or more of a dense vector search with a new parameter called alpha. So how does

00:03:37.520 | a typical hybrid search pipeline look? Well we start with our input data so

00:03:43.800 | this could be text, audio or something else and essentially what we're going to

00:03:47.800 | do is we're going to take that text or other input data and we're going to

00:03:53.600 | create two vectors from it. We're going to create our dense vector embedding and

00:03:57.360 | we're going to create our sparse vector embedding. And then everything else is

00:04:01.280 | handled by Pinecone. Within those dotted lines you can see that is just Pinecone

00:04:06.280 | doing its thing and building this very optimized hybrid index. But obviously

00:04:12.080 | before we get there we still need to create those sparse and dense vector

00:04:17.180 | representations. So let's have a look at how we can actually do that. So to get

00:04:22.040 | started we will need to actually just install a few dependencies. So we have Torch,

00:04:26.760 | Datasets, Transformers and Sentence Transformers. That's all we're going to

00:04:30.960 | need for this. And for now I just want to point out that the Pinecone Python client

00:04:38.680 | doesn't currently support the hybrid index. We have to interface directly with

00:04:45.860 | the hybrid endpoint. So right now we have this sort of helper class or function

00:04:53.760 | that's just going to handle a lot of that for us and essentially just act like a

00:04:57.760 | temporary hybrid index enabled Python client. So you'll be able to find a link

00:05:04.640 | to this in the video description if you would like to follow along. So what we'll

00:05:09.800 | do for now is just jump ahead directly to building the sparse and dense vectors.

00:05:14.240 | So the first thing we need obviously is some data to create our embeddings from.

00:05:19.160 | And we're going to use a very domain-specific medical Q&A data set

00:05:23.640 | called PubMed QA. Now if we run that that's going to download the data set

00:05:29.800 | from Hugging Face Datasets which you can install like that if you haven't already.

00:05:34.680 | And that would just take a moment. Okay once that has downloaded we'll be able

00:05:39.640 | to see the data set features. So what's most important here is that we have the

00:05:44.560 | context. I'm going to use all the questions later as well. But the context

00:05:47.920 | contains all of the long paragraphs that we're going to index in Pinecone using

00:05:54.560 | both our dense and sparse vectors. And we have just a thousand of these. So it's a

00:05:59.280 | pretty small data set but pretty good for this example. The reason that it's

00:06:04.920 | pretty good is if we just have a look at a few of the contexts that we're

00:06:09.520 | building here we can see that it's very specific language. Like I can read this

00:06:16.560 | and it doesn't really make any sense to me. And if it doesn't make sense to a

00:06:21.760 | typical person that means it probably doesn't make sense to a typical out-of-

00:06:27.840 | the-box pre-trained model. So what we would ideally have here is a model that

00:06:34.320 | has been fine-tuned for this specific domain and understands this specific

00:06:38.820 | language. But let's say that's not possible. That is where we would want to

00:06:44.520 | use hybrid search. So let's go ahead and we'll have a look at how we can build

00:06:49.000 | our sparse vectors. Now there are multiple methods for building our sparse

00:06:55.240 | vectors. This is just one of them. So we're going to go ahead and use a BERT

00:06:59.560 | tokenizer. We're just going to be using BERT tokenizer. I'm not going to use the

00:07:04.200 | BERT transformer. And what we're going to do is just we're going to tokenize a

00:07:08.880 | single context to get started. So just this context at position 0. We run that

00:07:15.720 | and what we will see is that we get input IDs, token type IDs and attention

00:07:21.680 | mask. Now if we were using BERT we would want to keep all of these tensors. But

00:07:28.200 | we're only wanting to create our sparse vector embedding. So in reality all we

00:07:33.160 | need now are the input IDs. So let's have a look at what they look like. And you

00:07:40.760 | can see that we just have all these integer ID values. Okay. Each one of these

00:07:45.080 | IDs, one of these token IDs if you like, they represent a specific word or

00:07:50.880 | subword that has been extracted from our paragraph using the BERT tokenizer's

00:07:58.280 | rule-based tokenization logic. So each one of those is just a unique word or

00:08:05.440 | subword. And what we need to do is convert each one of those big

00:08:11.760 | paragraphs that have been converted into the input ID list or token ID list. We

00:08:16.680 | need to convert that into a dictionary which simply maps the token ID to the

00:08:23.400 | number of times that token appears within that paragraph. So it's like a

00:08:28.360 | frequency dictionary. And we can do that super easily using the counter function

00:08:34.280 | from collections. So we import that and we just run this. And we can see that we

00:08:40.480 | now have this. Right. So most of these are not going to be high values and when

00:08:46.120 | you consider the total number of tokens, most tokens don't even have a frequency.

00:08:51.600 | So they're not even within this dictionary. And that's kind of the

00:08:54.600 | definition of a sparse vector. You're expecting the information within that

00:08:59.840 | vector to be very sparse. So most values or most values within that, if you

00:09:05.320 | imagine it as a vector, would be zero. But then some of them are one or two or

00:09:10.360 | you know whatever other number depending on the the method that you're using to

00:09:14.820 | build your sparse vector. So just to make this easier we're going to define a

00:09:19.680 | couple of functions to do all of this without us needing to rewrite everything

00:09:24.120 | every time. So this is just a function to build that dictionary that we we just

00:09:29.440 | built using the counter function. And one additional thing that we're doing here

00:09:36.120 | is we're actually just removing all of these tokens here. So these are special

00:09:43.040 | tokens used by BERT for parts of the processing function within BERT. Okay so

00:09:49.920 | it's like so BERT knows that this is the start of a sequence, the end of a

00:09:55.160 | sequence, this is a padding token. Basically these special tokens that only

00:10:00.840 | BERT really needs and they don't really have any meaning outside of that context.

00:10:05.600 | So there's no point in us having those within our sparse embeddings. So we just

00:10:10.400 | remove them because otherwise we're just adding noise to those embeddings. So we

00:10:16.320 | run that and then we also run this which is just going to handle the creation of

00:10:21.280 | those sparse vectors from start to finish. So we're just gonna pass a batch

00:10:25.320 | of context. We're going to tokenize everything and then we're going to build

00:10:29.840 | those dictionaries and return them. Okay that's the sparse vector creation it's

00:10:35.360 | not not too complex but we also need to create our dense vectors and actually

00:10:40.760 | this is more straightforward. So we're just going to use the sentence

00:10:45.040 | transformers library. If you have CUDA that's great you can use that or you can

00:10:49.360 | use NPS if you're on Mac. And we're just going to initialize this sentence

00:10:55.600 | transform model. It's a Q&A model and because we are, let's play this in a

00:11:01.760 | moment, we're restricted to using dot product at the moment as our similarity

00:11:04.880 | metric. We either want to use a model that has been trained with cosine

00:11:08.840 | similarity or more ideally dot product similarity. But cosine will work as well.

00:11:13.440 | So we initialize that and then we can just encode some text really easily. Just

00:11:21.080 | model encode and then we will get this this dense vector embedding. Wait a

00:11:26.080 | moment for the model to load and for this run. Okay and we can see that we have this

00:11:31.960 | 384 dimensional dense vector embedding. So that is pretty much everything we need

00:11:40.960 | to do in terms of building our sparse and dense vector representations. So now

00:11:46.220 | what we need to do is initialize our hybrid index using that sort of helper

00:11:51.920 | class that we defined at the start and then what we'll do is actually add all

00:11:55.560 | of these or we'll encode all of our context and add them to that hybrid

00:12:01.280 | index. So let's go back up to the top and we will find where we are initializing

00:12:08.480 | everything. So here we are. So we'll need a API key now. At the moment you won't

00:12:14.640 | get the API key from here because it's a within a private preview at the moment. So you

00:12:19.880 | will have to request access to the hybrid index and to do that again

00:12:26.560 | everything will be in the description of this video so you can just follow those

00:12:30.760 | instructions. So I'm gonna initialize this. That just initializes my connection

00:12:36.200 | to Pinecone and then what I need to do is actually create a hybrid index. Now

00:12:41.040 | there are a few things that are important to take note of here. So one

00:12:45.480 | we're using the dot product metric. That's important at the moment the only

00:12:50.320 | metric that is supported with a hybrid index is dot product. So you have to

00:12:54.580 | specify that you want to use the dot product metric here and another thing is

00:12:59.200 | to actually use a hybrid index we need to add this H to the pod that we would

00:13:04.480 | like to use. So right here we're using a S1 pod and we're using the hybrid index

00:13:09.480 | version of that S1 pod and when we run that we should see 201 which just means

00:13:15.240 | that the index has been created. So if we come down here we can describe the index

00:13:22.160 | this is pretty aligned to the typical Pinecone client so we can see that the

00:13:28.720 | index is now ready. As soon as ready is equal to true we can move on to the next bit.

00:13:32.800 | That might take a few seconds if it doesn't take long and we can connect to

00:13:38.160 | the index. Okay great so we can see to start with that our index is completely

00:13:43.560 | empty. There's nothing in there at the moment which obviously we would expect we

00:13:46.880 | haven't added anything yet. So now let's go ahead and actually begin adding those

00:13:51.440 | sparse and dense vectors. So we'll come down and we are going to be using the

00:13:57.160 | upsert function and what we're going to do is actually iterate through all of

00:14:02.040 | our contexts and we're going to go through in batches of 32 contexts at

00:14:06.640 | any one time. Okay so we set the batch size to 32 here. We've got TQDM which is

00:14:11.480 | just a progress bar you can if you need to you may need to install that.

00:14:16.840 | Install TQDM and first thing we'll do is just find the end of the batch so we're

00:14:24.440 | just extracting the 32 or less items at any one time. Extract those contexts

00:14:31.180 | create some IDs which is like accounts like 0 1 2 3 and so on and then we want

00:14:37.200 | to add metadata to each one of our records. So this is just the text of the

00:14:43.520 | context and it allows us to just see a human readable format of whatever is

00:14:49.560 | we're returning otherwise we'd only return the vectors and it's we can't

00:14:53.640 | understand what they are. So after that we create our dense and sparse vectors

00:15:00.440 | this is just repeating what we did before and then here we create the

00:15:07.600 | vector or the record that we'll be adding to Pinecone. So there's a slightly

00:15:13.160 | different format here to what we might be used to if you've used Pinecone

00:15:17.320 | before. Typically in Pinecone what you do is it would look like this. So we'd have

00:15:22.240 | the ID you would have values it's just a dense vector and then you would have

00:15:25.480 | your metadata which is a context. Obviously we're using a hybrid index so

00:15:30.480 | there's an extra value in there and that is the sparse values and that's just our

00:15:35.680 | sparse vectors and then we will just upsert those. Okay now it's worth

00:15:40.480 | pointing out that this upsert is not the typical upsert or isn't going to the

00:15:45.840 | same endpoint that we would typically use. So we come up to our help class up

00:15:51.360 | here the upsert here is going to the hybrid endpoint whereas before it was

00:15:59.120 | going to this it was going vectors upsert. Now all it has is this extra hybrid path

00:16:06.080 | included within there. That's the only difference and then so if we go back

00:16:10.320 | down to here you can run this and it will take a moment just to add

00:16:20.400 | everything it won't take too long though. Okay and took 47 seconds there and we

00:16:27.000 | have the 1,000 vectors in there now. Okay so now what I want to do is move on to

00:16:32.320 | querying so how do we actually query our hybrid index. So there's a slight

00:16:38.760 | difference again so we add a sparse vector to our query so we're just using

00:16:44.520 | this function all it's doing is encoding everything and then making our query. So

00:16:48.960 | we just add this sparse vector item to our query request we still have the

00:16:55.760 | dense vector and everything else except from alpha is is also the same. So alpha

00:17:00.400 | is a also a new parameter and I'll explain that in a moment. So then we

00:17:04.880 | would just query. Now the query again similar to the upsert endpoint also has

00:17:09.760 | hybrid in front of it now so now it's hybrid query but other than that that's

00:17:15.000 | there isn't really any any difference. So we have this question is very technical

00:17:21.840 | and definitely out of domain for most models. So we run that and we can go

00:17:29.320 | ahead and what we're going to do first is actually a pure semantic search or

00:17:34.280 | pure vector search. So let's run that and basically looking at this I'm not going

00:17:42.560 | to go through it because I hardly understand these myself but the answer

00:17:46.960 | that we want is actually ID 711 here. We want that to be ranked at position

00:17:53.840 | number one but it's not it's ranked at position number two which is bad but it

00:17:58.680 | could be better. So this is the result of a pure semantic or vector search and the

00:18:06.320 | reason there's a pure semantic search is because we set alpha equal to one. Now

00:18:11.120 | alpha is the parameter that allows us to switch the weighting between sparse and

00:18:16.540 | vector search. Okay so at one that means we're doing a purely dense vector search

00:18:23.800 | so what Pinecone would typically do in the past. If we say it's a zero we're

00:18:29.560 | doing a full sparse search and anything in between is a hybrid search. Okay so we

00:18:37.960 | don't get the perfect results when we're using dense vector research so let's

00:18:43.240 | have a look what happens if we use a hybrid search. So run this so we're

00:18:50.400 | saying alpha to 0.3 so it's more of a sparse search than a dense vector search

00:18:55.960 | but still using both and if we come up to the top we can see that ID 711 has

00:19:04.000 | been returned at position one so now we're getting the perfect results. So

00:19:08.640 | that's just one example of hybrid search and how it can help us get better

00:19:14.960 | results really really easily. And that's it for our introduction to hybrid search

00:19:22.640 | and also how we can implement it in Pinecone. With this we are able to reap

00:19:28.760 | the benefits of dense vector retrieval or vector search whilst also

00:19:35.740 | sidestepping some of its most common pitfalls which is out-of-domain search.

00:19:42.160 | Now if you would like to get started with this hybrid search functionality

00:19:47.520 | in Pinecone you will have to go through the preview at the moment if you're

00:19:52.080 | watching this in the future it's probably already generally available and

00:19:55.640 | you can you can just go ahead and install the Pinecone client but for now if

00:20:00.080 | you're interested in trying it the instructions to do so to request access

00:20:04.320 | will be in the video description. But for now that is everything so I hope this

00:20:11.400 | has all been interesting thank you very much for watching and I'll see you again

00:20:15.440 | in the next one. Bye.

00:20:18.680 | you

00:20:21.740 | you

00:20:23.800 | you

00:20:25.860 | [BLANK_AUDIO]