back to indexPinecone's New *Hybrid* Search - the future of search?
00:00:00.000 |
VectorSearch has unlocked the door to another level of relevance and 00:00:05.880 |
efficiency when it comes to retrieving data. In the past year alone the number 00:00:11.800 |
of VectorSearch use cases have exploded and there's no signs of that slowing 00:00:18.320 |
down anytime soon. Now the capabilities of VectorSearch are pretty impressive 00:00:23.800 |
but it's not always a perfect technology. In fact unless we have big domain 00:00:29.960 |
specific datasets to fine-tune embedding models with, traditional search still has 00:00:35.480 |
some advantages. We repeatedly see that VectorSearch unlocks this incredible 00:00:42.200 |
potential for really intelligent and powerful retrieval performance but it 00:00:50.360 |
really struggles when it's adapting to a new domain. Particularly when this new 00:00:55.280 |
domain is very different to the domain that the embedding model was fine-tuned 00:01:00.560 |
on. Whereas traditional search manages to adapt to new domains much better but 00:01:07.260 |
we're limited to that very specific performance level. So both approaches 00:01:12.680 |
have their pros and cons but what if we could somehow manage to eliminate a few 00:01:18.080 |
of those cons? Could we create a hybrid search with the heightened performance 00:01:23.680 |
potential of VectorSearch and the zero-shot adaptability of traditional 00:01:28.880 |
search? Today that's exactly what we're going to look at. We're going to look at 00:01:32.160 |
the new hybrid search that Pinecone has come up with that merges both VectorSearch 00:01:38.800 |
and a more traditional search into one single index. VectorSearch or 00:01:44.480 |
DenseRetrieval has been shown to significantly outperform traditional 00:01:49.040 |
methods but only when the embedding models that are creating these dense 00:01:53.480 |
vector embeddings have been fine-tuned on that target domain. When we try and 00:01:59.720 |
use the same models for outer domain tasks this performance doesn't tend to 00:02:04.160 |
hold so well. That means if we have a large amount of data covering a very 00:02:09.280 |
specific domain like medical question answering then we're okay we can fine 00:02:14.560 |
tune our model everything will work that's great. But if we don't have a 00:02:18.840 |
large amount of data to fine-tune our embedding model then the chances are 00:02:23.040 |
that we might actually find better performance from a traditional method or 00:02:27.680 |
sparse retrieval method like BM25. That gives us a best use case performance set 00:02:34.080 |
by BM25 and we have no potential to fine-tune that and improve that 00:02:38.600 |
performance and get more human-like intelligent retrieval. So if we want 00:02:43.640 |
better performance we're left with two options. We either need to annotate a 00:02:47.800 |
large data set to then use and fine-tune our embedding model or we can just go 00:02:53.680 |
ahead and use hybrid search. The problem is that hybrid search isn't a very easy 00:02:59.520 |
thing to do. In the past engineering teams had to have two separate solutions 00:03:05.800 |
one would have been a sparse search index and a dense search index and they 00:03:10.360 |
would have to have another system to merge the scores from both of those in 00:03:15.640 |
an intelligent way and re-rank the results. Whereas with Pinecone we don't 00:03:21.040 |
need to handle that anymore we just have a single endpoint and Pinecone does the 00:03:26.720 |
rest. And we can even adjust whether we want to go for more of a sparse search 00:03:31.520 |
or more of a dense vector search with a new parameter called alpha. So how does 00:03:37.520 |
a typical hybrid search pipeline look? Well we start with our input data so 00:03:43.800 |
this could be text, audio or something else and essentially what we're going to 00:03:47.800 |
do is we're going to take that text or other input data and we're going to 00:03:53.600 |
create two vectors from it. We're going to create our dense vector embedding and 00:03:57.360 |
we're going to create our sparse vector embedding. And then everything else is 00:04:01.280 |
handled by Pinecone. Within those dotted lines you can see that is just Pinecone 00:04:06.280 |
doing its thing and building this very optimized hybrid index. But obviously 00:04:12.080 |
before we get there we still need to create those sparse and dense vector 00:04:17.180 |
representations. So let's have a look at how we can actually do that. So to get 00:04:22.040 |
started we will need to actually just install a few dependencies. So we have Torch, 00:04:26.760 |
Datasets, Transformers and Sentence Transformers. That's all we're going to 00:04:30.960 |
need for this. And for now I just want to point out that the Pinecone Python client 00:04:38.680 |
doesn't currently support the hybrid index. We have to interface directly with 00:04:45.860 |
the hybrid endpoint. So right now we have this sort of helper class or function 00:04:53.760 |
that's just going to handle a lot of that for us and essentially just act like a 00:04:57.760 |
temporary hybrid index enabled Python client. So you'll be able to find a link 00:05:04.640 |
to this in the video description if you would like to follow along. So what we'll 00:05:09.800 |
do for now is just jump ahead directly to building the sparse and dense vectors. 00:05:14.240 |
So the first thing we need obviously is some data to create our embeddings from. 00:05:19.160 |
And we're going to use a very domain-specific medical Q&A data set 00:05:23.640 |
called PubMed QA. Now if we run that that's going to download the data set 00:05:29.800 |
from Hugging Face Datasets which you can install like that if you haven't already. 00:05:34.680 |
And that would just take a moment. Okay once that has downloaded we'll be able 00:05:39.640 |
to see the data set features. So what's most important here is that we have the 00:05:44.560 |
context. I'm going to use all the questions later as well. But the context 00:05:47.920 |
contains all of the long paragraphs that we're going to index in Pinecone using 00:05:54.560 |
both our dense and sparse vectors. And we have just a thousand of these. So it's a 00:05:59.280 |
pretty small data set but pretty good for this example. The reason that it's 00:06:04.920 |
pretty good is if we just have a look at a few of the contexts that we're 00:06:09.520 |
building here we can see that it's very specific language. Like I can read this 00:06:16.560 |
and it doesn't really make any sense to me. And if it doesn't make sense to a 00:06:21.760 |
typical person that means it probably doesn't make sense to a typical out-of- 00:06:27.840 |
the-box pre-trained model. So what we would ideally have here is a model that 00:06:34.320 |
has been fine-tuned for this specific domain and understands this specific 00:06:38.820 |
language. But let's say that's not possible. That is where we would want to 00:06:44.520 |
use hybrid search. So let's go ahead and we'll have a look at how we can build 00:06:49.000 |
our sparse vectors. Now there are multiple methods for building our sparse 00:06:55.240 |
vectors. This is just one of them. So we're going to go ahead and use a BERT 00:06:59.560 |
tokenizer. We're just going to be using BERT tokenizer. I'm not going to use the 00:07:04.200 |
BERT transformer. And what we're going to do is just we're going to tokenize a 00:07:08.880 |
single context to get started. So just this context at position 0. We run that 00:07:15.720 |
and what we will see is that we get input IDs, token type IDs and attention 00:07:21.680 |
mask. Now if we were using BERT we would want to keep all of these tensors. But 00:07:28.200 |
we're only wanting to create our sparse vector embedding. So in reality all we 00:07:33.160 |
need now are the input IDs. So let's have a look at what they look like. And you 00:07:40.760 |
can see that we just have all these integer ID values. Okay. Each one of these 00:07:45.080 |
IDs, one of these token IDs if you like, they represent a specific word or 00:07:50.880 |
subword that has been extracted from our paragraph using the BERT tokenizer's 00:07:58.280 |
rule-based tokenization logic. So each one of those is just a unique word or 00:08:05.440 |
subword. And what we need to do is convert each one of those big 00:08:11.760 |
paragraphs that have been converted into the input ID list or token ID list. We 00:08:16.680 |
need to convert that into a dictionary which simply maps the token ID to the 00:08:23.400 |
number of times that token appears within that paragraph. So it's like a 00:08:28.360 |
frequency dictionary. And we can do that super easily using the counter function 00:08:34.280 |
from collections. So we import that and we just run this. And we can see that we 00:08:40.480 |
now have this. Right. So most of these are not going to be high values and when 00:08:46.120 |
you consider the total number of tokens, most tokens don't even have a frequency. 00:08:51.600 |
So they're not even within this dictionary. And that's kind of the 00:08:54.600 |
definition of a sparse vector. You're expecting the information within that 00:08:59.840 |
vector to be very sparse. So most values or most values within that, if you 00:09:05.320 |
imagine it as a vector, would be zero. But then some of them are one or two or 00:09:10.360 |
you know whatever other number depending on the the method that you're using to 00:09:14.820 |
build your sparse vector. So just to make this easier we're going to define a 00:09:19.680 |
couple of functions to do all of this without us needing to rewrite everything 00:09:24.120 |
every time. So this is just a function to build that dictionary that we we just 00:09:29.440 |
built using the counter function. And one additional thing that we're doing here 00:09:36.120 |
is we're actually just removing all of these tokens here. So these are special 00:09:43.040 |
tokens used by BERT for parts of the processing function within BERT. Okay so 00:09:49.920 |
it's like so BERT knows that this is the start of a sequence, the end of a 00:09:55.160 |
sequence, this is a padding token. Basically these special tokens that only 00:10:00.840 |
BERT really needs and they don't really have any meaning outside of that context. 00:10:05.600 |
So there's no point in us having those within our sparse embeddings. So we just 00:10:10.400 |
remove them because otherwise we're just adding noise to those embeddings. So we 00:10:16.320 |
run that and then we also run this which is just going to handle the creation of 00:10:21.280 |
those sparse vectors from start to finish. So we're just gonna pass a batch 00:10:25.320 |
of context. We're going to tokenize everything and then we're going to build 00:10:29.840 |
those dictionaries and return them. Okay that's the sparse vector creation it's 00:10:35.360 |
not not too complex but we also need to create our dense vectors and actually 00:10:40.760 |
this is more straightforward. So we're just going to use the sentence 00:10:45.040 |
transformers library. If you have CUDA that's great you can use that or you can 00:10:49.360 |
use NPS if you're on Mac. And we're just going to initialize this sentence 00:10:55.600 |
transform model. It's a Q&A model and because we are, let's play this in a 00:11:01.760 |
moment, we're restricted to using dot product at the moment as our similarity 00:11:04.880 |
metric. We either want to use a model that has been trained with cosine 00:11:08.840 |
similarity or more ideally dot product similarity. But cosine will work as well. 00:11:13.440 |
So we initialize that and then we can just encode some text really easily. Just 00:11:21.080 |
model encode and then we will get this this dense vector embedding. Wait a 00:11:26.080 |
moment for the model to load and for this run. Okay and we can see that we have this 00:11:31.960 |
384 dimensional dense vector embedding. So that is pretty much everything we need 00:11:40.960 |
to do in terms of building our sparse and dense vector representations. So now 00:11:46.220 |
what we need to do is initialize our hybrid index using that sort of helper 00:11:51.920 |
class that we defined at the start and then what we'll do is actually add all 00:11:55.560 |
of these or we'll encode all of our context and add them to that hybrid 00:12:01.280 |
index. So let's go back up to the top and we will find where we are initializing 00:12:08.480 |
everything. So here we are. So we'll need a API key now. At the moment you won't 00:12:14.640 |
get the API key from here because it's a within a private preview at the moment. So you 00:12:19.880 |
will have to request access to the hybrid index and to do that again 00:12:26.560 |
everything will be in the description of this video so you can just follow those 00:12:30.760 |
instructions. So I'm gonna initialize this. That just initializes my connection 00:12:36.200 |
to Pinecone and then what I need to do is actually create a hybrid index. Now 00:12:41.040 |
there are a few things that are important to take note of here. So one 00:12:45.480 |
we're using the dot product metric. That's important at the moment the only 00:12:50.320 |
metric that is supported with a hybrid index is dot product. So you have to 00:12:54.580 |
specify that you want to use the dot product metric here and another thing is 00:12:59.200 |
to actually use a hybrid index we need to add this H to the pod that we would 00:13:04.480 |
like to use. So right here we're using a S1 pod and we're using the hybrid index 00:13:09.480 |
version of that S1 pod and when we run that we should see 201 which just means 00:13:15.240 |
that the index has been created. So if we come down here we can describe the index 00:13:22.160 |
this is pretty aligned to the typical Pinecone client so we can see that the 00:13:28.720 |
index is now ready. As soon as ready is equal to true we can move on to the next bit. 00:13:32.800 |
That might take a few seconds if it doesn't take long and we can connect to 00:13:38.160 |
the index. Okay great so we can see to start with that our index is completely 00:13:43.560 |
empty. There's nothing in there at the moment which obviously we would expect we 00:13:46.880 |
haven't added anything yet. So now let's go ahead and actually begin adding those 00:13:51.440 |
sparse and dense vectors. So we'll come down and we are going to be using the 00:13:57.160 |
upsert function and what we're going to do is actually iterate through all of 00:14:02.040 |
our contexts and we're going to go through in batches of 32 contexts at 00:14:06.640 |
any one time. Okay so we set the batch size to 32 here. We've got TQDM which is 00:14:11.480 |
just a progress bar you can if you need to you may need to install that. 00:14:16.840 |
Install TQDM and first thing we'll do is just find the end of the batch so we're 00:14:24.440 |
just extracting the 32 or less items at any one time. Extract those contexts 00:14:31.180 |
create some IDs which is like accounts like 0 1 2 3 and so on and then we want 00:14:37.200 |
to add metadata to each one of our records. So this is just the text of the 00:14:43.520 |
context and it allows us to just see a human readable format of whatever is 00:14:49.560 |
we're returning otherwise we'd only return the vectors and it's we can't 00:14:53.640 |
understand what they are. So after that we create our dense and sparse vectors 00:15:00.440 |
this is just repeating what we did before and then here we create the 00:15:07.600 |
vector or the record that we'll be adding to Pinecone. So there's a slightly 00:15:13.160 |
different format here to what we might be used to if you've used Pinecone 00:15:17.320 |
before. Typically in Pinecone what you do is it would look like this. So we'd have 00:15:22.240 |
the ID you would have values it's just a dense vector and then you would have 00:15:25.480 |
your metadata which is a context. Obviously we're using a hybrid index so 00:15:30.480 |
there's an extra value in there and that is the sparse values and that's just our 00:15:35.680 |
sparse vectors and then we will just upsert those. Okay now it's worth 00:15:40.480 |
pointing out that this upsert is not the typical upsert or isn't going to the 00:15:45.840 |
same endpoint that we would typically use. So we come up to our help class up 00:15:51.360 |
here the upsert here is going to the hybrid endpoint whereas before it was 00:15:59.120 |
going to this it was going vectors upsert. Now all it has is this extra hybrid path 00:16:06.080 |
included within there. That's the only difference and then so if we go back 00:16:10.320 |
down to here you can run this and it will take a moment just to add 00:16:20.400 |
everything it won't take too long though. Okay and took 47 seconds there and we 00:16:27.000 |
have the 1,000 vectors in there now. Okay so now what I want to do is move on to 00:16:32.320 |
querying so how do we actually query our hybrid index. So there's a slight 00:16:38.760 |
difference again so we add a sparse vector to our query so we're just using 00:16:44.520 |
this function all it's doing is encoding everything and then making our query. So 00:16:48.960 |
we just add this sparse vector item to our query request we still have the 00:16:55.760 |
dense vector and everything else except from alpha is is also the same. So alpha 00:17:00.400 |
is a also a new parameter and I'll explain that in a moment. So then we 00:17:04.880 |
would just query. Now the query again similar to the upsert endpoint also has 00:17:09.760 |
hybrid in front of it now so now it's hybrid query but other than that that's 00:17:15.000 |
there isn't really any any difference. So we have this question is very technical 00:17:21.840 |
and definitely out of domain for most models. So we run that and we can go 00:17:29.320 |
ahead and what we're going to do first is actually a pure semantic search or 00:17:34.280 |
pure vector search. So let's run that and basically looking at this I'm not going 00:17:42.560 |
to go through it because I hardly understand these myself but the answer 00:17:46.960 |
that we want is actually ID 711 here. We want that to be ranked at position 00:17:53.840 |
number one but it's not it's ranked at position number two which is bad but it 00:17:58.680 |
could be better. So this is the result of a pure semantic or vector search and the 00:18:06.320 |
reason there's a pure semantic search is because we set alpha equal to one. Now 00:18:11.120 |
alpha is the parameter that allows us to switch the weighting between sparse and 00:18:16.540 |
vector search. Okay so at one that means we're doing a purely dense vector search 00:18:23.800 |
so what Pinecone would typically do in the past. If we say it's a zero we're 00:18:29.560 |
doing a full sparse search and anything in between is a hybrid search. Okay so we 00:18:37.960 |
don't get the perfect results when we're using dense vector research so let's 00:18:43.240 |
have a look what happens if we use a hybrid search. So run this so we're 00:18:50.400 |
saying alpha to 0.3 so it's more of a sparse search than a dense vector search 00:18:55.960 |
but still using both and if we come up to the top we can see that ID 711 has 00:19:04.000 |
been returned at position one so now we're getting the perfect results. So 00:19:08.640 |
that's just one example of hybrid search and how it can help us get better 00:19:14.960 |
results really really easily. And that's it for our introduction to hybrid search 00:19:22.640 |
and also how we can implement it in Pinecone. With this we are able to reap 00:19:28.760 |
the benefits of dense vector retrieval or vector search whilst also 00:19:35.740 |
sidestepping some of its most common pitfalls which is out-of-domain search. 00:19:42.160 |
Now if you would like to get started with this hybrid search functionality 00:19:47.520 |
in Pinecone you will have to go through the preview at the moment if you're 00:19:52.080 |
watching this in the future it's probably already generally available and 00:19:55.640 |
you can you can just go ahead and install the Pinecone client but for now if 00:20:00.080 |
you're interested in trying it the instructions to do so to request access 00:20:04.320 |
will be in the video description. But for now that is everything so I hope this 00:20:11.400 |
has all been interesting thank you very much for watching and I'll see you again