Pinecone's New *Hybrid* Search - the future of search?

VectorSearch has unlocked the door to another level of relevance and efficiency when it comes to retrieving data. In the past year alone the number of VectorSearch use cases have exploded and there's no signs of that slowing down anytime soon. Now the capabilities of VectorSearch are pretty impressive but it's not always a perfect technology.

In fact unless we have big domain specific datasets to fine-tune embedding models with, traditional search still has some advantages. We repeatedly see that VectorSearch unlocks this incredible potential for really intelligent and powerful retrieval performance but it really struggles when it's adapting to a new domain. Particularly when this new domain is very different to the domain that the embedding model was fine-tuned on.

Whereas traditional search manages to adapt to new domains much better but we're limited to that very specific performance level. So both approaches have their pros and cons but what if we could somehow manage to eliminate a few of those cons? Could we create a hybrid search with the heightened performance potential of VectorSearch and the zero-shot adaptability of traditional search?

Today that's exactly what we're going to look at. We're going to look at the new hybrid search that Pinecone has come up with that merges both VectorSearch and a more traditional search into one single index. VectorSearch or DenseRetrieval has been shown to significantly outperform traditional methods but only when the embedding models that are creating these dense vector embeddings have been fine-tuned on that target domain.

When we try and use the same models for outer domain tasks this performance doesn't tend to hold so well. That means if we have a large amount of data covering a very specific domain like medical question answering then we're okay we can fine tune our model everything will work that's great.

But if we don't have a large amount of data to fine-tune our embedding model then the chances are that we might actually find better performance from a traditional method or sparse retrieval method like BM25. That gives us a best use case performance set by BM25 and we have no potential to fine-tune that and improve that performance and get more human-like intelligent retrieval.

So if we want better performance we're left with two options. We either need to annotate a large data set to then use and fine-tune our embedding model or we can just go ahead and use hybrid search. The problem is that hybrid search isn't a very easy thing to do.

In the past engineering teams had to have two separate solutions one would have been a sparse search index and a dense search index and they would have to have another system to merge the scores from both of those in an intelligent way and re-rank the results. Whereas with Pinecone we don't need to handle that anymore we just have a single endpoint and Pinecone does the rest.

And we can even adjust whether we want to go for more of a sparse search or more of a dense vector search with a new parameter called alpha. So how does a typical hybrid search pipeline look? Well we start with our input data so this could be text, audio or something else and essentially what we're going to do is we're going to take that text or other input data and we're going to create two vectors from it.

We're going to create our dense vector embedding and we're going to create our sparse vector embedding. And then everything else is handled by Pinecone. Within those dotted lines you can see that is just Pinecone doing its thing and building this very optimized hybrid index. But obviously before we get there we still need to create those sparse and dense vector representations.

So let's have a look at how we can actually do that. So to get started we will need to actually just install a few dependencies. So we have Torch, Datasets, Transformers and Sentence Transformers. That's all we're going to need for this. And for now I just want to point out that the Pinecone Python client doesn't currently support the hybrid index.

We have to interface directly with the hybrid endpoint. So right now we have this sort of helper class or function that's just going to handle a lot of that for us and essentially just act like a temporary hybrid index enabled Python client. So you'll be able to find a link to this in the video description if you would like to follow along.

So what we'll do for now is just jump ahead directly to building the sparse and dense vectors. So the first thing we need obviously is some data to create our embeddings from. And we're going to use a very domain-specific medical Q&A data set called PubMed QA. Now if we run that that's going to download the data set from Hugging Face Datasets which you can install like that if you haven't already.

And that would just take a moment. Okay once that has downloaded we'll be able to see the data set features. So what's most important here is that we have the context. I'm going to use all the questions later as well. But the context contains all of the long paragraphs that we're going to index in Pinecone using both our dense and sparse vectors.

And we have just a thousand of these. So it's a pretty small data set but pretty good for this example. The reason that it's pretty good is if we just have a look at a few of the contexts that we're building here we can see that it's very specific language.

Like I can read this and it doesn't really make any sense to me. And if it doesn't make sense to a typical person that means it probably doesn't make sense to a typical out-of- the-box pre-trained model. So what we would ideally have here is a model that has been fine-tuned for this specific domain and understands this specific language.

But let's say that's not possible. That is where we would want to use hybrid search. So let's go ahead and we'll have a look at how we can build our sparse vectors. Now there are multiple methods for building our sparse vectors. This is just one of them. So we're going to go ahead and use a BERT tokenizer.

We're just going to be using BERT tokenizer. I'm not going to use the BERT transformer. And what we're going to do is just we're going to tokenize a single context to get started. So just this context at position 0. We run that and what we will see is that we get input IDs, token type IDs and attention mask.

Now if we were using BERT we would want to keep all of these tensors. But we're only wanting to create our sparse vector embedding. So in reality all we need now are the input IDs. So let's have a look at what they look like. And you can see that we just have all these integer ID values.

Okay. Each one of these IDs, one of these token IDs if you like, they represent a specific word or subword that has been extracted from our paragraph using the BERT tokenizer's rule-based tokenization logic. So each one of those is just a unique word or subword. And what we need to do is convert each one of those big paragraphs that have been converted into the input ID list or token ID list.

We need to convert that into a dictionary which simply maps the token ID to the number of times that token appears within that paragraph. So it's like a frequency dictionary. And we can do that super easily using the counter function from collections. So we import that and we just run this.

And we can see that we now have this. Right. So most of these are not going to be high values and when you consider the total number of tokens, most tokens don't even have a frequency. So they're not even within this dictionary. And that's kind of the definition of a sparse vector.

You're expecting the information within that vector to be very sparse. So most values or most values within that, if you imagine it as a vector, would be zero. But then some of them are one or two or you know whatever other number depending on the the method that you're using to build your sparse vector.

So just to make this easier we're going to define a couple of functions to do all of this without us needing to rewrite everything every time. So this is just a function to build that dictionary that we we just built using the counter function. And one additional thing that we're doing here is we're actually just removing all of these tokens here.

So these are special tokens used by BERT for parts of the processing function within BERT. Okay so it's like so BERT knows that this is the start of a sequence, the end of a sequence, this is a padding token. Basically these special tokens that only BERT really needs and they don't really have any meaning outside of that context.

So there's no point in us having those within our sparse embeddings. So we just remove them because otherwise we're just adding noise to those embeddings. So we run that and then we also run this which is just going to handle the creation of those sparse vectors from start to finish.

So we're just gonna pass a batch of context. We're going to tokenize everything and then we're going to build those dictionaries and return them. Okay that's the sparse vector creation it's not not too complex but we also need to create our dense vectors and actually this is more straightforward.

So we're just going to use the sentence transformers library. If you have CUDA that's great you can use that or you can use NPS if you're on Mac. And we're just going to initialize this sentence transform model. It's a Q&A model and because we are, let's play this in a moment, we're restricted to using dot product at the moment as our similarity metric.

We either want to use a model that has been trained with cosine similarity or more ideally dot product similarity. But cosine will work as well. So we initialize that and then we can just encode some text really easily. Just model encode and then we will get this this dense vector embedding.

Wait a moment for the model to load and for this run. Okay and we can see that we have this 384 dimensional dense vector embedding. So that is pretty much everything we need to do in terms of building our sparse and dense vector representations. So now what we need to do is initialize our hybrid index using that sort of helper class that we defined at the start and then what we'll do is actually add all of these or we'll encode all of our context and add them to that hybrid index.

So let's go back up to the top and we will find where we are initializing everything. So here we are. So we'll need a API key now. At the moment you won't get the API key from here because it's a within a private preview at the moment. So you will have to request access to the hybrid index and to do that again everything will be in the description of this video so you can just follow those instructions.

So I'm gonna initialize this. That just initializes my connection to Pinecone and then what I need to do is actually create a hybrid index. Now there are a few things that are important to take note of here. So one we're using the dot product metric. That's important at the moment the only metric that is supported with a hybrid index is dot product.

So you have to specify that you want to use the dot product metric here and another thing is to actually use a hybrid index we need to add this H to the pod that we would like to use. So right here we're using a S1 pod and we're using the hybrid index version of that S1 pod and when we run that we should see 201 which just means that the index has been created.

So if we come down here we can describe the index this is pretty aligned to the typical Pinecone client so we can see that the index is now ready. As soon as ready is equal to true we can move on to the next bit. That might take a few seconds if it doesn't take long and we can connect to the index.

Okay great so we can see to start with that our index is completely empty. There's nothing in there at the moment which obviously we would expect we haven't added anything yet. So now let's go ahead and actually begin adding those sparse and dense vectors. So we'll come down and we are going to be using the upsert function and what we're going to do is actually iterate through all of our contexts and we're going to go through in batches of 32 contexts at any one time.

Okay so we set the batch size to 32 here. We've got TQDM which is just a progress bar you can if you need to you may need to install that. Install TQDM and first thing we'll do is just find the end of the batch so we're just extracting the 32 or less items at any one time.

Extract those contexts create some IDs which is like accounts like 0 1 2 3 and so on and then we want to add metadata to each one of our records. So this is just the text of the context and it allows us to just see a human readable format of whatever is we're returning otherwise we'd only return the vectors and it's we can't understand what they are.

So after that we create our dense and sparse vectors this is just repeating what we did before and then here we create the vector or the record that we'll be adding to Pinecone. So there's a slightly different format here to what we might be used to if you've used Pinecone before.

Typically in Pinecone what you do is it would look like this. So we'd have the ID you would have values it's just a dense vector and then you would have your metadata which is a context. Obviously we're using a hybrid index so there's an extra value in there and that is the sparse values and that's just our sparse vectors and then we will just upsert those.

Okay now it's worth pointing out that this upsert is not the typical upsert or isn't going to the same endpoint that we would typically use. So we come up to our help class up here the upsert here is going to the hybrid endpoint whereas before it was going to this it was going vectors upsert.

Now all it has is this extra hybrid path included within there. That's the only difference and then so if we go back down to here you can run this and it will take a moment just to add everything it won't take too long though. Okay and took 47 seconds there and we have the 1,000 vectors in there now.

Okay so now what I want to do is move on to querying so how do we actually query our hybrid index. So there's a slight difference again so we add a sparse vector to our query so we're just using this function all it's doing is encoding everything and then making our query.

So we just add this sparse vector item to our query request we still have the dense vector and everything else except from alpha is is also the same. So alpha is a also a new parameter and I'll explain that in a moment. So then we would just query. Now the query again similar to the upsert endpoint also has hybrid in front of it now so now it's hybrid query but other than that that's there isn't really any any difference.

So we have this question is very technical and definitely out of domain for most models. So we run that and we can go ahead and what we're going to do first is actually a pure semantic search or pure vector search. So let's run that and basically looking at this I'm not going to go through it because I hardly understand these myself but the answer that we want is actually ID 711 here.

We want that to be ranked at position number one but it's not it's ranked at position number two which is bad but it could be better. So this is the result of a pure semantic or vector search and the reason there's a pure semantic search is because we set alpha equal to one.

Now alpha is the parameter that allows us to switch the weighting between sparse and vector search. Okay so at one that means we're doing a purely dense vector search so what Pinecone would typically do in the past. If we say it's a zero we're doing a full sparse search and anything in between is a hybrid search.

Okay so we don't get the perfect results when we're using dense vector research so let's have a look what happens if we use a hybrid search. So run this so we're saying alpha to 0.3 so it's more of a sparse search than a dense vector search but still using both and if we come up to the top we can see that ID 711 has been returned at position one so now we're getting the perfect results.

So that's just one example of hybrid search and how it can help us get better results really really easily. And that's it for our introduction to hybrid search and also how we can implement it in Pinecone. With this we are able to reap the benefits of dense vector retrieval or vector search whilst also sidestepping some of its most common pitfalls which is out-of-domain search.

Now if you would like to get started with this hybrid search functionality in Pinecone you will have to go through the preview at the moment if you're watching this in the future it's probably already generally available and you can you can just go ahead and install the Pinecone client but for now if you're interested in trying it the instructions to do so to request access will be in the video description.

But for now that is everything so I hope this has all been interesting thank you very much for watching and I'll see you again in the next one. Bye. you you you

Pinecone's New Hybrid Search - the future of search?

Transcript