back to index

Pinecone's New *Hybrid* Search - the future of search?


Whisper Transcript | Transcript Only Page

00:00:00.000 | VectorSearch has unlocked the door to another level of relevance and
00:00:05.880 | efficiency when it comes to retrieving data. In the past year alone the number
00:00:11.800 | of VectorSearch use cases have exploded and there's no signs of that slowing
00:00:18.320 | down anytime soon. Now the capabilities of VectorSearch are pretty impressive
00:00:23.800 | but it's not always a perfect technology. In fact unless we have big domain
00:00:29.960 | specific datasets to fine-tune embedding models with, traditional search still has
00:00:35.480 | some advantages. We repeatedly see that VectorSearch unlocks this incredible
00:00:42.200 | potential for really intelligent and powerful retrieval performance but it
00:00:50.360 | really struggles when it's adapting to a new domain. Particularly when this new
00:00:55.280 | domain is very different to the domain that the embedding model was fine-tuned
00:01:00.560 | on. Whereas traditional search manages to adapt to new domains much better but
00:01:07.260 | we're limited to that very specific performance level. So both approaches
00:01:12.680 | have their pros and cons but what if we could somehow manage to eliminate a few
00:01:18.080 | of those cons? Could we create a hybrid search with the heightened performance
00:01:23.680 | potential of VectorSearch and the zero-shot adaptability of traditional
00:01:28.880 | search? Today that's exactly what we're going to look at. We're going to look at
00:01:32.160 | the new hybrid search that Pinecone has come up with that merges both VectorSearch
00:01:38.800 | and a more traditional search into one single index. VectorSearch or
00:01:44.480 | DenseRetrieval has been shown to significantly outperform traditional
00:01:49.040 | methods but only when the embedding models that are creating these dense
00:01:53.480 | vector embeddings have been fine-tuned on that target domain. When we try and
00:01:59.720 | use the same models for outer domain tasks this performance doesn't tend to
00:02:04.160 | hold so well. That means if we have a large amount of data covering a very
00:02:09.280 | specific domain like medical question answering then we're okay we can fine
00:02:14.560 | tune our model everything will work that's great. But if we don't have a
00:02:18.840 | large amount of data to fine-tune our embedding model then the chances are
00:02:23.040 | that we might actually find better performance from a traditional method or
00:02:27.680 | sparse retrieval method like BM25. That gives us a best use case performance set
00:02:34.080 | by BM25 and we have no potential to fine-tune that and improve that
00:02:38.600 | performance and get more human-like intelligent retrieval. So if we want
00:02:43.640 | better performance we're left with two options. We either need to annotate a
00:02:47.800 | large data set to then use and fine-tune our embedding model or we can just go
00:02:53.680 | ahead and use hybrid search. The problem is that hybrid search isn't a very easy
00:02:59.520 | thing to do. In the past engineering teams had to have two separate solutions
00:03:05.800 | one would have been a sparse search index and a dense search index and they
00:03:10.360 | would have to have another system to merge the scores from both of those in
00:03:15.640 | an intelligent way and re-rank the results. Whereas with Pinecone we don't
00:03:21.040 | need to handle that anymore we just have a single endpoint and Pinecone does the
00:03:26.720 | rest. And we can even adjust whether we want to go for more of a sparse search
00:03:31.520 | or more of a dense vector search with a new parameter called alpha. So how does
00:03:37.520 | a typical hybrid search pipeline look? Well we start with our input data so
00:03:43.800 | this could be text, audio or something else and essentially what we're going to
00:03:47.800 | do is we're going to take that text or other input data and we're going to
00:03:53.600 | create two vectors from it. We're going to create our dense vector embedding and
00:03:57.360 | we're going to create our sparse vector embedding. And then everything else is
00:04:01.280 | handled by Pinecone. Within those dotted lines you can see that is just Pinecone
00:04:06.280 | doing its thing and building this very optimized hybrid index. But obviously
00:04:12.080 | before we get there we still need to create those sparse and dense vector
00:04:17.180 | representations. So let's have a look at how we can actually do that. So to get
00:04:22.040 | started we will need to actually just install a few dependencies. So we have Torch,
00:04:26.760 | Datasets, Transformers and Sentence Transformers. That's all we're going to
00:04:30.960 | need for this. And for now I just want to point out that the Pinecone Python client
00:04:38.680 | doesn't currently support the hybrid index. We have to interface directly with
00:04:45.860 | the hybrid endpoint. So right now we have this sort of helper class or function
00:04:53.760 | that's just going to handle a lot of that for us and essentially just act like a
00:04:57.760 | temporary hybrid index enabled Python client. So you'll be able to find a link
00:05:04.640 | to this in the video description if you would like to follow along. So what we'll
00:05:09.800 | do for now is just jump ahead directly to building the sparse and dense vectors.
00:05:14.240 | So the first thing we need obviously is some data to create our embeddings from.
00:05:19.160 | And we're going to use a very domain-specific medical Q&A data set
00:05:23.640 | called PubMed QA. Now if we run that that's going to download the data set
00:05:29.800 | from Hugging Face Datasets which you can install like that if you haven't already.
00:05:34.680 | And that would just take a moment. Okay once that has downloaded we'll be able
00:05:39.640 | to see the data set features. So what's most important here is that we have the
00:05:44.560 | context. I'm going to use all the questions later as well. But the context
00:05:47.920 | contains all of the long paragraphs that we're going to index in Pinecone using
00:05:54.560 | both our dense and sparse vectors. And we have just a thousand of these. So it's a
00:05:59.280 | pretty small data set but pretty good for this example. The reason that it's
00:06:04.920 | pretty good is if we just have a look at a few of the contexts that we're
00:06:09.520 | building here we can see that it's very specific language. Like I can read this
00:06:16.560 | and it doesn't really make any sense to me. And if it doesn't make sense to a
00:06:21.760 | typical person that means it probably doesn't make sense to a typical out-of-
00:06:27.840 | the-box pre-trained model. So what we would ideally have here is a model that
00:06:34.320 | has been fine-tuned for this specific domain and understands this specific
00:06:38.820 | language. But let's say that's not possible. That is where we would want to
00:06:44.520 | use hybrid search. So let's go ahead and we'll have a look at how we can build
00:06:49.000 | our sparse vectors. Now there are multiple methods for building our sparse
00:06:55.240 | vectors. This is just one of them. So we're going to go ahead and use a BERT
00:06:59.560 | tokenizer. We're just going to be using BERT tokenizer. I'm not going to use the
00:07:04.200 | BERT transformer. And what we're going to do is just we're going to tokenize a
00:07:08.880 | single context to get started. So just this context at position 0. We run that
00:07:15.720 | and what we will see is that we get input IDs, token type IDs and attention
00:07:21.680 | mask. Now if we were using BERT we would want to keep all of these tensors. But
00:07:28.200 | we're only wanting to create our sparse vector embedding. So in reality all we
00:07:33.160 | need now are the input IDs. So let's have a look at what they look like. And you
00:07:40.760 | can see that we just have all these integer ID values. Okay. Each one of these
00:07:45.080 | IDs, one of these token IDs if you like, they represent a specific word or
00:07:50.880 | subword that has been extracted from our paragraph using the BERT tokenizer's
00:07:58.280 | rule-based tokenization logic. So each one of those is just a unique word or
00:08:05.440 | subword. And what we need to do is convert each one of those big
00:08:11.760 | paragraphs that have been converted into the input ID list or token ID list. We
00:08:16.680 | need to convert that into a dictionary which simply maps the token ID to the
00:08:23.400 | number of times that token appears within that paragraph. So it's like a
00:08:28.360 | frequency dictionary. And we can do that super easily using the counter function
00:08:34.280 | from collections. So we import that and we just run this. And we can see that we
00:08:40.480 | now have this. Right. So most of these are not going to be high values and when
00:08:46.120 | you consider the total number of tokens, most tokens don't even have a frequency.
00:08:51.600 | So they're not even within this dictionary. And that's kind of the
00:08:54.600 | definition of a sparse vector. You're expecting the information within that
00:08:59.840 | vector to be very sparse. So most values or most values within that, if you
00:09:05.320 | imagine it as a vector, would be zero. But then some of them are one or two or
00:09:10.360 | you know whatever other number depending on the the method that you're using to
00:09:14.820 | build your sparse vector. So just to make this easier we're going to define a
00:09:19.680 | couple of functions to do all of this without us needing to rewrite everything
00:09:24.120 | every time. So this is just a function to build that dictionary that we we just
00:09:29.440 | built using the counter function. And one additional thing that we're doing here
00:09:36.120 | is we're actually just removing all of these tokens here. So these are special
00:09:43.040 | tokens used by BERT for parts of the processing function within BERT. Okay so
00:09:49.920 | it's like so BERT knows that this is the start of a sequence, the end of a
00:09:55.160 | sequence, this is a padding token. Basically these special tokens that only
00:10:00.840 | BERT really needs and they don't really have any meaning outside of that context.
00:10:05.600 | So there's no point in us having those within our sparse embeddings. So we just
00:10:10.400 | remove them because otherwise we're just adding noise to those embeddings. So we
00:10:16.320 | run that and then we also run this which is just going to handle the creation of
00:10:21.280 | those sparse vectors from start to finish. So we're just gonna pass a batch
00:10:25.320 | of context. We're going to tokenize everything and then we're going to build
00:10:29.840 | those dictionaries and return them. Okay that's the sparse vector creation it's
00:10:35.360 | not not too complex but we also need to create our dense vectors and actually
00:10:40.760 | this is more straightforward. So we're just going to use the sentence
00:10:45.040 | transformers library. If you have CUDA that's great you can use that or you can
00:10:49.360 | use NPS if you're on Mac. And we're just going to initialize this sentence
00:10:55.600 | transform model. It's a Q&A model and because we are, let's play this in a
00:11:01.760 | moment, we're restricted to using dot product at the moment as our similarity
00:11:04.880 | metric. We either want to use a model that has been trained with cosine
00:11:08.840 | similarity or more ideally dot product similarity. But cosine will work as well.
00:11:13.440 | So we initialize that and then we can just encode some text really easily. Just
00:11:21.080 | model encode and then we will get this this dense vector embedding. Wait a
00:11:26.080 | moment for the model to load and for this run. Okay and we can see that we have this
00:11:31.960 | 384 dimensional dense vector embedding. So that is pretty much everything we need
00:11:40.960 | to do in terms of building our sparse and dense vector representations. So now
00:11:46.220 | what we need to do is initialize our hybrid index using that sort of helper
00:11:51.920 | class that we defined at the start and then what we'll do is actually add all
00:11:55.560 | of these or we'll encode all of our context and add them to that hybrid
00:12:01.280 | index. So let's go back up to the top and we will find where we are initializing
00:12:08.480 | everything. So here we are. So we'll need a API key now. At the moment you won't
00:12:14.640 | get the API key from here because it's a within a private preview at the moment. So you
00:12:19.880 | will have to request access to the hybrid index and to do that again
00:12:26.560 | everything will be in the description of this video so you can just follow those
00:12:30.760 | instructions. So I'm gonna initialize this. That just initializes my connection
00:12:36.200 | to Pinecone and then what I need to do is actually create a hybrid index. Now
00:12:41.040 | there are a few things that are important to take note of here. So one
00:12:45.480 | we're using the dot product metric. That's important at the moment the only
00:12:50.320 | metric that is supported with a hybrid index is dot product. So you have to
00:12:54.580 | specify that you want to use the dot product metric here and another thing is
00:12:59.200 | to actually use a hybrid index we need to add this H to the pod that we would
00:13:04.480 | like to use. So right here we're using a S1 pod and we're using the hybrid index
00:13:09.480 | version of that S1 pod and when we run that we should see 201 which just means
00:13:15.240 | that the index has been created. So if we come down here we can describe the index
00:13:22.160 | this is pretty aligned to the typical Pinecone client so we can see that the
00:13:28.720 | index is now ready. As soon as ready is equal to true we can move on to the next bit.
00:13:32.800 | That might take a few seconds if it doesn't take long and we can connect to
00:13:38.160 | the index. Okay great so we can see to start with that our index is completely
00:13:43.560 | empty. There's nothing in there at the moment which obviously we would expect we
00:13:46.880 | haven't added anything yet. So now let's go ahead and actually begin adding those
00:13:51.440 | sparse and dense vectors. So we'll come down and we are going to be using the
00:13:57.160 | upsert function and what we're going to do is actually iterate through all of
00:14:02.040 | our contexts and we're going to go through in batches of 32 contexts at
00:14:06.640 | any one time. Okay so we set the batch size to 32 here. We've got TQDM which is
00:14:11.480 | just a progress bar you can if you need to you may need to install that.
00:14:16.840 | Install TQDM and first thing we'll do is just find the end of the batch so we're
00:14:24.440 | just extracting the 32 or less items at any one time. Extract those contexts
00:14:31.180 | create some IDs which is like accounts like 0 1 2 3 and so on and then we want
00:14:37.200 | to add metadata to each one of our records. So this is just the text of the
00:14:43.520 | context and it allows us to just see a human readable format of whatever is
00:14:49.560 | we're returning otherwise we'd only return the vectors and it's we can't
00:14:53.640 | understand what they are. So after that we create our dense and sparse vectors
00:15:00.440 | this is just repeating what we did before and then here we create the
00:15:07.600 | vector or the record that we'll be adding to Pinecone. So there's a slightly
00:15:13.160 | different format here to what we might be used to if you've used Pinecone
00:15:17.320 | before. Typically in Pinecone what you do is it would look like this. So we'd have
00:15:22.240 | the ID you would have values it's just a dense vector and then you would have
00:15:25.480 | your metadata which is a context. Obviously we're using a hybrid index so
00:15:30.480 | there's an extra value in there and that is the sparse values and that's just our
00:15:35.680 | sparse vectors and then we will just upsert those. Okay now it's worth
00:15:40.480 | pointing out that this upsert is not the typical upsert or isn't going to the
00:15:45.840 | same endpoint that we would typically use. So we come up to our help class up
00:15:51.360 | here the upsert here is going to the hybrid endpoint whereas before it was
00:15:59.120 | going to this it was going vectors upsert. Now all it has is this extra hybrid path
00:16:06.080 | included within there. That's the only difference and then so if we go back
00:16:10.320 | down to here you can run this and it will take a moment just to add
00:16:20.400 | everything it won't take too long though. Okay and took 47 seconds there and we
00:16:27.000 | have the 1,000 vectors in there now. Okay so now what I want to do is move on to
00:16:32.320 | querying so how do we actually query our hybrid index. So there's a slight
00:16:38.760 | difference again so we add a sparse vector to our query so we're just using
00:16:44.520 | this function all it's doing is encoding everything and then making our query. So
00:16:48.960 | we just add this sparse vector item to our query request we still have the
00:16:55.760 | dense vector and everything else except from alpha is is also the same. So alpha
00:17:00.400 | is a also a new parameter and I'll explain that in a moment. So then we
00:17:04.880 | would just query. Now the query again similar to the upsert endpoint also has
00:17:09.760 | hybrid in front of it now so now it's hybrid query but other than that that's
00:17:15.000 | there isn't really any any difference. So we have this question is very technical
00:17:21.840 | and definitely out of domain for most models. So we run that and we can go
00:17:29.320 | ahead and what we're going to do first is actually a pure semantic search or
00:17:34.280 | pure vector search. So let's run that and basically looking at this I'm not going
00:17:42.560 | to go through it because I hardly understand these myself but the answer
00:17:46.960 | that we want is actually ID 711 here. We want that to be ranked at position
00:17:53.840 | number one but it's not it's ranked at position number two which is bad but it
00:17:58.680 | could be better. So this is the result of a pure semantic or vector search and the
00:18:06.320 | reason there's a pure semantic search is because we set alpha equal to one. Now
00:18:11.120 | alpha is the parameter that allows us to switch the weighting between sparse and
00:18:16.540 | vector search. Okay so at one that means we're doing a purely dense vector search
00:18:23.800 | so what Pinecone would typically do in the past. If we say it's a zero we're
00:18:29.560 | doing a full sparse search and anything in between is a hybrid search. Okay so we
00:18:37.960 | don't get the perfect results when we're using dense vector research so let's
00:18:43.240 | have a look what happens if we use a hybrid search. So run this so we're
00:18:50.400 | saying alpha to 0.3 so it's more of a sparse search than a dense vector search
00:18:55.960 | but still using both and if we come up to the top we can see that ID 711 has
00:19:04.000 | been returned at position one so now we're getting the perfect results. So
00:19:08.640 | that's just one example of hybrid search and how it can help us get better
00:19:14.960 | results really really easily. And that's it for our introduction to hybrid search
00:19:22.640 | and also how we can implement it in Pinecone. With this we are able to reap
00:19:28.760 | the benefits of dense vector retrieval or vector search whilst also
00:19:35.740 | sidestepping some of its most common pitfalls which is out-of-domain search.
00:19:42.160 | Now if you would like to get started with this hybrid search functionality
00:19:47.520 | in Pinecone you will have to go through the preview at the moment if you're
00:19:52.080 | watching this in the future it's probably already generally available and
00:19:55.640 | you can you can just go ahead and install the Pinecone client but for now if
00:20:00.080 | you're interested in trying it the instructions to do so to request access
00:20:04.320 | will be in the video description. But for now that is everything so I hope this
00:20:11.400 | has all been interesting thank you very much for watching and I'll see you again
00:20:15.440 | in the next one. Bye.
00:20:25.860 | [BLANK_AUDIO]