back to indexCohere AI's LLM for Semantic Search in Python
Chapters
0:0 Semantic search with Cohere LLM and Pinecone
0:45 Architecture overview
4:6 Getting code and prerequisites install
4:50 Cohere and Pinecone API keys
6:12 Initialize Cohere, get data, create embeddings
7:43 Creating Pinecone vector index
10:37 Querying with Cohere and Pinecone
12:56 Testing a few queries
14:35 Final notes
00:00:00.000 |
Today we are going to take a look at how to build a semantic search tool using 00:00:05.520 |
Cohere's Embed API endpoint and Pinecone's Vector Database. We'll be 00:00:11.420 |
using Cohere's large language model to embed sentences or paragraphs into a 00:00:17.640 |
vector space and then we'll be using Pinecone's Vector Database to actually 00:00:20.920 |
search through that vector space and retrieve relevant answers to our 00:00:25.680 |
particular queries based on the semantics of those queries rather than 00:00:30.480 |
just keyword matching. Now both of these services together are a pretty good 00:00:35.480 |
combination and they make building this sort of tool incredibly easy as we'll 00:00:39.440 |
see. But before we start building it let's take a look at what the overall 00:00:43.680 |
architecture will look like. So we're gonna be starting with our data it's 00:00:46.720 |
just going to be a load of text it can be split into sentences or roughly 00:00:52.840 |
paragraph sized chunks of text depending on what we're trying to do and what 00:00:58.480 |
we're going to do is feed those into Cohere's embedding endpoint which is 00:01:02.880 |
just going to go to a large language model and what that will do is encode 00:01:08.960 |
each of the chunks of text that we feed into it into a single vector. Now we're 00:01:16.740 |
going to have a thousand of these chunks of text we're gonna have like quite 00:01:22.040 |
small questions from the Tregg data set so we'll end up with a thousand of these 00:01:27.840 |
vectors. Okay and once we have them we then take them we put them into Pinecone 00:01:34.960 |
and whilst they are stored in Pinecone or even just the vectors by themselves 00:01:41.360 |
we can think of them as being you know how this works is that they are 00:01:45.560 |
represented in a vector space. So two of these chunks of text are semantically 00:01:51.560 |
similar i.e. they have a similar meaning they would be very close together in 00:01:55.600 |
that vector space whereas two sentences that have a very dissimilar meaning 00:02:01.360 |
would be very far apart within that vector space. Pinecone is the storage the 00:02:05.960 |
database that stores all these vectors and also allows us to search through these 00:02:09.560 |
vectors very efficiently so we can literally store millions, tens of 00:02:13.600 |
millions, billions of vectors in here and search through them incredibly fast. Now 00:02:19.360 |
all of this together is what we would call indexing and on the other side of 00:02:26.000 |
this we have the querying phase so when we're making queries let's say we have a 00:02:31.120 |
little search box here obviously this input can be anything we like but we 00:02:36.980 |
have a search box here and our users are going to enter a query it's like Google 00:02:41.200 |
search that query will go over to go here first using the same large language 00:02:46.560 |
model and what we will get is a single what we call a query vector so here is 00:02:54.560 |
our query vector in code we usually refer to it as XQ and we're going to 00:02:59.320 |
pass that over to Pinecone here and we're going to say to Pinecone okay with 00:03:04.680 |
this query vector what are the top K so like the number here top K maybe we say 00:03:12.680 |
equal to 3 what are the top K most similar already index vectors so with 00:03:17.880 |
top K equal to 3 if this was our query vector we would return the top 3 most 00:03:22.840 |
similar items here so I think maybe number 1 would be this vector here maybe 00:03:28.040 |
number 2 would be this one and number 3 would be this one and the Pinecone 00:03:31.760 |
would return those to us so we'd have 3 vectors here but obviously we can't read 00:03:39.040 |
or understand what these vectors mean all right they're just numbers so what 00:03:44.440 |
we need to do is find the metadata that was attached to those so the metadata is 00:03:49.240 |
going to contain the original text from there so we would actually get that 00:03:53.280 |
original text and we would return all that to our user so that's what we're 00:03:58.200 |
going to be building it probably looks much more complicated from this chart 00:04:01.920 |
than it actually is it's incredibly easy as we'll see it won't take as long to go 00:04:06.240 |
through so let's get started we are going to using this guide on Pinecone 00:04:10.000 |
so docs.pinecone.io/docs/cohere we're not going to have you go 00:04:15.520 |
through this page here we're actually just going to go over to here opening 00:04:19.000 |
Colab okay and we get this little chart is that pretty much exactly the same as 00:04:23.400 |
what I just explained just a little simpler and we're going to go down here 00:04:27.600 |
and we just need to install a few prerequisites so we have take a look up 00:04:33.600 |
here we have Cohere and Pinecone you know you know why we're using those I 00:04:37.720 |
just mentioned it and then we also have Hugging Face datasets this is where 00:04:41.140 |
we're going to be sourcing our data set from that we're going to be indexing and 00:04:45.160 |
then querying against later on okay it looks good so come down to here and we 00:04:51.520 |
need to sign up for API keys Cohere and Pinecone and both of these we can 00:04:55.680 |
actually do this completely free so Cohere has a trial amount that we can 00:05:01.400 |
query with so we're going to be using that click on Cohere here that will take 00:05:06.920 |
us to os.cohere.ai and it will also just redirect you to the dashboard if 00:05:12.920 |
you're already logged in if this is your first time using Cohere of course you 00:05:15.920 |
will not be so you have to go to the top right over here and create an account or 00:05:20.840 |
log in once you have done that you can go to settings on the left go to API 00:05:25.400 |
keys and you should have a trial key here so I've got the default key I'm 00:05:30.480 |
going to copy that and I'm going to go ahead and put it in here okay so I've 00:05:34.720 |
just put mine in and then for the Pinecone key click here and that will 00:05:39.840 |
take us to app.pinecone.io if you already logged in you will see 00:05:44.040 |
something like this otherwise you're going to see a little login screen or 00:05:47.040 |
create an account page so you go ahead do that and then you should be 00:05:51.560 |
redirected to here now if this first time using Pinecone this will be empty 00:05:56.440 |
you see I have a few indexes in here already but of course if you haven't created any 00:06:00.800 |
already this will be empty so all we need to do is head on over to API keys 00:06:05.280 |
on the left we should have a default API key you can copy that and we will place 00:06:10.880 |
in here okay so I've just added mine so I've got my Cohere and Pinecone keys in 00:06:16.080 |
there first thing we want to do is create our embeddings for that we are going to 00:06:19.960 |
need to use the Cohere embed endpoint and we also need some data so let's get 00:06:25.880 |
both of those so here we just initialize our connection to Cohere using that API 00:06:31.440 |
key from before and then here we're going to use HuggingFace datasets and 00:06:35.440 |
we're going to download the first 1,000 questions from the Trek dataset of first 00:06:43.200 |
1,000 rows actually and then the questions themselves are stored within 00:06:46.640 |
this text feature which we can see down here okay cool after this we are going 00:06:53.760 |
to be using the Cohere embed endpoint we're going to passing all of those 00:06:58.680 |
items so yes you have 1,000 items in there just pass them all to Cohere at 00:07:02.720 |
once and this client will just automatically batch those and iterate 00:07:06.880 |
through everything so we don't overload the API requests we're going to be using 00:07:11.520 |
the small model and we're going to be truncating so we're going to keep 00:07:15.760 |
everything on the left here and then after that we want to extract the 00:07:20.560 |
embeddings from what we return there so we run this it should be pretty quick 00:07:24.600 |
a second super fast and then let's just have a look at the dimensionality of what 00:07:29.680 |
we return there so we have 1,000 vectors each one of those vectors is 1,024 00:07:36.680 |
dimensions which is just the output dimensionality of Cohere's small large 00:07:42.360 |
language model okay and with those embeddings all created we can move on to 00:07:48.160 |
actually initializing our vector index which is where we're going to start 00:07:51.840 |
everything so for that we initialize our connection Pinecone using the API key we 00:07:56.520 |
used before we are going to be using this index name you can change this to 00:08:00.880 |
whatever you want I would just recommend that you keep it descriptive so that 00:08:04.480 |
you're not getting confused if you have multiple indexes later on and then here 00:08:08.920 |
if this is your first time using Pinecone or going through this notebook this will 00:08:13.160 |
just run here so it will create the index if you've already run this 00:08:16.840 |
notebook and the index already exists within your Pinecone account then this 00:08:21.680 |
is going to say if that index name does exist do not create the index again 00:08:27.080 |
because it already exists but it needs to create it again within that create 00:08:30.120 |
index we have the index name the dimensionality so that's the 1,024 that 00:08:35.200 |
we saw before the Cohere small model and also using cosine similarity as the 00:08:40.320 |
similarity metric there as well after that we just go ahead and connect to the 00:08:45.360 |
index so let's run that now once that is all done we're going to move on to 00:08:50.040 |
actually upsetting everything so adding all of those vectors the relevant 00:08:54.720 |
metadata and some unique IDs to our index and that will be in this format so 00:09:00.440 |
we're gonna big list where each record content is within a tuple containing an 00:09:06.000 |
ID a unique ID a vector that we've created from Coheres embed endpoint and 00:09:11.040 |
the metadata which is just going to contain the plain text version of the 00:09:15.420 |
information so come down here and we will go ahead and create that structure 00:09:20.880 |
so that's what we're doing here so creating a zip list of unique IDs which 00:09:26.000 |
are just a count of the embeddings which we've created before Cohere and the 00:09:30.640 |
metadata which you can see we're creating here it's just a dictionary 00:09:34.280 |
metadata is always within the dictionary format and we just have a key 00:09:39.760 |
which is text and the value which is the original plain text of our data now 00:09:45.800 |
appeal using batch size 128 that is so that we're not overloading the API calls 00:09:51.840 |
that we're making to Pinecone and we are actually upsetting in batches of 128 00:09:57.480 |
okay so we can run that at the end here we're going to describe the index 00:10:02.920 |
statistics which is just so we can see the vector count so have we upsert 00:10:08.040 |
everything into our vector index there and we can see here that we have and 00:10:11.400 |
from here we can also change the dimensionality of our index which again 00:10:15.320 |
this should align to the model output dimensionality again the 1024 that we 00:10:23.000 |
saw from before and okay so with that we've actually done all the indexing 00:10:28.800 |
stage of our workflow so we can actually cross off this bit here so the indexing 00:10:35.560 |
part this is all done now all we need to do is the querying part so we can see we 00:10:40.760 |
have our plain text query we're going to take that to Cohere we're going to from 00:10:44.840 |
Cohere we're going to create that query embedding we query that with Pinecone 00:10:49.920 |
and we return some vectors and the metadata with those vectors to the user 00:10:55.360 |
so to do that it's pretty much the same process again we have our query we have 00:11:00.920 |
what caused the 1929 Great Depression we are going to do the exact same thing 00:11:05.960 |
that we did before with the Trek data we are going to use Cohere embed use the 00:11:11.920 |
small model which I'm going to create to the left and we get those embeddings we 00:11:16.200 |
can also print the shape here so let me run this okay so the shape is just one 00:11:21.560 |
vector this time which is our query vector and it's still 1024 dimensionality 00:11:26.480 |
because we're using that small Cohere large language model now from there we 00:11:32.480 |
query Pinecone with this we're saying we want to return the top 10 most similar 00:11:36.800 |
results and we want to include the metadata that includes the plain text so 00:11:41.440 |
that we can actually read the results and we get this sort of response which 00:11:45.800 |
is we can read it relatively easily but let's clean it up a little bit more so 00:11:50.080 |
we come down here run this so we're going to go through each of those 00:11:53.200 |
matches that we returned here and we're just going to print out the score 00:11:57.080 |
rounded so it's be easier to read and we're going to return the metadata text 00:12:02.560 |
and we print all of those out we get something like this now the top two 00:12:06.720 |
results are they have much higher similarity scores than the rest of our 00:12:11.320 |
results and they are indeed far more relevant to our question these would 00:12:16.760 |
clearly be counted as duplicates to our question or at least very similar and 00:12:22.200 |
then the rest of these we can see that these are not directly relevant but I 00:12:26.760 |
think most of these kind of occur within the same sort of time error so it's 00:12:31.240 |
interesting that they it manages to kind of identify some sort of relationship 00:12:35.760 |
there and return those but of course these are the only two within that 1000 00:12:41.600 |
query data set that we have from Trek these are the only two items that refer 00:12:47.080 |
to the Great Depression the rest of them as you can see not relevant at least not 00:12:53.600 |
directly relevant so they're very good results now let's adjust this a little 00:12:58.800 |
bit so I mentioned before that we're searching based on the meaning of these 00:13:03.400 |
queries not the keywords so what we're going to do is replace the keyword 00:13:08.360 |
depression with the incorrect term recession now although it's incorrect 00:13:13.200 |
it's so you know we as humans would understand that it means the same thing 00:13:17.680 |
someone is trying to ask about that specific event and indeed we can see 00:13:22.440 |
that the results are pretty much exactly the same now the similarity scores 00:13:26.000 |
dropped a bit because we're using that incorrect term but nonetheless it is 00:13:31.320 |
identifying that the top two are exactly the same I think most of these are also 00:13:35.880 |
the same there's just a little bit of different order in there and a couple 00:13:39.680 |
that may be dropped from the top there now in this case we still have a lot of 00:13:44.200 |
similar keywords there so maybe recession is very clearly identified as 00:13:49.720 |
depression major with great and so on so maybe we can modify this even more and 00:13:56.840 |
just kind of drop all those similar words and we can be kind of more 00:14:00.880 |
descriptive here as well so why was there a long-term economic downturn in 00:14:05.520 |
the early 20th century so this is very different to the results that we would 00:14:10.640 |
expect to find and yet again we see those two right at the top and the rest 00:14:15.240 |
of these results are also very similar now interestingly I think because we are 00:14:19.920 |
using more descriptive language here it's managing to identify these two as 00:14:24.440 |
being more similar than we did with the previous where we use the incorrect 00:14:29.840 |
result you can see that the similarity score here is lower than it is here so 00:14:35.320 |
you can see already how easy is to build a pretty high-performing semantic search 00:14:41.720 |
engine using very little code and literally no prior knowledge about this 00:14:49.040 |
technology all we do is make a few API calls to go here and make a few API 00:14:53.080 |
calls to Pinecone and we have this semantic search tool now that's it for 00:14:59.320 |
this video I hope it has been useful and interesting but for now thank you very 00:15:05.800 |
much for watching and I'll see you again next one, bye.