back to index

NER Powered Semantic Search in Python


Chapters

0:0 NER Powered Semantic Search
1:19 Dependencies and Hugging Face Datasets Prep
4:18 Creating NER Entities with Transformers
7:0 Creating Embeddings with Sentence Transformers
7:48 Using Pinecone Vector Database
11:33 Indexing the Full Medium Articles Dataset
15:9 Making Queries to Pinecone
17:1 Final Thoughts

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to go through an example of using named entity recognition alongside a vector
00:00:08.080 | search and this is a really interesting way of just making sure our search is very much focused
00:00:16.240 | on exactly whatever it is we're looking for. So for example if we are going to search through
00:00:22.240 | articles and we are searching for something to do with Tesla we can use this to restrict the search
00:00:32.080 | whatever it is we're looking for maybe we're looking for some news about Tesla for self-driving
00:00:38.640 | and using this we're going to restrict search scope for full self-driving to specifically articles or
00:00:46.800 | parts of articles that contain the named entity Tesla. Now I think this is a really interesting
00:00:53.840 | use case and can definitely help making search more specific. So let's jump in to the example.
00:01:01.680 | So we're working from this example page on Pinecone so pinecone.io/examples/nersearch
00:01:09.280 | and Ashok thought of this and put all this together. Yeah I think it's a really cool
00:01:15.120 | example. So we'll work through the code. So to get started I'm going to come over to Colab here
00:01:20.880 | and I'm just going to install a few dependencies. So sentence transformers, pinecone client and
00:01:26.160 | datasets. Okay so that will just install everything. Okay great and then we can
00:01:34.000 | download the dataset that we're going to be using. So we're going to go from datasets,
00:01:40.320 | import load dataset. We are going to be using
00:01:44.640 | this dataset here. So it's a medium articles dataset. I'll go search right here. Okay and
00:01:55.920 | it's just it contains a ton of articles straight from medium. So to use that we use
00:02:04.000 | the name of the dataset from here and we're going to load it into a
00:02:10.400 | hooking face data frame. So data files, medium articles, csv
00:02:19.360 | and we want to train split. Okay if we have a look at that or we have a look at the head
00:02:28.800 | after converting it to a pandas data frame. So to pandas. I'll just take a moment to download.
00:02:37.120 | Okay and then zoom out a little bit so you can see what we have here. So title, text, url and
00:02:44.880 | a few other things in here. Okay so obviously the text is going to be the main part for us
00:02:50.720 | and what we will also do is just drop a few of these. So drop any where we have just empty rows.
00:02:59.040 | So that's this the drop na here and then we're going to sample at random 50,000 of these articles
00:03:06.400 | and we'll do that with a random state of 32. Essentially this is so you can get the same
00:03:12.880 | 50,000 articles that I will get here as well. Okay cool now for each article I mean there's a
00:03:21.040 | few things we could do in putting the embeddings together. So we're going to have to create
00:03:26.800 | embeddings for every single article here. What we could do is split the article into parts
00:03:33.200 | because our embedding models can only handle so many words at one time. So we could split
00:03:38.000 | the article into parts and embed all those different parts but in this case we can usually
00:03:43.280 | get an idea of what the article will talk about based on the title and usually the introduction.
00:03:48.560 | So what we'll do is actually take the first 1,000 characters. So what you can see here there
00:03:56.080 | and then we're just going to join the article title and the text and we'll keep all that within
00:04:02.240 | the title text feature. So we'll go df head again maybe we can just have a look at title text so
00:04:09.680 | let's do that. Okay so we have this now the next thing to do or that we want to do is initialize
00:04:22.480 | our NER model. So initialize NER model. Now how do we do that? So we're going to be using
00:04:32.080 | a hugging face for this. So I'll just copy the code in.
00:04:35.680 | We have all of this so this is a NER pipeline with an NER model. So we have this DSLIM
00:04:44.800 | database for NER named entity recognition. We have our tokenizer the model itself and then we
00:04:52.000 | just load all that into our pipeline. We need to select our device so if we have a GPU running
00:04:58.160 | we will want to use a GPU it will be much faster. So I'm going to import torch and what we're going
00:05:05.840 | to do is say device equals CUDA if a CUDA device is available. So torch dot CUDA is available
00:05:19.040 | otherwise we're just going to use CPU. Now if you are on colab what you can do to make
00:05:28.160 | sure you are going to be using GPU is go to runtime change runtime type and here change this
00:05:36.720 | to GPU. Okay so mine wasn't set to GPU so now I need to change it save and actually rerun everything.
00:05:44.560 | So I'll go ahead and do that. Okay so that all just reran and what I'm going to do is just print
00:05:51.440 | the device here so that we can see that we are in fact using CUDA hopefully. Okay cool
00:05:58.880 | and then here we will run all this and that will just download and initialize the model as well
00:06:06.400 | so it may take a moment. Okay so after that let's try it on a quick example. So we have London,
00:06:14.320 | Qatar, England and the United Kingdom. So we would expect a few named entities to be within this so
00:06:20.960 | let's run that and see what we return. Okay with this maybe I need to here maybe we write torch
00:06:30.320 | device this. Okay cool so here we have a few things so we have a location this is the entity type
00:06:44.880 | London. Okay cool that is definitely true location again England and location again United Kingdom.
00:06:52.880 | Okay all great so that's definitely working so let's move on to creating our embeddings.
00:07:00.080 | So to create those embeddings we will need a retrieve model. So to initialize that we're
00:07:06.640 | going to be using the sentence transformers library that will look something like this.
00:07:12.160 | So from sentence transformers import sentence transformer and then we initialize this model
00:07:18.400 | here. So this is just a it's a pretty good sentence transformer model that's been trained on
00:07:23.200 | a lot of examples so the performance is generally really good. Okay now we can see the model format
00:07:31.680 | here so we have match sequence 128 tokens word embedding dimension here 768 so that's how
00:07:40.560 | big the sentence embeddings will be sentence vectors and we can use this model to create
00:07:48.080 | our embeddings but we need something to store our embeddings and for that we're going to be using
00:07:51.760 | Pinecone so we will initialize that so for that we will need to do this. So import Pinecone
00:08:00.160 | initialize our connection to Pinecone and for that we need an API key which is
00:08:05.280 | free and we can get it from here so app.pinecone.io
00:08:13.920 | we need to log in here now you probably just have one project in here which would probably be
00:08:19.600 | your name default project so go into that you go to API keys and you just press copy
00:08:26.960 | on your API key here and then you just paste it into here now I put my API key into a variable
00:08:34.400 | called API key and with that we can initialize that connection now we'll create a new index
00:08:46.160 | called NER search and we do that with Pinecone
00:08:51.360 | create index and this index there's a few things that we need here so
00:09:01.680 | create index
00:09:02.400 | we need the index name the dimensionality so dimension now this is going to be equal to the
00:09:13.200 | retriever model that we just created that includes a embedding dimension and we actually have that
00:09:20.240 | up here it's 768 but we're going to we're not going to hard code it we're just going to get
00:09:27.280 | from here so get sentence embedding dimension let me see yeah this here
00:09:34.000 | okay so if we just have a look at what that will give us should be 768 yeah and then we also want
00:09:44.240 | to use the cosine similarity metric and then that will go ahead and create our index after it's been
00:09:51.280 | created we'll connect to it so we do Pinecone index and we just use the index name again
00:09:57.440 | and then we can describe the index just to see what is in there which should be nothing
00:10:04.960 | okay cool so we have dimensionality you can see that the index is empty as we haven't added
00:10:17.120 | anything into it yet so the next thing is actually preparing all of our data to be added to Pinecone
00:10:24.480 | and allowing us to do this search so what we're going to do first is initialize this
00:10:29.920 | extract named entities function which is going to extract for a batch of text it's going to
00:10:36.560 | extract named entities for each one those text records so let's initialize that and then what
00:10:44.640 | we're going to do is just have a look at how it will work so extract named entities
00:10:50.320 | and in there we're just going to pass the data that we have is called df title text
00:11:00.480 | what we'll do is maybe turn this into a list yeah let's do that so df title text i want to go
00:11:12.080 | for the first let's say the first three items okay and we'll turn into a list isn't already a list
00:11:18.800 | there okay cool so the first one we get data christmas anonymous america light switch so on
00:11:25.440 | and so on and yeah you can you can see we're getting those entities that are being extracted
00:11:30.240 | from there so that is working and what we can do now is actually loop through the full data set or
00:11:37.200 | the full 50 000 that we have extracted and we're just going to we're going to do this so let me
00:11:45.120 | walk walk you through all this so we're going to be doing everything in batches of 64 that's to
00:11:50.480 | avoid overwhelming our gpu which is going to be encoding everything in batches of 64 and also when
00:11:56.880 | we're sending these requests to pinecone we can't send too many or too large requests at any one
00:12:02.880 | time so with 64 we're pretty safe here we're going to find the end of the batch so we're going
00:12:09.760 | in batches 64 from i to i end obviously when you get to the ended data set the batch size will
00:12:15.760 | probably be smaller because we probably don't have the data set fitting into perfect 64 batch sizes
00:12:22.400 | okay and then we extract the batch generate those embeddings extract the name entities using our
00:12:29.520 | function if there are any duplicates so we might have the same name entity a few times and in a
00:12:35.680 | single sentence of course or single paragraph in that case we don't need to do that because we're
00:12:41.920 | just going to be filtering so we actually only need one instance of the named entity so to do
00:12:48.960 | that we just deduplicate our entities here and convert it back into a list and then what we're
00:12:56.400 | doing here is we're dropping the title text from our batch because we actually don't need that
00:13:03.440 | right here because we've just extracted our named entities and that's all we care about at the
00:13:08.080 | moment and then from there we create our metadata so this is going to be some metadata that we saw
00:13:14.720 | in the dictionary like format in pinecone create our unique ids which is count and we upset
00:13:22.320 | everything or add everything to upset list and upset so we can run that it will possibly take
00:13:29.120 | a little while to run may have to just deal with this quickly so entities
00:13:36.240 | can contain a list so probably the best way to deal with that is within the function here so
00:13:46.000 | let's come up here and what we'll do is just convert into a or remove the duplicates here so
00:13:52.720 | any equals list set any okay and then we don't have to do it here so I can remove that
00:14:01.680 | and we'll just call this suppose batch named entities
00:14:12.640 | okay cool so let's try and run that again great so that looks like it will work it will take a
00:14:18.560 | little bit of time so I will leave that to run and when it's complete we'll go through the actual
00:14:26.080 | making queries and see how that works okay so that is done I've just run it on a another fast
00:14:35.360 | computer so we should be able to describe the index stats here and we should see
00:14:39.920 | 50,000 items the one thing here is I need to just refresh the connection so let me do that quickly
00:14:50.080 | so we'll come here I am going to run this again and I'm also going to run this without the create
00:14:56.320 | index come down here run this here we go so we have 50,000 vectors in there now and what we want
00:15:06.960 | to do is use this here so this is going to search through pine cone so we're going to have our query
00:15:15.200 | here we're going to extract the named entities from our query embed our query and what we do is
00:15:22.320 | query with our embedded query with our query vector which is this xq we're going to return
00:15:29.120 | the top 10 most similar matches we're going to include metadata for those matches and importantly
00:15:36.720 | for the named entity part of this or named entity filtering what we do is we filter for named
00:15:44.560 | entities that are within the list that we have here so let's run that this will just return all
00:15:56.560 | the metadata that we have there so the the titles that we that we might want from that and what we
00:16:03.360 | do is simply make a few queries so what are the best places to visit in greece will be our first
00:16:09.680 | growth and we get these titles here so budget-friendly holidays was the best summer destination
00:16:18.720 | greece exploring greece santorini island yeah all i think pretty seem like pretty relevant articles
00:16:25.600 | to me ask a few more questions what the best places to visit in london run this and then we
00:16:36.320 | get a few more so oh i think pretty relevant again and then let's try
00:16:43.360 | one slightly different query why spacex wants to build a city on mars
00:16:47.840 | and here we go so for sure i think all these are definitely very relevant articles to be returned
00:17:01.840 | okay so that's it for this example covering vector search with this sort of named entity
00:17:08.800 | filtering component just to try and improve our results and make sure we are searching within a
00:17:15.840 | space that contains whatever it is we are looking for or just making our search scope more specific
00:17:24.240 | so i hope this has been interesting and useful thank you very much for watching and i will see
00:17:30.880 | you again in the next one bye