back to indexUsing Semantic Search to Find GIFs
Chapters
0:0 Intro
0:17 GIF Search Demo
1:56 Pipeline Overview
5:33 Data Preparation
8:17 Vector Database and Retriever
12:37 Querying
15:42 Streamlit App Code
00:00:00.000 |
Today we are going to have a look at how we can use semantic search to 00:00:05.560 |
intelligently search gifs and I think the best way for me to 00:00:11.420 |
Explain what I mean by that is to simply show you what I mean by that. So here we have very simple 00:00:22.840 |
for our search pipeline and I'm going to search for 00:00:33.520 |
Natural language query. I'm not I am there is some dependence on keywords here 00:00:39.520 |
Or there will be some keywords that are shared, you know, we have dogs 00:00:45.560 |
But because we're using semantic search we can 00:00:51.000 |
So if we go for something that has the same meaning but just doesn't include the word dog 00:01:00.720 |
We'll see that we actually still return this one here we've lost see we've lost this one that's a little further back now 00:01:08.160 |
But for legged mammal talking on the phone still works pretty pretty well 00:01:13.800 |
Now that's pretty cool. But what I want to do here is show you how you can do this and 00:01:28.120 |
particularly when you are searching for gifts to use on social media and so on but I 00:01:38.880 |
Power for this is its potential we can use the same pipeline 00:01:43.880 |
To do a lot more than just gif search. We can adapt it super easily to 00:01:55.760 |
So let's just have a quick high-level view of what this pipeline actually looks like 00:02:03.760 |
We start with our query at the top. So dogs on the phone or dogs talking on the phone 00:02:18.440 |
it's going to come down here and it's going to go into what we call a retriever model a 00:02:32.600 |
Retriever or tree. Yeah, it's transform model. That's especially trained to output meaningful 00:02:40.400 |
vector representations of that text so from here, we're going to get this big long vector is 00:02:50.240 |
numbers in there, I think the what the retriever we use is going to output a 00:03:01.600 |
Or something along those lines, but that's actually a very small 00:03:07.760 |
So we get that vector embedding and it's going to go into a vector database now 00:03:19.680 |
The vector database for us is going to be pinecone 00:03:35.680 |
Okay, so the one you saw before might say I think it says something like 00:03:40.640 |
vintage dogs talking on the phone or something along those lines and 00:03:52.600 |
So the same retrieve model views here and saw it as a vector embedding site inside pinecone 00:03:57.840 |
Now this has been done for hundreds of thousands of gifts in their descriptions 00:04:03.200 |
So then what we do is we look inside pinecone. So we have all of these 00:04:17.480 |
Introduce our query. So this is actually this is this vector here and we say, okay, which of these other 00:04:26.240 |
Items are the closest to our query and maybe we say, okay, it's used to here 00:04:39.560 |
We don't really care about the vectors themselves, but we care about the metadata that's been attached to those vectors 00:04:45.840 |
So in that metadata, we are going to have the URL 00:04:50.540 |
Okay, so the URL where we can find that gif image or the gif file so that's included in here 00:05:04.420 |
We use that to display the gif on our screen. So that's what you saw before 00:05:10.660 |
We just have some HTML. It's a just like an image tag 00:05:21.100 |
Okay when we display them and that's what we are going to build so 00:05:33.220 |
now you will be able to find all of this code in a link in the video description or if you're 00:05:45.060 |
So with that in mind, I'm not going to go really in-depth on the code 00:05:49.420 |
I was going to kind of go through it quite quickly and just give you an idea of what we're actually doing 00:05:57.540 |
Install any of these libraries if we don't have them installed already 00:06:02.940 |
And so we have our connections pine cone here. We have sentence transformers, which is our retriever model 00:06:08.920 |
TQDM just progress bar and pandas is just pandas day frames 00:06:13.900 |
Now here not really that important if you're building a app 00:06:21.740 |
You will want this because this will allow us to display HTML within our Jupiter notebook 00:06:27.260 |
Which I'll see important if we want to see what gifts we are returning 00:06:33.460 |
We obviously need a data set. So we have this data set 00:06:41.620 |
If we go down we see it's a tumblr gif data set 00:06:53.660 |
What that means so there's a bit of an imbalance there. That means that there are multiple 00:06:59.540 |
in some cases not all there are multiple descriptions for a single gif and 00:07:04.220 |
We can go down here. We can we can have a look at that in a moment. But this is the 00:07:10.360 |
Dataset structure. We have the URL and then we have a description and 00:07:20.540 |
Let me show you this first. So we come down here and this is an example, right? 00:07:25.500 |
So we have this image that which is from the URL. So I've pulled that in from your own 00:07:30.960 |
See here. We have the image tag just plugging that in and then we have the description description 00:07:36.820 |
It pretty accurately describes what is happening in the in the gif 00:07:45.900 |
Duplicate descriptions. Let's have a quick look at those as well 00:07:50.020 |
so these are a few of those duplicates and we can see that they are all the same gif as expected and 00:07:57.900 |
They just have different descriptions, but the descriptions are all pretty accurate. They're not, you know, they don't not describe the gif 00:08:08.740 |
Keeping these duplicates make sense because we just simply have multiple descriptions that are gonna point to the same gif 00:08:13.940 |
All of those descriptions are accurate. So it's not really an issue 00:08:17.140 |
So we have our data set now. Let's just have a quick look at our 00:08:21.160 |
Graph here. So the first thing we need to do here is initialize our retrieval model and our 00:08:28.700 |
Vector database and as you can see on the left here 00:08:36.740 |
Using our retriever and vector database before we begin querying anything 00:08:41.380 |
Basically, we're just going to take all that data 00:08:47.020 |
So first thing I'm going to do here is initialize the 00:08:50.380 |
Retriever model which is here and I'm going to initialize this sentence transform model 00:08:56.980 |
If you don't know anything about sentence transformers 00:08:58.800 |
There's a lot of videos on my youtube channel and love articles on pinecone that cover 00:09:10.900 |
Having a look at those because they are really interesting 00:09:15.940 |
One important thing here to know is that our model is going to be outputting 00:09:28.060 |
That's important. We need to know that when we're initializing our 00:09:33.620 |
Which we can do like this so we import pinecone 00:09:42.620 |
To app dot pinecone the i/o to get a free API key and you just put it in here 00:09:49.420 |
And then with that we can initialize our connections pinecone 00:09:54.620 |
Then we all we need to do is pass an index name. So index name can be anything you want 00:10:00.060 |
Just make sure it makes sense. So for me, this is a gift search 00:10:04.620 |
Index, so I am going to create that index with the dimensionality 00:10:11.140 |
384 in case that's what we got from the retriever model earlier 00:10:14.340 |
the metric is also important this should align to what the 00:10:19.780 |
Sentence transformer retrieval model has been fine-tuned 00:10:25.540 |
And then we connect to the index. We've just created 00:10:29.580 |
Okay, so once we've done that we move on to actually 00:10:41.140 |
Vector database initialize the retriever now use the 00:10:45.140 |
Retriever model to create our embeddings or the vectors 00:10:49.500 |
That represent all of our gift descriptions and then we insert all those into pinecone 00:10:55.940 |
So we do that in this loop. It's pretty simple. We do in batches of 64 00:11:01.260 |
Here I'm extracting a batch from the data frame that we have. I 00:11:10.340 |
We retrieve the metadata for that batch. So the metadata is going to be in a format. Do I have something here? 00:11:18.300 |
Maybe I don't so that metadata will be in the format. It's like description 00:11:28.580 |
Description and then we have a description of what is happening in there and 00:11:35.660 |
Then also the URL and that will go to the URL, right? 00:11:42.860 |
That's our metadata and we have one of those for each record or item and 00:11:49.540 |
Then we create our IDs. So the IDs need to be unique there needs to be strings and 00:11:55.980 |
That's what we're doing here. Okay, so we're just going through the IDs is simple 0 1 2 3 00:12:01.380 |
But obviously there's a string and then we pull that together. Okay, and we insert these two pine cones 00:12:08.060 |
So we do that in batches of 64 and then the end here 00:12:11.020 |
I'm just checking that we have all the vectors in the index, which we do have a hundred and twenty five 00:12:19.940 |
Okay, cool and we can also see the index fullness so 00:12:24.380 |
We can go up to about a million vectors on the free tier of pinecone. So 00:12:30.260 |
Here you can see we have plenty of space left, which is obviously pretty useful 00:12:37.340 |
With that we can move on to the right of our chart here 00:12:41.580 |
Which is the querying step so querying is what we're going to do all the time said indexing we do once 00:12:47.700 |
Unless we're sort of adding more data and then we might do it again 00:12:51.340 |
Otherwise querying is the main task of our pipeline 00:12:57.820 |
Every time a user makes a query search of something. We're going to be going through this pipeline. So 00:13:07.220 |
It's pretty much same as before query goes through to our retriever model that creates what we call a query vector 00:13:12.860 |
We pass that to pinecone and then we search for the most similar 00:13:19.820 |
gif description vectors or context vectors and 00:13:25.500 |
From there. We return the most similar ones. Okay, and we within those 00:13:30.540 |
Records from pinecone. We also include the metadata and 00:13:41.700 |
HTML to display those gifts to the user and let's have a look at how we actually do that in code 00:13:52.340 |
There's a search section or part which is searching 00:14:03.580 |
Context vectors and that's what we're doing here. So we encode with our retriever 00:14:09.460 |
we query and we are going to return the top 10 most similar context there and 00:14:22.340 |
Okay, you can see here. We're getting the metadata as well. It's also important 00:14:26.740 |
So here include metadata needs to be true. Otherwise, we're not going to be turning any metadata and we can't extract those URLs 00:14:33.820 |
okay, and I'm just putting that in a search a gif function there and 00:14:38.820 |
Then we have this display gifts. So all this is is the really simple 00:14:53.180 |
Developments inside here. We have a figure and our image with the URL source that we have. Okay 00:15:00.660 |
And then to do that or to perform a search we just do search gif 00:15:06.060 |
Dog being confused in this case and we display them 00:15:09.660 |
Okay, and we can see that we're getting all of these gifts where a dog is is confused, which is I think pretty cool 00:15:16.300 |
And we have loads of examples here. I just went through a load them to see what we get 00:15:21.900 |
Yeah, oh this one quite specific you can get really specific with this as well 00:15:31.780 |
So fluffy dog being cute and dancing like a person and then we get this one up at the top here 00:15:38.940 |
So yeah, I think that is pretty cool. Now if we want to 00:15:45.020 |
Replicate the sort of app the streamer app that you saw before 00:15:48.020 |
You can do one if you just want to test it you can but to create this app. It's super simple 00:15:56.140 |
Obviously we're using stream lit and this is all we do 00:16:00.260 |
I'm gonna zoom out a bit so it might be kind of hard to read on your screen, but it's honest 00:16:06.820 |
Again, you can just download this anyway, so it shouldn't really be an issue 00:16:13.180 |
This code is slightly outdated actually. So this model here is not using the MP net model anymore 00:16:19.300 |
It's actually using the mini LM model. It's just a smaller model and 00:16:22.980 |
Yeah, all we do we initialize pinecone initialize our retriever like we did in those notebooks 00:16:31.380 |
We have our HTML here exactly the same as in the notebooks again, and then we just write our app 00:16:42.540 |
What you're looking for? It's just a text input 00:16:45.040 |
That passes into query whenever query is not empty 00:16:53.560 |
Encode as we did before the query to create our query vector. We retrieve the most similar context and then we 00:17:04.220 |
URL metadata from those and then we display them and that's literally it's super simple and 00:17:11.740 |
We get this really cool gift search app from that. So 00:17:15.500 |
That's it for this video. I hope this has been 00:17:23.220 |
Thank you very much for watching and I will see you again in the next one. Bye