Using Semantic Search to Find GIFs

00:00:00.000 | Today we are going to have a look at how we can use semantic search to

00:00:05.560 | intelligently search gifs and I think the best way for me to

00:00:11.420 | Explain what I mean by that is to simply show you what I mean by that. So here we have very simple

00:00:18.960 | streamlit

00:00:20.840 | front-end for our

00:00:22.840 | for our search pipeline and I'm going to search for

00:00:27.840 | dogs talking on the phone now

00:00:30.760 | this is a

00:00:33.520 | Natural language query. I'm not I am there is some dependence on keywords here

00:00:39.520 | Or there will be some keywords that are shared, you know, we have dogs

00:00:42.960 | phone

00:00:45.560 | But because we're using semantic search we can

00:00:48.520 | Try and avoid using these keywords

00:00:51.000 | So if we go for something that has the same meaning but just doesn't include the word dog

00:00:56.960 | so for

00:00:58.720 | legged mammal

00:01:00.720 | We'll see that we actually still return this one here we've lost see we've lost this one that's a little further back now

00:01:08.160 | But for legged mammal talking on the phone still works pretty pretty well

00:01:13.800 | Now that's pretty cool. But what I want to do here is show you how you can do this and

00:01:21.200 | In particular, okay. This is gif search

00:01:25.640 | Not maybe maybe it is actually pretty useful

00:01:28.120 | particularly when you are searching for gifts to use on social media and so on but I

00:01:36.120 | Think the real

00:01:38.880 | Power for this is its potential we can use the same pipeline

00:01:43.880 | To do a lot more than just gif search. We can adapt it super easily to

00:01:48.280 | financial documents

00:01:51.000 | image search

00:01:52.480 | video search and a lot more

00:01:55.760 | So let's just have a quick high-level view of what this pipeline actually looks like

00:02:01.840 | so

00:02:03.760 | We start with our query at the top. So dogs on the phone or dogs talking on the phone

00:02:10.000 | Okay, that is our query

00:02:18.440 | it's going to come down here and it's going to go into what we call a retriever model a

00:02:24.720 | retriever model

00:02:26.720 | it's basically a

00:02:28.840 | transformer model

00:02:30.840 | Let me call it

00:02:32.600 | Retriever or tree. Yeah, it's transform model. That's especially trained to output meaningful

00:02:40.400 | vector representations of that text so from here, we're going to get this big long vector is

00:02:47.920 | plenty of

00:02:50.240 | numbers in there, I think the what the retriever we use is going to output a

00:02:55.600 | dimensionality of I think

00:02:58.520 | 374 if I'm not wrong

00:03:01.600 | Or something along those lines, but that's actually a very small

00:03:05.000 | Vector usually they are much larger

00:03:07.760 | So we get that vector embedding and it's going to go into a vector database now

00:03:19.680 | The vector database for us is going to be pinecone

00:03:23.440 | And in pinecone we have all of these

00:03:29.460 | We have so the gift you just saw

00:03:32.800 | Each one of those has a little description

00:03:35.680 | Okay, so the one you saw before might say I think it says something like

00:03:40.640 | vintage dogs talking on the phone or something along those lines and

00:03:46.720 | We take that description

00:03:48.720 | We pass it through to retrieve a model

00:03:52.600 | So the same retrieve model views here and saw it as a vector embedding site inside pinecone

00:03:57.840 | Now this has been done for hundreds of thousands of gifts in their descriptions

00:04:03.200 | So then what we do is we look inside pinecone. So we have all of these

00:04:08.880 | different

00:04:11.680 | Gift descriptions and

00:04:14.280 | What we do and use different color here we

00:04:17.480 | Introduce our query. So this is actually this is this vector here and we say, okay, which of these other

00:04:26.240 | Items are the closest to our query and maybe we say, okay, it's used to here

00:04:31.560 | So what we would do is return

00:04:34.640 | those

00:04:37.480 | over here and

00:04:39.560 | We don't really care about the vectors themselves, but we care about the metadata that's been attached to those vectors

00:04:45.840 | So in that metadata, we are going to have the URL

00:04:50.540 | Okay, so the URL where we can find that gif image or the gif file so that's included in here

00:05:00.480 | okay, so that comes along with us and

00:05:04.420 | We use that to display the gif on our screen. So that's what you saw before

00:05:10.660 | We just have some HTML. It's a just like an image tag

00:05:15.000 | And

00:05:18.940 | We just have the URL in there

00:05:21.100 | Okay when we display them and that's what we are going to build so

00:05:28.140 | Let's take a look at how how we can do that

00:05:33.220 | now you will be able to find all of this code in a link in the video description or if you're

00:05:40.260 | Watching this on the article

00:05:42.660 | It will be at the bottom of the article

00:05:45.060 | So with that in mind, I'm not going to go really in-depth on the code

00:05:49.420 | I was going to kind of go through it quite quickly and just give you an idea of what we're actually doing

00:05:54.000 | so the first thing we would need to do is

00:05:57.540 | Install any of these libraries if we don't have them installed already

00:06:02.940 | And so we have our connections pine cone here. We have sentence transformers, which is our retriever model

00:06:08.920 | TQDM just progress bar and pandas is just pandas day frames

00:06:13.900 | Now here not really that important if you're building a app

00:06:19.580 | But if you're doing this in the notebook

00:06:21.740 | You will want this because this will allow us to display HTML within our Jupiter notebook

00:06:27.260 | Which I'll see important if we want to see what gifts we are returning

00:06:31.780 | and then

00:06:33.460 | We obviously need a data set. So we have this data set

00:06:38.860 | here and

00:06:41.620 | If we go down we see it's a tumblr gif data set

00:06:46.340 | 100,000 animated gifs and

00:06:49.740 | 120 sentences describing those gifts now

00:06:53.660 | What that means so there's a bit of an imbalance there. That means that there are multiple

00:06:59.540 | in some cases not all there are multiple descriptions for a single gif and

00:07:04.220 | We can go down here. We can we can have a look at that in a moment. But this is the

00:07:10.360 | Dataset structure. We have the URL and then we have a description and

00:07:14.860 | We'll print these out in a moment

00:07:18.540 | so

00:07:20.540 | Let me show you this first. So we come down here and this is an example, right?

00:07:25.500 | So we have this image that which is from the URL. So I've pulled that in from your own

00:07:30.960 | See here. We have the image tag just plugging that in and then we have the description description

00:07:36.820 | It pretty accurately describes what is happening in the in the gif

00:07:40.580 | Okay

00:07:42.700 | now we have those duplicates or

00:07:45.900 | Duplicate descriptions. Let's have a quick look at those as well

00:07:50.020 | so these are a few of those duplicates and we can see that they are all the same gif as expected and

00:07:57.900 | They just have different descriptions, but the descriptions are all pretty accurate. They're not, you know, they don't not describe the gif

00:08:06.940 | So in this case

00:08:08.740 | Keeping these duplicates make sense because we just simply have multiple descriptions that are gonna point to the same gif

00:08:13.940 | All of those descriptions are accurate. So it's not really an issue

00:08:17.140 | So we have our data set now. Let's just have a quick look at our

00:08:21.160 | Graph here. So the first thing we need to do here is initialize our retrieval model and our

00:08:28.700 | Vector database and as you can see on the left here

00:08:33.420 | We are going to be indexing our data

00:08:36.740 | Using our retriever and vector database before we begin querying anything

00:08:41.380 | Basically, we're just going to take all that data

00:08:43.900 | Putting it into our vector database

00:08:47.020 | So first thing I'm going to do here is initialize the

00:08:50.380 | Retriever model which is here and I'm going to initialize this sentence transform model

00:08:56.980 | If you don't know anything about sentence transformers

00:08:58.800 | There's a lot of videos on my youtube channel and love articles on pinecone that cover

00:09:04.860 | these in a lot of detail, so

00:09:08.220 | I'd recommend

00:09:10.900 | Having a look at those because they are really interesting

00:09:13.780 | now

00:09:15.940 | One important thing here to know is that our model is going to be outputting

00:09:20.460 | vectors with dimensionality

00:09:23.140 | 384 okay, I think earlier I said

00:09:25.540 | 374 it's 384

00:09:28.060 | That's important. We need to know that when we're initializing our

00:09:31.220 | vector database

00:09:33.620 | Which we can do like this so we import pinecone

00:09:37.860 | For the API key here. You do need to go

00:09:42.620 | To app dot pinecone the i/o to get a free API key and you just put it in here

00:09:49.420 | And then with that we can initialize our connections pinecone

00:09:54.620 | Then we all we need to do is pass an index name. So index name can be anything you want

00:10:00.060 | Just make sure it makes sense. So for me, this is a gift search

00:10:04.620 | Index, so I am going to create that index with the dimensionality

00:10:11.140 | 384 in case that's what we got from the retriever model earlier

00:10:14.340 | the metric is also important this should align to what the

00:10:19.780 | Sentence transformer retrieval model has been fine-tuned

00:10:22.980 | to work with

00:10:25.540 | And then we connect to the index. We've just created

00:10:29.580 | Okay, so once we've done that we move on to actually

00:10:37.380 | indexing okay, so if initialize the

00:10:41.140 | Vector database initialize the retriever now use the

00:10:45.140 | Retriever model to create our embeddings or the vectors

00:10:49.500 | That represent all of our gift descriptions and then we insert all those into pinecone

00:10:55.940 | So we do that in this loop. It's pretty simple. We do in batches of 64

00:11:01.260 | Here I'm extracting a batch from the data frame that we have. I

00:11:05.980 | generate embeddings for that batch

00:11:10.340 | We retrieve the metadata for that batch. So the metadata is going to be in a format. Do I have something here?

00:11:18.300 | Maybe I don't so that metadata will be in the format. It's like description

00:11:28.580 | Description and then we have a description of what is happening in there and

00:11:35.660 | Then also the URL and that will go to the URL, right?

00:11:40.100 | pretty simple

00:11:42.860 | That's our metadata and we have one of those for each record or item and

00:11:49.540 | Then we create our IDs. So the IDs need to be unique there needs to be strings and

00:11:55.980 | That's what we're doing here. Okay, so we're just going through the IDs is simple 0 1 2 3

00:12:01.380 | But obviously there's a string and then we pull that together. Okay, and we insert these two pine cones

00:12:08.060 | So we do that in batches of 64 and then the end here

00:12:11.020 | I'm just checking that we have all the vectors in the index, which we do have a hundred and twenty five

00:12:16.920 | Thousand or just over

00:12:19.940 | Okay, cool and we can also see the index fullness so

00:12:24.380 | We can go up to about a million vectors on the free tier of pinecone. So

00:12:30.260 | Here you can see we have plenty of space left, which is obviously pretty useful

00:12:34.660 | so

00:12:37.340 | With that we can move on to the right of our chart here

00:12:41.580 | Which is the querying step so querying is what we're going to do all the time said indexing we do once

00:12:47.700 | Unless we're sort of adding more data and then we might do it again

00:12:51.340 | Otherwise querying is the main task of our pipeline

00:12:57.820 | Every time a user makes a query search of something. We're going to be going through this pipeline. So

00:13:04.820 | To do that again

00:13:07.220 | It's pretty much same as before query goes through to our retriever model that creates what we call a query vector

00:13:12.860 | We pass that to pinecone and then we search for the most similar

00:13:17.140 | already indexed

00:13:19.820 | gif description vectors or context vectors and

00:13:23.140 | then

00:13:25.500 | From there. We return the most similar ones. Okay, and we within those

00:13:30.540 | Records from pinecone. We also include the metadata and

00:13:35.020 | That includes URL to the original gif

00:13:38.580 | So then we can just use some simple

00:13:41.700 | HTML to display those gifts to the user and let's have a look at how we actually do that in code

00:13:48.620 | So I split into two steps

00:13:52.340 | There's a search section or part which is searching

00:13:56.940 | encoding our query and

00:14:00.060 | also

00:14:01.580 | Searching for the most relevant

00:14:03.580 | Context vectors and that's what we're doing here. So we encode with our retriever

00:14:09.460 | we query and we are going to return the top 10 most similar context there and

00:14:15.580 | Yeah, we just append that to this list

00:14:21.060 | Yeah

00:14:22.340 | Okay, you can see here. We're getting the metadata as well. It's also important

00:14:26.740 | So here include metadata needs to be true. Otherwise, we're not going to be turning any metadata and we can't extract those URLs

00:14:33.820 | okay, and I'm just putting that in a search a gif function there and

00:14:38.820 | Then we have this display gifts. So all this is is the really simple

00:14:46.700 | HTML that we are

00:14:49.780 | displaying

00:14:51.180 | So we just have these

00:14:53.180 | Developments inside here. We have a figure and our image with the URL source that we have. Okay

00:15:00.660 | And then to do that or to perform a search we just do search gif

00:15:06.060 | Dog being confused in this case and we display them

00:15:09.660 | Okay, and we can see that we're getting all of these gifts where a dog is is confused, which is I think pretty cool

00:15:16.300 | And we have loads of examples here. I just went through a load them to see what we get

00:15:21.900 | Yeah, oh this one quite specific you can get really specific with this as well

00:15:31.780 | So fluffy dog being cute and dancing like a person and then we get this one up at the top here

00:15:38.940 | So yeah, I think that is pretty cool. Now if we want to

00:15:45.020 | Replicate the sort of app the streamer app that you saw before

00:15:48.020 | You can do one if you just want to test it you can but to create this app. It's super simple

00:15:56.140 | Obviously we're using stream lit and this is all we do

00:16:00.260 | I'm gonna zoom out a bit so it might be kind of hard to read on your screen, but it's honest

00:16:04.820 | kind of showed a whole code and

00:16:06.820 | Again, you can just download this anyway, so it shouldn't really be an issue

00:16:13.180 | This code is slightly outdated actually. So this model here is not using the MP net model anymore

00:16:19.300 | It's actually using the mini LM model. It's just a smaller model and

00:16:22.980 | Yeah, all we do we initialize pinecone initialize our retriever like we did in those notebooks

00:16:31.380 | We have our HTML here exactly the same as in the notebooks again, and then we just write our app

00:16:39.820 | So we have a I powered gift search

00:16:42.540 | What you're looking for? It's just a text input

00:16:45.040 | That passes into query whenever query is not empty

00:16:49.240 | We will begin to search. Okay, so we

00:16:53.560 | Encode as we did before the query to create our query vector. We retrieve the most similar context and then we

00:17:01.580 | extract the

00:17:04.220 | URL metadata from those and then we display them and that's literally it's super simple and

00:17:11.740 | We get this really cool gift search app from that. So

00:17:15.500 | That's it for this video. I hope this has been

00:17:20.500 | interesting so

00:17:23.220 | Thank you very much for watching and I will see you again in the next one. Bye

Using Semantic Search to Find GIFs

Chapters