back to index

Using Semantic Search to Find GIFs


Chapters

0:0 Intro
0:17 GIF Search Demo
1:56 Pipeline Overview
5:33 Data Preparation
8:17 Vector Database and Retriever
12:37 Querying
15:42 Streamlit App Code

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we are going to have a look at how we can use semantic search to
00:00:05.560 | intelligently search gifs and I think the best way for me to
00:00:11.420 | Explain what I mean by that is to simply show you what I mean by that. So here we have very simple
00:00:18.960 | streamlit
00:00:20.840 | front-end for our
00:00:22.840 | for our search pipeline and I'm going to search for
00:00:27.840 | dogs talking on the phone now
00:00:30.760 | this is a
00:00:33.520 | Natural language query. I'm not I am there is some dependence on keywords here
00:00:39.520 | Or there will be some keywords that are shared, you know, we have dogs
00:00:42.960 | phone
00:00:45.560 | But because we're using semantic search we can
00:00:48.520 | Try and avoid using these keywords
00:00:51.000 | So if we go for something that has the same meaning but just doesn't include the word dog
00:00:56.960 | so for
00:00:58.720 | legged mammal
00:01:00.720 | We'll see that we actually still return this one here we've lost see we've lost this one that's a little further back now
00:01:08.160 | But for legged mammal talking on the phone still works pretty pretty well
00:01:13.800 | Now that's pretty cool. But what I want to do here is show you how you can do this and
00:01:21.200 | In particular, okay. This is gif search
00:01:25.640 | Not maybe maybe it is actually pretty useful
00:01:28.120 | particularly when you are searching for gifts to use on social media and so on but I
00:01:36.120 | Think the real
00:01:38.880 | Power for this is its potential we can use the same pipeline
00:01:43.880 | To do a lot more than just gif search. We can adapt it super easily to
00:01:48.280 | financial documents
00:01:51.000 | image search
00:01:52.480 | video search and a lot more
00:01:55.760 | So let's just have a quick high-level view of what this pipeline actually looks like
00:02:03.760 | We start with our query at the top. So dogs on the phone or dogs talking on the phone
00:02:10.000 | Okay, that is our query
00:02:18.440 | it's going to come down here and it's going to go into what we call a retriever model a
00:02:24.720 | retriever model
00:02:26.720 | it's basically a
00:02:28.840 | transformer model
00:02:30.840 | Let me call it
00:02:32.600 | Retriever or tree. Yeah, it's transform model. That's especially trained to output meaningful
00:02:40.400 | vector representations of that text so from here, we're going to get this big long vector is
00:02:47.920 | plenty of
00:02:50.240 | numbers in there, I think the what the retriever we use is going to output a
00:02:55.600 | dimensionality of I think
00:02:58.520 | 374 if I'm not wrong
00:03:01.600 | Or something along those lines, but that's actually a very small
00:03:05.000 | Vector usually they are much larger
00:03:07.760 | So we get that vector embedding and it's going to go into a vector database now
00:03:19.680 | The vector database for us is going to be pinecone
00:03:23.440 | And in pinecone we have all of these
00:03:29.460 | We have so the gift you just saw
00:03:32.800 | Each one of those has a little description
00:03:35.680 | Okay, so the one you saw before might say I think it says something like
00:03:40.640 | vintage dogs talking on the phone or something along those lines and
00:03:46.720 | We take that description
00:03:48.720 | We pass it through to retrieve a model
00:03:52.600 | So the same retrieve model views here and saw it as a vector embedding site inside pinecone
00:03:57.840 | Now this has been done for hundreds of thousands of gifts in their descriptions
00:04:03.200 | So then what we do is we look inside pinecone. So we have all of these
00:04:08.880 | different
00:04:11.680 | Gift descriptions and
00:04:14.280 | What we do and use different color here we
00:04:17.480 | Introduce our query. So this is actually this is this vector here and we say, okay, which of these other
00:04:26.240 | Items are the closest to our query and maybe we say, okay, it's used to here
00:04:31.560 | So what we would do is return
00:04:34.640 | those
00:04:37.480 | over here and
00:04:39.560 | We don't really care about the vectors themselves, but we care about the metadata that's been attached to those vectors
00:04:45.840 | So in that metadata, we are going to have the URL
00:04:50.540 | Okay, so the URL where we can find that gif image or the gif file so that's included in here
00:05:00.480 | okay, so that comes along with us and
00:05:04.420 | We use that to display the gif on our screen. So that's what you saw before
00:05:10.660 | We just have some HTML. It's a just like an image tag
00:05:18.940 | We just have the URL in there
00:05:21.100 | Okay when we display them and that's what we are going to build so
00:05:28.140 | Let's take a look at how how we can do that
00:05:33.220 | now you will be able to find all of this code in a link in the video description or if you're
00:05:40.260 | Watching this on the article
00:05:42.660 | It will be at the bottom of the article
00:05:45.060 | So with that in mind, I'm not going to go really in-depth on the code
00:05:49.420 | I was going to kind of go through it quite quickly and just give you an idea of what we're actually doing
00:05:54.000 | so the first thing we would need to do is
00:05:57.540 | Install any of these libraries if we don't have them installed already
00:06:02.940 | And so we have our connections pine cone here. We have sentence transformers, which is our retriever model
00:06:08.920 | TQDM just progress bar and pandas is just pandas day frames
00:06:13.900 | Now here not really that important if you're building a app
00:06:19.580 | But if you're doing this in the notebook
00:06:21.740 | You will want this because this will allow us to display HTML within our Jupiter notebook
00:06:27.260 | Which I'll see important if we want to see what gifts we are returning
00:06:31.780 | and then
00:06:33.460 | We obviously need a data set. So we have this data set
00:06:38.860 | here and
00:06:41.620 | If we go down we see it's a tumblr gif data set
00:06:46.340 | 100,000 animated gifs and
00:06:49.740 | 120 sentences describing those gifts now
00:06:53.660 | What that means so there's a bit of an imbalance there. That means that there are multiple
00:06:59.540 | in some cases not all there are multiple descriptions for a single gif and
00:07:04.220 | We can go down here. We can we can have a look at that in a moment. But this is the
00:07:10.360 | Dataset structure. We have the URL and then we have a description and
00:07:14.860 | We'll print these out in a moment
00:07:20.540 | Let me show you this first. So we come down here and this is an example, right?
00:07:25.500 | So we have this image that which is from the URL. So I've pulled that in from your own
00:07:30.960 | See here. We have the image tag just plugging that in and then we have the description description
00:07:36.820 | It pretty accurately describes what is happening in the in the gif
00:07:42.700 | now we have those duplicates or
00:07:45.900 | Duplicate descriptions. Let's have a quick look at those as well
00:07:50.020 | so these are a few of those duplicates and we can see that they are all the same gif as expected and
00:07:57.900 | They just have different descriptions, but the descriptions are all pretty accurate. They're not, you know, they don't not describe the gif
00:08:06.940 | So in this case
00:08:08.740 | Keeping these duplicates make sense because we just simply have multiple descriptions that are gonna point to the same gif
00:08:13.940 | All of those descriptions are accurate. So it's not really an issue
00:08:17.140 | So we have our data set now. Let's just have a quick look at our
00:08:21.160 | Graph here. So the first thing we need to do here is initialize our retrieval model and our
00:08:28.700 | Vector database and as you can see on the left here
00:08:33.420 | We are going to be indexing our data
00:08:36.740 | Using our retriever and vector database before we begin querying anything
00:08:41.380 | Basically, we're just going to take all that data
00:08:43.900 | Putting it into our vector database
00:08:47.020 | So first thing I'm going to do here is initialize the
00:08:50.380 | Retriever model which is here and I'm going to initialize this sentence transform model
00:08:56.980 | If you don't know anything about sentence transformers
00:08:58.800 | There's a lot of videos on my youtube channel and love articles on pinecone that cover
00:09:04.860 | these in a lot of detail, so
00:09:08.220 | I'd recommend
00:09:10.900 | Having a look at those because they are really interesting
00:09:15.940 | One important thing here to know is that our model is going to be outputting
00:09:20.460 | vectors with dimensionality
00:09:23.140 | 384 okay, I think earlier I said
00:09:25.540 | 374 it's 384
00:09:28.060 | That's important. We need to know that when we're initializing our
00:09:31.220 | vector database
00:09:33.620 | Which we can do like this so we import pinecone
00:09:37.860 | For the API key here. You do need to go
00:09:42.620 | To app dot pinecone the i/o to get a free API key and you just put it in here
00:09:49.420 | And then with that we can initialize our connections pinecone
00:09:54.620 | Then we all we need to do is pass an index name. So index name can be anything you want
00:10:00.060 | Just make sure it makes sense. So for me, this is a gift search
00:10:04.620 | Index, so I am going to create that index with the dimensionality
00:10:11.140 | 384 in case that's what we got from the retriever model earlier
00:10:14.340 | the metric is also important this should align to what the
00:10:19.780 | Sentence transformer retrieval model has been fine-tuned
00:10:22.980 | to work with
00:10:25.540 | And then we connect to the index. We've just created
00:10:29.580 | Okay, so once we've done that we move on to actually
00:10:37.380 | indexing okay, so if initialize the
00:10:41.140 | Vector database initialize the retriever now use the
00:10:45.140 | Retriever model to create our embeddings or the vectors
00:10:49.500 | That represent all of our gift descriptions and then we insert all those into pinecone
00:10:55.940 | So we do that in this loop. It's pretty simple. We do in batches of 64
00:11:01.260 | Here I'm extracting a batch from the data frame that we have. I
00:11:05.980 | generate embeddings for that batch
00:11:10.340 | We retrieve the metadata for that batch. So the metadata is going to be in a format. Do I have something here?
00:11:18.300 | Maybe I don't so that metadata will be in the format. It's like description
00:11:28.580 | Description and then we have a description of what is happening in there and
00:11:35.660 | Then also the URL and that will go to the URL, right?
00:11:40.100 | pretty simple
00:11:42.860 | That's our metadata and we have one of those for each record or item and
00:11:49.540 | Then we create our IDs. So the IDs need to be unique there needs to be strings and
00:11:55.980 | That's what we're doing here. Okay, so we're just going through the IDs is simple 0 1 2 3
00:12:01.380 | But obviously there's a string and then we pull that together. Okay, and we insert these two pine cones
00:12:08.060 | So we do that in batches of 64 and then the end here
00:12:11.020 | I'm just checking that we have all the vectors in the index, which we do have a hundred and twenty five
00:12:16.920 | Thousand or just over
00:12:19.940 | Okay, cool and we can also see the index fullness so
00:12:24.380 | We can go up to about a million vectors on the free tier of pinecone. So
00:12:30.260 | Here you can see we have plenty of space left, which is obviously pretty useful
00:12:37.340 | With that we can move on to the right of our chart here
00:12:41.580 | Which is the querying step so querying is what we're going to do all the time said indexing we do once
00:12:47.700 | Unless we're sort of adding more data and then we might do it again
00:12:51.340 | Otherwise querying is the main task of our pipeline
00:12:57.820 | Every time a user makes a query search of something. We're going to be going through this pipeline. So
00:13:04.820 | To do that again
00:13:07.220 | It's pretty much same as before query goes through to our retriever model that creates what we call a query vector
00:13:12.860 | We pass that to pinecone and then we search for the most similar
00:13:17.140 | already indexed
00:13:19.820 | gif description vectors or context vectors and
00:13:25.500 | From there. We return the most similar ones. Okay, and we within those
00:13:30.540 | Records from pinecone. We also include the metadata and
00:13:35.020 | That includes URL to the original gif
00:13:38.580 | So then we can just use some simple
00:13:41.700 | HTML to display those gifts to the user and let's have a look at how we actually do that in code
00:13:48.620 | So I split into two steps
00:13:52.340 | There's a search section or part which is searching
00:13:56.940 | encoding our query and
00:14:01.580 | Searching for the most relevant
00:14:03.580 | Context vectors and that's what we're doing here. So we encode with our retriever
00:14:09.460 | we query and we are going to return the top 10 most similar context there and
00:14:15.580 | Yeah, we just append that to this list
00:14:22.340 | Okay, you can see here. We're getting the metadata as well. It's also important
00:14:26.740 | So here include metadata needs to be true. Otherwise, we're not going to be turning any metadata and we can't extract those URLs
00:14:33.820 | okay, and I'm just putting that in a search a gif function there and
00:14:38.820 | Then we have this display gifts. So all this is is the really simple
00:14:46.700 | HTML that we are
00:14:49.780 | displaying
00:14:51.180 | So we just have these
00:14:53.180 | Developments inside here. We have a figure and our image with the URL source that we have. Okay
00:15:00.660 | And then to do that or to perform a search we just do search gif
00:15:06.060 | Dog being confused in this case and we display them
00:15:09.660 | Okay, and we can see that we're getting all of these gifts where a dog is is confused, which is I think pretty cool
00:15:16.300 | And we have loads of examples here. I just went through a load them to see what we get
00:15:21.900 | Yeah, oh this one quite specific you can get really specific with this as well
00:15:31.780 | So fluffy dog being cute and dancing like a person and then we get this one up at the top here
00:15:38.940 | So yeah, I think that is pretty cool. Now if we want to
00:15:45.020 | Replicate the sort of app the streamer app that you saw before
00:15:48.020 | You can do one if you just want to test it you can but to create this app. It's super simple
00:15:56.140 | Obviously we're using stream lit and this is all we do
00:16:00.260 | I'm gonna zoom out a bit so it might be kind of hard to read on your screen, but it's honest
00:16:04.820 | kind of showed a whole code and
00:16:06.820 | Again, you can just download this anyway, so it shouldn't really be an issue
00:16:13.180 | This code is slightly outdated actually. So this model here is not using the MP net model anymore
00:16:19.300 | It's actually using the mini LM model. It's just a smaller model and
00:16:22.980 | Yeah, all we do we initialize pinecone initialize our retriever like we did in those notebooks
00:16:31.380 | We have our HTML here exactly the same as in the notebooks again, and then we just write our app
00:16:39.820 | So we have a I powered gift search
00:16:42.540 | What you're looking for? It's just a text input
00:16:45.040 | That passes into query whenever query is not empty
00:16:49.240 | We will begin to search. Okay, so we
00:16:53.560 | Encode as we did before the query to create our query vector. We retrieve the most similar context and then we
00:17:01.580 | extract the
00:17:04.220 | URL metadata from those and then we display them and that's literally it's super simple and
00:17:11.740 | We get this really cool gift search app from that. So
00:17:15.500 | That's it for this video. I hope this has been
00:17:20.500 | interesting so
00:17:23.220 | Thank you very much for watching and I will see you again in the next one. Bye