back to index

How to Build an AI-Powered Video Search App


Chapters

0:0 Intro
2:56 YouTube Search App
4:43 Getting Data
7:58 Enhancing the Data
12:45 Scraping Other Metadata
14:52 Loading Data from Hugging Face
15:42 Index and Query the Data
20:43 Streamlit App Code

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today I want to have a look at how we can build a NLP powered
00:00:06.300 | Intelligent search for video and more specifically we're gonna have a look at how we can do that for YouTube now
00:00:15.240 | YouTube is
00:00:17.680 | obviously huge
00:00:19.680 | But it started with a very simple video
00:00:24.800 | Titled I'm at the zoo
00:00:28.600 | This video was a 19 second clip of
00:00:31.200 | YouTube's co-founder
00:00:33.840 | literally at the zoo showing you some elephants in the enclosure behind him and
00:00:39.680 | it seems really silly now, but
00:00:43.140 | When you think about this that was back in
00:00:47.600 | 2005 and up until that point the only
00:00:51.240 | video or
00:00:53.780 | Insights into other people's lives that we we will have seen
00:00:58.080 | is that of
00:01:00.080 | celebrities or
00:01:02.220 | Politicians and all of it was very orchestrated and not natural. So
00:01:07.460 | YouTube
00:01:09.740 | sort of brought this very
00:01:12.260 | unique
00:01:14.460 | age where we're able to actually see into other people's lives and normal people's lives like me or you and
00:01:22.300 | See just what they are doing and
00:01:29.200 | I'm at the zoo. It's a kind of odd very short video, but it really marks the beginning of this almost
00:01:37.460 | Explosion of
00:01:41.320 | Normal people made content which I think it is really cool and
00:01:49.240 | What YouTube brings to us is a lot more than just an insight in someone's life
00:01:54.120 | it can really range from anything you're watching this video right now to learn about how to build a
00:02:01.440 | search tool for video using natural language processing and
00:02:06.080 | Without YouTube
00:02:09.600 | Sure, there'd probably be something else but YouTube was really the place where this all started
00:02:15.060 | So building a search tool that allows you to search through some of the most engaging
00:02:22.160 | content on the planet I
00:02:24.100 | Think is really cool. And of course, we can just use normal Google search or YouTube's actual search bar
00:02:30.960 | well, as we'll see we can actually build something that maybe
00:02:35.240 | Does quite well in comparison to those tools. So
00:02:40.620 | before we really jump into the
00:02:43.920 | technology and
00:02:46.400 | How to build it. Let's have a quick look at a few examples of what we are going to actually be able to build up to
00:02:53.620 | following through this video
00:02:56.280 | So this is just a really simple interface
00:02:59.100 | Built using streamlet and we can just say okay. What is deep learning?
00:03:10.080 | Okay, we're gonna have here a few videos. I think we have five
00:03:14.360 | Okay, and we can just kind of click on one of these so I mean all of these seem pretty relevant
00:03:20.400 | But I just want to show you something really cool
00:03:22.760 | If we click on one of these will take us straight to the point of the video. So here it started at 329 and
00:03:28.840 | It's going to show us
00:03:32.360 | What is deep learning? Okay, so that's really cool like straight away. It's almost like a very intelligent and fast
00:03:39.080 | search tool
00:03:41.680 | for YouTube and
00:03:43.680 | YouTube is full of amazing content. So this is is I think really cool. I was playing around with this quite a lot
00:03:49.280 | If we're interested in something else, so what is a transform up? What is a transformer?
00:03:57.000 | model
00:03:59.800 | What is a Bert transform model?
00:04:01.800 | Okay, it's a Bert this looks pretty promising
00:04:11.280 | Transformers here looks pretty good. And then if we have a look at this one's also the first result
00:04:16.820 | This is the most relevant and this explains
00:04:19.400 | Bert pretty well as well. So that's really cool and
00:04:23.800 | What I think is coolest about this is we just use off-the-shelf models and put this together super quickly it is
00:04:32.400 | Not complicated as you will see to pull this together. You can probably do it within
00:04:38.760 | Next 20 minutes after following this video
00:04:41.400 | So we're gonna have a look at this data set here so if we have a look in here we have
00:04:49.080 | 116,000 text files and we also have the audio files as well
00:04:56.320 | So we're only going to use the text files and these are basically for every small
00:05:01.920 | Segment video we have the subtitles from that segment and that's what we're going to be performing our search on
00:05:10.520 | Let's have a look at what?
00:05:12.520 | how we can download this data from from Kaggle and
00:05:18.080 | What the data actually looks like so to start you will need an account
00:05:22.320 | So once you have created an account on Kaggle you go to the right here you go to I think account
00:05:29.720 | We have this nice picture of me with hair and
00:05:35.760 | We can scroll down and there's this little API section here and we can create a new API token
00:05:42.520 | once you've done that that's going to download a
00:05:45.680 | Kaggle JSON file to your to your computer and
00:05:51.120 | we use that Kaggle JSON file to
00:05:55.920 | Authenticate the Kaggle Python client so the Kaggle Python client
00:06:02.640 | We're just pipping so Kaggle and then when you run this import Kaggle the first time if you don't have that Kaggle JSON
00:06:08.960 | download onto your system
00:06:11.160 | This will come with an error and it will tell you where to place your Kaggle JSON
00:06:15.560 | So you just web directory tells you to put in put your Kaggle JSON in there rerun import Kaggle and it should work
00:06:24.640 | then to use our
00:06:28.440 | Let's use this
00:06:30.040 | We just run this so we're so this is in the command line. You can also do this in a terminal window
00:06:35.640 | Kaggle datasets download and this is that dates that I just showed you
00:06:40.400 | we will need to unzip that and
00:06:43.200 | Once we unzip that
00:06:47.120 | You will find everything in this data directory here now you can see in this data directory we there are a lot of files
00:06:56.800 | So all of these represent a video ID
00:07:02.120 | So if we go into here, we have all these timestamps and this is just a range of a timestamp and in there
00:07:09.000 | We have the audio that represents the audio from that range and we also have the subtitles from that range
00:07:16.240 | So if we look over here
00:07:18.480 | So within the range of
00:07:22.440 | From three seconds in to about six seconds in
00:07:27.180 | the word
00:07:30.440 | or the words
00:07:32.440 | Spoken in the video of machine learning. It's a buzzword, but I and then we're on to the next timestamp
00:07:40.280 | Which would be here. We have look
00:07:43.280 | So was but I would also claim it's a lot more than and then I assume they're gonna talk it
00:07:51.920 | Talk about how machine learning is is more than a buzzword
00:07:55.280 | That's fine
00:07:59.800 | Yeah, that's what we have there
00:08:01.800 | But that's very little you saw in that little app that we had like thumbnails. We had the
00:08:08.600 | video title which we don't have in that data set and
00:08:12.440 | We even I think we had the video description as well. Okay, so you can see all that here
00:08:18.640 | We didn't have any of this in that in that data set
00:08:22.540 | So, how do we do that? How do we how do we scrape that data?
00:08:26.880 | To do that. We need a few things. So we need beautiful soup
00:08:30.920 | TQDM you don't you don't necessarily need it, but it's as best we do and
00:08:35.800 | Datasets use datasets here because rather than having all this data
00:08:41.580 | Stored locally we can actually save it to the hooking face hub and
00:08:48.160 | If you do if you're not really interested you just want to build a separate it quickly you can download that data and I will show
00:08:54.960 | You how you can do that in what's probably the next chapter of this video
00:09:00.160 | So you can hover over the timeline of this video and you'll be able to see
00:09:04.000 | The next chapter and it saw that you'll be able to see how we can load that data set
00:09:08.480 | Directly from putting face rather than going through all this pre-processing. I
00:09:14.400 | want to show you because this is how you you like a big part of machine learning and building these things is
00:09:20.920 | The pre-processing so it's it would be a shame to miss it
00:09:24.860 | So we can see yeah what I showed you before we have video IDs
00:09:29.480 | Which is just a list of the directories in each one those we have the timestamps and we can load them
00:09:37.040 | So so we have in data video ID splits
00:09:40.220 | And then we have the the subtitles of text file
00:09:45.400 | Cool, and what I want to do is just loop through all these files
00:09:49.080 | To give us that, you know, give us what we can get from those files. So that is a video ID
00:09:55.600 | From the directory names the text from subtitles up text the start second and second which we can get from the
00:10:04.440 | directory
00:10:06.440 | timestamps and
00:10:08.000 | Also the URL because the URL is actually we can pull that from the video ID
00:10:15.720 | So I won't go through this in too much depth
00:10:19.120 | But what we're doing is we're going through each one of those splits
00:10:23.560 | Extracting that small chunk of text is very small chunk of text. If you remember it's just a few seconds long. So
00:10:29.880 | With Q&A, we really want to have longer chunks of text and like five words
00:10:35.720 | So what I've done here is said, okay once we reach about three or four sentences
00:10:42.960 | We are going to
00:10:45.280 | Save that as a chunk. Okay, there are better ways to do this. It's like a very
00:10:52.520 | Good approach is to have overlapping chunks so that you're not missing anything at all there because we might be cutting things off
00:11:00.560 | Right in the middle of a sentence. It's probably not the best idea
00:11:03.720 | But just for this demo, this is good enough. It's not problem
00:11:09.240 | So we create our start and end seconds using the timestamps that we have in that directory name
00:11:15.760 | okay, so that's literally just the number of seconds into the video that we are at and
00:11:21.680 | Then we create a document which is just all of the details for that particular chunk of text
00:11:28.400 | including the video that comes from the text itself the sign and seconds of that chunk of text and
00:11:37.080 | Also this which is the URL so the URL directs us to the video the specific video and also
00:11:43.820 | the start of that chunk of text that we're going to be returning later on and
00:11:48.520 | Yeah, so we create a list of documents like that
00:11:54.560 | Shouldn't take long
00:11:59.320 | This is one example and you can see okay like here. We kind of just cut off in the middle of
00:12:08.680 | Sentence and it's not perfect as well, but that's fine. It works well enough
00:12:13.080 | So it would be better to have some sort of window where we overlap so we for example took this
00:12:20.500 | Paragraph and then we maybe took half the first this second half of the paragraph followed by the next half of the next
00:12:29.040 | document
00:12:31.760 | But this is it's good enough
00:12:33.760 | You see a few of those and we have this starting in seconds URL, which you can see we have the 41 seconds 41
00:12:45.720 | Now as I said before there's that other
00:12:47.920 | Metadata that we don't have in here now
00:12:52.280 | You might need to show on Mac. You might need to pin the pip or conda install
00:12:59.360 | XML and that is for a beautiful soup a beautiful soup is a
00:13:04.040 | like a data scraping library or
00:13:08.040 | or HTML
00:13:10.480 | processing library almost
00:13:12.480 | So it's really good when we're scraping information from websites, which we can do so we can okay
00:13:19.240 | We're going to go to each video and we're going to capture the data thumbnail and any
00:13:25.520 | other
00:13:27.560 | Information we can from there. So in this case, we just have the title and the
00:13:33.360 | thumbnail and
00:13:36.440 | We saw those within this metadata
00:13:39.720 | Dictionary. Okay, so the title thumbnail
00:13:43.480 | If there's a error so there were a couple of
00:13:48.640 | Exceptions here rather than just
00:13:52.880 | Throwing an error and not returning anything or stopping the process
00:13:57.280 | I'm just returning an empty title and thumbnail because there's two out of the 127 that we scrape there
00:14:05.000 | So I think right is here's too much of an issue
00:14:07.040 | Okay, so now we have the document which is what we originally pulled and then we have the title and thumbnail so
00:14:16.160 | what we can need to do here is pull those together and
00:14:20.520 | There we have our full
00:14:22.520 | Our full document, okay
00:14:29.200 | When we are
00:14:31.200 | Saving things to the hugging face hub. We can just save it as a Jason Lyons file
00:14:36.040 | Okay, so it's like the this list of dictionaries save that to a Jason L
00:14:41.360 | file and then you can actually just upload that directly to
00:14:44.960 | hugging face, so
00:14:48.400 | That's yeah pretty straightforward
00:14:50.960 | Okay, so as promised this is how you can
00:14:57.200 | Download the data that we just created. So
00:15:01.540 | Super easy. It's exactly the same as what you saw before we have video ID takes saw and URL title and thumbnail and we have
00:15:10.520 | 11,000 items there. Okay, and
00:15:15.160 | So we have 11,000 sort of documents and that is spread across
00:15:19.640 | 127 videos or you know unique videos as far as I can remember and
00:15:26.320 | We can just see one of those and you can see we've already seen this example from before so it's exactly the same but obviously
00:15:34.200 | a lot easier
00:15:35.840 | For us to actually use because we're just pulling it from hugging face hub, which I think is really cool
00:15:44.000 | The next thing we need to do is actually
00:15:46.000 | Index all those documents within vector database. Of course, we're using pinecone here. So
00:15:52.380 | First we need to do to begin doing that is initialize a sentence transformer
00:15:58.640 | so the sentence transformer is a model that is going to take the
00:16:02.920 | text and convert it into a vector which we can then place inside our vector database and use that to
00:16:11.240 | Perform our sort of intelligent semantic search through all of its documents
00:16:17.400 | We have the match sequence length
00:16:19.400 | 128 here so I use this
00:16:22.200 | To come up with this of three to four sentence length of our paragraphs
00:16:28.660 | Because typically a token which is is what this is 128 tokens. It's typically going to be something like
00:16:36.880 | Three to five characters. So we'll go with three
00:16:41.520 | characters here and
00:16:44.120 | so so that's why that's why we got 360 from and
00:16:48.320 | We have this
00:16:53.160 | Sentence embedding dimensionality, so that's important
00:16:56.800 | So we pull that in here to the embedding dim variable and then we're going to use that when we are
00:17:02.000 | initializing our
00:17:05.200 | index, so
00:17:07.200 | We need to get an API key for this. So you go to app the pinecone I/O you create an API key
00:17:13.000 | And then you can just go here and take that. So if you're just looking at where to actually get your API key
00:17:21.560 | Okay, so you go within your your project. It's probably going to be called someone's default project
00:17:28.220 | You click in there you go to API keys and then you have this you just copy this and then you would paste it inside
00:17:36.160 | Pretty simple and then we can create our
00:17:39.040 | Index, whatever you can call it. Whatever you want. I call it YouTube search because that's what this is for me
00:17:45.560 | Again, you call it whatever you want for this model
00:17:49.020 | We need to use case on similarity and we need to align the model embedding dimension and the dimensionality of our index
00:17:56.020 | Okay, so that's what I've done there and it will connect to our new index using the name that we gave it here
00:18:03.200 | So what I'm going to do is index our data and batches of 64
00:18:07.080 | the data we insert into our index will include the
00:18:14.120 | Document ID
00:18:18.080 | The embedded vector that we create using the sentence transform model and any metadata we'd like to include
00:18:24.600 | So that's what we saw before the title start seconds a text itself. We're going to include all of that in there
00:18:32.360 | So to do that we create this loop. This is where I'm using tqdm. This is just a set
00:18:36.760 | This is a progress bar so that we can see, you know, how long this is gonna take
00:18:41.440 | it shouldn't take too long though, by the way and
00:18:44.320 | What we do so in batches of
00:18:47.880 | 64 I'm going to encode all of the text
00:18:52.680 | I'm going to create the the IDs so we have
00:18:58.560 | Two sets of so we have a video ID, but the video ID is not unique for every single snippet
00:19:04.840 | So what I'm doing here is taking video ID and a start second because that is unique and placing those together
00:19:11.280 | okay, and
00:19:14.240 | Then what we're doing is creating the metadata. So I just want these items
00:19:18.760 | from the metadata
00:19:21.560 | from before so
00:19:23.560 | Just pulling those in there's nothing
00:19:27.560 | Nothing particularly unique there
00:19:30.160 | Okay, and then just upsetting all those so inserting everything into pinecone
00:19:36.880 | The IDs the embeddings and the metadata and yeah, just adding all that to pinecone super easy
00:19:44.160 | And then we describe our indexes. You don't need to do this, but this is so we can see what is in there
00:19:50.320 | So we have the dimensionality how full our index is. We don't have much in there at the moment. So it's
00:19:55.760 | It's not very full. You can usually fit about a million
00:19:58.640 | embeddings into a typical
00:20:01.680 | what called pod which is like a hardware unit of
00:20:07.920 | of the vector database you can have more pods if you need more than that and
00:20:12.320 | Yeah, so this is our vector count we have just over 11,000 items in there
00:20:20.480 | And then we go this is what we're doing for wise deep learning and then we return those results. So this is just a text and
00:20:27.480 | Yeah, everything is pretty relevant there. So that is how it works
00:20:33.840 | It's pretty I think simple
00:20:37.000 | The only other thing to do is actually put all that into a streamer app, which is not hard at all
00:20:42.760 | okay, so
00:20:45.080 | This is our code. I'm gonna zoom out a little bit. I'm sorry if this is kind of small
00:20:51.440 | It's yeah, it's hard to get that all I'm gonna squeeze into the screen. So
00:20:57.760 | Yeah, I mean the same as before we're just initializing things
00:21:02.120 | We have this little card, which is like the each component or result
00:21:06.920 | We return just HTML really basic HMS or nothing complicated there
00:21:11.640 | And then yeah, we're just using stream that components stream lit right markdown
00:21:18.880 | There's nothing nothing really in there. We have the search bar whenever there's the search bar has something inside it. We research and
00:21:26.080 | just return
00:21:28.400 | That information in the format of a card and all that together is how we build this
00:21:33.320 | So it's really I think super straightforward and really easy to use
00:21:38.640 | Yeah, that's it for for this video
00:21:43.040 | I hope this has been being useful and at least insightful how this is how you can build something like this
00:21:49.080 | Of course, this is just one example like YouTube search video search
00:21:52.800 | you can you can search through anything that you can imagine as long as you have some sort of data that represents it and some
00:22:00.860 | Way to represent that sort of data as a question and answer
00:22:05.400 | format
00:22:07.400 | you can
00:22:09.240 | You can you can build something like this super easily
00:22:14.840 | Hope that's been useful. Thank you very much for watching and I will see you again in the next one. Bye