back to indexHow to Build an AI-Powered Video Search App
Chapters
0:0 Intro
2:56 YouTube Search App
4:43 Getting Data
7:58 Enhancing the Data
12:45 Scraping Other Metadata
14:52 Loading Data from Hugging Face
15:42 Index and Query the Data
20:43 Streamlit App Code
00:00:00.000 |
Today I want to have a look at how we can build a NLP powered 00:00:06.300 |
Intelligent search for video and more specifically we're gonna have a look at how we can do that for YouTube now 00:00:33.840 |
literally at the zoo showing you some elephants in the enclosure behind him and 00:00:53.780 |
Insights into other people's lives that we we will have seen 00:01:02.220 |
Politicians and all of it was very orchestrated and not natural. So 00:01:14.460 |
age where we're able to actually see into other people's lives and normal people's lives like me or you and 00:01:29.200 |
I'm at the zoo. It's a kind of odd very short video, but it really marks the beginning of this almost 00:01:41.320 |
Normal people made content which I think it is really cool and 00:01:49.240 |
What YouTube brings to us is a lot more than just an insight in someone's life 00:01:54.120 |
it can really range from anything you're watching this video right now to learn about how to build a 00:02:01.440 |
search tool for video using natural language processing and 00:02:09.600 |
Sure, there'd probably be something else but YouTube was really the place where this all started 00:02:15.060 |
So building a search tool that allows you to search through some of the most engaging 00:02:24.100 |
Think is really cool. And of course, we can just use normal Google search or YouTube's actual search bar 00:02:30.960 |
well, as we'll see we can actually build something that maybe 00:02:35.240 |
Does quite well in comparison to those tools. So 00:02:46.400 |
How to build it. Let's have a quick look at a few examples of what we are going to actually be able to build up to 00:02:59.100 |
Built using streamlet and we can just say okay. What is deep learning? 00:03:10.080 |
Okay, we're gonna have here a few videos. I think we have five 00:03:14.360 |
Okay, and we can just kind of click on one of these so I mean all of these seem pretty relevant 00:03:20.400 |
But I just want to show you something really cool 00:03:22.760 |
If we click on one of these will take us straight to the point of the video. So here it started at 329 and 00:03:32.360 |
What is deep learning? Okay, so that's really cool like straight away. It's almost like a very intelligent and fast 00:03:43.680 |
YouTube is full of amazing content. So this is is I think really cool. I was playing around with this quite a lot 00:03:49.280 |
If we're interested in something else, so what is a transform up? What is a transformer? 00:04:01.800 |
Okay, it's a Bert this looks pretty promising 00:04:11.280 |
Transformers here looks pretty good. And then if we have a look at this one's also the first result 00:04:19.400 |
Bert pretty well as well. So that's really cool and 00:04:23.800 |
What I think is coolest about this is we just use off-the-shelf models and put this together super quickly it is 00:04:32.400 |
Not complicated as you will see to pull this together. You can probably do it within 00:04:41.400 |
So we're gonna have a look at this data set here so if we have a look in here we have 00:04:49.080 |
116,000 text files and we also have the audio files as well 00:04:56.320 |
So we're only going to use the text files and these are basically for every small 00:05:01.920 |
Segment video we have the subtitles from that segment and that's what we're going to be performing our search on 00:05:12.520 |
how we can download this data from from Kaggle and 00:05:18.080 |
What the data actually looks like so to start you will need an account 00:05:22.320 |
So once you have created an account on Kaggle you go to the right here you go to I think account 00:05:29.720 |
We have this nice picture of me with hair and 00:05:35.760 |
We can scroll down and there's this little API section here and we can create a new API token 00:05:42.520 |
once you've done that that's going to download a 00:05:45.680 |
Kaggle JSON file to your to your computer and 00:05:55.920 |
Authenticate the Kaggle Python client so the Kaggle Python client 00:06:02.640 |
We're just pipping so Kaggle and then when you run this import Kaggle the first time if you don't have that Kaggle JSON 00:06:11.160 |
This will come with an error and it will tell you where to place your Kaggle JSON 00:06:15.560 |
So you just web directory tells you to put in put your Kaggle JSON in there rerun import Kaggle and it should work 00:06:30.040 |
We just run this so we're so this is in the command line. You can also do this in a terminal window 00:06:35.640 |
Kaggle datasets download and this is that dates that I just showed you 00:06:47.120 |
You will find everything in this data directory here now you can see in this data directory we there are a lot of files 00:07:02.120 |
So if we go into here, we have all these timestamps and this is just a range of a timestamp and in there 00:07:09.000 |
We have the audio that represents the audio from that range and we also have the subtitles from that range 00:07:22.440 |
From three seconds in to about six seconds in 00:07:32.440 |
Spoken in the video of machine learning. It's a buzzword, but I and then we're on to the next timestamp 00:07:43.280 |
So was but I would also claim it's a lot more than and then I assume they're gonna talk it 00:07:51.920 |
Talk about how machine learning is is more than a buzzword 00:08:01.800 |
But that's very little you saw in that little app that we had like thumbnails. We had the 00:08:08.600 |
video title which we don't have in that data set and 00:08:12.440 |
We even I think we had the video description as well. Okay, so you can see all that here 00:08:18.640 |
We didn't have any of this in that in that data set 00:08:22.540 |
So, how do we do that? How do we how do we scrape that data? 00:08:26.880 |
To do that. We need a few things. So we need beautiful soup 00:08:30.920 |
TQDM you don't you don't necessarily need it, but it's as best we do and 00:08:35.800 |
Datasets use datasets here because rather than having all this data 00:08:41.580 |
Stored locally we can actually save it to the hooking face hub and 00:08:48.160 |
If you do if you're not really interested you just want to build a separate it quickly you can download that data and I will show 00:08:54.960 |
You how you can do that in what's probably the next chapter of this video 00:09:00.160 |
So you can hover over the timeline of this video and you'll be able to see 00:09:04.000 |
The next chapter and it saw that you'll be able to see how we can load that data set 00:09:08.480 |
Directly from putting face rather than going through all this pre-processing. I 00:09:14.400 |
want to show you because this is how you you like a big part of machine learning and building these things is 00:09:20.920 |
The pre-processing so it's it would be a shame to miss it 00:09:24.860 |
So we can see yeah what I showed you before we have video IDs 00:09:29.480 |
Which is just a list of the directories in each one those we have the timestamps and we can load them 00:09:40.220 |
And then we have the the subtitles of text file 00:09:45.400 |
Cool, and what I want to do is just loop through all these files 00:09:49.080 |
To give us that, you know, give us what we can get from those files. So that is a video ID 00:09:55.600 |
From the directory names the text from subtitles up text the start second and second which we can get from the 00:10:08.000 |
Also the URL because the URL is actually we can pull that from the video ID 00:10:19.120 |
But what we're doing is we're going through each one of those splits 00:10:23.560 |
Extracting that small chunk of text is very small chunk of text. If you remember it's just a few seconds long. So 00:10:29.880 |
With Q&A, we really want to have longer chunks of text and like five words 00:10:35.720 |
So what I've done here is said, okay once we reach about three or four sentences 00:10:45.280 |
Save that as a chunk. Okay, there are better ways to do this. It's like a very 00:10:52.520 |
Good approach is to have overlapping chunks so that you're not missing anything at all there because we might be cutting things off 00:11:00.560 |
Right in the middle of a sentence. It's probably not the best idea 00:11:03.720 |
But just for this demo, this is good enough. It's not problem 00:11:09.240 |
So we create our start and end seconds using the timestamps that we have in that directory name 00:11:15.760 |
okay, so that's literally just the number of seconds into the video that we are at and 00:11:21.680 |
Then we create a document which is just all of the details for that particular chunk of text 00:11:28.400 |
including the video that comes from the text itself the sign and seconds of that chunk of text and 00:11:37.080 |
Also this which is the URL so the URL directs us to the video the specific video and also 00:11:43.820 |
the start of that chunk of text that we're going to be returning later on and 00:11:48.520 |
Yeah, so we create a list of documents like that 00:11:59.320 |
This is one example and you can see okay like here. We kind of just cut off in the middle of 00:12:08.680 |
Sentence and it's not perfect as well, but that's fine. It works well enough 00:12:13.080 |
So it would be better to have some sort of window where we overlap so we for example took this 00:12:20.500 |
Paragraph and then we maybe took half the first this second half of the paragraph followed by the next half of the next 00:12:33.760 |
You see a few of those and we have this starting in seconds URL, which you can see we have the 41 seconds 41 00:12:52.280 |
You might need to show on Mac. You might need to pin the pip or conda install 00:12:59.360 |
XML and that is for a beautiful soup a beautiful soup is a 00:13:12.480 |
So it's really good when we're scraping information from websites, which we can do so we can okay 00:13:19.240 |
We're going to go to each video and we're going to capture the data thumbnail and any 00:13:27.560 |
Information we can from there. So in this case, we just have the title and the 00:13:52.880 |
Throwing an error and not returning anything or stopping the process 00:13:57.280 |
I'm just returning an empty title and thumbnail because there's two out of the 127 that we scrape there 00:14:05.000 |
So I think right is here's too much of an issue 00:14:07.040 |
Okay, so now we have the document which is what we originally pulled and then we have the title and thumbnail so 00:14:16.160 |
what we can need to do here is pull those together and 00:14:31.200 |
Saving things to the hugging face hub. We can just save it as a Jason Lyons file 00:14:36.040 |
Okay, so it's like the this list of dictionaries save that to a Jason L 00:14:41.360 |
file and then you can actually just upload that directly to 00:15:01.540 |
Super easy. It's exactly the same as what you saw before we have video ID takes saw and URL title and thumbnail and we have 00:15:15.160 |
So we have 11,000 sort of documents and that is spread across 00:15:19.640 |
127 videos or you know unique videos as far as I can remember and 00:15:26.320 |
We can just see one of those and you can see we've already seen this example from before so it's exactly the same but obviously 00:15:35.840 |
For us to actually use because we're just pulling it from hugging face hub, which I think is really cool 00:15:46.000 |
Index all those documents within vector database. Of course, we're using pinecone here. So 00:15:52.380 |
First we need to do to begin doing that is initialize a sentence transformer 00:15:58.640 |
so the sentence transformer is a model that is going to take the 00:16:02.920 |
text and convert it into a vector which we can then place inside our vector database and use that to 00:16:11.240 |
Perform our sort of intelligent semantic search through all of its documents 00:16:22.200 |
To come up with this of three to four sentence length of our paragraphs 00:16:28.660 |
Because typically a token which is is what this is 128 tokens. It's typically going to be something like 00:16:36.880 |
Three to five characters. So we'll go with three 00:16:44.120 |
so so that's why that's why we got 360 from and 00:16:53.160 |
Sentence embedding dimensionality, so that's important 00:16:56.800 |
So we pull that in here to the embedding dim variable and then we're going to use that when we are 00:17:07.200 |
We need to get an API key for this. So you go to app the pinecone I/O you create an API key 00:17:13.000 |
And then you can just go here and take that. So if you're just looking at where to actually get your API key 00:17:21.560 |
Okay, so you go within your your project. It's probably going to be called someone's default project 00:17:28.220 |
You click in there you go to API keys and then you have this you just copy this and then you would paste it inside 00:17:39.040 |
Index, whatever you can call it. Whatever you want. I call it YouTube search because that's what this is for me 00:17:45.560 |
Again, you call it whatever you want for this model 00:17:49.020 |
We need to use case on similarity and we need to align the model embedding dimension and the dimensionality of our index 00:17:56.020 |
Okay, so that's what I've done there and it will connect to our new index using the name that we gave it here 00:18:03.200 |
So what I'm going to do is index our data and batches of 64 00:18:07.080 |
the data we insert into our index will include the 00:18:18.080 |
The embedded vector that we create using the sentence transform model and any metadata we'd like to include 00:18:24.600 |
So that's what we saw before the title start seconds a text itself. We're going to include all of that in there 00:18:32.360 |
So to do that we create this loop. This is where I'm using tqdm. This is just a set 00:18:36.760 |
This is a progress bar so that we can see, you know, how long this is gonna take 00:18:41.440 |
it shouldn't take too long though, by the way and 00:18:58.560 |
Two sets of so we have a video ID, but the video ID is not unique for every single snippet 00:19:04.840 |
So what I'm doing here is taking video ID and a start second because that is unique and placing those together 00:19:14.240 |
Then what we're doing is creating the metadata. So I just want these items 00:19:30.160 |
Okay, and then just upsetting all those so inserting everything into pinecone 00:19:36.880 |
The IDs the embeddings and the metadata and yeah, just adding all that to pinecone super easy 00:19:44.160 |
And then we describe our indexes. You don't need to do this, but this is so we can see what is in there 00:19:50.320 |
So we have the dimensionality how full our index is. We don't have much in there at the moment. So it's 00:19:55.760 |
It's not very full. You can usually fit about a million 00:20:01.680 |
what called pod which is like a hardware unit of 00:20:07.920 |
of the vector database you can have more pods if you need more than that and 00:20:12.320 |
Yeah, so this is our vector count we have just over 11,000 items in there 00:20:20.480 |
And then we go this is what we're doing for wise deep learning and then we return those results. So this is just a text and 00:20:27.480 |
Yeah, everything is pretty relevant there. So that is how it works 00:20:37.000 |
The only other thing to do is actually put all that into a streamer app, which is not hard at all 00:20:45.080 |
This is our code. I'm gonna zoom out a little bit. I'm sorry if this is kind of small 00:20:51.440 |
It's yeah, it's hard to get that all I'm gonna squeeze into the screen. So 00:20:57.760 |
Yeah, I mean the same as before we're just initializing things 00:21:02.120 |
We have this little card, which is like the each component or result 00:21:06.920 |
We return just HTML really basic HMS or nothing complicated there 00:21:11.640 |
And then yeah, we're just using stream that components stream lit right markdown 00:21:18.880 |
There's nothing nothing really in there. We have the search bar whenever there's the search bar has something inside it. We research and 00:21:28.400 |
That information in the format of a card and all that together is how we build this 00:21:33.320 |
So it's really I think super straightforward and really easy to use 00:21:43.040 |
I hope this has been being useful and at least insightful how this is how you can build something like this 00:21:49.080 |
Of course, this is just one example like YouTube search video search 00:21:52.800 |
you can you can search through anything that you can imagine as long as you have some sort of data that represents it and some 00:22:00.860 |
Way to represent that sort of data as a question and answer 00:22:09.240 |
You can you can build something like this super easily 00:22:14.840 |
Hope that's been useful. Thank you very much for watching and I will see you again in the next one. Bye