How to Build an AI-Powered Video Search App

00:00:00.000 | Today I want to have a look at how we can build a NLP powered

00:00:06.300 | Intelligent search for video and more specifically we're gonna have a look at how we can do that for YouTube now

00:00:15.240 | YouTube is

00:00:17.680 | obviously huge

00:00:19.680 | But it started with a very simple video

00:00:24.800 | Titled I'm at the zoo

00:00:28.600 | This video was a 19 second clip of

00:00:31.200 | YouTube's co-founder

00:00:33.840 | literally at the zoo showing you some elephants in the enclosure behind him and

00:00:39.680 | it seems really silly now, but

00:00:43.140 | When you think about this that was back in

00:00:47.600 | 2005 and up until that point the only

00:00:51.240 | video or

00:00:53.780 | Insights into other people's lives that we we will have seen

00:00:58.080 | is that of

00:01:00.080 | celebrities or

00:01:02.220 | Politicians and all of it was very orchestrated and not natural. So

00:01:07.460 | YouTube

00:01:09.740 | sort of brought this very

00:01:12.260 | unique

00:01:14.460 | age where we're able to actually see into other people's lives and normal people's lives like me or you and

00:01:22.300 | See just what they are doing and

00:01:27.960 | Okay

00:01:29.200 | I'm at the zoo. It's a kind of odd very short video, but it really marks the beginning of this almost

00:01:37.460 | Explosion of

00:01:41.320 | Normal people made content which I think it is really cool and

00:01:47.140 | now

00:01:49.240 | What YouTube brings to us is a lot more than just an insight in someone's life

00:01:54.120 | it can really range from anything you're watching this video right now to learn about how to build a

00:02:01.440 | search tool for video using natural language processing and

00:02:06.080 | Without YouTube

00:02:09.600 | Sure, there'd probably be something else but YouTube was really the place where this all started

00:02:15.060 | So building a search tool that allows you to search through some of the most engaging

00:02:22.160 | content on the planet I

00:02:24.100 | Think is really cool. And of course, we can just use normal Google search or YouTube's actual search bar

00:02:30.960 | well, as we'll see we can actually build something that maybe

00:02:35.240 | Does quite well in comparison to those tools. So

00:02:40.620 | before we really jump into the

00:02:43.920 | technology and

00:02:46.400 | How to build it. Let's have a quick look at a few examples of what we are going to actually be able to build up to

00:02:53.620 | following through this video

00:02:56.280 | So this is just a really simple interface

00:02:59.100 | Built using streamlet and we can just say okay. What is deep learning?

00:03:06.400 | And

00:03:10.080 | Okay, we're gonna have here a few videos. I think we have five

00:03:14.360 | Okay, and we can just kind of click on one of these so I mean all of these seem pretty relevant

00:03:20.400 | But I just want to show you something really cool

00:03:22.760 | If we click on one of these will take us straight to the point of the video. So here it started at 329 and

00:03:28.840 | It's going to show us

00:03:32.360 | What is deep learning? Okay, so that's really cool like straight away. It's almost like a very intelligent and fast

00:03:39.080 | search tool

00:03:41.680 | for YouTube and

00:03:43.680 | YouTube is full of amazing content. So this is is I think really cool. I was playing around with this quite a lot

00:03:49.280 | If we're interested in something else, so what is a transform up? What is a transformer?

00:03:57.000 | model

00:03:59.800 | What is a Bert transform model?

00:04:01.800 | Okay, it's a Bert this looks pretty promising

00:04:11.280 | Transformers here looks pretty good. And then if we have a look at this one's also the first result

00:04:16.820 | This is the most relevant and this explains

00:04:19.400 | Bert pretty well as well. So that's really cool and

00:04:23.800 | What I think is coolest about this is we just use off-the-shelf models and put this together super quickly it is

00:04:32.400 | Not complicated as you will see to pull this together. You can probably do it within

00:04:38.760 | Next 20 minutes after following this video

00:04:41.400 | So we're gonna have a look at this data set here so if we have a look in here we have

00:04:49.080 | 116,000 text files and we also have the audio files as well

00:04:56.320 | So we're only going to use the text files and these are basically for every small

00:05:01.920 | Segment video we have the subtitles from that segment and that's what we're going to be performing our search on

00:05:08.640 | now

00:05:10.520 | Let's have a look at what?

00:05:12.520 | how we can download this data from from Kaggle and

00:05:15.440 | also

00:05:18.080 | What the data actually looks like so to start you will need an account

00:05:22.320 | So once you have created an account on Kaggle you go to the right here you go to I think account

00:05:29.720 | We have this nice picture of me with hair and

00:05:35.760 | We can scroll down and there's this little API section here and we can create a new API token

00:05:42.520 | once you've done that that's going to download a

00:05:45.680 | Kaggle JSON file to your to your computer and

00:05:51.120 | we use that Kaggle JSON file to

00:05:55.920 | Authenticate the Kaggle Python client so the Kaggle Python client

00:06:02.640 | We're just pipping so Kaggle and then when you run this import Kaggle the first time if you don't have that Kaggle JSON

00:06:08.960 | download onto your system

00:06:11.160 | This will come with an error and it will tell you where to place your Kaggle JSON

00:06:15.560 | So you just web directory tells you to put in put your Kaggle JSON in there rerun import Kaggle and it should work

00:06:22.120 | and

00:06:24.640 | then to use our

00:06:28.440 | Let's use this

00:06:30.040 | We just run this so we're so this is in the command line. You can also do this in a terminal window

00:06:35.640 | Kaggle datasets download and this is that dates that I just showed you

00:06:40.400 | we will need to unzip that and

00:06:43.200 | Once we unzip that

00:06:47.120 | You will find everything in this data directory here now you can see in this data directory we there are a lot of files

00:06:56.800 | So all of these represent a video ID

00:07:00.360 | Okay

00:07:02.120 | So if we go into here, we have all these timestamps and this is just a range of a timestamp and in there

00:07:09.000 | We have the audio that represents the audio from that range and we also have the subtitles from that range

00:07:16.240 | So if we look over here

00:07:18.480 | So within the range of

00:07:22.440 | From three seconds in to about six seconds in

00:07:27.180 | the word

00:07:30.440 | or the words

00:07:32.440 | Spoken in the video of machine learning. It's a buzzword, but I and then we're on to the next timestamp

00:07:40.280 | Which would be here. We have look

00:07:43.280 | So was but I would also claim it's a lot more than and then I assume they're gonna talk it

00:07:51.920 | Talk about how machine learning is is more than a buzzword

00:07:55.280 | That's fine

00:07:58.160 | So

00:07:59.800 | Yeah, that's what we have there

00:08:01.800 | But that's very little you saw in that little app that we had like thumbnails. We had the

00:08:08.600 | video title which we don't have in that data set and

00:08:12.440 | We even I think we had the video description as well. Okay, so you can see all that here

00:08:18.640 | We didn't have any of this in that in that data set

00:08:22.540 | So, how do we do that? How do we how do we scrape that data?

00:08:26.880 | To do that. We need a few things. So we need beautiful soup

00:08:30.920 | TQDM you don't you don't necessarily need it, but it's as best we do and

00:08:35.800 | Datasets use datasets here because rather than having all this data

00:08:41.580 | Stored locally we can actually save it to the hooking face hub and

00:08:48.160 | If you do if you're not really interested you just want to build a separate it quickly you can download that data and I will show

00:08:54.960 | You how you can do that in what's probably the next chapter of this video

00:09:00.160 | So you can hover over the timeline of this video and you'll be able to see

00:09:04.000 | The next chapter and it saw that you'll be able to see how we can load that data set

00:09:08.480 | Directly from putting face rather than going through all this pre-processing. I

00:09:14.400 | want to show you because this is how you you like a big part of machine learning and building these things is

00:09:20.920 | The pre-processing so it's it would be a shame to miss it

00:09:24.860 | So we can see yeah what I showed you before we have video IDs

00:09:29.480 | Which is just a list of the directories in each one those we have the timestamps and we can load them

00:09:37.040 | So so we have in data video ID splits

00:09:40.220 | And then we have the the subtitles of text file

00:09:44.200 | Okay

00:09:45.400 | Cool, and what I want to do is just loop through all these files

00:09:49.080 | To give us that, you know, give us what we can get from those files. So that is a video ID

00:09:55.600 | From the directory names the text from subtitles up text the start second and second which we can get from the

00:10:04.440 | directory

00:10:06.440 | timestamps and

00:10:08.000 | Also the URL because the URL is actually we can pull that from the video ID

00:10:13.720 | Okay

00:10:15.720 | So I won't go through this in too much depth

00:10:19.120 | But what we're doing is we're going through each one of those splits

00:10:23.560 | Extracting that small chunk of text is very small chunk of text. If you remember it's just a few seconds long. So

00:10:29.880 | With Q&A, we really want to have longer chunks of text and like five words

00:10:35.720 | So what I've done here is said, okay once we reach about three or four sentences

00:10:42.960 | We are going to

00:10:45.280 | Save that as a chunk. Okay, there are better ways to do this. It's like a very

00:10:52.520 | Good approach is to have overlapping chunks so that you're not missing anything at all there because we might be cutting things off

00:11:00.560 | Right in the middle of a sentence. It's probably not the best idea

00:11:03.720 | But just for this demo, this is good enough. It's not problem

00:11:09.240 | So we create our start and end seconds using the timestamps that we have in that directory name

00:11:15.760 | okay, so that's literally just the number of seconds into the video that we are at and

00:11:21.680 | Then we create a document which is just all of the details for that particular chunk of text

00:11:28.400 | including the video that comes from the text itself the sign and seconds of that chunk of text and

00:11:37.080 | Also this which is the URL so the URL directs us to the video the specific video and also

00:11:43.820 | the start of that chunk of text that we're going to be returning later on and

00:11:48.520 | Yeah, so we create a list of documents like that

00:11:54.560 | Shouldn't take long

00:11:58.120 | so

00:11:59.320 | This is one example and you can see okay like here. We kind of just cut off in the middle of

00:12:06.840 | a

00:12:08.680 | Sentence and it's not perfect as well, but that's fine. It works well enough

00:12:13.080 | So it would be better to have some sort of window where we overlap so we for example took this

00:12:20.500 | Paragraph and then we maybe took half the first this second half of the paragraph followed by the next half of the next

00:12:29.040 | document

00:12:31.760 | But this is it's good enough

00:12:33.760 | You see a few of those and we have this starting in seconds URL, which you can see we have the 41 seconds 41

00:12:41.640 | Okay

00:12:45.720 | Now as I said before there's that other

00:12:47.920 | Metadata that we don't have in here now

00:12:52.280 | You might need to show on Mac. You might need to pin the pip or conda install

00:12:59.360 | XML and that is for a beautiful soup a beautiful soup is a

00:13:04.040 | like a data scraping library or

00:13:08.040 | or HTML

00:13:10.480 | processing library almost

00:13:12.480 | So it's really good when we're scraping information from websites, which we can do so we can okay

00:13:19.240 | We're going to go to each video and we're going to capture the data thumbnail and any

00:13:25.520 | other

00:13:27.560 | Information we can from there. So in this case, we just have the title and the

00:13:33.360 | thumbnail and

00:13:36.440 | We saw those within this metadata

00:13:39.720 | Dictionary. Okay, so the title thumbnail

00:13:43.480 | If there's a error so there were a couple of

00:13:48.640 | Exceptions here rather than just

00:13:52.880 | Throwing an error and not returning anything or stopping the process

00:13:57.280 | I'm just returning an empty title and thumbnail because there's two out of the 127 that we scrape there

00:14:05.000 | So I think right is here's too much of an issue

00:14:07.040 | Okay, so now we have the document which is what we originally pulled and then we have the title and thumbnail so

00:14:16.160 | what we can need to do here is pull those together and

00:14:20.520 | There we have our full

00:14:22.520 | Our full document, okay

00:14:26.240 | so

00:14:29.200 | When we are

00:14:31.200 | Saving things to the hugging face hub. We can just save it as a Jason Lyons file

00:14:36.040 | Okay, so it's like the this list of dictionaries save that to a Jason L

00:14:41.360 | file and then you can actually just upload that directly to

00:14:44.960 | hugging face, so

00:14:48.400 | That's yeah pretty straightforward

00:14:50.960 | Okay, so as promised this is how you can

00:14:57.200 | Download the data that we just created. So

00:15:01.540 | Super easy. It's exactly the same as what you saw before we have video ID takes saw and URL title and thumbnail and we have

00:15:10.520 | 11,000 items there. Okay, and

00:15:15.160 | So we have 11,000 sort of documents and that is spread across

00:15:19.640 | 127 videos or you know unique videos as far as I can remember and

00:15:26.320 | We can just see one of those and you can see we've already seen this example from before so it's exactly the same but obviously

00:15:34.200 | a lot easier

00:15:35.840 | For us to actually use because we're just pulling it from hugging face hub, which I think is really cool

00:15:42.160 | now

00:15:44.000 | The next thing we need to do is actually

00:15:46.000 | Index all those documents within vector database. Of course, we're using pinecone here. So

00:15:52.380 | First we need to do to begin doing that is initialize a sentence transformer

00:15:58.640 | so the sentence transformer is a model that is going to take the

00:16:02.920 | text and convert it into a vector which we can then place inside our vector database and use that to

00:16:11.240 | Perform our sort of intelligent semantic search through all of its documents

00:16:15.560 | so

00:16:17.400 | We have the match sequence length

00:16:19.400 | 128 here so I use this

00:16:22.200 | To come up with this of three to four sentence length of our paragraphs

00:16:28.660 | Because typically a token which is is what this is 128 tokens. It's typically going to be something like

00:16:36.880 | Three to five characters. So we'll go with three

00:16:41.520 | characters here and

00:16:44.120 | so so that's why that's why we got 360 from and

00:16:48.320 | We have this

00:16:51.160 | 768

00:16:53.160 | Sentence embedding dimensionality, so that's important

00:16:56.800 | So we pull that in here to the embedding dim variable and then we're going to use that when we are

00:17:02.000 | initializing our

00:17:05.200 | index, so

00:17:07.200 | We need to get an API key for this. So you go to app the pinecone I/O you create an API key

00:17:13.000 | And then you can just go here and take that. So if you're just looking at where to actually get your API key

00:17:21.560 | Okay, so you go within your your project. It's probably going to be called someone's default project

00:17:28.220 | You click in there you go to API keys and then you have this you just copy this and then you would paste it inside

00:17:34.720 | this

00:17:36.160 | Pretty simple and then we can create our

00:17:39.040 | Index, whatever you can call it. Whatever you want. I call it YouTube search because that's what this is for me

00:17:45.560 | Again, you call it whatever you want for this model

00:17:49.020 | We need to use case on similarity and we need to align the model embedding dimension and the dimensionality of our index

00:17:56.020 | Okay, so that's what I've done there and it will connect to our new index using the name that we gave it here

00:18:03.200 | So what I'm going to do is index our data and batches of 64

00:18:07.080 | the data we insert into our index will include the

00:18:14.120 | Document ID

00:18:18.080 | The embedded vector that we create using the sentence transform model and any metadata we'd like to include

00:18:24.600 | So that's what we saw before the title start seconds a text itself. We're going to include all of that in there

00:18:32.360 | So to do that we create this loop. This is where I'm using tqdm. This is just a set

00:18:36.760 | This is a progress bar so that we can see, you know, how long this is gonna take

00:18:41.440 | it shouldn't take too long though, by the way and

00:18:44.320 | What we do so in batches of

00:18:47.880 | 64 I'm going to encode all of the text

00:18:52.680 | I'm going to create the the IDs so we have

00:18:58.560 | Two sets of so we have a video ID, but the video ID is not unique for every single snippet

00:19:04.840 | So what I'm doing here is taking video ID and a start second because that is unique and placing those together

00:19:11.280 | okay, and

00:19:14.240 | Then what we're doing is creating the metadata. So I just want these items

00:19:18.760 | from the metadata

00:19:21.560 | from before so

00:19:23.560 | Just pulling those in there's nothing

00:19:27.560 | Nothing particularly unique there

00:19:30.160 | Okay, and then just upsetting all those so inserting everything into pinecone

00:19:36.880 | The IDs the embeddings and the metadata and yeah, just adding all that to pinecone super easy

00:19:44.160 | And then we describe our indexes. You don't need to do this, but this is so we can see what is in there

00:19:50.320 | So we have the dimensionality how full our index is. We don't have much in there at the moment. So it's

00:19:55.760 | It's not very full. You can usually fit about a million

00:19:58.640 | embeddings into a typical

00:20:01.680 | what called pod which is like a hardware unit of

00:20:05.720 | the

00:20:07.920 | of the vector database you can have more pods if you need more than that and

00:20:12.320 | Yeah, so this is our vector count we have just over 11,000 items in there

00:20:20.480 | And then we go this is what we're doing for wise deep learning and then we return those results. So this is just a text and

00:20:27.480 | Yeah, everything is pretty relevant there. So that is how it works

00:20:33.840 | It's pretty I think simple

00:20:37.000 | The only other thing to do is actually put all that into a streamer app, which is not hard at all

00:20:42.760 | okay, so

00:20:45.080 | This is our code. I'm gonna zoom out a little bit. I'm sorry if this is kind of small

00:20:49.920 | but

00:20:51.440 | It's yeah, it's hard to get that all I'm gonna squeeze into the screen. So

00:20:57.760 | Yeah, I mean the same as before we're just initializing things

00:21:02.120 | We have this little card, which is like the each component or result

00:21:06.920 | We return just HTML really basic HMS or nothing complicated there

00:21:11.640 | And then yeah, we're just using stream that components stream lit right markdown

00:21:18.880 | There's nothing nothing really in there. We have the search bar whenever there's the search bar has something inside it. We research and

00:21:26.080 | just return

00:21:28.400 | That information in the format of a card and all that together is how we build this

00:21:33.320 | So it's really I think super straightforward and really easy to use

00:21:38.640 | Yeah, that's it for for this video

00:21:43.040 | I hope this has been being useful and at least insightful how this is how you can build something like this

00:21:49.080 | Of course, this is just one example like YouTube search video search

00:21:52.800 | you can you can search through anything that you can imagine as long as you have some sort of data that represents it and some

00:22:00.860 | Way to represent that sort of data as a question and answer

00:22:05.400 | format

00:22:07.400 | you can

00:22:09.240 | You can you can build something like this super easily

00:22:12.840 | so I

00:22:14.840 | Hope that's been useful. Thank you very much for watching and I will see you again in the next one. Bye

How to Build an AI-Powered Video Search App

Chapters