How to Build an AI-Powered Video Search App

Today I want to have a look at how we can build a NLP powered Intelligent search for video and more specifically we're gonna have a look at how we can do that for YouTube now YouTube is obviously huge But it started with a very simple video Titled I'm at the zoo This video was a 19 second clip of YouTube's co-founder literally at the zoo showing you some elephants in the enclosure behind him and it seems really silly now, but When you think about this that was back in 2005 and up until that point the only video or Insights into other people's lives that we we will have seen is that of celebrities or Politicians and all of it was very orchestrated and not natural.

So YouTube sort of brought this very unique age where we're able to actually see into other people's lives and normal people's lives like me or you and See just what they are doing and Okay I'm at the zoo. It's a kind of odd very short video, but it really marks the beginning of this almost Explosion of Normal people made content which I think it is really cool and now What YouTube brings to us is a lot more than just an insight in someone's life it can really range from anything you're watching this video right now to learn about how to build a search tool for video using natural language processing and Without YouTube Sure, there'd probably be something else but YouTube was really the place where this all started So building a search tool that allows you to search through some of the most engaging content on the planet I Think is really cool.

And of course, we can just use normal Google search or YouTube's actual search bar well, as we'll see we can actually build something that maybe Does quite well in comparison to those tools. So before we really jump into the technology and How to build it. Let's have a quick look at a few examples of what we are going to actually be able to build up to following through this video So this is just a really simple interface Built using streamlet and we can just say okay.

What is deep learning? And Okay, we're gonna have here a few videos. I think we have five Okay, and we can just kind of click on one of these so I mean all of these seem pretty relevant But I just want to show you something really cool If we click on one of these will take us straight to the point of the video.

So here it started at 329 and It's going to show us What is deep learning? Okay, so that's really cool like straight away. It's almost like a very intelligent and fast search tool for YouTube and YouTube is full of amazing content. So this is is I think really cool.

I was playing around with this quite a lot If we're interested in something else, so what is a transform up? What is a transformer? model What is a Bert transform model? Okay, it's a Bert this looks pretty promising Transformers here looks pretty good. And then if we have a look at this one's also the first result This is the most relevant and this explains Bert pretty well as well.

So that's really cool and What I think is coolest about this is we just use off-the-shelf models and put this together super quickly it is Not complicated as you will see to pull this together. You can probably do it within Next 20 minutes after following this video So we're gonna have a look at this data set here so if we have a look in here we have 116,000 text files and we also have the audio files as well So we're only going to use the text files and these are basically for every small Segment video we have the subtitles from that segment and that's what we're going to be performing our search on now Let's have a look at what?

how we can download this data from from Kaggle and also What the data actually looks like so to start you will need an account So once you have created an account on Kaggle you go to the right here you go to I think account We have this nice picture of me with hair and We can scroll down and there's this little API section here and we can create a new API token once you've done that that's going to download a Kaggle JSON file to your to your computer and we use that Kaggle JSON file to Authenticate the Kaggle Python client so the Kaggle Python client We're just pipping so Kaggle and then when you run this import Kaggle the first time if you don't have that Kaggle JSON download onto your system This will come with an error and it will tell you where to place your Kaggle JSON So you just web directory tells you to put in put your Kaggle JSON in there rerun import Kaggle and it should work and then to use our Let's use this We just run this so we're so this is in the command line.

You can also do this in a terminal window Kaggle datasets download and this is that dates that I just showed you we will need to unzip that and Once we unzip that You will find everything in this data directory here now you can see in this data directory we there are a lot of files So all of these represent a video ID Okay So if we go into here, we have all these timestamps and this is just a range of a timestamp and in there We have the audio that represents the audio from that range and we also have the subtitles from that range So if we look over here So within the range of From three seconds in to about six seconds in the word or the words Spoken in the video of machine learning.

It's a buzzword, but I and then we're on to the next timestamp Which would be here. We have look So was but I would also claim it's a lot more than and then I assume they're gonna talk it Talk about how machine learning is is more than a buzzword That's fine So Yeah, that's what we have there But that's very little you saw in that little app that we had like thumbnails.

We had the video title which we don't have in that data set and We even I think we had the video description as well. Okay, so you can see all that here We didn't have any of this in that in that data set So, how do we do that?

How do we how do we scrape that data? To do that. We need a few things. So we need beautiful soup TQDM you don't you don't necessarily need it, but it's as best we do and Datasets use datasets here because rather than having all this data Stored locally we can actually save it to the hooking face hub and If you do if you're not really interested you just want to build a separate it quickly you can download that data and I will show You how you can do that in what's probably the next chapter of this video So you can hover over the timeline of this video and you'll be able to see The next chapter and it saw that you'll be able to see how we can load that data set Directly from putting face rather than going through all this pre-processing.

I want to show you because this is how you you like a big part of machine learning and building these things is The pre-processing so it's it would be a shame to miss it So we can see yeah what I showed you before we have video IDs Which is just a list of the directories in each one those we have the timestamps and we can load them So so we have in data video ID splits And then we have the the subtitles of text file Okay Cool, and what I want to do is just loop through all these files To give us that, you know, give us what we can get from those files.

So that is a video ID From the directory names the text from subtitles up text the start second and second which we can get from the directory timestamps and Also the URL because the URL is actually we can pull that from the video ID Okay So I won't go through this in too much depth But what we're doing is we're going through each one of those splits Extracting that small chunk of text is very small chunk of text.

If you remember it's just a few seconds long. So With Q&A, we really want to have longer chunks of text and like five words So what I've done here is said, okay once we reach about three or four sentences We are going to Save that as a chunk. Okay, there are better ways to do this.

It's like a very Good approach is to have overlapping chunks so that you're not missing anything at all there because we might be cutting things off Right in the middle of a sentence. It's probably not the best idea But just for this demo, this is good enough. It's not problem So we create our start and end seconds using the timestamps that we have in that directory name okay, so that's literally just the number of seconds into the video that we are at and Then we create a document which is just all of the details for that particular chunk of text including the video that comes from the text itself the sign and seconds of that chunk of text and Also this which is the URL so the URL directs us to the video the specific video and also the start of that chunk of text that we're going to be returning later on and Yeah, so we create a list of documents like that Shouldn't take long so This is one example and you can see okay like here.

We kind of just cut off in the middle of a Sentence and it's not perfect as well, but that's fine. It works well enough So it would be better to have some sort of window where we overlap so we for example took this Paragraph and then we maybe took half the first this second half of the paragraph followed by the next half of the next document But this is it's good enough You see a few of those and we have this starting in seconds URL, which you can see we have the 41 seconds 41 Okay Now as I said before there's that other Metadata that we don't have in here now You might need to show on Mac.

You might need to pin the pip or conda install XML and that is for a beautiful soup a beautiful soup is a like a data scraping library or or HTML processing library almost So it's really good when we're scraping information from websites, which we can do so we can okay We're going to go to each video and we're going to capture the data thumbnail and any other Information we can from there.

So in this case, we just have the title and the thumbnail and We saw those within this metadata Dictionary. Okay, so the title thumbnail If there's a error so there were a couple of Exceptions here rather than just Throwing an error and not returning anything or stopping the process I'm just returning an empty title and thumbnail because there's two out of the 127 that we scrape there So I think right is here's too much of an issue Okay, so now we have the document which is what we originally pulled and then we have the title and thumbnail so what we can need to do here is pull those together and There we have our full Our full document, okay so When we are Saving things to the hugging face hub.

We can just save it as a Jason Lyons file Okay, so it's like the this list of dictionaries save that to a Jason L file and then you can actually just upload that directly to hugging face, so That's yeah pretty straightforward Okay, so as promised this is how you can Download the data that we just created.

So Super easy. It's exactly the same as what you saw before we have video ID takes saw and URL title and thumbnail and we have 11,000 items there. Okay, and So we have 11,000 sort of documents and that is spread across 127 videos or you know unique videos as far as I can remember and We can just see one of those and you can see we've already seen this example from before so it's exactly the same but obviously a lot easier For us to actually use because we're just pulling it from hugging face hub, which I think is really cool now The next thing we need to do is actually Index all those documents within vector database.

Of course, we're using pinecone here. So First we need to do to begin doing that is initialize a sentence transformer so the sentence transformer is a model that is going to take the text and convert it into a vector which we can then place inside our vector database and use that to Perform our sort of intelligent semantic search through all of its documents so We have the match sequence length 128 here so I use this To come up with this of three to four sentence length of our paragraphs Because typically a token which is is what this is 128 tokens.

It's typically going to be something like Three to five characters. So we'll go with three characters here and so so that's why that's why we got 360 from and We have this 768 Sentence embedding dimensionality, so that's important So we pull that in here to the embedding dim variable and then we're going to use that when we are initializing our index, so We need to get an API key for this.

So you go to app the pinecone I/O you create an API key And then you can just go here and take that. So if you're just looking at where to actually get your API key Okay, so you go within your your project. It's probably going to be called someone's default project You click in there you go to API keys and then you have this you just copy this and then you would paste it inside this Pretty simple and then we can create our Index, whatever you can call it.

Whatever you want. I call it YouTube search because that's what this is for me Again, you call it whatever you want for this model We need to use case on similarity and we need to align the model embedding dimension and the dimensionality of our index Okay, so that's what I've done there and it will connect to our new index using the name that we gave it here So what I'm going to do is index our data and batches of 64 the data we insert into our index will include the Document ID The embedded vector that we create using the sentence transform model and any metadata we'd like to include So that's what we saw before the title start seconds a text itself.

We're going to include all of that in there So to do that we create this loop. This is where I'm using tqdm. This is just a set This is a progress bar so that we can see, you know, how long this is gonna take it shouldn't take too long though, by the way and What we do so in batches of 64 I'm going to encode all of the text I'm going to create the the IDs so we have Two sets of so we have a video ID, but the video ID is not unique for every single snippet So what I'm doing here is taking video ID and a start second because that is unique and placing those together okay, and Then what we're doing is creating the metadata.

So I just want these items from the metadata from before so Just pulling those in there's nothing Nothing particularly unique there Okay, and then just upsetting all those so inserting everything into pinecone The IDs the embeddings and the metadata and yeah, just adding all that to pinecone super easy And then we describe our indexes.

You don't need to do this, but this is so we can see what is in there So we have the dimensionality how full our index is. We don't have much in there at the moment. So it's It's not very full. You can usually fit about a million embeddings into a typical what called pod which is like a hardware unit of the of the vector database you can have more pods if you need more than that and Yeah, so this is our vector count we have just over 11,000 items in there And then we go this is what we're doing for wise deep learning and then we return those results.

So this is just a text and Yeah, everything is pretty relevant there. So that is how it works It's pretty I think simple The only other thing to do is actually put all that into a streamer app, which is not hard at all okay, so This is our code.

I'm gonna zoom out a little bit. I'm sorry if this is kind of small but It's yeah, it's hard to get that all I'm gonna squeeze into the screen. So Yeah, I mean the same as before we're just initializing things We have this little card, which is like the each component or result We return just HTML really basic HMS or nothing complicated there And then yeah, we're just using stream that components stream lit right markdown There's nothing nothing really in there.

We have the search bar whenever there's the search bar has something inside it. We research and just return That information in the format of a card and all that together is how we build this So it's really I think super straightforward and really easy to use Yeah, that's it for for this video I hope this has been being useful and at least insightful how this is how you can build something like this Of course, this is just one example like YouTube search video search you can you can search through anything that you can imagine as long as you have some sort of data that represents it and some Way to represent that sort of data as a question and answer format you can You can you can build something like this super easily so I Hope that's been useful.

Thank you very much for watching and I will see you again in the next one. Bye

How to Build an AI-Powered Video Search App

Chapters

Transcript