How to Use OpenAI Whisper to Fix YouTube Search

Search on YouTube is good, but it has some limitations. With trillions of hours of content on there, you would expect there to be an answer to pretty much every question you can think of. Yet, if we have a specific question that we want answered like, "What is OpenAI's clip?" We're actually just served dozens of 20-plus minute videos, and maybe we don't want that.

Maybe we just want a very brief 20-second definition. The current YouTube search has no solution for this. Maybe there's a good financial reason for doing so. Obviously, if people are watching longer videos or more of a video, that gives YouTube more time to serve ads. So, I can understand from a business perspective why that might be the case.

But, particularly for us, tech people that are wanting quick answers to quick problems, a better search would be incredibly useful where we can actually pinpoint the specific parts of a video that contain an answer to one of our questions. Very recently, a solution to this problem may have appeared in the form of OpenAI's Whisper.

In this video, I want to have a look at if we can build something better, a better YouTube search experience for people like us that want quick answers. And we're going to take a look at how we can use OpenAI's Whisper to actually do this. So, let's try and flesh out the idea a little bit more.

So, the idea is we want to get specific timestamps that answer a particular question. Now, looking at YouTube, we can already do this. So, we just kind of hover over this video here, we right-click, and look what we can do. We can copy the video URL at the current time.

Okay, so I can copy this, I can come over here, and I can paste it into my browser, and it will open that video at that time. You see we have this little time here, so let's run that. And yeah, we get that. So, it should be completely possible to do something with that, right?

We should be able to serve users' results based on those timestamps. Now, the only thing here is that we need a way to search through these videos. Now, YouTube does provide captions, and they work relatively well, but sometimes they can be a little bit weird. Now, this is where OpenAI's Whisper comes in.

OpenAI's Whisper is, you can think of it as the GPT-3 or DALI-2 equivalent for speech-to-text. And it's also open source, which is pretty cool. Now, you might expect this model to be absolutely huge, like GPT-3 or DALI-2, but in reality, it's actually a relatively small model. Like, we can, I believe the largest version of the model can be run with about 10 gigabytes of RAM, which is not that much.

So, we should be able to use OpenAI's Whisper with videos on YouTube to transcribe them more accurately than what YouTube captions can provide. So, that would be our first, well, that would be almost our first step, because we actually need the videos. So, our very first step would be getting videos from YouTube.

So, we have YouTube up here. First thing we need to do is actually get the audio files, so the MP3 files. So, we need to download those. We're going to store them locally or, you know, wherever local is for you. And once we have those, we want to use OpenAI's Whisper.

So, I don't know if they have a logo, but we'll just go Whisper. To create text from that. Okay, so we have all this text. And what is pretty cool with OpenAI's Whisper is that it will also include the time stamps where it found that particular text. Okay, so it's actually not going to look so much like a long piece of text.

It's going to look more like segments of text, like this. And those segments of text will have like a start and end second. So, it'll be like here, up to 7 seconds in, and then we'll have 7 to 12 seconds in, some more text. So, with that, we can then take those segments.

We can encode them with a sentence transformer model. So, let's just put SBIRT. We're not going to use SBIRT, but something along those lines. And then we get some vectors. All right, so we have our little vector space here. Should also mention here the SBIRT model will be a Q&A model, question answering.

Okay, so that's the specific type of machine learning where given a natural language question, you expect a natural language answer. Right, so these over here, these segments, they're our answers. Okay. And what we're going to do with that is we want to put this into a vector index or vector database.

So, we have another database over here. And when someone searches for something, so we're going to have a little search bar over here. Someone's going to search for something. And when they, they're going to write in their text, what is opening your eyes clip. And that's going to be passed into the SBIRT model, not SBIRT model.

I'm going to call it QA model. So, it's better. The same one as we have up here. That's going to encode it into a vector. So, we have our little vector here. And we pass that into the vector database. And then from there, we return the most relevant segments.

So, these up here. We return those to the user. So, the user will get, okay, at this time sump in a video. So, zero to seven seconds in one particular video, we have this answer for you. And then from seven to 12 in another video, we have this answer for you.

That's essentially what we're going to build. Now, maybe it looks kind of complicated from here. But in reality, I think all of this is relatively easy to use. So, let's dive into it and we'll start with the first part, which is actually getting our MP3 files. Now, as with most machine learning projects, I always expect the data preprocessing step to be the hardest.

And I think this is also the case here. So, the first thing I needed to figure out here is, okay, how do we get all this data from YouTube? Now, as a YouTube creator, I can download channel metadata and I can also download all of my videos. So, that seemed like the best approach initially, download all those videos.

Now, there's a couple of limitations here. One, you can't download other people's videos. So, the search scope is just limited to your own channel, which isn't much fun. And two, you have to download all these videos and there's a lot of them and it takes such a long time.

So, after a few days of trying to do this and trying to make it work and trying to push these videos to remote machines to process them and everything, I gave up and looked for an alternative. And the alternative is so much easier. In reality, we don't need to download all these things.

All we need is a video ID. And with a video ID, we can get everything we need using a Python library called PyTube. So, we can install PyTube like this or even actually like this, it's about the same. And once we've installed it, in this case, I use my channel metadata, which you can also download.

So, this is using the Hogan Face datasets library. So, in here, I will show you, if we come over to here, we have this James Callum channel metadata. So, this Hogan Face CO datasets, James Callum channel metadata. And you can see the sort of data we have in here.

All right. So, zoom out a little bit. And we have all this. So, we have the video ID, the channel ID, the title, when it was created, or all these sort of things, description. There's a lot of things. And most of this, we actually don't even need. All we really need is a video ID and a title.

Now, using this dataset through Hogan Face datasets, you can download it like this. Okay. At some point in the future, I want to add other channels, either to this dataset or the next dataset, which you will see soon. But for now, this is just videos from my channel. So, we're taking the train split, and then we come down and see there's 222 rows there, which is quite a lot.

But included in there, I think there is some degree of duplication of video entries. I'm not sure why. So, what we want to do is actually we create this meta item, and then we need to go through it here. So, it's actually meta here. Okay. So, we have meta here to align with this.

And what we want to do is say, okay, where are we going to save all of our MP3 files? And what we're going to do is just go through, we have the video ID in that dataset. And we're going to use Py2 here to create this YouTube object, which is just like an object with all of the information for that particular video inside it.

Now, sometimes this is a bug in Py2. If some of these video IDs have a particular character in it, it's going to give you this reject match error. Now, you could probably fix this, I think. But just for the sake of running through this quickly, I just added this try except statement in.

Because there's very few videos that trigger this issue, this bug. So, after that, we set this I tag to none. So, I tag is almost like the identification number that Py2 uses for different files attributed to each video. Because each video is not just a single file, it's actually a few different files.

It includes a MP4 video file, a MP3 audio file, and it includes those at different resolutions as well. So, with each video, you get quite a few different streams or files. So, what I'm doing here is I'm getting those streams or files, and I'm saying only when the audio ones.

So, this actually returns a few different choices. And the MP3 files that we actually want will have this MIME type. Okay? So, this is like a multimedia type, I believe. And although this says MP4, it's actually the audio related to the MP4 file, which is an MP3 file. So, we loop through all of the files attributed to a particular video, and the first one that we see that is an MP3 file, we return the I tag for that, and then we break from this loop.

And in the case that we loop through all the files and no MP3 file is found, which I didn't see happen once, so it probably won't happen. But just in case, I also added this. So, if the I tag is none, e.g. nothing was found, we continue. So, we ignore this, and we just move on to the next file or next video.

Now, from here, we get the correct MP3 audio stream based on the I tag that we identified here, and then we download it. Okay? So, we want to download. We have the output path, which is the save path, it's just the MP3 directory. And then we have a file name here, which is just the video ID, MP3.

And we go through, you see there's a couple of those rejects and match errors, but very few, honestly, it's nothing significant. After doing that, you should be able to see a new MP3 directory, and it will just contain a ton of MP3 audio files. Now, it does take a little bit of time to download everything.

It's not that bad, though, to be honest. But if you don't want to, it's fine, you can skip ahead. We also have an already transcribed dataset available that you can just use. But we'll get on to that pretty soon. So, as of yet, we have done this step here.

So, we've got our MP3s, and we've now stored them locally. Okay? And so, now we need to move on to Whisper. So, opening up Whisper, we come to here, and this is how we install it. So, we pip install from GitHub. And then, this is the install for the FFmpeg software for Ubuntu or Debian.

So, this depends on your system, okay? Let me, there are a few install instructions. We just go to here. So, on the Whisper GitHub repo, we come down, and it's here. Okay? So, we have the different install instructions. So, after installing, we come down, and we go over to here.

So, we just import Whisper. And we also need to import Torch as well, so that we can move the Whisper model over to a GPU, if you have a GPU. Otherwise, you can use CPU, but it will be slower. And if you are doing that, it's probably best if you use the small model.

Now, as for the different models, there are a few options. Again, we'll refer to the repo for this. You can see here, we have tiny, base, small, medium, and large. Now, you can see here the required RAM amounts. Now, 10 gigabytes for the large model is actually very good.

But if you are limited on time, or just the amount of RAM that you do have available, you can use the other models, and they're actually fairly good as well. One thing to know is, if you are using this for English, and you're not using the large model, you should use the English-specific models, because they tend to perform better.

But otherwise, you can use it for multilingual text-to-speech, and they're all capable of doing that as well, without the .en. Now, here we're using the large model to get the best quality results. And we're saying, okay, move the Whisper large model to CUDA, if it's available. Now, at the moment, it doesn't work on NPS.

But hopefully, that will be fixed relatively soon. We use the channel metadata here to match up the video ID from the MP3 filenames to the video titles, so that we can display that in the browser. So, come down here. And all I'm doing here is creating a videos metadata dictionary.

And all that is, is we have a dictionary with video IDs, and that maps to the title and URL, which we're actually just building here. Now, we don't actually need to publish. I just included that as we have all this metadata. Maybe at some point, it'll be useful. Maybe not.

And maybe we wanted to filter based on the publication date or something like that. But we're not actually going to use that. Okay, so we have video ID that maps the title and URL. And then what we want to do is, all of those MP3 files that we just downloaded, we're going to go through each one of those.

You can see that we have 108 of those, and they all look like this. So, we have the MP3 directory, we have the video ID, and there's an MP3 file. And then all we need to do here is also need to just import TQDM. So, from auto, import TQDM.

Okay. And then we just enumerate through each of those paths. So, what we want to do is, we can get the ID from that path if we needed to, like so. And all we're doing here is transcribing to get that text data. Okay, so given the path to the MP3 file, we just pass that to Whisper, use the transcribe method, and we actually get all of these, what are called segments.

Now, these segments are just really short snippets of text with the start and end seconds where that text or that audio was transcribed from. Okay, and then what we do is, here I'm going to create this transcription JSON lines file, the file that we're going to use to save everything.

And what I'm going to do here is basically just save everything. Okay, so you can modify it. So, each one of these snippets is pretty short. So, you can actually modify this, you can increase the window to like six and then the stride to like three, for example. But what we're going to do is actually do that later.

And in this case, we'll just take out the segments directly. So, we transcribe, we get our segments, get the video metadata. So, this is from the video's date, which includes the title that we need and the URL. This bit isn't so important because we're not actually, sorry, this bit is not so important because we're not using a window and stride greater than one.

So, it doesn't really matter here. This is, again, if you're using the window greater than one, we'll explain that later. But we do want to start and end positions for each segment. Okay, and we also want to create a new row ID because at the moment, we just have video IDs.

And of course, that means that there's a single video ID for a ton of segments. We don't want that. We want a single unique ID for every segment. So, we just create that by taking the video ID plus the actual timestamp. And then we create this meta dictionary. Okay, we append that to data here.

We don't actually need to do that because we're also saving it directly to the file as we go along. It's just more of a backup. Okay, and then from there, we can check the length of the data set. And we see that we have 27.2 thousand segments. So, it's small.

So, roughly five to seven word sentences. Okay, so let's take a look at where we are now. So, we've just done, well, we initialized Whisper and then we created these segments here. So, this bit, we are actually done with. So, the next bit is encoding these with Esper. Now, again, if you're processing this, this can take a little bit of time.

So, on an A100 GPU, for me, I think it took around 10 hours. Okay, and that is for, I don't know how many hours of video, but it's 108 of my videos, which are probably on average maybe like 30 minutes long, maybe a bit longer. So, I'd say at least 50 hours there, probably a fair bit more.

So, it's a lot faster than real-time processing, which is pretty cool, but it's still a while. So, I know you probably don't want to wait 10 hours to process everything if you have an A100 or longer. So, what you can do is this transcriptions data set is available on Hugging Face.

So, if we go to HuggingFace.co, and we have this James Callum YouTube transcriptions data set, and you can see what those look like here. So, let's have a look at where our segments are. Okay, so look, we have this text. It's like, "Hi, welcome to the video." Like, it's pretty short.

And then we have, yep, continues. So, they're all relatively short little segments, but we have the start and the end here. And obviously, if you think about this, we can also increase the size of those segments. So, we could measure, like, these five, for example, and then we just have the start at zero, and the end is 20.6.

We have the specific timestamp ID here, URL to that video, and the actual video title. So, we have everything we need there. And this is over here, so we can copy this James Callum YouTube transcriptions, and we can download it like we did the other data set, and I'll show you how.

So, we come to this Build Embeddings notebook, and like I said, here we have that data set. So, you can use this, and you'll get all those transcribed video segments. You don't need to actually do it yourself, but of course, you can if you want. So, we can see in there, we have the same things we saw before, start and text, title, URL.

They're the most important bits, and ID. Okay, here's a few examples. So, what you saw before, "Hi, welcome to the video." This is the fourth video in a Transformers, and it cuts off from Stretch Mini Series. So, you can see here that if we encode this with our Sentence Transformers model, we're going to cut a lot of meaning.

Okay, there's not that much meaning in this four-word sentence. It's not even a full sentence, it's part of a sentence. So, what we need to do is actually merge these into larger sentences, and that's where the window and stride from the previous chunk of code is useful. Okay, so the window is essentially every six segments, we're going to look at those as a whole.

But then if you consider that we're looking at these six segments at a time, we still have the problem of every six segments we're going to be cutting, and then starting a new segment, right? So, if those cuts could be in the middle of a sentence, or it could be in the middle between two sentences that are relevant to each other.

So, we'd end up losing that meaning, and we don't really want to do that. So, a typical method used to avoid this in question answering is to include something called a stride. So, we're going to look at every six segments, but then in the next step, we're going to step across only three segments.

By doing this, any meaningful segments that would otherwise be cut by, you know, just the fact that we're jumping over them like this, would be included in the next step, okay, because we have that overlap. So, we use that, and what we can do is we just iterate through our data set with this stride, and we take a batch of six segments at a time.

Now, once we get to the end of each video, there's no sort of defining split in our data. The only way we can recognize that we've gone on to another video between each segment, is that the title is going to be different. So, what we'll do is if the title at the start of our batch is different to the title at the end of our batch, we just skip this chunk of data.

And the reason we can just skip it rather than trying to, like, keep part of it, keep the final part of the video or the very start of the video, is because the final part of the video and the very start of every video, usually doesn't contain any meaningful information.

It's just either me saying hello, or it's me saying goodbye, okay. So, I don't think anyone's going to be asking questions about saying hello or goodbye. So, we can just skip those little bits at the start and ends of videos. And then what we do, so we have our six segments as a sort of chunk within the list, and we just join them with space character.

Okay, so now we get a larger chunk. And then we create a new set of metadata. Now, the only thing here is we include everything from the start of the batch. Okay, so the title and everything, and the ID in particular. But the one thing that does switch up from this is the end position.

So, the end position obviously needs to come from the end of the batch, because we have now ascended that segment. Okay, so with that, we have created our new dataset. Obviously, there's less items in this because we are batching together all these segments. So, we now have 9,000 segments, but they will be a lot more meaningful.

So, if we have a look at a few of these. So, here we see a "Hi, welcome to the video", "Support video", "Transformers from Scratch". And you can see that there's a lot more information conveyed in that paragraph than before, where we just had "Hi, welcome to the video" or "From Scratch" miniseries.

We come a little further down and we see, okay, training, testing, tallying that. And you can see there's a lot more information in here. We'll come down a little bit more. Token type IDs, let's go number zero, and so on, okay. These are not particular, you're probably not going to get any answers from each of these, but there are other points in the videos, which is like a paragraph long, where we will find answers and we'll see that pretty soon.

So, now we need to move on to actually embedding all of these chunks of segments into vector embeddings. Okay, so that we can actually search through them. So, to create those embeddings, we're going to use this QA model, which means question answering. It's also multilingual, if you are using a multilingual corpus here.

And one thing to note here is that it uses dot product similarity. Okay, so that's important later on, as we'll see. So, we initialize the Sentence Transformer, we can see it's this MPNet model, has this word embedding dimension 768, and it uses, what does it use? It uses the CLS token as pooling, that's the classifier token.

Okay, one thing here is we get the word embedding dimension, which is just this, and we need this for Pinecone, which is the vector database that we're going to be using. Okay, so we come down here, we need to pip install this. So, pip install Pinecone Client. And then we need to initialize the vector index, or vector database that we're going to be searching through.

Going to store all of our vectors there, all of our metadata, and then we're going to search through our queries. So, this would need an API key, which is free. So, you just go over here, get an API key, and put it in here. And then what we're doing here is saying, if the index ID, this YouTube search, which you can change, by the way, you don't have to keep YouTube search here.

If that doesn't already exist, I want to create a new index. Okay, so we're going to create the index. We're going to set dimensionality to 768, which fits with the embedding dimensions that we have. And we're going to use a dot product metric. And remember I said, that's because we're here, up here, this model is embedding within a dot product vector space.

So, that's pretty important. We need to now begin building the embeddings. So, what we do here is, we're going to encode everything and insert them into our vector index in batches of 64. Okay, that's just to keep things in parallel. We speed up the process in time. So, we do that here.

We're going to go through the entire set of the new dataset in batches of 64. We find the end of the batch. And then here, we're just extracting the metadata that we want to include. So, the main, the things that are important here is actually all of these, I think.

So, text, the start and end positions of that transcribed text, the URL of that video, and also the title of that video. We need all of those, they're all pretty important. And then what we want to do here is also just extract the text by itself within a list, because we're then going to use that to create the embedded vectors of our segments.

Okay, to convert those segments into vectors. And then, we also want the IDs. So, every vector or every entry within our vector index needs to have a unique ID. Okay, that's why we create unique IDs rather than just using video IDs earlier on. And then, what we can do is we create this list, which includes our batch IDs, our embeddings, and batch metadata.

And we insert or upsert into Pinecone, which is our vector database, okay? And then after we've done that, we can check that everything has actually been added. So, we can see here that I actually have more than the 9,000 that we created earlier on. And the reason for that is because I have been testing different sizes and re-uploading things.

And at some point, I made a mess and forgot to delete the old vectors. So, there's actually some duplication that I need to remove. So, in reality, you should get this, okay? So, 9,072. And another thing I should point out is earlier on, when you do the same thing here, when you've just created your index, you'd actually expect this to be zero, not already full, okay?

So, that's something I need to do. So, here, we would actually expect the 9,000. And then after that, well, we've done the next step, okay? So, we come back over here, and we've now created our, well, we initialized this, but we created these vectors. And then we have inserted them into Pinecone over here, which is our vector database.

So, at that point, we're now ready for the actual querying step, okay? So, where the user is going to come to a little search bar over here, and we're going to query. So, first, let me show you how we do that in code very quickly, okay? So, it's really simple.

Query, what is opening eyes click? When we encode that query, we use the same model, the same QA sentence transform model that we use to encode all the segments. And then we just convert that into a list, right? And that gives us what we call a query vector, which is xq variable here.

Then we query with xq, the query vector. We return the top five most similar of the items. And we also want to include the metadata, because that includes actual text, the start and end positions, and I think that's everything. Oh, also the title and URL, okay? All these things we need.

So, let's zoom out a little bit. So, what is the question? What is open eyes click? Okay, so we come down here, you can see these are two chunks from the same video. One is just slightly, it's about 20 seconds after the other. And you can see that, okay, open eye clip is a contrastive learning image pre-training model.

Use pairs of images and text in terms of matrix, cosine similarity between text and each image. And then there's more information after that as well. That's kind of cut, but when we actually visualize everything, we'll just be able to click on the video and we can actually just watch it through.

So, let's take a look at how that would work. So, one thing we actually do need to do here is you take the start and you add that to your URL, and then you create a link. So, I can even show you here, right? So, if we take this, come down here, we do this, and then we add question mark T equals, and maybe we want this bit here.

Okay, so we have that. Let's copy this and let's enter it into the browser. I'm going to put captions on. Come down here, see what returns, and you can see I'm just reading from the text here, right? This is another Q&A app I built. So, you actually have the answer on the screen and I'm also reading through it.

So, I think that's pretty cool. And what we can do is actually package all of what we've just done, or the querying part of it, into a nice little sort of web interface. A little search bar and we search and we get our answers. Now, this is really easy using Streamlit, which is what I'm going to show you.

I'm going to go over to Hug & Face again. I'm going to click on my profile and it's right at the top here. YouTube Tech Search. Click on there, okay. YouTube Q&A, I'm going to say, what was the question? What is OpenAI's flip? We'll ask some other questions soon.

Sometimes you get this, it just means you need to refresh. Something I need to fix in the actual Streamlit code. Okay, and then we get this. Now, what I've done is I've put it together so that when you have parts of the video, those segments that are directly next to each other, it will do this.

It's like a continuous text and then you just have a little timestamp here, here. And then if you click, let's say I click on here, it's going to take me to 2.09 in that video. So, let's close that. And then if I click here, it's going to take me to 2.27 in that video, right?

And then you also have other answers here. So, Intro to Dense Vectors, NLP and Vision, and so on and so on. And also, because we're using Vector Search here, we can actually kind of like mess up this. So, if I, maybe if I like go open, just make a mess of everything like this.

And we still actually get very similar results even though I completely butchered the OpenAI there. Okay, so let's try some other questions. So, I'm going to grab this one. So, this is Asherak came up with this question. Thank you, Asherak. What is the best unsupervised method to train a sentence transformer?

Okay, unsupervised method. There's only one unsupervised, like fully unsupervised method and it is TSDAE. Okay. So, with something called pre-train produces na-na-na-na. Sentence transformer using unsupervised training method called Transformer Based Sequential Denoising Autoencode. I'm surprised it transcribed all of that correctly. So, we can click on that, come to here, turn on the captions.

Okay, there we go. TSDAE. That is correct. That's pretty cool. Let's try another question. So, very similar, but what if I have little to no data? So, there's a few different approaches here. So, again, training sentence transformer, little to no data. TSDAE is a good option. We can also use something called GenQ, which generates queries for you.

This is particularly good for asymmetric search, e.g. question answering. And we also have Augmented SBIRT as well. So, again, really good results. I'm pretty impressed with this. What is vector search? Okay, and this is kind of harder because I don't think I really answer this very often in my videos.

But there are still, you know, when I searched this, I was surprised there are actually some answers. So, the first one, I don't think there's anything relevant in the first answer, but then we come down. So, we have an answer here, and we also have another answer here now.

So, imagine we came up with a new query. I like this one the most. Convert query into vector, place within vector space, and then search for the most in other vectors. So, let's go on there. Let's see. Okay, and we even get a nice little visual here as well.

So, we've got all of our vectors in this vector space. It's a nice, that looks pretty cool. You have a query vector here, xq, and then we're going to compare the distance between those, right? So, I think, again, we've got a really good answer there, both visually from the video and also just from the audio transcribed into text.

Okay, that's it for this project focus demo on building a better YouTube search. So, I hope you found this video helpful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful.

I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful. I hope you found this video useful.

How to Use OpenAI Whisper to Fix YouTube Search

Chapters

Transcript