How to Use OpenAI Whisper to Fix YouTube Search

00:00:00.000 | Search on YouTube is good, but it has some limitations.

00:00:05.840 | With trillions of hours of content on there, you would expect there to be an answer to pretty much every question you can think of.

00:00:15.740 | Yet, if we have a specific question that we want answered like, "What is OpenAI's clip?"

00:00:21.580 | We're actually just served dozens of 20-plus minute videos, and maybe we don't want that. Maybe we just want a very brief 20-second definition.

00:00:34.300 | The current YouTube search has no solution for this.

00:00:39.540 | Maybe there's a good financial reason for doing so.

00:00:43.380 | Obviously, if people are watching longer videos or more of a video, that gives YouTube more time to serve ads.

00:00:54.140 | So, I can understand from a business perspective why that might be the case.

00:01:00.880 | But, particularly for us, tech people that are wanting quick answers to quick problems, a better search would be incredibly useful where we can actually pinpoint the specific parts of a video that contain an answer to one of our questions.

00:01:21.640 | Very recently, a solution to this problem may have appeared in the form of OpenAI's Whisper.

00:01:29.520 | In this video, I want to have a look at if we can build something better, a better YouTube search experience for people like us that want quick answers.

00:01:40.020 | And we're going to take a look at how we can use OpenAI's Whisper to actually do this.

00:01:44.380 | So, let's try and flesh out the idea a little bit more.

00:01:48.540 | So, the idea is we want to get specific timestamps that answer a particular question.

00:01:56.020 | Now, looking at YouTube, we can already do this.

00:01:59.520 | So, we just kind of hover over this video here, we right-click, and look what we can do.

00:02:04.380 | We can copy the video URL at the current time.

00:02:08.120 | Okay, so I can copy this, I can come over here, and I can paste it into my browser, and it will open that video at that time.

00:02:15.080 | You see we have this little time here, so let's run that.

00:02:19.280 | And yeah, we get that.

00:02:22.880 | So, it should be completely possible to do something with that, right?

00:02:30.120 | We should be able to serve users' results based on those timestamps.

00:02:34.980 | Now, the only thing here is that we need a way to search through these videos.

00:02:41.020 | Now, YouTube does provide captions, and they work relatively well, but sometimes they can be a little bit weird.

00:02:50.180 | Now, this is where OpenAI's Whisper comes in.

00:02:53.220 | OpenAI's Whisper is, you can think of it as the GPT-3 or DALI-2 equivalent for speech-to-text.

00:03:02.980 | And it's also open source, which is pretty cool.

00:03:06.880 | Now, you might expect this model to be absolutely huge, like GPT-3 or DALI-2, but in reality, it's actually a relatively small model.

00:03:17.720 | Like, we can, I believe the largest version of the model can be run with about 10 gigabytes of RAM, which is not that much.

00:03:27.620 | So, we should be able to use OpenAI's Whisper with videos on YouTube to transcribe them more accurately than what YouTube captions can provide.

00:03:38.860 | So, that would be our first, well, that would be almost our first step, because we actually need the videos.

00:03:44.120 | So, our very first step would be getting videos from YouTube.

00:03:48.680 | So, we have YouTube up here.

00:03:50.420 | First thing we need to do is actually get the audio files, so the MP3 files.

00:03:55.820 | So, we need to download those.

00:03:57.380 | We're going to store them locally or, you know, wherever local is for you.

00:04:03.780 | And once we have those, we want to use OpenAI's Whisper.

00:04:07.720 | So, I don't know if they have a logo, but we'll just go Whisper.

00:04:14.020 | To create text from that.

00:04:17.280 | Okay, so we have all this text.

00:04:19.620 | And what is pretty cool with OpenAI's Whisper is that it will also include the time stamps where it found that particular text.

00:04:30.080 | Okay, so it's actually not going to look so much like a long piece of text.

00:04:34.520 | It's going to look more like segments of text, like this.

00:04:40.360 | And those segments of text will have like a start and end second.

00:04:44.720 | So, it'll be like here, up to 7 seconds in, and then we'll have 7 to 12 seconds in, some more text.

00:04:51.760 | So, with that, we can then take those segments.

00:04:57.360 | We can encode them with a sentence transformer model.

00:05:00.880 | So, let's just put SBIRT.

00:05:03.480 | We're not going to use SBIRT, but something along those lines.

00:05:07.460 | And then we get some vectors.

00:05:09.620 | All right, so we have our little vector space here.

00:05:12.920 | Should also mention here the SBIRT model will be a Q&A model, question answering.

00:05:17.680 | Okay, so that's the specific type of machine learning where given a natural language question, you expect a natural language answer.

00:05:26.960 | Right, so these over here, these segments, they're our answers.

00:05:31.980 | Okay.

00:05:33.420 | And what we're going to do with that is we want to put this into a vector index or vector database.

00:05:40.620 | So, we have another database over here.

00:05:44.460 | And when someone searches for something, so we're going to have a little search bar over here.

00:05:51.920 | Someone's going to search for something.

00:05:54.860 | And when they, they're going to write in their text, what is opening your eyes clip.

00:06:00.260 | And that's going to be passed into the SBIRT model, not SBIRT model.

00:06:05.060 | I'm going to call it QA model.

00:06:06.660 | So, it's better.

00:06:08.760 | The same one as we have up here.

00:06:11.320 | That's going to encode it into a vector.

00:06:13.560 | So, we have our little vector here.

00:06:15.960 | And we pass that into the vector database.

00:06:19.660 | And then from there, we return the most relevant segments.

00:06:25.860 | So, these up here.

00:06:29.120 | We return those to the user.

00:06:31.460 | So, the user will get, okay, at this time sump in a video.

00:06:34.560 | So, zero to seven seconds in one particular video, we have this answer for you.

00:06:40.960 | And then from seven to 12 in another video, we have this answer for you.

00:06:45.820 | That's essentially what we're going to build.

00:06:47.560 | Now, maybe it looks kind of complicated from here.

00:06:50.620 | But in reality, I think all of this is relatively easy to use.

00:06:56.320 | So, let's dive into it and we'll start with the first part,

00:07:00.660 | which is actually getting our MP3 files.

00:07:03.760 | Now, as with most machine learning projects,

00:07:06.720 | I always expect the data preprocessing step to be the hardest.

00:07:10.720 | And I think this is also the case here.

00:07:13.620 | So, the first thing I needed to figure out here is, okay,

00:07:18.320 | how do we get all this data from YouTube?

00:07:21.860 | Now, as a YouTube creator, I can download channel metadata

00:07:27.500 | and I can also download all of my videos.

00:07:31.020 | So, that seemed like the best approach initially, download all those videos.

00:07:35.420 | Now, there's a couple of limitations here.

00:07:37.360 | One, you can't download other people's videos.

00:07:41.400 | So, the search scope is just limited to your own channel, which isn't much fun.

00:07:47.000 | And two, you have to download all these videos and there's a lot of them

00:07:53.360 | and it takes such a long time.

00:07:55.760 | So, after a few days of trying to do this and trying to make it work

00:07:59.200 | and trying to push these videos to remote machines to process them and everything,

00:08:05.400 | I gave up and looked for an alternative.

00:08:08.000 | And the alternative is so much easier.

00:08:10.900 | In reality, we don't need to download all these things.

00:08:14.460 | All we need is a video ID.

00:08:17.000 | And with a video ID, we can get everything we need using a Python library called PyTube.

00:08:22.900 | So, we can install PyTube like this or even actually like this, it's about the same.

00:08:30.760 | And once we've installed it, in this case, I use my channel metadata,

00:08:36.600 | which you can also download.

00:08:38.060 | So, this is using the Hogan Face datasets library.

00:08:42.300 | So, in here, I will show you, if we come over to here, we have this James Callum channel metadata.

00:08:48.700 | So, this Hogan Face CO datasets, James Callum channel metadata.

00:08:54.400 | And you can see the sort of data we have in here.

00:08:57.600 | All right. So, zoom out a little bit.

00:09:00.400 | And we have all this.

00:09:01.300 | So, we have the video ID, the channel ID, the title, when it was created,

00:09:06.960 | or all these sort of things, description.

00:09:09.000 | There's a lot of things.

00:09:10.460 | And most of this, we actually don't even need.

00:09:13.560 | All we really need is a video ID and a title.

00:09:18.400 | Now, using this dataset through Hogan Face datasets, you can download it like this.

00:09:25.260 | Okay. At some point in the future, I want to add other channels,

00:09:29.320 | either to this dataset or the next dataset, which you will see soon.

00:09:34.760 | But for now, this is just videos from my channel.

00:09:38.800 | So, we're taking the train split, and then we come down and see there's 222 rows there,

00:09:45.100 | which is quite a lot.

00:09:46.600 | But included in there, I think there is some degree of duplication of video entries.

00:09:52.960 | I'm not sure why.

00:09:54.200 | So, what we want to do is actually we create this meta item,

00:10:00.800 | and then we need to go through it here.

00:10:03.060 | So, it's actually meta here.

00:10:05.860 | Okay. So, we have meta here to align with this.

00:10:11.420 | And what we want to do is say, okay, where are we going to save all of our MP3 files?

00:10:16.960 | And what we're going to do is just go through, we have the video ID in that dataset.

00:10:21.460 | And we're going to use Py2 here to create this YouTube object,

00:10:29.500 | which is just like an object with all of the information for that particular video inside it.

00:10:36.300 | Now, sometimes this is a bug in Py2.

00:10:39.440 | If some of these video IDs have a particular character in it,

00:10:44.660 | it's going to give you this reject match error.

00:10:47.200 | Now, you could probably fix this, I think.

00:10:50.440 | But just for the sake of running through this quickly, I just added this try except statement in.

00:10:56.340 | Because there's very few videos that trigger this issue, this bug.

00:11:00.600 | So, after that, we set this I tag to none.

00:11:05.540 | So, I tag is almost like the identification number that Py2 uses

00:11:11.800 | for different files attributed to each video.

00:11:15.600 | Because each video is not just a single file, it's actually a few different files.

00:11:20.740 | It includes a MP4 video file, a MP3 audio file,

00:11:26.840 | and it includes those at different resolutions as well.

00:11:30.800 | So, with each video, you get quite a few different streams or files.

00:11:37.600 | So, what I'm doing here is I'm getting those streams or files,

00:11:42.000 | and I'm saying only when the audio ones.

00:11:44.100 | So, this actually returns a few different choices.

00:11:48.200 | And the MP3 files that we actually want will have this MIME type.

00:11:53.600 | Okay? So, this is like a multimedia type, I believe.

00:11:59.240 | And although this says MP4, it's actually the audio related to the MP4 file,

00:12:05.000 | which is an MP3 file.

00:12:07.640 | So, we loop through all of the files attributed to a particular video,

00:12:11.500 | and the first one that we see that is an MP3 file,

00:12:14.700 | we return the I tag for that, and then we break from this loop.

00:12:18.540 | And in the case that we loop through all the files and no MP3 file is found,

00:12:23.840 | which I didn't see happen once, so it probably won't happen.

00:12:28.040 | But just in case, I also added this.

00:12:30.440 | So, if the I tag is none, e.g. nothing was found, we continue.

00:12:34.700 | So, we ignore this, and we just move on to the next file or next video.

00:12:39.600 | Now, from here, we get the correct MP3 audio stream

00:12:45.100 | based on the I tag that we identified here, and then we download it.

00:12:49.300 | Okay? So, we want to download.

00:12:53.700 | We have the output path, which is the save path, it's just the MP3 directory.

00:12:58.700 | And then we have a file name here, which is just the video ID, MP3.

00:13:03.300 | And we go through, you see there's a couple of those rejects and match errors,

00:13:08.300 | but very few, honestly, it's nothing significant.

00:13:13.200 | After doing that, you should be able to see a new MP3 directory,

00:13:18.800 | and it will just contain a ton of MP3 audio files.

00:13:23.700 | Now, it does take a little bit of time to download everything.

00:13:26.800 | It's not that bad, though, to be honest.

00:13:29.400 | But if you don't want to, it's fine, you can skip ahead.

00:13:32.400 | We also have an already transcribed dataset available that you can just use.

00:13:39.900 | But we'll get on to that pretty soon.

00:13:42.000 | So, as of yet, we have done this step here.

00:13:46.800 | So, we've got our MP3s, and we've now stored them locally.

00:13:51.800 | Okay? And so, now we need to move on to Whisper.

00:13:54.700 | So, opening up Whisper, we come to here, and this is how we install it.

00:14:00.200 | So, we pip install from GitHub.

00:14:03.200 | And then, this is the install for the FFmpeg software for Ubuntu or Debian.

00:14:11.000 | So, this depends on your system, okay?

00:14:14.700 | Let me, there are a few install instructions.

00:14:17.800 | We just go to here.

00:14:20.100 | So, on the Whisper GitHub repo, we come down, and it's here.

00:14:26.800 | Okay? So, we have the different install instructions.

00:14:30.100 | So, after installing, we come down, and we go over to here.

00:14:34.800 | So, we just import Whisper.

00:14:36.700 | And we also need to import Torch as well,

00:14:39.700 | so that we can move the Whisper model over to a GPU, if you have a GPU.

00:14:44.900 | Otherwise, you can use CPU, but it will be slower.

00:14:47.000 | And if you are doing that, it's probably best if you use the small model.

00:14:52.100 | Now, as for the different models, there are a few options.

00:14:56.300 | Again, we'll refer to the repo for this.

00:14:59.300 | You can see here, we have tiny, base, small, medium, and large.

00:15:02.900 | Now, you can see here the required RAM amounts.

00:15:07.600 | Now, 10 gigabytes for the large model is actually very good.

00:15:12.600 | But if you are limited on time, or just the amount of RAM that you do have available,

00:15:17.800 | you can use the other models, and they're actually fairly good as well.

00:15:21.200 | One thing to know is, if you are using this for English,

00:15:25.800 | and you're not using the large model,

00:15:27.800 | you should use the English-specific models, because they tend to perform better.

00:15:32.100 | But otherwise, you can use it for multilingual text-to-speech,

00:15:36.300 | and they're all capable of doing that as well, without the .en.

00:15:43.300 | Now, here we're using the large model to get the best quality results.

00:15:48.100 | And we're saying, okay, move the Whisper large model to CUDA, if it's available.

00:15:54.700 | Now, at the moment, it doesn't work on NPS.

00:15:58.300 | But hopefully, that will be fixed relatively soon.

00:16:02.500 | We use the channel metadata here to match up the video ID from the MP3 filenames

00:16:08.200 | to the video titles, so that we can display that in the browser.

00:16:13.100 | So, come down here.

00:16:15.400 | And all I'm doing here is creating a videos metadata dictionary.

00:16:19.800 | And all that is, is we have a dictionary with video IDs,

00:16:23.700 | and that maps to the title and URL, which we're actually just building here.

00:16:28.700 | Now, we don't actually need to publish.

00:16:30.400 | I just included that as we have all this metadata.

00:16:34.200 | Maybe at some point, it'll be useful.

00:16:36.200 | Maybe not.

00:16:37.000 | And maybe we wanted to filter based on the publication date or something like that.

00:16:40.900 | But we're not actually going to use that.

00:16:44.100 | Okay, so we have video ID that maps the title and URL.

00:16:48.500 | And then what we want to do is, all of those MP3 files that we just downloaded,

00:16:53.100 | we're going to go through each one of those.

00:16:55.500 | You can see that we have 108 of those, and they all look like this.

00:16:58.500 | So, we have the MP3 directory, we have the video ID, and there's an MP3 file.

00:17:02.800 | And then all we need to do here is also need to just import TQDM.

00:17:07.400 | So, from auto, import TQDM.

00:17:11.100 | Okay.

00:17:12.200 | And then we just enumerate through each of those paths.

00:17:17.600 | So, what we want to do is, we can get the ID from that path if we needed to, like so.

00:17:25.900 | And all we're doing here is transcribing to get that text data.

00:17:32.200 | Okay, so given the path to the MP3 file, we just pass that to Whisper,

00:17:37.700 | use the transcribe method, and we actually get all of these, what are called segments.

00:17:43.800 | Now, these segments are just really short snippets of text with the start and end seconds

00:17:50.900 | where that text or that audio was transcribed from.

00:17:55.700 | Okay, and then what we do is, here I'm going to create this transcription JSON lines file,

00:18:01.700 | the file that we're going to use to save everything.

00:18:04.500 | And what I'm going to do here is basically just save everything.

00:18:09.200 | Okay, so you can modify it.

00:18:10.700 | So, each one of these snippets is pretty short.

00:18:13.300 | So, you can actually modify this, you can increase the window to like six

00:18:16.400 | and then the stride to like three, for example.

00:18:19.600 | But what we're going to do is actually do that later.

00:18:23.300 | And in this case, we'll just take out the segments directly.

00:18:27.900 | So, we transcribe, we get our segments, get the video metadata.

00:18:34.900 | So, this is from the video's date, which includes the title that we need and the URL.

00:18:39.800 | This bit isn't so important because we're not actually, sorry, this bit is not so important

00:18:44.600 | because we're not using a window and stride greater than one.

00:18:48.700 | So, it doesn't really matter here.

00:18:51.200 | This is, again, if you're using the window greater than one, we'll explain that later.

00:18:56.000 | But we do want to start and end positions for each segment.

00:18:59.400 | Okay, and we also want to create a new row ID because at the moment, we just have video IDs.

00:19:04.400 | And of course, that means that there's a single video ID for a ton of segments.

00:19:09.300 | We don't want that. We want a single unique ID for every segment.

00:19:13.800 | So, we just create that by taking the video ID plus the actual timestamp.

00:19:20.500 | And then we create this meta dictionary.

00:19:24.200 | Okay, we append that to data here.

00:19:26.500 | We don't actually need to do that because we're also saving it directly to the file as we go along.

00:19:32.500 | It's just more of a backup.

00:19:34.600 | Okay, and then from there, we can check the length of the data set.

00:19:38.200 | And we see that we have 27.2 thousand segments.

00:19:42.900 | So, it's small. So, roughly five to seven word sentences.

00:19:47.600 | Okay, so let's take a look at where we are now.

00:19:51.200 | So, we've just done, well, we initialized Whisper and then we created these segments here.

00:19:58.400 | So, this bit, we are actually done with.

00:20:03.900 | So, the next bit is encoding these with Esper.

00:20:07.400 | Now, again, if you're processing this, this can take a little bit of time.

00:20:14.500 | So, on an A100 GPU, for me, I think it took around 10 hours.

00:20:23.000 | Okay, and that is for, I don't know how many hours of video,

00:20:26.600 | but it's 108 of my videos, which are probably on average maybe like 30 minutes long, maybe a bit longer.

00:20:34.900 | So, I'd say at least 50 hours there, probably a fair bit more.

00:20:42.000 | So, it's a lot faster than real-time processing, which is pretty cool, but it's still a while.

00:20:48.700 | So, I know you probably don't want to wait 10 hours to process everything if you have an A100 or longer.

00:20:54.800 | So, what you can do is this transcriptions data set is available on Hugging Face.

00:21:01.800 | So, if we go to HuggingFace.co, and we have this James Callum YouTube transcriptions data set,

00:21:10.200 | and you can see what those look like here.

00:21:12.800 | So, let's have a look at where our segments are.

00:21:15.800 | Okay, so look, we have this text.

00:21:17.500 | It's like, "Hi, welcome to the video."

00:21:19.700 | Like, it's pretty short.

00:21:22.000 | And then we have, yep, continues.

00:21:24.500 | So, they're all relatively short little segments, but we have the start and the end here.

00:21:29.600 | And obviously, if you think about this, we can also increase the size of those segments.

00:21:34.600 | So, we could measure, like, these five, for example, and then we just have the start at zero, and the end is 20.6.

00:21:42.200 | We have the specific timestamp ID here, URL to that video, and the actual video title.

00:21:52.800 | So, we have everything we need there.

00:21:56.400 | And this is over here, so we can copy this James Callum YouTube transcriptions,

00:22:01.000 | and we can download it like we did the other data set, and I'll show you how.

00:22:04.800 | So, we come to this Build Embeddings notebook, and like I said, here we have that data set.

00:22:10.400 | So, you can use this, and you'll get all those transcribed video segments.

00:22:15.800 | You don't need to actually do it yourself, but of course, you can if you want.

00:22:21.900 | So, we can see in there, we have the same things we saw before, start and text, title, URL.

00:22:27.100 | They're the most important bits, and ID.

00:22:29.400 | Okay, here's a few examples.

00:22:32.000 | So, what you saw before, "Hi, welcome to the video."

00:22:34.600 | This is the fourth video in a Transformers, and it cuts off from Stretch Mini Series.

00:22:39.900 | So, you can see here that if we encode this with our Sentence Transformers model,

00:22:46.200 | we're going to cut a lot of meaning.

00:22:49.200 | Okay, there's not that much meaning in this four-word sentence.

00:22:52.800 | It's not even a full sentence, it's part of a sentence.

00:22:55.200 | So, what we need to do is actually merge these into larger sentences,

00:22:59.700 | and that's where the window and stride from the previous chunk of code is useful.

00:23:05.300 | Okay, so the window is essentially every six segments, we're going to look at those as a whole.

00:23:10.600 | But then if you consider that we're looking at these six segments at a time,

00:23:16.400 | we still have the problem of every six segments we're going to be cutting,

00:23:20.900 | and then starting a new segment, right?

00:23:23.400 | So, if those cuts could be in the middle of a sentence,

00:23:29.200 | or it could be in the middle between two sentences that are relevant to each other.

00:23:33.300 | So, we'd end up losing that meaning, and we don't really want to do that.

00:23:38.400 | So, a typical method used to avoid this in question answering is to include something called a stride.

00:23:46.000 | So, we're going to look at every six segments,

00:23:48.800 | but then in the next step, we're going to step across only three segments.

00:23:53.700 | By doing this, any meaningful segments that would otherwise be cut by,

00:23:58.900 | you know, just the fact that we're jumping over them like this,

00:24:01.900 | would be included in the next step, okay, because we have that overlap.

00:24:06.700 | So, we use that, and what we can do is we just iterate through our data set with this stride,

00:24:14.900 | and we take a batch of six segments at a time.

00:24:18.800 | Now, once we get to the end of each video, there's no sort of defining split in our data.

00:24:27.300 | The only way we can recognize that we've gone on to another video between each segment,

00:24:32.700 | is that the title is going to be different.

00:24:34.600 | So, what we'll do is if the title at the start of our batch is different to the title at the end of our batch,

00:24:40.500 | we just skip this chunk of data.

00:24:44.300 | And the reason we can just skip it rather than trying to, like, keep part of it,

00:24:48.900 | keep the final part of the video or the very start of the video,

00:24:52.100 | is because the final part of the video and the very start of every video,

00:24:56.600 | usually doesn't contain any meaningful information.

00:25:00.000 | It's just either me saying hello, or it's me saying goodbye, okay.

00:25:04.600 | So, I don't think anyone's going to be asking questions about saying hello or goodbye.

00:25:09.600 | So, we can just skip those little bits at the start and ends of videos.

00:25:14.800 | And then what we do, so we have our six segments as a sort of chunk within the list,

00:25:22.600 | and we just join them with space character.

00:25:24.800 | Okay, so now we get a larger chunk.

00:25:26.800 | And then we create a new set of metadata.

00:25:30.700 | Now, the only thing here is we include everything from the start of the batch.

00:25:35.800 | Okay, so the title and everything, and the ID in particular.

00:25:39.900 | But the one thing that does switch up from this is the end position.

00:25:46.000 | So, the end position obviously needs to come from the end of the batch,

00:25:49.000 | because we have now ascended that segment.

00:25:52.000 | Okay, so with that, we have created our new dataset.

00:25:56.100 | Obviously, there's less items in this because we are batching together all these segments.

00:26:02.600 | So, we now have 9,000 segments, but they will be a lot more meaningful.

00:26:06.500 | So, if we have a look at a few of these.

00:26:08.500 | So, here we see a "Hi, welcome to the video", "Support video", "Transformers from Scratch".

00:26:15.000 | And you can see that there's a lot more information conveyed in that paragraph than before,

00:26:19.400 | where we just had "Hi, welcome to the video" or "From Scratch" miniseries.

00:26:23.900 | We come a little further down and we see, okay, training, testing, tallying that.

00:26:29.400 | And you can see there's a lot more information in here.

00:26:31.600 | We'll come down a little bit more.

00:26:33.600 | Token type IDs, let's go number zero, and so on, okay.

00:26:38.000 | These are not particular, you're probably not going to get any answers from each of these,

00:26:42.000 | but there are other points in the videos, which is like a paragraph long,

00:26:45.500 | where we will find answers and we'll see that pretty soon.

00:26:48.600 | So, now we need to move on to actually embedding all of these chunks of segments into vector embeddings.

00:26:57.200 | Okay, so that we can actually search through them.

00:26:59.500 | So, to create those embeddings, we're going to use this QA model, which means question answering.

00:27:05.200 | It's also multilingual, if you are using a multilingual corpus here.

00:27:10.300 | And one thing to note here is that it uses dot product similarity.

00:27:14.500 | Okay, so that's important later on, as we'll see.

00:27:17.200 | So, we initialize the Sentence Transformer, we can see it's this MPNet model,

00:27:22.900 | has this word embedding dimension 768, and it uses, what does it use?

00:27:28.600 | It uses the CLS token as pooling, that's the classifier token.

00:27:34.300 | Okay, one thing here is we get the word embedding dimension, which is just this,

00:27:40.000 | and we need this for Pinecone, which is the vector database that we're going to be using.

00:27:45.200 | Okay, so we come down here, we need to pip install this.

00:27:50.400 | So, pip install Pinecone Client.

00:27:55.100 | And then we need to initialize the vector index, or vector database that we're going to be searching through.

00:28:00.600 | Going to store all of our vectors there, all of our metadata, and then we're going to search through our queries.

00:28:06.000 | So, this would need an API key, which is free.

00:28:09.300 | So, you just go over here, get an API key, and put it in here.

00:28:13.600 | And then what we're doing here is saying, if the index ID, this YouTube search,

00:28:17.800 | which you can change, by the way, you don't have to keep YouTube search here.

00:28:21.700 | If that doesn't already exist, I want to create a new index.

00:28:25.500 | Okay, so we're going to create the index.

00:28:27.400 | We're going to set dimensionality to 768, which fits with the embedding dimensions that we have.

00:28:34.900 | And we're going to use a dot product metric.

00:28:37.200 | And remember I said, that's because we're here, up here, this model is embedding within a dot product vector space.

00:28:44.300 | So, that's pretty important.

00:28:45.800 | We need to now begin building the embeddings.

00:28:50.800 | So, what we do here is, we're going to encode everything and insert them into our vector index in batches of 64.

00:28:59.000 | Okay, that's just to keep things in parallel.

00:29:00.900 | We speed up the process in time.

00:29:03.300 | So, we do that here.

00:29:04.600 | We're going to go through the entire set of the new dataset in batches of 64.

00:29:10.400 | We find the end of the batch.

00:29:12.100 | And then here, we're just extracting the metadata that we want to include.

00:29:16.600 | So, the main, the things that are important here is actually all of these, I think.

00:29:21.000 | So, text, the start and end positions of that transcribed text, the URL of that video, and also the title of that video.

00:29:32.000 | We need all of those, they're all pretty important.

00:29:36.200 | And then what we want to do here is also just extract the text by itself within a list,

00:29:40.700 | because we're then going to use that to create the embedded vectors of our segments.

00:29:46.500 | Okay, to convert those segments into vectors.

00:29:49.000 | And then, we also want the IDs.

00:29:51.600 | So, every vector or every entry within our vector index needs to have a unique ID.

00:30:00.200 | Okay, that's why we create unique IDs rather than just using video IDs earlier on.

00:30:04.600 | And then, what we can do is we create this list, which includes our batch IDs, our embeddings, and batch metadata.

00:30:12.300 | And we insert or upsert into Pinecone, which is our vector database, okay?

00:30:18.800 | And then after we've done that, we can check that everything has actually been added.

00:30:23.400 | So, we can see here that I actually have more than the 9,000 that we created earlier on.

00:30:29.900 | And the reason for that is because I have been testing different sizes and re-uploading things.

00:30:37.100 | And at some point, I made a mess and forgot to delete the old vectors.

00:30:41.800 | So, there's actually some duplication that I need to remove.

00:30:46.600 | So, in reality, you should get this, okay?

00:30:49.100 | So, 9,072.

00:30:51.400 | And another thing I should point out is earlier on, when you do the same thing here,

00:30:56.600 | when you've just created your index, you'd actually expect this to be zero, not already full, okay?

00:31:03.500 | So, that's something I need to do.

00:31:07.600 | So, here, we would actually expect the 9,000.

00:31:11.400 | And then after that, well, we've done the next step, okay?

00:31:15.600 | So, we come back over here, and we've now created our, well, we initialized this, but we created these vectors.

00:31:28.000 | And then we have inserted them into Pinecone over here, which is our vector database.

00:31:36.000 | So, at that point, we're now ready for the actual querying step, okay?

00:31:41.100 | So, where the user is going to come to a little search bar over here, and we're going to query.

00:31:45.700 | So, first, let me show you how we do that in code very quickly, okay?

00:31:49.200 | So, it's really simple.

00:31:51.000 | Query, what is opening eyes click?

00:31:53.400 | When we encode that query, we use the same model, the same QA sentence transform model that we use to encode all the segments.

00:32:02.200 | And then we just convert that into a list, right?

00:32:04.600 | And that gives us what we call a query vector, which is xq variable here.

00:32:09.700 | Then we query with xq, the query vector.

00:32:12.700 | We return the top five most similar of the items.

00:32:16.400 | And we also want to include the metadata, because that includes actual text, the start and end positions, and I think that's everything.

00:32:23.600 | Oh, also the title and URL, okay?

00:32:26.300 | All these things we need.

00:32:28.000 | So, let's zoom out a little bit.

00:32:32.300 | So, what is the question?

00:32:33.200 | What is open eyes click?

00:32:35.100 | Okay, so we come down here, you can see these are two chunks from the same video.

00:32:40.000 | One is just slightly, it's about 20 seconds after the other.

00:32:45.500 | And you can see that, okay, open eye clip is a contrastive learning image pre-training model.

00:32:51.700 | Use pairs of images and text in terms of matrix, cosine similarity between text and each image.

00:32:58.400 | And then there's more information after that as well.

00:33:01.100 | That's kind of cut, but when we actually visualize everything, we'll just be able to click on the video and we can actually just watch it through.

00:33:09.300 | So, let's take a look at how that would work.

00:33:14.000 | So, one thing we actually do need to do here is you take the start and you add that to your URL, and then you create a link.

00:33:21.500 | So, I can even show you here, right?

00:33:23.400 | So, if we take this, come down here, we do this, and then we add question mark T equals, and maybe we want this bit here.

00:33:38.700 | Okay, so we have that.

00:33:40.200 | Let's copy this and let's enter it into the browser.

00:33:47.600 | I'm going to put captions on.

00:33:52.400 | Come down here, see what returns, and you can see I'm just reading from the text here, right?

00:33:58.800 | This is another Q&A app I built.

00:34:01.800 | So, you actually have the answer on the screen and I'm also reading through it.

00:34:06.400 | So, I think that's pretty cool.

00:34:10.000 | And what we can do is actually package all of what we've just done, or the querying part of it, into a nice little sort of web interface.

00:34:22.600 | A little search bar and we search and we get our answers.

00:34:25.500 | Now, this is really easy using Streamlit, which is what I'm going to show you.

00:34:30.800 | I'm going to go over to Hug & Face again.

00:34:32.600 | I'm going to click on my profile and it's right at the top here.

00:34:35.800 | YouTube Tech Search.

00:34:37.800 | Click on there, okay.

00:34:41.200 | YouTube Q&A, I'm going to say, what was the question?

00:34:44.400 | What is OpenAI's flip?

00:34:47.600 | We'll ask some other questions soon.

00:34:49.900 | Sometimes you get this, it just means you need to refresh.

00:34:52.800 | Something I need to fix in the actual Streamlit code.

00:34:56.100 | Okay, and then we get this.

00:34:59.000 | Now, what I've done is I've put it together so that when you have parts of the video,

00:35:05.700 | those segments that are directly next to each other, it will do this.

00:35:11.100 | It's like a continuous text and then you just have a little timestamp here, here.

00:35:17.300 | And then if you click, let's say I click on here, it's going to take me to 2.09 in that video.

00:35:24.000 | So, let's close that.

00:35:25.300 | And then if I click here, it's going to take me to 2.27 in that video, right?

00:35:31.900 | And then you also have other answers here.

00:35:34.700 | So, Intro to Dense Vectors, NLP and Vision, and so on and so on.

00:35:39.400 | And also, because we're using Vector Search here, we can actually kind of like mess up this.

00:35:44.800 | So, if I, maybe if I like go open, just make a mess of everything like this.

00:35:51.900 | And we still actually get very similar results even though I completely butchered the OpenAI there.

00:35:58.800 | Okay, so let's try some other questions.

00:36:03.800 | So, I'm going to grab this one.

00:36:05.500 | So, this is Asherak came up with this question.

00:36:09.500 | Thank you, Asherak.

00:36:10.700 | What is the best unsupervised method to train a sentence transformer?

00:36:15.600 | Okay, unsupervised method.

00:36:18.100 | There's only one unsupervised, like fully unsupervised method and it is TSDAE.

00:36:24.700 | Okay. So, with something called pre-train produces na-na-na-na.

00:36:29.900 | Sentence transformer using unsupervised training method called

00:36:33.400 | Transformer Based Sequential Denoising Autoencode.

00:36:36.300 | I'm surprised it transcribed all of that correctly.

00:36:38.800 | So, we can click on that, come to here, turn on the captions.

00:36:44.600 | Okay, there we go. TSDAE. That is correct.

00:36:48.700 | That's pretty cool.

00:36:50.600 | Let's try another question.

00:36:53.200 | So, very similar, but what if I have little to no data?

00:36:57.400 | So, there's a few different approaches here.

00:36:59.200 | So, again, training sentence transformer, little to no data.

00:37:04.100 | TSDAE is a good option.

00:37:06.300 | We can also use something called GenQ, which generates queries for you.

00:37:10.800 | This is particularly good for asymmetric search, e.g. question answering.

00:37:16.600 | And we also have Augmented SBIRT as well.

00:37:19.600 | So, again, really good results.

00:37:23.800 | I'm pretty impressed with this.

00:37:26.500 | What is vector search?

00:37:30.500 | Okay, and this is kind of harder because I don't think I really answer this very often in my videos.

00:37:36.700 | But there are still, you know, when I searched this, I was surprised there are actually some answers.

00:37:41.800 | So, the first one, I don't think there's anything relevant in the first answer, but then we come down.

00:37:49.500 | So, we have an answer here, and we also have another answer here now.

00:37:54.900 | So, imagine we came up with a new query.

00:37:56.900 | I like this one the most.

00:37:58.200 | Convert query into vector, place within vector space, and then search for the most in other vectors.

00:38:03.200 | So, let's go on there.

00:38:04.800 | Let's see.

00:38:08.300 | Okay, and we even get a nice little visual here as well.

00:38:12.200 | So, we've got all of our vectors in this vector space.

00:38:15.700 | It's a nice, that looks pretty cool.

00:38:19.600 | You have a query vector here, xq, and then we're going to compare the distance between those, right?

00:38:29.200 | So, I think, again, we've got a really good answer there, both visually from the video and also just from the audio transcribed into text.

00:38:39.300 | Okay, that's it for this project focus demo on building a better YouTube search.

00:38:45.500 | So, I hope you found this video helpful.

00:38:47.600 | I hope you found this video useful.

00:38:49.100 | I hope you found this video useful.

00:38:50.600 | I hope you found this video useful.

00:38:52.100 | I hope you found this video useful.

00:38:53.600 | I hope you found this video useful.

00:38:55.100 | I hope you found this video useful.

00:38:56.600 | I hope you found this video useful.

00:38:58.100 | I hope you found this video useful.

00:39:00.100 | I hope you found this video useful.

00:39:02.100 | I hope you found this video useful.

00:39:04.100 | I hope you found this video useful.

00:39:06.100 | I hope you found this video useful.

00:39:08.100 | I hope you found this video useful.

00:39:10.100 | I hope you found this video useful.

00:39:12.100 | I hope you found this video useful.

00:39:14.100 | I hope you found this video useful.

00:39:16.100 | I hope you found this video useful.

00:39:18.100 | I hope you found this video useful.

00:39:20.100 | I hope you found this video useful.

00:39:22.100 | I hope you found this video useful.

00:39:24.100 | I hope you found this video useful.

00:39:26.100 | I hope you found this video useful.

00:39:28.100 | I hope you found this video useful.

00:39:30.100 | I hope you found this video useful.

00:39:32.100 | I hope you found this video useful.

00:39:34.100 | I hope you found this video useful.

00:39:36.100 | I hope you found this video useful.

00:39:38.100 | I hope you found this video useful.

00:39:40.100 | I hope you found this video useful.

00:39:42.100 | I hope you found this video useful.

00:39:44.100 | I hope you found this video useful.

00:39:46.100 | I hope you found this video useful.

00:39:48.100 | I hope you found this video useful.

00:39:50.100 | I hope you found this video useful.

00:39:52.100 | I hope you found this video useful.

How to Use OpenAI Whisper to Fix YouTube Search

Chapters