back to indexHow to Use OpenAI Whisper to Fix YouTube Search
Chapters
0:0 OpenAI's Whisper
1:48 Idea Behind Better Search
6:56 Downloading Audio for Whisper
8:22 Download YouTube Videos with Python
16:52 Speech-to-Text with OpenAI Whisper
20:54 Hugging Face Datasets and Preprocessing
26:48 Using a Sentence Transformer
27:45 Initializing a Vector Database
28:45 Build Embeddings and Vector Index
31:35 Asking Questions
34:8 Hugging Face Ask YouTube App
00:00:00.000 |
Search on YouTube is good, but it has some limitations. 00:00:05.840 |
With trillions of hours of content on there, you would expect there to be an answer to pretty much every question you can think of. 00:00:15.740 |
Yet, if we have a specific question that we want answered like, "What is OpenAI's clip?" 00:00:21.580 |
We're actually just served dozens of 20-plus minute videos, and maybe we don't want that. Maybe we just want a very brief 20-second definition. 00:00:34.300 |
The current YouTube search has no solution for this. 00:00:39.540 |
Maybe there's a good financial reason for doing so. 00:00:43.380 |
Obviously, if people are watching longer videos or more of a video, that gives YouTube more time to serve ads. 00:00:54.140 |
So, I can understand from a business perspective why that might be the case. 00:01:00.880 |
But, particularly for us, tech people that are wanting quick answers to quick problems, a better search would be incredibly useful where we can actually pinpoint the specific parts of a video that contain an answer to one of our questions. 00:01:21.640 |
Very recently, a solution to this problem may have appeared in the form of OpenAI's Whisper. 00:01:29.520 |
In this video, I want to have a look at if we can build something better, a better YouTube search experience for people like us that want quick answers. 00:01:40.020 |
And we're going to take a look at how we can use OpenAI's Whisper to actually do this. 00:01:44.380 |
So, let's try and flesh out the idea a little bit more. 00:01:48.540 |
So, the idea is we want to get specific timestamps that answer a particular question. 00:01:56.020 |
Now, looking at YouTube, we can already do this. 00:01:59.520 |
So, we just kind of hover over this video here, we right-click, and look what we can do. 00:02:04.380 |
We can copy the video URL at the current time. 00:02:08.120 |
Okay, so I can copy this, I can come over here, and I can paste it into my browser, and it will open that video at that time. 00:02:15.080 |
You see we have this little time here, so let's run that. 00:02:22.880 |
So, it should be completely possible to do something with that, right? 00:02:30.120 |
We should be able to serve users' results based on those timestamps. 00:02:34.980 |
Now, the only thing here is that we need a way to search through these videos. 00:02:41.020 |
Now, YouTube does provide captions, and they work relatively well, but sometimes they can be a little bit weird. 00:02:50.180 |
Now, this is where OpenAI's Whisper comes in. 00:02:53.220 |
OpenAI's Whisper is, you can think of it as the GPT-3 or DALI-2 equivalent for speech-to-text. 00:03:02.980 |
And it's also open source, which is pretty cool. 00:03:06.880 |
Now, you might expect this model to be absolutely huge, like GPT-3 or DALI-2, but in reality, it's actually a relatively small model. 00:03:17.720 |
Like, we can, I believe the largest version of the model can be run with about 10 gigabytes of RAM, which is not that much. 00:03:27.620 |
So, we should be able to use OpenAI's Whisper with videos on YouTube to transcribe them more accurately than what YouTube captions can provide. 00:03:38.860 |
So, that would be our first, well, that would be almost our first step, because we actually need the videos. 00:03:44.120 |
So, our very first step would be getting videos from YouTube. 00:03:50.420 |
First thing we need to do is actually get the audio files, so the MP3 files. 00:03:57.380 |
We're going to store them locally or, you know, wherever local is for you. 00:04:03.780 |
And once we have those, we want to use OpenAI's Whisper. 00:04:07.720 |
So, I don't know if they have a logo, but we'll just go Whisper. 00:04:19.620 |
And what is pretty cool with OpenAI's Whisper is that it will also include the time stamps where it found that particular text. 00:04:30.080 |
Okay, so it's actually not going to look so much like a long piece of text. 00:04:34.520 |
It's going to look more like segments of text, like this. 00:04:40.360 |
And those segments of text will have like a start and end second. 00:04:44.720 |
So, it'll be like here, up to 7 seconds in, and then we'll have 7 to 12 seconds in, some more text. 00:04:51.760 |
So, with that, we can then take those segments. 00:04:57.360 |
We can encode them with a sentence transformer model. 00:05:03.480 |
We're not going to use SBIRT, but something along those lines. 00:05:09.620 |
All right, so we have our little vector space here. 00:05:12.920 |
Should also mention here the SBIRT model will be a Q&A model, question answering. 00:05:17.680 |
Okay, so that's the specific type of machine learning where given a natural language question, you expect a natural language answer. 00:05:26.960 |
Right, so these over here, these segments, they're our answers. 00:05:33.420 |
And what we're going to do with that is we want to put this into a vector index or vector database. 00:05:44.460 |
And when someone searches for something, so we're going to have a little search bar over here. 00:05:54.860 |
And when they, they're going to write in their text, what is opening your eyes clip. 00:06:00.260 |
And that's going to be passed into the SBIRT model, not SBIRT model. 00:06:19.660 |
And then from there, we return the most relevant segments. 00:06:31.460 |
So, the user will get, okay, at this time sump in a video. 00:06:34.560 |
So, zero to seven seconds in one particular video, we have this answer for you. 00:06:40.960 |
And then from seven to 12 in another video, we have this answer for you. 00:06:45.820 |
That's essentially what we're going to build. 00:06:47.560 |
Now, maybe it looks kind of complicated from here. 00:06:50.620 |
But in reality, I think all of this is relatively easy to use. 00:06:56.320 |
So, let's dive into it and we'll start with the first part, 00:07:06.720 |
I always expect the data preprocessing step to be the hardest. 00:07:13.620 |
So, the first thing I needed to figure out here is, okay, 00:07:21.860 |
Now, as a YouTube creator, I can download channel metadata 00:07:31.020 |
So, that seemed like the best approach initially, download all those videos. 00:07:37.360 |
One, you can't download other people's videos. 00:07:41.400 |
So, the search scope is just limited to your own channel, which isn't much fun. 00:07:47.000 |
And two, you have to download all these videos and there's a lot of them 00:07:55.760 |
So, after a few days of trying to do this and trying to make it work 00:07:59.200 |
and trying to push these videos to remote machines to process them and everything, 00:08:10.900 |
In reality, we don't need to download all these things. 00:08:17.000 |
And with a video ID, we can get everything we need using a Python library called PyTube. 00:08:22.900 |
So, we can install PyTube like this or even actually like this, it's about the same. 00:08:30.760 |
And once we've installed it, in this case, I use my channel metadata, 00:08:38.060 |
So, this is using the Hogan Face datasets library. 00:08:42.300 |
So, in here, I will show you, if we come over to here, we have this James Callum channel metadata. 00:08:48.700 |
So, this Hogan Face CO datasets, James Callum channel metadata. 00:08:54.400 |
And you can see the sort of data we have in here. 00:09:01.300 |
So, we have the video ID, the channel ID, the title, when it was created, 00:09:10.460 |
And most of this, we actually don't even need. 00:09:13.560 |
All we really need is a video ID and a title. 00:09:18.400 |
Now, using this dataset through Hogan Face datasets, you can download it like this. 00:09:25.260 |
Okay. At some point in the future, I want to add other channels, 00:09:29.320 |
either to this dataset or the next dataset, which you will see soon. 00:09:34.760 |
But for now, this is just videos from my channel. 00:09:38.800 |
So, we're taking the train split, and then we come down and see there's 222 rows there, 00:09:46.600 |
But included in there, I think there is some degree of duplication of video entries. 00:09:54.200 |
So, what we want to do is actually we create this meta item, 00:10:05.860 |
Okay. So, we have meta here to align with this. 00:10:11.420 |
And what we want to do is say, okay, where are we going to save all of our MP3 files? 00:10:16.960 |
And what we're going to do is just go through, we have the video ID in that dataset. 00:10:21.460 |
And we're going to use Py2 here to create this YouTube object, 00:10:29.500 |
which is just like an object with all of the information for that particular video inside it. 00:10:39.440 |
If some of these video IDs have a particular character in it, 00:10:44.660 |
it's going to give you this reject match error. 00:10:50.440 |
But just for the sake of running through this quickly, I just added this try except statement in. 00:10:56.340 |
Because there's very few videos that trigger this issue, this bug. 00:11:05.540 |
So, I tag is almost like the identification number that Py2 uses 00:11:11.800 |
for different files attributed to each video. 00:11:15.600 |
Because each video is not just a single file, it's actually a few different files. 00:11:20.740 |
It includes a MP4 video file, a MP3 audio file, 00:11:26.840 |
and it includes those at different resolutions as well. 00:11:30.800 |
So, with each video, you get quite a few different streams or files. 00:11:37.600 |
So, what I'm doing here is I'm getting those streams or files, 00:11:44.100 |
So, this actually returns a few different choices. 00:11:48.200 |
And the MP3 files that we actually want will have this MIME type. 00:11:53.600 |
Okay? So, this is like a multimedia type, I believe. 00:11:59.240 |
And although this says MP4, it's actually the audio related to the MP4 file, 00:12:07.640 |
So, we loop through all of the files attributed to a particular video, 00:12:11.500 |
and the first one that we see that is an MP3 file, 00:12:14.700 |
we return the I tag for that, and then we break from this loop. 00:12:18.540 |
And in the case that we loop through all the files and no MP3 file is found, 00:12:23.840 |
which I didn't see happen once, so it probably won't happen. 00:12:30.440 |
So, if the I tag is none, e.g. nothing was found, we continue. 00:12:34.700 |
So, we ignore this, and we just move on to the next file or next video. 00:12:39.600 |
Now, from here, we get the correct MP3 audio stream 00:12:45.100 |
based on the I tag that we identified here, and then we download it. 00:12:53.700 |
We have the output path, which is the save path, it's just the MP3 directory. 00:12:58.700 |
And then we have a file name here, which is just the video ID, MP3. 00:13:03.300 |
And we go through, you see there's a couple of those rejects and match errors, 00:13:08.300 |
but very few, honestly, it's nothing significant. 00:13:13.200 |
After doing that, you should be able to see a new MP3 directory, 00:13:18.800 |
and it will just contain a ton of MP3 audio files. 00:13:23.700 |
Now, it does take a little bit of time to download everything. 00:13:29.400 |
But if you don't want to, it's fine, you can skip ahead. 00:13:32.400 |
We also have an already transcribed dataset available that you can just use. 00:13:46.800 |
So, we've got our MP3s, and we've now stored them locally. 00:13:51.800 |
Okay? And so, now we need to move on to Whisper. 00:13:54.700 |
So, opening up Whisper, we come to here, and this is how we install it. 00:14:03.200 |
And then, this is the install for the FFmpeg software for Ubuntu or Debian. 00:14:14.700 |
Let me, there are a few install instructions. 00:14:20.100 |
So, on the Whisper GitHub repo, we come down, and it's here. 00:14:26.800 |
Okay? So, we have the different install instructions. 00:14:30.100 |
So, after installing, we come down, and we go over to here. 00:14:39.700 |
so that we can move the Whisper model over to a GPU, if you have a GPU. 00:14:44.900 |
Otherwise, you can use CPU, but it will be slower. 00:14:47.000 |
And if you are doing that, it's probably best if you use the small model. 00:14:52.100 |
Now, as for the different models, there are a few options. 00:14:59.300 |
You can see here, we have tiny, base, small, medium, and large. 00:15:02.900 |
Now, you can see here the required RAM amounts. 00:15:07.600 |
Now, 10 gigabytes for the large model is actually very good. 00:15:12.600 |
But if you are limited on time, or just the amount of RAM that you do have available, 00:15:17.800 |
you can use the other models, and they're actually fairly good as well. 00:15:21.200 |
One thing to know is, if you are using this for English, 00:15:27.800 |
you should use the English-specific models, because they tend to perform better. 00:15:32.100 |
But otherwise, you can use it for multilingual text-to-speech, 00:15:36.300 |
and they're all capable of doing that as well, without the .en. 00:15:43.300 |
Now, here we're using the large model to get the best quality results. 00:15:48.100 |
And we're saying, okay, move the Whisper large model to CUDA, if it's available. 00:15:58.300 |
But hopefully, that will be fixed relatively soon. 00:16:02.500 |
We use the channel metadata here to match up the video ID from the MP3 filenames 00:16:08.200 |
to the video titles, so that we can display that in the browser. 00:16:15.400 |
And all I'm doing here is creating a videos metadata dictionary. 00:16:19.800 |
And all that is, is we have a dictionary with video IDs, 00:16:23.700 |
and that maps to the title and URL, which we're actually just building here. 00:16:30.400 |
I just included that as we have all this metadata. 00:16:37.000 |
And maybe we wanted to filter based on the publication date or something like that. 00:16:44.100 |
Okay, so we have video ID that maps the title and URL. 00:16:48.500 |
And then what we want to do is, all of those MP3 files that we just downloaded, 00:16:55.500 |
You can see that we have 108 of those, and they all look like this. 00:16:58.500 |
So, we have the MP3 directory, we have the video ID, and there's an MP3 file. 00:17:02.800 |
And then all we need to do here is also need to just import TQDM. 00:17:12.200 |
And then we just enumerate through each of those paths. 00:17:17.600 |
So, what we want to do is, we can get the ID from that path if we needed to, like so. 00:17:25.900 |
And all we're doing here is transcribing to get that text data. 00:17:32.200 |
Okay, so given the path to the MP3 file, we just pass that to Whisper, 00:17:37.700 |
use the transcribe method, and we actually get all of these, what are called segments. 00:17:43.800 |
Now, these segments are just really short snippets of text with the start and end seconds 00:17:50.900 |
where that text or that audio was transcribed from. 00:17:55.700 |
Okay, and then what we do is, here I'm going to create this transcription JSON lines file, 00:18:01.700 |
the file that we're going to use to save everything. 00:18:04.500 |
And what I'm going to do here is basically just save everything. 00:18:10.700 |
So, each one of these snippets is pretty short. 00:18:13.300 |
So, you can actually modify this, you can increase the window to like six 00:18:16.400 |
and then the stride to like three, for example. 00:18:19.600 |
But what we're going to do is actually do that later. 00:18:23.300 |
And in this case, we'll just take out the segments directly. 00:18:27.900 |
So, we transcribe, we get our segments, get the video metadata. 00:18:34.900 |
So, this is from the video's date, which includes the title that we need and the URL. 00:18:39.800 |
This bit isn't so important because we're not actually, sorry, this bit is not so important 00:18:44.600 |
because we're not using a window and stride greater than one. 00:18:51.200 |
This is, again, if you're using the window greater than one, we'll explain that later. 00:18:56.000 |
But we do want to start and end positions for each segment. 00:18:59.400 |
Okay, and we also want to create a new row ID because at the moment, we just have video IDs. 00:19:04.400 |
And of course, that means that there's a single video ID for a ton of segments. 00:19:09.300 |
We don't want that. We want a single unique ID for every segment. 00:19:13.800 |
So, we just create that by taking the video ID plus the actual timestamp. 00:19:26.500 |
We don't actually need to do that because we're also saving it directly to the file as we go along. 00:19:34.600 |
Okay, and then from there, we can check the length of the data set. 00:19:38.200 |
And we see that we have 27.2 thousand segments. 00:19:42.900 |
So, it's small. So, roughly five to seven word sentences. 00:19:47.600 |
Okay, so let's take a look at where we are now. 00:19:51.200 |
So, we've just done, well, we initialized Whisper and then we created these segments here. 00:20:03.900 |
So, the next bit is encoding these with Esper. 00:20:07.400 |
Now, again, if you're processing this, this can take a little bit of time. 00:20:14.500 |
So, on an A100 GPU, for me, I think it took around 10 hours. 00:20:23.000 |
Okay, and that is for, I don't know how many hours of video, 00:20:26.600 |
but it's 108 of my videos, which are probably on average maybe like 30 minutes long, maybe a bit longer. 00:20:34.900 |
So, I'd say at least 50 hours there, probably a fair bit more. 00:20:42.000 |
So, it's a lot faster than real-time processing, which is pretty cool, but it's still a while. 00:20:48.700 |
So, I know you probably don't want to wait 10 hours to process everything if you have an A100 or longer. 00:20:54.800 |
So, what you can do is this transcriptions data set is available on Hugging Face. 00:21:01.800 |
So, if we go to HuggingFace.co, and we have this James Callum YouTube transcriptions data set, 00:21:12.800 |
So, let's have a look at where our segments are. 00:21:24.500 |
So, they're all relatively short little segments, but we have the start and the end here. 00:21:29.600 |
And obviously, if you think about this, we can also increase the size of those segments. 00:21:34.600 |
So, we could measure, like, these five, for example, and then we just have the start at zero, and the end is 20.6. 00:21:42.200 |
We have the specific timestamp ID here, URL to that video, and the actual video title. 00:21:56.400 |
And this is over here, so we can copy this James Callum YouTube transcriptions, 00:22:01.000 |
and we can download it like we did the other data set, and I'll show you how. 00:22:04.800 |
So, we come to this Build Embeddings notebook, and like I said, here we have that data set. 00:22:10.400 |
So, you can use this, and you'll get all those transcribed video segments. 00:22:15.800 |
You don't need to actually do it yourself, but of course, you can if you want. 00:22:21.900 |
So, we can see in there, we have the same things we saw before, start and text, title, URL. 00:22:32.000 |
So, what you saw before, "Hi, welcome to the video." 00:22:34.600 |
This is the fourth video in a Transformers, and it cuts off from Stretch Mini Series. 00:22:39.900 |
So, you can see here that if we encode this with our Sentence Transformers model, 00:22:49.200 |
Okay, there's not that much meaning in this four-word sentence. 00:22:52.800 |
It's not even a full sentence, it's part of a sentence. 00:22:55.200 |
So, what we need to do is actually merge these into larger sentences, 00:22:59.700 |
and that's where the window and stride from the previous chunk of code is useful. 00:23:05.300 |
Okay, so the window is essentially every six segments, we're going to look at those as a whole. 00:23:10.600 |
But then if you consider that we're looking at these six segments at a time, 00:23:16.400 |
we still have the problem of every six segments we're going to be cutting, 00:23:23.400 |
So, if those cuts could be in the middle of a sentence, 00:23:29.200 |
or it could be in the middle between two sentences that are relevant to each other. 00:23:33.300 |
So, we'd end up losing that meaning, and we don't really want to do that. 00:23:38.400 |
So, a typical method used to avoid this in question answering is to include something called a stride. 00:23:46.000 |
So, we're going to look at every six segments, 00:23:48.800 |
but then in the next step, we're going to step across only three segments. 00:23:53.700 |
By doing this, any meaningful segments that would otherwise be cut by, 00:23:58.900 |
you know, just the fact that we're jumping over them like this, 00:24:01.900 |
would be included in the next step, okay, because we have that overlap. 00:24:06.700 |
So, we use that, and what we can do is we just iterate through our data set with this stride, 00:24:14.900 |
and we take a batch of six segments at a time. 00:24:18.800 |
Now, once we get to the end of each video, there's no sort of defining split in our data. 00:24:27.300 |
The only way we can recognize that we've gone on to another video between each segment, 00:24:34.600 |
So, what we'll do is if the title at the start of our batch is different to the title at the end of our batch, 00:24:44.300 |
And the reason we can just skip it rather than trying to, like, keep part of it, 00:24:48.900 |
keep the final part of the video or the very start of the video, 00:24:52.100 |
is because the final part of the video and the very start of every video, 00:24:56.600 |
usually doesn't contain any meaningful information. 00:25:00.000 |
It's just either me saying hello, or it's me saying goodbye, okay. 00:25:04.600 |
So, I don't think anyone's going to be asking questions about saying hello or goodbye. 00:25:09.600 |
So, we can just skip those little bits at the start and ends of videos. 00:25:14.800 |
And then what we do, so we have our six segments as a sort of chunk within the list, 00:25:30.700 |
Now, the only thing here is we include everything from the start of the batch. 00:25:35.800 |
Okay, so the title and everything, and the ID in particular. 00:25:39.900 |
But the one thing that does switch up from this is the end position. 00:25:46.000 |
So, the end position obviously needs to come from the end of the batch, 00:25:52.000 |
Okay, so with that, we have created our new dataset. 00:25:56.100 |
Obviously, there's less items in this because we are batching together all these segments. 00:26:02.600 |
So, we now have 9,000 segments, but they will be a lot more meaningful. 00:26:08.500 |
So, here we see a "Hi, welcome to the video", "Support video", "Transformers from Scratch". 00:26:15.000 |
And you can see that there's a lot more information conveyed in that paragraph than before, 00:26:19.400 |
where we just had "Hi, welcome to the video" or "From Scratch" miniseries. 00:26:23.900 |
We come a little further down and we see, okay, training, testing, tallying that. 00:26:29.400 |
And you can see there's a lot more information in here. 00:26:33.600 |
Token type IDs, let's go number zero, and so on, okay. 00:26:38.000 |
These are not particular, you're probably not going to get any answers from each of these, 00:26:42.000 |
but there are other points in the videos, which is like a paragraph long, 00:26:45.500 |
where we will find answers and we'll see that pretty soon. 00:26:48.600 |
So, now we need to move on to actually embedding all of these chunks of segments into vector embeddings. 00:26:57.200 |
Okay, so that we can actually search through them. 00:26:59.500 |
So, to create those embeddings, we're going to use this QA model, which means question answering. 00:27:05.200 |
It's also multilingual, if you are using a multilingual corpus here. 00:27:10.300 |
And one thing to note here is that it uses dot product similarity. 00:27:14.500 |
Okay, so that's important later on, as we'll see. 00:27:17.200 |
So, we initialize the Sentence Transformer, we can see it's this MPNet model, 00:27:22.900 |
has this word embedding dimension 768, and it uses, what does it use? 00:27:28.600 |
It uses the CLS token as pooling, that's the classifier token. 00:27:34.300 |
Okay, one thing here is we get the word embedding dimension, which is just this, 00:27:40.000 |
and we need this for Pinecone, which is the vector database that we're going to be using. 00:27:45.200 |
Okay, so we come down here, we need to pip install this. 00:27:55.100 |
And then we need to initialize the vector index, or vector database that we're going to be searching through. 00:28:00.600 |
Going to store all of our vectors there, all of our metadata, and then we're going to search through our queries. 00:28:06.000 |
So, this would need an API key, which is free. 00:28:09.300 |
So, you just go over here, get an API key, and put it in here. 00:28:13.600 |
And then what we're doing here is saying, if the index ID, this YouTube search, 00:28:17.800 |
which you can change, by the way, you don't have to keep YouTube search here. 00:28:21.700 |
If that doesn't already exist, I want to create a new index. 00:28:27.400 |
We're going to set dimensionality to 768, which fits with the embedding dimensions that we have. 00:28:37.200 |
And remember I said, that's because we're here, up here, this model is embedding within a dot product vector space. 00:28:45.800 |
We need to now begin building the embeddings. 00:28:50.800 |
So, what we do here is, we're going to encode everything and insert them into our vector index in batches of 64. 00:28:59.000 |
Okay, that's just to keep things in parallel. 00:29:04.600 |
We're going to go through the entire set of the new dataset in batches of 64. 00:29:12.100 |
And then here, we're just extracting the metadata that we want to include. 00:29:16.600 |
So, the main, the things that are important here is actually all of these, I think. 00:29:21.000 |
So, text, the start and end positions of that transcribed text, the URL of that video, and also the title of that video. 00:29:32.000 |
We need all of those, they're all pretty important. 00:29:36.200 |
And then what we want to do here is also just extract the text by itself within a list, 00:29:40.700 |
because we're then going to use that to create the embedded vectors of our segments. 00:29:46.500 |
Okay, to convert those segments into vectors. 00:29:51.600 |
So, every vector or every entry within our vector index needs to have a unique ID. 00:30:00.200 |
Okay, that's why we create unique IDs rather than just using video IDs earlier on. 00:30:04.600 |
And then, what we can do is we create this list, which includes our batch IDs, our embeddings, and batch metadata. 00:30:12.300 |
And we insert or upsert into Pinecone, which is our vector database, okay? 00:30:18.800 |
And then after we've done that, we can check that everything has actually been added. 00:30:23.400 |
So, we can see here that I actually have more than the 9,000 that we created earlier on. 00:30:29.900 |
And the reason for that is because I have been testing different sizes and re-uploading things. 00:30:37.100 |
And at some point, I made a mess and forgot to delete the old vectors. 00:30:41.800 |
So, there's actually some duplication that I need to remove. 00:30:51.400 |
And another thing I should point out is earlier on, when you do the same thing here, 00:30:56.600 |
when you've just created your index, you'd actually expect this to be zero, not already full, okay? 00:31:07.600 |
So, here, we would actually expect the 9,000. 00:31:11.400 |
And then after that, well, we've done the next step, okay? 00:31:15.600 |
So, we come back over here, and we've now created our, well, we initialized this, but we created these vectors. 00:31:28.000 |
And then we have inserted them into Pinecone over here, which is our vector database. 00:31:36.000 |
So, at that point, we're now ready for the actual querying step, okay? 00:31:41.100 |
So, where the user is going to come to a little search bar over here, and we're going to query. 00:31:45.700 |
So, first, let me show you how we do that in code very quickly, okay? 00:31:53.400 |
When we encode that query, we use the same model, the same QA sentence transform model that we use to encode all the segments. 00:32:02.200 |
And then we just convert that into a list, right? 00:32:04.600 |
And that gives us what we call a query vector, which is xq variable here. 00:32:12.700 |
We return the top five most similar of the items. 00:32:16.400 |
And we also want to include the metadata, because that includes actual text, the start and end positions, and I think that's everything. 00:32:35.100 |
Okay, so we come down here, you can see these are two chunks from the same video. 00:32:40.000 |
One is just slightly, it's about 20 seconds after the other. 00:32:45.500 |
And you can see that, okay, open eye clip is a contrastive learning image pre-training model. 00:32:51.700 |
Use pairs of images and text in terms of matrix, cosine similarity between text and each image. 00:32:58.400 |
And then there's more information after that as well. 00:33:01.100 |
That's kind of cut, but when we actually visualize everything, we'll just be able to click on the video and we can actually just watch it through. 00:33:09.300 |
So, let's take a look at how that would work. 00:33:14.000 |
So, one thing we actually do need to do here is you take the start and you add that to your URL, and then you create a link. 00:33:23.400 |
So, if we take this, come down here, we do this, and then we add question mark T equals, and maybe we want this bit here. 00:33:40.200 |
Let's copy this and let's enter it into the browser. 00:33:52.400 |
Come down here, see what returns, and you can see I'm just reading from the text here, right? 00:34:01.800 |
So, you actually have the answer on the screen and I'm also reading through it. 00:34:10.000 |
And what we can do is actually package all of what we've just done, or the querying part of it, into a nice little sort of web interface. 00:34:22.600 |
A little search bar and we search and we get our answers. 00:34:25.500 |
Now, this is really easy using Streamlit, which is what I'm going to show you. 00:34:32.600 |
I'm going to click on my profile and it's right at the top here. 00:34:41.200 |
YouTube Q&A, I'm going to say, what was the question? 00:34:49.900 |
Sometimes you get this, it just means you need to refresh. 00:34:52.800 |
Something I need to fix in the actual Streamlit code. 00:34:59.000 |
Now, what I've done is I've put it together so that when you have parts of the video, 00:35:05.700 |
those segments that are directly next to each other, it will do this. 00:35:11.100 |
It's like a continuous text and then you just have a little timestamp here, here. 00:35:17.300 |
And then if you click, let's say I click on here, it's going to take me to 2.09 in that video. 00:35:25.300 |
And then if I click here, it's going to take me to 2.27 in that video, right? 00:35:34.700 |
So, Intro to Dense Vectors, NLP and Vision, and so on and so on. 00:35:39.400 |
And also, because we're using Vector Search here, we can actually kind of like mess up this. 00:35:44.800 |
So, if I, maybe if I like go open, just make a mess of everything like this. 00:35:51.900 |
And we still actually get very similar results even though I completely butchered the OpenAI there. 00:36:05.500 |
So, this is Asherak came up with this question. 00:36:10.700 |
What is the best unsupervised method to train a sentence transformer? 00:36:18.100 |
There's only one unsupervised, like fully unsupervised method and it is TSDAE. 00:36:24.700 |
Okay. So, with something called pre-train produces na-na-na-na. 00:36:29.900 |
Sentence transformer using unsupervised training method called 00:36:33.400 |
Transformer Based Sequential Denoising Autoencode. 00:36:36.300 |
I'm surprised it transcribed all of that correctly. 00:36:38.800 |
So, we can click on that, come to here, turn on the captions. 00:36:53.200 |
So, very similar, but what if I have little to no data? 00:36:59.200 |
So, again, training sentence transformer, little to no data. 00:37:06.300 |
We can also use something called GenQ, which generates queries for you. 00:37:10.800 |
This is particularly good for asymmetric search, e.g. question answering. 00:37:30.500 |
Okay, and this is kind of harder because I don't think I really answer this very often in my videos. 00:37:36.700 |
But there are still, you know, when I searched this, I was surprised there are actually some answers. 00:37:41.800 |
So, the first one, I don't think there's anything relevant in the first answer, but then we come down. 00:37:49.500 |
So, we have an answer here, and we also have another answer here now. 00:37:58.200 |
Convert query into vector, place within vector space, and then search for the most in other vectors. 00:38:08.300 |
Okay, and we even get a nice little visual here as well. 00:38:12.200 |
So, we've got all of our vectors in this vector space. 00:38:19.600 |
You have a query vector here, xq, and then we're going to compare the distance between those, right? 00:38:29.200 |
So, I think, again, we've got a really good answer there, both visually from the video and also just from the audio transcribed into text. 00:38:39.300 |
Okay, that's it for this project focus demo on building a better YouTube search.