back to indexProcessing Videos for GPT-4o and Search
Chapters
0:0 Semantic Chunking
0:24 Video Chunking and gpt-4o
1:59 Video Chunking Code
3:28 Setting up the Vision Transformer
5:56 ViT vs. CLIP and other models
6:40 Video Chunking Results
8:37 Using CLIP for Vision Chunking
11:29 Final Conclusion on Video Processing
00:00:00.000 |
Today we are going to be taking a look at how we can process video more efficiently 00:00:06.060 |
and accurately using what we call semantic chunkers. 00:00:09.600 |
So you may have heard of semantic chunkers within the realm of text processing and in 00:00:15.720 |
particular RAG, but the same concept can be applied to different modalities such as audio 00:00:23.760 |
So why would we care about processing video or chunking video in this way? 00:00:30.320 |
Well we've seen recently models like Jupyter 4.0 which can consume video. 00:00:37.480 |
And the way that they can consume video is that you are essentially sending them frames 00:00:45.800 |
And you can do this in essentially one of two ways. 00:00:49.160 |
You can either send it, you know, every second you send it a frame and that will work. 00:00:55.600 |
But especially for either fast-moving videos or slow-moving videos, you can either, in 00:01:03.080 |
the case of fast-moving, miss a lot of stuff or in the case of slow-moving, send many frames 00:01:08.480 |
that basically show the same thing and therefore increasing the time spent waiting for the 00:01:16.820 |
processing to finish and also end up spending far more money because you're just sending 00:01:22.440 |
tons of frames when you don't really need to and they're all the same. 00:01:28.620 |
So by semantically chunking video, you can identify where a video actually changes, where 00:01:35.720 |
the content of a video changes, and then you focus on those areas. 00:01:40.520 |
So let's take a look at how we can actually do this. 00:01:44.320 |
So I'm going to the semantic chunkers library. 00:01:46.680 |
It's a new library, but we have only a couple of docs at the moment, but one of those happens 00:01:53.480 |
So I'm going to go into the video chunking notebook here and I'm just going to go ahead 00:01:59.060 |
So we're going to work through this notebook. 00:02:00.920 |
First thing that we need to do is just install the prerequisites. 00:02:05.080 |
So I'm going to be using a vision model from semantic router library and semantic chunkers, 00:02:11.080 |
I'm going to include the stats here just so that we can visualize a bit of what we're 00:02:19.040 |
And then we're also going to be using the OpenCV library because we're doing image processing 00:02:23.900 |
and that's a typical library that you would use. 00:02:27.200 |
Now because we are in Jupyter, actually we can change our runtime type to use a GPU. 00:02:37.480 |
Okay so that is run, now we come down to here. 00:02:40.720 |
So we're going to download this video, I can just show you what the video is quickly. 00:02:46.880 |
So it is this and when you watch this video, there's kind of like two scenes in the video. 00:02:54.640 |
So there's this first scene here where the angle is like from the sky and the bunny thing 00:03:02.140 |
is looking up at the butterfly and then there's this scene where it's more of a landscape 00:03:11.080 |
Okay so there's two bits here, it kind of switches just there. 00:03:23.680 |
So in total we have 250 image frames from this video. 00:03:28.960 |
So let's go ahead and initialize our encoder. 00:03:32.960 |
So whenever we do this semantic chunking, we always end up using an encoder in some 00:03:39.340 |
This encoder is a little bit different, it is using a vision transformer, it's actually 00:03:44.560 |
There are definitely more recent models that you can use, but we're going to go ahead and 00:03:50.440 |
Okay so we've decided which device we're going to use here. 00:03:55.680 |
If you're on Apple Silicon, you should be able to get NPS running. 00:03:59.800 |
If you are on a NVIDIA GPU, you should get CUDA. 00:04:03.220 |
And if you're just on CPU, you should be seeing CPU here. 00:04:08.280 |
So I have a CUDA-enabled GPU, so we're using CUDA. 00:04:14.640 |
And then what we can do is come down to here and we're going to be using the consecutive 00:04:19.560 |
I'm going to set a threshold of 0.6 and you can increase or decrease this based on how 00:04:26.280 |
granular you want your splits within the video to be, or how sensitive you want them to be. 00:04:34.760 |
Okay so it's pretty quick, it doesn't take too long. 00:04:38.480 |
And we've identified two chunks, let's have a look at what those chunks look like. 00:04:43.500 |
So we're just sampling from each one of those chunks on each row in this visual here. 00:04:48.600 |
So yeah, we can see here, the color mapping here is kind of messed up, but you can see 00:04:55.280 |
that we have these three frames at the top from our first chunk, and these three frames 00:05:10.120 |
We can also, like I said, we can change the threshold here if we want to increase or decrease 00:05:20.640 |
So going at the extreme, we end up with a lot of chunks, so maybe let's try going a 00:05:28.740 |
So now we get three chunks, kind of curious, let's see what those are. 00:05:31.960 |
Okay, so the first chunk is, you know, we have the overhead view, the butterfly is on 00:05:39.280 |
And then third chunk, we have the other scene as well. 00:05:45.440 |
So for example, we are using the Vision Transform model right now. 00:05:50.720 |
We can also try using different models, and maybe we'll come back to trying those soon. 00:05:56.640 |
But one thing to be aware of with the Vision Transform models is that they're trained for 00:06:03.200 |
So that doesn't always mean they are the best at identifying the actual meaning or the context 00:06:10.360 |
They're better at these almost like broader classifications. 00:06:15.560 |
So if you do want to get a little more detailed, like maybe you want to try and identify that, 00:06:21.000 |
okay, now there's a ball in the video instead, you might want to try using a clip or a blip 00:06:26.160 |
model or something probably a bit more recent that has been trained for like similarity 00:06:35.320 |
But let's continue with VIT for now, and let's try another video. 00:06:39.600 |
Okay, so we have this new video, I can open it again so we can see what it is. 00:06:54.840 |
So there's a lot more complexity in this video. 00:07:01.700 |
So let's try with 65 here, and then we just throw all those video frames on and just see 00:07:13.280 |
So it's a long video, it will take a bit longer to process. 00:07:19.900 |
Again, the color is kind of messed up here, but you can still see what's going on. 00:07:27.380 |
So I think this gives us pretty interesting results. 00:07:35.220 |
We have the first which is just black, there's nothing in there. 00:07:39.120 |
Then it switches to the scene where the initial scene where the guy is talking in the car, 00:07:43.300 |
then you have him on his, you know, on the back of his truck. 00:07:46.620 |
Then you have the scene where he's driving his truck like this. 00:07:50.660 |
Continuing, we have him back in his car again, we have the all these, this big traffic jam 00:07:58.820 |
We have this specific car, him back in his car again. 00:08:02.180 |
We have this one, and this one's kind of interesting because you can see the angle actually changes 00:08:07.920 |
pretty significantly, but the topic of what is within the video, i.e. this car on this 00:08:16.660 |
So this one seems to work, I mean, pretty, like very well even. 00:08:23.460 |
It's identifying all the correct scenes within the video and yeah, I mean, generally speaking 00:08:34.720 |
So yeah, you can see that that works pretty well. 00:08:37.660 |
I want to take a quick, you know, show you how you can use different models in this as 00:08:43.980 |
So let me show you very quickly we can use a clip encoder, for example. 00:09:01.100 |
It focuses more on semantic similarity rather than classification like the vision transform 00:09:09.800 |
So in theory, it should have more nuanced understanding of what is within these videos 00:09:16.560 |
and then technically because of that, it should be able to basically get us better performance. 00:09:24.880 |
Let's try with first just reloading the rabbit video, run that, okay. 00:09:34.220 |
So this one again identified those two chunks and yet we can see it's the same as before. 00:09:39.420 |
So it's identifying the same stuff as what we saw with the previous model. 00:09:43.880 |
Let's see if we get anything different by doing this. 00:09:52.760 |
And we can, I mean, we can't see anything particularly great there, it's far too zoomed 00:10:01.360 |
And I don't know if you can see, but in this final one here, it looks like we have the 00:10:07.120 |
scene where we have the butterfly flying and then all of a sudden we have this, especially 00:10:13.600 |
in this one here, we have the ball has dropped on the butterfly. 00:10:18.880 |
So that's kind of what I was looking for to see if we could get that and see if I can 00:10:23.440 |
reduce this a little more and see if we can still get the same split. 00:10:31.340 |
But in any case, we can see that the clip model is able to at least identify that split 00:10:38.000 |
between this second scene where there's just a butterfly versus when there's this ball 00:10:46.080 |
And then let's try again with this video, see how it performs. 00:10:52.280 |
So I haven't tried tweaking the threshold here, so I don't know how it performs. 00:11:00.280 |
So we can see we have this first scene, there's some slight differences there, but it's definitely, 00:11:10.880 |
But in general, we can see that the scenes are, I think, relatively well separated again 00:11:21.000 |
But yeah, without even trying to modify the threshold there, we're actually getting not 00:11:29.920 |
So yeah, that is it for this look at semantic chunking for processing video in a more intelligent 00:11:40.520 |
As I said in the start, this is ideal for those use cases where you're needing to feed 00:11:45.160 |
video frames into AI models because generally speaking, AI models that include vision are 00:11:54.720 |
quite expensive and they take a long time to process. 00:11:58.200 |
So just throwing everything you have at them is generally not a good idea and it can be 00:12:07.360 |
So this is really mainly focused on being a solution to that, although I'm sure there 00:12:17.000 |
But anyway, that is it for this introduction to these video chunkers. 00:12:23.560 |
I hope this has been useful and interesting, but for now I'll leave it there. 00:12:27.240 |
So thank you very much for watching and I will see you again in the next one.