back to index

Processing Videos for GPT-4o and Search


Chapters

0:0 Semantic Chunking
0:24 Video Chunking and gpt-4o
1:59 Video Chunking Code
3:28 Setting up the Vision Transformer
5:56 ViT vs. CLIP and other models
6:40 Video Chunking Results
8:37 Using CLIP for Vision Chunking
11:29 Final Conclusion on Video Processing

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we are going to be taking a look at how we can process video more efficiently
00:00:06.060 | and accurately using what we call semantic chunkers.
00:00:09.600 | So you may have heard of semantic chunkers within the realm of text processing and in
00:00:15.720 | particular RAG, but the same concept can be applied to different modalities such as audio
00:00:22.580 | and also video.
00:00:23.760 | So why would we care about processing video or chunking video in this way?
00:00:30.320 | Well we've seen recently models like Jupyter 4.0 which can consume video.
00:00:37.480 | And the way that they can consume video is that you are essentially sending them frames
00:00:43.000 | or image frames from the video.
00:00:45.800 | And you can do this in essentially one of two ways.
00:00:49.160 | You can either send it, you know, every second you send it a frame and that will work.
00:00:55.600 | But especially for either fast-moving videos or slow-moving videos, you can either, in
00:01:03.080 | the case of fast-moving, miss a lot of stuff or in the case of slow-moving, send many frames
00:01:08.480 | that basically show the same thing and therefore increasing the time spent waiting for the
00:01:16.820 | processing to finish and also end up spending far more money because you're just sending
00:01:22.440 | tons of frames when you don't really need to and they're all the same.
00:01:26.360 | So why keep sending the same frames?
00:01:28.620 | So by semantically chunking video, you can identify where a video actually changes, where
00:01:35.720 | the content of a video changes, and then you focus on those areas.
00:01:40.520 | So let's take a look at how we can actually do this.
00:01:44.320 | So I'm going to the semantic chunkers library.
00:01:46.680 | It's a new library, but we have only a couple of docs at the moment, but one of those happens
00:01:51.480 | to be this video chunking.
00:01:53.480 | So I'm going to go into the video chunking notebook here and I'm just going to go ahead
00:01:57.520 | and open it in Colab.
00:01:59.060 | So we're going to work through this notebook.
00:02:00.920 | First thing that we need to do is just install the prerequisites.
00:02:05.080 | So I'm going to be using a vision model from semantic router library and semantic chunkers,
00:02:11.080 | I'm going to include the stats here just so that we can visualize a bit of what we're
00:02:17.680 | doing.
00:02:19.040 | And then we're also going to be using the OpenCV library because we're doing image processing
00:02:23.900 | and that's a typical library that you would use.
00:02:27.200 | Now because we are in Jupyter, actually we can change our runtime type to use a GPU.
00:02:35.480 | So maybe I'll do that quickly.
00:02:37.480 | Okay so that is run, now we come down to here.
00:02:40.720 | So we're going to download this video, I can just show you what the video is quickly.
00:02:46.880 | So it is this and when you watch this video, there's kind of like two scenes in the video.
00:02:54.640 | So there's this first scene here where the angle is like from the sky and the bunny thing
00:03:02.140 | is looking up at the butterfly and then there's this scene where it's more of a landscape
00:03:07.960 | and it's looking at the butterfly still.
00:03:11.080 | Okay so there's two bits here, it kind of switches just there.
00:03:17.080 | So that's where we want our split to be.
00:03:20.600 | So let's go ahead and try that.
00:03:23.680 | So in total we have 250 image frames from this video.
00:03:28.960 | So let's go ahead and initialize our encoder.
00:03:32.960 | So whenever we do this semantic chunking, we always end up using an encoder in some
00:03:37.840 | form or another.
00:03:39.340 | This encoder is a little bit different, it is using a vision transformer, it's actually
00:03:43.360 | quite an old model.
00:03:44.560 | There are definitely more recent models that you can use, but we're going to go ahead and
00:03:48.320 | use this one anyway.
00:03:50.440 | Okay so we've decided which device we're going to use here.
00:03:55.680 | If you're on Apple Silicon, you should be able to get NPS running.
00:03:59.800 | If you are on a NVIDIA GPU, you should get CUDA.
00:04:03.220 | And if you're just on CPU, you should be seeing CPU here.
00:04:08.280 | So I have a CUDA-enabled GPU, so we're using CUDA.
00:04:14.640 | And then what we can do is come down to here and we're going to be using the consecutive
00:04:18.560 | chunker.
00:04:19.560 | I'm going to set a threshold of 0.6 and you can increase or decrease this based on how
00:04:26.280 | granular you want your splits within the video to be, or how sensitive you want them to be.
00:04:32.260 | And we'll go ahead and run this.
00:04:34.760 | Okay so it's pretty quick, it doesn't take too long.
00:04:38.480 | And we've identified two chunks, let's have a look at what those chunks look like.
00:04:43.500 | So we're just sampling from each one of those chunks on each row in this visual here.
00:04:48.600 | So yeah, we can see here, the color mapping here is kind of messed up, but you can see
00:04:55.280 | that we have these three frames at the top from our first chunk, and these three frames
00:05:02.720 | at the bottom from our second chunk.
00:05:07.280 | So yeah, it looks pretty good.
00:05:10.120 | We can also, like I said, we can change the threshold here if we want to increase or decrease
00:05:15.280 | the sensitivity.
00:05:16.960 | So let's try increasing it a little bit.
00:05:20.640 | So going at the extreme, we end up with a lot of chunks, so maybe let's try going a
00:05:26.000 | little bit lower.
00:05:28.740 | So now we get three chunks, kind of curious, let's see what those are.
00:05:31.960 | Okay, so the first chunk is, you know, we have the overhead view, the butterfly is on
00:05:36.440 | the left.
00:05:37.440 | Second chunk, it's over on the right.
00:05:39.280 | And then third chunk, we have the other scene as well.
00:05:42.680 | Now we can also modify this a lot as well.
00:05:45.440 | So for example, we are using the Vision Transform model right now.
00:05:50.720 | We can also try using different models, and maybe we'll come back to trying those soon.
00:05:56.640 | But one thing to be aware of with the Vision Transform models is that they're trained for
00:06:02.200 | classification.
00:06:03.200 | So that doesn't always mean they are the best at identifying the actual meaning or the context
00:06:09.360 | within a video.
00:06:10.360 | They're better at these almost like broader classifications.
00:06:15.560 | So if you do want to get a little more detailed, like maybe you want to try and identify that,
00:06:21.000 | okay, now there's a ball in the video instead, you might want to try using a clip or a blip
00:06:26.160 | model or something probably a bit more recent that has been trained for like similarity
00:06:33.300 | rather than classification.
00:06:35.320 | But let's continue with VIT for now, and let's try another video.
00:06:39.600 | Okay, so we have this new video, I can open it again so we can see what it is.
00:06:47.560 | Okay, so some guy doing car stuff.
00:06:54.840 | So there's a lot more complexity in this video.
00:06:57.360 | So we can go ahead and just see what we get.
00:07:01.700 | So let's try with 65 here, and then we just throw all those video frames on and just see
00:07:11.620 | what we get out.
00:07:13.280 | So it's a long video, it will take a bit longer to process.
00:07:16.780 | Okay, and now let's visualize that.
00:07:19.900 | Again, the color is kind of messed up here, but you can still see what's going on.
00:07:26.100 | Okay, cool.
00:07:27.380 | So I think this gives us pretty interesting results.
00:07:30.900 | So remember, each row here is a chunk.
00:07:35.220 | We have the first which is just black, there's nothing in there.
00:07:39.120 | Then it switches to the scene where the initial scene where the guy is talking in the car,
00:07:43.300 | then you have him on his, you know, on the back of his truck.
00:07:46.620 | Then you have the scene where he's driving his truck like this.
00:07:50.660 | Continuing, we have him back in his car again, we have the all these, this big traffic jam
00:07:57.220 | of cars.
00:07:58.820 | We have this specific car, him back in his car again.
00:08:02.180 | We have this one, and this one's kind of interesting because you can see the angle actually changes
00:08:07.920 | pretty significantly, but the topic of what is within the video, i.e. this car on this
00:08:13.380 | road is still the same.
00:08:15.660 | And yeah, we continue.
00:08:16.660 | So this one seems to work, I mean, pretty, like very well even.
00:08:23.460 | It's identifying all the correct scenes within the video and yeah, I mean, generally speaking
00:08:31.940 | I think it looks pretty flawless.
00:08:34.720 | So yeah, you can see that that works pretty well.
00:08:37.660 | I want to take a quick, you know, show you how you can use different models in this as
00:08:41.420 | well if you prefer.
00:08:43.980 | So let me show you very quickly we can use a clip encoder, for example.
00:08:49.280 | So we go clip encoder, yeah.
00:08:56.160 | So we can use this to download the model.
00:08:59.380 | Clip is a more recent model.
00:09:01.100 | It focuses more on semantic similarity rather than classification like the vision transform
00:09:08.000 | model we just used.
00:09:09.800 | So in theory, it should have more nuanced understanding of what is within these videos
00:09:16.560 | and then technically because of that, it should be able to basically get us better performance.
00:09:22.000 | So I'm processing that other video again.
00:09:24.880 | Let's try with first just reloading the rabbit video, run that, okay.
00:09:34.220 | So this one again identified those two chunks and yet we can see it's the same as before.
00:09:39.420 | So it's identifying the same stuff as what we saw with the previous model.
00:09:43.880 | Let's see if we get anything different by doing this.
00:09:47.600 | Okay, now we get 15 chunks.
00:09:51.760 | Okay.
00:09:52.760 | And we can, I mean, we can't see anything particularly great there, it's far too zoomed
00:09:59.360 | Let's try something a bit smaller.
00:10:00.360 | Okay.
00:10:01.360 | And I don't know if you can see, but in this final one here, it looks like we have the
00:10:07.120 | scene where we have the butterfly flying and then all of a sudden we have this, especially
00:10:13.600 | in this one here, we have the ball has dropped on the butterfly.
00:10:18.880 | So that's kind of what I was looking for to see if we could get that and see if I can
00:10:23.440 | reduce this a little more and see if we can still get the same split.
00:10:28.340 | Maybe not.
00:10:29.340 | Okay.
00:10:30.340 | Not quite.
00:10:31.340 | But in any case, we can see that the clip model is able to at least identify that split
00:10:38.000 | between this second scene where there's just a butterfly versus when there's this ball
00:10:43.880 | falling on the butterfly.
00:10:46.080 | And then let's try again with this video, see how it performs.
00:10:51.280 | Okay.
00:10:52.280 | So I haven't tried tweaking the threshold here, so I don't know how it performs.
00:11:00.280 | So we can see we have this first scene, there's some slight differences there, but it's definitely,
00:11:07.000 | you know, probably catching too much.
00:11:10.880 | But in general, we can see that the scenes are, I think, relatively well separated again
00:11:15.900 | here and see a bit of a mix here as well.
00:11:21.000 | But yeah, without even trying to modify the threshold there, we're actually getting not
00:11:27.640 | perfect but a decent result.
00:11:29.920 | So yeah, that is it for this look at semantic chunking for processing video in a more intelligent
00:11:40.520 | As I said in the start, this is ideal for those use cases where you're needing to feed
00:11:45.160 | video frames into AI models because generally speaking, AI models that include vision are
00:11:54.720 | quite expensive and they take a long time to process.
00:11:58.200 | So just throwing everything you have at them is generally not a good idea and it can be
00:12:04.160 | expensive and it's just not efficient.
00:12:07.360 | So this is really mainly focused on being a solution to that, although I'm sure there
00:12:14.600 | are many other use cases out there as well.
00:12:17.000 | But anyway, that is it for this introduction to these video chunkers.
00:12:23.560 | I hope this has been useful and interesting, but for now I'll leave it there.
00:12:27.240 | So thank you very much for watching and I will see you again in the next one.
00:12:31.400 | [MUSIC]
00:12:41.400 | [END]