Processing Videos for GPT-4o and Search

00:00:00.000 | Today we are going to be taking a look at how we can process video more efficiently

00:00:06.060 | and accurately using what we call semantic chunkers.

00:00:09.600 | So you may have heard of semantic chunkers within the realm of text processing and in

00:00:15.720 | particular RAG, but the same concept can be applied to different modalities such as audio

00:00:22.580 | and also video.

00:00:23.760 | So why would we care about processing video or chunking video in this way?

00:00:30.320 | Well we've seen recently models like Jupyter 4.0 which can consume video.

00:00:37.480 | And the way that they can consume video is that you are essentially sending them frames

00:00:43.000 | or image frames from the video.

00:00:45.800 | And you can do this in essentially one of two ways.

00:00:49.160 | You can either send it, you know, every second you send it a frame and that will work.

00:00:55.600 | But especially for either fast-moving videos or slow-moving videos, you can either, in

00:01:03.080 | the case of fast-moving, miss a lot of stuff or in the case of slow-moving, send many frames

00:01:08.480 | that basically show the same thing and therefore increasing the time spent waiting for the

00:01:16.820 | processing to finish and also end up spending far more money because you're just sending

00:01:22.440 | tons of frames when you don't really need to and they're all the same.

00:01:26.360 | So why keep sending the same frames?

00:01:28.620 | So by semantically chunking video, you can identify where a video actually changes, where

00:01:35.720 | the content of a video changes, and then you focus on those areas.

00:01:40.520 | So let's take a look at how we can actually do this.

00:01:44.320 | So I'm going to the semantic chunkers library.

00:01:46.680 | It's a new library, but we have only a couple of docs at the moment, but one of those happens

00:01:51.480 | to be this video chunking.

00:01:53.480 | So I'm going to go into the video chunking notebook here and I'm just going to go ahead

00:01:57.520 | and open it in Colab.

00:01:59.060 | So we're going to work through this notebook.

00:02:00.920 | First thing that we need to do is just install the prerequisites.

00:02:05.080 | So I'm going to be using a vision model from semantic router library and semantic chunkers,

00:02:11.080 | I'm going to include the stats here just so that we can visualize a bit of what we're

00:02:17.680 | doing.

00:02:19.040 | And then we're also going to be using the OpenCV library because we're doing image processing

00:02:23.900 | and that's a typical library that you would use.

00:02:27.200 | Now because we are in Jupyter, actually we can change our runtime type to use a GPU.

00:02:35.480 | So maybe I'll do that quickly.

00:02:37.480 | Okay so that is run, now we come down to here.

00:02:40.720 | So we're going to download this video, I can just show you what the video is quickly.

00:02:46.880 | So it is this and when you watch this video, there's kind of like two scenes in the video.

00:02:54.640 | So there's this first scene here where the angle is like from the sky and the bunny thing

00:03:02.140 | is looking up at the butterfly and then there's this scene where it's more of a landscape

00:03:07.960 | and it's looking at the butterfly still.

00:03:11.080 | Okay so there's two bits here, it kind of switches just there.

00:03:17.080 | So that's where we want our split to be.

00:03:20.600 | So let's go ahead and try that.

00:03:23.680 | So in total we have 250 image frames from this video.

00:03:28.960 | So let's go ahead and initialize our encoder.

00:03:32.960 | So whenever we do this semantic chunking, we always end up using an encoder in some

00:03:37.840 | form or another.

00:03:39.340 | This encoder is a little bit different, it is using a vision transformer, it's actually

00:03:43.360 | quite an old model.

00:03:44.560 | There are definitely more recent models that you can use, but we're going to go ahead and

00:03:48.320 | use this one anyway.

00:03:50.440 | Okay so we've decided which device we're going to use here.

00:03:55.680 | If you're on Apple Silicon, you should be able to get NPS running.

00:03:59.800 | If you are on a NVIDIA GPU, you should get CUDA.

00:04:03.220 | And if you're just on CPU, you should be seeing CPU here.

00:04:08.280 | So I have a CUDA-enabled GPU, so we're using CUDA.

00:04:14.640 | And then what we can do is come down to here and we're going to be using the consecutive

00:04:18.560 | chunker.

00:04:19.560 | I'm going to set a threshold of 0.6 and you can increase or decrease this based on how

00:04:26.280 | granular you want your splits within the video to be, or how sensitive you want them to be.

00:04:32.260 | And we'll go ahead and run this.

00:04:34.760 | Okay so it's pretty quick, it doesn't take too long.

00:04:38.480 | And we've identified two chunks, let's have a look at what those chunks look like.

00:04:43.500 | So we're just sampling from each one of those chunks on each row in this visual here.

00:04:48.600 | So yeah, we can see here, the color mapping here is kind of messed up, but you can see

00:04:55.280 | that we have these three frames at the top from our first chunk, and these three frames

00:05:02.720 | at the bottom from our second chunk.

00:05:07.280 | So yeah, it looks pretty good.

00:05:10.120 | We can also, like I said, we can change the threshold here if we want to increase or decrease

00:05:15.280 | the sensitivity.

00:05:16.960 | So let's try increasing it a little bit.

00:05:20.640 | So going at the extreme, we end up with a lot of chunks, so maybe let's try going a

00:05:26.000 | little bit lower.

00:05:28.740 | So now we get three chunks, kind of curious, let's see what those are.

00:05:31.960 | Okay, so the first chunk is, you know, we have the overhead view, the butterfly is on

00:05:36.440 | the left.

00:05:37.440 | Second chunk, it's over on the right.

00:05:39.280 | And then third chunk, we have the other scene as well.

00:05:42.680 | Now we can also modify this a lot as well.

00:05:45.440 | So for example, we are using the Vision Transform model right now.

00:05:50.720 | We can also try using different models, and maybe we'll come back to trying those soon.

00:05:56.640 | But one thing to be aware of with the Vision Transform models is that they're trained for

00:06:02.200 | classification.

00:06:03.200 | So that doesn't always mean they are the best at identifying the actual meaning or the context

00:06:09.360 | within a video.

00:06:10.360 | They're better at these almost like broader classifications.

00:06:15.560 | So if you do want to get a little more detailed, like maybe you want to try and identify that,

00:06:21.000 | okay, now there's a ball in the video instead, you might want to try using a clip or a blip

00:06:26.160 | model or something probably a bit more recent that has been trained for like similarity

00:06:33.300 | rather than classification.

00:06:35.320 | But let's continue with VIT for now, and let's try another video.

00:06:39.600 | Okay, so we have this new video, I can open it again so we can see what it is.

00:06:47.560 | Okay, so some guy doing car stuff.

00:06:54.840 | So there's a lot more complexity in this video.

00:06:57.360 | So we can go ahead and just see what we get.

00:07:01.700 | So let's try with 65 here, and then we just throw all those video frames on and just see

00:07:11.620 | what we get out.

00:07:13.280 | So it's a long video, it will take a bit longer to process.

00:07:16.780 | Okay, and now let's visualize that.

00:07:19.900 | Again, the color is kind of messed up here, but you can still see what's going on.

00:07:26.100 | Okay, cool.

00:07:27.380 | So I think this gives us pretty interesting results.

00:07:30.900 | So remember, each row here is a chunk.

00:07:35.220 | We have the first which is just black, there's nothing in there.

00:07:39.120 | Then it switches to the scene where the initial scene where the guy is talking in the car,

00:07:43.300 | then you have him on his, you know, on the back of his truck.

00:07:46.620 | Then you have the scene where he's driving his truck like this.

00:07:50.660 | Continuing, we have him back in his car again, we have the all these, this big traffic jam

00:07:57.220 | of cars.

00:07:58.820 | We have this specific car, him back in his car again.

00:08:02.180 | We have this one, and this one's kind of interesting because you can see the angle actually changes

00:08:07.920 | pretty significantly, but the topic of what is within the video, i.e. this car on this

00:08:13.380 | road is still the same.

00:08:15.660 | And yeah, we continue.

00:08:16.660 | So this one seems to work, I mean, pretty, like very well even.

00:08:23.460 | It's identifying all the correct scenes within the video and yeah, I mean, generally speaking

00:08:31.940 | I think it looks pretty flawless.

00:08:34.720 | So yeah, you can see that that works pretty well.

00:08:37.660 | I want to take a quick, you know, show you how you can use different models in this as

00:08:41.420 | well if you prefer.

00:08:43.980 | So let me show you very quickly we can use a clip encoder, for example.

00:08:49.280 | So we go clip encoder, yeah.

00:08:56.160 | So we can use this to download the model.

00:08:59.380 | Clip is a more recent model.

00:09:01.100 | It focuses more on semantic similarity rather than classification like the vision transform

00:09:08.000 | model we just used.

00:09:09.800 | So in theory, it should have more nuanced understanding of what is within these videos

00:09:16.560 | and then technically because of that, it should be able to basically get us better performance.

00:09:22.000 | So I'm processing that other video again.

00:09:24.880 | Let's try with first just reloading the rabbit video, run that, okay.

00:09:34.220 | So this one again identified those two chunks and yet we can see it's the same as before.

00:09:39.420 | So it's identifying the same stuff as what we saw with the previous model.

00:09:43.880 | Let's see if we get anything different by doing this.

00:09:47.600 | Okay, now we get 15 chunks.

00:09:51.760 | Okay.

00:09:52.760 | And we can, I mean, we can't see anything particularly great there, it's far too zoomed

00:09:58.360 | out.

00:09:59.360 | Let's try something a bit smaller.

00:10:00.360 | Okay.

00:10:01.360 | And I don't know if you can see, but in this final one here, it looks like we have the

00:10:07.120 | scene where we have the butterfly flying and then all of a sudden we have this, especially

00:10:13.600 | in this one here, we have the ball has dropped on the butterfly.

00:10:18.880 | So that's kind of what I was looking for to see if we could get that and see if I can

00:10:23.440 | reduce this a little more and see if we can still get the same split.

00:10:28.340 | Maybe not.

00:10:29.340 | Okay.

00:10:30.340 | Not quite.

00:10:31.340 | But in any case, we can see that the clip model is able to at least identify that split

00:10:38.000 | between this second scene where there's just a butterfly versus when there's this ball

00:10:43.880 | falling on the butterfly.

00:10:46.080 | And then let's try again with this video, see how it performs.

00:10:51.280 | Okay.

00:10:52.280 | So I haven't tried tweaking the threshold here, so I don't know how it performs.

00:11:00.280 | So we can see we have this first scene, there's some slight differences there, but it's definitely,

00:11:07.000 | you know, probably catching too much.

00:11:10.880 | But in general, we can see that the scenes are, I think, relatively well separated again

00:11:15.900 | here and see a bit of a mix here as well.

00:11:21.000 | But yeah, without even trying to modify the threshold there, we're actually getting not

00:11:27.640 | perfect but a decent result.

00:11:29.920 | So yeah, that is it for this look at semantic chunking for processing video in a more intelligent

00:11:39.520 | way.

00:11:40.520 | As I said in the start, this is ideal for those use cases where you're needing to feed

00:11:45.160 | video frames into AI models because generally speaking, AI models that include vision are

00:11:54.720 | quite expensive and they take a long time to process.

00:11:58.200 | So just throwing everything you have at them is generally not a good idea and it can be

00:12:04.160 | expensive and it's just not efficient.

00:12:07.360 | So this is really mainly focused on being a solution to that, although I'm sure there

00:12:14.600 | are many other use cases out there as well.

00:12:17.000 | But anyway, that is it for this introduction to these video chunkers.

00:12:23.560 | I hope this has been useful and interesting, but for now I'll leave it there.

00:12:27.240 | So thank you very much for watching and I will see you again in the next one.

00:12:30.960 | Bye.

00:12:31.400 | [MUSIC]

00:12:41.400 | [END]

Processing Videos for GPT-4o and Search

Chapters