Processing Videos for GPT-4o and Search

Today we are going to be taking a look at how we can process video more efficiently and accurately using what we call semantic chunkers. So you may have heard of semantic chunkers within the realm of text processing and in particular RAG, but the same concept can be applied to different modalities such as audio and also video.

So why would we care about processing video or chunking video in this way? Well we've seen recently models like Jupyter 4.0 which can consume video. And the way that they can consume video is that you are essentially sending them frames or image frames from the video. And you can do this in essentially one of two ways.

You can either send it, you know, every second you send it a frame and that will work. But especially for either fast-moving videos or slow-moving videos, you can either, in the case of fast-moving, miss a lot of stuff or in the case of slow-moving, send many frames that basically show the same thing and therefore increasing the time spent waiting for the processing to finish and also end up spending far more money because you're just sending tons of frames when you don't really need to and they're all the same.

So why keep sending the same frames? So by semantically chunking video, you can identify where a video actually changes, where the content of a video changes, and then you focus on those areas. So let's take a look at how we can actually do this. So I'm going to the semantic chunkers library.

It's a new library, but we have only a couple of docs at the moment, but one of those happens to be this video chunking. So I'm going to go into the video chunking notebook here and I'm just going to go ahead and open it in Colab. So we're going to work through this notebook.

First thing that we need to do is just install the prerequisites. So I'm going to be using a vision model from semantic router library and semantic chunkers, I'm going to include the stats here just so that we can visualize a bit of what we're doing. And then we're also going to be using the OpenCV library because we're doing image processing and that's a typical library that you would use.

Now because we are in Jupyter, actually we can change our runtime type to use a GPU. So maybe I'll do that quickly. Okay so that is run, now we come down to here. So we're going to download this video, I can just show you what the video is quickly.

So it is this and when you watch this video, there's kind of like two scenes in the video. So there's this first scene here where the angle is like from the sky and the bunny thing is looking up at the butterfly and then there's this scene where it's more of a landscape and it's looking at the butterfly still.

Okay so there's two bits here, it kind of switches just there. So that's where we want our split to be. So let's go ahead and try that. So in total we have 250 image frames from this video. So let's go ahead and initialize our encoder. So whenever we do this semantic chunking, we always end up using an encoder in some form or another.

This encoder is a little bit different, it is using a vision transformer, it's actually quite an old model. There are definitely more recent models that you can use, but we're going to go ahead and use this one anyway. Okay so we've decided which device we're going to use here.

If you're on Apple Silicon, you should be able to get NPS running. If you are on a NVIDIA GPU, you should get CUDA. And if you're just on CPU, you should be seeing CPU here. So I have a CUDA-enabled GPU, so we're using CUDA. And then what we can do is come down to here and we're going to be using the consecutive chunker.

I'm going to set a threshold of 0.6 and you can increase or decrease this based on how granular you want your splits within the video to be, or how sensitive you want them to be. And we'll go ahead and run this. Okay so it's pretty quick, it doesn't take too long.

And we've identified two chunks, let's have a look at what those chunks look like. So we're just sampling from each one of those chunks on each row in this visual here. So yeah, we can see here, the color mapping here is kind of messed up, but you can see that we have these three frames at the top from our first chunk, and these three frames at the bottom from our second chunk.

So yeah, it looks pretty good. We can also, like I said, we can change the threshold here if we want to increase or decrease the sensitivity. So let's try increasing it a little bit. So going at the extreme, we end up with a lot of chunks, so maybe let's try going a little bit lower.

So now we get three chunks, kind of curious, let's see what those are. Okay, so the first chunk is, you know, we have the overhead view, the butterfly is on the left. Second chunk, it's over on the right. And then third chunk, we have the other scene as well.

Now we can also modify this a lot as well. So for example, we are using the Vision Transform model right now. We can also try using different models, and maybe we'll come back to trying those soon. But one thing to be aware of with the Vision Transform models is that they're trained for classification.

So that doesn't always mean they are the best at identifying the actual meaning or the context within a video. They're better at these almost like broader classifications. So if you do want to get a little more detailed, like maybe you want to try and identify that, okay, now there's a ball in the video instead, you might want to try using a clip or a blip model or something probably a bit more recent that has been trained for like similarity rather than classification.

But let's continue with VIT for now, and let's try another video. Okay, so we have this new video, I can open it again so we can see what it is. Okay, so some guy doing car stuff. So there's a lot more complexity in this video. So we can go ahead and just see what we get.

So let's try with 65 here, and then we just throw all those video frames on and just see what we get out. So it's a long video, it will take a bit longer to process. Okay, and now let's visualize that. Again, the color is kind of messed up here, but you can still see what's going on.

Okay, cool. So I think this gives us pretty interesting results. So remember, each row here is a chunk. We have the first which is just black, there's nothing in there. Then it switches to the scene where the initial scene where the guy is talking in the car, then you have him on his, you know, on the back of his truck.

Then you have the scene where he's driving his truck like this. Continuing, we have him back in his car again, we have the all these, this big traffic jam of cars. We have this specific car, him back in his car again. We have this one, and this one's kind of interesting because you can see the angle actually changes pretty significantly, but the topic of what is within the video, i.e.

this car on this road is still the same. And yeah, we continue. So this one seems to work, I mean, pretty, like very well even. It's identifying all the correct scenes within the video and yeah, I mean, generally speaking I think it looks pretty flawless. So yeah, you can see that that works pretty well.

I want to take a quick, you know, show you how you can use different models in this as well if you prefer. So let me show you very quickly we can use a clip encoder, for example. So we go clip encoder, yeah. So we can use this to download the model.

Clip is a more recent model. It focuses more on semantic similarity rather than classification like the vision transform model we just used. So in theory, it should have more nuanced understanding of what is within these videos and then technically because of that, it should be able to basically get us better performance.

So I'm processing that other video again. Let's try with first just reloading the rabbit video, run that, okay. So this one again identified those two chunks and yet we can see it's the same as before. So it's identifying the same stuff as what we saw with the previous model.

Let's see if we get anything different by doing this. Okay, now we get 15 chunks. Okay. And we can, I mean, we can't see anything particularly great there, it's far too zoomed out. Let's try something a bit smaller. Okay. And I don't know if you can see, but in this final one here, it looks like we have the scene where we have the butterfly flying and then all of a sudden we have this, especially in this one here, we have the ball has dropped on the butterfly.

So that's kind of what I was looking for to see if we could get that and see if I can reduce this a little more and see if we can still get the same split. Maybe not. Okay. Not quite. But in any case, we can see that the clip model is able to at least identify that split between this second scene where there's just a butterfly versus when there's this ball falling on the butterfly.

And then let's try again with this video, see how it performs. Okay. So I haven't tried tweaking the threshold here, so I don't know how it performs. So we can see we have this first scene, there's some slight differences there, but it's definitely, you know, probably catching too much.

But in general, we can see that the scenes are, I think, relatively well separated again here and see a bit of a mix here as well. But yeah, without even trying to modify the threshold there, we're actually getting not perfect but a decent result. So yeah, that is it for this look at semantic chunking for processing video in a more intelligent way.

As I said in the start, this is ideal for those use cases where you're needing to feed video frames into AI models because generally speaking, AI models that include vision are quite expensive and they take a long time to process. So just throwing everything you have at them is generally not a good idea and it can be expensive and it's just not efficient.

So this is really mainly focused on being a solution to that, although I'm sure there are many other use cases out there as well. But anyway, that is it for this introduction to these video chunkers. I hope this has been useful and interesting, but for now I'll leave it there.

So thank you very much for watching and I will see you again in the next one. Bye.

Processing Videos for GPT-4o and Search

Chapters

Transcript