Back to Index

NSFW Image Detection with AI


Chapters

0:0 AI Image Classification
0:23 How to use Multi-Modal AI
1:47 Finding Image Detection Notebook
2:18 Shrek Dataset
4:55 Creating Multi-Modal Routes
6:36 Testing NSFW Image Detection
7:53 Final Notes on Multi-Modal AI

Transcript

Today we're going to be taking a look at the new Vision and Image features of Semantic Router. So we've added Vision transformers, thanks to Bogdan who also put together the demo that we're going to be walking through, and CLIP, which is a multimodal model. Now both of these together mean that we now have the ability to use image routes and also multimodal routes.

Now, why would we want to use Vision or multimodal routes? Well, there are actually a lot of use cases, for example, data pre-processing. We can use this to route our processing methodology in different directions, for example with PDFs. We see both text and image, and based on the type of text or the types of images that we're seeing, we might want to process them in different ways, so we can use it there.

We can also use these encoders in a slightly different way using the Semantic splitters in the Semantic Router library, which I haven't spoken about yet, but I will soon, to split video automatically based on the imagery that you're seeing within the video. And I think by far probably the most obvious use case here is image detection, and in particular with the increase in AI-generated images and maybe just people-generated images as well, you can use this for things like SFW versus NSFW image detection, which is what we're going to see here.

So let's start by having a look at our NSFW SFW dataset. So for those of you that are not aware, SFW meaning Shrek for work, and NSFW is not Shrek for work. The idea is that when you're at work, you only want to be viewing SFW pictures, i.e. pictures containing Shrek.

Whereas when you're not at work, you can look at any images you want. So we have an example notebook for this, of course. We're going to come to the Semantic Router library, docs, multimodal here, and I'm going to click open in Colab. Okay, great. So we're going to first do a pip install of Semantic Router, the version that we need for this, and we're specifying the vision dependencies here.

There are a few vision dependencies, you've got like torch vision and things in there, so this can take a little bit of time to actually install everything. We're also going to be using hung face datasets, that is because we're going to be downloading the dataset I just showed you, the Shrek versus not Shrek dataset to use as routes here.

So while we are waiting for that to install, I'm going to come down to here and I'll just show you what this dataset actually looks like. So we have two splits in the dataset, a training split and a test split. Now to load it, we're going to use this, and then you can see that we have these images here.

So we have this, I've counted this as a Shrek image. So what we're going to want to do is set up some routes that detect Shrek or not Shrek. And we're going to be using these images within the training splits. We also have, if we come down here, we also have our test split.

We won't use any of these to create our routes because we want to see that this does apply or does transfer, generalize to our test data as well. And obviously we see some slightly different Shrek and not Shrek images in here. So I'll skip ahead to when our install is complete.

Okay, so it's installed. You will see this little warning here. It's not a big deal, it's fine. It does work. Okay, run that and we should see that this will work. You should see the Shrek rock image pop up. Okay, cool, looks good. Now what we want to do is grab all the images that are labeled with isShrek.

So you can see in the data, maybe I'll come here and show you. So let's look at the data. We have three fields. So text, which is like a kind of descriptive field of what is within the image, although it's not really that descriptive. So like this one is Dwayne Johnson with hair.

We have the image file and then we also have, okay, this isn't Shrek and we can have a look. All right, so we have image. Let's take a look. Okay, not Shrek. This one is Shrek. So what we want to do is grab the images that are labeled with isShrek.

So, for example, the third one that we have here, this is Dwayne Johnson's Shrek and it is Shrek here. So we're going to go through, grab those and we're creating a list here of images that are Shrek and images that are not. Okay, so we have five that are Shrek, 19 that are not.

Okay, so we're going to create our routes using the images. So the image is actually going in place of where we'd usually put our utterances, okay. So, yeah, we can create that and we're also going to create a not Shrek route as well. We could, I think we could just avoid that to be honest, but it's okay.

We could do either really. It's good to be very verbose with your embeddings. We're going to initialize our multimodal clip encoder here. Okay, we'll take a moment to download. It's not a massive, this is the model size here. It's like almost 600 megabytes, it's not huge, but it isn't small either.

And then what we want to do is initialize our route layer. A route layer always, as we've seen before, requires a encoder, which is in this case our multimodal clip encoder and the routes. Okay, the routes we defined before with Shrek and not Shrek. And we're going to test, okay, so we're going to see don't you love politics?

That shouldn't really be either of, it shouldn't trigger anything, right. And this is a, you know, you can see here that I'm using text to classify here, even though we use just images in our routes. So that's the kind of interesting thing that we can do here. Okay, so we can see that here Shrek, the text is classified as Shrek in the routes, which is cool.

So it's, you know, putting them into the images of Shrek bucket. And then Dwayne "The Rock" Johnson, it's seen Dwayne Johnson in the images and they are tagged as not Shrek. So it's giving us the not Shrek route. Okay, so we have everything being classified correctly there with those, you know, with that text.

But what we really want to be doing here as well, you know, we can do both, of course, like we've seen, is we can take some images that we haven't seen before and see if we can label them correctly. So we're going to be loading this other data set.

So the test data set, and then here there's a mix of, you know, as I said, Shrek and not Shrek. So this one, we have Shrek and we will see what classification we get here, which you can actually see already. So I run this and we can see Shrek.

We can remove the name here, it gives us the full route choice object. Okay, so name Shrek. We have another image here again of Shrek. And if we come down here, yeah, I mean, I've already run it. So you can see that it is classifying as Shrek. And then we have our not Shrek picture here.

Where did that go? Okay, so we have this nice coral reef. And if we come down to here, it's saying this is not Shrek. Okay, so I think in the training data for not Shrek, we have some nature images. So it puts nature images into that rather than none.

So, yeah, I mean, that is it. We have our multimodal route layer here, it seems to be working pretty well. We can also, you know, if you want to take us further, you can go have a look at the route optimization stuff that we've talked about, where you're literally training your route layer on like a training set of utterances to the routes that they should trigger.

With that, you can get pretty good results. And we have like an image detection or classification route layer here, which works pretty well. So that is it for this video. As I mentioned at the start, there's a lot more that we can actually do with this, ranging from the route layers that we've seen here to simple even video splitting or more intelligent data processing.

And I'm sure there's plenty of other ways that we can use this as well that I just haven't thought of yet. So I'm very interested to see what people build this. If you do decide to build something cool, I'd love to hear about it. But for now, I'm going to leave it there.

So thank you very much for watching. I hope this has been useful and interesting. And I will see you again in the next one. Bye.