back to indexNSFW Image Detection with AI
Chapters
0:0 AI Image Classification
0:23 How to use Multi-Modal AI
1:47 Finding Image Detection Notebook
2:18 Shrek Dataset
4:55 Creating Multi-Modal Routes
6:36 Testing NSFW Image Detection
7:53 Final Notes on Multi-Modal AI
00:00:00.000 |
Today we're going to be taking a look at the new Vision and Image features of Semantic Router. 00:00:05.840 |
So we've added Vision transformers, thanks to Bogdan who also put together the demo that we're going to be walking through, 00:00:15.240 |
Now both of these together mean that we now have the ability to use image routes and also multimodal routes. 00:00:23.720 |
Now, why would we want to use Vision or multimodal routes? 00:00:28.120 |
Well, there are actually a lot of use cases, for example, data pre-processing. 00:00:32.000 |
We can use this to route our processing methodology in different directions, for example with PDFs. 00:00:38.360 |
We see both text and image, and based on the type of text or the types of images that we're seeing, 00:00:44.760 |
we might want to process them in different ways, so we can use it there. 00:00:48.160 |
We can also use these encoders in a slightly different way using the Semantic splitters in the Semantic Router library, 00:00:54.520 |
which I haven't spoken about yet, but I will soon, to split video automatically based on the imagery that you're seeing within the video. 00:01:02.880 |
And I think by far probably the most obvious use case here is image detection, 00:01:08.920 |
and in particular with the increase in AI-generated images and maybe just people-generated images as well, 00:01:15.080 |
you can use this for things like SFW versus NSFW image detection, which is what we're going to see here. 00:01:22.160 |
So let's start by having a look at our NSFW SFW dataset. 00:01:27.320 |
So for those of you that are not aware, SFW meaning Shrek for work, and NSFW is not Shrek for work. 00:01:35.720 |
The idea is that when you're at work, you only want to be viewing SFW pictures, i.e. pictures containing Shrek. 00:01:43.720 |
Whereas when you're not at work, you can look at any images you want. 00:01:47.360 |
So we have an example notebook for this, of course. 00:01:51.120 |
We're going to come to the Semantic Router library, docs, multimodal here, and I'm going to click open in Colab. 00:01:58.120 |
Okay, great. So we're going to first do a pip install of Semantic Router, the version that we need for this, 00:02:05.120 |
and we're specifying the vision dependencies here. 00:02:08.560 |
There are a few vision dependencies, you've got like torch vision and things in there, 00:02:13.280 |
so this can take a little bit of time to actually install everything. 00:02:18.080 |
We're also going to be using hung face datasets, that is because we're going to be downloading the dataset I just showed you, 00:02:25.520 |
the Shrek versus not Shrek dataset to use as routes here. 00:02:30.720 |
So while we are waiting for that to install, I'm going to come down to here and I'll just show you what this dataset actually looks like. 00:02:37.520 |
So we have two splits in the dataset, a training split and a test split. 00:02:43.600 |
Now to load it, we're going to use this, and then you can see that we have these images here. 00:02:47.840 |
So we have this, I've counted this as a Shrek image. 00:02:51.360 |
So what we're going to want to do is set up some routes that detect Shrek or not Shrek. 00:02:59.520 |
And we're going to be using these images within the training splits. 00:03:03.680 |
We also have, if we come down here, we also have our test split. 00:03:08.000 |
We won't use any of these to create our routes because we want to see that this does apply or does transfer, 00:03:18.000 |
And obviously we see some slightly different Shrek and not Shrek images in here. 00:03:25.040 |
So I'll skip ahead to when our install is complete. 00:03:36.680 |
Okay, run that and we should see that this will work. 00:03:45.240 |
Now what we want to do is grab all the images that are labeled with isShrek. 00:03:49.720 |
So you can see in the data, maybe I'll come here and show you. 00:04:01.200 |
So text, which is like a kind of descriptive field of what is within the image, although it's not really that descriptive. 00:04:11.480 |
So like this one is Dwayne Johnson with hair. 00:04:14.920 |
We have the image file and then we also have, okay, this isn't Shrek and we can have a look. 00:04:29.800 |
So what we want to do is grab the images that are labeled with isShrek. 00:04:33.960 |
So, for example, the third one that we have here, this is Dwayne Johnson's Shrek and it is Shrek here. 00:04:44.240 |
So we're going to go through, grab those and we're creating a list here of images that are Shrek and images that are not. 00:04:52.360 |
Okay, so we have five that are Shrek, 19 that are not. 00:04:55.800 |
Okay, so we're going to create our routes using the images. 00:04:59.320 |
So the image is actually going in place of where we'd usually put our utterances, okay. 00:05:04.360 |
So, yeah, we can create that and we're also going to create a not Shrek route as well. 00:05:10.200 |
We could, I think we could just avoid that to be honest, but it's okay. 00:05:16.920 |
It's good to be very verbose with your embeddings. 00:05:21.400 |
We're going to initialize our multimodal clip encoder here. 00:05:27.920 |
It's not a massive, this is the model size here. 00:05:30.840 |
It's like almost 600 megabytes, it's not huge, but it isn't small either. 00:05:37.720 |
And then what we want to do is initialize our route layer. 00:05:39.960 |
A route layer always, as we've seen before, requires a encoder, which is in this case our multimodal clip encoder and the routes. 00:05:49.240 |
Okay, the routes we defined before with Shrek and not Shrek. 00:05:54.320 |
And we're going to test, okay, so we're going to see don't you love politics? 00:05:58.400 |
That shouldn't really be either of, it shouldn't trigger anything, right. 00:06:03.280 |
And this is a, you know, you can see here that I'm using text to classify here, even though we use just images in our routes. 00:06:11.200 |
So that's the kind of interesting thing that we can do here. 00:06:14.160 |
Okay, so we can see that here Shrek, the text is classified as Shrek in the routes, which is cool. 00:06:21.960 |
So it's, you know, putting them into the images of Shrek bucket. 00:06:25.560 |
And then Dwayne "The Rock" Johnson, it's seen Dwayne Johnson in the images and they are tagged as not Shrek. 00:06:36.400 |
Okay, so we have everything being classified correctly there with those, you know, with that text. 00:06:42.600 |
But what we really want to be doing here as well, you know, we can do both, of course, like we've seen, 00:06:48.280 |
is we can take some images that we haven't seen before and see if we can label them correctly. 00:06:54.840 |
So we're going to be loading this other data set. 00:06:58.520 |
So the test data set, and then here there's a mix of, you know, as I said, Shrek and not Shrek. 00:07:05.040 |
So this one, we have Shrek and we will see what classification we get here, which you can actually see already. 00:07:15.640 |
We can remove the name here, it gives us the full route choice object. 00:07:24.600 |
And if we come down here, yeah, I mean, I've already run it. 00:07:29.320 |
So you can see that it is classifying as Shrek. 00:07:41.280 |
And if we come down to here, it's saying this is not Shrek. 00:07:45.360 |
Okay, so I think in the training data for not Shrek, we have some nature images. 00:07:50.400 |
So it puts nature images into that rather than none. 00:07:57.320 |
We have our multimodal route layer here, it seems to be working pretty well. 00:08:03.000 |
We can also, you know, if you want to take us further, you can go have a look at the route optimization stuff that we've talked about, 00:08:09.120 |
where you're literally training your route layer on like a training set of utterances to the routes that they should trigger. 00:08:20.720 |
And we have like an image detection or classification route layer here, which works pretty well. 00:08:30.280 |
As I mentioned at the start, there's a lot more that we can actually do with this, 00:08:33.680 |
ranging from the route layers that we've seen here to simple even video splitting or more intelligent data processing. 00:08:43.440 |
And I'm sure there's plenty of other ways that we can use this as well that I just haven't thought of yet. 00:08:50.720 |
So I'm very interested to see what people build this. 00:08:54.760 |
If you do decide to build something cool, I'd love to hear about it.