back to indexOpenAI CLIP Explained | Multi-modal ML
Chapters
0:0
9:39 Architecture of Clip
12:12 Contrastive Training
19:24 Padding Labels
20:10 Input Ids and an Attention Mask
24:26 Encoding Images
27:7 Cosine Similarity
31:22 Object Detection
00:00:00.000 |
Today we're talking about what is quite possibly the future of machine learning in both computer 00:00:06.160 |
vision and NLP. We're talking about a combination of both into a single multi-modal model. That 00:00:16.960 |
model is CLIP which was built and trained by OpenAI. Now CLIP is open source and it has been 00:00:24.160 |
around for a little while. It's been around from since the start of 2021 but in the past few months 00:00:33.600 |
we've seen the adoption of CLIP grow at a pretty insane rate. It has found uses in a load of things 00:00:43.760 |
so what we will be focusing on is text or multi-modal search across text and images. 00:00:49.760 |
We're also going to talk a little bit about zero shot classifications, zero shot object detection 00:00:55.680 |
and at some point in the future we'll also talk about the use of models like CLIP or even CLIP 00:01:02.000 |
itself in the diffuser models that have become incredibly popular recently like DALI-2, IMAGEN 00:01:09.440 |
and MIDJOURNEY. Now to understand why CLIP is so important we can use a paper that was released 00:01:17.120 |
in 2020 called Experience Grounds Language. In Experience Grounds Language the authors define 00:01:25.280 |
these five different world scopes; corpus, internet perception, embodiment and social. 00:01:32.000 |
Now most of the models that we're aware of in NLP, this is a sort of NLP language focused paper, 00:01:42.320 |
these were sort of the state of the art very recently so you have BERT, T5 and all these 00:01:49.120 |
different models, GPT and so on. In perception this is where we start to see not just NLP but 00:01:56.560 |
also computer vision. So here is where we are now with CLIP. Okay and quite interestingly as well 00:02:05.520 |
in the next world scope this is where you start to get reinforcement learning and then you continue 00:02:10.720 |
going so you almost see this culmination of all the different disciplines in machine learning 00:02:15.040 |
coming together into one larger discipline. Now the focus between these different world scopes is 00:02:21.600 |
mostly on the data that is being used to train the models. So over here world scope one you have 00:02:28.800 |
the first models that we saw which would have been things like Word2Vec which is probably one 00:02:37.280 |
of the earliest samples of deep learning in NLP and that consisted of training a neural network 00:02:44.080 |
on a small or not small amount of data but small compared to future data sets. So a single corpus 00:02:52.240 |
you know for example 100,000 sentences from Wikipedia may be one example of that. 00:02:58.880 |
Then going forwards we have the internet sized corpuses. So these are based on a huge web 00:03:06.160 |
scraping from across the internet from many different sources and the models trained on this 00:03:12.240 |
were able to obviously kind of pull in more general understanding of language from purely text data. 00:03:22.080 |
Okay so it's a lot of data but it's still purely text. The next one which we're focusing on world 00:03:26.960 |
scope three is not just that it's in our example text and image data. Okay so we're training a 00:03:34.720 |
model to understand both of these different modalities of information and this is almost like 00:03:42.800 |
AI moving from a purely digital very abstract space to a more realistic real world space because 00:03:53.760 |
in the real world we don't just rely on text data we rely on a huge number of sensory inputs. We have 00:04:02.800 |
visual, audio, touch and everything else. Okay so we're sort of moving more towards 00:04:10.960 |
that more broad scope of inputs from different modalities. Okay where modality would be something 00:04:17.520 |
like text or image or visual and so on. For us in the real world that chaotic ensemble of different 00:04:28.080 |
sensory inputs is what kind of creates or trains our internal model of the outside world. So it 00:04:35.200 |
makes sense that that is the sort of direction that machine learning and AI may also go in. 00:04:42.160 |
So to achieve this multi modality in CLIP we actually use two models that are trained to 00:04:50.880 |
almost speak the same language. So we have these two models one of them is a text encoder one of 00:04:56.480 |
them is an image encoder. Both of these models create a vector representation of whatever they 00:05:03.760 |
are being input so the text encoder may get a sentence. That sentence could be two dogs running 00:05:08.640 |
across a frosty field and then we have a image of two dogs running across a frosty field and CLIP 00:05:17.840 |
will be trained so that the text encoder consumes our sentence and outputs a vector representation 00:05:27.200 |
that is very very closely aligned to what the image encoder has output based on the image of 00:05:35.440 |
the same concept. Now by training both of these models to encode these vectors into a similar 00:05:42.800 |
vector space we are teaching them to speak the same vector language right. So this is it's very 00:05:50.240 |
abstract this this vector language is 512 dimensional space so we can't directly understand 00:05:58.480 |
what or it's very difficult for us to directly understand what is actually happening there. 00:06:03.600 |
But these two models do actually output patterns that are logical and make sense and we can see 00:06:13.440 |
some of this by comparing the similarity between the vectors that it outputs. Okay so we can see 00:06:19.520 |
that the two vectors for dogs running across a frosty field both the the text vector and the 00:06:24.960 |
image vector are both within a very similar vector space. Whereas something else like 00:06:31.680 |
elephants in the Serengeti is you know whether it's text or image it's not here with our our 00:06:38.400 |
two dogs running across frosty field it's somewhere over over here right in a completely different 00:06:43.120 |
space. So what we can do with that is is calculate the similarity between these vectors and identify 00:06:49.440 |
which ones are similar or not similar according to CLIP. From this from these these meaningful 00:06:56.240 |
vectors that CLIP is actually outputting we are able to create a content-based image retrieval 00:07:01.840 |
system. Okay so content-based image retrieval is basically where we using some text or using 00:07:08.080 |
maybe even another image we can search for images based on their content right and not just like 00:07:17.920 |
some meta textual metadata or something that's been attached to them. And with CLIP unlike other 00:07:23.120 |
content-based image retrieval systems CLIP is incredibly good at actually capturing the meaning 00:07:30.800 |
across an entire image. So you know for example with our two dogs running across a frosty field 00:07:36.960 |
we might also be able to describe the background of that image without mentioning that there's two 00:07:41.360 |
dogs in it and if we describe in such a way that we align pretty well with what that image actually 00:07:47.600 |
is what is in that image we might actually also return the image based on that. So we're not just 00:07:53.360 |
focusing on one thing in the image CLIP allows us to focus on many things in the image. So an 00:08:00.480 |
example of that is within this data set I've been using here there are no images there's one single 00:08:08.400 |
image of the food a hot dog. Okay so I tried to search for that and the first image that is 00:08:15.440 |
returned is a dog eating a hot dog. Okay so it's pretty relevant but of course there are no other 00:08:21.280 |
images of hot dogs in this in this data set. So the other images that are returned are quite 00:08:26.640 |
interesting because in some way or another they are kind of showing a hot dog. So the first one 00:08:33.280 |
we have a dog looking pretty cozy in a warm room with a fire in the background. Then we have a dog 00:08:41.120 |
in a big wooly jumper and another dog kind of like posing for the camera. So weirdly enough we got a 00:08:49.840 |
load of hot dog images even though it's not really maybe it's not exactly what we meant when we said 00:08:55.360 |
hot dog but a person could understand that. Okay we can we can see how those that term and those 00:09:01.840 |
images are related. Now we're not actually only restricted to text to image search. When we encode 00:09:08.880 |
our data when we code text and when we encode images we are actually just creating vectors. 00:09:14.080 |
So we can search across that space in any direction with any combination of modalities. 00:09:21.120 |
So we could do a text to text search image to image search. We can also do image to text search 00:09:27.280 |
or we can search everything. We could use some text to search with both text and images. We can 00:09:32.480 |
kind of go in any direction use any modality we want there. Now let's go into a little more detail 00:09:38.160 |
on what the architecture of CLIP actually looks like. So CLIP as I mentioned it's these two models. 00:09:46.880 |
Now these two models are trained in parallel. One of them is the text encoder. Now it's a just a 00:09:53.040 |
generic text encoder of 12 layers and then on the image encoder side there are although there are 00:09:59.280 |
two different options I've spoken about. There is a vision transformer model and also a ResNet model 00:10:06.720 |
and they use a few different sizes for ResNet as well. Both of these both of these encoder models 00:10:12.720 |
output a single 512 dimensional vector and the way these models is trained is kind of in the 00:10:21.120 |
name of CLIP. So CLIP stands for contrastive learning in pre-training and so the training 00:10:29.200 |
that is used during pre-training is is contrastive. It's contrastive pre-training. Now across both 00:10:34.400 |
NLP and computer vision large models sort of dominate the state of the art and the reason for 00:10:41.440 |
this or the idea behind this is that just by giving a large model a huge amount of data they can learn 00:10:48.880 |
sort of general patterns from what they see and almost kind of internalize a general rule set 00:10:56.960 |
for the the data that it sees. Okay so they manage to recognize general patterns in their modality. 00:11:04.480 |
In language they may be able to internalize the grammar rules and patterns in English language. 00:11:12.240 |
For vision models that may be sort of the general patterns that you identify or notice 00:11:20.480 |
in with different scenes and different objects. Now the problem with these different models the 00:11:26.560 |
reason they don't fit together very well already is that they're trained separately. So by default 00:11:33.120 |
these state-of-the-art models have no understanding of each other and that that's where CLIP is is 00:11:39.280 |
different that's what CLIP has has brought to the table here. With CLIP the text and image encoders 00:11:45.760 |
are trained while considering the context of the other modality. Okay so the text encoder is trained 00:11:53.920 |
and it considers the modality or it considers the the concept learned by the image encoder and the 00:11:59.920 |
image encoder does the same for the text encoder and we can almost think of this as the the image 00:12:05.840 |
and text encoders are sharing a almost indirect understanding of the other modality. Now 00:12:12.880 |
contrastive training works by taking a image and text pair so for example the two dogs running 00:12:19.520 |
across a frosty field and putting those together into the text encoder and image encoder and 00:12:26.160 |
learning to encode them both as as closely as possible. For this to work well we also need 00:12:34.240 |
negative pairs so we need something to compare against as this is a general rule in contrastive 00:12:43.120 |
learning you can't just have positive pairs because then everything can just be kind of 00:12:46.320 |
encoded into the same like tiny little space and you don't know how to separate the the pairs are 00:12:53.680 |
dissimilar. Okay so we need both positive and negative pairs so we have a positive pair okay 00:12:58.640 |
in order to get negative pairs we can essentially just take all the positive pairs in our data set 00:13:05.200 |
and we can say okay the pair t1 and i1 we can mix t1 with different i's okay so we can do t1 00:13:15.520 |
with i2 and i3 i4 and so on so we're basically just swapping the pairs and we can we can understand 00:13:24.320 |
that those other pairs are probably not going to be similar as long as our data set is relatively 00:13:30.080 |
large. Occasionally maybe we will get a pair that are similar but as long as our data set is large 00:13:36.240 |
enough that that doesn't happen too frequently it's not going to affect how our training is it 00:13:40.640 |
will be sort of a negligible problem. So with this idea we can use a loss function that will 00:13:49.680 |
minimize the difference between positive pairs and maximize the difference between 00:13:56.480 |
negative pairs and that will look something like this where we have our positive pairs in the 00:14:01.440 |
diagonal of the similarity matrix and everything else is something that we the dot product there 00:14:08.080 |
we need to maximize and this image that you see here is actually the pre-training for a single 00:14:15.520 |
batch okay so one interesting thing to note here is if we have a small batch say we only have a 00:14:21.920 |
batch size of two it's going to be very easy for our model to identify which two items are similar 00:14:27.280 |
which two are not similar whereas if we have 64 in our 64 items in our batch it will be much harder 00:14:35.280 |
for our model because it has to it has to find more nuanced differences between them and 00:14:40.480 |
and what basically the odds of guessing randomly between those and guessing correctly are much 00:14:48.080 |
smaller. So a larger batch size is a good thing to aim for in this contrastive pre-training approach. 00:14:56.880 |
So with that I think we we have a good idea now of how CLIP can be used and also you know how it 00:15:05.120 |
has been trained for for this. So what I really want to do now is kind of show you how you might 00:15:11.440 |
be able to use it as well. Now we're going to be using the Vision Transformer version of CLIP 00:15:17.360 |
okay so remember I said there's a ResNet and Vision Transformer options for that image encoder 00:15:23.760 |
we're going to use a Vision Transformer version and OpenAI have released this model through the 00:15:31.600 |
Hugging Face library so we can we can go to the Hugging Face library and use it directly from 00:15:35.920 |
there which makes it really easy for us to actually sort of get started with it. So let's go ahead and 00:15:40.960 |
do that now. Okay so for this we will need to install a few libraries here so we have Transformers, 00:15:48.560 |
Torch and Datasets. So Datasets we need to actually get a Dataset so I've prepared one 00:15:54.560 |
especially for this so we have this image text Dataset and in here we don't have there's not 00:16:02.000 |
much it's just 21 images or text to image pairs and we can see what they look like. So 00:16:08.800 |
we have this text Aerial shot of a futuristic city with a large motorway. Okay so I'll try to just 00:16:15.760 |
describe this image as best I could and yeah that's what I got and there are like you saw just 00:16:25.040 |
now 21 of these image text pairs in there. So let's go ahead and actually prepare or download 00:16:33.600 |
and sort of initialize CLIP for our use. So the model ID on Hugging Face is this 00:16:41.920 |
so if we were to go to HuggingFace.co we could type that in here and we have the model there. 00:16:54.960 |
Okay so this is the model that we're using over from OpenAI here and with this model we 00:17:02.400 |
use these two we use a processor and a model so this is the model itself this is CLIP right this 00:17:10.800 |
is a almost like a pre-processor for both our text and also the images. Okay so one thing we would do 00:17:20.560 |
here if we have a CUDA device and available we can move our model to the CUDA device. 00:17:28.000 |
At the moment if you try and do this with NPS so if you're on Mac and you have a you have Apple 00:17:34.720 |
Silicon there are some processes or some transformations in the CLIP that don't 00:17:41.520 |
function on NPS at the moment so I would stick with CPU. We're only doing inference so it's 00:17:48.320 |
still pretty fast. Now as I was mentioning the processor is what handles both the text 00:17:55.200 |
and image preparation that needs to happen before we feed them into the actual encoder models 00:18:01.280 |
themselves that make up CLIP. So for text we do this so this is just going to be this going to 00:18:09.280 |
work like a normal text tokenizer. A normal text tokenizer for text transform models is used 00:18:16.080 |
in order to translate our human readable text into transformer readable IDs. Okay so we 00:18:24.880 |
pass the text here we make sure we are saying there are no images included here because the 00:18:30.880 |
processor if we have both images and text it can process them at the same time. We can do that here 00:18:38.720 |
as well but I want to show you it separately just to show you what they're actually doing. 00:18:43.040 |
So the padding we need to set that to true and that is because different sentences 00:18:52.160 |
can have different lengths okay so you have like hello world and whatever I wrote before up here 00:18:59.440 |
so this aerial shot of futuristic city aerial shot of a city. These two sentences have different 00:19:09.760 |
lengths and a transform model needs to see the same length being input for all of the 00:19:17.280 |
text that is within this sort of single batch. So basically what it's going to do there is add what 00:19:24.560 |
are called padding labels so it's just going to add a few of these up to the length of the longest 00:19:32.400 |
sequence within that batch of text items just because in here we have those 22 um 00:19:39.520 |
20 no sorry 21 sentences. So that's all we're doing there I'm sure that is 00:19:52.320 |
and then we are returning those as pytorch tensors and then finally just moving them to 00:19:57.760 |
whichever device we're using. I'm using CPU here so it's not actually necessary to do this but 00:20:03.360 |
I'm doing it in case you do do the same on a CUDA enabled device. So from there we have these input 00:20:11.280 |
IDs and an attention mask okay so let's have a quick look at what what those are. So if we go 00:20:19.120 |
into tokens and we have a look at input IDs okay you see we get all these literally just integer 00:20:31.520 |
values and you'll see that a lot of them have this 49407 at the end right that is they're the padding 00:20:40.720 |
tokens there okay so they they are not represented as strings but they're represented as these 00:20:46.560 |
integer numbers okay and we know that they're the padding tokens because they they're appearing 00:20:51.520 |
several times at the end of each sequence and none of the sequences I fed in there 00:20:57.120 |
were they didn't have any similar words at the end of those okay so you can see them all here. 00:21:02.560 |
So we know that those are the padding sequences we also see there's like an 00:21:09.760 |
initialization of sequence token there as well and then everything in between those 00:21:17.760 |
they are tokens that represent a word or a part of a word from our original text. 00:21:24.240 |
So let's see input IDs the attention mask is you'll see so here you can see that it's just 00:21:32.720 |
these ones and zeros now the ones represent real tokens okay they represent real words that were in 00:21:40.000 |
our from our text inputs the zeros represent where the where our processor has added padding tokens 00:21:50.400 |
so this is used for the internal mechanisms of the text transformer to know which tokens to pay 00:21:57.120 |
attention to which ones to ignore because we don't want to really focus on those padding tokens 00:22:02.080 |
because they're meaningless they're just there to make sure we have the same size inputs going 00:22:08.720 |
into our transform model that's all that is. So we can go down and after we have our tokens 00:22:18.000 |
now what we what we do is we use clip to encode all of them with this get text features okay and 00:22:26.400 |
then we pass our tokens and we've got two device here I think I already I already moved them to 00:22:31.600 |
device so I don't need to do that again we can actually remove that. Okay and okay what do we 00:22:39.760 |
get here so we get 21 so 21 text inputs that makes sense 512 dimensional vectors okay so they are our 00:22:52.000 |
text embeddings representing each of those those text sentences I just gave and then one other 00:22:58.960 |
thing I wanted to point out here is that we have the min and max values and they're they're pretty 00:23:04.800 |
big okay they're clearly not normalized so this depends on what you're doing if you are if you 00:23:11.520 |
want to compare these vectors you need to make sure you're not using a similarity metric that 00:23:19.120 |
looks or that considers the magnitude of your vectors you need to only consider the the angle 00:23:26.320 |
so you can do that with cosine similarity or the alternative is that you can normalize these 00:23:31.440 |
vectors and then you can also do this with dot product similarity okay so to normalize if you 00:23:39.520 |
wanted to use our product similarity now you would do this okay so here we're just detaching our text 00:23:46.080 |
embeddings from the the PyTorch graph moving them to CPU if needed we actually don't need to do that 00:23:51.440 |
but do it here anyway and converting them into a NumPy rate and then we calculate the value that 00:23:58.160 |
we will normalize it each vector by okay so for each each vector we're calculating a number and 00:24:05.840 |
then that number is what we're going to divide them all by here okay to normalize that and then 00:24:13.280 |
after that you can see the minimum maximum is this minus 0.15 and plus 0.53 okay so neither 00:24:22.720 |
of them going over minus one or plus one now now when it comes to encoding images we we do the same 00:24:28.240 |
thing or very similar thing so images are also pre-processed using the using the processor as 00:24:35.280 |
we did with our text but we just use slightly different parameters to start there so the reason 00:24:40.720 |
we're processing these images is that uh clip expects a certain size of image when when we're 00:24:47.040 |
feeding images into it and expects those those image pixels to be normalized as well now rgb 00:24:55.040 |
images by default the the value the pixel values and they will range from 0 to 255 00:25:02.000 |
we need to normalize those and we also need to resize the images so you can see you can see that 00:25:08.240 |
here so the first image it has this size it's it's a pretty big image okay this is the the width and 00:25:14.720 |
the height of that image now here we're taking all the images and we're processing them make sure we 00:25:20.960 |
say text is is none and that will actually only output one tensor the pixel values tensor so 00:25:26.080 |
we're just going to extract that straight out there and we're also going to move it to um the 00:25:30.800 |
device set hardware device in this case just cpu and now let's have a look at this image or images 00:25:37.040 |
now so now we can see that we have this um this array or tensor with three color channels so this 00:25:43.840 |
is the rgb and it has a height and width of 224 so it's been you know sort of squeezed into a 00:25:52.000 |
smaller size now and we have 21 of those because we fed in all of our images okay so this is how 00:26:00.800 |
we use the processor and this is just resizing and normalizing our images ready for the vision 00:26:07.440 |
transformer encoder of clip and very similar to before before you use get text features 00:26:15.040 |
now we're going to use get image features and we pass in those images like that and again as you 00:26:21.840 |
you might expect those images are not um normalized you see that here 00:26:26.960 |
and as we would also expect they are the same dimensionality as our text embeddings so that 00:26:35.520 |
means we can compare them but before comparing them of course as before we we normalize them 00:26:42.000 |
so we should normalize them again here um and yep same process again and we can see that those have 00:26:49.440 |
those have changed okay cool so what we now want to do is calculate the similarity between all of 00:27:00.240 |
our image embeddings and all of our text embeddings so we can do that in a few different ways uh we 00:27:07.440 |
have cosine similarity or dot product similarity the reason we can use our product similarity is 00:27:13.280 |
because we normalize but i'm going to show you how to do both so that if you don't normalize 00:27:17.360 |
you can actually just use cosine similarity like we do here so cosine similarity is actually just 00:27:22.640 |
a dot product as a numerator between the text embeddings and image embeddings and in the 00:27:30.080 |
denominator we have just normalized the norm values of both of those okay that is that's all 00:27:38.880 |
it is actually so it's it's pretty pretty simple and if we plot those similarity scores between 00:27:45.840 |
those we get this so we would expect along this diagonal here we would expect these to be the 00:27:52.240 |
highest similarity values because they represent the the true pairs okay between the images and 00:27:57.920 |
the text now we have some that are not quite there like here and there is this image text pair which 00:28:03.760 |
is more similar even with something else you know i very quickly put these together so they're not 00:28:10.800 |
always going to be perfect so we have one here that is maybe not perfect but again there is also 00:28:16.560 |
a lot of overlap between these images so there are several images of city skylines and a lot of time 00:28:22.640 |
i describe those as futuristic cities in you know in whatever with a big motorway or something on 00:28:28.480 |
those lines so that is probably where we're getting that from um now if we were to calculate the dot 00:28:35.840 |
product similarity between these we would expect it to be the same okay now um okay from this this 00:28:43.680 |
calculation dot product similarity we do seem to get very similar um set of similarities at the end 00:28:50.560 |
there but are they the same well if we go down here we can see that they pretty much are so 00:28:57.520 |
we can't do a straight comparison we can't do we can't do cosine similarity equals 00:29:05.760 |
um dot similarity because the numbers are actually slightly different but they're only 00:29:11.280 |
slightly different because there's sort of a floating point error because these are all 00:29:14.720 |
floating point numbers so we get very very small differences between the numbers and you can see 00:29:21.920 |
that here so taking we've calculated the difference between them between the numbers and the two 00:29:26.720 |
arrays and then we're looking okay what's the minimum difference between them it's zero okay 00:29:31.920 |
that's what we'd expect that's where the numbers are exactly the same what's the maximum difference 00:29:36.240 |
between the numbers and it's 2.98 to the minus eight so like 0.000000 and so on two so very small 00:29:49.520 |
number and okay with that we we know it's just floating point errors between those two similarity 00:29:56.960 |
arrays so that's pretty cool to to see that and we can use this exact concept of comparing with 00:30:03.600 |
with a cosine similarity or dot product similarity to actually search through all of our images 00:30:11.600 |
with like a text prompt for example but that's not all that clip is good for clip is also has this 00:30:18.320 |
amazing performance um as a zero shot model for different tasks okay so not even just one that's 00:30:26.240 |
what actually different tasks so um it is it performs incredibly well out of the box for 00:30:33.440 |
classification and we'll go through this in more detail in a future video but the idea is that 00:30:41.600 |
given a set of classes from a classification image classification data set you can maybe you can 00:30:50.960 |
modify them a little bit the class names to make them more like a sentence and then you use this 00:30:56.000 |
same idea of comparing all of your your your text representations of the classes with a set of 00:31:03.200 |
images from the data set and with this you just calculate the similarity between those and this 00:31:08.400 |
the the text embedding or the you can think of it as a class embedding that gets a high similarity 00:31:14.240 |
to your image is the predicted class okay so you have zero shot classification like that super easy 00:31:21.200 |
another use case is object detection so you let's say you have maybe you're looking for a cat or a 00:31:30.000 |
butterfly in an image okay and you okay when you're looking for the cat you're gonna you're 00:31:36.480 |
gonna use a chunk of text that says um a fluffy cat okay and you encode that with clip and you 00:31:42.000 |
get your text embedding and then what you do is you break up your image into all these little patches 00:31:48.240 |
and you just slide through all of those patches okay you can you can include like an overlap 00:31:53.680 |
so you're going over those over like you're not missing anything between patches so you're just 00:31:59.680 |
sliding through your image and with each part of the image that you slide through you extract the 00:32:05.360 |
image from that you process it through clip and then you compare the encoding for that image 00:32:11.600 |
against the tips that you create so a fluffy cat and what we will see is that patches of the image 00:32:18.720 |
that contain what it what it is you've just described will have a higher similarity rating 00:32:23.920 |
okay and then you can overlay those scores back onto your image and you will find that the that 00:32:30.000 |
clip is able to essentially identify where in your image a specific object is and you are just 00:32:38.400 |
describing that image using a natural language prompt now these are only a few use cases of clip 00:32:45.200 |
and only really scratch the surface of what is actually possible with this model we also see 00:32:52.000 |
it being used in like i said the diffusion models like dali which is a great example of how powerful 00:32:58.080 |
clip can actually be so that's it for this introduction to clip i hope it's been useful 00:33:07.520 |
as i said we're going to go into more detail on the different use cases of clip and how to 00:33:11.920 |
apply clip for these use cases in future videos but until then that's it for now 00:33:18.000 |
so thank you very much for watching and i will see you again in the next one bye