back to indexIntro to Dense Vectors for NLP and Vision
Chapters
0:0 Intro
1:50 Why Dense Vectors?
3:55 Word2vec and Representing Meaning
8:40 Sentence Transformers
9:58 Sentence Transformers in Python
15:8 Question-Answering
18:18 DPR in Python
29:55 Vision Transformers
33:22 OpenAI's CLIP in Python
42:49 Review and What's Next
00:00:02.240 |
We're going to start a new series on embedding methods 00:00:07.180 |
for NLP, but we're also going to have a look at other embedding 00:00:28.280 |
we can build dense embeddings for images and maybe 00:00:33.880 |
So I think this series of articles and videos 00:00:39.960 |
Now, what I want to start with is having a look at-- 00:00:50.920 |
And whilst we do that, I'm going to refer a lot to Word2Vec 00:00:55.080 |
because that's the first widely adopted version of this. 00:01:01.320 |
And then we're going to have a look at sentence embeddings, 00:01:06.000 |
so how we can build sentence embeddings using the Sentence 00:01:09.600 |
And we're going to go through the code for that as well. 00:01:16.360 |
And we're going to focus on Facebook AI's Dense Passage 00:01:23.800 |
And again, we are going to go through the code for that 00:01:27.600 |
And then another thing that I think is quite exciting 00:01:54.200 |
is, why would we use dense vectors in the first place? 00:01:58.560 |
Now, we have two options when it comes to representing text. 00:02:01.320 |
And that is we can represent it as a dense vector 00:02:08.880 |
going to focus on the syntax and the words that we're comparing. 00:02:12.680 |
So if we had two sentences, Bill ran from the giraffe 00:02:22.040 |
So Bill ran from the dolphin towards the giraffe. 00:02:26.240 |
Both of these sentences have the exact same words in it. 00:02:32.080 |
So in one of them, Bill is running away from a giraffe. 00:02:34.520 |
And the other one is running away from a dolphin. 00:02:36.600 |
Now, when it comes to sparse vector representation, 00:02:41.520 |
we'd find it difficult to correctly identify these 00:02:48.720 |
Because we tend to represent words one by one 00:02:56.600 |
Now, we can also use n-grams so we can put two words together. 00:03:07.240 |
where we have different words for the same meaning. 00:03:11.680 |
So for example, if you want to say hello to someone, 00:03:18.040 |
I'm sure there's a million other ways of saying it. 00:03:27.240 |
So whereas sparse vectors are very good for comparing 00:03:34.840 |
at comparing the semantics or the meaning behind text. 00:03:38.160 |
And that's where we want to start using dense vectors. 00:03:45.840 |
a numerical representation of the semantic meaning 00:03:53.840 |
And we can actually visualize a lot of these relationships. 00:04:01.880 |
which was the first very popular dense vector 00:04:10.720 |
of people showing that you had things like this, 00:04:13.320 |
what you can see on the screen, where, for example, 00:04:15.840 |
we'd have days of the week clustered together, 00:04:17.920 |
or we would have months or other related abstract topics 00:04:32.760 |
When we're actually building these dense vectors, 00:04:44.280 |
So this is obviously a simplified version of that. 00:04:48.040 |
And not only will we find that similar words are 00:04:53.160 |
find that we can perform what I think is best described 00:05:05.000 |
that came from around the same time as WordSpec. 00:05:09.760 |
you'll be able to find them at the bottom of the article 00:05:14.280 |
If you need the article, it's in the description. 00:05:16.600 |
Now, what we'd find is if we took the vector for king, 00:05:21.200 |
subtracted the vector for man, added the vector for woman, 00:05:32.120 |
So the nearest vector would be the vector for queen. 00:05:35.320 |
And I mean, I think that's super interesting. 00:05:46.880 |
And they've just gotten a lot more advanced in that time. 00:05:56.780 |
And Word2Vec was one of the earliest versions 00:06:04.100 |
And going from the name, we know it's Word2Vector. 00:06:10.520 |
Now, how this worked, there were two different methods. 00:06:15.880 |
is what you can see now, which is where we take one word, 00:06:19.840 |
and we would take the sparse vector encoding for that word 00:06:26.480 |
And then we would, in the vector on the right, 00:06:31.440 |
we would have a one-hot encoding for all of the words that 00:06:40.080 |
And that's surrounded by the words quick brown, 00:06:44.320 |
And this would be run through a simple feed-forward neural 00:06:48.760 |
And we would go through this compression stage. 00:06:54.080 |
that we would build our dense vector representation for fox. 00:07:09.200 |
And this would be done many times over for every time 00:07:21.920 |
like a numerical representation of that word. 00:07:34.320 |
We're just swapping the order of the transformation. 00:07:38.280 |
So on the left, we have all of our context words. 00:07:41.160 |
And then on the right, we would have the word 00:07:43.800 |
that we're focusing on and we're building the embedding for. 00:07:51.120 |
as the catalyst for a lot of other vector embeddings. 00:08:08.880 |
is vector embeddings for major league baseball players. 00:08:14.000 |
So you've got a lot of different 2Vec methods 00:08:18.280 |
that came out of the woodworks after the original Word2Vec. 00:08:22.880 |
And then we also had other ones like glove as well, 00:08:30.760 |
and we wouldn't really go ahead and use that. 00:08:32.680 |
So I'm not going to spend any more time on it. 00:08:34.840 |
And we'll just move on to having a look at sentence similarity. 00:08:41.160 |
as very similar to Word2Vec in that we're building 00:08:49.840 |
we're representing a sentence or a paragraph. 00:09:03.640 |
And BERT by itself, you can build embeddings. 00:09:07.880 |
But it's based on a token-by-token embedding. 00:09:10.640 |
So within BERT, you have all of these different embeddings, 00:09:15.800 |
So what the guys at Sentence Transformers did 00:09:23.480 |
They called it Siamese BERT, where they had two BERTs. 00:09:29.160 |
And they output a single vector for the full input that 00:09:35.040 |
was input into the model, which was around 128 tokens at max. 00:09:42.840 |
Now, this allowed us to build a single vector for sentences. 00:09:50.240 |
can start comparing sentences and paragraphs. 00:10:06.120 |
Now, I've already done this, so I'm not going to rerun it. 00:10:22.600 |
We want to import the Sentence Transformer object. 00:10:30.360 |
And from there, we can just initialize our model. 00:10:38.040 |
And then in here, we just need to type our model name. 00:10:40.800 |
Now, if you Google Sentence Transformers or SBERT, 00:10:49.320 |
And it has loads of different models on there. 00:11:05.240 |
And usually, you will need to download the model. 00:11:11.880 |
So you will see a load of loading bars or progress bars. 00:11:18.320 |
I already have it downloaded, so I don't need to run it again. 00:11:28.640 |
we can compare these and look at what the Sentence Transformer 00:11:40.720 |
"The bees decided to have a mutiny against their queen." 00:11:47.920 |
don't have any matching words between the two sentences. 00:11:56.440 |
Now, the meaning there is pretty much the same. 00:12:02.440 |
But there are no shared words other than, I think, 00:12:17.080 |
But we'll see that with dense vectors, it will. 00:12:23.920 |
So the first thing we want to do is encode our embeddings. 00:12:30.120 |
So we'll write "embeddings" plus "model.encodeSentences." 00:12:39.240 |
And then let's have a look at what that outputs, or at least 00:12:44.440 |
And we see that we get seven vectors, or seven embeddings, 00:12:50.920 |
each one with a dimensionality of 768 values. 00:12:56.800 |
And we can use cosine similarity to compare all of these. 00:13:04.800 |
import cosine similarity from SentenceTransformers. 00:13:17.760 |
And then what we do is calculate the cosine similarity scores 00:13:34.480 |
because I want to see that this is the most similar. 00:13:38.240 |
So I'm just going to select that, so write "embeddings." 00:13:46.880 |
And then I want "embeddings," well, the remaining of them, 00:13:55.480 |
We will be able to see that we have something 00:14:00.440 |
So this one here is the most similar, by quite a bit, 00:14:13.040 |
So if I take the argmax of that, we should see-- 00:14:25.800 |
And if we go Sentences, and we put that, Sentences, 00:14:35.520 |
the bees decided to have a mutiny against their queen. 00:14:38.320 |
So it correctly identified that these two, this and this, 00:14:46.480 |
are far more similar than the rest of the sentences, which 00:14:50.360 |
I think is very cool, because there's not even 00:14:57.760 |
the bees, flying, stinging insects, and matriarch 00:15:12.000 |
for language applications is question answering. 00:15:18.280 |
with a few different, let's say, architectures. 00:15:28.200 |
Now, the structure of open domain question answering 00:15:32.080 |
is what you can see on the screen at the moment. 00:15:35.480 |
So we ask a question that gets passed to something 00:15:41.480 |
The retriever model contains a question encoder, 00:15:50.880 |
And within there, we will have a set of contexts. 00:16:03.160 |
and encodes our context into the same vector space. 00:16:10.720 |
if we had a question, what is the capital of France? 00:16:22.440 |
into the same vector space, or very, very close by. 00:17:00.360 |
So in the previous example, we would output Paris, hopefully. 00:17:07.040 |
DPR is Facebook AI's Dense Passage Retriever. 00:17:11.080 |
And it actually consists of two smaller encoders. 00:17:15.480 |
We have a question encoder and a context encoder. 00:17:24.200 |
And we pass questions and their equivalent context 00:17:27.960 |
to the question and context encoder, respectively. 00:17:31.720 |
And we optimize based on a contrastive loss function. 00:17:36.600 |
So we compare the vectors from our question encoder 00:17:41.720 |
And we try to minimize the difference between them, 00:18:00.040 |
And they're used to identify very similar sentences. 00:18:04.240 |
This is used to identify not very similar sentences, 00:18:25.600 |
is initialize our context encoder and our question 00:18:35.440 |
Now, we're going to use the HookinFace transformers 00:18:46.520 |
Now, if you pip installed sentence transformers, 00:18:50.120 |
that does include transformers as a prerequisite. 00:18:55.800 |
you should already have transformers as well. 00:18:57.640 |
So first thing we want to do is, from transformers, 00:19:10.760 |
and tokenizer for each for both our context encoder 00:19:20.120 |
So write DPR context encoder tokenizer and DPR context 00:19:36.320 |
want the question encoder tokenizer and question encoder. 00:19:39.880 |
So we write DPR question encoder tokenizer and also 00:19:52.200 |
And that's all we need to import, so let's run that. 00:19:55.400 |
And then, we can go ahead and initialize our tokenizer model. 00:20:04.960 |
Now, this is going to be the DPR context encoder 00:20:12.480 |
If you've ever used HuggingFace transformers before, 00:20:21.080 |
we can find on the HuggingFace.co/models website. 00:20:27.960 |
in what I'm about to type in, you will find that it comes up. 00:20:31.360 |
So I'm going to type Facebook/DPR context, so CTX encoder. 00:20:58.640 |
So context tokenizer equals DPR context encoder tokenizer 00:21:10.880 |
And then, again, we want the same model name in there. 00:21:16.080 |
OK, so they are our context side of the model. 00:21:32.760 |
So we've got our context encoder and tokenizer. 00:21:35.800 |
Now we want to question encoder and tokenizer. 00:21:57.120 |
And then in the model, we are just replacing CTX again 00:22:11.880 |
If you haven't already got these cached on your machine, 00:22:16.700 |
because we're downloading four sets of models and tokenizers. 00:22:25.480 |
Now, I already have them, so I don't need to wait for that. 00:22:52.760 |
And inside here, I've also put in the questions themselves, 00:22:59.040 |
because I want to prove that this is not just a sentence 00:23:02.920 |
transformer where it's finding the most similar sentence. 00:23:13.600 |
it shouldn't return what is a best-selling sci-fi book. 00:23:16.400 |
It should instead return the best-selling sci-fi book 00:23:20.740 |
So we should see that there is a difference between using DPR 00:23:30.640 |
And then what we want to do is tokenize everything. 00:23:45.400 |
And then in here, we're going to pass our context. 00:23:48.840 |
And then if you use HuggingFace transformers, 00:23:53.120 |
you should recognize this as well, so maxLength here. 00:24:04.680 |
We don't need to truncate anything, I don't think. 00:24:21.400 |
And oh, the only thing we do need to include here 00:24:32.800 |
And then what we can do is we write xbEmbeddings. 00:24:38.920 |
So this is how we build our context embeddings. 00:24:53.320 |
And then in here, we pass our tokens, xbTokens, like that. 00:24:58.840 |
And then for our questions, we do exactly the same thing. 00:25:02.720 |
But of course, we just replace the context part of it 00:25:06.960 |
So here, we have the question tokenizer, we have questions, 00:25:17.920 |
And then here, I'm going to rename xb to xq, so our query. 00:25:28.480 |
So first, let's have a look at what we have inside xq. 00:25:34.880 |
So we'll see that we have a few different tensors in here. 00:25:38.320 |
So I'll just write xq keys to see what we have. 00:25:40.600 |
You see that we actually only have one output here, 00:26:00.200 |
So we can write shape to see the shape of those embeddings. 00:26:04.720 |
So the number of questions that we passed up here, 00:26:11.080 |
been encoded into a embedding of 768 dimensions. 00:26:19.720 |
And we could do the same for xb if we want as well. 00:26:31.000 |
because, obviously, we have more context than we do questions. 00:26:35.280 |
So what we want to do now, I'm going to import Torch. 00:26:39.960 |
So again, this should have been installed already 00:26:43.280 |
with Hugging Face Transformers and also Sentence Transformers. 00:26:52.920 |
What I'm going to do is go for i, and then the query vector 00:27:09.120 |
So what I'm doing here is I'm going to run through. 00:27:14.680 |
I'm going to create a loop to go through each query 00:27:37.600 |
We're still going to write xq vec, so the single vector 00:27:42.640 |
And from here, we just want xb pooler output, pooler output. 00:27:50.800 |
And from there, we want to get the argmax, so the maximum 00:27:58.200 |
argument, so the highest score in our probability right here. 00:28:07.420 |
going to print the current question that we're asking, 00:28:31.720 |
So we get, what is the capital city of Australia? 00:28:39.560 |
And it's not returning the exact sentence back to us 00:28:50.200 |
So Canberra is the capital city of Australia. 00:28:53.760 |
Now second one, as we had hoped, the best-selling sci-fi book 00:29:01.560 |
And then I just wanted to include this one as well 00:29:08.620 |
So in this case, it didn't find the correct answer 00:29:12.700 |
of how many searches are performed on Google. 00:29:17.520 |
so the correct answer should have been this one here. 00:29:22.040 |
So Google serves more than 2 trillion queries annually. 00:29:43.380 |
has a lot of potential in many businesses around the world. 00:30:09.120 |
So recently, computer vision has had a few advances 00:30:40.340 |
And what we're finding is that a model or an architecture that 00:31:09.060 |
We have the Vision Transformer, which is very recent. 00:31:14.100 |
I think the paper is January 2021, if I'm not wrong. 00:31:19.180 |
And although we don't need a Vision Transformer 00:31:26.380 |
I think the fact that we can use it is pretty cool. 00:31:31.300 |
with Hugging Face Transformers, as we will see 00:31:45.340 |
The text encoder is more of a traditional transformer, 00:31:49.860 |
And the image encoder is our new Vision Transformer. 00:31:57.220 |
like we did with DPR, the bi-encoder architecture. 00:32:00.780 |
And what we can do is train it to put images and language, 00:32:09.580 |
and map them to the same point in a vector space, 00:32:29.460 |
and process that through an image encoder, which would 00:32:41.060 |
Now, I'm going to be using these three pictures that I 00:32:57.100 |
And what I'm going to do is we have these three pictures. 00:33:01.540 |
And I'm also going to encode these three captions, 00:33:09.540 |
So we're going to perform a similarity, or a cosine 00:33:17.780 |
And we'll see the results are pretty cool, in my opinion. 00:33:29.660 |
And we're going to be using a new model from OpenAI, 00:33:37.420 |
Similar to DPR, where DPR is in question and context encoding, 00:33:43.060 |
Clip is using two encoders to do image and caption encoding, 00:33:50.720 |
So we're going to do, from Transformers, import Clip 00:34:01.700 |
as what we could call a tokenizer in typical language 00:34:17.220 |
Like we did with DPR, where we imported four classes. 00:34:26.820 |
And then what we want to do is we'll just initialize those. 00:34:48.620 |
So it's the Vision Transformer, this VIT you see here. 00:34:53.700 |
which Clip is using or is based on, at least the Vision aspect. 00:35:08.860 |
but the patch part of that is referring to the way 00:35:19.340 |
And that's the patch size, the patch 32 there. 00:35:23.380 |
So we also want the processor, which again, we 00:35:27.540 |
can kind of see that as akin to or equivalent to our tokenizer. 00:35:34.340 |
And we're just doing this for language models. 00:36:17.620 |
And I'm using requests to actually get the image 00:36:25.460 |
So I actually need to get matplotlib in there as well. 00:36:30.260 |
So import matplotlib.pyplot PLT and numpy as well. 00:36:39.380 |
OK, and we'll see those images that we saw before. 00:37:15.620 |
because they look like, well, there's a tree here 00:37:18.740 |
So try and make it a little bit more difficult. 00:37:23.620 |
But I mean, they're reasonably straightforward still, 00:37:29.300 |
you can imagine, you can see these as tokens. 00:37:32.620 |
We do inputs, so processor, similar to our tokenizer again. 00:37:42.540 |
So we have the text, and we want to input our captions. 00:37:51.300 |
And then we want to return the return tensors, or tensor, 00:38:15.500 |
And if we-- let me have a quick look at what we have here. 00:38:18.220 |
So we have our input IDs, pixel values, and so on. 00:38:22.020 |
So input IDs, we also have attention here as well. 00:38:32.340 |
And now what we want to do is create our encodings. 00:38:39.540 |
So in here, because we're using the clip model, 00:38:42.780 |
we're actually going to perform the encodings. 00:38:45.180 |
And it's also going to do the whole similarity checking 00:38:49.300 |
for us as well, and identify which images and captions 00:38:55.660 |
Or what it's going to do is go through each image 00:38:58.140 |
and find the caption that it believes belongs to it. 00:39:15.820 |
So we have the logics per image and per text. 00:39:34.380 |
these to find the most probable caption for each image. 00:39:39.820 |
And then-- so what we were doing before where we were just 00:39:42.340 |
extracting the embeddings, we can also do that. 00:39:45.540 |
And maybe I'll just copy in the code for that as well. 00:39:51.840 |
And we also have the image embeddings in here as well. 00:39:55.620 |
And then a little further down, we have the logics somewhere, 00:40:17.860 |
vision model output, text model output as well. 00:40:20.900 |
Now what we'll do is I'm going to paste this code in. 00:40:24.140 |
And so here, I'm going to go for image in each image. 00:40:30.860 |
I'm going to get the argmax, so the caption that it believes 00:40:48.780 |
so the probability there is the probs equals outputs. 00:41:09.740 |
So it's predicting caption 2, caption 0, and then caption 1 00:41:21.660 |
And then two dogs running as well, which I don't know. 00:41:25.300 |
For me, maybe because I'm usually working with language, 00:41:29.500 |
I think seeing both language and images together is-- 00:41:37.460 |
Super-- I don't know-- fascinating that it actually 00:41:43.420 |
So another thing that I want to show you very quickly-- 00:41:49.500 |
because I don't want to go through all of it. 00:41:55.100 |
So these are the embeddings if we wanted to extract them 00:42:01.020 |
put them in a vector index somewhere, a vector database. 00:42:10.860 |
So I'm going to do a dog hiding behind a tree. 00:42:15.700 |
or not the context, the images, the image embeddings. 00:42:22.580 |
So the cosine similarity, we get the highest one 00:42:33.260 |
And let's have a look at what our prediction is then. 00:42:39.580 |
So it's shown as the dog hiding behind the tree 00:42:42.940 |
for our query, which is a dog hiding behind a tree. 00:42:51.460 |
We've, I think, covered quite a lot of embedding methods. 00:42:58.780 |
to dense vectors with Word2Vec and where it came from 00:43:05.500 |
And we've had a look at sentence embeddings and sentence 00:43:08.700 |
transformers, moved on to Q&A with Facebook AI's DPR. 00:43:14.460 |
And now we've had a look at the new Vision Transformer 00:43:17.300 |
and how we can use that with other transform models 00:43:20.300 |
to build these really cool cross-media embeddings 00:43:26.500 |
that we can compare, which has blown me away a little bit. 00:43:33.340 |
But like I said, this is the first video and article 00:43:43.100 |
But for now, thank you very much for watching.