Intro to Dense Vectors for NLP and Vision

I'm welcome to this video. We're going to start a new series on embedding methods for NLP, but we're also going to have a look at other embedding methods as well. So mainly, we're going to be focusing on language-dense embeddings. We might have a look at sparse embeddings, but we've already covered that before.

So I'm not 100% sure on that. But definitely dense embeddings. We're going to also have a look at how we can build dense embeddings for images and maybe some other media formats as well. So I think this series of articles and videos will be pretty exciting. Now, what I want to start with is having a look at-- well, basically, quickly introducing what dense vectors and dense embeddings are.

And whilst we do that, I'm going to refer a lot to Word2Vec because that's the first widely adopted version of this. And then we're going to have a look at sentence embeddings, so how we can build sentence embeddings using the Sentence Transformers library. And we're going to go through the code for that as well.

Then we're going to have a look at Q&A. So Q&A is quite interesting, I think. And we're going to focus on Facebook AI's Dense Passage Retriever for that. And again, we are going to go through the code for that as well. And then another thing that I think is quite exciting is image and text embeddings.

So to do that, we're going to have a look at the new Vision Transformer. So I think all of that's pretty cool. So let's jump straight into it. So I think the first question we want to ask is, why would we use dense vectors in the first place? Now, we have two options when it comes to representing text.

And that is we can represent it as a dense vector or as a sparse vector. Now, sparse vectors are good if we're going to focus on the syntax and the words that we're comparing. So if we had two sentences, Bill ran from the giraffe towards the dolphin. And then we said the opposite.

So Bill ran from the dolphin towards the giraffe. Both of these sentences have the exact same words in it. But they have different meanings, right? So in one of them, Bill is running away from a giraffe. And the other one is running away from a dolphin. Now, when it comes to sparse vector representation, we'd find it difficult to correctly identify these as not being the same sentence.

Because we tend to represent words one by one in some sort of one-hot encoding, and then compare those vectors. Now, we can also use n-grams so we can put two words together. And in that case, we would identify that there is a difference. But it's not that effective. And then we also want to consider where we have different words for the same meaning.

So for example, if you want to say hello to someone, you say hi, hello, hey. I'm sure there's a million other ways of saying it. And sparse vector representations would view these as different words. So whereas sparse vectors are very good for comparing the syntax of text, it's not very good at comparing the semantics or the meaning behind text.

And that's where we want to start using dense vectors. So we can see dense vectors as pretty much a numerical representation of the semantic meaning behind some text. And we can actually visualize a lot of these relationships. So towards around 2013, we had WordSpec, which was the first very popular dense vector embedding for words.

And around that time, we had a lot of people showing that you had things like this, what you can see on the screen, where, for example, we'd have days of the week clustered together, or we would have months or other related abstract topics represented or clustered together in our highly dimensional space.

Now, of course, this is a 3D graph. When we're actually building these dense vectors, we have many more dimensions, more towards the 500, 700, 800 or so. So this is obviously a simplified version of that. And not only will we find that similar words are clustered in the same area, but we also find that we can perform what I think is best described as arithmetic on words.

So this is a very popular example that came from around the same time as WordSpec. If you want references and everything, you'll be able to find them at the bottom of the article that this video is attached to. If you need the article, it's in the description. Now, what we'd find is if we took the vector for king, subtracted the vector for man, added the vector for woman, we would not get the exact vector for queen, but we'd get very, very close.

So the nearest vector would be the vector for queen. And I mean, I think that's super interesting. And this is from the start of when we had these vector embeddings. So this is eight years ago now. And they've just gotten a lot more advanced in that time. So as I said, these examples are coming from the era of the Word2Vec.

And Word2Vec was one of the earliest versions of these dense representations. And going from the name, we know it's Word2Vector. So we're converting words into vectors. Now, how this worked, there were two different methods. We had the skip-gram method, which is what you can see now, which is where we take one word, and we would take the sparse vector encoding for that word on the left that you can see.

And then we would, in the vector on the right, we would have a one-hot encoding for all of the words that surround that first word. So in this case, we have fox. And that's surrounded by the words quick brown, jumped, and over. And this would be run through a simple feed-forward neural network.

And we would go through this compression stage. And it is within that compression stage that we would build our dense vector representation for fox. And that would simply be a neural network being optimized to go from fox and predict the quick brown, jumped, and over. And this would be done many times over for every time that word appears in a big corpus of text with its multiple contexts.

And what that does is it just builds up like a numerical representation of that word. And then there was the other approach, which Word2Vec also used, which is called continuous bag of words. And it's basically the same. We're just swapping the order of the transformation. So on the left, we have all of our context words.

And then on the right, we would have the word that we're focusing on and we're building the embedding for. Now, Word2Vec really seemed to act as the catalyst for a lot of other vector embeddings. From Word2Vec, for example, we have like sentence2Vec, doc2Vec. We even had this one that I found when I was researching for this, which is called batter picture2Vec, which is vector embeddings for major league baseball players.

So you've got a lot of different 2Vec methods that came out of the woodworks after the original Word2Vec. And then we also had other ones like glove as well, which is worth a mention. Now, nowadays, Word2Vec is pretty outdated and we wouldn't really go ahead and use that. So I'm not going to spend any more time on it.

And we'll just move on to having a look at sentence similarity. So you can see sentence similarity as very similar to Word2Vec in that we're building these dense representations. But rather than representing a single word, we're representing a sentence or a paragraph. And the way that this would be done is using the current transform models.

So BERT was the first example of doing this. And BERT by itself, you can build embeddings. But it's based on a token-by-token embedding. So within BERT, you have all of these different embeddings, but they each represent a single token. So what the guys at Sentence Transformers did is they trained like a Siamese.

They called it Siamese BERT, where they had two BERTs. And they were trained in parallel. And they output a single vector for the full input that was input into the model, which was around 128 tokens at max. Now, this allowed us to build a single vector for sentences. And that's very good, because then we can start comparing sentences and paragraphs.

So let's have a look at how we can actually build that in code. So the first thing you'll need to do is pip install Sentence Transformers. Now, I've already done this, so I'm not going to rerun it. But if you don't have Sentence Transformers, you will need to install it.

And then after that, all we want to do is we want to write from Sentence Transformers. We want to import the Sentence Transformer object. And from there, we can just initialize our model. Super easy. We just write model, Sentence Transformer. And then in here, we just need to type our model name.

Now, if you Google Sentence Transformers or SBERT, you will find the web page for this library. And it has loads of different models on there. One of the highest performing ones that I found on there at the moment is called All MPNet Base V2. So we just execute that.

And usually, you will need to download the model. So you will see a load of loading bars or progress bars. That's fine. It's just downloading the model for you. I already have it downloaded, so I don't need to run it again. And then what we need is a set of sentences so that we can actually compare what we-- we can compare these and look at what the Sentence Transformer believes is the most similar.

Now, all of these are completely random, but we have this one here. "The bees decided to have a mutiny against their queen." And I just rewrote that in a way that we don't have any matching words between the two sentences. So we have "flying, singing insects rebelled in opposition to the matriarch." Now, the meaning there is pretty much the same.

Maybe not exactly the same, but pretty much. But there are no shared words other than, I think, "to" and "the." Yeah, "to" and "the." So in terms of sparse vector encoding, this wouldn't score very well. But we'll see that with dense vectors, it will. So the first thing we want to do is encode our embeddings.

So we'll write "embeddings" plus "model.encodeSentences." And then let's have a look at what that outputs, or at least the shape of what it outputs. And we see that we get seven vectors, or seven embeddings, each one with a dimensionality of 768 values. And we can use cosine similarity to compare all of these.

Now, the easiest way to do this is we just import cosine similarity from SentenceTransformers. So "SentenceTransformers.util importCosSim." And then what we do is calculate the cosine similarity scores between all of our vectors. Now, I want to compare the final item here, so this last one, against the rest of them, because I want to see that this is the most similar.

So I'm just going to select that, so write "embeddings." And we're just taking the last vector. And then I want "embeddings," well, the remaining of them, so all the vectors except from the last one. And let's just have a look. We will be able to see that we have something that seems pretty obvious.

So this one here is the most similar, by quite a bit, 0.6. The closest is 0.19 here. So it's definitely calculating that as a lot more similar than the other ones. So if I take the argmax of that, we should see-- so 3, and take the item. And if we go Sentences, and we put that, Sentences, and we index number 3, we see, OK, the bees decided to have a mutiny against their queen.

So it correctly identified that these two, this and this, are far more similar than the rest of the sentences, which I think is very cool, because there's not even any similar words in there. And even as a human, it's kind of, you know, the bees, flying, stinging insects, and matriarch and queen, you know, it's not obvious.

So I think that's really cool. Another popular use of embeddings for language applications is question answering. Now, question answering can be done with a few different, let's say, architectures. And one of the, I think, most popular ones is open domain question answering. Now, the structure of open domain question answering is what you can see on the screen at the moment.

So we ask a question that gets passed to something called a retriever model. The retriever model contains a question encoder, which encodes the question, passes it along to our index database. And within there, we will have a set of contexts. Now, contexts are usually a paragraph that contains the answer to our question.

And DPR both encodes our questions and encodes our context into the same vector space. So what we would get is, for example, if we had a question, what is the capital of France? And then we also had a context. The capital of France is Paris. DPR would attempt to encode both of those into the same vector space, or very, very close by.

So the vectors produced by both of those would be very, very similar. So all we're doing in that index database is finding the most similar embeddings to our question embedding. And then from there, we pass that along. We pass our context and the question again to our reader model.

Here, I've used a BERT Q&A model. It doesn't have to be BERT. It can be any reader for question answering. And then that outputs the specific part of our context, which contains our answer. So in the previous example, we would output Paris, hopefully. Now, we had that DPR retriever model.

DPR is Facebook AI's Dense Passage Retriever. And it actually consists of two smaller encoders. We have a question encoder and a context encoder. Now, during training, what we do is we train both of these encoders in parallel. And we pass questions and their equivalent context to the question and context encoder, respectively.

And we optimize based on a contrastive loss function. So we compare the vectors from our question encoder and the context encoder. And we try to minimize the difference between them, the question and context pairs. And that's how we build the DPR model. That's why it works for question answering.

So it's not like our sentence transformers, where they are just a single model. And they're used to identify very similar sentences. This is used to identify not very similar sentences, but very similar question and context pairs. And we will see difference in a moment when we go through the code.

So let's get started with that. So come down here. And the first thing we probably want to do is initialize our context encoder and our question encoder from DPR. Now, we're going to use the HookinFace transformers library for this. So if you do not already, you'd have to pip install transformers.

Now, if you pip installed sentence transformers, that does include transformers as a prerequisite. So if you installed that already, you should already have transformers as well. So first thing we want to do is, from transformers, we want to import a fair few classes here. So we need both the model, or the encoder, and tokenizer for each for both our context encoder and our question encoder.

So let's do the context encoder first. So write DPR context encoder tokenizer and DPR context encoder here. And then, as well as that, we also want the question encoder tokenizer and question encoder. So we write DPR question encoder tokenizer and also DPR question encoder. And that's all we need to import, so let's run that.

And then, we can go ahead and initialize our tokenizer model. So we have the context model. Now, this is going to be the DPR context encoder from pre-trained. If you've ever used HuggingFace transformers before, you should recognize this from pre-trained. We're just going to load in a model, which we can find on the HuggingFace.co/models website.

So if you go to that address, and you type in what I'm about to type in, you will find that it comes up. So I'm going to type Facebook/DPR context, so CTX encoder. And we want single enqueue base. And I'm going to copy this, because we are going to use it again in just a moment, for our context tokenizer.

So context tokenizer equals DPR context encoder tokenizer from pre-trained again, from pre-trained. And then, again, we want the same model name in there. OK, so they are our context side of the model. But we also need to get the question side. So we've got our context encoder and tokenizer.

Now we want to question encoder and tokenizer. So I write question here and here. And we're just replacing everything where we've put CTX with question in here. So it's this question and this as well. And then in the model, we are just replacing CTX again with question. It's pretty straightforward.

Now, I'm going to run that with you. If you haven't already got these cached on your machine, it can take a little bit of time, because we're downloading four sets of models and tokenizers. So it can take a little bit of time. Now, I already have them, so I don't need to wait for that.

Now, first thing I want to do is set up a set of questions and context. So I have three questions here. Well, you can read them. I'm not going to go through them all. And then we have context. Each question has a couple of contexts that are kind of relevant, but then just one that is actually the answer.

And inside here, I've also put in the questions themselves, because I want to prove that this is not just a sentence transformer where it's finding the most similar sentence. So it should, when we have these questions, it shouldn't return-- like for this one, it shouldn't return what is a best-selling sci-fi book.

It should instead return the best-selling sci-fi book is "Doom." So we should see that there is a difference between using DPR and using sentence transforms. So run that. And then what we want to do is tokenize everything. So we're going to tokenize our context. So I'm going to write xbtokens.

And we want the context tokenizer. And then in here, we're going to pass our context. And then if you use HuggingFace transformers, you should recognize this as well, so maxLength here. So for this, I'm going to put 256. And I'll set padding equal to maxLength. We don't need to truncate anything, I don't think.

No, they're all very short. So this maxLength, we could even reduce it to something pretty small. But I'm going to leave it at that. So we'll pad up to the maxLength. And oh, the only thing we do need to include here is that we want to return PyTorch tensors.

So return tensors equals pt. OK. And then what we can do is we write xbEmbeddings. So this is how we build our context embeddings. xbEmbed, I'll just call it xb, is equal to model, the context model, ctxModel. And then in here, we pass our tokens, xbTokens, like that. And then for our questions, we do exactly the same thing.

But of course, we just replace the context part of it with questions. So here, we have the question tokenizer, we have questions, and we have the question model. And then here, I'm going to rename xb to xq, so our query. OK, let's have a look at what we get.

So first, let's have a look at what we have inside xq. So we'll see that we have a few different tensors in here. So I'll just write xq keys to see what we have. You see that we actually only have one output here, so the pooler output, which is fine because that's what we need.

So we write xq, pooler output. And these here are our embeddings. So we can write shape to see the shape of those embeddings. So we have three vectors. So the number of questions that we passed up here, and each one of those questions has been encoded into a embedding of 768 dimensions.

So that looks good. And we could do the same for xb if we want as well. It's exactly the same. So write xb, and we'll see the shape. Just at this time, we have nine vectors because, obviously, we have more context than we do questions. So what we want to do now, I'm going to import Torch.

So again, this should have been installed already with Hugging Face Transformers and also Sentence Transformers. So if you've gotten this far, you don't need to worry about installing this. What I'm going to do is go for i, and then the query vector in xq, xq, pooler output, pooler output.

So I'm going to enumerate that. So what I'm doing here is I'm going to run through. I'm going to create a loop to go through each query and to get the most similar vector from xb, so from our encoded context. So we write probs equals cosine similarity. So these are our similarity scores, doing exactly the same as we did before.

We're still going to write xq vec, so the single vector at the moment. And from here, we just want xb pooler output, pooler output. And from there, we want to get the argmax, so the maximum argument, so the highest score in our probability right here. So Torch argmax, and here we have props.

And then what I'm going to do is I'm going to print the current question that we're asking, so questions i. Now I'm going to print the context which has been chosen from our argmax. So we just write context argmax. And then I'm just going to put this in here so we have a little bit of separation.

Let's have a see what we have. So we get, what is the capital city of Australia? Now remember, this exact question was also in our context. And it's not returning the exact sentence back to us or the exact question back to us. It's actually returning as the answer. So Canberra is the capital city of Australia.

Now second one, as we had hoped, the best-selling sci-fi book has been chosen to do in here. And then I just wanted to include this one as well to point out that it's not perfect. It doesn't always get things right. So in this case, it didn't find the correct answer of how many searches are performed on Google.

If we have a look at the context, so the correct answer should have been this one here. So Google serves more than 2 trillion queries annually. So it didn't get that one. But the other two it did get, despite having the actual questions in there as well. One of them here.

So again, I think that's really cool. And I think Q&A is something that has a lot of potential in many businesses around the world. So I think that's a very cool one to use. OK, so the next one I want to cover is a mix of language and also vision.

So recently, computer vision has had a few advances from the discipline of NLP. So in NLP, we've been using transformers for a reasonable amount of time now. And transformers have proven to be incredible models for language. And very recently, transformers have been applied to computer vision as well, which is very cool, I think.

And what we're finding is that a model or an architecture that can be used for language can also be used for computer vision. And I think that's super cool. So I want to show you one of those models, or briefly touch upon one of those models. We will go into it in more detail in a future article and video.

But for now, I'm just going to mention it. We have the Vision Transformer, which is very recent. I think the paper is January 2021, if I'm not wrong. And although we don't need a Vision Transformer to build an embedding for an image, I think the fact that we can use it is pretty cool.

And we can really do it very easily with Hugging Face Transformers, as we will see when we go through the code. Now, a very interesting use of this is to actually take two different encoders, both transformers. The text encoder is more of a traditional transformer, obviously. And the image encoder is our new Vision Transformer.

And we can actually train them together, like we did with DPR, the bi-encoder architecture. And what we can do is train it to put images and language, so language that describes an image, and map them to the same point in a vector space, or very close, at least. And that's what I've tried to visualize.

You can see on the screen now. So we have two logs running. We process that through our text encoder. And we get a very similar vector to if we took the picture of two dogs running and process that through an image encoder, which would be our Vision Transformer. So I think that's-- I don't know.

For me, I think that's so cool. Now, I'm going to be using these three pictures that I got from Unsplash. If you want to see the photo credits, they will be either in the article, if you're reading the article, or they'll be in the video description, if not. And what I'm going to do is we have these three pictures.

We're going to encode those. And I'm also going to encode these three captions, and a few other captions as well. And we're going to see if they match. So we're going to perform a similarity, or a cosine similarity search across them, and see which pairs match the closest. And we'll see the results are pretty cool, in my opinion.

So let's jump into it. Again, we're going to be using Transformers. And we're going to be using a new model from OpenAI, which is for the image and text. Similar to DPR, where DPR is in question and context encoding, Clip is using two encoders to do image and caption encoding, which is pretty cool.

So we're going to do, from Transformers, import Clip Processor. So I'm kind of viewing this processor as what we could call a tokenizer in typical language transformers. And then we want the Clip model. So this contains both encoders for us, so we don't have to mess around. Like we did with DPR, where we imported four classes.

Here, we're just importing the two. And then what we want to do is we'll just initialize those. So again, very similar. So we do Clip model from Pre-trained. And in here, we write OpenAI, Clip VIT. So it's the Vision Transformer, this VIT you see here. It refers to the Vision Transformer which Clip is using or is based on, at least the Vision aspect.

And we want to write Base Patch 32. So I mean, we'll go into it in more detail, but the patch part of that is referring to the way that the model almost tokenizes your images. It splits an image into different patches. And that's the patch size, the patch 32 there.

So we also want the processor, which again, we can kind of see that as akin to or equivalent to our tokenizer. And we're just doing this for language models. And again, I'll just copy that across. OK, so model processor looks good. Let me rerun it. OK, again, I already have it cached, so it won't download for me.

And you'll get this thing here. Don't worry about it. It still works. Now I'm going to copy in the code I'm using to get the photos. So I have the photo URLs here. I'm using a pill to create the image object. And I'm using requests to actually get the image from the URL that we have here.

And then down here, I'm just going to show you what images we have. So I actually need to get matplotlib in there as well. So import matplotlib.pyplot PLT and numpy as well. OK, and we'll see those images that we saw before. So we have the puppy or dog running, the dog hiding behind tree, and then we have the two dogs running.

OK, so they are our images, and we've stored them in images here, OK? And the next part are captions. So I've just written these six captions. The first three are actually the captions, and then the other three I just made up. I included trees and park in there because they look like, well, there's a tree here and there's a park here.

So try and make it a little bit more difficult. But I mean, they're reasonably straightforward still, I think. And then to create our-- you can imagine, you can see these as tokens. We do inputs, so processor, similar to our tokenizer again. And we have a few inputs here. So we have the text, and we want to input our captions.

And then we also have images. And of course, we just input our images. And then we want to return the return tensors, or tensor, equal to PT. And we set padding to true, OK? Return tensors PT, OK? And if we-- let me have a quick look at what we have here.

So we have our input IDs, pixel values, and so on. So input IDs, we also have attention here as well. So these first two are for our text, and then pixel values are for the images. And now what we want to do is create our encodings. So in here, because we're using the clip model, we're actually going to perform the encodings.

And it's also going to do the whole similarity checking for us as well, and identify which images and captions are the closest pairs. Or what it's going to do is go through each image and find the caption that it believes belongs to it. So like before, we just write inputs here.

And I think maybe let's have a look at what we have in our outputs. So we can see we'll have a few things here that I think are pretty useful. So we have the logics per image and per text. So for these, we can-- for each of our text, we can use this to get the most probable image that is assigned to each caption.

And in logics per image, we can use these to find the most probable caption for each image. And then-- so what we were doing before where we were just extracting the embeddings, we can also do that. And maybe I'll just copy in the code for that as well. So we have the text embeddings here.

So we can extract those if we want. And we also have the image embeddings in here as well. And then a little further down, we have the logics somewhere, pool output here. Yeah. So we have the pool outputs and the logics. OK, so let me just close that. And I do believe we also have a few more.

So let me just show you those quickly. Yeah, we have a few tensors there as well, vision model output, text model output as well. Now what we'll do is I'm going to paste this code in. And so here, I'm going to go for image in each image. I'm going to iterate through.

I'm going to get the argmax, so the caption that it believes or is predicted for that image. And then we're going to show it. And we're going to print both out. Let's see if they match. Oh, so I'm getting ahead of myself there. So we also need to-- so the probability there is the probs equals outputs.

And we want the logics pair image. And we'll take the argmax while we're here. So dim equals 1 for that. And let's have a look at what we get. We'll see that we get this. So it's predicting caption 2, caption 0, and then caption 1 for our three images.

Let's look through that. And we'll see we get a dog running. Cool. A dog hiding behind a tree. And then two dogs running as well, which I don't know. For me, maybe because I'm usually working with language, I think seeing both language and images together is-- I don't know-- really cool.

Super-- I don't know-- fascinating that it actually works like that so easily. So another thing that I want to show you very quickly-- I'm just going to copy the code in, because I don't want to go through all of it. It'll take a while. So we just have the embeddings.

So these are the embeddings if we wanted to extract them and do what we did before with them. Or if you wanted to take these embeddings, put them in a vector index somewhere, a vector database. And we can get our query. So I'm going to do a dog hiding behind a tree.

We can get the context-- or not the context, the images, the image embeddings. Again, like before, we do the similarity. So the cosine similarity, we get the highest one is the second one here. So it's looking pretty good. And from there, we get our prediction, which is argmax, so we'll take number 1.

And let's have a look at what our prediction is then. So we will plot that. We'll show you the image again. We have prediction. So it's shown as the dog hiding behind the tree for our query, which is a dog hiding behind a tree. So again, super cool. Now, that's it for this video.

We've, I think, covered quite a lot of embedding methods. We've had a look at some introduction to dense vectors with Word2Vec and where it came from and how it quickly evolved. And we've had a look at sentence embeddings and sentence transformers, moved on to Q&A with Facebook AI's DPR.

And now we've had a look at the new Vision Transformer and how we can use that with other transform models to build these really cool cross-media embeddings that we can compare, which has blown me away a little bit. Now, that's it for this video. But like I said, this is the first video and article in what will be a series on embeddings.

So there's a lot more to come. But for now, thank you very much for watching. And I'll see you in the next one. Bye.

Intro to Dense Vectors for NLP and Vision

Chapters

Transcript