back to index

Intro to Dense Vectors for NLP and Vision


Chapters

0:0 Intro
1:50 Why Dense Vectors?
3:55 Word2vec and Representing Meaning
8:40 Sentence Transformers
9:58 Sentence Transformers in Python
15:8 Question-Answering
18:18 DPR in Python
29:55 Vision Transformers
33:22 OpenAI's CLIP in Python
42:49 Review and What's Next

Whisper Transcript | Transcript Only Page

00:00:00.000 | I'm welcome to this video.
00:00:02.240 | We're going to start a new series on embedding methods
00:00:07.180 | for NLP, but we're also going to have a look at other embedding
00:00:11.160 | methods as well.
00:00:12.360 | So mainly, we're going to be focusing
00:00:14.600 | on language-dense embeddings.
00:00:17.920 | We might have a look at sparse embeddings,
00:00:19.640 | but we've already covered that before.
00:00:21.520 | So I'm not 100% sure on that.
00:00:23.920 | But definitely dense embeddings.
00:00:26.000 | We're going to also have a look at how
00:00:28.280 | we can build dense embeddings for images and maybe
00:00:31.080 | some other media formats as well.
00:00:33.880 | So I think this series of articles and videos
00:00:38.200 | will be pretty exciting.
00:00:39.960 | Now, what I want to start with is having a look at--
00:00:45.960 | well, basically, quickly introducing
00:00:47.800 | what dense vectors and dense embeddings are.
00:00:50.920 | And whilst we do that, I'm going to refer a lot to Word2Vec
00:00:55.080 | because that's the first widely adopted version of this.
00:01:01.320 | And then we're going to have a look at sentence embeddings,
00:01:06.000 | so how we can build sentence embeddings using the Sentence
00:01:08.720 | Transformers library.
00:01:09.600 | And we're going to go through the code for that as well.
00:01:11.940 | Then we're going to have a look at Q&A.
00:01:13.760 | So Q&A is quite interesting, I think.
00:01:16.360 | And we're going to focus on Facebook AI's Dense Passage
00:01:20.960 | Retriever for that.
00:01:23.800 | And again, we are going to go through the code for that
00:01:26.240 | as well.
00:01:27.600 | And then another thing that I think is quite exciting
00:01:31.120 | is image and text embeddings.
00:01:34.200 | So to do that, we're going to have a look
00:01:36.920 | at the new Vision Transformer.
00:01:38.880 | So I think all of that's pretty cool.
00:01:42.400 | So let's jump straight into it.
00:01:51.360 | So I think the first question we want to ask
00:01:54.200 | is, why would we use dense vectors in the first place?
00:01:58.560 | Now, we have two options when it comes to representing text.
00:02:01.320 | And that is we can represent it as a dense vector
00:02:04.920 | or as a sparse vector.
00:02:06.560 | Now, sparse vectors are good if we're
00:02:08.880 | going to focus on the syntax and the words that we're comparing.
00:02:12.680 | So if we had two sentences, Bill ran from the giraffe
00:02:18.280 | towards the dolphin.
00:02:20.400 | And then we said the opposite.
00:02:22.040 | So Bill ran from the dolphin towards the giraffe.
00:02:26.240 | Both of these sentences have the exact same words in it.
00:02:30.120 | But they have different meanings, right?
00:02:32.080 | So in one of them, Bill is running away from a giraffe.
00:02:34.520 | And the other one is running away from a dolphin.
00:02:36.600 | Now, when it comes to sparse vector representation,
00:02:41.520 | we'd find it difficult to correctly identify these
00:02:46.440 | as not being the same sentence.
00:02:48.720 | Because we tend to represent words one by one
00:02:53.200 | in some sort of one-hot encoding,
00:02:54.920 | and then compare those vectors.
00:02:56.600 | Now, we can also use n-grams so we can put two words together.
00:02:59.520 | And in that case, we would identify
00:03:01.360 | that there is a difference.
00:03:02.800 | But it's not that effective.
00:03:05.360 | And then we also want to consider
00:03:07.240 | where we have different words for the same meaning.
00:03:11.680 | So for example, if you want to say hello to someone,
00:03:15.840 | you say hi, hello, hey.
00:03:18.040 | I'm sure there's a million other ways of saying it.
00:03:20.680 | And sparse vector representations
00:03:24.240 | would view these as different words.
00:03:27.240 | So whereas sparse vectors are very good for comparing
00:03:31.160 | the syntax of text, it's not very good
00:03:34.840 | at comparing the semantics or the meaning behind text.
00:03:38.160 | And that's where we want to start using dense vectors.
00:03:41.760 | So we can see dense vectors as pretty much
00:03:45.840 | a numerical representation of the semantic meaning
00:03:51.200 | behind some text.
00:03:53.840 | And we can actually visualize a lot of these relationships.
00:03:57.760 | So towards around 2013, we had WordSpec,
00:04:01.880 | which was the first very popular dense vector
00:04:06.280 | embedding for words.
00:04:08.320 | And around that time, we had a lot
00:04:10.720 | of people showing that you had things like this,
00:04:13.320 | what you can see on the screen, where, for example,
00:04:15.840 | we'd have days of the week clustered together,
00:04:17.920 | or we would have months or other related abstract topics
00:04:24.720 | represented or clustered together
00:04:26.720 | in our highly dimensional space.
00:04:30.120 | Now, of course, this is a 3D graph.
00:04:32.760 | When we're actually building these dense vectors,
00:04:34.800 | we have many more dimensions, more
00:04:38.480 | towards the 500, 700, 800 or so.
00:04:44.280 | So this is obviously a simplified version of that.
00:04:48.040 | And not only will we find that similar words are
00:04:51.360 | clustered in the same area, but we also
00:04:53.160 | find that we can perform what I think is best described
00:04:57.680 | as arithmetic on words.
00:05:01.480 | So this is a very popular example
00:05:05.000 | that came from around the same time as WordSpec.
00:05:07.560 | If you want references and everything,
00:05:09.760 | you'll be able to find them at the bottom of the article
00:05:12.360 | that this video is attached to.
00:05:14.280 | If you need the article, it's in the description.
00:05:16.600 | Now, what we'd find is if we took the vector for king,
00:05:21.200 | subtracted the vector for man, added the vector for woman,
00:05:26.340 | we would not get the exact vector for queen,
00:05:29.640 | but we'd get very, very close.
00:05:32.120 | So the nearest vector would be the vector for queen.
00:05:35.320 | And I mean, I think that's super interesting.
00:05:38.800 | And this is from the start of when we
00:05:41.480 | had these vector embeddings.
00:05:43.160 | So this is eight years ago now.
00:05:46.880 | And they've just gotten a lot more advanced in that time.
00:05:51.000 | So as I said, these examples are coming
00:05:54.040 | from the era of the Word2Vec.
00:05:56.780 | And Word2Vec was one of the earliest versions
00:06:01.880 | of these dense representations.
00:06:04.100 | And going from the name, we know it's Word2Vector.
00:06:06.920 | So we're converting words into vectors.
00:06:10.520 | Now, how this worked, there were two different methods.
00:06:14.240 | We had the skip-gram method, which
00:06:15.880 | is what you can see now, which is where we take one word,
00:06:19.840 | and we would take the sparse vector encoding for that word
00:06:24.560 | on the left that you can see.
00:06:26.480 | And then we would, in the vector on the right,
00:06:31.440 | we would have a one-hot encoding for all of the words that
00:06:35.920 | surround that first word.
00:06:38.680 | So in this case, we have fox.
00:06:40.080 | And that's surrounded by the words quick brown,
00:06:42.240 | jumped, and over.
00:06:44.320 | And this would be run through a simple feed-forward neural
00:06:47.760 | network.
00:06:48.760 | And we would go through this compression stage.
00:06:52.200 | And it is within that compression stage
00:06:54.080 | that we would build our dense vector representation for fox.
00:07:00.120 | And that would simply be a neural network
00:07:04.320 | being optimized to go from fox and predict
00:07:07.520 | the quick brown, jumped, and over.
00:07:09.200 | And this would be done many times over for every time
00:07:13.000 | that word appears in a big corpus of text
00:07:17.440 | with its multiple contexts.
00:07:20.300 | And what that does is it just builds up
00:07:21.920 | like a numerical representation of that word.
00:07:26.600 | And then there was the other approach,
00:07:28.260 | which Word2Vec also used, which is
00:07:30.560 | called continuous bag of words.
00:07:32.520 | And it's basically the same.
00:07:34.320 | We're just swapping the order of the transformation.
00:07:38.280 | So on the left, we have all of our context words.
00:07:41.160 | And then on the right, we would have the word
00:07:43.800 | that we're focusing on and we're building the embedding for.
00:07:46.860 | Now, Word2Vec really seemed to act
00:07:51.120 | as the catalyst for a lot of other vector embeddings.
00:07:56.080 | From Word2Vec, for example, we have
00:07:58.240 | like sentence2Vec, doc2Vec.
00:08:01.120 | We even had this one that I found
00:08:03.640 | when I was researching for this, which
00:08:05.600 | is called batter picture2Vec, which
00:08:08.880 | is vector embeddings for major league baseball players.
00:08:14.000 | So you've got a lot of different 2Vec methods
00:08:18.280 | that came out of the woodworks after the original Word2Vec.
00:08:22.880 | And then we also had other ones like glove as well,
00:08:25.320 | which is worth a mention.
00:08:27.520 | Now, nowadays, Word2Vec is pretty outdated
00:08:30.760 | and we wouldn't really go ahead and use that.
00:08:32.680 | So I'm not going to spend any more time on it.
00:08:34.840 | And we'll just move on to having a look at sentence similarity.
00:08:39.680 | So you can see sentence similarity
00:08:41.160 | as very similar to Word2Vec in that we're building
00:08:46.200 | these dense representations.
00:08:48.040 | But rather than representing a single word,
00:08:49.840 | we're representing a sentence or a paragraph.
00:08:53.520 | And the way that this would be done
00:08:56.000 | is using the current transform models.
00:08:58.520 | So BERT was the first example of doing this.
00:09:03.640 | And BERT by itself, you can build embeddings.
00:09:07.880 | But it's based on a token-by-token embedding.
00:09:10.640 | So within BERT, you have all of these different embeddings,
00:09:13.360 | but they each represent a single token.
00:09:15.800 | So what the guys at Sentence Transformers did
00:09:20.200 | is they trained like a Siamese.
00:09:23.480 | They called it Siamese BERT, where they had two BERTs.
00:09:26.360 | And they were trained in parallel.
00:09:29.160 | And they output a single vector for the full input that
00:09:35.040 | was input into the model, which was around 128 tokens at max.
00:09:42.840 | Now, this allowed us to build a single vector for sentences.
00:09:48.120 | And that's very good, because then we
00:09:50.240 | can start comparing sentences and paragraphs.
00:09:54.480 | So let's have a look at how we can actually
00:09:57.680 | build that in code.
00:10:00.560 | So the first thing you'll need to do
00:10:02.480 | is pip install Sentence Transformers.
00:10:06.120 | Now, I've already done this, so I'm not going to rerun it.
00:10:10.160 | But if you don't have Sentence Transformers,
00:10:12.000 | you will need to install it.
00:10:16.120 | And then after that, all we want to do is we
00:10:18.840 | want to write from Sentence Transformers.
00:10:22.600 | We want to import the Sentence Transformer object.
00:10:30.360 | And from there, we can just initialize our model.
00:10:33.120 | Super easy.
00:10:33.800 | We just write model, Sentence Transformer.
00:10:38.040 | And then in here, we just need to type our model name.
00:10:40.800 | Now, if you Google Sentence Transformers or SBERT,
00:10:45.320 | you will find the web page for this library.
00:10:49.320 | And it has loads of different models on there.
00:10:53.080 | One of the highest performing ones
00:10:54.760 | that I found on there at the moment
00:10:57.000 | is called All MPNet Base V2.
00:11:03.880 | So we just execute that.
00:11:05.240 | And usually, you will need to download the model.
00:11:11.880 | So you will see a load of loading bars or progress bars.
00:11:16.040 | That's fine.
00:11:16.640 | It's just downloading the model for you.
00:11:18.320 | I already have it downloaded, so I don't need to run it again.
00:11:21.840 | And then what we need is a set of sentences
00:11:24.400 | so that we can actually compare what we--
00:11:28.640 | we can compare these and look at what the Sentence Transformer
00:11:33.520 | believes is the most similar.
00:11:35.720 | Now, all of these are completely random,
00:11:38.600 | but we have this one here.
00:11:40.720 | "The bees decided to have a mutiny against their queen."
00:11:43.560 | And I just rewrote that in a way that we
00:11:47.920 | don't have any matching words between the two sentences.
00:11:51.760 | So we have "flying, singing insects
00:11:53.960 | rebelled in opposition to the matriarch."
00:11:56.440 | Now, the meaning there is pretty much the same.
00:11:59.080 | Maybe not exactly the same, but pretty much.
00:12:02.440 | But there are no shared words other than, I think,
00:12:05.920 | "to" and "the."
00:12:08.360 | Yeah, "to" and "the."
00:12:10.560 | So in terms of sparse vector encoding,
00:12:15.840 | this wouldn't score very well.
00:12:17.080 | But we'll see that with dense vectors, it will.
00:12:23.920 | So the first thing we want to do is encode our embeddings.
00:12:30.120 | So we'll write "embeddings" plus "model.encodeSentences."
00:12:39.240 | And then let's have a look at what that outputs, or at least
00:12:43.280 | the shape of what it outputs.
00:12:44.440 | And we see that we get seven vectors, or seven embeddings,
00:12:50.920 | each one with a dimensionality of 768 values.
00:12:56.800 | And we can use cosine similarity to compare all of these.
00:13:01.120 | Now, the easiest way to do this is we just
00:13:04.800 | import cosine similarity from SentenceTransformers.
00:13:10.000 | So "SentenceTransformers.util importCosSim."
00:13:17.760 | And then what we do is calculate the cosine similarity scores
00:13:23.360 | between all of our vectors.
00:13:26.040 | Now, I want to compare the final item here,
00:13:32.440 | so this last one, against the rest of them,
00:13:34.480 | because I want to see that this is the most similar.
00:13:38.240 | So I'm just going to select that, so write "embeddings."
00:13:43.880 | And we're just taking the last vector.
00:13:46.880 | And then I want "embeddings," well, the remaining of them,
00:13:50.400 | so all the vectors except from the last one.
00:13:53.680 | And let's just have a look.
00:13:55.480 | We will be able to see that we have something
00:13:59.000 | that seems pretty obvious.
00:14:00.440 | So this one here is the most similar, by quite a bit,
00:14:04.960 | The closest is 0.19 here.
00:14:07.720 | So it's definitely calculating that
00:14:09.920 | as a lot more similar than the other ones.
00:14:13.040 | So if I take the argmax of that, we should see--
00:14:22.120 | so 3, and take the item.
00:14:25.800 | And if we go Sentences, and we put that, Sentences,
00:14:31.760 | and we index number 3, we see, OK,
00:14:35.520 | the bees decided to have a mutiny against their queen.
00:14:38.320 | So it correctly identified that these two, this and this,
00:14:46.480 | are far more similar than the rest of the sentences, which
00:14:50.360 | I think is very cool, because there's not even
00:14:53.200 | any similar words in there.
00:14:54.880 | And even as a human, it's kind of, you know,
00:14:57.760 | the bees, flying, stinging insects, and matriarch
00:15:02.360 | and queen, you know, it's not obvious.
00:15:06.760 | So I think that's really cool.
00:15:08.920 | Another popular use of embeddings
00:15:12.000 | for language applications is question answering.
00:15:15.560 | Now, question answering can be done
00:15:18.280 | with a few different, let's say, architectures.
00:15:23.200 | And one of the, I think, most popular ones
00:15:25.840 | is open domain question answering.
00:15:28.200 | Now, the structure of open domain question answering
00:15:32.080 | is what you can see on the screen at the moment.
00:15:35.480 | So we ask a question that gets passed to something
00:15:39.640 | called a retriever model.
00:15:41.480 | The retriever model contains a question encoder,
00:15:45.600 | which encodes the question, passes it along
00:15:47.640 | to our index database.
00:15:50.880 | And within there, we will have a set of contexts.
00:15:54.040 | Now, contexts are usually a paragraph
00:15:57.000 | that contains the answer to our question.
00:15:59.760 | And DPR both encodes our questions
00:16:03.160 | and encodes our context into the same vector space.
00:16:08.120 | So what we would get is, for example,
00:16:10.720 | if we had a question, what is the capital of France?
00:16:16.000 | And then we also had a context.
00:16:17.440 | The capital of France is Paris.
00:16:20.000 | DPR would attempt to encode both of those
00:16:22.440 | into the same vector space, or very, very close by.
00:16:26.960 | So the vectors produced by both of those
00:16:28.800 | would be very, very similar.
00:16:30.200 | So all we're doing in that index database
00:16:35.840 | is finding the most similar embeddings
00:16:39.080 | to our question embedding.
00:16:42.280 | And then from there, we pass that along.
00:16:44.240 | We pass our context and the question again
00:16:46.280 | to our reader model.
00:16:48.680 | Here, I've used a BERT Q&A model.
00:16:50.960 | It doesn't have to be BERT.
00:16:52.080 | It can be any reader for question answering.
00:16:55.520 | And then that outputs the specific part
00:16:57.920 | of our context, which contains our answer.
00:17:00.360 | So in the previous example, we would output Paris, hopefully.
00:17:04.560 | Now, we had that DPR retriever model.
00:17:07.040 | DPR is Facebook AI's Dense Passage Retriever.
00:17:11.080 | And it actually consists of two smaller encoders.
00:17:15.480 | We have a question encoder and a context encoder.
00:17:18.040 | Now, during training, what we do is
00:17:20.840 | we train both of these encoders in parallel.
00:17:24.200 | And we pass questions and their equivalent context
00:17:27.960 | to the question and context encoder, respectively.
00:17:31.720 | And we optimize based on a contrastive loss function.
00:17:36.600 | So we compare the vectors from our question encoder
00:17:40.520 | and the context encoder.
00:17:41.720 | And we try to minimize the difference between them,
00:17:44.640 | the question and context pairs.
00:17:47.160 | And that's how we build the DPR model.
00:17:50.680 | That's why it works for question answering.
00:17:53.160 | So it's not like our sentence transformers,
00:17:56.560 | where they are just a single model.
00:18:00.040 | And they're used to identify very similar sentences.
00:18:04.240 | This is used to identify not very similar sentences,
00:18:08.600 | but very similar question and context pairs.
00:18:11.920 | And we will see difference in a moment
00:18:14.880 | when we go through the code.
00:18:18.840 | So let's get started with that.
00:18:21.320 | So come down here.
00:18:23.520 | And the first thing we probably want to do
00:18:25.600 | is initialize our context encoder and our question
00:18:33.720 | encoder from DPR.
00:18:35.440 | Now, we're going to use the HookinFace transformers
00:18:37.520 | library for this.
00:18:39.480 | So if you do not already, you'd have
00:18:42.880 | to pip install transformers.
00:18:46.520 | Now, if you pip installed sentence transformers,
00:18:50.120 | that does include transformers as a prerequisite.
00:18:53.400 | So if you installed that already,
00:18:55.800 | you should already have transformers as well.
00:18:57.640 | So first thing we want to do is, from transformers,
00:19:04.120 | we want to import a fair few classes here.
00:19:07.920 | So we need both the model, or the encoder,
00:19:10.760 | and tokenizer for each for both our context encoder
00:19:15.600 | and our question encoder.
00:19:16.960 | So let's do the context encoder first.
00:19:20.120 | So write DPR context encoder tokenizer and DPR context
00:19:27.120 | encoder here.
00:19:30.200 | And then, as well as that, we also
00:19:36.320 | want the question encoder tokenizer and question encoder.
00:19:39.880 | So we write DPR question encoder tokenizer and also
00:19:46.480 | DPR question encoder.
00:19:52.200 | And that's all we need to import, so let's run that.
00:19:55.400 | And then, we can go ahead and initialize our tokenizer model.
00:20:01.320 | So we have the context model.
00:20:04.960 | Now, this is going to be the DPR context encoder
00:20:11.560 | from pre-trained.
00:20:12.480 | If you've ever used HuggingFace transformers before,
00:20:15.280 | you should recognize this from pre-trained.
00:20:17.200 | We're just going to load in a model, which
00:20:21.080 | we can find on the HuggingFace.co/models website.
00:20:26.200 | So if you go to that address, and you type
00:20:27.960 | in what I'm about to type in, you will find that it comes up.
00:20:31.360 | So I'm going to type Facebook/DPR context, so CTX encoder.
00:20:45.840 | And we want single enqueue base.
00:20:49.920 | And I'm going to copy this, because we
00:20:52.280 | are going to use it again in just a moment,
00:20:55.560 | for our context tokenizer.
00:20:58.640 | So context tokenizer equals DPR context encoder tokenizer
00:21:06.200 | from pre-trained again, from pre-trained.
00:21:10.880 | And then, again, we want the same model name in there.
00:21:16.080 | OK, so they are our context side of the model.
00:21:27.680 | But we also need to get the question side.
00:21:32.760 | So we've got our context encoder and tokenizer.
00:21:35.800 | Now we want to question encoder and tokenizer.
00:21:40.320 | So I write question here and here.
00:21:43.520 | And we're just replacing everything
00:21:45.160 | where we've put CTX with question in here.
00:21:48.320 | So it's this question and this as well.
00:21:57.120 | And then in the model, we are just replacing CTX again
00:22:01.840 | with question.
00:22:04.040 | It's pretty straightforward.
00:22:05.520 | Now, I'm going to run that with you.
00:22:11.880 | If you haven't already got these cached on your machine,
00:22:15.280 | it can take a little bit of time,
00:22:16.700 | because we're downloading four sets of models and tokenizers.
00:22:21.200 | So it can take a little bit of time.
00:22:25.480 | Now, I already have them, so I don't need to wait for that.
00:22:29.880 | Now, first thing I want to do is set up
00:22:35.120 | a set of questions and context.
00:22:37.440 | So I have three questions here.
00:22:39.680 | Well, you can read them.
00:22:40.680 | I'm not going to go through them all.
00:22:42.220 | And then we have context.
00:22:43.600 | Each question has a couple of contexts
00:22:46.800 | that are kind of relevant, but then just one
00:22:50.160 | that is actually the answer.
00:22:52.760 | And inside here, I've also put in the questions themselves,
00:22:59.040 | because I want to prove that this is not just a sentence
00:23:02.920 | transformer where it's finding the most similar sentence.
00:23:08.120 | So it should, when we have these questions,
00:23:11.520 | it shouldn't return-- like for this one,
00:23:13.600 | it shouldn't return what is a best-selling sci-fi book.
00:23:16.400 | It should instead return the best-selling sci-fi book
00:23:19.840 | is "Doom."
00:23:20.740 | So we should see that there is a difference between using DPR
00:23:25.040 | and using sentence transforms.
00:23:29.000 | So run that.
00:23:30.640 | And then what we want to do is tokenize everything.
00:23:33.080 | So we're going to tokenize our context.
00:23:37.960 | So I'm going to write xbtokens.
00:23:40.640 | And we want the context tokenizer.
00:23:45.400 | And then in here, we're going to pass our context.
00:23:48.840 | And then if you use HuggingFace transformers,
00:23:53.120 | you should recognize this as well, so maxLength here.
00:23:57.800 | So for this, I'm going to put 256.
00:24:01.280 | And I'll set padding equal to maxLength.
00:24:04.680 | We don't need to truncate anything, I don't think.
00:24:08.640 | No, they're all very short.
00:24:10.600 | So this maxLength, we could even reduce it
00:24:13.600 | to something pretty small.
00:24:14.960 | But I'm going to leave it at that.
00:24:19.520 | So we'll pad up to the maxLength.
00:24:21.400 | And oh, the only thing we do need to include here
00:24:25.000 | is that we want to return PyTorch tensors.
00:24:27.600 | So return tensors equals pt.
00:24:32.800 | And then what we can do is we write xbEmbeddings.
00:24:38.920 | So this is how we build our context embeddings.
00:24:43.720 | xbEmbed, I'll just call it xb, is
00:24:47.640 | equal to model, the context model, ctxModel.
00:24:53.320 | And then in here, we pass our tokens, xbTokens, like that.
00:24:58.840 | And then for our questions, we do exactly the same thing.
00:25:02.720 | But of course, we just replace the context part of it
00:25:05.760 | with questions.
00:25:06.960 | So here, we have the question tokenizer, we have questions,
00:25:14.680 | and we have the question model.
00:25:17.920 | And then here, I'm going to rename xb to xq, so our query.
00:25:23.200 | OK, let's have a look at what we get.
00:25:28.480 | So first, let's have a look at what we have inside xq.
00:25:34.880 | So we'll see that we have a few different tensors in here.
00:25:38.320 | So I'll just write xq keys to see what we have.
00:25:40.600 | You see that we actually only have one output here,
00:25:48.320 | so the pooler output, which is fine
00:25:49.960 | because that's what we need.
00:25:52.160 | So we write xq, pooler output.
00:25:55.440 | And these here are our embeddings.
00:26:00.200 | So we can write shape to see the shape of those embeddings.
00:26:02.960 | So we have three vectors.
00:26:04.720 | So the number of questions that we passed up here,
00:26:09.120 | and each one of those questions has
00:26:11.080 | been encoded into a embedding of 768 dimensions.
00:26:17.920 | So that looks good.
00:26:19.720 | And we could do the same for xb if we want as well.
00:26:23.200 | It's exactly the same.
00:26:24.480 | So write xb, and we'll see the shape.
00:26:29.000 | Just at this time, we have nine vectors
00:26:31.000 | because, obviously, we have more context than we do questions.
00:26:35.280 | So what we want to do now, I'm going to import Torch.
00:26:39.960 | So again, this should have been installed already
00:26:43.280 | with Hugging Face Transformers and also Sentence Transformers.
00:26:47.920 | So if you've gotten this far, you
00:26:49.280 | don't need to worry about installing this.
00:26:52.920 | What I'm going to do is go for i, and then the query vector
00:26:58.600 | in xq, xq, pooler output, pooler output.
00:27:07.880 | So I'm going to enumerate that.
00:27:09.120 | So what I'm doing here is I'm going to run through.
00:27:14.680 | I'm going to create a loop to go through each query
00:27:18.200 | and to get the most similar vector from xb,
00:27:24.760 | so from our encoded context.
00:27:28.680 | So we write probs equals cosine similarity.
00:27:32.640 | So these are our similarity scores,
00:27:34.680 | doing exactly the same as we did before.
00:27:37.600 | We're still going to write xq vec, so the single vector
00:27:41.840 | at the moment.
00:27:42.640 | And from here, we just want xb pooler output, pooler output.
00:27:50.800 | And from there, we want to get the argmax, so the maximum
00:27:58.200 | argument, so the highest score in our probability right here.
00:28:02.800 | So Torch argmax, and here we have props.
00:28:05.960 | And then what I'm going to do is I'm
00:28:07.420 | going to print the current question that we're asking,
00:28:11.120 | so questions i.
00:28:13.880 | Now I'm going to print the context which has
00:28:17.840 | been chosen from our argmax.
00:28:20.000 | So we just write context argmax.
00:28:24.480 | And then I'm just going to put this in here
00:28:27.200 | so we have a little bit of separation.
00:28:29.960 | Let's have a see what we have.
00:28:31.720 | So we get, what is the capital city of Australia?
00:28:35.040 | Now remember, this exact question
00:28:37.200 | was also in our context.
00:28:39.560 | And it's not returning the exact sentence back to us
00:28:46.080 | or the exact question back to us.
00:28:47.440 | It's actually returning as the answer.
00:28:50.200 | So Canberra is the capital city of Australia.
00:28:53.760 | Now second one, as we had hoped, the best-selling sci-fi book
00:28:59.400 | has been chosen to do in here.
00:29:01.560 | And then I just wanted to include this one as well
00:29:03.840 | to point out that it's not perfect.
00:29:06.480 | It doesn't always get things right.
00:29:08.620 | So in this case, it didn't find the correct answer
00:29:12.700 | of how many searches are performed on Google.
00:29:14.960 | If we have a look at the context,
00:29:17.520 | so the correct answer should have been this one here.
00:29:22.040 | So Google serves more than 2 trillion queries annually.
00:29:27.240 | So it didn't get that one.
00:29:28.880 | But the other two it did get, despite having
00:29:32.680 | the actual questions in there as well.
00:29:35.420 | One of them here.
00:29:38.740 | So again, I think that's really cool.
00:29:40.660 | And I think Q&A is something that
00:29:43.380 | has a lot of potential in many businesses around the world.
00:29:50.580 | So I think that's a very cool one to use.
00:29:55.020 | OK, so the next one I want to cover
00:30:01.540 | is a mix of language and also vision.
00:30:09.120 | So recently, computer vision has had a few advances
00:30:15.160 | from the discipline of NLP.
00:30:18.640 | So in NLP, we've been using transformers
00:30:20.720 | for a reasonable amount of time now.
00:30:24.360 | And transformers have proven to be
00:30:27.880 | incredible models for language.
00:30:31.480 | And very recently, transformers have
00:30:34.340 | been applied to computer vision as well,
00:30:37.740 | which is very cool, I think.
00:30:40.340 | And what we're finding is that a model or an architecture that
00:30:45.660 | can be used for language can also
00:30:46.980 | be used for computer vision.
00:30:50.320 | And I think that's super cool.
00:30:54.420 | So I want to show you one of those models,
00:30:59.220 | or briefly touch upon one of those models.
00:31:02.260 | We will go into it in more detail
00:31:03.940 | in a future article and video.
00:31:06.900 | But for now, I'm just going to mention it.
00:31:09.060 | We have the Vision Transformer, which is very recent.
00:31:14.100 | I think the paper is January 2021, if I'm not wrong.
00:31:19.180 | And although we don't need a Vision Transformer
00:31:22.100 | to build an embedding for an image,
00:31:26.380 | I think the fact that we can use it is pretty cool.
00:31:29.360 | And we can really do it very easily
00:31:31.300 | with Hugging Face Transformers, as we will see
00:31:33.780 | when we go through the code.
00:31:35.540 | Now, a very interesting use of this
00:31:40.020 | is to actually take two different encoders,
00:31:43.400 | both transformers.
00:31:45.340 | The text encoder is more of a traditional transformer,
00:31:48.860 | obviously.
00:31:49.860 | And the image encoder is our new Vision Transformer.
00:31:54.460 | And we can actually train them together,
00:31:57.220 | like we did with DPR, the bi-encoder architecture.
00:32:00.780 | And what we can do is train it to put images and language,
00:32:07.180 | so language that describes an image,
00:32:09.580 | and map them to the same point in a vector space,
00:32:16.060 | or very close, at least.
00:32:18.360 | And that's what I've tried to visualize.
00:32:20.020 | You can see on the screen now.
00:32:21.380 | So we have two logs running.
00:32:22.540 | We process that through our text encoder.
00:32:24.980 | And we get a very similar vector to if we
00:32:26.980 | took the picture of two dogs running
00:32:29.460 | and process that through an image encoder, which would
00:32:32.900 | be our Vision Transformer.
00:32:35.540 | So I think that's--
00:32:37.300 | I don't know.
00:32:37.820 | For me, I think that's so cool.
00:32:41.060 | Now, I'm going to be using these three pictures that I
00:32:46.340 | got from Unsplash.
00:32:47.660 | If you want to see the photo credits,
00:32:50.020 | they will be either in the article,
00:32:52.140 | if you're reading the article, or they'll
00:32:53.800 | be in the video description, if not.
00:32:57.100 | And what I'm going to do is we have these three pictures.
00:33:00.300 | We're going to encode those.
00:33:01.540 | And I'm also going to encode these three captions,
00:33:04.620 | and a few other captions as well.
00:33:06.660 | And we're going to see if they match.
00:33:09.540 | So we're going to perform a similarity, or a cosine
00:33:12.580 | similarity search across them, and see
00:33:14.940 | which pairs match the closest.
00:33:17.780 | And we'll see the results are pretty cool, in my opinion.
00:33:23.980 | So let's jump into it.
00:33:26.980 | Again, we're going to be using Transformers.
00:33:29.660 | And we're going to be using a new model from OpenAI,
00:33:33.300 | which is for the image and text.
00:33:37.420 | Similar to DPR, where DPR is in question and context encoding,
00:33:43.060 | Clip is using two encoders to do image and caption encoding,
00:33:49.260 | which is pretty cool.
00:33:50.720 | So we're going to do, from Transformers, import Clip
00:33:58.300 | Processor.
00:33:59.140 | So I'm kind of viewing this processor
00:34:01.700 | as what we could call a tokenizer in typical language
00:34:07.900 | transformers.
00:34:10.060 | And then we want the Clip model.
00:34:12.660 | So this contains both encoders for us,
00:34:15.140 | so we don't have to mess around.
00:34:17.220 | Like we did with DPR, where we imported four classes.
00:34:23.100 | Here, we're just importing the two.
00:34:26.820 | And then what we want to do is we'll just initialize those.
00:34:29.580 | So again, very similar.
00:34:32.180 | So we do Clip model from Pre-trained.
00:34:38.900 | And in here, we write OpenAI, Clip VIT.
00:34:48.620 | So it's the Vision Transformer, this VIT you see here.
00:34:51.780 | It refers to the Vision Transformer
00:34:53.700 | which Clip is using or is based on, at least the Vision aspect.
00:35:01.540 | And we want to write Base Patch 32.
00:35:06.500 | So I mean, we'll go into it in more detail,
00:35:08.860 | but the patch part of that is referring to the way
00:35:11.500 | that the model almost tokenizes your images.
00:35:16.380 | It splits an image into different patches.
00:35:19.340 | And that's the patch size, the patch 32 there.
00:35:23.380 | So we also want the processor, which again, we
00:35:27.540 | can kind of see that as akin to or equivalent to our tokenizer.
00:35:34.340 | And we're just doing this for language models.
00:35:39.420 | And again, I'll just copy that across.
00:35:42.580 | OK, so model processor looks good.
00:35:50.420 | Let me rerun it.
00:35:53.740 | OK, again, I already have it cached,
00:35:56.620 | so it won't download for me.
00:35:58.900 | And you'll get this thing here.
00:36:02.420 | Don't worry about it.
00:36:03.340 | It still works.
00:36:04.740 | Now I'm going to copy in the code
00:36:07.020 | I'm using to get the photos.
00:36:09.260 | So I have the photo URLs here.
00:36:11.220 | I'm using a pill to create the image object.
00:36:17.620 | And I'm using requests to actually get the image
00:36:20.300 | from the URL that we have here.
00:36:22.460 | And then down here, I'm just going
00:36:23.880 | to show you what images we have.
00:36:25.460 | So I actually need to get matplotlib in there as well.
00:36:30.260 | So import matplotlib.pyplot PLT and numpy as well.
00:36:39.380 | OK, and we'll see those images that we saw before.
00:36:46.820 | So we have the puppy or dog running,
00:36:50.060 | the dog hiding behind tree, and then we
00:36:51.820 | have the two dogs running.
00:36:53.540 | OK, so they are our images, and we've
00:36:55.740 | stored them in images here, OK?
00:37:00.500 | And the next part are captions.
00:37:02.820 | So I've just written these six captions.
00:37:06.620 | The first three are actually the captions,
00:37:10.460 | and then the other three I just made up.
00:37:13.580 | I included trees and park in there
00:37:15.620 | because they look like, well, there's a tree here
00:37:17.740 | and there's a park here.
00:37:18.740 | So try and make it a little bit more difficult.
00:37:23.620 | But I mean, they're reasonably straightforward still,
00:37:26.780 | I think.
00:37:27.660 | And then to create our--
00:37:29.300 | you can imagine, you can see these as tokens.
00:37:32.620 | We do inputs, so processor, similar to our tokenizer again.
00:37:39.700 | And we have a few inputs here.
00:37:42.540 | So we have the text, and we want to input our captions.
00:37:46.420 | And then we also have images.
00:37:48.420 | And of course, we just input our images.
00:37:51.300 | And then we want to return the return tensors, or tensor,
00:37:57.300 | equal to PT.
00:37:59.660 | And we set padding to true, OK?
00:38:03.660 | Return tensors PT, OK?
00:38:15.500 | And if we-- let me have a quick look at what we have here.
00:38:18.220 | So we have our input IDs, pixel values, and so on.
00:38:22.020 | So input IDs, we also have attention here as well.
00:38:26.060 | So these first two are for our text,
00:38:30.500 | and then pixel values are for the images.
00:38:32.340 | And now what we want to do is create our encodings.
00:38:39.540 | So in here, because we're using the clip model,
00:38:42.780 | we're actually going to perform the encodings.
00:38:45.180 | And it's also going to do the whole similarity checking
00:38:49.300 | for us as well, and identify which images and captions
00:38:53.300 | are the closest pairs.
00:38:55.660 | Or what it's going to do is go through each image
00:38:58.140 | and find the caption that it believes belongs to it.
00:39:01.620 | So like before, we just write inputs here.
00:39:05.020 | And I think maybe let's have a look
00:39:07.740 | at what we have in our outputs.
00:39:10.180 | So we can see we'll have a few things here
00:39:13.500 | that I think are pretty useful.
00:39:15.820 | So we have the logics per image and per text.
00:39:19.300 | So for these, we can--
00:39:21.460 | for each of our text, we can use this
00:39:24.660 | to get the most probable image that
00:39:29.540 | is assigned to each caption.
00:39:31.940 | And in logics per image, we can use
00:39:34.380 | these to find the most probable caption for each image.
00:39:39.820 | And then-- so what we were doing before where we were just
00:39:42.340 | extracting the embeddings, we can also do that.
00:39:45.540 | And maybe I'll just copy in the code for that as well.
00:39:48.860 | So we have the text embeddings here.
00:39:50.380 | So we can extract those if we want.
00:39:51.840 | And we also have the image embeddings in here as well.
00:39:55.620 | And then a little further down, we have the logics somewhere,
00:39:59.260 | pool output here.
00:40:00.740 | Yeah.
00:40:02.700 | So we have the pool outputs and the logics.
00:40:05.820 | OK, so let me just close that.
00:40:09.660 | And I do believe we also have a few more.
00:40:11.460 | So let me just show you those quickly.
00:40:14.620 | Yeah, we have a few tensors there as well,
00:40:17.860 | vision model output, text model output as well.
00:40:20.900 | Now what we'll do is I'm going to paste this code in.
00:40:24.140 | And so here, I'm going to go for image in each image.
00:40:29.220 | I'm going to iterate through.
00:40:30.860 | I'm going to get the argmax, so the caption that it believes
00:40:35.780 | or is predicted for that image.
00:40:37.660 | And then we're going to show it.
00:40:39.060 | And we're going to print both out.
00:40:40.480 | Let's see if they match.
00:40:44.020 | Oh, so I'm getting ahead of myself there.
00:40:47.180 | So we also need to--
00:40:48.780 | so the probability there is the probs equals outputs.
00:40:56.380 | And we want the logics pair image.
00:41:00.620 | And we'll take the argmax while we're here.
00:41:03.220 | So dim equals 1 for that.
00:41:05.440 | And let's have a look at what we get.
00:41:07.460 | We'll see that we get this.
00:41:09.740 | So it's predicting caption 2, caption 0, and then caption 1
00:41:13.420 | for our three images.
00:41:15.740 | Let's look through that.
00:41:16.780 | And we'll see we get a dog running.
00:41:18.940 | Cool.
00:41:19.940 | A dog hiding behind a tree.
00:41:21.660 | And then two dogs running as well, which I don't know.
00:41:25.300 | For me, maybe because I'm usually working with language,
00:41:29.500 | I think seeing both language and images together is--
00:41:34.860 | I don't know-- really cool.
00:41:37.460 | Super-- I don't know-- fascinating that it actually
00:41:40.580 | works like that so easily.
00:41:43.420 | So another thing that I want to show you very quickly--
00:41:48.020 | I'm just going to copy the code in,
00:41:49.500 | because I don't want to go through all of it.
00:41:52.180 | It'll take a while.
00:41:53.620 | So we just have the embeddings.
00:41:55.100 | So these are the embeddings if we wanted to extract them
00:41:57.420 | and do what we did before with them.
00:41:59.300 | Or if you wanted to take these embeddings,
00:42:01.020 | put them in a vector index somewhere, a vector database.
00:42:06.940 | And we can get our query.
00:42:10.860 | So I'm going to do a dog hiding behind a tree.
00:42:12.820 | We can get the context--
00:42:15.700 | or not the context, the images, the image embeddings.
00:42:19.420 | Again, like before, we do the similarity.
00:42:22.580 | So the cosine similarity, we get the highest one
00:42:24.900 | is the second one here.
00:42:26.060 | So it's looking pretty good.
00:42:27.700 | And from there, we get our prediction,
00:42:29.700 | which is argmax, so we'll take number 1.
00:42:33.260 | And let's have a look at what our prediction is then.
00:42:35.500 | So we will plot that.
00:42:37.260 | We'll show you the image again.
00:42:38.660 | We have prediction.
00:42:39.580 | So it's shown as the dog hiding behind the tree
00:42:42.940 | for our query, which is a dog hiding behind a tree.
00:42:46.100 | So again, super cool.
00:42:49.820 | Now, that's it for this video.
00:42:51.460 | We've, I think, covered quite a lot of embedding methods.
00:42:56.260 | We've had a look at some introduction
00:42:58.780 | to dense vectors with Word2Vec and where it came from
00:43:03.460 | and how it quickly evolved.
00:43:05.500 | And we've had a look at sentence embeddings and sentence
00:43:08.700 | transformers, moved on to Q&A with Facebook AI's DPR.
00:43:14.460 | And now we've had a look at the new Vision Transformer
00:43:17.300 | and how we can use that with other transform models
00:43:20.300 | to build these really cool cross-media embeddings
00:43:26.500 | that we can compare, which has blown me away a little bit.
00:43:30.060 | Now, that's it for this video.
00:43:33.340 | But like I said, this is the first video and article
00:43:36.940 | in what will be a series on embeddings.
00:43:39.980 | So there's a lot more to come.
00:43:43.100 | But for now, thank you very much for watching.
00:43:45.140 | And I'll see you in the next one.