Intro to Dense Vectors for NLP and Vision

00:00:00.000 | I'm welcome to this video.

00:00:02.240 | We're going to start a new series on embedding methods

00:00:07.180 | for NLP, but we're also going to have a look at other embedding

00:00:11.160 | methods as well.

00:00:12.360 | So mainly, we're going to be focusing

00:00:14.600 | on language-dense embeddings.

00:00:17.920 | We might have a look at sparse embeddings,

00:00:19.640 | but we've already covered that before.

00:00:21.520 | So I'm not 100% sure on that.

00:00:23.920 | But definitely dense embeddings.

00:00:26.000 | We're going to also have a look at how

00:00:28.280 | we can build dense embeddings for images and maybe

00:00:31.080 | some other media formats as well.

00:00:33.880 | So I think this series of articles and videos

00:00:38.200 | will be pretty exciting.

00:00:39.960 | Now, what I want to start with is having a look at--

00:00:45.960 | well, basically, quickly introducing

00:00:47.800 | what dense vectors and dense embeddings are.

00:00:50.920 | And whilst we do that, I'm going to refer a lot to Word2Vec

00:00:55.080 | because that's the first widely adopted version of this.

00:01:01.320 | And then we're going to have a look at sentence embeddings,

00:01:06.000 | so how we can build sentence embeddings using the Sentence

00:01:08.720 | Transformers library.

00:01:09.600 | And we're going to go through the code for that as well.

00:01:11.940 | Then we're going to have a look at Q&A.

00:01:13.760 | So Q&A is quite interesting, I think.

00:01:16.360 | And we're going to focus on Facebook AI's Dense Passage

00:01:20.960 | Retriever for that.

00:01:23.800 | And again, we are going to go through the code for that

00:01:26.240 | as well.

00:01:27.600 | And then another thing that I think is quite exciting

00:01:31.120 | is image and text embeddings.

00:01:34.200 | So to do that, we're going to have a look

00:01:36.920 | at the new Vision Transformer.

00:01:38.880 | So I think all of that's pretty cool.

00:01:42.400 | So let's jump straight into it.

00:01:51.360 | So I think the first question we want to ask

00:01:54.200 | is, why would we use dense vectors in the first place?

00:01:58.560 | Now, we have two options when it comes to representing text.

00:02:01.320 | And that is we can represent it as a dense vector

00:02:04.920 | or as a sparse vector.

00:02:06.560 | Now, sparse vectors are good if we're

00:02:08.880 | going to focus on the syntax and the words that we're comparing.

00:02:12.680 | So if we had two sentences, Bill ran from the giraffe

00:02:18.280 | towards the dolphin.

00:02:20.400 | And then we said the opposite.

00:02:22.040 | So Bill ran from the dolphin towards the giraffe.

00:02:26.240 | Both of these sentences have the exact same words in it.

00:02:30.120 | But they have different meanings, right?

00:02:32.080 | So in one of them, Bill is running away from a giraffe.

00:02:34.520 | And the other one is running away from a dolphin.

00:02:36.600 | Now, when it comes to sparse vector representation,

00:02:41.520 | we'd find it difficult to correctly identify these

00:02:46.440 | as not being the same sentence.

00:02:48.720 | Because we tend to represent words one by one

00:02:53.200 | in some sort of one-hot encoding,

00:02:54.920 | and then compare those vectors.

00:02:56.600 | Now, we can also use n-grams so we can put two words together.

00:02:59.520 | And in that case, we would identify

00:03:01.360 | that there is a difference.

00:03:02.800 | But it's not that effective.

00:03:05.360 | And then we also want to consider

00:03:07.240 | where we have different words for the same meaning.

00:03:11.680 | So for example, if you want to say hello to someone,

00:03:15.840 | you say hi, hello, hey.

00:03:18.040 | I'm sure there's a million other ways of saying it.

00:03:20.680 | And sparse vector representations

00:03:24.240 | would view these as different words.

00:03:27.240 | So whereas sparse vectors are very good for comparing

00:03:31.160 | the syntax of text, it's not very good

00:03:34.840 | at comparing the semantics or the meaning behind text.

00:03:38.160 | And that's where we want to start using dense vectors.

00:03:41.760 | So we can see dense vectors as pretty much

00:03:45.840 | a numerical representation of the semantic meaning

00:03:51.200 | behind some text.

00:03:53.840 | And we can actually visualize a lot of these relationships.

00:03:57.760 | So towards around 2013, we had WordSpec,

00:04:01.880 | which was the first very popular dense vector

00:04:06.280 | embedding for words.

00:04:08.320 | And around that time, we had a lot

00:04:10.720 | of people showing that you had things like this,

00:04:13.320 | what you can see on the screen, where, for example,

00:04:15.840 | we'd have days of the week clustered together,

00:04:17.920 | or we would have months or other related abstract topics

00:04:24.720 | represented or clustered together

00:04:26.720 | in our highly dimensional space.

00:04:30.120 | Now, of course, this is a 3D graph.

00:04:32.760 | When we're actually building these dense vectors,

00:04:34.800 | we have many more dimensions, more

00:04:38.480 | towards the 500, 700, 800 or so.

00:04:44.280 | So this is obviously a simplified version of that.

00:04:48.040 | And not only will we find that similar words are

00:04:51.360 | clustered in the same area, but we also

00:04:53.160 | find that we can perform what I think is best described

00:04:57.680 | as arithmetic on words.

00:05:01.480 | So this is a very popular example

00:05:05.000 | that came from around the same time as WordSpec.

00:05:07.560 | If you want references and everything,

00:05:09.760 | you'll be able to find them at the bottom of the article

00:05:12.360 | that this video is attached to.

00:05:14.280 | If you need the article, it's in the description.

00:05:16.600 | Now, what we'd find is if we took the vector for king,

00:05:21.200 | subtracted the vector for man, added the vector for woman,

00:05:26.340 | we would not get the exact vector for queen,

00:05:29.640 | but we'd get very, very close.

00:05:32.120 | So the nearest vector would be the vector for queen.

00:05:35.320 | And I mean, I think that's super interesting.

00:05:38.800 | And this is from the start of when we

00:05:41.480 | had these vector embeddings.

00:05:43.160 | So this is eight years ago now.

00:05:46.880 | And they've just gotten a lot more advanced in that time.

00:05:51.000 | So as I said, these examples are coming

00:05:54.040 | from the era of the Word2Vec.

00:05:56.780 | And Word2Vec was one of the earliest versions

00:06:01.880 | of these dense representations.

00:06:04.100 | And going from the name, we know it's Word2Vector.

00:06:06.920 | So we're converting words into vectors.

00:06:10.520 | Now, how this worked, there were two different methods.

00:06:14.240 | We had the skip-gram method, which

00:06:15.880 | is what you can see now, which is where we take one word,

00:06:19.840 | and we would take the sparse vector encoding for that word

00:06:24.560 | on the left that you can see.

00:06:26.480 | And then we would, in the vector on the right,

00:06:31.440 | we would have a one-hot encoding for all of the words that

00:06:35.920 | surround that first word.

00:06:38.680 | So in this case, we have fox.

00:06:40.080 | And that's surrounded by the words quick brown,

00:06:42.240 | jumped, and over.

00:06:44.320 | And this would be run through a simple feed-forward neural

00:06:47.760 | network.

00:06:48.760 | And we would go through this compression stage.

00:06:52.200 | And it is within that compression stage

00:06:54.080 | that we would build our dense vector representation for fox.

00:07:00.120 | And that would simply be a neural network

00:07:04.320 | being optimized to go from fox and predict

00:07:07.520 | the quick brown, jumped, and over.

00:07:09.200 | And this would be done many times over for every time

00:07:13.000 | that word appears in a big corpus of text

00:07:17.440 | with its multiple contexts.

00:07:20.300 | And what that does is it just builds up

00:07:21.920 | like a numerical representation of that word.

00:07:26.600 | And then there was the other approach,

00:07:28.260 | which Word2Vec also used, which is

00:07:30.560 | called continuous bag of words.

00:07:32.520 | And it's basically the same.

00:07:34.320 | We're just swapping the order of the transformation.

00:07:38.280 | So on the left, we have all of our context words.

00:07:41.160 | And then on the right, we would have the word

00:07:43.800 | that we're focusing on and we're building the embedding for.

00:07:46.860 | Now, Word2Vec really seemed to act

00:07:51.120 | as the catalyst for a lot of other vector embeddings.

00:07:56.080 | From Word2Vec, for example, we have

00:07:58.240 | like sentence2Vec, doc2Vec.

00:08:01.120 | We even had this one that I found

00:08:03.640 | when I was researching for this, which

00:08:05.600 | is called batter picture2Vec, which

00:08:08.880 | is vector embeddings for major league baseball players.

00:08:14.000 | So you've got a lot of different 2Vec methods

00:08:18.280 | that came out of the woodworks after the original Word2Vec.

00:08:22.880 | And then we also had other ones like glove as well,

00:08:25.320 | which is worth a mention.

00:08:27.520 | Now, nowadays, Word2Vec is pretty outdated

00:08:30.760 | and we wouldn't really go ahead and use that.

00:08:32.680 | So I'm not going to spend any more time on it.

00:08:34.840 | And we'll just move on to having a look at sentence similarity.

00:08:39.680 | So you can see sentence similarity

00:08:41.160 | as very similar to Word2Vec in that we're building

00:08:46.200 | these dense representations.

00:08:48.040 | But rather than representing a single word,

00:08:49.840 | we're representing a sentence or a paragraph.

00:08:53.520 | And the way that this would be done

00:08:56.000 | is using the current transform models.

00:08:58.520 | So BERT was the first example of doing this.

00:09:03.640 | And BERT by itself, you can build embeddings.

00:09:07.880 | But it's based on a token-by-token embedding.

00:09:10.640 | So within BERT, you have all of these different embeddings,

00:09:13.360 | but they each represent a single token.

00:09:15.800 | So what the guys at Sentence Transformers did

00:09:20.200 | is they trained like a Siamese.

00:09:23.480 | They called it Siamese BERT, where they had two BERTs.

00:09:26.360 | And they were trained in parallel.

00:09:29.160 | And they output a single vector for the full input that

00:09:35.040 | was input into the model, which was around 128 tokens at max.

00:09:42.840 | Now, this allowed us to build a single vector for sentences.

00:09:48.120 | And that's very good, because then we

00:09:50.240 | can start comparing sentences and paragraphs.

00:09:54.480 | So let's have a look at how we can actually

00:09:57.680 | build that in code.

00:10:00.560 | So the first thing you'll need to do

00:10:02.480 | is pip install Sentence Transformers.

00:10:06.120 | Now, I've already done this, so I'm not going to rerun it.

00:10:10.160 | But if you don't have Sentence Transformers,

00:10:12.000 | you will need to install it.

00:10:16.120 | And then after that, all we want to do is we

00:10:18.840 | want to write from Sentence Transformers.

00:10:22.600 | We want to import the Sentence Transformer object.

00:10:30.360 | And from there, we can just initialize our model.

00:10:33.120 | Super easy.

00:10:33.800 | We just write model, Sentence Transformer.

00:10:38.040 | And then in here, we just need to type our model name.

00:10:40.800 | Now, if you Google Sentence Transformers or SBERT,

00:10:45.320 | you will find the web page for this library.

00:10:49.320 | And it has loads of different models on there.

00:10:53.080 | One of the highest performing ones

00:10:54.760 | that I found on there at the moment

00:10:57.000 | is called All MPNet Base V2.

00:11:03.880 | So we just execute that.

00:11:05.240 | And usually, you will need to download the model.

00:11:11.880 | So you will see a load of loading bars or progress bars.

00:11:16.040 | That's fine.

00:11:16.640 | It's just downloading the model for you.

00:11:18.320 | I already have it downloaded, so I don't need to run it again.

00:11:21.840 | And then what we need is a set of sentences

00:11:24.400 | so that we can actually compare what we--

00:11:28.640 | we can compare these and look at what the Sentence Transformer

00:11:33.520 | believes is the most similar.

00:11:35.720 | Now, all of these are completely random,

00:11:38.600 | but we have this one here.

00:11:40.720 | "The bees decided to have a mutiny against their queen."

00:11:43.560 | And I just rewrote that in a way that we

00:11:47.920 | don't have any matching words between the two sentences.

00:11:51.760 | So we have "flying, singing insects

00:11:53.960 | rebelled in opposition to the matriarch."

00:11:56.440 | Now, the meaning there is pretty much the same.

00:11:59.080 | Maybe not exactly the same, but pretty much.

00:12:02.440 | But there are no shared words other than, I think,

00:12:05.920 | "to" and "the."

00:12:08.360 | Yeah, "to" and "the."

00:12:10.560 | So in terms of sparse vector encoding,

00:12:15.840 | this wouldn't score very well.

00:12:17.080 | But we'll see that with dense vectors, it will.

00:12:23.920 | So the first thing we want to do is encode our embeddings.

00:12:30.120 | So we'll write "embeddings" plus "model.encodeSentences."

00:12:39.240 | And then let's have a look at what that outputs, or at least

00:12:43.280 | the shape of what it outputs.

00:12:44.440 | And we see that we get seven vectors, or seven embeddings,

00:12:50.920 | each one with a dimensionality of 768 values.

00:12:56.800 | And we can use cosine similarity to compare all of these.

00:13:01.120 | Now, the easiest way to do this is we just

00:13:04.800 | import cosine similarity from SentenceTransformers.

00:13:10.000 | So "SentenceTransformers.util importCosSim."

00:13:17.760 | And then what we do is calculate the cosine similarity scores

00:13:23.360 | between all of our vectors.

00:13:26.040 | Now, I want to compare the final item here,

00:13:32.440 | so this last one, against the rest of them,

00:13:34.480 | because I want to see that this is the most similar.

00:13:38.240 | So I'm just going to select that, so write "embeddings."

00:13:43.880 | And we're just taking the last vector.

00:13:46.880 | And then I want "embeddings," well, the remaining of them,

00:13:50.400 | so all the vectors except from the last one.

00:13:53.680 | And let's just have a look.

00:13:55.480 | We will be able to see that we have something

00:13:59.000 | that seems pretty obvious.

00:14:00.440 | So this one here is the most similar, by quite a bit,

00:14:04.080 | 0.6.

00:14:04.960 | The closest is 0.19 here.

00:14:07.720 | So it's definitely calculating that

00:14:09.920 | as a lot more similar than the other ones.

00:14:13.040 | So if I take the argmax of that, we should see--

00:14:22.120 | so 3, and take the item.

00:14:25.800 | And if we go Sentences, and we put that, Sentences,

00:14:31.760 | and we index number 3, we see, OK,

00:14:35.520 | the bees decided to have a mutiny against their queen.

00:14:38.320 | So it correctly identified that these two, this and this,

00:14:46.480 | are far more similar than the rest of the sentences, which

00:14:50.360 | I think is very cool, because there's not even

00:14:53.200 | any similar words in there.

00:14:54.880 | And even as a human, it's kind of, you know,

00:14:57.760 | the bees, flying, stinging insects, and matriarch

00:15:02.360 | and queen, you know, it's not obvious.

00:15:06.760 | So I think that's really cool.

00:15:08.920 | Another popular use of embeddings

00:15:12.000 | for language applications is question answering.

00:15:15.560 | Now, question answering can be done

00:15:18.280 | with a few different, let's say, architectures.

00:15:23.200 | And one of the, I think, most popular ones

00:15:25.840 | is open domain question answering.

00:15:28.200 | Now, the structure of open domain question answering

00:15:32.080 | is what you can see on the screen at the moment.

00:15:35.480 | So we ask a question that gets passed to something

00:15:39.640 | called a retriever model.

00:15:41.480 | The retriever model contains a question encoder,

00:15:45.600 | which encodes the question, passes it along

00:15:47.640 | to our index database.

00:15:50.880 | And within there, we will have a set of contexts.

00:15:54.040 | Now, contexts are usually a paragraph

00:15:57.000 | that contains the answer to our question.

00:15:59.760 | And DPR both encodes our questions

00:16:03.160 | and encodes our context into the same vector space.

00:16:08.120 | So what we would get is, for example,

00:16:10.720 | if we had a question, what is the capital of France?

00:16:16.000 | And then we also had a context.

00:16:17.440 | The capital of France is Paris.

00:16:20.000 | DPR would attempt to encode both of those

00:16:22.440 | into the same vector space, or very, very close by.

00:16:26.960 | So the vectors produced by both of those

00:16:28.800 | would be very, very similar.

00:16:30.200 | So all we're doing in that index database

00:16:35.840 | is finding the most similar embeddings

00:16:39.080 | to our question embedding.

00:16:42.280 | And then from there, we pass that along.

00:16:44.240 | We pass our context and the question again

00:16:46.280 | to our reader model.

00:16:48.680 | Here, I've used a BERT Q&A model.

00:16:50.960 | It doesn't have to be BERT.

00:16:52.080 | It can be any reader for question answering.

00:16:55.520 | And then that outputs the specific part

00:16:57.920 | of our context, which contains our answer.

00:17:00.360 | So in the previous example, we would output Paris, hopefully.

00:17:04.560 | Now, we had that DPR retriever model.

00:17:07.040 | DPR is Facebook AI's Dense Passage Retriever.

00:17:11.080 | And it actually consists of two smaller encoders.

00:17:15.480 | We have a question encoder and a context encoder.

00:17:18.040 | Now, during training, what we do is

00:17:20.840 | we train both of these encoders in parallel.

00:17:24.200 | And we pass questions and their equivalent context

00:17:27.960 | to the question and context encoder, respectively.

00:17:31.720 | And we optimize based on a contrastive loss function.

00:17:36.600 | So we compare the vectors from our question encoder

00:17:40.520 | and the context encoder.

00:17:41.720 | And we try to minimize the difference between them,

00:17:44.640 | the question and context pairs.

00:17:47.160 | And that's how we build the DPR model.

00:17:50.680 | That's why it works for question answering.

00:17:53.160 | So it's not like our sentence transformers,

00:17:56.560 | where they are just a single model.

00:18:00.040 | And they're used to identify very similar sentences.

00:18:04.240 | This is used to identify not very similar sentences,

00:18:08.600 | but very similar question and context pairs.

00:18:11.920 | And we will see difference in a moment

00:18:14.880 | when we go through the code.

00:18:18.840 | So let's get started with that.

00:18:21.320 | So come down here.

00:18:23.520 | And the first thing we probably want to do

00:18:25.600 | is initialize our context encoder and our question

00:18:33.720 | encoder from DPR.

00:18:35.440 | Now, we're going to use the HookinFace transformers

00:18:37.520 | library for this.

00:18:39.480 | So if you do not already, you'd have

00:18:42.880 | to pip install transformers.

00:18:46.520 | Now, if you pip installed sentence transformers,

00:18:50.120 | that does include transformers as a prerequisite.

00:18:53.400 | So if you installed that already,

00:18:55.800 | you should already have transformers as well.

00:18:57.640 | So first thing we want to do is, from transformers,

00:19:04.120 | we want to import a fair few classes here.

00:19:07.920 | So we need both the model, or the encoder,

00:19:10.760 | and tokenizer for each for both our context encoder

00:19:15.600 | and our question encoder.

00:19:16.960 | So let's do the context encoder first.

00:19:20.120 | So write DPR context encoder tokenizer and DPR context

00:19:27.120 | encoder here.

00:19:30.200 | And then, as well as that, we also

00:19:36.320 | want the question encoder tokenizer and question encoder.

00:19:39.880 | So we write DPR question encoder tokenizer and also

00:19:46.480 | DPR question encoder.

00:19:52.200 | And that's all we need to import, so let's run that.

00:19:55.400 | And then, we can go ahead and initialize our tokenizer model.

00:20:01.320 | So we have the context model.

00:20:04.960 | Now, this is going to be the DPR context encoder

00:20:11.560 | from pre-trained.

00:20:12.480 | If you've ever used HuggingFace transformers before,

00:20:15.280 | you should recognize this from pre-trained.

00:20:17.200 | We're just going to load in a model, which

00:20:21.080 | we can find on the HuggingFace.co/models website.

00:20:26.200 | So if you go to that address, and you type

00:20:27.960 | in what I'm about to type in, you will find that it comes up.

00:20:31.360 | So I'm going to type Facebook/DPR context, so CTX encoder.

00:20:45.840 | And we want single enqueue base.

00:20:49.920 | And I'm going to copy this, because we

00:20:52.280 | are going to use it again in just a moment,

00:20:55.560 | for our context tokenizer.

00:20:58.640 | So context tokenizer equals DPR context encoder tokenizer

00:21:06.200 | from pre-trained again, from pre-trained.

00:21:10.880 | And then, again, we want the same model name in there.

00:21:16.080 | OK, so they are our context side of the model.

00:21:27.680 | But we also need to get the question side.

00:21:32.760 | So we've got our context encoder and tokenizer.

00:21:35.800 | Now we want to question encoder and tokenizer.

00:21:40.320 | So I write question here and here.

00:21:43.520 | And we're just replacing everything

00:21:45.160 | where we've put CTX with question in here.

00:21:48.320 | So it's this question and this as well.

00:21:57.120 | And then in the model, we are just replacing CTX again

00:22:01.840 | with question.

00:22:04.040 | It's pretty straightforward.

00:22:05.520 | Now, I'm going to run that with you.

00:22:11.880 | If you haven't already got these cached on your machine,

00:22:15.280 | it can take a little bit of time,

00:22:16.700 | because we're downloading four sets of models and tokenizers.

00:22:21.200 | So it can take a little bit of time.

00:22:25.480 | Now, I already have them, so I don't need to wait for that.

00:22:29.880 | Now, first thing I want to do is set up

00:22:35.120 | a set of questions and context.

00:22:37.440 | So I have three questions here.

00:22:39.680 | Well, you can read them.

00:22:40.680 | I'm not going to go through them all.

00:22:42.220 | And then we have context.

00:22:43.600 | Each question has a couple of contexts

00:22:46.800 | that are kind of relevant, but then just one

00:22:50.160 | that is actually the answer.

00:22:52.760 | And inside here, I've also put in the questions themselves,

00:22:59.040 | because I want to prove that this is not just a sentence

00:23:02.920 | transformer where it's finding the most similar sentence.

00:23:08.120 | So it should, when we have these questions,

00:23:11.520 | it shouldn't return-- like for this one,

00:23:13.600 | it shouldn't return what is a best-selling sci-fi book.

00:23:16.400 | It should instead return the best-selling sci-fi book

00:23:19.840 | is "Doom."

00:23:20.740 | So we should see that there is a difference between using DPR

00:23:25.040 | and using sentence transforms.

00:23:29.000 | So run that.

00:23:30.640 | And then what we want to do is tokenize everything.

00:23:33.080 | So we're going to tokenize our context.

00:23:37.960 | So I'm going to write xbtokens.

00:23:40.640 | And we want the context tokenizer.

00:23:45.400 | And then in here, we're going to pass our context.

00:23:48.840 | And then if you use HuggingFace transformers,

00:23:53.120 | you should recognize this as well, so maxLength here.

00:23:57.800 | So for this, I'm going to put 256.

00:24:01.280 | And I'll set padding equal to maxLength.

00:24:04.680 | We don't need to truncate anything, I don't think.

00:24:08.640 | No, they're all very short.

00:24:10.600 | So this maxLength, we could even reduce it

00:24:13.600 | to something pretty small.

00:24:14.960 | But I'm going to leave it at that.

00:24:19.520 | So we'll pad up to the maxLength.

00:24:21.400 | And oh, the only thing we do need to include here

00:24:25.000 | is that we want to return PyTorch tensors.

00:24:27.600 | So return tensors equals pt.

00:24:31.840 | OK.

00:24:32.800 | And then what we can do is we write xbEmbeddings.

00:24:38.920 | So this is how we build our context embeddings.

00:24:43.720 | xbEmbed, I'll just call it xb, is

00:24:47.640 | equal to model, the context model, ctxModel.

00:24:53.320 | And then in here, we pass our tokens, xbTokens, like that.

00:24:58.840 | And then for our questions, we do exactly the same thing.

00:25:02.720 | But of course, we just replace the context part of it

00:25:05.760 | with questions.

00:25:06.960 | So here, we have the question tokenizer, we have questions,

00:25:14.680 | and we have the question model.

00:25:17.920 | And then here, I'm going to rename xb to xq, so our query.

00:25:23.200 | OK, let's have a look at what we get.

00:25:28.480 | So first, let's have a look at what we have inside xq.

00:25:34.880 | So we'll see that we have a few different tensors in here.

00:25:38.320 | So I'll just write xq keys to see what we have.

00:25:40.600 | You see that we actually only have one output here,

00:25:48.320 | so the pooler output, which is fine

00:25:49.960 | because that's what we need.

00:25:52.160 | So we write xq, pooler output.

00:25:55.440 | And these here are our embeddings.

00:26:00.200 | So we can write shape to see the shape of those embeddings.

00:26:02.960 | So we have three vectors.

00:26:04.720 | So the number of questions that we passed up here,

00:26:09.120 | and each one of those questions has

00:26:11.080 | been encoded into a embedding of 768 dimensions.

00:26:17.920 | So that looks good.

00:26:19.720 | And we could do the same for xb if we want as well.

00:26:23.200 | It's exactly the same.

00:26:24.480 | So write xb, and we'll see the shape.

00:26:29.000 | Just at this time, we have nine vectors

00:26:31.000 | because, obviously, we have more context than we do questions.

00:26:35.280 | So what we want to do now, I'm going to import Torch.

00:26:39.960 | So again, this should have been installed already

00:26:43.280 | with Hugging Face Transformers and also Sentence Transformers.

00:26:47.920 | So if you've gotten this far, you

00:26:49.280 | don't need to worry about installing this.

00:26:52.920 | What I'm going to do is go for i, and then the query vector

00:26:58.600 | in xq, xq, pooler output, pooler output.

00:27:07.880 | So I'm going to enumerate that.

00:27:09.120 | So what I'm doing here is I'm going to run through.

00:27:14.680 | I'm going to create a loop to go through each query

00:27:18.200 | and to get the most similar vector from xb,

00:27:24.760 | so from our encoded context.

00:27:28.680 | So we write probs equals cosine similarity.

00:27:32.640 | So these are our similarity scores,

00:27:34.680 | doing exactly the same as we did before.

00:27:37.600 | We're still going to write xq vec, so the single vector

00:27:41.840 | at the moment.

00:27:42.640 | And from here, we just want xb pooler output, pooler output.

00:27:50.800 | And from there, we want to get the argmax, so the maximum

00:27:58.200 | argument, so the highest score in our probability right here.

00:28:02.800 | So Torch argmax, and here we have props.

00:28:05.960 | And then what I'm going to do is I'm

00:28:07.420 | going to print the current question that we're asking,

00:28:11.120 | so questions i.

00:28:13.880 | Now I'm going to print the context which has

00:28:17.840 | been chosen from our argmax.

00:28:20.000 | So we just write context argmax.

00:28:24.480 | And then I'm just going to put this in here

00:28:27.200 | so we have a little bit of separation.

00:28:29.960 | Let's have a see what we have.

00:28:31.720 | So we get, what is the capital city of Australia?

00:28:35.040 | Now remember, this exact question

00:28:37.200 | was also in our context.

00:28:39.560 | And it's not returning the exact sentence back to us

00:28:46.080 | or the exact question back to us.

00:28:47.440 | It's actually returning as the answer.

00:28:50.200 | So Canberra is the capital city of Australia.

00:28:53.760 | Now second one, as we had hoped, the best-selling sci-fi book

00:28:59.400 | has been chosen to do in here.

00:29:01.560 | And then I just wanted to include this one as well

00:29:03.840 | to point out that it's not perfect.

00:29:06.480 | It doesn't always get things right.

00:29:08.620 | So in this case, it didn't find the correct answer

00:29:12.700 | of how many searches are performed on Google.

00:29:14.960 | If we have a look at the context,

00:29:17.520 | so the correct answer should have been this one here.

00:29:22.040 | So Google serves more than 2 trillion queries annually.

00:29:27.240 | So it didn't get that one.

00:29:28.880 | But the other two it did get, despite having

00:29:32.680 | the actual questions in there as well.

00:29:35.420 | One of them here.

00:29:38.740 | So again, I think that's really cool.

00:29:40.660 | And I think Q&A is something that

00:29:43.380 | has a lot of potential in many businesses around the world.

00:29:50.580 | So I think that's a very cool one to use.

00:29:55.020 | OK, so the next one I want to cover

00:30:01.540 | is a mix of language and also vision.

00:30:09.120 | So recently, computer vision has had a few advances

00:30:15.160 | from the discipline of NLP.

00:30:18.640 | So in NLP, we've been using transformers

00:30:20.720 | for a reasonable amount of time now.

00:30:24.360 | And transformers have proven to be

00:30:27.880 | incredible models for language.

00:30:31.480 | And very recently, transformers have

00:30:34.340 | been applied to computer vision as well,

00:30:37.740 | which is very cool, I think.

00:30:40.340 | And what we're finding is that a model or an architecture that

00:30:45.660 | can be used for language can also

00:30:46.980 | be used for computer vision.

00:30:50.320 | And I think that's super cool.

00:30:54.420 | So I want to show you one of those models,

00:30:59.220 | or briefly touch upon one of those models.

00:31:02.260 | We will go into it in more detail

00:31:03.940 | in a future article and video.

00:31:06.900 | But for now, I'm just going to mention it.

00:31:09.060 | We have the Vision Transformer, which is very recent.

00:31:14.100 | I think the paper is January 2021, if I'm not wrong.

00:31:19.180 | And although we don't need a Vision Transformer

00:31:22.100 | to build an embedding for an image,

00:31:26.380 | I think the fact that we can use it is pretty cool.

00:31:29.360 | And we can really do it very easily

00:31:31.300 | with Hugging Face Transformers, as we will see

00:31:33.780 | when we go through the code.

00:31:35.540 | Now, a very interesting use of this

00:31:40.020 | is to actually take two different encoders,

00:31:43.400 | both transformers.

00:31:45.340 | The text encoder is more of a traditional transformer,

00:31:48.860 | obviously.

00:31:49.860 | And the image encoder is our new Vision Transformer.

00:31:54.460 | And we can actually train them together,

00:31:57.220 | like we did with DPR, the bi-encoder architecture.

00:32:00.780 | And what we can do is train it to put images and language,

00:32:07.180 | so language that describes an image,

00:32:09.580 | and map them to the same point in a vector space,

00:32:16.060 | or very close, at least.

00:32:18.360 | And that's what I've tried to visualize.

00:32:20.020 | You can see on the screen now.

00:32:21.380 | So we have two logs running.

00:32:22.540 | We process that through our text encoder.

00:32:24.980 | And we get a very similar vector to if we

00:32:26.980 | took the picture of two dogs running

00:32:29.460 | and process that through an image encoder, which would

00:32:32.900 | be our Vision Transformer.

00:32:35.540 | So I think that's--

00:32:37.300 | I don't know.

00:32:37.820 | For me, I think that's so cool.

00:32:41.060 | Now, I'm going to be using these three pictures that I

00:32:46.340 | got from Unsplash.

00:32:47.660 | If you want to see the photo credits,

00:32:50.020 | they will be either in the article,

00:32:52.140 | if you're reading the article, or they'll

00:32:53.800 | be in the video description, if not.

00:32:57.100 | And what I'm going to do is we have these three pictures.

00:33:00.300 | We're going to encode those.

00:33:01.540 | And I'm also going to encode these three captions,

00:33:04.620 | and a few other captions as well.

00:33:06.660 | And we're going to see if they match.

00:33:09.540 | So we're going to perform a similarity, or a cosine

00:33:12.580 | similarity search across them, and see

00:33:14.940 | which pairs match the closest.

00:33:17.780 | And we'll see the results are pretty cool, in my opinion.

00:33:23.980 | So let's jump into it.

00:33:26.980 | Again, we're going to be using Transformers.

00:33:29.660 | And we're going to be using a new model from OpenAI,

00:33:33.300 | which is for the image and text.

00:33:37.420 | Similar to DPR, where DPR is in question and context encoding,

00:33:43.060 | Clip is using two encoders to do image and caption encoding,

00:33:49.260 | which is pretty cool.

00:33:50.720 | So we're going to do, from Transformers, import Clip

00:33:58.300 | Processor.

00:33:59.140 | So I'm kind of viewing this processor

00:34:01.700 | as what we could call a tokenizer in typical language

00:34:07.900 | transformers.

00:34:10.060 | And then we want the Clip model.

00:34:12.660 | So this contains both encoders for us,

00:34:15.140 | so we don't have to mess around.

00:34:17.220 | Like we did with DPR, where we imported four classes.

00:34:23.100 | Here, we're just importing the two.

00:34:26.820 | And then what we want to do is we'll just initialize those.

00:34:29.580 | So again, very similar.

00:34:32.180 | So we do Clip model from Pre-trained.

00:34:38.900 | And in here, we write OpenAI, Clip VIT.

00:34:48.620 | So it's the Vision Transformer, this VIT you see here.

00:34:51.780 | It refers to the Vision Transformer

00:34:53.700 | which Clip is using or is based on, at least the Vision aspect.

00:35:01.540 | And we want to write Base Patch 32.

00:35:06.500 | So I mean, we'll go into it in more detail,

00:35:08.860 | but the patch part of that is referring to the way

00:35:11.500 | that the model almost tokenizes your images.

00:35:16.380 | It splits an image into different patches.

00:35:19.340 | And that's the patch size, the patch 32 there.

00:35:23.380 | So we also want the processor, which again, we

00:35:27.540 | can kind of see that as akin to or equivalent to our tokenizer.

00:35:34.340 | And we're just doing this for language models.

00:35:39.420 | And again, I'll just copy that across.

00:35:42.580 | OK, so model processor looks good.

00:35:50.420 | Let me rerun it.

00:35:53.740 | OK, again, I already have it cached,

00:35:56.620 | so it won't download for me.

00:35:58.900 | And you'll get this thing here.

00:36:02.420 | Don't worry about it.

00:36:03.340 | It still works.

00:36:04.740 | Now I'm going to copy in the code

00:36:07.020 | I'm using to get the photos.

00:36:09.260 | So I have the photo URLs here.

00:36:11.220 | I'm using a pill to create the image object.

00:36:17.620 | And I'm using requests to actually get the image

00:36:20.300 | from the URL that we have here.

00:36:22.460 | And then down here, I'm just going

00:36:23.880 | to show you what images we have.

00:36:25.460 | So I actually need to get matplotlib in there as well.

00:36:30.260 | So import matplotlib.pyplot PLT and numpy as well.

00:36:39.380 | OK, and we'll see those images that we saw before.

00:36:46.820 | So we have the puppy or dog running,

00:36:50.060 | the dog hiding behind tree, and then we

00:36:51.820 | have the two dogs running.

00:36:53.540 | OK, so they are our images, and we've

00:36:55.740 | stored them in images here, OK?

00:37:00.500 | And the next part are captions.

00:37:02.820 | So I've just written these six captions.

00:37:06.620 | The first three are actually the captions,

00:37:10.460 | and then the other three I just made up.

00:37:13.580 | I included trees and park in there

00:37:15.620 | because they look like, well, there's a tree here

00:37:17.740 | and there's a park here.

00:37:18.740 | So try and make it a little bit more difficult.

00:37:23.620 | But I mean, they're reasonably straightforward still,

00:37:26.780 | I think.

00:37:27.660 | And then to create our--

00:37:29.300 | you can imagine, you can see these as tokens.

00:37:32.620 | We do inputs, so processor, similar to our tokenizer again.

00:37:39.700 | And we have a few inputs here.

00:37:42.540 | So we have the text, and we want to input our captions.

00:37:46.420 | And then we also have images.

00:37:48.420 | And of course, we just input our images.

00:37:51.300 | And then we want to return the return tensors, or tensor,

00:37:57.300 | equal to PT.

00:37:59.660 | And we set padding to true, OK?

00:38:03.660 | Return tensors PT, OK?

00:38:15.500 | And if we-- let me have a quick look at what we have here.

00:38:18.220 | So we have our input IDs, pixel values, and so on.

00:38:22.020 | So input IDs, we also have attention here as well.

00:38:26.060 | So these first two are for our text,

00:38:30.500 | and then pixel values are for the images.

00:38:32.340 | And now what we want to do is create our encodings.

00:38:39.540 | So in here, because we're using the clip model,

00:38:42.780 | we're actually going to perform the encodings.

00:38:45.180 | And it's also going to do the whole similarity checking

00:38:49.300 | for us as well, and identify which images and captions

00:38:53.300 | are the closest pairs.

00:38:55.660 | Or what it's going to do is go through each image

00:38:58.140 | and find the caption that it believes belongs to it.

00:39:01.620 | So like before, we just write inputs here.

00:39:05.020 | And I think maybe let's have a look

00:39:07.740 | at what we have in our outputs.

00:39:10.180 | So we can see we'll have a few things here

00:39:13.500 | that I think are pretty useful.

00:39:15.820 | So we have the logics per image and per text.

00:39:19.300 | So for these, we can--

00:39:21.460 | for each of our text, we can use this

00:39:24.660 | to get the most probable image that

00:39:29.540 | is assigned to each caption.

00:39:31.940 | And in logics per image, we can use

00:39:34.380 | these to find the most probable caption for each image.

00:39:39.820 | And then-- so what we were doing before where we were just

00:39:42.340 | extracting the embeddings, we can also do that.

00:39:45.540 | And maybe I'll just copy in the code for that as well.

00:39:48.860 | So we have the text embeddings here.

00:39:50.380 | So we can extract those if we want.

00:39:51.840 | And we also have the image embeddings in here as well.

00:39:55.620 | And then a little further down, we have the logics somewhere,

00:39:59.260 | pool output here.

00:40:00.740 | Yeah.

00:40:02.700 | So we have the pool outputs and the logics.

00:40:05.820 | OK, so let me just close that.

00:40:09.660 | And I do believe we also have a few more.

00:40:11.460 | So let me just show you those quickly.

00:40:14.620 | Yeah, we have a few tensors there as well,

00:40:17.860 | vision model output, text model output as well.

00:40:20.900 | Now what we'll do is I'm going to paste this code in.

00:40:24.140 | And so here, I'm going to go for image in each image.

00:40:29.220 | I'm going to iterate through.

00:40:30.860 | I'm going to get the argmax, so the caption that it believes

00:40:35.780 | or is predicted for that image.

00:40:37.660 | And then we're going to show it.

00:40:39.060 | And we're going to print both out.

00:40:40.480 | Let's see if they match.

00:40:44.020 | Oh, so I'm getting ahead of myself there.

00:40:47.180 | So we also need to--

00:40:48.780 | so the probability there is the probs equals outputs.

00:40:56.380 | And we want the logics pair image.

00:41:00.620 | And we'll take the argmax while we're here.

00:41:03.220 | So dim equals 1 for that.

00:41:05.440 | And let's have a look at what we get.

00:41:07.460 | We'll see that we get this.

00:41:09.740 | So it's predicting caption 2, caption 0, and then caption 1

00:41:13.420 | for our three images.

00:41:15.740 | Let's look through that.

00:41:16.780 | And we'll see we get a dog running.

00:41:18.940 | Cool.

00:41:19.940 | A dog hiding behind a tree.

00:41:21.660 | And then two dogs running as well, which I don't know.

00:41:25.300 | For me, maybe because I'm usually working with language,

00:41:29.500 | I think seeing both language and images together is--

00:41:34.860 | I don't know-- really cool.

00:41:37.460 | Super-- I don't know-- fascinating that it actually

00:41:40.580 | works like that so easily.

00:41:43.420 | So another thing that I want to show you very quickly--

00:41:48.020 | I'm just going to copy the code in,

00:41:49.500 | because I don't want to go through all of it.

00:41:52.180 | It'll take a while.

00:41:53.620 | So we just have the embeddings.

00:41:55.100 | So these are the embeddings if we wanted to extract them

00:41:57.420 | and do what we did before with them.

00:41:59.300 | Or if you wanted to take these embeddings,

00:42:01.020 | put them in a vector index somewhere, a vector database.

00:42:06.940 | And we can get our query.

00:42:10.860 | So I'm going to do a dog hiding behind a tree.

00:42:12.820 | We can get the context--

00:42:15.700 | or not the context, the images, the image embeddings.

00:42:19.420 | Again, like before, we do the similarity.

00:42:22.580 | So the cosine similarity, we get the highest one

00:42:24.900 | is the second one here.

00:42:26.060 | So it's looking pretty good.

00:42:27.700 | And from there, we get our prediction,

00:42:29.700 | which is argmax, so we'll take number 1.

00:42:33.260 | And let's have a look at what our prediction is then.

00:42:35.500 | So we will plot that.

00:42:37.260 | We'll show you the image again.

00:42:38.660 | We have prediction.

00:42:39.580 | So it's shown as the dog hiding behind the tree

00:42:42.940 | for our query, which is a dog hiding behind a tree.

00:42:46.100 | So again, super cool.

00:42:49.820 | Now, that's it for this video.

00:42:51.460 | We've, I think, covered quite a lot of embedding methods.

00:42:56.260 | We've had a look at some introduction

00:42:58.780 | to dense vectors with Word2Vec and where it came from

00:43:03.460 | and how it quickly evolved.

00:43:05.500 | And we've had a look at sentence embeddings and sentence

00:43:08.700 | transformers, moved on to Q&A with Facebook AI's DPR.

00:43:14.460 | And now we've had a look at the new Vision Transformer

00:43:17.300 | and how we can use that with other transform models

00:43:20.300 | to build these really cool cross-media embeddings

00:43:26.500 | that we can compare, which has blown me away a little bit.

00:43:30.060 | Now, that's it for this video.

00:43:33.340 | But like I said, this is the first video and article

00:43:36.940 | in what will be a series on embeddings.

00:43:39.980 | So there's a lot more to come.

00:43:43.100 | But for now, thank you very much for watching.

00:43:45.140 | And I'll see you in the next one.

00:43:46.520 | Bye.

Intro to Dense Vectors for NLP and Vision

Chapters