back to indexIntro to Sentence Embeddings with Transformers
Chapters
0:0 Introduction
0:58 Machine Translation
7:32 Transform Models
9:18 CrossEncoders
15:32 Softmax Loss Approach
16:12 Label Feature
18:54 Python Implementation
00:00:02.120 |
We're going to explore how we can use sentence transformers 00:00:05.680 |
and sentence embeddings in NLP for semantic similarity 00:00:12.560 |
Now, in the video, we're going to have a quick recap 00:00:18.400 |
So we're going to have a quick look at recurring neural 00:00:25.920 |
to define what is the difference between a transformer 00:00:29.440 |
and a sentence transformer, and also understanding, OK, 00:00:35.840 |
produced by transformers or sentence transformers 00:00:42.960 |
through how we can implement our own sentence transformers 00:00:48.880 |
So I think we should just jump straight into it. 00:01:03.640 |
if we pieced together where transformers come from 00:01:16.520 |
if we try and figure out the difference between a transformer 00:01:21.400 |
So we're going to start with recurring neural networks. 00:01:36.840 |
is a set of, well, recurrent units, usually something 00:01:57.600 |
And then this context vector would be passed along 00:02:07.200 |
And it would decode those into an autologous language, 00:02:10.920 |
say, like French, or in this case, actually, Italian. 00:02:21.320 |
to pass a lot of information through that single point 00:02:28.240 |
Now, that creates what is called information bottleneck. 00:02:32.600 |
There's too much information trying to be crammed 00:02:42.760 |
And what the attention mechanism does is for every step 00:02:45.800 |
or every token that is decoded by our decoder, 00:02:49.560 |
that token is sent to the attention mechanism. 00:02:53.520 |
And the alignment between the decoder at that time step 00:03:12.000 |
So it tells the decoder which tokens from the encoder 00:03:18.640 |
I need to pay attention to whatever my current unit is? 00:03:23.920 |
And this attention mechanism, what it produces 00:03:29.400 |
So this is from another very well-known paper in 2015. 00:03:39.520 |
have the French words or the French translation 00:03:45.200 |
And then on the top, we have the English translation. 00:03:48.960 |
And all of these boxes you see are the activations 00:03:55.800 |
So we can see essentially which words are the most aligned. 00:04:00.440 |
And that is essentially what the attention mechanism did. 00:04:04.280 |
It allowed the decoder to focus on the most relevant words 00:04:11.880 |
Now, moving on, in 2017, there was another paper 00:04:26.520 |
could remove the recurrent part of the encoder decoder network 00:04:35.200 |
And what they produced with a few modifications 00:04:42.160 |
than any of the recurrent neural networks with attention 00:04:47.520 |
And what they named this new model was a transformer. 00:04:53.560 |
So this 2017 paper, "Attention is All You Need," 00:05:04.640 |
that was aimed to help improve recurrent neural networks. 00:05:10.160 |
Now, of course, like I said, the attention mechanism 00:05:12.720 |
was not just the same plain attention mechanism 00:05:16.360 |
that was used before in recurrent neural networks. 00:05:22.240 |
And those modifications really came down to three key changes. 00:05:31.080 |
which replaced the key advantage of recurrent neural networks 00:05:41.920 |
they considered one word or one time set after the other. 00:05:45.880 |
So there was a sense of order to those models 00:05:56.480 |
by adding a set of varying sine wave activations 00:06:01.880 |
And these activations varied based on the position 00:06:07.840 |
So what you have there is a way for the network 00:06:12.480 |
to identify the order of the tokens or the token 00:06:16.280 |
activations or embeddings that are being processed. 00:06:22.920 |
Now, self-attention is where the attention mechanism 00:06:26.280 |
is applied between a word and all of the other words 00:06:32.160 |
So the sentence or the paragraph that it belongs in. 00:06:37.520 |
that attention was being applied between the decoder 00:06:42.880 |
This is like applying attention to the encoder and the encoder 00:06:48.160 |
And what this did is, rather than just embedding 00:06:52.280 |
the meaning of a word, it also embeds the context of a word 00:07:01.040 |
which obviously greatly enriched the amount of information 00:07:07.560 |
And then the third and final change that they made 00:07:17.480 |
as several parallel attention mechanisms working together. 00:07:26.600 |
allowed the representation of several sets of relationships 00:07:41.720 |
and of course, you could do this to an extent 00:07:50.640 |
So with transform models, we take the core of the model, 00:07:54.000 |
which has been trained using a significant amount of time 00:07:58.360 |
and computing power by the likes of Google and OpenAI. 00:08:04.120 |
We just take that core model, add a few layers 00:08:06.960 |
onto the end of it that are designed in a way 00:08:09.600 |
for our specific use case, and train it a little bit more. 00:08:15.960 |
And I think one of the most widely known of these 00:08:20.480 |
or the most popular of these models is probably BERT. 00:08:25.080 |
And of course, there are other models as well. 00:08:28.400 |
Later in this video, we're going to have a look at one 00:08:38.200 |
Now, so far, we've explained that transformers 00:08:42.520 |
have much richer embeddings, or word or token embeddings, 00:08:51.360 |
But we're interested in not word or token-level embeddings, 00:09:10.240 |
based on word-level or token-level embeddings. 00:09:18.680 |
So what would happen before sentence transformers 00:09:29.560 |
And like I said before, we have the core BERT model. 00:09:32.760 |
And then we just add a couple of layers onto the end of it 00:09:44.520 |
it's a core BERT model followed by a feedforward neural 00:09:48.720 |
We pass two sentences into the BERT model at once. 00:09:52.480 |
The BERT model embeds them with very rich word and token-level 00:09:58.840 |
The feedforward network takes those and decides 00:10:06.200 |
Now, this is fine, and it is actually accurate. 00:10:11.200 |
But it's not really scalable, because essentially, 00:10:14.160 |
every time you want to compare a pair of sentences, 00:10:17.760 |
you have to run a full inference computation on BERT. 00:10:22.720 |
So let's say we wanted to form a semantic similarity 00:10:32.360 |
We would have to run the BERT inference computation 100,000 00:10:42.160 |
the similarity between all of those sentences. 00:10:55.520 |
would end up with just under 500 million comparisons there. 00:11:01.320 |
So running a full BERT inference prediction 500 million times 00:11:19.400 |
like word or token embeddings, but for sentences. 00:11:23.080 |
Now, with the original BERT, we could actually produce these. 00:11:36.000 |
Typically, BERT is 512, those being output by the model. 00:11:40.200 |
We could take the average across all of those 00:11:42.240 |
and take that average as what we call sentence embedding. 00:11:45.880 |
Now, there were other methods of doing this as well. 00:11:49.280 |
That was probably the most popular and effective one. 00:11:52.520 |
And we could take that sentence embedding, store it somewhere, 00:11:55.760 |
and then we could just compare it to other sentence embedding 00:11:59.720 |
than a full BERT computation, like cosine similarity. 00:12:07.320 |
That's fast enough for us, but it's just not that accurate. 00:12:12.880 |
And it was actually found that comparing average glove 00:12:25.720 |
We can't use what is called a mean pooling approach. 00:12:44.120 |
They introduced what is the first sentence transformer, 00:12:50.040 |
And it was found that sentence BERT or SBERT outperformed 00:13:10.240 |
wanted to find the most similar sentence pair from 10,000 00:13:21.160 |
With SBERT embeddings, they could create all the embeddings 00:13:28.080 |
And then they could compare all of those with cosine similarity 00:13:42.920 |
Now, I think that's pretty much all the context we need 00:13:47.960 |
And what we'll do now is dive into a little bit 00:13:52.680 |
Now, we said before we have the core transform models. 00:13:57.280 |
And what SBERT does is fine-tunes on sentence pairs 00:14:02.840 |
using what is called a Siamese architecture or Siamese 00:14:12.760 |
can view as two BERT models that are identical. 00:14:18.240 |
And the weights between those two models are tied. 00:14:20.680 |
Now, in reality, when we're implementing this, 00:14:25.440 |
And what we do is we process one sentence, sentence A, 00:14:31.040 |
And then we process another sentence, sentence B, 00:14:36.240 |
So with our cross-encoder, we were processing the sentence 00:14:52.160 |
to reduce the difference between two vector embeddings 00:14:56.320 |
or two sentence embeddings that are produced for sentence A 00:14:59.800 |
and sentence B. And those sentence embeddings 00:15:05.600 |
Now, to actually create those sentence embeddings, 00:15:20.680 |
Now, there are several different ways of training SBIRT. 00:15:24.960 |
But the one that was covered most prominently 00:15:28.640 |
in the original paper is called the Softmax Loss Approach. 00:15:32.880 |
And that's what we are going to be describing here. 00:15:39.680 |
we can train on natural language inference data sets. 00:15:43.520 |
Now, the 2019 paper used two of those, the Stanford Natural 00:15:47.640 |
Language Inference Corpus and the Multi-Genre Natural Language 00:15:57.720 |
And what we have inside there are sentence pairs. 00:16:02.200 |
One is a premise, which suggests a certain hypothesis. 00:16:07.720 |
So those two sentence pairs are, in some cases, related. 00:16:12.400 |
And we can tell whether they're related or not 00:16:16.920 |
Now, the label feature contains three different classes. 00:16:29.360 |
that the premise sentence suggests the hypothesis 00:16:41.200 |
So they could both be true, but they are not necessarily 00:16:44.960 |
And then we have number 2, which is contradiction, 00:16:47.840 |
which means the premise and hypothesis sentences 00:17:09.280 |
The Siamese BERT, or BERT, outputs two sentence embeddings, 00:17:18.720 |
And what we do is concatenate those two sentence embeddings. 00:17:23.440 |
Now, the paper explained a few different ways of doing that. 00:17:31.080 |
And what we do is we take the absolute value of U minus V, 00:17:38.520 |
So we're basically finding the difference between U and V. 00:17:47.760 |
And then we concatenate all of those together. 00:17:54.000 |
And then they are passed into a feed-forward neural network. 00:18:02.120 |
takes the concatenated vector length, or sentence embedding 00:18:06.080 |
length, as its input and outputs just three activations. 00:18:11.680 |
Now, those three activations align with our three labels. 00:18:16.280 |
So what we do from there, we have those three activations. 00:18:21.480 |
between those predicted labels and the true labels. 00:18:26.160 |
Now, softmax loss is nothing more than a cross-entropy loss 00:18:33.040 |
And that's really all there is to training one of these models. 00:18:37.680 |
Now, we are going to cover all of this in full. 00:18:40.920 |
We're going to go through the code and everything 00:18:45.920 |
But for now, we're just describing the process. 00:18:51.000 |
well, probably in the next video and article. 00:19:00.920 |
and where sentence transformers and transformers come from. 00:19:09.960 |
some of these models using the sentence transformers library, 00:19:12.840 |
which was built by the same people who designed 00:19:24.600 |
if you do not already have sentence transformers installed, 00:19:38.640 |
So all I'm going to do now is, from sentence transformers, 00:19:44.160 |
I'm going to import the sentence transformer class object. 00:19:52.920 |
And then from there, we can initialize a sentence 00:19:57.720 |
So all we need to write is model equals sentence transformer. 00:20:04.480 |
And then in here, we just need to write our model name. 00:20:40.200 |
We have the transform model, and we have the pooling layer, 00:20:43.560 |
which is the mean pooling that I mentioned before. 00:20:46.160 |
Now, the transformer, we see that the max sequence 00:20:48.760 |
length, so the maximum number of tokens that we can input there 00:20:55.360 |
It's a BIRT model from the Hugging Face library. 00:21:00.320 |
Then in pooling, we can see the output dimension 00:21:07.800 |
And we can also see the way that the tokens have 00:21:11.120 |
been pooled to create the sentence embedding, which 00:21:38.240 |
And let's see what they look like, so embeddings again. 00:21:41.200 |
And we see that we have these, which are our embeddings. 00:21:48.560 |
So each one of these is for a single sentence. 00:22:01.640 |
So we can check the shape of that if we want as well, 00:22:07.200 |
So you see that we have five embeddings, five sentences, 00:22:13.960 |
And now that we have our sentence embeddings, 00:22:17.200 |
we can use these to quickly compare sentence similarity 00:22:24.600 |
The most popular are semantic textual similarity, 00:22:29.120 |
which is a comparison of sentence pairs, which 00:22:43.120 |
Now, we've covered semantic search a lot already 00:23:03.200 |
So this enables us to search based on concepts 00:23:07.160 |
rather than specific words, which is pretty cool. 00:23:14.680 |
is obviously useful for things like topic modeling. 00:23:26.240 |
using nothing more than a cosine similarity function and NumPy. 00:23:33.720 |
And we also are going to use the sentence transformers 00:23:43.280 |
So write sentence transformers util import cosim. 00:23:52.680 |
is I'm going to initialize a empty array of zeros. 00:23:59.040 |
And I want that to be the length of our sentences. 00:24:21.160 |
And we just want to say the similarity function 00:24:34.320 |
on this specific column is going to equal to cosine sim. 00:24:41.040 |
And then here, we want to put our embeddings. 00:25:06.680 |
well, we've already got those pairs on this side of the array. 00:25:16.880 |
So we're going to import matplotlib.pyplot as PLT. 00:25:24.600 |
This makes it a little bit easier and nicer as well. 00:25:27.480 |
And I'm just going to write sns.heatmap in here, sim. 00:25:45.240 |
And we can see here, so we have the sentence values 00:25:56.520 |
actually, let me print them out here instead. 00:26:00.480 |
So if I print sentences, maybe it's better if I-- 00:26:09.080 |
So we have number zero is obviously this first sentence. 00:26:25.200 |
So if we want to look at the most similar pair according 00:26:37.280 |
And if we have a look, 3 and 4 are these two. 00:26:42.520 |
So these two are the only ones that kind of mean 00:26:45.720 |
And I've written these so that they basically carry-- 00:26:53.080 |
they share none of the same descriptive words. 00:27:12.520 |
So you can see that it's identifying those two as the 00:27:16.480 |
And then we get some other similarity scores, 00:27:24.520 |
It's talking about eggplants and mannequin heads. 00:27:29.280 |
I suppose, in reality, someone being an eggplant 00:27:32.040 |
and this sort of thing is both kind of weird, strange things 00:27:38.080 |
So maybe that's why it's capturing similarity there. 00:27:46.480 |
So generally, I think it's good that it identifies this 00:27:54.640 |
And we do find that with the more recent models, 00:28:00.000 |
So what we can do is I'm going to get this other model. 00:28:05.120 |
And we're going to do sentence transformer again. 00:28:10.360 |
And this time, we are using the MPNet base model. 00:28:15.680 |
Now, this is basically the highest performing model 00:28:22.040 |
Although I was told on the channel's Discord by Ashraq 00:28:30.120 |
that there is actually, when they were training this MPNet 00:28:41.280 |
shown on the Sentence Transformer's home page, 00:28:46.600 |
they trained both these models that they did also 00:28:50.400 |
And it does have a slightly higher performance as well. 00:29:10.880 |
For starters, the max sequence length is three times as long 00:29:13.640 |
as it was with the BERT model using the MPNet base model. 00:29:18.120 |
And we also have this additional normalization function. 00:29:27.440 |
So I'm just going to take this, bring it down here. 00:29:32.840 |
And let's just use the heat map straight away. 00:29:54.920 |
So let's write embeddings equals model.encode sentences. 00:30:01.920 |
Now, if we run it, we'll see that the similarity 00:30:06.400 |
of these ones, of these other sentence pairs, 00:30:12.760 |
But it's still identifying 4 and 3 as pretty similar. 00:30:17.960 |
a decent performance increase from this MPNet model, which 00:30:21.800 |
is the most recent model, and the original SBERT model here. 00:30:27.960 |
So I think that's pretty cool to see as well. 00:30:33.120 |
sentence embeddings and the Sentence Transformers 00:30:38.000 |
Now, going forward, obviously, this is a series of videos. 00:30:41.720 |
So we're going to be covering a lot more than just Sentence 00:30:48.280 |
how we can train a sentence BERT model, an SBERT model, which