Intro to Sentence Embeddings with Transformers

Hi, welcome to the video. We're going to explore how we can use sentence transformers and sentence embeddings in NLP for semantic similarity applications. Now, in the video, we're going to have a quick recap on transformers and where they came from. So we're going to have a quick look at recurring neural networks and the attention mechanism.

And then we're going to move on to trying to define what is the difference between a transformer and a sentence transformer, and also understanding, OK, why are these embeddings that are produced by transformers or sentence transformers specifically so good? And at the end, we're also going to go through how we can implement our own sentence transformers in Python as well.

So I think we should just jump straight into it. Before we dive into sentence transformers, I think it would make a lot of sense if we pieced together where transformers come from with the intention of trying to understand why we use transformers now rather than some other architecture. And I think it's also very important if we try and figure out the difference between a transformer and a sentence transformer as well.

So we're going to start with recurring neural networks. And more specifically, I want to have a look at machine translation. So machine translation use something called a encoder-decoder network, where you would have a encoder, which is a set of, well, recurrent units, usually something like LSTMs or GRUs. And information, or that encoder, would encode some sort of input text.

So in, let's say, English, it would encode that English text into something called a context vector. And then this context vector would be passed along to a decoder network, which, again, is just another set of LSTM or GRU units. And it would decode those into an autologous language, say, like French, or in this case, actually, Italian.

That is how machine translation worked back with recurring neural networks. The only issue is that we're trying to pass a lot of information through that single point between the encoder and the decoder. Now, that creates what is called information bottleneck. There's too much information trying to be crammed through that single one point.

So what they came up with is something called the attention mechanism. And what the attention mechanism does is for every step or every token that is decoded by our decoder, that token is sent to the attention mechanism. And the alignment between the decoder at that time step is compared to all of the encoder units or on all of the encoder hidden states.

And what that does is essentially builds this type of attention. So it tells the decoder which tokens from the encoder to focus on. So it's literally saying, where do I need to pay attention to whatever my current unit is? And this attention mechanism, what it produces is something like this.

So this is from another very well-known paper in 2015. And what you can see is a matrix of the-- we have the French words or the French translation on the left, on the y-axis. And then on the top, we have the English translation. And all of these boxes you see are the activations of the attention mechanism.

So we can see essentially which words are the most aligned. And that is essentially what the attention mechanism did. It allowed the decoder to focus on the most relevant words in the encoder part of the network. Now, moving on, in 2017, there was another paper called "Attention is All You Need." And this really marked, I think, what is a turning point in NLP.

What was found in this paper is that they could remove the recurrent part of the encoder decoder network and maintain just the attention mechanism. And what they produced with a few modifications to the attention was a high-performing model than any of the recurrent neural networks with attention or without attention that came before it.

And what they named this new model was a transformer. So this 2017 paper, "Attention is All You Need," is where transformers came from. And they actually came from a mechanism that was aimed to help improve recurrent neural networks. Now, of course, like I said, the attention mechanism was not just the same plain attention mechanism that was used before in recurrent neural networks.

It had been modified a little bit as well. And those modifications really came down to three key changes. And those were positional encoding, which replaced the key advantage of recurrent neural networks in NLP, which was the ability to consider the order of a sequence. Because they were recurrent neural networks, they considered one word or one time set after the other.

So there was a sense of order to those models that does not appear in, for example, convolutional neural networks. And this positional encoding worked by adding a set of varying sine wave activations to each input embedding. And these activations varied based on the position of the word or token.

So what you have there is a way for the network to identify the order of the tokens or the token activations or embeddings that are being processed. The next change was self-attention. Now, self-attention is where the attention mechanism is applied between a word and all of the other words in its own context.

So the sentence or the paragraph that it belongs in. Now, we saw with the encoder/decoder that attention was being applied between the decoder and the encoder. This is like applying attention to the encoder and the encoder again. And what this did is, rather than just embedding the meaning of a word, it also embeds the context of a word into its vector or word representations, which obviously greatly enriched the amount of information that you have within that embedding.

And then the third and final change that they made was the addition of multi-head attention. And we can see multi-head attention as several parallel attention mechanisms working together. And using these multiple attention heads allowed the representation of several sets of relationships rather than just a single set. Now, these new transform models also had the benefit of generalizing very well.

So what we find with transform models-- and of course, you could do this to an extent with your current neural networks as well. But it was far less effective. So with transform models, we take the core of the model, which has been trained using a significant amount of time and computing power by the likes of Google and OpenAI.

We just take that core model, add a few layers onto the end of it that are designed in a way for our specific use case, and train it a little bit more. So we fine-tune it. And I think one of the most widely known of these or the most popular of these models is probably BERT.

And of course, there are other models as well. Later in this video, we're going to have a look at one called the MPNet model. But BERT is certainly one of the, I think, most popular of those. Now, so far, we've explained that transformers have much richer embeddings, or word or token embeddings, than anything that came before.

And that's good. But we're interested in not word or token-level embeddings, but sentence embeddings. Because we want to do semantic similarity between sentences or paragraphs. And of course, transformers, they work-- the inside of the transformer works based on word-level or token-level embeddings. So that doesn't really help us that much.

So what would happen before sentence transformers were introduced is we would use something called a cross-encoder. So we would have a BERT cross-encoder model. And like I said before, we have the core BERT model. And then we just add a couple of layers onto the end of it and fine-tune it for our specific use case.

And that's what cross-encoder is. It's the core BERT model. As you can see on the screen right now, it's a core BERT model followed by a feedforward neural network. We pass two sentences into the BERT model at once. The BERT model embeds them with very rich word and token-level embeddings.

The feedforward network takes those and decides how similar those two sentences are. Now, this is fine, and it is actually accurate. It works well. But it's not really scalable, because essentially, every time you want to compare a pair of sentences, you have to run a full inference computation on BERT.

So let's say we wanted to form a semantic similarity search across just 100,000 sentences, which is a reasonably small data set. We would have to run the BERT inference computation 100,000 times to actually go through and identify the similarity between all of those sentences. And that's going to take a lot of time.

And it gets worse as well. I mean, if you consider clustering all of those 100,000 sentences, we would end up with just under 500 million comparisons there. So running a full BERT inference prediction 500 million times just to cluster 100,000 sentences, obviously, that's not scalable. That will take a very, very long time.

So ideally, what we need is something like word or token embeddings, but for sentences. Now, with the original BERT, we could actually produce these. They just were not very good. So what we could do is take the mean value across all of our word embeddings. Typically, BERT is 512, those being output by the model.

We could take the average across all of those and take that average as what we call sentence embedding. Now, there were other methods of doing this as well. That was probably the most popular and effective one. And we could take that sentence embedding, store it somewhere, and then we could just compare it to other sentence embedding using something that's a bit simpler than a full BERT computation, like cosine similarity.

And that is much faster. That's fast enough for us, but it's just not that accurate. And it was actually found that comparing average glove embeddings, which were produced in 2014, were actually more accurate. So we can't really do that. We can't use what is called a mean pooling approach.

Or we can't use it in its current form. Now, the solution to this problem was introduced by two people in 2019. Nils Reimers and Irina Gurevich. They introduced what is the first sentence transformer, or sentence BERT. And it was found that sentence BERT or SBERT outperformed all of the previous state-of-the-art models on pretty much all benchmarks.

Not all of them, but most of them. And it did it in a very quick time. So if we compare it to BERT, if we wanted to find the most similar sentence pair from 10,000 sentences, in that 2019 paper, they found that with BERT, that took 65 hours. With SBERT embeddings, they could create all the embeddings in just around 5 seconds.

And then they could compare all of those with cosine similarity in 0.01 seconds. So it's a lot faster. We go from 65 hours to just over 5 seconds, which is, I think, pretty incredible. Now, I think that's pretty much all the context we need behind sentence transformers. And what we'll do now is dive into a little bit of how they actually work.

Now, we said before we have the core transform models. And what SBERT does is fine-tunes on sentence pairs using what is called a Siamese architecture or Siamese network. What we mean by a Siamese network is that we have what we can see, what can view as two BERT models that are identical.

And the weights between those two models are tied. Now, in reality, when we're implementing this, we just use a single BERT model. And what we do is we process one sentence, sentence A, through the model. And then we process another sentence, sentence B, through the model. And that's the sentence pair.

So with our cross-encoder, we were processing the sentence pair together. We were putting them both together, processing them all at once. This time, we processed them separately. And during training, what happens is the weights within BERT are optimized to reduce the difference between two vector embeddings or two sentence embeddings that are produced for sentence A and sentence B.

And those sentence embeddings are called U and V. Now, to actually create those sentence embeddings, we do what we did before with BERT, where we do the mean pooling operation. Now, the reason that this works better is because we're fine-tuning it. So with BERT, we didn't fine-tune it. This time, we are fine-tuning it.

Now, there are several different ways of training SBIRT. But the one that was covered most prominently in the original paper is called the Softmax Loss Approach. And that's what we are going to be describing here. Now, for the Softmax Loss Approach, we can train on natural language inference data sets.

Now, the 2019 paper used two of those, the Stanford Natural Language Inference Corpus and the Multi-Genre Natural Language Inference Corpus. Now, both of these were merged together. And what we have inside there are sentence pairs. One is a premise, which suggests a certain hypothesis. So those two sentence pairs are, in some cases, related.

And we can tell whether they're related or not using the label feature. Now, the label feature contains three different classes. We have 0, which is called entailment. And what this is, is it indicates that the premise sentence suggests the hypothesis sentence. Then we have class 1, or label 1.

That means the two sentences are neutral. So they could both be true, but they are not necessarily related. And then we have number 2, which is contradiction, which means the premise and hypothesis sentences contradict each other. Now, given this data, we feed sentence A into BERT first for our fine-tuning.

And then we feed sentence B into BERT. So that's our premise and hypothesis. The Siamese BERT, or BERT, outputs two sentence embeddings, U and V, from this process. And what we do is concatenate those two sentence embeddings. Now, the paper explained a few different ways of doing that. But the most effective was to take U and V.

And what we do is we take the absolute value of U minus V, which is an element-wise operation. So we're basically finding the difference between U and V. And that produces this other vector, which is the bar U minus V bar. And then we concatenate all of those together.

So they all get concatenated together. And then they are passed into a feed-forward neural network. Now, this feed-forward neural network takes the concatenated vector length, or sentence embedding length, as its input and outputs just three activations. Now, those three activations align with our three labels. So what we do from there, we have those three activations.

We then need to calculate the softmax loss between those predicted labels and the true labels. Now, softmax loss is nothing more than a cross-entropy loss function. And that's really all there is to training one of these models. Now, we are going to cover all of this in full. We're going to go through the code and everything and train our own sentence button model.

But for now, we're just describing the process. And we'll cover that in another-- well, probably in the next video and article. Now, that's really, I think, everything we need to know for now on how they work and where sentence transformers and transformers come from. So let's jump into Python.

And what we'll do is actually implement some of these models using the sentence transformers library, which was built by the same people who designed the first sentence transformer, SBIRT. So let's go ahead and do that now. So the first thing that you will need to do, if you do not already have sentence transformers installed, is just pip install sentence transformers.

So I already have it installed. So I'm not going to go ahead and run that. So all I'm going to do now is, from sentence transformers, I'm going to import the sentence transformer class object. And then from there, we can initialize a sentence transformer, super easy. So all we need to write is model equals sentence transformer.

And then in here, we just need to write our model name. Now, if you go to this website, SBIRT.net, you will find a load of different models. Now, the one that we will be using is this. So this is the original SBIRT model. And if we just come down here and print out what that will return to us, we will see a few different components.

So you can see that we have two components. We have the transform model, and we have the pooling layer, which is the mean pooling that I mentioned before. Now, the transformer, we see that the max sequence length, so the maximum number of tokens that we can input there is 128.

And we see that we're using the base model. It's a BIRT model from the Hugging Face library. Then in pooling, we can see the output dimension of our sentence vector, which is 768. And we can also see the way that the tokens have been pooled to create the sentence embedding, which is just the mean pooling approach.

Now, given these sentences here, I'm going to run that. All we need to do to actually encode those is we write model.encode. And then we just pass sentences. And that will create our embeddings. So I'm just going to call this embeddings. And let's see what they look like, so embeddings again.

And we see that we have these, which are our embeddings. So each one of these is for a single sentence. So here we have the first sentence here. And these are each of dimensionality 768. So we can check the shape of that if we want as well, just to confirm that is true.

So you see that we have five embeddings, five sentences, each one 768 dimensions. And now that we have our sentence embeddings, we can use these to quickly compare sentence similarity for quite a few different use cases. The most popular are semantic textual similarity, which is a comparison of sentence pairs, which is what we are going to do here.

And generally, this is probably most often used for benchmarking these kinds of models. Then we have semantic search. Now, we've covered semantic search a lot already in other articles and videos. And this is information retrieval using semantic meaning. So given a set of sentences, we can search using a query sentence and identify the most similar records.

So this enables us to search based on concepts rather than specific words, which is pretty cool. Now, we also have clustering. So we can cluster our sentences, which is obviously useful for things like topic modeling. Now, we can put together a very fast STS, so the semantic textual similarity example, using nothing more than a cosine similarity function and NumPy.

So we just want to import NumPy as np. And we also are going to use the sentence transformers cosine similarity function. So write sentence transformers util import cosim. OK, so from there, what I'm going to do is I'm going to initialize a empty array of zeros. And I want that to be the length of our sentences.

So I want it to be a 5 by 5 array. And what we're going to do is loop through all of these sentence embeddings we produced from SBIRT and compare those with the cosine similarity. So we just want to write for i in a range. And we'll delete the sentences.

And we just want to say the similarity function or the similarity array from i to the end on this specific column is going to equal to cosine sim. And then here, we want to put our embeddings. We just want the current session followed by the other embeddings. And that will populate our similarity array.

So we can print that out. And we'll see, OK, so down the middle here-- so I've not populated these because these-- well, we've already got those pairs on this side of the array. You'll see we're going to visualize it. So that'll probably make more sense. So let's do that now.

So we're going to import matplotlib.pyplot as PLT. And we're also going to import seaborne. This makes it a little bit easier and nicer as well. And I'm just going to write sns.heatmap in here, sim. And we want annotations set to true. OK, so this is just a visual of that array that we produced just now.

And we can see here, so we have the sentence values or sentence positions. So sentence zero, if we go here-- actually, let me print them out here instead. That makes more sense. So if I print sentences, maybe it's better if I-- so I'll put them like this, yeah. So we have number zero is obviously this first sentence.

And then we have number 1, 2, 3, and 4. And that correlates to 0 to 4 and 0 to 4 here on the two axes. So if we want to look at the most similar pair according to our SBIRT model, it's this 4 and 3, which gets a cosine similarity of 0.64.

And if we have a look, 3 and 4 are these two. So these two are the only ones that kind of mean the same thing. And I've written these so that they basically carry-- they share none of the same descriptive words. So for dentists, we have dental specialists. Chewing bricks, I put flossing.

So not even the same thing. And construction materials as well. It's not even the same thing there. But very similar sort of concepts that we're talking about there. So you can see that it's identifying those two as the most similar concept, which is pretty cool. And then we get some other similarity scores, which are kind of high here.

And they're not really related. So we have 3 and 0. It's talking about eggplants and mannequin heads. So it's pretty different. I suppose, in reality, someone being an eggplant and this sort of thing is both kind of weird, strange things to happen. So maybe that's why it's capturing similarity there.

But it's not obviously similar. So generally, I think it's good that it identifies this as being the most similar. But it could be better. And we do find that with the more recent models, it is, in fact, better. So what we can do is I'm going to get this other model.

We'll just call it model. And we're going to do sentence transformer again. And this time, we are using the MPNet base model. Now, this is basically the highest performing model at the moment. Although I was told on the channel's Discord by Ashraq that there is actually, when they were training this MPNet model, they also trained a Roberta model.

And although the Roberta model is not shown on the Sentence Transformer's home page, you can see in the competition where they trained both these models that they did also train that model. And it does have a slightly higher performance as well. It might be slower. I'm not sure. Probably because it's Roberta.

But the performance of that is actually higher than the MPNet version of that model, which is pretty cool. Now, here, we can see a few things that are slightly different. For starters, the max sequence length is three times as long as it was with the BERT model using the MPNet base model.

And we also have this additional normalization function. Now, let's just take what we wrote up here. So I'm just going to take this, bring it down here. And let's just use the heat map straight away. So SNS heat map sim and annot equals true. And we'll see something slightly different.

Or we should do. So I haven't processed the similarity yet or the embeddings. So let's write embeddings equals model.encode sentences. Now, if we run it, we'll see that the similarity of these ones, of these other sentence pairs, is now a lot lower. But it's still identifying 4 and 3 as pretty similar.

So we can see straight away there's a decent performance increase from this MPNet model, which is the most recent model, and the original SBERT model here. So I think that's pretty cool to see as well. Now, that's it for this model introducing sentence embeddings and the Sentence Transformers library and models.

Now, going forward, obviously, this is a series of videos. So we're going to be covering a lot more than just Sentence Transformers. But next, we are actually going to cover how we can train a sentence BERT model, an SBERT model, which I think will be pretty cool. So until then, that's it for now.

So thank you very much for watching. And I will see you in the next one. Bye.

Intro to Sentence Embeddings with Transformers

Chapters

Transcript