Intro to Sentence Embeddings with Transformers

00:00:00.000 | Hi, welcome to the video.

00:00:02.120 | We're going to explore how we can use sentence transformers

00:00:05.680 | and sentence embeddings in NLP for semantic similarity

00:00:11.160 | applications.

00:00:12.560 | Now, in the video, we're going to have a quick recap

00:00:16.360 | on transformers and where they came from.

00:00:18.400 | So we're going to have a quick look at recurring neural

00:00:20.740 | networks and the attention mechanism.

00:00:22.800 | And then we're going to move on to trying

00:00:25.920 | to define what is the difference between a transformer

00:00:29.440 | and a sentence transformer, and also understanding, OK,

00:00:33.440 | why are these embeddings that are

00:00:35.840 | produced by transformers or sentence transformers

00:00:38.640 | specifically so good?

00:00:40.600 | And at the end, we're also going to go

00:00:42.960 | through how we can implement our own sentence transformers

00:00:46.400 | in Python as well.

00:00:48.880 | So I think we should just jump straight into it.

00:00:59.000 | Before we dive into sentence transformers,

00:01:01.280 | I think it would make a lot of sense

00:01:03.640 | if we pieced together where transformers come from

00:01:06.840 | with the intention of trying to understand

00:01:09.280 | why we use transformers now rather

00:01:12.160 | than some other architecture.

00:01:14.680 | And I think it's also very important

00:01:16.520 | if we try and figure out the difference between a transformer

00:01:19.280 | and a sentence transformer as well.

00:01:21.400 | So we're going to start with recurring neural networks.

00:01:24.300 | And more specifically, I want to have

00:01:26.880 | a look at machine translation.

00:01:28.960 | So machine translation use something

00:01:32.040 | called a encoder-decoder network,

00:01:34.480 | where you would have a encoder, which

00:01:36.840 | is a set of, well, recurrent units, usually something

00:01:40.840 | like LSTMs or GRUs.

00:01:43.200 | And information, or that encoder,

00:01:46.720 | would encode some sort of input text.

00:01:50.120 | So in, let's say, English, it would

00:01:52.800 | encode that English text into something

00:01:56.040 | called a context vector.

00:01:57.600 | And then this context vector would be passed along

00:01:59.720 | to a decoder network, which, again, is just

00:02:03.160 | another set of LSTM or GRU units.

00:02:07.200 | And it would decode those into an autologous language,

00:02:10.920 | say, like French, or in this case, actually, Italian.

00:02:14.560 | That is how machine translation worked back

00:02:17.960 | with recurring neural networks.

00:02:19.840 | The only issue is that we're trying

00:02:21.320 | to pass a lot of information through that single point

00:02:24.880 | between the encoder and the decoder.

00:02:28.240 | Now, that creates what is called information bottleneck.

00:02:32.600 | There's too much information trying to be crammed

00:02:35.280 | through that single one point.

00:02:37.440 | So what they came up with is something

00:02:41.120 | called the attention mechanism.

00:02:42.760 | And what the attention mechanism does is for every step

00:02:45.800 | or every token that is decoded by our decoder,

00:02:49.560 | that token is sent to the attention mechanism.

00:02:53.520 | And the alignment between the decoder at that time step

00:02:58.880 | is compared to all of the encoder units

00:03:02.920 | or on all of the encoder hidden states.

00:03:06.880 | And what that does is essentially

00:03:09.280 | builds this type of attention.

00:03:12.000 | So it tells the decoder which tokens from the encoder

00:03:15.200 | to focus on.

00:03:16.400 | So it's literally saying, where do

00:03:18.640 | I need to pay attention to whatever my current unit is?

00:03:23.920 | And this attention mechanism, what it produces

00:03:26.880 | is something like this.

00:03:29.400 | So this is from another very well-known paper in 2015.

00:03:34.840 | And what you can see is a matrix of the-- we

00:03:39.520 | have the French words or the French translation

00:03:42.600 | on the left, on the y-axis.

00:03:45.200 | And then on the top, we have the English translation.

00:03:48.960 | And all of these boxes you see are the activations

00:03:53.480 | of the attention mechanism.

00:03:55.800 | So we can see essentially which words are the most aligned.

00:04:00.440 | And that is essentially what the attention mechanism did.

00:04:04.280 | It allowed the decoder to focus on the most relevant words

00:04:08.920 | in the encoder part of the network.

00:04:11.880 | Now, moving on, in 2017, there was another paper

00:04:16.040 | called "Attention is All You Need."

00:04:17.800 | And this really marked, I think, what

00:04:21.240 | is a turning point in NLP.

00:04:24.040 | What was found in this paper is that they

00:04:26.520 | could remove the recurrent part of the encoder decoder network

00:04:31.520 | and maintain just the attention mechanism.

00:04:35.200 | And what they produced with a few modifications

00:04:39.040 | to the attention was a high-performing model

00:04:42.160 | than any of the recurrent neural networks with attention

00:04:44.840 | or without attention that came before it.

00:04:47.520 | And what they named this new model was a transformer.

00:04:53.560 | So this 2017 paper, "Attention is All You Need,"

00:04:57.600 | is where transformers came from.

00:05:00.840 | And they actually came from a mechanism

00:05:04.640 | that was aimed to help improve recurrent neural networks.

00:05:10.160 | Now, of course, like I said, the attention mechanism

00:05:12.720 | was not just the same plain attention mechanism

00:05:16.360 | that was used before in recurrent neural networks.

00:05:19.400 | It had been modified a little bit as well.

00:05:22.240 | And those modifications really came down to three key changes.

00:05:28.200 | And those were positional encoding,

00:05:31.080 | which replaced the key advantage of recurrent neural networks

00:05:34.920 | in NLP, which was the ability to consider

00:05:37.800 | the order of a sequence.

00:05:39.640 | Because they were recurrent neural networks,

00:05:41.920 | they considered one word or one time set after the other.

00:05:45.880 | So there was a sense of order to those models

00:05:49.880 | that does not appear in, for example,

00:05:51.800 | convolutional neural networks.

00:05:53.760 | And this positional encoding worked

00:05:56.480 | by adding a set of varying sine wave activations

00:06:00.160 | to each input embedding.

00:06:01.880 | And these activations varied based on the position

00:06:06.720 | of the word or token.

00:06:07.840 | So what you have there is a way for the network

00:06:12.480 | to identify the order of the tokens or the token

00:06:16.280 | activations or embeddings that are being processed.

00:06:20.080 | The next change was self-attention.

00:06:22.920 | Now, self-attention is where the attention mechanism

00:06:26.280 | is applied between a word and all of the other words

00:06:29.760 | in its own context.

00:06:32.160 | So the sentence or the paragraph that it belongs in.

00:06:35.400 | Now, we saw with the encoder/decoder

00:06:37.520 | that attention was being applied between the decoder

00:06:41.280 | and the encoder.

00:06:42.880 | This is like applying attention to the encoder and the encoder

00:06:47.280 | again.

00:06:48.160 | And what this did is, rather than just embedding

00:06:52.280 | the meaning of a word, it also embeds the context of a word

00:06:56.480 | into its vector or word representations,

00:07:01.040 | which obviously greatly enriched the amount of information

00:07:04.480 | that you have within that embedding.

00:07:07.560 | And then the third and final change that they made

00:07:11.120 | was the addition of multi-head attention.

00:07:14.640 | And we can see multi-head attention

00:07:17.480 | as several parallel attention mechanisms working together.

00:07:22.880 | And using these multiple attention heads

00:07:26.600 | allowed the representation of several sets of relationships

00:07:30.640 | rather than just a single set.

00:07:32.840 | Now, these new transform models also

00:07:36.520 | had the benefit of generalizing very well.

00:07:39.520 | So what we find with transform models--

00:07:41.720 | and of course, you could do this to an extent

00:07:44.120 | with your current neural networks as well.

00:07:46.760 | But it was far less effective.

00:07:50.640 | So with transform models, we take the core of the model,

00:07:54.000 | which has been trained using a significant amount of time

00:07:58.360 | and computing power by the likes of Google and OpenAI.

00:08:04.120 | We just take that core model, add a few layers

00:08:06.960 | onto the end of it that are designed in a way

00:08:09.600 | for our specific use case, and train it a little bit more.

00:08:13.680 | So we fine-tune it.

00:08:15.960 | And I think one of the most widely known of these

00:08:20.480 | or the most popular of these models is probably BERT.

00:08:25.080 | And of course, there are other models as well.

00:08:28.400 | Later in this video, we're going to have a look at one

00:08:30.880 | called the MPNet model.

00:08:33.920 | But BERT is certainly one of the, I think,

00:08:36.280 | most popular of those.

00:08:38.200 | Now, so far, we've explained that transformers

00:08:42.520 | have much richer embeddings, or word or token embeddings,

00:08:47.800 | than anything that came before.

00:08:50.240 | And that's good.

00:08:51.360 | But we're interested in not word or token-level embeddings,

00:08:56.280 | but sentence embeddings.

00:08:57.600 | Because we want to do semantic similarity

00:09:01.160 | between sentences or paragraphs.

00:09:04.360 | And of course, transformers, they work--

00:09:08.160 | the inside of the transformer works

00:09:10.240 | based on word-level or token-level embeddings.

00:09:14.760 | So that doesn't really help us that much.

00:09:18.680 | So what would happen before sentence transformers

00:09:22.480 | were introduced is we would use something

00:09:25.480 | called a cross-encoder.

00:09:27.000 | So we would have a BERT cross-encoder model.

00:09:29.560 | And like I said before, we have the core BERT model.

00:09:32.760 | And then we just add a couple of layers onto the end of it

00:09:35.360 | and fine-tune it for our specific use case.

00:09:38.200 | And that's what cross-encoder is.

00:09:39.640 | It's the core BERT model.

00:09:42.160 | As you can see on the screen right now,

00:09:44.520 | it's a core BERT model followed by a feedforward neural

00:09:47.880 | network.

00:09:48.720 | We pass two sentences into the BERT model at once.

00:09:52.480 | The BERT model embeds them with very rich word and token-level

00:09:58.000 | embeddings.

00:09:58.840 | The feedforward network takes those and decides

00:10:01.520 | how similar those two sentences are.

00:10:06.200 | Now, this is fine, and it is actually accurate.

00:10:09.360 | It works well.

00:10:11.200 | But it's not really scalable, because essentially,

00:10:14.160 | every time you want to compare a pair of sentences,

00:10:17.760 | you have to run a full inference computation on BERT.

00:10:22.720 | So let's say we wanted to form a semantic similarity

00:10:26.040 | search across just 100,000 sentences, which

00:10:29.240 | is a reasonably small data set.

00:10:32.360 | We would have to run the BERT inference computation 100,000

00:10:37.880 | times to actually go through and identify

00:10:42.160 | the similarity between all of those sentences.

00:10:44.640 | And that's going to take a lot of time.

00:10:46.720 | And it gets worse as well.

00:10:48.120 | I mean, if you consider clustering

00:10:50.400 | all of those 100,000 sentences, we

00:10:55.520 | would end up with just under 500 million comparisons there.

00:11:01.320 | So running a full BERT inference prediction 500 million times

00:11:07.800 | just to cluster 100,000 sentences,

00:11:11.280 | obviously, that's not scalable.

00:11:14.040 | That will take a very, very long time.

00:11:16.480 | So ideally, what we need is something

00:11:19.400 | like word or token embeddings, but for sentences.

00:11:23.080 | Now, with the original BERT, we could actually produce these.

00:11:28.360 | They just were not very good.

00:11:29.760 | So what we could do is take the mean value

00:11:33.720 | across all of our word embeddings.

00:11:36.000 | Typically, BERT is 512, those being output by the model.

00:11:40.200 | We could take the average across all of those

00:11:42.240 | and take that average as what we call sentence embedding.

00:11:45.880 | Now, there were other methods of doing this as well.

00:11:49.280 | That was probably the most popular and effective one.

00:11:52.520 | And we could take that sentence embedding, store it somewhere,

00:11:55.760 | and then we could just compare it to other sentence embedding

00:11:58.240 | using something that's a bit simpler

00:11:59.720 | than a full BERT computation, like cosine similarity.

00:12:04.640 | And that is much faster.

00:12:07.320 | That's fast enough for us, but it's just not that accurate.

00:12:12.880 | And it was actually found that comparing average glove

00:12:16.440 | embeddings, which were produced in 2014,

00:12:20.320 | were actually more accurate.

00:12:23.920 | So we can't really do that.

00:12:25.720 | We can't use what is called a mean pooling approach.

00:12:32.040 | Or we can't use it in its current form.

00:12:35.120 | Now, the solution to this problem

00:12:36.680 | was introduced by two people in 2019.

00:12:40.440 | Nils Reimers and Irina Gurevich.

00:12:44.120 | They introduced what is the first sentence transformer,

00:12:48.440 | or sentence BERT.

00:12:50.040 | And it was found that sentence BERT or SBERT outperformed

00:12:54.480 | all of the previous state-of-the-art models

00:12:57.920 | on pretty much all benchmarks.

00:12:59.880 | Not all of them, but most of them.

00:13:02.280 | And it did it in a very quick time.

00:13:07.520 | So if we compare it to BERT, if we

00:13:10.240 | wanted to find the most similar sentence pair from 10,000

00:13:14.120 | sentences, in that 2019 paper, they

00:13:17.200 | found that with BERT, that took 65 hours.

00:13:21.160 | With SBERT embeddings, they could create all the embeddings

00:13:25.120 | in just around 5 seconds.

00:13:28.080 | And then they could compare all of those with cosine similarity

00:13:31.360 | in 0.01 seconds.

00:13:33.680 | So it's a lot faster.

00:13:36.120 | We go from 65 hours to just over 5 seconds,

00:13:39.920 | which is, I think, pretty incredible.

00:13:42.920 | Now, I think that's pretty much all the context we need

00:13:45.920 | behind sentence transformers.

00:13:47.960 | And what we'll do now is dive into a little bit

00:13:50.680 | of how they actually work.

00:13:52.680 | Now, we said before we have the core transform models.

00:13:57.280 | And what SBERT does is fine-tunes on sentence pairs

00:14:02.840 | using what is called a Siamese architecture or Siamese

00:14:06.560 | network.

00:14:08.360 | What we mean by a Siamese network

00:14:10.520 | is that we have what we can see, what

00:14:12.760 | can view as two BERT models that are identical.

00:14:18.240 | And the weights between those two models are tied.

00:14:20.680 | Now, in reality, when we're implementing this,

00:14:23.640 | we just use a single BERT model.

00:14:25.440 | And what we do is we process one sentence, sentence A,

00:14:29.920 | through the model.

00:14:31.040 | And then we process another sentence, sentence B,

00:14:33.760 | through the model.

00:14:34.560 | And that's the sentence pair.

00:14:36.240 | So with our cross-encoder, we were processing the sentence

00:14:39.360 | pair together.

00:14:40.040 | We were putting them both together,

00:14:41.500 | processing them all at once.

00:14:43.560 | This time, we processed them separately.

00:14:45.760 | And during training, what happens

00:14:48.400 | is the weights within BERT are optimized

00:14:52.160 | to reduce the difference between two vector embeddings

00:14:56.320 | or two sentence embeddings that are produced for sentence A

00:14:59.800 | and sentence B. And those sentence embeddings

00:15:03.880 | are called U and V.

00:15:05.600 | Now, to actually create those sentence embeddings,

00:15:07.960 | we do what we did before with BERT, where we

00:15:10.600 | do the mean pooling operation.

00:15:12.680 | Now, the reason that this works better

00:15:14.920 | is because we're fine-tuning it.

00:15:16.560 | So with BERT, we didn't fine-tune it.

00:15:18.720 | This time, we are fine-tuning it.

00:15:20.680 | Now, there are several different ways of training SBIRT.

00:15:24.960 | But the one that was covered most prominently

00:15:28.640 | in the original paper is called the Softmax Loss Approach.

00:15:32.880 | And that's what we are going to be describing here.

00:15:36.400 | Now, for the Softmax Loss Approach,

00:15:39.680 | we can train on natural language inference data sets.

00:15:43.520 | Now, the 2019 paper used two of those, the Stanford Natural

00:15:47.640 | Language Inference Corpus and the Multi-Genre Natural Language

00:15:51.920 | Inference Corpus.

00:15:53.800 | Now, both of these were merged together.

00:15:57.720 | And what we have inside there are sentence pairs.

00:16:02.200 | One is a premise, which suggests a certain hypothesis.

00:16:07.720 | So those two sentence pairs are, in some cases, related.

00:16:12.400 | And we can tell whether they're related or not

00:16:14.560 | using the label feature.

00:16:16.920 | Now, the label feature contains three different classes.

00:16:22.080 | We have 0, which is called entailment.

00:16:25.480 | And what this is, is it indicates

00:16:29.360 | that the premise sentence suggests the hypothesis

00:16:33.480 | sentence.

00:16:34.720 | Then we have class 1, or label 1.

00:16:37.680 | That means the two sentences are neutral.

00:16:41.200 | So they could both be true, but they are not necessarily

00:16:43.760 | related.

00:16:44.960 | And then we have number 2, which is contradiction,

00:16:47.840 | which means the premise and hypothesis sentences

00:16:51.520 | contradict each other.

00:16:53.320 | Now, given this data, we feed sentence A

00:16:56.880 | into BERT first for our fine-tuning.

00:17:01.400 | And then we feed sentence B into BERT.

00:17:05.080 | So that's our premise and hypothesis.

00:17:09.280 | The Siamese BERT, or BERT, outputs two sentence embeddings,

00:17:15.280 | U and V, from this process.

00:17:18.720 | And what we do is concatenate those two sentence embeddings.

00:17:23.440 | Now, the paper explained a few different ways of doing that.

00:17:27.760 | But the most effective was to take U and V.

00:17:31.080 | And what we do is we take the absolute value of U minus V,

00:17:36.800 | which is an element-wise operation.

00:17:38.520 | So we're basically finding the difference between U and V.

00:17:41.800 | And that produces this other vector, which

00:17:44.040 | is the bar U minus V bar.

00:17:47.760 | And then we concatenate all of those together.

00:17:50.960 | So they all get concatenated together.

00:17:54.000 | And then they are passed into a feed-forward neural network.

00:17:58.880 | Now, this feed-forward neural network

00:18:02.120 | takes the concatenated vector length, or sentence embedding

00:18:06.080 | length, as its input and outputs just three activations.

00:18:11.680 | Now, those three activations align with our three labels.

00:18:16.280 | So what we do from there, we have those three activations.

00:18:19.520 | We then need to calculate the softmax loss

00:18:21.480 | between those predicted labels and the true labels.

00:18:26.160 | Now, softmax loss is nothing more than a cross-entropy loss

00:18:32.000 | function.

00:18:33.040 | And that's really all there is to training one of these models.

00:18:37.680 | Now, we are going to cover all of this in full.

00:18:40.920 | We're going to go through the code and everything

00:18:42.920 | and train our own sentence button model.

00:18:45.920 | But for now, we're just describing the process.

00:18:48.800 | And we'll cover that in another--

00:18:51.000 | well, probably in the next video and article.

00:18:54.200 | Now, that's really, I think, everything

00:18:58.240 | we need to know for now on how they work

00:19:00.920 | and where sentence transformers and transformers come from.

00:19:04.880 | So let's jump into Python.

00:19:08.040 | And what we'll do is actually implement

00:19:09.960 | some of these models using the sentence transformers library,

00:19:12.840 | which was built by the same people who designed

00:19:17.680 | the first sentence transformer, SBIRT.

00:19:20.240 | So let's go ahead and do that now.

00:19:22.600 | So the first thing that you will need to do,

00:19:24.600 | if you do not already have sentence transformers installed,

00:19:29.240 | is just pip install sentence transformers.

00:19:33.600 | So I already have it installed.

00:19:35.160 | So I'm not going to go ahead and run that.

00:19:38.640 | So all I'm going to do now is, from sentence transformers,

00:19:44.160 | I'm going to import the sentence transformer class object.

00:19:52.920 | And then from there, we can initialize a sentence

00:19:56.360 | transformer, super easy.

00:19:57.720 | So all we need to write is model equals sentence transformer.

00:20:04.480 | And then in here, we just need to write our model name.

00:20:08.160 | Now, if you go to this website, SBIRT.net,

00:20:16.000 | you will find a load of different models.

00:20:19.600 | Now, the one that we will be using is this.

00:20:21.920 | So this is the original SBIRT model.

00:20:25.000 | And if we just come down here and print out

00:20:27.080 | what that will return to us, we will

00:20:31.080 | see a few different components.

00:20:33.200 | So you can see that we have two components.

00:20:40.200 | We have the transform model, and we have the pooling layer,

00:20:43.560 | which is the mean pooling that I mentioned before.

00:20:46.160 | Now, the transformer, we see that the max sequence

00:20:48.760 | length, so the maximum number of tokens that we can input there

00:20:51.400 | is 128.

00:20:52.840 | And we see that we're using the base model.

00:20:55.360 | It's a BIRT model from the Hugging Face library.

00:21:00.320 | Then in pooling, we can see the output dimension

00:21:04.160 | of our sentence vector, which is 768.

00:21:07.800 | And we can also see the way that the tokens have

00:21:11.120 | been pooled to create the sentence embedding, which

00:21:13.400 | is just the mean pooling approach.

00:21:17.120 | Now, given these sentences here, I'm

00:21:21.680 | going to run that.

00:21:23.800 | All we need to do to actually encode those

00:21:26.640 | is we write model.encode.

00:21:29.720 | And then we just pass sentences.

00:21:33.200 | And that will create our embeddings.

00:21:34.960 | So I'm just going to call this embeddings.

00:21:38.240 | And let's see what they look like, so embeddings again.

00:21:41.200 | And we see that we have these, which are our embeddings.

00:21:48.560 | So each one of these is for a single sentence.

00:21:51.520 | So here we have the first sentence here.

00:21:54.640 | And these are each of dimensionality 768.

00:22:01.640 | So we can check the shape of that if we want as well,

00:22:04.360 | just to confirm that is true.

00:22:07.200 | So you see that we have five embeddings, five sentences,

00:22:11.080 | each one 768 dimensions.

00:22:13.960 | And now that we have our sentence embeddings,

00:22:17.200 | we can use these to quickly compare sentence similarity

00:22:20.920 | for quite a few different use cases.

00:22:24.600 | The most popular are semantic textual similarity,

00:22:29.120 | which is a comparison of sentence pairs, which

00:22:31.480 | is what we are going to do here.

00:22:33.880 | And generally, this is probably most often

00:22:37.160 | used for benchmarking these kinds of models.

00:22:41.680 | Then we have semantic search.

00:22:43.120 | Now, we've covered semantic search a lot already

00:22:46.800 | in other articles and videos.

00:22:49.960 | And this is information retrieval

00:22:52.240 | using semantic meaning.

00:22:54.840 | So given a set of sentences, we can

00:22:57.600 | search using a query sentence and identify

00:23:01.240 | the most similar records.

00:23:03.200 | So this enables us to search based on concepts

00:23:07.160 | rather than specific words, which is pretty cool.

00:23:11.520 | Now, we also have clustering.

00:23:12.760 | So we can cluster our sentences, which

00:23:14.680 | is obviously useful for things like topic modeling.

00:23:17.560 | Now, we can put together a very fast STS,

00:23:21.400 | so the semantic textual similarity example,

00:23:26.240 | using nothing more than a cosine similarity function and NumPy.

00:23:30.320 | So we just want to import NumPy as np.

00:23:33.720 | And we also are going to use the sentence transformers

00:23:40.200 | cosine similarity function.

00:23:43.280 | So write sentence transformers util import cosim.

00:23:50.260 | OK, so from there, what I'm going to do

00:23:52.680 | is I'm going to initialize a empty array of zeros.

00:23:59.040 | And I want that to be the length of our sentences.

00:24:01.760 | So I want it to be a 5 by 5 array.

00:24:05.560 | And what we're going to do is loop

00:24:09.600 | through all of these sentence embeddings

00:24:11.260 | we produced from SBIRT and compare those

00:24:13.920 | with the cosine similarity.

00:24:16.320 | So we just want to write for i in a range.

00:24:19.640 | And we'll delete the sentences.

00:24:21.160 | And we just want to say the similarity function

00:24:28.040 | or the similarity array from i to the end

00:24:34.320 | on this specific column is going to equal to cosine sim.

00:24:41.040 | And then here, we want to put our embeddings.

00:24:42.880 | We just want the current session followed

00:24:46.000 | by the other embeddings.

00:24:52.020 | And that will populate our similarity array.

00:24:56.120 | So we can print that out.

00:24:57.780 | And we'll see, OK, so down the middle here--

00:25:01.320 | so I've not populated these because these--

00:25:06.680 | well, we've already got those pairs on this side of the array.

00:25:11.400 | You'll see we're going to visualize it.

00:25:13.160 | So that'll probably make more sense.

00:25:15.400 | So let's do that now.

00:25:16.880 | So we're going to import matplotlib.pyplot as PLT.

00:25:22.800 | And we're also going to import seaborne.

00:25:24.600 | This makes it a little bit easier and nicer as well.

00:25:27.480 | And I'm just going to write sns.heatmap in here, sim.

00:25:32.640 | And we want annotations set to true.

00:25:36.360 | OK, so this is just a visual of that array

00:25:43.440 | that we produced just now.

00:25:45.240 | And we can see here, so we have the sentence values

00:25:51.040 | or sentence positions.

00:25:53.120 | So sentence zero, if we go here--

00:25:56.520 | actually, let me print them out here instead.

00:25:58.520 | That makes more sense.

00:26:00.480 | So if I print sentences, maybe it's better if I--

00:26:05.320 | so I'll put them like this, yeah.

00:26:09.080 | So we have number zero is obviously this first sentence.

00:26:13.480 | And then we have number 1, 2, 3, and 4.

00:26:18.720 | And that correlates to 0 to 4 and 0 to 4

00:26:23.080 | here on the two axes.

00:26:25.200 | So if we want to look at the most similar pair according

00:26:29.800 | to our SBIRT model, it's this 4 and 3,

00:26:32.960 | which gets a cosine similarity of 0.64.

00:26:37.280 | And if we have a look, 3 and 4 are these two.

00:26:42.520 | So these two are the only ones that kind of mean

00:26:44.600 | the same thing.

00:26:45.720 | And I've written these so that they basically carry--

00:26:53.080 | they share none of the same descriptive words.

00:26:57.040 | So for dentists, we have dental specialists.

00:27:00.280 | Chewing bricks, I put flossing.

00:27:02.800 | So not even the same thing.

00:27:04.840 | And construction materials as well.

00:27:06.600 | It's not even the same thing there.

00:27:08.680 | But very similar sort of concepts

00:27:11.200 | that we're talking about there.

00:27:12.520 | So you can see that it's identifying those two as the

00:27:14.560 | most similar concept, which is pretty cool.

00:27:16.480 | And then we get some other similarity scores,

00:27:18.880 | which are kind of high here.

00:27:20.080 | And they're not really related.

00:27:21.480 | So we have 3 and 0.

00:27:24.520 | It's talking about eggplants and mannequin heads.

00:27:27.080 | So it's pretty different.

00:27:29.280 | I suppose, in reality, someone being an eggplant

00:27:32.040 | and this sort of thing is both kind of weird, strange things

00:27:37.280 | to happen.

00:27:38.080 | So maybe that's why it's capturing similarity there.

00:27:41.040 | But it's not obviously similar.

00:27:46.480 | So generally, I think it's good that it identifies this

00:27:51.400 | as being the most similar.

00:27:53.560 | But it could be better.

00:27:54.640 | And we do find that with the more recent models,

00:27:57.840 | it is, in fact, better.

00:28:00.000 | So what we can do is I'm going to get this other model.

00:28:03.960 | We'll just call it model.

00:28:05.120 | And we're going to do sentence transformer again.

00:28:10.360 | And this time, we are using the MPNet base model.

00:28:15.680 | Now, this is basically the highest performing model

00:28:20.720 | at the moment.

00:28:22.040 | Although I was told on the channel's Discord by Ashraq

00:28:30.120 | that there is actually, when they were training this MPNet

00:28:33.760 | model, they also trained a Roberta model.

00:28:38.320 | And although the Roberta model is not

00:28:41.280 | shown on the Sentence Transformer's home page,

00:28:44.800 | you can see in the competition where

00:28:46.600 | they trained both these models that they did also

00:28:48.960 | train that model.

00:28:50.400 | And it does have a slightly higher performance as well.

00:28:54.520 | It might be slower.

00:28:55.680 | I'm not sure.

00:28:56.200 | Probably because it's Roberta.

00:28:58.280 | But the performance of that is actually

00:29:00.800 | higher than the MPNet version of that model,

00:29:04.920 | which is pretty cool.

00:29:06.200 | Now, here, we can see a few things

00:29:09.760 | that are slightly different.

00:29:10.880 | For starters, the max sequence length is three times as long

00:29:13.640 | as it was with the BERT model using the MPNet base model.

00:29:18.120 | And we also have this additional normalization function.

00:29:23.200 | Now, let's just take what we wrote up here.

00:29:27.440 | So I'm just going to take this, bring it down here.

00:29:32.840 | And let's just use the heat map straight away.

00:29:35.800 | So SNS heat map sim and annot equals true.

00:29:45.240 | And we'll see something slightly different.

00:29:47.480 | Or we should do.

00:29:49.880 | So I haven't processed the similarity yet

00:29:53.680 | or the embeddings.

00:29:54.920 | So let's write embeddings equals model.encode sentences.

00:30:01.920 | Now, if we run it, we'll see that the similarity

00:30:06.400 | of these ones, of these other sentence pairs,

00:30:10.320 | is now a lot lower.

00:30:12.760 | But it's still identifying 4 and 3 as pretty similar.

00:30:16.360 | So we can see straight away there's

00:30:17.960 | a decent performance increase from this MPNet model, which

00:30:21.800 | is the most recent model, and the original SBERT model here.

00:30:27.960 | So I think that's pretty cool to see as well.

00:30:30.600 | Now, that's it for this model introducing

00:30:33.120 | sentence embeddings and the Sentence Transformers

00:30:36.520 | library and models.

00:30:38.000 | Now, going forward, obviously, this is a series of videos.

00:30:41.720 | So we're going to be covering a lot more than just Sentence

00:30:44.120 | Transformers.

00:30:45.120 | But next, we are actually going to cover

00:30:48.280 | how we can train a sentence BERT model, an SBERT model, which

00:30:54.200 | I think will be pretty cool.

00:30:56.400 | So until then, that's it for now.

00:30:59.840 | So thank you very much for watching.

00:31:01.800 | And I will see you in the next one.

00:31:04.000 | Bye.

Intro to Sentence Embeddings with Transformers

Chapters