back to index

Intro to Sentence Embeddings with Transformers


Chapters

0:0 Introduction
0:58 Machine Translation
7:32 Transform Models
9:18 CrossEncoders
15:32 Softmax Loss Approach
16:12 Label Feature
18:54 Python Implementation

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, welcome to the video.
00:00:02.120 | We're going to explore how we can use sentence transformers
00:00:05.680 | and sentence embeddings in NLP for semantic similarity
00:00:11.160 | applications.
00:00:12.560 | Now, in the video, we're going to have a quick recap
00:00:16.360 | on transformers and where they came from.
00:00:18.400 | So we're going to have a quick look at recurring neural
00:00:20.740 | networks and the attention mechanism.
00:00:22.800 | And then we're going to move on to trying
00:00:25.920 | to define what is the difference between a transformer
00:00:29.440 | and a sentence transformer, and also understanding, OK,
00:00:33.440 | why are these embeddings that are
00:00:35.840 | produced by transformers or sentence transformers
00:00:38.640 | specifically so good?
00:00:40.600 | And at the end, we're also going to go
00:00:42.960 | through how we can implement our own sentence transformers
00:00:46.400 | in Python as well.
00:00:48.880 | So I think we should just jump straight into it.
00:00:59.000 | Before we dive into sentence transformers,
00:01:01.280 | I think it would make a lot of sense
00:01:03.640 | if we pieced together where transformers come from
00:01:06.840 | with the intention of trying to understand
00:01:09.280 | why we use transformers now rather
00:01:12.160 | than some other architecture.
00:01:14.680 | And I think it's also very important
00:01:16.520 | if we try and figure out the difference between a transformer
00:01:19.280 | and a sentence transformer as well.
00:01:21.400 | So we're going to start with recurring neural networks.
00:01:24.300 | And more specifically, I want to have
00:01:26.880 | a look at machine translation.
00:01:28.960 | So machine translation use something
00:01:32.040 | called a encoder-decoder network,
00:01:34.480 | where you would have a encoder, which
00:01:36.840 | is a set of, well, recurrent units, usually something
00:01:40.840 | like LSTMs or GRUs.
00:01:43.200 | And information, or that encoder,
00:01:46.720 | would encode some sort of input text.
00:01:50.120 | So in, let's say, English, it would
00:01:52.800 | encode that English text into something
00:01:56.040 | called a context vector.
00:01:57.600 | And then this context vector would be passed along
00:01:59.720 | to a decoder network, which, again, is just
00:02:03.160 | another set of LSTM or GRU units.
00:02:07.200 | And it would decode those into an autologous language,
00:02:10.920 | say, like French, or in this case, actually, Italian.
00:02:14.560 | That is how machine translation worked back
00:02:17.960 | with recurring neural networks.
00:02:19.840 | The only issue is that we're trying
00:02:21.320 | to pass a lot of information through that single point
00:02:24.880 | between the encoder and the decoder.
00:02:28.240 | Now, that creates what is called information bottleneck.
00:02:32.600 | There's too much information trying to be crammed
00:02:35.280 | through that single one point.
00:02:37.440 | So what they came up with is something
00:02:41.120 | called the attention mechanism.
00:02:42.760 | And what the attention mechanism does is for every step
00:02:45.800 | or every token that is decoded by our decoder,
00:02:49.560 | that token is sent to the attention mechanism.
00:02:53.520 | And the alignment between the decoder at that time step
00:02:58.880 | is compared to all of the encoder units
00:03:02.920 | or on all of the encoder hidden states.
00:03:06.880 | And what that does is essentially
00:03:09.280 | builds this type of attention.
00:03:12.000 | So it tells the decoder which tokens from the encoder
00:03:15.200 | to focus on.
00:03:16.400 | So it's literally saying, where do
00:03:18.640 | I need to pay attention to whatever my current unit is?
00:03:23.920 | And this attention mechanism, what it produces
00:03:26.880 | is something like this.
00:03:29.400 | So this is from another very well-known paper in 2015.
00:03:34.840 | And what you can see is a matrix of the-- we
00:03:39.520 | have the French words or the French translation
00:03:42.600 | on the left, on the y-axis.
00:03:45.200 | And then on the top, we have the English translation.
00:03:48.960 | And all of these boxes you see are the activations
00:03:53.480 | of the attention mechanism.
00:03:55.800 | So we can see essentially which words are the most aligned.
00:04:00.440 | And that is essentially what the attention mechanism did.
00:04:04.280 | It allowed the decoder to focus on the most relevant words
00:04:08.920 | in the encoder part of the network.
00:04:11.880 | Now, moving on, in 2017, there was another paper
00:04:16.040 | called "Attention is All You Need."
00:04:17.800 | And this really marked, I think, what
00:04:21.240 | is a turning point in NLP.
00:04:24.040 | What was found in this paper is that they
00:04:26.520 | could remove the recurrent part of the encoder decoder network
00:04:31.520 | and maintain just the attention mechanism.
00:04:35.200 | And what they produced with a few modifications
00:04:39.040 | to the attention was a high-performing model
00:04:42.160 | than any of the recurrent neural networks with attention
00:04:44.840 | or without attention that came before it.
00:04:47.520 | And what they named this new model was a transformer.
00:04:53.560 | So this 2017 paper, "Attention is All You Need,"
00:04:57.600 | is where transformers came from.
00:05:00.840 | And they actually came from a mechanism
00:05:04.640 | that was aimed to help improve recurrent neural networks.
00:05:10.160 | Now, of course, like I said, the attention mechanism
00:05:12.720 | was not just the same plain attention mechanism
00:05:16.360 | that was used before in recurrent neural networks.
00:05:19.400 | It had been modified a little bit as well.
00:05:22.240 | And those modifications really came down to three key changes.
00:05:28.200 | And those were positional encoding,
00:05:31.080 | which replaced the key advantage of recurrent neural networks
00:05:34.920 | in NLP, which was the ability to consider
00:05:37.800 | the order of a sequence.
00:05:39.640 | Because they were recurrent neural networks,
00:05:41.920 | they considered one word or one time set after the other.
00:05:45.880 | So there was a sense of order to those models
00:05:49.880 | that does not appear in, for example,
00:05:51.800 | convolutional neural networks.
00:05:53.760 | And this positional encoding worked
00:05:56.480 | by adding a set of varying sine wave activations
00:06:00.160 | to each input embedding.
00:06:01.880 | And these activations varied based on the position
00:06:06.720 | of the word or token.
00:06:07.840 | So what you have there is a way for the network
00:06:12.480 | to identify the order of the tokens or the token
00:06:16.280 | activations or embeddings that are being processed.
00:06:20.080 | The next change was self-attention.
00:06:22.920 | Now, self-attention is where the attention mechanism
00:06:26.280 | is applied between a word and all of the other words
00:06:29.760 | in its own context.
00:06:32.160 | So the sentence or the paragraph that it belongs in.
00:06:35.400 | Now, we saw with the encoder/decoder
00:06:37.520 | that attention was being applied between the decoder
00:06:41.280 | and the encoder.
00:06:42.880 | This is like applying attention to the encoder and the encoder
00:06:47.280 | again.
00:06:48.160 | And what this did is, rather than just embedding
00:06:52.280 | the meaning of a word, it also embeds the context of a word
00:06:56.480 | into its vector or word representations,
00:07:01.040 | which obviously greatly enriched the amount of information
00:07:04.480 | that you have within that embedding.
00:07:07.560 | And then the third and final change that they made
00:07:11.120 | was the addition of multi-head attention.
00:07:14.640 | And we can see multi-head attention
00:07:17.480 | as several parallel attention mechanisms working together.
00:07:22.880 | And using these multiple attention heads
00:07:26.600 | allowed the representation of several sets of relationships
00:07:30.640 | rather than just a single set.
00:07:32.840 | Now, these new transform models also
00:07:36.520 | had the benefit of generalizing very well.
00:07:39.520 | So what we find with transform models--
00:07:41.720 | and of course, you could do this to an extent
00:07:44.120 | with your current neural networks as well.
00:07:46.760 | But it was far less effective.
00:07:50.640 | So with transform models, we take the core of the model,
00:07:54.000 | which has been trained using a significant amount of time
00:07:58.360 | and computing power by the likes of Google and OpenAI.
00:08:04.120 | We just take that core model, add a few layers
00:08:06.960 | onto the end of it that are designed in a way
00:08:09.600 | for our specific use case, and train it a little bit more.
00:08:13.680 | So we fine-tune it.
00:08:15.960 | And I think one of the most widely known of these
00:08:20.480 | or the most popular of these models is probably BERT.
00:08:25.080 | And of course, there are other models as well.
00:08:28.400 | Later in this video, we're going to have a look at one
00:08:30.880 | called the MPNet model.
00:08:33.920 | But BERT is certainly one of the, I think,
00:08:36.280 | most popular of those.
00:08:38.200 | Now, so far, we've explained that transformers
00:08:42.520 | have much richer embeddings, or word or token embeddings,
00:08:47.800 | than anything that came before.
00:08:50.240 | And that's good.
00:08:51.360 | But we're interested in not word or token-level embeddings,
00:08:56.280 | but sentence embeddings.
00:08:57.600 | Because we want to do semantic similarity
00:09:01.160 | between sentences or paragraphs.
00:09:04.360 | And of course, transformers, they work--
00:09:08.160 | the inside of the transformer works
00:09:10.240 | based on word-level or token-level embeddings.
00:09:14.760 | So that doesn't really help us that much.
00:09:18.680 | So what would happen before sentence transformers
00:09:22.480 | were introduced is we would use something
00:09:25.480 | called a cross-encoder.
00:09:27.000 | So we would have a BERT cross-encoder model.
00:09:29.560 | And like I said before, we have the core BERT model.
00:09:32.760 | And then we just add a couple of layers onto the end of it
00:09:35.360 | and fine-tune it for our specific use case.
00:09:38.200 | And that's what cross-encoder is.
00:09:39.640 | It's the core BERT model.
00:09:42.160 | As you can see on the screen right now,
00:09:44.520 | it's a core BERT model followed by a feedforward neural
00:09:47.880 | network.
00:09:48.720 | We pass two sentences into the BERT model at once.
00:09:52.480 | The BERT model embeds them with very rich word and token-level
00:09:58.000 | embeddings.
00:09:58.840 | The feedforward network takes those and decides
00:10:01.520 | how similar those two sentences are.
00:10:06.200 | Now, this is fine, and it is actually accurate.
00:10:09.360 | It works well.
00:10:11.200 | But it's not really scalable, because essentially,
00:10:14.160 | every time you want to compare a pair of sentences,
00:10:17.760 | you have to run a full inference computation on BERT.
00:10:22.720 | So let's say we wanted to form a semantic similarity
00:10:26.040 | search across just 100,000 sentences, which
00:10:29.240 | is a reasonably small data set.
00:10:32.360 | We would have to run the BERT inference computation 100,000
00:10:37.880 | times to actually go through and identify
00:10:42.160 | the similarity between all of those sentences.
00:10:44.640 | And that's going to take a lot of time.
00:10:46.720 | And it gets worse as well.
00:10:48.120 | I mean, if you consider clustering
00:10:50.400 | all of those 100,000 sentences, we
00:10:55.520 | would end up with just under 500 million comparisons there.
00:11:01.320 | So running a full BERT inference prediction 500 million times
00:11:07.800 | just to cluster 100,000 sentences,
00:11:11.280 | obviously, that's not scalable.
00:11:14.040 | That will take a very, very long time.
00:11:16.480 | So ideally, what we need is something
00:11:19.400 | like word or token embeddings, but for sentences.
00:11:23.080 | Now, with the original BERT, we could actually produce these.
00:11:28.360 | They just were not very good.
00:11:29.760 | So what we could do is take the mean value
00:11:33.720 | across all of our word embeddings.
00:11:36.000 | Typically, BERT is 512, those being output by the model.
00:11:40.200 | We could take the average across all of those
00:11:42.240 | and take that average as what we call sentence embedding.
00:11:45.880 | Now, there were other methods of doing this as well.
00:11:49.280 | That was probably the most popular and effective one.
00:11:52.520 | And we could take that sentence embedding, store it somewhere,
00:11:55.760 | and then we could just compare it to other sentence embedding
00:11:58.240 | using something that's a bit simpler
00:11:59.720 | than a full BERT computation, like cosine similarity.
00:12:04.640 | And that is much faster.
00:12:07.320 | That's fast enough for us, but it's just not that accurate.
00:12:12.880 | And it was actually found that comparing average glove
00:12:16.440 | embeddings, which were produced in 2014,
00:12:20.320 | were actually more accurate.
00:12:23.920 | So we can't really do that.
00:12:25.720 | We can't use what is called a mean pooling approach.
00:12:32.040 | Or we can't use it in its current form.
00:12:35.120 | Now, the solution to this problem
00:12:36.680 | was introduced by two people in 2019.
00:12:40.440 | Nils Reimers and Irina Gurevich.
00:12:44.120 | They introduced what is the first sentence transformer,
00:12:48.440 | or sentence BERT.
00:12:50.040 | And it was found that sentence BERT or SBERT outperformed
00:12:54.480 | all of the previous state-of-the-art models
00:12:57.920 | on pretty much all benchmarks.
00:12:59.880 | Not all of them, but most of them.
00:13:02.280 | And it did it in a very quick time.
00:13:07.520 | So if we compare it to BERT, if we
00:13:10.240 | wanted to find the most similar sentence pair from 10,000
00:13:14.120 | sentences, in that 2019 paper, they
00:13:17.200 | found that with BERT, that took 65 hours.
00:13:21.160 | With SBERT embeddings, they could create all the embeddings
00:13:25.120 | in just around 5 seconds.
00:13:28.080 | And then they could compare all of those with cosine similarity
00:13:31.360 | in 0.01 seconds.
00:13:33.680 | So it's a lot faster.
00:13:36.120 | We go from 65 hours to just over 5 seconds,
00:13:39.920 | which is, I think, pretty incredible.
00:13:42.920 | Now, I think that's pretty much all the context we need
00:13:45.920 | behind sentence transformers.
00:13:47.960 | And what we'll do now is dive into a little bit
00:13:50.680 | of how they actually work.
00:13:52.680 | Now, we said before we have the core transform models.
00:13:57.280 | And what SBERT does is fine-tunes on sentence pairs
00:14:02.840 | using what is called a Siamese architecture or Siamese
00:14:06.560 | network.
00:14:08.360 | What we mean by a Siamese network
00:14:10.520 | is that we have what we can see, what
00:14:12.760 | can view as two BERT models that are identical.
00:14:18.240 | And the weights between those two models are tied.
00:14:20.680 | Now, in reality, when we're implementing this,
00:14:23.640 | we just use a single BERT model.
00:14:25.440 | And what we do is we process one sentence, sentence A,
00:14:29.920 | through the model.
00:14:31.040 | And then we process another sentence, sentence B,
00:14:33.760 | through the model.
00:14:34.560 | And that's the sentence pair.
00:14:36.240 | So with our cross-encoder, we were processing the sentence
00:14:39.360 | pair together.
00:14:40.040 | We were putting them both together,
00:14:41.500 | processing them all at once.
00:14:43.560 | This time, we processed them separately.
00:14:45.760 | And during training, what happens
00:14:48.400 | is the weights within BERT are optimized
00:14:52.160 | to reduce the difference between two vector embeddings
00:14:56.320 | or two sentence embeddings that are produced for sentence A
00:14:59.800 | and sentence B. And those sentence embeddings
00:15:03.880 | are called U and V.
00:15:05.600 | Now, to actually create those sentence embeddings,
00:15:07.960 | we do what we did before with BERT, where we
00:15:10.600 | do the mean pooling operation.
00:15:12.680 | Now, the reason that this works better
00:15:14.920 | is because we're fine-tuning it.
00:15:16.560 | So with BERT, we didn't fine-tune it.
00:15:18.720 | This time, we are fine-tuning it.
00:15:20.680 | Now, there are several different ways of training SBIRT.
00:15:24.960 | But the one that was covered most prominently
00:15:28.640 | in the original paper is called the Softmax Loss Approach.
00:15:32.880 | And that's what we are going to be describing here.
00:15:36.400 | Now, for the Softmax Loss Approach,
00:15:39.680 | we can train on natural language inference data sets.
00:15:43.520 | Now, the 2019 paper used two of those, the Stanford Natural
00:15:47.640 | Language Inference Corpus and the Multi-Genre Natural Language
00:15:51.920 | Inference Corpus.
00:15:53.800 | Now, both of these were merged together.
00:15:57.720 | And what we have inside there are sentence pairs.
00:16:02.200 | One is a premise, which suggests a certain hypothesis.
00:16:07.720 | So those two sentence pairs are, in some cases, related.
00:16:12.400 | And we can tell whether they're related or not
00:16:14.560 | using the label feature.
00:16:16.920 | Now, the label feature contains three different classes.
00:16:22.080 | We have 0, which is called entailment.
00:16:25.480 | And what this is, is it indicates
00:16:29.360 | that the premise sentence suggests the hypothesis
00:16:33.480 | sentence.
00:16:34.720 | Then we have class 1, or label 1.
00:16:37.680 | That means the two sentences are neutral.
00:16:41.200 | So they could both be true, but they are not necessarily
00:16:43.760 | related.
00:16:44.960 | And then we have number 2, which is contradiction,
00:16:47.840 | which means the premise and hypothesis sentences
00:16:51.520 | contradict each other.
00:16:53.320 | Now, given this data, we feed sentence A
00:16:56.880 | into BERT first for our fine-tuning.
00:17:01.400 | And then we feed sentence B into BERT.
00:17:05.080 | So that's our premise and hypothesis.
00:17:09.280 | The Siamese BERT, or BERT, outputs two sentence embeddings,
00:17:15.280 | U and V, from this process.
00:17:18.720 | And what we do is concatenate those two sentence embeddings.
00:17:23.440 | Now, the paper explained a few different ways of doing that.
00:17:27.760 | But the most effective was to take U and V.
00:17:31.080 | And what we do is we take the absolute value of U minus V,
00:17:36.800 | which is an element-wise operation.
00:17:38.520 | So we're basically finding the difference between U and V.
00:17:41.800 | And that produces this other vector, which
00:17:44.040 | is the bar U minus V bar.
00:17:47.760 | And then we concatenate all of those together.
00:17:50.960 | So they all get concatenated together.
00:17:54.000 | And then they are passed into a feed-forward neural network.
00:17:58.880 | Now, this feed-forward neural network
00:18:02.120 | takes the concatenated vector length, or sentence embedding
00:18:06.080 | length, as its input and outputs just three activations.
00:18:11.680 | Now, those three activations align with our three labels.
00:18:16.280 | So what we do from there, we have those three activations.
00:18:19.520 | We then need to calculate the softmax loss
00:18:21.480 | between those predicted labels and the true labels.
00:18:26.160 | Now, softmax loss is nothing more than a cross-entropy loss
00:18:32.000 | function.
00:18:33.040 | And that's really all there is to training one of these models.
00:18:37.680 | Now, we are going to cover all of this in full.
00:18:40.920 | We're going to go through the code and everything
00:18:42.920 | and train our own sentence button model.
00:18:45.920 | But for now, we're just describing the process.
00:18:48.800 | And we'll cover that in another--
00:18:51.000 | well, probably in the next video and article.
00:18:54.200 | Now, that's really, I think, everything
00:18:58.240 | we need to know for now on how they work
00:19:00.920 | and where sentence transformers and transformers come from.
00:19:04.880 | So let's jump into Python.
00:19:08.040 | And what we'll do is actually implement
00:19:09.960 | some of these models using the sentence transformers library,
00:19:12.840 | which was built by the same people who designed
00:19:17.680 | the first sentence transformer, SBIRT.
00:19:20.240 | So let's go ahead and do that now.
00:19:22.600 | So the first thing that you will need to do,
00:19:24.600 | if you do not already have sentence transformers installed,
00:19:29.240 | is just pip install sentence transformers.
00:19:33.600 | So I already have it installed.
00:19:35.160 | So I'm not going to go ahead and run that.
00:19:38.640 | So all I'm going to do now is, from sentence transformers,
00:19:44.160 | I'm going to import the sentence transformer class object.
00:19:52.920 | And then from there, we can initialize a sentence
00:19:56.360 | transformer, super easy.
00:19:57.720 | So all we need to write is model equals sentence transformer.
00:20:04.480 | And then in here, we just need to write our model name.
00:20:08.160 | Now, if you go to this website, SBIRT.net,
00:20:16.000 | you will find a load of different models.
00:20:19.600 | Now, the one that we will be using is this.
00:20:21.920 | So this is the original SBIRT model.
00:20:25.000 | And if we just come down here and print out
00:20:27.080 | what that will return to us, we will
00:20:31.080 | see a few different components.
00:20:33.200 | So you can see that we have two components.
00:20:40.200 | We have the transform model, and we have the pooling layer,
00:20:43.560 | which is the mean pooling that I mentioned before.
00:20:46.160 | Now, the transformer, we see that the max sequence
00:20:48.760 | length, so the maximum number of tokens that we can input there
00:20:51.400 | is 128.
00:20:52.840 | And we see that we're using the base model.
00:20:55.360 | It's a BIRT model from the Hugging Face library.
00:21:00.320 | Then in pooling, we can see the output dimension
00:21:04.160 | of our sentence vector, which is 768.
00:21:07.800 | And we can also see the way that the tokens have
00:21:11.120 | been pooled to create the sentence embedding, which
00:21:13.400 | is just the mean pooling approach.
00:21:17.120 | Now, given these sentences here, I'm
00:21:21.680 | going to run that.
00:21:23.800 | All we need to do to actually encode those
00:21:26.640 | is we write model.encode.
00:21:29.720 | And then we just pass sentences.
00:21:33.200 | And that will create our embeddings.
00:21:34.960 | So I'm just going to call this embeddings.
00:21:38.240 | And let's see what they look like, so embeddings again.
00:21:41.200 | And we see that we have these, which are our embeddings.
00:21:48.560 | So each one of these is for a single sentence.
00:21:51.520 | So here we have the first sentence here.
00:21:54.640 | And these are each of dimensionality 768.
00:22:01.640 | So we can check the shape of that if we want as well,
00:22:04.360 | just to confirm that is true.
00:22:07.200 | So you see that we have five embeddings, five sentences,
00:22:11.080 | each one 768 dimensions.
00:22:13.960 | And now that we have our sentence embeddings,
00:22:17.200 | we can use these to quickly compare sentence similarity
00:22:20.920 | for quite a few different use cases.
00:22:24.600 | The most popular are semantic textual similarity,
00:22:29.120 | which is a comparison of sentence pairs, which
00:22:31.480 | is what we are going to do here.
00:22:33.880 | And generally, this is probably most often
00:22:37.160 | used for benchmarking these kinds of models.
00:22:41.680 | Then we have semantic search.
00:22:43.120 | Now, we've covered semantic search a lot already
00:22:46.800 | in other articles and videos.
00:22:49.960 | And this is information retrieval
00:22:52.240 | using semantic meaning.
00:22:54.840 | So given a set of sentences, we can
00:22:57.600 | search using a query sentence and identify
00:23:01.240 | the most similar records.
00:23:03.200 | So this enables us to search based on concepts
00:23:07.160 | rather than specific words, which is pretty cool.
00:23:11.520 | Now, we also have clustering.
00:23:12.760 | So we can cluster our sentences, which
00:23:14.680 | is obviously useful for things like topic modeling.
00:23:17.560 | Now, we can put together a very fast STS,
00:23:21.400 | so the semantic textual similarity example,
00:23:26.240 | using nothing more than a cosine similarity function and NumPy.
00:23:30.320 | So we just want to import NumPy as np.
00:23:33.720 | And we also are going to use the sentence transformers
00:23:40.200 | cosine similarity function.
00:23:43.280 | So write sentence transformers util import cosim.
00:23:50.260 | OK, so from there, what I'm going to do
00:23:52.680 | is I'm going to initialize a empty array of zeros.
00:23:59.040 | And I want that to be the length of our sentences.
00:24:01.760 | So I want it to be a 5 by 5 array.
00:24:05.560 | And what we're going to do is loop
00:24:09.600 | through all of these sentence embeddings
00:24:11.260 | we produced from SBIRT and compare those
00:24:13.920 | with the cosine similarity.
00:24:16.320 | So we just want to write for i in a range.
00:24:19.640 | And we'll delete the sentences.
00:24:21.160 | And we just want to say the similarity function
00:24:28.040 | or the similarity array from i to the end
00:24:34.320 | on this specific column is going to equal to cosine sim.
00:24:41.040 | And then here, we want to put our embeddings.
00:24:42.880 | We just want the current session followed
00:24:46.000 | by the other embeddings.
00:24:52.020 | And that will populate our similarity array.
00:24:56.120 | So we can print that out.
00:24:57.780 | And we'll see, OK, so down the middle here--
00:25:01.320 | so I've not populated these because these--
00:25:06.680 | well, we've already got those pairs on this side of the array.
00:25:11.400 | You'll see we're going to visualize it.
00:25:13.160 | So that'll probably make more sense.
00:25:15.400 | So let's do that now.
00:25:16.880 | So we're going to import matplotlib.pyplot as PLT.
00:25:22.800 | And we're also going to import seaborne.
00:25:24.600 | This makes it a little bit easier and nicer as well.
00:25:27.480 | And I'm just going to write sns.heatmap in here, sim.
00:25:32.640 | And we want annotations set to true.
00:25:36.360 | OK, so this is just a visual of that array
00:25:43.440 | that we produced just now.
00:25:45.240 | And we can see here, so we have the sentence values
00:25:51.040 | or sentence positions.
00:25:53.120 | So sentence zero, if we go here--
00:25:56.520 | actually, let me print them out here instead.
00:25:58.520 | That makes more sense.
00:26:00.480 | So if I print sentences, maybe it's better if I--
00:26:05.320 | so I'll put them like this, yeah.
00:26:09.080 | So we have number zero is obviously this first sentence.
00:26:13.480 | And then we have number 1, 2, 3, and 4.
00:26:18.720 | And that correlates to 0 to 4 and 0 to 4
00:26:23.080 | here on the two axes.
00:26:25.200 | So if we want to look at the most similar pair according
00:26:29.800 | to our SBIRT model, it's this 4 and 3,
00:26:32.960 | which gets a cosine similarity of 0.64.
00:26:37.280 | And if we have a look, 3 and 4 are these two.
00:26:42.520 | So these two are the only ones that kind of mean
00:26:44.600 | the same thing.
00:26:45.720 | And I've written these so that they basically carry--
00:26:53.080 | they share none of the same descriptive words.
00:26:57.040 | So for dentists, we have dental specialists.
00:27:00.280 | Chewing bricks, I put flossing.
00:27:02.800 | So not even the same thing.
00:27:04.840 | And construction materials as well.
00:27:06.600 | It's not even the same thing there.
00:27:08.680 | But very similar sort of concepts
00:27:11.200 | that we're talking about there.
00:27:12.520 | So you can see that it's identifying those two as the
00:27:14.560 | most similar concept, which is pretty cool.
00:27:16.480 | And then we get some other similarity scores,
00:27:18.880 | which are kind of high here.
00:27:20.080 | And they're not really related.
00:27:21.480 | So we have 3 and 0.
00:27:24.520 | It's talking about eggplants and mannequin heads.
00:27:27.080 | So it's pretty different.
00:27:29.280 | I suppose, in reality, someone being an eggplant
00:27:32.040 | and this sort of thing is both kind of weird, strange things
00:27:37.280 | to happen.
00:27:38.080 | So maybe that's why it's capturing similarity there.
00:27:41.040 | But it's not obviously similar.
00:27:46.480 | So generally, I think it's good that it identifies this
00:27:51.400 | as being the most similar.
00:27:53.560 | But it could be better.
00:27:54.640 | And we do find that with the more recent models,
00:27:57.840 | it is, in fact, better.
00:28:00.000 | So what we can do is I'm going to get this other model.
00:28:03.960 | We'll just call it model.
00:28:05.120 | And we're going to do sentence transformer again.
00:28:10.360 | And this time, we are using the MPNet base model.
00:28:15.680 | Now, this is basically the highest performing model
00:28:20.720 | at the moment.
00:28:22.040 | Although I was told on the channel's Discord by Ashraq
00:28:30.120 | that there is actually, when they were training this MPNet
00:28:33.760 | model, they also trained a Roberta model.
00:28:38.320 | And although the Roberta model is not
00:28:41.280 | shown on the Sentence Transformer's home page,
00:28:44.800 | you can see in the competition where
00:28:46.600 | they trained both these models that they did also
00:28:48.960 | train that model.
00:28:50.400 | And it does have a slightly higher performance as well.
00:28:54.520 | It might be slower.
00:28:55.680 | I'm not sure.
00:28:56.200 | Probably because it's Roberta.
00:28:58.280 | But the performance of that is actually
00:29:00.800 | higher than the MPNet version of that model,
00:29:04.920 | which is pretty cool.
00:29:06.200 | Now, here, we can see a few things
00:29:09.760 | that are slightly different.
00:29:10.880 | For starters, the max sequence length is three times as long
00:29:13.640 | as it was with the BERT model using the MPNet base model.
00:29:18.120 | And we also have this additional normalization function.
00:29:23.200 | Now, let's just take what we wrote up here.
00:29:27.440 | So I'm just going to take this, bring it down here.
00:29:32.840 | And let's just use the heat map straight away.
00:29:35.800 | So SNS heat map sim and annot equals true.
00:29:45.240 | And we'll see something slightly different.
00:29:47.480 | Or we should do.
00:29:49.880 | So I haven't processed the similarity yet
00:29:53.680 | or the embeddings.
00:29:54.920 | So let's write embeddings equals model.encode sentences.
00:30:01.920 | Now, if we run it, we'll see that the similarity
00:30:06.400 | of these ones, of these other sentence pairs,
00:30:10.320 | is now a lot lower.
00:30:12.760 | But it's still identifying 4 and 3 as pretty similar.
00:30:16.360 | So we can see straight away there's
00:30:17.960 | a decent performance increase from this MPNet model, which
00:30:21.800 | is the most recent model, and the original SBERT model here.
00:30:27.960 | So I think that's pretty cool to see as well.
00:30:30.600 | Now, that's it for this model introducing
00:30:33.120 | sentence embeddings and the Sentence Transformers
00:30:36.520 | library and models.
00:30:38.000 | Now, going forward, obviously, this is a series of videos.
00:30:41.720 | So we're going to be covering a lot more than just Sentence
00:30:44.120 | Transformers.
00:30:45.120 | But next, we are actually going to cover
00:30:48.280 | how we can train a sentence BERT model, an SBERT model, which
00:30:54.200 | I think will be pretty cool.
00:30:56.400 | So until then, that's it for now.
00:30:59.840 | So thank you very much for watching.
00:31:01.800 | And I will see you in the next one.