Today Unsupervised Sentence Transformers, Tomorrow Skynet (how TSDAE works)

00:00:00.000 | In this video we're going to have a look at how we can train sentence transformers

00:00:04.560 | without needing any label data. So if you're new to sentence transformers,

00:00:11.160 | sentence embeddings or vectors, a sentence vector as we'll call it is

00:00:16.200 | simply a numerical representation of a sentence or paragraph. If you think about

00:00:23.560 | language it's a very human centric concept, it's not built for computers so

00:00:31.120 | computers really struggle to get the meaning or concepts that we as humans

00:00:38.240 | find very easy to communicate using language. Now the modern-ish computers

00:00:45.160 | appeared during and around World War Two. The first application of NLP came soon

00:00:51.280 | after in 1954 with the Georgetown machine translation experiment. In that

00:00:57.760 | first decade of research those involved were pretty optimistic that they were

00:01:03.280 | going to solve the problem of machine translation in just a few short years.

00:01:08.440 | Obviously they were a little bit optimistic and that's still a problem

00:01:14.800 | that's still not solved, we still haven't solved machine translation and the same

00:01:19.480 | goes for anything in NLP but in the past decade especially there have been a lot

00:01:27.680 | of breakthroughs. The the field of NLP has progressed at an incredible rate in

00:01:33.760 | just the past decade and we now have an incredible ecosystem of language models

00:01:43.520 | and techniques that we can use for a lot of different use cases. Now a lot of this

00:01:49.800 | recent success is in part thanks to the dense vector representations of language.

00:01:58.400 | So those are vectors, so numerical vectors that a machine can understand

00:02:05.200 | but are built in such a way that they actually provide a numerical

00:02:10.120 | representation of the semantics or the meaning behind whatever is those vectors

00:02:16.600 | represent, whether that be tokens, words or sentences, paragraphs and so on. So and

00:02:23.240 | with those dense vectors we now have a way for computers to comprehend and

00:02:31.240 | understand to an extent the semantic meaning behind language. To build those

00:02:39.080 | you know given a lot of data and a lot of compute we tend to use transform

00:02:45.080 | models. In NLP, transformers are the de facto standard and for building

00:02:52.560 | representations of sentences or paragraphs there is a subcategory of

00:02:58.880 | transformers called sentence transformers. Now the training process to build a

00:03:05.160 | transformer begins with something called pre-training that produces a generic

00:03:12.920 | transformer model and then we fine-tune that, so we train that further using

00:03:19.040 | special methods to build sentence transformers that can produce these very

00:03:25.440 | information rich and accurate sentence vectors. Now whereas pre-training tends

00:03:34.200 | to use unsupervised training methods, fine-tuning tends to be more along the

00:03:40.560 | lines of supervised training and what that means is that we need a lot of

00:03:46.920 | labeled data and for some domains and languages there simply is not enough

00:03:54.480 | labeled data out there to actually build a sentence transformer for those

00:04:00.160 | specific domains or languages. So that means that you can either spend a long

00:04:06.280 | time gathering data and labeling all the data to get tens of thousands of labeled

00:04:13.120 | samples or you can go ahead and try fine-tuning model using unsupervised

00:04:20.320 | training. Now unsupervised training I will tell you straight away is not going

00:04:25.560 | to get you the performance that you would get from a supervised training

00:04:30.040 | approach, however if you do not have the labeled data to train using a supervised

00:04:36.000 | approach, unsupervised training is your best bet and it still works pretty well.

00:04:42.080 | So in this video that's what we're going to cover, we're going to cover how we can

00:04:47.560 | train a sentence transformer or fine-tune sentence transformer using a

00:04:52.400 | unsupervised training method called transformer based sequential denoising

00:04:58.920 | autoencoder. So what we'll do is jump straight into it and take a look at

00:05:06.560 | where we might want to use this training approach and and how we can actually

00:05:11.200 | implement it. So the first question we need to ask is do we really need to

00:05:18.560 | resort to unsupervised training? Now what we're going to do here is just have a

00:05:24.080 | look at a few most popular training approaches and what sort of data we

00:05:28.240 | need for that. So the first one we're looking at here is natural language

00:05:33.760 | inference or NLI and NLI requires that we have pairs of sentences that are

00:05:43.600 | labeled as either contradictory, neutral, which means they're not necessarily

00:05:50.600 | related, or as entailing or as inferring each other. So you have you have pairs

00:05:59.560 | that entail each other, so they are both very similar, pairs that are neutral and

00:06:11.400 | also pairs that are contradictory. And this is the traditional NLI data. Now

00:06:22.640 | using another version of fine-tuning with with NLI called multiple negatives

00:06:33.480 | ranking loss, you can get by with only entailment pairs, so pairs that are

00:06:44.720 | related to each other or positive pairs. And it can also use contradictory pairs

00:06:51.640 | to improve the performance of training as well, but you don't need it. So if you

00:06:57.040 | have positive pairs of related sentences, you can go ahead and actually try

00:07:04.560 | training or fine-tuning using NLI with with multiple negative ranking loss.

00:07:11.800 | If you don't have that, fine. Another option is that you have a semantic

00:07:18.240 | textual similarity, DSL or STS, and what this is is you have, so you have sentence

00:07:25.480 | A here, sentence B here, and then you have a score from from 0 to 1 that tells you

00:07:33.680 | the similarity between those two scores. And you would train this using something

00:07:40.400 | like cosine similarity loss. Now if that's not an option and your focus or

00:07:47.400 | use case is on building a sentence transformer for another language where

00:07:53.760 | there is no current sentence transformer, you can use multilingual parallel data.

00:08:00.760 | So what I mean by that is, so parallel data just means translation pairs. So if

00:08:06.600 | you have, for example, a English sentence and then you have another language here,

00:08:13.560 | so it can it can be anything, I'm just going to put XX, and that XX is your target

00:08:19.040 | language, you can fine-tune a model using something called multilingual knowledge

00:08:29.080 | distillation. And what that does is takes a monolingual model, for example in

00:08:38.600 | English, and using those translation pairs it distills the knowledge, the

00:08:44.320 | semantic similarity knowledge, from that monolingual English model into a

00:08:50.040 | multilingual model which can handle both English and your target language. So

00:08:55.880 | they're three options that are quite popular, very common, that you can go for. And as

00:09:04.600 | they're supervised methods, the chances are they're probably going to outperform

00:09:09.320 | anything you do with unsupervised training, at least for now. So if none of

00:09:16.320 | those sound like something you can do, the datasets, or if it sounds like you

00:09:21.800 | probably can't get data that seems like that and it doesn't match your use case,

00:09:26.200 | then we would have to move on to unsupervised training. So like I've

00:09:33.800 | written here, you want to go for unsupervised if you have little to no

00:09:37.440 | data in your unique domain or your low resource language. And with low

00:09:44.080 | resource language you also have no translation data. So from a source

00:09:48.640 | language like English, or you can use other languages as well, just as long as

00:09:53.000 | there's a monolingual model in that source language, to your target language.

00:09:59.000 | So if we can't do that, we move on to unsupervised learning. And one of

00:10:06.280 | the best approaches at the moment is this transformer-based and sequential

00:10:10.920 | denoising autoencoder. Now there are other approaches as well, but we're not

00:10:16.440 | going to cover those. And I think for now this is probably your best bet,

00:10:21.720 | although there is other methods being researched that do look quite promising

00:10:26.160 | as well. So the way that TSD works is, if you have, let's say you have a sentence

00:10:35.000 | here, what you do is you take your sentence and you corrupt that data. So

00:10:43.880 | we have this, and what we're going to do is it's just kind of remove parts of

00:10:49.600 | that or modify it in a different way and do some other things to it. So it's

00:10:54.600 | slightly different, but not too different that it should not be similar. And we

00:11:01.760 | take both of these, so we take the modified input and we feed it into our

00:11:08.240 | encoder model, so our transformer. Our transformer outputs a set of tokens and

00:11:17.040 | we use some sort of pooling method to convert those into a single sentence

00:11:23.920 | vector. So with that sentence vector, we process that through another model here,

00:11:31.080 | which is a decoder model. So it goes into decoder model, and what that decoder

00:11:36.640 | model must do is actually optimize to produce a sentence here, to produce the

00:11:46.640 | same text. So it has to try and predict this original sentence. And these weights

00:11:56.600 | in the decoder and encoder are optimized in order for the decoder to be able to

00:12:04.640 | actually do that. And that is TSDAE. It's not particularly complex

00:12:12.980 | when you sort of look at a very high level, and it's certainly a very

00:12:18.200 | intelligent way of building a sentence transformer without any labeled data. Now

00:12:26.040 | let's have a look at a little graphic here to compare TSDAE to MLM. Now MLM

00:12:35.120 | is master language modeling and that is a pre-training approach that a few of

00:12:39.880 | you will probably be familiar with, and that's why I wanted to include this

00:12:42.920 | comparison in here. So with MLM, which is the bottom down here, we take some

00:12:55.760 | input text and we mask one of the tokens in that text. We pass it through an

00:13:02.040 | encoder which outputs all these token vectors, and basically these token

00:13:06.720 | vectors for every word almost, every word or subword here, will be

00:13:12.440 | represented by one of these token vectors or token embeddings. All those

00:13:17.600 | pass into the decoder and the decoder attempts to predict which word or token

00:13:27.040 | is behind that masked token here. So it's trying to optimize for that mask to

00:13:33.560 | become an elephant. And that's how you use a pre-trainer transformer, or one of the

00:13:38.640 | ways that you can pre-train transformer. TSDAE is different for quite a few

00:13:45.800 | reasons, but I think the main reasons you can think of is, one, we are not

00:13:55.040 | necessarily masking the input, and it was found that the best way or the best

00:14:00.640 | approach is to actually delete the token. So you see here we should have that mask

00:14:06.920 | here, but it's not there anymore, we just removed it. So one, we delete rather than

00:14:13.640 | mask, though in the TSDAE paper they did test both. We have an encoder as

00:14:23.160 | before, but that is followed by this pooling step. And this pooling step takes

00:14:31.040 | the token vectors that we see down here, and it converts them into a single

00:14:37.800 | sentence vector. So that sentence vector is passed on to the decoder, so if you

00:14:44.000 | compare both of these steps here, these two, the decoder in Masked Language Modeling

00:14:49.800 | is getting a lot more information. It's getting token level information, whereas

00:14:54.800 | the decoder in TSDAE is dealing with a lot less data and it's dealing with

00:15:00.280 | sentence level information rather than token level information. And it is then

00:15:06.400 | optimizing for the same thing to try and predict that we should have this text

00:15:12.520 | here with elephant rather than the missing or corrupted text that was input

00:15:18.920 | into the encoder initially. So that's the main difference between both of those.

00:15:25.320 | Now in the TSDAE paper from Wang, Reimers, and Gurevich, they tested

00:15:36.680 | different approaches to fine-tuning. So the first of those is the noise type. So

00:15:45.200 | when we take that original text and we corrupt it, what is the best approach in

00:15:51.480 | you know, how do we corrupt it? And they found that deleting tokens, so

00:15:57.880 | this box here, produced the best results. Other options, you can swap tokens, so

00:16:06.000 | swap one word for another, you can mask as we saw with Masked Language Modeling, you can

00:16:10.680 | place those tokens, you can add new tokens, you do different things, right? But by

00:16:17.840 | far, the best here was to just delete the token. Now there's also, you know, how

00:16:25.180 | many tokens do you delete? So going through each token, you assign a

00:16:29.680 | probability of that token being deleted and that's what we see here with this

00:16:33.720 | noise ratio. So again, best best approach there is 0.6. Okay, so we're going

00:16:41.840 | through each token and assigning probability of 60% that that token will

00:16:46.560 | be deleted. So you are removing quite a lot of data. And then we have the

00:16:52.080 | pooling method, so the little circle after the encoder. And the

00:16:59.200 | best or the highest performing approach here was using mean pooling, okay?

00:17:06.840 | But to mean pool, you have another step of actually taking the average across all

00:17:12.120 | those word vectors or token vectors, whereas with CLS pooling, you don't do

00:17:19.080 | that. You just take the CLS token, which is a classified token from BERT.

00:17:27.000 | So if you've seen it before, it looks like this. Okay, and then you have your

00:17:32.760 | other tokens following it. So that is the approach that they stuck with. They

00:17:38.400 | went and used deletion only in encrypting that data, a ratio of 0.6%

00:17:46.320 | with a probability of 60%. And at the end, they used CLS pooling. You can use mean

00:17:54.960 | pooling as well, or even max. It's up to you, but later on we're going to stick

00:18:00.040 | with CLS to follow along with the paper. It's just a actually quick

00:18:05.760 | explanation of CLS and mean pooling if it's new to you. So we have

00:18:12.360 | our encoder. We output loads of token vectors. CLS pooling, we just take

00:18:17.240 | that single vector, and that is our sentence vector. So we're

00:18:24.960 | not really doing anything there. We're just kind of extracting that vector.

00:18:28.040 | Whereas with mean pooling, we're taking an average over all of the output token

00:18:34.080 | vectors to create our sentence vector. So that's it for the visual

00:18:40.880 | explanation. So let's jump into how to actually build or fine-tune a model

00:18:48.080 | using this approach. Okay, so we have here, I'm just loading a data set. So we're

00:18:56.560 | using HoganFace for data sets here. If you haven't used it before, you

00:19:01.880 | would want to pip install datasets. And what we're doing is getting the Oscar

00:19:08.320 | dataset, Oscar corpus, which is basically a massive multilingual corpus. It has a

00:19:13.720 | lot of different languages in there. I'm sure not every single language, but if

00:19:20.160 | you can think of a language, it's probably in there. And we're taking the

00:19:25.560 | English portion of that, just so I can actually read and understand things. And

00:19:33.760 | we're taking the training data. Okay, and this is important. So the Oscar dataset,

00:19:40.760 | or at least for English, the size of that is quite massive. It's 1.8 terabytes of

00:19:49.560 | data. If you don't include the streaming true, all that's going to download to

00:19:54.480 | your computer. And I assume you probably, a lot of us probably don't even have that

00:19:59.840 | much memory available on our machines anyway, so it won't work. So you need to

00:20:07.920 | add streaming equals true, because what that will do is, as we request a sample

00:20:12.440 | from the dataset, it will download it and pull it through for us, one at a time,

00:20:16.280 | not the full thing. So it's obviously a lot more efficient. It's not going to break

00:20:21.840 | our computer or anything. Now, because we're streaming it, we have to kind of

00:20:26.560 | iterate through it. So if I want to show you part of that, so I'm going to go four

00:20:30.800 | row in Oscar, print that row, and just break. Okay, so we're just going to print

00:20:37.160 | the first item there. So we have these two features, ID, which is just

00:20:44.960 | an index ID value, and we also have text. Now in here, we can see quite a lot. So

00:20:52.760 | there's this text, and it's pretty long. So there's multiple paragraphs

00:20:59.040 | in there, multiple sentences. And when we're training with TSDAE, we only want

00:21:05.800 | small-ish sentences. We just want one sentence for each sample. So we need to

00:21:12.280 | split that. We need to split that up into just sentences. So to do that, I'm going

00:21:19.200 | to import RE, because I want to split for just periods here, full stops, periods.

00:21:28.120 | And I also want to split on newline characters, and I want to remove spaces

00:21:34.120 | at the same time. So I just want to remove anything that indicates that this is a

00:21:37.560 | new sentence or paragraph, and split based on that character. So I'm going to

00:21:44.320 | create a regex, re.compile, and that is going to be any full stop followed by a

00:21:54.760 | space. That's an optional thing, so it doesn't have to be a space. And also

00:22:01.560 | optionally, followed by a newline character. Okay, so this is just going to

00:22:05.800 | match any full stop for sure. And it will also allow for there to be a space

00:22:11.720 | included in that, and they'll also allow for that to include a newline

00:22:17.200 | character as well. So that's going to capture everything for us. And let's see

00:22:22.960 | what that looks like. So we write splitter.split, and we'll go row. So the

00:22:30.960 | row it will just pull through text, and yeah. Okay, so now we see that we have all

00:22:40.200 | these nice sentences rather than just one massive paragraph. You see some here

00:22:46.440 | are not very long, and what we'll do is we'll remove them later on, because they

00:22:52.600 | aren't really sentences. So let's do that. I'm gonna create a number here which is

00:23:04.360 | going to count number of sentences we manage to capture. So in reality, or at

00:23:12.360 | least in the TSDA paper, they found that 10k is pretty much all you need, and you

00:23:20.840 | can sort of go up to 100k as well if you want. So we're going to go

00:23:27.520 | up to 100k. We probably don't necessarily need to. Probably 10k, maybe even lower.

00:23:32.480 | English is probably a reasonably easy one for this to figure out, so you

00:23:37.160 | could possibly go even lower and still get decent results. So that's one

00:23:43.360 | thing about TSDA is that you need very little data, which is pretty cool,

00:23:48.080 | especially when it's not labeled. So we have sentences. I'm gonna create a list,

00:23:53.280 | and we're going to just iterate through. So for row in OSCQR, we need to create

00:24:01.200 | our new sentences. So new sentences equals splitter.split, and that will be

00:24:08.360 | row text like we did before. And we also want to say we want to remove a sentence,

00:24:20.200 | so say line for line and sentence. If, sorry, in new sentences, new sentences, if

00:24:32.640 | that line or length of that line is less than, no, greater than 10. Okay, so we're

00:24:44.160 | saying we only want to include strings that have a character length of greater

00:24:49.920 | than 10, and we can maybe even increase that because if we look at this, this is

00:24:54.320 | definitely more than 10. So let's just go with 20 for now and see how that goes.

00:25:02.240 | Now, they're our new sentences from a single sample, and we want to extend our

00:25:10.160 | sentences list with those new sentences. So we just write send new sentences. Okay,

00:25:20.560 | and like I said, OSCQR is a massive data set. If we run through this for the

00:25:28.280 | whole data set, we're going to end up with a lot of sentences, and we don't

00:25:32.800 | need that many. We're only 10 to 100,000 sentences. So what I'm going to say is

00:25:39.000 | number of sentences is going to be equal to the length of new sentences. So the

00:25:47.720 | number of new sentences that we've just added, we're going to add that on to new

00:25:52.440 | num, number of sentences. So once that exceeds 100k, then we want to break. So

00:26:09.280 | we want to stop its num sentence. Okay, and with that, we should be able to run

00:26:17.760 | that. It should be quite quick. Okay, pretty nice and easy there, and the

00:26:28.160 | next step, as we usually would with PyTorch, is we want to put this data into

00:26:36.440 | a data set object, and then we want to load that data set object into a data

00:26:41.520 | loader. Now, because we're doing this thing where we corrupt our data by

00:26:47.400 | adding noise to it, we either need to do that manually when we're building out

00:26:52.680 | our data set object, or what we can do is just use the SentenceTransformers

00:26:59.000 | denoising autoencoder data set object. So to use that, we just write from

00:27:05.040 | SentenceTransformers.datasets import denoising autoencoder data set, and

00:27:16.200 | we're also going to create a data loader now as well. So as we usually would in

00:27:21.600 | PyTorch, we'll just import that as well. So it's from torch.data or utils.data

00:27:30.280 | import data loader, and we want to create a data set. So data set is equal to

00:27:41.080 | denoising autoencoder data set, and we just pass our data into that. So

00:27:48.640 | sentences, and that is all we need for our data set. So from now we can create our

00:27:59.920 | data loader. So it's loader equals data loader, and we pass in our data set,

00:28:12.400 | and we also want to say, okay, what is the batch size, and do we want

00:28:17.200 | to shuffle the data. So for the batch size, we're going to put 8. We do want to

00:28:23.480 | shuffle data. Shuffle is true, and we also want to drop the last batch, because

00:28:31.800 | this will not be the same batch size as the rest of our batches, or

00:28:38.080 | most likely will not be. So we'll just drop it. It's easier. So we run that. So we

00:28:44.720 | now have that. Our data is prepared for fine-tuning, and we now need to move on

00:28:53.200 | to the final preparation before we actually fine-tune the model, which is

00:28:59.800 | setting up the loss function and the actual training function itself. So

00:29:06.240 | before we actually even do the loss function, we need to define the model

00:29:10.600 | that we're going to be training. So we're going to be using vert-based encase from the

00:29:15.760 | hood-and-face transformers library, and to initialize that, or we will initialize

00:29:21.080 | that through sentence transformers. So from sentence transformers, import from

00:29:29.120 | sentence transformer, and we also want models. Now the vert model is going

00:29:40.400 | to be models.transformer, and then in here we just pass the name vert-based

00:29:47.120 | encase, so this will just download models directly from hood-and-face models

00:29:53.160 | repository as we would with the transformers library. So we want to write

00:29:58.640 | vert-based encase, and then we also need a pooling layer. So write pooling

00:30:06.560 | equals models.pooling, and as we saw before when we were working through

00:30:15.160 | everything, we want to use pooling using the CLS or classifier token. So to do

00:30:23.200 | that, we will need to pass CLS in there. But before we do that, we also actually

00:30:29.840 | need to tell the pooling layer what number, what dimensionality to expect

00:30:39.000 | from that vector. And to get that, I'll show you, we can just write vert equals get

00:30:47.120 | word embedding dimension, and we'll see 768. But we will once it's actually

00:30:53.240 | defined, so let me put that in there. So I'll add that in there. Okay, run this, you'll

00:31:01.400 | see in a sec. Okay, that's just initializing everything, and we have that

00:31:07.120 | 768. Now what we have here are two separate layers, and we need to combine

00:31:13.040 | them both or merge them both into a single sentence transform model. So to do

00:31:18.640 | that, we want to write model equals, this is where sentence transformer part comes

00:31:23.080 | in, we have modules, and that will be vert followed by the pooling layer. And then

00:31:31.240 | we can also print out the description of that model as well. So we see we

00:31:37.420 | have transformer using vert model, and we also see the pooling. We have the word

00:31:45.680 | embedding dimension, and we also have the pooling mode CLS token is true, whereas

00:31:51.600 | the rest of them are false, because we're not using those pooling methods. Now with the

00:31:55.840 | model defined, we can define our loss function. So write from sentence

00:32:01.520 | transformers.losses, and we're going to import the denoising auto encode loss,

00:32:08.760 | and we'll use that to define our loss function. So we just write denoising

00:32:15.480 | auto encode loss, and we pass in our model, so it knows what to actually

00:32:25.080 | optimize. And we also need to make sure that we tie the encoder and decoder

00:32:33.880 | weights, so we have true in there. Okay, because we have that encode decoder, and

00:32:41.000 | we're tying those weights together because the performance is there. We can

00:32:43.880 | also set faults, but the performance will not be as good. So that is everything. We

00:32:54.200 | can go ahead and actually move on to the training function, so model.fit. So

00:32:59.800 | what we're going to be doing here is we write model.fit, and I'll just add a few

00:33:08.920 | points here. So we're going to be using an atom optimizer. We are going to be

00:33:14.000 | using a learning rate, so LR of 3e to -5, and that learning rate is

00:33:22.960 | constant, so we're going to be using a constant learning rate. So if you've

00:33:29.000 | watched the last videos, we have tended to use a warm-up before we

00:33:35.200 | actually move on to that learning rate, so we warm up to that learning rate. With

00:33:39.400 | this, we're just going to use constant learning rate all the way through, and

00:33:42.560 | there's also no weight decay in there. So model.fit, we need to say what

00:33:50.600 | are our training objectives, so we just write train objectives, and then in

00:34:00.600 | there we pass a list, and in here we need pairs of data loaders and the loss

00:34:07.400 | functions we're going to use to optimize with that data. In our case, we just have

00:34:11.680 | one of these, so it's just load up the data and also loss. We're going to train

00:34:18.000 | for one epoch, so epoch equals 1. We are going to be using the atom optimizer. The

00:34:26.700 | default optimizer here is a atom W or atom weight decay optimizer from the

00:34:34.640 | Transformers library, so if we want to use atom, we just set the weight decay to

00:34:40.520 | zero. So we write weight decay equals zero. We also need to set scheduler, so

00:34:49.160 | this is why I mentioned before, so scheduler equals constant learning rate.

00:34:57.200 | Okay, so we're not doing any warm-up, we're just going with a constant

00:35:01.720 | learning rate all the way through, and then we want to pass our optimizer

00:35:06.440 | parameters. So here we're just passing the learning rate, nothing more, so that is

00:35:12.840 | 3e to the minus 5, and while training we're probably going to want to see the

00:35:19.720 | progress bar, so I'm going to set that equal to true as well. Now after that, after we

00:35:29.520 | finish training, you're probably going to want to save the model, so you can save

00:35:34.680 | that as wherever you would like. So I'm going to write TSDAE, but I'm going to

00:35:44.840 | start running that, and now I'm going to stop it, because it doesn't

00:35:50.000 | take long, to be fair. With 100,000 samples, this took 20 minutes on a

00:35:55.480 | reasonably good GPU, so that's a RTX 3090, so it's obviously a decent GPU, but it's

00:36:02.920 | nothing like Tesla or anything like that. So this is really quick, I was

00:36:09.800 | pretty impressed. Okay, so you can see that this is training, so what I'm going to do is

00:36:14.040 | pause that, or stop that, and I'm going to switch over to the other notebook where

00:36:18.080 | I have that training, and the evaluation I performed afterwards, we'll just run

00:36:23.240 | through it really quickly. Okay, so we're here now, yep, you can see everything is

00:36:30.920 | the same as before, model save, I've saved it there, and oh, there was one thing I did

00:36:38.680 | want to mention. So if you do get an error with NLTK, you just need to do

00:36:45.320 | this. You just need to either pip install NLTK, if you haven't already got it installed,

00:36:50.320 | and then you just run this, input NLTK, and that's going to download this PUNT

00:36:54.760 | tokenizer, which is used in the denoising process, so where we're

00:37:02.360 | adding noise and denoising and whatever else, it uses that tokenizer, so

00:37:07.120 | that's why we need that in there. So if you do get that error, just run that.

00:37:15.120 | So after that, we train the model, and we want to evaluate the performance of

00:37:21.080 | the model because we want to see that it has actually worked. So one benchmark that

00:37:28.200 | you've probably seen me use a few times already is the Semantic Textual

00:37:32.480 | Similarity Benchmark, or STSB, which, again, we can use Hugging Face datasets

00:37:38.040 | to pull that, so it's from the Glue dataset, it's the STSB part of that, and

00:37:43.760 | we're going to take the validation split, so we're not taking the training data, though we

00:37:49.460 | can if we want because we haven't trained it on that, just in case. And that

00:37:54.880 | contains one sentence, two sentences, and a label. This label is a score, so if you

00:38:01.040 | remember earlier we were talking about STS data that we can train using cosine

00:38:07.820 | similarity loss, this is the data that we would use. So we have that label

00:38:13.680 | score, and in this dataset, it ranges from 0 to 5, we want to normalize that from 0

00:38:20.400 | to 1. So that's what I'm doing here, I'm using the datasets map function and

00:38:26.480 | then a lambda in here, and we're just dividing all those by 5, mapping them all

00:38:32.960 | from 0 to 1. Then we are reformatting that benchmark data, the STSB data, using

00:38:42.500 | the Sentence Transformers input example class, so we need this because we're

00:38:47.480 | using the evaluators from the Sentence Transformers library later. So all we do

00:38:53.120 | there is we loop through, create a list, and we append input example objects that

00:38:59.600 | contain the two sentences and the label, so the score. And then we initialize a

00:39:06.920 | similarity evaluator, so we can see this is called Embedding Similarity Evaluator,

00:39:13.320 | so that's for this type of data, the STS data, and we're just passing all the

00:39:19.700 | samples we have there, and you can write to CSV if you want, but I'm not doing that

00:39:23.240 | because I just want to see the score in here, I don't really, I don't want to see

00:39:28.120 | all the detail, so I just want to have a look at the overall score. And then we

00:39:33.100 | just evaluate the model, so you just pass the model to your evaluator, and using

00:39:39.000 | the model that we trained with TSDAE, we get 0.75, which is the Spearman's

00:39:48.040 | coefficients, so basically saying how, where our model is scoring high, does

00:39:56.720 | that correlate to where the true scores are high, or the true labels are high, and

00:40:02.120 | 0.75 means yes, there is correlation there, it's pretty

00:40:09.360 | strong, right, it's not the strongest, as we'll see in a minute, but it's pretty

00:40:13.600 | strong. So that's a good score, and we can see that it is working. Now, if we

00:40:22.440 | compare that to an untrained model, so what we had before, before we actually

00:40:26.120 | fine-tuned it with TSDAE, so we scroll down, I've just reinitialized something

00:40:33.920 | here, the same model as before, and evaluated it, scroll down a little bit to

00:40:38.480 | find the score, and you see that it's like 0.32, which is obviously way lower,

00:40:44.040 | and yeah, there's some correlation there, but it's not great. So TSDAE is

00:40:52.760 | giving us pretty good results, I think, like it is really not bad. Now, something

00:41:01.560 | we mentioned earlier is, yes, TSDAE works, it's an unsupervised method, but the

00:41:07.120 | unsupervised method is, or cannot really be compared to supervised methods in

00:41:14.720 | terms of the performance that you'll get from your Sentence Transformer, and so I

00:41:19.760 | wanted to show you that here. So first thing I've done, it is taking the

00:41:24.400 | original SBIRT, so the first one, and okay, we get 0.8, or 0.81 if you want to

00:41:33.920 | round up. So it's about 7% better than the other, than the unsupervised

00:41:43.080 | method, which is a fair bit, but it's not massive, so at least our unsupervised

00:41:49.400 | model is up there in like a good area of performance. It's not the best, but it

00:41:55.880 | does work. And as well, in the paper they did do better than this, I think they got

00:42:02.400 | to, I think, 78 maybe? So they also did better than what I got here anyway, so

00:42:10.480 | you can probably do better. And then I wanted to look at a more advanced model,

00:42:16.280 | one of the more recent models at least, like MPNet. So the MPNet model scored

00:42:24.440 | pretty much 89, or 0.89. So we get okay result, or I think pretty good

00:42:32.760 | result TSA, but obviously I can't compare it to those unsupervised models. Now

00:42:40.020 | that's it for this video. I think, at least for me, this unsupervised approach

00:42:47.480 | to training is actually one of the coolest things, or at least one of the

00:42:54.120 | coolest approaches to training and building models that I've seen in

00:42:58.000 | sentence transformers. Possibly only really paralleled with multilingual

00:43:04.760 | knowledge of distillation for training multilingual models. Both of these

00:43:09.280 | together, for me, I think are really incredible. I know this isn't like the

00:43:14.200 | best performance. When you think about all of the low resource languages out

00:43:20.120 | there that don't really have that much data, they just have unstructured text

00:43:24.560 | states like this, or the domains, very specific domains, where they just

00:43:31.840 | have loads of text data, but they don't have label data, and they can't afford to

00:43:35.880 | pay someone to go and create all that data, or don't have the time to even. I

00:43:40.400 | think this is a really cool method to actually be able to use. I mean you

00:43:48.560 | don't really need anything. So for that reason, I think this is really cool, and

00:43:55.360 | is definitely one of the most interesting ways of training, in my

00:44:03.240 | opinion. Just training something without label data that actually works pretty

00:44:09.320 | well. So yeah, that's it for this video. I hope it's been useful, and thank you

00:44:17.560 | very much for watching, and I'll see you in the next one.

Today Unsupervised Sentence Transformers, Tomorrow Skynet (how TSDAE works)

Chapters