Back to Index

Today Unsupervised Sentence Transformers, Tomorrow Skynet (how TSDAE works)


Chapters

0:0 Why Language Embedding Matters
5:12 Supervised Methods
5:29 Natural Language Inference
7:15 Semantic Textual Similarity
7:43 Multilingual Training
10:0 TSDAE (Unsupervised)
18:50 Data Preparation
29:5 Initialize Model
32:39 Model Training
36:25 NLTK Error
37:15 Evaluation
41:1 TSDAE vs Supervised Methods
42:42 Why TSDAE is Cool

Transcript

In this video we're going to have a look at how we can train sentence transformers without needing any label data. So if you're new to sentence transformers, sentence embeddings or vectors, a sentence vector as we'll call it is simply a numerical representation of a sentence or paragraph. If you think about language it's a very human centric concept, it's not built for computers so computers really struggle to get the meaning or concepts that we as humans find very easy to communicate using language.

Now the modern-ish computers appeared during and around World War Two. The first application of NLP came soon after in 1954 with the Georgetown machine translation experiment. In that first decade of research those involved were pretty optimistic that they were going to solve the problem of machine translation in just a few short years.

Obviously they were a little bit optimistic and that's still a problem that's still not solved, we still haven't solved machine translation and the same goes for anything in NLP but in the past decade especially there have been a lot of breakthroughs. The the field of NLP has progressed at an incredible rate in just the past decade and we now have an incredible ecosystem of language models and techniques that we can use for a lot of different use cases.

Now a lot of this recent success is in part thanks to the dense vector representations of language. So those are vectors, so numerical vectors that a machine can understand but are built in such a way that they actually provide a numerical representation of the semantics or the meaning behind whatever is those vectors represent, whether that be tokens, words or sentences, paragraphs and so on.

So and with those dense vectors we now have a way for computers to comprehend and understand to an extent the semantic meaning behind language. To build those you know given a lot of data and a lot of compute we tend to use transform models. In NLP, transformers are the de facto standard and for building representations of sentences or paragraphs there is a subcategory of transformers called sentence transformers.

Now the training process to build a transformer begins with something called pre-training that produces a generic transformer model and then we fine-tune that, so we train that further using special methods to build sentence transformers that can produce these very information rich and accurate sentence vectors. Now whereas pre-training tends to use unsupervised training methods, fine-tuning tends to be more along the lines of supervised training and what that means is that we need a lot of labeled data and for some domains and languages there simply is not enough labeled data out there to actually build a sentence transformer for those specific domains or languages.

So that means that you can either spend a long time gathering data and labeling all the data to get tens of thousands of labeled samples or you can go ahead and try fine-tuning model using unsupervised training. Now unsupervised training I will tell you straight away is not going to get you the performance that you would get from a supervised training approach, however if you do not have the labeled data to train using a supervised approach, unsupervised training is your best bet and it still works pretty well.

So in this video that's what we're going to cover, we're going to cover how we can train a sentence transformer or fine-tune sentence transformer using a unsupervised training method called transformer based sequential denoising autoencoder. So what we'll do is jump straight into it and take a look at where we might want to use this training approach and and how we can actually implement it.

So the first question we need to ask is do we really need to resort to unsupervised training? Now what we're going to do here is just have a look at a few most popular training approaches and what sort of data we need for that. So the first one we're looking at here is natural language inference or NLI and NLI requires that we have pairs of sentences that are labeled as either contradictory, neutral, which means they're not necessarily related, or as entailing or as inferring each other.

So you have you have pairs that entail each other, so they are both very similar, pairs that are neutral and also pairs that are contradictory. And this is the traditional NLI data. Now using another version of fine-tuning with with NLI called multiple negatives ranking loss, you can get by with only entailment pairs, so pairs that are related to each other or positive pairs.

And it can also use contradictory pairs to improve the performance of training as well, but you don't need it. So if you have positive pairs of related sentences, you can go ahead and actually try training or fine-tuning using NLI with with multiple negative ranking loss. If you don't have that, fine.

Another option is that you have a semantic textual similarity, DSL or STS, and what this is is you have, so you have sentence A here, sentence B here, and then you have a score from from 0 to 1 that tells you the similarity between those two scores. And you would train this using something like cosine similarity loss.

Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer, you can use multilingual parallel data. So what I mean by that is, so parallel data just means translation pairs. So if you have, for example, a English sentence and then you have another language here, so it can it can be anything, I'm just going to put XX, and that XX is your target language, you can fine-tune a model using something called multilingual knowledge distillation.

And what that does is takes a monolingual model, for example in English, and using those translation pairs it distills the knowledge, the semantic similarity knowledge, from that monolingual English model into a multilingual model which can handle both English and your target language. So they're three options that are quite popular, very common, that you can go for.

And as they're supervised methods, the chances are they're probably going to outperform anything you do with unsupervised training, at least for now. So if none of those sound like something you can do, the datasets, or if it sounds like you probably can't get data that seems like that and it doesn't match your use case, then we would have to move on to unsupervised training.

So like I've written here, you want to go for unsupervised if you have little to no data in your unique domain or your low resource language. And with low resource language you also have no translation data. So from a source language like English, or you can use other languages as well, just as long as there's a monolingual model in that source language, to your target language.

So if we can't do that, we move on to unsupervised learning. And one of the best approaches at the moment is this transformer-based and sequential denoising autoencoder. Now there are other approaches as well, but we're not going to cover those. And I think for now this is probably your best bet, although there is other methods being researched that do look quite promising as well.

So the way that TSD works is, if you have, let's say you have a sentence here, what you do is you take your sentence and you corrupt that data. So we have this, and what we're going to do is it's just kind of remove parts of that or modify it in a different way and do some other things to it.

So it's slightly different, but not too different that it should not be similar. And we take both of these, so we take the modified input and we feed it into our encoder model, so our transformer. Our transformer outputs a set of tokens and we use some sort of pooling method to convert those into a single sentence vector.

So with that sentence vector, we process that through another model here, which is a decoder model. So it goes into decoder model, and what that decoder model must do is actually optimize to produce a sentence here, to produce the same text. So it has to try and predict this original sentence.

And these weights in the decoder and encoder are optimized in order for the decoder to be able to actually do that. And that is TSDAE. It's not particularly complex when you sort of look at a very high level, and it's certainly a very intelligent way of building a sentence transformer without any labeled data.

Now let's have a look at a little graphic here to compare TSDAE to MLM. Now MLM is master language modeling and that is a pre-training approach that a few of you will probably be familiar with, and that's why I wanted to include this comparison in here. So with MLM, which is the bottom down here, we take some input text and we mask one of the tokens in that text.

We pass it through an encoder which outputs all these token vectors, and basically these token vectors for every word almost, every word or subword here, will be represented by one of these token vectors or token embeddings. All those pass into the decoder and the decoder attempts to predict which word or token is behind that masked token here.

So it's trying to optimize for that mask to become an elephant. And that's how you use a pre-trainer transformer, or one of the ways that you can pre-train transformer. TSDAE is different for quite a few reasons, but I think the main reasons you can think of is, one, we are not necessarily masking the input, and it was found that the best way or the best approach is to actually delete the token.

So you see here we should have that mask here, but it's not there anymore, we just removed it. So one, we delete rather than mask, though in the TSDAE paper they did test both. We have an encoder as before, but that is followed by this pooling step. And this pooling step takes the token vectors that we see down here, and it converts them into a single sentence vector.

So that sentence vector is passed on to the decoder, so if you compare both of these steps here, these two, the decoder in Masked Language Modeling is getting a lot more information. It's getting token level information, whereas the decoder in TSDAE is dealing with a lot less data and it's dealing with sentence level information rather than token level information.

And it is then optimizing for the same thing to try and predict that we should have this text here with elephant rather than the missing or corrupted text that was input into the encoder initially. So that's the main difference between both of those. Now in the TSDAE paper from Wang, Reimers, and Gurevich, they tested different approaches to fine-tuning.

So the first of those is the noise type. So when we take that original text and we corrupt it, what is the best approach in you know, how do we corrupt it? And they found that deleting tokens, so this box here, produced the best results. Other options, you can swap tokens, so swap one word for another, you can mask as we saw with Masked Language Modeling, you can place those tokens, you can add new tokens, you do different things, right?

But by far, the best here was to just delete the token. Now there's also, you know, how many tokens do you delete? So going through each token, you assign a probability of that token being deleted and that's what we see here with this noise ratio. So again, best best approach there is 0.6.

Okay, so we're going through each token and assigning probability of 60% that that token will be deleted. So you are removing quite a lot of data. And then we have the pooling method, so the little circle after the encoder. And the best or the highest performing approach here was using mean pooling, okay?

But to mean pool, you have another step of actually taking the average across all those word vectors or token vectors, whereas with CLS pooling, you don't do that. You just take the CLS token, which is a classified token from BERT. So if you've seen it before, it looks like this.

Okay, and then you have your other tokens following it. So that is the approach that they stuck with. They went and used deletion only in encrypting that data, a ratio of 0.6% with a probability of 60%. And at the end, they used CLS pooling. You can use mean pooling as well, or even max.

It's up to you, but later on we're going to stick with CLS to follow along with the paper. It's just a actually quick explanation of CLS and mean pooling if it's new to you. So we have our encoder. We output loads of token vectors. CLS pooling, we just take that single vector, and that is our sentence vector.

So we're not really doing anything there. We're just kind of extracting that vector. Whereas with mean pooling, we're taking an average over all of the output token vectors to create our sentence vector. So that's it for the visual explanation. So let's jump into how to actually build or fine-tune a model using this approach.

Okay, so we have here, I'm just loading a data set. So we're using HoganFace for data sets here. If you haven't used it before, you would want to pip install datasets. And what we're doing is getting the Oscar dataset, Oscar corpus, which is basically a massive multilingual corpus. It has a lot of different languages in there.

I'm sure not every single language, but if you can think of a language, it's probably in there. And we're taking the English portion of that, just so I can actually read and understand things. And we're taking the training data. Okay, and this is important. So the Oscar dataset, or at least for English, the size of that is quite massive.

It's 1.8 terabytes of data. If you don't include the streaming true, all that's going to download to your computer. And I assume you probably, a lot of us probably don't even have that much memory available on our machines anyway, so it won't work. So you need to add streaming equals true, because what that will do is, as we request a sample from the dataset, it will download it and pull it through for us, one at a time, not the full thing.

So it's obviously a lot more efficient. It's not going to break our computer or anything. Now, because we're streaming it, we have to kind of iterate through it. So if I want to show you part of that, so I'm going to go four row in Oscar, print that row, and just break.

Okay, so we're just going to print the first item there. So we have these two features, ID, which is just an index ID value, and we also have text. Now in here, we can see quite a lot. So there's this text, and it's pretty long. So there's multiple paragraphs in there, multiple sentences.

And when we're training with TSDAE, we only want small-ish sentences. We just want one sentence for each sample. So we need to split that. We need to split that up into just sentences. So to do that, I'm going to import RE, because I want to split for just periods here, full stops, periods.

And I also want to split on newline characters, and I want to remove spaces at the same time. So I just want to remove anything that indicates that this is a new sentence or paragraph, and split based on that character. So I'm going to create a regex, re.compile, and that is going to be any full stop followed by a space.

That's an optional thing, so it doesn't have to be a space. And also optionally, followed by a newline character. Okay, so this is just going to match any full stop for sure. And it will also allow for there to be a space included in that, and they'll also allow for that to include a newline character as well.

So that's going to capture everything for us. And let's see what that looks like. So we write splitter.split, and we'll go row. So the row it will just pull through text, and yeah. Okay, so now we see that we have all these nice sentences rather than just one massive paragraph.

You see some here are not very long, and what we'll do is we'll remove them later on, because they aren't really sentences. So let's do that. I'm gonna create a number here which is going to count number of sentences we manage to capture. So in reality, or at least in the TSDA paper, they found that 10k is pretty much all you need, and you can sort of go up to 100k as well if you want.

So we're going to go up to 100k. We probably don't necessarily need to. Probably 10k, maybe even lower. English is probably a reasonably easy one for this to figure out, so you could possibly go even lower and still get decent results. So that's one thing about TSDA is that you need very little data, which is pretty cool, especially when it's not labeled.

So we have sentences. I'm gonna create a list, and we're going to just iterate through. So for row in OSCQR, we need to create our new sentences. So new sentences equals splitter.split, and that will be row text like we did before. And we also want to say we want to remove a sentence, so say line for line and sentence.

If, sorry, in new sentences, new sentences, if that line or length of that line is less than, no, greater than 10. Okay, so we're saying we only want to include strings that have a character length of greater than 10, and we can maybe even increase that because if we look at this, this is definitely more than 10.

So let's just go with 20 for now and see how that goes. Now, they're our new sentences from a single sample, and we want to extend our sentences list with those new sentences. So we just write send new sentences. Okay, and like I said, OSCQR is a massive data set.

If we run through this for the whole data set, we're going to end up with a lot of sentences, and we don't need that many. We're only 10 to 100,000 sentences. So what I'm going to say is number of sentences is going to be equal to the length of new sentences.

So the number of new sentences that we've just added, we're going to add that on to new num, number of sentences. So once that exceeds 100k, then we want to break. So we want to stop its num sentence. Okay, and with that, we should be able to run that.

It should be quite quick. Okay, pretty nice and easy there, and the next step, as we usually would with PyTorch, is we want to put this data into a data set object, and then we want to load that data set object into a data loader. Now, because we're doing this thing where we corrupt our data by adding noise to it, we either need to do that manually when we're building out our data set object, or what we can do is just use the SentenceTransformers denoising autoencoder data set object.

So to use that, we just write from SentenceTransformers.datasets import denoising autoencoder data set, and we're also going to create a data loader now as well. So as we usually would in PyTorch, we'll just import that as well. So it's from torch.data or utils.data import data loader, and we want to create a data set.

So data set is equal to denoising autoencoder data set, and we just pass our data into that. So sentences, and that is all we need for our data set. So from now we can create our data loader. So it's loader equals data loader, and we pass in our data set, and we also want to say, okay, what is the batch size, and do we want to shuffle the data.

So for the batch size, we're going to put 8. We do want to shuffle data. Shuffle is true, and we also want to drop the last batch, because this will not be the same batch size as the rest of our batches, or most likely will not be. So we'll just drop it.

It's easier. So we run that. So we now have that. Our data is prepared for fine-tuning, and we now need to move on to the final preparation before we actually fine-tune the model, which is setting up the loss function and the actual training function itself. So before we actually even do the loss function, we need to define the model that we're going to be training.

So we're going to be using vert-based encase from the hood-and-face transformers library, and to initialize that, or we will initialize that through sentence transformers. So from sentence transformers, import from sentence transformer, and we also want models. Now the vert model is going to be models.transformer, and then in here we just pass the name vert-based encase, so this will just download models directly from hood-and-face models repository as we would with the transformers library.

So we want to write vert-based encase, and then we also need a pooling layer. So write pooling equals models.pooling, and as we saw before when we were working through everything, we want to use pooling using the CLS or classifier token. So to do that, we will need to pass CLS in there.

But before we do that, we also actually need to tell the pooling layer what number, what dimensionality to expect from that vector. And to get that, I'll show you, we can just write vert equals get word embedding dimension, and we'll see 768. But we will once it's actually defined, so let me put that in there.

So I'll add that in there. Okay, run this, you'll see in a sec. Okay, that's just initializing everything, and we have that 768. Now what we have here are two separate layers, and we need to combine them both or merge them both into a single sentence transform model. So to do that, we want to write model equals, this is where sentence transformer part comes in, we have modules, and that will be vert followed by the pooling layer.

And then we can also print out the description of that model as well. So we see we have transformer using vert model, and we also see the pooling. We have the word embedding dimension, and we also have the pooling mode CLS token is true, whereas the rest of them are false, because we're not using those pooling methods.

Now with the model defined, we can define our loss function. So write from sentence transformers.losses, and we're going to import the denoising auto encode loss, and we'll use that to define our loss function. So we just write denoising auto encode loss, and we pass in our model, so it knows what to actually optimize.

And we also need to make sure that we tie the encoder and decoder weights, so we have true in there. Okay, because we have that encode decoder, and we're tying those weights together because the performance is there. We can also set faults, but the performance will not be as good.

So that is everything. We can go ahead and actually move on to the training function, so model.fit. So what we're going to be doing here is we write model.fit, and I'll just add a few points here. So we're going to be using an atom optimizer. We are going to be using a learning rate, so LR of 3e to -5, and that learning rate is constant, so we're going to be using a constant learning rate.

So if you've watched the last videos, we have tended to use a warm-up before we actually move on to that learning rate, so we warm up to that learning rate. With this, we're just going to use constant learning rate all the way through, and there's also no weight decay in there.

So model.fit, we need to say what are our training objectives, so we just write train objectives, and then in there we pass a list, and in here we need pairs of data loaders and the loss functions we're going to use to optimize with that data. In our case, we just have one of these, so it's just load up the data and also loss.

We're going to train for one epoch, so epoch equals 1. We are going to be using the atom optimizer. The default optimizer here is a atom W or atom weight decay optimizer from the Transformers library, so if we want to use atom, we just set the weight decay to zero.

So we write weight decay equals zero. We also need to set scheduler, so this is why I mentioned before, so scheduler equals constant learning rate. Okay, so we're not doing any warm-up, we're just going with a constant learning rate all the way through, and then we want to pass our optimizer parameters.

So here we're just passing the learning rate, nothing more, so that is 3e to the minus 5, and while training we're probably going to want to see the progress bar, so I'm going to set that equal to true as well. Now after that, after we finish training, you're probably going to want to save the model, so you can save that as wherever you would like.

So I'm going to write TSDAE, but I'm going to start running that, and now I'm going to stop it, because it doesn't take long, to be fair. With 100,000 samples, this took 20 minutes on a reasonably good GPU, so that's a RTX 3090, so it's obviously a decent GPU, but it's nothing like Tesla or anything like that.

So this is really quick, I was pretty impressed. Okay, so you can see that this is training, so what I'm going to do is pause that, or stop that, and I'm going to switch over to the other notebook where I have that training, and the evaluation I performed afterwards, we'll just run through it really quickly.

Okay, so we're here now, yep, you can see everything is the same as before, model save, I've saved it there, and oh, there was one thing I did want to mention. So if you do get an error with NLTK, you just need to do this. You just need to either pip install NLTK, if you haven't already got it installed, and then you just run this, input NLTK, and that's going to download this PUNT tokenizer, which is used in the denoising process, so where we're adding noise and denoising and whatever else, it uses that tokenizer, so that's why we need that in there.

So if you do get that error, just run that. So after that, we train the model, and we want to evaluate the performance of the model because we want to see that it has actually worked. So one benchmark that you've probably seen me use a few times already is the Semantic Textual Similarity Benchmark, or STSB, which, again, we can use Hugging Face datasets to pull that, so it's from the Glue dataset, it's the STSB part of that, and we're going to take the validation split, so we're not taking the training data, though we can if we want because we haven't trained it on that, just in case.

And that contains one sentence, two sentences, and a label. This label is a score, so if you remember earlier we were talking about STS data that we can train using cosine similarity loss, this is the data that we would use. So we have that label score, and in this dataset, it ranges from 0 to 5, we want to normalize that from 0 to 1.

So that's what I'm doing here, I'm using the datasets map function and then a lambda in here, and we're just dividing all those by 5, mapping them all from 0 to 1. Then we are reformatting that benchmark data, the STSB data, using the Sentence Transformers input example class, so we need this because we're using the evaluators from the Sentence Transformers library later.

So all we do there is we loop through, create a list, and we append input example objects that contain the two sentences and the label, so the score. And then we initialize a similarity evaluator, so we can see this is called Embedding Similarity Evaluator, so that's for this type of data, the STS data, and we're just passing all the samples we have there, and you can write to CSV if you want, but I'm not doing that because I just want to see the score in here, I don't really, I don't want to see all the detail, so I just want to have a look at the overall score.

And then we just evaluate the model, so you just pass the model to your evaluator, and using the model that we trained with TSDAE, we get 0.75, which is the Spearman's coefficients, so basically saying how, where our model is scoring high, does that correlate to where the true scores are high, or the true labels are high, and 0.75 means yes, there is correlation there, it's pretty strong, right, it's not the strongest, as we'll see in a minute, but it's pretty strong.

So that's a good score, and we can see that it is working. Now, if we compare that to an untrained model, so what we had before, before we actually fine-tuned it with TSDAE, so we scroll down, I've just reinitialized something here, the same model as before, and evaluated it, scroll down a little bit to find the score, and you see that it's like 0.32, which is obviously way lower, and yeah, there's some correlation there, but it's not great.

So TSDAE is giving us pretty good results, I think, like it is really not bad. Now, something we mentioned earlier is, yes, TSDAE works, it's an unsupervised method, but the unsupervised method is, or cannot really be compared to supervised methods in terms of the performance that you'll get from your Sentence Transformer, and so I wanted to show you that here.

So first thing I've done, it is taking the original SBIRT, so the first one, and okay, we get 0.8, or 0.81 if you want to round up. So it's about 7% better than the other, than the unsupervised method, which is a fair bit, but it's not massive, so at least our unsupervised model is up there in like a good area of performance.

It's not the best, but it does work. And as well, in the paper they did do better than this, I think they got to, I think, 78 maybe? So they also did better than what I got here anyway, so you can probably do better. And then I wanted to look at a more advanced model, one of the more recent models at least, like MPNet.

So the MPNet model scored pretty much 89, or 0.89. So we get okay result, or I think pretty good result TSA, but obviously I can't compare it to those unsupervised models. Now that's it for this video. I think, at least for me, this unsupervised approach to training is actually one of the coolest things, or at least one of the coolest approaches to training and building models that I've seen in sentence transformers.

Possibly only really paralleled with multilingual knowledge of distillation for training multilingual models. Both of these together, for me, I think are really incredible. I know this isn't like the best performance. When you think about all of the low resource languages out there that don't really have that much data, they just have unstructured text states like this, or the domains, very specific domains, where they just have loads of text data, but they don't have label data, and they can't afford to pay someone to go and create all that data, or don't have the time to even.

I think this is a really cool method to actually be able to use. I mean you don't really need anything. So for that reason, I think this is really cool, and is definitely one of the most interesting ways of training, in my opinion. Just training something without label data that actually works pretty well.

So yeah, that's it for this video. I hope it's been useful, and thank you very much for watching, and I'll see you in the next one.