back to indexToday Unsupervised Sentence Transformers, Tomorrow Skynet (how TSDAE works)
Chapters
0:0 Why Language Embedding Matters
5:12 Supervised Methods
5:29 Natural Language Inference
7:15 Semantic Textual Similarity
7:43 Multilingual Training
10:0 TSDAE (Unsupervised)
18:50 Data Preparation
29:5 Initialize Model
32:39 Model Training
36:25 NLTK Error
37:15 Evaluation
41:1 TSDAE vs Supervised Methods
42:42 Why TSDAE is Cool
00:00:00.000 |
In this video we're going to have a look at how we can train sentence transformers 00:00:04.560 |
without needing any label data. So if you're new to sentence transformers, 00:00:11.160 |
sentence embeddings or vectors, a sentence vector as we'll call it is 00:00:16.200 |
simply a numerical representation of a sentence or paragraph. If you think about 00:00:23.560 |
language it's a very human centric concept, it's not built for computers so 00:00:31.120 |
computers really struggle to get the meaning or concepts that we as humans 00:00:38.240 |
find very easy to communicate using language. Now the modern-ish computers 00:00:45.160 |
appeared during and around World War Two. The first application of NLP came soon 00:00:51.280 |
after in 1954 with the Georgetown machine translation experiment. In that 00:00:57.760 |
first decade of research those involved were pretty optimistic that they were 00:01:03.280 |
going to solve the problem of machine translation in just a few short years. 00:01:08.440 |
Obviously they were a little bit optimistic and that's still a problem 00:01:14.800 |
that's still not solved, we still haven't solved machine translation and the same 00:01:19.480 |
goes for anything in NLP but in the past decade especially there have been a lot 00:01:27.680 |
of breakthroughs. The the field of NLP has progressed at an incredible rate in 00:01:33.760 |
just the past decade and we now have an incredible ecosystem of language models 00:01:43.520 |
and techniques that we can use for a lot of different use cases. Now a lot of this 00:01:49.800 |
recent success is in part thanks to the dense vector representations of language. 00:01:58.400 |
So those are vectors, so numerical vectors that a machine can understand 00:02:05.200 |
but are built in such a way that they actually provide a numerical 00:02:10.120 |
representation of the semantics or the meaning behind whatever is those vectors 00:02:16.600 |
represent, whether that be tokens, words or sentences, paragraphs and so on. So and 00:02:23.240 |
with those dense vectors we now have a way for computers to comprehend and 00:02:31.240 |
understand to an extent the semantic meaning behind language. To build those 00:02:39.080 |
you know given a lot of data and a lot of compute we tend to use transform 00:02:45.080 |
models. In NLP, transformers are the de facto standard and for building 00:02:52.560 |
representations of sentences or paragraphs there is a subcategory of 00:02:58.880 |
transformers called sentence transformers. Now the training process to build a 00:03:05.160 |
transformer begins with something called pre-training that produces a generic 00:03:12.920 |
transformer model and then we fine-tune that, so we train that further using 00:03:19.040 |
special methods to build sentence transformers that can produce these very 00:03:25.440 |
information rich and accurate sentence vectors. Now whereas pre-training tends 00:03:34.200 |
to use unsupervised training methods, fine-tuning tends to be more along the 00:03:40.560 |
lines of supervised training and what that means is that we need a lot of 00:03:46.920 |
labeled data and for some domains and languages there simply is not enough 00:03:54.480 |
labeled data out there to actually build a sentence transformer for those 00:04:00.160 |
specific domains or languages. So that means that you can either spend a long 00:04:06.280 |
time gathering data and labeling all the data to get tens of thousands of labeled 00:04:13.120 |
samples or you can go ahead and try fine-tuning model using unsupervised 00:04:20.320 |
training. Now unsupervised training I will tell you straight away is not going 00:04:25.560 |
to get you the performance that you would get from a supervised training 00:04:30.040 |
approach, however if you do not have the labeled data to train using a supervised 00:04:36.000 |
approach, unsupervised training is your best bet and it still works pretty well. 00:04:42.080 |
So in this video that's what we're going to cover, we're going to cover how we can 00:04:47.560 |
train a sentence transformer or fine-tune sentence transformer using a 00:04:52.400 |
unsupervised training method called transformer based sequential denoising 00:04:58.920 |
autoencoder. So what we'll do is jump straight into it and take a look at 00:05:06.560 |
where we might want to use this training approach and and how we can actually 00:05:11.200 |
implement it. So the first question we need to ask is do we really need to 00:05:18.560 |
resort to unsupervised training? Now what we're going to do here is just have a 00:05:24.080 |
look at a few most popular training approaches and what sort of data we 00:05:28.240 |
need for that. So the first one we're looking at here is natural language 00:05:33.760 |
inference or NLI and NLI requires that we have pairs of sentences that are 00:05:43.600 |
labeled as either contradictory, neutral, which means they're not necessarily 00:05:50.600 |
related, or as entailing or as inferring each other. So you have you have pairs 00:05:59.560 |
that entail each other, so they are both very similar, pairs that are neutral and 00:06:11.400 |
also pairs that are contradictory. And this is the traditional NLI data. Now 00:06:22.640 |
using another version of fine-tuning with with NLI called multiple negatives 00:06:33.480 |
ranking loss, you can get by with only entailment pairs, so pairs that are 00:06:44.720 |
related to each other or positive pairs. And it can also use contradictory pairs 00:06:51.640 |
to improve the performance of training as well, but you don't need it. So if you 00:06:57.040 |
have positive pairs of related sentences, you can go ahead and actually try 00:07:04.560 |
training or fine-tuning using NLI with with multiple negative ranking loss. 00:07:11.800 |
If you don't have that, fine. Another option is that you have a semantic 00:07:18.240 |
textual similarity, DSL or STS, and what this is is you have, so you have sentence 00:07:25.480 |
A here, sentence B here, and then you have a score from from 0 to 1 that tells you 00:07:33.680 |
the similarity between those two scores. And you would train this using something 00:07:40.400 |
like cosine similarity loss. Now if that's not an option and your focus or 00:07:47.400 |
use case is on building a sentence transformer for another language where 00:07:53.760 |
there is no current sentence transformer, you can use multilingual parallel data. 00:08:00.760 |
So what I mean by that is, so parallel data just means translation pairs. So if 00:08:06.600 |
you have, for example, a English sentence and then you have another language here, 00:08:13.560 |
so it can it can be anything, I'm just going to put XX, and that XX is your target 00:08:19.040 |
language, you can fine-tune a model using something called multilingual knowledge 00:08:29.080 |
distillation. And what that does is takes a monolingual model, for example in 00:08:38.600 |
English, and using those translation pairs it distills the knowledge, the 00:08:44.320 |
semantic similarity knowledge, from that monolingual English model into a 00:08:50.040 |
multilingual model which can handle both English and your target language. So 00:08:55.880 |
they're three options that are quite popular, very common, that you can go for. And as 00:09:04.600 |
they're supervised methods, the chances are they're probably going to outperform 00:09:09.320 |
anything you do with unsupervised training, at least for now. So if none of 00:09:16.320 |
those sound like something you can do, the datasets, or if it sounds like you 00:09:21.800 |
probably can't get data that seems like that and it doesn't match your use case, 00:09:26.200 |
then we would have to move on to unsupervised training. So like I've 00:09:33.800 |
written here, you want to go for unsupervised if you have little to no 00:09:37.440 |
data in your unique domain or your low resource language. And with low 00:09:44.080 |
resource language you also have no translation data. So from a source 00:09:48.640 |
language like English, or you can use other languages as well, just as long as 00:09:53.000 |
there's a monolingual model in that source language, to your target language. 00:09:59.000 |
So if we can't do that, we move on to unsupervised learning. And one of 00:10:06.280 |
the best approaches at the moment is this transformer-based and sequential 00:10:10.920 |
denoising autoencoder. Now there are other approaches as well, but we're not 00:10:16.440 |
going to cover those. And I think for now this is probably your best bet, 00:10:21.720 |
although there is other methods being researched that do look quite promising 00:10:26.160 |
as well. So the way that TSD works is, if you have, let's say you have a sentence 00:10:35.000 |
here, what you do is you take your sentence and you corrupt that data. So 00:10:43.880 |
we have this, and what we're going to do is it's just kind of remove parts of 00:10:49.600 |
that or modify it in a different way and do some other things to it. So it's 00:10:54.600 |
slightly different, but not too different that it should not be similar. And we 00:11:01.760 |
take both of these, so we take the modified input and we feed it into our 00:11:08.240 |
encoder model, so our transformer. Our transformer outputs a set of tokens and 00:11:17.040 |
we use some sort of pooling method to convert those into a single sentence 00:11:23.920 |
vector. So with that sentence vector, we process that through another model here, 00:11:31.080 |
which is a decoder model. So it goes into decoder model, and what that decoder 00:11:36.640 |
model must do is actually optimize to produce a sentence here, to produce the 00:11:46.640 |
same text. So it has to try and predict this original sentence. And these weights 00:11:56.600 |
in the decoder and encoder are optimized in order for the decoder to be able to 00:12:04.640 |
actually do that. And that is TSDAE. It's not particularly complex 00:12:12.980 |
when you sort of look at a very high level, and it's certainly a very 00:12:18.200 |
intelligent way of building a sentence transformer without any labeled data. Now 00:12:26.040 |
let's have a look at a little graphic here to compare TSDAE to MLM. Now MLM 00:12:35.120 |
is master language modeling and that is a pre-training approach that a few of 00:12:39.880 |
you will probably be familiar with, and that's why I wanted to include this 00:12:42.920 |
comparison in here. So with MLM, which is the bottom down here, we take some 00:12:55.760 |
input text and we mask one of the tokens in that text. We pass it through an 00:13:02.040 |
encoder which outputs all these token vectors, and basically these token 00:13:06.720 |
vectors for every word almost, every word or subword here, will be 00:13:12.440 |
represented by one of these token vectors or token embeddings. All those 00:13:17.600 |
pass into the decoder and the decoder attempts to predict which word or token 00:13:27.040 |
is behind that masked token here. So it's trying to optimize for that mask to 00:13:33.560 |
become an elephant. And that's how you use a pre-trainer transformer, or one of the 00:13:38.640 |
ways that you can pre-train transformer. TSDAE is different for quite a few 00:13:45.800 |
reasons, but I think the main reasons you can think of is, one, we are not 00:13:55.040 |
necessarily masking the input, and it was found that the best way or the best 00:14:00.640 |
approach is to actually delete the token. So you see here we should have that mask 00:14:06.920 |
here, but it's not there anymore, we just removed it. So one, we delete rather than 00:14:13.640 |
mask, though in the TSDAE paper they did test both. We have an encoder as 00:14:23.160 |
before, but that is followed by this pooling step. And this pooling step takes 00:14:31.040 |
the token vectors that we see down here, and it converts them into a single 00:14:37.800 |
sentence vector. So that sentence vector is passed on to the decoder, so if you 00:14:44.000 |
compare both of these steps here, these two, the decoder in Masked Language Modeling 00:14:49.800 |
is getting a lot more information. It's getting token level information, whereas 00:14:54.800 |
the decoder in TSDAE is dealing with a lot less data and it's dealing with 00:15:00.280 |
sentence level information rather than token level information. And it is then 00:15:06.400 |
optimizing for the same thing to try and predict that we should have this text 00:15:12.520 |
here with elephant rather than the missing or corrupted text that was input 00:15:18.920 |
into the encoder initially. So that's the main difference between both of those. 00:15:25.320 |
Now in the TSDAE paper from Wang, Reimers, and Gurevich, they tested 00:15:36.680 |
different approaches to fine-tuning. So the first of those is the noise type. So 00:15:45.200 |
when we take that original text and we corrupt it, what is the best approach in 00:15:51.480 |
you know, how do we corrupt it? And they found that deleting tokens, so 00:15:57.880 |
this box here, produced the best results. Other options, you can swap tokens, so 00:16:06.000 |
swap one word for another, you can mask as we saw with Masked Language Modeling, you can 00:16:10.680 |
place those tokens, you can add new tokens, you do different things, right? But by 00:16:17.840 |
far, the best here was to just delete the token. Now there's also, you know, how 00:16:25.180 |
many tokens do you delete? So going through each token, you assign a 00:16:29.680 |
probability of that token being deleted and that's what we see here with this 00:16:33.720 |
noise ratio. So again, best best approach there is 0.6. Okay, so we're going 00:16:41.840 |
through each token and assigning probability of 60% that that token will 00:16:46.560 |
be deleted. So you are removing quite a lot of data. And then we have the 00:16:52.080 |
pooling method, so the little circle after the encoder. And the 00:16:59.200 |
best or the highest performing approach here was using mean pooling, okay? 00:17:06.840 |
But to mean pool, you have another step of actually taking the average across all 00:17:12.120 |
those word vectors or token vectors, whereas with CLS pooling, you don't do 00:17:19.080 |
that. You just take the CLS token, which is a classified token from BERT. 00:17:27.000 |
So if you've seen it before, it looks like this. Okay, and then you have your 00:17:32.760 |
other tokens following it. So that is the approach that they stuck with. They 00:17:38.400 |
went and used deletion only in encrypting that data, a ratio of 0.6% 00:17:46.320 |
with a probability of 60%. And at the end, they used CLS pooling. You can use mean 00:17:54.960 |
pooling as well, or even max. It's up to you, but later on we're going to stick 00:18:00.040 |
with CLS to follow along with the paper. It's just a actually quick 00:18:05.760 |
explanation of CLS and mean pooling if it's new to you. So we have 00:18:12.360 |
our encoder. We output loads of token vectors. CLS pooling, we just take 00:18:17.240 |
that single vector, and that is our sentence vector. So we're 00:18:24.960 |
not really doing anything there. We're just kind of extracting that vector. 00:18:28.040 |
Whereas with mean pooling, we're taking an average over all of the output token 00:18:34.080 |
vectors to create our sentence vector. So that's it for the visual 00:18:40.880 |
explanation. So let's jump into how to actually build or fine-tune a model 00:18:48.080 |
using this approach. Okay, so we have here, I'm just loading a data set. So we're 00:18:56.560 |
using HoganFace for data sets here. If you haven't used it before, you 00:19:01.880 |
would want to pip install datasets. And what we're doing is getting the Oscar 00:19:08.320 |
dataset, Oscar corpus, which is basically a massive multilingual corpus. It has a 00:19:13.720 |
lot of different languages in there. I'm sure not every single language, but if 00:19:20.160 |
you can think of a language, it's probably in there. And we're taking the 00:19:25.560 |
English portion of that, just so I can actually read and understand things. And 00:19:33.760 |
we're taking the training data. Okay, and this is important. So the Oscar dataset, 00:19:40.760 |
or at least for English, the size of that is quite massive. It's 1.8 terabytes of 00:19:49.560 |
data. If you don't include the streaming true, all that's going to download to 00:19:54.480 |
your computer. And I assume you probably, a lot of us probably don't even have that 00:19:59.840 |
much memory available on our machines anyway, so it won't work. So you need to 00:20:07.920 |
add streaming equals true, because what that will do is, as we request a sample 00:20:12.440 |
from the dataset, it will download it and pull it through for us, one at a time, 00:20:16.280 |
not the full thing. So it's obviously a lot more efficient. It's not going to break 00:20:21.840 |
our computer or anything. Now, because we're streaming it, we have to kind of 00:20:26.560 |
iterate through it. So if I want to show you part of that, so I'm going to go four 00:20:30.800 |
row in Oscar, print that row, and just break. Okay, so we're just going to print 00:20:37.160 |
the first item there. So we have these two features, ID, which is just 00:20:44.960 |
an index ID value, and we also have text. Now in here, we can see quite a lot. So 00:20:52.760 |
there's this text, and it's pretty long. So there's multiple paragraphs 00:20:59.040 |
in there, multiple sentences. And when we're training with TSDAE, we only want 00:21:05.800 |
small-ish sentences. We just want one sentence for each sample. So we need to 00:21:12.280 |
split that. We need to split that up into just sentences. So to do that, I'm going 00:21:19.200 |
to import RE, because I want to split for just periods here, full stops, periods. 00:21:28.120 |
And I also want to split on newline characters, and I want to remove spaces 00:21:34.120 |
at the same time. So I just want to remove anything that indicates that this is a 00:21:37.560 |
new sentence or paragraph, and split based on that character. So I'm going to 00:21:44.320 |
create a regex, re.compile, and that is going to be any full stop followed by a 00:21:54.760 |
space. That's an optional thing, so it doesn't have to be a space. And also 00:22:01.560 |
optionally, followed by a newline character. Okay, so this is just going to 00:22:05.800 |
match any full stop for sure. And it will also allow for there to be a space 00:22:11.720 |
included in that, and they'll also allow for that to include a newline 00:22:17.200 |
character as well. So that's going to capture everything for us. And let's see 00:22:22.960 |
what that looks like. So we write splitter.split, and we'll go row. So the 00:22:30.960 |
row it will just pull through text, and yeah. Okay, so now we see that we have all 00:22:40.200 |
these nice sentences rather than just one massive paragraph. You see some here 00:22:46.440 |
are not very long, and what we'll do is we'll remove them later on, because they 00:22:52.600 |
aren't really sentences. So let's do that. I'm gonna create a number here which is 00:23:04.360 |
going to count number of sentences we manage to capture. So in reality, or at 00:23:12.360 |
least in the TSDA paper, they found that 10k is pretty much all you need, and you 00:23:20.840 |
can sort of go up to 100k as well if you want. So we're going to go 00:23:27.520 |
up to 100k. We probably don't necessarily need to. Probably 10k, maybe even lower. 00:23:32.480 |
English is probably a reasonably easy one for this to figure out, so you 00:23:37.160 |
could possibly go even lower and still get decent results. So that's one 00:23:43.360 |
thing about TSDA is that you need very little data, which is pretty cool, 00:23:48.080 |
especially when it's not labeled. So we have sentences. I'm gonna create a list, 00:23:53.280 |
and we're going to just iterate through. So for row in OSCQR, we need to create 00:24:01.200 |
our new sentences. So new sentences equals splitter.split, and that will be 00:24:08.360 |
row text like we did before. And we also want to say we want to remove a sentence, 00:24:20.200 |
so say line for line and sentence. If, sorry, in new sentences, new sentences, if 00:24:32.640 |
that line or length of that line is less than, no, greater than 10. Okay, so we're 00:24:44.160 |
saying we only want to include strings that have a character length of greater 00:24:49.920 |
than 10, and we can maybe even increase that because if we look at this, this is 00:24:54.320 |
definitely more than 10. So let's just go with 20 for now and see how that goes. 00:25:02.240 |
Now, they're our new sentences from a single sample, and we want to extend our 00:25:10.160 |
sentences list with those new sentences. So we just write send new sentences. Okay, 00:25:20.560 |
and like I said, OSCQR is a massive data set. If we run through this for the 00:25:28.280 |
whole data set, we're going to end up with a lot of sentences, and we don't 00:25:32.800 |
need that many. We're only 10 to 100,000 sentences. So what I'm going to say is 00:25:39.000 |
number of sentences is going to be equal to the length of new sentences. So the 00:25:47.720 |
number of new sentences that we've just added, we're going to add that on to new 00:25:52.440 |
num, number of sentences. So once that exceeds 100k, then we want to break. So 00:26:09.280 |
we want to stop its num sentence. Okay, and with that, we should be able to run 00:26:17.760 |
that. It should be quite quick. Okay, pretty nice and easy there, and the 00:26:28.160 |
next step, as we usually would with PyTorch, is we want to put this data into 00:26:36.440 |
a data set object, and then we want to load that data set object into a data 00:26:41.520 |
loader. Now, because we're doing this thing where we corrupt our data by 00:26:47.400 |
adding noise to it, we either need to do that manually when we're building out 00:26:52.680 |
our data set object, or what we can do is just use the SentenceTransformers 00:26:59.000 |
denoising autoencoder data set object. So to use that, we just write from 00:27:05.040 |
SentenceTransformers.datasets import denoising autoencoder data set, and 00:27:16.200 |
we're also going to create a data loader now as well. So as we usually would in 00:27:21.600 |
PyTorch, we'll just import that as well. So it's from torch.data or utils.data 00:27:30.280 |
import data loader, and we want to create a data set. So data set is equal to 00:27:41.080 |
denoising autoencoder data set, and we just pass our data into that. So 00:27:48.640 |
sentences, and that is all we need for our data set. So from now we can create our 00:27:59.920 |
data loader. So it's loader equals data loader, and we pass in our data set, 00:28:12.400 |
and we also want to say, okay, what is the batch size, and do we want 00:28:17.200 |
to shuffle the data. So for the batch size, we're going to put 8. We do want to 00:28:23.480 |
shuffle data. Shuffle is true, and we also want to drop the last batch, because 00:28:31.800 |
this will not be the same batch size as the rest of our batches, or 00:28:38.080 |
most likely will not be. So we'll just drop it. It's easier. So we run that. So we 00:28:44.720 |
now have that. Our data is prepared for fine-tuning, and we now need to move on 00:28:53.200 |
to the final preparation before we actually fine-tune the model, which is 00:28:59.800 |
setting up the loss function and the actual training function itself. So 00:29:06.240 |
before we actually even do the loss function, we need to define the model 00:29:10.600 |
that we're going to be training. So we're going to be using vert-based encase from the 00:29:15.760 |
hood-and-face transformers library, and to initialize that, or we will initialize 00:29:21.080 |
that through sentence transformers. So from sentence transformers, import from 00:29:29.120 |
sentence transformer, and we also want models. Now the vert model is going 00:29:40.400 |
to be models.transformer, and then in here we just pass the name vert-based 00:29:47.120 |
encase, so this will just download models directly from hood-and-face models 00:29:53.160 |
repository as we would with the transformers library. So we want to write 00:29:58.640 |
vert-based encase, and then we also need a pooling layer. So write pooling 00:30:06.560 |
equals models.pooling, and as we saw before when we were working through 00:30:15.160 |
everything, we want to use pooling using the CLS or classifier token. So to do 00:30:23.200 |
that, we will need to pass CLS in there. But before we do that, we also actually 00:30:29.840 |
need to tell the pooling layer what number, what dimensionality to expect 00:30:39.000 |
from that vector. And to get that, I'll show you, we can just write vert equals get 00:30:47.120 |
word embedding dimension, and we'll see 768. But we will once it's actually 00:30:53.240 |
defined, so let me put that in there. So I'll add that in there. Okay, run this, you'll 00:31:01.400 |
see in a sec. Okay, that's just initializing everything, and we have that 00:31:07.120 |
768. Now what we have here are two separate layers, and we need to combine 00:31:13.040 |
them both or merge them both into a single sentence transform model. So to do 00:31:18.640 |
that, we want to write model equals, this is where sentence transformer part comes 00:31:23.080 |
in, we have modules, and that will be vert followed by the pooling layer. And then 00:31:31.240 |
we can also print out the description of that model as well. So we see we 00:31:37.420 |
have transformer using vert model, and we also see the pooling. We have the word 00:31:45.680 |
embedding dimension, and we also have the pooling mode CLS token is true, whereas 00:31:51.600 |
the rest of them are false, because we're not using those pooling methods. Now with the 00:31:55.840 |
model defined, we can define our loss function. So write from sentence 00:32:01.520 |
transformers.losses, and we're going to import the denoising auto encode loss, 00:32:08.760 |
and we'll use that to define our loss function. So we just write denoising 00:32:15.480 |
auto encode loss, and we pass in our model, so it knows what to actually 00:32:25.080 |
optimize. And we also need to make sure that we tie the encoder and decoder 00:32:33.880 |
weights, so we have true in there. Okay, because we have that encode decoder, and 00:32:41.000 |
we're tying those weights together because the performance is there. We can 00:32:43.880 |
also set faults, but the performance will not be as good. So that is everything. We 00:32:54.200 |
can go ahead and actually move on to the training function, so model.fit. So 00:32:59.800 |
what we're going to be doing here is we write model.fit, and I'll just add a few 00:33:08.920 |
points here. So we're going to be using an atom optimizer. We are going to be 00:33:14.000 |
using a learning rate, so LR of 3e to -5, and that learning rate is 00:33:22.960 |
constant, so we're going to be using a constant learning rate. So if you've 00:33:29.000 |
watched the last videos, we have tended to use a warm-up before we 00:33:35.200 |
actually move on to that learning rate, so we warm up to that learning rate. With 00:33:39.400 |
this, we're just going to use constant learning rate all the way through, and 00:33:42.560 |
there's also no weight decay in there. So model.fit, we need to say what 00:33:50.600 |
are our training objectives, so we just write train objectives, and then in 00:34:00.600 |
there we pass a list, and in here we need pairs of data loaders and the loss 00:34:07.400 |
functions we're going to use to optimize with that data. In our case, we just have 00:34:11.680 |
one of these, so it's just load up the data and also loss. We're going to train 00:34:18.000 |
for one epoch, so epoch equals 1. We are going to be using the atom optimizer. The 00:34:26.700 |
default optimizer here is a atom W or atom weight decay optimizer from the 00:34:34.640 |
Transformers library, so if we want to use atom, we just set the weight decay to 00:34:40.520 |
zero. So we write weight decay equals zero. We also need to set scheduler, so 00:34:49.160 |
this is why I mentioned before, so scheduler equals constant learning rate. 00:34:57.200 |
Okay, so we're not doing any warm-up, we're just going with a constant 00:35:01.720 |
learning rate all the way through, and then we want to pass our optimizer 00:35:06.440 |
parameters. So here we're just passing the learning rate, nothing more, so that is 00:35:12.840 |
3e to the minus 5, and while training we're probably going to want to see the 00:35:19.720 |
progress bar, so I'm going to set that equal to true as well. Now after that, after we 00:35:29.520 |
finish training, you're probably going to want to save the model, so you can save 00:35:34.680 |
that as wherever you would like. So I'm going to write TSDAE, but I'm going to 00:35:44.840 |
start running that, and now I'm going to stop it, because it doesn't 00:35:50.000 |
take long, to be fair. With 100,000 samples, this took 20 minutes on a 00:35:55.480 |
reasonably good GPU, so that's a RTX 3090, so it's obviously a decent GPU, but it's 00:36:02.920 |
nothing like Tesla or anything like that. So this is really quick, I was 00:36:09.800 |
pretty impressed. Okay, so you can see that this is training, so what I'm going to do is 00:36:14.040 |
pause that, or stop that, and I'm going to switch over to the other notebook where 00:36:18.080 |
I have that training, and the evaluation I performed afterwards, we'll just run 00:36:23.240 |
through it really quickly. Okay, so we're here now, yep, you can see everything is 00:36:30.920 |
the same as before, model save, I've saved it there, and oh, there was one thing I did 00:36:38.680 |
want to mention. So if you do get an error with NLTK, you just need to do 00:36:45.320 |
this. You just need to either pip install NLTK, if you haven't already got it installed, 00:36:50.320 |
and then you just run this, input NLTK, and that's going to download this PUNT 00:36:54.760 |
tokenizer, which is used in the denoising process, so where we're 00:37:02.360 |
adding noise and denoising and whatever else, it uses that tokenizer, so 00:37:07.120 |
that's why we need that in there. So if you do get that error, just run that. 00:37:15.120 |
So after that, we train the model, and we want to evaluate the performance of 00:37:21.080 |
the model because we want to see that it has actually worked. So one benchmark that 00:37:28.200 |
you've probably seen me use a few times already is the Semantic Textual 00:37:32.480 |
Similarity Benchmark, or STSB, which, again, we can use Hugging Face datasets 00:37:38.040 |
to pull that, so it's from the Glue dataset, it's the STSB part of that, and 00:37:43.760 |
we're going to take the validation split, so we're not taking the training data, though we 00:37:49.460 |
can if we want because we haven't trained it on that, just in case. And that 00:37:54.880 |
contains one sentence, two sentences, and a label. This label is a score, so if you 00:38:01.040 |
remember earlier we were talking about STS data that we can train using cosine 00:38:07.820 |
similarity loss, this is the data that we would use. So we have that label 00:38:13.680 |
score, and in this dataset, it ranges from 0 to 5, we want to normalize that from 0 00:38:20.400 |
to 1. So that's what I'm doing here, I'm using the datasets map function and 00:38:26.480 |
then a lambda in here, and we're just dividing all those by 5, mapping them all 00:38:32.960 |
from 0 to 1. Then we are reformatting that benchmark data, the STSB data, using 00:38:42.500 |
the Sentence Transformers input example class, so we need this because we're 00:38:47.480 |
using the evaluators from the Sentence Transformers library later. So all we do 00:38:53.120 |
there is we loop through, create a list, and we append input example objects that 00:38:59.600 |
contain the two sentences and the label, so the score. And then we initialize a 00:39:06.920 |
similarity evaluator, so we can see this is called Embedding Similarity Evaluator, 00:39:13.320 |
so that's for this type of data, the STS data, and we're just passing all the 00:39:19.700 |
samples we have there, and you can write to CSV if you want, but I'm not doing that 00:39:23.240 |
because I just want to see the score in here, I don't really, I don't want to see 00:39:28.120 |
all the detail, so I just want to have a look at the overall score. And then we 00:39:33.100 |
just evaluate the model, so you just pass the model to your evaluator, and using 00:39:39.000 |
the model that we trained with TSDAE, we get 0.75, which is the Spearman's 00:39:48.040 |
coefficients, so basically saying how, where our model is scoring high, does 00:39:56.720 |
that correlate to where the true scores are high, or the true labels are high, and 00:40:02.120 |
0.75 means yes, there is correlation there, it's pretty 00:40:09.360 |
strong, right, it's not the strongest, as we'll see in a minute, but it's pretty 00:40:13.600 |
strong. So that's a good score, and we can see that it is working. Now, if we 00:40:22.440 |
compare that to an untrained model, so what we had before, before we actually 00:40:26.120 |
fine-tuned it with TSDAE, so we scroll down, I've just reinitialized something 00:40:33.920 |
here, the same model as before, and evaluated it, scroll down a little bit to 00:40:38.480 |
find the score, and you see that it's like 0.32, which is obviously way lower, 00:40:44.040 |
and yeah, there's some correlation there, but it's not great. So TSDAE is 00:40:52.760 |
giving us pretty good results, I think, like it is really not bad. Now, something 00:41:01.560 |
we mentioned earlier is, yes, TSDAE works, it's an unsupervised method, but the 00:41:07.120 |
unsupervised method is, or cannot really be compared to supervised methods in 00:41:14.720 |
terms of the performance that you'll get from your Sentence Transformer, and so I 00:41:19.760 |
wanted to show you that here. So first thing I've done, it is taking the 00:41:24.400 |
original SBIRT, so the first one, and okay, we get 0.8, or 0.81 if you want to 00:41:33.920 |
round up. So it's about 7% better than the other, than the unsupervised 00:41:43.080 |
method, which is a fair bit, but it's not massive, so at least our unsupervised 00:41:49.400 |
model is up there in like a good area of performance. It's not the best, but it 00:41:55.880 |
does work. And as well, in the paper they did do better than this, I think they got 00:42:02.400 |
to, I think, 78 maybe? So they also did better than what I got here anyway, so 00:42:10.480 |
you can probably do better. And then I wanted to look at a more advanced model, 00:42:16.280 |
one of the more recent models at least, like MPNet. So the MPNet model scored 00:42:24.440 |
pretty much 89, or 0.89. So we get okay result, or I think pretty good 00:42:32.760 |
result TSA, but obviously I can't compare it to those unsupervised models. Now 00:42:40.020 |
that's it for this video. I think, at least for me, this unsupervised approach 00:42:47.480 |
to training is actually one of the coolest things, or at least one of the 00:42:54.120 |
coolest approaches to training and building models that I've seen in 00:42:58.000 |
sentence transformers. Possibly only really paralleled with multilingual 00:43:04.760 |
knowledge of distillation for training multilingual models. Both of these 00:43:09.280 |
together, for me, I think are really incredible. I know this isn't like the 00:43:14.200 |
best performance. When you think about all of the low resource languages out 00:43:20.120 |
there that don't really have that much data, they just have unstructured text 00:43:24.560 |
states like this, or the domains, very specific domains, where they just 00:43:31.840 |
have loads of text data, but they don't have label data, and they can't afford to 00:43:35.880 |
pay someone to go and create all that data, or don't have the time to even. I 00:43:40.400 |
think this is a really cool method to actually be able to use. I mean you 00:43:48.560 |
don't really need anything. So for that reason, I think this is really cool, and 00:43:55.360 |
is definitely one of the most interesting ways of training, in my 00:44:03.240 |
opinion. Just training something without label data that actually works pretty 00:44:09.320 |
well. So yeah, that's it for this video. I hope it's been useful, and thank you 00:44:17.560 |
very much for watching, and I'll see you in the next one.