back to indexFine-tune Sentence Transformers the OG Way (with NLI Softmax loss)
Chapters
0:0 Intro
0:42 NLI Fine-tuning
1:44 Softmax Loss Training Overview
5:47 Preprocessing NLI Data
12:48 PyTorch Process
19:48 Using Sentence-Transformers
30:45 Results
35:49 Outro
00:00:04.200 |
can train a SBERT model, or a Sentence Transformer, 00:00:12.240 |
is kind of like the original way of training these models 00:00:15.320 |
or fine-tuning these models, which is using Softmax Loss. 00:00:47.320 |
is part of what we could call the natural language inference 00:00:55.280 |
And within that sort of category of training, 00:01:01.520 |
We have Softmax Loss or Softmax Classification Loss, 00:01:06.600 |
And then we also have something called a Multiple Negatives 00:01:12.040 |
Now, in reality, you probably wouldn't use Softmax Loss, 00:01:17.840 |
because it's just nowhere near as good as using 00:01:22.120 |
the other form of Loss, the Multiple Negatives Ranking. 00:01:42.360 |
I'm going to just kind of go through it very quickly. 00:01:47.520 |
we can either use what is called a Siamese network 00:01:52.880 |
Now, what you can see right now is a Siamese network. 00:01:55.760 |
So we have almost like two copies of the same BERT, 00:02:03.360 |
And the idea is we would have two sentences, sentence A 00:02:07.480 |
and sentence B, and we would feed both of those 00:02:10.680 |
through our BERT model and produce the token embeddings 00:02:18.440 |
so like a mean average pooling, and then from that, 00:02:29.920 |
to try and get those sentence embeddings as close as 00:02:39.720 |
them to be as far away from each other as possible. 00:02:46.560 |
So that's kind of like the start of the model, 00:02:57.040 |
is concatenate those two together, so U plus V here. 00:03:03.480 |
And then we're also going to do this other operation here. 00:03:17.480 |
So we're just getting a positive number, which 00:03:24.080 |
And we create this big vector, which is U, V, 00:03:40.680 |
into a very simple feed-forward neural network. 00:03:49.520 |
we're going to have the dimensionality of 768. 00:03:55.040 |
So obviously, the dimensionality or input dimension 00:04:05.120 |
And then the output are our three output activations here. 00:04:15.800 |
will remember that we had three labels in our training data. 00:04:22.960 |
So in our NLI training data, we had its entailment, neutral, 00:04:34.400 |
So those sentence pairs, and we're trying to classify, 00:04:43.920 |
and then we just optimize using cross-entropy loss, which 00:05:11.640 |
we're mainly going to focus on how we form our data to put it 00:05:17.400 |
to how we actually train all of this using the sentence 00:05:26.400 |
But we'll just very quickly run through the code 00:05:29.680 |
in PyTorch, so you can just see how it works. 00:05:36.720 |
you can obviously just take a look at the code 00:05:58.120 |
We're going to have a look at our data anyway now. 00:06:00.560 |
So I'm going to use the HuggingFace data sets library. 00:06:07.280 |
And we're actually using two different data sets. 00:06:10.160 |
We're using the Stanford Natural Language Inference, 00:06:12.520 |
or SNLI data set, and also the multi-genre NLI data set 00:06:31.160 |
And we want the training subset of that, so train. 00:06:52.120 |
So these are columns, or you can call them columns if you want. 00:06:56.280 |
So we have the premise, hypothesis, and label. 00:07:08.920 |
And then we saw labels at the end, so it's just the same. 00:07:14.360 |
Now, if you want to have a quick look at one of those, 00:07:28.120 |
That's neutral, so the premise and the hypothesis 00:07:30.280 |
could both be true, but they are not necessarily related. 00:07:37.120 |
The person is training his horse for the competition. 00:07:50.040 |
is a contradiction or something else, why did I spawn again? 00:07:59.240 |
So a person on a horse jumps over a broken down airplane. 00:08:05.440 |
So those two things aren't about the same topic, 00:08:09.520 |
And then the other one, we have just, I think, if I do, 00:08:17.360 |
So a person on a horse jumps over a broken down airplane. 00:08:28.440 |
sorry, this here, this premise infers this hypothesis. 00:08:36.120 |
And like I said, we have two of those data sets. 00:08:58.520 |
And again, we want to split to be equal to train. 00:09:06.120 |
see a very similar format, but not exactly the same. 00:09:17.280 |
We need to reformat our MNLI data set a little bit. 00:09:40.240 |
get an error, which is annoying, but it's fine. 00:09:46.080 |
because data sets adopt to concatenate data sets. 00:09:55.440 |
So SMI, MNI, and we're going to get this error. 00:09:59.240 |
OK, so the schema, so the format of the data set is different. 00:10:04.480 |
Even though they both contain the same columns, 00:10:07.720 |
I think one of them has a slightly different format. 00:10:15.840 |
So they both have slightly different formats. 00:10:20.440 |
So to fix that, we just want to change the schema 00:10:27.120 |
And all we do for that is we're going to change the SMI data 00:10:32.120 |
set and say SMI cast features, just cast maybe, 00:10:53.400 |
We can look and see, OK, we now have 943, basically, 00:11:12.320 |
which we have up at the top here, 0, 1, and 2. 00:11:15.840 |
But there's actually some rows that have the label minus 1. 00:11:24.280 |
It's where someone couldn't figure out what to actually 00:11:29.760 |
So what we're going to do is just remove those. 00:11:41.240 |
This lambda function is going to select rows where the label 00:11:49.800 |
So we're going to say false if the label value, so label, 00:12:20.240 |
So removed, I think it's like 700 or so rows. 00:12:24.600 |
So if we're using the sentence transformers way of training 00:12:32.080 |
the models, this is pretty much all we have to do. 00:12:38.240 |
which is to convert the data into input examples or a list 00:12:44.440 |
of input examples, which we'll move on to in a moment. 00:12:48.880 |
I'm going to quickly just cover the other training 00:13:00.280 |
using that approach, the model was nowhere near as good 00:13:03.680 |
as when I trained it using sentence transformers. 00:13:09.480 |
But if you're interested, this is how we do it. 00:13:19.240 |
And OK, we're going to see it's basically doing the same thing. 00:13:31.360 |
So the difference here, so we're importing mainly 00:13:35.320 |
the BERT tokenizer is what we're focusing on here. 00:13:58.520 |
We're tokenizing both the premise sentences and also 00:14:04.080 |
And we get the input IDs and the attention mask out of those. 00:14:11.840 |
see that this is what we end up with at the end there. 00:14:19.120 |
And then we also have the input IDs and attention 00:14:28.960 |
And then after that, we need to do this as well. 00:14:43.680 |
So we're setting up a data loader using batch size 16. 00:14:50.600 |
And then if we come down, this is all just examples. 00:14:57.440 |
So I'm actually going to go a little further down. 00:15:08.120 |
So here, I'm defining the-- you remember before in that graph, 00:15:17.160 |
They both went into the BERT, or the Siamese BERT. 00:15:31.840 |
and compressed them into just a single 768 dimensional vector. 00:15:45.000 |
Sentence Transformers, the library, by the way, 00:15:47.400 |
the framework, that's probably a bit confusing. 00:15:51.680 |
But I mean, when I say Sentence Transformers, or using 00:15:54.920 |
Sentence Transformers, I mean the framework or library, 00:16:14.000 |
which is why we're not just taking the average straight. 00:16:42.880 |
And then we pass them to a Feedforward Neural Network. 00:16:47.160 |
is the size of our sentence embeddings multiplied by 3. 00:16:58.040 |
And then we also use a cross-entropy loss function 00:17:02.760 |
between what the Feedforward Neural Network outputs 00:17:20.280 |
rather than using the Sentence Transformers library. 00:17:22.960 |
So here, we're getting this get linear schedule with warmup. 00:17:30.360 |
So that's just saying, for the first 10% of our sets, 00:17:37.000 |
So I'm not going to go full-on training at 1e to the minus 5. 00:17:45.040 |
Now, in the SPET paper, they used 2e to the minus 5. 00:17:54.960 |
But if you can get it working with 2e to the minus 5, 00:18:05.400 |
And then they only train for 1e park as well. 00:18:09.360 |
And then also here, I'm using the add and move weighted k. 00:18:31.600 |
And then we're just getting all the data out. 00:18:43.680 |
actually, sorry, so U and V here are actually token embeddings. 00:18:47.600 |
Here, we're converting them into sentence embeddings. 00:18:51.360 |
And then we're getting the U, the absolute value 00:19:04.160 |
that we then feed into the feedforward neural network. 00:19:30.760 |
seeing if I could see what happened if I did two EPUBs. 00:19:51.680 |
work through the actual training with sentence transformers, 00:19:57.040 |
OK, so I said before we had the list of input examples. 00:20:08.160 |
So we just want to write from sentence transformers, 00:20:14.720 |
And then all I'm going to do here is write from tqdm or-- 00:20:34.960 |
actually, we want to create our training examples first, 00:20:46.280 |
through all of our training data, through our data set, 00:20:51.560 |
which is just sentence A, sentence B, and the label. 00:21:03.200 |
So just adding tqdm in there so we have a progress bar, 00:21:09.880 |
All we need to do is write train samples, append input example. 00:21:39.280 |
So you have to pass your text, which is the input text 00:21:43.800 |
that you're going to process into your model. 00:21:46.080 |
So we go row, premise, and also row hypothesis. 00:21:54.720 |
So they're just our two text features from our data set. 00:22:07.160 |
It's just the feature names from our data set, 00:22:18.960 |
And then from there, we need to-- you remember before, 00:22:32.600 |
from the Sentence Transformers library, which are quite good. 00:22:35.560 |
But for this, we're just using a normal PyTorch data loader. 00:22:41.760 |
So we can just write from torch utils data, import data loader. 00:22:48.920 |
Same as the paper, we're using batch size of 16. 00:22:57.880 |
And the data loader or loader is just data loader. 00:23:06.000 |
We pass in those train samples, specify our batch size. 00:23:13.200 |
And if you'd like to shuffle, which in this case we will, 00:23:39.520 |
So we're going to have a transformer module, which 00:23:44.080 |
And then we're also going to have a pooling module, which 00:24:15.760 |
And then here, it's using the Hugging Face models. 00:24:20.400 |
So we can put anything from Hugging Face on here. 00:24:37.640 |
And we want to get the word embedding dimension. 00:24:50.920 |
And then, of course, of our sentence embedding as well. 00:24:55.560 |
And then we also want to set the type of pooling 00:25:19.000 |
This is the mean pooling, and we're going to use that. 00:25:34.560 |
And then we just want to initialize our model. 00:25:48.560 |
You'd write the sentence transformer name in here. 00:26:01.680 |
the model using the modules that we just initialized. 00:26:09.120 |
And then keep details of that model in there. 00:26:18.920 |
So this is our sentence transformer structure. 00:26:34.120 |
And we have the word embedding dimension that we'll expect, 00:26:38.080 |
And then you see here, we have those different pooling modes. 00:26:40.580 |
And we are using pooling mode mean tokens, which is true. 00:26:50.320 |
to initialize our loss function, which is pretty straightforward 00:27:11.040 |
So what we want to do is write loss equals losses. 00:27:20.360 |
And then in here, so you think, OK, our loss function, 00:27:27.120 |
So it can get the model parameters from that. 00:27:35.720 |
And then it also needs the embedding dimension. 00:27:41.360 |
So it needs on this sentence embedding dimension. 00:28:02.800 |
OK, how many labels are we going to have in our model? 00:28:10.960 |
So I'm sure you can get that dynamically from the data set 00:28:25.200 |
So I think we should be OK to start training. 00:28:30.600 |
So I'm going to say we go for, OK, one epoch. 00:28:34.640 |
We want to say how many warm-up steps do we want. 00:28:37.400 |
So again, it's the 10% warming up that we use. 00:28:43.120 |
So we just want 0.1 multiplied by the length of our data set. 00:28:59.840 |
So I'm just-- it's quite rough, rounding very roughly there. 00:29:06.960 |
And then we want to just start training our model. 00:29:20.920 |
contains a single tuple, which is our loader. 00:29:28.560 |
So I think with this, you can, if you have multiple train 00:29:31.920 |
objects, you can put another loader, another loss, 00:29:46.120 |
of warm-up steps, which is just warm-up steps again. 00:30:07.120 |
I think testB is what I've called it later on. 00:30:24.440 |
loads of lines that it's printing to a new line 00:30:32.120 |
So I wouldn't do that because that's obviously 00:30:53.520 |
So this is a notebook, pretty much just covered. 00:31:00.080 |
And then here, so we have the training, the training time 00:31:05.800 |
as well, something I didn't mention just now, 00:31:09.280 |
is one hour and 15 minutes for me on an RTX 3090. 00:31:22.840 |
So I define these sentences just below random sentences, 00:31:33.040 |
So see this one, one thinks she saw her raw fish and rice 00:31:41.120 |
And this one, seeing her sushi move, weaving with spaghetti, 00:31:50.080 |
and dental specialist with construction materials, 00:32:00.320 |
but they don't share any of the same descriptive words. 00:32:04.720 |
But they kind of mean the same thing, roundabouts. 00:32:14.280 |
So with our model, so we have loaded the model here, 00:32:24.640 |
So if you've saved the model, which it does automatically 00:32:28.480 |
here, you just take this, you take that, you come down here, 00:32:41.040 |
And then you would put that in the model variable. 00:32:54.040 |
it to encode those sentences, which is just in the list, 00:32:59.720 |
And then from there, I'm getting the cosine similarity, 00:33:29.600 |
So this 7 and 5, 9 and 1, and I think 4 and 3 00:33:39.680 |
And they are, in fact, the highest three-rated scores. 00:33:51.800 |
is not the best way of training your model anymore, 00:34:00.520 |
So let me show you some of the charts from MNR loss, 00:34:21.560 |
Like, all the values are very near the same value, which 00:34:24.320 |
makes it hard to differentiate between similar and not 00:34:29.800 |
But obviously, BERT hasn't been trained for this, 00:34:41.160 |
compared to the Sentence Transformer's trained model, 00:34:48.840 |
This is the actual-- the one that they trained, 00:35:02.360 |
so this is an MNR model that I have trained using 00:35:25.760 |
And that's really the difference between models. 00:35:34.520 |
those similar and dissimilar pairs very well. 00:35:40.720 |
So that is-- I mean, that's my MNR model as well. 00:36:00.560 |
going to have a look at how we can use MNR loss 00:36:03.040 |
or multiple negative ranking loss to build a model, which 00:36:07.800 |
I think, personally, is a lot more interesting. 00:36:11.400 |
Sentence-- sorry, Softmax loss is pretty interesting. 00:36:27.400 |
It isn't very intuitive when you think about it. 00:36:29.640 |
It's kind of hard to understand why it works. 00:36:33.680 |
Because we have that weird concatenation at the end. 00:36:38.360 |
MNR loss is much more intuitive, and it makes a lot more sense. 00:36:45.200 |
So we're going to cover that in the next video. 00:36:48.920 |
So I think that should be pretty interesting.