back to indexMaking The Most of Data: Augmented SBERT
Chapters
0:0
7:1 Language
7:28 Data Augmentation Techniques
9:28 Contextual Word Embeddings
12:14 Cross Encoder
15:16 Data Augmentation
28:23 Create that Unlabeled Data Set
34:52 Remove any Duplicates
35:53 Predicting the Labels of Our Cross Encoder
45:11 Pooling Layer
00:00:00.000 |
In this video, we're going to have a look at how we can make the most of the limited data 00:00:07.280 |
language data augmentation strategies and training approaches 00:00:12.080 |
More specifically we're going to focus on something called augmented expert. So 00:00:18.000 |
You may or may not be aware that the past decade has has been sort of a renaissance or explosion in the field 00:00:31.200 |
that especially early progress with things like a 00:00:34.840 |
Ceptron and current neural networks a lot of that was researched and discovered 00:00:46.400 |
didn't see that really applied in industry or 00:00:56.360 |
There are two main reasons for this. So the first is that we didn't have enough compute power back in the 00:01:04.280 |
50s 60s 70s to train the models that we needed to train and we also didn't have the data to actually 00:01:16.120 |
Not really a problem anymore. We sort of look at this graph 00:01:20.960 |
It depends on what model you're training, of course, if you are opening on your training GPT 4 or 5, whatever 00:01:28.360 |
Yeah, maybe compute powers is pretty relevant but for most of us we can get access to 00:01:36.280 |
Cloud machinery not personal machines and we can wait a few hours or a couple of days and 00:01:53.520 |
Now that obviously wasn't always the case until very recently back in 00:02:00.060 |
1960s you see on this graph here. We have the 00:02:05.780 |
704 and I mean you can see on the the y-axis. We have floating-point operations per second 00:02:14.480 |
And that's a logarithmic scale. So linear scale just basically looks like a straight line until a few years ago and shoots up 00:02:24.680 |
pretty impressive how much progress is made in terms of 00:02:32.480 |
Like I said, that's you know, not really an issue for us anymore 00:02:36.880 |
we you know, we have the compute in most cases to do what we need to do and 00:02:43.200 |
Data is not as much of a problem anymore, but we'll talk about that in a moment 00:02:52.200 |
Big increase in data not quite as big as the computing power and this this graph here doesn't go quite as far back 00:03:02.960 |
Where I believe it was at 2 zettabytes and now 00:03:15.760 |
Fairly big increase not quite as much as compute power over time, but still pretty massive now the 00:03:26.160 |
yes, there's a lot of data out there, but is there that much data out there for 00:03:32.280 |
What we need to train models to do and in a lot of cases 00:03:37.640 |
Yes, there is but it really depends on what you're doing if you are 00:03:45.640 |
The more niche domains. So what I have here on the left over here are a couple of niche domains 00:03:57.120 |
sentence pairs for climate evidence and claims for example, so where you have a 00:04:04.240 |
Piece of evidence and a claim and whether the the claim supports evidence or not 00:04:08.960 |
There is a very small data set called climate fever data set, but it's not it's not big 00:04:13.520 |
For agriculture, I assume within that industry. There's not that much data, although I 00:04:22.800 |
Not fully where I just assume this probably not that much and then also niche finance 00:04:30.160 |
Which I do at least have a bit more experience with and I imagine this is probably something though 00:04:37.360 |
Because a finance is a big industry. There's a lot of finance data out there, but there's a lot of niche 00:04:44.720 |
Little projects and problems in finance where you find much less data 00:04:50.200 |
So, yes, we have a lot more data nowadays, but we don't have 00:04:56.000 |
Enough for a lot of use cases on the right here. We have a couple of examples of low resource data sets 00:05:01.920 |
so we have a debate from the Maldives and also the 00:05:09.280 |
We kind things find a different approach now we can 00:05:14.440 |
investigate depending on your use case on supervised learning TSD a which we have covered in a previous video article and 00:05:23.320 |
That does work when you're trying to build a model that recognizes 00:05:28.760 |
Generic similarity and it works very well as well. But for example with the climate claims data 00:05:36.000 |
We are not necessarily trying to match sentence a and B based on their semantic similarity 00:05:43.160 |
but we're trying to match it sentence a which is a claim to 00:05:46.680 |
Sentence B, which is evidence as to whether that evidence supports the claim or not 00:05:55.080 |
Unsupervised approach like TSC doesn't really work 00:06:06.720 |
Alternative training approaches that we can use. So basically what we need to do is 00:06:18.800 |
Difficult particularly for language so data augmentation is not specific to NLP 00:06:25.560 |
it's used across ML and it's more established in the field of computer vision and 00:06:32.320 |
That makes sense because computer vision you say you have an image you can modify that image using a few 00:06:42.240 |
The person can still look at that image and think okay, that is the same image 00:06:46.040 |
It's just maybe it's rotated a little bit. We've changed a color grading 00:06:51.920 |
Or something along those lines just modified it slightly, but it's still in essence the same image 00:07:01.120 |
for language, it's a bit difficult because language is very 00:07:13.160 |
the chances are you're going to produce something that doesn't make any sense and 00:07:17.040 |
We when we're augmenting our data, we don't want to just throw rubbish into our model. We want 00:07:31.320 |
We'll have a look at a couple of the simpler ones now, so 00:07:36.360 |
There is a library called NLP or which I think is is very good for this sort of thing. It's 00:07:43.240 |
essentially a library that allows us to do data augmentation for NLP and 00:07:48.520 |
What you can see here is two methods using word2vec 00:07:55.640 |
Vectors and similarity and what we're doing is taking this original 00:08:01.520 |
Sentence so the quick brown fox jumps over the lazy dog 00:08:11.240 |
Using word2vec. So we're trying to find what words were to bet things could go in here 00:08:16.400 |
which words are the most similar to the surrounding words and 00:08:23.280 |
Alessiari, which I don't know. I think it seems like a name to me that I'm not sure 00:08:28.680 |
That I don't think really fits them so it's not great it's not perfect 00:08:37.280 |
Lazy superintendents dog that does kind of make sense. I feel like a 00:08:44.440 |
Maybe a stereotype or I'm sure it's been in The Simpsons or something before 00:08:50.580 |
So, okay fair enough. I can I can see how that can in there, but again, it's a it's a bit weird. It's not great 00:09:02.080 |
So rather than the quick brown fox, we have the easy brown fox and rather than jumping over the lazy dog 00:09:08.840 |
Jumps around the lazy dog which changes the meaning slightly 00:09:17.720 |
We still have a sentence that kind of makes sense 00:09:26.000 |
have to use word to bed you can also use contextual word embeddings like with Bert and 00:09:31.560 |
For me, I think these results look better. So for insertion 00:09:36.520 |
We get even the quick brown fox usually jumps over the lazy dog. So we're adding some words that make sense 00:09:44.080 |
that's I think good or substitution and we are ending one word here and 00:09:51.520 |
We're changing that to a little quick brown fox instead of just quick brown fox 00:09:56.880 |
So I think that makes sense. And this is a good way of renting your data more data from less 00:10:10.360 |
Sentence pairs we can basically just take all of the data from from say we have 00:10:21.760 |
B over here. I should miss the data frame and we have all of these 00:10:27.400 |
Sentence A's and we have all these sentence B's now 00:10:34.040 |
One sentence a it's already matched up to one sentence B and what we can do is say 00:10:39.360 |
okay, I want to randomly sample some other sentence fees and 00:10:49.880 |
Three more pairs now. Okay, so if we if we did this if we took 00:10:59.140 |
We made new pets from all of them. So it's not really random sampling just taking all the possible pairs that we end up with 00:11:10.680 |
New or nine pairs in total which is much better if you send that a look further 00:11:24.600 |
so you can see quite quickly you can you can take a small data set and very quickly create a 00:11:31.160 |
Big data set with it. Now. This is just one part of the the problem though because our 00:11:38.120 |
Smaller data set will have similarity scores or 00:11:45.880 |
But the the new data set that we've just created the augmented data set doesn't have any of those 00:11:53.000 |
We just ran with sample new sentence. So there's no scores or labels that we need those to actually train a model 00:12:02.480 |
What we can do is take a slightly different approach or add another step 00:12:10.380 |
Now that other step is using something called a cross encoder so in 00:12:20.780 |
We can use two different types of models. We can use a cross encoder over here 00:12:27.340 |
Or we can use a by encoder or it or what I would usually call a sentence transformer now a 00:12:44.540 |
Simply putting sentence a and sentence B into a BERT model together at once 00:12:50.700 |
so we have sentence a separate a token B feed that into a BERT model and 00:12:58.420 |
Embeddings output embeddings over here and they all get fed into a linear layer 00:13:04.420 |
Which converts all of those into a similarity score up here? 00:13:08.900 |
now that similarity score is typically going to be more accurate than a 00:13:14.940 |
Similarity score that you get from a by encoder or such transformer 00:13:24.580 |
from our sentence transformer, we are outputting sentence vectors and 00:13:31.660 |
If we have two sentence vectors, we can perform a cosine similarity or Euclidean distance 00:13:39.420 |
calculation to get the similarity of those two vectors and 00:13:55.420 |
BERT inference set right which is what we need with a cross encoder. So I 00:14:08.260 |
Vectors using a cross encoder expert cross encoder would take you something like 65 hours 00:14:15.140 |
Whereas with a by encoder, it's going to take you about five seconds. So it's 00:14:26.900 |
That's why we use by encoders or sentence transformers now 00:14:31.220 |
The reason I'm talking about cross encoders is because we get this more 00:14:41.020 |
Another very key thing here is that we need less data 00:14:45.460 |
Train a cross encoder with a by encoder if we I think the expert model 00:14:52.140 |
Itself was trained on something like 1 million 00:14:57.500 |
Sentence pairs and some new models are training a billion or more 00:15:01.940 |
Whereas a cross encoder we can we can train a reasonable cross encoder on like 5k or maybe even less 00:15:10.620 |
Sentence pairs. So we need much less data and 00:15:14.700 |
That works quite well. We've been talking about the data orientation. We can take a small data set 00:15:19.740 |
we can augment it to create more sentence pairs and 00:15:24.060 |
Then what we do is train on the original data set, which we call the gold data set 00:15:35.300 |
Then we use that fine-tune cross encoder to label 00:15:42.740 |
Data set without labels and that creates a augmented label data set that we call the silver data set 00:15:54.980 |
sort of strategy of creating a silver data set which we would then use and to 00:16:27.100 |
this so what you can see this flow diagram is 00:16:41.020 |
So we we've already described most of this so we get our gold data set the original data set 00:16:48.420 |
That's gonna be quite small. Let's say one to five 00:16:51.520 |
Thousand sentence pairs are labeled from now. We're going to use something like random sampling 00:17:03.940 |
We're going to use that to create a larger data set. Let's say we create something like a hundred thousand 00:17:17.180 |
Similarity scores or natural language inference 00:17:25.420 |
What we do is we take that gold data set and we take it down here and we fine-tune a cross encoder 00:17:32.420 |
Using that gold data because we need less data to train a reasonably good 00:17:39.540 |
So we take that and we fine-tune cross encoder and then we use that cross encoder 00:17:45.860 |
Alongside our unlabeled data set to create a new 00:17:53.980 |
Now the cross encoder is going to predict the similarity scores or in a line labels or every pair 00:18:07.860 |
we also have the gold data, which is up here and 00:18:19.060 |
the by encoder or the sentence transformer on 00:18:22.380 |
both the gold data and the silver data now one thing I would say here is it's useful to 00:18:30.260 |
Separate some of your gold data at the very start. So don't even train your cross encoder on those 00:18:38.780 |
it's good to separate them as your is your evaluation or test set and 00:18:43.260 |
What evaluate the both the cross encoder performance and also your by encoder performance on that separate 00:18:52.060 |
So don't include that in your training data for any of your models 00:18:55.300 |
Keep that separate and then you can use that to figure out is this working or is it not work? 00:19:08.860 |
Sort of see this is the same as what he saw before just another this is a training approach that so we have the gold 00:19:22.860 |
We have our unlabeled pairs which would come from random sampling on gold data 00:19:27.380 |
we process those for a cross encoder to create a silver data set and then the silver and 00:19:44.780 |
Now what I want to do is actually go through the code and and work to an example of how we can actually do this 00:19:55.860 |
downloaded the both the training and the validation set by a CSP data and 00:20:03.060 |
Let's have a look at what some of that data looks like. So 00:20:10.500 |
So we have sentence pair sentence one sentence two 00:20:14.060 |
Just a simple sentence and we have a label which is our similarity score now 00:20:19.700 |
that similarity score varies from between 0 up to 5 where 0 is 00:20:31.860 |
5 is they mean that that same thing now see here these two mean the same thing as we 00:20:39.260 |
Now we can see here that these two mean the same thing as we would expect 00:20:53.260 |
Score a little bit because we are going to be training 00:20:57.620 |
using cosine similarity loss and we would expect our 00:21:01.020 |
Label to not go up to a value of 5 but instead go to value 1. So 00:21:10.500 |
Changing that score so that we are dividing everything by by normalizing everything 00:21:22.820 |
Now what we can do is load our training data into a data loader. So to do that we first 00:21:30.180 |
Form everything into a input example and then load that into into our pineclutch data loader 00:21:38.380 |
So I run that and then at the same time during training I also want to 00:21:48.140 |
Output a evaluation source. So how did the cross encoded do on the evaluation data? 00:21:58.140 |
Import so here we're importing from chance sentence transformers cross encoder 00:22:04.860 |
Evaluation I'm importing the cross encoder CE 00:22:11.340 |
Again I'm using input examples with working sentence transformers library 00:22:29.260 |
Putting all that development or I'm putting all that relational data into that 00:22:39.420 |
Then we can move on to initializing a cross encoder and training it and also evaluating it. So 00:22:52.820 |
Sentence transform it and I'll make sure I'm working in Python 00:23:08.380 |
To initialize that cross encoder model, I'll call it see 00:23:15.940 |
Need to do is write cross encoder very similar to when we write sentence transformer initialize it and model 00:23:30.140 |
Initialize a cross encoder from so birthdays in case and also number of labels that we'd like to use 00:23:42.940 |
Similarity score between zero and one. So we just want a 00:23:46.540 |
Single label that if we were doing for example, NLI 00:23:55.980 |
Neutral labels or some other labels and we would change this to for example three, but in this case one 00:24:07.220 |
and then from now we move on to actually training so we call model or see dot fit and 00:24:16.220 |
The data loader so it's slightly different to the fit function. We usually use of sentence transformers 00:24:33.700 |
Don't need to do this. But if you are going to evaluate your 00:24:36.860 |
Model during training you also want to add in evaluator as well 00:24:42.020 |
So this is from the CE correlation evaluator to make sure here using a cross encoder 00:24:55.220 |
Say one epoch and we should define this because I would also like to 00:25:01.500 |
While we're training I would also like to include some warm-up sets as well 00:25:06.780 |
We should I'm going to include a lot of warm-up sets actually and although I'll mention it. I'll talk about it in a moment. So I 00:25:22.740 |
would like to take integer so the length of loader so the number of 00:25:28.780 |
Actions that we have now and our data set. I'm going to multiply this by 00:25:36.700 |
Do a warm-up or do warm-up sets for 40% of our 00:25:42.700 |
total data set size or batch or 40% of our total number of batches and 00:25:48.980 |
We also need to multiply that by number of epochs. Let's say for training two epochs 00:25:54.260 |
We do multiply that in this case just one so not necessary, but it's there 00:26:05.100 |
Training steps and I found this works better than something like 10% 15% 20% 00:26:14.740 |
Think you could also achieve a similar result by just decreasing the learning rate of your model. So 00:26:23.940 |
By default, so if I write the epochs here, we'll define the warm-up sets 00:26:29.940 |
With warm-up so by default this we use optimizer params 00:26:50.860 |
Say want to decrease that a little bit you could go. Let's say 00:26:56.340 |
Go to the minus 6 5 e to the minus 6 and this would probably have a similar effect to having 00:27:03.300 |
Such a significant number of warm-up sets and then in this case, you could decrease this to 1 or 10% 00:27:09.340 |
But for me the way I've tested this I've end up going with 40% warm-up sets and that works quite well 00:27:21.220 |
Where do we want to save our model? So I'm going to say I want to save it into the base 00:27:40.460 |
Run everything for us. I'll just make sure it's actually yep. There we go 00:27:45.780 |
So see it's running, but I'm not gonna run it because I've already done it 00:27:57.260 |
okay, so we now have our gold data set which we have pulled from looking face data sets and 00:28:17.300 |
Now so before we actually go on to predicting labels with the with the cross encoder 00:28:27.380 |
let's do that through random sampling using the gold data set you already have and 00:28:41.460 |
Add a little bit separation in here. So now we're going to go ahead and create the 00:28:51.340 |
So as I said, we're going to be using random sampling for that and I find that the 00:28:57.580 |
the easiest way to do that is to actually go ahead and use a pandas dataframe rather than using the 00:29:05.780 |
Data set object that we currently have so I'm gonna go ahead and initialize that so we have our gold data 00:29:19.820 |
In here we're going to have sentence one sentence two system one 00:29:37.660 |
Okay, and as well as that we also have sentence two 00:29:45.780 |
sentence two now we may also want to include our 00:29:52.900 |
Label in there, although I wouldn't say this is really necessary 00:30:12.700 |
Gonna overwrite anything called gold. It's okay 00:30:15.500 |
So, okay, I'm gonna have a look at that as well so you can see a few examples of what we're actually working with 00:30:25.200 |
I'll just go ahead and actually rerun these as well 00:30:30.860 |
Okay, so there we have we have our gold data and 00:30:42.580 |
Reformatted that into a kind of data frame. You can use the sample method to randomly sample 00:30:51.620 |
To do that what I will want to do is create a new data frame 00:30:56.620 |
So this is going to be our one labeled silver date 00:31:00.460 |
So it's not it's not a silver data set yet because we don't have the labels or scores yet 00:31:04.460 |
But this is going to be where we we will put them and in here. We again will have sentence one 00:31:16.060 |
Two but at the moment that they're empty. It's nothing nothing in there yet 00:31:20.280 |
so what we need to do is actually iterate through all of the 00:31:27.060 |
Before that I'm just going to do from or import 00:31:41.060 |
That's just a progress bar so you can see you know where we are I don't I don't really like to wait and 00:31:46.580 |
have no idea how long this is taking to process and 00:31:56.900 |
tqdm so we have the progress bar and I'll take a list of a set 00:32:02.060 |
So we're taking all the unique values in the gold data frame for sentence one 00:32:08.260 |
Okay, so that will just loop through every single unique sentence of one 00:32:13.820 |
Item in there and I'm gonna use that and I'm going to randomly sample 00:32:21.820 |
sentences from the other column sentence two to be paired with that sentence one and yeah, I'll 00:32:33.100 |
Phrases that we're going to sample are going to come from the gold data, of course, and we only want to 00:32:40.060 |
sample from rows where sentence one is not equal to the current sentence one because otherwise we 00:32:48.680 |
Are possibly going to introduce duplicates and we're going to remove duplicates anyway 00:32:53.980 |
But let's just remove them from the sampling in the in the first place. So 00:33:00.580 |
Take that so all of the gold data set that where sentence one is 00:33:05.880 |
Not equal to sentence one and what I'm going to do is just sample five of those rows 00:33:12.100 |
Like that now from that. I'm just going to extract 00:33:21.220 |
sentence two phrases that we have there and I'm going to convert them into a list and 00:33:31.580 |
In the sampled list that we've just created. I'm going to take my pairs 00:33:37.000 |
I'm going to append new pairs. So pairs are penned 00:33:41.040 |
and I want sentence one to be sentence one and 00:33:47.060 |
Also sentence two is going to be equal to sentence two 00:34:06.380 |
So let me possibly just go maybe the first 500 00:34:13.300 |
Yeah, let's go to first 500 see how long that takes and 00:34:19.700 |
I will also want to just have a look at what we what we get from that 00:34:42.740 |
so because we we're taking five of sentence one every time and random sampling it we can see that we have a few of those and 00:34:50.380 |
Another thing that we might do is remove any do fits now 00:34:55.580 |
That probably isn't any duplicates here, but we can check so pairs equals 00:35:22.580 |
So they were not any do it's anyway, but that's it it's a good idea to add that in just in case and 00:35:36.220 |
Actually take the cross encoder. In fact, actually let's go back to our little flow charts. So we have 00:35:50.780 |
So it's good and now we go on to predicting the labels of our cross encoder so down here 00:36:01.180 |
the cross encoder code here and what I've done is 00:36:13.620 |
Plugin-based models. So what we what you can do and what I can do is 00:36:18.380 |
This so I'm gonna write James Callum and it is called Bert 00:36:41.860 |
Use that cross encoder to create our labels. So that will create our silver data set now to do that. I 00:36:49.580 |
I'm gonna call it silver for now. I mean it is this isn't really the silver data set, but it's fine and 00:36:57.340 |
I'm going to create a list and I'm going to zip both of the columns from our pets 00:37:23.180 |
All about all about pairs again, you can look at those 00:37:33.900 |
What we you want to do now is actually create our score. So just take the cross encoder 00:37:59.500 |
Let's say 10 because I already have the full data set so I can I can show you that 00:38:04.300 |
Somewhere else and let's have a look at what you have in those scores. So three of them 00:38:11.500 |
So we have an array and we we have these these scores. Okay, so 00:38:19.460 |
They are our predictions our similarity predictions for the first three now because they're randomly sampled a lot of these are 00:38:39.900 |
Not particularly relevant. Let's just one must first issue with this and you can you can 00:38:46.340 |
change you can try and modify that by after creating your scores if you have if you 00:38:56.540 |
For a lot of values or a lot of records and then just go ahead and remove 00:39:04.020 |
most of the low scoring samples and keep all of your high scoring samples that will help you deal with that 00:39:25.500 |
Will not actually cover all of them because we only have ten in here. So 00:39:34.140 |
So this isn't you shouldn't do this obviously it's just so they fit 00:39:44.020 |
Let's have look. Okay, so we now have sense one sense two and some labels and 00:39:50.780 |
What you do, but I'm not going to run. This is you would write pairs to CSV 00:39:56.380 |
And that's so you need to do this if you're running everything in the same notebook 00:40:00.580 |
Well, it's probably a good idea so with CSV so I'm going to say the silver data is a tab separate file and 00:40:09.900 |
Obviously the separator for for that type of file is it it's a tab character and I don't want to include in those 00:40:20.780 |
Okay, and that will create the the silver data file that we can train with 00:40:46.580 |
Sentence pairs and the scores that our encoder has assigned to them 00:40:52.100 |
So I'm going to close that I'm going to go back to the demo 00:41:01.580 |
First go back to the flow chart that we had. I'm going to cross off predict labels 00:41:12.620 |
We're going to go ahead and fine-tune it the by encoder on both gold and silver data 00:41:24.500 |
Yes, and the silver I'm going to load that from file. So 00:41:44.660 |
Let's have a look what we have make sure it's all loaded correctly looks good 00:41:52.140 |
Put both those together. So all data is equal to gold and 00:41:59.180 |
We ignore the index so they get an index error 00:42:11.460 |
Okay, we can see that we hopefully now have all of the all of the data in this check the length 00:42:18.500 |
So it's definitely bigger a bigger data set now before which is gold 00:42:26.940 |
okay, so we now have a larger data set and we can go ahead and use that to fine-tune the 00:42:36.740 |
By encoder or sentence transformer. So what I'm going to do is take the code from up here 00:42:47.020 |
I think I've already run this for so I don't need to import the input example here 00:43:02.540 |
What we actually want to do here is for I row in all data because this is a date frame 00:43:10.780 |
It's right through each row. We have row sentence one sentence two and also a label 00:43:20.100 |
We load them and to our train data and we can have a look at that train data 00:43:33.780 |
Inputs a sample objects if you want to see what one those has inside you can access the text 00:43:46.100 |
New cell. So let me pull this down here and you can also access a label 00:43:57.780 |
So that looks good and we can now take that like we did before and load it into a data loader 00:44:04.700 |
So, let me go up again and we'll we'll copy that 00:44:22.180 |
Run this crates our data loader and we can move on to actually initializing the sentence transformer or by encoder 00:44:30.140 |
And actually training it. So once we run from sentence transformers, we're going to import 00:44:38.100 |
Models and also going to import sentence transformer 00:44:42.740 |
now to initialize our sentence transformer if you following along with the series of videos and articles 00:44:52.780 |
Will know that we we do something looks like this 00:44:56.060 |
So we're going to convert and that is going to be models adopt transformer 00:45:04.700 |
here we're just loading a model from face transformers, so they're based on case and 00:45:09.380 |
We also have our pooling layer. So models again, and we have pooling and 00:45:24.280 |
layer should respect which is just going to be birds dot get word embedding dimension and 00:45:32.580 |
Also, it needs to know what type of pooling we're going to use we're going to use 00:45:39.040 |
CLS pooling are going to use mean pooling max pooling or so on now we are going to use 00:45:56.340 |
Let's say components in our in our sentence transformer and we need to now put those together 00:46:10.860 |
Then we just pass as a list that and also cooling 00:46:18.140 |
So we run that we can also have one model looks like 00:46:23.160 |
Okay, and we have you see we have a sentence transformer object and inside there. We have two 00:46:29.900 |
Layers or components first ones are transformer 00:46:33.180 |
It's a BERT model and the second one is our pooling and we can see here 00:46:37.540 |
the only pooling method that is set to true is the 00:46:41.820 |
Mode mean tokens, which means we're going to take the mean across all the word 00:46:45.740 |
Embeddings output by BERT and use that to create our sentence embedding or vector 00:46:56.640 |
Initialize our loss function. So we do want to write from sentence 00:47:13.300 |
Similarity loss and in here we seem to pass the model. So understands which parameters to to actually optimize and 00:47:27.660 |
That's similar to before the cross encoder although slightly different. So let me let me take that. It's a little further up 00:47:41.420 |
Then take that and we're just gonna modify it so 00:47:45.340 |
Warm up. I'm going to warm up for a 15% of the number of steps now. We're going to run through 00:47:53.060 |
we change this to model it's not it's not see anymore and 00:47:58.220 |
Like I said, there are some differences here. So we have a training objectives. That's different and 00:48:06.460 |
This is just a list of all the training objectives. We have we are only using one and we just pass 00:48:16.780 |
Evaluator we could use an evaluator. I'm not going to 00:48:21.500 |
For this one, I'm going to evaluate everything afterwards the 00:48:26.420 |
Epochs and warm steps are the same. The only thing that's different is the output path, which is going to be but STS be 00:48:35.180 |
That's that's it so go ahead and run that should should run let's check that it does 00:48:42.040 |
Okay, so I've got this error here so it's lucky that we we checked and 00:48:49.780 |
we have this runtime error found D type long but expected float and 00:48:54.300 |
If we come up here, it's going to be in the data load or in the data that we've 00:49:03.140 |
Yeah, so here I've put int first some region. I'm not sure why did that? 00:49:08.420 |
So this should be a float that the label in your your training data 00:49:17.300 |
Okay, so here as well the the cross encoder we would expect a float value 00:49:29.980 |
So just be aware that I'll make sure there's a note in the video earlier on for that 00:49:39.780 |
Okay, let's continue through that and try and rerun it should be okay now. Oh 00:49:47.140 |
I need to actually rerun everything else as well 00:50:03.260 |
This for a moment just to be sure that is actually running this time, but it does 00:50:20.460 |
Looks good when for some reason in the notebook I'm actually seeing the number of iterations, but okay 00:50:27.580 |
Yeah, pause it now and we can see that. Yes. It did run through two iterations. So it is running 00:50:38.860 |
What I want to do now is actually show you okay valuation of these models. So 00:50:45.380 |
Back to our flow chart quickly. Okay, so fine-tuned by coda. We've just done it 00:50:59.260 |
Yeah, let's move on to the evaluation. Okay, so my evaluation script here 00:51:14.460 |
But basically all we're doing is we're importing the 00:51:18.420 |
embedding similarity evaluated from down here 00:51:24.980 |
SDSP again, and we're taking validation split, which we didn't train on 00:51:29.420 |
we are converted into input examples feeding it into our embedding similarity evaluator and 00:51:35.540 |
Loading the model the model name a pass through command line 00:51:51.200 |
Let me switch across to the command line. I can we can see how that how that actually performs 00:51:58.340 |
Okay, so just switch across to my other desktop because this is much faster so I can I can actually 00:52:06.060 |
Run this quickly. So Python and zero three, so we're gonna run that valuation script 00:52:16.460 |
we have all three or all three the cross encoder the 00:52:24.460 |
Augmented experts and also a sentence transformer train purely on the gold data set. So first let's have a look at the 00:52:43.860 |
Okay, so everything downloaded and then we've got a score of zero point five zero six. So it correlates 00:52:54.420 |
predictions of the model correlate to the actual scores 00:53:01.660 |
Relation so they do correlate. It's not bad. It's not great either. Let's have a look at the cross encoder 00:53:19.020 |
0.58. So as we'd expect training on just the gold data 00:53:24.420 |
the cross encoder does outperform the the by encoder or sentence transformer and 00:53:40.980 |
let's run that again the way for to download and 00:53:52.300 |
Yeah, the I mean the correlation area is much higher than this for the augmented data 00:53:58.340 |
So then if we had just used a gold data set, so it really really has improved the performance a lot now 00:54:05.660 |
This is maybe an atypical performance increase. It's like 90% or 90 00:54:14.180 |
Performance and that's good. But if you look at the the original paper from there was rhymers and co they 00:54:22.420 |
Found a sort of expected performance increase of I believe seven or nine 00:54:30.220 |
Points. So this is definitely pretty significant. This is definitely a bit more than that 00:54:39.100 |
these models or this training strategy can actually be so 00:54:44.140 |
That's it for this video. I hope this has been useful and I hope this 00:54:55.620 |
Overcome the sometimes lack of data that we find and I think a lot of our 00:55:01.940 |
Particular use cases. Thank you very much for watching and I will see you in the next one