Making The Most of Data: Augmented SBERT

00:00:00.000 | In this video, we're going to have a look at how we can make the most of the limited data

00:00:07.280 | language data augmentation strategies and training approaches

00:00:12.080 | More specifically we're going to focus on something called augmented expert. So

00:00:18.000 | You may or may not be aware that the past decade has has been sort of a renaissance or explosion in the field

00:00:26.400 | of machine learning and data science and

00:00:29.520 | a lot of

00:00:31.200 | that especially early progress with things like a

00:00:34.840 | Ceptron and current neural networks a lot of that was researched and discovered

00:00:41.080 | back in the 50s and 60s and 70s, but we

00:00:46.400 | didn't see that really applied in industry or

00:00:50.640 | anywhere really until

00:00:53.440 | the past decade and

00:00:56.360 | There are two main reasons for this. So the first is that we didn't have enough compute power back in the

00:01:04.280 | 50s 60s 70s to train the models that we needed to train and we also didn't have the data to actually

00:01:11.320 | train those models now

00:01:13.720 | compute power is

00:01:16.120 | Not really a problem anymore. We sort of look at this graph

00:01:20.960 | It depends on what model you're training, of course, if you are opening on your training GPT 4 or 5, whatever

00:01:28.360 | Yeah, maybe compute powers is pretty relevant but for most of us we can get access to

00:01:36.280 | Cloud machinery not personal machines and we can wait a few hours or a couple of days and

00:01:45.120 | fine-tune or pre-train a transform model

00:01:49.520 | that is

00:01:51.320 | Good performance or for what we need

00:01:53.520 | Now that obviously wasn't always the case until very recently back in

00:02:00.060 | 1960s you see on this graph here. We have the

00:02:03.260 | IBM

00:02:05.780 | 704 and I mean you can see on the the y-axis. We have floating-point operations per second

00:02:14.480 | And that's a logarithmic scale. So linear scale just basically looks like a straight line until a few years ago and shoots up

00:02:22.660 | it's

00:02:24.680 | pretty impressive how much progress is made in terms of

00:02:29.680 | computing power now

00:02:32.480 | Like I said, that's you know, not really an issue for us anymore

00:02:36.880 | we you know, we have the compute in most cases to do what we need to do and

00:02:43.200 | Data is not as much of a problem anymore, but we'll talk about that in a moment

00:02:48.800 | so data again, we have a very

00:02:52.200 | Big increase in data not quite as big as the computing power and this this graph here doesn't go quite as far back

00:03:00.880 | It's only 2010

00:03:02.960 | Where I believe it was at 2 zettabytes and now

00:03:09.360 | 2021 or or so in in 2021, so

00:03:13.800 | there's a

00:03:15.760 | Fairly big increase not quite as much as compute power over time, but still pretty massive now the

00:03:23.160 | thing with data is

00:03:26.160 | yes, there's a lot of data out there, but is there that much data out there for

00:03:32.280 | What we need to train models to do and in a lot of cases

00:03:37.640 | Yes, there is but it really depends on what you're doing if you are

00:03:41.760 | Focusing on

00:03:45.640 | The more niche domains. So what I have here on the left over here are a couple of niche domains

00:03:53.360 | There's not that much data out there on

00:03:57.120 | sentence pairs for climate evidence and claims for example, so where you have a

00:04:04.240 | Piece of evidence and a claim and whether the the claim supports evidence or not

00:04:08.960 | There is a very small data set called climate fever data set, but it's not it's not big

00:04:13.520 | For agriculture, I assume within that industry. There's not that much data, although I

00:04:19.680 | Have never worked in that industry. So I

00:04:22.800 | Not fully where I just assume this probably not that much and then also niche finance

00:04:30.160 | Which I do at least have a bit more experience with and I imagine this is probably something though

00:04:35.360 | A lot of you will find useful as well

00:04:37.360 | Because a finance is a big industry. There's a lot of finance data out there, but there's a lot of niche

00:04:44.720 | Little projects and problems in finance where you find much less data

00:04:50.200 | So, yes, we have a lot more data nowadays, but we don't have

00:04:56.000 | Enough for a lot of use cases on the right here. We have a couple of examples of low resource data sets

00:05:01.920 | so we have a debate from the Maldives and also the

00:05:05.520 | Languages as well. So with these

00:05:09.280 | We kind things find a different approach now we can

00:05:14.440 | investigate depending on your use case on supervised learning TSD a which we have covered in a previous video article and

00:05:23.320 | That does work when you're trying to build a model that recognizes

00:05:28.760 | Generic similarity and it works very well as well. But for example with the climate claims data

00:05:36.000 | We are not necessarily trying to match sentence a and B based on their semantic similarity

00:05:43.160 | but we're trying to match it sentence a which is a claim to

00:05:46.680 | Sentence B, which is evidence as to whether that evidence supports the claim or not

00:05:53.080 | so in that case

00:05:55.080 | Unsupervised approach like TSC doesn't really work

00:05:58.920 | so

00:06:01.440 | what we have is very little data and

00:06:03.720 | We there aren't really any

00:06:06.720 | Alternative training approaches that we can use. So basically what we need to do is

00:06:12.520 | Create more data now

00:06:14.960 | data

00:06:16.520 | augmentation is

00:06:18.800 | Difficult particularly for language so data augmentation is not specific to NLP

00:06:25.560 | it's used across ML and it's more established in the field of computer vision and

00:06:32.320 | That makes sense because computer vision you say you have an image you can modify that image using a few

00:06:40.080 | different approaches and

00:06:42.240 | The person can still look at that image and think okay, that is the same image

00:06:46.040 | It's just maybe it's rotated a little bit. We've changed a color grading

00:06:49.640 | brightness

00:06:51.920 | Or something along those lines just modified it slightly, but it's still in essence the same image

00:06:58.400 | now

00:07:01.120 | for language, it's a bit difficult because language is very

00:07:04.840 | abstract and

00:07:07.360 | Nuance, so if you start randomly changing

00:07:10.960 | certain words

00:07:13.160 | the chances are you're going to produce something that doesn't make any sense and

00:07:17.040 | We when we're augmenting our data, we don't want to just throw rubbish into our model. We want

00:07:22.960 | Something that makes sense. So

00:07:26.560 | there are

00:07:28.680 | some data augmentation techniques and

00:07:31.320 | We'll have a look at a couple of the simpler ones now, so

00:07:36.360 | There is a library called NLP or which I think is is very good for this sort of thing. It's

00:07:43.240 | essentially a library that allows us to do data augmentation for NLP and

00:07:48.520 | What you can see here is two methods using word2vec

00:07:55.640 | Vectors and similarity and what we're doing is taking this original

00:08:01.520 | Sentence so the quick brown fox jumps over the lazy dog

00:08:07.120 | And we're just inserting some words

00:08:11.240 | Using word2vec. So we're trying to find what words were to bet things could go in here

00:08:16.400 | which words are the most similar to the surrounding words and

00:08:20.160 | We have this

00:08:23.280 | Alessiari, which I don't know. I think it seems like a name to me that I'm not sure

00:08:28.680 | That I don't think really fits them so it's not great it's not perfect

00:08:37.280 | Lazy superintendents dog that does kind of make sense. I feel like a

00:08:41.480 | lazy superintendents dog is

00:08:44.440 | Maybe a stereotype or I'm sure it's been in The Simpsons or something before

00:08:50.580 | So, okay fair enough. I can I can see how that can in there, but again, it's a it's a bit weird. It's not great

00:08:58.040 | Substitution for me seems to work better

00:09:02.080 | So rather than the quick brown fox, we have the easy brown fox and rather than jumping over the lazy dog

00:09:08.840 | Jumps around the lazy dog which changes the meaning slightly

00:09:13.160 | Easy is a bit weird there to be fair

00:09:15.800 | but

00:09:17.720 | We still have a sentence that kind of makes sense

00:09:21.000 | So that's good. I think now we don't

00:09:26.000 | have to use word to bed you can also use contextual word embeddings like with Bert and

00:09:31.560 | For me, I think these results look better. So for insertion

00:09:36.520 | We get even the quick brown fox usually jumps over the lazy dog. So we're adding some words that make sense

00:09:44.080 | that's I think good or substitution and we are ending one word here and

00:09:51.520 | We're changing that to a little quick brown fox instead of just quick brown fox

00:09:56.880 | So I think that makes sense. And this is a good way of renting your data more data from less

00:10:05.320 | but for us because we are using

00:10:10.360 | Sentence pairs we can basically just take all of the data from from say we have

00:10:20.600 | a

00:10:21.760 | B over here. I should miss the data frame and we have all of these

00:10:27.400 | Sentence A's and we have all these sentence B's now

00:10:31.920 | if we take

00:10:34.040 | One sentence a it's already matched up to one sentence B and what we can do is say

00:10:39.360 | okay, I want to randomly sample some other sentence fees and

00:10:43.480 | match them up to

00:10:47.200 | Sentence a so, you know, we have

00:10:49.880 | Three more pairs now. Okay, so if we if we did this if we took

00:10:55.560 | three sentence A's three sentence B's and

00:10:59.140 | We made new pets from all of them. So it's not really random sampling just taking all the possible pairs that we end up with

00:11:06.880 | Nine

00:11:10.680 | New or nine pairs in total which is much better if you send that a look further

00:11:17.760 | so from just thousand pairs

00:11:20.160 | We can end up with one million pairs

00:11:24.600 | so you can see quite quickly you can you can take a small data set and very quickly create a

00:11:31.160 | Big data set with it. Now. This is just one part of the the problem though because our

00:11:38.120 | Smaller data set will have similarity scores or

00:11:43.000 | natural language inference labels

00:11:45.880 | But the the new data set that we've just created the augmented data set doesn't have any of those

00:11:53.000 | We just ran with sample new sentence. So there's no scores or labels that we need those to actually train a model

00:11:59.680 | so

00:12:02.480 | What we can do is take a slightly different approach or add another step

00:12:08.380 | here

00:12:10.380 | Now that other step is using something called a cross encoder so in

00:12:17.740 | semantic similarity

00:12:20.780 | We can use two different types of models. We can use a cross encoder over here

00:12:27.340 | Or we can use a by encoder or it or what I would usually call a sentence transformer now a

00:12:36.940 | cross encoder is

00:12:38.860 | the old way of doing it and

00:12:41.380 | it works by

00:12:44.540 | Simply putting sentence a and sentence B into a BERT model together at once

00:12:50.700 | so we have sentence a separate a token B feed that into a BERT model and

00:12:55.220 | From that BERT model. We will get all of our

00:12:58.420 | Embeddings output embeddings over here and they all get fed into a linear layer

00:13:04.420 | Which converts all of those into a similarity score up here?

00:13:08.900 | now that similarity score is typically going to be more accurate than a

00:13:14.940 | Similarity score that you get from a by encoder or such transformer

00:13:19.140 | but

00:13:22.100 | The problem here is

00:13:24.580 | from our sentence transformer, we are outputting sentence vectors and

00:13:31.660 | If we have two sentence vectors, we can perform a cosine similarity or Euclidean distance

00:13:39.420 | calculation to get the similarity of those two vectors and

00:13:45.220 | cosine similarity

00:13:48.460 | Calculation or operation is

00:13:51.060 | much quicker than a full

00:13:55.420 | BERT inference set right which is what we need with a cross encoder. So I

00:14:01.340 | Think it is something like

00:14:03.340 | for maybe 10 maybe clustering 10,000

00:14:08.260 | Vectors using a cross encoder expert cross encoder would take you something like 65 hours

00:14:15.140 | Whereas with a by encoder, it's going to take you about five seconds. So it's

00:14:21.700 | much much quicker

00:14:24.420 | and

00:14:26.900 | That's why we use by encoders or sentence transformers now

00:14:31.220 | The reason I'm talking about cross encoders is because we get this more

00:14:35.300 | accurate similarity score

00:14:38.340 | which we can use as a label and

00:14:41.020 | Another very key thing here is that we need less data

00:14:45.460 | Train a cross encoder with a by encoder if we I think the expert model

00:14:52.140 | Itself was trained on something like 1 million

00:14:57.500 | Sentence pairs and some new models are training a billion or more

00:15:01.940 | Whereas a cross encoder we can we can train a reasonable cross encoder on like 5k or maybe even less

00:15:10.620 | Sentence pairs. So we need much less data and

00:15:14.700 | That works quite well. We've been talking about the data orientation. We can take a small data set

00:15:19.740 | we can augment it to create more sentence pairs and

00:15:24.060 | Then what we do is train on the original data set, which we call the gold data set

00:15:30.440 | we train our cross encoder using that and

00:15:35.300 | Then we use that fine-tune cross encoder to label

00:15:40.540 | the augmented

00:15:42.740 | Data set without labels and that creates a augmented label data set that we call the silver data set

00:15:51.300 | So

00:15:53.300 | That

00:15:54.980 | sort of strategy of creating a silver data set which we would then use and to

00:16:01.340 | fine-tune our by encoder model is

00:16:05.360 | What we refer to as the in domain

00:16:10.220 | Augmented

00:16:19.140 | Expert

00:16:21.140 | Training strategy

00:16:23.460 | Okay, and

00:16:27.100 | this so what you can see this flow diagram is

00:16:30.940 | basically every set that we need to do to

00:16:34.660 | Create an in domain or spurt

00:16:38.080 | training process

00:16:41.020 | So we we've already described most of this so we get our gold data set the original data set

00:16:48.420 | That's gonna be quite small. Let's say one to five

00:16:51.520 | Thousand sentence pairs are labeled from now. We're going to use something like random sampling

00:16:58.140 | Which I'll just call ran

00:17:00.660 | Sam

00:17:03.940 | We're going to use that to create a larger data set. Let's say we create something like a hundred thousand

00:17:10.940 | sentence pairs

00:17:13.300 | But these are not labeled. We don't have any

00:17:17.180 | Similarity scores or natural language inference

00:17:20.260 | labels or these

00:17:23.260 | So

00:17:25.420 | What we do is we take that gold data set and we take it down here and we fine-tune a cross encoder

00:17:32.420 | Using that gold data because we need less data to train a reasonably good

00:17:37.140 | cross encoder

00:17:39.540 | So we take that and we fine-tune cross encoder and then we use that cross encoder

00:17:45.860 | Alongside our unlabeled data set to create a new

00:17:51.260 | silver data set

00:17:53.980 | Now the cross encoder is going to predict the similarity scores or in a line labels or every pair

00:18:01.120 | So with that we have our silver data

00:18:07.860 | we also have the gold data, which is up here and

00:18:13.060 | we actually take both those together and

00:18:16.140 | we fine-tune

00:18:19.060 | the by encoder or the sentence transformer on

00:18:22.380 | both the gold data and the silver data now one thing I would say here is it's useful to

00:18:30.260 | Separate some of your gold data at the very start. So don't even train your cross encoder on those

00:18:38.780 | it's good to separate them as your is your evaluation or test set and

00:18:43.260 | What evaluate the both the cross encoder performance and also your by encoder performance on that separate

00:18:52.060 | So don't include that in your training data for any of your models

00:18:55.300 | Keep that separate and then you can use that to figure out is this working or is it not work?

00:19:01.500 | so

00:19:03.740 | that is

00:19:05.580 | in domain or meant of experts and

00:19:08.860 | Sort of see this is the same as what he saw before just another this is a training approach that so we have the gold

00:19:20.020 | trained cross encoder

00:19:22.860 | We have our unlabeled pairs which would come from random sampling on gold data

00:19:27.380 | we process those for a cross encoder to create a silver data set and then the silver and

00:19:32.660 | the gold

00:19:34.660 | Come over here to fine-tune a by encoder

00:19:37.100 | So

00:19:40.380 | that's it for the

00:19:42.380 | theory and the concepts and

00:19:44.780 | Now what I want to do is actually go through the code and and work to an example of how we can actually do this

00:19:53.060 | okay, so we

00:19:55.860 | downloaded the both the training and the validation set by a CSP data and

00:20:03.060 | Let's have a look at what some of that data looks like. So

00:20:06.580 | CSP

00:20:09.260 | zero

00:20:10.500 | So we have sentence pair sentence one sentence two

00:20:14.060 | Just a simple sentence and we have a label which is our similarity score now

00:20:19.700 | that similarity score varies from between 0 up to 5 where 0 is

00:20:25.540 | no similarity no

00:20:28.380 | relation between the two sentence pairs and

00:20:31.860 | 5 is they mean that that same thing now see here these two mean the same thing as we

00:20:39.260 | Now we can see here that these two mean the same thing as we would expect

00:20:47.140 | So

00:20:50.220 | we

00:20:51.260 | first want to modify that

00:20:53.260 | Score a little bit because we are going to be training

00:20:57.620 | using cosine similarity loss and we would expect our

00:21:01.020 | Label to not go up to a value of 5 but instead go to value 1. So

00:21:07.980 | All I'm doing here is

00:21:10.500 | Changing that score so that we are dividing everything by by normalizing everything

00:21:18.580 | so we do that and no problem and

00:21:22.820 | Now what we can do is load our training data into a data loader. So to do that we first

00:21:30.180 | Form everything into a input example and then load that into into our pineclutch data loader

00:21:38.380 | So I run that and then at the same time during training I also want to

00:21:48.140 | Output a evaluation source. So how did the cross encoded do on the evaluation data?

00:21:54.940 | so to do that I

00:21:58.140 | Import so here we're importing from chance sentence transformers cross encoder

00:22:04.860 | Evaluation I'm importing the cross encoder CE

00:22:08.580 | correlation evaluator I

00:22:11.340 | Again I'm using input examples with working sentence transformers library

00:22:18.460 | And I

00:22:20.460 | importing both text and the labels and

00:22:25.100 | Here I

00:22:29.260 | Putting all that development or I'm putting all that relational data into that

00:22:35.860 | evaluator, okay, and I can run that and

00:22:39.420 | Then we can move on to initializing a cross encoder and training it and also evaluating it. So

00:22:47.020 | to do that, we're going to

00:22:49.020 | import from sentence transform it so from

00:22:52.820 | Sentence transform it and I'll make sure I'm working in Python

00:23:00.060 | I'm going to import from cross encoder a

00:23:03.900 | Cross encoder. Okay, and

00:23:08.380 | To initialize that cross encoder model, I'll call it see

00:23:13.380 | all I

00:23:15.940 | Need to do is write cross encoder very similar to when we write sentence transformer initialize it and model

00:23:22.180 | we specify

00:23:25.260 | the

00:23:27.220 | model from face transformers that we like to

00:23:30.140 | Initialize a cross encoder from so birthdays in case and also number of labels that we'd like to use

00:23:37.740 | so in this case, we are just targeting a

00:23:42.940 | Similarity score between zero and one. So we just want a

00:23:46.540 | Single label that if we were doing for example, NLI

00:23:50.980 | labels where we have

00:23:53.820 | entailment contradiction and

00:23:55.980 | Neutral labels or some other labels and we would change this to for example three, but in this case one

00:24:03.700 | We can initialize our cross encoder

00:24:07.220 | and then from now we move on to actually training so we call model or see dot fit and

00:24:13.460 | We want to specify

00:24:16.220 | The data loader so it's slightly different to the fit function. We usually use of sentence transformers

00:24:22.460 | So we want train data loader

00:24:24.820 | we specify our loader that we

00:24:28.180 | Initialize just up here the data loader

00:24:31.620 | we

00:24:33.700 | Don't need to do this. But if you are going to evaluate your

00:24:36.860 | Model during training you also want to add in evaluator as well

00:24:42.020 | So this is from the CE correlation evaluator to make sure here using a cross encoder

00:24:48.500 | evaluation class

00:24:50.620 | we would like to

00:24:52.620 | run for

00:24:55.220 | Say one epoch and we should define this because I would also like to

00:25:01.500 | While we're training I would also like to include some warm-up sets as well

00:25:06.780 | We should I'm going to include a lot of warm-up sets actually and although I'll mention it. I'll talk about it in a moment. So I

00:25:13.540 | Would say number of epochs

00:25:16.380 | Is equal to one and for the warm-up I

00:25:22.740 | would like to take integer so the length of loader so the number of

00:25:28.780 | Actions that we have now and our data set. I'm going to multiply this by

00:25:33.540 | 0.4. So I'm going to

00:25:36.700 | Do a warm-up or do warm-up sets for 40% of our

00:25:42.700 | total data set size or batch or 40% of our total number of batches and

00:25:48.980 | We also need to multiply that by number of epochs. Let's say for training two epochs

00:25:54.260 | We do multiply that in this case just one so not necessary, but it's there

00:25:58.900 | so we're actually

00:26:01.820 | forming warm-up for 40% of the

00:26:05.100 | Training steps and I found this works better than something like 10% 15% 20%

00:26:11.020 | However that being said I

00:26:14.740 | Think you could also achieve a similar result by just decreasing the learning rate of your model. So

00:26:23.940 | By default, so if I write the epochs here, we'll define the warm-up sets

00:26:29.940 | With warm-up so by default this we use optimizer params

00:26:38.540 | with a learning rate of

00:26:42.620 | 2 e to the minus 5

00:26:48.180 | okay, so if you

00:26:50.860 | Say want to decrease that a little bit you could go. Let's say

00:26:56.340 | Go to the minus 6 5 e to the minus 6 and this would probably have a similar effect to having

00:27:03.300 | Such a significant number of warm-up sets and then in this case, you could decrease this to 1 or 10%

00:27:09.340 | But for me the way I've tested this I've end up going with 40% warm-up sets and that works quite well

00:27:18.980 | so the final step here is

00:27:21.220 | Where do we want to save our model? So I'm going to say I want to save it into the base

00:27:28.100 | cross encoder

00:27:30.860 | Or let's say

00:27:32.980 | that

00:27:34.500 | STSB cross encoder and

00:27:36.500 | We can run that and that will

00:27:40.460 | Run everything for us. I'll just make sure it's actually yep. There we go

00:27:45.780 | So see it's running, but I'm not gonna run it because I've already done it

00:27:49.020 | so let me pause that and

00:27:51.020 | I

00:27:52.820 | will

00:27:54.380 | Move on to the next step

00:27:57.260 | okay, so we now have our gold data set which we have pulled from looking face data sets and

00:28:06.260 | We've just fine-tuned a cross encoder. So

00:28:09.620 | Let's cross both of those off of here

00:28:13.660 | this and this and

00:28:17.300 | Now so before we actually go on to predicting labels with the with the cross encoder

00:28:22.540 | We need to actually create that

00:28:25.060 | unlabeled data set so

00:28:27.380 | let's do that through random sampling using the gold data set you already have and

00:28:32.860 | Then we can move on to the next steps

00:28:38.980 | okay, so I'll just

00:28:41.460 | Add a little bit separation in here. So now we're going to go ahead and create the

00:28:47.900 | augmented

00:28:49.340 | data

00:28:51.340 | So as I said, we're going to be using random sampling for that and I find that the

00:28:57.580 | the easiest way to do that is to actually go ahead and use a pandas dataframe rather than using the

00:29:05.780 | Data set object that we currently have so I'm gonna go ahead and initialize that so we have our gold data

00:29:13.380 | that will be

00:29:17.340 | PD the dataframe and

00:29:19.820 | In here we're going to have sentence one sentence two system one

00:29:26.260 | That

00:29:30.340 | Is going to be equal to

00:29:32.900 | STSB

00:29:35.740 | Sentence one

00:29:37.660 | Okay, and as well as that we also have sentence two

00:29:43.740 | going to be STSB

00:29:45.780 | sentence two now we may also want to include our

00:29:52.900 | Label in there, although I wouldn't say this is really necessary

00:29:57.980 | Add it in

00:30:01.380 | So our label is just like

00:30:04.340 | And if I have look here so we have

00:30:12.700 | Gonna overwrite anything called gold. It's okay

00:30:15.500 | So, okay, I'm gonna have a look at that as well so you can see a few examples of what we're actually working with

00:30:25.200 | I'll just go ahead and actually rerun these as well

00:30:30.860 | Okay, so there we have we have our gold data and

00:30:38.100 | now what we can do because we

00:30:42.580 | Reformatted that into a kind of data frame. You can use the sample method to randomly sample

00:30:48.780 | different sentences so

00:30:51.620 | To do that what I will want to do is create a new data frame

00:30:56.620 | So this is going to be our one labeled silver date

00:31:00.460 | So it's not it's not a silver data set yet because we don't have the labels or scores yet

00:31:04.460 | But this is going to be where we we will put them and in here. We again will have sentence one

00:31:12.740 | and

00:31:14.260 | also sentence

00:31:16.060 | Two but at the moment that they're empty. It's nothing nothing in there yet

00:31:20.280 | so what we need to do is actually iterate through all of the

00:31:24.460 | rows in here, so

00:31:27.060 | Before that I'm just going to do from or import

00:31:31.380 | tqdm.auto

00:31:34.260 | From the tqdm.auto import tqdm

00:31:41.060 | That's just a progress bar so you can see you know where we are I don't I don't really like to wait and

00:31:46.580 | have no idea how long this is taking to process and

00:31:50.700 | For sentence one

00:31:54.820 | in

00:31:56.900 | tqdm so we have the progress bar and I'll take a list of a set

00:32:02.060 | So we're taking all the unique values in the gold data frame for sentence one

00:32:08.260 | Okay, so that will just loop through every single unique sentence of one

00:32:13.820 | Item in there and I'm gonna use that and I'm going to randomly sample

00:32:19.860 | five

00:32:21.820 | sentences from the other column sentence two to be paired with that sentence one and yeah, I'll

00:32:28.780 | sample or the sentence to

00:32:33.100 | Phrases that we're going to sample are going to come from the gold data, of course, and we only want to

00:32:40.060 | sample from rows where sentence one is not equal to the current sentence one because otherwise we

00:32:48.680 | Are possibly going to introduce duplicates and we're going to remove duplicates anyway

00:32:53.980 | But let's just remove them from the sampling in the in the first place. So

00:32:58.460 | we're going to

00:33:00.580 | Take that so all of the gold data set that where sentence one is

00:33:05.880 | Not equal to sentence one and what I'm going to do is just sample five of those rows

00:33:12.100 | Like that now from that. I'm just going to extract

00:33:17.780 | sentence to sort of five

00:33:21.220 | sentence two phrases that we have there and I'm going to convert them into a list and

00:33:27.460 | now for

00:33:29.900 | sentence two

00:33:31.580 | In the sampled list that we've just created. I'm going to take my pairs

00:33:37.000 | I'm going to append new pairs. So pairs are penned

00:33:41.040 | and I want sentence one to be sentence one and

00:33:47.060 | Also sentence two is going to be equal to sentence two

00:33:53.040 | now this

00:33:55.580 | Will take a little while

00:33:57.940 | So what I'm going to do is actually

00:33:59.940 | Maybe not include the the full

00:34:03.420 | data set here

00:34:06.380 | So let me possibly just go maybe the first 500

00:34:13.300 | Yeah, let's go to first 500 see how long that takes and

00:34:19.700 | I will also want to just have a look at what we what we get from that

00:34:25.820 | Okay, so yes, it's much quicker

00:34:27.820 | Okay, so we have sentence one. Let me

00:34:33.060 | Remove that from there

00:34:36.540 | And

00:34:40.060 | Let's just say that top ten, right?

00:34:42.740 | so because we we're taking five of sentence one every time and random sampling it we can see that we have a few of those and

00:34:50.380 | Another thing that we might do is remove any do fits now

00:34:55.580 | That probably isn't any duplicates here, but we can check so pairs equals

00:35:01.780 | pairs up drop

00:35:04.620 | duplicates and

00:35:06.820 | Then we'll check the length of pairs again

00:35:08.940 | Okay

00:35:12.700 | also prints

00:35:14.700 | Let me run this again and print

00:35:20.580 | Okay

00:35:22.580 | So they were not any do it's anyway, but that's it it's a good idea to add that in just in case and

00:35:32.420 | Now I want to do is

00:35:36.220 | Actually take the cross encoder. In fact, actually let's go back to our little flow charts. So we have

00:35:44.740 | Now created a larger unlabeled data set

00:35:50.780 | So it's good and now we go on to predicting the labels of our cross encoder so down here

00:35:58.140 | What I'm gonna do is take

00:36:01.180 | the cross encoder code here and what I've done is

00:36:06.020 | I've trained this already and

00:36:08.860 | I've uploaded it to

00:36:11.300 | the

00:36:13.620 | Plugin-based models. So what we what you can do and what I can do is

00:36:18.380 | This so I'm gonna write James Callum and it is called Bert

00:36:26.220 | STSB cross encoder

00:36:30.060 | Okay

00:36:34.580 | Okay, so that's our cross encoder and

00:36:37.580 | Now what I want to do is

00:36:41.860 | Use that cross encoder to create our labels. So that will create our silver data set now to do that. I

00:36:49.580 | I'm gonna call it silver for now. I mean it is this isn't really the silver data set, but it's fine and

00:36:57.340 | I'm going to create a list and I'm going to zip both of the columns from our pets

00:37:03.820 | So pairs sentence one pair sentence, too

00:37:07.980 | As

00:37:09.980 | One and

00:37:13.740 | pairs

00:37:15.700 | sentence - okay, so

00:37:18.620 | That will give us

00:37:23.180 | All about all about pairs again, you can look at those

00:37:28.980 | okay, so it's just like this and

00:37:33.900 | What we you want to do now is actually create our score. So just take the cross encoder

00:37:38.300 | what did we load as C C dot predict and

00:37:42.380 | We just pass in that silver data

00:37:45.380 | so

00:37:47.340 | Do that

00:37:49.100 | Let's run it. It might take a moment

00:37:51.340 | Okay, so it's definitely taking a moment so

00:37:55.620 | Let me pause it. I'm going to just do

00:37:59.500 | Let's say 10 because I already have the full data set so I can I can show you that

00:38:04.300 | Somewhere else and let's have a look at what you have in those scores. So three of them

00:38:11.500 | So we have an array and we we have these these scores. Okay, so

00:38:17.140 | that

00:38:19.460 | They are our predictions our similarity predictions for the first three now because they're randomly sampled a lot of these are

00:38:26.980 | negative

00:38:28.780 | So if we go silver

00:38:30.900 | Say negative, I mean more

00:38:34.820 | They're not relevant. So

00:38:37.420 | Yeah, we can see

00:38:39.900 | Not particularly relevant. Let's just one must first issue with this and you can you can

00:38:46.340 | change you can try and modify that by after creating your scores if you have if you

00:38:54.180 | oversample and

00:38:56.540 | For a lot of values or a lot of records and then just go ahead and remove

00:39:04.020 | most of the low scoring samples and keep all of your high scoring samples that will help you deal with that

00:39:13.300 | imbalance in your data

00:39:15.620 | so

00:39:16.980 | What I'm going to do is I'm going to add

00:39:19.340 | to the labels column those scores

00:39:23.380 | which

00:39:25.500 | Will not actually cover all of them because we only have ten in here. So

00:39:30.420 | Let me maybe multiply that

00:39:34.140 | So this isn't you shouldn't do this obviously it's just so they fit

00:39:39.740 | Okay, and

00:39:44.020 | Let's have look. Okay, so we now have sense one sense two and some labels and

00:39:50.780 | What you do, but I'm not going to run. This is you would write pairs to CSV

00:39:56.380 | And that's so you need to do this if you're running everything in the same notebook

00:40:00.580 | Well, it's probably a good idea so with CSV so I'm going to say the silver data is a tab separate file and

00:40:09.900 | Obviously the separator for for that type of file is it it's a tab character and I don't want to include in those

00:40:20.780 | Okay, and that will create the the silver data file that we can train with

00:40:27.060 | Which I do already have so

00:40:30.820 | We come over here

00:40:36.260 | we can we can see that I have this file and

00:40:40.260 | We have all of these

00:40:43.180 | different

00:40:46.580 | Sentence pairs and the scores that our encoder has assigned to them

00:40:52.100 | So I'm going to close that I'm going to go back to the demo

00:40:56.900 | and

00:40:59.580 | What I'm now going to do is actually

00:41:01.580 | First go back to the flow chart that we had. I'm going to cross off predict labels

00:41:09.620 | And

00:41:12.620 | We're going to go ahead and fine-tune it the by encoder on both gold and silver data

00:41:17.980 | So we have the gold data

00:41:21.460 | Let's have a look at we have

00:41:24.500 | Yes, and the silver I'm going to load that from file. So

00:41:28.620 | PD read CSV

00:41:32.540 | Silver TSV

00:41:36.500 | Separator is

00:41:40.100 | tab

00:41:42.660 | Character and

00:41:44.660 | Let's have a look what we have make sure it's all loaded correctly looks good

00:41:49.420 | now I'm going to do is

00:41:52.140 | Put both those together. So all data is equal to gold and

00:41:57.380 | silver and

00:41:59.180 | We ignore the index so they get an index error

00:42:02.580 | Sorry true and

00:42:07.020 | All data is got head

00:42:11.460 | Okay, we can see that we hopefully now have all of the all of the data in this check the length

00:42:18.500 | So it's definitely bigger a bigger data set now before which is gold

00:42:26.940 | okay, so we now have a larger data set and we can go ahead and use that to fine-tune the

00:42:36.740 | By encoder or sentence transformer. So what I'm going to do is take the code from up here

00:42:43.340 | so we have this train and data and

00:42:47.020 | I think I've already run this for so I don't need to import the input example here

00:42:54.060 | But what I want to do here is for row in

00:42:58.060 | All data and

00:43:02.540 | What we actually want to do here is for I row in all data because this is a date frame

00:43:08.340 | It's a rose

00:43:10.780 | It's right through each row. We have row sentence one sentence two and also a label

00:43:17.380 | so

00:43:20.100 | We load them and to our train data and we can have a look at that train data

00:43:26.540 | See what it looks like

00:43:31.780 | Okay, we see that we get all these

00:43:33.780 | Inputs a sample objects if you want to see what one those has inside you can access the text

00:43:41.420 | like this

00:43:44.020 | Should probably do that on a

00:43:46.100 | New cell. So let me pull this down here and you can also access a label

00:43:51.900 | To see what we what we have in there

00:43:55.460 | Okay

00:43:57.780 | So that looks good and we can now take that like we did before and load it into a data loader

00:44:04.700 | So, let me go up again and we'll we'll copy that

00:44:08.180 | Where are you?

00:44:11.380 | Take this

00:44:16.860 | Bring it down here and

00:44:22.180 | Run this crates our data loader and we can move on to actually initializing the sentence transformer or by encoder

00:44:30.140 | And actually training it. So once we run from sentence transformers, we're going to import

00:44:38.100 | Models and also going to import sentence transformer

00:44:42.740 | now to initialize our sentence transformer if you following along with the series of videos and articles

00:44:51.460 | you

00:44:52.780 | Will know that we we do something looks like this

00:44:56.060 | So we're going to convert and that is going to be models adopt transformer

00:45:00.460 | And

00:45:04.700 | here we're just loading a model from face transformers, so they're based on case and

00:45:09.380 | We also have our pooling layer. So models again, and we have pooling and

00:45:15.100 | in here we want to include the

00:45:19.300 | dimensionality of the

00:45:21.300 | the vectors that the pooling

00:45:24.280 | layer should respect which is just going to be birds dot get word embedding dimension and

00:45:32.580 | Also, it needs to know what type of pooling we're going to use we're going to use

00:45:39.040 | CLS pooling are going to use mean pooling max pooling or so on now we are going to use

00:45:47.180 | Pooling and we're going to use a mean

00:45:49.180 | To mode mean tokens. Let me say that's true

00:45:53.500 | so they're the two

00:45:56.340 | Let's say components in our in our sentence transformer and we need to now put those together

00:46:02.440 | So we're gonna call model equals

00:46:05.100 | sentence

00:46:06.980 | transformer and

00:46:08.860 | we write modules and

00:46:10.860 | Then we just pass as a list that and also cooling

00:46:16.140 | Okay

00:46:18.140 | So we run that we can also have one model looks like

00:46:23.160 | Okay, and we have you see we have a sentence transformer object and inside there. We have two

00:46:29.900 | Layers or components first ones are transformer

00:46:33.180 | It's a BERT model and the second one is our pooling and we can see here

00:46:37.540 | the only pooling method that is set to true is the

00:46:41.820 | Mode mean tokens, which means we're going to take the mean across all the word

00:46:45.740 | Embeddings output by BERT and use that to create our sentence embedding or vector

00:46:51.500 | so

00:46:53.940 | with that model now defined we can

00:46:56.640 | Initialize our loss function. So we do want to write from sentence

00:47:03.180 | transformers

00:47:06.300 | Losses import cosine similarity loss

00:47:11.300 | It's okay sign

00:47:13.300 | Similarity loss and in here we seem to pass the model. So understands which parameters to to actually optimize and

00:47:19.900 | Initialize that and then we sell our

00:47:24.060 | training function or the fit function and

00:47:27.660 | That's similar to before the cross encoder although slightly different. So let me let me take that. It's a little further up

00:47:39.420 | From here

00:47:41.420 | Then take that and we're just gonna modify it so

00:47:45.340 | Warm up. I'm going to warm up for a 15% of the number of steps now. We're going to run through

00:47:53.060 | we change this to model it's not it's not see anymore and

00:47:58.220 | Like I said, there are some differences here. So we have a training objectives. That's different and

00:48:06.460 | This is just a list of all the training objectives. We have we are only using one and we just pass

00:48:12.660 | loader and

00:48:14.780 | loss into that

00:48:16.780 | Evaluator we could use an evaluator. I'm not going to

00:48:21.500 | For this one, I'm going to evaluate everything afterwards the

00:48:26.420 | Epochs and warm steps are the same. The only thing that's different is the output path, which is going to be but STS be

00:48:35.180 | That's that's it so go ahead and run that should should run let's check that it does

00:48:42.040 | Okay, so I've got this error here so it's lucky that we we checked and

00:48:49.780 | we have this runtime error found D type long but expected float and

00:48:54.300 | If we come up here, it's going to be in the data load or in the data that we've

00:48:59.460 | Initialized so I

00:49:03.140 | Yeah, so here I've put int first some region. I'm not sure why did that?

00:49:08.420 | So this should be a float that the label in your your training data

00:49:12.180 | And that should be the same

00:49:14.780 | Up here as well

00:49:17.300 | Okay, so here as well the the cross encoder we would expect a float value

00:49:29.980 | So just be aware that I'll make sure there's a note in the video earlier on for that

00:49:36.520 | okay, and

00:49:39.780 | Okay, let's continue through that and try and rerun it should be okay now. Oh

00:49:47.140 | I need to actually rerun everything else as well

00:49:51.500 | So rerun this

00:49:56.300 | Okay label 1.0

00:49:58.620 | This is

00:50:03.260 | This for a moment just to be sure that is actually running this time, but it does

00:50:11.020 | look good, so

00:50:13.780 | Yeah

00:50:18.340 | So

00:50:20.460 | Looks good when for some reason in the notebook I'm actually seeing the number of iterations, but okay

00:50:27.580 | Yeah, pause it now and we can see that. Yes. It did run through two iterations. So it is running

00:50:32.760 | Correctly now, that's good

00:50:35.300 | so

00:50:37.300 | That's great

00:50:38.860 | What I want to do now is actually show you okay valuation of these models. So

00:50:45.380 | Back to our flow chart quickly. Okay, so fine-tuned by coda. We've just done it

00:50:50.880 | So we've now finished without in the main or

00:50:54.340 | augmented expert

00:50:57.200 | training strategy and

00:50:59.260 | Yeah, let's move on to the evaluation. Okay, so my evaluation script here

00:51:04.840 | Is

00:51:08.380 | Maybe not the easiest to to read

00:51:14.460 | But basically all we're doing is we're importing the

00:51:18.420 | embedding similarity evaluated from down here

00:51:21.580 | I'm loading the the glue data

00:51:24.980 | SDSP again, and we're taking validation split, which we didn't train on

00:51:29.420 | we are converted into input examples feeding it into our embedding similarity evaluator and

00:51:35.540 | Loading the model the model name a pass through command line

00:51:43.980 | arguments for

00:51:45.340 | up here and

00:51:47.340 | Then it just prints out of the score so

00:51:51.200 | Let me switch across to the command line. I can we can see how that how that actually performs

00:51:58.340 | Okay, so just switch across to my other desktop because this is much faster so I can I can actually

00:52:06.060 | Run this quickly. So Python and zero three, so we're gonna run that valuation script

00:52:13.060 | and

00:52:14.460 | we're going to pass here is

00:52:16.460 | we have all three or all three the cross encoder the

00:52:21.180 | sentence transformer train using

00:52:24.460 | Augmented experts and also a sentence transformer train purely on the gold data set. So first let's have a look at the

00:52:32.140 | But SDSP gold data set trained

00:52:36.860 | model, so

00:52:40.740 | Run this might take a moment to download it

00:52:43.860 | Okay, so everything downloaded and then we've got a score of zero point five zero six. So it correlates

00:52:52.220 | to the

00:52:54.420 | predictions of the model correlate to the actual scores

00:52:58.580 | with a sort of 50%

00:53:01.660 | Relation so they do correlate. It's not bad. It's not great either. Let's have a look at the cross encoder

00:53:09.940 | so

00:53:11.940 | Again

00:53:14.460 | Encoder

00:53:16.800 | Okay, and we get score of

00:53:19.020 | 0.58. So as we'd expect training on just the gold data

00:53:24.420 | the cross encoder does outperform the the by encoder or sentence transformer and

00:53:30.260 | The final one would be okay

00:53:33.200 | with the augmented data

00:53:36.420 | How does the sentence transformer perform?

00:53:39.780 | so

00:53:40.980 | let's run that again the way for to download and

00:53:44.020 | we get a much better score of

00:53:48.660 | zero point six nine, so

00:53:52.300 | Yeah, the I mean the correlation area is much higher than this for the augmented data

00:53:58.340 | So then if we had just used a gold data set, so it really really has improved the performance a lot now

00:54:05.660 | This is maybe an atypical performance increase. It's like 90% or 90

00:54:11.400 | point increase in

00:54:14.180 | Performance and that's good. But if you look at the the original paper from there was rhymers and co they

00:54:22.420 | Found a sort of expected performance increase of I believe seven or nine

00:54:30.220 | Points. So this is definitely pretty significant. This is definitely a bit more than that

00:54:35.940 | But I think it goes show how good

00:54:39.100 | these models or this training strategy can actually be so

00:54:44.140 | That's it for this video. I hope this has been useful and I hope this

00:54:51.700 | helps a few of you kind of

00:54:55.620 | Overcome the sometimes lack of data that we find and I think a lot of our

00:55:01.940 | Particular use cases. Thank you very much for watching and I will see you in the next one

Making The Most of Data: Augmented SBERT

Chapters