back to index

Making The Most of Data: Augmented SBERT


Chapters

0:0
7:1 Language
7:28 Data Augmentation Techniques
9:28 Contextual Word Embeddings
12:14 Cross Encoder
15:16 Data Augmentation
28:23 Create that Unlabeled Data Set
34:52 Remove any Duplicates
35:53 Predicting the Labels of Our Cross Encoder
45:11 Pooling Layer

Whisper Transcript | Transcript Only Page

00:00:00.000 | In this video, we're going to have a look at how we can make the most of the limited data
00:00:05.360 | using
00:00:07.280 | language data augmentation strategies and training approaches
00:00:12.080 | More specifically we're going to focus on something called augmented expert. So
00:00:18.000 | You may or may not be aware that the past decade has has been sort of a renaissance or explosion in the field
00:00:26.400 | of machine learning and data science and
00:00:29.520 | a lot of
00:00:31.200 | that especially early progress with things like a
00:00:34.840 | Ceptron and current neural networks a lot of that was researched and discovered
00:00:41.080 | back in the 50s and 60s and 70s, but we
00:00:46.400 | didn't see that really applied in industry or
00:00:50.640 | anywhere really until
00:00:53.440 | the past decade and
00:00:56.360 | There are two main reasons for this. So the first is that we didn't have enough compute power back in the
00:01:04.280 | 50s 60s 70s to train the models that we needed to train and we also didn't have the data to actually
00:01:11.320 | train those models now
00:01:13.720 | compute power is
00:01:16.120 | Not really a problem anymore. We sort of look at this graph
00:01:20.960 | It depends on what model you're training, of course, if you are opening on your training GPT 4 or 5, whatever
00:01:28.360 | Yeah, maybe compute powers is pretty relevant but for most of us we can get access to
00:01:36.280 | Cloud machinery not personal machines and we can wait a few hours or a couple of days and
00:01:45.120 | fine-tune or pre-train a transform model
00:01:49.520 | that is
00:01:51.320 | Good performance or for what we need
00:01:53.520 | Now that obviously wasn't always the case until very recently back in
00:02:00.060 | 1960s you see on this graph here. We have the
00:02:05.780 | 704 and I mean you can see on the the y-axis. We have floating-point operations per second
00:02:14.480 | And that's a logarithmic scale. So linear scale just basically looks like a straight line until a few years ago and shoots up
00:02:24.680 | pretty impressive how much progress is made in terms of
00:02:29.680 | computing power now
00:02:32.480 | Like I said, that's you know, not really an issue for us anymore
00:02:36.880 | we you know, we have the compute in most cases to do what we need to do and
00:02:43.200 | Data is not as much of a problem anymore, but we'll talk about that in a moment
00:02:48.800 | so data again, we have a very
00:02:52.200 | Big increase in data not quite as big as the computing power and this this graph here doesn't go quite as far back
00:03:00.880 | It's only 2010
00:03:02.960 | Where I believe it was at 2 zettabytes and now
00:03:09.360 | 2021 or or so in in 2021, so
00:03:13.800 | there's a
00:03:15.760 | Fairly big increase not quite as much as compute power over time, but still pretty massive now the
00:03:23.160 | thing with data is
00:03:26.160 | yes, there's a lot of data out there, but is there that much data out there for
00:03:32.280 | What we need to train models to do and in a lot of cases
00:03:37.640 | Yes, there is but it really depends on what you're doing if you are
00:03:41.760 | Focusing on
00:03:45.640 | The more niche domains. So what I have here on the left over here are a couple of niche domains
00:03:53.360 | There's not that much data out there on
00:03:57.120 | sentence pairs for climate evidence and claims for example, so where you have a
00:04:04.240 | Piece of evidence and a claim and whether the the claim supports evidence or not
00:04:08.960 | There is a very small data set called climate fever data set, but it's not it's not big
00:04:13.520 | For agriculture, I assume within that industry. There's not that much data, although I
00:04:19.680 | Have never worked in that industry. So I
00:04:22.800 | Not fully where I just assume this probably not that much and then also niche finance
00:04:30.160 | Which I do at least have a bit more experience with and I imagine this is probably something though
00:04:35.360 | A lot of you will find useful as well
00:04:37.360 | Because a finance is a big industry. There's a lot of finance data out there, but there's a lot of niche
00:04:44.720 | Little projects and problems in finance where you find much less data
00:04:50.200 | So, yes, we have a lot more data nowadays, but we don't have
00:04:56.000 | Enough for a lot of use cases on the right here. We have a couple of examples of low resource data sets
00:05:01.920 | so we have a debate from the Maldives and also the
00:05:05.520 | Languages as well. So with these
00:05:09.280 | We kind things find a different approach now we can
00:05:14.440 | investigate depending on your use case on supervised learning TSD a which we have covered in a previous video article and
00:05:23.320 | That does work when you're trying to build a model that recognizes
00:05:28.760 | Generic similarity and it works very well as well. But for example with the climate claims data
00:05:36.000 | We are not necessarily trying to match sentence a and B based on their semantic similarity
00:05:43.160 | but we're trying to match it sentence a which is a claim to
00:05:46.680 | Sentence B, which is evidence as to whether that evidence supports the claim or not
00:05:53.080 | so in that case
00:05:55.080 | Unsupervised approach like TSC doesn't really work
00:06:01.440 | what we have is very little data and
00:06:03.720 | We there aren't really any
00:06:06.720 | Alternative training approaches that we can use. So basically what we need to do is
00:06:12.520 | Create more data now
00:06:16.520 | augmentation is
00:06:18.800 | Difficult particularly for language so data augmentation is not specific to NLP
00:06:25.560 | it's used across ML and it's more established in the field of computer vision and
00:06:32.320 | That makes sense because computer vision you say you have an image you can modify that image using a few
00:06:40.080 | different approaches and
00:06:42.240 | The person can still look at that image and think okay, that is the same image
00:06:46.040 | It's just maybe it's rotated a little bit. We've changed a color grading
00:06:49.640 | brightness
00:06:51.920 | Or something along those lines just modified it slightly, but it's still in essence the same image
00:07:01.120 | for language, it's a bit difficult because language is very
00:07:04.840 | abstract and
00:07:07.360 | Nuance, so if you start randomly changing
00:07:10.960 | certain words
00:07:13.160 | the chances are you're going to produce something that doesn't make any sense and
00:07:17.040 | We when we're augmenting our data, we don't want to just throw rubbish into our model. We want
00:07:22.960 | Something that makes sense. So
00:07:26.560 | there are
00:07:28.680 | some data augmentation techniques and
00:07:31.320 | We'll have a look at a couple of the simpler ones now, so
00:07:36.360 | There is a library called NLP or which I think is is very good for this sort of thing. It's
00:07:43.240 | essentially a library that allows us to do data augmentation for NLP and
00:07:48.520 | What you can see here is two methods using word2vec
00:07:55.640 | Vectors and similarity and what we're doing is taking this original
00:08:01.520 | Sentence so the quick brown fox jumps over the lazy dog
00:08:07.120 | And we're just inserting some words
00:08:11.240 | Using word2vec. So we're trying to find what words were to bet things could go in here
00:08:16.400 | which words are the most similar to the surrounding words and
00:08:20.160 | We have this
00:08:23.280 | Alessiari, which I don't know. I think it seems like a name to me that I'm not sure
00:08:28.680 | That I don't think really fits them so it's not great it's not perfect
00:08:37.280 | Lazy superintendents dog that does kind of make sense. I feel like a
00:08:41.480 | lazy superintendents dog is
00:08:44.440 | Maybe a stereotype or I'm sure it's been in The Simpsons or something before
00:08:50.580 | So, okay fair enough. I can I can see how that can in there, but again, it's a it's a bit weird. It's not great
00:08:58.040 | Substitution for me seems to work better
00:09:02.080 | So rather than the quick brown fox, we have the easy brown fox and rather than jumping over the lazy dog
00:09:08.840 | Jumps around the lazy dog which changes the meaning slightly
00:09:13.160 | Easy is a bit weird there to be fair
00:09:17.720 | We still have a sentence that kind of makes sense
00:09:21.000 | So that's good. I think now we don't
00:09:26.000 | have to use word to bed you can also use contextual word embeddings like with Bert and
00:09:31.560 | For me, I think these results look better. So for insertion
00:09:36.520 | We get even the quick brown fox usually jumps over the lazy dog. So we're adding some words that make sense
00:09:44.080 | that's I think good or substitution and we are ending one word here and
00:09:51.520 | We're changing that to a little quick brown fox instead of just quick brown fox
00:09:56.880 | So I think that makes sense. And this is a good way of renting your data more data from less
00:10:05.320 | but for us because we are using
00:10:10.360 | Sentence pairs we can basically just take all of the data from from say we have
00:10:21.760 | B over here. I should miss the data frame and we have all of these
00:10:27.400 | Sentence A's and we have all these sentence B's now
00:10:31.920 | if we take
00:10:34.040 | One sentence a it's already matched up to one sentence B and what we can do is say
00:10:39.360 | okay, I want to randomly sample some other sentence fees and
00:10:43.480 | match them up to
00:10:47.200 | Sentence a so, you know, we have
00:10:49.880 | Three more pairs now. Okay, so if we if we did this if we took
00:10:55.560 | three sentence A's three sentence B's and
00:10:59.140 | We made new pets from all of them. So it's not really random sampling just taking all the possible pairs that we end up with
00:11:10.680 | New or nine pairs in total which is much better if you send that a look further
00:11:17.760 | so from just thousand pairs
00:11:20.160 | We can end up with one million pairs
00:11:24.600 | so you can see quite quickly you can you can take a small data set and very quickly create a
00:11:31.160 | Big data set with it. Now. This is just one part of the the problem though because our
00:11:38.120 | Smaller data set will have similarity scores or
00:11:43.000 | natural language inference labels
00:11:45.880 | But the the new data set that we've just created the augmented data set doesn't have any of those
00:11:53.000 | We just ran with sample new sentence. So there's no scores or labels that we need those to actually train a model
00:12:02.480 | What we can do is take a slightly different approach or add another step
00:12:10.380 | Now that other step is using something called a cross encoder so in
00:12:17.740 | semantic similarity
00:12:20.780 | We can use two different types of models. We can use a cross encoder over here
00:12:27.340 | Or we can use a by encoder or it or what I would usually call a sentence transformer now a
00:12:36.940 | cross encoder is
00:12:38.860 | the old way of doing it and
00:12:41.380 | it works by
00:12:44.540 | Simply putting sentence a and sentence B into a BERT model together at once
00:12:50.700 | so we have sentence a separate a token B feed that into a BERT model and
00:12:55.220 | From that BERT model. We will get all of our
00:12:58.420 | Embeddings output embeddings over here and they all get fed into a linear layer
00:13:04.420 | Which converts all of those into a similarity score up here?
00:13:08.900 | now that similarity score is typically going to be more accurate than a
00:13:14.940 | Similarity score that you get from a by encoder or such transformer
00:13:22.100 | The problem here is
00:13:24.580 | from our sentence transformer, we are outputting sentence vectors and
00:13:31.660 | If we have two sentence vectors, we can perform a cosine similarity or Euclidean distance
00:13:39.420 | calculation to get the similarity of those two vectors and
00:13:45.220 | cosine similarity
00:13:48.460 | Calculation or operation is
00:13:51.060 | much quicker than a full
00:13:55.420 | BERT inference set right which is what we need with a cross encoder. So I
00:14:01.340 | Think it is something like
00:14:03.340 | for maybe 10 maybe clustering 10,000
00:14:08.260 | Vectors using a cross encoder expert cross encoder would take you something like 65 hours
00:14:15.140 | Whereas with a by encoder, it's going to take you about five seconds. So it's
00:14:21.700 | much much quicker
00:14:26.900 | That's why we use by encoders or sentence transformers now
00:14:31.220 | The reason I'm talking about cross encoders is because we get this more
00:14:35.300 | accurate similarity score
00:14:38.340 | which we can use as a label and
00:14:41.020 | Another very key thing here is that we need less data
00:14:45.460 | Train a cross encoder with a by encoder if we I think the expert model
00:14:52.140 | Itself was trained on something like 1 million
00:14:57.500 | Sentence pairs and some new models are training a billion or more
00:15:01.940 | Whereas a cross encoder we can we can train a reasonable cross encoder on like 5k or maybe even less
00:15:10.620 | Sentence pairs. So we need much less data and
00:15:14.700 | That works quite well. We've been talking about the data orientation. We can take a small data set
00:15:19.740 | we can augment it to create more sentence pairs and
00:15:24.060 | Then what we do is train on the original data set, which we call the gold data set
00:15:30.440 | we train our cross encoder using that and
00:15:35.300 | Then we use that fine-tune cross encoder to label
00:15:40.540 | the augmented
00:15:42.740 | Data set without labels and that creates a augmented label data set that we call the silver data set
00:15:54.980 | sort of strategy of creating a silver data set which we would then use and to
00:16:01.340 | fine-tune our by encoder model is
00:16:05.360 | What we refer to as the in domain
00:16:10.220 | Augmented
00:16:19.140 | Expert
00:16:21.140 | Training strategy
00:16:23.460 | Okay, and
00:16:27.100 | this so what you can see this flow diagram is
00:16:30.940 | basically every set that we need to do to
00:16:34.660 | Create an in domain or spurt
00:16:38.080 | training process
00:16:41.020 | So we we've already described most of this so we get our gold data set the original data set
00:16:48.420 | That's gonna be quite small. Let's say one to five
00:16:51.520 | Thousand sentence pairs are labeled from now. We're going to use something like random sampling
00:16:58.140 | Which I'll just call ran
00:17:03.940 | We're going to use that to create a larger data set. Let's say we create something like a hundred thousand
00:17:10.940 | sentence pairs
00:17:13.300 | But these are not labeled. We don't have any
00:17:17.180 | Similarity scores or natural language inference
00:17:20.260 | labels or these
00:17:25.420 | What we do is we take that gold data set and we take it down here and we fine-tune a cross encoder
00:17:32.420 | Using that gold data because we need less data to train a reasonably good
00:17:37.140 | cross encoder
00:17:39.540 | So we take that and we fine-tune cross encoder and then we use that cross encoder
00:17:45.860 | Alongside our unlabeled data set to create a new
00:17:51.260 | silver data set
00:17:53.980 | Now the cross encoder is going to predict the similarity scores or in a line labels or every pair
00:18:01.120 | So with that we have our silver data
00:18:07.860 | we also have the gold data, which is up here and
00:18:13.060 | we actually take both those together and
00:18:16.140 | we fine-tune
00:18:19.060 | the by encoder or the sentence transformer on
00:18:22.380 | both the gold data and the silver data now one thing I would say here is it's useful to
00:18:30.260 | Separate some of your gold data at the very start. So don't even train your cross encoder on those
00:18:38.780 | it's good to separate them as your is your evaluation or test set and
00:18:43.260 | What evaluate the both the cross encoder performance and also your by encoder performance on that separate
00:18:52.060 | So don't include that in your training data for any of your models
00:18:55.300 | Keep that separate and then you can use that to figure out is this working or is it not work?
00:19:03.740 | that is
00:19:05.580 | in domain or meant of experts and
00:19:08.860 | Sort of see this is the same as what he saw before just another this is a training approach that so we have the gold
00:19:20.020 | trained cross encoder
00:19:22.860 | We have our unlabeled pairs which would come from random sampling on gold data
00:19:27.380 | we process those for a cross encoder to create a silver data set and then the silver and
00:19:32.660 | the gold
00:19:34.660 | Come over here to fine-tune a by encoder
00:19:40.380 | that's it for the
00:19:42.380 | theory and the concepts and
00:19:44.780 | Now what I want to do is actually go through the code and and work to an example of how we can actually do this
00:19:53.060 | okay, so we
00:19:55.860 | downloaded the both the training and the validation set by a CSP data and
00:20:03.060 | Let's have a look at what some of that data looks like. So
00:20:10.500 | So we have sentence pair sentence one sentence two
00:20:14.060 | Just a simple sentence and we have a label which is our similarity score now
00:20:19.700 | that similarity score varies from between 0 up to 5 where 0 is
00:20:25.540 | no similarity no
00:20:28.380 | relation between the two sentence pairs and
00:20:31.860 | 5 is they mean that that same thing now see here these two mean the same thing as we
00:20:39.260 | Now we can see here that these two mean the same thing as we would expect
00:20:51.260 | first want to modify that
00:20:53.260 | Score a little bit because we are going to be training
00:20:57.620 | using cosine similarity loss and we would expect our
00:21:01.020 | Label to not go up to a value of 5 but instead go to value 1. So
00:21:07.980 | All I'm doing here is
00:21:10.500 | Changing that score so that we are dividing everything by by normalizing everything
00:21:18.580 | so we do that and no problem and
00:21:22.820 | Now what we can do is load our training data into a data loader. So to do that we first
00:21:30.180 | Form everything into a input example and then load that into into our pineclutch data loader
00:21:38.380 | So I run that and then at the same time during training I also want to
00:21:48.140 | Output a evaluation source. So how did the cross encoded do on the evaluation data?
00:21:54.940 | so to do that I
00:21:58.140 | Import so here we're importing from chance sentence transformers cross encoder
00:22:04.860 | Evaluation I'm importing the cross encoder CE
00:22:08.580 | correlation evaluator I
00:22:11.340 | Again I'm using input examples with working sentence transformers library
00:22:18.460 | And I
00:22:20.460 | importing both text and the labels and
00:22:25.100 | Here I
00:22:29.260 | Putting all that development or I'm putting all that relational data into that
00:22:35.860 | evaluator, okay, and I can run that and
00:22:39.420 | Then we can move on to initializing a cross encoder and training it and also evaluating it. So
00:22:47.020 | to do that, we're going to
00:22:49.020 | import from sentence transform it so from
00:22:52.820 | Sentence transform it and I'll make sure I'm working in Python
00:23:00.060 | I'm going to import from cross encoder a
00:23:03.900 | Cross encoder. Okay, and
00:23:08.380 | To initialize that cross encoder model, I'll call it see
00:23:13.380 | all I
00:23:15.940 | Need to do is write cross encoder very similar to when we write sentence transformer initialize it and model
00:23:22.180 | we specify
00:23:27.220 | model from face transformers that we like to
00:23:30.140 | Initialize a cross encoder from so birthdays in case and also number of labels that we'd like to use
00:23:37.740 | so in this case, we are just targeting a
00:23:42.940 | Similarity score between zero and one. So we just want a
00:23:46.540 | Single label that if we were doing for example, NLI
00:23:50.980 | labels where we have
00:23:53.820 | entailment contradiction and
00:23:55.980 | Neutral labels or some other labels and we would change this to for example three, but in this case one
00:24:03.700 | We can initialize our cross encoder
00:24:07.220 | and then from now we move on to actually training so we call model or see dot fit and
00:24:13.460 | We want to specify
00:24:16.220 | The data loader so it's slightly different to the fit function. We usually use of sentence transformers
00:24:22.460 | So we want train data loader
00:24:24.820 | we specify our loader that we
00:24:28.180 | Initialize just up here the data loader
00:24:33.700 | Don't need to do this. But if you are going to evaluate your
00:24:36.860 | Model during training you also want to add in evaluator as well
00:24:42.020 | So this is from the CE correlation evaluator to make sure here using a cross encoder
00:24:48.500 | evaluation class
00:24:50.620 | we would like to
00:24:52.620 | run for
00:24:55.220 | Say one epoch and we should define this because I would also like to
00:25:01.500 | While we're training I would also like to include some warm-up sets as well
00:25:06.780 | We should I'm going to include a lot of warm-up sets actually and although I'll mention it. I'll talk about it in a moment. So I
00:25:13.540 | Would say number of epochs
00:25:16.380 | Is equal to one and for the warm-up I
00:25:22.740 | would like to take integer so the length of loader so the number of
00:25:28.780 | Actions that we have now and our data set. I'm going to multiply this by
00:25:33.540 | 0.4. So I'm going to
00:25:36.700 | Do a warm-up or do warm-up sets for 40% of our
00:25:42.700 | total data set size or batch or 40% of our total number of batches and
00:25:48.980 | We also need to multiply that by number of epochs. Let's say for training two epochs
00:25:54.260 | We do multiply that in this case just one so not necessary, but it's there
00:25:58.900 | so we're actually
00:26:01.820 | forming warm-up for 40% of the
00:26:05.100 | Training steps and I found this works better than something like 10% 15% 20%
00:26:11.020 | However that being said I
00:26:14.740 | Think you could also achieve a similar result by just decreasing the learning rate of your model. So
00:26:23.940 | By default, so if I write the epochs here, we'll define the warm-up sets
00:26:29.940 | With warm-up so by default this we use optimizer params
00:26:38.540 | with a learning rate of
00:26:42.620 | 2 e to the minus 5
00:26:48.180 | okay, so if you
00:26:50.860 | Say want to decrease that a little bit you could go. Let's say
00:26:56.340 | Go to the minus 6 5 e to the minus 6 and this would probably have a similar effect to having
00:27:03.300 | Such a significant number of warm-up sets and then in this case, you could decrease this to 1 or 10%
00:27:09.340 | But for me the way I've tested this I've end up going with 40% warm-up sets and that works quite well
00:27:18.980 | so the final step here is
00:27:21.220 | Where do we want to save our model? So I'm going to say I want to save it into the base
00:27:28.100 | cross encoder
00:27:30.860 | Or let's say
00:27:34.500 | STSB cross encoder and
00:27:36.500 | We can run that and that will
00:27:40.460 | Run everything for us. I'll just make sure it's actually yep. There we go
00:27:45.780 | So see it's running, but I'm not gonna run it because I've already done it
00:27:49.020 | so let me pause that and
00:27:54.380 | Move on to the next step
00:27:57.260 | okay, so we now have our gold data set which we have pulled from looking face data sets and
00:28:06.260 | We've just fine-tuned a cross encoder. So
00:28:09.620 | Let's cross both of those off of here
00:28:13.660 | this and this and
00:28:17.300 | Now so before we actually go on to predicting labels with the with the cross encoder
00:28:22.540 | We need to actually create that
00:28:25.060 | unlabeled data set so
00:28:27.380 | let's do that through random sampling using the gold data set you already have and
00:28:32.860 | Then we can move on to the next steps
00:28:38.980 | okay, so I'll just
00:28:41.460 | Add a little bit separation in here. So now we're going to go ahead and create the
00:28:47.900 | augmented
00:28:51.340 | So as I said, we're going to be using random sampling for that and I find that the
00:28:57.580 | the easiest way to do that is to actually go ahead and use a pandas dataframe rather than using the
00:29:05.780 | Data set object that we currently have so I'm gonna go ahead and initialize that so we have our gold data
00:29:13.380 | that will be
00:29:17.340 | PD the dataframe and
00:29:19.820 | In here we're going to have sentence one sentence two system one
00:29:30.340 | Is going to be equal to
00:29:35.740 | Sentence one
00:29:37.660 | Okay, and as well as that we also have sentence two
00:29:43.740 | going to be STSB
00:29:45.780 | sentence two now we may also want to include our
00:29:52.900 | Label in there, although I wouldn't say this is really necessary
00:29:57.980 | Add it in
00:30:01.380 | So our label is just like
00:30:04.340 | And if I have look here so we have
00:30:12.700 | Gonna overwrite anything called gold. It's okay
00:30:15.500 | So, okay, I'm gonna have a look at that as well so you can see a few examples of what we're actually working with
00:30:25.200 | I'll just go ahead and actually rerun these as well
00:30:30.860 | Okay, so there we have we have our gold data and
00:30:38.100 | now what we can do because we
00:30:42.580 | Reformatted that into a kind of data frame. You can use the sample method to randomly sample
00:30:48.780 | different sentences so
00:30:51.620 | To do that what I will want to do is create a new data frame
00:30:56.620 | So this is going to be our one labeled silver date
00:31:00.460 | So it's not it's not a silver data set yet because we don't have the labels or scores yet
00:31:04.460 | But this is going to be where we we will put them and in here. We again will have sentence one
00:31:14.260 | also sentence
00:31:16.060 | Two but at the moment that they're empty. It's nothing nothing in there yet
00:31:20.280 | so what we need to do is actually iterate through all of the
00:31:24.460 | rows in here, so
00:31:27.060 | Before that I'm just going to do from or import
00:31:31.380 | tqdm.auto
00:31:34.260 | From the tqdm.auto import tqdm
00:31:41.060 | That's just a progress bar so you can see you know where we are I don't I don't really like to wait and
00:31:46.580 | have no idea how long this is taking to process and
00:31:50.700 | For sentence one
00:31:56.900 | tqdm so we have the progress bar and I'll take a list of a set
00:32:02.060 | So we're taking all the unique values in the gold data frame for sentence one
00:32:08.260 | Okay, so that will just loop through every single unique sentence of one
00:32:13.820 | Item in there and I'm gonna use that and I'm going to randomly sample
00:32:21.820 | sentences from the other column sentence two to be paired with that sentence one and yeah, I'll
00:32:28.780 | sample or the sentence to
00:32:33.100 | Phrases that we're going to sample are going to come from the gold data, of course, and we only want to
00:32:40.060 | sample from rows where sentence one is not equal to the current sentence one because otherwise we
00:32:48.680 | Are possibly going to introduce duplicates and we're going to remove duplicates anyway
00:32:53.980 | But let's just remove them from the sampling in the in the first place. So
00:32:58.460 | we're going to
00:33:00.580 | Take that so all of the gold data set that where sentence one is
00:33:05.880 | Not equal to sentence one and what I'm going to do is just sample five of those rows
00:33:12.100 | Like that now from that. I'm just going to extract
00:33:17.780 | sentence to sort of five
00:33:21.220 | sentence two phrases that we have there and I'm going to convert them into a list and
00:33:27.460 | now for
00:33:29.900 | sentence two
00:33:31.580 | In the sampled list that we've just created. I'm going to take my pairs
00:33:37.000 | I'm going to append new pairs. So pairs are penned
00:33:41.040 | and I want sentence one to be sentence one and
00:33:47.060 | Also sentence two is going to be equal to sentence two
00:33:53.040 | now this
00:33:55.580 | Will take a little while
00:33:57.940 | So what I'm going to do is actually
00:33:59.940 | Maybe not include the the full
00:34:03.420 | data set here
00:34:06.380 | So let me possibly just go maybe the first 500
00:34:13.300 | Yeah, let's go to first 500 see how long that takes and
00:34:19.700 | I will also want to just have a look at what we what we get from that
00:34:25.820 | Okay, so yes, it's much quicker
00:34:27.820 | Okay, so we have sentence one. Let me
00:34:33.060 | Remove that from there
00:34:40.060 | Let's just say that top ten, right?
00:34:42.740 | so because we we're taking five of sentence one every time and random sampling it we can see that we have a few of those and
00:34:50.380 | Another thing that we might do is remove any do fits now
00:34:55.580 | That probably isn't any duplicates here, but we can check so pairs equals
00:35:01.780 | pairs up drop
00:35:04.620 | duplicates and
00:35:06.820 | Then we'll check the length of pairs again
00:35:12.700 | also prints
00:35:14.700 | Let me run this again and print
00:35:22.580 | So they were not any do it's anyway, but that's it it's a good idea to add that in just in case and
00:35:32.420 | Now I want to do is
00:35:36.220 | Actually take the cross encoder. In fact, actually let's go back to our little flow charts. So we have
00:35:44.740 | Now created a larger unlabeled data set
00:35:50.780 | So it's good and now we go on to predicting the labels of our cross encoder so down here
00:35:58.140 | What I'm gonna do is take
00:36:01.180 | the cross encoder code here and what I've done is
00:36:06.020 | I've trained this already and
00:36:08.860 | I've uploaded it to
00:36:13.620 | Plugin-based models. So what we what you can do and what I can do is
00:36:18.380 | This so I'm gonna write James Callum and it is called Bert
00:36:26.220 | STSB cross encoder
00:36:34.580 | Okay, so that's our cross encoder and
00:36:37.580 | Now what I want to do is
00:36:41.860 | Use that cross encoder to create our labels. So that will create our silver data set now to do that. I
00:36:49.580 | I'm gonna call it silver for now. I mean it is this isn't really the silver data set, but it's fine and
00:36:57.340 | I'm going to create a list and I'm going to zip both of the columns from our pets
00:37:03.820 | So pairs sentence one pair sentence, too
00:37:09.980 | One and
00:37:13.740 | pairs
00:37:15.700 | sentence - okay, so
00:37:18.620 | That will give us
00:37:23.180 | All about all about pairs again, you can look at those
00:37:28.980 | okay, so it's just like this and
00:37:33.900 | What we you want to do now is actually create our score. So just take the cross encoder
00:37:38.300 | what did we load as C C dot predict and
00:37:42.380 | We just pass in that silver data
00:37:47.340 | Do that
00:37:49.100 | Let's run it. It might take a moment
00:37:51.340 | Okay, so it's definitely taking a moment so
00:37:55.620 | Let me pause it. I'm going to just do
00:37:59.500 | Let's say 10 because I already have the full data set so I can I can show you that
00:38:04.300 | Somewhere else and let's have a look at what you have in those scores. So three of them
00:38:11.500 | So we have an array and we we have these these scores. Okay, so
00:38:19.460 | They are our predictions our similarity predictions for the first three now because they're randomly sampled a lot of these are
00:38:26.980 | negative
00:38:28.780 | So if we go silver
00:38:30.900 | Say negative, I mean more
00:38:34.820 | They're not relevant. So
00:38:37.420 | Yeah, we can see
00:38:39.900 | Not particularly relevant. Let's just one must first issue with this and you can you can
00:38:46.340 | change you can try and modify that by after creating your scores if you have if you
00:38:54.180 | oversample and
00:38:56.540 | For a lot of values or a lot of records and then just go ahead and remove
00:39:04.020 | most of the low scoring samples and keep all of your high scoring samples that will help you deal with that
00:39:13.300 | imbalance in your data
00:39:16.980 | What I'm going to do is I'm going to add
00:39:19.340 | to the labels column those scores
00:39:23.380 | which
00:39:25.500 | Will not actually cover all of them because we only have ten in here. So
00:39:30.420 | Let me maybe multiply that
00:39:34.140 | So this isn't you shouldn't do this obviously it's just so they fit
00:39:39.740 | Okay, and
00:39:44.020 | Let's have look. Okay, so we now have sense one sense two and some labels and
00:39:50.780 | What you do, but I'm not going to run. This is you would write pairs to CSV
00:39:56.380 | And that's so you need to do this if you're running everything in the same notebook
00:40:00.580 | Well, it's probably a good idea so with CSV so I'm going to say the silver data is a tab separate file and
00:40:09.900 | Obviously the separator for for that type of file is it it's a tab character and I don't want to include in those
00:40:20.780 | Okay, and that will create the the silver data file that we can train with
00:40:27.060 | Which I do already have so
00:40:30.820 | We come over here
00:40:36.260 | we can we can see that I have this file and
00:40:40.260 | We have all of these
00:40:43.180 | different
00:40:46.580 | Sentence pairs and the scores that our encoder has assigned to them
00:40:52.100 | So I'm going to close that I'm going to go back to the demo
00:40:59.580 | What I'm now going to do is actually
00:41:01.580 | First go back to the flow chart that we had. I'm going to cross off predict labels
00:41:12.620 | We're going to go ahead and fine-tune it the by encoder on both gold and silver data
00:41:17.980 | So we have the gold data
00:41:21.460 | Let's have a look at we have
00:41:24.500 | Yes, and the silver I'm going to load that from file. So
00:41:28.620 | PD read CSV
00:41:32.540 | Silver TSV
00:41:36.500 | Separator is
00:41:42.660 | Character and
00:41:44.660 | Let's have a look what we have make sure it's all loaded correctly looks good
00:41:49.420 | now I'm going to do is
00:41:52.140 | Put both those together. So all data is equal to gold and
00:41:57.380 | silver and
00:41:59.180 | We ignore the index so they get an index error
00:42:02.580 | Sorry true and
00:42:07.020 | All data is got head
00:42:11.460 | Okay, we can see that we hopefully now have all of the all of the data in this check the length
00:42:18.500 | So it's definitely bigger a bigger data set now before which is gold
00:42:26.940 | okay, so we now have a larger data set and we can go ahead and use that to fine-tune the
00:42:36.740 | By encoder or sentence transformer. So what I'm going to do is take the code from up here
00:42:43.340 | so we have this train and data and
00:42:47.020 | I think I've already run this for so I don't need to import the input example here
00:42:54.060 | But what I want to do here is for row in
00:42:58.060 | All data and
00:43:02.540 | What we actually want to do here is for I row in all data because this is a date frame
00:43:08.340 | It's a rose
00:43:10.780 | It's right through each row. We have row sentence one sentence two and also a label
00:43:20.100 | We load them and to our train data and we can have a look at that train data
00:43:26.540 | See what it looks like
00:43:31.780 | Okay, we see that we get all these
00:43:33.780 | Inputs a sample objects if you want to see what one those has inside you can access the text
00:43:41.420 | like this
00:43:44.020 | Should probably do that on a
00:43:46.100 | New cell. So let me pull this down here and you can also access a label
00:43:51.900 | To see what we what we have in there
00:43:57.780 | So that looks good and we can now take that like we did before and load it into a data loader
00:44:04.700 | So, let me go up again and we'll we'll copy that
00:44:08.180 | Where are you?
00:44:11.380 | Take this
00:44:16.860 | Bring it down here and
00:44:22.180 | Run this crates our data loader and we can move on to actually initializing the sentence transformer or by encoder
00:44:30.140 | And actually training it. So once we run from sentence transformers, we're going to import
00:44:38.100 | Models and also going to import sentence transformer
00:44:42.740 | now to initialize our sentence transformer if you following along with the series of videos and articles
00:44:52.780 | Will know that we we do something looks like this
00:44:56.060 | So we're going to convert and that is going to be models adopt transformer
00:45:04.700 | here we're just loading a model from face transformers, so they're based on case and
00:45:09.380 | We also have our pooling layer. So models again, and we have pooling and
00:45:15.100 | in here we want to include the
00:45:19.300 | dimensionality of the
00:45:21.300 | the vectors that the pooling
00:45:24.280 | layer should respect which is just going to be birds dot get word embedding dimension and
00:45:32.580 | Also, it needs to know what type of pooling we're going to use we're going to use
00:45:39.040 | CLS pooling are going to use mean pooling max pooling or so on now we are going to use
00:45:47.180 | Pooling and we're going to use a mean
00:45:49.180 | To mode mean tokens. Let me say that's true
00:45:53.500 | so they're the two
00:45:56.340 | Let's say components in our in our sentence transformer and we need to now put those together
00:46:02.440 | So we're gonna call model equals
00:46:05.100 | sentence
00:46:06.980 | transformer and
00:46:08.860 | we write modules and
00:46:10.860 | Then we just pass as a list that and also cooling
00:46:18.140 | So we run that we can also have one model looks like
00:46:23.160 | Okay, and we have you see we have a sentence transformer object and inside there. We have two
00:46:29.900 | Layers or components first ones are transformer
00:46:33.180 | It's a BERT model and the second one is our pooling and we can see here
00:46:37.540 | the only pooling method that is set to true is the
00:46:41.820 | Mode mean tokens, which means we're going to take the mean across all the word
00:46:45.740 | Embeddings output by BERT and use that to create our sentence embedding or vector
00:46:53.940 | with that model now defined we can
00:46:56.640 | Initialize our loss function. So we do want to write from sentence
00:47:03.180 | transformers
00:47:06.300 | Losses import cosine similarity loss
00:47:11.300 | It's okay sign
00:47:13.300 | Similarity loss and in here we seem to pass the model. So understands which parameters to to actually optimize and
00:47:19.900 | Initialize that and then we sell our
00:47:24.060 | training function or the fit function and
00:47:27.660 | That's similar to before the cross encoder although slightly different. So let me let me take that. It's a little further up
00:47:39.420 | From here
00:47:41.420 | Then take that and we're just gonna modify it so
00:47:45.340 | Warm up. I'm going to warm up for a 15% of the number of steps now. We're going to run through
00:47:53.060 | we change this to model it's not it's not see anymore and
00:47:58.220 | Like I said, there are some differences here. So we have a training objectives. That's different and
00:48:06.460 | This is just a list of all the training objectives. We have we are only using one and we just pass
00:48:12.660 | loader and
00:48:14.780 | loss into that
00:48:16.780 | Evaluator we could use an evaluator. I'm not going to
00:48:21.500 | For this one, I'm going to evaluate everything afterwards the
00:48:26.420 | Epochs and warm steps are the same. The only thing that's different is the output path, which is going to be but STS be
00:48:35.180 | That's that's it so go ahead and run that should should run let's check that it does
00:48:42.040 | Okay, so I've got this error here so it's lucky that we we checked and
00:48:49.780 | we have this runtime error found D type long but expected float and
00:48:54.300 | If we come up here, it's going to be in the data load or in the data that we've
00:48:59.460 | Initialized so I
00:49:03.140 | Yeah, so here I've put int first some region. I'm not sure why did that?
00:49:08.420 | So this should be a float that the label in your your training data
00:49:12.180 | And that should be the same
00:49:14.780 | Up here as well
00:49:17.300 | Okay, so here as well the the cross encoder we would expect a float value
00:49:29.980 | So just be aware that I'll make sure there's a note in the video earlier on for that
00:49:36.520 | okay, and
00:49:39.780 | Okay, let's continue through that and try and rerun it should be okay now. Oh
00:49:47.140 | I need to actually rerun everything else as well
00:49:51.500 | So rerun this
00:49:56.300 | Okay label 1.0
00:49:58.620 | This is
00:50:03.260 | This for a moment just to be sure that is actually running this time, but it does
00:50:11.020 | look good, so
00:50:20.460 | Looks good when for some reason in the notebook I'm actually seeing the number of iterations, but okay
00:50:27.580 | Yeah, pause it now and we can see that. Yes. It did run through two iterations. So it is running
00:50:32.760 | Correctly now, that's good
00:50:37.300 | That's great
00:50:38.860 | What I want to do now is actually show you okay valuation of these models. So
00:50:45.380 | Back to our flow chart quickly. Okay, so fine-tuned by coda. We've just done it
00:50:50.880 | So we've now finished without in the main or
00:50:54.340 | augmented expert
00:50:57.200 | training strategy and
00:50:59.260 | Yeah, let's move on to the evaluation. Okay, so my evaluation script here
00:51:08.380 | Maybe not the easiest to to read
00:51:14.460 | But basically all we're doing is we're importing the
00:51:18.420 | embedding similarity evaluated from down here
00:51:21.580 | I'm loading the the glue data
00:51:24.980 | SDSP again, and we're taking validation split, which we didn't train on
00:51:29.420 | we are converted into input examples feeding it into our embedding similarity evaluator and
00:51:35.540 | Loading the model the model name a pass through command line
00:51:43.980 | arguments for
00:51:45.340 | up here and
00:51:47.340 | Then it just prints out of the score so
00:51:51.200 | Let me switch across to the command line. I can we can see how that how that actually performs
00:51:58.340 | Okay, so just switch across to my other desktop because this is much faster so I can I can actually
00:52:06.060 | Run this quickly. So Python and zero three, so we're gonna run that valuation script
00:52:14.460 | we're going to pass here is
00:52:16.460 | we have all three or all three the cross encoder the
00:52:21.180 | sentence transformer train using
00:52:24.460 | Augmented experts and also a sentence transformer train purely on the gold data set. So first let's have a look at the
00:52:32.140 | But SDSP gold data set trained
00:52:36.860 | model, so
00:52:40.740 | Run this might take a moment to download it
00:52:43.860 | Okay, so everything downloaded and then we've got a score of zero point five zero six. So it correlates
00:52:52.220 | to the
00:52:54.420 | predictions of the model correlate to the actual scores
00:52:58.580 | with a sort of 50%
00:53:01.660 | Relation so they do correlate. It's not bad. It's not great either. Let's have a look at the cross encoder
00:53:11.940 | Again
00:53:14.460 | Encoder
00:53:16.800 | Okay, and we get score of
00:53:19.020 | 0.58. So as we'd expect training on just the gold data
00:53:24.420 | the cross encoder does outperform the the by encoder or sentence transformer and
00:53:30.260 | The final one would be okay
00:53:33.200 | with the augmented data
00:53:36.420 | How does the sentence transformer perform?
00:53:40.980 | let's run that again the way for to download and
00:53:44.020 | we get a much better score of
00:53:48.660 | zero point six nine, so
00:53:52.300 | Yeah, the I mean the correlation area is much higher than this for the augmented data
00:53:58.340 | So then if we had just used a gold data set, so it really really has improved the performance a lot now
00:54:05.660 | This is maybe an atypical performance increase. It's like 90% or 90
00:54:11.400 | point increase in
00:54:14.180 | Performance and that's good. But if you look at the the original paper from there was rhymers and co they
00:54:22.420 | Found a sort of expected performance increase of I believe seven or nine
00:54:30.220 | Points. So this is definitely pretty significant. This is definitely a bit more than that
00:54:35.940 | But I think it goes show how good
00:54:39.100 | these models or this training strategy can actually be so
00:54:44.140 | That's it for this video. I hope this has been useful and I hope this
00:54:51.700 | helps a few of you kind of
00:54:55.620 | Overcome the sometimes lack of data that we find and I think a lot of our
00:55:01.940 | Particular use cases. Thank you very much for watching and I will see you in the next one