back to indexAll You Need to Know on Multilingual Sentence Vectors (1 Model, 50+ Languages)
Chapters
0:0 Intro
1:19 Multilingual Vectors
5:55 Multi-task Training (mUSE)
9:36 Multilingual Knowledge Distillation
11:13 Knowledge Distillation Training
13:43 Visual Walkthrough
14:53 Parallel Data Prep
20:23 Choosing a Student Model
24:55 Initializing the Models
30:5 ParallelSentencesDataset
33:54 Loss and Fine-tuning
36:59 Model Evaluation
39:23 Outro
00:00:00.000 |
Today we're going to be having a look at multilingual sentence transformers. We're 00:00:04.480 |
going to look at how they work, how they're trained, and why they're so useful. We're going 00:00:11.200 |
to be focusing on one specific training method, which I think is quite useful because all it 00:00:19.280 |
really needs is a reasonably small data set of parallel data, which is simply translation pairs 00:00:28.800 |
from a source language like English to whichever other language you're using. So obviously, if you 00:00:34.800 |
are wanting to train a sentence transformer in a language that doesn't really have that much data, 00:00:42.640 |
it's particularly sentence similarity data, this can be really useful for actually taking a 00:00:50.400 |
high-performing, for example, English sentence transformer and transferring that knowledge or 00:00:57.680 |
distilling that knowledge into a sentence transformer for your own language. So I think 00:01:05.440 |
this will be pretty useful for a lot of you. And let's jump straight into it. 00:01:12.000 |
Before we really get into the whole multilingual sentence transformer part of the video, 00:01:25.200 |
I just want to give an impression of what these multilingual sentence transformers are actually 00:01:30.000 |
doing. So on here, we can see a single English sentence or brief phrase down at the bottom, 00:01:39.120 |
"isle of plants," and the rest of these are all in Italian. So what we have here are vector 00:01:47.360 |
representations or dense vector representations of these phrases. And a monolingual sentence 00:01:55.680 |
transformer, which is most of the sentence transformers, will only cope with one language. So 00:02:02.160 |
we would hope that phrases that have a similar meaning end up within the same sort of vector 00:02:08.880 |
space. So like we have for "amo lupiante" here and "I love plants," these are kind of in the same 00:02:20.240 |
space. A monolingual sentence transformer would do that for similar sentences. So in English, 00:02:29.120 |
we might have "I love plants" and "I like plants," which is actually what we have up here. So this 00:02:35.760 |
here is Italian for "I like plants." And we would hope that they're in a similar area, 00:02:41.040 |
whereas irrelevant or almost contradictory sentences we would hope would be far off 00:02:49.920 |
somewhere else, like our vector over here. So that's how, obviously, a monolingual sentence 00:02:58.160 |
transformer works, and it's exactly the same for a multilingual sentence transformer. 00:03:03.680 |
The only difference is that rather than having a single language, it will comprehend multiple 00:03:10.080 |
languages. And that's what you can see in this visual. So in this example, I have "I love plants" 00:03:18.240 |
and "amo lupiante." They have the same meaning, just in different languages. So that means that 00:03:25.760 |
they should be as close together as possible in this vector space. So here we're just visualizing 00:03:34.400 |
three dimensions. In reality, it'd be a lot more. I think most transformer models go with 768 00:03:42.320 |
dimensions. But obviously, we can't visualize that, so we have 3D here. So we want different 00:03:49.600 |
languages or similar sentences from different languages to end up in the same area. And we 00:03:55.280 |
also want to be able to represent relationships between different sentences that are similar. 00:04:02.640 |
And we can kind of see that relationship here. So we have "mi piacciono lupiante" and "amo lupiante" 00:04:08.800 |
and "I love plants" are all kind of in the same sort of area. "Mi piacciono lupiante," so "I like 00:04:15.760 |
plants," is obviously separated somewhat, but it's still within the same area. And then in the bottom 00:04:24.320 |
left down there, we have "ho un cane arancione," which means "I have a orange dog." So obviously, 00:04:33.760 |
you know, that's really nothing to do with "I love plants." Although I suppose you could say 00:04:38.160 |
you're talking about yourself, so maybe it's a little bit similar. But otherwise, 00:04:42.800 |
they're completely different topics. So that's kind of what we want to build, 00:04:51.280 |
something that takes sentences from different languages and maps them into a vector space, 00:04:57.520 |
which has some sort of numerical structure to represent the semantic meaning of those sentences. 00:05:04.720 |
And it should be language agnostic. So obviously, we can't -- well, maybe we can train on every 00:05:11.120 |
language. I don't know any models that are trained on every single language, 00:05:14.400 |
but we want it to be able to comprehend different languages and not be biased towards 00:05:23.120 |
different phrases in different languages, but just have a very balanced comprehension of all of them. 00:05:29.680 |
Okay? So that's how the vectors should look. And then, okay, so what would the training data for 00:05:42.640 |
this look like, and what are the training approaches? So like I said before, there's 00:05:48.400 |
two training approaches that I'm going to just briefly touch upon, but we're going to focus on 00:05:52.640 |
the latter of those. So the first one that I want to mention is what the MUSE, or Multilingual 00:06:04.000 |
Universal Sentence Encoder Model, was trained on, which is a multitask 00:06:15.520 |
translation bridging approach to training. So what I mean by that is it uses two or uses a 00:06:24.640 |
dual encoder structure, and those encoders deal with two different tasks. So on one end, you have 00:06:36.000 |
the parallel data training. So when we say parallel data, these are sentence pairs in 00:06:44.960 |
different languages. So like we had before, we had the Amalopoeia and Isle of Plants, 00:06:52.160 |
which is just the Italian and English phrases for Isle of Plants. So we would have our source 00:07:00.240 |
language and also the translation, or the target language is probably a better way, 00:07:07.280 |
but I'll put translation for now. So we have the source and translation. That's our parallel data 00:07:12.720 |
set. And what we're doing is optimizing to get those two vectors or the two sentence vectors 00:07:18.560 |
produced by either one of those sentences as close as possible. And then there is also the source 00:07:27.520 |
data. So we basically have sentence similarity or NLI data, but we have it just for the source 00:07:37.840 |
language. So we have source, sentence A, and source, sentence B. And we train on both of these. 00:07:47.760 |
Now, it works, and that's good. But obviously, we're training on a multi-task architecture here, 00:07:55.680 |
and training on a single task in machine learning is already hard enough. Training on two and getting 00:08:03.040 |
them to balance and train well is harder. And the amount of data, at least for Muse, and I believe 00:08:12.000 |
for if you're training using this approach, you're going to need to use a similar amount of data, 00:08:16.560 |
is pretty significant. I think Muse is something like a billion pairs, so it's pretty high. 00:08:23.760 |
And another thing is that we also need something called hard negatives in the training data in 00:08:31.680 |
order for this model to perform well. So what I mean by hard negative is, say we have our source 00:08:40.560 |
sentence A here, and we have this source B, which is like a similar sentence, a high similarity 00:08:47.920 |
sentence. They mean basically the same thing. We'd also have to add a source C. And this source C 00:08:59.200 |
will have to be similar in the words I use to source A, but actually means something different. 00:09:05.040 |
So it's harder for the model to differentiate between them. And the model would have to figure 00:09:10.640 |
out these two sentences are not similar, even though they seem similar at first, but they're 00:09:16.320 |
not. So it makes the training task harder for the model, which, of course, makes the model better. 00:09:23.760 |
So that is training approach number one. And we've mentioned the parallel data there. That's 00:09:31.840 |
the data set we're going to be using for the second training approach. And that second training 00:09:37.600 |
approach is called multi-lingual knowledge distillation. So that is a mouthful. And it 00:09:53.840 |
takes me a while to write that. I'm sorry. So multi-lingual knowledge distillation. 00:10:01.520 |
So this was introduced in 2020 by, who we've mentioned before, the Sentence Transformers 00:10:08.720 |
people, Nils Reimers and Irina Gruevich. And the sort of advantage of using this approach is that 00:10:17.600 |
we only need the parallel data set. So we only need those translation pairs. And the amount of 00:10:22.480 |
training data you need is a lot smaller. And using this approach, the Sentence Transformers 00:10:31.440 |
people have actually trained Sentence Transformers that can use more than 50 00:10:37.920 |
languages at once. And the performance is good. It's not just that they managed to 00:10:43.920 |
get a few phrases correct. The performance is actually quite good. 00:10:48.000 |
So I think it's pretty impressive. And the training time for these is super quick, 00:10:57.280 |
as we'll see. And like I said, it's using just translation data, parallel data, 00:11:03.920 |
which is reasonably easy to get for almost every language. So I think that's pretty useful. 00:11:12.000 |
Now, well, let's have a look at what that multi-lingual knowledge distillation training 00:11:19.360 |
process actually looks like. So it's what we have here. So same example as before. I've got 00:11:25.120 |
"I like plants" this time and "Mi piacere non li piante," which is, again, the same thing 00:11:30.960 |
in Italian. Now, we have both of those. We have a teacher model and a student model. Now, 00:11:37.040 |
when we say knowledge distillation, that means where you basically take one model 00:11:44.080 |
and you distill the knowledge from that one model into another model here. The model that already 00:11:51.520 |
knows some of the stuff that we want, that we want to distill knowledge from, 00:11:55.920 |
is called the teacher model. Now, the teacher model, in this case, is going to be a monolingual 00:12:03.840 |
model. So it's probably going to be a sentence transformer. That's very good at English tests 00:12:09.040 |
only. And what we do is we take the student model, which is going to be-- it doesn't have 00:12:17.120 |
to be a sentence transformer. It's just a pre-trained transform model. We'll be using 00:12:22.720 |
XLM Roberta later on. And it needs to be capable of understanding multiple languages. 00:12:30.000 |
So in this case, we feed the English sentence into both our teacher model and student model. 00:12:37.360 |
And then we optimize the student model to reduce the difference between the two vectors output 00:12:44.640 |
by those two models. And that makes the student model almost mimic the monolingual aspect of the 00:12:51.600 |
teacher model. But then we take it a little further, and we process the Italian, or the 00:12:57.360 |
target language, through the student model. And then we do the same thing. So we try to reduce 00:13:03.680 |
the difference between the Italian vector and the teacher's English vector. And what we're doing 00:13:09.440 |
there is making the student model mimic the teacher for a different language. So through 00:13:16.400 |
that process, you can add more and more languages to a student model, which mimics your teacher 00:13:23.920 |
model. I mean, it seems at least really simple just to think of it like that, in my opinion, 00:13:34.320 |
anyway. But it works really well. So it's a very cool technique, in my opinion. I do like it. 00:13:42.880 |
So just a more visual way of going through that. We have these different circles. They represent 00:13:51.600 |
different language tasks, or different languages, but pretty similar, or the same task in each one 00:13:57.440 |
of those. We have our monolingual teacher model. And that can perform on one of these languages. 00:14:04.240 |
But fails on the others. We take that monolingual model, or our teacher model, and then we also 00:14:10.800 |
take a pre-trained multilingual model. So the important thing here is that it can handle new 00:14:16.080 |
languages, like I said with XLM and Roberta. This is our student. We perform multilingual 00:14:22.640 |
knowledge distillation, meaning the student learns how the teacher performs well on the single task 00:14:28.000 |
by mimicking its sentence vector outputs. The student then performs this mimicry across multiple 00:14:34.240 |
languages. And then hopefully, the student model can now perform across all of the languages that 00:14:42.240 |
we are wanting to train on. That's how the multilingual knowledge distillation works. 00:14:49.280 |
Let's have a look at that in code. Okay, so we're in our code here. And the first thing I'm going 00:14:55.920 |
to do is actually get our data. So in the paper that introduced the multilingual knowledge 00:15:04.320 |
distillation, Rimas and Gurevich use the focus partly on this TED subtitles data. So yeah, we 00:15:14.960 |
know TED Talks, they're just low talks where people present on a particular topic, usually 00:15:20.640 |
pretty interesting. And those TED Talks have subtitles in loads of different languages. 00:15:28.720 |
So they scraped that subtitle data and use that as sentence pairs for the different languages. 00:15:37.200 |
Okay, so that's the parallel data. Now, what I'm going to do is use Hug and Face Transformers 00:15:43.680 |
to download that. So we just import datasets here. So I said Hug and Face Transformers, 00:15:50.640 |
actually Hug and Face Datasets here. So import datasets, and I'm going to load that dataset. 00:15:57.440 |
And just have a look at what the structure of that dataset is. So it's the TED multi, 00:16:04.400 |
and I'm just getting the training data here. You see in here, we have this Features, 00:16:08.480 |
Translations, and Talk Name. Now, it's not really very clear, but inside the translations data, 00:16:16.560 |
we have the language tag. So these are language codes, ISO language codes. If you type that into 00:16:23.440 |
Google, they'll pop up. If you don't know which one, which are which. And below, we also have 00:16:33.200 |
in here, it's not very clear again. So if I come here, we have translations, and each one of those 00:16:38.400 |
corresponds to the language code up here. Okay, so if we came here, we see EN, it's English, 00:16:45.200 |
and we find it here. Okay, and then we also have Talk Name. It's not really important for us. 00:16:53.520 |
So we can get the index of our English text, because we need to extract that for our source 00:17:01.440 |
language. So we extract that, we get number four, so we're going into those language pairs, 00:17:07.200 |
finding EN. And then we use that index to get the corresponding translation, 00:17:12.320 |
which is here. And then we'd use that to create all of our pairs. Now, here, I've just created 00:17:19.840 |
loads of pairs. This is the first one, so this is English to Arabic. But if we have a look, 00:17:25.520 |
there's actually loads of pairs here. So we have 27 in total, which is obviously quite a lot. 00:17:30.000 |
Probably not going to use all of those. I mean, you could do if you want to. It depends on what 00:17:34.320 |
you're trying to build. But I think most of us are probably not going to be trying to build some 00:17:39.200 |
model that crosses all these different languages. So what I'm going to do is just initialize a list 00:17:47.120 |
of languages that we would like to train on. So we're going to be feeding all of this into 00:17:56.960 |
a sentence transformer class called ParallelSentencesDataSet. And that requires that 00:18:04.400 |
we, one, separate out our pairs using a tab character, and two, keep all those pairs separated 00:18:12.160 |
in different gzip files. So that's why I'm using this particular structure. So data preprocessing 00:18:20.400 |
steps here, I'm just running through them quickly because I want to focus more on the actual 00:18:24.560 |
sentence transformer training part. So run that, and we can-- well, it's actually going to take a 00:18:31.040 |
moment, so let me skip forward. And then we'll see how many pairs-- well, I just want to see. 00:18:38.880 |
We don't have to do this. But I want to see how many pairs we have for each language. 00:18:42.640 |
And you see here, we have about 200,000 for each of them. The German one is slightly less. 00:18:50.720 |
And then let's have a look at what those source and translations look like. So here, 00:18:55.360 |
we have applause and applause. Now, I think that's Italian. It seems so. But here, we can see, OK, 00:19:04.960 |
the end of the talk ends in applause. So obviously, the subtitles say applause. Well, 00:19:11.520 |
hopefully, it ends in applause. And then we just have the tab character, and that separates the 00:19:17.520 |
source language, English in this case, from the translated language. Now what we want to do is 00:19:26.960 |
save that data. So we sort all that in these dictionaries. So initialize dictionary here, 00:19:33.280 |
and access them here. So we have ENIT, ES, AR, FR, and DE. And now I'm just going to save them. 00:19:42.240 |
So run this. That will save. And what I'll do is just write OSLister. So we can see what is in 00:19:52.080 |
there. Where is it? It's data. Just data. Is that right? OK. And then we have these five files. OK. 00:20:03.920 |
Now let's continue. So now what we want to do is, OK, we have-- that's our training data. It's 00:20:12.000 |
ready, or mostly ready, before we feed it into the Sentence Transformer's parallel sentences 00:20:19.600 |
data set object later on. So OK, let's leave that for now and move on to the next step, which is 00:20:26.320 |
choosing our teacher and student models. So I already mentioned before, we want our student 00:20:33.920 |
model to be capable of multilingual comprehension. So what I mean by that-- or not just what I mean, 00:20:42.320 |
but one big component of that is, can the Transformer Tokenizer deal with different 00:20:48.080 |
languages? In some cases, they really can't. So let me show you what the BERT Tokenizer does with 00:20:55.920 |
these four different sentences. So we'll just loop through each one. So four texts in sentences. 00:21:03.200 |
And what I'm going to do is just print. I'm going to print the output of the BERT Tokenizer. 00:21:09.600 |
And if I tokenize that text, what does it give me? OK. So what we have here-- OK, English, 00:21:20.080 |
of course. BERT is fine. The Tokenizer, or the vocabulary of the Tokenizer of BERT is, I think, 00:21:27.920 |
roughly 30,000 tokens. And most of those are English-based. You can see here that it has 00:21:38.400 |
picked up some Chinese characters, because it does-- other languages do feed into it a little 00:21:42.880 |
bit, because it's just-- all the data is pulled from the internet. Other bits do get in there. 00:21:49.200 |
But it's mostly English. So that's why we see, OK, we have these unknown characters. 00:21:54.720 |
Now, as soon as we have an unknown character in our sentence, the Tokenizer-- or no, sorry, 00:22:00.960 |
the Transformer is ready to struggle to understand what is in that position? What is that unknown 00:22:07.920 |
token supposed to represent? In the case of-- I think of it as it's like when you're a kid 00:22:16.400 |
in school, and they had those-- had a paragraph, and you had to fill in the blanks. So you had a 00:22:23.600 |
paragraph, and occasionally, in a couple of sentences, there'll be a couple of blank lines 00:22:28.880 |
where you need to guess what the correct word should be. If you only have a couple of those 00:22:34.160 |
blanks, as a person, you can probably guess accurately. And the same for BERT. BERT can 00:22:41.760 |
probably guess accurately what the occasional unknown token is. But if in school, they gave 00:22:50.320 |
you a sheet, and they said, OK, fill out these blanks, and it was literally just a paragraph 00:22:55.680 |
of blank, and you had to guess it correctly, you probably-- I don't know. I think your chances are 00:23:01.360 |
pretty slim of getting that correct. So the same is true for BERT. BERT, for example, 00:23:09.440 |
in our Georgian example down here, how can BERT know what that means? It will not know. 00:23:14.960 |
So the tokenizer from BERT is not suitable for non-Latin character languages whatsoever. 00:23:23.360 |
And then it does know some Greek characters here. And maybe it knows all of them, 00:23:28.000 |
because I suppose Greek feeds into Latin languages a bit more than Georgian or Chinese. 00:23:35.760 |
But it doesn't know what to do with them. They're all single-character tokens. And the issue with 00:23:40.560 |
single-character tokens is that you can't really encode that much information into a single 00:23:46.240 |
character. Because if you have 24 characters in your alphabet, that means you have 24 encodings 00:23:54.160 |
to represent your entire language, which is not going to happen. So that's also not good. 00:24:01.440 |
So basically, don't use a BERT tokenizer. It's not a good idea. What you can do is, 00:24:08.560 |
OK, how is this xlmr token or tokenizer? Now, xlmr is trained for multilingual comprehension. 00:24:20.160 |
It uses a sentence piece transformer, which uses byte-level logic to split up the sentence or the 00:24:27.360 |
words. So it can deal with tokens it's never seen before, which is pretty nice. And the vocabulary 00:24:35.840 |
size for this is not 30k. I think it's 250k. It could be off a few k there, but it's around that 00:24:44.000 |
mark. And it's been trained on many languages. So it's obviously a much better option for our 00:24:53.280 |
student model. So let's have a look at how we initialize that. So this xlmr model is just coming 00:25:01.520 |
from Transformers. So I need to convert that model from just a Transformer model into an-- 00:25:10.320 |
or initialize it as a Sentence Transformer model using the Sentence Transformers library. 00:25:16.160 |
So from Sentence Transformers, I'm going to import models and also Sentence Transformer. 00:25:23.680 |
So xlmr, so this is going to be our actual Transformer model. We're going to write 00:25:30.480 |
models.transformer. And Sentence Transformers under hood uses HuggingFace Transformers as well. 00:25:39.200 |
So we would access this as the normal model identifier that we would with normal HuggingFace 00:25:47.200 |
Transformers, which is xlmr RobertaBase. As well as that, we need a pooling layer. 00:25:57.040 |
So we write models.pooling. And in here, we need to pass the output embeddings dimensions. So it's 00:26:08.240 |
this get word embedding dimension for our model. And also what type of pooling we'd like to do. We 00:26:15.520 |
have max pooling, CLS token pooling. And what we want is a mean pooling. So is pooling 00:26:29.440 |
mode mean tokens equals true. Okay. So that two components of our Sentence Transformer. 00:26:40.240 |
And then from there, we can initialize our students. So student equals Sentence Transformer. 00:26:48.000 |
And we're initializing that using the modules, which is just a list of our two components there. 00:26:57.120 |
So xlmr followed by pooling. And that's it. So let's have a look at what we have there. 00:27:06.240 |
Okay. We can just ignore this top bit here. We just want to focus on this. 00:27:10.880 |
So you see we have our transformer model followed by the pooling here. And we also see that we're 00:27:17.840 |
using the mean tokens pooling set to true, rest of them are false. Okay. So that's our student 00:27:24.240 |
model initialized. And now what we want to do is initialize our teach model. Now the teach model, 00:27:32.080 |
let me show you. You just have to be a little bit careful with this. So Sentence Transformer. 00:27:38.000 |
So maybe you'd like to use one of the top forming ones, which a lot of them are the old models. 00:27:48.400 |
So these are monolingual models, all MPNet base V2. And okay, let's initialize this and let's see 00:28:02.160 |
what is inside it. Okay. So we have the transformer, we have the pooling as we had before, 00:28:08.400 |
but then we also have this normalization layer. So the outputs from this model are normalized. 00:28:16.080 |
And obviously, if you're trying to make another model mimic that normalization layer outputs, 00:28:23.440 |
well, it's not ideal because the model is going to be trying to normalize its own vectors. So 00:28:32.480 |
you don't really want to do that. You want to choose a model. You either want to remove the 00:28:36.720 |
normalization layer or just choose a model that doesn't have a normalization layer, which I think 00:28:43.840 |
is probably the better option. So that's what I'm going to do. So for the teacher, I'm going to use 00:28:49.840 |
a Sentence Transformer. I'm going to use a paraphrase models because these 00:28:55.840 |
don't use normalization layers. Distill, Roberta, base V2. Okay. Let's have a look. 00:29:10.640 |
Okay. So now you can see we have the transformer followed directly by the pooling. 00:29:16.240 |
Now, another thing that you probably should just be aware of here is that we have this max 00:29:20.800 |
sequence length here is 512, which doesn't align with our paraphrase model here. But that's fine 00:29:28.560 |
because I'm going to limit the maximum sequence length anyway to 250. So it's not really an issue, 00:29:38.560 |
but just look out for that if you're training your own models. This one's of 384. So none of 00:29:44.480 |
those align. But yeah, just be aware that the sequence lengths might not align there. 00:29:53.280 |
So we've formatted our training data. We have our two models, the teacher and the student. 00:30:04.880 |
So now what we can do is prepare that data for loading into our training process, 00:30:11.680 |
our fine tuning process. So I said before, we're going to be using the parallel sentences, 00:30:19.360 |
sorry, from Sentence Transformers import parallel sentences dataset. 00:30:25.360 |
And first thing we need to do here is actually initialize the object. And that requires that 00:30:34.080 |
we pass the two models that we're training with because this kind of handles the interaction 00:30:39.760 |
between those two models as well. So obviously we have our student model, which is our student. 00:30:47.040 |
And we have the teacher model, which is our teacher. Alongside this, we want batch size. 00:30:58.800 |
I'm going to use 32, but I think actually you can probably use higher batches here, 00:31:05.280 |
or you probably should use higher batches. I think 64 is one that I see used a lot in these training 00:31:13.280 |
codes. And you also use embedding cache equal to true. Okay. So that initializes the parallel 00:31:27.600 |
sentences dataset object. And now what we want to do is add our data to it. So we need our training 00:31:35.360 |
files. So training files equal to OS list that we did before. I think it's in the data file, 00:31:44.400 |
in the data directory. Yeah. So that's what we want. And what I'll do is just 00:31:56.800 |
for F in those train files, I'm going to load each one of those into the dataset object. 00:32:04.160 |
Print F and data.loaddata. I need to make sure I include the path there, 00:32:12.560 |
followed by the actual file name. You need to pass your max sentences, 00:32:22.240 |
which is the maximum number of sentences that you're going to take from that load data batch. 00:32:27.120 |
So basically the maximum number of sentences we're going to use from each language there. 00:32:33.920 |
Now, I'm just going to set this to 250,000, which is higher than any of the batches we have. 00:32:42.080 |
That's fine. I don't think, I mean, if you want to try and balance it out, that's fine. You can 00:32:47.920 |
do that here. And then the other option is where we set the maximum length of the sentences that 00:32:59.120 |
we're going to be processing. So that is max sentence length. And I said before, look, 00:33:07.040 |
the maximum we have here is 256 or 512. So let's just trim all of those down to 256. 00:33:17.920 |
Okay. That will load our data. And now we just need to initialize a data loader. So we're just 00:33:26.960 |
using PyTorch here. So run from Torch, utils.data, import data loader. Loader is equal to data 00:33:39.920 |
loader. That's our data. We want to shuffle that data. And we also want to set the batch size, 00:33:49.440 |
which is same as before, 32. Okay. So models are ready. Data is ready. Now we initialize our 00:33:59.760 |
loss function. So from sentence transformers again, dot losses, import MSE loss. 00:34:09.280 |
And then loss is equal to MSE loss. And then here we have model equals student model. Okay. So we're 00:34:22.640 |
only optimizing our student model, not the teacher model. The teacher model is there to teach our 00:34:29.360 |
student, not the other way around. Okay. So that's everything we need ready for training. So 00:34:37.920 |
let's move on to the actual training function. So we can train. I'm going to train for one epoch, 00:34:44.640 |
but you can do more. I think in the actual, so in the other codes I've seen that do this, 00:34:54.240 |
they will train for like five epochs. But even just training on one epoch, 00:34:59.280 |
you actually get a pretty good model. So I think you don't need to train on too many. 00:35:07.040 |
But obviously, if you want better performance, I would go with the five that I've seen in the 00:35:14.080 |
other codes. So we need to pass our train objectors here. So we have the data loader 00:35:23.040 |
and then the loss function. Now we want to say, okay, how many epochs? I've said before, 00:35:29.200 |
I'm going to go with one, a number of warmup steps. So before you jump straight up to the 00:35:35.680 |
learning rate that we select in a moment, do we want to warm up first? Yes, we do. I'm going to 00:35:43.440 |
warm up for 10% of the training data, which is just the length of the loader and multiply by 0.1. 00:35:55.120 |
Okay, and from there, where do you want to save the model? I'm going to try, 00:36:03.200 |
I'm going to save it in xmlTED, our optimizer parameters. 00:36:11.200 |
So we're going to set a learning rate of 2e to the minus 5, epsilon of 1e to the minus 6. 00:36:26.480 |
I'm also going to set correct bias equal to false. 00:36:32.160 |
Okay, there are the optimizer parameters, and then we can also save the best model. 00:36:40.000 |
Save the best model equal to true. And then we run it. Okay, so run that. It's going to 00:36:50.640 |
take a long time, so I'm actually going to stop it because I've already run it. 00:36:55.520 |
And let's have a look at actually evaluating that and have a look at the results. 00:36:59.520 |
Okay, so I just have this notebook where I've evaluated the model. So I'm using this STS 00:37:07.760 |
sentence textual similarity benchmark data set, which is multilingual. I'm getting the English 00:37:14.960 |
data and also the Italian. And you can see they are similar. So each row in the English data set 00:37:26.240 |
corresponds to the other language data sets as well. So in here, sentence 1 in the English means 00:37:31.600 |
the same thing as sentence 0 in the Italian. Okay, same sentence 2, also the same similarity score. 00:37:39.680 |
So the first thing we do is normalize that similarity score, and then we go down a little 00:37:46.320 |
bit. So we reformat the data using Sentence Transformer's InputExample class. And through 00:37:54.960 |
this, I've created three different evaluation sets. So we have the English to English, 00:38:00.480 |
Italian to Italian, and then English to Italian. And then what we do here is we initialize 00:38:09.600 |
a similarity evaluator for each of these data sets. Again, we're using Sentence Transformers, 00:38:15.680 |
just makes life a lot easier. We initialize those, and then we can just pass our model 00:38:21.280 |
to each one of those evaluators to get its performance. So here, 81.6 on the English set, 00:38:28.640 |
74.3 and 71 here. Now, I just trained on one epoch. If you want better performance, 00:38:38.320 |
you can train on what epochs, and you should be able to get more towards 80% or maybe a little 00:38:44.240 |
bit higher. So pretty straightforward and incredibly easy. And then here, I wanted to 00:38:53.040 |
compare that to the student before we trained it. So I initialized a new student and had a look, 00:38:58.800 |
and you can see the evaluation is pretty low. So for English, 47.5. Italian, actually 50%, 00:39:07.680 |
surprisingly. Although it's already a multilingual model, so it does make sense I can just send 00:39:15.120 |
Italian. And then from English to Italian, it really struggles, drops down to 23. 00:39:22.800 |
So that's it for this video. I think it's been pretty useful, at least for me. I can kind of 00:39:32.000 |
see where you can build a Sentence Transformer in a lot of different languages using this, 00:39:39.040 |
which is, I think, really cool and will probably be useful for a lot of people. 00:39:43.680 |
So I hope you enjoyed the video. Thank you very much for watching,