back to index

All You Need to Know on Multilingual Sentence Vectors (1 Model, 50+ Languages)


Chapters

0:0 Intro
1:19 Multilingual Vectors
5:55 Multi-task Training (mUSE)
9:36 Multilingual Knowledge Distillation
11:13 Knowledge Distillation Training
13:43 Visual Walkthrough
14:53 Parallel Data Prep
20:23 Choosing a Student Model
24:55 Initializing the Models
30:5 ParallelSentencesDataset
33:54 Loss and Fine-tuning
36:59 Model Evaluation
39:23 Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to be having a look at multilingual sentence transformers. We're
00:00:04.480 | going to look at how they work, how they're trained, and why they're so useful. We're going
00:00:11.200 | to be focusing on one specific training method, which I think is quite useful because all it
00:00:19.280 | really needs is a reasonably small data set of parallel data, which is simply translation pairs
00:00:28.800 | from a source language like English to whichever other language you're using. So obviously, if you
00:00:34.800 | are wanting to train a sentence transformer in a language that doesn't really have that much data,
00:00:42.640 | it's particularly sentence similarity data, this can be really useful for actually taking a
00:00:50.400 | high-performing, for example, English sentence transformer and transferring that knowledge or
00:00:57.680 | distilling that knowledge into a sentence transformer for your own language. So I think
00:01:05.440 | this will be pretty useful for a lot of you. And let's jump straight into it.
00:01:12.000 | Before we really get into the whole multilingual sentence transformer part of the video,
00:01:25.200 | I just want to give an impression of what these multilingual sentence transformers are actually
00:01:30.000 | doing. So on here, we can see a single English sentence or brief phrase down at the bottom,
00:01:39.120 | "isle of plants," and the rest of these are all in Italian. So what we have here are vector
00:01:47.360 | representations or dense vector representations of these phrases. And a monolingual sentence
00:01:55.680 | transformer, which is most of the sentence transformers, will only cope with one language. So
00:02:02.160 | we would hope that phrases that have a similar meaning end up within the same sort of vector
00:02:08.880 | space. So like we have for "amo lupiante" here and "I love plants," these are kind of in the same
00:02:20.240 | space. A monolingual sentence transformer would do that for similar sentences. So in English,
00:02:29.120 | we might have "I love plants" and "I like plants," which is actually what we have up here. So this
00:02:35.760 | here is Italian for "I like plants." And we would hope that they're in a similar area,
00:02:41.040 | whereas irrelevant or almost contradictory sentences we would hope would be far off
00:02:49.920 | somewhere else, like our vector over here. So that's how, obviously, a monolingual sentence
00:02:58.160 | transformer works, and it's exactly the same for a multilingual sentence transformer.
00:03:03.680 | The only difference is that rather than having a single language, it will comprehend multiple
00:03:10.080 | languages. And that's what you can see in this visual. So in this example, I have "I love plants"
00:03:18.240 | and "amo lupiante." They have the same meaning, just in different languages. So that means that
00:03:25.760 | they should be as close together as possible in this vector space. So here we're just visualizing
00:03:34.400 | three dimensions. In reality, it'd be a lot more. I think most transformer models go with 768
00:03:42.320 | dimensions. But obviously, we can't visualize that, so we have 3D here. So we want different
00:03:49.600 | languages or similar sentences from different languages to end up in the same area. And we
00:03:55.280 | also want to be able to represent relationships between different sentences that are similar.
00:04:02.640 | And we can kind of see that relationship here. So we have "mi piacciono lupiante" and "amo lupiante"
00:04:08.800 | and "I love plants" are all kind of in the same sort of area. "Mi piacciono lupiante," so "I like
00:04:15.760 | plants," is obviously separated somewhat, but it's still within the same area. And then in the bottom
00:04:24.320 | left down there, we have "ho un cane arancione," which means "I have a orange dog." So obviously,
00:04:33.760 | you know, that's really nothing to do with "I love plants." Although I suppose you could say
00:04:38.160 | you're talking about yourself, so maybe it's a little bit similar. But otherwise,
00:04:42.800 | they're completely different topics. So that's kind of what we want to build,
00:04:51.280 | something that takes sentences from different languages and maps them into a vector space,
00:04:57.520 | which has some sort of numerical structure to represent the semantic meaning of those sentences.
00:05:04.720 | And it should be language agnostic. So obviously, we can't -- well, maybe we can train on every
00:05:11.120 | language. I don't know any models that are trained on every single language,
00:05:14.400 | but we want it to be able to comprehend different languages and not be biased towards
00:05:23.120 | different phrases in different languages, but just have a very balanced comprehension of all of them.
00:05:29.680 | Okay? So that's how the vectors should look. And then, okay, so what would the training data for
00:05:42.640 | this look like, and what are the training approaches? So like I said before, there's
00:05:48.400 | two training approaches that I'm going to just briefly touch upon, but we're going to focus on
00:05:52.640 | the latter of those. So the first one that I want to mention is what the MUSE, or Multilingual
00:06:04.000 | Universal Sentence Encoder Model, was trained on, which is a multitask
00:06:15.520 | translation bridging approach to training. So what I mean by that is it uses two or uses a
00:06:24.640 | dual encoder structure, and those encoders deal with two different tasks. So on one end, you have
00:06:36.000 | the parallel data training. So when we say parallel data, these are sentence pairs in
00:06:44.960 | different languages. So like we had before, we had the Amalopoeia and Isle of Plants,
00:06:52.160 | which is just the Italian and English phrases for Isle of Plants. So we would have our source
00:07:00.240 | language and also the translation, or the target language is probably a better way,
00:07:07.280 | but I'll put translation for now. So we have the source and translation. That's our parallel data
00:07:12.720 | set. And what we're doing is optimizing to get those two vectors or the two sentence vectors
00:07:18.560 | produced by either one of those sentences as close as possible. And then there is also the source
00:07:27.520 | data. So we basically have sentence similarity or NLI data, but we have it just for the source
00:07:37.840 | language. So we have source, sentence A, and source, sentence B. And we train on both of these.
00:07:47.760 | Now, it works, and that's good. But obviously, we're training on a multi-task architecture here,
00:07:55.680 | and training on a single task in machine learning is already hard enough. Training on two and getting
00:08:03.040 | them to balance and train well is harder. And the amount of data, at least for Muse, and I believe
00:08:12.000 | for if you're training using this approach, you're going to need to use a similar amount of data,
00:08:16.560 | is pretty significant. I think Muse is something like a billion pairs, so it's pretty high.
00:08:23.760 | And another thing is that we also need something called hard negatives in the training data in
00:08:31.680 | order for this model to perform well. So what I mean by hard negative is, say we have our source
00:08:40.560 | sentence A here, and we have this source B, which is like a similar sentence, a high similarity
00:08:47.920 | sentence. They mean basically the same thing. We'd also have to add a source C. And this source C
00:08:59.200 | will have to be similar in the words I use to source A, but actually means something different.
00:09:05.040 | So it's harder for the model to differentiate between them. And the model would have to figure
00:09:10.640 | out these two sentences are not similar, even though they seem similar at first, but they're
00:09:16.320 | not. So it makes the training task harder for the model, which, of course, makes the model better.
00:09:23.760 | So that is training approach number one. And we've mentioned the parallel data there. That's
00:09:31.840 | the data set we're going to be using for the second training approach. And that second training
00:09:37.600 | approach is called multi-lingual knowledge distillation. So that is a mouthful. And it
00:09:53.840 | takes me a while to write that. I'm sorry. So multi-lingual knowledge distillation.
00:10:01.520 | So this was introduced in 2020 by, who we've mentioned before, the Sentence Transformers
00:10:08.720 | people, Nils Reimers and Irina Gruevich. And the sort of advantage of using this approach is that
00:10:17.600 | we only need the parallel data set. So we only need those translation pairs. And the amount of
00:10:22.480 | training data you need is a lot smaller. And using this approach, the Sentence Transformers
00:10:31.440 | people have actually trained Sentence Transformers that can use more than 50
00:10:37.920 | languages at once. And the performance is good. It's not just that they managed to
00:10:43.920 | get a few phrases correct. The performance is actually quite good.
00:10:48.000 | So I think it's pretty impressive. And the training time for these is super quick,
00:10:57.280 | as we'll see. And like I said, it's using just translation data, parallel data,
00:11:03.920 | which is reasonably easy to get for almost every language. So I think that's pretty useful.
00:11:12.000 | Now, well, let's have a look at what that multi-lingual knowledge distillation training
00:11:19.360 | process actually looks like. So it's what we have here. So same example as before. I've got
00:11:25.120 | "I like plants" this time and "Mi piacere non li piante," which is, again, the same thing
00:11:30.960 | in Italian. Now, we have both of those. We have a teacher model and a student model. Now,
00:11:37.040 | when we say knowledge distillation, that means where you basically take one model
00:11:44.080 | and you distill the knowledge from that one model into another model here. The model that already
00:11:51.520 | knows some of the stuff that we want, that we want to distill knowledge from,
00:11:55.920 | is called the teacher model. Now, the teacher model, in this case, is going to be a monolingual
00:12:03.840 | model. So it's probably going to be a sentence transformer. That's very good at English tests
00:12:09.040 | only. And what we do is we take the student model, which is going to be-- it doesn't have
00:12:17.120 | to be a sentence transformer. It's just a pre-trained transform model. We'll be using
00:12:22.720 | XLM Roberta later on. And it needs to be capable of understanding multiple languages.
00:12:30.000 | So in this case, we feed the English sentence into both our teacher model and student model.
00:12:37.360 | And then we optimize the student model to reduce the difference between the two vectors output
00:12:44.640 | by those two models. And that makes the student model almost mimic the monolingual aspect of the
00:12:51.600 | teacher model. But then we take it a little further, and we process the Italian, or the
00:12:57.360 | target language, through the student model. And then we do the same thing. So we try to reduce
00:13:03.680 | the difference between the Italian vector and the teacher's English vector. And what we're doing
00:13:09.440 | there is making the student model mimic the teacher for a different language. So through
00:13:16.400 | that process, you can add more and more languages to a student model, which mimics your teacher
00:13:23.920 | model. I mean, it seems at least really simple just to think of it like that, in my opinion,
00:13:34.320 | anyway. But it works really well. So it's a very cool technique, in my opinion. I do like it.
00:13:42.880 | So just a more visual way of going through that. We have these different circles. They represent
00:13:51.600 | different language tasks, or different languages, but pretty similar, or the same task in each one
00:13:57.440 | of those. We have our monolingual teacher model. And that can perform on one of these languages.
00:14:04.240 | But fails on the others. We take that monolingual model, or our teacher model, and then we also
00:14:10.800 | take a pre-trained multilingual model. So the important thing here is that it can handle new
00:14:16.080 | languages, like I said with XLM and Roberta. This is our student. We perform multilingual
00:14:22.640 | knowledge distillation, meaning the student learns how the teacher performs well on the single task
00:14:28.000 | by mimicking its sentence vector outputs. The student then performs this mimicry across multiple
00:14:34.240 | languages. And then hopefully, the student model can now perform across all of the languages that
00:14:42.240 | we are wanting to train on. That's how the multilingual knowledge distillation works.
00:14:49.280 | Let's have a look at that in code. Okay, so we're in our code here. And the first thing I'm going
00:14:55.920 | to do is actually get our data. So in the paper that introduced the multilingual knowledge
00:15:04.320 | distillation, Rimas and Gurevich use the focus partly on this TED subtitles data. So yeah, we
00:15:14.960 | know TED Talks, they're just low talks where people present on a particular topic, usually
00:15:20.640 | pretty interesting. And those TED Talks have subtitles in loads of different languages.
00:15:28.720 | So they scraped that subtitle data and use that as sentence pairs for the different languages.
00:15:37.200 | Okay, so that's the parallel data. Now, what I'm going to do is use Hug and Face Transformers
00:15:43.680 | to download that. So we just import datasets here. So I said Hug and Face Transformers,
00:15:50.640 | actually Hug and Face Datasets here. So import datasets, and I'm going to load that dataset.
00:15:57.440 | And just have a look at what the structure of that dataset is. So it's the TED multi,
00:16:04.400 | and I'm just getting the training data here. You see in here, we have this Features,
00:16:08.480 | Translations, and Talk Name. Now, it's not really very clear, but inside the translations data,
00:16:16.560 | we have the language tag. So these are language codes, ISO language codes. If you type that into
00:16:23.440 | Google, they'll pop up. If you don't know which one, which are which. And below, we also have
00:16:33.200 | in here, it's not very clear again. So if I come here, we have translations, and each one of those
00:16:38.400 | corresponds to the language code up here. Okay, so if we came here, we see EN, it's English,
00:16:45.200 | and we find it here. Okay, and then we also have Talk Name. It's not really important for us.
00:16:53.520 | So we can get the index of our English text, because we need to extract that for our source
00:17:01.440 | language. So we extract that, we get number four, so we're going into those language pairs,
00:17:07.200 | finding EN. And then we use that index to get the corresponding translation,
00:17:12.320 | which is here. And then we'd use that to create all of our pairs. Now, here, I've just created
00:17:19.840 | loads of pairs. This is the first one, so this is English to Arabic. But if we have a look,
00:17:25.520 | there's actually loads of pairs here. So we have 27 in total, which is obviously quite a lot.
00:17:30.000 | Probably not going to use all of those. I mean, you could do if you want to. It depends on what
00:17:34.320 | you're trying to build. But I think most of us are probably not going to be trying to build some
00:17:39.200 | model that crosses all these different languages. So what I'm going to do is just initialize a list
00:17:47.120 | of languages that we would like to train on. So we're going to be feeding all of this into
00:17:56.960 | a sentence transformer class called ParallelSentencesDataSet. And that requires that
00:18:04.400 | we, one, separate out our pairs using a tab character, and two, keep all those pairs separated
00:18:12.160 | in different gzip files. So that's why I'm using this particular structure. So data preprocessing
00:18:20.400 | steps here, I'm just running through them quickly because I want to focus more on the actual
00:18:24.560 | sentence transformer training part. So run that, and we can-- well, it's actually going to take a
00:18:31.040 | moment, so let me skip forward. And then we'll see how many pairs-- well, I just want to see.
00:18:38.880 | We don't have to do this. But I want to see how many pairs we have for each language.
00:18:42.640 | And you see here, we have about 200,000 for each of them. The German one is slightly less.
00:18:50.720 | And then let's have a look at what those source and translations look like. So here,
00:18:55.360 | we have applause and applause. Now, I think that's Italian. It seems so. But here, we can see, OK,
00:19:04.960 | the end of the talk ends in applause. So obviously, the subtitles say applause. Well,
00:19:11.520 | hopefully, it ends in applause. And then we just have the tab character, and that separates the
00:19:17.520 | source language, English in this case, from the translated language. Now what we want to do is
00:19:26.960 | save that data. So we sort all that in these dictionaries. So initialize dictionary here,
00:19:33.280 | and access them here. So we have ENIT, ES, AR, FR, and DE. And now I'm just going to save them.
00:19:42.240 | So run this. That will save. And what I'll do is just write OSLister. So we can see what is in
00:19:52.080 | there. Where is it? It's data. Just data. Is that right? OK. And then we have these five files. OK.
00:20:03.920 | Now let's continue. So now what we want to do is, OK, we have-- that's our training data. It's
00:20:12.000 | ready, or mostly ready, before we feed it into the Sentence Transformer's parallel sentences
00:20:19.600 | data set object later on. So OK, let's leave that for now and move on to the next step, which is
00:20:26.320 | choosing our teacher and student models. So I already mentioned before, we want our student
00:20:33.920 | model to be capable of multilingual comprehension. So what I mean by that-- or not just what I mean,
00:20:42.320 | but one big component of that is, can the Transformer Tokenizer deal with different
00:20:48.080 | languages? In some cases, they really can't. So let me show you what the BERT Tokenizer does with
00:20:55.920 | these four different sentences. So we'll just loop through each one. So four texts in sentences.
00:21:03.200 | And what I'm going to do is just print. I'm going to print the output of the BERT Tokenizer.
00:21:09.600 | And if I tokenize that text, what does it give me? OK. So what we have here-- OK, English,
00:21:20.080 | of course. BERT is fine. The Tokenizer, or the vocabulary of the Tokenizer of BERT is, I think,
00:21:27.920 | roughly 30,000 tokens. And most of those are English-based. You can see here that it has
00:21:38.400 | picked up some Chinese characters, because it does-- other languages do feed into it a little
00:21:42.880 | bit, because it's just-- all the data is pulled from the internet. Other bits do get in there.
00:21:49.200 | But it's mostly English. So that's why we see, OK, we have these unknown characters.
00:21:54.720 | Now, as soon as we have an unknown character in our sentence, the Tokenizer-- or no, sorry,
00:22:00.960 | the Transformer is ready to struggle to understand what is in that position? What is that unknown
00:22:07.920 | token supposed to represent? In the case of-- I think of it as it's like when you're a kid
00:22:16.400 | in school, and they had those-- had a paragraph, and you had to fill in the blanks. So you had a
00:22:23.600 | paragraph, and occasionally, in a couple of sentences, there'll be a couple of blank lines
00:22:28.880 | where you need to guess what the correct word should be. If you only have a couple of those
00:22:34.160 | blanks, as a person, you can probably guess accurately. And the same for BERT. BERT can
00:22:41.760 | probably guess accurately what the occasional unknown token is. But if in school, they gave
00:22:50.320 | you a sheet, and they said, OK, fill out these blanks, and it was literally just a paragraph
00:22:55.680 | of blank, and you had to guess it correctly, you probably-- I don't know. I think your chances are
00:23:01.360 | pretty slim of getting that correct. So the same is true for BERT. BERT, for example,
00:23:09.440 | in our Georgian example down here, how can BERT know what that means? It will not know.
00:23:14.960 | So the tokenizer from BERT is not suitable for non-Latin character languages whatsoever.
00:23:23.360 | And then it does know some Greek characters here. And maybe it knows all of them,
00:23:28.000 | because I suppose Greek feeds into Latin languages a bit more than Georgian or Chinese.
00:23:35.760 | But it doesn't know what to do with them. They're all single-character tokens. And the issue with
00:23:40.560 | single-character tokens is that you can't really encode that much information into a single
00:23:46.240 | character. Because if you have 24 characters in your alphabet, that means you have 24 encodings
00:23:54.160 | to represent your entire language, which is not going to happen. So that's also not good.
00:24:01.440 | So basically, don't use a BERT tokenizer. It's not a good idea. What you can do is,
00:24:08.560 | OK, how is this xlmr token or tokenizer? Now, xlmr is trained for multilingual comprehension.
00:24:20.160 | It uses a sentence piece transformer, which uses byte-level logic to split up the sentence or the
00:24:27.360 | words. So it can deal with tokens it's never seen before, which is pretty nice. And the vocabulary
00:24:35.840 | size for this is not 30k. I think it's 250k. It could be off a few k there, but it's around that
00:24:44.000 | mark. And it's been trained on many languages. So it's obviously a much better option for our
00:24:53.280 | student model. So let's have a look at how we initialize that. So this xlmr model is just coming
00:25:01.520 | from Transformers. So I need to convert that model from just a Transformer model into an--
00:25:10.320 | or initialize it as a Sentence Transformer model using the Sentence Transformers library.
00:25:16.160 | So from Sentence Transformers, I'm going to import models and also Sentence Transformer.
00:25:23.680 | So xlmr, so this is going to be our actual Transformer model. We're going to write
00:25:30.480 | models.transformer. And Sentence Transformers under hood uses HuggingFace Transformers as well.
00:25:39.200 | So we would access this as the normal model identifier that we would with normal HuggingFace
00:25:47.200 | Transformers, which is xlmr RobertaBase. As well as that, we need a pooling layer.
00:25:57.040 | So we write models.pooling. And in here, we need to pass the output embeddings dimensions. So it's
00:26:08.240 | this get word embedding dimension for our model. And also what type of pooling we'd like to do. We
00:26:15.520 | have max pooling, CLS token pooling. And what we want is a mean pooling. So is pooling
00:26:29.440 | mode mean tokens equals true. Okay. So that two components of our Sentence Transformer.
00:26:40.240 | And then from there, we can initialize our students. So student equals Sentence Transformer.
00:26:48.000 | And we're initializing that using the modules, which is just a list of our two components there.
00:26:57.120 | So xlmr followed by pooling. And that's it. So let's have a look at what we have there.
00:27:06.240 | Okay. We can just ignore this top bit here. We just want to focus on this.
00:27:10.880 | So you see we have our transformer model followed by the pooling here. And we also see that we're
00:27:17.840 | using the mean tokens pooling set to true, rest of them are false. Okay. So that's our student
00:27:24.240 | model initialized. And now what we want to do is initialize our teach model. Now the teach model,
00:27:32.080 | let me show you. You just have to be a little bit careful with this. So Sentence Transformer.
00:27:38.000 | So maybe you'd like to use one of the top forming ones, which a lot of them are the old models.
00:27:48.400 | So these are monolingual models, all MPNet base V2. And okay, let's initialize this and let's see
00:28:02.160 | what is inside it. Okay. So we have the transformer, we have the pooling as we had before,
00:28:08.400 | but then we also have this normalization layer. So the outputs from this model are normalized.
00:28:16.080 | And obviously, if you're trying to make another model mimic that normalization layer outputs,
00:28:23.440 | well, it's not ideal because the model is going to be trying to normalize its own vectors. So
00:28:32.480 | you don't really want to do that. You want to choose a model. You either want to remove the
00:28:36.720 | normalization layer or just choose a model that doesn't have a normalization layer, which I think
00:28:43.840 | is probably the better option. So that's what I'm going to do. So for the teacher, I'm going to use
00:28:49.840 | a Sentence Transformer. I'm going to use a paraphrase models because these
00:28:55.840 | don't use normalization layers. Distill, Roberta, base V2. Okay. Let's have a look.
00:29:10.640 | Okay. So now you can see we have the transformer followed directly by the pooling.
00:29:16.240 | Now, another thing that you probably should just be aware of here is that we have this max
00:29:20.800 | sequence length here is 512, which doesn't align with our paraphrase model here. But that's fine
00:29:28.560 | because I'm going to limit the maximum sequence length anyway to 250. So it's not really an issue,
00:29:38.560 | but just look out for that if you're training your own models. This one's of 384. So none of
00:29:44.480 | those align. But yeah, just be aware that the sequence lengths might not align there.
00:29:53.280 | So we've formatted our training data. We have our two models, the teacher and the student.
00:30:04.880 | So now what we can do is prepare that data for loading into our training process,
00:30:11.680 | our fine tuning process. So I said before, we're going to be using the parallel sentences,
00:30:19.360 | sorry, from Sentence Transformers import parallel sentences dataset.
00:30:25.360 | And first thing we need to do here is actually initialize the object. And that requires that
00:30:34.080 | we pass the two models that we're training with because this kind of handles the interaction
00:30:39.760 | between those two models as well. So obviously we have our student model, which is our student.
00:30:47.040 | And we have the teacher model, which is our teacher. Alongside this, we want batch size.
00:30:58.800 | I'm going to use 32, but I think actually you can probably use higher batches here,
00:31:05.280 | or you probably should use higher batches. I think 64 is one that I see used a lot in these training
00:31:13.280 | codes. And you also use embedding cache equal to true. Okay. So that initializes the parallel
00:31:27.600 | sentences dataset object. And now what we want to do is add our data to it. So we need our training
00:31:35.360 | files. So training files equal to OS list that we did before. I think it's in the data file,
00:31:44.400 | in the data directory. Yeah. So that's what we want. And what I'll do is just
00:31:56.800 | for F in those train files, I'm going to load each one of those into the dataset object.
00:32:04.160 | Print F and data.loaddata. I need to make sure I include the path there,
00:32:12.560 | followed by the actual file name. You need to pass your max sentences,
00:32:22.240 | which is the maximum number of sentences that you're going to take from that load data batch.
00:32:27.120 | So basically the maximum number of sentences we're going to use from each language there.
00:32:33.920 | Now, I'm just going to set this to 250,000, which is higher than any of the batches we have.
00:32:42.080 | That's fine. I don't think, I mean, if you want to try and balance it out, that's fine. You can
00:32:47.920 | do that here. And then the other option is where we set the maximum length of the sentences that
00:32:59.120 | we're going to be processing. So that is max sentence length. And I said before, look,
00:33:07.040 | the maximum we have here is 256 or 512. So let's just trim all of those down to 256.
00:33:17.920 | Okay. That will load our data. And now we just need to initialize a data loader. So we're just
00:33:26.960 | using PyTorch here. So run from Torch, utils.data, import data loader. Loader is equal to data
00:33:39.920 | loader. That's our data. We want to shuffle that data. And we also want to set the batch size,
00:33:49.440 | which is same as before, 32. Okay. So models are ready. Data is ready. Now we initialize our
00:33:59.760 | loss function. So from sentence transformers again, dot losses, import MSE loss.
00:34:09.280 | And then loss is equal to MSE loss. And then here we have model equals student model. Okay. So we're
00:34:22.640 | only optimizing our student model, not the teacher model. The teacher model is there to teach our
00:34:29.360 | student, not the other way around. Okay. So that's everything we need ready for training. So
00:34:37.920 | let's move on to the actual training function. So we can train. I'm going to train for one epoch,
00:34:44.640 | but you can do more. I think in the actual, so in the other codes I've seen that do this,
00:34:54.240 | they will train for like five epochs. But even just training on one epoch,
00:34:59.280 | you actually get a pretty good model. So I think you don't need to train on too many.
00:35:07.040 | But obviously, if you want better performance, I would go with the five that I've seen in the
00:35:14.080 | other codes. So we need to pass our train objectors here. So we have the data loader
00:35:23.040 | and then the loss function. Now we want to say, okay, how many epochs? I've said before,
00:35:29.200 | I'm going to go with one, a number of warmup steps. So before you jump straight up to the
00:35:35.680 | learning rate that we select in a moment, do we want to warm up first? Yes, we do. I'm going to
00:35:43.440 | warm up for 10% of the training data, which is just the length of the loader and multiply by 0.1.
00:35:55.120 | Okay, and from there, where do you want to save the model? I'm going to try,
00:36:03.200 | I'm going to save it in xmlTED, our optimizer parameters.
00:36:11.200 | So we're going to set a learning rate of 2e to the minus 5, epsilon of 1e to the minus 6.
00:36:26.480 | I'm also going to set correct bias equal to false.
00:36:32.160 | Okay, there are the optimizer parameters, and then we can also save the best model.
00:36:40.000 | Save the best model equal to true. And then we run it. Okay, so run that. It's going to
00:36:50.640 | take a long time, so I'm actually going to stop it because I've already run it.
00:36:55.520 | And let's have a look at actually evaluating that and have a look at the results.
00:36:59.520 | Okay, so I just have this notebook where I've evaluated the model. So I'm using this STS
00:37:07.760 | sentence textual similarity benchmark data set, which is multilingual. I'm getting the English
00:37:14.960 | data and also the Italian. And you can see they are similar. So each row in the English data set
00:37:26.240 | corresponds to the other language data sets as well. So in here, sentence 1 in the English means
00:37:31.600 | the same thing as sentence 0 in the Italian. Okay, same sentence 2, also the same similarity score.
00:37:39.680 | So the first thing we do is normalize that similarity score, and then we go down a little
00:37:46.320 | bit. So we reformat the data using Sentence Transformer's InputExample class. And through
00:37:54.960 | this, I've created three different evaluation sets. So we have the English to English,
00:38:00.480 | Italian to Italian, and then English to Italian. And then what we do here is we initialize
00:38:09.600 | a similarity evaluator for each of these data sets. Again, we're using Sentence Transformers,
00:38:15.680 | just makes life a lot easier. We initialize those, and then we can just pass our model
00:38:21.280 | to each one of those evaluators to get its performance. So here, 81.6 on the English set,
00:38:28.640 | 74.3 and 71 here. Now, I just trained on one epoch. If you want better performance,
00:38:38.320 | you can train on what epochs, and you should be able to get more towards 80% or maybe a little
00:38:44.240 | bit higher. So pretty straightforward and incredibly easy. And then here, I wanted to
00:38:53.040 | compare that to the student before we trained it. So I initialized a new student and had a look,
00:38:58.800 | and you can see the evaluation is pretty low. So for English, 47.5. Italian, actually 50%,
00:39:07.680 | surprisingly. Although it's already a multilingual model, so it does make sense I can just send
00:39:15.120 | Italian. And then from English to Italian, it really struggles, drops down to 23.
00:39:22.800 | So that's it for this video. I think it's been pretty useful, at least for me. I can kind of
00:39:32.000 | see where you can build a Sentence Transformer in a lot of different languages using this,
00:39:39.040 | which is, I think, really cool and will probably be useful for a lot of people.
00:39:43.680 | So I hope you enjoyed the video. Thank you very much for watching,
00:39:49.680 | and I'll see you again in the next one.