All You Need to Know on Multilingual Sentence Vectors (1 Model, 50+ Languages)

00:00:00.000 | Today we're going to be having a look at multilingual sentence transformers. We're

00:00:04.480 | going to look at how they work, how they're trained, and why they're so useful. We're going

00:00:11.200 | to be focusing on one specific training method, which I think is quite useful because all it

00:00:19.280 | really needs is a reasonably small data set of parallel data, which is simply translation pairs

00:00:28.800 | from a source language like English to whichever other language you're using. So obviously, if you

00:00:34.800 | are wanting to train a sentence transformer in a language that doesn't really have that much data,

00:00:42.640 | it's particularly sentence similarity data, this can be really useful for actually taking a

00:00:50.400 | high-performing, for example, English sentence transformer and transferring that knowledge or

00:00:57.680 | distilling that knowledge into a sentence transformer for your own language. So I think

00:01:05.440 | this will be pretty useful for a lot of you. And let's jump straight into it.

00:01:12.000 | Before we really get into the whole multilingual sentence transformer part of the video,

00:01:25.200 | I just want to give an impression of what these multilingual sentence transformers are actually

00:01:30.000 | doing. So on here, we can see a single English sentence or brief phrase down at the bottom,

00:01:39.120 | "isle of plants," and the rest of these are all in Italian. So what we have here are vector

00:01:47.360 | representations or dense vector representations of these phrases. And a monolingual sentence

00:01:55.680 | transformer, which is most of the sentence transformers, will only cope with one language. So

00:02:02.160 | we would hope that phrases that have a similar meaning end up within the same sort of vector

00:02:08.880 | space. So like we have for "amo lupiante" here and "I love plants," these are kind of in the same

00:02:20.240 | space. A monolingual sentence transformer would do that for similar sentences. So in English,

00:02:29.120 | we might have "I love plants" and "I like plants," which is actually what we have up here. So this

00:02:35.760 | here is Italian for "I like plants." And we would hope that they're in a similar area,

00:02:41.040 | whereas irrelevant or almost contradictory sentences we would hope would be far off

00:02:49.920 | somewhere else, like our vector over here. So that's how, obviously, a monolingual sentence

00:02:58.160 | transformer works, and it's exactly the same for a multilingual sentence transformer.

00:03:03.680 | The only difference is that rather than having a single language, it will comprehend multiple

00:03:10.080 | languages. And that's what you can see in this visual. So in this example, I have "I love plants"

00:03:18.240 | and "amo lupiante." They have the same meaning, just in different languages. So that means that

00:03:25.760 | they should be as close together as possible in this vector space. So here we're just visualizing

00:03:34.400 | three dimensions. In reality, it'd be a lot more. I think most transformer models go with 768

00:03:42.320 | dimensions. But obviously, we can't visualize that, so we have 3D here. So we want different

00:03:49.600 | languages or similar sentences from different languages to end up in the same area. And we

00:03:55.280 | also want to be able to represent relationships between different sentences that are similar.

00:04:02.640 | And we can kind of see that relationship here. So we have "mi piacciono lupiante" and "amo lupiante"

00:04:08.800 | and "I love plants" are all kind of in the same sort of area. "Mi piacciono lupiante," so "I like

00:04:15.760 | plants," is obviously separated somewhat, but it's still within the same area. And then in the bottom

00:04:24.320 | left down there, we have "ho un cane arancione," which means "I have a orange dog." So obviously,

00:04:33.760 | you know, that's really nothing to do with "I love plants." Although I suppose you could say

00:04:38.160 | you're talking about yourself, so maybe it's a little bit similar. But otherwise,

00:04:42.800 | they're completely different topics. So that's kind of what we want to build,

00:04:51.280 | something that takes sentences from different languages and maps them into a vector space,

00:04:57.520 | which has some sort of numerical structure to represent the semantic meaning of those sentences.

00:05:04.720 | And it should be language agnostic. So obviously, we can't -- well, maybe we can train on every

00:05:11.120 | language. I don't know any models that are trained on every single language,

00:05:14.400 | but we want it to be able to comprehend different languages and not be biased towards

00:05:23.120 | different phrases in different languages, but just have a very balanced comprehension of all of them.

00:05:29.680 | Okay? So that's how the vectors should look. And then, okay, so what would the training data for

00:05:42.640 | this look like, and what are the training approaches? So like I said before, there's

00:05:48.400 | two training approaches that I'm going to just briefly touch upon, but we're going to focus on

00:05:52.640 | the latter of those. So the first one that I want to mention is what the MUSE, or Multilingual

00:06:04.000 | Universal Sentence Encoder Model, was trained on, which is a multitask

00:06:15.520 | translation bridging approach to training. So what I mean by that is it uses two or uses a

00:06:24.640 | dual encoder structure, and those encoders deal with two different tasks. So on one end, you have

00:06:36.000 | the parallel data training. So when we say parallel data, these are sentence pairs in

00:06:44.960 | different languages. So like we had before, we had the Amalopoeia and Isle of Plants,

00:06:52.160 | which is just the Italian and English phrases for Isle of Plants. So we would have our source

00:07:00.240 | language and also the translation, or the target language is probably a better way,

00:07:07.280 | but I'll put translation for now. So we have the source and translation. That's our parallel data

00:07:12.720 | set. And what we're doing is optimizing to get those two vectors or the two sentence vectors

00:07:18.560 | produced by either one of those sentences as close as possible. And then there is also the source

00:07:27.520 | data. So we basically have sentence similarity or NLI data, but we have it just for the source

00:07:37.840 | language. So we have source, sentence A, and source, sentence B. And we train on both of these.

00:07:47.760 | Now, it works, and that's good. But obviously, we're training on a multi-task architecture here,

00:07:55.680 | and training on a single task in machine learning is already hard enough. Training on two and getting

00:08:03.040 | them to balance and train well is harder. And the amount of data, at least for Muse, and I believe

00:08:12.000 | for if you're training using this approach, you're going to need to use a similar amount of data,

00:08:16.560 | is pretty significant. I think Muse is something like a billion pairs, so it's pretty high.

00:08:23.760 | And another thing is that we also need something called hard negatives in the training data in

00:08:31.680 | order for this model to perform well. So what I mean by hard negative is, say we have our source

00:08:40.560 | sentence A here, and we have this source B, which is like a similar sentence, a high similarity

00:08:47.920 | sentence. They mean basically the same thing. We'd also have to add a source C. And this source C

00:08:59.200 | will have to be similar in the words I use to source A, but actually means something different.

00:09:05.040 | So it's harder for the model to differentiate between them. And the model would have to figure

00:09:10.640 | out these two sentences are not similar, even though they seem similar at first, but they're

00:09:16.320 | not. So it makes the training task harder for the model, which, of course, makes the model better.

00:09:23.760 | So that is training approach number one. And we've mentioned the parallel data there. That's

00:09:31.840 | the data set we're going to be using for the second training approach. And that second training

00:09:37.600 | approach is called multi-lingual knowledge distillation. So that is a mouthful. And it

00:09:53.840 | takes me a while to write that. I'm sorry. So multi-lingual knowledge distillation.

00:10:01.520 | So this was introduced in 2020 by, who we've mentioned before, the Sentence Transformers

00:10:08.720 | people, Nils Reimers and Irina Gruevich. And the sort of advantage of using this approach is that

00:10:17.600 | we only need the parallel data set. So we only need those translation pairs. And the amount of

00:10:22.480 | training data you need is a lot smaller. And using this approach, the Sentence Transformers

00:10:31.440 | people have actually trained Sentence Transformers that can use more than 50

00:10:37.920 | languages at once. And the performance is good. It's not just that they managed to

00:10:43.920 | get a few phrases correct. The performance is actually quite good.

00:10:48.000 | So I think it's pretty impressive. And the training time for these is super quick,

00:10:57.280 | as we'll see. And like I said, it's using just translation data, parallel data,

00:11:03.920 | which is reasonably easy to get for almost every language. So I think that's pretty useful.

00:11:12.000 | Now, well, let's have a look at what that multi-lingual knowledge distillation training

00:11:19.360 | process actually looks like. So it's what we have here. So same example as before. I've got

00:11:25.120 | "I like plants" this time and "Mi piacere non li piante," which is, again, the same thing

00:11:30.960 | in Italian. Now, we have both of those. We have a teacher model and a student model. Now,

00:11:37.040 | when we say knowledge distillation, that means where you basically take one model

00:11:44.080 | and you distill the knowledge from that one model into another model here. The model that already

00:11:51.520 | knows some of the stuff that we want, that we want to distill knowledge from,

00:11:55.920 | is called the teacher model. Now, the teacher model, in this case, is going to be a monolingual

00:12:03.840 | model. So it's probably going to be a sentence transformer. That's very good at English tests

00:12:09.040 | only. And what we do is we take the student model, which is going to be-- it doesn't have

00:12:17.120 | to be a sentence transformer. It's just a pre-trained transform model. We'll be using

00:12:22.720 | XLM Roberta later on. And it needs to be capable of understanding multiple languages.

00:12:30.000 | So in this case, we feed the English sentence into both our teacher model and student model.

00:12:37.360 | And then we optimize the student model to reduce the difference between the two vectors output

00:12:44.640 | by those two models. And that makes the student model almost mimic the monolingual aspect of the

00:12:51.600 | teacher model. But then we take it a little further, and we process the Italian, or the

00:12:57.360 | target language, through the student model. And then we do the same thing. So we try to reduce

00:13:03.680 | the difference between the Italian vector and the teacher's English vector. And what we're doing

00:13:09.440 | there is making the student model mimic the teacher for a different language. So through

00:13:16.400 | that process, you can add more and more languages to a student model, which mimics your teacher

00:13:23.920 | model. I mean, it seems at least really simple just to think of it like that, in my opinion,

00:13:34.320 | anyway. But it works really well. So it's a very cool technique, in my opinion. I do like it.

00:13:42.880 | So just a more visual way of going through that. We have these different circles. They represent

00:13:51.600 | different language tasks, or different languages, but pretty similar, or the same task in each one

00:13:57.440 | of those. We have our monolingual teacher model. And that can perform on one of these languages.

00:14:04.240 | But fails on the others. We take that monolingual model, or our teacher model, and then we also

00:14:10.800 | take a pre-trained multilingual model. So the important thing here is that it can handle new

00:14:16.080 | languages, like I said with XLM and Roberta. This is our student. We perform multilingual

00:14:22.640 | knowledge distillation, meaning the student learns how the teacher performs well on the single task

00:14:28.000 | by mimicking its sentence vector outputs. The student then performs this mimicry across multiple

00:14:34.240 | languages. And then hopefully, the student model can now perform across all of the languages that

00:14:42.240 | we are wanting to train on. That's how the multilingual knowledge distillation works.

00:14:49.280 | Let's have a look at that in code. Okay, so we're in our code here. And the first thing I'm going

00:14:55.920 | to do is actually get our data. So in the paper that introduced the multilingual knowledge

00:15:04.320 | distillation, Rimas and Gurevich use the focus partly on this TED subtitles data. So yeah, we

00:15:14.960 | know TED Talks, they're just low talks where people present on a particular topic, usually

00:15:20.640 | pretty interesting. And those TED Talks have subtitles in loads of different languages.

00:15:28.720 | So they scraped that subtitle data and use that as sentence pairs for the different languages.

00:15:37.200 | Okay, so that's the parallel data. Now, what I'm going to do is use Hug and Face Transformers

00:15:43.680 | to download that. So we just import datasets here. So I said Hug and Face Transformers,

00:15:50.640 | actually Hug and Face Datasets here. So import datasets, and I'm going to load that dataset.

00:15:57.440 | And just have a look at what the structure of that dataset is. So it's the TED multi,

00:16:04.400 | and I'm just getting the training data here. You see in here, we have this Features,

00:16:08.480 | Translations, and Talk Name. Now, it's not really very clear, but inside the translations data,

00:16:16.560 | we have the language tag. So these are language codes, ISO language codes. If you type that into

00:16:23.440 | Google, they'll pop up. If you don't know which one, which are which. And below, we also have

00:16:33.200 | in here, it's not very clear again. So if I come here, we have translations, and each one of those

00:16:38.400 | corresponds to the language code up here. Okay, so if we came here, we see EN, it's English,

00:16:45.200 | and we find it here. Okay, and then we also have Talk Name. It's not really important for us.

00:16:53.520 | So we can get the index of our English text, because we need to extract that for our source

00:17:01.440 | language. So we extract that, we get number four, so we're going into those language pairs,

00:17:07.200 | finding EN. And then we use that index to get the corresponding translation,

00:17:12.320 | which is here. And then we'd use that to create all of our pairs. Now, here, I've just created

00:17:19.840 | loads of pairs. This is the first one, so this is English to Arabic. But if we have a look,

00:17:25.520 | there's actually loads of pairs here. So we have 27 in total, which is obviously quite a lot.

00:17:30.000 | Probably not going to use all of those. I mean, you could do if you want to. It depends on what

00:17:34.320 | you're trying to build. But I think most of us are probably not going to be trying to build some

00:17:39.200 | model that crosses all these different languages. So what I'm going to do is just initialize a list

00:17:47.120 | of languages that we would like to train on. So we're going to be feeding all of this into

00:17:56.960 | a sentence transformer class called ParallelSentencesDataSet. And that requires that

00:18:04.400 | we, one, separate out our pairs using a tab character, and two, keep all those pairs separated

00:18:12.160 | in different gzip files. So that's why I'm using this particular structure. So data preprocessing

00:18:20.400 | steps here, I'm just running through them quickly because I want to focus more on the actual

00:18:24.560 | sentence transformer training part. So run that, and we can-- well, it's actually going to take a

00:18:31.040 | moment, so let me skip forward. And then we'll see how many pairs-- well, I just want to see.

00:18:38.880 | We don't have to do this. But I want to see how many pairs we have for each language.

00:18:42.640 | And you see here, we have about 200,000 for each of them. The German one is slightly less.

00:18:50.720 | And then let's have a look at what those source and translations look like. So here,

00:18:55.360 | we have applause and applause. Now, I think that's Italian. It seems so. But here, we can see, OK,

00:19:04.960 | the end of the talk ends in applause. So obviously, the subtitles say applause. Well,

00:19:11.520 | hopefully, it ends in applause. And then we just have the tab character, and that separates the

00:19:17.520 | source language, English in this case, from the translated language. Now what we want to do is

00:19:26.960 | save that data. So we sort all that in these dictionaries. So initialize dictionary here,

00:19:33.280 | and access them here. So we have ENIT, ES, AR, FR, and DE. And now I'm just going to save them.

00:19:42.240 | So run this. That will save. And what I'll do is just write OSLister. So we can see what is in

00:19:52.080 | there. Where is it? It's data. Just data. Is that right? OK. And then we have these five files. OK.

00:20:03.920 | Now let's continue. So now what we want to do is, OK, we have-- that's our training data. It's

00:20:12.000 | ready, or mostly ready, before we feed it into the Sentence Transformer's parallel sentences

00:20:19.600 | data set object later on. So OK, let's leave that for now and move on to the next step, which is

00:20:26.320 | choosing our teacher and student models. So I already mentioned before, we want our student

00:20:33.920 | model to be capable of multilingual comprehension. So what I mean by that-- or not just what I mean,

00:20:42.320 | but one big component of that is, can the Transformer Tokenizer deal with different

00:20:48.080 | languages? In some cases, they really can't. So let me show you what the BERT Tokenizer does with

00:20:55.920 | these four different sentences. So we'll just loop through each one. So four texts in sentences.

00:21:03.200 | And what I'm going to do is just print. I'm going to print the output of the BERT Tokenizer.

00:21:09.600 | And if I tokenize that text, what does it give me? OK. So what we have here-- OK, English,

00:21:20.080 | of course. BERT is fine. The Tokenizer, or the vocabulary of the Tokenizer of BERT is, I think,

00:21:27.920 | roughly 30,000 tokens. And most of those are English-based. You can see here that it has

00:21:38.400 | picked up some Chinese characters, because it does-- other languages do feed into it a little

00:21:42.880 | bit, because it's just-- all the data is pulled from the internet. Other bits do get in there.

00:21:49.200 | But it's mostly English. So that's why we see, OK, we have these unknown characters.

00:21:54.720 | Now, as soon as we have an unknown character in our sentence, the Tokenizer-- or no, sorry,

00:22:00.960 | the Transformer is ready to struggle to understand what is in that position? What is that unknown

00:22:07.920 | token supposed to represent? In the case of-- I think of it as it's like when you're a kid

00:22:16.400 | in school, and they had those-- had a paragraph, and you had to fill in the blanks. So you had a

00:22:23.600 | paragraph, and occasionally, in a couple of sentences, there'll be a couple of blank lines

00:22:28.880 | where you need to guess what the correct word should be. If you only have a couple of those

00:22:34.160 | blanks, as a person, you can probably guess accurately. And the same for BERT. BERT can

00:22:41.760 | probably guess accurately what the occasional unknown token is. But if in school, they gave

00:22:50.320 | you a sheet, and they said, OK, fill out these blanks, and it was literally just a paragraph

00:22:55.680 | of blank, and you had to guess it correctly, you probably-- I don't know. I think your chances are

00:23:01.360 | pretty slim of getting that correct. So the same is true for BERT. BERT, for example,

00:23:09.440 | in our Georgian example down here, how can BERT know what that means? It will not know.

00:23:14.960 | So the tokenizer from BERT is not suitable for non-Latin character languages whatsoever.

00:23:23.360 | And then it does know some Greek characters here. And maybe it knows all of them,

00:23:28.000 | because I suppose Greek feeds into Latin languages a bit more than Georgian or Chinese.

00:23:35.760 | But it doesn't know what to do with them. They're all single-character tokens. And the issue with

00:23:40.560 | single-character tokens is that you can't really encode that much information into a single

00:23:46.240 | character. Because if you have 24 characters in your alphabet, that means you have 24 encodings

00:23:54.160 | to represent your entire language, which is not going to happen. So that's also not good.

00:24:01.440 | So basically, don't use a BERT tokenizer. It's not a good idea. What you can do is,

00:24:08.560 | OK, how is this xlmr token or tokenizer? Now, xlmr is trained for multilingual comprehension.

00:24:20.160 | It uses a sentence piece transformer, which uses byte-level logic to split up the sentence or the

00:24:27.360 | words. So it can deal with tokens it's never seen before, which is pretty nice. And the vocabulary

00:24:35.840 | size for this is not 30k. I think it's 250k. It could be off a few k there, but it's around that

00:24:44.000 | mark. And it's been trained on many languages. So it's obviously a much better option for our

00:24:53.280 | student model. So let's have a look at how we initialize that. So this xlmr model is just coming

00:25:01.520 | from Transformers. So I need to convert that model from just a Transformer model into an--

00:25:10.320 | or initialize it as a Sentence Transformer model using the Sentence Transformers library.

00:25:16.160 | So from Sentence Transformers, I'm going to import models and also Sentence Transformer.

00:25:23.680 | So xlmr, so this is going to be our actual Transformer model. We're going to write

00:25:30.480 | models.transformer. And Sentence Transformers under hood uses HuggingFace Transformers as well.

00:25:39.200 | So we would access this as the normal model identifier that we would with normal HuggingFace

00:25:47.200 | Transformers, which is xlmr RobertaBase. As well as that, we need a pooling layer.

00:25:57.040 | So we write models.pooling. And in here, we need to pass the output embeddings dimensions. So it's

00:26:08.240 | this get word embedding dimension for our model. And also what type of pooling we'd like to do. We

00:26:15.520 | have max pooling, CLS token pooling. And what we want is a mean pooling. So is pooling

00:26:29.440 | mode mean tokens equals true. Okay. So that two components of our Sentence Transformer.

00:26:40.240 | And then from there, we can initialize our students. So student equals Sentence Transformer.

00:26:48.000 | And we're initializing that using the modules, which is just a list of our two components there.

00:26:57.120 | So xlmr followed by pooling. And that's it. So let's have a look at what we have there.

00:27:06.240 | Okay. We can just ignore this top bit here. We just want to focus on this.

00:27:10.880 | So you see we have our transformer model followed by the pooling here. And we also see that we're

00:27:17.840 | using the mean tokens pooling set to true, rest of them are false. Okay. So that's our student

00:27:24.240 | model initialized. And now what we want to do is initialize our teach model. Now the teach model,

00:27:32.080 | let me show you. You just have to be a little bit careful with this. So Sentence Transformer.

00:27:38.000 | So maybe you'd like to use one of the top forming ones, which a lot of them are the old models.

00:27:48.400 | So these are monolingual models, all MPNet base V2. And okay, let's initialize this and let's see

00:28:02.160 | what is inside it. Okay. So we have the transformer, we have the pooling as we had before,

00:28:08.400 | but then we also have this normalization layer. So the outputs from this model are normalized.

00:28:16.080 | And obviously, if you're trying to make another model mimic that normalization layer outputs,

00:28:23.440 | well, it's not ideal because the model is going to be trying to normalize its own vectors. So

00:28:32.480 | you don't really want to do that. You want to choose a model. You either want to remove the

00:28:36.720 | normalization layer or just choose a model that doesn't have a normalization layer, which I think

00:28:43.840 | is probably the better option. So that's what I'm going to do. So for the teacher, I'm going to use

00:28:49.840 | a Sentence Transformer. I'm going to use a paraphrase models because these

00:28:55.840 | don't use normalization layers. Distill, Roberta, base V2. Okay. Let's have a look.

00:29:10.640 | Okay. So now you can see we have the transformer followed directly by the pooling.

00:29:16.240 | Now, another thing that you probably should just be aware of here is that we have this max

00:29:20.800 | sequence length here is 512, which doesn't align with our paraphrase model here. But that's fine

00:29:28.560 | because I'm going to limit the maximum sequence length anyway to 250. So it's not really an issue,

00:29:38.560 | but just look out for that if you're training your own models. This one's of 384. So none of

00:29:44.480 | those align. But yeah, just be aware that the sequence lengths might not align there.

00:29:53.280 | So we've formatted our training data. We have our two models, the teacher and the student.

00:30:04.880 | So now what we can do is prepare that data for loading into our training process,

00:30:11.680 | our fine tuning process. So I said before, we're going to be using the parallel sentences,

00:30:19.360 | sorry, from Sentence Transformers import parallel sentences dataset.

00:30:25.360 | And first thing we need to do here is actually initialize the object. And that requires that

00:30:34.080 | we pass the two models that we're training with because this kind of handles the interaction

00:30:39.760 | between those two models as well. So obviously we have our student model, which is our student.

00:30:47.040 | And we have the teacher model, which is our teacher. Alongside this, we want batch size.

00:30:58.800 | I'm going to use 32, but I think actually you can probably use higher batches here,

00:31:05.280 | or you probably should use higher batches. I think 64 is one that I see used a lot in these training

00:31:13.280 | codes. And you also use embedding cache equal to true. Okay. So that initializes the parallel

00:31:27.600 | sentences dataset object. And now what we want to do is add our data to it. So we need our training

00:31:35.360 | files. So training files equal to OS list that we did before. I think it's in the data file,

00:31:44.400 | in the data directory. Yeah. So that's what we want. And what I'll do is just

00:31:56.800 | for F in those train files, I'm going to load each one of those into the dataset object.

00:32:04.160 | Print F and data.loaddata. I need to make sure I include the path there,

00:32:12.560 | followed by the actual file name. You need to pass your max sentences,

00:32:22.240 | which is the maximum number of sentences that you're going to take from that load data batch.

00:32:27.120 | So basically the maximum number of sentences we're going to use from each language there.

00:32:33.920 | Now, I'm just going to set this to 250,000, which is higher than any of the batches we have.

00:32:42.080 | That's fine. I don't think, I mean, if you want to try and balance it out, that's fine. You can

00:32:47.920 | do that here. And then the other option is where we set the maximum length of the sentences that

00:32:59.120 | we're going to be processing. So that is max sentence length. And I said before, look,

00:33:07.040 | the maximum we have here is 256 or 512. So let's just trim all of those down to 256.

00:33:17.920 | Okay. That will load our data. And now we just need to initialize a data loader. So we're just

00:33:26.960 | using PyTorch here. So run from Torch, utils.data, import data loader. Loader is equal to data

00:33:39.920 | loader. That's our data. We want to shuffle that data. And we also want to set the batch size,

00:33:49.440 | which is same as before, 32. Okay. So models are ready. Data is ready. Now we initialize our

00:33:59.760 | loss function. So from sentence transformers again, dot losses, import MSE loss.

00:34:09.280 | And then loss is equal to MSE loss. And then here we have model equals student model. Okay. So we're

00:34:22.640 | only optimizing our student model, not the teacher model. The teacher model is there to teach our

00:34:29.360 | student, not the other way around. Okay. So that's everything we need ready for training. So

00:34:37.920 | let's move on to the actual training function. So we can train. I'm going to train for one epoch,

00:34:44.640 | but you can do more. I think in the actual, so in the other codes I've seen that do this,

00:34:54.240 | they will train for like five epochs. But even just training on one epoch,

00:34:59.280 | you actually get a pretty good model. So I think you don't need to train on too many.

00:35:07.040 | But obviously, if you want better performance, I would go with the five that I've seen in the

00:35:14.080 | other codes. So we need to pass our train objectors here. So we have the data loader

00:35:23.040 | and then the loss function. Now we want to say, okay, how many epochs? I've said before,

00:35:29.200 | I'm going to go with one, a number of warmup steps. So before you jump straight up to the

00:35:35.680 | learning rate that we select in a moment, do we want to warm up first? Yes, we do. I'm going to

00:35:43.440 | warm up for 10% of the training data, which is just the length of the loader and multiply by 0.1.

00:35:55.120 | Okay, and from there, where do you want to save the model? I'm going to try,

00:36:03.200 | I'm going to save it in xmlTED, our optimizer parameters.

00:36:11.200 | So we're going to set a learning rate of 2e to the minus 5, epsilon of 1e to the minus 6.

00:36:26.480 | I'm also going to set correct bias equal to false.

00:36:32.160 | Okay, there are the optimizer parameters, and then we can also save the best model.

00:36:40.000 | Save the best model equal to true. And then we run it. Okay, so run that. It's going to

00:36:50.640 | take a long time, so I'm actually going to stop it because I've already run it.

00:36:55.520 | And let's have a look at actually evaluating that and have a look at the results.

00:36:59.520 | Okay, so I just have this notebook where I've evaluated the model. So I'm using this STS

00:37:07.760 | sentence textual similarity benchmark data set, which is multilingual. I'm getting the English

00:37:14.960 | data and also the Italian. And you can see they are similar. So each row in the English data set

00:37:26.240 | corresponds to the other language data sets as well. So in here, sentence 1 in the English means

00:37:31.600 | the same thing as sentence 0 in the Italian. Okay, same sentence 2, also the same similarity score.

00:37:39.680 | So the first thing we do is normalize that similarity score, and then we go down a little

00:37:46.320 | bit. So we reformat the data using Sentence Transformer's InputExample class. And through

00:37:54.960 | this, I've created three different evaluation sets. So we have the English to English,

00:38:00.480 | Italian to Italian, and then English to Italian. And then what we do here is we initialize

00:38:09.600 | a similarity evaluator for each of these data sets. Again, we're using Sentence Transformers,

00:38:15.680 | just makes life a lot easier. We initialize those, and then we can just pass our model

00:38:21.280 | to each one of those evaluators to get its performance. So here, 81.6 on the English set,

00:38:28.640 | 74.3 and 71 here. Now, I just trained on one epoch. If you want better performance,

00:38:38.320 | you can train on what epochs, and you should be able to get more towards 80% or maybe a little

00:38:44.240 | bit higher. So pretty straightforward and incredibly easy. And then here, I wanted to

00:38:53.040 | compare that to the student before we trained it. So I initialized a new student and had a look,

00:38:58.800 | and you can see the evaluation is pretty low. So for English, 47.5. Italian, actually 50%,

00:39:07.680 | surprisingly. Although it's already a multilingual model, so it does make sense I can just send

00:39:15.120 | Italian. And then from English to Italian, it really struggles, drops down to 23.

00:39:22.800 | So that's it for this video. I think it's been pretty useful, at least for me. I can kind of

00:39:32.000 | see where you can build a Sentence Transformer in a lot of different languages using this,

00:39:39.040 | which is, I think, really cool and will probably be useful for a lot of people.

00:39:43.680 | So I hope you enjoyed the video. Thank you very much for watching,

00:39:49.680 | and I'll see you again in the next one.

All You Need to Know on Multilingual Sentence Vectors (1 Model, 50+ Languages)

Chapters