Back to Index

All You Need to Know on Multilingual Sentence Vectors (1 Model, 50+ Languages)


Chapters

0:0 Intro
1:19 Multilingual Vectors
5:55 Multi-task Training (mUSE)
9:36 Multilingual Knowledge Distillation
11:13 Knowledge Distillation Training
13:43 Visual Walkthrough
14:53 Parallel Data Prep
20:23 Choosing a Student Model
24:55 Initializing the Models
30:5 ParallelSentencesDataset
33:54 Loss and Fine-tuning
36:59 Model Evaluation
39:23 Outro

Transcript

Today we're going to be having a look at multilingual sentence transformers. We're going to look at how they work, how they're trained, and why they're so useful. We're going to be focusing on one specific training method, which I think is quite useful because all it really needs is a reasonably small data set of parallel data, which is simply translation pairs from a source language like English to whichever other language you're using.

So obviously, if you are wanting to train a sentence transformer in a language that doesn't really have that much data, it's particularly sentence similarity data, this can be really useful for actually taking a high-performing, for example, English sentence transformer and transferring that knowledge or distilling that knowledge into a sentence transformer for your own language.

So I think this will be pretty useful for a lot of you. And let's jump straight into it. Before we really get into the whole multilingual sentence transformer part of the video, I just want to give an impression of what these multilingual sentence transformers are actually doing. So on here, we can see a single English sentence or brief phrase down at the bottom, "isle of plants," and the rest of these are all in Italian.

So what we have here are vector representations or dense vector representations of these phrases. And a monolingual sentence transformer, which is most of the sentence transformers, will only cope with one language. So we would hope that phrases that have a similar meaning end up within the same sort of vector space.

So like we have for "amo lupiante" here and "I love plants," these are kind of in the same space. A monolingual sentence transformer would do that for similar sentences. So in English, we might have "I love plants" and "I like plants," which is actually what we have up here.

So this here is Italian for "I like plants." And we would hope that they're in a similar area, whereas irrelevant or almost contradictory sentences we would hope would be far off somewhere else, like our vector over here. So that's how, obviously, a monolingual sentence transformer works, and it's exactly the same for a multilingual sentence transformer.

The only difference is that rather than having a single language, it will comprehend multiple languages. And that's what you can see in this visual. So in this example, I have "I love plants" and "amo lupiante." They have the same meaning, just in different languages. So that means that they should be as close together as possible in this vector space.

So here we're just visualizing three dimensions. In reality, it'd be a lot more. I think most transformer models go with 768 dimensions. But obviously, we can't visualize that, so we have 3D here. So we want different languages or similar sentences from different languages to end up in the same area.

And we also want to be able to represent relationships between different sentences that are similar. And we can kind of see that relationship here. So we have "mi piacciono lupiante" and "amo lupiante" and "I love plants" are all kind of in the same sort of area. "Mi piacciono lupiante," so "I like plants," is obviously separated somewhat, but it's still within the same area.

And then in the bottom left down there, we have "ho un cane arancione," which means "I have a orange dog." So obviously, you know, that's really nothing to do with "I love plants." Although I suppose you could say you're talking about yourself, so maybe it's a little bit similar.

But otherwise, they're completely different topics. So that's kind of what we want to build, something that takes sentences from different languages and maps them into a vector space, which has some sort of numerical structure to represent the semantic meaning of those sentences. And it should be language agnostic. So obviously, we can't -- well, maybe we can train on every language.

I don't know any models that are trained on every single language, but we want it to be able to comprehend different languages and not be biased towards different phrases in different languages, but just have a very balanced comprehension of all of them. Okay? So that's how the vectors should look.

And then, okay, so what would the training data for this look like, and what are the training approaches? So like I said before, there's two training approaches that I'm going to just briefly touch upon, but we're going to focus on the latter of those. So the first one that I want to mention is what the MUSE, or Multilingual Universal Sentence Encoder Model, was trained on, which is a multitask translation bridging approach to training.

So what I mean by that is it uses two or uses a dual encoder structure, and those encoders deal with two different tasks. So on one end, you have the parallel data training. So when we say parallel data, these are sentence pairs in different languages. So like we had before, we had the Amalopoeia and Isle of Plants, which is just the Italian and English phrases for Isle of Plants.

So we would have our source language and also the translation, or the target language is probably a better way, but I'll put translation for now. So we have the source and translation. That's our parallel data set. And what we're doing is optimizing to get those two vectors or the two sentence vectors produced by either one of those sentences as close as possible.

And then there is also the source data. So we basically have sentence similarity or NLI data, but we have it just for the source language. So we have source, sentence A, and source, sentence B. And we train on both of these. Now, it works, and that's good. But obviously, we're training on a multi-task architecture here, and training on a single task in machine learning is already hard enough.

Training on two and getting them to balance and train well is harder. And the amount of data, at least for Muse, and I believe for if you're training using this approach, you're going to need to use a similar amount of data, is pretty significant. I think Muse is something like a billion pairs, so it's pretty high.

And another thing is that we also need something called hard negatives in the training data in order for this model to perform well. So what I mean by hard negative is, say we have our source sentence A here, and we have this source B, which is like a similar sentence, a high similarity sentence.

They mean basically the same thing. We'd also have to add a source C. And this source C will have to be similar in the words I use to source A, but actually means something different. So it's harder for the model to differentiate between them. And the model would have to figure out these two sentences are not similar, even though they seem similar at first, but they're not.

So it makes the training task harder for the model, which, of course, makes the model better. So that is training approach number one. And we've mentioned the parallel data there. That's the data set we're going to be using for the second training approach. And that second training approach is called multi-lingual knowledge distillation.

So that is a mouthful. And it takes me a while to write that. I'm sorry. So multi-lingual knowledge distillation. So this was introduced in 2020 by, who we've mentioned before, the Sentence Transformers people, Nils Reimers and Irina Gruevich. And the sort of advantage of using this approach is that we only need the parallel data set.

So we only need those translation pairs. And the amount of training data you need is a lot smaller. And using this approach, the Sentence Transformers people have actually trained Sentence Transformers that can use more than 50 languages at once. And the performance is good. It's not just that they managed to get a few phrases correct.

The performance is actually quite good. So I think it's pretty impressive. And the training time for these is super quick, as we'll see. And like I said, it's using just translation data, parallel data, which is reasonably easy to get for almost every language. So I think that's pretty useful.

Now, well, let's have a look at what that multi-lingual knowledge distillation training process actually looks like. So it's what we have here. So same example as before. I've got "I like plants" this time and "Mi piacere non li piante," which is, again, the same thing in Italian. Now, we have both of those.

We have a teacher model and a student model. Now, when we say knowledge distillation, that means where you basically take one model and you distill the knowledge from that one model into another model here. The model that already knows some of the stuff that we want, that we want to distill knowledge from, is called the teacher model.

Now, the teacher model, in this case, is going to be a monolingual model. So it's probably going to be a sentence transformer. That's very good at English tests only. And what we do is we take the student model, which is going to be-- it doesn't have to be a sentence transformer.

It's just a pre-trained transform model. We'll be using XLM Roberta later on. And it needs to be capable of understanding multiple languages. So in this case, we feed the English sentence into both our teacher model and student model. And then we optimize the student model to reduce the difference between the two vectors output by those two models.

And that makes the student model almost mimic the monolingual aspect of the teacher model. But then we take it a little further, and we process the Italian, or the target language, through the student model. And then we do the same thing. So we try to reduce the difference between the Italian vector and the teacher's English vector.

And what we're doing there is making the student model mimic the teacher for a different language. So through that process, you can add more and more languages to a student model, which mimics your teacher model. I mean, it seems at least really simple just to think of it like that, in my opinion, anyway.

But it works really well. So it's a very cool technique, in my opinion. I do like it. So just a more visual way of going through that. We have these different circles. They represent different language tasks, or different languages, but pretty similar, or the same task in each one of those.

We have our monolingual teacher model. And that can perform on one of these languages. But fails on the others. We take that monolingual model, or our teacher model, and then we also take a pre-trained multilingual model. So the important thing here is that it can handle new languages, like I said with XLM and Roberta.

This is our student. We perform multilingual knowledge distillation, meaning the student learns how the teacher performs well on the single task by mimicking its sentence vector outputs. The student then performs this mimicry across multiple languages. And then hopefully, the student model can now perform across all of the languages that we are wanting to train on.

That's how the multilingual knowledge distillation works. Let's have a look at that in code. Okay, so we're in our code here. And the first thing I'm going to do is actually get our data. So in the paper that introduced the multilingual knowledge distillation, Rimas and Gurevich use the focus partly on this TED subtitles data.

So yeah, we know TED Talks, they're just low talks where people present on a particular topic, usually pretty interesting. And those TED Talks have subtitles in loads of different languages. So they scraped that subtitle data and use that as sentence pairs for the different languages. Okay, so that's the parallel data.

Now, what I'm going to do is use Hug and Face Transformers to download that. So we just import datasets here. So I said Hug and Face Transformers, actually Hug and Face Datasets here. So import datasets, and I'm going to load that dataset. And just have a look at what the structure of that dataset is.

So it's the TED multi, and I'm just getting the training data here. You see in here, we have this Features, Translations, and Talk Name. Now, it's not really very clear, but inside the translations data, we have the language tag. So these are language codes, ISO language codes. If you type that into Google, they'll pop up.

If you don't know which one, which are which. And below, we also have in here, it's not very clear again. So if I come here, we have translations, and each one of those corresponds to the language code up here. Okay, so if we came here, we see EN, it's English, and we find it here.

Okay, and then we also have Talk Name. It's not really important for us. So we can get the index of our English text, because we need to extract that for our source language. So we extract that, we get number four, so we're going into those language pairs, finding EN.

And then we use that index to get the corresponding translation, which is here. And then we'd use that to create all of our pairs. Now, here, I've just created loads of pairs. This is the first one, so this is English to Arabic. But if we have a look, there's actually loads of pairs here.

So we have 27 in total, which is obviously quite a lot. Probably not going to use all of those. I mean, you could do if you want to. It depends on what you're trying to build. But I think most of us are probably not going to be trying to build some model that crosses all these different languages.

So what I'm going to do is just initialize a list of languages that we would like to train on. So we're going to be feeding all of this into a sentence transformer class called ParallelSentencesDataSet. And that requires that we, one, separate out our pairs using a tab character, and two, keep all those pairs separated in different gzip files.

So that's why I'm using this particular structure. So data preprocessing steps here, I'm just running through them quickly because I want to focus more on the actual sentence transformer training part. So run that, and we can-- well, it's actually going to take a moment, so let me skip forward.

And then we'll see how many pairs-- well, I just want to see. We don't have to do this. But I want to see how many pairs we have for each language. And you see here, we have about 200,000 for each of them. The German one is slightly less. And then let's have a look at what those source and translations look like.

So here, we have applause and applause. Now, I think that's Italian. It seems so. But here, we can see, OK, the end of the talk ends in applause. So obviously, the subtitles say applause. Well, hopefully, it ends in applause. And then we just have the tab character, and that separates the source language, English in this case, from the translated language.

Now what we want to do is save that data. So we sort all that in these dictionaries. So initialize dictionary here, and access them here. So we have ENIT, ES, AR, FR, and DE. And now I'm just going to save them. So run this. That will save. And what I'll do is just write OSLister.

So we can see what is in there. Where is it? It's data. Just data. Is that right? OK. And then we have these five files. OK. Now let's continue. So now what we want to do is, OK, we have-- that's our training data. It's ready, or mostly ready, before we feed it into the Sentence Transformer's parallel sentences data set object later on.

So OK, let's leave that for now and move on to the next step, which is choosing our teacher and student models. So I already mentioned before, we want our student model to be capable of multilingual comprehension. So what I mean by that-- or not just what I mean, but one big component of that is, can the Transformer Tokenizer deal with different languages?

In some cases, they really can't. So let me show you what the BERT Tokenizer does with these four different sentences. So we'll just loop through each one. So four texts in sentences. And what I'm going to do is just print. I'm going to print the output of the BERT Tokenizer.

And if I tokenize that text, what does it give me? OK. So what we have here-- OK, English, of course. BERT is fine. The Tokenizer, or the vocabulary of the Tokenizer of BERT is, I think, roughly 30,000 tokens. And most of those are English-based. You can see here that it has picked up some Chinese characters, because it does-- other languages do feed into it a little bit, because it's just-- all the data is pulled from the internet.

Other bits do get in there. But it's mostly English. So that's why we see, OK, we have these unknown characters. Now, as soon as we have an unknown character in our sentence, the Tokenizer-- or no, sorry, the Transformer is ready to struggle to understand what is in that position?

What is that unknown token supposed to represent? In the case of-- I think of it as it's like when you're a kid in school, and they had those-- had a paragraph, and you had to fill in the blanks. So you had a paragraph, and occasionally, in a couple of sentences, there'll be a couple of blank lines where you need to guess what the correct word should be.

If you only have a couple of those blanks, as a person, you can probably guess accurately. And the same for BERT. BERT can probably guess accurately what the occasional unknown token is. But if in school, they gave you a sheet, and they said, OK, fill out these blanks, and it was literally just a paragraph of blank, and you had to guess it correctly, you probably-- I don't know.

I think your chances are pretty slim of getting that correct. So the same is true for BERT. BERT, for example, in our Georgian example down here, how can BERT know what that means? It will not know. So the tokenizer from BERT is not suitable for non-Latin character languages whatsoever.

And then it does know some Greek characters here. And maybe it knows all of them, because I suppose Greek feeds into Latin languages a bit more than Georgian or Chinese. But it doesn't know what to do with them. They're all single-character tokens. And the issue with single-character tokens is that you can't really encode that much information into a single character.

Because if you have 24 characters in your alphabet, that means you have 24 encodings to represent your entire language, which is not going to happen. So that's also not good. So basically, don't use a BERT tokenizer. It's not a good idea. What you can do is, OK, how is this xlmr token or tokenizer?

Now, xlmr is trained for multilingual comprehension. It uses a sentence piece transformer, which uses byte-level logic to split up the sentence or the words. So it can deal with tokens it's never seen before, which is pretty nice. And the vocabulary size for this is not 30k. I think it's 250k.

It could be off a few k there, but it's around that mark. And it's been trained on many languages. So it's obviously a much better option for our student model. So let's have a look at how we initialize that. So this xlmr model is just coming from Transformers. So I need to convert that model from just a Transformer model into an-- or initialize it as a Sentence Transformer model using the Sentence Transformers library.

So from Sentence Transformers, I'm going to import models and also Sentence Transformer. So xlmr, so this is going to be our actual Transformer model. We're going to write models.transformer. And Sentence Transformers under hood uses HuggingFace Transformers as well. So we would access this as the normal model identifier that we would with normal HuggingFace Transformers, which is xlmr RobertaBase.

As well as that, we need a pooling layer. So we write models.pooling. And in here, we need to pass the output embeddings dimensions. So it's this get word embedding dimension for our model. And also what type of pooling we'd like to do. We have max pooling, CLS token pooling.

And what we want is a mean pooling. So is pooling mode mean tokens equals true. Okay. So that two components of our Sentence Transformer. And then from there, we can initialize our students. So student equals Sentence Transformer. And we're initializing that using the modules, which is just a list of our two components there.

So xlmr followed by pooling. And that's it. So let's have a look at what we have there. Okay. We can just ignore this top bit here. We just want to focus on this. So you see we have our transformer model followed by the pooling here. And we also see that we're using the mean tokens pooling set to true, rest of them are false.

Okay. So that's our student model initialized. And now what we want to do is initialize our teach model. Now the teach model, let me show you. You just have to be a little bit careful with this. So Sentence Transformer. So maybe you'd like to use one of the top forming ones, which a lot of them are the old models.

So these are monolingual models, all MPNet base V2. And okay, let's initialize this and let's see what is inside it. Okay. So we have the transformer, we have the pooling as we had before, but then we also have this normalization layer. So the outputs from this model are normalized.

And obviously, if you're trying to make another model mimic that normalization layer outputs, well, it's not ideal because the model is going to be trying to normalize its own vectors. So you don't really want to do that. You want to choose a model. You either want to remove the normalization layer or just choose a model that doesn't have a normalization layer, which I think is probably the better option.

So that's what I'm going to do. So for the teacher, I'm going to use a Sentence Transformer. I'm going to use a paraphrase models because these don't use normalization layers. Distill, Roberta, base V2. Okay. Let's have a look. Okay. So now you can see we have the transformer followed directly by the pooling.

Now, another thing that you probably should just be aware of here is that we have this max sequence length here is 512, which doesn't align with our paraphrase model here. But that's fine because I'm going to limit the maximum sequence length anyway to 250. So it's not really an issue, but just look out for that if you're training your own models.

This one's of 384. So none of those align. But yeah, just be aware that the sequence lengths might not align there. So we've formatted our training data. We have our two models, the teacher and the student. So now what we can do is prepare that data for loading into our training process, our fine tuning process.

So I said before, we're going to be using the parallel sentences, sorry, from Sentence Transformers import parallel sentences dataset. And first thing we need to do here is actually initialize the object. And that requires that we pass the two models that we're training with because this kind of handles the interaction between those two models as well.

So obviously we have our student model, which is our student. And we have the teacher model, which is our teacher. Alongside this, we want batch size. I'm going to use 32, but I think actually you can probably use higher batches here, or you probably should use higher batches. I think 64 is one that I see used a lot in these training codes.

And you also use embedding cache equal to true. Okay. So that initializes the parallel sentences dataset object. And now what we want to do is add our data to it. So we need our training files. So training files equal to OS list that we did before. I think it's in the data file, in the data directory.

Yeah. So that's what we want. And what I'll do is just for F in those train files, I'm going to load each one of those into the dataset object. Print F and data.loaddata. I need to make sure I include the path there, followed by the actual file name. You need to pass your max sentences, which is the maximum number of sentences that you're going to take from that load data batch.

So basically the maximum number of sentences we're going to use from each language there. Now, I'm just going to set this to 250,000, which is higher than any of the batches we have. That's fine. I don't think, I mean, if you want to try and balance it out, that's fine.

You can do that here. And then the other option is where we set the maximum length of the sentences that we're going to be processing. So that is max sentence length. And I said before, look, the maximum we have here is 256 or 512. So let's just trim all of those down to 256.

Okay. That will load our data. And now we just need to initialize a data loader. So we're just using PyTorch here. So run from Torch, utils.data, import data loader. Loader is equal to data loader. That's our data. We want to shuffle that data. And we also want to set the batch size, which is same as before, 32.

Okay. So models are ready. Data is ready. Now we initialize our loss function. So from sentence transformers again, dot losses, import MSE loss. And then loss is equal to MSE loss. And then here we have model equals student model. Okay. So we're only optimizing our student model, not the teacher model.

The teacher model is there to teach our student, not the other way around. Okay. So that's everything we need ready for training. So let's move on to the actual training function. So we can train. I'm going to train for one epoch, but you can do more. I think in the actual, so in the other codes I've seen that do this, they will train for like five epochs.

But even just training on one epoch, you actually get a pretty good model. So I think you don't need to train on too many. But obviously, if you want better performance, I would go with the five that I've seen in the other codes. So we need to pass our train objectors here.

So we have the data loader and then the loss function. Now we want to say, okay, how many epochs? I've said before, I'm going to go with one, a number of warmup steps. So before you jump straight up to the learning rate that we select in a moment, do we want to warm up first?

Yes, we do. I'm going to warm up for 10% of the training data, which is just the length of the loader and multiply by 0.1. Okay, and from there, where do you want to save the model? I'm going to try, I'm going to save it in xmlTED, our optimizer parameters.

So we're going to set a learning rate of 2e to the minus 5, epsilon of 1e to the minus 6. I'm also going to set correct bias equal to false. Okay, there are the optimizer parameters, and then we can also save the best model. Save the best model equal to true.

And then we run it. Okay, so run that. It's going to take a long time, so I'm actually going to stop it because I've already run it. And let's have a look at actually evaluating that and have a look at the results. Okay, so I just have this notebook where I've evaluated the model.

So I'm using this STS sentence textual similarity benchmark data set, which is multilingual. I'm getting the English data and also the Italian. And you can see they are similar. So each row in the English data set corresponds to the other language data sets as well. So in here, sentence 1 in the English means the same thing as sentence 0 in the Italian.

Okay, same sentence 2, also the same similarity score. So the first thing we do is normalize that similarity score, and then we go down a little bit. So we reformat the data using Sentence Transformer's InputExample class. And through this, I've created three different evaluation sets. So we have the English to English, Italian to Italian, and then English to Italian.

And then what we do here is we initialize a similarity evaluator for each of these data sets. Again, we're using Sentence Transformers, just makes life a lot easier. We initialize those, and then we can just pass our model to each one of those evaluators to get its performance. So here, 81.6 on the English set, 74.3 and 71 here.

Now, I just trained on one epoch. If you want better performance, you can train on what epochs, and you should be able to get more towards 80% or maybe a little bit higher. So pretty straightforward and incredibly easy. And then here, I wanted to compare that to the student before we trained it.

So I initialized a new student and had a look, and you can see the evaluation is pretty low. So for English, 47.5. Italian, actually 50%, surprisingly. Although it's already a multilingual model, so it does make sense I can just send Italian. And then from English to Italian, it really struggles, drops down to 23.

So that's it for this video. I think it's been pretty useful, at least for me. I can kind of see where you can build a Sentence Transformer in a lot of different languages using this, which is, I think, really cool and will probably be useful for a lot of people.

So I hope you enjoyed the video. Thank you very much for watching, and I'll see you again in the next one.