Stanford CS25: V3 I No Language Left Behind: Scaling Human-Centered Machine Translation

We're glad to have Angela Phan today with us here, and she's a research scientist at Meta AI Research in New York, focusing on research in text generation mainly. And currently she's working on language modeling and developing the Lion AI agents meta products. And recent research products include No Language Left Behind, which she'll be talking briefly about today, universal speech translation for unwritten languages, as well as Lama 2.

So give it up for Angela, I guess. All right, thank you all so much. So yeah, when I got this email, I was like, Oh, I should probably talk about Lama 2. But then I noticed you have Sharon, who will like, you know, is like a 10x better speaker than me.

So I was like, okay, like, maybe not Lama 2. But then I thought I maybe would cover this project that we did called No Language Left Behind, which could be very, also very relevant to this class. And so when you think about a lot of text generation technology, most of it until fairly recently, has been really focused on English.

But there are actually more than 3000 written languages worldwide. And for me, this is extremely personally meaningful, because actually English is my third language. So it's really important. Yeah, so it's really also very personally meaningful. And when you think about some of the multilingual technology that permeates, it's not like we've never worked on multilingual, right?

Actually, when speaking about generative AI, I actually think translation is one of the most commercially successful and widespread applications of generative AI. I mean, ultimately, translation models, they are, you know, like conditional language models. And so when you think about like traveling or something like that, or my sister is taking Spanish, so like, just like doing her Spanish homework, we have a lot of tools that exist today.

So things like Google Translate cover around 130 languages, Microsoft Translate about 110. This might be a little bit outdated, since I pulled the statistics a little bit ago. But the project for No Language Left Behind, it started from like a very simple ask, like, okay, there's 3000 languages worldwide, maybe it'll be like, pretty hard to get to all 3000.

Since some of them are pretty rare, and not spoken by many, but there are still like hundreds of languages spoken by millions and millions of people. And so we were like, okay, no big deal. Like, let's just start from the 100 ish that we have today, and just go for like a doubling, like what would it take to actually be able to double this kind of coverage?

And of course, you know, just saying that you support a bunch of languages is not the goal, right? We actually want to create high quality, safe translations that would be usable by people, just like if you're going on vacation today, your kind of instinct is to whip out your phone and get on the Google Translate app.

And so kind of the backdrop to this project was that there was actually a lot of progress in translation. So historically, there's been a lot of focus on what we call higher resource languages. And these are not necessarily languages that are spoken by the most people in the world.

But when we say higher resource, it means the most amount of data. And so you can think about things like Europarl, or, you know, translations from the European Parliament, and those served as the foundation for a lot, a lot of translation development. And more recently, there's been a great focus on low resource languages.

And it's been driven across the research community with groups like Ghana NLP, Masakane, America's NLP. And these are all really exciting developments. And so these have led to a lot of development of new data sets, as well as criticisms of existing data sets, and also work on new languages, and usually languages that people kind of speak, and they care a lot about.

And we found this like really, really exciting. And so looking at a lot of this, a bunch of us got together at FAIR, and started thinking like, okay, we actually speak some pretty low resource languages from like Catalan, to Assamese, and so on. And so we started this as kind of like a big passionate research project.

And so today, I want to cover a little bit about our high level approach to this problem, which is a little bit interdisciplinary. I want to talk about how we actually created the data sets to be able to support this kind of work. Of course, I want to talk about the models, since this is a class about transformers.

One note here, I think that's actually very interesting in terms of translation as like a research direction, is that actually a lot of innovations have been done in translation. The original transformer paper, I think is one of them, and which makes always translation a quite interesting area to work on, because I feel like it's a very mature research area as well.

So it kind of is like, okay, if your architecture works in translation, it probably works very generally. So that's also one of the things that excites me about translation research. Then I want to talk about evaluation, like how are we actually measuring and ensuring the quality of these translations are good and safe for people.

And then I want to end with a little bit of like, you know, high level thoughts about future directions and things that I hope that we can work on in the future. So I want to start with our approach. I think the most important thing in research is to know that we're working on a real problem, especially when it's really close to people like translation.

And I think in many areas, like when I was working on on-device AI, for example, I feel like I had like a research problem in mind, but it was like very, very disconnected from the practical problem of actually putting models on phones. And so this was something that was really important to us.

And so we actually started the project by kind of like focusing on a social sciences type approach or sociology type approach. And we actually did a lot of interviews with low resource speakers. And so we met with about 44 different native speakers that spoke 36 different languages across North America.

I will say that a lot of them are like immigrants to the US, since that was kind of like the easiest kind of cohort to recruit. And we learned a lot of different things about how they approach low resource languages, but also the kind of technological need that they have, because I think it's easy to be like, hey, I have this cool background, like I have this cool problem, and I want to solve it.

But I think it's very important to actually like talk to the people, this is a problem that that needs to be solved. And so we learned that there's great fear in general, that low resource languages might be undergoing a state of decline, partially because a lot of education is shifting to languages like Hindi, or like English, or Mandarin, Chinese, for example, and there's a lot of excitement to be included in existing translation systems.

And people said they have always tried to use Google Translate or Microsoft Translate in their existing languages. But ultimately, they found that the quality is really insufficient for reliable usage. So if you think about like, well, I was going to say when I was in high school, but you all are probably like substantially younger than me.

So maybe like, you know, 10 so years ago, you know, and you try to use Google Translate for your Spanish homework, like your Spanish teacher could always identify that, like, you know, it was not a human written translation, and so you would get marks off. But that's not really the case for some of the high resource languages today.

And so I think, as with all things in machine learning, it really starts from a data perspective. Like, why can't we just train models and hundreds of languages or large language models and hundreds of languages, it's because we don't have the data to support it. And so I want to talk first about evaluation data sets, because I think it's extremely important to nail evaluation.

And then I'll talk about training. So for an evaluation data set for this work, we started this FLORES effort, it stands for Facebook low resource, I guess we're called meta now, but I didn't think FLORES was like a very good renaming, so we're still calling it FLORES. So this was something we originally started for just two languages in this first paper at EMNLP many years ago, so it was just for Nepali and Sinhala, and we later extended it to incorporate two more languages in a release.

Afterwards, we thought a lot about, okay, like, FLORES was really useful for the community. How can we extend it to 100 languages? And so that was this follow up work that we did, I think we had at ACL, or WMT. And then in this project, we were like, okay, how can we go from FLORES 101 to FLORES 200 to really go for the doubling effect?

And so what is FLORES? Well, it's in the name, it's a focus on low resource languages. So we do include some higher resource languages like German or Hindi or so on, almost for calibration effect as well. But the majority of the focus is on these lower and mid resource languages.

It's the first large scale many to many machine translation evaluation data set, which means that we take all of the sentences in English, and then we translate them to all of the languages, which means that you would be able to evaluate any cross pair of languages. So for example, like Chinese to French, I lived in France for many years.

So it's like very personally relevant to me. Of course, 200 languages also in the name, there's a broad diversity of different domains and topics. I think this is important when designing an evaluation data set, which is like very top of mind for anybody interested in language modeling research. Because like the way people train machine translation models, and the way people use them are often like very different.

And so if you only benchmark your data set, for example, on use, which is very common in translation research, then you don't really pick up the fact that people talk about such a wide variety of things, and have like different casual conversations that they need translated official documents and so on.

It's also document level data set. This is not something that I think the community is like broadly leveraging right now. But the way it's translated is that you can have document level context. And so translators are provided the entire document to translate from, and we also provide the entire document for evaluation.

And we translate like multiple sentences from the same paragraph. And so this was like a potential research direction that we wanted to make sure we covered models that needed like potentially more context, because a lot of translation work is done at the sentence level. So how do we actually ensure that this data set was high quality.

So the first step is that we take a document. Well, actually, first step is like alignment on language standards. So this is very important, because when you're translating French, or Chinese, I think most people have a strong understanding of like, what it means to produce like a good French or good Chinese.

And there are a lot of professional translators hired in these languages. But when you go to lower resource languages, it's not necessarily the case that there's like a, you know, a glowing translation industry around translating a lower resource language. And so one of the first things is actually to align on like, what is a high quality translation.

And so there's actually a lot of challenges here. So there are certain low resource languages where there's different competing language standards, or there's like very high variance in different regions on how languages are spoken. And so this step is a pretty critical one. So then what we do is we take the document, we send it to one group of translators, and they do the first translation step, then we do some automatic checking, you know, like if the input sentence was like 10 words, and the output sentence like 300 words, it's like, most likely something went wrong, you know.

And so we send it back. Otherwise, we'll send it onwards to a separate, completely independent set of translators that do review. And so they try to rate the quality of this. And if the quality doesn't pass a sufficient bar, and it gets sent back to the original set of translators to to edit, and they kind of go through and like address all of the feedback.

And then if it's good enough, then it enters our data set. And so there's many challenges here. The first one, of course, is just like finding translators, and also finding more translators. There was a certain issue that we ran into, for example, that in a certain country that the internet was not available.

And so, you know, it's a lot of recruitment. The other one, of course, is language standardization. I think I briefly mentioned this before, but there's a lot of different challenges in just understanding like, what is a high quality translation, for example, the low resource language, Breton, there's like two competing groups on like, how do you write Breton?

So it's like very difficult to resolve some of those things. And the final thing is that there's actually a lot of variation, even in languages like Arabic, like the Arabic, like Moroccan Arabic is very different from, you know, Jordanian Arabic and so on. And there are also certain regions that they speak the same language, but due to historical reasons they write in different scripts.

And so one of the things we actually did was like, if there are languages written in multiple scripts, we actually supported the collection of a multiple script evaluation. And I think this is really important because if you're building an underlying technology and you only choose one, then I think you risk like just kind of like naturally supporting one over the other when we really should be like kind of a more neutral technology provider.

And so this is something that we explored a lot, as well as exploring different variants of Arabic. This is also open source, if you just go to this link, you can just like download all of the, all of the text files for this. With evaluation done, I want to talk a little bit about how we collected some of these training data sets.

The first thing I want to talk about is this data set we created called NLLB Seed. And the idea of this is like, it's a really seed data set of high quality translations and languages that really don't have anything. Why? Because, well, you can't start from nothing, you know, you got to bootstrap from somewhere.

A lot of people have been using the Bible as a way to bootstrap. But it's very limited domain, obviously very religious text. And so we created this data set, NLLB Seed for languages that really don't have anything to get started from. It's only about 5000 sentences, so it's nothing crazy.

But it supports a lot of different use cases, like training language identification models, or sentence encoders, Ngram language models, like all of these things that I'm about to talk about in our data set pipeline. So it covers 43 languages, about 6000 sentences. And the way we decide to sample is, it's focused on really general content.

So Wikipedia has this article of like, hey, if you're going to start like a new Wikipedia and your new language, I think Wikipedia is like 309-ish Wikipedias, last I checked. Here's like a list of articles that every Wikipedia in a new language should have. And so that's where we sampled this original content from.

And of course, it's also open source if you if you want to download it. So what we ended up doing to get large scale training data is using mining. So this is not something we pioneered in this project, we have like a bunch of different previous work. So we started from Wikimatrix, where we were like, hey, there's a lot of different sentences on Wikipedia, and different languages that we should be able to match up.

And so we tried to do that with Wikipedia to get machine translation training data. We extended that to the web in the CC matrix project. And then we extended it to very, very large scale mining on all cross pairs. And this project on beyond English centric multilingual machine translation, we really tried to ditch like English as a central pivot language.

And so the way this whole data mining thing works, is that it focuses on sentence alignment. So everyone is probably super familiar with this, because it's how language models are built now. But it's like you take Common Crawl, or any other open source dump of the web, I don't know, like red pajama, or like whatever you want to CC net, whatever you want to use these days.

And you take all of the data, you extract all of the text, you know, a lot of HTML parsing, and so on goes into it. And the idea is that we want to try to find matching text that could be a translation. So we shatter it all into sentences, we embed them with different sentence encoder models.

And then we do a match to try to understand in a multilingual space, if the sentences match. And so one of the biggest challenges to this is that the quality of the sentence encoding is very important. So if your sentence encoding is not very accurate, then it's impossible to match in this multi dimensional space, the idea of like the meaning being the same.

And so one of the big things we tried to do here, this project was try to improve the quality of the sentence encoders. And so one of the big things that we did was train sentence encoders with mask language modeling, you see that on the left, but we also use multilingual distillation, which you see on the right.

And so previous approaches to sentence encoders and the trend in the research community for a while was to really try to embed all languages in the same sentence encoder model. So projects like XLMR, for example, are in that direction, I think is pretty widely used. The challenge with this, when you're training a low resource model is that a lot of your high resource data just overwhelms your low resource data.

And so you don't end up with a very high quality sentence encoder for those languages. So what we ended up doing is we had a multilingual teacher model. And we distilled a bunch of student models that are specialized to different language families that are low resource. And so this enables the quality to be pretty high.

And so the way that distillation works is that the teacher and the student model both see the same data, and then we try to minimize the cosine loss between the sentence embeddings that they produce. I think an important question that you can ask here is, why do you need to do multilingual distillation?

Why can't you just train a bunch of different student models, one per language family? Why even care about distillation? And the reason is because if you're going to use a bunch of sentence encoders for mining, the important thing is that they all exist in the same embedding space. If you train one separate model and another separate model, there's nothing constraining them so that you can mine all of the data against each other.

And so one of the things we found is that by starting everything from the same teacher model and trying to use this cosine loss to minimize the distance between embeddings, you are able to have this constrained space where you can mine every language against every other, even if you have different student models.

And so this graph on the y-axis, it shows the error rates of mining. And so lower is better. And on the x-axis, it shows a bunch of different low-resource languages. So for example, the first one is Urdu, the second one is Telugu, third one is Tagalog, and so on.

And so the gray bar here is the original laser paper. So this is a paper we put out maybe in 2018-ish, and we had all of these languages, we counted them as included. But as you can see, the error rate is extremely, extremely high for these languages. So even though they were included, couldn't really be used for high quality.

And the blue bar is the laser model that we trained based on the technique I just described in the previous slide. And you can see that I think the most important point is that you can barely see the blue bars. So it was very effective, even for these previous languages that people had thought we had previously embedded.

And then so now how does this kind of thing fit into a whole data pipeline around this approach? So one of the most important things is when you download the data from the web, you don't really know what language it's in. And so this is part of all of the large-scale data cleaning that goes into training large language models today.

And so the way we identify different languages is through simple classification models called language identification models. And I think it's a classification model. And so people think it's easier than it actually is. But I think some of the major challenges are that there's so many different languages. They're written in many different ways.

And web text is very casual. And so it can be very difficult to actually train a good classification model that can generalize to that. And so what we did is, you know, we had our LID training data, and we produced a language identification model, LID. And then we actually did human evaluation to label errors coming from the LID system to iteratively improve this on web text itself to improve the quality of this specific model.

Then after we produce this LID model, then we insert, like, all of our Common Crawl, where the web arrow is coming in, and we do a ton of filtering and cleaning. And this produces a huge corpus of different monolingual data that you can then use for training anything. Afterwards, we train our encoder, what I described on the previous text.

And then we convert this monolingual data into what we call mind bytexts. So these are a huge data set of things that we think are translations of each other. And then finally, what we do is we actually try to validate that these are real mind bytexts by training very small bilingual, multilingual, sorry, bilingual translation models in order to see, you know, what the quality is like.

And I think this is important because, like, the data development cycle and, like, the end task that it's being used for, you don't want to, you know, like, completely separate it. An analogy to large language model training today is that, like, when you're doing your pre-training, you don't want, like, someone to just deliver you a data, like, the data mix of your different data sets is very important.

And it's pretty similar here. And I think one of the highlights that we did here is really, like, focused on the human evaluation of the language identification model, because that actually improves the quality of, like, all of the underlying data if you just, like, more accurately know what language it's in.

And this entire data pipeline is actually open source in this library. And we had an EMNLP paper describing it. The reason why I thought this was important is that because, like, I think data cleaning is actually such a fundamental underlying thing that drives model quality and people's data pipelines.

You know, it's like, oh, I have this script and this other thing and this other thing. And so, it's actually, I think, very important to be able to recreate it and rerun it as part of, you know, almost like your research that you would do as follow-up work. And so, that's why we open sourced it.

A few reflection things. For low-resource languages, even though we did a large-scale mining, I think monolingual data is the limiting factor. Like, there are many languages that do not have, like, a huge amount of text written online, and so, it can be very challenging to get a large amount.

Further, I think languages and unique scripts can be extremely hard to get good representations of if you don't have very much data. There are certain languages, as well, where they were historically written in a new script, but now the government would like to write it in a totally new one, like the old Cheeky script, for example.

And so, there's not a lot of content to represent these scripts, so it's hard to learn representations. And then further, a lot of the content we create, it's even after mining, it's a fairly limited domain, often religious content. Okay, so, with data discussed, I want to segue a little bit into some of the modelling work.

Just to kind of start with, like, a high-level picture, I think there's, like, three major challenges when you talk about, like, large-scale multilingual modelling, and these pretty much apply to language models, as well. The first one is effective data augmentation for low-resource languages, like, how can you prevent the low-resource language data from just being completely drowned out by the time you've seen, like, all of your words of German or Russian?

I think there's also a question of, like, scalability of the model, so even if you train very large-scale models, how do you prevent the representations of different languages from interfering with each other? And that leads to the last point, as well, of, like, if you give the model very limited capacity, then, of course, it may not have the capacity to model all of these different languages, and so you also need to accelerate the scale of the model.

And so, preliminary, for those who may not have seen a translation system before, I don't know how many of you that practically is, so we use standard sequence-to-sequence models, so the input text, the, like, choral thing is what you want to translate, enters a transformer decoder model that then, you know, with a tension mechanism, goes to a transformer decoder model, and then it decodes autoregressively the actual translation, which you can see here in yellow.

And so I want to talk a little bit about, like, how the data looks as we feed it into the models, so there's a few different ways that you might want to think about data, so you want to be, like, okay, did a human look at it and decide that, like, these two sentences are translations, or are they noisy?

So is it limited in size? Another thing you can think about is, like, is the data quality dependent on some other factor, and so that's, like, the model-dependent thing, in which case, like, the data quality may be capped by the quality of that dependency, and so I think you can think a little bit, like, the ideal data set, it would be, like, humans have reviewed, you know, every bit of it, it's not noisy at all, we have an infinite amount, and it doesn't have any dependencies on any other models, it's just, like, pure quality, but in reality, like, closer to what we have are these.

So we have a bunch of different data sources. We have the seed data that I discussed, like, way back in the talk, where it's a small amount of, like, really high-quality human-aligned data, but the only problem is that it's limited in size, it's, like, 6,000 sentences per language. We have the public bitext, so this is data that people have created over many years of working in translation, you know, you can download it from, like, the Opus corpus, for example, mostly has not been reviewed by humans, so pretty extremely noisy, in many languages, it's just coming from the Bible, so the size is quite limited.

You have our mind data, so this is not human-aligned either, but it does have a model dependency, you know, it's dependent on the quality of the sentence encoders, and we have two other sources of data from back translation, so the idea of back translation, it's a model augmentation technique heavily used in machine translation, where you use a model to produce, like, pseudo-translations, like silver data, and we use two different techniques to produce these back translations that also are dependent on the underlying model used to make the translations.

So this is a picture of, like, our high-level different data sources, and, like, how you want to think about the quality and the different axes, and so if we put them all together, what do we get? The y-axis here is the number of training pairs, and the x-axis here is the languages sorted by resource, so you can see, like, on the left-hand side, you have your low-resource languages like Wolof, and on your right-hand side, you've got your high-resource languages like French, the peak is English, of course.

And so if you just look at what's available publicly, this is the distribution you get, and you'll see, like, a huge, huge fall-off pretty quickly, and then if you add in the data that we have created for mining and back translation, our goal is basically to, like, make the distribution a little bit more uniform.

It's very hard on the extremely low-resource side, of course, but to make it a little bit more uniform so that you don't just immediately, you know, overfit on your low-resource languages before you've even seen, like, three shards of your German data. With that kind of data strategy in mind, I want to talk a little bit about mixture of experts.

So this is something that we explored quite aggressively in the translation space for a number of years. You know, we could have this equal conversation about some of the debates going on, on, like, do you want sparse or dense architectures for large language models? But essentially, mixture of experts, it enables massive scale because you don't have to just scale, like, your kind of your dense trunk model, but you can have, like, a bunch of different separate experts that you activate per token.

It also allows you to avoid language interference because the idea is that the different experts, they could specialize to specific languages. Unfortunately, it adds a ton of capacity, so it becomes pretty easy to overfit. So I'm going to talk a little bit about this overfitting phenomenon. So the top set of graphs that we're going to talk about is for the language Congo, and then the bottom set of languages is French.

So you really want to compare, like, a low-resource language on top with a high-resource language on bottom. So if you just take your dense model, traditional transformer sequence-to-sequence architecture, that's this graph that you're showing, right? So there's a little bit of overfitting on the low-resource language, but you can pretty much regularize this with standard dropout techniques, right?

So there's not a big problem, and on French, you know, you basically have no real problem. However, the minute you switch from, like, a dense architecture to a token-level MOE architecture, you just experience massive overfitting on the low-resource language. So the green line here is, like, just demonstrating without dropout the overfitting.

And then if you add dropout, you know, you get a little bit better performance, but it's still overfitting quite a bit. Like, essentially, by, like, you know, 12K updates, you know, there's no real point in continuing training. Like, you're burning GPU, basically. And so one of the things we actually worked on quite a bit was, like, trying to figure out how to properly regularize these MOE architectures with this specific masking technique on the gating function that decides, like, which expert to route to in your MOE architecture to just try to pull back some of this overfitting effect.

So if you look in the top right graph, the purple line, you know, you still see some, you know, successful regularization. Another thing that we did to control the overfitting effect, it's actually quite being used in language models today as well, is curriculum learning. And the idea of this is, like, how are we going to stage when languages are introduced?

And so what we did was we tried to train a vanilla model, and then we started to measure when the languages begin to overfit. And then we basically bucketed them into different sections. And so for high resource languages like French, you want to start it early, and it needs to be trained the entire way.

But for a lower resource language like Wolof, you know, after maybe like 100K updates, it's done. So the rest of the time is just overfitting. And so it actually gets worse the more you train it. So what we did is we moved some of those lower resource languages, and we inserted them much later into the training schedule.

So you start training your high resource, then you start training your mid resource, and then your low resource, and then your very low resource. And so by the end, you know, everything in theory has trained and is not as overfit as it would be without this kind of technique.

So I want to show some results. So first, I want to show results on existing data sets. So before we get to 200 languages, like let's just talk about 100 languages. And so this is the Flores 101 dev test. It's important to compare to this because this is where like existing benchmarks in the community lie.

Whereas on 200, of course, we can put up anything, you know, because it's the first work on that. So the first column is translating out of English. So English to Chinese, English to Icelandic, anything like that. The second column is translating into English, so Chinese to English. The third column, XXYY, it's translating any cross pair, not involving English.

And the last column is the average. So if you look at the first set of rows, this is a comparison on models that cover 87 different languages. So there was this paper MTM 100. There was also this deep net paper. So you can see the average blue score. Blue is a standard translation metric, essentially a metric of word overlap.

So we're looking at blue score here. And so you can see the last row NLB 200, even though we cover 200 languages, the blue score is substantially above some of the existing work. Now if we look at 101 languages, only the Delta LM paper from Microsoft at the time covered that number of languages.

And so if you compare on all of the different cross sets, similarly, you see that this no language left behind model is much stronger in terms of blue. One thing really quick on the variance of these blue numbers, I think it's important to understand, like, is something statistically significant or not?

I think about 0.5 blue is kind of like the general, like, plus minus that you'll see. And so if it's like above that, it's usually a statistically significant metric improvement. Okay. So now I want to talk a little bit about Flores 200 results. So here's similar, like the first chunk of columns translating out of English, then like next chunk is translating into English, then you have like your cross pairs, and then you have your average.

So we have this blue metric as well. We also have a character level metric based on CHRF++ that's commonly used in the translation community. So I think looking at these numbers, of course, like there's no baseline work to compare to unlike on the previous slide. And so, you know, when we get to human evaluation in a little bit, it'll be more concrete.

But I think generally, one of the kind of rules of thumb I have for these types of numbers is like around 30 is like pretty reasonably becomes usable. And I think another thing, like if you compare these supervised pairs to zero shot pairs, I think we don't see like a huge drop off on zero shot, which indicates the model has like some sort of generalization, even if it didn't see that translation pair directly during training.

Another way to calibrate some of this is to compare to Google Translate. And so if you compare to Google Translate, no language left behind is quite a bit better at translating into English, and not as good as translating out of English, although if you like average across everything, it's a little bit better.

I want to talk a little bit about human evaluation as well to complement some of our discussion on automatic evaluation. And so I think, you know, automatic metrics, fast, really good for research iteration, impossible to move forward without. But human evaluation is really the real deal here. And so we had this paper at Amta on how to make this human evaluation very consistent and scalable across different language pairs.

I think this goes back to the kind of evaluation data set point that I was making at the beginning of the talk, where, you know, if you're a professional German translator, you're really good at evaluating the quality of your German translation. But beyond that, you know, there's not a lot of consistency.

And if you evaluate translation on like a five point scale, you know, like a five translating between two languages and like a three translating between other two languages, like, are those really comparable? And so we had this entire experiment methodology on how we might want to make this a little bit more comparable.

So I want to show some results now on this. So the y-axis here, so the metric is called XSTS, so metric for how we're doing this human evaluation. The y-axis here is actually the delta. So anything, you know, it's a five point scale. So it's a delta, not the raw score.

The x-axis here is a bunch of different translation directions that we evaluated. So the gray set is translating into English. The green set is translating non-English directions. So like French to Wolof. And then the blue set is translating out of English. And so what you're looking for is like a positive delta indicates that our modeling architecture is much better.

So what the delta is between is like a baseline transformer model just trained on all of our data versus like the final no language left behind model that we created. So the data is actually the same for both of them. That's how we get all 200 languages. So we're just measuring here the human eval of the modeling improvements.

And so you can see most of the delta is pretty noticeable. Some of them not so much like, you know, I don't know, Zulu to English, we didn't seem to improve very much, but in general, like it's an improvement detectable by human evaluation. You might also ask like, OK, what is the statistically significant difference here between about 0.2 to 0.3 plus or minus is something that's like pretty noticeable and like above 0.5 it's very noticeable.

One of the things that I also want to get at in evaluation is that there's many different facets of model evaluation. And I think if you look at like all of the different like LLM leaderboards or like the transparency reports or whatever, you like begin to internalize this pretty quickly.

But what we just looked at are just like very high level summary numbers. And they don't really tell you like what exactly are the errors and like is it ultimately usable by people? Is it like a safe thing that people can rely on? And so one of the things we really focused on is user safety.

And some of that manifests in some of the toxicity work that we did. And the driving thing here is that like not all errors in translation are made equal. So during COVID, there was this one that was like really went viral circulating around, but the message during COVID is like you got to wash your hands.

But the translation producers like you got to hold hands, which I think is like exactly the opposite of what you want to do. And other types of measurement errors are really important as well. So if you're like telling someone how far they want to go, and you're like, hey, you want to travel like five kilometers, and then your translation is like travel 500 kilometers.

It's a completely different type of issue. And so what we did for toxicity, which is a big focus for this work, is that we collected different toxicity lists for all 200 languages. And so why do I care so much about toxicity? I think it's a user safety thing. So if you input like some perfectly benign text, and then the output is profanity, I think it's just like really unexpected.

And it breaks a lot of trust in the system. And it's an extremely poor experience for people. That being said, it's also a very, very challenging thing, because it's extremely culturally specific. So things that are slurs, or insults in certain languages, they don't really generalize across cultures, which means that things like this are very challenging to create.

And I also was very interested in this direction, because I think it's broadly useful for all sorts of different type of detection, things that you need to do, and also mitigation. And so even though we developed this in the context of translation, it can be used very broadly in other types of NLP applications.

This is also open source. You can download it. You have to type in a little password that's in the GitHub repo, just so that you don't accidentally like download and realize you have files of like curse words all over your computer. Okay, so I want to end a little bit with some thoughts about future directions.

And before I get there, like, you know, there's like a 190 page paper that like writes up like all of this in far greater detail, in case you're curious. So a few future directions that I think I'm really interested in, and some of these like are also very applicable to things like speech, is that I think one of them is more explicit multilingual.

So I think a lot of approaches to multilingual have been like, hey, you know, we have this thing that's working well for one language, like, let's try to scale it to a bunch of different languages, and then we're going to put them all in the same modeling bucket and just kind of like hope that the model learns a lot of these different representations.

But I think there's a lot of potential room for explicitly bringing in, like the fact that you know, it's multilingual into the architecture more. And so, you know, it's possible to capture more nuances between languages or different relationships between languages. The other one is continued support for everyone. I think it's like something reflecting on this project is that, you know, going from 100 to 200 was already pretty challenging, but going beyond a lot of the techniques that we developed here are not necessarily that scalable.

This is actually what inspired some of our work on speech translation as well. So if you recently saw like the seamless M4T release, or like the unwritten languages, like we did a lot of modeling of Hokkien, and I think that goes into this direction really well, because many of the languages that people want to use are like spoken first languages and not necessarily like primarily written.

And then I think the last thing that I'm still really passionate about is like continued increased ease of use and training of these models and like democratization for the community. So one of the things that we tried to do in this work is just like really, really clearly write down everything that we did, and like open source, like even the data pipeline and things like that.

And so that's where you get like all of the repos that I linked and, you know, like a huge write up. But I think if someone were to try to reproduce this for their own language, and many people have, like, I'm not saying that that hasn't been, but it's like, if you wanted to like do this, it would be extremely, extremely hard, because there's like so much different things going on.

So I think most of the what we've seen is like people have downloaded the base model and fine tuned it for their own language. But it's pretty hard to just like add on many, many more languages to the system, because of how complicated all the moving parts are. And so I feel like something for the translation community overall, is like how to how do we simplify a lot of these things.

And I think that's where like a lot of fundamental modeling innovation could help us get to. And so, yeah, I got a chance to give this talk. But of course, the work is like being done by a huge team of people that I've cited here. But yeah, if you want to use any of this, or read more about it, like everything is linked from this main GitHub repo here in Fair Seek, and you can like click on everything else afterwards.

But yeah, maybe maybe I'll go back to Steven, if we have any questions or anything else like that. All right. No, thanks for the great talk. Yeah. If anybody has any questions, feel free to unmute and ask. Did you consult with a lot of like native speakers for like, you know, profanities and this type of stuff?

Like, how are you able to get access to the, you know, low quality languages or low resource languages and make sure the translations are correct? Yeah, yeah, that's a really good question. I mean, I think it's the most important to consult like a bunch of native speakers across the entire development process.

So part of our original thing was just like interviewing a bunch of people to understand like what they're looking for in a high quality translation. And then we have like an entire professional translation team hired, which took quite a long time to find, to consult with along the process.

And then right now, like we also have some of the things like toxicity lists are open to the community. So if you make like a pull request, we try to like, you know, validate that it's like a useful addition and then like try to merge it in as well.

We have a question in the room. Let's see if that comes over soon. So I'll speak, so like, did you spend most of your time in the data pipeline state? Yeah. Yeah. Yeah. A good question. I think the question is, did you spend most of your time in the data pipeline state?

It ended up being about like, kind of like 50/50, like data or more like driving work and then like 50/50 on the other side, like modeling and evaluation work, because once like the data is set, then there is a lot and a lot of iteration on the modeling side to figure out like, okay, which, how much of the data should we use?

Like how should we portion the data? How do we prevent overfitting? What is the right architecture? But a lot of work goes into the data because I think if you don't have high quality data, you know, you just can't get a good model. And for data mining, how do you mine the data?

Do you use like Selenium or how do you mine the web? Yeah. So for the web, we start with Common Crawl. So we downloaded all of the different dumps of Common Crawl, and then we use HTML parser. I think now, like, you know, if you download, for example, the red pajama data set, like they've done a lot of this, like parsing and stuff.

And then we have like large scale pipelines that are set up, like you can use Spark, for example, to process these things to like split all of the different sentences out, run your language identification, you know, you can do different heuristic cleaning. There are certain languages where it's like very, actually very challenging to identify what is a sentence.

Like, I think in Thai, there is no like period. So you have to like use different models to identify what is a sentence and I parse some of those things out. And then we end up with, you know, our monolingual data dump. What is Common Crawl? Is it software that you use for data sets?

Oh, yeah. Yeah. Common Crawl is kind of like an open source version of the web that runs, I think, maybe quarterly, I would have to check. But yeah, if you go to like commoncrawl.org, you can download it. But warning, it's like, very large. Hey, I have a question. You might have mentioned this briefly, but I'm wondering how ChatGPT and GPT-4 does on this.

Like, does just more scale and pre-training data help as well for low resource machine translation? Yeah, yeah. Good question. Actually, there have been some studies done on like how, you know, these systems work. I think for high resource languages, it's actually quite beneficial to scale up. I think part of that is because the models have some innate generalization.

And so one of the challenges that people talk about different things in different languages, so like seeing that knowledge in another language can actually help the generalization. But on low resource languages, it's, yeah, the performance is pretty difficult, especially on some of these translation benchmarks. I also think that language models, in terms of being trained for like a translation objective, tend to score worse on translation benchmarks, because language models are like approximately capturing the same thing, whereas translation models, you really try to align the meaning a little bit more.

But yeah, so I think for low resource, still pretty challenging. But yeah, one thing that's interesting is, for most English language models that can actually do a reasonable job at producing other languages, because it's impossible to get rid of other languages in your English specific data. So things like French or German will work reasonably.

So just to clarify, you said language models trained with a translation objective do better, right? Because they tend to do better, like if you fine tune for the translation task, it will tend to do better. Yeah, that makes sense compared to like, for example, some few shot in context examples.

Right, right, exactly, exactly. One other question is, do you see this being similar to, for example, fine tuning on particular expert domains, which might also have less data and low resource and as well as domain specific jargon and so forth? Yeah, I mean, I think if we were to restart this project, now, I think that would be one of the first things we also explored, or at least like an extremely strong baseline, where if you like take some of the data and you try to fine tune, or, you know, try to do domain adaptation, I think that's also where like some of the like retrieval type approaches go in for, you know, translation, but also large language modeling work, where you try to have like a separate domain that you can like retrieve some text in for adaptation.

I think all of those approaches are pretty promising. All right, great, any other questions? One quick one on the point of irreplaceability on one of the slides, I think you showed some peak results with zero shot that were higher than just the base model. Do you think that's because there might still be some overfitting on those low resource languages?

Yeah, good question. So for our large scale mining, we don't mind like every single possible cross pair. So like, like Icelandic Wulof, it's probably not like the most in demand translation direction. And so we did not mind like all 200 times 200, because it's like really producing like a combinatorial explosion.

And so that's where the zero shot results come from, where, you know, you don't need, you don't have training data directionally in that pair, but the model has seen both the input and the output. And so I think those results are pretty good, well, they're good for certain languages, which I think goes to show like the generalization capability.

And it's not like as critical to have like every single pair covered, but many of them are not as good. And so you see overall, the performance is lower, even though on certain languages, it can perform better than you expect. But that's because it has seen the input and the output.

It's not zero shot on like completely unseen language. I have a question, but I wanted you to also, you know, do something related to transcription of audio information. Yeah, good question. So in this project, no, not so much transcription. But we had a follow up work that we released actually just like a month or so ago, called seamless M4T, which is like a joint model for both speech and text translation.

And that's where we do leverage a lot of audio transcription, because that also has like, it helps us bridge, you know, like the spoken data and the text data to leverage both of them together. Right, just to clarify, the supervised fine tuning, it worked better, right, compared to other methods.

So actually, in this work, it was a couple years ago now. So supervised fine tuning wasn't as common as it was now. But I think in the literature, if you want to use like a large language model to do translation, it's currently best, yeah, if you do some supervised fine tuning.

I'm just wondering about that, because the way as humans, right, we don't just learn by looking at pairs of the same thing in different languages, and kind of memorizing how to map from one to the other, we kind of learn in a more unsupervised way, where if we know both languages, then we can kind of naturally translate between them.

And I guess it makes sense for an LLM, why having supervised examples would help. Yeah. Yeah, I mean, I think it's like the base foundation model continues to improve in quality. I think that's where the quality will probably improve, and you don't need less and less fine tuning. I mean, do you think that's like the open AI approach, like if you have the best foundation model, then you don't need as much like domain specific fine tuning.

I think like, you know, like at the start, when I started working on text generation, there was like translation researchers, and like summarization researchers, and like question answering researchers, and they like work very differently. But now it's like, it's all driven by the same underlying thing, and you're not like a specialized summarization researcher anymore.

Right. I think that's that makes a lot of sense. Do we have any other questions? Ron, any more in person questions? I don't think so. Okay, great. All right. Well, thank you, Angela, for the very interesting and great talk again, and for taking the time and we hope, yeah, we hope that you can keep in touch and if anybody has any other questions, feel free to get in touch with Angela.

All right. Thanks so much for having me today. Bye, everyone. Transcribed by https://otter.ai Transcribed by https://otter.ai

Stanford CS25: V3 I No Language Left Behind: Scaling Human-Centered Machine Translation

Transcript