Stanford XCS224U: NLU I Contextual Word Representations, Part 6: RoBERTa I Spring 2023

Welcome back everyone. This is part six in our series on contextual representation. We're going to focus on RoBERTa. RoBERTa stands for robustly optimized BERT approach. You might recall that I finished the BERT screencast by listing out some key known limitations of the BERT model. The top item on that list was just an observation that the BERT team originally did an admirably detailed, but still very partial set of ablation studies and optimization studies.

That gave us some glimpses of how to best optimize BERT models, but it was hardly a thorough exploration. That's where the RoBERTa team is going to take over and try to do a more thorough exploration of this design space. I think this is a really interesting development because at a meta level, it points to a shift in methodologies.

The RoBERTa team does do a much fuller exploration of the design space, but it's nowhere near the exhaustive exploration of hyperparameters that we used to see especially in the pre-deep learning era. I think what we're seeing with RoBERTa is that it is simply too expensive in terms of money or compute or time to be completely thorough.

Even RoBERTa is a very heuristic and partial exploration of the design space. But nonetheless, I think it was extremely instructive. For this slide, I'm going to list out key differences between BERT and RoBERTa, and then we'll explore some of the evidence in favor of these decisions just after that.

First item on the list, BERT used a static masking approach. What that means is that they copied their training data some number of times and applied different masks to each copy. But then that set of copies of the dataset with its masking was used repeatedly during epochs of training.

What that means is that the same masking was seen repeatedly by the model. You might have an intuition that we'll get more and better diversity into this training regime if we dynamically mask examples, which would just mean that as we load individual batches, we apply some random dynamic masking to those so that subsequent batches containing the same examples have different masking applied to them.

Clearly, that's going to introduce some diversity into the training regime and that could be useful. For BERT, the inputs to the model were two concatenated document segments, and that's actually crucial to their next sentence prediction task. Whereas for RoBERTa, inputs are sentence sequences that may even span document boundaries.

Obviously, that's going to be disruptive to the next sentence prediction objective, but correspondingly, whereas BERT had that NSP objective, RoBERTa simply dropped it on the grounds that it was not earning its keep. For BERT, the training batches contained 256 examples. RoBERTa upped that to 2,000 examples per batch, a substantial increase.

BERT used a wordpiece tokenizer, whereas RoBERTa used a character level byte-pair encoding algorithm. BERT was trained on a lot of data, Books, Corpus, and English Wikipedia. RoBERTa leveled up on the amount of data by training on Books, Corpus, Wikipedia, CC News, Open Web Text, and Stories, and the result of that is a substantial increase in the amount of data that the model saw.

BERT was trained for one million steps, whereas RoBERTa was trained for 500,000 steps. Pause there. You might think that means RoBERTa was trained for less time, but remember the batch sizes are substantially larger and so the net effect of these two choices is that RoBERTa was trained for a lot more instances.

Then finally, for the BERT team, there was an intuition that it would be useful for optimization to train on short sequences first. The RoBERTa team simply dropped that and trained on full-length sequences throughout the training regime. I think those are the high-level changes between BERT and RoBERTa. There are some additional differences and I refer to Section 3.1 of the paper for the details on those.

Let's dive into some of the evidence that they used for these choices, beginning with that first shift from static masking to dynamic masking. This table summarizes their evidence for this choice. They're using SQuAD, Multi-NLI, and Binary Stanford Sentiment Treebank as their benchmarks to make this decision. You can see that for SQuAD and SST, there's a pretty clear win, dynamic masking is better.

For Multi-NLI, it looks like there was a small regression, but on average, the results look better for dynamic masking. I will say that to augment these results, there is a clear intuition that dynamic masking is going to be useful. Even if it's not reflected in these benchmarks, we might still think that it's a wise choice if we can afford to train in that way.

We talked briefly about how examples are presented to these models. I would say the two competitors that Roberta thoroughly evaluated were full sentences and doc sentences. Doc sentences will be where we limit training instances to pairs of sentences that come from the same document, which you would think would give us a clear intuition about something like discourse coherence for those instances.

We can also compare that against full sentences in which we present examples even though they might span document boundaries. We have less of a guarantee of discourse coherence. Although doc sentences comes out a little bit ahead in this benchmark that they have set up across squad, Multi-NLI, SST2, and race, they chose full sentences on the grounds that there is more at play here than just accuracy.

We should also think about the efficiency of the training regime. Since full sentences makes it much easier to create efficient batches of examples, they opted for that instead. That's also very welcome to my mind because it's showing, again, that there's more at stake in this new era than just accuracy.

We should also consider our resources. This table summarizes their evidence for the larger batch sizes. They're using various metrics here, perplexity, which is a pseudo perplexity given that BERT uses bidirectional context. They're also benchmarking against Multi-NLI and SST2. What they find is that clearly, there's a win for having this very large batch size at 2,000 examples.

Then finally, just the raw amount of data that these models are trained on is interesting and also the amount of training time that they get. What they found is that they got the best results for Roberta by training for as long as they could possibly afford to on as much data as they could include.

You can see the amount of data going up to 160 gigabytes here versus the largest BERT model at 13, a substantial increase. The step size going all the way up to 500,000, whereas for BERT, it was a million. But remember, overall, there are many more examples being presented as a result of the batch size being so much larger for the Roberta models.

Again, another familiar lesson from the deep learning era, more is better in terms of data and training time, especially when our goal is to create these pre-trained artifacts that are useful for fine-tuning. To round this out, I thought I'd mention that the Roberta team released two models, BASE and LARGE, which are directly comparable to the corresponding BERT artifacts.

The BASE model has 12 layers, dimensionality of 768, and a feed-forward layer of 3072 for a total of 125 million parameters which is more or less the same as BERT BASE. Then correspondingly, BERT LARGE has all the same basic settings as BERT BASE, and correspondingly, essentially, the same number of parameters at 355 million.

As I said at the start of this screencast, Roberta was thorough, but even that is only a very partial exploration of the full design space suggested by the BERT model. For many more results, I highly recommend this paper, a primer in BERTology from Rogers et al. It's a little bit of an old paper at this point, so lots has happened since it was released, but nonetheless, it's very thorough and contains lots of insights about how best to set up these BERT style models for doing various things in NLP.

So highly recommended as a companion to this little screencast.

Stanford XCS224U: NLU I Contextual Word Representations, Part 6: RoBERTa I Spring 2023

Chapters

Transcript