back to index

Stanford XCS224U: NLU I Contextual Word Representations, Part 6: RoBERTa I Spring 2023


Chapters

0:0 Intro
0:15 Addressing the known limitations with BERT
1:28 Robustly optimized BERT approach
4:23 ROBERTA results informing final system design
7:47 ROBERTA: Core model releases

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back everyone.
00:00:06.080 | This is part six in our series on contextual representation.
00:00:09.560 | We're going to focus on RoBERTa.
00:00:11.080 | RoBERTa stands for robustly optimized BERT approach.
00:00:14.680 | You might recall that I finished
00:00:16.600 | the BERT screencast by listing out
00:00:18.360 | some key known limitations of the BERT model.
00:00:21.680 | The top item on that list was just an observation that
00:00:25.280 | the BERT team originally did an admirably detailed,
00:00:28.800 | but still very partial set of
00:00:30.840 | ablation studies and optimization studies.
00:00:33.880 | That gave us some glimpses of how to best optimize BERT models,
00:00:38.960 | but it was hardly a thorough exploration.
00:00:41.480 | That's where the RoBERTa team is going to take over and try to
00:00:45.120 | do a more thorough exploration of this design space.
00:00:48.600 | I think this is a really interesting development
00:00:51.160 | because at a meta level,
00:00:53.240 | it points to a shift in methodologies.
00:00:55.640 | The RoBERTa team does do
00:00:57.640 | a much fuller exploration of the design space,
00:01:00.320 | but it's nowhere near the exhaustive exploration of
00:01:04.120 | hyperparameters that we used to
00:01:05.760 | see especially in the pre-deep learning era.
00:01:08.880 | I think what we're seeing with RoBERTa is that it is simply too
00:01:12.440 | expensive in terms of money or compute
00:01:15.040 | or time to be completely thorough.
00:01:17.960 | Even RoBERTa is a very heuristic and
00:01:21.160 | partial exploration of the design space.
00:01:24.000 | But nonetheless, I think it was extremely instructive.
00:01:27.520 | For this slide, I'm going to list out
00:01:29.920 | key differences between BERT and RoBERTa,
00:01:32.560 | and then we'll explore some of the evidence in favor of
00:01:35.560 | these decisions just after that.
00:01:38.560 | First item on the list,
00:01:40.080 | BERT used a static masking approach.
00:01:42.920 | What that means is that they copied
00:01:44.960 | their training data some number of times and
00:01:47.560 | applied different masks to each copy.
00:01:50.560 | But then that set of copies of
00:01:53.280 | the dataset with its masking was used
00:01:55.280 | repeatedly during epochs of training.
00:01:58.000 | What that means is that the same masking
00:02:00.280 | was seen repeatedly by the model.
00:02:02.560 | You might have an intuition that we'll get
00:02:04.880 | more and better diversity into
00:02:07.320 | this training regime if we dynamically mask examples,
00:02:10.200 | which would just mean that as we load individual batches,
00:02:13.320 | we apply some random dynamic masking to those so that
00:02:17.680 | subsequent batches containing the same examples
00:02:20.960 | have different masking applied to them.
00:02:23.360 | Clearly, that's going to introduce some diversity
00:02:25.720 | into the training regime and that could be useful.
00:02:28.760 | For BERT, the inputs to the model
00:02:32.840 | were two concatenated document segments,
00:02:35.760 | and that's actually crucial to
00:02:37.000 | their next sentence prediction task.
00:02:39.240 | Whereas for RoBERTa, inputs are
00:02:41.240 | sentence sequences that may even span document boundaries.
00:02:45.520 | Obviously, that's going to be disruptive to
00:02:47.800 | the next sentence prediction objective,
00:02:50.000 | but correspondingly, whereas BERT had that NSP objective,
00:02:53.960 | RoBERTa simply dropped it on
00:02:55.520 | the grounds that it was not earning its keep.
00:02:58.480 | For BERT, the training batches contained 256 examples.
00:03:03.600 | RoBERTa upped that to 2,000 examples per batch,
00:03:07.160 | a substantial increase.
00:03:09.280 | BERT used a wordpiece tokenizer,
00:03:12.000 | whereas RoBERTa used
00:03:13.240 | a character level byte-pair encoding algorithm.
00:03:17.080 | BERT was trained on a lot of data,
00:03:19.640 | Books, Corpus, and English Wikipedia.
00:03:21.960 | RoBERTa leveled up on the amount of data by training on Books,
00:03:25.840 | Corpus, Wikipedia, CC News,
00:03:28.080 | Open Web Text, and Stories,
00:03:29.800 | and the result of that is
00:03:31.400 | a substantial increase in
00:03:32.840 | the amount of data that the model saw.
00:03:35.440 | BERT was trained for one million steps,
00:03:38.280 | whereas RoBERTa was trained for 500,000 steps.
00:03:41.840 | Pause there. You might think that means
00:03:44.200 | RoBERTa was trained for less time,
00:03:46.760 | but remember the batch sizes are
00:03:48.720 | substantially larger and so the net effect of
00:03:51.240 | these two choices is that RoBERTa was
00:03:53.400 | trained for a lot more instances.
00:03:57.120 | Then finally, for the BERT team,
00:03:59.440 | there was an intuition that it would be useful for
00:04:01.480 | optimization to train on short sequences first.
00:04:05.080 | The RoBERTa team simply dropped that and trained on
00:04:07.880 | full-length sequences throughout the training regime.
00:04:11.560 | I think those are the high-level changes
00:04:14.160 | between BERT and RoBERTa.
00:04:15.560 | There are some additional differences and I
00:04:17.720 | refer to Section 3.1 of
00:04:20.000 | the paper for the details on those.
00:04:22.920 | Let's dive into some of
00:04:25.040 | the evidence that they used for these choices,
00:04:27.520 | beginning with that first shift from
00:04:30.160 | static masking to dynamic masking.
00:04:33.000 | This table summarizes their evidence for this choice.
00:04:36.240 | They're using SQuAD, Multi-NLI,
00:04:38.600 | and Binary Stanford Sentiment Treebank
00:04:41.440 | as their benchmarks to make this decision.
00:04:44.400 | You can see that for SQuAD and SST,
00:04:46.960 | there's a pretty clear win,
00:04:48.200 | dynamic masking is better.
00:04:49.840 | For Multi-NLI, it looks like there was a small regression,
00:04:53.000 | but on average, the results look better for dynamic masking.
00:04:56.880 | I will say that to augment these results,
00:04:59.800 | there is a clear intuition that
00:05:01.480 | dynamic masking is going to be useful.
00:05:03.600 | Even if it's not reflected in these benchmarks,
00:05:06.200 | we might still think that it's
00:05:08.080 | a wise choice if we can afford to train in that way.
00:05:12.120 | We talked briefly about how examples are presented to
00:05:16.560 | these models. I would say the two competitors that
00:05:19.800 | Roberta thoroughly evaluated were
00:05:22.320 | full sentences and doc sentences.
00:05:25.000 | Doc sentences will be where we limit
00:05:27.480 | training instances to pairs of
00:05:29.000 | sentences that come from the same document,
00:05:31.400 | which you would think would give us a clear intuition about
00:05:34.360 | something like discourse coherence for those instances.
00:05:38.080 | We can also compare that against full sentences in which we
00:05:41.320 | present examples even though they
00:05:43.880 | might span document boundaries.
00:05:46.480 | We have less of a guarantee of discourse coherence.
00:05:49.760 | Although doc sentences comes out a little bit ahead in
00:05:53.160 | this benchmark that they have set up across squad,
00:05:55.800 | Multi-NLI, SST2, and race,
00:05:58.800 | they chose full sentences on the grounds that there is
00:06:01.720 | more at play here than just accuracy.
00:06:04.960 | We should also think about
00:06:06.880 | the efficiency of the training regime.
00:06:09.200 | Since full sentences makes it much easier to
00:06:11.760 | create efficient batches of examples,
00:06:14.320 | they opted for that instead.
00:06:16.120 | That's also very welcome to my mind because it's showing,
00:06:19.360 | again, that there's more at stake in
00:06:21.000 | this new era than just accuracy.
00:06:23.640 | We should also consider our resources.
00:06:26.880 | This table summarizes
00:06:29.760 | their evidence for the larger batch sizes.
00:06:32.120 | They're using various metrics here, perplexity,
00:06:34.680 | which is a pseudo perplexity given
00:06:36.680 | that BERT uses bidirectional context.
00:06:39.600 | They're also benchmarking against Multi-NLI and SST2.
00:06:43.320 | What they find is that clearly,
00:06:45.560 | there's a win for having
00:06:47.040 | this very large batch size at 2,000 examples.
00:06:51.640 | Then finally, just the raw amount of data that
00:06:55.240 | these models are trained on is
00:06:56.640 | interesting and also the amount
00:06:58.080 | of training time that they get.
00:06:59.680 | What they found is that they got
00:07:01.640 | the best results for Roberta by training
00:07:04.240 | for as long as they could possibly afford
00:07:06.400 | to on as much data as they could include.
00:07:10.200 | You can see the amount of data going up to
00:07:12.240 | 160 gigabytes here versus
00:07:14.680 | the largest BERT model at 13,
00:07:16.800 | a substantial increase.
00:07:18.560 | The step size going all the way up to 500,000,
00:07:21.640 | whereas for BERT, it was a million.
00:07:23.280 | But remember, overall, there are many more examples
00:07:26.040 | being presented as a result of
00:07:27.600 | the batch size being so much larger for the Roberta models.
00:07:32.240 | Again, another familiar lesson
00:07:34.880 | from the deep learning era,
00:07:36.280 | more is better in terms of data and training time,
00:07:39.880 | especially when our goal is to create
00:07:42.120 | these pre-trained artifacts
00:07:44.520 | that are useful for fine-tuning.
00:07:47.280 | To round this out, I thought I'd mention that
00:07:50.120 | the Roberta team released two models,
00:07:52.280 | BASE and LARGE, which are directly
00:07:54.600 | comparable to the corresponding BERT artifacts.
00:07:57.520 | The BASE model has 12 layers,
00:08:00.280 | dimensionality of 768,
00:08:02.320 | and a feed-forward layer of 3072 for a total of
00:08:05.480 | 125 million parameters which is
00:08:07.960 | more or less the same as BERT BASE.
00:08:10.080 | Then correspondingly, BERT LARGE has
00:08:12.240 | all the same basic settings as BERT BASE,
00:08:15.680 | and correspondingly, essentially,
00:08:17.360 | the same number of parameters at 355 million.
00:08:21.680 | As I said at the start of this screencast,
00:08:24.600 | Roberta was thorough, but even that is only
00:08:26.760 | a very partial exploration of
00:08:28.440 | the full design space suggested by the BERT model.
00:08:31.360 | For many more results,
00:08:33.320 | I highly recommend this paper,
00:08:34.960 | a primer in BERTology from Rogers et al.
00:08:37.720 | It's a little bit of an old paper at this point,
00:08:40.280 | so lots has happened since it was released,
00:08:42.240 | but nonetheless, it's very thorough and contains
00:08:45.000 | lots of insights about how best to set up
00:08:47.680 | these BERT style models for doing various things in NLP.
00:08:51.040 | So highly recommended as a companion
00:08:53.200 | to this little screencast.
00:08:55.760 | [BLANK_AUDIO]