back to indexStanford XCS224U: NLU I Contextual Word Representations, Part 6: RoBERTa I Spring 2023
Chapters
0:0 Intro
0:15 Addressing the known limitations with BERT
1:28 Robustly optimized BERT approach
4:23 ROBERTA results informing final system design
7:47 ROBERTA: Core model releases
00:00:06.080 |
This is part six in our series on contextual representation. 00:00:11.080 |
RoBERTa stands for robustly optimized BERT approach. 00:00:18.360 |
some key known limitations of the BERT model. 00:00:21.680 |
The top item on that list was just an observation that 00:00:25.280 |
the BERT team originally did an admirably detailed, 00:00:33.880 |
That gave us some glimpses of how to best optimize BERT models, 00:00:41.480 |
That's where the RoBERTa team is going to take over and try to 00:00:45.120 |
do a more thorough exploration of this design space. 00:00:48.600 |
I think this is a really interesting development 00:00:57.640 |
a much fuller exploration of the design space, 00:01:00.320 |
but it's nowhere near the exhaustive exploration of 00:01:08.880 |
I think what we're seeing with RoBERTa is that it is simply too 00:01:24.000 |
But nonetheless, I think it was extremely instructive. 00:01:32.560 |
and then we'll explore some of the evidence in favor of 00:02:07.320 |
this training regime if we dynamically mask examples, 00:02:10.200 |
which would just mean that as we load individual batches, 00:02:13.320 |
we apply some random dynamic masking to those so that 00:02:17.680 |
subsequent batches containing the same examples 00:02:23.360 |
Clearly, that's going to introduce some diversity 00:02:25.720 |
into the training regime and that could be useful. 00:02:41.240 |
sentence sequences that may even span document boundaries. 00:02:50.000 |
but correspondingly, whereas BERT had that NSP objective, 00:02:55.520 |
the grounds that it was not earning its keep. 00:02:58.480 |
For BERT, the training batches contained 256 examples. 00:03:03.600 |
RoBERTa upped that to 2,000 examples per batch, 00:03:13.240 |
a character level byte-pair encoding algorithm. 00:03:21.960 |
RoBERTa leveled up on the amount of data by training on Books, 00:03:38.280 |
whereas RoBERTa was trained for 500,000 steps. 00:03:48.720 |
substantially larger and so the net effect of 00:03:59.440 |
there was an intuition that it would be useful for 00:04:01.480 |
optimization to train on short sequences first. 00:04:05.080 |
The RoBERTa team simply dropped that and trained on 00:04:07.880 |
full-length sequences throughout the training regime. 00:04:25.040 |
the evidence that they used for these choices, 00:04:33.000 |
This table summarizes their evidence for this choice. 00:04:49.840 |
For Multi-NLI, it looks like there was a small regression, 00:04:53.000 |
but on average, the results look better for dynamic masking. 00:05:03.600 |
Even if it's not reflected in these benchmarks, 00:05:08.080 |
a wise choice if we can afford to train in that way. 00:05:12.120 |
We talked briefly about how examples are presented to 00:05:16.560 |
these models. I would say the two competitors that 00:05:31.400 |
which you would think would give us a clear intuition about 00:05:34.360 |
something like discourse coherence for those instances. 00:05:38.080 |
We can also compare that against full sentences in which we 00:05:46.480 |
We have less of a guarantee of discourse coherence. 00:05:49.760 |
Although doc sentences comes out a little bit ahead in 00:05:53.160 |
this benchmark that they have set up across squad, 00:05:58.800 |
they chose full sentences on the grounds that there is 00:06:16.120 |
That's also very welcome to my mind because it's showing, 00:06:32.120 |
They're using various metrics here, perplexity, 00:06:39.600 |
They're also benchmarking against Multi-NLI and SST2. 00:06:47.040 |
this very large batch size at 2,000 examples. 00:06:51.640 |
Then finally, just the raw amount of data that 00:07:18.560 |
The step size going all the way up to 500,000, 00:07:23.280 |
But remember, overall, there are many more examples 00:07:27.600 |
the batch size being so much larger for the Roberta models. 00:07:36.280 |
more is better in terms of data and training time, 00:07:47.280 |
To round this out, I thought I'd mention that 00:07:54.600 |
comparable to the corresponding BERT artifacts. 00:08:02.320 |
and a feed-forward layer of 3072 for a total of 00:08:17.360 |
the same number of parameters at 355 million. 00:08:28.440 |
the full design space suggested by the BERT model. 00:08:37.720 |
It's a little bit of an old paper at this point, 00:08:42.240 |
but nonetheless, it's very thorough and contains 00:08:47.680 |
these BERT style models for doing various things in NLP.