Stanford XCS224U: NLU I Contextual Word Representations, Part 6: RoBERTa I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.080 | This is part six in our series on contextual representation.

00:00:09.560 | We're going to focus on RoBERTa.

00:00:11.080 | RoBERTa stands for robustly optimized BERT approach.

00:00:14.680 | You might recall that I finished

00:00:16.600 | the BERT screencast by listing out

00:00:18.360 | some key known limitations of the BERT model.

00:00:21.680 | The top item on that list was just an observation that

00:00:25.280 | the BERT team originally did an admirably detailed,

00:00:28.800 | but still very partial set of

00:00:30.840 | ablation studies and optimization studies.

00:00:33.880 | That gave us some glimpses of how to best optimize BERT models,

00:00:38.960 | but it was hardly a thorough exploration.

00:00:41.480 | That's where the RoBERTa team is going to take over and try to

00:00:45.120 | do a more thorough exploration of this design space.

00:00:48.600 | I think this is a really interesting development

00:00:51.160 | because at a meta level,

00:00:53.240 | it points to a shift in methodologies.

00:00:55.640 | The RoBERTa team does do

00:00:57.640 | a much fuller exploration of the design space,

00:01:00.320 | but it's nowhere near the exhaustive exploration of

00:01:04.120 | hyperparameters that we used to

00:01:05.760 | see especially in the pre-deep learning era.

00:01:08.880 | I think what we're seeing with RoBERTa is that it is simply too

00:01:12.440 | expensive in terms of money or compute

00:01:15.040 | or time to be completely thorough.

00:01:17.960 | Even RoBERTa is a very heuristic and

00:01:21.160 | partial exploration of the design space.

00:01:24.000 | But nonetheless, I think it was extremely instructive.

00:01:27.520 | For this slide, I'm going to list out

00:01:29.920 | key differences between BERT and RoBERTa,

00:01:32.560 | and then we'll explore some of the evidence in favor of

00:01:35.560 | these decisions just after that.

00:01:38.560 | First item on the list,

00:01:40.080 | BERT used a static masking approach.

00:01:42.920 | What that means is that they copied

00:01:44.960 | their training data some number of times and

00:01:47.560 | applied different masks to each copy.

00:01:50.560 | But then that set of copies of

00:01:53.280 | the dataset with its masking was used

00:01:55.280 | repeatedly during epochs of training.

00:01:58.000 | What that means is that the same masking

00:02:00.280 | was seen repeatedly by the model.

00:02:02.560 | You might have an intuition that we'll get

00:02:04.880 | more and better diversity into

00:02:07.320 | this training regime if we dynamically mask examples,

00:02:10.200 | which would just mean that as we load individual batches,

00:02:13.320 | we apply some random dynamic masking to those so that

00:02:17.680 | subsequent batches containing the same examples

00:02:20.960 | have different masking applied to them.

00:02:23.360 | Clearly, that's going to introduce some diversity

00:02:25.720 | into the training regime and that could be useful.

00:02:28.760 | For BERT, the inputs to the model

00:02:32.840 | were two concatenated document segments,

00:02:35.760 | and that's actually crucial to

00:02:37.000 | their next sentence prediction task.

00:02:39.240 | Whereas for RoBERTa, inputs are

00:02:41.240 | sentence sequences that may even span document boundaries.

00:02:45.520 | Obviously, that's going to be disruptive to

00:02:47.800 | the next sentence prediction objective,

00:02:50.000 | but correspondingly, whereas BERT had that NSP objective,

00:02:53.960 | RoBERTa simply dropped it on

00:02:55.520 | the grounds that it was not earning its keep.

00:02:58.480 | For BERT, the training batches contained 256 examples.

00:03:03.600 | RoBERTa upped that to 2,000 examples per batch,

00:03:07.160 | a substantial increase.

00:03:09.280 | BERT used a wordpiece tokenizer,

00:03:12.000 | whereas RoBERTa used

00:03:13.240 | a character level byte-pair encoding algorithm.

00:03:17.080 | BERT was trained on a lot of data,

00:03:19.640 | Books, Corpus, and English Wikipedia.

00:03:21.960 | RoBERTa leveled up on the amount of data by training on Books,

00:03:25.840 | Corpus, Wikipedia, CC News,

00:03:28.080 | Open Web Text, and Stories,

00:03:29.800 | and the result of that is

00:03:31.400 | a substantial increase in

00:03:32.840 | the amount of data that the model saw.

00:03:35.440 | BERT was trained for one million steps,

00:03:38.280 | whereas RoBERTa was trained for 500,000 steps.

00:03:41.840 | Pause there. You might think that means

00:03:44.200 | RoBERTa was trained for less time,

00:03:46.760 | but remember the batch sizes are

00:03:48.720 | substantially larger and so the net effect of

00:03:51.240 | these two choices is that RoBERTa was

00:03:53.400 | trained for a lot more instances.

00:03:57.120 | Then finally, for the BERT team,

00:03:59.440 | there was an intuition that it would be useful for

00:04:01.480 | optimization to train on short sequences first.

00:04:05.080 | The RoBERTa team simply dropped that and trained on

00:04:07.880 | full-length sequences throughout the training regime.

00:04:11.560 | I think those are the high-level changes

00:04:14.160 | between BERT and RoBERTa.

00:04:15.560 | There are some additional differences and I

00:04:17.720 | refer to Section 3.1 of

00:04:20.000 | the paper for the details on those.

00:04:22.920 | Let's dive into some of

00:04:25.040 | the evidence that they used for these choices,

00:04:27.520 | beginning with that first shift from

00:04:30.160 | static masking to dynamic masking.

00:04:33.000 | This table summarizes their evidence for this choice.

00:04:36.240 | They're using SQuAD, Multi-NLI,

00:04:38.600 | and Binary Stanford Sentiment Treebank

00:04:41.440 | as their benchmarks to make this decision.

00:04:44.400 | You can see that for SQuAD and SST,

00:04:46.960 | there's a pretty clear win,

00:04:48.200 | dynamic masking is better.

00:04:49.840 | For Multi-NLI, it looks like there was a small regression,

00:04:53.000 | but on average, the results look better for dynamic masking.

00:04:56.880 | I will say that to augment these results,

00:04:59.800 | there is a clear intuition that

00:05:01.480 | dynamic masking is going to be useful.

00:05:03.600 | Even if it's not reflected in these benchmarks,

00:05:06.200 | we might still think that it's

00:05:08.080 | a wise choice if we can afford to train in that way.

00:05:12.120 | We talked briefly about how examples are presented to

00:05:16.560 | these models. I would say the two competitors that

00:05:19.800 | Roberta thoroughly evaluated were

00:05:22.320 | full sentences and doc sentences.

00:05:25.000 | Doc sentences will be where we limit

00:05:27.480 | training instances to pairs of

00:05:29.000 | sentences that come from the same document,

00:05:31.400 | which you would think would give us a clear intuition about

00:05:34.360 | something like discourse coherence for those instances.

00:05:38.080 | We can also compare that against full sentences in which we

00:05:41.320 | present examples even though they

00:05:43.880 | might span document boundaries.

00:05:46.480 | We have less of a guarantee of discourse coherence.

00:05:49.760 | Although doc sentences comes out a little bit ahead in

00:05:53.160 | this benchmark that they have set up across squad,

00:05:55.800 | Multi-NLI, SST2, and race,

00:05:58.800 | they chose full sentences on the grounds that there is

00:06:01.720 | more at play here than just accuracy.

00:06:04.960 | We should also think about

00:06:06.880 | the efficiency of the training regime.

00:06:09.200 | Since full sentences makes it much easier to

00:06:11.760 | create efficient batches of examples,

00:06:14.320 | they opted for that instead.

00:06:16.120 | That's also very welcome to my mind because it's showing,

00:06:19.360 | again, that there's more at stake in

00:06:21.000 | this new era than just accuracy.

00:06:23.640 | We should also consider our resources.

00:06:26.880 | This table summarizes

00:06:29.760 | their evidence for the larger batch sizes.

00:06:32.120 | They're using various metrics here, perplexity,

00:06:34.680 | which is a pseudo perplexity given

00:06:36.680 | that BERT uses bidirectional context.

00:06:39.600 | They're also benchmarking against Multi-NLI and SST2.

00:06:43.320 | What they find is that clearly,

00:06:45.560 | there's a win for having

00:06:47.040 | this very large batch size at 2,000 examples.

00:06:51.640 | Then finally, just the raw amount of data that

00:06:55.240 | these models are trained on is

00:06:56.640 | interesting and also the amount

00:06:58.080 | of training time that they get.

00:06:59.680 | What they found is that they got

00:07:01.640 | the best results for Roberta by training

00:07:04.240 | for as long as they could possibly afford

00:07:06.400 | to on as much data as they could include.

00:07:10.200 | You can see the amount of data going up to

00:07:12.240 | 160 gigabytes here versus

00:07:14.680 | the largest BERT model at 13,

00:07:16.800 | a substantial increase.

00:07:18.560 | The step size going all the way up to 500,000,

00:07:21.640 | whereas for BERT, it was a million.

00:07:23.280 | But remember, overall, there are many more examples

00:07:26.040 | being presented as a result of

00:07:27.600 | the batch size being so much larger for the Roberta models.

00:07:32.240 | Again, another familiar lesson

00:07:34.880 | from the deep learning era,

00:07:36.280 | more is better in terms of data and training time,

00:07:39.880 | especially when our goal is to create

00:07:42.120 | these pre-trained artifacts

00:07:44.520 | that are useful for fine-tuning.

00:07:47.280 | To round this out, I thought I'd mention that

00:07:50.120 | the Roberta team released two models,

00:07:52.280 | BASE and LARGE, which are directly

00:07:54.600 | comparable to the corresponding BERT artifacts.

00:07:57.520 | The BASE model has 12 layers,

00:08:00.280 | dimensionality of 768,

00:08:02.320 | and a feed-forward layer of 3072 for a total of

00:08:05.480 | 125 million parameters which is

00:08:07.960 | more or less the same as BERT BASE.

00:08:10.080 | Then correspondingly, BERT LARGE has

00:08:12.240 | all the same basic settings as BERT BASE,

00:08:15.680 | and correspondingly, essentially,

00:08:17.360 | the same number of parameters at 355 million.

00:08:21.680 | As I said at the start of this screencast,

00:08:24.600 | Roberta was thorough, but even that is only

00:08:26.760 | a very partial exploration of

00:08:28.440 | the full design space suggested by the BERT model.

00:08:31.360 | For many more results,

00:08:33.320 | I highly recommend this paper,

00:08:34.960 | a primer in BERTology from Rogers et al.

00:08:37.720 | It's a little bit of an old paper at this point,

00:08:40.280 | so lots has happened since it was released,

00:08:42.240 | but nonetheless, it's very thorough and contains

00:08:45.000 | lots of insights about how best to set up

00:08:47.680 | these BERT style models for doing various things in NLP.

00:08:51.040 | So highly recommended as a companion

00:08:53.200 | to this little screencast.

00:08:55.760 | [BLANK_AUDIO]

Stanford XCS224U: NLU I Contextual Word Representations, Part 6: RoBERTa I Spring 2023

Chapters