back to index

Stanford XCS224U: Natural Language Understanding I In-context Learning, Pt 1: Origins I Spring 2023


Chapters

0:0
1:22 Early precedents
2:45 Beginnings: Radford et al. 2019 (GPT-2)
5:9 Cultural moment: Brown et al. 2020 (GPT-3)

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome everyone.
00:00:05.920 | This is the first screencast in our series on in-context learning.
00:00:08.960 | This series is a kind of companion to the one that we did on information retrieval.
00:00:12.840 | The two series come together to help you with homework two and bake-off two,
00:00:16.560 | which is focused on few-shot open domain question answering with
00:00:20.520 | frozen retrievers and frozen large language models.
00:00:24.240 | To start this series,
00:00:25.480 | I thought we would just reflect a bit on the origins of
00:00:28.360 | the idea of in-context learning,
00:00:30.400 | which is really a story of how NLP got to
00:00:33.480 | this strange and exciting and chaotic moment for the field,
00:00:38.160 | and maybe also for the society more broadly.
00:00:41.200 | All credit to the Chomsky bot for bringing us to this moment.
00:00:45.420 | I'm only joking.
00:00:46.840 | The Chomsky bot is a very simple pattern-based language model.
00:00:52.040 | It's been around since the '90s, I believe.
00:00:55.640 | With very simple mechanisms,
00:00:57.560 | it produces prose that is roughly in the style of
00:01:00.600 | the political philosopher and sometime linguist Noam Chomsky.
00:01:04.480 | It produces prose that delights and maybe informs us,
00:01:08.880 | and the underlying mechanisms are very simple.
00:01:11.160 | I think that's a nice reminder about what
00:01:13.920 | all of these large language models might be doing even in the present day.
00:01:17.880 | But I'm only joking,
00:01:19.840 | although it's only partly a joke.
00:01:21.520 | I think when we think about precedence for in-context learning,
00:01:24.680 | it is worth mentioning that in the pre-deep learning era,
00:01:28.560 | N-gram-based language models,
00:01:30.320 | very sparse large language models,
00:01:32.720 | were often truly massive.
00:01:34.560 | For example, Brant et al.
00:01:35.880 | 2007 use a 300 billion parameter language model
00:01:40.840 | trained on two trillion tokens of
00:01:42.800 | text to help with machine translation.
00:01:45.680 | That is a very large and very powerful mechanism with
00:01:48.780 | a different character from the large language models of today.
00:01:52.120 | But it is nonetheless worth noting that they played
00:01:54.680 | an important role in a lot of different fields way back when.
00:01:58.920 | I think for in-context learning as we know it now,
00:02:03.160 | the earliest paper as far as I know is the DECA NLP paper.
00:02:07.720 | This is McCann et al. 2018.
00:02:09.680 | They do multitask training with
00:02:11.880 | task instructions that are natural language questions.
00:02:15.360 | That does seem like the origin of the idea that with free-form natural language
00:02:19.640 | instructions we could essentially end up with artifacts that could do
00:02:23.240 | multiple things guided solely by text.
00:02:27.920 | Then it's worth noting also that in the GPT paper,
00:02:31.600 | Radford et al. 2018,
00:02:33.420 | you can find buried in there some tentative proposals to do
00:02:36.640 | prompt-based experiments with that model.
00:02:40.860 | But the real origins of the ideas, again,
00:02:44.420 | as far as I know,
00:02:45.680 | are Radford et al. 2019.
00:02:47.920 | This is the GPT-2 paper.
00:02:50.520 | Let me just show you some snippets from this paper.
00:02:53.080 | It's really inspiring how much they did.
00:02:55.280 | They say at the start,
00:02:56.440 | "We demonstrate language models can perform
00:02:58.520 | downstream tasks in a zero-shot setting
00:03:01.160 | without any parameter or architecture modification."
00:03:04.120 | There you see this idea of using frozen models,
00:03:07.440 | prompting them, and seeing if they will produce interesting behaviors.
00:03:11.480 | They looked at a bunch of different tasks.
00:03:14.160 | For summarization, they say,
00:03:15.640 | "To induce summarization behavior,
00:03:17.480 | we add the text TLDR after the article and generate 100 tokens."
00:03:22.080 | This is mind-blowing.
00:03:23.280 | I remember when I first heard about this idea,
00:03:26.640 | I had such a cognitive bias against in-context learning of this sort being
00:03:30.920 | successful that I assumed what they were trying to say to us is that they had
00:03:35.280 | trained that token in a task-specific way to
00:03:39.560 | do summarization and then just given it a colorful name.
00:03:43.120 | But no, they really meant it.
00:03:45.040 | They simply prompt the model with this token,
00:03:47.360 | and look at what comes out.
00:03:49.680 | For translation, they say,
00:03:51.920 | "We test whether GPT-2 has begun to
00:03:54.040 | learn how to translate from one language to another.
00:03:56.800 | In order to help it infer that this is the desired task,
00:03:59.720 | we condition the language model on a context of
00:04:02.040 | example pairs of the format English sentence equals French sentence.
00:04:06.200 | Then after a final prompt of English sentence equals,
00:04:09.240 | we sample from the model with greedy decoding and
00:04:11.680 | use the first generated sentence as the translation."
00:04:14.560 | Incredible, and what you see emerging there,
00:04:17.080 | is this idea of demonstrations,
00:04:19.120 | including in the prompt some examples of the behavior that you
00:04:21.960 | want as a way of coaxing the model to do what you would like it to do.
00:04:26.000 | Here's a similar example.
00:04:27.680 | They say, "Similar to translation,
00:04:29.360 | the context of the language model is seeded with example question-answer pairs,
00:04:33.480 | which helps the model infer the short answer style of the dataset."
00:04:37.240 | That's for QA, and again,
00:04:38.800 | they started to see that demonstrations could help
00:04:41.400 | the model see what the implicit task instruction was.
00:04:45.360 | They also in the paper,
00:04:47.360 | evaluate a bunch of other things,
00:04:48.880 | text completion, Winograd schemas,
00:04:51.440 | and reading comprehension, and maybe others.
00:04:53.800 | It's a very impressive and thorough exploration,
00:04:55.840 | very open about the benefits and limitations of the methods,
00:04:59.640 | a very impressive and creative paper.
00:05:03.120 | That was the beginning of the idea in terms of research.
00:05:08.880 | The cultural moment certainly arrives with the GPT-3 paper,
00:05:13.480 | Brown et al. 2020,
00:05:15.280 | which is also impressive in its own ways.
00:05:17.560 | Here I'm just going to quote from the abstract and we
00:05:19.560 | can linger a bit over what it says.
00:05:21.920 | They start, "We show that scaling up language models
00:05:24.960 | greatly improves task agnostic few-shot performance,
00:05:28.320 | sometimes even reaching competitiveness with
00:05:30.920 | prior state-of-the-art fine-tuning approaches."
00:05:33.840 | We could quibble with whether or not they actually saw
00:05:36.440 | competitiveness in that sense,
00:05:38.160 | but it is absolutely true that they got
00:05:40.720 | very impressive behaviors out of their model in
00:05:43.160 | this task agnostic few-shot setting.
00:05:46.280 | Specifically, we train GPT-3,
00:05:49.440 | an autoregressive language model with 175 billion parameters,
00:05:53.480 | 10x more than any previous non-sparse language model,
00:05:57.000 | and test its performance in the few-shot setting.
00:05:59.520 | There are two things I really like about this part.
00:06:01.720 | First, 175 billion parameters is
00:06:04.920 | indeed incredibly ambitious and impressive,
00:06:07.520 | even today to say nothing of back in 2020.
00:06:10.840 | I also really love that they mentioned non-sparse language model,
00:06:15.160 | a nod to those N-gram based models that I mentioned before,
00:06:18.480 | which were often truly massive.
00:06:21.200 | For all tasks, GPT-3 is applied without
00:06:24.520 | any gradient updates or fine-tuning,
00:06:26.740 | with tasks and few-shot demonstrations
00:06:28.960 | specified purely via text interaction with the model.
00:06:32.120 | That's nice. You might think in
00:06:34.560 | retrospect that they're repeating themselves here.
00:06:37.440 | They've already established that these are going to be frozen models,
00:06:40.000 | but I think it's necessary for them to do
00:06:42.320 | that because this was such an unfamiliar idea.
00:06:44.640 | I can imagine, again,
00:06:46.240 | being a reader of this paper and assuming that they can't
00:06:49.160 | really mean they're just using frozen models for all these tasks.
00:06:52.080 | Surely, there is some fine-tuning somewhere,
00:06:54.040 | and so they're emphasizing that in fact,
00:06:56.320 | the model is entirely frozen.
00:06:59.000 | GPT-3 achieves strong performance on many NLP datasets,
00:07:03.200 | including translation, question answering,
00:07:05.440 | and closed tasks, as well as several tasks that
00:07:08.320 | require on-the-fly reasoning or domain adaptations,
00:07:11.280 | such as unscrambling words,
00:07:13.000 | using a novel word in a sentence,
00:07:14.680 | or performing three-digit arithmetic.
00:07:16.720 | I love this. A real diversity of tasks,
00:07:19.160 | and what I think you can see them doing is really trying to
00:07:22.080 | push the limits of what would be possible in this mode.
00:07:25.840 | At the same time, we also identify some datasets where
00:07:29.040 | GPT-3's few-shot learning still struggles,
00:07:32.000 | as well as some datasets where GPT-3
00:07:34.040 | faces methodological issues related to training on large web corpora.
00:07:38.360 | I also love this sentence.
00:07:39.760 | It's again, very open about what they
00:07:41.560 | achieved and where the limitations are.
00:07:43.560 | They're acknowledging that they found
00:07:45.080 | some tasks that are still hard for the model,
00:07:47.240 | and they also acknowledge in the paper that they had
00:07:49.680 | some minor slip-ups where they intended to make sure they
00:07:53.280 | hadn't trained on data that was relevant for
00:07:55.800 | the test task that they were performing,
00:07:58.000 | and in fact, they had not quite gotten that right.
00:08:00.760 | They're being very open about that and exploring
00:08:04.000 | how hard it is to get that right at the scale that they're operating at.
00:08:07.880 | Just like the GPT-2 paper,
00:08:10.280 | a wonderfully open and thorough exploration of the ideas.
00:08:16.080 | [BLANK_AUDIO]