Stanford XCS224U: Natural Language Understanding I In-context Learning, Pt 1: Origins I Spring 2023

00:00:00.000 | Welcome everyone.

00:00:05.920 | This is the first screencast in our series on in-context learning.

00:00:08.960 | This series is a kind of companion to the one that we did on information retrieval.

00:00:12.840 | The two series come together to help you with homework two and bake-off two,

00:00:16.560 | which is focused on few-shot open domain question answering with

00:00:20.520 | frozen retrievers and frozen large language models.

00:00:24.240 | To start this series,

00:00:25.480 | I thought we would just reflect a bit on the origins of

00:00:28.360 | the idea of in-context learning,

00:00:30.400 | which is really a story of how NLP got to

00:00:33.480 | this strange and exciting and chaotic moment for the field,

00:00:38.160 | and maybe also for the society more broadly.

00:00:41.200 | All credit to the Chomsky bot for bringing us to this moment.

00:00:45.420 | I'm only joking.

00:00:46.840 | The Chomsky bot is a very simple pattern-based language model.

00:00:52.040 | It's been around since the '90s, I believe.

00:00:55.640 | With very simple mechanisms,

00:00:57.560 | it produces prose that is roughly in the style of

00:01:00.600 | the political philosopher and sometime linguist Noam Chomsky.

00:01:04.480 | It produces prose that delights and maybe informs us,

00:01:08.880 | and the underlying mechanisms are very simple.

00:01:11.160 | I think that's a nice reminder about what

00:01:13.920 | all of these large language models might be doing even in the present day.

00:01:17.880 | But I'm only joking,

00:01:19.840 | although it's only partly a joke.

00:01:21.520 | I think when we think about precedence for in-context learning,

00:01:24.680 | it is worth mentioning that in the pre-deep learning era,

00:01:28.560 | N-gram-based language models,

00:01:30.320 | very sparse large language models,

00:01:32.720 | were often truly massive.

00:01:34.560 | For example, Brant et al.

00:01:35.880 | 2007 use a 300 billion parameter language model

00:01:40.840 | trained on two trillion tokens of

00:01:42.800 | text to help with machine translation.

00:01:45.680 | That is a very large and very powerful mechanism with

00:01:48.780 | a different character from the large language models of today.

00:01:52.120 | But it is nonetheless worth noting that they played

00:01:54.680 | an important role in a lot of different fields way back when.

00:01:58.920 | I think for in-context learning as we know it now,

00:02:03.160 | the earliest paper as far as I know is the DECA NLP paper.

00:02:07.720 | This is McCann et al. 2018.

00:02:09.680 | They do multitask training with

00:02:11.880 | task instructions that are natural language questions.

00:02:15.360 | That does seem like the origin of the idea that with free-form natural language

00:02:19.640 | instructions we could essentially end up with artifacts that could do

00:02:23.240 | multiple things guided solely by text.

00:02:27.920 | Then it's worth noting also that in the GPT paper,

00:02:31.600 | Radford et al. 2018,

00:02:33.420 | you can find buried in there some tentative proposals to do

00:02:36.640 | prompt-based experiments with that model.

00:02:40.860 | But the real origins of the ideas, again,

00:02:44.420 | as far as I know,

00:02:45.680 | are Radford et al. 2019.

00:02:47.920 | This is the GPT-2 paper.

00:02:50.520 | Let me just show you some snippets from this paper.

00:02:53.080 | It's really inspiring how much they did.

00:02:55.280 | They say at the start,

00:02:56.440 | "We demonstrate language models can perform

00:02:58.520 | downstream tasks in a zero-shot setting

00:03:01.160 | without any parameter or architecture modification."

00:03:04.120 | There you see this idea of using frozen models,

00:03:07.440 | prompting them, and seeing if they will produce interesting behaviors.

00:03:11.480 | They looked at a bunch of different tasks.

00:03:14.160 | For summarization, they say,

00:03:15.640 | "To induce summarization behavior,

00:03:17.480 | we add the text TLDR after the article and generate 100 tokens."

00:03:22.080 | This is mind-blowing.

00:03:23.280 | I remember when I first heard about this idea,

00:03:26.640 | I had such a cognitive bias against in-context learning of this sort being

00:03:30.920 | successful that I assumed what they were trying to say to us is that they had

00:03:35.280 | trained that token in a task-specific way to

00:03:39.560 | do summarization and then just given it a colorful name.

00:03:43.120 | But no, they really meant it.

00:03:45.040 | They simply prompt the model with this token,

00:03:47.360 | and look at what comes out.

00:03:49.680 | For translation, they say,

00:03:51.920 | "We test whether GPT-2 has begun to

00:03:54.040 | learn how to translate from one language to another.

00:03:56.800 | In order to help it infer that this is the desired task,

00:03:59.720 | we condition the language model on a context of

00:04:02.040 | example pairs of the format English sentence equals French sentence.

00:04:06.200 | Then after a final prompt of English sentence equals,

00:04:09.240 | we sample from the model with greedy decoding and

00:04:11.680 | use the first generated sentence as the translation."

00:04:14.560 | Incredible, and what you see emerging there,

00:04:17.080 | is this idea of demonstrations,

00:04:19.120 | including in the prompt some examples of the behavior that you

00:04:21.960 | want as a way of coaxing the model to do what you would like it to do.

00:04:26.000 | Here's a similar example.

00:04:27.680 | They say, "Similar to translation,

00:04:29.360 | the context of the language model is seeded with example question-answer pairs,

00:04:33.480 | which helps the model infer the short answer style of the dataset."

00:04:37.240 | That's for QA, and again,

00:04:38.800 | they started to see that demonstrations could help

00:04:41.400 | the model see what the implicit task instruction was.

00:04:45.360 | They also in the paper,

00:04:47.360 | evaluate a bunch of other things,

00:04:48.880 | text completion, Winograd schemas,

00:04:51.440 | and reading comprehension, and maybe others.

00:04:53.800 | It's a very impressive and thorough exploration,

00:04:55.840 | very open about the benefits and limitations of the methods,

00:04:59.640 | a very impressive and creative paper.

00:05:03.120 | That was the beginning of the idea in terms of research.

00:05:08.880 | The cultural moment certainly arrives with the GPT-3 paper,

00:05:13.480 | Brown et al. 2020,

00:05:15.280 | which is also impressive in its own ways.

00:05:17.560 | Here I'm just going to quote from the abstract and we

00:05:19.560 | can linger a bit over what it says.

00:05:21.920 | They start, "We show that scaling up language models

00:05:24.960 | greatly improves task agnostic few-shot performance,

00:05:28.320 | sometimes even reaching competitiveness with

00:05:30.920 | prior state-of-the-art fine-tuning approaches."

00:05:33.840 | We could quibble with whether or not they actually saw

00:05:36.440 | competitiveness in that sense,

00:05:38.160 | but it is absolutely true that they got

00:05:40.720 | very impressive behaviors out of their model in

00:05:43.160 | this task agnostic few-shot setting.

00:05:46.280 | Specifically, we train GPT-3,

00:05:49.440 | an autoregressive language model with 175 billion parameters,

00:05:53.480 | 10x more than any previous non-sparse language model,

00:05:57.000 | and test its performance in the few-shot setting.

00:05:59.520 | There are two things I really like about this part.

00:06:01.720 | First, 175 billion parameters is

00:06:04.920 | indeed incredibly ambitious and impressive,

00:06:07.520 | even today to say nothing of back in 2020.

00:06:10.840 | I also really love that they mentioned non-sparse language model,

00:06:15.160 | a nod to those N-gram based models that I mentioned before,

00:06:18.480 | which were often truly massive.

00:06:21.200 | For all tasks, GPT-3 is applied without

00:06:24.520 | any gradient updates or fine-tuning,

00:06:26.740 | with tasks and few-shot demonstrations

00:06:28.960 | specified purely via text interaction with the model.

00:06:32.120 | That's nice. You might think in

00:06:34.560 | retrospect that they're repeating themselves here.

00:06:37.440 | They've already established that these are going to be frozen models,

00:06:40.000 | but I think it's necessary for them to do

00:06:42.320 | that because this was such an unfamiliar idea.

00:06:44.640 | I can imagine, again,

00:06:46.240 | being a reader of this paper and assuming that they can't

00:06:49.160 | really mean they're just using frozen models for all these tasks.

00:06:52.080 | Surely, there is some fine-tuning somewhere,

00:06:54.040 | and so they're emphasizing that in fact,

00:06:56.320 | the model is entirely frozen.

00:06:59.000 | GPT-3 achieves strong performance on many NLP datasets,

00:07:03.200 | including translation, question answering,

00:07:05.440 | and closed tasks, as well as several tasks that

00:07:08.320 | require on-the-fly reasoning or domain adaptations,

00:07:11.280 | such as unscrambling words,

00:07:13.000 | using a novel word in a sentence,

00:07:14.680 | or performing three-digit arithmetic.

00:07:16.720 | I love this. A real diversity of tasks,

00:07:19.160 | and what I think you can see them doing is really trying to

00:07:22.080 | push the limits of what would be possible in this mode.

00:07:25.840 | At the same time, we also identify some datasets where

00:07:29.040 | GPT-3's few-shot learning still struggles,

00:07:32.000 | as well as some datasets where GPT-3

00:07:34.040 | faces methodological issues related to training on large web corpora.

00:07:38.360 | I also love this sentence.

00:07:39.760 | It's again, very open about what they

00:07:41.560 | achieved and where the limitations are.

00:07:43.560 | They're acknowledging that they found

00:07:45.080 | some tasks that are still hard for the model,

00:07:47.240 | and they also acknowledge in the paper that they had

00:07:49.680 | some minor slip-ups where they intended to make sure they

00:07:53.280 | hadn't trained on data that was relevant for

00:07:55.800 | the test task that they were performing,

00:07:58.000 | and in fact, they had not quite gotten that right.

00:08:00.760 | They're being very open about that and exploring

00:08:04.000 | how hard it is to get that right at the scale that they're operating at.

00:08:07.880 | Just like the GPT-2 paper,

00:08:10.280 | a wonderfully open and thorough exploration of the ideas.

00:08:16.080 | [BLANK_AUDIO]

Stanford XCS224U: Natural Language Understanding I In-context Learning, Pt 1: Origins I Spring 2023

Chapters