back to indexStanford XCS224U: Natural Language Understanding I In-context Learning, Pt 1: Origins I Spring 2023
Chapters
0:0
1:22 Early precedents
2:45 Beginnings: Radford et al. 2019 (GPT-2)
5:9 Cultural moment: Brown et al. 2020 (GPT-3)
00:00:05.920 |
This is the first screencast in our series on in-context learning. 00:00:08.960 |
This series is a kind of companion to the one that we did on information retrieval. 00:00:12.840 |
The two series come together to help you with homework two and bake-off two, 00:00:16.560 |
which is focused on few-shot open domain question answering with 00:00:20.520 |
frozen retrievers and frozen large language models. 00:00:25.480 |
I thought we would just reflect a bit on the origins of 00:00:33.480 |
this strange and exciting and chaotic moment for the field, 00:00:41.200 |
All credit to the Chomsky bot for bringing us to this moment. 00:00:46.840 |
The Chomsky bot is a very simple pattern-based language model. 00:00:57.560 |
it produces prose that is roughly in the style of 00:01:00.600 |
the political philosopher and sometime linguist Noam Chomsky. 00:01:04.480 |
It produces prose that delights and maybe informs us, 00:01:08.880 |
and the underlying mechanisms are very simple. 00:01:13.920 |
all of these large language models might be doing even in the present day. 00:01:21.520 |
I think when we think about precedence for in-context learning, 00:01:24.680 |
it is worth mentioning that in the pre-deep learning era, 00:01:35.880 |
2007 use a 300 billion parameter language model 00:01:45.680 |
That is a very large and very powerful mechanism with 00:01:48.780 |
a different character from the large language models of today. 00:01:52.120 |
But it is nonetheless worth noting that they played 00:01:54.680 |
an important role in a lot of different fields way back when. 00:01:58.920 |
I think for in-context learning as we know it now, 00:02:03.160 |
the earliest paper as far as I know is the DECA NLP paper. 00:02:11.880 |
task instructions that are natural language questions. 00:02:15.360 |
That does seem like the origin of the idea that with free-form natural language 00:02:19.640 |
instructions we could essentially end up with artifacts that could do 00:02:27.920 |
Then it's worth noting also that in the GPT paper, 00:02:33.420 |
you can find buried in there some tentative proposals to do 00:02:50.520 |
Let me just show you some snippets from this paper. 00:03:01.160 |
without any parameter or architecture modification." 00:03:04.120 |
There you see this idea of using frozen models, 00:03:07.440 |
prompting them, and seeing if they will produce interesting behaviors. 00:03:17.480 |
we add the text TLDR after the article and generate 100 tokens." 00:03:23.280 |
I remember when I first heard about this idea, 00:03:26.640 |
I had such a cognitive bias against in-context learning of this sort being 00:03:30.920 |
successful that I assumed what they were trying to say to us is that they had 00:03:39.560 |
do summarization and then just given it a colorful name. 00:03:45.040 |
They simply prompt the model with this token, 00:03:54.040 |
learn how to translate from one language to another. 00:03:56.800 |
In order to help it infer that this is the desired task, 00:03:59.720 |
we condition the language model on a context of 00:04:02.040 |
example pairs of the format English sentence equals French sentence. 00:04:06.200 |
Then after a final prompt of English sentence equals, 00:04:09.240 |
we sample from the model with greedy decoding and 00:04:11.680 |
use the first generated sentence as the translation." 00:04:19.120 |
including in the prompt some examples of the behavior that you 00:04:21.960 |
want as a way of coaxing the model to do what you would like it to do. 00:04:29.360 |
the context of the language model is seeded with example question-answer pairs, 00:04:33.480 |
which helps the model infer the short answer style of the dataset." 00:04:38.800 |
they started to see that demonstrations could help 00:04:41.400 |
the model see what the implicit task instruction was. 00:04:53.800 |
It's a very impressive and thorough exploration, 00:04:55.840 |
very open about the benefits and limitations of the methods, 00:05:03.120 |
That was the beginning of the idea in terms of research. 00:05:08.880 |
The cultural moment certainly arrives with the GPT-3 paper, 00:05:17.560 |
Here I'm just going to quote from the abstract and we 00:05:21.920 |
They start, "We show that scaling up language models 00:05:24.960 |
greatly improves task agnostic few-shot performance, 00:05:30.920 |
prior state-of-the-art fine-tuning approaches." 00:05:33.840 |
We could quibble with whether or not they actually saw 00:05:40.720 |
very impressive behaviors out of their model in 00:05:49.440 |
an autoregressive language model with 175 billion parameters, 00:05:53.480 |
10x more than any previous non-sparse language model, 00:05:57.000 |
and test its performance in the few-shot setting. 00:05:59.520 |
There are two things I really like about this part. 00:06:10.840 |
I also really love that they mentioned non-sparse language model, 00:06:15.160 |
a nod to those N-gram based models that I mentioned before, 00:06:28.960 |
specified purely via text interaction with the model. 00:06:34.560 |
retrospect that they're repeating themselves here. 00:06:37.440 |
They've already established that these are going to be frozen models, 00:06:42.320 |
that because this was such an unfamiliar idea. 00:06:46.240 |
being a reader of this paper and assuming that they can't 00:06:49.160 |
really mean they're just using frozen models for all these tasks. 00:06:59.000 |
GPT-3 achieves strong performance on many NLP datasets, 00:07:05.440 |
and closed tasks, as well as several tasks that 00:07:08.320 |
require on-the-fly reasoning or domain adaptations, 00:07:19.160 |
and what I think you can see them doing is really trying to 00:07:22.080 |
push the limits of what would be possible in this mode. 00:07:25.840 |
At the same time, we also identify some datasets where 00:07:34.040 |
faces methodological issues related to training on large web corpora. 00:07:45.080 |
some tasks that are still hard for the model, 00:07:47.240 |
and they also acknowledge in the paper that they had 00:07:49.680 |
some minor slip-ups where they intended to make sure they 00:07:58.000 |
and in fact, they had not quite gotten that right. 00:08:00.760 |
They're being very open about that and exploring 00:08:04.000 |
how hard it is to get that right at the scale that they're operating at. 00:08:10.280 |
a wonderfully open and thorough exploration of the ideas.