back to indexStanford XCS224U: NLU I In-context Learning, Part 3: Current Moment I Spring 2023
00:00:06.080 |
This is part three in our series on in-context learning. 00:00:13.760 |
The current moment is surely going to change very fast as the field changes. 00:00:17.840 |
I think I can say that the lessons here will be 00:00:20.480 |
useful no matter what direction the field takes next. 00:00:28.000 |
This is an incredibly important ingredient when it comes to 00:00:30.480 |
understanding the behaviors of our large language models. 00:00:34.520 |
This is a slide that I used in a previous screencast, 00:00:37.200 |
but I augmented it with the colossal clean crawled corpus C4. 00:00:41.240 |
This is a dataset that was created as part of the T5 modeling effort, 00:00:52.880 |
essentially about the dataset and the auditing work that Dodge et al did. 00:00:56.680 |
They called that article inside the secret list of 00:00:59.320 |
websites that make AI like chat GPT sound smart. 00:01:02.920 |
I'm not sure secret is appropriate here because it seems like 00:01:05.920 |
everyone is being pretty open about what is in C4. 00:01:09.040 |
But nonetheless, the article is very useful in terms of helping you, 00:01:12.640 |
people like us, audit what was in datasets like that. 00:01:20.040 |
unsupervised pre-training are an incredibly important ingredient when it comes to 00:01:23.760 |
understanding what our models can do and where they're limited. 00:01:27.760 |
But as I mentioned at the end of the previous screencast, 00:01:37.960 |
simply unsupervised language model pre-training. 00:01:40.740 |
We have now entered into the era of instruct fine-tuning. 00:01:45.100 |
Unfortunately, we know much less about what is happening with instruct fine-tuning. 00:01:49.760 |
We don't really know what the large industrial labs are 00:01:56.460 |
We can infer that they are paying lots of people to generate instruct data. 00:02:01.440 |
That means that very often these people are doing quite sophisticated things. 00:02:05.300 |
For example, I think people might be prompted with a text that says, 00:02:11.200 |
and then a human actually writes that Python program. 00:02:14.320 |
That's just one instance of many domains and areas of expertise where 00:02:19.120 |
they have recruited people to exemplify the desired behavior. 00:02:22.860 |
Again, a reminder that the really sophisticated things that we're 00:02:26.600 |
seeing from language models these days are not 00:02:29.640 |
emerging in some magical way from unsupervised pre-training, 00:02:33.300 |
but rather emerging very directly from standard, 00:02:39.960 |
I think we can also infer that these large industrial labs are using 00:02:43.840 |
their own models to generate examples and adjudicate between examples. 00:02:48.000 |
In fact, we're going to review a method along those lines, 00:02:53.900 |
If you would like to get a feel for what instruct fine-tuning is like, 00:02:58.120 |
I would encourage you to check out the Stanford Human Preferences dataset. 00:03:03.420 |
instruct fine-tuning dataset that was derived from Reddit posts. 00:03:08.580 |
You could use that, maybe using subparts of it or 00:03:11.500 |
different protocols for fine-tuning to get a feel for 00:03:22.640 |
I think this is a powerful method that points to lots of 00:03:25.260 |
new ways in which we could use models to make models better. 00:03:39.680 |
some new instructions via in-context learning. 00:03:43.120 |
The generated instruction is then fed back into 00:03:46.120 |
that same language model with a new kind of prompt that helps the model 00:03:51.880 |
a classification task or some other kind of task. 00:03:54.980 |
Depending on the generated response at step 2, 00:03:57.720 |
we feed the generated output into one or another of these two prompts down here, 00:04:02.900 |
and that step gives us new input-output pairs that we can use 00:04:08.060 |
for subsequent supervised language model pre-training. 00:04:14.260 |
quality and to make sure the dataset is diverse, 00:04:16.660 |
but then those generated instructions go back into the task pool and can 00:04:20.660 |
participate in parts of these prompts to generate more data. 00:04:25.500 |
In that way, we can use a language model to bootstrap a new dataset, 00:04:30.820 |
and then we can update that very same language model with 00:04:34.260 |
the new dataset in the hopes that that will lead it to 00:04:42.620 |
self-instruct happens at the level of the prompts that they use. 00:04:48.300 |
This is the prompt. You can see the model is given 00:04:50.740 |
eight demonstrations and then asked to generate a new instruction. 00:04:54.740 |
The majority of these demonstrations were human-created, 00:05:00.300 |
some of them were actually model-generated instructions. 00:05:04.340 |
At step 2, we have classification task identification. 00:05:07.780 |
The generated response from step 1 is fed into this prompt, 00:05:13.420 |
predict whether or not it was a classification task. 00:05:20.020 |
a classification task prompt or a non-classification task prompt. 00:05:24.620 |
The results here give us new input-output pairs 00:05:28.340 |
that we can use to augment our self-instruct dataset. 00:05:31.500 |
Then as I said, we do subsequent language model supervised, 00:05:39.700 |
update the model that was used for this generation process. 00:05:43.220 |
That self-instruct was a major mechanism behind Alpaca. 00:05:48.220 |
Alpaca was an important recent moment for the field because it 00:05:51.420 |
started to show people that we could via self-instruct methods, 00:05:55.380 |
take relatively small models like a seven billion parameter model, 00:06:04.660 |
In more detail, the way Alpaca works is we begin with a Lama model. 00:06:09.100 |
Lama is a class of models that was released recently by Meta AI. 00:06:15.660 |
the 175 tasks that were written by humans for the self-instruct paper. 00:06:20.860 |
Then they follow self-instruct with some minor simplifications using 00:06:25.020 |
Text DaVinci 3 as the engine to create the new input-output pairs. 00:06:29.420 |
That gives them a dataset ultimately of 52,000 examples, 00:06:33.700 |
and those examples were used to update the Lama model to create Alpaca. 00:06:43.540 |
the Lama model are actually quite powerful in terms of 00:06:46.620 |
imbuing Alpaca with new instruct following behaviors. 00:06:50.780 |
Again, there's a major lesson there about technology, 00:06:54.340 |
and I think this is an exciting new direction for the field as we think about 00:06:57.580 |
making these relatively small models ever more performant. 00:07:01.380 |
There is also a lesson for you about what's going to be 00:07:03.940 |
effective for in-context learning because obviously, 00:07:08.980 |
your own prompts to align with the instruction fine-tuning data 00:07:15.780 |
and that lesson generalizes to all of these large language models. 00:07:19.180 |
For some, we have visibility into the instruct fine-tuning data as with Alpaca, 00:07:25.540 |
People have to organically discover which prompting techniques work, 00:07:29.980 |
which is really a process of uncovering, I believe, 00:07:32.580 |
what their instruct fine-tuning phase was like at this point. 00:07:38.660 |
was exciting because it bucked the trend of model sizes going up, up, up. 00:07:43.460 |
This is a slide that I used in the intro lecture for the course. 00:07:46.860 |
We got all the way up to Palm at 540 billion parameters. 00:07:50.380 |
It may be that GPT-4 is substantially larger even than that. 00:07:58.540 |
we're starting to see that model sizes might come down and nonetheless be very performant. 00:08:04.660 |
That is incredibly exciting because it's going to happen. 00:08:07.980 |
There are lots of incentives, intellectual, technological, 00:08:10.820 |
financial for us to find a way to have smaller models be performant. 00:08:14.940 |
I think that will be an important step toward actually truly democratizing 00:08:19.540 |
access to large language models and the capabilities that they can enable.