Stanford XCS224U: NLU I In-context Learning, Part 3: Current Moment I Spring 2023

00:00:00.000 | Welcome back everyone.

00:00:06.080 | This is part three in our series on in-context learning.

00:00:08.880 | I've called this part the current moment.

00:00:10.720 | That is either foolish or foolhardy or both.

00:00:13.760 | The current moment is surely going to change very fast as the field changes.

00:00:17.840 | I think I can say that the lessons here will be

00:00:20.480 | useful no matter what direction the field takes next.

00:00:23.720 | As always, I want to start with data.

00:00:26.360 | Data used for self-supervision.

00:00:28.000 | This is an incredibly important ingredient when it comes to

00:00:30.480 | understanding the behaviors of our large language models.

00:00:34.520 | This is a slide that I used in a previous screencast,

00:00:37.200 | but I augmented it with the colossal clean crawled corpus C4.

00:00:41.240 | This is a dataset that was created as part of the T5 modeling effort,

00:00:45.400 | and it is audited by Dodge et al 2021.

00:00:49.120 | Interesting side note,

00:00:51.060 | the Washington Post did an article that is

00:00:52.880 | essentially about the dataset and the auditing work that Dodge et al did.

00:00:56.680 | They called that article inside the secret list of

00:00:59.320 | websites that make AI like chat GPT sound smart.

00:01:02.920 | I'm not sure secret is appropriate here because it seems like

00:01:05.920 | everyone is being pretty open about what is in C4.

00:01:09.040 | But nonetheless, the article is very useful in terms of helping you,

00:01:12.640 | people like us, audit what was in datasets like that.

00:01:16.520 | Undoubtedly, the data that are used for

00:01:20.040 | unsupervised pre-training are an incredibly important ingredient when it comes to

00:01:23.760 | understanding what our models can do and where they're limited.

00:01:27.760 | But as I mentioned at the end of the previous screencast,

00:01:31.440 | this is no longer the only ingredient.

00:01:33.440 | We have left the era when all of

00:01:36.200 | the language model pre-training was

00:01:37.960 | simply unsupervised language model pre-training.

00:01:40.740 | We have now entered into the era of instruct fine-tuning.

00:01:45.100 | Unfortunately, we know much less about what is happening with instruct fine-tuning.

00:01:49.760 | We don't really know what the large industrial labs are

00:01:53.400 | doing in terms of data and protocols here.

00:01:56.460 | We can infer that they are paying lots of people to generate instruct data.

00:02:01.440 | That means that very often these people are doing quite sophisticated things.

00:02:05.300 | For example, I think people might be prompted with a text that says,

00:02:09.100 | write a certain Python program,

00:02:11.200 | and then a human actually writes that Python program.

00:02:14.320 | That's just one instance of many domains and areas of expertise where

00:02:19.120 | they have recruited people to exemplify the desired behavior.

00:02:22.860 | Again, a reminder that the really sophisticated things that we're

00:02:26.600 | seeing from language models these days are not

00:02:29.640 | emerging in some magical way from unsupervised pre-training,

00:02:33.300 | but rather emerging very directly from standard,

00:02:36.880 | good old-fashioned supervision.

00:02:39.960 | I think we can also infer that these large industrial labs are using

00:02:43.840 | their own models to generate examples and adjudicate between examples.

00:02:48.000 | In fact, we're going to review a method along those lines,

00:02:50.800 | self-instruct, in just a second.

00:02:53.900 | If you would like to get a feel for what instruct fine-tuning is like,

00:02:58.120 | I would encourage you to check out the Stanford Human Preferences dataset.

00:03:01.700 | This is a new release on

00:03:03.420 | instruct fine-tuning dataset that was derived from Reddit posts.

00:03:08.580 | You could use that, maybe using subparts of it or

00:03:11.500 | different protocols for fine-tuning to get a feel for

00:03:14.060 | how instruct data affects model behaviors,

00:03:17.980 | and that could be quite illuminating.

00:03:20.460 | I mentioned self-instruct before.

00:03:22.640 | I think this is a powerful method that points to lots of

00:03:25.260 | new ways in which we could use models to make models better.

00:03:29.020 | For self-instruct, we begin from

00:03:31.440 | 175 tasks that were written by humans.

00:03:35.280 | Those go into a task pool,

00:03:37.460 | and then we have a language model generate

00:03:39.680 | some new instructions via in-context learning.

00:03:43.120 | The generated instruction is then fed back into

00:03:46.120 | that same language model with a new kind of prompt that helps the model

00:03:49.620 | decide whether the instruction is

00:03:51.880 | a classification task or some other kind of task.

00:03:54.980 | Depending on the generated response at step 2,

00:03:57.720 | we feed the generated output into one or another of these two prompts down here,

00:04:02.900 | and that step gives us new input-output pairs that we can use

00:04:08.060 | for subsequent supervised language model pre-training.

00:04:12.600 | There's some filtering there for

00:04:14.260 | quality and to make sure the dataset is diverse,

00:04:16.660 | but then those generated instructions go back into the task pool and can

00:04:20.660 | participate in parts of these prompts to generate more data.

00:04:25.500 | In that way, we can use a language model to bootstrap a new dataset,

00:04:30.820 | and then we can update that very same language model with

00:04:34.260 | the new dataset in the hopes that that will lead it to

00:04:36.660 | have more and more diverse abilities.

00:04:40.100 | That was abstract, so let's walk through how

00:04:42.620 | self-instruct happens at the level of the prompts that they use.

00:04:45.980 | At step 1, we have instruction generation.

00:04:48.300 | This is the prompt. You can see the model is given

00:04:50.740 | eight demonstrations and then asked to generate a new instruction.

00:04:54.740 | The majority of these demonstrations were human-created,

00:04:57.900 | but in subsequent rounds of self-instruct,

00:05:00.300 | some of them were actually model-generated instructions.

00:05:04.340 | At step 2, we have classification task identification.

00:05:07.780 | The generated response from step 1 is fed into this prompt,

00:05:11.100 | and the model learns in context to

00:05:13.420 | predict whether or not it was a classification task.

00:05:16.540 | Depending on the generated response there,

00:05:18.860 | we either feed it into

00:05:20.020 | a classification task prompt or a non-classification task prompt.

00:05:24.620 | The results here give us new input-output pairs

00:05:28.340 | that we can use to augment our self-instruct dataset.

00:05:31.500 | Then as I said, we do subsequent language model supervised,

00:05:35.620 | pre-training in the standard way,

00:05:37.580 | or maybe with some other techniques to

00:05:39.700 | update the model that was used for this generation process.

00:05:43.220 | That self-instruct was a major mechanism behind Alpaca.

00:05:48.220 | Alpaca was an important recent moment for the field because it

00:05:51.420 | started to show people that we could via self-instruct methods,

00:05:55.380 | take relatively small models like a seven billion parameter model,

00:05:59.820 | do some instruct fine-tuning,

00:06:01.500 | and get a very capable result as the output.

00:06:04.660 | In more detail, the way Alpaca works is we begin with a Lama model.

00:06:09.100 | Lama is a class of models that was released recently by Meta AI.

00:06:13.900 | The Alpaca team began from

00:06:15.660 | the 175 tasks that were written by humans for the self-instruct paper.

00:06:20.860 | Then they follow self-instruct with some minor simplifications using

00:06:25.020 | Text DaVinci 3 as the engine to create the new input-output pairs.

00:06:29.420 | That gives them a dataset ultimately of 52,000 examples,

00:06:33.700 | and those examples were used to update the Lama model to create Alpaca.

00:06:38.540 | The observation is that the results of

00:06:40.740 | this relatively small-scale effort to update

00:06:43.540 | the Lama model are actually quite powerful in terms of

00:06:46.620 | imbuing Alpaca with new instruct following behaviors.

00:06:50.780 | Again, there's a major lesson there about technology,

00:06:54.340 | and I think this is an exciting new direction for the field as we think about

00:06:57.580 | making these relatively small models ever more performant.

00:07:01.380 | There is also a lesson for you about what's going to be

00:07:03.940 | effective for in-context learning because obviously,

00:07:06.620 | to the extent that you can tune

00:07:08.980 | your own prompts to align with the instruction fine-tuning data

00:07:12.300 | that models like Alpaca have seen,

00:07:14.340 | you will be more successful,

00:07:15.780 | and that lesson generalizes to all of these large language models.

00:07:19.180 | For some, we have visibility into the instruct fine-tuning data as with Alpaca,

00:07:23.660 | but for the largest ones, we don't.

00:07:25.540 | People have to organically discover which prompting techniques work,

00:07:29.980 | which is really a process of uncovering, I believe,

00:07:32.580 | what their instruct fine-tuning phase was like at this point.

00:07:36.580 | Alpaca, as I said,

00:07:38.660 | was exciting because it bucked the trend of model sizes going up, up, up.

00:07:43.460 | This is a slide that I used in the intro lecture for the course.

00:07:46.860 | We got all the way up to Palm at 540 billion parameters.

00:07:50.380 | It may be that GPT-4 is substantially larger even than that.

00:07:54.860 | As a result of this instruct fine-tuning,

00:07:58.540 | we're starting to see that model sizes might come down and nonetheless be very performant.

00:08:04.660 | That is incredibly exciting because it's going to happen.

00:08:07.980 | There are lots of incentives, intellectual, technological,

00:08:10.820 | financial for us to find a way to have smaller models be performant.

00:08:14.940 | I think that will be an important step toward actually truly democratizing

00:08:19.540 | access to large language models and the capabilities that they can enable.

00:08:24.460 | [BLANK_AUDIO]