back to indexStanford XCS224U: NLU I Intro & Evolution of Natural Language Understanding, Pt. 1 I Spring 2023
00:00:09.680 |
Uh, it is a weird and wonderful and maybe worrying moment to be doing natural language understanding. 00:00:16.440 |
My goal for today is just to kind of immerse us in this moment and think about how we got here and what it's like to be doing research now. 00:00:25.440 |
And I think that'll set us up well to think about what we're gonna do in the course and how that's gonna set you up to participate in this moment in AI, 00:00:35.040 |
uh, in many ways, in whichever ways you choose. 00:00:38.240 |
And it's an especially impactful moment to be doing that. 00:00:43.200 |
And I feel like we can get you all to the point where you are doing meaningful things that contribute to this ongoing moment in ways that are gonna be exciting and impactful. 00:00:57.560 |
This is always a moment of reflection for me. 00:01:00.240 |
I started teaching this course in 2012, um, which I guess is ages ago now. 00:01:06.480 |
It feels recent in my lived experience, but it does feel like ages ago in terms of the content. 00:01:11.480 |
In 2012, on the first day, I had a slide that looked like this. 00:01:15.240 |
I said, "It was an exciting time to be doing natural language understanding research." 00:01:20.440 |
I noted that there was a resurgence of interest in the area after a long period of people mainly focused on syntax and things like that. 00:01:28.760 |
But there was a widespread perception that NLU was- was on- was poised for a breakthrough and to have huge impact that was relating to business things, 00:01:37.120 |
and that there was a white-hot job market for Stanford grads. 00:01:40.200 |
A lot of this language is coming from the fact that we were in this moment when Siri had just launched, 00:01:48.440 |
and we had all of these in-home devices and all the tech giants kind of competing on what was emerging as the field of natural language understanding. 00:01:59.120 |
I did feel like I should update that in 2022 by saying this is the most exciting moment ever as opposed to it just being an exciting time. 00:02:09.280 |
We were on- in this feeling that we had experienced a resurgence of interest in the area, 00:02:18.560 |
The industry interest at this point makes the stuff from 2012 look like small potatoes. 00:02:31.940 |
and the core things about NLU remain far from solved. 00:02:40.440 |
it has felt like there has been an acceleration, 00:02:42.960 |
and some problems that we used to focus on feel kind of like they're less pressing. 00:02:48.480 |
I won't say solved, but they feel like we've made a lot of progress on them as a result of models getting better. 00:02:54.480 |
But all that means for me is that there are more exciting things in the future that we can tackle even more ambitious things. 00:03:01.640 |
And you'll see that I've tried to overhaul the course to be ever more ambitious about the kind of problems that we might take on. 00:03:09.440 |
But we do kind of live in a golden age for all of this stuff. 00:03:14.720 |
I'm not sure what I would have predicted to say nothing of 2012, 00:03:18.000 |
that we would have these incredible models like DALI2, 00:03:21.120 |
which can take you from text into these incredible images. 00:03:24.720 |
Language models, which will more or less be the star of the quarter for us. 00:03:29.120 |
But also models that can take you from natural language to code. 00:03:33.080 |
And of course, we are all seeing right now as we speak, 00:03:36.480 |
that the entire industry related to web search is being reshaped around NLU technologies. 00:03:43.600 |
So whereas this felt like a kind of niche area of NLP when we started this course in 2012, 00:03:56.120 |
all of AI is focused on these questions of natural language understanding, 00:04:06.840 |
throughout the years, we have used simple examples to kind of highlight the weaknesses of current models. 00:04:12.600 |
And so a classic one for us was simply this question, 00:04:19.440 |
The idea here is that it's a simple question, 00:04:22.400 |
but it can be hard for our language technologies because of that negation, the no there. 00:04:28.400 |
In 1980, there was a famous system called CHAT80. 00:04:33.080 |
It was a symbolic system representing the first major phase of research in NLP. 00:04:41.280 |
And CHAT80 was an incredible system in that it could answer questions like, 00:04:45.600 |
which country bordering the Mediterranean borders a country that is 00:04:49.040 |
bordered by a country whose population exceeds the population of India? 00:04:54.760 |
Turkey, at least according to 1980s geography. 00:04:58.920 |
But if you asked CHAT80 a simple question like, 00:05:06.680 |
It was an incredibly expressive system, but rigid. 00:05:14.880 |
but things that fell outside of its capacity, 00:05:26.960 |
And this was meant to be a kind of revolutionary language technology. 00:05:36.560 |
If you search for which US states border no US states, 00:05:40.120 |
it kind of just gives you a list of the US states. 00:05:45.120 |
that it has no capacity to understand the question posed. 00:06:17.560 |
And then it really went off the deep end from there, 00:06:37.200 |
It had a problem that it started listing things, 00:06:39.520 |
but it did say Alaska, Hawaii, and Puerto Rico, 00:06:42.840 |
which is an interestingly more impressive answer 00:06:50.640 |
but it's looking like we're seeing some signal. 00:07:11.400 |
making everything before then sort of pale in comparison. 00:07:17.360 |
you know, one of the new best-in-class models, 00:07:29.600 |
And if you just think about the little history I've given, 00:07:32.840 |
a kind of microcosm of what is happening in the field, 00:07:49.760 |
but these examples multiply, and we can quantify this. 00:07:56.320 |
in which year was Stanford University founded? 00:08:05.800 |
and it gave a fluent and factually correct answer 00:08:13.080 |
which was best-in-class until a few weeks ago, 00:08:38.080 |
the idea is that we should come up with examples 00:08:40.960 |
that will test whether models deeply understand, 00:08:46.080 |
simple memorization of statistics and other things 00:08:50.640 |
and really probe to see whether they understand 00:08:54.200 |
And Levesque and Winograd's technique for doing this 00:09:09.040 |
Maybe it's a question you've never thought about before, 00:09:11.400 |
but you probably have a pretty consistent answer 00:09:17.560 |
Here, I asked another one of Levesque's questions. 00:09:28.080 |
"There is no rule against it, but it is not common." 00:09:30.960 |
And that seemed like a very good answer to me at the time. 00:09:39.120 |
No, professional baseball players are not allowed 00:09:45.560 |
about the appearance of players' uniforms and caps. 00:09:49.000 |
And any modifications to the caps are not allowed. 00:09:52.520 |
Okay, I thought I was feeling good about this, 00:09:55.200 |
but now I don't even myself know what the answer is. 00:10:01.520 |
We have two confident answers that are contradictory 00:10:05.640 |
across two models that are very closely related. 00:10:09.080 |
It's starting to worry us a little bit, I hope. 00:10:22.040 |
Let me show you the responses I got a bit later. 00:10:31.760 |
whether an agent we were interacting with was human or AI, 00:10:43.960 |
now we're into the mode of trying to figure out 00:10:46.680 |
exactly what kind of agents we're interacting with 00:10:51.920 |
about the kinds of things that we do with them. 00:11:00.760 |
is also supported by what's happening in the field. 00:11:06.880 |
And the headline here is that our benchmarks, 00:11:09.400 |
the tasks, the datasets we use to probe our models 00:11:24.760 |
And along the y-axis, I have a normalized measure 00:11:28.520 |
of distance from what we call human performance. 00:11:34.000 |
Each one of these benchmarks has, in its own particular way, 00:11:37.080 |
set a so-called estimate of human performance. 00:11:48.640 |
This is like digit recognition, famous task in AI. 00:11:53.400 |
and it took about 20 years for us to see a system 00:11:56.960 |
that surpassed human performance in this very loose sense. 00:12:01.800 |
The switchboard corpus, this is going from speech to text. 00:12:05.520 |
It's a very similar story, launched in the '90s, 00:12:13.480 |
ImageNet, this was launched, I believe, in 2009, 00:12:17.120 |
and it took less than 10 years for us to see a system 00:12:23.120 |
And now progress is gonna pick up really fast. 00:12:25.280 |
SQUAD 1.1, the Stanford question-answering dataset, 00:12:29.040 |
was launched in 2016, and it took about three years 00:12:51.480 |
in natural language understanding, a multitask benchmark. 00:12:58.040 |
that GLUE would be too difficult for present-day systems. 00:13:10.720 |
The response was superGLUE, but it was saturated, 00:13:20.880 |
and I think we should dwell on whether or not 00:13:22.920 |
it's fair to call it that, but even setting that aside, 00:13:26.440 |
this looks like undeniably a story of progress. 00:13:33.640 |
would not even have been able to enter the GLUE benchmark 00:13:37.160 |
to say nothing of achieving scores like this. 00:13:46.400 |
Here's a post from Jason Wei where he evaluated 00:13:48.800 |
our latest and greatest large language models 00:13:56.480 |
this new class of very large language models. 00:13:59.920 |
Jason's observation is that we see emergent abilities 00:14:10.040 |
thought these tasks would stand for a very long time, 00:14:13.280 |
and what we're seeing instead is that one by one, 00:14:17.960 |
and in some cases, performing at the standard 00:14:22.880 |
Again, an incredible story of progress there. 00:14:26.520 |
So I hope that is energizing, maybe a little intimidating, 00:14:31.960 |
but I hope fundamentally energizing for you all. 00:14:43.560 |
Let's get a feel for that, and that'll kind of serve 00:14:49.120 |
Before I do that, though, are there questions or comments, 00:15:02.960 |
- We should reflect, though, maybe as a group 00:15:08.240 |
My question for you, when you say it did well, 00:15:23.480 |
Bard found that rule and gave me that number. 00:15:34.400 |
will offer me links, but the links go nowhere. 00:15:43.840 |
These models are offering us what looks like evidence, 00:15:47.280 |
but a lot of the evidence is just fabricated, 00:15:49.840 |
and this is worse than offering no evidence at all. 00:15:56.240 |
what is the rule about players and their caps? 00:16:18.560 |
Again, first, a little bit of historical context. 00:16:26.200 |
This is more or less the start of the field itself. 00:16:37.840 |
In fact, that was kind of pioneered here at Stanford 00:16:40.800 |
by people who were pioneering the very field of AI. 00:16:44.040 |
And that paradigm of essentially programming these systems 00:16:58.800 |
and then in turn in natural language processing. 00:17:03.520 |
instead of programming systems with all these rules, 00:17:11.240 |
there was still a lot of programming involved 00:17:13.200 |
because we would write a lot of feature functions 00:17:19.360 |
And we would hope that our machine learning systems 00:17:21.360 |
could learn from the output of those feature functions. 00:17:25.880 |
this was the rise of the fully data-driven learning systems. 00:17:29.680 |
And we just hope that some process of optimization 00:17:34.960 |
The next big phase of this was the deep learning revolution. 00:17:42.480 |
Again, Stanford was at the forefront of this to be sure. 00:17:49.660 |
this is kind of not so different from this mode here. 00:17:52.280 |
It's just that we now replace that simple model 00:17:58.280 |
really deep models that have a tremendous capacity 00:18:03.400 |
We started also to see a shift even further away 00:18:14.360 |
and the optimization process could do all the work for us. 00:18:17.980 |
Then the next thing, big thing that happened, 00:18:21.560 |
which could take us, I suppose, until about 2018, 00:18:26.120 |
where we have a lot of pre-trained parameters. 00:18:28.360 |
These are pictures of maybe big language models 00:18:42.960 |
and we do some learning on some task-specific data, 00:18:56.440 |
is this mode where we're gonna replace everything 00:18:59.060 |
with maybe one ginormous language model of some kind 00:19:03.500 |
and hope that that thing, that enormous black box, 00:19:09.900 |
about whether that's really the path forward, 00:19:12.040 |
but it certainly feels like the zeitgeist to be sure. 00:19:23.520 |
a more rounded example of what that all means? 00:19:28.600 |
The point for now though is really this shift from here 00:19:32.920 |
where we're mostly learning from scratch for our task. 00:19:44.680 |
that gives us a leg up on the problem we're trying to solve. 00:19:56.440 |
there was no talk of releasing model parameters 00:20:01.520 |
were just good for the task that they had set. 00:20:04.220 |
As we move into this era, and then certainly this one, 00:20:11.680 |
or maybe general purpose computer vision capabilities 00:20:17.040 |
that can do more than any previous system could do. 00:20:30.240 |
certainly beginning in this final phase here, 00:20:35.240 |
Just let me take the temperature of the room. 00:20:36.840 |
How many people have encountered the transformer before? 00:20:45.160 |
but I'm not gonna go through this diagram now 00:20:56.040 |
All I can say for you now is that I expect you 00:20:59.220 |
to go on the following journey, which all of us go on. 00:21:07.480 |
I hope can get you to the point where you feel, 00:21:09.960 |
oh, this is actually pretty simple components 00:21:12.900 |
that have been combined in a pretty straightforward way. 00:21:17.600 |
The true enlightenment comes from, wait a second, 00:21:28.080 |
were brought together in this way have proved so powerful. 00:21:35.820 |
which is kind of latent going all the way back 00:21:38.220 |
to the start of AI, especially as it relates to linguistics, 00:21:48.800 |
just learning from the world in the most general sense. 00:22:03.260 |
but they could be language plus sensor readings, 00:22:13.360 |
from the distributional patterns that they contain, 00:22:16.720 |
or for many of these models, to assign high probability 00:22:20.140 |
to the attested sequences in whatever data that you pour in. 00:22:24.160 |
For this kind of learning, we don't need to do any labeling. 00:22:27.840 |
All we need to do is have lots and lots of symbol streams. 00:22:35.600 |
we're sampling from them, and that's what we all think of 00:22:37.960 |
when we think of prompting and getting a response back. 00:22:40.240 |
But the underlying mechanism is, at least in part, 00:22:46.960 |
this is really important for why these models 00:22:48.820 |
are so powerful, the symbols do not need to be just language. 00:23:18.280 |
All we need is lots of data in unstructured format. 00:23:22.400 |
This really begins in the era of static word representations 00:23:30.560 |
especially the GloVe team, they were really visionary 00:23:33.340 |
in the sense that they not only released a paper and code, 00:23:46.360 |
with model artifacts, and people started using them 00:23:50.260 |
as the inputs to recurrent neural networks and other things. 00:23:57.580 |
as an important component to doing really well 00:24:06.220 |
but the really big moment for contextual representations 00:24:14.380 |
I can remember being at the North American ACL meeting 00:24:18.340 |
in New Orleans in 2018 at the best paper session. 00:24:22.420 |
They had not announced which of the best papers 00:24:27.320 |
but we all knew it was gonna be the ELMo paper 00:24:35.500 |
on hard tasks for the field were just mind-blowing, 00:24:38.740 |
the sort of thing that you really only see once 00:24:48.980 |
same thing, I think same best paper award thing. 00:25:14.220 |
and then fast forward a little bit, we get GPT-3, 00:25:34.900 |
and what we started to see is emergent capabilities. 00:25:45.500 |
I think I can lift your spirits a little bit, 00:26:10.720 |
I remember when this came out, I probably laughed. 00:26:17.260 |
because I couldn't imagine that it was actually billion, 00:26:23.220 |
But now, that's, you know, we take that for granted. 00:26:30.300 |
Then we get GPT-3, reportedly at 175 billion parameters. 00:26:45.780 |
And I guess there are rumors that we have gone upward 00:26:59.200 |
One thing I wanna say is there's a noteworthy pattern 00:27:06.380 |
in this very large, in this race for very large models. 00:27:09.540 |
We've got like Google, NVIDIA, Meta, and OpenAI, right? 00:27:14.540 |
And that was actually a real cause for concern. 00:27:30.560 |
these other large tech companies kind of caught up. 00:27:33.760 |
But it was still for a while looking like a story 00:27:46.560 |
on Foundation Models, led this incredibly ambitious project 00:27:54.540 |
is that we have a more healthy ecosystem now. 00:28:00.460 |
are both kind of fully open source groups of researchers. 00:28:04.140 |
We've got, well, one academic institution represented. 00:28:07.700 |
This could be a little bit embarrassing for Stanford. 00:28:12.780 |
is that we have lots of startups represented. 00:28:14.960 |
So these are well-funded, but relatively small outfits 00:28:18.500 |
that are producing outstanding language models. 00:28:24.860 |
and then we'll worry less about centralization of power. 00:28:28.540 |
There's plenty of other things to worry about, 00:28:39.020 |
which is you have this scary rise in model size, 00:28:51.920 |
that are in the range of like 10 billion parameters 00:29:00.660 |
and then here at Stanford, they released the alpaca thing, 00:29:03.460 |
and then Databricks released the Hello Dolly model. 00:29:16.900 |
is that this is relatively small, but so it goes. 00:29:20.140 |
And the point is that a 10 billion parameter model 00:29:23.220 |
is one that could be run on regular old commercial hardware, 00:29:38.100 |
and it won't be long before we've got the ability 00:29:58.020 |
As a result of these models being so powerful, 00:30:07.540 |
that you can get a lot of mileage out of them 00:30:12.340 |
When you prompt one of these very large models, 00:30:14.580 |
you put it in a temporary state by inputting some text, 00:30:18.180 |
and then you generate a sample from the model 00:30:20.260 |
using some technique, and you see what comes out, right? 00:30:24.620 |
better late than, it's gonna probably spit out never. 00:30:28.300 |
If you put in every day, I eat breakfast, lunch, 00:30:34.420 |
And you might have an intuition that the reasons, 00:30:40.400 |
so that it could just learn from co-occurrence patterns 00:30:44.760 |
For the second one, we kind of interpreted as humans 00:30:52.900 |
that the mechanism is the same as in the first case. 00:30:56.040 |
This was just a bunch of co-occurrence patterns. 00:30:58.240 |
A lot of people described their routines in text, 00:31:04.820 |
as you think about things like the president of the US is. 00:31:11.740 |
it might look like it is offering us factual knowledge, 00:31:16.540 |
but it's the same mechanism as for those first two examples. 00:31:19.880 |
It is just learning from the fact that a lot of people 00:31:30.040 |
And so definitely, if you ask a model something like 00:31:36.460 |
that this is just the aggregate of a lot of data 00:31:40.220 |
It has no particular wisdom to offer you necessarily 00:31:43.680 |
beyond what was encoded latently in that giant sea 00:32:05.580 |
We just imagine like a very factually incorrect corpus. 00:32:12.160 |
how do we inject like truth into like these corpuses? 00:32:25.020 |
but also what would that mean and how would we achieve it? 00:32:28.960 |
And even if we did back off to something like, 00:32:31.580 |
how would we ensure self-consistency for a model? 00:32:37.860 |
even those questions which seem easier to pose 00:32:40.980 |
are incredibly difficult questions in the current moment 00:32:46.540 |
that self-supervision thing that I described, 00:32:49.040 |
and then a little bit of what I'll talk about next. 00:32:51.940 |
But none of the structure that we used to have 00:33:02.160 |
The prompting thing, we take this a step forward, right? 00:33:09.480 |
remember that's that 175 billion parameter monster. 00:33:18.140 |
which was just the notion that for these very large, 00:33:35.400 |
And what you're doing here is with your context passage 00:33:41.840 |
to find an answer to its question in the context passage. 00:33:52.280 |
for the actual target question at the bottom here. 00:34:07.460 |
whether this was a viable path forward for a class project, 00:34:16.940 |
because I never would have guessed that this would work. 00:34:29.900 |
on the basis of this simple in-context learning mechanism, 00:34:33.780 |
transformatively different from anything that we saw before. 00:34:37.140 |
In fact, let me just emphasize this a little bit. 00:34:44.580 |
For those of you who have been in the field a little while, 00:34:48.020 |
just contrast what I described in-context learning 00:35:01.900 |
because this is a very particular human emotion. 00:35:06.460 |
we would need an entire dedicated model to this, right? 00:35:11.860 |
of positive and negative instances of nervous anticipation, 00:35:19.040 |
on feature representations of these examples over here, 00:35:30.900 |
In this new mode, few-shot in-context learning, 00:35:37.140 |
"Hey, model, here's an example of nervous anticipation." 00:35:48.220 |
And it learns from all those symbols that you put in 00:36:03.860 |
I've structured the model around the binary distinction, 00:36:12.220 |
On the right, nervous anticipation is just more 00:36:24.140 |
is that models can learn, be put in a temporary state, 00:36:39.940 |
but it is increasingly clear that it is not the only thing 00:36:43.620 |
that is driving learning in the best models in this class. 00:36:54.740 |
This is a diagram from the chat GPT blog post. 00:36:59.780 |
but really two of them are important for us for right now. 00:37:03.160 |
The first is that in a phase of training these models, 00:37:15.460 |
So you might be asked to do a little Python program, 00:37:19.420 |
might write that Python program, for example. 00:37:31.460 |
And that is so important because that takes us way beyond 00:37:38.440 |
It is now back to a very familiar story from all of AI, 00:37:45.220 |
What is happening is that a lot of human intelligence 00:38:05.460 |
So we should remember, we had that brief moment 00:38:08.200 |
where it looked like it was all unstructured, unlabeled data, 00:38:11.380 |
and that was important to unlocking these capacities, 00:38:14.460 |
but now we are back at a very labor-intensive 00:38:17.660 |
human capacity here, driving what looked like 00:38:21.340 |
the really important behaviors for these models. 00:38:24.020 |
Final step, which I think actually intimately relates 00:38:29.820 |
to that instruct tuning that I just described. 00:38:33.500 |
this reinforcement learning with human feedback. 00:38:43.520 |
So suppose we asked ourselves a question like, 00:38:50.000 |
that if the customer doesn't have any auto loan, 00:38:54.500 |
then the customer doesn't have any auto loans? 00:38:58.340 |
It's the sort of reasoning that you might have to do 00:39:00.180 |
if you're thinking about a contract or something like that, 00:39:07.780 |
our old friend from the start of the lecture. 00:39:18.340 |
is it true that if the customer doesn't have any loans, 00:39:20.700 |
then the customer doesn't have any auto loans 00:39:26.360 |
And here it says, no, this is not necessarily true. 00:39:31.400 |
which is the reverse of the question that I asked. 00:39:34.900 |
Again, kind of showing it doesn't deeply understand 00:39:38.920 |
It just kind of does an act that looks like it did. 00:39:47.680 |
Now we do what's called step-by-step prompting. 00:39:51.740 |
You would just tell the model that it was in some kind 00:40:00.680 |
and then you could give an example in your prompts 00:40:05.680 |
And then finally you could prompt it with your premise, 00:40:13.140 |
Here, I won't bother going through the details, 00:40:18.260 |
the model now not only answers and reasons correctly, 00:40:31.300 |
but the more sophisticated prompting mode elicited it. 00:40:38.460 |
of the fact that this model was instruct tuned. 00:40:45.380 |
and how it's supposed to think about prompts like this. 00:40:47.980 |
So the combination of all that human intelligence 00:40:50.340 |
and the capacity of the model led to this really interesting 00:41:04.060 |
Of course, we're gonna unpack all of that stuff 00:41:14.780 |
- The human brain has about 100 billion neurons, 00:41:19.020 |
And I'm not sure how many parameters that might be, 00:41:22.340 |
maybe like 10 trillion parameters or something like that. 00:41:26.060 |
Are we approaching a point where these machines 00:41:30.300 |
or is there something to the language instinct, 00:41:52.380 |
was that these models remain smaller than the human brain. 00:41:59.340 |
On the one hand, they obviously have superhuman capabilities. 00:42:02.560 |
On the other hand, they fall down in ways that humans don't. 00:42:07.060 |
It's very interesting to ask why that difference exists. 00:42:12.700 |
about the limitations of learning from scratch 00:42:36.700 |
And in fact, the increased ability of these models 00:42:40.660 |
to learn from data has been really illuminating 00:42:49.700 |
You have to be careful because they're so different from us, 00:42:53.300 |
On the other hand, I think they are helping us understand 00:42:57.160 |
how to differentiate different theories of cognition. 00:43:07.740 |
that were focused on those cognitive questions in here. 00:43:09.980 |
This is a wonderful space in which to explore 00:43:28.900 |
I mean, partially following up on the brain thing, 00:43:38.300 |
And then also thinking about the previous phase 00:43:41.220 |
that you talked about, about breaking up the models 00:43:46.100 |
that decides which domain our question falls into, 00:43:54.260 |
whether we're gonna touch on an architecture like that. 00:44:05.180 |
It feels like combining big models and logic trees 00:44:11.700 |
Yeah, like one quick summary of what you said 00:44:15.340 |
The modularity of mind is an important old question 00:44:29.060 |
which have a capacity to do lots of different things 00:44:31.740 |
if they have the right pre-training and the right structure, 00:44:34.140 |
we could ask, does modularity emerge naturally? 00:44:40.260 |
Both of those seem like they could be indirect evidence 00:44:46.140 |
'cause these models are so different from us. 00:44:48.060 |
But as a kind of existence proof, for example, 00:44:58.660 |
Yeah, I don't know whether there are results for that. 00:45:02.180 |
No, just kind of a follow-up question on that as well. 00:45:06.020 |
So given how closed all these big models are, 00:45:09.780 |
how could we interact with the model in such a way 00:45:15.700 |
'Cause we literally can only interact with it. 00:45:22.740 |
the closed-off nature of a lot of these models 00:45:30.380 |
We don't get to look at their internal representations. 00:45:35.100 |
But I mentioned the rise of these 10 billion parameter models 00:45:41.620 |
And those are models that, with the right hardware, 00:45:46.060 |
And I think that's just gonna get better and better. 00:46:00.580 |
And I think it's an increasingly important area 00:46:15.940 |
that was as big as eight or 10 billion parameters. 00:46:26.220 |
this baseball cap prompt that we were discussing. 00:46:42.980 |
And so, like, the idea is that there's, like, 00:46:50.300 |
And so that's, like, the primary form of evaluation. 00:46:54.580 |
And so I guess, like, how does that play into, then, 00:46:58.540 |
like, is there some form of encoded or understanding, 00:47:01.740 |
understood deeper value system that's encoded into them? 00:47:11.980 |
find out that a model had a particular belief system 00:47:30.180 |
these models purport to offer evidence from a rule book, 00:47:51.060 |
- Can we just hook up these models to a large database 00:48:13.100 |
I wanna give you a feel for how the course will work, 00:48:18.500 |
So high-level overview, we've got these topics, 00:48:20.580 |
contextual representations, transformers and stuff, 00:48:25.300 |
that will be the topic of the first homework, 00:48:27.540 |
and it's gonna build on the first unit there. 00:48:34.620 |
and get some guarantees about how these models will behave. 00:48:40.300 |
In case you were worried that all the tasks were solved, 00:48:44.340 |
a seemingly simple task about semantic interpretation 00:48:47.580 |
that you will, well, I think it will not be solved. 00:48:51.860 |
'cause who knows what you all are capable of, 00:48:57.020 |
We'll talk about benchmarking and adversarial training 00:49:01.780 |
as we move into this mode where everyone is interacting 00:49:08.140 |
we need to take a step back and rigorously assess 00:49:11.020 |
whether they actually are behaving in good ways, 00:49:13.340 |
or whether we're just biased toward remembering 00:49:19.700 |
that's the explainability stuff that I mentioned, 00:49:23.260 |
And as you can see for the, like, five, six, and seven, 00:49:28.540 |
where you're fo- you're focused on final projects, 00:49:47.540 |
which is an informal competition around data and modeling. 00:50:03.940 |
and the team is gonna look at all your submissions, 00:50:06.940 |
and give out some prizes for top-performing systems, 00:50:12.380 |
or interesting, or ambitious, or something like that. 00:50:23.860 |
and then as a group, we can reflect on what worked, 00:50:26.860 |
and what didn't, and look at the really ambitious things 00:50:34.580 |
and this is just as a way to make sure you have incentives 00:50:37.900 |
to really immerse yourself in the course material. 00:50:45.660 |
which I'll talk a little bit about probably next time, 00:50:48.020 |
that is just making sure you understand the course policies. 00:50:55.340 |
but the idea is that you will have some incentive 00:50:58.420 |
to learn about policies like due dates, and so forth. 00:51:02.220 |
And then the real action is in the final project, 00:51:09.580 |
Those three components, you'll probably do those in Teams, 00:51:14.060 |
you'll be mentored by someone from the teaching team. 00:51:16.820 |
And as I said before, we have this incredibly expert 00:51:28.380 |
with someone who's really aligned with your project goals, 00:51:31.820 |
and then I think you can go really, really far. 00:51:38.940 |
and all Stanford kids get obsessed about this stuff. 00:51:42.740 |
On the final project, is this more of an academic paper, 00:51:56.860 |
It is easy to get obsessed with your Bake-off entry. 00:52:11.660 |
I mean, one of them is on retrieval augmented 00:52:14.380 |
which is one of my core research focuses right now, 00:52:18.820 |
If you do something really interesting for a Bake-off, 00:52:34.820 |
I've got links at the website to people who have gone on 00:52:37.620 |
to publish their final paper as an NLP paper. 00:52:42.900 |
They didn't literally publish the final paper 00:52:46.540 |
almost no one can produce a publishable paper. 00:52:50.500 |
but you could form the basis for then working 00:52:55.060 |
and then getting a really outstanding publication out of it. 00:52:57.820 |
And I would say that that's the default goal. 00:52:59.620 |
The nature of the contribution though is highly varied. 00:53:07.900 |
but there are a lot of ways to satisfy that requirement, 00:53:14.180 |
for some expansive notion of the field as well. 00:53:28.620 |
we are presupposing CS224N or CS224S as prerequisites for the course. 00:53:34.820 |
And what that means is that I'm gonna skip a lot of 00:53:37.980 |
the fundamentals that we have covered in past years. 00:53:43.340 |
check out the background page of the course site. 00:53:45.980 |
It covers fundamentals of scientific computing, 00:53:54.780 |
And I'm hoping that that's enough of a refresher. 00:53:57.580 |
If you look at that material and find that it too is kind of 00:54:03.540 |
then contact us on the teaching team and we can 00:54:08.620 |
But officially, this is a course that presupposes CS224N. 00:54:14.620 |
Then the core goals. This kind of relates to that previous question. 00:54:18.900 |
Hands-on experience with a wide range of problems. 00:54:22.060 |
Mentorship from the teaching team to guide you through projects and assignments. 00:54:27.380 |
And then really the central goal here is to make you the best, 00:54:33.240 |
most flexible NLU researcher and practitioner that you can be for whatever you decide to do next. 00:54:40.020 |
And we're assuming that you have lots of diverse goals that somehow connect with NLU. 00:54:45.500 |
All right. Let's do some course themes unless there are questions. 00:54:54.140 |
I have a whole final section of this slideshow that's about the course, 00:55:01.960 |
Might save that for next time and you can check it out at 00:55:04.400 |
the website and you'll be forced to engage with it for quiz zero. 00:55:11.240 |
the content part of this unless there are questions or comments. 00:55:25.920 |
we want to talk about core concepts and goals. 00:55:28.480 |
Give you a sense for what these models are like, 00:55:30.940 |
why they work, what they're supposed to do, all of that stuff. 00:55:34.440 |
We'll talk about a bunch of different architectures. 00:55:39.560 |
but I hope that I have picked enough of them with the right selection of them to give you 00:55:44.120 |
a feel for how people are thinking about these models and the kind of 00:55:50.080 |
real meaningful advancement just at the level of architectures. 00:55:55.840 |
which I think maybe a lot of us have been surprised to see just how 00:55:59.040 |
important that is as a differentiator for different approaches in this space. 00:56:06.120 |
taking really large models and making them smaller. 00:56:09.400 |
It's an important goal for lots of reasons and an exciting area of research. 00:56:15.320 |
is going to do a little lecture for us on diffusion objectives for these models, 00:56:19.760 |
and then is going to talk about practical pre-training and fine-tuning. 00:56:24.120 |
I'm going to enlist the entire teaching team to do guest lectures, 00:56:28.140 |
and these are the two that I've lined up so far. 00:56:31.120 |
That will culminate or be aligned with this first homework in Bake-off, 00:56:37.920 |
I'm going to give you a bunch of different sentiment datasets, 00:56:40.760 |
and you're going to have to design one system that can succeed on all of them. 00:56:50.520 |
That has data that's like what you developed on, 00:56:54.160 |
and then some mystery examples that you will not really be able to anticipate. 00:56:58.560 |
We're going to see how well you do at handling 00:57:01.000 |
all of these different domains with one system. 00:57:07.840 |
a refresher on core concepts and supervised learning, 00:57:11.320 |
and really getting you to think about transformers. 00:57:13.720 |
Although we're not going to constrain the solution that you 00:57:19.600 |
Our second major theme will be retrieval augmented in context learning. 00:57:27.360 |
A topic that I would not even have dreamt of five years ago, 00:57:33.440 |
and seemed kind of infeasible three years ago, 00:57:36.120 |
and that we first did two years- one year ago? 00:57:38.960 |
Oh goodness. I think this is only the second time, 00:57:41.720 |
but I had to redo it entirely because things have changed so much. 00:57:48.760 |
We have two characters so far in our kind of emerging narrative for NLU. 00:57:53.520 |
On the one hand, we have this approach that I'm going to call LLMs for everything, 00:58:02.560 |
Here I've chosen a very complicated question. 00:58:04.600 |
Which MVP of a game red flaherty umpired was elected to the baseball hall of fame? 00:58:10.600 |
Hats off to you if you know that the answer is Sandy Koufax. 00:58:15.120 |
Um, the LLMs for everything approach is that you just type that question in, 00:58:26.320 |
The other character that I'm going to introduce 00:58:28.960 |
here is what I'm going to call retrieval augmented. 00:58:34.180 |
except now this is going to proceed differently. 00:58:36.040 |
The first thing that we will do is take some large language model and 00:58:39.840 |
encode that query into some numerical representation. 00:58:46.600 |
The new piece is that we're going to also have a knowledge store, 00:58:50.520 |
which you could think of as an old-fashioned web index, right? 00:58:58.520 |
the modern twist that now all of the documents 00:59:01.280 |
are also represented by large language models. 00:59:04.120 |
But fundamentally, this is an index of a sort that drives all web search right now. 00:59:12.040 |
queries on the basis of these numerical representations. 00:59:16.400 |
we can reproduce the classic search experience. 00:59:19.200 |
Here I've got a ranked list of documents that came back from my query, 00:59:23.920 |
just like when you do Google as of the last time I googled. 00:59:33.080 |
those retrieved documents and synthesize them into an answer. 00:59:39.160 |
it's kind of small, but it's the same answer over here. 00:59:41.400 |
Although notably, this answer is now decorated with 00:59:44.600 |
links that would allow you the user to track back to 00:59:48.320 |
what documents actually provided that evidence. 00:59:56.400 |
And that's kind of what we were already grappling with. 00:59:59.720 |
This is an important societal need because this is taking over web search. 01:00:04.280 |
What are our goals for this kind of model here? 01:00:12.160 |
multiple documents and synthesize it down into a single answer. 01:00:17.080 |
the approaches that I just showed you are going to do really well on that. 01:00:23.000 |
to be updatable because the world is changing all the time. 01:00:27.480 |
We need it to track provenance and maybe invoke something like factuality. 01:00:32.360 |
But certainly provenance, we need to know where the information came from. 01:00:37.640 |
We need to know that the model won't produce private information. 01:00:40.840 |
And we might need to restrict access to parts of 01:00:43.440 |
the model's knowledge to different groups like 01:00:45.760 |
different customers or different people with different privileges and so forth. 01:00:49.640 |
That's what we're going to need if we're really going to 01:00:55.080 |
As I said, I think both of the approaches that I sketched do well on 01:00:58.400 |
the synthesis part because they both use a language model and those are really good. 01:01:13.040 |
And I pointed out models like Alpaca that are smaller. 01:01:17.200 |
But I strongly suspect that if we are going to continue to ask 01:01:21.280 |
these models to be both a knowledge store and a language capability, 01:01:26.600 |
we're going to be dealing with these really large models. 01:01:30.160 |
The hope of the retrieval augmented approach is that we 01:01:36.720 |
And the reason we could do that is that we're going to factor out 01:01:40.120 |
the knowledge store into that index and the language capability, 01:01:45.960 |
The only thing we're going to be asking the language model 01:01:48.840 |
is to be good at that kind of in-context learning. 01:01:51.880 |
It doesn't need to also store a full model of the world. 01:01:55.520 |
And I think that means that these models could be smaller. 01:01:58.720 |
So overall, a big gain in efficiency if we go retrieval augmented. 01:02:09.440 |
Again, this is a problem that people are working on 01:02:11.520 |
very concertedly for the LLMs for everything approach. 01:02:14.720 |
But these models persist in giving outdated answers to questions. 01:02:19.640 |
And one pattern you see is that there's a lot of progress where you could like 01:02:24.380 |
the correct answer to who is the president of the US. 01:02:27.280 |
But then you ask it about something related to 01:02:29.720 |
the family of the president and it reveals that it has 01:02:34.080 |
outdated information stored in its parameters and that's 01:02:37.520 |
because all of this information is interconnected and we don't at 01:02:41.280 |
the present moment know how to reliably do that kind of systematic editing. 01:02:53.960 |
we assume that the knowledge store changed like somebody updated a Wikipedia page. 01:02:58.280 |
So we represent all the documents again or at least just the ones that changed. 01:03:02.720 |
And now we have a lot of guarantees that as that propagates forward into 01:03:06.560 |
the retrieved results which are consumed by the language model, 01:03:09.920 |
it will reflect the changes we made to the underlying database in 01:03:14.000 |
exactly the same way that a web search index is updated now. 01:03:19.000 |
Right. One forward pass of the large language model 01:03:22.960 |
compared to maybe training from scratch over here on 01:03:26.640 |
new data to get an absolute guarantee that the change will propagate. 01:03:42.400 |
my question, are professional baseball players 01:03:46.600 |
But I kind of cut it off but at the top there I said, 01:04:02.480 |
And I think that this is worse than providing no links at all because I'm 01:04:10.680 |
to see links and think they're probably evidence, 01:04:17.680 |
I see it found the relevant MLB pages and that's it." 01:04:27.320 |
a search phase where we're actually linked back to documents. 01:04:30.560 |
And then we just need to solve the interesting non-trivial question 01:04:34.120 |
of how to link those documents into the synthesized answer. 01:04:37.520 |
But all of the information we need is right there on the screen for us. 01:04:41.560 |
And so this feels like a relatively tractable problem 01:04:44.320 |
compared to what we are faced with on the left. 01:04:47.280 |
I will say, I've been just amazed at the rollout, 01:04:55.000 |
which now incorporates OpenAI models at some level. 01:04:57.960 |
Because it is clear that it is doing web search, right? 01:05:01.800 |
Because it's got information that comes from documents that 01:05:04.640 |
only appeared on the web days before your query. 01:05:08.200 |
But what it's doing with that information seems completely chaotic to me. 01:05:13.280 |
So that it's kind of just getting mushed in with whatever else the model is doing, 01:05:17.440 |
and you get this unpredictable combination of things that are grounded in documents, 01:05:26.320 |
And again, I maintain this is worse than just giving 01:05:33.120 |
I don't know why these companies are not simply doing the retrieval augmented thing, 01:05:39.640 |
and maybe your research could help them wise up a little bit about this. 01:05:48.680 |
we have a pressing problem, privacy challenges. 01:05:51.920 |
We know that those models can memorize long strings in their training data, 01:05:55.560 |
and that could include some very particular information about one of us, 01:06:01.160 |
We have no known way with a language model to compartmentalize LLM capabilities, 01:06:06.080 |
and say like, you can see this kind of result and you cannot. 01:06:09.640 |
And similarly, we have no known way to restrict access to part of an LLMs capabilities. 01:06:15.680 |
They just produce things based on their prompts, 01:06:18.400 |
and you could try to have some prompt tuning that would tell them for 01:06:21.240 |
this kind of person or setting do this and not that, 01:06:24.040 |
but nobody could guarantee that that would succeed. 01:06:26.800 |
Whereas, for the retrieval augmented approach, again, 01:06:31.120 |
we're thinking about accessing information from an index, 01:06:34.680 |
and access restrictions on an index is an old problem by now. 01:06:41.440 |
but something that a lot of people have tackled for decades now, 01:06:45.600 |
and so we can offer something like guarantees, 01:06:48.160 |
just from the fact that we have a separated knowledge store. 01:07:00.200 |
people are working on these problems and it's very exciting, 01:07:07.340 |
But over here on the retrieval augmented side, 01:07:13.760 |
it's just that we can see the path to solving them, 01:07:16.360 |
and this feels very urgent to me because of how 01:07:19.600 |
suddenly this kind of technology is being deployed in 01:07:24.760 |
the core things we do in society, which is web search. 01:07:28.200 |
So it's an urgent thing that we get good at this. 01:07:37.720 |
the way you would do even the retrieval augmented thing would be that you would 01:07:44.560 |
train a custom purpose model to do the question answering part, 01:07:48.040 |
and it could extract things from the text that you produced, 01:07:50.840 |
or maybe even generate some new things from the text that you produced. 01:07:54.400 |
That's the mode that I mentioned before where you'd have some language models, 01:07:58.880 |
maybe a few of them, and you'd have an index, 01:08:00.880 |
and you would stitch them together into a question answering system 01:08:04.480 |
that you would probably train on question answering data, 01:08:07.960 |
and you would hope that this whole big monster may be 01:08:10.120 |
fine-tuned on squad or natural questions or one of those datasets, 01:08:14.520 |
gave you a general purpose question answering capability. 01:08:19.440 |
That's the present, but I think it might actually be the recent past. 01:08:24.320 |
In fact, the way that you all will probably work when we do this unit, 01:08:33.880 |
This starts from the observation that the retriever model is really just a model that 01:08:42.920 |
and a language model is also a device for taking in text and producing text with scores. 01:08:51.880 |
you can think of them as just black box devices that do this input-output thing, 01:08:55.960 |
and then you get into the intriguing mode of asking, 01:08:58.680 |
but what if we had them just talk to each other? 01:09:01.600 |
That is what you will do for the homework and bake-off. 01:09:04.760 |
You will have frozen retriever and a frozen large language model, 01:09:11.760 |
solve a very difficult open domain question answering problem. 01:09:16.080 |
That's pushing us into a new mode for even thinking about how we design AI systems, 01:09:24.080 |
it's much more about getting them to communicate with each other 01:09:27.360 |
effectively to design a system from frozen components. 01:09:31.880 |
Again, unanticipated at least by me as of a few years ago, 01:09:41.920 |
I think what I'll do since we're near the end of the- of class here, 01:09:46.840 |
and then we'll use some of our time next time to introduce a few other of 01:09:50.200 |
these course themes and that'll set us up well for diving into transformers. 01:09:57.840 |
few-shot open QA is kind of the task that you will tackle for homework two. 01:10:08.280 |
The most standard thing we could do is just prompt the language model with that question, 01:10:12.920 |
what- what is the course to take down here and see what answer it gave back, right? 01:10:17.320 |
But the retrieval augmented insight is that we 01:10:20.680 |
might also retrieve some kind of passage from a knowledge store. 01:10:25.440 |
The course to take is natural language understanding, 01:10:28.100 |
and that could be done with a retrieval mechanism. 01:10:33.180 |
It might help the model as we saw going back to the GPT-3 paper to 01:10:37.480 |
have some examples of the kind of behavior that I'm hoping to get from the model. 01:10:41.960 |
And so here I have retrieved from some dataset, 01:10:44.800 |
question-answer pairs that will kind of give it a sense for what I want it to do in the end. 01:10:51.360 |
We could also pick questions that were based very closely on the question that we posed. 01:10:57.840 |
That would be like k-nearest neighbors approach where we use 01:11:01.160 |
our retrieval mechanism to find similar questions to the one that we care about. 01:11:06.220 |
I could also add in some context passages and I could do that by retrieval. 01:11:11.200 |
So now we've used the retrieval model twice potentially, 01:11:14.500 |
once to get good demonstrations and once to provide context for each one of them. 01:11:19.460 |
But I could also use my retrieval mechanism with the questions and answers from 01:11:23.860 |
the demonstration to get even richer connections 01:11:29.440 |
I could even use a language model to rewrite aspects of those demonstrations to put them 01:11:34.500 |
in a format that might help me with the final question that I want to pose. 01:11:42.120 |
the retrieval mechanism and the large language model to build up this prompt. 01:11:50.900 |
And then when you think about the model generation, again, 01:11:54.140 |
we could just take the top response from the model, 01:11:57.100 |
but we can do very sophisticated things on up to this full retrieval augmented generation model, 01:12:04.140 |
which essentially marginalizes out the evidence passage and gives us 01:12:08.460 |
a really powerful look at a good answer conditional 01:12:11.640 |
on that very complicated prompt that we constructed. 01:12:15.540 |
I think what you're seeing on the left here is that we are going to move from an era where 01:12:20.780 |
we just type in prompts into these models and hope for the best, 01:12:25.060 |
into an era where prompt construction is a kind of new programming mode, 01:12:37.740 |
but also drawing on very powerful pre-trained components to assemble 01:12:43.640 |
this kind of instruction kit for your large language model to do whatever task you have set for it. 01:12:50.340 |
And so instead of designing these AI systems with 01:12:55.740 |
we might actually be moving back into a mode that's like 01:12:58.980 |
that symbolic mode from the '80s where you type in a computer program. 01:13:03.260 |
It's just that now the program that you type in is 01:13:06.900 |
connected to these very powerful modern AI components. 01:13:14.700 |
opening doors to all kinds of new capabilities for these systems. 01:13:18.540 |
And this first homework and bake-off is going to give you a glimpse of that. 01:13:23.140 |
And you're going to use a programming model we've 01:13:25.420 |
developed called demonstrate-search-predict that I 01:13:28.460 |
hope will give you a glimpse of just how powerful this can be. 01:13:39.980 |
So next time I'll show you a few more units from the course,