back to index

Stanford XCS224U: NLU I Intro & Evolution of Natural Language Understanding, Pt. 1 I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome everyone.
00:00:06.880 | Uh, this is natural language understanding.
00:00:09.680 | Uh, it is a weird and wonderful and maybe worrying moment to be doing natural language understanding.
00:00:16.440 | My goal for today is just to kind of immerse us in this moment and think about how we got here and what it's like to be doing research now.
00:00:25.440 | And I think that'll set us up well to think about what we're gonna do in the course and how that's gonna set you up to participate in this moment in AI,
00:00:35.040 | uh, in many ways, in whichever ways you choose.
00:00:38.240 | And it's an especially impactful moment to be doing that.
00:00:40.920 | And this is a project-oriented course.
00:00:43.200 | And I feel like we can get you all to the point where you are doing meaningful things that contribute to this ongoing moment in ways that are gonna be exciting and impactful.
00:00:53.000 | That is the fundamental goal of the course.
00:00:55.960 | Let's now think about the current moment.
00:00:57.560 | This is always a moment of reflection for me.
00:01:00.240 | I started teaching this course in 2012, um, which I guess is ages ago now.
00:01:06.480 | It feels recent in my lived experience, but it does feel like ages ago in terms of the content.
00:01:11.480 | In 2012, on the first day, I had a slide that looked like this.
00:01:15.240 | I said, "It was an exciting time to be doing natural language understanding research."
00:01:20.440 | I noted that there was a resurgence of interest in the area after a long period of people mainly focused on syntax and things like that.
00:01:28.760 | But there was a widespread perception that NLU was- was on- was poised for a breakthrough and to have huge impact that was relating to business things,
00:01:37.120 | and that there was a white-hot job market for Stanford grads.
00:01:40.200 | A lot of this language is coming from the fact that we were in this moment when Siri had just launched,
00:01:45.600 | Watson had just won on Jeopardy,
00:01:48.440 | and we had all of these in-home devices and all the tech giants kind of competing on what was emerging as the field of natural language understanding.
00:01:56.840 | Let's fast forward to 2022.
00:01:59.120 | I did feel like I should update that in 2022 by saying this is the most exciting moment ever as opposed to it just being an exciting time.
00:02:06.720 | But I emphasize the same things, right?
00:02:09.280 | We were on- in this feeling that we had experienced a resurgence of interest in the area,
00:02:14.640 | although now it was hyper-intensified.
00:02:17.120 | Same thing with industry.
00:02:18.560 | The industry interest at this point makes the stuff from 2012 look like small potatoes.
00:02:24.560 | Systems were getting very impressive,
00:02:27.400 | but and I maintain this here,
00:02:29.580 | they show their weaknesses very quickly,
00:02:31.940 | and the core things about NLU remain far from solved.
00:02:35.640 | So the big breakthroughs lie in the future.
00:02:37.920 | I will say that even since 2022,
00:02:40.440 | it has felt like there has been an acceleration,
00:02:42.960 | and some problems that we used to focus on feel kind of like they're less pressing.
00:02:48.480 | I won't say solved, but they feel like we've made a lot of progress on them as a result of models getting better.
00:02:54.480 | But all that means for me is that there are more exciting things in the future that we can tackle even more ambitious things.
00:03:01.640 | And you'll see that I've tried to overhaul the course to be ever more ambitious about the kind of problems that we might take on.
00:03:09.440 | But we do kind of live in a golden age for all of this stuff.
00:03:13.400 | And even in 2022,
00:03:14.720 | I'm not sure what I would have predicted to say nothing of 2012,
00:03:18.000 | that we would have these incredible models like DALI2,
00:03:21.120 | which can take you from text into these incredible images.
00:03:24.720 | Language models, which will more or less be the star of the quarter for us.
00:03:29.120 | But also models that can take you from natural language to code.
00:03:33.080 | And of course, we are all seeing right now as we speak,
00:03:36.480 | that the entire industry related to web search is being reshaped around NLU technologies.
00:03:43.600 | So whereas this felt like a kind of niche area of NLP when we started this course in 2012,
00:03:50.920 | now it feels like the entire field of NLP,
00:03:54.120 | certainly in some aspects,
00:03:56.120 | all of AI is focused on these questions of natural language understanding,
00:04:00.320 | which is exciting for us.
00:04:02.800 | One more moment of reflection here.
00:04:05.000 | You know, in this course,
00:04:06.840 | throughout the years, we have used simple examples to kind of highlight the weaknesses of current models.
00:04:12.600 | And so a classic one for us was simply this question,
00:04:16.320 | which US states border no US states?
00:04:19.440 | The idea here is that it's a simple question,
00:04:22.400 | but it can be hard for our language technologies because of that negation, the no there.
00:04:28.400 | In 1980, there was a famous system called CHAT80.
00:04:33.080 | It was a symbolic system representing the first major phase of research in NLP.
00:04:38.760 | You can see the fragment of the system here.
00:04:41.280 | And CHAT80 was an incredible system in that it could answer questions like,
00:04:45.600 | which country bordering the Mediterranean borders a country that is
00:04:49.040 | bordered by a country whose population exceeds the population of India?
00:04:52.680 | I've given you the answer here,
00:04:54.760 | Turkey, at least according to 1980s geography.
00:04:58.920 | But if you asked CHAT80 a simple question like,
00:05:01.920 | which US states border no US states?
00:05:04.000 | It would just say, I don't understand.
00:05:06.680 | It was an incredibly expressive system, but rigid.
00:05:10.880 | It could do some things very deeply,
00:05:13.200 | as you see from the first question,
00:05:14.880 | but things that fell outside of its capacity,
00:05:17.440 | it would just fall down flat.
00:05:19.720 | That was the 1980s.
00:05:21.520 | Let's fast forward.
00:05:22.440 | 2009, around the time this course launched,
00:05:25.040 | Wolfram Alphra hit the scene.
00:05:26.960 | And this was meant to be a kind of revolutionary language technology.
00:05:30.880 | The website is still up,
00:05:32.320 | and to my amazement,
00:05:34.000 | it still gives the following behavior.
00:05:36.560 | If you search for which US states border no US states,
00:05:40.120 | it kind of just gives you a list of the US states.
00:05:43.160 | Revealing, I would say,
00:05:45.120 | that it has no capacity to understand the question posed.
00:05:49.400 | That was 2009.
00:05:50.960 | So we've gone from 1980 to 2009.
00:05:54.000 | Okay, let's go to 2020.
00:05:55.600 | This is the first of the OpenAI models, ADA.
00:05:59.320 | Which US states border no US states?
00:06:02.120 | The answer is no.
00:06:03.880 | And then it sort of starts to babble,
00:06:05.680 | the US border is not a state border.
00:06:07.640 | It did that for a very long time.
00:06:10.400 | What about Babbage?
00:06:11.680 | This is still 2020.
00:06:13.880 | The US states border no US states.
00:06:16.040 | What is the name of the US state?
00:06:17.560 | And then it really went off the deep end from there,
00:06:19.880 | again, for a very long time.
00:06:21.360 | That was Babbage.
00:06:22.240 | If you had seen this output,
00:06:24.600 | well, at least for me,
00:06:25.720 | it might have shaken my faith
00:06:27.800 | that this was a viable approach, right?
00:06:30.600 | But the team persisted, I guess.
00:06:32.120 | 2021, this is the Curie model.
00:06:34.280 | Which US states border no US states?
00:06:37.200 | It had a problem that it started listing things,
00:06:39.520 | but it did say Alaska, Hawaii, and Puerto Rico,
00:06:42.840 | which is an interestingly more impressive answer
00:06:45.880 | than the first answer, right?
00:06:47.680 | It still has some problem understanding
00:06:49.480 | what it means to respond,
00:06:50.640 | but it's looking like we're seeing some signal.
00:06:54.240 | Da Vinci Instruct Beta, this is 2022.
00:06:56.840 | It's important, I think,
00:06:58.080 | that this is the first of the models
00:06:59.600 | that have Instruct in the name.
00:07:01.160 | We'll talk about that in a minute.
00:07:02.840 | Which US states border no US states?
00:07:04.800 | Alaska and Hawaii.
00:07:06.760 | From 2020 to 2022,
00:07:08.800 | we have seen this astounding leap forward,
00:07:11.400 | making everything before then sort of pale in comparison.
00:07:14.680 | And then finally, Text Da Vinci One,
00:07:17.360 | you know, one of the new best-in-class models,
00:07:19.400 | at least until two months ago.
00:07:21.160 | Which US states border no US states?
00:07:23.200 | Alaska and Hawaii are the only US states
00:07:25.440 | that border no other US states.
00:07:26.920 | A very impressive answer indeed.
00:07:29.600 | And if you just think about the little history I've given,
00:07:32.840 | a kind of microcosm of what is happening in the field,
00:07:37.040 | a lot of time without much progress,
00:07:40.560 | with some hype attached,
00:07:42.160 | and now in the last few years,
00:07:43.920 | this kind of rapid progress forward.
00:07:46.160 | And, you know, that's just one example,
00:07:49.760 | but these examples multiply, and we can quantify this.
00:07:52.400 | Here's another impressive case.
00:07:54.080 | I asked the Da Vinci Two model,
00:07:56.320 | in which year was Stanford University founded?
00:07:59.080 | When did it enroll its first students?
00:08:01.000 | Who is its current president,
00:08:02.280 | and what is its mascot?
00:08:03.760 | A complicated question indeed,
00:08:05.800 | and it gave a fluent and factually correct answer
00:08:09.440 | on all counts.
00:08:10.800 | This is the Da Vinci Three model,
00:08:13.080 | which was best-in-class until a few weeks ago,
00:08:16.200 | and it gave exactly the same answer.
00:08:18.440 | Very impressive.
00:08:20.800 | Now, in this course, and for-
00:08:22.760 | you'll see at the website,
00:08:23.800 | one of the readings we've suggested
00:08:25.560 | for the start of the course
00:08:27.000 | is this classic paper by Hector Levesque
00:08:29.320 | called "On Our Best Behavior."
00:08:31.400 | And the thrust of this article,
00:08:33.120 | essentially channeling Terry Winograd,
00:08:35.200 | and Terry Winograd's schema,
00:08:38.080 | the idea is that we should come up with examples
00:08:40.960 | that will test whether models deeply understand,
00:08:44.120 | and in particular, get past the kind of
00:08:46.080 | simple memorization of statistics and other things
00:08:49.040 | about the data they're trained on,
00:08:50.640 | and really probe to see whether they understand
00:08:52.720 | what the world is like.
00:08:54.200 | And Levesque and Winograd's technique for doing this
00:08:57.680 | is to pose very unlikely questions
00:09:01.200 | where humans have very natural answers,
00:09:03.520 | like one of the ones Levesque poses is,
00:09:05.640 | could a crocodile run the steeplechase?
00:09:09.040 | Maybe it's a question you've never thought about before,
00:09:11.400 | but you probably have a pretty consistent answer
00:09:13.920 | across this group.
00:09:15.000 | Could a crocodile run the steeplechase?
00:09:17.560 | Here, I asked another one of Levesque's questions.
00:09:19.600 | Are professional baseball players allowed
00:09:21.800 | to glue small wings onto their caps?
00:09:24.440 | You could think about that for a second.
00:09:26.280 | The Da Vinci 2 model said,
00:09:28.080 | "There is no rule against it, but it is not common."
00:09:30.960 | And that seemed like a very good answer to me at the time.
00:09:35.120 | When the Da Vinci 3 engine came out,
00:09:37.360 | though, this started to worry me.
00:09:39.120 | No, professional baseball players are not allowed
00:09:41.760 | to glue small wings onto their caps.
00:09:43.720 | Major League Baseball has strict rules
00:09:45.560 | about the appearance of players' uniforms and caps.
00:09:49.000 | And any modifications to the caps are not allowed.
00:09:52.520 | Okay, I thought I was feeling good about this,
00:09:55.200 | but now I don't even myself know what the answer is.
00:09:58.240 | Are professional baseball players allowed
00:10:00.080 | to glue small wings onto their caps?
00:10:01.520 | We have two confident answers that are contradictory
00:10:05.640 | across two models that are very closely related.
00:10:09.080 | It's starting to worry us a little bit, I hope.
00:10:11.760 | But still, it's impressive.
00:10:14.200 | What's that?
00:10:15.040 | - You want me to ask a part?
00:10:16.760 | - You could check.
00:10:17.600 | Yes, I have a few cases,
00:10:19.200 | and this is an interesting experiment
00:10:20.880 | for us to run for sure.
00:10:22.040 | Let me show you the responses I got a bit later.
00:10:24.560 | The point, though, I guess,
00:10:27.800 | if you've seen the movie "Blade Runner,"
00:10:29.120 | this is starting to feel like to figure out
00:10:31.760 | whether an agent we were interacting with was human or AI,
00:10:36.120 | we would need to get very sophisticated
00:10:39.480 | interview techniques indeed.
00:10:41.200 | The Turing test, long forgotten here,
00:10:43.960 | now we're into the mode of trying to figure out
00:10:46.680 | exactly what kind of agents we're interacting with
00:10:49.520 | by having to be extremely clever
00:10:51.920 | about the kinds of things that we do with them.
00:10:54.400 | Now, that's kind of anecdotal evidence,
00:10:58.640 | but I think that the picture of progress
00:11:00.760 | is also supported by what's happening in the field.
00:11:03.920 | Let me start this story with our benchmarks.
00:11:06.880 | And the headline here is that our benchmarks,
00:11:09.400 | the tasks, the datasets we use to probe our models
00:11:13.080 | are saturating faster than ever before.
00:11:15.360 | And I'll articulate what I mean by saturate.
00:11:17.880 | So we have a little framework.
00:11:19.840 | Along the x-axis, I have time
00:11:22.080 | stretching back into like the 1990s.
00:11:24.760 | And along the y-axis, I have a normalized measure
00:11:28.520 | of distance from what we call human performance.
00:11:31.400 | That's the red line set at zero.
00:11:34.000 | Each one of these benchmarks has, in its own particular way,
00:11:37.080 | set a so-called estimate of human performance.
00:11:39.840 | I think we should be cynical about that,
00:11:41.880 | but nonetheless, this'll be a kind of marker
00:11:44.400 | of progress for us.
00:11:45.720 | First dataset, MNIST.
00:11:48.640 | This is like digit recognition, famous task in AI.
00:11:51.600 | It was launched in the 1990s,
00:11:53.400 | and it took about 20 years for us to see a system
00:11:56.960 | that surpassed human performance in this very loose sense.
00:12:01.800 | The switchboard corpus, this is going from speech to text.
00:12:05.520 | It's a very similar story, launched in the '90s,
00:12:08.000 | and it took about 20 years
00:12:09.960 | for us to see a superhuman system.
00:12:13.480 | ImageNet, this was launched, I believe, in 2009,
00:12:17.120 | and it took less than 10 years for us to see a system
00:12:20.320 | that surpassed that red line.
00:12:23.120 | And now progress is gonna pick up really fast.
00:12:25.280 | SQUAD 1.1, the Stanford question-answering dataset,
00:12:29.040 | was launched in 2016, and it took about three years
00:12:32.440 | for it to be saturated in this sense.
00:12:35.240 | SQUAD 2.0 was the team's attempt
00:12:38.480 | to pose an even harder problem,
00:12:40.120 | one where there were unanswerable questions,
00:12:42.960 | but it took even less time for systems
00:12:45.160 | to get past that red line.
00:12:47.000 | Then we get the GLUE benchmark.
00:12:49.600 | This is a famous benchmark
00:12:51.480 | in natural language understanding, a multitask benchmark.
00:12:55.200 | When this was launched, a lot of us thought
00:12:58.040 | that GLUE would be too difficult for present-day systems.
00:13:01.520 | It looked like this might be a challenge
00:13:03.200 | that would stand for a very long time,
00:13:05.440 | but it took like less than a year
00:13:07.800 | for systems to pass human performance.
00:13:10.720 | The response was superGLUE, but it was saturated,
00:13:14.080 | if anything, even more quickly.
00:13:16.200 | Now, we can be as cynical as we want
00:13:19.080 | about this notion of human performance,
00:13:20.880 | and I think we should dwell on whether or not
00:13:22.920 | it's fair to call it that, but even setting that aside,
00:13:26.440 | this looks like undeniably a story of progress.
00:13:31.040 | The systems that we had in 2012
00:13:33.640 | would not even have been able to enter the GLUE benchmark
00:13:37.160 | to say nothing of achieving scores like this.
00:13:39.840 | So something meaningful has happened.
00:13:42.000 | Now, you might think by the standards of AI,
00:13:44.560 | these datasets are kind of old.
00:13:46.400 | Here's a post from Jason Wei where he evaluated
00:13:48.800 | our latest and greatest large language models
00:13:51.640 | on a bunch of mostly new tasks
00:13:53.760 | that were actually designed to stress test
00:13:56.480 | this new class of very large language models.
00:13:59.920 | Jason's observation is that we see emergent abilities
00:14:03.360 | across more than 100 tasks for these models,
00:14:05.880 | especially for our largest models.
00:14:08.320 | The point, though, is that we, again,
00:14:10.040 | thought these tasks would stand for a very long time,
00:14:13.280 | and what we're seeing instead is that one by one,
00:14:16.120 | systems are certainly getting traction,
00:14:17.960 | and in some cases, performing at the standard
00:14:20.880 | we had set for humans.
00:14:22.880 | Again, an incredible story of progress there.
00:14:26.520 | So I hope that is energizing, maybe a little intimidating,
00:14:31.960 | but I hope fundamentally energizing for you all.
00:14:34.800 | The next question that I wanna ask for you
00:14:38.280 | is just what is going on?
00:14:40.320 | What is driving all of this sudden progress?
00:14:43.560 | Let's get a feel for that, and that'll kind of serve
00:14:46.120 | as the foundation for the course itself.
00:14:49.120 | Before I do that, though, are there questions or comments,
00:14:52.960 | things I could resolve, or things I left out
00:14:55.080 | about the current moment?
00:14:56.520 | - Brandon on board, I think very well.
00:15:02.960 | - We should reflect, though, maybe as a group
00:15:06.280 | about what it means to do very well.
00:15:08.240 | My question for you, when you say it did well,
00:15:11.320 | what is the Major League Baseball rule
00:15:13.320 | about players gluing things onto their caps?
00:15:15.520 | - Rule 3.06.
00:15:17.360 | - You found the actual rule?
00:15:18.760 | - No, this is what Bard, well, I don't-
00:15:21.800 | - Did you find the rule?
00:15:22.640 | - I didn't find the rule.
00:15:23.480 | Bard found that rule and gave me that number.
00:15:24.960 | - Okay. - Is it accurate?
00:15:26.360 | - Yes, that is gonna be the question for us.
00:15:28.720 | I can get- - It's a direct quote, too,
00:15:30.320 | which is right for hallucination.
00:15:32.560 | - Well, I'm gonna show you the OpenAI models
00:15:34.400 | will offer me links, but the links go nowhere.
00:15:37.320 | (audience laughing)
00:15:40.160 | What you're pointing out, I think,
00:15:41.760 | is an increasing societal problem.
00:15:43.840 | These models are offering us what looks like evidence,
00:15:47.280 | but a lot of the evidence is just fabricated,
00:15:49.840 | and this is worse than offering no evidence at all.
00:15:52.640 | What I really need is someone who knows
00:15:54.480 | Major League Baseball to tell me,
00:15:56.240 | what is the rule about players and their caps?
00:15:59.920 | I want it from an expert human,
00:16:01.560 | not an expert language model.
00:16:04.640 | - Can we- - What's that?
00:16:06.480 | - Can we Google?
00:16:08.000 | - Be careful how you Google, though.
00:16:09.560 | I guess that's the lesson of 2023.
00:16:11.800 | All right, what's going on?
00:16:16.560 | Let's start to make some progress on this.
00:16:18.560 | Again, first, a little bit of historical context.
00:16:22.040 | I've got a timeline going back to the 1960s
00:16:25.200 | along the x-axis.
00:16:26.200 | This is more or less the start of the field itself.
00:16:29.080 | And in that early era,
00:16:31.120 | essentially all of the approaches
00:16:33.800 | were based in symbolic algorithms
00:16:36.000 | like the CHAT81 that I showed you.
00:16:37.840 | In fact, that was kind of pioneered here at Stanford
00:16:40.800 | by people who were pioneering the very field of AI.
00:16:44.040 | And that paradigm of essentially programming these systems
00:16:47.760 | lasted well into the 1980s.
00:16:50.280 | In the '90s, early 2000s,
00:16:54.840 | we get the statistical revolution
00:16:57.120 | throughout artificial intelligence,
00:16:58.800 | and then in turn in natural language processing.
00:17:01.600 | And the big change there is that
00:17:03.520 | instead of programming systems with all these rules,
00:17:06.280 | we're gonna design machine learning systems
00:17:08.240 | that are gonna try to learn from data.
00:17:10.400 | Under the hood,
00:17:11.240 | there was still a lot of programming involved
00:17:13.200 | because we would write a lot of feature functions
00:17:15.840 | that were little programs
00:17:16.960 | that would help us detect things about data.
00:17:19.360 | And we would hope that our machine learning systems
00:17:21.360 | could learn from the output of those feature functions.
00:17:24.720 | But in the end,
00:17:25.880 | this was the rise of the fully data-driven learning systems.
00:17:29.680 | And we just hope that some process of optimization
00:17:32.800 | leads us to new capabilities.
00:17:34.960 | The next big phase of this was the deep learning revolution.
00:17:39.400 | This happened starting around 2009, 2010.
00:17:42.480 | Again, Stanford was at the forefront of this to be sure.
00:17:46.360 | It felt like a big change at the time,
00:17:48.240 | but in retrospect,
00:17:49.660 | this is kind of not so different from this mode here.
00:17:52.280 | It's just that we now replace that simple model
00:17:56.280 | with really big models,
00:17:58.280 | really deep models that have a tremendous capacity
00:18:01.480 | to learn things from data.
00:18:03.400 | We started also to see a shift even further away
00:18:07.100 | from those feature functions,
00:18:08.640 | from writing little programs,
00:18:10.320 | and more toward a more mode
00:18:12.000 | where we would just hope that the data
00:18:14.360 | and the optimization process could do all the work for us.
00:18:17.980 | Then the next thing, big thing that happened,
00:18:21.560 | which could take us, I suppose, until about 2018,
00:18:25.280 | would be this mode
00:18:26.120 | where we have a lot of pre-trained parameters.
00:18:28.360 | These are pictures of maybe big language models
00:18:30.840 | or computer vision models or something.
00:18:32.920 | And when we build systems,
00:18:34.260 | we build on those pre-trained components
00:18:36.940 | and stitch them together
00:18:38.440 | with these task-specific parameters.
00:18:41.040 | And we hope that when they're all combined
00:18:42.960 | and we do some learning on some task-specific data,
00:18:46.320 | we have something that's benefiting
00:18:48.000 | from all these pre-trained components.
00:18:51.200 | And then the mode that we seem to be in now
00:18:53.980 | that I want us to reflect critically on
00:18:56.440 | is this mode where we're gonna replace everything
00:18:59.060 | with maybe one ginormous language model of some kind
00:19:03.500 | and hope that that thing, that enormous black box,
00:19:06.720 | will do all the work for us.
00:19:08.720 | We should think critically
00:19:09.900 | about whether that's really the path forward,
00:19:12.040 | but it certainly feels like the zeitgeist to be sure.
00:19:16.240 | Question, yeah.
00:19:17.080 | - If you think it's worth it,
00:19:18.840 | could you go back to the last slide
00:19:21.040 | and maybe explain a little bit,
00:19:23.520 | a more rounded example of what that all means?
00:19:25.480 | I couldn't quite follow.
00:19:27.040 | - Let's do that later.
00:19:28.600 | The point for now though is really this shift from here
00:19:32.920 | where we're mostly learning from scratch for our task.
00:19:36.840 | Here, we've got things like BERT in the mix.
00:19:39.720 | We've got pre-trained components,
00:19:42.200 | models that we hope begin in a state
00:19:44.680 | that gives us a leg up on the problem we're trying to solve.
00:19:47.680 | That's the big thing that happened.
00:19:49.160 | And you get this emphasis
00:19:51.040 | on people releasing model parameters.
00:19:53.840 | In this earlier phase like here,
00:19:56.440 | there was no talk of releasing model parameters
00:19:58.880 | because mostly the models people trained
00:20:01.520 | were just good for the task that they had set.
00:20:04.220 | As we move into this era, and then certainly this one,
00:20:07.640 | these things are meant to be like
00:20:09.480 | general purpose language capabilities
00:20:11.680 | or maybe general purpose computer vision capabilities
00:20:14.920 | that we stitch together into a system
00:20:17.040 | that can do more than any previous system could do.
00:20:20.600 | Right, so then we have this big thing here.
00:20:26.720 | So that's the feeling now.
00:20:28.680 | Behind all of this,
00:20:30.240 | certainly beginning in this final phase here,
00:20:32.920 | is the transformer architecture.
00:20:35.240 | Just let me take the temperature of the room.
00:20:36.840 | How many people have encountered the transformer before?
00:20:39.640 | Right, yeah, it's sort of unavoidable
00:20:42.480 | if you're doing this research.
00:20:43.720 | Here's a diagram of it,
00:20:45.160 | but I'm not gonna go through this diagram now
00:20:47.480 | because starting on Wednesday,
00:20:49.920 | we are gonna have an entire lecture
00:20:52.200 | essentially devoted to unpacking this thing
00:20:54.700 | and understanding it.
00:20:56.040 | All I can say for you now is that I expect you
00:20:59.220 | to go on the following journey, which all of us go on.
00:21:02.560 | How on earth does the transformer work?
00:21:04.880 | It looks very, very complicated.
00:21:07.480 | I hope can get you to the point where you feel,
00:21:09.960 | oh, this is actually pretty simple components
00:21:12.900 | that have been combined in a pretty straightforward way.
00:21:16.000 | That's your second step on the journey.
00:21:17.600 | The true enlightenment comes from, wait a second,
00:21:21.280 | why does this work at all?
00:21:23.440 | And then you're with the entire field
00:21:25.600 | trying to understand why these simple things
00:21:28.080 | were brought together in this way have proved so powerful.
00:21:30.980 | The other major thing that happened,
00:21:35.820 | which is kind of latent going all the way back
00:21:38.220 | to the start of AI, especially as it relates to linguistics,
00:21:42.120 | is this notion of self-supervision,
00:21:44.360 | of distributional learning,
00:21:46.280 | because this is gonna unlock the door to us
00:21:48.800 | just learning from the world in the most general sense.
00:21:52.980 | In self-supervision, your model's only goal
00:21:56.880 | is to learn from co-occurrence patterns
00:21:59.640 | in the sequences that it's trained on.
00:22:01.600 | And the sequences can be language,
00:22:03.260 | but they could be language plus sensor readings,
00:22:06.000 | computer code, maybe even images
00:22:08.480 | that you embed in this space, just symbols.
00:22:11.320 | And the model's only goal is to learn
00:22:13.360 | from the distributional patterns that they contain,
00:22:16.720 | or for many of these models, to assign high probability
00:22:20.140 | to the attested sequences in whatever data that you pour in.
00:22:24.160 | For this kind of learning, we don't need to do any labeling.
00:22:27.840 | All we need to do is have lots and lots of symbol streams.
00:22:32.600 | And then when we generate from these models,
00:22:35.600 | we're sampling from them, and that's what we all think of
00:22:37.960 | when we think of prompting and getting a response back.
00:22:40.240 | But the underlying mechanism is, at least in part,
00:22:43.720 | this notion of self-supervision.
00:22:45.360 | And I'll emphasize again, 'cause I think
00:22:46.960 | this is really important for why these models
00:22:48.820 | are so powerful, the symbols do not need to be just language.
00:22:52.800 | They can include lots of other things
00:22:54.920 | that might help a model piece together
00:22:57.960 | a full picture of the world we live in,
00:23:00.400 | and also the connections between language
00:23:02.600 | and those pieces of the world,
00:23:04.480 | just from this distributional learning.
00:23:07.800 | The result of this proving so powerful
00:23:10.800 | is the advent of large-scale pre-training,
00:23:13.600 | because now we're not held back anymore
00:23:16.480 | by the need for labeled data.
00:23:18.280 | All we need is lots of data in unstructured format.
00:23:22.400 | This really begins in the era of static word representations
00:23:26.200 | like Word2Vec and GloVe.
00:23:28.800 | And in fact, those teams, and I would say
00:23:30.560 | especially the GloVe team, they were really visionary
00:23:33.340 | in the sense that they not only released a paper and code,
00:23:38.340 | but pre-trained parameters.
00:23:41.740 | This was really brand new for the field,
00:23:44.060 | this idea that you would empower people
00:23:46.360 | with model artifacts, and people started using them
00:23:50.260 | as the inputs to recurrent neural networks and other things.
00:23:54.580 | And you started to see pre-training
00:23:57.580 | as an important component to doing really well
00:24:00.180 | at hard things.
00:24:02.500 | There were some predecessors
00:24:04.460 | that I'll talk about next time,
00:24:06.220 | but the really big moment for contextual representations
00:24:10.020 | is the ELMo model.
00:24:11.360 | This is the paper,
00:24:12.200 | Deep Contextualized Word Representations.
00:24:14.380 | I can remember being at the North American ACL meeting
00:24:18.340 | in New Orleans in 2018 at the best paper session.
00:24:22.420 | They had not announced which of the best papers
00:24:24.820 | was gonna win the outstanding paper award,
00:24:27.320 | but we all knew it was gonna be the ELMo paper
00:24:30.260 | because the gains that they had reported
00:24:33.440 | from fine-tuning their ELMo parameters
00:24:35.500 | on hard tasks for the field were just mind-blowing,
00:24:38.740 | the sort of thing that you really only see once
00:24:41.360 | in a kind of generation of this research,
00:24:43.640 | or so we thought.
00:24:44.860 | Because the next year, BERT came out,
00:24:48.980 | same thing, I think same best paper award thing.
00:24:51.820 | The paper already had had huge impact
00:24:54.500 | by the time it was even published,
00:24:56.300 | and they too released their model parameters
00:25:00.220 | ELMo is not transformer-based.
00:25:02.260 | BERT is the first of the sequence of things
00:25:04.340 | that's based in the transformer,
00:25:05.740 | and again, lifting all boats even above
00:25:08.660 | where ELMo had brought us.
00:25:10.820 | Then we get GPT.
00:25:12.420 | This is the first GPT paper,
00:25:14.220 | and then fast forward a little bit, we get GPT-3,
00:25:17.220 | and that was pre-training at a scale
00:25:21.320 | that was previously kind of unimaginable
00:25:24.300 | 'cause this, now we're talking about,
00:25:26.540 | for the BERT model, 100 million parameters,
00:25:28.820 | and for GPT-3, well north of 100 billion.
00:25:33.020 | Different order of magnitude,
00:25:34.900 | and what we started to see is emergent capabilities.
00:25:38.240 | That model size thing is important.
00:25:41.860 | Again, this is a sort of feeling of progress
00:25:43.900 | and maybe also despair.
00:25:45.500 | I think I can lift your spirits a little bit,
00:25:47.600 | but we should think about model size.
00:25:50.260 | So I have years along the x-axis again,
00:25:53.420 | and I have model size going from 100 million
00:25:55.880 | to one trillion here on a logarithmic scale.
00:25:59.600 | So 2018, GPT, that's like 100 million BERT.
00:26:03.720 | I think it's 300 million for the large one.
00:26:06.360 | Okay, GPT-2, even larger.
00:26:09.000 | Megatron, 8.3 billion.
00:26:10.720 | I remember when this came out, I probably laughed.
00:26:13.960 | Maybe I thought it was a joke.
00:26:15.240 | I certainly thought it was some kind of typo
00:26:17.260 | because I couldn't imagine that it was actually billion,
00:26:20.340 | like with a B there.
00:26:23.220 | But now, that's, you know, we take that for granted.
00:26:26.260 | Megatron, 11 billion.
00:26:27.700 | This is 2021 or so.
00:26:30.300 | Then we get GPT-3, reportedly at 175 billion parameters.
00:26:34.940 | And then we get this thing where it seems
00:26:36.660 | like we're doing typos again.
00:26:38.180 | Megatron, Turing, NLG was like 500,
00:26:41.700 | and then Palm is 540 billion parameters.
00:26:45.780 | And I guess there are rumors that we have gone upward
00:26:48.340 | all the way to a trillion, right?
00:26:51.500 | There's an undeniable trend here.
00:26:54.280 | I think there is something to this trend,
00:26:57.260 | but we should reflect on it a little bit.
00:26:59.200 | One thing I wanna say is there's a noteworthy pattern
00:27:03.080 | of very few entities have participated
00:27:06.380 | in this very large, in this race for very large models.
00:27:09.540 | We've got like Google, NVIDIA, Meta, and OpenAI, right?
00:27:14.540 | And that was actually a real cause for concern.
00:27:16.780 | I remember being at a workshop
00:27:18.340 | between Stanford and OpenAI,
00:27:20.820 | where the number one source of consternation
00:27:23.220 | was really that only OpenAI at that point
00:27:27.020 | had trained these really large models.
00:27:29.060 | And after that, predictably,
00:27:30.560 | these other large tech companies kind of caught up.
00:27:33.760 | But it was still for a while looking like a story
00:27:36.440 | of real centralization of power.
00:27:38.740 | That might still be happening,
00:27:40.900 | but I think there's reason to be optimistic.
00:27:42.500 | So here at Stanford, the Helm Group,
00:27:44.780 | which is part of the Center for Research
00:27:46.560 | on Foundation Models, led this incredibly ambitious project
00:27:50.400 | of evaluating lots of language models.
00:27:52.700 | And one thing that emerges from that
00:27:54.540 | is that we have a more healthy ecosystem now.
00:27:57.440 | So we have these like loose collectives,
00:27:59.280 | Big Science and Eleuther
00:28:00.460 | are both kind of fully open source groups of researchers.
00:28:04.140 | We've got, well, one academic institution represented.
00:28:07.700 | This could be a little bit embarrassing for Stanford.
00:28:09.720 | Maybe we'll correct that.
00:28:11.380 | And then maybe the more important thing
00:28:12.780 | is that we have lots of startups represented.
00:28:14.960 | So these are well-funded, but relatively small outfits
00:28:18.500 | that are producing outstanding language models.
00:28:21.720 | And so the result,
00:28:22.800 | I think we're gonna see much more of this,
00:28:24.860 | and then we'll worry less about centralization of power.
00:28:28.540 | There's plenty of other things to worry about,
00:28:30.380 | so we shouldn't get sanguine about this,
00:28:31.900 | but this particular point, I think,
00:28:34.160 | is being alleviated by current trends.
00:28:36.820 | And there's another aspect of this too,
00:28:39.020 | which is you have this scary rise in model size,
00:28:42.220 | but what is happening right now as we speak
00:28:45.500 | in a very quick way is we're seeing
00:28:47.740 | a push towards smaller models.
00:28:49.980 | And in particular, we're seeing that models
00:28:51.920 | that are in the range of like 10 billion parameters
00:28:55.140 | can be highly performant, right?
00:28:56.940 | So we have the Flan models, we have Lama,
00:29:00.660 | and then here at Stanford, they released the alpaca thing,
00:29:03.460 | and then Databricks released the Hello Dolly model.
00:29:06.580 | These are all models that are like
00:29:07.920 | eight to 10 billion parameters,
00:29:09.780 | which I know this sounds funny
00:29:11.220 | because I laughed a few years ago
00:29:13.120 | when the Megatron model had 8.3 billion,
00:29:15.620 | and now what I'm saying to you
00:29:16.900 | is that this is relatively small, but so it goes.
00:29:20.140 | And the point is that a 10 billion parameter model
00:29:23.220 | is one that could be run on regular old commercial hardware,
00:29:27.620 | whereas these monsters up here,
00:29:29.540 | really you have lots of pressures
00:29:31.240 | towards centralization of power there
00:29:32.800 | because almost no one can work with them.
00:29:35.700 | But anyone essentially can work with alpaca,
00:29:38.100 | and it won't be long before we've got the ability
00:29:40.660 | to kind of work with it on small devices
00:29:42.780 | and things like that.
00:29:43.940 | And that too is really gonna open the door
00:29:47.180 | to lots of innovation.
00:29:48.480 | I think that will bring some good,
00:29:50.100 | and I think it will bring some bad,
00:29:51.780 | but it is certainly a meaningful change
00:29:53.780 | from this scary trend that we were seeing
00:29:55.820 | until four months ago.
00:29:58.020 | As a result of these models being so powerful,
00:30:05.540 | people started to realize
00:30:07.540 | that you can get a lot of mileage out of them
00:30:10.140 | simply by prompting them.
00:30:12.340 | When you prompt one of these very large models,
00:30:14.580 | you put it in a temporary state by inputting some text,
00:30:18.180 | and then you generate a sample from the model
00:30:20.260 | using some technique, and you see what comes out, right?
00:30:22.660 | So if you type into one of these models,
00:30:24.620 | better late than, it's gonna probably spit out never.
00:30:28.300 | If you put in every day, I eat breakfast, lunch,
00:30:31.940 | and it will probably say dinner.
00:30:34.420 | And you might have an intuition that the reasons,
00:30:36.620 | the causes for that are kind of different.
00:30:38.340 | The first one is a sort of idiom,
00:30:40.400 | so that it could just learn from co-occurrence patterns
00:30:42.900 | in text transparently.
00:30:44.760 | For the second one, we kind of interpreted as humans
00:30:47.960 | as reflecting something about routines,
00:30:50.740 | but you should remind yourself
00:30:52.900 | that the mechanism is the same as in the first case.
00:30:56.040 | This was just a bunch of co-occurrence patterns.
00:30:58.240 | A lot of people described their routines in text,
00:31:01.180 | and the model picked up on that.
00:31:03.280 | And carry that thought forward
00:31:04.820 | as you think about things like the president of the US is.
00:31:08.100 | When it fills that in with Biden or whoever,
00:31:11.740 | it might look like it is offering us factual knowledge,
00:31:14.380 | and maybe in some sense it is,
00:31:16.540 | but it's the same mechanism as for those first two examples.
00:31:19.880 | It is just learning from the fact that a lot of people
00:31:23.120 | have expressed a lot of texts that look like
00:31:25.360 | the president of the US is Joe Biden,
00:31:27.600 | and it is repeating that back to us.
00:31:30.040 | And so definitely, if you ask a model something like
00:31:33.220 | the key to happiness is, you should remember
00:31:36.460 | that this is just the aggregate of a lot of data
00:31:39.300 | that it was trained on.
00:31:40.220 | It has no particular wisdom to offer you necessarily
00:31:43.680 | beyond what was encoded latently in that giant sea
00:31:48.680 | of mostly unaudited, unstructured text.
00:31:53.380 | Yeah, question.
00:31:55.120 | - I guess it would be kind of hard
00:31:58.260 | to get something like this,
00:31:59.180 | but if we had a corpus of just like,
00:32:01.900 | all the languages, right,
00:32:03.220 | but literally all of the facts were wrong.
00:32:05.580 | We just imagine like a very factually incorrect corpus.
00:32:08.740 | Like, I guess I'm getting at like,
00:32:12.160 | how do we inject like truth into like these corpuses?
00:32:15.660 | - It's a question that bears repeating.
00:32:19.060 | How do we inject truth?
00:32:20.740 | It's a question you all could think about.
00:32:23.420 | What is truth, of course,
00:32:25.020 | but also what would that mean and how would we achieve it?
00:32:28.960 | And even if we did back off to something like,
00:32:31.580 | how would we ensure self-consistency for a model?
00:32:34.660 | Or, you know, at the level of a worldview
00:32:36.620 | or a set of facts,
00:32:37.860 | even those questions which seem easier to pose
00:32:40.980 | are incredibly difficult questions in the current moment
00:32:44.340 | where our only mechanisms are basically
00:32:46.540 | that self-supervision thing that I described,
00:32:49.040 | and then a little bit of what I'll talk about next.
00:32:51.940 | But none of the structure that we used to have
00:32:55.000 | where we would have a database of knowledge
00:32:56.840 | and things like that,
00:32:58.300 | that is posing problems.
00:33:00.080 | (laughs)
00:33:02.160 | The prompting thing, we take this a step forward, right?
00:33:07.760 | So the GPT-3 paper,
00:33:09.480 | remember that's that 175 billion parameter monster.
00:33:13.400 | The eye-opening thing about that
00:33:15.200 | is what we now call in-context learning,
00:33:18.140 | which was just the notion that for these very large,
00:33:20.920 | very capable models,
00:33:22.080 | you could input a bunch of texts,
00:33:24.380 | like here's a passage,
00:33:26.200 | and maybe an example of the kind of behavior
00:33:28.280 | that you wanted,
00:33:29.500 | and then your actual question,
00:33:31.520 | and the model would do a pretty good job
00:33:33.480 | at answering the question.
00:33:35.400 | And what you're doing here is with your context passage
00:33:38.280 | and your demonstration,
00:33:39.400 | pushing the model to be extractive,
00:33:41.840 | to find an answer to its question in the context passage.
00:33:46.200 | And then the observation of this paper
00:33:48.400 | is that they do a pretty good job
00:33:50.100 | at following that same behavior
00:33:52.280 | for the actual target question at the bottom here.
00:33:55.080 | Remember, this is all just prompting,
00:33:57.400 | putting the model in a temporary state
00:33:59.700 | and seeing what comes out.
00:34:00.860 | You don't change the model,
00:34:02.440 | you just prompt it.
00:34:03.920 | This, in 2012, if you had asked me
00:34:07.460 | whether this was a viable path forward for a class project,
00:34:10.320 | I want to prompt an RNN or something,
00:34:13.140 | I would have advised you as best I could
00:34:15.500 | to choose some other topic
00:34:16.940 | because I never would have guessed that this would work.
00:34:19.880 | So the mind-blowing thing about this paper
00:34:23.760 | and everything that's followed
00:34:25.300 | is that we might be nearing the point
00:34:27.040 | where we can design entire AI systems
00:34:29.900 | on the basis of this simple in-context learning mechanism,
00:34:33.780 | transformatively different from anything that we saw before.
00:34:37.140 | In fact, let me just emphasize this a little bit.
00:34:41.220 | It is worth dwelling on how strange this is.
00:34:44.580 | For those of you who have been in the field a little while,
00:34:48.020 | just contrast what I described in-context learning
00:34:51.220 | with the standard mode of supervision.
00:34:55.060 | Let's imagine for a case here
00:34:56.780 | that we want to train a model
00:34:58.580 | to detect nervous anticipation.
00:35:00.940 | And I have picked this
00:35:01.900 | because this is a very particular human emotion.
00:35:05.380 | And in the old mode,
00:35:06.460 | we would need an entire dedicated model to this, right?
00:35:09.700 | We would collect a little dataset
00:35:11.860 | of positive and negative instances of nervous anticipation,
00:35:16.040 | and we would train a supervised classifier
00:35:19.040 | on feature representations of these examples over here,
00:35:22.340 | learning from this binary distinction.
00:35:25.500 | We would need custom data and a custom model
00:35:28.420 | for this particular task in all likelihood.
00:35:30.900 | In this new mode, few-shot in-context learning,
00:35:35.580 | we essentially just prompt the model,
00:35:37.140 | "Hey, model, here's an example of nervous anticipation."
00:35:40.620 | My palms started to sweat
00:35:41.940 | as the lotto numbers were read off.
00:35:43.740 | "Hey, model, here's an example
00:35:45.380 | without nervous anticipation," and so forth.
00:35:48.220 | And it learns from all those symbols that you put in
00:35:52.700 | and their co-occurrences,
00:35:54.440 | something about nervous anticipation.
00:35:58.180 | On the left for this model here,
00:35:59.960 | I've written out nervous anticipation,
00:36:01.660 | but remember, that has no special status.
00:36:03.860 | I've structured the model around the binary distinction,
00:36:07.340 | the one and the zero.
00:36:08.940 | And everything about the model
00:36:10.320 | is geared toward my learning goal.
00:36:12.220 | On the right, nervous anticipation is just more
00:36:16.200 | of the symbols that I've put into the model.
00:36:19.220 | And the eye-opening thing, again,
00:36:21.120 | about the GPT-3 paper and what's followed
00:36:24.140 | is that models can learn, be put in a temporary state,
00:36:28.280 | and do well at tasks like this.
00:36:30.280 | Now, I talked about self-supervision before,
00:36:36.020 | and I think that is a major component
00:36:38.180 | to the success of these models,
00:36:39.940 | but it is increasingly clear that it is not the only thing
00:36:43.620 | that is driving learning in the best models in this class.
00:36:47.840 | The other thing that we should think about
00:36:51.020 | is what's called reinforcement learning
00:36:52.940 | with human feedback.
00:36:54.740 | This is a diagram from the chat GPT blog post.
00:36:58.400 | There are a lot of details here,
00:36:59.780 | but really two of them are important for us for right now.
00:37:03.160 | The first is that in a phase of training these models,
00:37:08.160 | people are given inputs and ask themselves
00:37:12.260 | to produce good outputs for those inputs.
00:37:15.460 | So you might be asked to do a little Python program,
00:37:17.940 | and you yourself as an annotator
00:37:19.420 | might write that Python program, for example.
00:37:22.340 | So that's highly skilled work
00:37:24.020 | that depends on a lot of human intelligence.
00:37:26.740 | And those examples, those pairs,
00:37:28.960 | are part of how the model is trained.
00:37:31.460 | And that is so important because that takes us way beyond
00:37:34.560 | just learning from co-occurrence patterns
00:37:36.900 | of symbols and text.
00:37:38.440 | It is now back to a very familiar story from all of AI,
00:37:43.060 | which is that it's not magic.
00:37:45.220 | What is happening is that a lot of human intelligence
00:37:48.400 | is driving the behavior of these systems.
00:37:51.940 | And that happens again at step two here.
00:37:54.220 | So now the model produces different outputs,
00:37:56.740 | and humans come in and rank those outputs,
00:37:59.220 | again, expressing direct human preferences
00:38:02.660 | that take us well beyond self-supervision.
00:38:05.460 | So we should remember, we had that brief moment
00:38:08.200 | where it looked like it was all unstructured, unlabeled data,
00:38:11.380 | and that was important to unlocking these capacities,
00:38:14.460 | but now we are back at a very labor-intensive
00:38:17.660 | human capacity here, driving what looked like
00:38:21.340 | the really important behaviors for these models.
00:38:24.020 | Final step, which I think actually intimately relates
00:38:29.820 | to that instruct tuning that I just described.
00:38:32.100 | That's a kind of way of summarizing
00:38:33.500 | this reinforcement learning with human feedback.
00:38:36.340 | And this is what's called step-by-step
00:38:38.380 | or chain-of-thought reasoning.
00:38:39.720 | Now we're thinking about the prompts
00:38:41.300 | that we use for these models.
00:38:43.520 | So suppose we asked ourselves a question like,
00:38:45.500 | can models reason about negation?
00:38:48.020 | To give an example, does the model know
00:38:50.000 | that if the customer doesn't have any auto loan,
00:38:52.860 | sorry, doesn't have any loans,
00:38:54.500 | then the customer doesn't have any auto loans?
00:38:57.140 | It's a simple example.
00:38:58.340 | It's the sort of reasoning that you might have to do
00:39:00.180 | if you're thinking about a contract or something like that,
00:39:03.240 | whether a rule has been followed.
00:39:05.300 | And it just involves negation,
00:39:07.780 | our old friend from the start of the lecture.
00:39:10.860 | Now in the old school prompting style,
00:39:13.580 | all the way back in 2021,
00:39:15.680 | we would kind of naively just input,
00:39:18.340 | is it true that if the customer doesn't have any loans,
00:39:20.700 | then the customer doesn't have any auto loans
00:39:22.840 | into one of these models?
00:39:24.280 | And we would see what came back.
00:39:26.360 | And here it says, no, this is not necessarily true.
00:39:28.580 | A customer can have auto loans
00:39:30.000 | without having any other loans,
00:39:31.400 | which is the reverse of the question that I asked.
00:39:34.900 | Again, kind of showing it doesn't deeply understand
00:39:37.840 | what we put in here.
00:39:38.920 | It just kind of does an act that looks like it did.
00:39:42.060 | And that's worrisome.
00:39:44.280 | But we're learning how to communicate
00:39:46.320 | with these very alien creatures.
00:39:47.680 | Now we do what's called step-by-step prompting.
00:39:50.160 | This is the cutting edge thing.
00:39:51.740 | You would just tell the model that it was in some kind
00:39:53.780 | of logical or common sense reasoning exam.
00:39:56.620 | That matters to the model.
00:39:58.540 | Then you could give some instructions,
00:40:00.680 | and then you could give an example in your prompts
00:40:03.100 | of the kind of thing it was gonna see.
00:40:05.680 | And then finally you could prompt it with your premise,
00:40:08.300 | and then your question.
00:40:09.780 | And the model would spit out something
00:40:11.780 | that looked really good.
00:40:13.140 | Here, I won't bother going through the details,
00:40:15.400 | but with that kind of prompt,
00:40:18.260 | the model now not only answers and reasons correctly,
00:40:21.440 | but also offers a really nice explanation
00:40:23.900 | of its own reasoning.
00:40:25.740 | The capacity was there.
00:40:27.440 | It was latent, and we didn't see it
00:40:29.500 | in the simple prompting mode,
00:40:31.300 | but the more sophisticated prompting mode elicited it.
00:40:35.080 | And I think this is in large part the result
00:40:38.460 | of the fact that this model was instruct tuned.
00:40:40.820 | And so people actually taught it
00:40:42.700 | about how that markup is supposed to work,
00:40:45.380 | and how it's supposed to think about prompts like this.
00:40:47.980 | So the combination of all that human intelligence
00:40:50.340 | and the capacity of the model led to this really interesting
00:40:53.500 | and much better behavior.
00:40:54.980 | That is a glimpse of the foundations
00:41:02.540 | of all of this, I would say.
00:41:04.060 | Of course, we're gonna unpack all of that stuff
00:41:06.220 | as we go through the quarter,
00:41:07.940 | but I hope you're getting a sense for it.
00:41:10.020 | Are there questions I can answer about it?
00:41:11.920 | Things I could circle back on?
00:41:14.780 | - The human brain has about 100 billion neurons,
00:41:17.940 | is my understanding.
00:41:19.020 | And I'm not sure how many parameters that might be,
00:41:22.340 | maybe like 10 trillion parameters or something like that.
00:41:26.060 | Are we approaching a point where these machines
00:41:28.060 | can start emulating the human brain,
00:41:30.300 | or is there something to the language instinct,
00:41:33.140 | or, you know, instincts of all kinds
00:41:35.340 | that maybe take into the human brain?
00:41:37.180 | - Oh, it's nothing but big questions today.
00:41:41.440 | Right, so the question is kind of like,
00:41:43.900 | what is the relationship between the models
00:41:46.020 | we're talking about and the human brain?
00:41:47.700 | And you raised that in terms of the size,
00:41:49.980 | and I guess the upshot of your description
00:41:52.380 | was that these models remain smaller than the human brain.
00:41:55.820 | I think that's reasonable.
00:41:57.180 | It's tricky though.
00:41:59.340 | On the one hand, they obviously have superhuman capabilities.
00:42:02.560 | On the other hand, they fall down in ways that humans don't.
00:42:07.060 | It's very interesting to ask why that difference exists.
00:42:11.180 | And maybe that would tell us something
00:42:12.700 | about the limitations of learning from scratch
00:42:16.820 | versus being initialized by evolution,
00:42:19.180 | the way all of us were.
00:42:20.420 | I don't know, but I would say that
00:42:23.900 | underlying your whole line of questioning
00:42:26.100 | is the question, can we use these models
00:42:28.980 | to eliminate questions of neuroscience
00:42:31.180 | and cognitive science?
00:42:32.740 | And I think we should be careful,
00:42:34.700 | but that the answer is absolutely yes.
00:42:36.700 | And in fact, the increased ability of these models
00:42:40.660 | to learn from data has been really illuminating
00:42:44.340 | about certain recalcitrant questions
00:42:47.260 | from cognitive science in particular.
00:42:49.700 | You have to be careful because they're so different from us,
00:42:52.180 | these models.
00:42:53.300 | On the other hand, I think they are helping us understand
00:42:57.160 | how to differentiate different theories of cognition.
00:42:59.700 | And ultimately, I think they will help us
00:43:01.620 | understand cognition itself.
00:43:03.300 | And I would, of course, welcome projects
00:43:07.740 | that were focused on those cognitive questions in here.
00:43:09.980 | This is a wonderful space in which to explore
00:43:13.260 | this kind of more speculative angle,
00:43:16.140 | connecting AI to the cognitive sciences.
00:43:19.300 | Other questions, comments?
00:43:25.060 | Yes, in the back.
00:43:26.180 | - I would be curious to understand whether,
00:43:28.900 | I mean, partially following up on the brain thing,
00:43:31.300 | just to use a metaphor of our brain
00:43:33.340 | not being just one huge lump of neurons,
00:43:35.620 | but being separated into different areas.
00:43:38.300 | And then also thinking about the previous phase
00:43:41.220 | that you talked about, about breaking up the models
00:43:43.980 | and potentially having a model in the front
00:43:46.100 | that decides which domain our question falls into,
00:43:49.460 | and then having different sub-models.
00:43:52.460 | And I'm wondering whether that's arising,
00:43:54.260 | whether we're gonna touch on an architecture like that.
00:43:57.700 | Because it just seems natural to me
00:43:59.180 | because prompting a huge model
00:44:01.340 | is just very expensive computationally.
00:44:05.180 | It feels like combining big models and logic trees
00:44:08.820 | could be a cool approach.
00:44:10.820 | - I love it.
00:44:11.700 | Yeah, like one quick summary of what you said
00:44:13.820 | would relate directly to your question.
00:44:15.340 | The modularity of mind is an important old question
00:44:19.420 | about human cognition.
00:44:21.260 | To what extent are our abilities
00:44:23.500 | modularized in the mind-brain?
00:44:25.660 | With these current models,
00:44:29.060 | which have a capacity to do lots of different things
00:44:31.740 | if they have the right pre-training and the right structure,
00:44:34.140 | we could ask, does modularity emerge naturally?
00:44:37.340 | Or do they learn non-modular solutions?
00:44:40.260 | Both of those seem like they could be indirect evidence
00:44:43.500 | for how people work.
00:44:44.860 | Again, we have to be careful
00:44:46.140 | 'cause these models are so different from us.
00:44:48.060 | But as a kind of existence proof, for example,
00:44:50.220 | that modularity was emergent
00:44:51.820 | from otherwise unstructured learning,
00:44:54.220 | that would be certainly eye-opening, right?
00:44:56.820 | I have no idea.
00:44:58.660 | Yeah, I don't know whether there are results for that.
00:45:00.700 | Are there results?
00:45:02.180 | No, just kind of a follow-up question on that as well.
00:45:06.020 | So given how closed all these big models are,
00:45:09.780 | how could we interact with the model in such a way
00:45:12.820 | that helps us learn if there is modular?
00:45:15.700 | 'Cause we literally can only interact with it.
00:45:18.020 | So how do we go about studying that?
00:45:20.980 | Right, so the question is, you know,
00:45:22.740 | the closed-off nature of a lot of these models
00:45:26.300 | has been a problem.
00:45:27.260 | We can access the OpenAI models,
00:45:29.180 | but only through an API.
00:45:30.380 | We don't get to look at their internal representations.
00:45:33.220 | And that has been a blocker.
00:45:35.100 | But I mentioned the rise of these 10 billion parameter models
00:45:39.420 | as being performant and interesting.
00:45:41.620 | And those are models that, with the right hardware,
00:45:44.260 | you can dissect a little bit.
00:45:46.060 | And I think that's just gonna get better and better.
00:45:48.060 | And so we'll be able to, you know,
00:45:49.780 | peer inside them in ways
00:45:51.100 | that we haven't been able to until recently.
00:45:53.380 | Yeah.
00:45:54.220 | And in fact, like,
00:45:56.820 | we're gonna talk a lot about explainability.
00:45:59.020 | That's a major unit of this course.
00:46:00.580 | And I think it's an increasingly important area
00:46:03.340 | of the whole field that we have techniques
00:46:05.580 | for understanding these models
00:46:07.780 | so that we know how they're gonna behave
00:46:09.300 | when we deploy them.
00:46:10.620 | And it would be wonderfully exciting
00:46:12.140 | if you all wanted to try to scale
00:46:13.900 | the methods we talk about to a model
00:46:15.940 | that was as big as eight or 10 billion parameters.
00:46:18.700 | Ambitious just to do that,
00:46:20.500 | but then maybe a meaningful step forward.
00:46:22.580 | Yeah.
00:46:24.700 | - I have a question back to, like,
00:46:26.220 | this baseball cap prompt that we were discussing.
00:46:28.900 | So I suppose, like,
00:46:29.740 | a part of the way that we discuss rules
00:46:32.420 | is, like, there is a little bit of ambiguity
00:46:34.460 | for, like, human interpretation.
00:46:35.940 | Like, for example, in the honor code
00:46:37.860 | and the fundamental standard,
00:46:38.780 | like, it's intentionally ambiguous
00:46:40.580 | so that it's context dependent.
00:46:42.980 | And so, like, the idea is that there's, like,
00:46:44.700 | this inherent underlying value system
00:46:46.900 | that, like, affords whatever the rules
00:46:49.380 | that are written out are.
00:46:50.300 | And so that's, like, the primary form of evaluation.
00:46:54.580 | And so I guess, like, how does that play into, then,
00:46:56.980 | how these language models are understanding,
00:46:58.540 | like, is there some form of encoded or understanding,
00:47:01.740 | understood deeper value system that's encoded into them?
00:47:04.540 | - You could certainly ask.
00:47:07.220 | I mean, the essence of your question is,
00:47:08.780 | could we, with analysis techniques, say,
00:47:11.980 | find out that a model had a particular belief system
00:47:15.020 | that was guiding its behavior?
00:47:17.100 | I think we can ask that question now.
00:47:19.260 | It sounds fantastically difficult,
00:47:20.980 | but maybe piecemeal we could get,
00:47:22.500 | make some progress on it for sure.
00:47:24.580 | Yeah, I wanna return to the MLB one, though,
00:47:26.620 | because, well, as you'll see,
00:47:28.700 | and as I think we already saw,
00:47:30.180 | these models purport to offer evidence from a rule book,
00:47:33.580 | and that's where I feel stuck.
00:47:35.340 | - You're keeping score, Tom.
00:47:39.260 | I posted the answer and some other stuff
00:47:42.100 | in the class discussion.
00:47:44.340 | - Wonderful, thank you.
00:47:51.060 | - Can we just hook up these models to a large database
00:47:54.220 | of actually provided information
00:47:56.220 | and send encyclopedia and allow it to,
00:47:58.700 | you know, step up?
00:47:59.860 | - Well, kind of, yes.
00:48:04.180 | Actually, this is the sort of solution
00:48:05.540 | that I wanna advocate for.
00:48:06.860 | I'm gonna do this in a minute.
00:48:08.140 | Yeah.
00:48:08.980 | Here, let's, so we'll do this overview.
00:48:13.100 | I wanna give you a feel for how the course will work,
00:48:15.260 | and then dive into some of our major themes.
00:48:18.500 | So high-level overview, we've got these topics,
00:48:20.580 | contextual representations, transformers and stuff,
00:48:23.700 | multi-domain sentiment analysis,
00:48:25.300 | that will be the topic of the first homework,
00:48:27.540 | and it's gonna build on the first unit there.
00:48:30.620 | Retrieval augmented in-context learning,
00:48:32.700 | this is where we might hook up to a database
00:48:34.620 | and get some guarantees about how these models will behave.
00:48:38.300 | Compositional generalization.
00:48:40.300 | In case you were worried that all the tasks were solved,
00:48:42.540 | I'm gonna confront you with a task,
00:48:44.340 | a seemingly simple task about semantic interpretation
00:48:47.580 | that you will, well, I think it will not be solved.
00:48:50.180 | I mean, those could be famous last words,
00:48:51.860 | 'cause who knows what you all are capable of,
00:48:54.260 | but it's a very hard task that we will pose.
00:48:57.020 | We'll talk about benchmarking and adversarial training
00:48:59.340 | and testing, increasingly important topics
00:49:01.780 | as we move into this mode where everyone is interacting
00:49:04.220 | with these large language models,
00:49:05.900 | and feeling impressed by their behavior,
00:49:08.140 | we need to take a step back and rigorously assess
00:49:11.020 | whether they actually are behaving in good ways,
00:49:13.340 | or whether we're just biased toward remembering
00:49:15.420 | the good things and forgetting the bad ones.
00:49:17.660 | We'll do model introspection,
00:49:19.700 | that's the explainability stuff that I mentioned,
00:49:21.740 | and finally methods and metrics.
00:49:23.260 | And as you can see for the, like, five, six, and seven,
00:49:26.820 | that's gonna be in the phase of the course
00:49:28.540 | where you're fo- you're focused on final projects,
00:49:31.260 | and I'm hoping that that gives you tools
00:49:33.060 | to write really rich final papers
00:49:35.140 | that have great analysis in them,
00:49:37.620 | and really excellent assessments.
00:49:40.460 | And then for the work that you'll do,
00:49:41.900 | we're gonna have three assignments,
00:49:44.460 | and each one of the assignments is paired
00:49:46.340 | with what we call a bake-off,
00:49:47.540 | which is an informal competition around data and modeling.
00:49:51.180 | Essentially, the homework problems ask you
00:49:53.380 | to set up some baseline systems,
00:49:55.500 | and get a feel for a problem,
00:49:57.500 | and then you write your own original system,
00:50:00.060 | and you enter that into the bake-off.
00:50:01.700 | And we have a leaderboard on Gradescope,
00:50:03.940 | and the team is gonna look at all your submissions,
00:50:06.940 | and give out some prizes for top-performing systems,
00:50:09.940 | but also systems that are really creative,
00:50:12.380 | or interesting, or ambitious, or something like that.
00:50:15.380 | And that has always been a lot of fun,
00:50:17.900 | and also really illuminating,
00:50:19.500 | 'cause it's like crowdsourcing a whole lot
00:50:21.820 | of different approaches to a problem,
00:50:23.860 | and then as a group, we can reflect on what worked,
00:50:26.860 | and what didn't, and look at the really ambitious things
00:50:29.260 | that you all try.
00:50:30.620 | So that's my favorite part.
00:50:31.820 | We have three offline quizzes,
00:50:34.580 | and this is just as a way to make sure you have incentives
00:50:37.900 | to really immerse yourself in the course material.
00:50:42.540 | Those are done on Canvas.
00:50:44.340 | There's actually a fourth quiz,
00:50:45.660 | which I'll talk a little bit about probably next time,
00:50:48.020 | that is just making sure you understand the course policies.
00:50:52.060 | That's quiz zero.
00:50:53.420 | You can take it as many times as you want,
00:50:55.340 | but the idea is that you will have some incentive
00:50:58.420 | to learn about policies like due dates, and so forth.
00:51:02.220 | And then the real action is in the final project,
00:51:04.620 | and that will have a lit review phase,
00:51:06.620 | an experiment protocol, and a final paper.
00:51:09.580 | Those three components, you'll probably do those in Teams,
00:51:12.300 | and throughout all of that work,
00:51:14.060 | you'll be mentored by someone from the teaching team.
00:51:16.820 | And as I said before, we have this incredibly expert
00:51:20.220 | teaching team, lots of varied expertise,
00:51:23.060 | a lot of experience in the field,
00:51:25.220 | and so we hope to align you with the person,
00:51:28.380 | with someone who's really aligned with your project goals,
00:51:31.820 | and then I think you can go really, really far.
00:51:34.740 | Yeah.
00:51:35.580 | - It looks like we're about quarter.
00:51:37.340 | Already looking forward to Baker's,
00:51:38.940 | and all Stanford kids get obsessed about this stuff.
00:51:42.740 | On the final project, is this more of an academic paper,
00:51:47.100 | or rather about building working code,
00:51:51.140 | and showing a state of the art?
00:51:54.700 | - Great question.
00:51:55.540 | For the first one, the Bake-offs, yes.
00:51:56.860 | It is easy to get obsessed with your Bake-off entry.
00:52:00.060 | I would say that if you get obsessed,
00:52:02.300 | and you do really well,
00:52:03.660 | just make that into your final project.
00:52:05.860 | All three of them, all three of them
00:52:08.140 | are really important problems.
00:52:10.140 | They are not idle work.
00:52:11.660 | I mean, one of them is on retrieval augmented
00:52:13.540 | in-context learning,
00:52:14.380 | which is one of my core research focuses right now,
00:52:16.780 | so is compositional generalization.
00:52:18.820 | If you do something really interesting for a Bake-off,
00:52:20.900 | make it your final paper,
00:52:22.540 | and then go on to publish it.
00:52:24.500 | For the second part of your question,
00:52:25.900 | I would say that the core goal is to get you
00:52:28.260 | to produce something that could be
00:52:30.220 | a research contribution in the field,
00:52:32.460 | and we have lots of success stories.
00:52:34.820 | I've got links at the website to people who have gone on
00:52:37.620 | to publish their final paper as an NLP paper.
00:52:41.140 | I'm careful the way I say that.
00:52:42.900 | They didn't literally publish the final paper
00:52:45.060 | because in 10 weeks,
00:52:46.540 | almost no one can produce a publishable paper.
00:52:48.740 | It's just not enough time,
00:52:50.500 | but you could form the basis for then working
00:52:52.660 | a little bit more or a lot more,
00:52:55.060 | and then getting a really outstanding publication out of it.
00:52:57.820 | And I would say that that's the default goal.
00:52:59.620 | The nature of the contribution though is highly varied.
00:53:02.540 | We have one requirement,
00:53:03.700 | which is that the final paper have
00:53:04.980 | some quantitative evaluation in it,
00:53:07.900 | but there are a lot of ways to satisfy that requirement,
00:53:10.420 | and then you could be serving
00:53:11.820 | many different questions in the field
00:53:14.180 | for some expansive notion of the field as well.
00:53:18.500 | Background materials.
00:53:25.940 | So I should say that officially,
00:53:28.620 | we are presupposing CS224N or CS224S as prerequisites for the course.
00:53:34.820 | And what that means is that I'm gonna skip a lot of
00:53:37.980 | the fundamentals that we have covered in past years.
00:53:41.220 | If you need a refresher,
00:53:43.340 | check out the background page of the course site.
00:53:45.980 | It covers fundamentals of scientific computing,
00:53:49.260 | static vector representations like
00:53:51.740 | word2vec and GloVe, and supervised learning.
00:53:54.780 | And I'm hoping that that's enough of a refresher.
00:53:57.580 | If you look at that material and find that it too is kind of
00:54:01.460 | beyond where you're at right now,
00:54:03.540 | then contact us on the teaching team and we can
00:54:06.100 | think about how to manage that.
00:54:08.620 | But officially, this is a course that presupposes CS224N.
00:54:14.620 | Then the core goals. This kind of relates to that previous question.
00:54:18.900 | Hands-on experience with a wide range of problems.
00:54:22.060 | Mentorship from the teaching team to guide you through projects and assignments.
00:54:27.380 | And then really the central goal here is to make you the best,
00:54:30.660 | that is most insightful, most responsible,
00:54:33.240 | most flexible NLU researcher and practitioner that you can be for whatever you decide to do next.
00:54:40.020 | And we're assuming that you have lots of diverse goals that somehow connect with NLU.
00:54:45.500 | All right. Let's do some course themes unless there are questions.
00:54:54.140 | I have a whole final section of this slideshow that's about the course,
00:54:59.340 | materials and requirements and stuff.
00:55:01.960 | Might save that for next time and you can check it out at
00:55:04.400 | the website and you'll be forced to engage with it for quiz zero.
00:55:08.600 | I thought instead I would dive back into
00:55:11.240 | the content part of this unless there are questions or comments.
00:55:15.240 | All right. First course theme,
00:55:21.080 | transformer-based pre-training.
00:55:23.640 | So starting with the transformer,
00:55:25.920 | we want to talk about core concepts and goals.
00:55:28.480 | Give you a sense for what these models are like,
00:55:30.940 | why they work, what they're supposed to do, all of that stuff.
00:55:34.440 | We'll talk about a bunch of different architectures.
00:55:37.520 | There are dozens and dozens of them,
00:55:39.560 | but I hope that I have picked enough of them with the right selection of them to give you
00:55:44.120 | a feel for how people are thinking about these models and the kind of
00:55:47.760 | innovations they brought in that have led to
00:55:50.080 | real meaningful advancement just at the level of architectures.
00:55:53.900 | We'll also talk about positional encoding,
00:55:55.840 | which I think maybe a lot of us have been surprised to see just how
00:55:59.040 | important that is as a differentiator for different approaches in this space.
00:56:03.960 | We'll talk about distillation,
00:56:06.120 | taking really large models and making them smaller.
00:56:09.400 | It's an important goal for lots of reasons and an exciting area of research.
00:56:13.920 | Then as I mentioned,
00:56:15.320 | is going to do a little lecture for us on diffusion objectives for these models,
00:56:19.760 | and then is going to talk about practical pre-training and fine-tuning.
00:56:24.120 | I'm going to enlist the entire teaching team to do guest lectures,
00:56:28.140 | and these are the two that I've lined up so far.
00:56:31.120 | That will culminate or be aligned with this first homework in Bake-off,
00:56:35.800 | which has a multi-domain sentiment.
00:56:37.920 | I'm going to give you a bunch of different sentiment datasets,
00:56:40.760 | and you're going to have to design one system that can succeed on all of them.
00:56:45.000 | Then for the Bake-off,
00:56:46.600 | we have an unlabeled dataset for you.
00:56:48.600 | We have the labels, but you won't.
00:56:50.520 | That has data that's like what you developed on,
00:56:54.160 | and then some mystery examples that you will not really be able to anticipate.
00:56:58.560 | We're going to see how well you do at handling
00:57:01.000 | all of these different domains with one system.
00:57:04.160 | This is by way of again,
00:57:07.840 | a refresher on core concepts and supervised learning,
00:57:11.320 | and really getting you to think about transformers.
00:57:13.720 | Although we're not going to constrain the solution that you
00:57:16.280 | offer for your original system.
00:57:19.600 | Our second major theme will be retrieval augmented in context learning.
00:57:27.360 | A topic that I would not even have dreamt of five years ago,
00:57:33.440 | and seemed kind of infeasible three years ago,
00:57:36.120 | and that we first did two years- one year ago?
00:57:38.960 | Oh goodness. I think this is only the second time,
00:57:41.720 | but I had to redo it entirely because things have changed so much.
00:57:46.520 | Here's the idea.
00:57:48.760 | We have two characters so far in our kind of emerging narrative for NLU.
00:57:53.520 | On the one hand, we have this approach that I'm going to call LLMs for everything,
00:57:57.840 | large language models for everything.
00:58:00.120 | You input some kind of question.
00:58:02.560 | Here I've chosen a very complicated question.
00:58:04.600 | Which MVP of a game red flaherty umpired was elected to the baseball hall of fame?
00:58:10.600 | Hats off to you if you know that the answer is Sandy Koufax.
00:58:15.120 | Um, the LLMs for everything approach is that you just type that question in,
00:58:20.840 | and the model gives you an answer.
00:58:23.000 | And hopefully you're happy with the answer.
00:58:26.320 | The other character that I'm going to introduce
00:58:28.960 | here is what I'm going to call retrieval augmented.
00:58:31.720 | So I have the same question at the top here,
00:58:34.180 | except now this is going to proceed differently.
00:58:36.040 | The first thing that we will do is take some large language model and
00:58:39.840 | encode that query into some numerical representation.
00:58:44.880 | That's sort of familiar.
00:58:46.600 | The new piece is that we're going to also have a knowledge store,
00:58:50.520 | which you could think of as an old-fashioned web index, right?
00:58:55.520 | Just a knowledge store of documents with
00:58:58.520 | the modern twist that now all of the documents
00:59:01.280 | are also represented by large language models.
00:59:04.120 | But fundamentally, this is an index of a sort that drives all web search right now.
00:59:09.280 | We can score documents with respect to
00:59:12.040 | queries on the basis of these numerical representations.
00:59:15.080 | And if we want to,
00:59:16.400 | we can reproduce the classic search experience.
00:59:19.200 | Here I've got a ranked list of documents that came back from my query,
00:59:23.920 | just like when you do Google as of the last time I googled.
00:59:28.400 | But in this mode, we can continue, right?
00:59:30.760 | We can have another language model slurp up
00:59:33.080 | those retrieved documents and synthesize them into an answer.
00:59:37.520 | And so here at the bottom I've got,
00:59:39.160 | it's kind of small, but it's the same answer over here.
00:59:41.400 | Although notably, this answer is now decorated with
00:59:44.600 | links that would allow you the user to track back to
00:59:48.320 | what documents actually provided that evidence.
00:59:52.280 | Whereas on the left,
00:59:53.920 | who knows where that information came from?
00:59:56.400 | And that's kind of what we were already grappling with.
00:59:59.720 | This is an important societal need because this is taking over web search.
01:00:04.280 | What are our goals for this kind of model here?
01:00:06.680 | So first, we want synthesis fluency, right?
01:00:09.360 | We want to be able to take information from
01:00:12.160 | multiple documents and synthesize it down into a single answer.
01:00:15.840 | And I think both of
01:00:17.080 | the approaches that I just showed you are going to do really well on that.
01:00:20.480 | We also need these models to be efficient,
01:00:23.000 | to be updatable because the world is changing all the time.
01:00:27.480 | We need it to track provenance and maybe invoke something like factuality.
01:00:32.360 | But certainly provenance, we need to know where the information came from.
01:00:35.840 | And we need some safety and security.
01:00:37.640 | We need to know that the model won't produce private information.
01:00:40.840 | And we might need to restrict access to parts of
01:00:43.440 | the model's knowledge to different groups like
01:00:45.760 | different customers or different people with different privileges and so forth.
01:00:49.640 | That's what we're going to need if we're really going to
01:00:51.920 | deploy these models out into the world.
01:00:55.080 | As I said, I think both of the approaches that I sketched do well on
01:00:58.400 | the synthesis part because they both use a language model and those are really good.
01:01:02.200 | They all have the gift of GAB, so to speak.
01:01:04.800 | What about efficiency?
01:01:06.560 | On the LLM for everything approach,
01:01:09.200 | we had this undeniable rise in model size.
01:01:13.040 | And I pointed out models like Alpaca that are smaller.
01:01:17.200 | But I strongly suspect that if we are going to continue to ask
01:01:21.280 | these models to be both a knowledge store and a language capability,
01:01:26.600 | we're going to be dealing with these really large models.
01:01:30.160 | The hope of the retrieval augmented approach is that we
01:01:34.520 | could get by with the smaller models.
01:01:36.720 | And the reason we could do that is that we're going to factor out
01:01:40.120 | the knowledge store into that index and the language capability,
01:01:44.520 | which is going to be the language model.
01:01:45.960 | The only thing we're going to be asking the language model
01:01:48.840 | is to be good at that kind of in-context learning.
01:01:51.880 | It doesn't need to also store a full model of the world.
01:01:55.520 | And I think that means that these models could be smaller.
01:01:58.720 | So overall, a big gain in efficiency if we go retrieval augmented.
01:02:03.320 | People will make progress,
01:02:04.840 | but I think it's going to be tense.
01:02:07.280 | What about updatability?
01:02:09.440 | Again, this is a problem that people are working on
01:02:11.520 | very concertedly for the LLMs for everything approach.
01:02:14.720 | But these models persist in giving outdated answers to questions.
01:02:19.640 | And one pattern you see is that there's a lot of progress where you could like
01:02:22.920 | edit a model so that it gives
01:02:24.380 | the correct answer to who is the president of the US.
01:02:27.280 | But then you ask it about something related to
01:02:29.720 | the family of the president and it reveals that it has
01:02:34.080 | outdated information stored in its parameters and that's
01:02:37.520 | because all of this information is interconnected and we don't at
01:02:41.280 | the present moment know how to reliably do that kind of systematic editing.
01:02:46.720 | Okay. On the retrieval augmented approach,
01:02:49.720 | we just re-index our data.
01:02:52.040 | If the world changes,
01:02:53.960 | we assume that the knowledge store changed like somebody updated a Wikipedia page.
01:02:58.280 | So we represent all the documents again or at least just the ones that changed.
01:03:02.720 | And now we have a lot of guarantees that as that propagates forward into
01:03:06.560 | the retrieved results which are consumed by the language model,
01:03:09.920 | it will reflect the changes we made to the underlying database in
01:03:14.000 | exactly the same way that a web search index is updated now.
01:03:19.000 | Right. One forward pass of the large language model
01:03:22.960 | compared to maybe training from scratch over here on
01:03:26.640 | new data to get an absolute guarantee that the change will propagate.
01:03:31.480 | What about provenance?
01:03:33.280 | Okay. We have seen this already,
01:03:35.320 | this problem here. LLMs for everything.
01:03:37.760 | I asked GPT-3, the DaVinci 3 model,
01:03:42.400 | my question, are professional baseball players
01:03:44.400 | allowed to glue small wings onto their caps?
01:03:46.600 | But I kind of cut it off but at the top there I said,
01:03:49.160 | provide me some links to the evidence.
01:03:53.320 | And it dutifully provided the links,
01:03:56.080 | but none of the links are real.
01:03:57.960 | If you copy them out and follow them,
01:04:00.360 | they all go to 404 pages.
01:04:02.480 | And I think that this is worse than providing no links at all because I'm
01:04:07.680 | attuned as a human in the current moment
01:04:10.680 | to see links and think they're probably evidence,
01:04:12.880 | and I don't follow all the links.
01:04:15.240 | And here you might look and say, "Oh yeah,
01:04:17.680 | I see it found the relevant MLB pages and that's it."
01:04:21.440 | Right. Over here, the kind of the point of
01:04:25.360 | this is that we are first doing
01:04:27.320 | a search phase where we're actually linked back to documents.
01:04:30.560 | And then we just need to solve the interesting non-trivial question
01:04:34.120 | of how to link those documents into the synthesized answer.
01:04:37.520 | But all of the information we need is right there on the screen for us.
01:04:41.560 | And so this feels like a relatively tractable problem
01:04:44.320 | compared to what we are faced with on the left.
01:04:47.280 | I will say, I've been just amazed at the rollout,
01:04:52.440 | especially of the Bing search engine,
01:04:55.000 | which now incorporates OpenAI models at some level.
01:04:57.960 | Because it is clear that it is doing web search, right?
01:05:01.800 | Because it's got information that comes from documents that
01:05:04.640 | only appeared on the web days before your query.
01:05:08.200 | But what it's doing with that information seems completely chaotic to me.
01:05:13.280 | So that it's kind of just getting mushed in with whatever else the model is doing,
01:05:17.440 | and you get this unpredictable combination of things that are grounded in documents,
01:05:24.080 | and things that are completely fabricated.
01:05:26.320 | And again, I maintain this is worse than just giving
01:05:28.920 | an answer with no evidence attached to it.
01:05:33.120 | I don't know why these companies are not simply doing the retrieval augmented thing,
01:05:37.920 | but I'm sure they are going to wise up,
01:05:39.640 | and maybe your research could help them wise up a little bit about this.
01:05:43.800 | Finally, safety and security.
01:05:45.880 | This is relatively straightforward.
01:05:47.240 | On the LLMs for everything approach,
01:05:48.680 | we have a pressing problem, privacy challenges.
01:05:51.920 | We know that those models can memorize long strings in their training data,
01:05:55.560 | and that could include some very particular information about one of us,
01:05:59.600 | and that should be worrying us.
01:06:01.160 | We have no known way with a language model to compartmentalize LLM capabilities,
01:06:06.080 | and say like, you can see this kind of result and you cannot.
01:06:09.640 | And similarly, we have no known way to restrict access to part of an LLMs capabilities.
01:06:15.680 | They just produce things based on their prompts,
01:06:18.400 | and you could try to have some prompt tuning that would tell them for
01:06:21.240 | this kind of person or setting do this and not that,
01:06:24.040 | but nobody could guarantee that that would succeed.
01:06:26.800 | Whereas, for the retrieval augmented approach, again,
01:06:31.120 | we're thinking about accessing information from an index,
01:06:34.680 | and access restrictions on an index is an old problem by now.
01:06:39.840 | Again, I don't want to say solved,
01:06:41.440 | but something that a lot of people have tackled for decades now,
01:06:45.600 | and so we can offer something like guarantees,
01:06:48.160 | just from the fact that we have a separated knowledge store.
01:06:52.640 | Again, my smiley face.
01:06:56.040 | You can see where my feelings are.
01:06:58.000 | For the LLMs for everything approach,
01:07:00.200 | people are working on these problems and it's very exciting,
01:07:02.960 | and if you want a challenge,
01:07:04.520 | take up one of these challenges here.
01:07:07.340 | But over here on the retrieval augmented side,
01:07:09.640 | I think we have lots of reasons to think.
01:07:11.800 | It's not that they're completely solved,
01:07:13.760 | it's just that we can see the path to solving them,
01:07:16.360 | and this feels very urgent to me because of how
01:07:19.600 | suddenly this kind of technology is being deployed in
01:07:22.960 | a very user-facing way for one of
01:07:24.760 | the core things we do in society, which is web search.
01:07:28.200 | So it's an urgent thing that we get good at this.
01:07:32.280 | Final things I want to say about this.
01:07:35.480 | So until recently,
01:07:37.720 | the way you would do even the retrieval augmented thing would be that you would
01:07:41.040 | have your index and then you might
01:07:44.560 | train a custom purpose model to do the question answering part,
01:07:48.040 | and it could extract things from the text that you produced,
01:07:50.840 | or maybe even generate some new things from the text that you produced.
01:07:54.400 | That's the mode that I mentioned before where you'd have some language models,
01:07:58.880 | maybe a few of them, and you'd have an index,
01:08:00.880 | and you would stitch them together into a question answering system
01:08:04.480 | that you would probably train on question answering data,
01:08:07.960 | and you would hope that this whole big monster may be
01:08:10.120 | fine-tuned on squad or natural questions or one of those datasets,
01:08:14.520 | gave you a general purpose question answering capability.
01:08:19.440 | That's the present, but I think it might actually be the recent past.
01:08:24.320 | In fact, the way that you all will probably work when we do this unit,
01:08:28.960 | and certainly for the homework,
01:08:30.740 | is that we will just have frozen components.
01:08:33.880 | This starts from the observation that the retriever model is really just a model that
01:08:39.120 | takes in text and produces text with scores,
01:08:42.920 | and a language model is also a device for taking in text and producing text with scores.
01:08:49.880 | These are when these are frozen components,
01:08:51.880 | you can think of them as just black box devices that do this input-output thing,
01:08:55.960 | and then you get into the intriguing mode of asking,
01:08:58.680 | but what if we had them just talk to each other?
01:09:01.600 | That is what you will do for the homework and bake-off.
01:09:04.760 | You will have frozen retriever and a frozen large language model,
01:09:08.640 | and you will get them to work together to
01:09:11.760 | solve a very difficult open domain question answering problem.
01:09:16.080 | That's pushing us into a new mode for even thinking about how we design AI systems,
01:09:21.640 | where it's not so much about fine-tuning,
01:09:24.080 | it's much more about getting them to communicate with each other
01:09:27.360 | effectively to design a system from frozen components.
01:09:31.880 | Again, unanticipated at least by me as of a few years ago,
01:09:36.760 | and now an exciting new direction.
01:09:39.920 | So just to wrap up,
01:09:41.920 | I think what I'll do since we're near the end of the- of class here,
01:09:45.000 | I'll just finish up this one unit,
01:09:46.840 | and then we'll use some of our time next time to introduce a few other of
01:09:50.200 | these course themes and that'll set us up well for diving into transformers.
01:09:55.520 | Final piece here just to inspire you,
01:09:57.840 | few-shot open QA is kind of the task that you will tackle for homework two.
01:10:02.640 | And here's how you could think about this.
01:10:04.400 | Imagine that the question has come in,
01:10:06.200 | what is the course to take?
01:10:08.280 | The most standard thing we could do is just prompt the language model with that question,
01:10:12.920 | what- what is the course to take down here and see what answer it gave back, right?
01:10:17.320 | But the retrieval augmented insight is that we
01:10:20.680 | might also retrieve some kind of passage from a knowledge store.
01:10:23.880 | Here I have a very short passage.
01:10:25.440 | The course to take is natural language understanding,
01:10:28.100 | and that could be done with a retrieval mechanism.
01:10:31.160 | But why stop there?
01:10:33.180 | It might help the model as we saw going back to the GPT-3 paper to
01:10:37.480 | have some examples of the kind of behavior that I'm hoping to get from the model.
01:10:41.960 | And so here I have retrieved from some dataset,
01:10:44.800 | question-answer pairs that will kind of give it a sense for what I want it to do in the end.
01:10:49.400 | But again, why stop there?
01:10:51.360 | We could also pick questions that were based very closely on the question that we posed.
01:10:57.840 | That would be like k-nearest neighbors approach where we use
01:11:01.160 | our retrieval mechanism to find similar questions to the one that we care about.
01:11:06.220 | I could also add in some context passages and I could do that by retrieval.
01:11:11.200 | So now we've used the retrieval model twice potentially,
01:11:14.500 | once to get good demonstrations and once to provide context for each one of them.
01:11:19.460 | But I could also use my retrieval mechanism with the questions and answers from
01:11:23.860 | the demonstration to get even richer connections
01:11:26.460 | between my demonstrations and the passages.
01:11:29.440 | I could even use a language model to rewrite aspects of those demonstrations to put them
01:11:34.500 | in a format that might help me with the final question that I want to pose.
01:11:39.020 | So now I have an interwoven use of
01:11:42.120 | the retrieval mechanism and the large language model to build up this prompt.
01:11:47.540 | Down at the retrieval thing,
01:11:49.380 | I could do the same thing.
01:11:50.900 | And then when you think about the model generation, again,
01:11:54.140 | we could just take the top response from the model,
01:11:57.100 | but we can do very sophisticated things on up to this full retrieval augmented generation model,
01:12:04.140 | which essentially marginalizes out the evidence passage and gives us
01:12:08.460 | a really powerful look at a good answer conditional
01:12:11.640 | on that very complicated prompt that we constructed.
01:12:15.540 | I think what you're seeing on the left here is that we are going to move from an era where
01:12:20.780 | we just type in prompts into these models and hope for the best,
01:12:25.060 | into an era where prompt construction is a kind of new programming mode,
01:12:31.060 | where you're writing down computer code,
01:12:33.740 | could be Python code,
01:12:35.200 | that is doing traditional computing things,
01:12:37.740 | but also drawing on very powerful pre-trained components to assemble
01:12:43.640 | this kind of instruction kit for your large language model to do whatever task you have set for it.
01:12:50.340 | And so instead of designing these AI systems with
01:12:53.340 | all that fine-tuning I described before,
01:12:55.740 | we might actually be moving back into a mode that's like
01:12:58.980 | that symbolic mode from the '80s where you type in a computer program.
01:13:03.260 | It's just that now the program that you type in is
01:13:06.900 | connected to these very powerful modern AI components.
01:13:11.580 | And we're seeing right now that that is
01:13:14.700 | opening doors to all kinds of new capabilities for these systems.
01:13:18.540 | And this first homework and bake-off is going to give you a glimpse of that.
01:13:23.140 | And you're going to use a programming model we've
01:13:25.420 | developed called demonstrate-search-predict that I
01:13:28.460 | hope will give you a glimpse of just how powerful this can be.
01:13:32.340 | All right. We are out of time, right? 420?
01:13:39.980 | So next time I'll show you a few more units from the course,
01:13:43.660 | and then we'll dive into transformers.
01:13:46.660 | [BLANK_AUDIO]