Stanford XCS224U: NLU I Intro & Evolution of Natural Language Understanding, Pt. 1 I Spring 2023

00:00:00.000 | Welcome everyone.

00:00:06.880 | Uh, this is natural language understanding.

00:00:09.680 | Uh, it is a weird and wonderful and maybe worrying moment to be doing natural language understanding.

00:00:16.440 | My goal for today is just to kind of immerse us in this moment and think about how we got here and what it's like to be doing research now.

00:00:25.440 | And I think that'll set us up well to think about what we're gonna do in the course and how that's gonna set you up to participate in this moment in AI,

00:00:35.040 | uh, in many ways, in whichever ways you choose.

00:00:38.240 | And it's an especially impactful moment to be doing that.

00:00:40.920 | And this is a project-oriented course.

00:00:43.200 | And I feel like we can get you all to the point where you are doing meaningful things that contribute to this ongoing moment in ways that are gonna be exciting and impactful.

00:00:53.000 | That is the fundamental goal of the course.

00:00:55.960 | Let's now think about the current moment.

00:00:57.560 | This is always a moment of reflection for me.

00:01:00.240 | I started teaching this course in 2012, um, which I guess is ages ago now.

00:01:06.480 | It feels recent in my lived experience, but it does feel like ages ago in terms of the content.

00:01:11.480 | In 2012, on the first day, I had a slide that looked like this.

00:01:15.240 | I said, "It was an exciting time to be doing natural language understanding research."

00:01:20.440 | I noted that there was a resurgence of interest in the area after a long period of people mainly focused on syntax and things like that.

00:01:28.760 | But there was a widespread perception that NLU was- was on- was poised for a breakthrough and to have huge impact that was relating to business things,

00:01:37.120 | and that there was a white-hot job market for Stanford grads.

00:01:40.200 | A lot of this language is coming from the fact that we were in this moment when Siri had just launched,

00:01:45.600 | Watson had just won on Jeopardy,

00:01:48.440 | and we had all of these in-home devices and all the tech giants kind of competing on what was emerging as the field of natural language understanding.

00:01:56.840 | Let's fast forward to 2022.

00:01:59.120 | I did feel like I should update that in 2022 by saying this is the most exciting moment ever as opposed to it just being an exciting time.

00:02:06.720 | But I emphasize the same things, right?

00:02:09.280 | We were on- in this feeling that we had experienced a resurgence of interest in the area,

00:02:14.640 | although now it was hyper-intensified.

00:02:17.120 | Same thing with industry.

00:02:18.560 | The industry interest at this point makes the stuff from 2012 look like small potatoes.

00:02:24.560 | Systems were getting very impressive,

00:02:27.400 | but and I maintain this here,

00:02:29.580 | they show their weaknesses very quickly,

00:02:31.940 | and the core things about NLU remain far from solved.

00:02:35.640 | So the big breakthroughs lie in the future.

00:02:37.920 | I will say that even since 2022,

00:02:40.440 | it has felt like there has been an acceleration,

00:02:42.960 | and some problems that we used to focus on feel kind of like they're less pressing.

00:02:48.480 | I won't say solved, but they feel like we've made a lot of progress on them as a result of models getting better.

00:02:54.480 | But all that means for me is that there are more exciting things in the future that we can tackle even more ambitious things.

00:03:01.640 | And you'll see that I've tried to overhaul the course to be ever more ambitious about the kind of problems that we might take on.

00:03:09.440 | But we do kind of live in a golden age for all of this stuff.

00:03:13.400 | And even in 2022,

00:03:14.720 | I'm not sure what I would have predicted to say nothing of 2012,

00:03:18.000 | that we would have these incredible models like DALI2,

00:03:21.120 | which can take you from text into these incredible images.

00:03:24.720 | Language models, which will more or less be the star of the quarter for us.

00:03:29.120 | But also models that can take you from natural language to code.

00:03:33.080 | And of course, we are all seeing right now as we speak,

00:03:36.480 | that the entire industry related to web search is being reshaped around NLU technologies.

00:03:43.600 | So whereas this felt like a kind of niche area of NLP when we started this course in 2012,

00:03:50.920 | now it feels like the entire field of NLP,

00:03:54.120 | certainly in some aspects,

00:03:56.120 | all of AI is focused on these questions of natural language understanding,

00:04:00.320 | which is exciting for us.

00:04:02.800 | One more moment of reflection here.

00:04:05.000 | You know, in this course,

00:04:06.840 | throughout the years, we have used simple examples to kind of highlight the weaknesses of current models.

00:04:12.600 | And so a classic one for us was simply this question,

00:04:16.320 | which US states border no US states?

00:04:19.440 | The idea here is that it's a simple question,

00:04:22.400 | but it can be hard for our language technologies because of that negation, the no there.

00:04:28.400 | In 1980, there was a famous system called CHAT80.

00:04:33.080 | It was a symbolic system representing the first major phase of research in NLP.

00:04:38.760 | You can see the fragment of the system here.

00:04:41.280 | And CHAT80 was an incredible system in that it could answer questions like,

00:04:45.600 | which country bordering the Mediterranean borders a country that is

00:04:49.040 | bordered by a country whose population exceeds the population of India?

00:04:52.680 | I've given you the answer here,

00:04:54.760 | Turkey, at least according to 1980s geography.

00:04:58.920 | But if you asked CHAT80 a simple question like,

00:05:01.920 | which US states border no US states?

00:05:04.000 | It would just say, I don't understand.

00:05:06.680 | It was an incredibly expressive system, but rigid.

00:05:10.880 | It could do some things very deeply,

00:05:13.200 | as you see from the first question,

00:05:14.880 | but things that fell outside of its capacity,

00:05:17.440 | it would just fall down flat.

00:05:19.720 | That was the 1980s.

00:05:21.520 | Let's fast forward.

00:05:22.440 | 2009, around the time this course launched,

00:05:25.040 | Wolfram Alphra hit the scene.

00:05:26.960 | And this was meant to be a kind of revolutionary language technology.

00:05:30.880 | The website is still up,

00:05:32.320 | and to my amazement,

00:05:34.000 | it still gives the following behavior.

00:05:36.560 | If you search for which US states border no US states,

00:05:40.120 | it kind of just gives you a list of the US states.

00:05:43.160 | Revealing, I would say,

00:05:45.120 | that it has no capacity to understand the question posed.

00:05:49.400 | That was 2009.

00:05:50.960 | So we've gone from 1980 to 2009.

00:05:54.000 | Okay, let's go to 2020.

00:05:55.600 | This is the first of the OpenAI models, ADA.

00:05:59.320 | Which US states border no US states?

00:06:02.120 | The answer is no.

00:06:03.880 | And then it sort of starts to babble,

00:06:05.680 | the US border is not a state border.

00:06:07.640 | It did that for a very long time.

00:06:10.400 | What about Babbage?

00:06:11.680 | This is still 2020.

00:06:13.880 | The US states border no US states.

00:06:16.040 | What is the name of the US state?

00:06:17.560 | And then it really went off the deep end from there,

00:06:19.880 | again, for a very long time.

00:06:21.360 | That was Babbage.

00:06:22.240 | If you had seen this output,

00:06:24.600 | well, at least for me,

00:06:25.720 | it might have shaken my faith

00:06:27.800 | that this was a viable approach, right?

00:06:30.600 | But the team persisted, I guess.

00:06:32.120 | 2021, this is the Curie model.

00:06:34.280 | Which US states border no US states?

00:06:37.200 | It had a problem that it started listing things,

00:06:39.520 | but it did say Alaska, Hawaii, and Puerto Rico,

00:06:42.840 | which is an interestingly more impressive answer

00:06:45.880 | than the first answer, right?

00:06:47.680 | It still has some problem understanding

00:06:49.480 | what it means to respond,

00:06:50.640 | but it's looking like we're seeing some signal.

00:06:54.240 | Da Vinci Instruct Beta, this is 2022.

00:06:56.840 | It's important, I think,

00:06:58.080 | that this is the first of the models

00:06:59.600 | that have Instruct in the name.

00:07:01.160 | We'll talk about that in a minute.

00:07:02.840 | Which US states border no US states?

00:07:04.800 | Alaska and Hawaii.

00:07:06.760 | From 2020 to 2022,

00:07:08.800 | we have seen this astounding leap forward,

00:07:11.400 | making everything before then sort of pale in comparison.

00:07:14.680 | And then finally, Text Da Vinci One,

00:07:17.360 | you know, one of the new best-in-class models,

00:07:19.400 | at least until two months ago.

00:07:21.160 | Which US states border no US states?

00:07:23.200 | Alaska and Hawaii are the only US states

00:07:25.440 | that border no other US states.

00:07:26.920 | A very impressive answer indeed.

00:07:29.600 | And if you just think about the little history I've given,

00:07:32.840 | a kind of microcosm of what is happening in the field,

00:07:37.040 | a lot of time without much progress,

00:07:40.560 | with some hype attached,

00:07:42.160 | and now in the last few years,

00:07:43.920 | this kind of rapid progress forward.

00:07:46.160 | And, you know, that's just one example,

00:07:49.760 | but these examples multiply, and we can quantify this.

00:07:52.400 | Here's another impressive case.

00:07:54.080 | I asked the Da Vinci Two model,

00:07:56.320 | in which year was Stanford University founded?

00:07:59.080 | When did it enroll its first students?

00:08:01.000 | Who is its current president,

00:08:02.280 | and what is its mascot?

00:08:03.760 | A complicated question indeed,

00:08:05.800 | and it gave a fluent and factually correct answer

00:08:09.440 | on all counts.

00:08:10.800 | This is the Da Vinci Three model,

00:08:13.080 | which was best-in-class until a few weeks ago,

00:08:16.200 | and it gave exactly the same answer.

00:08:18.440 | Very impressive.

00:08:20.800 | Now, in this course, and for-

00:08:22.760 | you'll see at the website,

00:08:23.800 | one of the readings we've suggested

00:08:25.560 | for the start of the course

00:08:27.000 | is this classic paper by Hector Levesque

00:08:29.320 | called "On Our Best Behavior."

00:08:31.400 | And the thrust of this article,

00:08:33.120 | essentially channeling Terry Winograd,

00:08:35.200 | and Terry Winograd's schema,

00:08:38.080 | the idea is that we should come up with examples

00:08:40.960 | that will test whether models deeply understand,

00:08:44.120 | and in particular, get past the kind of

00:08:46.080 | simple memorization of statistics and other things

00:08:49.040 | about the data they're trained on,

00:08:50.640 | and really probe to see whether they understand

00:08:52.720 | what the world is like.

00:08:54.200 | And Levesque and Winograd's technique for doing this

00:08:57.680 | is to pose very unlikely questions

00:09:01.200 | where humans have very natural answers,

00:09:03.520 | like one of the ones Levesque poses is,

00:09:05.640 | could a crocodile run the steeplechase?

00:09:09.040 | Maybe it's a question you've never thought about before,

00:09:11.400 | but you probably have a pretty consistent answer

00:09:13.920 | across this group.

00:09:15.000 | Could a crocodile run the steeplechase?

00:09:17.560 | Here, I asked another one of Levesque's questions.

00:09:19.600 | Are professional baseball players allowed

00:09:21.800 | to glue small wings onto their caps?

00:09:24.440 | You could think about that for a second.

00:09:26.280 | The Da Vinci 2 model said,

00:09:28.080 | "There is no rule against it, but it is not common."

00:09:30.960 | And that seemed like a very good answer to me at the time.

00:09:35.120 | When the Da Vinci 3 engine came out,

00:09:37.360 | though, this started to worry me.

00:09:39.120 | No, professional baseball players are not allowed

00:09:41.760 | to glue small wings onto their caps.

00:09:43.720 | Major League Baseball has strict rules

00:09:45.560 | about the appearance of players' uniforms and caps.

00:09:49.000 | And any modifications to the caps are not allowed.

00:09:52.520 | Okay, I thought I was feeling good about this,

00:09:55.200 | but now I don't even myself know what the answer is.

00:09:58.240 | Are professional baseball players allowed

00:10:00.080 | to glue small wings onto their caps?

00:10:01.520 | We have two confident answers that are contradictory

00:10:05.640 | across two models that are very closely related.

00:10:09.080 | It's starting to worry us a little bit, I hope.

00:10:11.760 | But still, it's impressive.

00:10:14.200 | What's that?

00:10:15.040 | - You want me to ask a part?

00:10:16.760 | - You could check.

00:10:17.600 | Yes, I have a few cases,

00:10:19.200 | and this is an interesting experiment

00:10:20.880 | for us to run for sure.

00:10:22.040 | Let me show you the responses I got a bit later.

00:10:24.560 | The point, though, I guess,

00:10:27.800 | if you've seen the movie "Blade Runner,"

00:10:29.120 | this is starting to feel like to figure out

00:10:31.760 | whether an agent we were interacting with was human or AI,

00:10:36.120 | we would need to get very sophisticated

00:10:39.480 | interview techniques indeed.

00:10:41.200 | The Turing test, long forgotten here,

00:10:43.960 | now we're into the mode of trying to figure out

00:10:46.680 | exactly what kind of agents we're interacting with

00:10:49.520 | by having to be extremely clever

00:10:51.920 | about the kinds of things that we do with them.

00:10:54.400 | Now, that's kind of anecdotal evidence,

00:10:58.640 | but I think that the picture of progress

00:11:00.760 | is also supported by what's happening in the field.

00:11:03.920 | Let me start this story with our benchmarks.

00:11:06.880 | And the headline here is that our benchmarks,

00:11:09.400 | the tasks, the datasets we use to probe our models

00:11:13.080 | are saturating faster than ever before.

00:11:15.360 | And I'll articulate what I mean by saturate.

00:11:17.880 | So we have a little framework.

00:11:19.840 | Along the x-axis, I have time

00:11:22.080 | stretching back into like the 1990s.

00:11:24.760 | And along the y-axis, I have a normalized measure

00:11:28.520 | of distance from what we call human performance.

00:11:31.400 | That's the red line set at zero.

00:11:34.000 | Each one of these benchmarks has, in its own particular way,

00:11:37.080 | set a so-called estimate of human performance.

00:11:39.840 | I think we should be cynical about that,

00:11:41.880 | but nonetheless, this'll be a kind of marker

00:11:44.400 | of progress for us.

00:11:45.720 | First dataset, MNIST.

00:11:48.640 | This is like digit recognition, famous task in AI.

00:11:51.600 | It was launched in the 1990s,

00:11:53.400 | and it took about 20 years for us to see a system

00:11:56.960 | that surpassed human performance in this very loose sense.

00:12:01.800 | The switchboard corpus, this is going from speech to text.

00:12:05.520 | It's a very similar story, launched in the '90s,

00:12:08.000 | and it took about 20 years

00:12:09.960 | for us to see a superhuman system.

00:12:13.480 | ImageNet, this was launched, I believe, in 2009,

00:12:17.120 | and it took less than 10 years for us to see a system

00:12:20.320 | that surpassed that red line.

00:12:23.120 | And now progress is gonna pick up really fast.

00:12:25.280 | SQUAD 1.1, the Stanford question-answering dataset,

00:12:29.040 | was launched in 2016, and it took about three years

00:12:32.440 | for it to be saturated in this sense.

00:12:35.240 | SQUAD 2.0 was the team's attempt

00:12:38.480 | to pose an even harder problem,

00:12:40.120 | one where there were unanswerable questions,

00:12:42.960 | but it took even less time for systems

00:12:45.160 | to get past that red line.

00:12:47.000 | Then we get the GLUE benchmark.

00:12:49.600 | This is a famous benchmark

00:12:51.480 | in natural language understanding, a multitask benchmark.

00:12:55.200 | When this was launched, a lot of us thought

00:12:58.040 | that GLUE would be too difficult for present-day systems.

00:13:01.520 | It looked like this might be a challenge

00:13:03.200 | that would stand for a very long time,

00:13:05.440 | but it took like less than a year

00:13:07.800 | for systems to pass human performance.

00:13:10.720 | The response was superGLUE, but it was saturated,

00:13:14.080 | if anything, even more quickly.

00:13:16.200 | Now, we can be as cynical as we want

00:13:19.080 | about this notion of human performance,

00:13:20.880 | and I think we should dwell on whether or not

00:13:22.920 | it's fair to call it that, but even setting that aside,

00:13:26.440 | this looks like undeniably a story of progress.

00:13:31.040 | The systems that we had in 2012

00:13:33.640 | would not even have been able to enter the GLUE benchmark

00:13:37.160 | to say nothing of achieving scores like this.

00:13:39.840 | So something meaningful has happened.

00:13:42.000 | Now, you might think by the standards of AI,

00:13:44.560 | these datasets are kind of old.

00:13:46.400 | Here's a post from Jason Wei where he evaluated

00:13:48.800 | our latest and greatest large language models

00:13:51.640 | on a bunch of mostly new tasks

00:13:53.760 | that were actually designed to stress test

00:13:56.480 | this new class of very large language models.

00:13:59.920 | Jason's observation is that we see emergent abilities

00:14:03.360 | across more than 100 tasks for these models,

00:14:05.880 | especially for our largest models.

00:14:08.320 | The point, though, is that we, again,

00:14:10.040 | thought these tasks would stand for a very long time,

00:14:13.280 | and what we're seeing instead is that one by one,

00:14:16.120 | systems are certainly getting traction,

00:14:17.960 | and in some cases, performing at the standard

00:14:20.880 | we had set for humans.

00:14:22.880 | Again, an incredible story of progress there.

00:14:26.520 | So I hope that is energizing, maybe a little intimidating,

00:14:31.960 | but I hope fundamentally energizing for you all.

00:14:34.800 | The next question that I wanna ask for you

00:14:38.280 | is just what is going on?

00:14:40.320 | What is driving all of this sudden progress?

00:14:43.560 | Let's get a feel for that, and that'll kind of serve

00:14:46.120 | as the foundation for the course itself.

00:14:49.120 | Before I do that, though, are there questions or comments,

00:14:52.960 | things I could resolve, or things I left out

00:14:55.080 | about the current moment?

00:14:56.520 | - Brandon on board, I think very well.

00:15:02.960 | - We should reflect, though, maybe as a group

00:15:06.280 | about what it means to do very well.

00:15:08.240 | My question for you, when you say it did well,

00:15:11.320 | what is the Major League Baseball rule

00:15:13.320 | about players gluing things onto their caps?

00:15:15.520 | - Rule 3.06.

00:15:17.360 | - You found the actual rule?

00:15:18.760 | - No, this is what Bard, well, I don't-

00:15:21.800 | - Did you find the rule?

00:15:22.640 | - I didn't find the rule.

00:15:23.480 | Bard found that rule and gave me that number.

00:15:24.960 | - Okay. - Is it accurate?

00:15:26.360 | - Yes, that is gonna be the question for us.

00:15:28.720 | I can get- - It's a direct quote, too,

00:15:30.320 | which is right for hallucination.

00:15:32.560 | - Well, I'm gonna show you the OpenAI models

00:15:34.400 | will offer me links, but the links go nowhere.

00:15:37.320 | (audience laughing)

00:15:40.160 | What you're pointing out, I think,

00:15:41.760 | is an increasing societal problem.

00:15:43.840 | These models are offering us what looks like evidence,

00:15:47.280 | but a lot of the evidence is just fabricated,

00:15:49.840 | and this is worse than offering no evidence at all.

00:15:52.640 | What I really need is someone who knows

00:15:54.480 | Major League Baseball to tell me,

00:15:56.240 | what is the rule about players and their caps?

00:15:59.920 | I want it from an expert human,

00:16:01.560 | not an expert language model.

00:16:04.640 | - Can we- - What's that?

00:16:06.480 | - Can we Google?

00:16:08.000 | - Be careful how you Google, though.

00:16:09.560 | I guess that's the lesson of 2023.

00:16:11.800 | All right, what's going on?

00:16:16.560 | Let's start to make some progress on this.

00:16:18.560 | Again, first, a little bit of historical context.

00:16:22.040 | I've got a timeline going back to the 1960s

00:16:25.200 | along the x-axis.

00:16:26.200 | This is more or less the start of the field itself.

00:16:29.080 | And in that early era,

00:16:31.120 | essentially all of the approaches

00:16:33.800 | were based in symbolic algorithms

00:16:36.000 | like the CHAT81 that I showed you.

00:16:37.840 | In fact, that was kind of pioneered here at Stanford

00:16:40.800 | by people who were pioneering the very field of AI.

00:16:44.040 | And that paradigm of essentially programming these systems

00:16:47.760 | lasted well into the 1980s.

00:16:50.280 | In the '90s, early 2000s,

00:16:54.840 | we get the statistical revolution

00:16:57.120 | throughout artificial intelligence,

00:16:58.800 | and then in turn in natural language processing.

00:17:01.600 | And the big change there is that

00:17:03.520 | instead of programming systems with all these rules,

00:17:06.280 | we're gonna design machine learning systems

00:17:08.240 | that are gonna try to learn from data.

00:17:10.400 | Under the hood,

00:17:11.240 | there was still a lot of programming involved

00:17:13.200 | because we would write a lot of feature functions

00:17:15.840 | that were little programs

00:17:16.960 | that would help us detect things about data.

00:17:19.360 | And we would hope that our machine learning systems

00:17:21.360 | could learn from the output of those feature functions.

00:17:24.720 | But in the end,

00:17:25.880 | this was the rise of the fully data-driven learning systems.

00:17:29.680 | And we just hope that some process of optimization

00:17:32.800 | leads us to new capabilities.

00:17:34.960 | The next big phase of this was the deep learning revolution.

00:17:39.400 | This happened starting around 2009, 2010.

00:17:42.480 | Again, Stanford was at the forefront of this to be sure.

00:17:46.360 | It felt like a big change at the time,

00:17:48.240 | but in retrospect,

00:17:49.660 | this is kind of not so different from this mode here.

00:17:52.280 | It's just that we now replace that simple model

00:17:56.280 | with really big models,

00:17:58.280 | really deep models that have a tremendous capacity

00:18:01.480 | to learn things from data.

00:18:03.400 | We started also to see a shift even further away

00:18:07.100 | from those feature functions,

00:18:08.640 | from writing little programs,

00:18:10.320 | and more toward a more mode

00:18:12.000 | where we would just hope that the data

00:18:14.360 | and the optimization process could do all the work for us.

00:18:17.980 | Then the next thing, big thing that happened,

00:18:21.560 | which could take us, I suppose, until about 2018,

00:18:25.280 | would be this mode

00:18:26.120 | where we have a lot of pre-trained parameters.

00:18:28.360 | These are pictures of maybe big language models

00:18:30.840 | or computer vision models or something.

00:18:32.920 | And when we build systems,

00:18:34.260 | we build on those pre-trained components

00:18:36.940 | and stitch them together

00:18:38.440 | with these task-specific parameters.

00:18:41.040 | And we hope that when they're all combined

00:18:42.960 | and we do some learning on some task-specific data,

00:18:46.320 | we have something that's benefiting

00:18:48.000 | from all these pre-trained components.

00:18:51.200 | And then the mode that we seem to be in now

00:18:53.980 | that I want us to reflect critically on

00:18:56.440 | is this mode where we're gonna replace everything

00:18:59.060 | with maybe one ginormous language model of some kind

00:19:03.500 | and hope that that thing, that enormous black box,

00:19:06.720 | will do all the work for us.

00:19:08.720 | We should think critically

00:19:09.900 | about whether that's really the path forward,

00:19:12.040 | but it certainly feels like the zeitgeist to be sure.

00:19:16.240 | Question, yeah.

00:19:17.080 | - If you think it's worth it,

00:19:18.840 | could you go back to the last slide

00:19:21.040 | and maybe explain a little bit,

00:19:23.520 | a more rounded example of what that all means?

00:19:25.480 | I couldn't quite follow.

00:19:27.040 | - Let's do that later.

00:19:28.600 | The point for now though is really this shift from here

00:19:32.920 | where we're mostly learning from scratch for our task.

00:19:36.840 | Here, we've got things like BERT in the mix.

00:19:39.720 | We've got pre-trained components,

00:19:42.200 | models that we hope begin in a state

00:19:44.680 | that gives us a leg up on the problem we're trying to solve.

00:19:47.680 | That's the big thing that happened.

00:19:49.160 | And you get this emphasis

00:19:51.040 | on people releasing model parameters.

00:19:53.840 | In this earlier phase like here,

00:19:56.440 | there was no talk of releasing model parameters

00:19:58.880 | because mostly the models people trained

00:20:01.520 | were just good for the task that they had set.

00:20:04.220 | As we move into this era, and then certainly this one,

00:20:07.640 | these things are meant to be like

00:20:09.480 | general purpose language capabilities

00:20:11.680 | or maybe general purpose computer vision capabilities

00:20:14.920 | that we stitch together into a system

00:20:17.040 | that can do more than any previous system could do.

00:20:20.600 | Right, so then we have this big thing here.

00:20:26.720 | So that's the feeling now.

00:20:28.680 | Behind all of this,

00:20:30.240 | certainly beginning in this final phase here,

00:20:32.920 | is the transformer architecture.

00:20:35.240 | Just let me take the temperature of the room.

00:20:36.840 | How many people have encountered the transformer before?

00:20:39.640 | Right, yeah, it's sort of unavoidable

00:20:42.480 | if you're doing this research.

00:20:43.720 | Here's a diagram of it,

00:20:45.160 | but I'm not gonna go through this diagram now

00:20:47.480 | because starting on Wednesday,

00:20:49.920 | we are gonna have an entire lecture

00:20:52.200 | essentially devoted to unpacking this thing

00:20:54.700 | and understanding it.

00:20:56.040 | All I can say for you now is that I expect you

00:20:59.220 | to go on the following journey, which all of us go on.

00:21:02.560 | How on earth does the transformer work?

00:21:04.880 | It looks very, very complicated.

00:21:07.480 | I hope can get you to the point where you feel,

00:21:09.960 | oh, this is actually pretty simple components

00:21:12.900 | that have been combined in a pretty straightforward way.

00:21:16.000 | That's your second step on the journey.

00:21:17.600 | The true enlightenment comes from, wait a second,

00:21:21.280 | why does this work at all?

00:21:23.440 | And then you're with the entire field

00:21:25.600 | trying to understand why these simple things

00:21:28.080 | were brought together in this way have proved so powerful.

00:21:30.980 | The other major thing that happened,

00:21:35.820 | which is kind of latent going all the way back

00:21:38.220 | to the start of AI, especially as it relates to linguistics,

00:21:42.120 | is this notion of self-supervision,

00:21:44.360 | of distributional learning,

00:21:46.280 | because this is gonna unlock the door to us

00:21:48.800 | just learning from the world in the most general sense.

00:21:52.980 | In self-supervision, your model's only goal

00:21:56.880 | is to learn from co-occurrence patterns

00:21:59.640 | in the sequences that it's trained on.

00:22:01.600 | And the sequences can be language,

00:22:03.260 | but they could be language plus sensor readings,

00:22:06.000 | computer code, maybe even images

00:22:08.480 | that you embed in this space, just symbols.

00:22:11.320 | And the model's only goal is to learn

00:22:13.360 | from the distributional patterns that they contain,

00:22:16.720 | or for many of these models, to assign high probability

00:22:20.140 | to the attested sequences in whatever data that you pour in.

00:22:24.160 | For this kind of learning, we don't need to do any labeling.

00:22:27.840 | All we need to do is have lots and lots of symbol streams.

00:22:32.600 | And then when we generate from these models,

00:22:35.600 | we're sampling from them, and that's what we all think of

00:22:37.960 | when we think of prompting and getting a response back.

00:22:40.240 | But the underlying mechanism is, at least in part,

00:22:43.720 | this notion of self-supervision.

00:22:45.360 | And I'll emphasize again, 'cause I think

00:22:46.960 | this is really important for why these models

00:22:48.820 | are so powerful, the symbols do not need to be just language.

00:22:52.800 | They can include lots of other things

00:22:54.920 | that might help a model piece together

00:22:57.960 | a full picture of the world we live in,

00:23:00.400 | and also the connections between language

00:23:02.600 | and those pieces of the world,

00:23:04.480 | just from this distributional learning.

00:23:07.800 | The result of this proving so powerful

00:23:10.800 | is the advent of large-scale pre-training,

00:23:13.600 | because now we're not held back anymore

00:23:16.480 | by the need for labeled data.

00:23:18.280 | All we need is lots of data in unstructured format.

00:23:22.400 | This really begins in the era of static word representations

00:23:26.200 | like Word2Vec and GloVe.

00:23:28.800 | And in fact, those teams, and I would say

00:23:30.560 | especially the GloVe team, they were really visionary

00:23:33.340 | in the sense that they not only released a paper and code,

00:23:38.340 | but pre-trained parameters.

00:23:41.740 | This was really brand new for the field,

00:23:44.060 | this idea that you would empower people

00:23:46.360 | with model artifacts, and people started using them

00:23:50.260 | as the inputs to recurrent neural networks and other things.

00:23:54.580 | And you started to see pre-training

00:23:57.580 | as an important component to doing really well

00:24:00.180 | at hard things.

00:24:02.500 | There were some predecessors

00:24:04.460 | that I'll talk about next time,

00:24:06.220 | but the really big moment for contextual representations

00:24:10.020 | is the ELMo model.

00:24:11.360 | This is the paper,

00:24:12.200 | Deep Contextualized Word Representations.

00:24:14.380 | I can remember being at the North American ACL meeting

00:24:18.340 | in New Orleans in 2018 at the best paper session.

00:24:22.420 | They had not announced which of the best papers

00:24:24.820 | was gonna win the outstanding paper award,

00:24:27.320 | but we all knew it was gonna be the ELMo paper

00:24:30.260 | because the gains that they had reported

00:24:33.440 | from fine-tuning their ELMo parameters

00:24:35.500 | on hard tasks for the field were just mind-blowing,

00:24:38.740 | the sort of thing that you really only see once

00:24:41.360 | in a kind of generation of this research,

00:24:43.640 | or so we thought.

00:24:44.860 | Because the next year, BERT came out,

00:24:48.980 | same thing, I think same best paper award thing.

00:24:51.820 | The paper already had had huge impact

00:24:54.500 | by the time it was even published,

00:24:56.300 | and they too released their model parameters

00:25:00.220 | ELMo is not transformer-based.

00:25:02.260 | BERT is the first of the sequence of things

00:25:04.340 | that's based in the transformer,

00:25:05.740 | and again, lifting all boats even above

00:25:08.660 | where ELMo had brought us.

00:25:10.820 | Then we get GPT.

00:25:12.420 | This is the first GPT paper,

00:25:14.220 | and then fast forward a little bit, we get GPT-3,

00:25:17.220 | and that was pre-training at a scale

00:25:21.320 | that was previously kind of unimaginable

00:25:24.300 | 'cause this, now we're talking about,

00:25:26.540 | for the BERT model, 100 million parameters,

00:25:28.820 | and for GPT-3, well north of 100 billion.

00:25:33.020 | Different order of magnitude,

00:25:34.900 | and what we started to see is emergent capabilities.

00:25:38.240 | That model size thing is important.

00:25:41.860 | Again, this is a sort of feeling of progress

00:25:43.900 | and maybe also despair.

00:25:45.500 | I think I can lift your spirits a little bit,

00:25:47.600 | but we should think about model size.

00:25:50.260 | So I have years along the x-axis again,

00:25:53.420 | and I have model size going from 100 million

00:25:55.880 | to one trillion here on a logarithmic scale.

00:25:59.600 | So 2018, GPT, that's like 100 million BERT.

00:26:03.720 | I think it's 300 million for the large one.

00:26:06.360 | Okay, GPT-2, even larger.

00:26:09.000 | Megatron, 8.3 billion.

00:26:10.720 | I remember when this came out, I probably laughed.

00:26:13.960 | Maybe I thought it was a joke.

00:26:15.240 | I certainly thought it was some kind of typo

00:26:17.260 | because I couldn't imagine that it was actually billion,

00:26:20.340 | like with a B there.

00:26:23.220 | But now, that's, you know, we take that for granted.

00:26:26.260 | Megatron, 11 billion.

00:26:27.700 | This is 2021 or so.

00:26:30.300 | Then we get GPT-3, reportedly at 175 billion parameters.

00:26:34.940 | And then we get this thing where it seems

00:26:36.660 | like we're doing typos again.

00:26:38.180 | Megatron, Turing, NLG was like 500,

00:26:41.700 | and then Palm is 540 billion parameters.

00:26:45.780 | And I guess there are rumors that we have gone upward

00:26:48.340 | all the way to a trillion, right?

00:26:51.500 | There's an undeniable trend here.

00:26:54.280 | I think there is something to this trend,

00:26:57.260 | but we should reflect on it a little bit.

00:26:59.200 | One thing I wanna say is there's a noteworthy pattern

00:27:03.080 | of very few entities have participated

00:27:06.380 | in this very large, in this race for very large models.

00:27:09.540 | We've got like Google, NVIDIA, Meta, and OpenAI, right?

00:27:14.540 | And that was actually a real cause for concern.

00:27:16.780 | I remember being at a workshop

00:27:18.340 | between Stanford and OpenAI,

00:27:20.820 | where the number one source of consternation

00:27:23.220 | was really that only OpenAI at that point

00:27:27.020 | had trained these really large models.

00:27:29.060 | And after that, predictably,

00:27:30.560 | these other large tech companies kind of caught up.

00:27:33.760 | But it was still for a while looking like a story

00:27:36.440 | of real centralization of power.

00:27:38.740 | That might still be happening,

00:27:40.900 | but I think there's reason to be optimistic.

00:27:42.500 | So here at Stanford, the Helm Group,

00:27:44.780 | which is part of the Center for Research

00:27:46.560 | on Foundation Models, led this incredibly ambitious project

00:27:50.400 | of evaluating lots of language models.

00:27:52.700 | And one thing that emerges from that

00:27:54.540 | is that we have a more healthy ecosystem now.

00:27:57.440 | So we have these like loose collectives,

00:27:59.280 | Big Science and Eleuther

00:28:00.460 | are both kind of fully open source groups of researchers.

00:28:04.140 | We've got, well, one academic institution represented.

00:28:07.700 | This could be a little bit embarrassing for Stanford.

00:28:09.720 | Maybe we'll correct that.

00:28:11.380 | And then maybe the more important thing

00:28:12.780 | is that we have lots of startups represented.

00:28:14.960 | So these are well-funded, but relatively small outfits

00:28:18.500 | that are producing outstanding language models.

00:28:21.720 | And so the result,

00:28:22.800 | I think we're gonna see much more of this,

00:28:24.860 | and then we'll worry less about centralization of power.

00:28:28.540 | There's plenty of other things to worry about,

00:28:30.380 | so we shouldn't get sanguine about this,

00:28:31.900 | but this particular point, I think,

00:28:34.160 | is being alleviated by current trends.

00:28:36.820 | And there's another aspect of this too,

00:28:39.020 | which is you have this scary rise in model size,

00:28:42.220 | but what is happening right now as we speak

00:28:45.500 | in a very quick way is we're seeing

00:28:47.740 | a push towards smaller models.

00:28:49.980 | And in particular, we're seeing that models

00:28:51.920 | that are in the range of like 10 billion parameters

00:28:55.140 | can be highly performant, right?

00:28:56.940 | So we have the Flan models, we have Lama,

00:29:00.660 | and then here at Stanford, they released the alpaca thing,

00:29:03.460 | and then Databricks released the Hello Dolly model.

00:29:06.580 | These are all models that are like

00:29:07.920 | eight to 10 billion parameters,

00:29:09.780 | which I know this sounds funny

00:29:11.220 | because I laughed a few years ago

00:29:13.120 | when the Megatron model had 8.3 billion,

00:29:15.620 | and now what I'm saying to you

00:29:16.900 | is that this is relatively small, but so it goes.

00:29:20.140 | And the point is that a 10 billion parameter model

00:29:23.220 | is one that could be run on regular old commercial hardware,

00:29:27.620 | whereas these monsters up here,

00:29:29.540 | really you have lots of pressures

00:29:31.240 | towards centralization of power there

00:29:32.800 | because almost no one can work with them.

00:29:35.700 | But anyone essentially can work with alpaca,

00:29:38.100 | and it won't be long before we've got the ability

00:29:40.660 | to kind of work with it on small devices

00:29:42.780 | and things like that.

00:29:43.940 | And that too is really gonna open the door

00:29:47.180 | to lots of innovation.

00:29:48.480 | I think that will bring some good,

00:29:50.100 | and I think it will bring some bad,

00:29:51.780 | but it is certainly a meaningful change

00:29:53.780 | from this scary trend that we were seeing

00:29:55.820 | until four months ago.

00:29:58.020 | As a result of these models being so powerful,

00:30:05.540 | people started to realize

00:30:07.540 | that you can get a lot of mileage out of them

00:30:10.140 | simply by prompting them.

00:30:12.340 | When you prompt one of these very large models,

00:30:14.580 | you put it in a temporary state by inputting some text,

00:30:18.180 | and then you generate a sample from the model

00:30:20.260 | using some technique, and you see what comes out, right?

00:30:22.660 | So if you type into one of these models,

00:30:24.620 | better late than, it's gonna probably spit out never.

00:30:28.300 | If you put in every day, I eat breakfast, lunch,

00:30:31.940 | and it will probably say dinner.

00:30:34.420 | And you might have an intuition that the reasons,

00:30:36.620 | the causes for that are kind of different.

00:30:38.340 | The first one is a sort of idiom,

00:30:40.400 | so that it could just learn from co-occurrence patterns

00:30:42.900 | in text transparently.

00:30:44.760 | For the second one, we kind of interpreted as humans

00:30:47.960 | as reflecting something about routines,

00:30:50.740 | but you should remind yourself

00:30:52.900 | that the mechanism is the same as in the first case.

00:30:56.040 | This was just a bunch of co-occurrence patterns.

00:30:58.240 | A lot of people described their routines in text,

00:31:01.180 | and the model picked up on that.

00:31:03.280 | And carry that thought forward

00:31:04.820 | as you think about things like the president of the US is.

00:31:08.100 | When it fills that in with Biden or whoever,

00:31:11.740 | it might look like it is offering us factual knowledge,

00:31:14.380 | and maybe in some sense it is,

00:31:16.540 | but it's the same mechanism as for those first two examples.

00:31:19.880 | It is just learning from the fact that a lot of people

00:31:23.120 | have expressed a lot of texts that look like

00:31:25.360 | the president of the US is Joe Biden,

00:31:27.600 | and it is repeating that back to us.

00:31:30.040 | And so definitely, if you ask a model something like

00:31:33.220 | the key to happiness is, you should remember

00:31:36.460 | that this is just the aggregate of a lot of data

00:31:39.300 | that it was trained on.

00:31:40.220 | It has no particular wisdom to offer you necessarily

00:31:43.680 | beyond what was encoded latently in that giant sea

00:31:48.680 | of mostly unaudited, unstructured text.

00:31:53.380 | Yeah, question.

00:31:55.120 | - I guess it would be kind of hard

00:31:58.260 | to get something like this,

00:31:59.180 | but if we had a corpus of just like,

00:32:01.900 | all the languages, right,

00:32:03.220 | but literally all of the facts were wrong.

00:32:05.580 | We just imagine like a very factually incorrect corpus.

00:32:08.740 | Like, I guess I'm getting at like,

00:32:12.160 | how do we inject like truth into like these corpuses?

00:32:15.660 | - It's a question that bears repeating.

00:32:19.060 | How do we inject truth?

00:32:20.740 | It's a question you all could think about.

00:32:23.420 | What is truth, of course,

00:32:25.020 | but also what would that mean and how would we achieve it?

00:32:28.960 | And even if we did back off to something like,

00:32:31.580 | how would we ensure self-consistency for a model?

00:32:34.660 | Or, you know, at the level of a worldview

00:32:36.620 | or a set of facts,

00:32:37.860 | even those questions which seem easier to pose

00:32:40.980 | are incredibly difficult questions in the current moment

00:32:44.340 | where our only mechanisms are basically

00:32:46.540 | that self-supervision thing that I described,

00:32:49.040 | and then a little bit of what I'll talk about next.

00:32:51.940 | But none of the structure that we used to have

00:32:55.000 | where we would have a database of knowledge

00:32:56.840 | and things like that,

00:32:58.300 | that is posing problems.

00:33:00.080 | (laughs)

00:33:02.160 | The prompting thing, we take this a step forward, right?

00:33:07.760 | So the GPT-3 paper,

00:33:09.480 | remember that's that 175 billion parameter monster.

00:33:13.400 | The eye-opening thing about that

00:33:15.200 | is what we now call in-context learning,

00:33:18.140 | which was just the notion that for these very large,

00:33:20.920 | very capable models,

00:33:22.080 | you could input a bunch of texts,

00:33:24.380 | like here's a passage,

00:33:26.200 | and maybe an example of the kind of behavior

00:33:28.280 | that you wanted,

00:33:29.500 | and then your actual question,

00:33:31.520 | and the model would do a pretty good job

00:33:33.480 | at answering the question.

00:33:35.400 | And what you're doing here is with your context passage

00:33:38.280 | and your demonstration,

00:33:39.400 | pushing the model to be extractive,

00:33:41.840 | to find an answer to its question in the context passage.

00:33:46.200 | And then the observation of this paper

00:33:48.400 | is that they do a pretty good job

00:33:50.100 | at following that same behavior

00:33:52.280 | for the actual target question at the bottom here.

00:33:55.080 | Remember, this is all just prompting,

00:33:57.400 | putting the model in a temporary state

00:33:59.700 | and seeing what comes out.

00:34:00.860 | You don't change the model,

00:34:02.440 | you just prompt it.

00:34:03.920 | This, in 2012, if you had asked me

00:34:07.460 | whether this was a viable path forward for a class project,

00:34:10.320 | I want to prompt an RNN or something,

00:34:13.140 | I would have advised you as best I could

00:34:15.500 | to choose some other topic

00:34:16.940 | because I never would have guessed that this would work.

00:34:19.880 | So the mind-blowing thing about this paper

00:34:23.760 | and everything that's followed

00:34:25.300 | is that we might be nearing the point

00:34:27.040 | where we can design entire AI systems

00:34:29.900 | on the basis of this simple in-context learning mechanism,

00:34:33.780 | transformatively different from anything that we saw before.

00:34:37.140 | In fact, let me just emphasize this a little bit.

00:34:41.220 | It is worth dwelling on how strange this is.

00:34:44.580 | For those of you who have been in the field a little while,

00:34:48.020 | just contrast what I described in-context learning

00:34:51.220 | with the standard mode of supervision.

00:34:55.060 | Let's imagine for a case here

00:34:56.780 | that we want to train a model

00:34:58.580 | to detect nervous anticipation.

00:35:00.940 | And I have picked this

00:35:01.900 | because this is a very particular human emotion.

00:35:05.380 | And in the old mode,

00:35:06.460 | we would need an entire dedicated model to this, right?

00:35:09.700 | We would collect a little dataset

00:35:11.860 | of positive and negative instances of nervous anticipation,

00:35:16.040 | and we would train a supervised classifier

00:35:19.040 | on feature representations of these examples over here,

00:35:22.340 | learning from this binary distinction.

00:35:25.500 | We would need custom data and a custom model

00:35:28.420 | for this particular task in all likelihood.

00:35:30.900 | In this new mode, few-shot in-context learning,

00:35:35.580 | we essentially just prompt the model,

00:35:37.140 | "Hey, model, here's an example of nervous anticipation."

00:35:40.620 | My palms started to sweat

00:35:41.940 | as the lotto numbers were read off.

00:35:43.740 | "Hey, model, here's an example

00:35:45.380 | without nervous anticipation," and so forth.

00:35:48.220 | And it learns from all those symbols that you put in

00:35:52.700 | and their co-occurrences,

00:35:54.440 | something about nervous anticipation.

00:35:58.180 | On the left for this model here,

00:35:59.960 | I've written out nervous anticipation,

00:36:01.660 | but remember, that has no special status.

00:36:03.860 | I've structured the model around the binary distinction,

00:36:07.340 | the one and the zero.

00:36:08.940 | And everything about the model

00:36:10.320 | is geared toward my learning goal.

00:36:12.220 | On the right, nervous anticipation is just more

00:36:16.200 | of the symbols that I've put into the model.

00:36:19.220 | And the eye-opening thing, again,

00:36:21.120 | about the GPT-3 paper and what's followed

00:36:24.140 | is that models can learn, be put in a temporary state,

00:36:28.280 | and do well at tasks like this.

00:36:30.280 | Now, I talked about self-supervision before,

00:36:36.020 | and I think that is a major component

00:36:38.180 | to the success of these models,

00:36:39.940 | but it is increasingly clear that it is not the only thing

00:36:43.620 | that is driving learning in the best models in this class.

00:36:47.840 | The other thing that we should think about

00:36:51.020 | is what's called reinforcement learning

00:36:52.940 | with human feedback.

00:36:54.740 | This is a diagram from the chat GPT blog post.

00:36:58.400 | There are a lot of details here,

00:36:59.780 | but really two of them are important for us for right now.

00:37:03.160 | The first is that in a phase of training these models,

00:37:08.160 | people are given inputs and ask themselves

00:37:12.260 | to produce good outputs for those inputs.

00:37:15.460 | So you might be asked to do a little Python program,

00:37:17.940 | and you yourself as an annotator

00:37:19.420 | might write that Python program, for example.

00:37:22.340 | So that's highly skilled work

00:37:24.020 | that depends on a lot of human intelligence.

00:37:26.740 | And those examples, those pairs,

00:37:28.960 | are part of how the model is trained.

00:37:31.460 | And that is so important because that takes us way beyond

00:37:34.560 | just learning from co-occurrence patterns

00:37:36.900 | of symbols and text.

00:37:38.440 | It is now back to a very familiar story from all of AI,

00:37:43.060 | which is that it's not magic.

00:37:45.220 | What is happening is that a lot of human intelligence

00:37:48.400 | is driving the behavior of these systems.

00:37:51.940 | And that happens again at step two here.

00:37:54.220 | So now the model produces different outputs,

00:37:56.740 | and humans come in and rank those outputs,

00:37:59.220 | again, expressing direct human preferences

00:38:02.660 | that take us well beyond self-supervision.

00:38:05.460 | So we should remember, we had that brief moment

00:38:08.200 | where it looked like it was all unstructured, unlabeled data,

00:38:11.380 | and that was important to unlocking these capacities,

00:38:14.460 | but now we are back at a very labor-intensive

00:38:17.660 | human capacity here, driving what looked like

00:38:21.340 | the really important behaviors for these models.

00:38:24.020 | Final step, which I think actually intimately relates

00:38:29.820 | to that instruct tuning that I just described.

00:38:32.100 | That's a kind of way of summarizing

00:38:33.500 | this reinforcement learning with human feedback.

00:38:36.340 | And this is what's called step-by-step

00:38:38.380 | or chain-of-thought reasoning.

00:38:39.720 | Now we're thinking about the prompts

00:38:41.300 | that we use for these models.

00:38:43.520 | So suppose we asked ourselves a question like,

00:38:45.500 | can models reason about negation?

00:38:48.020 | To give an example, does the model know

00:38:50.000 | that if the customer doesn't have any auto loan,

00:38:52.860 | sorry, doesn't have any loans,

00:38:54.500 | then the customer doesn't have any auto loans?

00:38:57.140 | It's a simple example.

00:38:58.340 | It's the sort of reasoning that you might have to do

00:39:00.180 | if you're thinking about a contract or something like that,

00:39:03.240 | whether a rule has been followed.

00:39:05.300 | And it just involves negation,

00:39:07.780 | our old friend from the start of the lecture.

00:39:10.860 | Now in the old school prompting style,

00:39:13.580 | all the way back in 2021,

00:39:15.680 | we would kind of naively just input,

00:39:18.340 | is it true that if the customer doesn't have any loans,

00:39:20.700 | then the customer doesn't have any auto loans

00:39:22.840 | into one of these models?

00:39:24.280 | And we would see what came back.

00:39:26.360 | And here it says, no, this is not necessarily true.

00:39:28.580 | A customer can have auto loans

00:39:30.000 | without having any other loans,

00:39:31.400 | which is the reverse of the question that I asked.

00:39:34.900 | Again, kind of showing it doesn't deeply understand

00:39:37.840 | what we put in here.

00:39:38.920 | It just kind of does an act that looks like it did.

00:39:42.060 | And that's worrisome.

00:39:44.280 | But we're learning how to communicate

00:39:46.320 | with these very alien creatures.

00:39:47.680 | Now we do what's called step-by-step prompting.

00:39:50.160 | This is the cutting edge thing.

00:39:51.740 | You would just tell the model that it was in some kind

00:39:53.780 | of logical or common sense reasoning exam.

00:39:56.620 | That matters to the model.

00:39:58.540 | Then you could give some instructions,

00:40:00.680 | and then you could give an example in your prompts

00:40:03.100 | of the kind of thing it was gonna see.

00:40:05.680 | And then finally you could prompt it with your premise,

00:40:08.300 | and then your question.

00:40:09.780 | And the model would spit out something

00:40:11.780 | that looked really good.

00:40:13.140 | Here, I won't bother going through the details,

00:40:15.400 | but with that kind of prompt,

00:40:18.260 | the model now not only answers and reasons correctly,

00:40:21.440 | but also offers a really nice explanation

00:40:23.900 | of its own reasoning.

00:40:25.740 | The capacity was there.

00:40:27.440 | It was latent, and we didn't see it

00:40:29.500 | in the simple prompting mode,

00:40:31.300 | but the more sophisticated prompting mode elicited it.

00:40:35.080 | And I think this is in large part the result

00:40:38.460 | of the fact that this model was instruct tuned.

00:40:40.820 | And so people actually taught it

00:40:42.700 | about how that markup is supposed to work,

00:40:45.380 | and how it's supposed to think about prompts like this.

00:40:47.980 | So the combination of all that human intelligence

00:40:50.340 | and the capacity of the model led to this really interesting

00:40:53.500 | and much better behavior.

00:40:54.980 | That is a glimpse of the foundations

00:41:02.540 | of all of this, I would say.

00:41:04.060 | Of course, we're gonna unpack all of that stuff

00:41:06.220 | as we go through the quarter,

00:41:07.940 | but I hope you're getting a sense for it.

00:41:10.020 | Are there questions I can answer about it?

00:41:11.920 | Things I could circle back on?

00:41:13.340 | Yes.

00:41:14.780 | - The human brain has about 100 billion neurons,

00:41:17.940 | is my understanding.

00:41:19.020 | And I'm not sure how many parameters that might be,

00:41:22.340 | maybe like 10 trillion parameters or something like that.

00:41:26.060 | Are we approaching a point where these machines

00:41:28.060 | can start emulating the human brain,

00:41:30.300 | or is there something to the language instinct,

00:41:33.140 | or, you know, instincts of all kinds

00:41:35.340 | that maybe take into the human brain?

00:41:37.180 | - Oh, it's nothing but big questions today.

00:41:41.440 | Right, so the question is kind of like,

00:41:43.900 | what is the relationship between the models

00:41:46.020 | we're talking about and the human brain?

00:41:47.700 | And you raised that in terms of the size,

00:41:49.980 | and I guess the upshot of your description

00:41:52.380 | was that these models remain smaller than the human brain.

00:41:55.820 | I think that's reasonable.

00:41:57.180 | It's tricky though.

00:41:59.340 | On the one hand, they obviously have superhuman capabilities.

00:42:02.560 | On the other hand, they fall down in ways that humans don't.

00:42:07.060 | It's very interesting to ask why that difference exists.

00:42:11.180 | And maybe that would tell us something

00:42:12.700 | about the limitations of learning from scratch

00:42:16.820 | versus being initialized by evolution,

00:42:19.180 | the way all of us were.

00:42:20.420 | I don't know, but I would say that

00:42:23.900 | underlying your whole line of questioning

00:42:26.100 | is the question, can we use these models

00:42:28.980 | to eliminate questions of neuroscience

00:42:31.180 | and cognitive science?

00:42:32.740 | And I think we should be careful,

00:42:34.700 | but that the answer is absolutely yes.

00:42:36.700 | And in fact, the increased ability of these models

00:42:40.660 | to learn from data has been really illuminating

00:42:44.340 | about certain recalcitrant questions

00:42:47.260 | from cognitive science in particular.

00:42:49.700 | You have to be careful because they're so different from us,

00:42:52.180 | these models.

00:42:53.300 | On the other hand, I think they are helping us understand

00:42:57.160 | how to differentiate different theories of cognition.

00:42:59.700 | And ultimately, I think they will help us

00:43:01.620 | understand cognition itself.

00:43:03.300 | And I would, of course, welcome projects

00:43:07.740 | that were focused on those cognitive questions in here.

00:43:09.980 | This is a wonderful space in which to explore

00:43:13.260 | this kind of more speculative angle,

00:43:16.140 | connecting AI to the cognitive sciences.

00:43:19.300 | Other questions, comments?

00:43:25.060 | Yes, in the back.

00:43:26.180 | - I would be curious to understand whether,

00:43:28.900 | I mean, partially following up on the brain thing,

00:43:31.300 | just to use a metaphor of our brain

00:43:33.340 | not being just one huge lump of neurons,

00:43:35.620 | but being separated into different areas.

00:43:38.300 | And then also thinking about the previous phase

00:43:41.220 | that you talked about, about breaking up the models

00:43:43.980 | and potentially having a model in the front

00:43:46.100 | that decides which domain our question falls into,

00:43:49.460 | and then having different sub-models.

00:43:52.460 | And I'm wondering whether that's arising,

00:43:54.260 | whether we're gonna touch on an architecture like that.

00:43:57.700 | Because it just seems natural to me

00:43:59.180 | because prompting a huge model

00:44:01.340 | is just very expensive computationally.

00:44:05.180 | It feels like combining big models and logic trees

00:44:08.820 | could be a cool approach.

00:44:10.820 | - I love it.

00:44:11.700 | Yeah, like one quick summary of what you said

00:44:13.820 | would relate directly to your question.

00:44:15.340 | The modularity of mind is an important old question

00:44:19.420 | about human cognition.

00:44:21.260 | To what extent are our abilities

00:44:23.500 | modularized in the mind-brain?

00:44:25.660 | With these current models,

00:44:29.060 | which have a capacity to do lots of different things

00:44:31.740 | if they have the right pre-training and the right structure,

00:44:34.140 | we could ask, does modularity emerge naturally?

00:44:37.340 | Or do they learn non-modular solutions?

00:44:40.260 | Both of those seem like they could be indirect evidence

00:44:43.500 | for how people work.

00:44:44.860 | Again, we have to be careful

00:44:46.140 | 'cause these models are so different from us.

00:44:48.060 | But as a kind of existence proof, for example,

00:44:50.220 | that modularity was emergent

00:44:51.820 | from otherwise unstructured learning,

00:44:54.220 | that would be certainly eye-opening, right?

00:44:56.820 | I have no idea.

00:44:58.660 | Yeah, I don't know whether there are results for that.

00:45:00.700 | Are there results?

00:45:02.180 | No, just kind of a follow-up question on that as well.

00:45:06.020 | So given how closed all these big models are,

00:45:09.780 | how could we interact with the model in such a way

00:45:12.820 | that helps us learn if there is modular?

00:45:15.700 | 'Cause we literally can only interact with it.

00:45:18.020 | So how do we go about studying that?

00:45:20.980 | Right, so the question is, you know,

00:45:22.740 | the closed-off nature of a lot of these models

00:45:26.300 | has been a problem.

00:45:27.260 | We can access the OpenAI models,

00:45:29.180 | but only through an API.

00:45:30.380 | We don't get to look at their internal representations.

00:45:33.220 | And that has been a blocker.

00:45:35.100 | But I mentioned the rise of these 10 billion parameter models

00:45:39.420 | as being performant and interesting.

00:45:41.620 | And those are models that, with the right hardware,

00:45:44.260 | you can dissect a little bit.

00:45:46.060 | And I think that's just gonna get better and better.

00:45:48.060 | And so we'll be able to, you know,

00:45:49.780 | peer inside them in ways

00:45:51.100 | that we haven't been able to until recently.

00:45:53.380 | Yeah.

00:45:54.220 | And in fact, like,

00:45:56.820 | we're gonna talk a lot about explainability.

00:45:59.020 | That's a major unit of this course.

00:46:00.580 | And I think it's an increasingly important area

00:46:03.340 | of the whole field that we have techniques

00:46:05.580 | for understanding these models

00:46:07.780 | so that we know how they're gonna behave

00:46:09.300 | when we deploy them.

00:46:10.620 | And it would be wonderfully exciting

00:46:12.140 | if you all wanted to try to scale

00:46:13.900 | the methods we talk about to a model

00:46:15.940 | that was as big as eight or 10 billion parameters.

00:46:18.700 | Ambitious just to do that,

00:46:20.500 | but then maybe a meaningful step forward.

00:46:22.580 | Yeah.

00:46:24.700 | - I have a question back to, like,

00:46:26.220 | this baseball cap prompt that we were discussing.

00:46:28.900 | So I suppose, like,

00:46:29.740 | a part of the way that we discuss rules

00:46:32.420 | is, like, there is a little bit of ambiguity

00:46:34.460 | for, like, human interpretation.

00:46:35.940 | Like, for example, in the honor code

00:46:37.860 | and the fundamental standard,

00:46:38.780 | like, it's intentionally ambiguous

00:46:40.580 | so that it's context dependent.

00:46:42.980 | And so, like, the idea is that there's, like,

00:46:44.700 | this inherent underlying value system

00:46:46.900 | that, like, affords whatever the rules

00:46:49.380 | that are written out are.

00:46:50.300 | And so that's, like, the primary form of evaluation.

00:46:54.580 | And so I guess, like, how does that play into, then,

00:46:56.980 | how these language models are understanding,

00:46:58.540 | like, is there some form of encoded or understanding,

00:47:01.740 | understood deeper value system that's encoded into them?

00:47:04.540 | - You could certainly ask.

00:47:07.220 | I mean, the essence of your question is,

00:47:08.780 | could we, with analysis techniques, say,

00:47:11.980 | find out that a model had a particular belief system

00:47:15.020 | that was guiding its behavior?

00:47:17.100 | I think we can ask that question now.

00:47:19.260 | It sounds fantastically difficult,

00:47:20.980 | but maybe piecemeal we could get,

00:47:22.500 | make some progress on it for sure.

00:47:24.580 | Yeah, I wanna return to the MLB one, though,

00:47:26.620 | because, well, as you'll see,

00:47:28.700 | and as I think we already saw,

00:47:30.180 | these models purport to offer evidence from a rule book,

00:47:33.580 | and that's where I feel stuck.

00:47:35.340 | - You're keeping score, Tom.

00:47:39.260 | I posted the answer and some other stuff

00:47:42.100 | in the class discussion.

00:47:44.340 | - Wonderful, thank you.

00:47:45.500 | Yes.

00:47:51.060 | - Can we just hook up these models to a large database

00:47:54.220 | of actually provided information

00:47:56.220 | and send encyclopedia and allow it to,

00:47:58.700 | you know, step up?

00:47:59.860 | - Well, kind of, yes.

00:48:04.180 | Actually, this is the sort of solution

00:48:05.540 | that I wanna advocate for.

00:48:06.860 | I'm gonna do this in a minute.

00:48:08.140 | Yeah.

00:48:08.980 | Here, let's, so we'll do this overview.

00:48:13.100 | I wanna give you a feel for how the course will work,

00:48:15.260 | and then dive into some of our major themes.

00:48:18.500 | So high-level overview, we've got these topics,

00:48:20.580 | contextual representations, transformers and stuff,

00:48:23.700 | multi-domain sentiment analysis,

00:48:25.300 | that will be the topic of the first homework,

00:48:27.540 | and it's gonna build on the first unit there.

00:48:30.620 | Retrieval augmented in-context learning,

00:48:32.700 | this is where we might hook up to a database

00:48:34.620 | and get some guarantees about how these models will behave.

00:48:38.300 | Compositional generalization.

00:48:40.300 | In case you were worried that all the tasks were solved,

00:48:42.540 | I'm gonna confront you with a task,

00:48:44.340 | a seemingly simple task about semantic interpretation

00:48:47.580 | that you will, well, I think it will not be solved.

00:48:50.180 | I mean, those could be famous last words,

00:48:51.860 | 'cause who knows what you all are capable of,

00:48:54.260 | but it's a very hard task that we will pose.

00:48:57.020 | We'll talk about benchmarking and adversarial training

00:48:59.340 | and testing, increasingly important topics

00:49:01.780 | as we move into this mode where everyone is interacting

00:49:04.220 | with these large language models,

00:49:05.900 | and feeling impressed by their behavior,

00:49:08.140 | we need to take a step back and rigorously assess

00:49:11.020 | whether they actually are behaving in good ways,

00:49:13.340 | or whether we're just biased toward remembering

00:49:15.420 | the good things and forgetting the bad ones.

00:49:17.660 | We'll do model introspection,

00:49:19.700 | that's the explainability stuff that I mentioned,

00:49:21.740 | and finally methods and metrics.

00:49:23.260 | And as you can see for the, like, five, six, and seven,

00:49:26.820 | that's gonna be in the phase of the course

00:49:28.540 | where you're fo- you're focused on final projects,

00:49:31.260 | and I'm hoping that that gives you tools

00:49:33.060 | to write really rich final papers

00:49:35.140 | that have great analysis in them,

00:49:37.620 | and really excellent assessments.

00:49:40.460 | And then for the work that you'll do,

00:49:41.900 | we're gonna have three assignments,

00:49:44.460 | and each one of the assignments is paired

00:49:46.340 | with what we call a bake-off,

00:49:47.540 | which is an informal competition around data and modeling.

00:49:51.180 | Essentially, the homework problems ask you

00:49:53.380 | to set up some baseline systems,

00:49:55.500 | and get a feel for a problem,

00:49:57.500 | and then you write your own original system,

00:50:00.060 | and you enter that into the bake-off.

00:50:01.700 | And we have a leaderboard on Gradescope,

00:50:03.940 | and the team is gonna look at all your submissions,

00:50:06.940 | and give out some prizes for top-performing systems,

00:50:09.940 | but also systems that are really creative,

00:50:12.380 | or interesting, or ambitious, or something like that.

00:50:15.380 | And that has always been a lot of fun,

00:50:17.900 | and also really illuminating,

00:50:19.500 | 'cause it's like crowdsourcing a whole lot

00:50:21.820 | of different approaches to a problem,

00:50:23.860 | and then as a group, we can reflect on what worked,

00:50:26.860 | and what didn't, and look at the really ambitious things

00:50:29.260 | that you all try.

00:50:30.620 | So that's my favorite part.

00:50:31.820 | We have three offline quizzes,

00:50:34.580 | and this is just as a way to make sure you have incentives

00:50:37.900 | to really immerse yourself in the course material.

00:50:42.540 | Those are done on Canvas.

00:50:44.340 | There's actually a fourth quiz,

00:50:45.660 | which I'll talk a little bit about probably next time,

00:50:48.020 | that is just making sure you understand the course policies.

00:50:52.060 | That's quiz zero.

00:50:53.420 | You can take it as many times as you want,

00:50:55.340 | but the idea is that you will have some incentive

00:50:58.420 | to learn about policies like due dates, and so forth.

00:51:02.220 | And then the real action is in the final project,

00:51:04.620 | and that will have a lit review phase,

00:51:06.620 | an experiment protocol, and a final paper.

00:51:09.580 | Those three components, you'll probably do those in Teams,

00:51:12.300 | and throughout all of that work,

00:51:14.060 | you'll be mentored by someone from the teaching team.

00:51:16.820 | And as I said before, we have this incredibly expert

00:51:20.220 | teaching team, lots of varied expertise,

00:51:23.060 | a lot of experience in the field,

00:51:25.220 | and so we hope to align you with the person,

00:51:28.380 | with someone who's really aligned with your project goals,

00:51:31.820 | and then I think you can go really, really far.

00:51:34.740 | Yeah.

00:51:35.580 | - It looks like we're about quarter.

00:51:37.340 | Already looking forward to Baker's,

00:51:38.940 | and all Stanford kids get obsessed about this stuff.

00:51:42.740 | On the final project, is this more of an academic paper,

00:51:47.100 | or rather about building working code,

00:51:51.140 | and showing a state of the art?

00:51:54.700 | - Great question.

00:51:55.540 | For the first one, the Bake-offs, yes.

00:51:56.860 | It is easy to get obsessed with your Bake-off entry.

00:52:00.060 | I would say that if you get obsessed,

00:52:02.300 | and you do really well,

00:52:03.660 | just make that into your final project.

00:52:05.860 | All three of them, all three of them

00:52:08.140 | are really important problems.

00:52:10.140 | They are not idle work.

00:52:11.660 | I mean, one of them is on retrieval augmented

00:52:13.540 | in-context learning,

00:52:14.380 | which is one of my core research focuses right now,

00:52:16.780 | so is compositional generalization.

00:52:18.820 | If you do something really interesting for a Bake-off,

00:52:20.900 | make it your final paper,

00:52:22.540 | and then go on to publish it.

00:52:24.500 | For the second part of your question,

00:52:25.900 | I would say that the core goal is to get you

00:52:28.260 | to produce something that could be

00:52:30.220 | a research contribution in the field,

00:52:32.460 | and we have lots of success stories.

00:52:34.820 | I've got links at the website to people who have gone on

00:52:37.620 | to publish their final paper as an NLP paper.

00:52:41.140 | I'm careful the way I say that.

00:52:42.900 | They didn't literally publish the final paper

00:52:45.060 | because in 10 weeks,

00:52:46.540 | almost no one can produce a publishable paper.

00:52:48.740 | It's just not enough time,

00:52:50.500 | but you could form the basis for then working

00:52:52.660 | a little bit more or a lot more,

00:52:55.060 | and then getting a really outstanding publication out of it.

00:52:57.820 | And I would say that that's the default goal.

00:52:59.620 | The nature of the contribution though is highly varied.

00:53:02.540 | We have one requirement,

00:53:03.700 | which is that the final paper have

00:53:04.980 | some quantitative evaluation in it,

00:53:07.900 | but there are a lot of ways to satisfy that requirement,

00:53:10.420 | and then you could be serving

00:53:11.820 | many different questions in the field

00:53:14.180 | for some expansive notion of the field as well.

00:53:18.500 | Background materials.

00:53:25.940 | So I should say that officially,

00:53:28.620 | we are presupposing CS224N or CS224S as prerequisites for the course.

00:53:34.820 | And what that means is that I'm gonna skip a lot of

00:53:37.980 | the fundamentals that we have covered in past years.

00:53:41.220 | If you need a refresher,

00:53:43.340 | check out the background page of the course site.

00:53:45.980 | It covers fundamentals of scientific computing,

00:53:49.260 | static vector representations like

00:53:51.740 | word2vec and GloVe, and supervised learning.

00:53:54.780 | And I'm hoping that that's enough of a refresher.

00:53:57.580 | If you look at that material and find that it too is kind of

00:54:01.460 | beyond where you're at right now,

00:54:03.540 | then contact us on the teaching team and we can

00:54:06.100 | think about how to manage that.

00:54:08.620 | But officially, this is a course that presupposes CS224N.

00:54:14.620 | Then the core goals. This kind of relates to that previous question.

00:54:18.900 | Hands-on experience with a wide range of problems.

00:54:22.060 | Mentorship from the teaching team to guide you through projects and assignments.

00:54:27.380 | And then really the central goal here is to make you the best,

00:54:30.660 | that is most insightful, most responsible,

00:54:33.240 | most flexible NLU researcher and practitioner that you can be for whatever you decide to do next.

00:54:40.020 | And we're assuming that you have lots of diverse goals that somehow connect with NLU.

00:54:45.500 | All right. Let's do some course themes unless there are questions.

00:54:54.140 | I have a whole final section of this slideshow that's about the course,

00:54:59.340 | materials and requirements and stuff.

00:55:01.960 | Might save that for next time and you can check it out at

00:55:04.400 | the website and you'll be forced to engage with it for quiz zero.

00:55:08.600 | I thought instead I would dive back into

00:55:11.240 | the content part of this unless there are questions or comments.

00:55:15.240 | All right. First course theme,

00:55:21.080 | transformer-based pre-training.

00:55:23.640 | So starting with the transformer,

00:55:25.920 | we want to talk about core concepts and goals.

00:55:28.480 | Give you a sense for what these models are like,

00:55:30.940 | why they work, what they're supposed to do, all of that stuff.

00:55:34.440 | We'll talk about a bunch of different architectures.

00:55:37.520 | There are dozens and dozens of them,

00:55:39.560 | but I hope that I have picked enough of them with the right selection of them to give you

00:55:44.120 | a feel for how people are thinking about these models and the kind of

00:55:47.760 | innovations they brought in that have led to

00:55:50.080 | real meaningful advancement just at the level of architectures.

00:55:53.900 | We'll also talk about positional encoding,

00:55:55.840 | which I think maybe a lot of us have been surprised to see just how

00:55:59.040 | important that is as a differentiator for different approaches in this space.

00:56:03.960 | We'll talk about distillation,

00:56:06.120 | taking really large models and making them smaller.

00:56:09.400 | It's an important goal for lots of reasons and an exciting area of research.

00:56:13.920 | Then as I mentioned,

00:56:15.320 | is going to do a little lecture for us on diffusion objectives for these models,

00:56:19.760 | and then is going to talk about practical pre-training and fine-tuning.

00:56:24.120 | I'm going to enlist the entire teaching team to do guest lectures,

00:56:28.140 | and these are the two that I've lined up so far.

00:56:31.120 | That will culminate or be aligned with this first homework in Bake-off,

00:56:35.800 | which has a multi-domain sentiment.

00:56:37.920 | I'm going to give you a bunch of different sentiment datasets,

00:56:40.760 | and you're going to have to design one system that can succeed on all of them.

00:56:45.000 | Then for the Bake-off,

00:56:46.600 | we have an unlabeled dataset for you.

00:56:48.600 | We have the labels, but you won't.

00:56:50.520 | That has data that's like what you developed on,

00:56:54.160 | and then some mystery examples that you will not really be able to anticipate.

00:56:58.560 | We're going to see how well you do at handling

00:57:01.000 | all of these different domains with one system.

00:57:04.160 | This is by way of again,

00:57:07.840 | a refresher on core concepts and supervised learning,

00:57:11.320 | and really getting you to think about transformers.

00:57:13.720 | Although we're not going to constrain the solution that you

00:57:16.280 | offer for your original system.

00:57:19.600 | Our second major theme will be retrieval augmented in context learning.

00:57:27.360 | A topic that I would not even have dreamt of five years ago,

00:57:33.440 | and seemed kind of infeasible three years ago,

00:57:36.120 | and that we first did two years- one year ago?

00:57:38.960 | Oh goodness. I think this is only the second time,

00:57:41.720 | but I had to redo it entirely because things have changed so much.

00:57:46.520 | Here's the idea.

00:57:48.760 | We have two characters so far in our kind of emerging narrative for NLU.

00:57:53.520 | On the one hand, we have this approach that I'm going to call LLMs for everything,

00:57:57.840 | large language models for everything.

00:58:00.120 | You input some kind of question.

00:58:02.560 | Here I've chosen a very complicated question.

00:58:04.600 | Which MVP of a game red flaherty umpired was elected to the baseball hall of fame?

00:58:10.600 | Hats off to you if you know that the answer is Sandy Koufax.

00:58:15.120 | Um, the LLMs for everything approach is that you just type that question in,

00:58:20.840 | and the model gives you an answer.

00:58:23.000 | And hopefully you're happy with the answer.

00:58:26.320 | The other character that I'm going to introduce

00:58:28.960 | here is what I'm going to call retrieval augmented.

00:58:31.720 | So I have the same question at the top here,

00:58:34.180 | except now this is going to proceed differently.

00:58:36.040 | The first thing that we will do is take some large language model and

00:58:39.840 | encode that query into some numerical representation.

00:58:44.880 | That's sort of familiar.

00:58:46.600 | The new piece is that we're going to also have a knowledge store,

00:58:50.520 | which you could think of as an old-fashioned web index, right?

00:58:55.520 | Just a knowledge store of documents with

00:58:58.520 | the modern twist that now all of the documents

00:59:01.280 | are also represented by large language models.

00:59:04.120 | But fundamentally, this is an index of a sort that drives all web search right now.

00:59:09.280 | We can score documents with respect to

00:59:12.040 | queries on the basis of these numerical representations.

00:59:15.080 | And if we want to,

00:59:16.400 | we can reproduce the classic search experience.

00:59:19.200 | Here I've got a ranked list of documents that came back from my query,

00:59:23.920 | just like when you do Google as of the last time I googled.

00:59:28.400 | But in this mode, we can continue, right?

00:59:30.760 | We can have another language model slurp up

00:59:33.080 | those retrieved documents and synthesize them into an answer.

00:59:37.520 | And so here at the bottom I've got,

00:59:39.160 | it's kind of small, but it's the same answer over here.

00:59:41.400 | Although notably, this answer is now decorated with

00:59:44.600 | links that would allow you the user to track back to

00:59:48.320 | what documents actually provided that evidence.

00:59:52.280 | Whereas on the left,

00:59:53.920 | who knows where that information came from?

00:59:56.400 | And that's kind of what we were already grappling with.

00:59:59.720 | This is an important societal need because this is taking over web search.

01:00:04.280 | What are our goals for this kind of model here?

01:00:06.680 | So first, we want synthesis fluency, right?

01:00:09.360 | We want to be able to take information from

01:00:12.160 | multiple documents and synthesize it down into a single answer.

01:00:15.840 | And I think both of

01:00:17.080 | the approaches that I just showed you are going to do really well on that.

01:00:20.480 | We also need these models to be efficient,

01:00:23.000 | to be updatable because the world is changing all the time.

01:00:27.480 | We need it to track provenance and maybe invoke something like factuality.

01:00:32.360 | But certainly provenance, we need to know where the information came from.

01:00:35.840 | And we need some safety and security.

01:00:37.640 | We need to know that the model won't produce private information.

01:00:40.840 | And we might need to restrict access to parts of

01:00:43.440 | the model's knowledge to different groups like

01:00:45.760 | different customers or different people with different privileges and so forth.

01:00:49.640 | That's what we're going to need if we're really going to

01:00:51.920 | deploy these models out into the world.

01:00:55.080 | As I said, I think both of the approaches that I sketched do well on

01:00:58.400 | the synthesis part because they both use a language model and those are really good.

01:01:02.200 | They all have the gift of GAB, so to speak.

01:01:04.800 | What about efficiency?

01:01:06.560 | On the LLM for everything approach,

01:01:09.200 | we had this undeniable rise in model size.

01:01:13.040 | And I pointed out models like Alpaca that are smaller.

01:01:17.200 | But I strongly suspect that if we are going to continue to ask

01:01:21.280 | these models to be both a knowledge store and a language capability,

01:01:26.600 | we're going to be dealing with these really large models.

01:01:30.160 | The hope of the retrieval augmented approach is that we

01:01:34.520 | could get by with the smaller models.

01:01:36.720 | And the reason we could do that is that we're going to factor out

01:01:40.120 | the knowledge store into that index and the language capability,

01:01:44.520 | which is going to be the language model.

01:01:45.960 | The only thing we're going to be asking the language model

01:01:48.840 | is to be good at that kind of in-context learning.

01:01:51.880 | It doesn't need to also store a full model of the world.

01:01:55.520 | And I think that means that these models could be smaller.

01:01:58.720 | So overall, a big gain in efficiency if we go retrieval augmented.

01:02:03.320 | People will make progress,

01:02:04.840 | but I think it's going to be tense.

01:02:07.280 | What about updatability?

01:02:09.440 | Again, this is a problem that people are working on

01:02:11.520 | very concertedly for the LLMs for everything approach.

01:02:14.720 | But these models persist in giving outdated answers to questions.

01:02:19.640 | And one pattern you see is that there's a lot of progress where you could like

01:02:22.920 | edit a model so that it gives

01:02:24.380 | the correct answer to who is the president of the US.

01:02:27.280 | But then you ask it about something related to

01:02:29.720 | the family of the president and it reveals that it has

01:02:34.080 | outdated information stored in its parameters and that's

01:02:37.520 | because all of this information is interconnected and we don't at

01:02:41.280 | the present moment know how to reliably do that kind of systematic editing.

01:02:46.720 | Okay. On the retrieval augmented approach,

01:02:49.720 | we just re-index our data.

01:02:52.040 | If the world changes,

01:02:53.960 | we assume that the knowledge store changed like somebody updated a Wikipedia page.

01:02:58.280 | So we represent all the documents again or at least just the ones that changed.

01:03:02.720 | And now we have a lot of guarantees that as that propagates forward into

01:03:06.560 | the retrieved results which are consumed by the language model,

01:03:09.920 | it will reflect the changes we made to the underlying database in

01:03:14.000 | exactly the same way that a web search index is updated now.

01:03:19.000 | Right. One forward pass of the large language model

01:03:22.960 | compared to maybe training from scratch over here on

01:03:26.640 | new data to get an absolute guarantee that the change will propagate.

01:03:31.480 | What about provenance?

01:03:33.280 | Okay. We have seen this already,

01:03:35.320 | this problem here. LLMs for everything.

01:03:37.760 | I asked GPT-3, the DaVinci 3 model,

01:03:42.400 | my question, are professional baseball players

01:03:44.400 | allowed to glue small wings onto their caps?

01:03:46.600 | But I kind of cut it off but at the top there I said,

01:03:49.160 | provide me some links to the evidence.

01:03:53.320 | And it dutifully provided the links,

01:03:56.080 | but none of the links are real.

01:03:57.960 | If you copy them out and follow them,

01:04:00.360 | they all go to 404 pages.

01:04:02.480 | And I think that this is worse than providing no links at all because I'm

01:04:07.680 | attuned as a human in the current moment

01:04:10.680 | to see links and think they're probably evidence,

01:04:12.880 | and I don't follow all the links.

01:04:15.240 | And here you might look and say, "Oh yeah,

01:04:17.680 | I see it found the relevant MLB pages and that's it."

01:04:21.440 | Right. Over here, the kind of the point of

01:04:25.360 | this is that we are first doing

01:04:27.320 | a search phase where we're actually linked back to documents.

01:04:30.560 | And then we just need to solve the interesting non-trivial question

01:04:34.120 | of how to link those documents into the synthesized answer.

01:04:37.520 | But all of the information we need is right there on the screen for us.

01:04:41.560 | And so this feels like a relatively tractable problem

01:04:44.320 | compared to what we are faced with on the left.

01:04:47.280 | I will say, I've been just amazed at the rollout,

01:04:52.440 | especially of the Bing search engine,

01:04:55.000 | which now incorporates OpenAI models at some level.

01:04:57.960 | Because it is clear that it is doing web search, right?

01:05:01.800 | Because it's got information that comes from documents that

01:05:04.640 | only appeared on the web days before your query.

01:05:08.200 | But what it's doing with that information seems completely chaotic to me.

01:05:13.280 | So that it's kind of just getting mushed in with whatever else the model is doing,

01:05:17.440 | and you get this unpredictable combination of things that are grounded in documents,

01:05:24.080 | and things that are completely fabricated.

01:05:26.320 | And again, I maintain this is worse than just giving

01:05:28.920 | an answer with no evidence attached to it.

01:05:33.120 | I don't know why these companies are not simply doing the retrieval augmented thing,

01:05:37.920 | but I'm sure they are going to wise up,

01:05:39.640 | and maybe your research could help them wise up a little bit about this.

01:05:43.800 | Finally, safety and security.

01:05:45.880 | This is relatively straightforward.

01:05:47.240 | On the LLMs for everything approach,

01:05:48.680 | we have a pressing problem, privacy challenges.

01:05:51.920 | We know that those models can memorize long strings in their training data,

01:05:55.560 | and that could include some very particular information about one of us,

01:05:59.600 | and that should be worrying us.

01:06:01.160 | We have no known way with a language model to compartmentalize LLM capabilities,

01:06:06.080 | and say like, you can see this kind of result and you cannot.

01:06:09.640 | And similarly, we have no known way to restrict access to part of an LLMs capabilities.

01:06:15.680 | They just produce things based on their prompts,

01:06:18.400 | and you could try to have some prompt tuning that would tell them for

01:06:21.240 | this kind of person or setting do this and not that,

01:06:24.040 | but nobody could guarantee that that would succeed.

01:06:26.800 | Whereas, for the retrieval augmented approach, again,

01:06:31.120 | we're thinking about accessing information from an index,

01:06:34.680 | and access restrictions on an index is an old problem by now.

01:06:39.840 | Again, I don't want to say solved,

01:06:41.440 | but something that a lot of people have tackled for decades now,

01:06:45.600 | and so we can offer something like guarantees,

01:06:48.160 | just from the fact that we have a separated knowledge store.

01:06:52.640 | Again, my smiley face.

01:06:56.040 | You can see where my feelings are.

01:06:58.000 | For the LLMs for everything approach,

01:07:00.200 | people are working on these problems and it's very exciting,

01:07:02.960 | and if you want a challenge,

01:07:04.520 | take up one of these challenges here.

01:07:07.340 | But over here on the retrieval augmented side,

01:07:09.640 | I think we have lots of reasons to think.

01:07:11.800 | It's not that they're completely solved,

01:07:13.760 | it's just that we can see the path to solving them,

01:07:16.360 | and this feels very urgent to me because of how

01:07:19.600 | suddenly this kind of technology is being deployed in

01:07:22.960 | a very user-facing way for one of

01:07:24.760 | the core things we do in society, which is web search.

01:07:28.200 | So it's an urgent thing that we get good at this.

01:07:32.280 | Final things I want to say about this.

01:07:35.480 | So until recently,

01:07:37.720 | the way you would do even the retrieval augmented thing would be that you would

01:07:41.040 | have your index and then you might

01:07:44.560 | train a custom purpose model to do the question answering part,

01:07:48.040 | and it could extract things from the text that you produced,

01:07:50.840 | or maybe even generate some new things from the text that you produced.

01:07:54.400 | That's the mode that I mentioned before where you'd have some language models,

01:07:58.880 | maybe a few of them, and you'd have an index,

01:08:00.880 | and you would stitch them together into a question answering system

01:08:04.480 | that you would probably train on question answering data,

01:08:07.960 | and you would hope that this whole big monster may be

01:08:10.120 | fine-tuned on squad or natural questions or one of those datasets,

01:08:14.520 | gave you a general purpose question answering capability.

01:08:19.440 | That's the present, but I think it might actually be the recent past.

01:08:24.320 | In fact, the way that you all will probably work when we do this unit,

01:08:28.960 | and certainly for the homework,

01:08:30.740 | is that we will just have frozen components.

01:08:33.880 | This starts from the observation that the retriever model is really just a model that

01:08:39.120 | takes in text and produces text with scores,

01:08:42.920 | and a language model is also a device for taking in text and producing text with scores.

01:08:49.880 | These are when these are frozen components,

01:08:51.880 | you can think of them as just black box devices that do this input-output thing,

01:08:55.960 | and then you get into the intriguing mode of asking,

01:08:58.680 | but what if we had them just talk to each other?

01:09:01.600 | That is what you will do for the homework and bake-off.

01:09:04.760 | You will have frozen retriever and a frozen large language model,

01:09:08.640 | and you will get them to work together to

01:09:11.760 | solve a very difficult open domain question answering problem.

01:09:16.080 | That's pushing us into a new mode for even thinking about how we design AI systems,

01:09:21.640 | where it's not so much about fine-tuning,

01:09:24.080 | it's much more about getting them to communicate with each other

01:09:27.360 | effectively to design a system from frozen components.

01:09:31.880 | Again, unanticipated at least by me as of a few years ago,

01:09:36.760 | and now an exciting new direction.

01:09:39.920 | So just to wrap up,

01:09:41.920 | I think what I'll do since we're near the end of the- of class here,

01:09:45.000 | I'll just finish up this one unit,

01:09:46.840 | and then we'll use some of our time next time to introduce a few other of

01:09:50.200 | these course themes and that'll set us up well for diving into transformers.

01:09:55.520 | Final piece here just to inspire you,

01:09:57.840 | few-shot open QA is kind of the task that you will tackle for homework two.

01:10:02.640 | And here's how you could think about this.

01:10:04.400 | Imagine that the question has come in,

01:10:06.200 | what is the course to take?

01:10:08.280 | The most standard thing we could do is just prompt the language model with that question,

01:10:12.920 | what- what is the course to take down here and see what answer it gave back, right?

01:10:17.320 | But the retrieval augmented insight is that we

01:10:20.680 | might also retrieve some kind of passage from a knowledge store.

01:10:23.880 | Here I have a very short passage.

01:10:25.440 | The course to take is natural language understanding,

01:10:28.100 | and that could be done with a retrieval mechanism.

01:10:31.160 | But why stop there?

01:10:33.180 | It might help the model as we saw going back to the GPT-3 paper to

01:10:37.480 | have some examples of the kind of behavior that I'm hoping to get from the model.

01:10:41.960 | And so here I have retrieved from some dataset,

01:10:44.800 | question-answer pairs that will kind of give it a sense for what I want it to do in the end.

01:10:49.400 | But again, why stop there?

01:10:51.360 | We could also pick questions that were based very closely on the question that we posed.

01:10:57.840 | That would be like k-nearest neighbors approach where we use

01:11:01.160 | our retrieval mechanism to find similar questions to the one that we care about.

01:11:06.220 | I could also add in some context passages and I could do that by retrieval.

01:11:11.200 | So now we've used the retrieval model twice potentially,

01:11:14.500 | once to get good demonstrations and once to provide context for each one of them.

01:11:19.460 | But I could also use my retrieval mechanism with the questions and answers from

01:11:23.860 | the demonstration to get even richer connections

01:11:26.460 | between my demonstrations and the passages.

01:11:29.440 | I could even use a language model to rewrite aspects of those demonstrations to put them

01:11:34.500 | in a format that might help me with the final question that I want to pose.

01:11:39.020 | So now I have an interwoven use of

01:11:42.120 | the retrieval mechanism and the large language model to build up this prompt.

01:11:47.540 | Down at the retrieval thing,

01:11:49.380 | I could do the same thing.

01:11:50.900 | And then when you think about the model generation, again,

01:11:54.140 | we could just take the top response from the model,

01:11:57.100 | but we can do very sophisticated things on up to this full retrieval augmented generation model,

01:12:04.140 | which essentially marginalizes out the evidence passage and gives us

01:12:08.460 | a really powerful look at a good answer conditional

01:12:11.640 | on that very complicated prompt that we constructed.

01:12:15.540 | I think what you're seeing on the left here is that we are going to move from an era where

01:12:20.780 | we just type in prompts into these models and hope for the best,

01:12:25.060 | into an era where prompt construction is a kind of new programming mode,

01:12:31.060 | where you're writing down computer code,

01:12:33.740 | could be Python code,

01:12:35.200 | that is doing traditional computing things,

01:12:37.740 | but also drawing on very powerful pre-trained components to assemble

01:12:43.640 | this kind of instruction kit for your large language model to do whatever task you have set for it.

01:12:50.340 | And so instead of designing these AI systems with

01:12:53.340 | all that fine-tuning I described before,

01:12:55.740 | we might actually be moving back into a mode that's like

01:12:58.980 | that symbolic mode from the '80s where you type in a computer program.

01:13:03.260 | It's just that now the program that you type in is

01:13:06.900 | connected to these very powerful modern AI components.

01:13:11.580 | And we're seeing right now that that is

01:13:14.700 | opening doors to all kinds of new capabilities for these systems.

01:13:18.540 | And this first homework and bake-off is going to give you a glimpse of that.

01:13:23.140 | And you're going to use a programming model we've

01:13:25.420 | developed called demonstrate-search-predict that I

01:13:28.460 | hope will give you a glimpse of just how powerful this can be.

01:13:32.340 | All right. We are out of time, right? 420?

01:13:39.980 | So next time I'll show you a few more units from the course,

01:13:43.660 | and then we'll dive into transformers.

01:13:46.660 | [BLANK_AUDIO]