Stanford XCS224U: Natural Language Understanding I Course Overview, Part 2 I Spring 2023

00:00:00.000 | All right.

00:00:06.120 | Welcome back everyone.

00:00:08.380 | Day two. Got a lot we want to accomplish today.

00:00:12.460 | What I have on the screen right now is the home base for the course.

00:00:17.660 | This is our public website and you could think of it as

00:00:20.560 | kind of a hub for everything that you'll need in the course.

00:00:24.400 | You can see along the top here we've got some policy pages.

00:00:29.060 | There's a whole page on projects.

00:00:31.560 | There's a page that provides an index of background materials,

00:00:36.200 | YouTube screencasts, slides,

00:00:39.880 | hands-on materials in case you need to fill in some background stuff.

00:00:43.580 | Notice also I do a podcast that actually began in this course last year,

00:00:48.940 | and I found it so rewarding that I just continued doing it all year.

00:00:53.120 | So new episodes continue to appear.

00:00:55.240 | If you have ideas for guests for this podcast,

00:00:58.640 | feel free to suggest them.

00:00:59.960 | I'm always looking for exciting people to interview,

00:01:02.920 | and I think the back episodes are also really illuminating.

00:01:06.320 | That's along the top.

00:01:07.840 | Then over here on the left,

00:01:09.320 | you've got one-stop shopping for the various systems that we have to deal with.

00:01:13.960 | You've got our Ed forum for discussion.

00:01:16.380 | If you're not in there, let us know.

00:01:18.040 | We can get you signed up.

00:01:19.520 | Canvas is your home for the screencasts and also the quizzes,

00:01:24.960 | and I guess there's some other stuff there.

00:01:26.920 | Grade scope is where you'll submit the main assignments,

00:01:30.560 | including your project work,

00:01:32.100 | and also enter the bake-offs.

00:01:34.480 | Then we have our course GitHub,

00:01:36.280 | and that is the course code that we'll be depending on for the assignments,

00:01:40.040 | and that I hope you can build on for the original work that you do.

00:01:43.600 | If you need to reach us,

00:01:45.000 | you can use the discussion forum,

00:01:46.360 | but we also have this staff email address that is

00:01:49.280 | vastly preferred to writing to us individually.

00:01:53.800 | It really helps us manage the workload and know what's happening if you either ping us

00:01:59.400 | on the discussion forum, private posts,

00:02:01.680 | public posts, whatever, or use that staff address.

00:02:06.240 | In the middle of this page here,

00:02:09.520 | we've got links to all the materials.

00:02:12.360 | The first column is slides and stuff like that, and also notebooks.

00:02:16.980 | The middle column, it's core readings mostly.

00:02:20.360 | I'm not presupposing that you will manage to do all of this reading because there is a lot of it,

00:02:26.760 | but these are important and rewarding papers,

00:02:30.160 | and so at some point in your life,

00:02:31.720 | you might want to immerse yourselves in them.

00:02:33.640 | But I'm hoping that I can be your trusted guide through that literature.

00:02:38.080 | Then on the right, you have the assignments.

00:02:41.800 | That's the website.

00:02:44.560 | Questions, comments, anything I could clear up?

00:02:47.520 | I have a time budgeted later to review the policies and required work in a bit more detail.

00:02:53.040 | But if there are questions now,

00:02:54.400 | I'm happy to take them. Yes.

00:02:57.120 | For the quizzes, are the quizzes doable on the day that they become available,

00:03:01.360 | or do we need like the course material all the way up through the D?

00:03:06.600 | That is a good question.

00:03:08.580 | It's going to depend on your background.

00:03:10.440 | But in the worst case,

00:03:12.520 | if this is all brand new to you,

00:03:14.380 | you might not feel like you can confidently finish the quiz until that final lecture in the unit.

00:03:21.160 | Like this one is all about transformers.

00:03:23.360 | All of the answers are embedded in this handout here.

00:03:26.640 | But if you want to hear it from me,

00:03:29.540 | you might not hear that until next Thursday,

00:03:32.360 | but that gives you another five days.

00:03:33.960 | Perfect. Thank you.

00:03:35.040 | Yes.

00:03:37.160 | You mentioned past projects are available.

00:03:40.160 | Where can we learn?

00:03:41.880 | Right. That must be here.

00:03:44.040 | I think I've got an index of past projects behind a protected link,

00:03:48.840 | which will depend on you being enrolled.

00:03:50.700 | If you're not enrolled, we can get you past that little hurdle.

00:03:53.680 | But I did get permission to release some of them.

00:03:56.120 | So somewhere on this page is a link to X,

00:03:59.120 | oh, there it is, exemplary projects.

00:04:02.740 | There's another list at the GitHub projects.md page,

00:04:07.600 | which is also linked somewhere in here,

00:04:09.280 | of published work, and that stuff you could download.

00:04:11.760 | The private link gives you the actual course submission.

00:04:15.000 | That could be an interesting exercise to

00:04:16.840 | compare the paper they did in here with the thing they actually published.

00:04:20.920 | I'll emphasize again that that will be interesting because of how much work it

00:04:24.900 | typically takes to go from a class project to something that makes it onto the ACL anthology.

00:04:30.720 | But that's of course an exciting journey.

00:04:34.040 | Oh yeah, and if you haven't already,

00:04:38.640 | do this course setup.

00:04:39.660 | It's very lightweight.

00:04:41.040 | You get your computing environment set up to use our code.

00:04:44.480 | Actually, this is a sign of the changing times.

00:04:47.120 | I also exhort you to sign up for a bunch of services,

00:04:50.160 | Colab, and maybe consider getting a pro account for $30.

00:04:54.480 | Over the course of the entire quarter,

00:04:56.040 | you could get a lot more compute on Colab, including GPUs.

00:04:59.440 | Also, the Amazon versions, SageMaker Studio.

00:05:04.920 | In addition, OpenAI account and Cohere account.

00:05:09.520 | Both of those have free tiers.

00:05:12.200 | For Cohere, you get really rate limited and for OpenAI, they give you $5.

00:05:16.600 | You could consider spending a little bit more.

00:05:18.760 | I do think you could do all our coursework for under those amounts.

00:05:22.080 | I think that for OpenAI,

00:05:23.720 | you could still have lots of accounts if you wanted to.

00:05:26.840 | Each one getting $5.

00:05:28.640 | It used to be 18 and now it's five,

00:05:30.760 | so we know what's coming.

00:05:32.440 | But embrace it while you can.

00:05:34.760 | Also, I'll say, I'm pretty well confident that we'll get a bunch of

00:05:38.640 | credits from AWS Educate for you to use EC2 machines.

00:05:43.160 | So more details about that in a little bit.

00:05:46.680 | If you want to follow along,

00:05:49.680 | let's head to this one.

00:05:50.720 | This is our slideshow from last time.

00:05:52.600 | I do just want to review some things.

00:05:54.440 | What we did last time is I tried to immerse us in

00:05:57.520 | this weird and wonderful moment for AI and give you a sense for how we got here.

00:06:04.120 | Then we talked about the first two units,

00:06:06.520 | transformers and retrieval augmented in-context learning.

00:06:11.400 | I think that is all wonderful stuff.

00:06:13.600 | I expect you all to do lots of creative and cool things in that space.

00:06:17.440 | But it's important for me to continue this slideshow because there is

00:06:20.960 | more to our field than just those big language models and prompting.

00:06:25.000 | There are lots of important ways to contribute beyond that.

00:06:28.040 | So let me take a moment and just give you an indication of what I have in mind there.

00:06:32.880 | Our third main course unit,

00:06:36.040 | I've called compositional generalization.

00:06:38.560 | This is brand new.

00:06:40.080 | We're going to focus on the COGS benchmark,

00:06:43.080 | which is a relatively recent synthetic dataset that is designed to stress test

00:06:49.600 | whether models have really learned systematic solutions to language problems.

00:06:55.200 | So the way COGS works is we have essentially a semantic parsing task.

00:06:59.720 | The input is a sentence like Lena gave the bottle to John,

00:07:03.440 | and the task is to learn how to map those sentences to their logical form,

00:07:08.160 | which are these logical representations down here.

00:07:11.600 | The interesting thing about COGS is that they've posed hard generalization tasks.

00:07:17.920 | For example, in training,

00:07:19.960 | you might get to see examples where Lena here is in subject position,

00:07:24.780 | and then at test time,

00:07:26.440 | you see Lena in object position.

00:07:29.040 | Or at train time,

00:07:30.800 | you might see Paula as a name but in isolation,

00:07:34.040 | and the task is to have the system learn how to deal with

00:07:36.960 | Paula as a subject of a sentence like Paula painted a cake.

00:07:41.480 | Or object PP to subject PP.

00:07:44.960 | So at train time,

00:07:46.340 | you see Emma ate the cake on the table,

00:07:48.680 | where on the table is inside the direct object of the sentence.

00:07:52.200 | Then at test time,

00:07:53.520 | you see the cake on the table burned,

00:07:55.360 | where on the table is now a subject.

00:07:58.240 | These seem like dead simple generalization tasks,

00:08:02.240 | and the sentences are very simple.

00:08:04.520 | But here's the punchline.

00:08:06.520 | This is a kind of accumulated leaderboard of a lot of entries for COGS.

00:08:11.440 | If you look all the way on the right,

00:08:13.120 | you can see systems are doing pretty well.

00:08:15.180 | It is impressive that they can go from

00:08:17.020 | these free-form sentences into those very ornate logical forms.

00:08:20.440 | Okay, but look at this column.

00:08:22.320 | This is a column of zeros,

00:08:24.300 | object PP to subject PP.

00:08:26.520 | It looked really simple,

00:08:28.040 | and that's just the task of learning from Emma ate the cake on

00:08:31.120 | the table and predicting the cake on the table burned.

00:08:33.920 | Why are these- all these brand new systems getting zero on this split?

00:08:39.520 | That shows first of all that this is a hard problem.

00:08:42.620 | Now, we are going to work with a variant that we created of COGS called reCOGS.

00:08:47.760 | This was done with my students Zen Wu and Chris Manning.

00:08:50.280 | It's brand new work.

00:08:51.780 | We think that in part,

00:08:53.480 | all those zeros derive from there being some artifacts in COGS.

00:08:57.320 | So it was made kind of artificially hard and also artificially easy in some ways.

00:09:02.840 | So in this class, we're going to work with reCOGS,

00:09:05.440 | which has done some systematic meaning-preserving transformations to

00:09:09.760 | the original to create a new dataset that we think is fairer.

00:09:13.680 | But it still remains incredibly hard.

00:09:16.980 | Systems can get traction where- before they were getting zero,

00:09:20.660 | so we know their signal.

00:09:22.200 | And we have more confidence that this is testing something about semantics.

00:09:26.160 | And then the punchline remains the same.

00:09:28.400 | This is incredibly hard for our systems, even our best systems.

00:09:32.960 | There needs to be some kind of breakthrough here for us to get

00:09:37.000 | our systems to do well even on these incredibly simple sentences.

00:09:41.720 | So I am eager to see what you all do with this problem.

00:09:45.720 | You're seeing a picture here of the kind of best we could do,

00:09:49.000 | which is a little bit better than what was in the literature previously,

00:09:53.200 | but certainly not a solved task.

00:09:55.840 | Right. So that will culminate in this homework and bake off our third one.

00:10:02.280 | From there, the course work opens up into your projects.

00:10:07.760 | We're done with the regular assignments and we go through the rhythm of lit review,

00:10:11.760 | experiment protocol, which is a special document that kind of lays down

00:10:15.680 | the nuts and bolts of what you're going to do for your paper,

00:10:18.400 | and then the final paper itself.

00:10:20.560 | In the spirit of that,

00:10:22.200 | what we do in our course together is think

00:10:25.320 | about topics that will supercharge your own final project papers.

00:10:29.840 | The first topic that comes to mind for me there is better and more diverse benchmarks.

00:10:36.200 | We need measurement instruments to get

00:10:39.360 | reliable estimates of how well our systems are doing,

00:10:42.520 | and that implies having good benchmarks.

00:10:45.160 | In this context, I really like to invoke

00:10:47.280 | this famous quotation from the explorer Jacques Cousteau.

00:10:50.280 | He said, "Water and air,

00:10:52.920 | the two essential fluids on which all life depends,

00:10:56.160 | that's datasets for our field."

00:10:58.520 | You can see here that Cousteau did continue have become global garbage cans.

00:11:03.040 | That might concern us about what's happening with our datasets.

00:11:06.400 | I don't think it's that bad,

00:11:07.960 | but still you could have that in the back of your mind that we need

00:11:11.480 | these datasets we create to be reliable high-quality instruments.

00:11:17.120 | The reason for that is that we ask so much of our datasets.

00:11:21.000 | We use them to optimize models when we train on them.

00:11:24.080 | We use them crucially,

00:11:25.400 | and this is increasingly important to evaluate our models,

00:11:28.520 | our biggest language models that are getting all the headlines.

00:11:31.160 | How well are they actually doing?

00:11:33.240 | We need datasets for that.

00:11:34.920 | We use it to compare models,

00:11:36.720 | to enable new capabilities via training and testing,

00:11:40.160 | to measure progress as a field.

00:11:42.560 | It's our fundamental barometer for this,

00:11:44.960 | and of course for basic scientific inquiry into language and the world.

00:11:49.540 | This is a long and important list,

00:11:51.840 | and it shows you that datasets are really central to what we're doing.

00:11:57.120 | So I'm exhorting you as you can tell to think about datasets,

00:12:01.080 | especially ones that would be powerful as

00:12:02.800 | evaluation tools in the context of this course.

00:12:05.560 | I am genuinely worried about the new dynamic where we are

00:12:10.280 | evaluating these big language models essentially on Twitter,

00:12:14.280 | where people have screenshots of some fun cases that they saw,

00:12:18.080 | and we all know that we're not

00:12:20.220 | seeing a full representative sample of the inputs.

00:12:22.760 | We're seeing the worst and the best,

00:12:24.740 | and it's impossible to piece together a scientific picture from that.

00:12:28.420 | My student, Omar Khattab,

00:12:30.540 | recently observed, I think this is very wise,

00:12:33.100 | that we have moved into this era in which

00:12:35.260 | designing systems might be really easy.

00:12:37.380 | It might be a matter of writing a prompt,

00:12:39.360 | but figuring out whether it was a good system is going to get harder and harder,

00:12:43.460 | and for that we need lots of evaluation datasets.

00:12:48.060 | You could think about this slide that I showed you from before.

00:12:52.460 | We have this benchmark saturation with all of these systems now

00:12:56.100 | increasingly quickly getting above our estimate of human performance,

00:12:59.640 | but I asked you to be cynical about that as a measure of human performance.

00:13:04.180 | Another perspective on this slide could be that our benchmarks are simply too easy,

00:13:09.460 | because it is not like if you interacted with one of these systems,

00:13:13.300 | even the most recent ones,

00:13:14.680 | it would feel superhuman to you.

00:13:17.620 | Partly what we're seeing here is a remnant of the fact that until very recently,

00:13:23.780 | our evaluations had to be essentially machine tasks,

00:13:26.780 | not human tasks,

00:13:28.100 | and we had humans do machine tasks to get a measure of human performance.

00:13:32.620 | Maybe we're moving into a new and more exciting era.

00:13:35.940 | We're going to talk about adversarial testing.

00:13:38.420 | I've been involved with the Dynabench effort.

00:13:40.700 | This is a kind of open-source effort to develop datasets

00:13:44.520 | that are going to be really hard for the best of our models,

00:13:47.740 | and I think that's a wonderful dynamic as well.

00:13:50.860 | That leads into this related topic of us having more meaningful evaluations.

00:13:57.620 | Here's a fundamental thing that you might worry

00:13:59.780 | about throughout artificial intelligence.

00:14:02.580 | All we care about is performance for the system,

00:14:05.700 | some notion of accuracy.

00:14:07.620 | I've put this under the heading of Strathairn's law.

00:14:10.140 | When a measure becomes a target,

00:14:11.740 | it ceases to be a good measure.

00:14:13.580 | If we have this consensus that all we care about is accuracy,

00:14:16.980 | we know what will happen.

00:14:18.380 | Everyone in the field will climb on accuracy.

00:14:21.300 | We know from Strathairn's law that that will distort the actual rate of

00:14:26.180 | progress by diminishing everything else that could be

00:14:30.620 | important to thinking about these AI systems.

00:14:34.140 | Relatedly, this is a wonderful study from Birhane et al.

00:14:38.340 | I've selected a few of the values encoded in ML research,

00:14:42.260 | which they did via a very extensive literature survey.

00:14:46.060 | Impressionistically, here's the list.

00:14:48.340 | At the top, dominating everything else,

00:14:51.700 | you have an obsession with performance, as I said.

00:14:54.540 | Then way down on the list though,

00:14:56.980 | in second place, you have efficiency and things like explainability,

00:15:01.420 | applicability in the real world, robustness.

00:15:04.060 | As I go farther down on the list here,

00:15:05.940 | the ones that are colored there,

00:15:06.980 | they actually should be in the tiniest of type.

00:15:09.700 | Because if you think about the field's actual values as reflected in the literature,

00:15:14.620 | you find that these things are getting almost no play.

00:15:17.620 | I think things are looking up,

00:15:20.060 | but it's still the case that it's wildly skewed toward performance.

00:15:24.220 | But those things that I have down there in purple and orange,

00:15:27.620 | beneficence, privacy, fairness, and justice,

00:15:31.820 | those are incredibly important things and more and more

00:15:34.460 | important as these systems are being deployed more widely.

00:15:38.740 | So we have to, via our practices and what we hold to be valuable,

00:15:43.820 | elevate these other principles.

00:15:46.180 | You all could start to do that by thinking about

00:15:48.700 | proposing evaluations that would elevate them.

00:15:51.780 | That could be tremendously exciting.

00:15:54.260 | The final point here is that we could also have

00:15:57.740 | a move toward leaderboards that embrace more aspects of this.

00:16:02.180 | Again, to help us move away from the obsession on performance,

00:16:05.700 | we should have leaderboards that score us along many dimensions.

00:16:09.780 | In this context, I've really been inspired by work that

00:16:12.460 | Cowen did on what he calls Dyna-scoring,

00:16:15.620 | which is a proposal for how to

00:16:17.820 | synthesize across a lot of different measurement dimensions.

00:16:21.500 | To give you a quick illustration,

00:16:23.500 | here I have a table where the rows are question answering systems,

00:16:27.740 | and the columns are different things we could measure.

00:16:30.420 | Just a sample of them, performance,

00:16:32.620 | throughput, memory, fairness, robustness,

00:16:36.340 | and we could add other dimensions.

00:16:38.340 | With the current Dyna-scoring that you're seeing here,

00:16:40.980 | where most of the weight is put on performance,

00:16:43.380 | that DeBirda system is the winner in this leaderboard competition.

00:16:47.900 | But that's standard. But what if we decided that we

00:16:50.540 | cared much more about fairness for these systems?

00:16:53.060 | So we adjusted the Dyna-scoring here to put five on fairness,

00:16:57.380 | keep a lot on performance,

00:16:58.780 | but diminish the other measures there.

00:17:01.140 | Well, now the Electra Large system is in first place.

00:17:04.780 | Which one was the true winner?

00:17:06.740 | I think the answer is that there is no true winner here.

00:17:09.620 | What this shows is that all of our leaderboards are

00:17:13.260 | reflecting some ordering of our preferences,

00:17:16.700 | and when we pick one,

00:17:18.060 | we are instilling a particular set of values on the whole enterprise.

00:17:22.300 | But this is also creating space for us.

00:17:24.860 | This is I think part of Cowen's vision for Dyna-scoring,

00:17:28.380 | that we could design leaderboards that were

00:17:30.380 | tuned to the things that we want to do in the world,

00:17:32.860 | via the weighting and the columns that we chose,

00:17:35.580 | and evaluate systems on that basis. Yeah.

00:17:40.260 | >> What does fairness to you in this field mean?

00:17:42.980 | How do you measure something like that?

00:17:44.580 | >> What is fairness? Yeah. Well,

00:17:45.900 | that's a whole another aspect to this.

00:17:47.540 | So if we're going to start to measure these dimensions,

00:17:49.620 | like we're going to have a column for fairness,

00:17:51.340 | we better be sure that we know what's behind that.

00:17:54.140 | I can tell you there needs to be a lot more work on

00:17:57.660 | our measurement devices, our benchmarks, for assessing fairness.

00:18:01.740 | Because all of those things are incredibly nuanced,

00:18:04.780 | multi-dimensional concepts, and so

00:18:06.940 | the idea would be to bring that in as well.

00:18:09.220 | Yeah. Throughput memory, maybe those are straightforward,

00:18:12.900 | but fairness is going to be a challenging one.

00:18:15.180 | But that's not to say that it's not incredibly important.

00:18:18.860 | Then finally, to really inspire you,

00:18:25.900 | I do feel like this is the first time I could say this in this course.

00:18:28.860 | I think we're moving into an era in which

00:18:30.860 | our evaluations can be much more meaningful than they ever were before.

00:18:34.620 | Assessment today or yesterday is really one-dimensional,

00:18:38.820 | that's the performance thing I mentioned.

00:18:40.780 | It's largely insensitive to the context.

00:18:43.500 | We always pick F1 or something as

00:18:45.500 | the only thing regardless of what we're trying to accomplish in the world.

00:18:49.100 | The terms are largely set by us researchers.

00:18:52.020 | We say it's F1 and everyone follows suit because we're

00:18:55.580 | supposed to be the experts on this and it's often very

00:18:58.580 | opaque and tailored to machine tasks.

00:19:01.780 | I've already complained about that.

00:19:03.300 | Our estimates of human performance being

00:19:05.420 | very different from what you would think that phrase means.

00:19:08.620 | In this new future that we could start right now,

00:19:11.700 | our assessments could certainly be high dimensional and fluid.

00:19:14.420 | I showed you a glimpse of that with the Dyna scoring.

00:19:16.820 | I think that's incredible.

00:19:18.260 | They could also in turn be highly sensitive to the context that we're in.

00:19:22.180 | If we care about fairness and we care about efficiency,

00:19:25.140 | and we put those above performance,

00:19:27.140 | we're going to get a very different prioritization of

00:19:29.740 | the systems and so forth and so on.

00:19:32.780 | Then in turn, the terms of these evaluations could be set not by us researchers,

00:19:38.340 | who are doing our very abstract thing,

00:19:40.020 | but rather the people who are trying to get value out of these systems,

00:19:43.740 | the people who have to interact with them.

00:19:46.060 | Then the judgments could ultimately be made by the users.

00:19:49.980 | They could decide which system they want to choose

00:19:52.060 | based on their own expressed preferences.

00:19:54.900 | Then in turn, maybe we could have

00:19:57.900 | our evaluations be much more at the level of human tasks.

00:20:02.580 | Right now, for example,

00:20:04.540 | we might insist that some human labelers

00:20:06.580 | choose a particular label for an ambiguous example,

00:20:09.540 | and then we assess how much agreement they have.

00:20:13.640 | Whereas the human thing is to discuss and debate,

00:20:16.940 | to have a dialogue about what the right label is in

00:20:19.700 | the face of ambiguity and context dependence.

00:20:22.540 | Well, now we could have that kind of evaluation, right?

00:20:26.060 | Maybe we evaluate systems on their ability to

00:20:28.740 | adjudicate in a human-like way on what the label should be.

00:20:32.820 | Hard to imagine before,

00:20:35.180 | but now probably something that you could toy around with a little bit with one of

00:20:38.900 | these large language model APIs right now if you wanted.

00:20:42.460 | I think we could really embrace that.

00:20:45.700 | I have a couple more topics, but let me pause there.

00:20:50.620 | Do people have thoughts, questions,

00:20:53.100 | insights about benchmarks and evaluation?

00:20:58.060 | I hope you're seeing that it's a wide open area for final projects. Yeah.

00:21:05.180 | Is there more of a move to like get like specialists in other fields,

00:21:09.220 | like for example, like linguistics or like related things to like help make benchmarks?

00:21:14.660 | What a wonderful question.

00:21:16.460 | You asked, is there a move to have more experts participate in evaluation?

00:21:20.980 | I hope the answer is yes,

00:21:22.420 | but let's make the answer yes.

00:21:23.860 | That would be the point, right?

00:21:25.020 | Because what we want is to provide the world with tools and concepts that would allow

00:21:30.380 | domain experts people who actually know what's going on in the domain.

00:21:34.740 | We're trying to use this AI technology to make

00:21:37.340 | these decisions and make adjustments and so forth based on what's working and what isn't.

00:21:41.900 | Yeah, that should be our goal.

00:21:45.380 | Then what we as researchers can do is provide things like what Colin provided with

00:21:49.380 | Dynascoring which is the intellectual infrastructure to allow them to do that.

00:21:53.820 | Yeah. Then you all probably have lots of domain expertise that intersects with what we're doing,

00:22:01.300 | but maybe comes from other fields.

00:22:04.060 | You could participate as an NLU researcher and as

00:22:08.220 | domain expert to do a paper that embrace both aspects of this.

00:22:13.100 | Maybe you propose a kind of metric that you think really works well for

00:22:16.700 | your field of economics or sociology or whatever you're studying, right?

00:22:21.820 | Yeah, health, medicine, all these things, incredibly important.

00:22:25.740 | So another hand go up. Yeah.

00:22:28.620 | I think one of the challenges we're going to face is

00:22:31.020 | this really expensive to collect human or more sophisticated labels.

00:22:35.140 | As an example, there's a paper that came out recently in

00:22:37.980 | Med Hall where they trained or actually really just tuned an LLM to respond to

00:22:46.460 | medical questions from USMOE and other medicine related exams.

00:22:53.740 | They also had a section for long-form answers.

00:22:57.060 | The short-form answers, it's a multiple choice, they can figure it out.

00:23:00.340 | The long-form answers, they actually had doctors evaluate them.

00:23:05.100 | That's really expensive. They could only collect so many labels.

00:23:07.900 | Even the large staff of doctors.

00:23:10.100 | So I think the balance between calculating,

00:23:13.700 | put through a super easy, it's just counting.

00:23:15.620 | But evaluating how valuable a search result is,

00:23:19.020 | that requires a human, that's a little more expensive.

00:23:21.180 | I'm curious how we can balance the cost.

00:23:23.940 | Yeah. The issue of cost is going to be unavoidable for us.

00:23:29.260 | I think we should confront it as a group.

00:23:31.300 | This research has just gotten more expensive and that's

00:23:33.780 | obviously distorting who can participate and what we value.

00:23:37.180 | It's another thing I could discuss under this rubric.

00:23:39.700 | For your particular question though,

00:23:42.540 | I remain optimistic because I think we are in an era now in which you could do

00:23:47.380 | a meaningful evaluation of a system with no training data and rather

00:23:51.900 | just a few dozen let's say 100 examples for assessment.

00:23:57.820 | If you're careful about how you use it,

00:23:59.540 | that is if you don't develop your system on it and so forth.

00:24:03.300 | But even if you say, "Okay,

00:24:04.660 | I'm going to have a 100 for development,

00:24:06.140 | a 100 that I keep for a final evaluation to

00:24:08.940 | really get a sense for how my system performs on new data."

00:24:12.260 | That's only 200 examples and I feel like that's manageable,

00:24:18.100 | even if we need experts.

00:24:20.860 | The point would be that that might be money well spent.

00:24:23.900 | It might be that if we can get some experts to provide the 200 cases,

00:24:28.180 | we have a really reliable measurement tool.

00:24:31.820 | I could never have said this 10 years ago because 10 years ago,

00:24:35.740 | the norm was to have 50,000 training instances and 5,000 test instances,

00:24:42.220 | and now your cost concern really kicks in.

00:24:45.300 | But for the present moment,

00:24:47.180 | I feel like a few meaningful cases could be worth a lot.

00:24:50.980 | You all could construct those datasets.

00:24:53.220 | Again, before I used to give the advice,

00:24:55.020 | don't create your own dataset in this class,

00:24:56.900 | you'll run out of time.

00:24:58.220 | But now I can give the advice,

00:25:00.020 | no, if you have some domain expertise in

00:25:03.100 | the life sciences or something and you want a dataset,

00:25:06.060 | create one to use for assessment.

00:25:09.060 | It'll shape your system design,

00:25:10.620 | but that could be healthy as well.

00:25:12.940 | Another big theme, explainability.

00:25:23.180 | This also relates to our increased impact.

00:25:26.300 | If we're going to deploy these models out in the world,

00:25:29.380 | it is really important that we understand them.

00:25:32.060 | Right now, we do a lot of behavioral testing.

00:25:35.260 | That is, we come up with these test cases and we see how well the model does.

00:25:40.180 | But the problem, which is a deep problem of scientific induction,

00:25:44.220 | is that you can never come up with enough cases.

00:25:46.820 | The world is a diverse and complex place,

00:25:49.580 | and no matter how many things you dreamed up when you were doing the research,

00:25:53.140 | if you deploy your system,

00:25:54.580 | it will encounter things that you never anticipated.

00:25:58.420 | If all you've done is behavioral testing,

00:26:01.100 | you might feel very nervous about this because you might have

00:26:03.820 | essentially no idea what it's going to do on new cases.

00:26:07.580 | The mission of explainability research should be to go one layer

00:26:11.660 | deeper and understand what is happening inside

00:26:14.620 | these models so that we have a sense for how they'll generalize to new cases.

00:26:19.100 | It's a very challenging thing because we're thinking about

00:26:21.980 | these enormous and incredibly opaque models.

00:26:26.260 | You can even find people saying in the literature that they're

00:26:29.060 | skeptical that we can ever understand what's happening with these systems,

00:26:32.420 | but I am optimistic.

00:26:34.000 | They are closed, deterministic systems.

00:26:37.220 | They may be complex,

00:26:38.840 | but we're smart.

00:26:40.080 | We can figure out what they have learned.

00:26:42.440 | I really have confidence in this.

00:26:44.220 | The importance of this is really that we have these broader societal goals.

00:26:48.460 | We want systems that are reliable,

00:26:50.740 | and safe, and trustworthy,

00:26:53.020 | and we want to know where we can use them,

00:26:55.040 | and we want them to be free from bias.

00:26:56.860 | It seems to me that all of these questions depend on us

00:27:01.300 | having some true analytic guarantees about model behaviors.

00:27:05.640 | It seems very hard for me to say,

00:27:08.300 | "Trust me, my model is not biased along some dimension,"

00:27:11.820 | if I don't even have any idea how it works.

00:27:14.600 | The best I could say is that it wasn't biased in some evaluations that I ran,

00:27:18.620 | but I just emphasize for you that that's very different from being

00:27:22.540 | evaluated by the world where a lot of things could happen that you didn't anticipate.

00:27:27.700 | We'll talk about a lot of different explanation methods.

00:27:31.860 | I think that these methods should be human interpretable.

00:27:34.920 | That is, we don't want low-level mathematical explanations of how the models work.

00:27:38.780 | We want this expressed in human-level concepts so that we can reason about these systems.

00:27:44.980 | We also want them to be faithful to the underlying model.

00:27:48.660 | We don't want to fabricate human interpretable but inaccurate explanations of the model.

00:27:53.780 | We want them to be true to the underlying systems.

00:27:57.080 | These are two very difficult standards to meet together.

00:28:01.060 | I can make them human interpretable if I offer you no guarantees of faithfulness,

00:28:06.100 | but then I'm just tricking myself and you.

00:28:09.020 | I can make them faithful by making them very technical and low-level.

00:28:12.660 | We could just talk about all the matrix multiplications we want,

00:28:15.740 | but that's not going to provide a human-level insight into how the models are working.

00:28:20.640 | So together though, we need to get methods that are good for both of these, right?

00:28:24.900 | Concept-level understanding of the causal dynamics of these systems.

00:28:30.100 | We'll talk about a lot of different explanation methods.

00:28:33.300 | I'll just do this quickly.

00:28:34.500 | Train tests, that is the behavioral thing,

00:28:36.700 | remains very important for the field.

00:28:39.020 | We'll talk about probing,

00:28:40.660 | which was an early and very influential and very ambitious attempt

00:28:44.380 | to understand the hidden representations of our models.

00:28:47.940 | We'll talk about attribution methods.

00:28:50.220 | These are ways to assign importance to different parts of the representations of these models,

00:28:55.700 | both input and output,

00:28:56.840 | and also their internal representations.

00:28:59.860 | Then we're going to talk about methods that depend on

00:29:03.220 | active manipulations of model internal states.

00:29:06.540 | You'll be able to tell that I strongly favor

00:29:09.640 | the active manipulation approach because I think that that's

00:29:12.940 | the kind of approach that can give us causal insights,

00:29:16.140 | and also richly characterize what the models are doing,

00:29:19.340 | and that's more or less the two desiderata that I just mentioned for these methods.

00:29:24.200 | But there's value to all of these things,

00:29:26.340 | and we'll talk about all of them,

00:29:27.660 | and you'll get hands-on with all of them,

00:29:29.340 | and all of them can be wonderful for your analysis sections of your final papers.

00:29:35.340 | We might even talk about interchange intervention training,

00:29:39.220 | which is when you use explainability methods to actually

00:29:41.940 | push the models to become better, more systematic,

00:29:46.020 | more reliable, maybe less biased along dimensions that you care about.

00:29:50.580 | That's my review of the core things.

00:29:56.900 | Questions or comments?

00:29:58.380 | I have a few more kind of more low-level things about the course to do now. Yeah.

00:30:03.420 | I know we're going to get into all of the explanation methods in a lot of detail later on,

00:30:08.180 | but can you give a quick example just so that we have

00:30:10.620 | any imagination of what they are?

00:30:13.460 | Probing is training supervised classifiers on internal representations.

00:30:19.540 | This was just the cool thing to say, "Hey,

00:30:21.500 | I'll just look at layer three,

00:30:23.060 | column four of my BERT representation.

00:30:25.460 | Does it encode information about animacy or part of speech?"

00:30:30.020 | The answer seems to be yes.

00:30:33.700 | I think that was really eye-opening that even if your task was sentiment analysis,

00:30:39.300 | you might have learned latent structure about animacy.

00:30:42.980 | That's getting closer to the human level concept stuff.

00:30:46.460 | Problem with probing is that you have no guarantee that

00:30:49.180 | that information about animacy here has any causal impact on the model behavior.

00:30:53.860 | It could be just kind of something that the model learned by the by.

00:30:57.860 | Attribution methods have the kind of reverse problem.

00:31:01.060 | They can give you some causal guarantees that this neuron

00:31:04.420 | plays this particular role in the input-output behavior,

00:31:08.260 | but it's usually just a kind of scalar value.

00:31:11.140 | It's like 0.3 and you say, "Well,

00:31:12.780 | what does the 0.3 mean?"

00:31:13.820 | And you say, "It means that it's that much important."

00:31:17.020 | But nothing like, "Oh, is it animent?"

00:31:19.060 | Or none of those human level things.

00:31:20.940 | And then I think the active manipulation thing,

00:31:23.220 | which is like doing lots of brain surgeries on your model,

00:31:26.540 | can provide the benefits of both probing and attribution.

00:31:30.820 | Causal insights, but also a deep understanding of the- what the representations are.

00:31:37.860 | And there's a whole family of those.

00:31:40.500 | It's a very exciting part of the literature. Yeah.

00:31:43.620 | I have a question going back to COGS.

00:31:46.060 | So I guess, why would we want to use the COGS dataset if we're testing for generalization?

00:31:51.180 | Like, why can't we just prompt a language model of a word that we've never seen before,

00:31:55.380 | and kind of like try and induce some format if you see it in the subject position,

00:31:59.540 | get it to position in the object and see how well it does that.

00:32:03.020 | Oh, yeah. No. So for COGS,

00:32:05.340 | for your original system,

00:32:06.980 | it could be that you try to prompt a language model.

00:32:09.860 | Zen did a bunch of that as part of the research.

00:32:13.100 | It was only okay.

00:32:14.860 | But maybe there's a version of that where you

00:32:17.460 | prompt in the right way with the right kind of instructions,

00:32:20.260 | and then it does solve COGS.

00:32:21.780 | That would be wonderful because that would suggest to me that those models,

00:32:26.340 | whatever model you prompted,

00:32:27.780 | has internal representations that are systematic enough to have kind of

00:32:32.060 | a notion of subject and a notion of object and

00:32:34.860 | verb and all of that linguistic stuff,

00:32:37.340 | and that would be very exciting.

00:32:39.220 | Yeah. The cool thing about COGS is that I think it's a pretty

00:32:42.140 | reliable measurement device for making such claims. Yeah.

00:32:48.740 | How transferable is this discussion to languages other than English?

00:32:54.580 | Like, I wonder if there- if we should be concerned about

00:32:57.620 | the very tight coupling between the properties of English as a language,

00:33:02.660 | and all our advancement in NLP?

00:33:06.660 | Well, I mean, I hope that a lot of you do projects on multilingual NLP,

00:33:12.860 | low resource settings, and so forth.

00:33:15.700 | I think in a way,

00:33:17.780 | we live in a golden age for that research as well.

00:33:20.740 | There's more research on more languages than there were 10 years ago,

00:33:25.220 | and that's certainly a positive development.

00:33:27.860 | The downside is that it's all done with multilingual representations,

00:33:32.020 | multilingual BERT, and so forth,

00:33:33.780 | and they tend to do much better on English tasks than every other task.

00:33:37.620 | So that obviously feels like suboptimal.

00:33:41.180 | But again, that's the same story of like

00:33:44.260 | sudden progress with a lot of

00:33:45.900 | mixed feelings that I have about a lot of these topics.

00:33:49.940 | In the interest of time, let's press on a little bit.

00:33:54.980 | I think I just wanted to skip to the course mechanics.

00:33:58.420 | This is at the website, but there it is.

00:34:00.180 | That's the breakdown of required pieces.

00:34:03.140 | You can see that it has a kind of strong emphasis

00:34:06.100 | toward the three parts that are related to the final project.

00:34:09.460 | But the homeworks are also really important and the quizzes less so.

00:34:13.380 | But I think they're important enough that you'll take them seriously.

00:34:17.900 | It's fully asynchronous.

00:34:20.380 | It's wonderful to see so many of you here,

00:34:22.660 | and I am eager to interact with you here if possible,

00:34:27.060 | but also on Zoom for office hours and stuff.

00:34:29.500 | Please attend office hours if you just want to chat.

00:34:32.020 | One of my favorite games to play in office hours is a group comes with

00:34:35.940 | three project ideas and I rank them from

00:34:38.940 | least to most or most to least viable for the course.

00:34:42.660 | It's a fun game for me,

00:34:44.060 | and I think it always illuminates some things about the problems.

00:34:48.620 | Then we have continuous evaluation.

00:34:51.220 | So you have the three assignments,

00:34:52.580 | the quizzes, and then the project work.

00:34:55.060 | There's no final exam.

00:34:56.340 | Just we want you to be focused on the final project at that point.

00:35:00.660 | I think I'll leave this aside.

00:35:02.580 | We can talk about the grading of the original systems a bit later.

00:35:06.340 | Then you have the project work,

00:35:08.780 | some links here, exceptional final projects, and some guidance.

00:35:12.860 | These are the two documents I mentioned before.

00:35:15.900 | I'll just say again that this is the most important part of

00:35:19.180 | the course to me and the thing that's special.

00:35:21.140 | I'll say again also,

00:35:22.380 | we have this incredibly accomplished teaching team this year,

00:35:25.540 | diverse interests, and they all have done incredible research on their own.

00:35:30.420 | I've learned a ton from them and from their work,

00:35:33.140 | and I encourage you to do the same.

00:35:35.220 | So seek them out in office hours and,

00:35:37.740 | um, and you know,

00:35:39.180 | take advantage of their mentorship for the work you do.

00:35:42.620 | Then here are some crucial course links,

00:35:45.260 | kind of covered that before.

00:35:46.940 | The quizzes I think I've covered as well,

00:35:50.980 | and these policies are all at the website.

00:35:54.700 | Um, right.

00:35:57.660 | And so the setup,

00:35:58.860 | do that if you haven't already.

00:36:00.780 | Make sure you're in the discussion forum.

00:36:02.540 | We want you to be connected with the kind of ongoing discourse for the class.

00:36:05.820 | Do quiz zero as soon as you can,

00:36:07.860 | so that you know your rights and responsibilities.

00:36:10.460 | And then I think right now we should check out the homework,

00:36:14.060 | the sentiment homework to make sure you're

00:36:16.740 | oriented around that before we dive into transformers.

00:36:20.580 | Questions about that stuff?

00:36:23.780 | It's all at the website,

00:36:25.260 | but I've kind of evoked it for you in case it raised any issues.

00:36:29.540 | All right. Let's look briefly at the first homework.

00:36:38.900 | I feel like we should kick it off somehow,

00:36:41.140 | and it is maybe an unusual mode for homeworks.

00:36:44.500 | So feel free to ask questions.

00:36:46.340 | This is kind of cool. So this link will take you to the GitHub,

00:36:50.340 | uh, which I think you're probably all set up with on your home computers.

00:36:54.140 | But you might want to work with this in the Cloud.

00:36:57.460 | And though this- so this works well.

00:36:58.980 | So you could just quick click like open in Colab.

00:37:02.340 | And I think I've done a pretty good job of getting you so that it will set

00:37:08.900 | itself up with the installs that you need and the course repo and so forth.

00:37:14.020 | I would actually be curious to know what their bumps along

00:37:16.580 | the road to getting this to just work out of the box in Colab.

00:37:19.580 | I do encourage this because if you're ambitious,

00:37:22.140 | you'll probably want GPUs,

00:37:23.980 | and this is a good inexpensive way to get them.

00:37:27.140 | It's also a pretty nice environment to do the work in. Zoom in here.

00:37:32.540 | Along the left, you can see the outline.

00:37:35.540 | And that's actually kind of a good place to start.

00:37:37.620 | So we're doing multi-domain sentiment.

00:37:40.240 | And what I mean by that is,

00:37:42.500 | you're encouraged to work with three datasets,

00:37:45.820 | Dynascent round one,

00:37:47.700 | Dynascent round two, and the Stanford Sentiment Treebank.

00:37:51.140 | These are all sentiment tasks,

00:37:53.460 | and they all have ternary labels,

00:37:55.540 | positive, negative, and neutral.

00:37:57.500 | But I'm not guaranteeing you that those labels are aligned in the semantic sense.

00:38:02.860 | In fact, I think that the SST labels are a bit different from the Dynascent ones.

00:38:09.220 | But certainly, the underlying data are different because Dynascent is

00:38:12.980 | like product reviews and Stanford Sentiment Treebank is movie reviews.

00:38:18.500 | But there are further things.

00:38:19.940 | So Dynascent round one is hard examples

00:38:22.920 | that we harvested from the world,

00:38:25.260 | from the Yelp academic dataset.

00:38:27.460 | Whereas Dynascent round two is actually annotators working on the Dynabench platform,

00:38:34.260 | which I mentioned just a minute ago,

00:38:35.980 | trying to fool a really good sentiment model.

00:38:39.900 | So the Dynascent round two examples are hard.

00:38:43.060 | They involve like non-literal language use and sarcasm,

00:38:46.620 | and other things that we know challenge current day models.

00:38:50.300 | So you have these three datasets.

00:38:52.780 | Then there are really two main questions.

00:38:56.100 | For the first question,

00:38:57.620 | I'm just pushing you to develop a simple linear model with sparse feature representations.

00:39:04.580 | This is a kind of more traditional mode background.

00:39:08.060 | If you need a refresher on this,

00:39:09.700 | this is a chance to get it.

00:39:10.860 | If you feel stuck on this question,

00:39:12.860 | I think we should talk about how to get you up to speed for the course.

00:39:17.660 | But for a lot of you,

00:39:18.960 | especially if you've been immersed in NLP,

00:39:20.860 | this should be a pretty straightforward question.

00:39:23.300 | It leads to a pretty good system.

00:39:25.060 | So you do a feature function,

00:39:26.820 | you write a function for training models,

00:39:29.380 | and a function for assessing models.

00:39:31.900 | For each one of these questions,

00:39:34.120 | what you do is complete a function that I started.

00:39:38.900 | There is not a lot of coding.

00:39:40.980 | This is mainly about starting to build your own original system.

00:39:45.940 | For every single one of these questions,

00:39:48.540 | there's a test that you can run.

00:39:50.900 | I like unit tests a lot.

00:39:53.420 | I think we should all write more unit tests.

00:39:55.540 | The advantage of the test for me is that if there was

00:39:57.940 | any unclarity about my own instructions,

00:40:00.420 | the test probably clears them up.

00:40:02.580 | You also get a guarantee that if your code passes the test,

00:40:05.940 | you're in good shape.

00:40:07.220 | More or less the same tests run on Gradescope.

00:40:10.500 | So when you upload the notebook,

00:40:12.300 | if you got a clean bill of health at home,

00:40:14.820 | you'll probably do fine on Gradescope.

00:40:17.900 | If you don't, it might be because

00:40:19.380 | the Gradescope autograder has a bug.

00:40:21.580 | Let me know about that.

00:40:23.100 | Those things always feel like they're just barely functioning.

00:40:26.780 | But the idea is that this is really not about me evaluating you.

00:40:31.900 | This is about you exercising

00:40:34.900 | the relevant muscles and building up

00:40:36.980 | concepts that will let you develop your own systems.

00:40:40.620 | I'm just trying to be a trusted guide for you on that.

00:40:44.340 | So you do some coding and you have these three questions here.

00:40:47.820 | The result of doing those three questions is that you

00:40:50.780 | have something that could be the basis for your original system.

00:40:54.180 | It'd be pretty cool by the way if some people

00:40:56.260 | competed using just sparse linear models

00:40:58.820 | to show the transformers that there's still competition out there.

00:41:03.500 | So that's the first question.

00:41:05.620 | Then the second one is the same way,

00:41:08.340 | except now we're focused on transformer fine-tuning,

00:41:11.180 | which is our main focus for this unit.

00:41:14.380 | I have a question here that pushes you to

00:41:17.940 | understand how these models tokenize data.

00:41:20.860 | It's really different from the old mode.

00:41:23.580 | You'll learn some hugging face code and you'll also learn some concepts.

00:41:27.860 | Then I have a question that pushes you to

00:41:30.220 | understand what the representations are like.

00:41:32.460 | We're going to talk about them abstractly.

00:41:34.420 | Here you'll be hands-on with them.

00:41:36.500 | Then finally, you're going to finish up

00:41:39.020 | writing a PyTorch module where you fine-tune BERT.

00:41:43.780 | That is step one to a really incredible system I'm sure.

00:41:50.660 | I've actually written the interface for you.

00:41:54.260 | So that given the course code and everything else,

00:41:57.180 | the interfaces for these things are pretty straightforward to write.

00:41:59.940 | All you have to do is write the module,

00:42:01.980 | and for completing the homework questions,

00:42:03.820 | you don't actually need heavy-duty computing at

00:42:05.820 | all because you don't do anything heavy-duty.

00:42:08.740 | But when you get to the original system,

00:42:11.900 | that might be where you want to train a big monster model and figure out how to

00:42:16.140 | work with the computational resources that you have to get that done.

00:42:20.020 | This notebook is using TinyBERT,

00:42:22.580 | which is small,

00:42:24.540 | but you still need a GPU to do the work.

00:42:26.780 | So you'll still want to be on Colab or something like that.

00:42:30.140 | Then I don't know how ambitious you're going to get for your original system.

00:42:34.460 | You can tell that I'm trying to lead you toward using

00:42:37.340 | question one and question two for your original system,

00:42:39.700 | but it's not required.

00:42:41.060 | If you want to do something where you just prompt GPT-4,

00:42:44.620 | maybe you'll win, I don't know.

00:42:46.900 | I'm up for anything.

00:42:49.060 | It does need to be an original system,

00:42:51.260 | so you can't just download somebody else's code.

00:42:53.660 | If all you did was a very boring prompt structure,

00:42:56.780 | you wouldn't get a high grade on your original system.

00:42:59.460 | We're trying to encourage you to think creatively and explore.

00:43:03.260 | Then the final thing is you just enter this in a bake-off,

00:43:06.380 | and really that just means grabbing an unlabeled dataset

00:43:09.620 | from the web and adding a column with predictions in it.

00:43:13.260 | Then you upload that when you submit your work to Gradescope.

00:43:17.300 | Then when everyone's submissions are in,

00:43:20.020 | we'll reveal the scores and there'll be some winners,

00:43:23.060 | and we'll give out prizes.

00:43:24.860 | I'm optimistic that we're going to have EC2 codes as prizes.

00:43:29.140 | That's always been fun because if you win a bake-off,

00:43:31.100 | you get a little bit more compute resources for your next big thing.

00:43:35.340 | They don't want to hand out these codes anymore like they used to,

00:43:41.060 | because Cloud Compute is so important now,

00:43:42.940 | but I think I have an arrangement in place to get some.

00:43:46.580 | By the way, we give out prizes for

00:43:48.980 | the best systems and the most creative systems,

00:43:51.780 | and we have even given out prizes for the lowest scoring system.

00:43:55.660 | Because if that was a cool thing that should have worked and didn't,

00:43:58.660 | I feel like you did a service to all of us by going down that route,

00:44:02.860 | and that deserves a prize.

00:44:04.900 | As a trying to have a multi-dimensional leaderboard here,

00:44:08.700 | even as we rank all of you according to the performance of your systems.

00:44:13.260 | That's my overview. Questions or comments or anything?

00:44:20.580 | All right.

00:44:31.820 | I propose then that we go to Transformers.

00:44:37.580 | So download the handout.

00:44:41.300 | By the way, these should be really good.

00:44:43.300 | So these slides, you'll get a version with

00:44:45.100 | fewer overlays to make it more browsable.

00:44:47.260 | All of these things up here are links.

00:44:50.060 | So if you click on these bubbles,

00:44:56.780 | you can go directly to that part,

00:44:58.540 | and you can see that this is a kind of outline of this unit.

00:45:01.500 | Then there's also a good table of contents with good labels.

00:45:04.860 | So if you need to find things in what I admit is a very large deck,

00:45:09.100 | that should make it easier to do that.

00:45:10.940 | You can also track our progress as we move through these things.

00:45:15.100 | So we dive in.

00:45:23.980 | Guiding ideas.

00:45:25.940 | What is happening with these contextual representations?

00:45:30.100 | Okay. This one slide here used to take two weeks for this course.

00:45:35.060 | And I've been trying to convey this.

00:45:37.300 | We have stopped doing that.

00:45:38.620 | The background materials are still at the website.

00:45:40.900 | It was also the first two weeks of CS224n.

00:45:44.660 | We did them before they did them in CS224n,

00:45:49.180 | back before natural language understanding was all the rage.

00:45:52.500 | But they get there first now,

00:45:54.060 | and it is a more basic course.

00:45:55.460 | I'm saying they do GloVe,

00:45:56.740 | Word2Vec, and we're going to dive right into transformers.

00:46:00.340 | Here is my one slide summary of this.

00:46:03.540 | Back in the old, old days,

00:46:06.140 | the dawning of the statistical revolution in NLP,

00:46:09.940 | the way we represented examples,

00:46:12.220 | let's say words for this case,

00:46:13.840 | was with feature-based sparse representations.

00:46:17.020 | And what I mean by that is that if you

00:46:18.860 | wanted to represent a word of a language,

00:46:21.060 | you might write a feature function that says, okay,

00:46:23.740 | yes or no on it being referring to an animate thing,

00:46:27.140 | yes or no on it ending in the characters ing,

00:46:31.740 | yes or no on it mostly being used as a verb,

00:46:35.820 | and so forth and so on.

00:46:36.980 | And so all these little feature functions would end up giving

00:46:39.660 | you really long vectors of essentially ones and zeros that were kind of

00:46:45.000 | hand-designed and that would give you a perspective on

00:46:48.500 | a bunch of the dimensions of the word you were trying to represent.

00:46:53.240 | That lasted for a while,

00:46:55.400 | and then it kind of started to get replaced pre-Word2Vec and GloVe,

00:47:01.240 | with methods like pointwise mutual information or TF-IDF.

00:47:06.720 | These methods had long been recognized as

00:47:09.940 | fundamental in the field of information retrieval,

00:47:13.140 | especially TF-IDF as a main representation technique

00:47:16.700 | for finding relevant documents for queries.

00:47:20.300 | Took a while for NLP people to realize that they would be valuable.

00:47:25.140 | But what you start to see here is that instead of writing all those feature functions,

00:47:30.340 | I'll just keep track of co-occurrence patterns in large collections of text.

00:47:35.820 | And PMI and TF-IDF do this- do this essentially just by counting,

00:47:40.560 | and then re-weighting some of the counts.

00:47:42.560 | But really it is the rawest form of distributional representation.

00:47:47.800 | That kind of got replaced,

00:47:50.520 | or this is sort of simultaneous in an interesting way,

00:47:52.800 | but you have paired with PMI and TF-IDF methods like a principal components analysis,

00:47:59.000 | SVD which is sometimes called latent semantic analysis,

00:48:03.160 | LDA which is latent Dirichlet allocation,

00:48:05.960 | a topic modeling technique.

00:48:07.800 | So a whole family of these things that are essentially taking

00:48:10.640 | count data and giving you reduced dimensional versions of that count data.

00:48:16.160 | And the power of doing that is really that you can

00:48:19.480 | capture higher order notions of co-occurrence.

00:48:22.360 | Not just what I as a word co-occurred with,

00:48:25.520 | but also the sense in which I might co-occur

00:48:28.640 | with words that co-occur with the things I co-occur with.

00:48:32.600 | You're kind of second order neighbors and you can imagine

00:48:35.360 | traveling out into this representational neighborhood here.

00:48:39.200 | And that turns out to be very powerful because a lot of

00:48:42.200 | semantic affinities come not from just being neighbors with something,

00:48:45.640 | but rather from that whole network of things co-occurring with each other.

00:48:50.520 | And what these methods do is take all that count data and compress it in a way that

00:48:54.960 | loses some information but also captures those notions of similarity.

00:49:00.440 | And then the final step which might actually be the kind of

00:49:04.520 | final step in this literature were learned dimensionality reduction things,

00:49:09.880 | autoencoders, Word2Vec, and GloVe.

00:49:12.800 | And this is where you might start with some count data,

00:49:16.400 | but you have some machine learning algorithm that learns how to

00:49:20.360 | compute dense learned representations from that count data.

00:49:26.600 | Um, so kind of like step three infused with more of what we know of as machine learning now.

00:49:34.120 | And I say it might be the end because I think now,

00:49:38.280 | for anything that you would do with this mode,

00:49:41.160 | you would probably just use contextual representations.

00:49:44.040 | So this is the full story perhaps.

00:49:47.280 | And then here's the review if you want, right?

00:49:51.200 | And I think it is important to understand this both the history but also

00:49:54.560 | the technical details to really deeply understand what I'm about to dive into.

00:49:58.720 | So you might want to circle back if that was too fast. Yeah.

00:50:03.240 | Is there any option to just like one-hot encode your entire vocabulary?

00:50:08.040 | I think this is my understanding of what modern transformer-based models do.

00:50:12.280 | To one-hot encode the whole vocabulary?

00:50:16.200 | Yes.

00:50:17.520 | So well, just say a bit more like what are you, what are you doing?

00:50:21.200 | Like my understanding of how large language models encode individual words,

00:50:29.160 | is they have a list of all of their possible tokens.

00:50:32.080 | They break it down into tokens.

00:50:33.920 | And then if your token 337,

00:50:35.960 | you're just like, you have a vector of length,

00:50:38.680 | the number of tokens you have,

00:50:39.920 | like a vocabulary of like 50,000.

00:50:42.120 | And then you just one-hot encode which token that is.

00:50:45.520 | Hmm. Well, we're about to do this.

00:50:48.280 | So why don't if- I'll show you how they represent things.

00:50:51.520 | And let's see if it connects with your question.

00:50:54.840 | Because it is different.

00:50:57.600 | Yeah, it is going to be very different.

00:50:59.480 | And the notion of token and the notion of type is about to get sort of complicated.

00:51:04.520 | Right. Before we do the technical part,

00:51:07.880 | just a little bit of context here about why I think this is so exciting.

00:51:11.720 | I'm a linguist, right?

00:51:12.920 | And I was excited by the static vector representations of words,

00:51:16.880 | but it was also very annoying to me because they give you one vector per word.

00:51:23.160 | Whereas my experience of language is that words have

00:51:27.480 | multiple senses and it is hard to delimit where the senses begin and end.

00:51:31.720 | Consider a verb like break,

00:51:33.800 | which I've worked on with my PhD student, Erica Peterson.

00:51:36.960 | The vase broke.

00:51:38.200 | That's one sense maybe.

00:51:39.800 | Dawn broke.

00:51:41.160 | That's the same form, broke.

00:51:44.360 | But that means something different.

00:51:46.360 | Entirely different.

00:51:47.840 | The news broke.

00:51:49.360 | Again, broke as the form there.

00:51:51.680 | But this being something more like was published or appeared.

00:51:55.360 | Sandy broke the world record.

00:51:57.480 | It's very unlike the vase broke, right?

00:52:00.480 | Now it's like surpassing the limit.

00:52:02.840 | Sandy broke the law is a different sense yet again.

00:52:06.520 | That's some kind of transgression.

00:52:08.560 | The burglar broke into the house.

00:52:11.040 | That's break again, but now with a particle.

00:52:13.680 | And that means something different still.

00:52:15.840 | The newscaster broke into the movie broadcast.

00:52:18.240 | That means it was interrupted.

00:52:20.520 | Very different again. We broke even,

00:52:23.120 | means I don't know, we ended up back at the same amount we started with.

00:52:27.280 | All- so how many senses of break are here?

00:52:30.120 | If I was in the old mode of static representation,

00:52:33.440 | would I survive with one break vector for all of these examples?

00:52:38.200 | Or would I have one per example type?

00:52:41.760 | But then what about all the ones that I didn't list here?

00:52:44.400 | The sen- the number of senses for break starts to feel impossible to enumerate.

00:52:49.960 | If you just think about all the ways in which you encounter this verb.

00:52:53.200 | And there is some metaphorical core that seems to run through them.

00:52:57.560 | But in the details,

00:52:58.920 | these senses are all very different.

00:53:01.720 | And this tells me that the sense of a word like break is being modulated by the context it is appearing in.

00:53:10.240 | And the idea that we would have one fixed representation for it,

00:53:14.000 | even if it's learned from data,

00:53:15.640 | is just kind of wrong from the outset.

00:53:19.160 | Here's another example.

00:53:21.120 | We have a flat tire.

00:53:22.840 | But what about flat beer,

00:53:24.680 | flat note, flat surface?

00:53:27.160 | Maybe they have some metaphorical core,

00:53:29.440 | but those feel like at least two to four different senses for flat.

00:53:34.840 | Throw a party, throw a fight,

00:53:37.080 | throw a ball, throw a fit.

00:53:39.040 | All very different senses.

00:53:41.160 | It's tragic to think we would have one throw that was meant to cover all of these examples, right?

00:53:48.120 | A crane caught a fish.

00:53:50.280 | A crane picked up the steel beam.

00:53:52.960 | That might feel like a standard sort of lexical ambiguity.

00:53:56.480 | And so maybe you can imagine that we have one vector for crane as a bird,

00:54:01.160 | and one for crane as a machine.

00:54:03.760 | But is that going to work for the entire vocabulary?

00:54:07.080 | I suspect not.

00:54:08.480 | I saw a crane.

00:54:09.520 | We wouldn't even know what vector we were dealing with there, right?

00:54:12.680 | Which one would we pick?

00:54:13.960 | And now we have another problem on our hands,

00:54:15.720 | which is selecting the static vector based on contexts, right?

00:54:19.840 | How are we going to do that? And this is a really deep thing.

00:54:22.560 | It's not just about the local kind of morphosyntactic context here.

00:54:27.200 | What about, are there typos?

00:54:29.160 | I didn't see any.

00:54:30.680 | So the sense of any there is like any typos, right?

00:54:34.720 | Versus are there bookstores downtown?

00:54:36.920 | I didn't see any.

00:54:37.840 | Now the sense of any and the kind of elliptical stuff that comes after it is any bookstores.

00:54:44.560 | And now I hope you can see that the sense that words can have is modulated by context in the most extended sense.

00:54:53.040 | And having fixed static representations was never going to work in the face of all of this diversity.

00:55:00.240 | We were never going to figure out how to cut up the senses in just the right way to get all of this data handled correctly.

00:55:09.720 | And the vision of contextual representation models is that you're not even going to try to do all that hard and boring stuff.

00:55:17.040 | Instead, you are just going to embrace the fact that every word could take on a different sense,

00:55:22.640 | that is, have a different representation depending on everything that is happening around it.

00:55:28.480 | And we won't have to decide then which sense is in 1A and whether it's different from 1B and 1C and so forth.

00:55:33.840 | We will just have all of these token level representations.

00:55:40.120 | It will be entirely a theory that is based in words as they appear in context.

00:55:47.920 | For me as a linguist, it is not surprising at all that this turns out to lead to lots of engineering successes

00:55:54.400 | because it feels so deeply right to me about how language works.

00:56:01.120 | Uh, brief history here. I just want to be dutiful about this. Make sure people get credit where it's due.

00:56:06.680 | November 2015, Dai and Lei, that's a foundational paper where they really did what is probably the first example of language model pre-training.

00:56:17.000 | It's a cool paper to look at. It's complicated in some ways that are surprising to us now, and it is certainly a visionary paper.

00:56:24.920 | And then McCann et al., this is a paper from Salesforce Research that's led by, at the time was read by Richard Socher,

00:56:31.040 | who is a distinguished alum of this class. Proud of that.

00:56:34.960 | They developed the Cove model where what they did is train machine translation models.

00:56:40.320 | And then the inspired idea was that the tran- the translation representations might be useful for other tasks.

00:56:48.120 | And again, that feels like the dawn of the notion of pre-training contextual representations.

00:56:54.200 | And then ELMo came. I mentioned ELMo last time. Huge breakthrough. Massive bidirectional LSTMs.

00:57:01.320 | And they really showed that that could lead to rich multipurpose representations.

00:57:06.600 | And that's where you really feel everyone reorienting their research toward these kind of models.

00:57:13.000 | That's not a transformer-based one, though. That's by- by LSTMs.

00:57:17.320 | And then we get, um, GPT in June 2018 and BERT in October 2018.

00:57:25.320 | Um, the BERT paper was published a long time after that, but as I said before,

00:57:30.040 | it had already achieved massive impact by the time it was published in 2019 or whatever.

00:57:35.400 | So that's why I've been giving the months here, because you can see it's really- there was this

00:57:39.560 | sudden uptake in the amount of- of interest in these things that happened around this time.

00:57:45.560 | And that led to where we are now.

00:57:48.440 | Another kind of interesting thing to think about if you step back for the context here,

00:57:54.840 | is that we as a field have been traveling from high bias models,

00:58:00.280 | where we decide a lot about how the data should look and be processed,

00:58:03.960 | toward models that impose essentially nothing on the world.

00:58:07.800 | So if you go up into the upper left here,

00:58:09.800 | I'm just imagining a model that's kind of in the old mode,

00:58:12.280 | where you have like your glove representations of these three words.

00:58:16.280 | And to get a representation for the sentence, you just add up those representations.

00:58:21.320 | In doing that, you have decided ahead of time,

00:58:24.760 | a lot of stuff about how those pieces are gonna come together.

00:58:28.200 | I mean, you just said it was gonna be addition,

00:58:31.960 | which is almost certainly not correct about how the world works.

00:58:34.920 | But that- so that's a prototypical case of a high bias decision.

00:58:38.520 | If you move over to the right here, that's a kind of recurrent neural network.

00:58:43.720 | And here, I've kind of decided that my data will be processed left to right.

00:58:49.640 | I could learn a lot of different functions in that space,

00:58:53.400 | so it's much more expressive, much less biased in this machine learning sense,

00:58:57.720 | than this solution here.

00:58:59.160 | But I have still decided ahead of time that I'm gonna go left to right.

00:59:03.160 | And this is another example.

00:59:05.400 | These are tree-structured networks.

00:59:07.000 | Richard Socher, who I just mentioned,

00:59:08.520 | was truly a pioneer in tree-structured recursive neural networks.

00:59:13.320 | Here, I make a lot of decisions about how the pieces can possibly come together.

00:59:18.120 | The rock is a unit constituent, separate from rules, which comes in later.

00:59:23.080 | And I'm just saying I know ahead of time that the data will work that way.

00:59:28.360 | If I'm right, it's gonna give me a huge boost,

00:59:31.320 | because I don't have to learn those details.

00:59:33.400 | If I'm wrong though, I might be wrong forever.

00:59:36.520 | And I think that's actually that feeling that you're wrong forever

00:59:40.440 | is what led to this kind of thing happening.

00:59:42.280 | So here, I've got kind of a bidirectional recurrent model.

00:59:46.280 | So now, you can go left to right and right to left,

00:59:49.640 | and all these attention mechanisms that are gonna,

00:59:53.080 | like, jump around in the linear string.

00:59:56.680 | And this is a true progression with what happened with recurrent neural networks.

01:00:02.760 | Go both directions and add a bunch of attention connections.

01:00:06.040 | And that is kind of the thing that caused everyone to realize,

01:00:11.160 | "Oh, we should just connect everything to everything else

01:00:14.280 | and go to the maximally low-biased version of this,

01:00:18.040 | and just assume that the data will teach us

01:00:20.760 | about what's important to connect to what.

01:00:23.000 | We won't decide anything ahead of time."

01:00:25.240 | So a triumph of saying,

01:00:27.480 | "I have no idea what the world is gonna be like.

01:00:30.120 | I just trust in my data and my optimization."

01:00:33.160 | The attention piece is really interesting to me.

01:00:37.480 | You know, we used to talk a lot about this in the course.

01:00:40.120 | Here, I have a sequence, really not so good.

01:00:43.240 | And a common mode, still common today,

01:00:46.760 | is that I might fit a classifier on this final representation here

01:00:51.240 | to make a sentiment decision.

01:00:53.960 | But people went on that journey I just described,

01:00:57.160 | where you think, "Wait a second.

01:00:58.600 | If I'm just gonna use this,

01:01:00.120 | won't I lose a lot of information about the earlier words?"

01:01:04.680 | I should have some way of, like, connecting back.

01:01:08.200 | And so what they did is dot products,

01:01:11.080 | essentially, between the thing that you're using for your classifier

01:01:14.600 | and the previous states.

01:01:16.200 | That's what I've depicted here,

01:01:17.960 | just as a kind of way of scoring this final thing with respect to the previous thing.

01:01:23.800 | You might normalize them a little bit,

01:01:25.720 | and then form what was called a context vector.

01:01:28.840 | This is like the attention representation.

01:01:31.240 | And what I've done here is build these links back to all these previous states.

01:01:36.200 | And that turned out to be incredibly powerful.

01:01:39.640 | And when you read the title of the paper, "Attention is all you need,"

01:01:43.400 | what they are doing is saying,

01:01:46.360 | "You don't need LSTM connections and recurrent connections and stuff.

01:01:50.760 | The sense in which attention is all you need

01:01:52.840 | is the sense in which they're saying all you needed

01:01:55.320 | was these connections that you were adding onto the top of your earlier models."

01:01:59.080 | And maybe they were right,

01:02:03.960 | but certainly, uh, it has taken over the field.

01:02:07.880 | Another important idea here that might often be overlooked

01:02:12.440 | is just this notion that we should model the sub-parts of words.

01:02:16.200 | And again, I can't resist a historical note here.

01:02:19.480 | If you look back at the ELMo paper,

01:02:21.960 | what they did to embrace this insight is incredible.

01:02:26.200 | They had character-level representations,

01:02:28.680 | and then they fit all these convolutions on top of all those character-level representations,

01:02:34.760 | which is essentially like ways of pooling together sub-parts of the word.

01:02:38.760 | And then they form a representation of- at the top that's like the average

01:02:43.560 | plus the concatenation of all of these different convolutional layers.

01:02:48.360 | And the result of this is a vocabulary that does latently have information about characters

01:02:54.760 | and sub-parts of words as well as words in it.

01:02:57.640 | And I feel that that's deeply right, right?

01:03:00.680 | And this is like a space in which you could capture lots of things like

01:03:04.200 | how talk is similar to talking and is similar to talked.

01:03:07.880 | And you know, all that stuff that a simple unigram parsing would miss

01:03:12.120 | is latently represented in this space.

01:03:16.280 | But the vocabulary for ELMo is like 100,000 words.

01:03:21.640 | So that's 100,000 embedding space that I need to have.

01:03:25.960 | It's actually gargantuan.

01:03:27.560 | And it's still the case that if you process real data,

01:03:32.440 | you have to unk out, that is, mark as unknown most of the words you encounter.

01:03:37.400 | Because the language is incredibly complicated,

01:03:40.280 | and 100,000 doesn't even come close to covering

01:03:43.800 | all the tokens that you encounter in the world.

01:03:46.200 | And so again, we have this kind of galaxy brain moment where I guess the field says,

01:03:51.240 | forget all that.

01:03:52.760 | And what you do instead is tokenize your data

01:03:56.600 | so that you just split apart words into their sub-word tokens if you need to.

01:04:03.640 | So here I've got an example with the BERT tokenizer.

01:04:07.000 | This isn't too surprising.

01:04:08.360 | That comes out looking kind of normal.

01:04:11.000 | But if you do encode me,

01:04:12.600 | notice that the word encode has been split into two tokens.

01:04:17.080 | And if you do snuffleupagus,

01:04:20.040 | you get 1, 2, 3, 4, 5, 6, 7 tokens,

01:04:25.000 | 6 or 7 from that,

01:04:26.600 | because it doesn't know what the word is.

01:04:28.440 | And so what it does is not unk it out,

01:04:31.400 | but rather split it apart into a bunch of different pieces.

01:04:34.280 | And the result is the really startling thing

01:04:37.720 | that BERT-like models have only 30,000 words in their vocabulary.

01:04:41.640 | But they're words in the sense that they're these sub-word tokens.

01:04:46.440 | Now, this was going to be tragically bad

01:04:50.920 | in the realm where we were doing static word representations

01:04:55.640 | because I'm going to have a vector for NU

01:04:57.640 | and have no sense in which it was participating in the larger part of snuffleupagus.

01:05:04.040 | But we're talking about contextual models.

01:05:06.920 | So even if these are the tokens,

01:05:08.520 | the model is going to see the full sequence.

01:05:10.840 | And we can hope that it reconstructs something like the word

01:05:15.240 | from all the pieces that it encountered.

01:05:17.640 | Certainly, we could hope that for something like encode.

01:05:19.960 | And we take this for granted now,

01:05:23.080 | but it's deeply insightful to me

01:05:25.880 | and incredibly freeing in terms of the engineering resources that you need.

01:05:30.200 | But it does depend on rich contextual representations.

01:05:35.800 | And then another notion, positional encoding.

01:05:39.080 | So we have all these tokens or maybe, you know, subparts of words.

01:05:43.240 | In addition to representing things

01:05:46.280 | using a traditional static embedding space like a GloVe one,

01:05:50.760 | that's what I put with these light gray boxes here,

01:05:53.640 | we'll also represent sequences with positional encodings,

01:05:57.640 | which will just keep track of where the token appeared in the sequence I'm processing.

01:06:03.720 | And what that means is that the word rock here,

01:06:06.520 | occurring in position two,

01:06:08.760 | will have a different representation

01:06:11.960 | if rock appears in position 47 in the string.

01:06:15.800 | It'll be kind of the same word,

01:06:18.680 | but also partly a different word.

01:06:21.160 | And that's another way in which you're embracing the fact

01:06:25.160 | that all of these representations are going to be contextual.

01:06:28.920 | This is an interesting story for me

01:06:31.640 | because I've been slow to realize what maybe the whole field already knew,

01:06:35.560 | that this is incredibly important.

01:06:37.320 | How you do this positional encoding really matters for how models perform.

01:06:42.040 | And that's why, in fact, it's like one of these early topics here

01:06:45.320 | that we'll talk about next time.

01:06:46.600 | And then of course, another guiding idea here

01:06:53.480 | is that we are going to do massive scale pre-training.

01:06:56.760 | I mentioned this before.

01:06:57.880 | We're going to have these contextual models

01:07:00.120 | with all these tiny little parts of words in them,

01:07:03.560 | all in sequences with positional encodings.

01:07:06.280 | And we are going to train at an incredible scale.

01:07:08.840 | That's that same story of word2vec, GloVe, through GPT,

01:07:13.080 | and then on up to GPT-3.

01:07:15.080 | I mentioned this before.

01:07:16.760 | And some magic happens as you do this on more and more data

01:07:22.280 | with larger and larger models.

01:07:24.040 | And then finally, this is related.

01:07:29.720 | This insight that instead of starting from scratch for my machine learning models,

01:07:35.240 | I should start with a pre-trained one and fine-tune it for particular tasks.

01:07:41.000 | We saw this a little bit in the pre-transformer era.

01:07:46.360 | The standard mode was to take GloVe or word2vec representations

01:07:50.600 | and have them be the inputs to something like an RNN.

01:07:53.640 | And then the RNN would learn.

01:07:55.640 | And instead of having to learn the embedding space from scratch,

01:07:59.400 | it would start in this really interesting space.

01:08:02.280 | And that is actually a kind of learning of contextual representations.

01:08:07.880 | Because what happens if the GloVe representations are updated

01:08:11.160 | is that they all shift around and the network kind of pushes them around

01:08:15.080 | so that you get different senses for them in context.

01:08:18.840 | And then again, the transformer thing just takes that to the limit.

01:08:22.760 | And that is the mode that you'll operate in for the first homework.

01:08:27.000 | I've put this from 2018 onwards.

01:08:29.320 | We have this thing where, I hope you can make it out at the bottom,

01:08:32.440 | you load in BERT and you just put a classifier on top of it.

01:08:35.960 | And you learn that classifier for your sentiment task, say.

01:08:40.520 | And that actually updates the BERT parameters.

01:08:43.720 | And the BERT parameters help you do better at your task.

01:08:47.240 | And in particular, they might help you generalize

01:08:50.680 | to things that are sparsely represented in your task-specific training data.

01:08:56.280 | Because they've learned so much about the world in their pre-training phase.

01:09:01.160 | I put that for 2018 onwards.

01:09:05.880 | I'm a little worried that we're moving into a future

01:09:08.280 | in which fine-tuning is just again using an open AI API.

01:09:12.600 | But you all will definitely learn how to do much more than this,

01:09:16.520 | even if you fall back on doing this at some point.

01:09:18.760 | Where now what you're doing is some lightweight version of fine-tuning

01:09:23.800 | a massive model like GPT-3.

01:09:26.840 | Same mental model there.

01:09:28.360 | It's just that the starting point knows so much about language in the world,

01:09:32.840 | compared even to the BERT model up here.

01:09:34.840 | Those are the guiding ideas.

01:09:41.560 | I'll pause.

01:09:42.840 | Questions, comments?

01:09:45.160 | What's on your minds?

01:09:46.680 | Yeah.

01:09:47.180 | Going back to the sub-word splitting of the longer word,

01:09:52.600 | is there an infusion we are imposing on splitting that particular way,

01:09:56.600 | or is that also part of- driven by the model itself?

01:10:01.000 | Oh, yeah.

01:10:02.120 | So what gets imposed as a modeling bias in that sense

01:10:05.880 | when you do the tokenization?

01:10:07.080 | Is that the question?

01:10:08.280 | Yeah, potentially a lot.

01:10:09.880 | I left this out for reasons of time,

01:10:11.880 | but there are a bunch of different schemes

01:10:14.120 | that you can run for doing that sub-word tokenization.

01:10:16.760 | You'll see this as you read papers and as I talk.

01:10:19.560 | Word piece, there's byte-pairing coding,

01:10:22.520 | there's a unigram language model.

01:10:24.680 | All of them are attempts to learn kind of optimal way to tokenize the data

01:10:30.600 | based on things that tend to co-occur a lot together.

01:10:34.040 | But it's definitely a meaningful step.

01:10:37.000 | Yeah, and so like for example,

01:10:39.240 | as someone who's interested in the morphology of languages,

01:10:42.760 | word forms, and you all might-

01:10:44.600 | this could be a cool multilingual angle

01:10:46.600 | if you think about languages with very rich morphology.

01:10:50.120 | You might have an intuition that you want a tokenization scheme

01:10:52.680 | that reproduces the morphology of that language,

01:10:55.080 | that splits a big word with all its suffixes say,

01:10:58.360 | down into things that look like the actual pieces,

01:11:00.600 | as you recognize them.

01:11:02.840 | And then you could think,

01:11:04.040 | well, the best of these schemes should come close to that, right?

01:11:07.800 | And that could be an important and useful bias that you impose.

01:11:10.840 | Yeah.

01:11:14.300 | Go, yeah.

01:11:14.800 | Go.

01:11:15.300 | Yeah.

01:11:16.760 | Can you elaborate on what happens when we do fine-tuning to the original model?

01:11:22.040 | Like, does it change or just-

01:11:25.880 | it adds additional layers to it or like,

01:11:29.160 | what actually happens with fine-tuning?

01:11:31.320 | What happens when you fine-tune?

01:11:34.440 | As usual with these questions,

01:11:35.640 | there's like an easy answer and a hard answer.

01:11:37.480 | So the easy answer is that you are simply back-propagating

01:11:41.400 | whatever error signal you got from your,

01:11:43.720 | you know, output comparison with the true label,

01:11:46.280 | back through all the parameters of the model.

01:11:49.240 | And in principle, that could mean that,

01:11:51.160 | you know, as you fine-tune on your sentiment task,

01:11:53.000 | you are updating all of the parameters,

01:11:55.000 | even the pre-trained BERT ones.

01:11:57.320 | And then of course, there are variants of that

01:11:59.000 | where you update just some of the BERT parameters

01:12:01.240 | while leaving others frozen and so forth.

01:12:03.560 | But the idea is you have a smart initialization,

01:12:07.240 | that's the BERT initialization,

01:12:09.320 | and then you kind of adjust the whole thing

01:12:11.800 | to be really good at your task.

01:12:13.480 | What really happens there?

01:12:17.000 | That's a deep question, right?

01:12:19.240 | That could actually connect with the explainability stuff,

01:12:21.640 | like what adjustments are happening to the network,

01:12:24.760 | and which ones are useful,

01:12:26.520 | which ones could even be detrimental,

01:12:28.520 | which ones are causing you to overfit.

01:12:30.680 | Are there lightweight versions of the fine-tuning

01:12:34.200 | that would be better and more robust

01:12:35.960 | and get a better balance from the pre-training

01:12:38.120 | and the task-specific thing?

01:12:39.800 | And that just shows you fine-tuning

01:12:42.840 | is an art and a science all at once.

01:12:44.760 | Yeah.

01:12:47.260 | Sure.

01:12:49.820 | So do we also control, like,

01:12:53.720 | the influence of the fine-tuning dataset

01:12:57.880 | over the original model?

01:12:59.000 | Can we control it in a way that,

01:13:01.800 | oh, change- just change the model a little bit,

01:13:04.680 | or change the original model completely?

01:13:06.520 | Let's see, what's the right metaphor?

01:13:09.480 | You could control it the same way

01:13:12.200 | that you could control a kind of out-of-control car.

01:13:15.960 | I mean, you have a steering wheel

01:13:17.320 | and you have an accelerator and a brake,

01:13:20.200 | but if they're all kind of-

01:13:21.320 | you're not sure how they work.

01:13:22.360 | Yeah, you can try.

01:13:25.000 | And as you get better at the task,

01:13:26.760 | as you get better at driving the vehicle,

01:13:28.440 | you have more fine control.

01:13:30.280 | But it's an art and a science at the same time.

01:13:33.320 | I'm actually hoping, you know,

01:13:34.760 | that Sid, he's going to do a hands-on session

01:13:37.160 | with us next week,

01:13:38.360 | and that he imparts some of his own hard-won lessons to you

01:13:41.720 | about how to drive these things,

01:13:43.240 | because he's really been in the trenches

01:13:45.400 | doing this with large models.

01:13:46.840 | But you know, you have your learning rate

01:13:49.080 | and your optimizer and other things you can fiddle with,

01:13:52.040 | and hope that it steers in the direction you want.

01:13:54.520 | Yeah, if we use tree structures right now

01:14:00.520 | to represent like syntax,

01:14:01.960 | I guess my question is,

01:14:03.240 | why don't they work super well for language models?

01:14:06.840 | And like, I guess the sentiment that you had was like,

01:14:09.720 | oh, kind of just like put attention towards anything

01:14:12.120 | and see what works.

01:14:13.400 | I guess, why is that the sentiment

01:14:15.160 | in linguistics then as well?

01:14:16.680 | Great question.

01:14:18.680 | My personal perspective is that probably

01:14:22.040 | all the trees that we have come up with

01:14:24.200 | are kind of wrong.

01:14:25.000 | And as a result,

01:14:27.560 | we were actually making it harder for our models,

01:14:30.200 | because we were putting them in this bad initial state.

01:14:32.680 | And so the mode I've moved into is to think,

01:14:35.880 | let's use the transform or something

01:14:37.560 | that's much more like this,

01:14:38.600 | totally free form,

01:14:39.800 | and then use explainability methods

01:14:41.640 | to see if we can see what tree structures they've induced.

01:14:44.360 | Because those might be closer

01:14:46.440 | to the true tree structures of language.

01:14:48.920 | And another aspect of this is that I feel like

01:14:51.400 | there's not even for a given sentence,

01:14:53.880 | one structure.

01:14:54.840 | There could be one for semantics,

01:14:57.000 | one for syntax,

01:14:58.120 | one for other things.

01:14:59.560 | And so we want those all simultaneously represented.

01:15:02.440 | And again,

01:15:02.920 | these powerful models we're talking about could do that.

01:15:05.960 | And so then they become devices

01:15:08.200 | for helping us learn what the right structures are,

01:15:10.280 | as opposed to us imposing them.

01:15:11.800 | Yeah.

01:15:13.820 | Numbers are important part of language.

01:15:17.400 | How tokenization works for numbers,

01:15:19.720 | because it's digits,

01:15:21.320 | it's words,

01:15:22.440 | the same meaning,

01:15:23.240 | different meaning.

01:15:23.960 | Is it in the same domain or it's something otherwise?

01:15:29.320 | Oh, I love this.

01:15:31.080 | This is a great example of something that sounds small,

01:15:33.560 | but could be a wonderful final paper

01:15:35.480 | and turns out to be hard and deep.

01:15:37.160 | How do you represent numbers

01:15:38.840 | if you've got a word piece tokenizer?

01:15:41.400 | Do you get it down to all the digits?

01:15:43.480 | Do you leave them as big chunks?

01:15:45.800 | Do you do it based on a strong bias that you have

01:15:48.360 | about like this is the tens place,

01:15:50.200 | this is the hundreds place?

01:15:51.800 | What do you do?

01:15:52.680 | Yeah, wonderful question to ask.

01:15:55.240 | And I am absolutely positive

01:15:57.480 | that that low-level tokenization choice will influence

01:16:00.280 | whether or not your model can learn to do

01:16:02.120 | basic arithmetic, say.

01:16:04.360 | Yeah.

01:16:06.200 | And so a paper that evaluated a bunch of schemes

01:16:08.680 | in a way that was insightful and important,

01:16:10.520 | you know, on real mathematical abilities

01:16:13.000 | could really help us understand

01:16:14.760 | which models will be intrinsically limited

01:16:17.080 | and in turn how to develop better ones.

01:16:19.240 | I love it.

01:16:19.740 | Yeah.

01:16:23.260 | On the slide that you have,

01:16:25.000 | the- that's titled attention,

01:16:28.200 | you've shown your source as dot products

01:16:30.840 | with the final word

01:16:32.760 | against all the other previous ones.

01:16:35.640 | Well, I picked the final one,

01:16:36.760 | not- not the first one.

01:16:37.960 | Or all of them.

01:16:40.120 | Attention is all you need.

01:16:42.120 | This is a perfect transition.

01:16:44.280 | So yeah, you should do all of them.

01:16:46.520 | That's like the- what they mean

01:16:48.520 | by the title of the paper.

01:16:50.120 | Yes, do it all.

01:16:51.160 | Self-attention means attending everything

01:16:53.240 | to everything else.

01:16:54.600 | So this is a perfect transition.

01:16:56.200 | We're out of time, 4.20.

01:16:58.200 | Next time we will dive into the transformer.

01:17:01.240 | We'll resolve the questions that we got back there.

01:17:03.320 | You'll see much more of these attention connections.

01:17:06.360 | Yeah, we're really queued up now

01:17:07.880 | to dive into the technical parts.

01:17:09.400 | [BLANK_AUDIO]