back to index

Stanford XCS224U: Natural Language Understanding I Course Overview, Part 2 I Spring 2023


Whisper Transcript | Transcript Only Page

00:00:00.000 | All right.
00:00:06.120 | Welcome back everyone.
00:00:08.380 | Day two. Got a lot we want to accomplish today.
00:00:12.460 | What I have on the screen right now is the home base for the course.
00:00:17.660 | This is our public website and you could think of it as
00:00:20.560 | kind of a hub for everything that you'll need in the course.
00:00:24.400 | You can see along the top here we've got some policy pages.
00:00:29.060 | There's a whole page on projects.
00:00:31.560 | There's a page that provides an index of background materials,
00:00:36.200 | YouTube screencasts, slides,
00:00:39.880 | hands-on materials in case you need to fill in some background stuff.
00:00:43.580 | Notice also I do a podcast that actually began in this course last year,
00:00:48.940 | and I found it so rewarding that I just continued doing it all year.
00:00:53.120 | So new episodes continue to appear.
00:00:55.240 | If you have ideas for guests for this podcast,
00:00:58.640 | feel free to suggest them.
00:00:59.960 | I'm always looking for exciting people to interview,
00:01:02.920 | and I think the back episodes are also really illuminating.
00:01:06.320 | That's along the top.
00:01:07.840 | Then over here on the left,
00:01:09.320 | you've got one-stop shopping for the various systems that we have to deal with.
00:01:13.960 | You've got our Ed forum for discussion.
00:01:16.380 | If you're not in there, let us know.
00:01:18.040 | We can get you signed up.
00:01:19.520 | Canvas is your home for the screencasts and also the quizzes,
00:01:24.960 | and I guess there's some other stuff there.
00:01:26.920 | Grade scope is where you'll submit the main assignments,
00:01:30.560 | including your project work,
00:01:32.100 | and also enter the bake-offs.
00:01:34.480 | Then we have our course GitHub,
00:01:36.280 | and that is the course code that we'll be depending on for the assignments,
00:01:40.040 | and that I hope you can build on for the original work that you do.
00:01:43.600 | If you need to reach us,
00:01:45.000 | you can use the discussion forum,
00:01:46.360 | but we also have this staff email address that is
00:01:49.280 | vastly preferred to writing to us individually.
00:01:53.800 | It really helps us manage the workload and know what's happening if you either ping us
00:01:59.400 | on the discussion forum, private posts,
00:02:01.680 | public posts, whatever, or use that staff address.
00:02:06.240 | In the middle of this page here,
00:02:09.520 | we've got links to all the materials.
00:02:12.360 | The first column is slides and stuff like that, and also notebooks.
00:02:16.980 | The middle column, it's core readings mostly.
00:02:20.360 | I'm not presupposing that you will manage to do all of this reading because there is a lot of it,
00:02:26.760 | but these are important and rewarding papers,
00:02:30.160 | and so at some point in your life,
00:02:31.720 | you might want to immerse yourselves in them.
00:02:33.640 | But I'm hoping that I can be your trusted guide through that literature.
00:02:38.080 | Then on the right, you have the assignments.
00:02:41.800 | That's the website.
00:02:44.560 | Questions, comments, anything I could clear up?
00:02:47.520 | I have a time budgeted later to review the policies and required work in a bit more detail.
00:02:53.040 | But if there are questions now,
00:02:54.400 | I'm happy to take them. Yes.
00:02:57.120 | For the quizzes, are the quizzes doable on the day that they become available,
00:03:01.360 | or do we need like the course material all the way up through the D?
00:03:06.600 | That is a good question.
00:03:08.580 | It's going to depend on your background.
00:03:10.440 | But in the worst case,
00:03:12.520 | if this is all brand new to you,
00:03:14.380 | you might not feel like you can confidently finish the quiz until that final lecture in the unit.
00:03:21.160 | Like this one is all about transformers.
00:03:23.360 | All of the answers are embedded in this handout here.
00:03:26.640 | But if you want to hear it from me,
00:03:29.540 | you might not hear that until next Thursday,
00:03:32.360 | but that gives you another five days.
00:03:33.960 | Perfect. Thank you.
00:03:37.160 | You mentioned past projects are available.
00:03:40.160 | Where can we learn?
00:03:41.880 | Right. That must be here.
00:03:44.040 | I think I've got an index of past projects behind a protected link,
00:03:48.840 | which will depend on you being enrolled.
00:03:50.700 | If you're not enrolled, we can get you past that little hurdle.
00:03:53.680 | But I did get permission to release some of them.
00:03:56.120 | So somewhere on this page is a link to X,
00:03:59.120 | oh, there it is, exemplary projects.
00:04:02.740 | There's another list at the GitHub projects.md page,
00:04:07.600 | which is also linked somewhere in here,
00:04:09.280 | of published work, and that stuff you could download.
00:04:11.760 | The private link gives you the actual course submission.
00:04:15.000 | That could be an interesting exercise to
00:04:16.840 | compare the paper they did in here with the thing they actually published.
00:04:20.920 | I'll emphasize again that that will be interesting because of how much work it
00:04:24.900 | typically takes to go from a class project to something that makes it onto the ACL anthology.
00:04:30.720 | But that's of course an exciting journey.
00:04:34.040 | Oh yeah, and if you haven't already,
00:04:38.640 | do this course setup.
00:04:39.660 | It's very lightweight.
00:04:41.040 | You get your computing environment set up to use our code.
00:04:44.480 | Actually, this is a sign of the changing times.
00:04:47.120 | I also exhort you to sign up for a bunch of services,
00:04:50.160 | Colab, and maybe consider getting a pro account for $30.
00:04:54.480 | Over the course of the entire quarter,
00:04:56.040 | you could get a lot more compute on Colab, including GPUs.
00:04:59.440 | Also, the Amazon versions, SageMaker Studio.
00:05:04.920 | In addition, OpenAI account and Cohere account.
00:05:09.520 | Both of those have free tiers.
00:05:12.200 | For Cohere, you get really rate limited and for OpenAI, they give you $5.
00:05:16.600 | You could consider spending a little bit more.
00:05:18.760 | I do think you could do all our coursework for under those amounts.
00:05:22.080 | I think that for OpenAI,
00:05:23.720 | you could still have lots of accounts if you wanted to.
00:05:26.840 | Each one getting $5.
00:05:28.640 | It used to be 18 and now it's five,
00:05:30.760 | so we know what's coming.
00:05:32.440 | But embrace it while you can.
00:05:34.760 | Also, I'll say, I'm pretty well confident that we'll get a bunch of
00:05:38.640 | credits from AWS Educate for you to use EC2 machines.
00:05:43.160 | So more details about that in a little bit.
00:05:46.680 | If you want to follow along,
00:05:49.680 | let's head to this one.
00:05:50.720 | This is our slideshow from last time.
00:05:52.600 | I do just want to review some things.
00:05:54.440 | What we did last time is I tried to immerse us in
00:05:57.520 | this weird and wonderful moment for AI and give you a sense for how we got here.
00:06:04.120 | Then we talked about the first two units,
00:06:06.520 | transformers and retrieval augmented in-context learning.
00:06:11.400 | I think that is all wonderful stuff.
00:06:13.600 | I expect you all to do lots of creative and cool things in that space.
00:06:17.440 | But it's important for me to continue this slideshow because there is
00:06:20.960 | more to our field than just those big language models and prompting.
00:06:25.000 | There are lots of important ways to contribute beyond that.
00:06:28.040 | So let me take a moment and just give you an indication of what I have in mind there.
00:06:32.880 | Our third main course unit,
00:06:36.040 | I've called compositional generalization.
00:06:38.560 | This is brand new.
00:06:40.080 | We're going to focus on the COGS benchmark,
00:06:43.080 | which is a relatively recent synthetic dataset that is designed to stress test
00:06:49.600 | whether models have really learned systematic solutions to language problems.
00:06:55.200 | So the way COGS works is we have essentially a semantic parsing task.
00:06:59.720 | The input is a sentence like Lena gave the bottle to John,
00:07:03.440 | and the task is to learn how to map those sentences to their logical form,
00:07:08.160 | which are these logical representations down here.
00:07:11.600 | The interesting thing about COGS is that they've posed hard generalization tasks.
00:07:17.920 | For example, in training,
00:07:19.960 | you might get to see examples where Lena here is in subject position,
00:07:24.780 | and then at test time,
00:07:26.440 | you see Lena in object position.
00:07:29.040 | Or at train time,
00:07:30.800 | you might see Paula as a name but in isolation,
00:07:34.040 | and the task is to have the system learn how to deal with
00:07:36.960 | Paula as a subject of a sentence like Paula painted a cake.
00:07:41.480 | Or object PP to subject PP.
00:07:44.960 | So at train time,
00:07:46.340 | you see Emma ate the cake on the table,
00:07:48.680 | where on the table is inside the direct object of the sentence.
00:07:52.200 | Then at test time,
00:07:53.520 | you see the cake on the table burned,
00:07:55.360 | where on the table is now a subject.
00:07:58.240 | These seem like dead simple generalization tasks,
00:08:02.240 | and the sentences are very simple.
00:08:04.520 | But here's the punchline.
00:08:06.520 | This is a kind of accumulated leaderboard of a lot of entries for COGS.
00:08:11.440 | If you look all the way on the right,
00:08:13.120 | you can see systems are doing pretty well.
00:08:15.180 | It is impressive that they can go from
00:08:17.020 | these free-form sentences into those very ornate logical forms.
00:08:20.440 | Okay, but look at this column.
00:08:22.320 | This is a column of zeros,
00:08:24.300 | object PP to subject PP.
00:08:26.520 | It looked really simple,
00:08:28.040 | and that's just the task of learning from Emma ate the cake on
00:08:31.120 | the table and predicting the cake on the table burned.
00:08:33.920 | Why are these- all these brand new systems getting zero on this split?
00:08:39.520 | That shows first of all that this is a hard problem.
00:08:42.620 | Now, we are going to work with a variant that we created of COGS called reCOGS.
00:08:47.760 | This was done with my students Zen Wu and Chris Manning.
00:08:50.280 | It's brand new work.
00:08:51.780 | We think that in part,
00:08:53.480 | all those zeros derive from there being some artifacts in COGS.
00:08:57.320 | So it was made kind of artificially hard and also artificially easy in some ways.
00:09:02.840 | So in this class, we're going to work with reCOGS,
00:09:05.440 | which has done some systematic meaning-preserving transformations to
00:09:09.760 | the original to create a new dataset that we think is fairer.
00:09:13.680 | But it still remains incredibly hard.
00:09:16.980 | Systems can get traction where- before they were getting zero,
00:09:20.660 | so we know their signal.
00:09:22.200 | And we have more confidence that this is testing something about semantics.
00:09:26.160 | And then the punchline remains the same.
00:09:28.400 | This is incredibly hard for our systems, even our best systems.
00:09:32.960 | There needs to be some kind of breakthrough here for us to get
00:09:37.000 | our systems to do well even on these incredibly simple sentences.
00:09:41.720 | So I am eager to see what you all do with this problem.
00:09:45.720 | You're seeing a picture here of the kind of best we could do,
00:09:49.000 | which is a little bit better than what was in the literature previously,
00:09:53.200 | but certainly not a solved task.
00:09:55.840 | Right. So that will culminate in this homework and bake off our third one.
00:10:02.280 | From there, the course work opens up into your projects.
00:10:07.760 | We're done with the regular assignments and we go through the rhythm of lit review,
00:10:11.760 | experiment protocol, which is a special document that kind of lays down
00:10:15.680 | the nuts and bolts of what you're going to do for your paper,
00:10:18.400 | and then the final paper itself.
00:10:20.560 | In the spirit of that,
00:10:22.200 | what we do in our course together is think
00:10:25.320 | about topics that will supercharge your own final project papers.
00:10:29.840 | The first topic that comes to mind for me there is better and more diverse benchmarks.
00:10:36.200 | We need measurement instruments to get
00:10:39.360 | reliable estimates of how well our systems are doing,
00:10:42.520 | and that implies having good benchmarks.
00:10:45.160 | In this context, I really like to invoke
00:10:47.280 | this famous quotation from the explorer Jacques Cousteau.
00:10:50.280 | He said, "Water and air,
00:10:52.920 | the two essential fluids on which all life depends,
00:10:56.160 | that's datasets for our field."
00:10:58.520 | You can see here that Cousteau did continue have become global garbage cans.
00:11:03.040 | That might concern us about what's happening with our datasets.
00:11:06.400 | I don't think it's that bad,
00:11:07.960 | but still you could have that in the back of your mind that we need
00:11:11.480 | these datasets we create to be reliable high-quality instruments.
00:11:17.120 | The reason for that is that we ask so much of our datasets.
00:11:21.000 | We use them to optimize models when we train on them.
00:11:24.080 | We use them crucially,
00:11:25.400 | and this is increasingly important to evaluate our models,
00:11:28.520 | our biggest language models that are getting all the headlines.
00:11:31.160 | How well are they actually doing?
00:11:33.240 | We need datasets for that.
00:11:34.920 | We use it to compare models,
00:11:36.720 | to enable new capabilities via training and testing,
00:11:40.160 | to measure progress as a field.
00:11:42.560 | It's our fundamental barometer for this,
00:11:44.960 | and of course for basic scientific inquiry into language and the world.
00:11:49.540 | This is a long and important list,
00:11:51.840 | and it shows you that datasets are really central to what we're doing.
00:11:57.120 | So I'm exhorting you as you can tell to think about datasets,
00:12:01.080 | especially ones that would be powerful as
00:12:02.800 | evaluation tools in the context of this course.
00:12:05.560 | I am genuinely worried about the new dynamic where we are
00:12:10.280 | evaluating these big language models essentially on Twitter,
00:12:14.280 | where people have screenshots of some fun cases that they saw,
00:12:18.080 | and we all know that we're not
00:12:20.220 | seeing a full representative sample of the inputs.
00:12:22.760 | We're seeing the worst and the best,
00:12:24.740 | and it's impossible to piece together a scientific picture from that.
00:12:28.420 | My student, Omar Khattab,
00:12:30.540 | recently observed, I think this is very wise,
00:12:33.100 | that we have moved into this era in which
00:12:35.260 | designing systems might be really easy.
00:12:37.380 | It might be a matter of writing a prompt,
00:12:39.360 | but figuring out whether it was a good system is going to get harder and harder,
00:12:43.460 | and for that we need lots of evaluation datasets.
00:12:48.060 | You could think about this slide that I showed you from before.
00:12:52.460 | We have this benchmark saturation with all of these systems now
00:12:56.100 | increasingly quickly getting above our estimate of human performance,
00:12:59.640 | but I asked you to be cynical about that as a measure of human performance.
00:13:04.180 | Another perspective on this slide could be that our benchmarks are simply too easy,
00:13:09.460 | because it is not like if you interacted with one of these systems,
00:13:13.300 | even the most recent ones,
00:13:14.680 | it would feel superhuman to you.
00:13:17.620 | Partly what we're seeing here is a remnant of the fact that until very recently,
00:13:23.780 | our evaluations had to be essentially machine tasks,
00:13:26.780 | not human tasks,
00:13:28.100 | and we had humans do machine tasks to get a measure of human performance.
00:13:32.620 | Maybe we're moving into a new and more exciting era.
00:13:35.940 | We're going to talk about adversarial testing.
00:13:38.420 | I've been involved with the Dynabench effort.
00:13:40.700 | This is a kind of open-source effort to develop datasets
00:13:44.520 | that are going to be really hard for the best of our models,
00:13:47.740 | and I think that's a wonderful dynamic as well.
00:13:50.860 | That leads into this related topic of us having more meaningful evaluations.
00:13:57.620 | Here's a fundamental thing that you might worry
00:13:59.780 | about throughout artificial intelligence.
00:14:02.580 | All we care about is performance for the system,
00:14:05.700 | some notion of accuracy.
00:14:07.620 | I've put this under the heading of Strathairn's law.
00:14:10.140 | When a measure becomes a target,
00:14:11.740 | it ceases to be a good measure.
00:14:13.580 | If we have this consensus that all we care about is accuracy,
00:14:16.980 | we know what will happen.
00:14:18.380 | Everyone in the field will climb on accuracy.
00:14:21.300 | We know from Strathairn's law that that will distort the actual rate of
00:14:26.180 | progress by diminishing everything else that could be
00:14:30.620 | important to thinking about these AI systems.
00:14:34.140 | Relatedly, this is a wonderful study from Birhane et al.
00:14:38.340 | I've selected a few of the values encoded in ML research,
00:14:42.260 | which they did via a very extensive literature survey.
00:14:46.060 | Impressionistically, here's the list.
00:14:48.340 | At the top, dominating everything else,
00:14:51.700 | you have an obsession with performance, as I said.
00:14:54.540 | Then way down on the list though,
00:14:56.980 | in second place, you have efficiency and things like explainability,
00:15:01.420 | applicability in the real world, robustness.
00:15:04.060 | As I go farther down on the list here,
00:15:05.940 | the ones that are colored there,
00:15:06.980 | they actually should be in the tiniest of type.
00:15:09.700 | Because if you think about the field's actual values as reflected in the literature,
00:15:14.620 | you find that these things are getting almost no play.
00:15:17.620 | I think things are looking up,
00:15:20.060 | but it's still the case that it's wildly skewed toward performance.
00:15:24.220 | But those things that I have down there in purple and orange,
00:15:27.620 | beneficence, privacy, fairness, and justice,
00:15:31.820 | those are incredibly important things and more and more
00:15:34.460 | important as these systems are being deployed more widely.
00:15:38.740 | So we have to, via our practices and what we hold to be valuable,
00:15:43.820 | elevate these other principles.
00:15:46.180 | You all could start to do that by thinking about
00:15:48.700 | proposing evaluations that would elevate them.
00:15:51.780 | That could be tremendously exciting.
00:15:54.260 | The final point here is that we could also have
00:15:57.740 | a move toward leaderboards that embrace more aspects of this.
00:16:02.180 | Again, to help us move away from the obsession on performance,
00:16:05.700 | we should have leaderboards that score us along many dimensions.
00:16:09.780 | In this context, I've really been inspired by work that
00:16:12.460 | Cowen did on what he calls Dyna-scoring,
00:16:15.620 | which is a proposal for how to
00:16:17.820 | synthesize across a lot of different measurement dimensions.
00:16:21.500 | To give you a quick illustration,
00:16:23.500 | here I have a table where the rows are question answering systems,
00:16:27.740 | and the columns are different things we could measure.
00:16:30.420 | Just a sample of them, performance,
00:16:32.620 | throughput, memory, fairness, robustness,
00:16:36.340 | and we could add other dimensions.
00:16:38.340 | With the current Dyna-scoring that you're seeing here,
00:16:40.980 | where most of the weight is put on performance,
00:16:43.380 | that DeBirda system is the winner in this leaderboard competition.
00:16:47.900 | But that's standard. But what if we decided that we
00:16:50.540 | cared much more about fairness for these systems?
00:16:53.060 | So we adjusted the Dyna-scoring here to put five on fairness,
00:16:57.380 | keep a lot on performance,
00:16:58.780 | but diminish the other measures there.
00:17:01.140 | Well, now the Electra Large system is in first place.
00:17:04.780 | Which one was the true winner?
00:17:06.740 | I think the answer is that there is no true winner here.
00:17:09.620 | What this shows is that all of our leaderboards are
00:17:13.260 | reflecting some ordering of our preferences,
00:17:16.700 | and when we pick one,
00:17:18.060 | we are instilling a particular set of values on the whole enterprise.
00:17:22.300 | But this is also creating space for us.
00:17:24.860 | This is I think part of Cowen's vision for Dyna-scoring,
00:17:28.380 | that we could design leaderboards that were
00:17:30.380 | tuned to the things that we want to do in the world,
00:17:32.860 | via the weighting and the columns that we chose,
00:17:35.580 | and evaluate systems on that basis. Yeah.
00:17:40.260 | >> What does fairness to you in this field mean?
00:17:42.980 | How do you measure something like that?
00:17:44.580 | >> What is fairness? Yeah. Well,
00:17:45.900 | that's a whole another aspect to this.
00:17:47.540 | So if we're going to start to measure these dimensions,
00:17:49.620 | like we're going to have a column for fairness,
00:17:51.340 | we better be sure that we know what's behind that.
00:17:54.140 | I can tell you there needs to be a lot more work on
00:17:57.660 | our measurement devices, our benchmarks, for assessing fairness.
00:18:01.740 | Because all of those things are incredibly nuanced,
00:18:04.780 | multi-dimensional concepts, and so
00:18:06.940 | the idea would be to bring that in as well.
00:18:09.220 | Yeah. Throughput memory, maybe those are straightforward,
00:18:12.900 | but fairness is going to be a challenging one.
00:18:15.180 | But that's not to say that it's not incredibly important.
00:18:18.860 | Then finally, to really inspire you,
00:18:25.900 | I do feel like this is the first time I could say this in this course.
00:18:28.860 | I think we're moving into an era in which
00:18:30.860 | our evaluations can be much more meaningful than they ever were before.
00:18:34.620 | Assessment today or yesterday is really one-dimensional,
00:18:38.820 | that's the performance thing I mentioned.
00:18:40.780 | It's largely insensitive to the context.
00:18:43.500 | We always pick F1 or something as
00:18:45.500 | the only thing regardless of what we're trying to accomplish in the world.
00:18:49.100 | The terms are largely set by us researchers.
00:18:52.020 | We say it's F1 and everyone follows suit because we're
00:18:55.580 | supposed to be the experts on this and it's often very
00:18:58.580 | opaque and tailored to machine tasks.
00:19:01.780 | I've already complained about that.
00:19:03.300 | Our estimates of human performance being
00:19:05.420 | very different from what you would think that phrase means.
00:19:08.620 | In this new future that we could start right now,
00:19:11.700 | our assessments could certainly be high dimensional and fluid.
00:19:14.420 | I showed you a glimpse of that with the Dyna scoring.
00:19:16.820 | I think that's incredible.
00:19:18.260 | They could also in turn be highly sensitive to the context that we're in.
00:19:22.180 | If we care about fairness and we care about efficiency,
00:19:25.140 | and we put those above performance,
00:19:27.140 | we're going to get a very different prioritization of
00:19:29.740 | the systems and so forth and so on.
00:19:32.780 | Then in turn, the terms of these evaluations could be set not by us researchers,
00:19:38.340 | who are doing our very abstract thing,
00:19:40.020 | but rather the people who are trying to get value out of these systems,
00:19:43.740 | the people who have to interact with them.
00:19:46.060 | Then the judgments could ultimately be made by the users.
00:19:49.980 | They could decide which system they want to choose
00:19:52.060 | based on their own expressed preferences.
00:19:54.900 | Then in turn, maybe we could have
00:19:57.900 | our evaluations be much more at the level of human tasks.
00:20:02.580 | Right now, for example,
00:20:04.540 | we might insist that some human labelers
00:20:06.580 | choose a particular label for an ambiguous example,
00:20:09.540 | and then we assess how much agreement they have.
00:20:13.640 | Whereas the human thing is to discuss and debate,
00:20:16.940 | to have a dialogue about what the right label is in
00:20:19.700 | the face of ambiguity and context dependence.
00:20:22.540 | Well, now we could have that kind of evaluation, right?
00:20:26.060 | Maybe we evaluate systems on their ability to
00:20:28.740 | adjudicate in a human-like way on what the label should be.
00:20:32.820 | Hard to imagine before,
00:20:35.180 | but now probably something that you could toy around with a little bit with one of
00:20:38.900 | these large language model APIs right now if you wanted.
00:20:42.460 | I think we could really embrace that.
00:20:45.700 | I have a couple more topics, but let me pause there.
00:20:50.620 | Do people have thoughts, questions,
00:20:53.100 | insights about benchmarks and evaluation?
00:20:58.060 | I hope you're seeing that it's a wide open area for final projects. Yeah.
00:21:05.180 | Is there more of a move to like get like specialists in other fields,
00:21:09.220 | like for example, like linguistics or like related things to like help make benchmarks?
00:21:14.660 | What a wonderful question.
00:21:16.460 | You asked, is there a move to have more experts participate in evaluation?
00:21:20.980 | I hope the answer is yes,
00:21:22.420 | but let's make the answer yes.
00:21:23.860 | That would be the point, right?
00:21:25.020 | Because what we want is to provide the world with tools and concepts that would allow
00:21:30.380 | domain experts people who actually know what's going on in the domain.
00:21:34.740 | We're trying to use this AI technology to make
00:21:37.340 | these decisions and make adjustments and so forth based on what's working and what isn't.
00:21:41.900 | Yeah, that should be our goal.
00:21:45.380 | Then what we as researchers can do is provide things like what Colin provided with
00:21:49.380 | Dynascoring which is the intellectual infrastructure to allow them to do that.
00:21:53.820 | Yeah. Then you all probably have lots of domain expertise that intersects with what we're doing,
00:22:01.300 | but maybe comes from other fields.
00:22:04.060 | You could participate as an NLU researcher and as
00:22:08.220 | domain expert to do a paper that embrace both aspects of this.
00:22:13.100 | Maybe you propose a kind of metric that you think really works well for
00:22:16.700 | your field of economics or sociology or whatever you're studying, right?
00:22:21.820 | Yeah, health, medicine, all these things, incredibly important.
00:22:25.740 | So another hand go up. Yeah.
00:22:28.620 | I think one of the challenges we're going to face is
00:22:31.020 | this really expensive to collect human or more sophisticated labels.
00:22:35.140 | As an example, there's a paper that came out recently in
00:22:37.980 | Med Hall where they trained or actually really just tuned an LLM to respond to
00:22:46.460 | medical questions from USMOE and other medicine related exams.
00:22:53.740 | They also had a section for long-form answers.
00:22:57.060 | The short-form answers, it's a multiple choice, they can figure it out.
00:23:00.340 | The long-form answers, they actually had doctors evaluate them.
00:23:05.100 | That's really expensive. They could only collect so many labels.
00:23:07.900 | Even the large staff of doctors.
00:23:10.100 | So I think the balance between calculating,
00:23:13.700 | put through a super easy, it's just counting.
00:23:15.620 | But evaluating how valuable a search result is,
00:23:19.020 | that requires a human, that's a little more expensive.
00:23:21.180 | I'm curious how we can balance the cost.
00:23:23.940 | Yeah. The issue of cost is going to be unavoidable for us.
00:23:29.260 | I think we should confront it as a group.
00:23:31.300 | This research has just gotten more expensive and that's
00:23:33.780 | obviously distorting who can participate and what we value.
00:23:37.180 | It's another thing I could discuss under this rubric.
00:23:39.700 | For your particular question though,
00:23:42.540 | I remain optimistic because I think we are in an era now in which you could do
00:23:47.380 | a meaningful evaluation of a system with no training data and rather
00:23:51.900 | just a few dozen let's say 100 examples for assessment.
00:23:57.820 | If you're careful about how you use it,
00:23:59.540 | that is if you don't develop your system on it and so forth.
00:24:03.300 | But even if you say, "Okay,
00:24:04.660 | I'm going to have a 100 for development,
00:24:06.140 | a 100 that I keep for a final evaluation to
00:24:08.940 | really get a sense for how my system performs on new data."
00:24:12.260 | That's only 200 examples and I feel like that's manageable,
00:24:18.100 | even if we need experts.
00:24:20.860 | The point would be that that might be money well spent.
00:24:23.900 | It might be that if we can get some experts to provide the 200 cases,
00:24:28.180 | we have a really reliable measurement tool.
00:24:31.820 | I could never have said this 10 years ago because 10 years ago,
00:24:35.740 | the norm was to have 50,000 training instances and 5,000 test instances,
00:24:42.220 | and now your cost concern really kicks in.
00:24:45.300 | But for the present moment,
00:24:47.180 | I feel like a few meaningful cases could be worth a lot.
00:24:50.980 | You all could construct those datasets.
00:24:53.220 | Again, before I used to give the advice,
00:24:55.020 | don't create your own dataset in this class,
00:24:56.900 | you'll run out of time.
00:24:58.220 | But now I can give the advice,
00:25:00.020 | no, if you have some domain expertise in
00:25:03.100 | the life sciences or something and you want a dataset,
00:25:06.060 | create one to use for assessment.
00:25:09.060 | It'll shape your system design,
00:25:10.620 | but that could be healthy as well.
00:25:12.940 | Another big theme, explainability.
00:25:23.180 | This also relates to our increased impact.
00:25:26.300 | If we're going to deploy these models out in the world,
00:25:29.380 | it is really important that we understand them.
00:25:32.060 | Right now, we do a lot of behavioral testing.
00:25:35.260 | That is, we come up with these test cases and we see how well the model does.
00:25:40.180 | But the problem, which is a deep problem of scientific induction,
00:25:44.220 | is that you can never come up with enough cases.
00:25:46.820 | The world is a diverse and complex place,
00:25:49.580 | and no matter how many things you dreamed up when you were doing the research,
00:25:53.140 | if you deploy your system,
00:25:54.580 | it will encounter things that you never anticipated.
00:25:58.420 | If all you've done is behavioral testing,
00:26:01.100 | you might feel very nervous about this because you might have
00:26:03.820 | essentially no idea what it's going to do on new cases.
00:26:07.580 | The mission of explainability research should be to go one layer
00:26:11.660 | deeper and understand what is happening inside
00:26:14.620 | these models so that we have a sense for how they'll generalize to new cases.
00:26:19.100 | It's a very challenging thing because we're thinking about
00:26:21.980 | these enormous and incredibly opaque models.
00:26:26.260 | You can even find people saying in the literature that they're
00:26:29.060 | skeptical that we can ever understand what's happening with these systems,
00:26:32.420 | but I am optimistic.
00:26:34.000 | They are closed, deterministic systems.
00:26:37.220 | They may be complex,
00:26:38.840 | but we're smart.
00:26:40.080 | We can figure out what they have learned.
00:26:42.440 | I really have confidence in this.
00:26:44.220 | The importance of this is really that we have these broader societal goals.
00:26:48.460 | We want systems that are reliable,
00:26:50.740 | and safe, and trustworthy,
00:26:53.020 | and we want to know where we can use them,
00:26:55.040 | and we want them to be free from bias.
00:26:56.860 | It seems to me that all of these questions depend on us
00:27:01.300 | having some true analytic guarantees about model behaviors.
00:27:05.640 | It seems very hard for me to say,
00:27:08.300 | "Trust me, my model is not biased along some dimension,"
00:27:11.820 | if I don't even have any idea how it works.
00:27:14.600 | The best I could say is that it wasn't biased in some evaluations that I ran,
00:27:18.620 | but I just emphasize for you that that's very different from being
00:27:22.540 | evaluated by the world where a lot of things could happen that you didn't anticipate.
00:27:27.700 | We'll talk about a lot of different explanation methods.
00:27:31.860 | I think that these methods should be human interpretable.
00:27:34.920 | That is, we don't want low-level mathematical explanations of how the models work.
00:27:38.780 | We want this expressed in human-level concepts so that we can reason about these systems.
00:27:44.980 | We also want them to be faithful to the underlying model.
00:27:48.660 | We don't want to fabricate human interpretable but inaccurate explanations of the model.
00:27:53.780 | We want them to be true to the underlying systems.
00:27:57.080 | These are two very difficult standards to meet together.
00:28:01.060 | I can make them human interpretable if I offer you no guarantees of faithfulness,
00:28:06.100 | but then I'm just tricking myself and you.
00:28:09.020 | I can make them faithful by making them very technical and low-level.
00:28:12.660 | We could just talk about all the matrix multiplications we want,
00:28:15.740 | but that's not going to provide a human-level insight into how the models are working.
00:28:20.640 | So together though, we need to get methods that are good for both of these, right?
00:28:24.900 | Concept-level understanding of the causal dynamics of these systems.
00:28:30.100 | We'll talk about a lot of different explanation methods.
00:28:33.300 | I'll just do this quickly.
00:28:34.500 | Train tests, that is the behavioral thing,
00:28:36.700 | remains very important for the field.
00:28:39.020 | We'll talk about probing,
00:28:40.660 | which was an early and very influential and very ambitious attempt
00:28:44.380 | to understand the hidden representations of our models.
00:28:47.940 | We'll talk about attribution methods.
00:28:50.220 | These are ways to assign importance to different parts of the representations of these models,
00:28:55.700 | both input and output,
00:28:56.840 | and also their internal representations.
00:28:59.860 | Then we're going to talk about methods that depend on
00:29:03.220 | active manipulations of model internal states.
00:29:06.540 | You'll be able to tell that I strongly favor
00:29:09.640 | the active manipulation approach because I think that that's
00:29:12.940 | the kind of approach that can give us causal insights,
00:29:16.140 | and also richly characterize what the models are doing,
00:29:19.340 | and that's more or less the two desiderata that I just mentioned for these methods.
00:29:24.200 | But there's value to all of these things,
00:29:26.340 | and we'll talk about all of them,
00:29:27.660 | and you'll get hands-on with all of them,
00:29:29.340 | and all of them can be wonderful for your analysis sections of your final papers.
00:29:35.340 | We might even talk about interchange intervention training,
00:29:39.220 | which is when you use explainability methods to actually
00:29:41.940 | push the models to become better, more systematic,
00:29:46.020 | more reliable, maybe less biased along dimensions that you care about.
00:29:50.580 | That's my review of the core things.
00:29:56.900 | Questions or comments?
00:29:58.380 | I have a few more kind of more low-level things about the course to do now. Yeah.
00:30:03.420 | I know we're going to get into all of the explanation methods in a lot of detail later on,
00:30:08.180 | but can you give a quick example just so that we have
00:30:10.620 | any imagination of what they are?
00:30:13.460 | Probing is training supervised classifiers on internal representations.
00:30:19.540 | This was just the cool thing to say, "Hey,
00:30:21.500 | I'll just look at layer three,
00:30:23.060 | column four of my BERT representation.
00:30:25.460 | Does it encode information about animacy or part of speech?"
00:30:30.020 | The answer seems to be yes.
00:30:33.700 | I think that was really eye-opening that even if your task was sentiment analysis,
00:30:39.300 | you might have learned latent structure about animacy.
00:30:42.980 | That's getting closer to the human level concept stuff.
00:30:46.460 | Problem with probing is that you have no guarantee that
00:30:49.180 | that information about animacy here has any causal impact on the model behavior.
00:30:53.860 | It could be just kind of something that the model learned by the by.
00:30:57.860 | Attribution methods have the kind of reverse problem.
00:31:01.060 | They can give you some causal guarantees that this neuron
00:31:04.420 | plays this particular role in the input-output behavior,
00:31:08.260 | but it's usually just a kind of scalar value.
00:31:11.140 | It's like 0.3 and you say, "Well,
00:31:12.780 | what does the 0.3 mean?"
00:31:13.820 | And you say, "It means that it's that much important."
00:31:17.020 | But nothing like, "Oh, is it animent?"
00:31:19.060 | Or none of those human level things.
00:31:20.940 | And then I think the active manipulation thing,
00:31:23.220 | which is like doing lots of brain surgeries on your model,
00:31:26.540 | can provide the benefits of both probing and attribution.
00:31:30.820 | Causal insights, but also a deep understanding of the- what the representations are.
00:31:37.860 | And there's a whole family of those.
00:31:40.500 | It's a very exciting part of the literature. Yeah.
00:31:43.620 | I have a question going back to COGS.
00:31:46.060 | So I guess, why would we want to use the COGS dataset if we're testing for generalization?
00:31:51.180 | Like, why can't we just prompt a language model of a word that we've never seen before,
00:31:55.380 | and kind of like try and induce some format if you see it in the subject position,
00:31:59.540 | get it to position in the object and see how well it does that.
00:32:03.020 | Oh, yeah. No. So for COGS,
00:32:05.340 | for your original system,
00:32:06.980 | it could be that you try to prompt a language model.
00:32:09.860 | Zen did a bunch of that as part of the research.
00:32:13.100 | It was only okay.
00:32:14.860 | But maybe there's a version of that where you
00:32:17.460 | prompt in the right way with the right kind of instructions,
00:32:20.260 | and then it does solve COGS.
00:32:21.780 | That would be wonderful because that would suggest to me that those models,
00:32:26.340 | whatever model you prompted,
00:32:27.780 | has internal representations that are systematic enough to have kind of
00:32:32.060 | a notion of subject and a notion of object and
00:32:34.860 | verb and all of that linguistic stuff,
00:32:37.340 | and that would be very exciting.
00:32:39.220 | Yeah. The cool thing about COGS is that I think it's a pretty
00:32:42.140 | reliable measurement device for making such claims. Yeah.
00:32:48.740 | How transferable is this discussion to languages other than English?
00:32:54.580 | Like, I wonder if there- if we should be concerned about
00:32:57.620 | the very tight coupling between the properties of English as a language,
00:33:02.660 | and all our advancement in NLP?
00:33:06.660 | Well, I mean, I hope that a lot of you do projects on multilingual NLP,
00:33:12.860 | low resource settings, and so forth.
00:33:15.700 | I think in a way,
00:33:17.780 | we live in a golden age for that research as well.
00:33:20.740 | There's more research on more languages than there were 10 years ago,
00:33:25.220 | and that's certainly a positive development.
00:33:27.860 | The downside is that it's all done with multilingual representations,
00:33:32.020 | multilingual BERT, and so forth,
00:33:33.780 | and they tend to do much better on English tasks than every other task.
00:33:37.620 | So that obviously feels like suboptimal.
00:33:41.180 | But again, that's the same story of like
00:33:44.260 | sudden progress with a lot of
00:33:45.900 | mixed feelings that I have about a lot of these topics.
00:33:49.940 | In the interest of time, let's press on a little bit.
00:33:54.980 | I think I just wanted to skip to the course mechanics.
00:33:58.420 | This is at the website, but there it is.
00:34:00.180 | That's the breakdown of required pieces.
00:34:03.140 | You can see that it has a kind of strong emphasis
00:34:06.100 | toward the three parts that are related to the final project.
00:34:09.460 | But the homeworks are also really important and the quizzes less so.
00:34:13.380 | But I think they're important enough that you'll take them seriously.
00:34:17.900 | It's fully asynchronous.
00:34:20.380 | It's wonderful to see so many of you here,
00:34:22.660 | and I am eager to interact with you here if possible,
00:34:27.060 | but also on Zoom for office hours and stuff.
00:34:29.500 | Please attend office hours if you just want to chat.
00:34:32.020 | One of my favorite games to play in office hours is a group comes with
00:34:35.940 | three project ideas and I rank them from
00:34:38.940 | least to most or most to least viable for the course.
00:34:42.660 | It's a fun game for me,
00:34:44.060 | and I think it always illuminates some things about the problems.
00:34:48.620 | Then we have continuous evaluation.
00:34:51.220 | So you have the three assignments,
00:34:52.580 | the quizzes, and then the project work.
00:34:55.060 | There's no final exam.
00:34:56.340 | Just we want you to be focused on the final project at that point.
00:35:00.660 | I think I'll leave this aside.
00:35:02.580 | We can talk about the grading of the original systems a bit later.
00:35:06.340 | Then you have the project work,
00:35:08.780 | some links here, exceptional final projects, and some guidance.
00:35:12.860 | These are the two documents I mentioned before.
00:35:15.900 | I'll just say again that this is the most important part of
00:35:19.180 | the course to me and the thing that's special.
00:35:21.140 | I'll say again also,
00:35:22.380 | we have this incredibly accomplished teaching team this year,
00:35:25.540 | diverse interests, and they all have done incredible research on their own.
00:35:30.420 | I've learned a ton from them and from their work,
00:35:33.140 | and I encourage you to do the same.
00:35:35.220 | So seek them out in office hours and,
00:35:37.740 | um, and you know,
00:35:39.180 | take advantage of their mentorship for the work you do.
00:35:42.620 | Then here are some crucial course links,
00:35:45.260 | kind of covered that before.
00:35:46.940 | The quizzes I think I've covered as well,
00:35:50.980 | and these policies are all at the website.
00:35:54.700 | Um, right.
00:35:57.660 | And so the setup,
00:35:58.860 | do that if you haven't already.
00:36:00.780 | Make sure you're in the discussion forum.
00:36:02.540 | We want you to be connected with the kind of ongoing discourse for the class.
00:36:05.820 | Do quiz zero as soon as you can,
00:36:07.860 | so that you know your rights and responsibilities.
00:36:10.460 | And then I think right now we should check out the homework,
00:36:14.060 | the sentiment homework to make sure you're
00:36:16.740 | oriented around that before we dive into transformers.
00:36:20.580 | Questions about that stuff?
00:36:23.780 | It's all at the website,
00:36:25.260 | but I've kind of evoked it for you in case it raised any issues.
00:36:29.540 | All right. Let's look briefly at the first homework.
00:36:38.900 | I feel like we should kick it off somehow,
00:36:41.140 | and it is maybe an unusual mode for homeworks.
00:36:44.500 | So feel free to ask questions.
00:36:46.340 | This is kind of cool. So this link will take you to the GitHub,
00:36:50.340 | uh, which I think you're probably all set up with on your home computers.
00:36:54.140 | But you might want to work with this in the Cloud.
00:36:57.460 | And though this- so this works well.
00:36:58.980 | So you could just quick click like open in Colab.
00:37:02.340 | And I think I've done a pretty good job of getting you so that it will set
00:37:08.900 | itself up with the installs that you need and the course repo and so forth.
00:37:14.020 | I would actually be curious to know what their bumps along
00:37:16.580 | the road to getting this to just work out of the box in Colab.
00:37:19.580 | I do encourage this because if you're ambitious,
00:37:22.140 | you'll probably want GPUs,
00:37:23.980 | and this is a good inexpensive way to get them.
00:37:27.140 | It's also a pretty nice environment to do the work in. Zoom in here.
00:37:32.540 | Along the left, you can see the outline.
00:37:35.540 | And that's actually kind of a good place to start.
00:37:37.620 | So we're doing multi-domain sentiment.
00:37:40.240 | And what I mean by that is,
00:37:42.500 | you're encouraged to work with three datasets,
00:37:45.820 | Dynascent round one,
00:37:47.700 | Dynascent round two, and the Stanford Sentiment Treebank.
00:37:51.140 | These are all sentiment tasks,
00:37:53.460 | and they all have ternary labels,
00:37:55.540 | positive, negative, and neutral.
00:37:57.500 | But I'm not guaranteeing you that those labels are aligned in the semantic sense.
00:38:02.860 | In fact, I think that the SST labels are a bit different from the Dynascent ones.
00:38:09.220 | But certainly, the underlying data are different because Dynascent is
00:38:12.980 | like product reviews and Stanford Sentiment Treebank is movie reviews.
00:38:18.500 | But there are further things.
00:38:19.940 | So Dynascent round one is hard examples
00:38:22.920 | that we harvested from the world,
00:38:25.260 | from the Yelp academic dataset.
00:38:27.460 | Whereas Dynascent round two is actually annotators working on the Dynabench platform,
00:38:34.260 | which I mentioned just a minute ago,
00:38:35.980 | trying to fool a really good sentiment model.
00:38:39.900 | So the Dynascent round two examples are hard.
00:38:43.060 | They involve like non-literal language use and sarcasm,
00:38:46.620 | and other things that we know challenge current day models.
00:38:50.300 | So you have these three datasets.
00:38:52.780 | Then there are really two main questions.
00:38:56.100 | For the first question,
00:38:57.620 | I'm just pushing you to develop a simple linear model with sparse feature representations.
00:39:04.580 | This is a kind of more traditional mode background.
00:39:08.060 | If you need a refresher on this,
00:39:09.700 | this is a chance to get it.
00:39:10.860 | If you feel stuck on this question,
00:39:12.860 | I think we should talk about how to get you up to speed for the course.
00:39:17.660 | But for a lot of you,
00:39:18.960 | especially if you've been immersed in NLP,
00:39:20.860 | this should be a pretty straightforward question.
00:39:23.300 | It leads to a pretty good system.
00:39:25.060 | So you do a feature function,
00:39:26.820 | you write a function for training models,
00:39:29.380 | and a function for assessing models.
00:39:31.900 | For each one of these questions,
00:39:34.120 | what you do is complete a function that I started.
00:39:38.900 | There is not a lot of coding.
00:39:40.980 | This is mainly about starting to build your own original system.
00:39:45.940 | For every single one of these questions,
00:39:48.540 | there's a test that you can run.
00:39:50.900 | I like unit tests a lot.
00:39:53.420 | I think we should all write more unit tests.
00:39:55.540 | The advantage of the test for me is that if there was
00:39:57.940 | any unclarity about my own instructions,
00:40:00.420 | the test probably clears them up.
00:40:02.580 | You also get a guarantee that if your code passes the test,
00:40:05.940 | you're in good shape.
00:40:07.220 | More or less the same tests run on Gradescope.
00:40:10.500 | So when you upload the notebook,
00:40:12.300 | if you got a clean bill of health at home,
00:40:14.820 | you'll probably do fine on Gradescope.
00:40:17.900 | If you don't, it might be because
00:40:19.380 | the Gradescope autograder has a bug.
00:40:21.580 | Let me know about that.
00:40:23.100 | Those things always feel like they're just barely functioning.
00:40:26.780 | But the idea is that this is really not about me evaluating you.
00:40:31.900 | This is about you exercising
00:40:34.900 | the relevant muscles and building up
00:40:36.980 | concepts that will let you develop your own systems.
00:40:40.620 | I'm just trying to be a trusted guide for you on that.
00:40:44.340 | So you do some coding and you have these three questions here.
00:40:47.820 | The result of doing those three questions is that you
00:40:50.780 | have something that could be the basis for your original system.
00:40:54.180 | It'd be pretty cool by the way if some people
00:40:56.260 | competed using just sparse linear models
00:40:58.820 | to show the transformers that there's still competition out there.
00:41:03.500 | So that's the first question.
00:41:05.620 | Then the second one is the same way,
00:41:08.340 | except now we're focused on transformer fine-tuning,
00:41:11.180 | which is our main focus for this unit.
00:41:14.380 | I have a question here that pushes you to
00:41:17.940 | understand how these models tokenize data.
00:41:20.860 | It's really different from the old mode.
00:41:23.580 | You'll learn some hugging face code and you'll also learn some concepts.
00:41:27.860 | Then I have a question that pushes you to
00:41:30.220 | understand what the representations are like.
00:41:32.460 | We're going to talk about them abstractly.
00:41:34.420 | Here you'll be hands-on with them.
00:41:36.500 | Then finally, you're going to finish up
00:41:39.020 | writing a PyTorch module where you fine-tune BERT.
00:41:43.780 | That is step one to a really incredible system I'm sure.
00:41:50.660 | I've actually written the interface for you.
00:41:54.260 | So that given the course code and everything else,
00:41:57.180 | the interfaces for these things are pretty straightforward to write.
00:41:59.940 | All you have to do is write the module,
00:42:01.980 | and for completing the homework questions,
00:42:03.820 | you don't actually need heavy-duty computing at
00:42:05.820 | all because you don't do anything heavy-duty.
00:42:08.740 | But when you get to the original system,
00:42:11.900 | that might be where you want to train a big monster model and figure out how to
00:42:16.140 | work with the computational resources that you have to get that done.
00:42:20.020 | This notebook is using TinyBERT,
00:42:22.580 | which is small,
00:42:24.540 | but you still need a GPU to do the work.
00:42:26.780 | So you'll still want to be on Colab or something like that.
00:42:30.140 | Then I don't know how ambitious you're going to get for your original system.
00:42:34.460 | You can tell that I'm trying to lead you toward using
00:42:37.340 | question one and question two for your original system,
00:42:39.700 | but it's not required.
00:42:41.060 | If you want to do something where you just prompt GPT-4,
00:42:44.620 | maybe you'll win, I don't know.
00:42:46.900 | I'm up for anything.
00:42:49.060 | It does need to be an original system,
00:42:51.260 | so you can't just download somebody else's code.
00:42:53.660 | If all you did was a very boring prompt structure,
00:42:56.780 | you wouldn't get a high grade on your original system.
00:42:59.460 | We're trying to encourage you to think creatively and explore.
00:43:03.260 | Then the final thing is you just enter this in a bake-off,
00:43:06.380 | and really that just means grabbing an unlabeled dataset
00:43:09.620 | from the web and adding a column with predictions in it.
00:43:13.260 | Then you upload that when you submit your work to Gradescope.
00:43:17.300 | Then when everyone's submissions are in,
00:43:20.020 | we'll reveal the scores and there'll be some winners,
00:43:23.060 | and we'll give out prizes.
00:43:24.860 | I'm optimistic that we're going to have EC2 codes as prizes.
00:43:29.140 | That's always been fun because if you win a bake-off,
00:43:31.100 | you get a little bit more compute resources for your next big thing.
00:43:35.340 | They don't want to hand out these codes anymore like they used to,
00:43:41.060 | because Cloud Compute is so important now,
00:43:42.940 | but I think I have an arrangement in place to get some.
00:43:46.580 | By the way, we give out prizes for
00:43:48.980 | the best systems and the most creative systems,
00:43:51.780 | and we have even given out prizes for the lowest scoring system.
00:43:55.660 | Because if that was a cool thing that should have worked and didn't,
00:43:58.660 | I feel like you did a service to all of us by going down that route,
00:44:02.860 | and that deserves a prize.
00:44:04.900 | As a trying to have a multi-dimensional leaderboard here,
00:44:08.700 | even as we rank all of you according to the performance of your systems.
00:44:13.260 | That's my overview. Questions or comments or anything?
00:44:20.580 | All right.
00:44:31.820 | I propose then that we go to Transformers.
00:44:37.580 | So download the handout.
00:44:41.300 | By the way, these should be really good.
00:44:43.300 | So these slides, you'll get a version with
00:44:45.100 | fewer overlays to make it more browsable.
00:44:47.260 | All of these things up here are links.
00:44:50.060 | So if you click on these bubbles,
00:44:56.780 | you can go directly to that part,
00:44:58.540 | and you can see that this is a kind of outline of this unit.
00:45:01.500 | Then there's also a good table of contents with good labels.
00:45:04.860 | So if you need to find things in what I admit is a very large deck,
00:45:09.100 | that should make it easier to do that.
00:45:10.940 | You can also track our progress as we move through these things.
00:45:15.100 | So we dive in.
00:45:23.980 | Guiding ideas.
00:45:25.940 | What is happening with these contextual representations?
00:45:30.100 | Okay. This one slide here used to take two weeks for this course.
00:45:35.060 | And I've been trying to convey this.
00:45:37.300 | We have stopped doing that.
00:45:38.620 | The background materials are still at the website.
00:45:40.900 | It was also the first two weeks of CS224n.
00:45:44.660 | We did them before they did them in CS224n,
00:45:49.180 | back before natural language understanding was all the rage.
00:45:52.500 | But they get there first now,
00:45:54.060 | and it is a more basic course.
00:45:55.460 | I'm saying they do GloVe,
00:45:56.740 | Word2Vec, and we're going to dive right into transformers.
00:46:00.340 | Here is my one slide summary of this.
00:46:03.540 | Back in the old, old days,
00:46:06.140 | the dawning of the statistical revolution in NLP,
00:46:09.940 | the way we represented examples,
00:46:12.220 | let's say words for this case,
00:46:13.840 | was with feature-based sparse representations.
00:46:17.020 | And what I mean by that is that if you
00:46:18.860 | wanted to represent a word of a language,
00:46:21.060 | you might write a feature function that says, okay,
00:46:23.740 | yes or no on it being referring to an animate thing,
00:46:27.140 | yes or no on it ending in the characters ing,
00:46:31.740 | yes or no on it mostly being used as a verb,
00:46:35.820 | and so forth and so on.
00:46:36.980 | And so all these little feature functions would end up giving
00:46:39.660 | you really long vectors of essentially ones and zeros that were kind of
00:46:45.000 | hand-designed and that would give you a perspective on
00:46:48.500 | a bunch of the dimensions of the word you were trying to represent.
00:46:53.240 | That lasted for a while,
00:46:55.400 | and then it kind of started to get replaced pre-Word2Vec and GloVe,
00:47:01.240 | with methods like pointwise mutual information or TF-IDF.
00:47:06.720 | These methods had long been recognized as
00:47:09.940 | fundamental in the field of information retrieval,
00:47:13.140 | especially TF-IDF as a main representation technique
00:47:16.700 | for finding relevant documents for queries.
00:47:20.300 | Took a while for NLP people to realize that they would be valuable.
00:47:25.140 | But what you start to see here is that instead of writing all those feature functions,
00:47:30.340 | I'll just keep track of co-occurrence patterns in large collections of text.
00:47:35.820 | And PMI and TF-IDF do this- do this essentially just by counting,
00:47:40.560 | and then re-weighting some of the counts.
00:47:42.560 | But really it is the rawest form of distributional representation.
00:47:47.800 | That kind of got replaced,
00:47:50.520 | or this is sort of simultaneous in an interesting way,
00:47:52.800 | but you have paired with PMI and TF-IDF methods like a principal components analysis,
00:47:59.000 | SVD which is sometimes called latent semantic analysis,
00:48:03.160 | LDA which is latent Dirichlet allocation,
00:48:05.960 | a topic modeling technique.
00:48:07.800 | So a whole family of these things that are essentially taking
00:48:10.640 | count data and giving you reduced dimensional versions of that count data.
00:48:16.160 | And the power of doing that is really that you can
00:48:19.480 | capture higher order notions of co-occurrence.
00:48:22.360 | Not just what I as a word co-occurred with,
00:48:25.520 | but also the sense in which I might co-occur
00:48:28.640 | with words that co-occur with the things I co-occur with.
00:48:32.600 | You're kind of second order neighbors and you can imagine
00:48:35.360 | traveling out into this representational neighborhood here.
00:48:39.200 | And that turns out to be very powerful because a lot of
00:48:42.200 | semantic affinities come not from just being neighbors with something,
00:48:45.640 | but rather from that whole network of things co-occurring with each other.
00:48:50.520 | And what these methods do is take all that count data and compress it in a way that
00:48:54.960 | loses some information but also captures those notions of similarity.
00:49:00.440 | And then the final step which might actually be the kind of
00:49:04.520 | final step in this literature were learned dimensionality reduction things,
00:49:09.880 | autoencoders, Word2Vec, and GloVe.
00:49:12.800 | And this is where you might start with some count data,
00:49:16.400 | but you have some machine learning algorithm that learns how to
00:49:20.360 | compute dense learned representations from that count data.
00:49:26.600 | Um, so kind of like step three infused with more of what we know of as machine learning now.
00:49:34.120 | And I say it might be the end because I think now,
00:49:38.280 | for anything that you would do with this mode,
00:49:41.160 | you would probably just use contextual representations.
00:49:44.040 | So this is the full story perhaps.
00:49:47.280 | And then here's the review if you want, right?
00:49:51.200 | And I think it is important to understand this both the history but also
00:49:54.560 | the technical details to really deeply understand what I'm about to dive into.
00:49:58.720 | So you might want to circle back if that was too fast. Yeah.
00:50:03.240 | Is there any option to just like one-hot encode your entire vocabulary?
00:50:08.040 | I think this is my understanding of what modern transformer-based models do.
00:50:12.280 | To one-hot encode the whole vocabulary?
00:50:17.520 | So well, just say a bit more like what are you, what are you doing?
00:50:21.200 | Like my understanding of how large language models encode individual words,
00:50:29.160 | is they have a list of all of their possible tokens.
00:50:32.080 | They break it down into tokens.
00:50:33.920 | And then if your token 337,
00:50:35.960 | you're just like, you have a vector of length,
00:50:38.680 | the number of tokens you have,
00:50:39.920 | like a vocabulary of like 50,000.
00:50:42.120 | And then you just one-hot encode which token that is.
00:50:45.520 | Hmm. Well, we're about to do this.
00:50:48.280 | So why don't if- I'll show you how they represent things.
00:50:51.520 | And let's see if it connects with your question.
00:50:54.840 | Because it is different.
00:50:57.600 | Yeah, it is going to be very different.
00:50:59.480 | And the notion of token and the notion of type is about to get sort of complicated.
00:51:04.520 | Right. Before we do the technical part,
00:51:07.880 | just a little bit of context here about why I think this is so exciting.
00:51:11.720 | I'm a linguist, right?
00:51:12.920 | And I was excited by the static vector representations of words,
00:51:16.880 | but it was also very annoying to me because they give you one vector per word.
00:51:23.160 | Whereas my experience of language is that words have
00:51:27.480 | multiple senses and it is hard to delimit where the senses begin and end.
00:51:31.720 | Consider a verb like break,
00:51:33.800 | which I've worked on with my PhD student, Erica Peterson.
00:51:36.960 | The vase broke.
00:51:38.200 | That's one sense maybe.
00:51:39.800 | Dawn broke.
00:51:41.160 | That's the same form, broke.
00:51:44.360 | But that means something different.
00:51:46.360 | Entirely different.
00:51:47.840 | The news broke.
00:51:49.360 | Again, broke as the form there.
00:51:51.680 | But this being something more like was published or appeared.
00:51:55.360 | Sandy broke the world record.
00:51:57.480 | It's very unlike the vase broke, right?
00:52:00.480 | Now it's like surpassing the limit.
00:52:02.840 | Sandy broke the law is a different sense yet again.
00:52:06.520 | That's some kind of transgression.
00:52:08.560 | The burglar broke into the house.
00:52:11.040 | That's break again, but now with a particle.
00:52:13.680 | And that means something different still.
00:52:15.840 | The newscaster broke into the movie broadcast.
00:52:18.240 | That means it was interrupted.
00:52:20.520 | Very different again. We broke even,
00:52:23.120 | means I don't know, we ended up back at the same amount we started with.
00:52:27.280 | All- so how many senses of break are here?
00:52:30.120 | If I was in the old mode of static representation,
00:52:33.440 | would I survive with one break vector for all of these examples?
00:52:38.200 | Or would I have one per example type?
00:52:41.760 | But then what about all the ones that I didn't list here?
00:52:44.400 | The sen- the number of senses for break starts to feel impossible to enumerate.
00:52:49.960 | If you just think about all the ways in which you encounter this verb.
00:52:53.200 | And there is some metaphorical core that seems to run through them.
00:52:57.560 | But in the details,
00:52:58.920 | these senses are all very different.
00:53:01.720 | And this tells me that the sense of a word like break is being modulated by the context it is appearing in.
00:53:10.240 | And the idea that we would have one fixed representation for it,
00:53:14.000 | even if it's learned from data,
00:53:15.640 | is just kind of wrong from the outset.
00:53:19.160 | Here's another example.
00:53:21.120 | We have a flat tire.
00:53:22.840 | But what about flat beer,
00:53:24.680 | flat note, flat surface?
00:53:27.160 | Maybe they have some metaphorical core,
00:53:29.440 | but those feel like at least two to four different senses for flat.
00:53:34.840 | Throw a party, throw a fight,
00:53:37.080 | throw a ball, throw a fit.
00:53:39.040 | All very different senses.
00:53:41.160 | It's tragic to think we would have one throw that was meant to cover all of these examples, right?
00:53:48.120 | A crane caught a fish.
00:53:50.280 | A crane picked up the steel beam.
00:53:52.960 | That might feel like a standard sort of lexical ambiguity.
00:53:56.480 | And so maybe you can imagine that we have one vector for crane as a bird,
00:54:01.160 | and one for crane as a machine.
00:54:03.760 | But is that going to work for the entire vocabulary?
00:54:07.080 | I suspect not.
00:54:08.480 | I saw a crane.
00:54:09.520 | We wouldn't even know what vector we were dealing with there, right?
00:54:12.680 | Which one would we pick?
00:54:13.960 | And now we have another problem on our hands,
00:54:15.720 | which is selecting the static vector based on contexts, right?
00:54:19.840 | How are we going to do that? And this is a really deep thing.
00:54:22.560 | It's not just about the local kind of morphosyntactic context here.
00:54:27.200 | What about, are there typos?
00:54:29.160 | I didn't see any.
00:54:30.680 | So the sense of any there is like any typos, right?
00:54:34.720 | Versus are there bookstores downtown?
00:54:36.920 | I didn't see any.
00:54:37.840 | Now the sense of any and the kind of elliptical stuff that comes after it is any bookstores.
00:54:44.560 | And now I hope you can see that the sense that words can have is modulated by context in the most extended sense.
00:54:53.040 | And having fixed static representations was never going to work in the face of all of this diversity.
00:55:00.240 | We were never going to figure out how to cut up the senses in just the right way to get all of this data handled correctly.
00:55:09.720 | And the vision of contextual representation models is that you're not even going to try to do all that hard and boring stuff.
00:55:17.040 | Instead, you are just going to embrace the fact that every word could take on a different sense,
00:55:22.640 | that is, have a different representation depending on everything that is happening around it.
00:55:28.480 | And we won't have to decide then which sense is in 1A and whether it's different from 1B and 1C and so forth.
00:55:33.840 | We will just have all of these token level representations.
00:55:40.120 | It will be entirely a theory that is based in words as they appear in context.
00:55:47.920 | For me as a linguist, it is not surprising at all that this turns out to lead to lots of engineering successes
00:55:54.400 | because it feels so deeply right to me about how language works.
00:56:01.120 | Uh, brief history here. I just want to be dutiful about this. Make sure people get credit where it's due.
00:56:06.680 | November 2015, Dai and Lei, that's a foundational paper where they really did what is probably the first example of language model pre-training.
00:56:17.000 | It's a cool paper to look at. It's complicated in some ways that are surprising to us now, and it is certainly a visionary paper.
00:56:24.920 | And then McCann et al., this is a paper from Salesforce Research that's led by, at the time was read by Richard Socher,
00:56:31.040 | who is a distinguished alum of this class. Proud of that.
00:56:34.960 | They developed the Cove model where what they did is train machine translation models.
00:56:40.320 | And then the inspired idea was that the tran- the translation representations might be useful for other tasks.
00:56:48.120 | And again, that feels like the dawn of the notion of pre-training contextual representations.
00:56:54.200 | And then ELMo came. I mentioned ELMo last time. Huge breakthrough. Massive bidirectional LSTMs.
00:57:01.320 | And they really showed that that could lead to rich multipurpose representations.
00:57:06.600 | And that's where you really feel everyone reorienting their research toward these kind of models.
00:57:13.000 | That's not a transformer-based one, though. That's by- by LSTMs.
00:57:17.320 | And then we get, um, GPT in June 2018 and BERT in October 2018.
00:57:25.320 | Um, the BERT paper was published a long time after that, but as I said before,
00:57:30.040 | it had already achieved massive impact by the time it was published in 2019 or whatever.
00:57:35.400 | So that's why I've been giving the months here, because you can see it's really- there was this
00:57:39.560 | sudden uptake in the amount of- of interest in these things that happened around this time.
00:57:45.560 | And that led to where we are now.
00:57:48.440 | Another kind of interesting thing to think about if you step back for the context here,
00:57:54.840 | is that we as a field have been traveling from high bias models,
00:58:00.280 | where we decide a lot about how the data should look and be processed,
00:58:03.960 | toward models that impose essentially nothing on the world.
00:58:07.800 | So if you go up into the upper left here,
00:58:09.800 | I'm just imagining a model that's kind of in the old mode,
00:58:12.280 | where you have like your glove representations of these three words.
00:58:16.280 | And to get a representation for the sentence, you just add up those representations.
00:58:21.320 | In doing that, you have decided ahead of time,
00:58:24.760 | a lot of stuff about how those pieces are gonna come together.
00:58:28.200 | I mean, you just said it was gonna be addition,
00:58:31.960 | which is almost certainly not correct about how the world works.
00:58:34.920 | But that- so that's a prototypical case of a high bias decision.
00:58:38.520 | If you move over to the right here, that's a kind of recurrent neural network.
00:58:43.720 | And here, I've kind of decided that my data will be processed left to right.
00:58:49.640 | I could learn a lot of different functions in that space,
00:58:53.400 | so it's much more expressive, much less biased in this machine learning sense,
00:58:57.720 | than this solution here.
00:58:59.160 | But I have still decided ahead of time that I'm gonna go left to right.
00:59:03.160 | And this is another example.
00:59:05.400 | These are tree-structured networks.
00:59:07.000 | Richard Socher, who I just mentioned,
00:59:08.520 | was truly a pioneer in tree-structured recursive neural networks.
00:59:13.320 | Here, I make a lot of decisions about how the pieces can possibly come together.
00:59:18.120 | The rock is a unit constituent, separate from rules, which comes in later.
00:59:23.080 | And I'm just saying I know ahead of time that the data will work that way.
00:59:28.360 | If I'm right, it's gonna give me a huge boost,
00:59:31.320 | because I don't have to learn those details.
00:59:33.400 | If I'm wrong though, I might be wrong forever.
00:59:36.520 | And I think that's actually that feeling that you're wrong forever
00:59:40.440 | is what led to this kind of thing happening.
00:59:42.280 | So here, I've got kind of a bidirectional recurrent model.
00:59:46.280 | So now, you can go left to right and right to left,
00:59:49.640 | and all these attention mechanisms that are gonna,
00:59:53.080 | like, jump around in the linear string.
00:59:56.680 | And this is a true progression with what happened with recurrent neural networks.
01:00:02.760 | Go both directions and add a bunch of attention connections.
01:00:06.040 | And that is kind of the thing that caused everyone to realize,
01:00:11.160 | "Oh, we should just connect everything to everything else
01:00:14.280 | and go to the maximally low-biased version of this,
01:00:18.040 | and just assume that the data will teach us
01:00:20.760 | about what's important to connect to what.
01:00:23.000 | We won't decide anything ahead of time."
01:00:25.240 | So a triumph of saying,
01:00:27.480 | "I have no idea what the world is gonna be like.
01:00:30.120 | I just trust in my data and my optimization."
01:00:33.160 | The attention piece is really interesting to me.
01:00:37.480 | You know, we used to talk a lot about this in the course.
01:00:40.120 | Here, I have a sequence, really not so good.
01:00:43.240 | And a common mode, still common today,
01:00:46.760 | is that I might fit a classifier on this final representation here
01:00:51.240 | to make a sentiment decision.
01:00:53.960 | But people went on that journey I just described,
01:00:57.160 | where you think, "Wait a second.
01:00:58.600 | If I'm just gonna use this,
01:01:00.120 | won't I lose a lot of information about the earlier words?"
01:01:04.680 | I should have some way of, like, connecting back.
01:01:08.200 | And so what they did is dot products,
01:01:11.080 | essentially, between the thing that you're using for your classifier
01:01:14.600 | and the previous states.
01:01:16.200 | That's what I've depicted here,
01:01:17.960 | just as a kind of way of scoring this final thing with respect to the previous thing.
01:01:23.800 | You might normalize them a little bit,
01:01:25.720 | and then form what was called a context vector.
01:01:28.840 | This is like the attention representation.
01:01:31.240 | And what I've done here is build these links back to all these previous states.
01:01:36.200 | And that turned out to be incredibly powerful.
01:01:39.640 | And when you read the title of the paper, "Attention is all you need,"
01:01:43.400 | what they are doing is saying,
01:01:46.360 | "You don't need LSTM connections and recurrent connections and stuff.
01:01:50.760 | The sense in which attention is all you need
01:01:52.840 | is the sense in which they're saying all you needed
01:01:55.320 | was these connections that you were adding onto the top of your earlier models."
01:01:59.080 | And maybe they were right,
01:02:03.960 | but certainly, uh, it has taken over the field.
01:02:07.880 | Another important idea here that might often be overlooked
01:02:12.440 | is just this notion that we should model the sub-parts of words.
01:02:16.200 | And again, I can't resist a historical note here.
01:02:19.480 | If you look back at the ELMo paper,
01:02:21.960 | what they did to embrace this insight is incredible.
01:02:26.200 | They had character-level representations,
01:02:28.680 | and then they fit all these convolutions on top of all those character-level representations,
01:02:34.760 | which is essentially like ways of pooling together sub-parts of the word.
01:02:38.760 | And then they form a representation of- at the top that's like the average
01:02:43.560 | plus the concatenation of all of these different convolutional layers.
01:02:48.360 | And the result of this is a vocabulary that does latently have information about characters
01:02:54.760 | and sub-parts of words as well as words in it.
01:02:57.640 | And I feel that that's deeply right, right?
01:03:00.680 | And this is like a space in which you could capture lots of things like
01:03:04.200 | how talk is similar to talking and is similar to talked.
01:03:07.880 | And you know, all that stuff that a simple unigram parsing would miss
01:03:12.120 | is latently represented in this space.
01:03:16.280 | But the vocabulary for ELMo is like 100,000 words.
01:03:21.640 | So that's 100,000 embedding space that I need to have.
01:03:25.960 | It's actually gargantuan.
01:03:27.560 | And it's still the case that if you process real data,
01:03:32.440 | you have to unk out, that is, mark as unknown most of the words you encounter.
01:03:37.400 | Because the language is incredibly complicated,
01:03:40.280 | and 100,000 doesn't even come close to covering
01:03:43.800 | all the tokens that you encounter in the world.
01:03:46.200 | And so again, we have this kind of galaxy brain moment where I guess the field says,
01:03:51.240 | forget all that.
01:03:52.760 | And what you do instead is tokenize your data
01:03:56.600 | so that you just split apart words into their sub-word tokens if you need to.
01:04:03.640 | So here I've got an example with the BERT tokenizer.
01:04:07.000 | This isn't too surprising.
01:04:08.360 | That comes out looking kind of normal.
01:04:11.000 | But if you do encode me,
01:04:12.600 | notice that the word encode has been split into two tokens.
01:04:17.080 | And if you do snuffleupagus,
01:04:20.040 | you get 1, 2, 3, 4, 5, 6, 7 tokens,
01:04:25.000 | 6 or 7 from that,
01:04:26.600 | because it doesn't know what the word is.
01:04:28.440 | And so what it does is not unk it out,
01:04:31.400 | but rather split it apart into a bunch of different pieces.
01:04:34.280 | And the result is the really startling thing
01:04:37.720 | that BERT-like models have only 30,000 words in their vocabulary.
01:04:41.640 | But they're words in the sense that they're these sub-word tokens.
01:04:46.440 | Now, this was going to be tragically bad
01:04:50.920 | in the realm where we were doing static word representations
01:04:55.640 | because I'm going to have a vector for NU
01:04:57.640 | and have no sense in which it was participating in the larger part of snuffleupagus.
01:05:04.040 | But we're talking about contextual models.
01:05:06.920 | So even if these are the tokens,
01:05:08.520 | the model is going to see the full sequence.
01:05:10.840 | And we can hope that it reconstructs something like the word
01:05:15.240 | from all the pieces that it encountered.
01:05:17.640 | Certainly, we could hope that for something like encode.
01:05:19.960 | And we take this for granted now,
01:05:23.080 | but it's deeply insightful to me
01:05:25.880 | and incredibly freeing in terms of the engineering resources that you need.
01:05:30.200 | But it does depend on rich contextual representations.
01:05:35.800 | And then another notion, positional encoding.
01:05:39.080 | So we have all these tokens or maybe, you know, subparts of words.
01:05:43.240 | In addition to representing things
01:05:46.280 | using a traditional static embedding space like a GloVe one,
01:05:50.760 | that's what I put with these light gray boxes here,
01:05:53.640 | we'll also represent sequences with positional encodings,
01:05:57.640 | which will just keep track of where the token appeared in the sequence I'm processing.
01:06:03.720 | And what that means is that the word rock here,
01:06:06.520 | occurring in position two,
01:06:08.760 | will have a different representation
01:06:11.960 | if rock appears in position 47 in the string.
01:06:15.800 | It'll be kind of the same word,
01:06:18.680 | but also partly a different word.
01:06:21.160 | And that's another way in which you're embracing the fact
01:06:25.160 | that all of these representations are going to be contextual.
01:06:28.920 | This is an interesting story for me
01:06:31.640 | because I've been slow to realize what maybe the whole field already knew,
01:06:35.560 | that this is incredibly important.
01:06:37.320 | How you do this positional encoding really matters for how models perform.
01:06:42.040 | And that's why, in fact, it's like one of these early topics here
01:06:45.320 | that we'll talk about next time.
01:06:46.600 | And then of course, another guiding idea here
01:06:53.480 | is that we are going to do massive scale pre-training.
01:06:56.760 | I mentioned this before.
01:06:57.880 | We're going to have these contextual models
01:07:00.120 | with all these tiny little parts of words in them,
01:07:03.560 | all in sequences with positional encodings.
01:07:06.280 | And we are going to train at an incredible scale.
01:07:08.840 | That's that same story of word2vec, GloVe, through GPT,
01:07:13.080 | and then on up to GPT-3.
01:07:15.080 | I mentioned this before.
01:07:16.760 | And some magic happens as you do this on more and more data
01:07:22.280 | with larger and larger models.
01:07:24.040 | And then finally, this is related.
01:07:29.720 | This insight that instead of starting from scratch for my machine learning models,
01:07:35.240 | I should start with a pre-trained one and fine-tune it for particular tasks.
01:07:41.000 | We saw this a little bit in the pre-transformer era.
01:07:46.360 | The standard mode was to take GloVe or word2vec representations
01:07:50.600 | and have them be the inputs to something like an RNN.
01:07:53.640 | And then the RNN would learn.
01:07:55.640 | And instead of having to learn the embedding space from scratch,
01:07:59.400 | it would start in this really interesting space.
01:08:02.280 | And that is actually a kind of learning of contextual representations.
01:08:07.880 | Because what happens if the GloVe representations are updated
01:08:11.160 | is that they all shift around and the network kind of pushes them around
01:08:15.080 | so that you get different senses for them in context.
01:08:18.840 | And then again, the transformer thing just takes that to the limit.
01:08:22.760 | And that is the mode that you'll operate in for the first homework.
01:08:27.000 | I've put this from 2018 onwards.
01:08:29.320 | We have this thing where, I hope you can make it out at the bottom,
01:08:32.440 | you load in BERT and you just put a classifier on top of it.
01:08:35.960 | And you learn that classifier for your sentiment task, say.
01:08:40.520 | And that actually updates the BERT parameters.
01:08:43.720 | And the BERT parameters help you do better at your task.
01:08:47.240 | And in particular, they might help you generalize
01:08:50.680 | to things that are sparsely represented in your task-specific training data.
01:08:56.280 | Because they've learned so much about the world in their pre-training phase.
01:09:01.160 | I put that for 2018 onwards.
01:09:05.880 | I'm a little worried that we're moving into a future
01:09:08.280 | in which fine-tuning is just again using an open AI API.
01:09:12.600 | But you all will definitely learn how to do much more than this,
01:09:16.520 | even if you fall back on doing this at some point.
01:09:18.760 | Where now what you're doing is some lightweight version of fine-tuning
01:09:23.800 | a massive model like GPT-3.
01:09:26.840 | Same mental model there.
01:09:28.360 | It's just that the starting point knows so much about language in the world,
01:09:32.840 | compared even to the BERT model up here.
01:09:34.840 | Those are the guiding ideas.
01:09:41.560 | I'll pause.
01:09:42.840 | Questions, comments?
01:09:45.160 | What's on your minds?
01:09:46.680 | Yeah.
01:09:47.180 | Going back to the sub-word splitting of the longer word,
01:09:52.600 | is there an infusion we are imposing on splitting that particular way,
01:09:56.600 | or is that also part of- driven by the model itself?
01:10:01.000 | Oh, yeah.
01:10:02.120 | So what gets imposed as a modeling bias in that sense
01:10:05.880 | when you do the tokenization?
01:10:07.080 | Is that the question?
01:10:08.280 | Yeah, potentially a lot.
01:10:09.880 | I left this out for reasons of time,
01:10:11.880 | but there are a bunch of different schemes
01:10:14.120 | that you can run for doing that sub-word tokenization.
01:10:16.760 | You'll see this as you read papers and as I talk.
01:10:19.560 | Word piece, there's byte-pairing coding,
01:10:22.520 | there's a unigram language model.
01:10:24.680 | All of them are attempts to learn kind of optimal way to tokenize the data
01:10:30.600 | based on things that tend to co-occur a lot together.
01:10:34.040 | But it's definitely a meaningful step.
01:10:37.000 | Yeah, and so like for example,
01:10:39.240 | as someone who's interested in the morphology of languages,
01:10:42.760 | word forms, and you all might-
01:10:44.600 | this could be a cool multilingual angle
01:10:46.600 | if you think about languages with very rich morphology.
01:10:50.120 | You might have an intuition that you want a tokenization scheme
01:10:52.680 | that reproduces the morphology of that language,
01:10:55.080 | that splits a big word with all its suffixes say,
01:10:58.360 | down into things that look like the actual pieces,
01:11:00.600 | as you recognize them.
01:11:02.840 | And then you could think,
01:11:04.040 | well, the best of these schemes should come close to that, right?
01:11:07.800 | And that could be an important and useful bias that you impose.
01:11:10.840 | Yeah.
01:11:14.300 | Go, yeah.
01:11:15.300 | Yeah.
01:11:16.760 | Can you elaborate on what happens when we do fine-tuning to the original model?
01:11:22.040 | Like, does it change or just-
01:11:25.880 | it adds additional layers to it or like,
01:11:29.160 | what actually happens with fine-tuning?
01:11:31.320 | What happens when you fine-tune?
01:11:34.440 | As usual with these questions,
01:11:35.640 | there's like an easy answer and a hard answer.
01:11:37.480 | So the easy answer is that you are simply back-propagating
01:11:41.400 | whatever error signal you got from your,
01:11:43.720 | you know, output comparison with the true label,
01:11:46.280 | back through all the parameters of the model.
01:11:49.240 | And in principle, that could mean that,
01:11:51.160 | you know, as you fine-tune on your sentiment task,
01:11:53.000 | you are updating all of the parameters,
01:11:55.000 | even the pre-trained BERT ones.
01:11:57.320 | And then of course, there are variants of that
01:11:59.000 | where you update just some of the BERT parameters
01:12:01.240 | while leaving others frozen and so forth.
01:12:03.560 | But the idea is you have a smart initialization,
01:12:07.240 | that's the BERT initialization,
01:12:09.320 | and then you kind of adjust the whole thing
01:12:11.800 | to be really good at your task.
01:12:13.480 | What really happens there?
01:12:17.000 | That's a deep question, right?
01:12:19.240 | That could actually connect with the explainability stuff,
01:12:21.640 | like what adjustments are happening to the network,
01:12:24.760 | and which ones are useful,
01:12:26.520 | which ones could even be detrimental,
01:12:28.520 | which ones are causing you to overfit.
01:12:30.680 | Are there lightweight versions of the fine-tuning
01:12:34.200 | that would be better and more robust
01:12:35.960 | and get a better balance from the pre-training
01:12:38.120 | and the task-specific thing?
01:12:39.800 | And that just shows you fine-tuning
01:12:42.840 | is an art and a science all at once.
01:12:44.760 | Yeah.
01:12:47.260 | Sure.
01:12:49.820 | So do we also control, like,
01:12:53.720 | the influence of the fine-tuning dataset
01:12:57.880 | over the original model?
01:12:59.000 | Can we control it in a way that,
01:13:01.800 | oh, change- just change the model a little bit,
01:13:04.680 | or change the original model completely?
01:13:06.520 | Let's see, what's the right metaphor?
01:13:09.480 | You could control it the same way
01:13:12.200 | that you could control a kind of out-of-control car.
01:13:15.960 | I mean, you have a steering wheel
01:13:17.320 | and you have an accelerator and a brake,
01:13:20.200 | but if they're all kind of-
01:13:21.320 | you're not sure how they work.
01:13:22.360 | Yeah, you can try.
01:13:25.000 | And as you get better at the task,
01:13:26.760 | as you get better at driving the vehicle,
01:13:28.440 | you have more fine control.
01:13:30.280 | But it's an art and a science at the same time.
01:13:33.320 | I'm actually hoping, you know,
01:13:34.760 | that Sid, he's going to do a hands-on session
01:13:37.160 | with us next week,
01:13:38.360 | and that he imparts some of his own hard-won lessons to you
01:13:41.720 | about how to drive these things,
01:13:43.240 | because he's really been in the trenches
01:13:45.400 | doing this with large models.
01:13:46.840 | But you know, you have your learning rate
01:13:49.080 | and your optimizer and other things you can fiddle with,
01:13:52.040 | and hope that it steers in the direction you want.
01:13:54.520 | Yeah, if we use tree structures right now
01:14:00.520 | to represent like syntax,
01:14:01.960 | I guess my question is,
01:14:03.240 | why don't they work super well for language models?
01:14:06.840 | And like, I guess the sentiment that you had was like,
01:14:09.720 | oh, kind of just like put attention towards anything
01:14:12.120 | and see what works.
01:14:13.400 | I guess, why is that the sentiment
01:14:15.160 | in linguistics then as well?
01:14:16.680 | Great question.
01:14:18.680 | My personal perspective is that probably
01:14:22.040 | all the trees that we have come up with
01:14:24.200 | are kind of wrong.
01:14:25.000 | And as a result,
01:14:27.560 | we were actually making it harder for our models,
01:14:30.200 | because we were putting them in this bad initial state.
01:14:32.680 | And so the mode I've moved into is to think,
01:14:35.880 | let's use the transform or something
01:14:37.560 | that's much more like this,
01:14:38.600 | totally free form,
01:14:39.800 | and then use explainability methods
01:14:41.640 | to see if we can see what tree structures they've induced.
01:14:44.360 | Because those might be closer
01:14:46.440 | to the true tree structures of language.
01:14:48.920 | And another aspect of this is that I feel like
01:14:51.400 | there's not even for a given sentence,
01:14:53.880 | one structure.
01:14:54.840 | There could be one for semantics,
01:14:57.000 | one for syntax,
01:14:58.120 | one for other things.
01:14:59.560 | And so we want those all simultaneously represented.
01:15:02.440 | And again,
01:15:02.920 | these powerful models we're talking about could do that.
01:15:05.960 | And so then they become devices
01:15:08.200 | for helping us learn what the right structures are,
01:15:10.280 | as opposed to us imposing them.
01:15:11.800 | Yeah.
01:15:13.820 | Numbers are important part of language.
01:15:17.400 | How tokenization works for numbers,
01:15:19.720 | because it's digits,
01:15:21.320 | it's words,
01:15:22.440 | the same meaning,
01:15:23.240 | different meaning.
01:15:23.960 | Is it in the same domain or it's something otherwise?
01:15:29.320 | Oh, I love this.
01:15:31.080 | This is a great example of something that sounds small,
01:15:33.560 | but could be a wonderful final paper
01:15:35.480 | and turns out to be hard and deep.
01:15:37.160 | How do you represent numbers
01:15:38.840 | if you've got a word piece tokenizer?
01:15:41.400 | Do you get it down to all the digits?
01:15:43.480 | Do you leave them as big chunks?
01:15:45.800 | Do you do it based on a strong bias that you have
01:15:48.360 | about like this is the tens place,
01:15:50.200 | this is the hundreds place?
01:15:51.800 | What do you do?
01:15:52.680 | Yeah, wonderful question to ask.
01:15:55.240 | And I am absolutely positive
01:15:57.480 | that that low-level tokenization choice will influence
01:16:00.280 | whether or not your model can learn to do
01:16:02.120 | basic arithmetic, say.
01:16:04.360 | Yeah.
01:16:06.200 | And so a paper that evaluated a bunch of schemes
01:16:08.680 | in a way that was insightful and important,
01:16:10.520 | you know, on real mathematical abilities
01:16:13.000 | could really help us understand
01:16:14.760 | which models will be intrinsically limited
01:16:17.080 | and in turn how to develop better ones.
01:16:19.240 | I love it.
01:16:19.740 | Yeah.
01:16:23.260 | On the slide that you have,
01:16:25.000 | the- that's titled attention,
01:16:28.200 | you've shown your source as dot products
01:16:30.840 | with the final word
01:16:32.760 | against all the other previous ones.
01:16:35.640 | Well, I picked the final one,
01:16:36.760 | not- not the first one.
01:16:37.960 | Or all of them.
01:16:40.120 | Attention is all you need.
01:16:42.120 | This is a perfect transition.
01:16:44.280 | So yeah, you should do all of them.
01:16:46.520 | That's like the- what they mean
01:16:48.520 | by the title of the paper.
01:16:50.120 | Yes, do it all.
01:16:51.160 | Self-attention means attending everything
01:16:53.240 | to everything else.
01:16:54.600 | So this is a perfect transition.
01:16:56.200 | We're out of time, 4.20.
01:16:58.200 | Next time we will dive into the transformer.
01:17:01.240 | We'll resolve the questions that we got back there.
01:17:03.320 | You'll see much more of these attention connections.
01:17:06.360 | Yeah, we're really queued up now
01:17:07.880 | to dive into the technical parts.
01:17:09.400 | [BLANK_AUDIO]