back to indexStanford XCS224U: Natural Language Understanding I Course Overview, Part 2 I Spring 2023
00:00:08.380 |
Day two. Got a lot we want to accomplish today. 00:00:12.460 |
What I have on the screen right now is the home base for the course. 00:00:17.660 |
This is our public website and you could think of it as 00:00:20.560 |
kind of a hub for everything that you'll need in the course. 00:00:24.400 |
You can see along the top here we've got some policy pages. 00:00:31.560 |
There's a page that provides an index of background materials, 00:00:39.880 |
hands-on materials in case you need to fill in some background stuff. 00:00:43.580 |
Notice also I do a podcast that actually began in this course last year, 00:00:48.940 |
and I found it so rewarding that I just continued doing it all year. 00:00:55.240 |
If you have ideas for guests for this podcast, 00:00:59.960 |
I'm always looking for exciting people to interview, 00:01:02.920 |
and I think the back episodes are also really illuminating. 00:01:09.320 |
you've got one-stop shopping for the various systems that we have to deal with. 00:01:19.520 |
Canvas is your home for the screencasts and also the quizzes, 00:01:26.920 |
Grade scope is where you'll submit the main assignments, 00:01:36.280 |
and that is the course code that we'll be depending on for the assignments, 00:01:40.040 |
and that I hope you can build on for the original work that you do. 00:01:46.360 |
but we also have this staff email address that is 00:01:49.280 |
vastly preferred to writing to us individually. 00:01:53.800 |
It really helps us manage the workload and know what's happening if you either ping us 00:02:01.680 |
public posts, whatever, or use that staff address. 00:02:12.360 |
The first column is slides and stuff like that, and also notebooks. 00:02:16.980 |
The middle column, it's core readings mostly. 00:02:20.360 |
I'm not presupposing that you will manage to do all of this reading because there is a lot of it, 00:02:26.760 |
but these are important and rewarding papers, 00:02:31.720 |
you might want to immerse yourselves in them. 00:02:33.640 |
But I'm hoping that I can be your trusted guide through that literature. 00:02:44.560 |
Questions, comments, anything I could clear up? 00:02:47.520 |
I have a time budgeted later to review the policies and required work in a bit more detail. 00:02:57.120 |
For the quizzes, are the quizzes doable on the day that they become available, 00:03:01.360 |
or do we need like the course material all the way up through the D? 00:03:14.380 |
you might not feel like you can confidently finish the quiz until that final lecture in the unit. 00:03:23.360 |
All of the answers are embedded in this handout here. 00:03:44.040 |
I think I've got an index of past projects behind a protected link, 00:03:50.700 |
If you're not enrolled, we can get you past that little hurdle. 00:03:53.680 |
But I did get permission to release some of them. 00:04:02.740 |
There's another list at the GitHub projects.md page, 00:04:09.280 |
of published work, and that stuff you could download. 00:04:11.760 |
The private link gives you the actual course submission. 00:04:16.840 |
compare the paper they did in here with the thing they actually published. 00:04:20.920 |
I'll emphasize again that that will be interesting because of how much work it 00:04:24.900 |
typically takes to go from a class project to something that makes it onto the ACL anthology. 00:04:41.040 |
You get your computing environment set up to use our code. 00:04:44.480 |
Actually, this is a sign of the changing times. 00:04:47.120 |
I also exhort you to sign up for a bunch of services, 00:04:50.160 |
Colab, and maybe consider getting a pro account for $30. 00:04:56.040 |
you could get a lot more compute on Colab, including GPUs. 00:05:04.920 |
In addition, OpenAI account and Cohere account. 00:05:12.200 |
For Cohere, you get really rate limited and for OpenAI, they give you $5. 00:05:16.600 |
You could consider spending a little bit more. 00:05:18.760 |
I do think you could do all our coursework for under those amounts. 00:05:23.720 |
you could still have lots of accounts if you wanted to. 00:05:34.760 |
Also, I'll say, I'm pretty well confident that we'll get a bunch of 00:05:38.640 |
credits from AWS Educate for you to use EC2 machines. 00:05:54.440 |
What we did last time is I tried to immerse us in 00:05:57.520 |
this weird and wonderful moment for AI and give you a sense for how we got here. 00:06:06.520 |
transformers and retrieval augmented in-context learning. 00:06:13.600 |
I expect you all to do lots of creative and cool things in that space. 00:06:17.440 |
But it's important for me to continue this slideshow because there is 00:06:20.960 |
more to our field than just those big language models and prompting. 00:06:25.000 |
There are lots of important ways to contribute beyond that. 00:06:28.040 |
So let me take a moment and just give you an indication of what I have in mind there. 00:06:43.080 |
which is a relatively recent synthetic dataset that is designed to stress test 00:06:49.600 |
whether models have really learned systematic solutions to language problems. 00:06:55.200 |
So the way COGS works is we have essentially a semantic parsing task. 00:06:59.720 |
The input is a sentence like Lena gave the bottle to John, 00:07:03.440 |
and the task is to learn how to map those sentences to their logical form, 00:07:08.160 |
which are these logical representations down here. 00:07:11.600 |
The interesting thing about COGS is that they've posed hard generalization tasks. 00:07:19.960 |
you might get to see examples where Lena here is in subject position, 00:07:30.800 |
you might see Paula as a name but in isolation, 00:07:34.040 |
and the task is to have the system learn how to deal with 00:07:36.960 |
Paula as a subject of a sentence like Paula painted a cake. 00:07:48.680 |
where on the table is inside the direct object of the sentence. 00:07:58.240 |
These seem like dead simple generalization tasks, 00:08:06.520 |
This is a kind of accumulated leaderboard of a lot of entries for COGS. 00:08:17.020 |
these free-form sentences into those very ornate logical forms. 00:08:28.040 |
and that's just the task of learning from Emma ate the cake on 00:08:31.120 |
the table and predicting the cake on the table burned. 00:08:33.920 |
Why are these- all these brand new systems getting zero on this split? 00:08:39.520 |
That shows first of all that this is a hard problem. 00:08:42.620 |
Now, we are going to work with a variant that we created of COGS called reCOGS. 00:08:47.760 |
This was done with my students Zen Wu and Chris Manning. 00:08:53.480 |
all those zeros derive from there being some artifacts in COGS. 00:08:57.320 |
So it was made kind of artificially hard and also artificially easy in some ways. 00:09:02.840 |
So in this class, we're going to work with reCOGS, 00:09:05.440 |
which has done some systematic meaning-preserving transformations to 00:09:09.760 |
the original to create a new dataset that we think is fairer. 00:09:16.980 |
Systems can get traction where- before they were getting zero, 00:09:22.200 |
And we have more confidence that this is testing something about semantics. 00:09:28.400 |
This is incredibly hard for our systems, even our best systems. 00:09:32.960 |
There needs to be some kind of breakthrough here for us to get 00:09:37.000 |
our systems to do well even on these incredibly simple sentences. 00:09:41.720 |
So I am eager to see what you all do with this problem. 00:09:45.720 |
You're seeing a picture here of the kind of best we could do, 00:09:49.000 |
which is a little bit better than what was in the literature previously, 00:09:55.840 |
Right. So that will culminate in this homework and bake off our third one. 00:10:02.280 |
From there, the course work opens up into your projects. 00:10:07.760 |
We're done with the regular assignments and we go through the rhythm of lit review, 00:10:11.760 |
experiment protocol, which is a special document that kind of lays down 00:10:15.680 |
the nuts and bolts of what you're going to do for your paper, 00:10:25.320 |
about topics that will supercharge your own final project papers. 00:10:29.840 |
The first topic that comes to mind for me there is better and more diverse benchmarks. 00:10:39.360 |
reliable estimates of how well our systems are doing, 00:10:47.280 |
this famous quotation from the explorer Jacques Cousteau. 00:10:52.920 |
the two essential fluids on which all life depends, 00:10:58.520 |
You can see here that Cousteau did continue have become global garbage cans. 00:11:03.040 |
That might concern us about what's happening with our datasets. 00:11:07.960 |
but still you could have that in the back of your mind that we need 00:11:11.480 |
these datasets we create to be reliable high-quality instruments. 00:11:17.120 |
The reason for that is that we ask so much of our datasets. 00:11:21.000 |
We use them to optimize models when we train on them. 00:11:25.400 |
and this is increasingly important to evaluate our models, 00:11:28.520 |
our biggest language models that are getting all the headlines. 00:11:36.720 |
to enable new capabilities via training and testing, 00:11:44.960 |
and of course for basic scientific inquiry into language and the world. 00:11:51.840 |
and it shows you that datasets are really central to what we're doing. 00:11:57.120 |
So I'm exhorting you as you can tell to think about datasets, 00:12:02.800 |
evaluation tools in the context of this course. 00:12:05.560 |
I am genuinely worried about the new dynamic where we are 00:12:10.280 |
evaluating these big language models essentially on Twitter, 00:12:14.280 |
where people have screenshots of some fun cases that they saw, 00:12:20.220 |
seeing a full representative sample of the inputs. 00:12:24.740 |
and it's impossible to piece together a scientific picture from that. 00:12:30.540 |
recently observed, I think this is very wise, 00:12:39.360 |
but figuring out whether it was a good system is going to get harder and harder, 00:12:43.460 |
and for that we need lots of evaluation datasets. 00:12:48.060 |
You could think about this slide that I showed you from before. 00:12:52.460 |
We have this benchmark saturation with all of these systems now 00:12:56.100 |
increasingly quickly getting above our estimate of human performance, 00:12:59.640 |
but I asked you to be cynical about that as a measure of human performance. 00:13:04.180 |
Another perspective on this slide could be that our benchmarks are simply too easy, 00:13:09.460 |
because it is not like if you interacted with one of these systems, 00:13:17.620 |
Partly what we're seeing here is a remnant of the fact that until very recently, 00:13:23.780 |
our evaluations had to be essentially machine tasks, 00:13:28.100 |
and we had humans do machine tasks to get a measure of human performance. 00:13:32.620 |
Maybe we're moving into a new and more exciting era. 00:13:35.940 |
We're going to talk about adversarial testing. 00:13:38.420 |
I've been involved with the Dynabench effort. 00:13:40.700 |
This is a kind of open-source effort to develop datasets 00:13:44.520 |
that are going to be really hard for the best of our models, 00:13:47.740 |
and I think that's a wonderful dynamic as well. 00:13:50.860 |
That leads into this related topic of us having more meaningful evaluations. 00:13:57.620 |
Here's a fundamental thing that you might worry 00:14:02.580 |
All we care about is performance for the system, 00:14:07.620 |
I've put this under the heading of Strathairn's law. 00:14:13.580 |
If we have this consensus that all we care about is accuracy, 00:14:18.380 |
Everyone in the field will climb on accuracy. 00:14:21.300 |
We know from Strathairn's law that that will distort the actual rate of 00:14:26.180 |
progress by diminishing everything else that could be 00:14:30.620 |
important to thinking about these AI systems. 00:14:34.140 |
Relatedly, this is a wonderful study from Birhane et al. 00:14:38.340 |
I've selected a few of the values encoded in ML research, 00:14:42.260 |
which they did via a very extensive literature survey. 00:14:51.700 |
you have an obsession with performance, as I said. 00:14:56.980 |
in second place, you have efficiency and things like explainability, 00:15:06.980 |
they actually should be in the tiniest of type. 00:15:09.700 |
Because if you think about the field's actual values as reflected in the literature, 00:15:14.620 |
you find that these things are getting almost no play. 00:15:20.060 |
but it's still the case that it's wildly skewed toward performance. 00:15:24.220 |
But those things that I have down there in purple and orange, 00:15:31.820 |
those are incredibly important things and more and more 00:15:34.460 |
important as these systems are being deployed more widely. 00:15:38.740 |
So we have to, via our practices and what we hold to be valuable, 00:15:46.180 |
You all could start to do that by thinking about 00:15:48.700 |
proposing evaluations that would elevate them. 00:15:54.260 |
The final point here is that we could also have 00:15:57.740 |
a move toward leaderboards that embrace more aspects of this. 00:16:02.180 |
Again, to help us move away from the obsession on performance, 00:16:05.700 |
we should have leaderboards that score us along many dimensions. 00:16:09.780 |
In this context, I've really been inspired by work that 00:16:17.820 |
synthesize across a lot of different measurement dimensions. 00:16:23.500 |
here I have a table where the rows are question answering systems, 00:16:27.740 |
and the columns are different things we could measure. 00:16:38.340 |
With the current Dyna-scoring that you're seeing here, 00:16:40.980 |
where most of the weight is put on performance, 00:16:43.380 |
that DeBirda system is the winner in this leaderboard competition. 00:16:47.900 |
But that's standard. But what if we decided that we 00:16:50.540 |
cared much more about fairness for these systems? 00:16:53.060 |
So we adjusted the Dyna-scoring here to put five on fairness, 00:17:01.140 |
Well, now the Electra Large system is in first place. 00:17:06.740 |
I think the answer is that there is no true winner here. 00:17:09.620 |
What this shows is that all of our leaderboards are 00:17:18.060 |
we are instilling a particular set of values on the whole enterprise. 00:17:24.860 |
This is I think part of Cowen's vision for Dyna-scoring, 00:17:30.380 |
tuned to the things that we want to do in the world, 00:17:32.860 |
via the weighting and the columns that we chose, 00:17:40.260 |
>> What does fairness to you in this field mean? 00:17:47.540 |
So if we're going to start to measure these dimensions, 00:17:49.620 |
like we're going to have a column for fairness, 00:17:51.340 |
we better be sure that we know what's behind that. 00:17:54.140 |
I can tell you there needs to be a lot more work on 00:17:57.660 |
our measurement devices, our benchmarks, for assessing fairness. 00:18:01.740 |
Because all of those things are incredibly nuanced, 00:18:09.220 |
Yeah. Throughput memory, maybe those are straightforward, 00:18:12.900 |
but fairness is going to be a challenging one. 00:18:15.180 |
But that's not to say that it's not incredibly important. 00:18:25.900 |
I do feel like this is the first time I could say this in this course. 00:18:30.860 |
our evaluations can be much more meaningful than they ever were before. 00:18:34.620 |
Assessment today or yesterday is really one-dimensional, 00:18:45.500 |
the only thing regardless of what we're trying to accomplish in the world. 00:18:52.020 |
We say it's F1 and everyone follows suit because we're 00:18:55.580 |
supposed to be the experts on this and it's often very 00:19:05.420 |
very different from what you would think that phrase means. 00:19:08.620 |
In this new future that we could start right now, 00:19:11.700 |
our assessments could certainly be high dimensional and fluid. 00:19:14.420 |
I showed you a glimpse of that with the Dyna scoring. 00:19:18.260 |
They could also in turn be highly sensitive to the context that we're in. 00:19:22.180 |
If we care about fairness and we care about efficiency, 00:19:27.140 |
we're going to get a very different prioritization of 00:19:32.780 |
Then in turn, the terms of these evaluations could be set not by us researchers, 00:19:40.020 |
but rather the people who are trying to get value out of these systems, 00:19:46.060 |
Then the judgments could ultimately be made by the users. 00:19:49.980 |
They could decide which system they want to choose 00:19:57.900 |
our evaluations be much more at the level of human tasks. 00:20:06.580 |
choose a particular label for an ambiguous example, 00:20:09.540 |
and then we assess how much agreement they have. 00:20:13.640 |
Whereas the human thing is to discuss and debate, 00:20:16.940 |
to have a dialogue about what the right label is in 00:20:19.700 |
the face of ambiguity and context dependence. 00:20:22.540 |
Well, now we could have that kind of evaluation, right? 00:20:26.060 |
Maybe we evaluate systems on their ability to 00:20:28.740 |
adjudicate in a human-like way on what the label should be. 00:20:35.180 |
but now probably something that you could toy around with a little bit with one of 00:20:38.900 |
these large language model APIs right now if you wanted. 00:20:45.700 |
I have a couple more topics, but let me pause there. 00:20:58.060 |
I hope you're seeing that it's a wide open area for final projects. Yeah. 00:21:05.180 |
Is there more of a move to like get like specialists in other fields, 00:21:09.220 |
like for example, like linguistics or like related things to like help make benchmarks? 00:21:16.460 |
You asked, is there a move to have more experts participate in evaluation? 00:21:25.020 |
Because what we want is to provide the world with tools and concepts that would allow 00:21:30.380 |
domain experts people who actually know what's going on in the domain. 00:21:34.740 |
We're trying to use this AI technology to make 00:21:37.340 |
these decisions and make adjustments and so forth based on what's working and what isn't. 00:21:45.380 |
Then what we as researchers can do is provide things like what Colin provided with 00:21:49.380 |
Dynascoring which is the intellectual infrastructure to allow them to do that. 00:21:53.820 |
Yeah. Then you all probably have lots of domain expertise that intersects with what we're doing, 00:22:04.060 |
You could participate as an NLU researcher and as 00:22:08.220 |
domain expert to do a paper that embrace both aspects of this. 00:22:13.100 |
Maybe you propose a kind of metric that you think really works well for 00:22:16.700 |
your field of economics or sociology or whatever you're studying, right? 00:22:21.820 |
Yeah, health, medicine, all these things, incredibly important. 00:22:28.620 |
I think one of the challenges we're going to face is 00:22:31.020 |
this really expensive to collect human or more sophisticated labels. 00:22:35.140 |
As an example, there's a paper that came out recently in 00:22:37.980 |
Med Hall where they trained or actually really just tuned an LLM to respond to 00:22:46.460 |
medical questions from USMOE and other medicine related exams. 00:22:53.740 |
They also had a section for long-form answers. 00:22:57.060 |
The short-form answers, it's a multiple choice, they can figure it out. 00:23:00.340 |
The long-form answers, they actually had doctors evaluate them. 00:23:05.100 |
That's really expensive. They could only collect so many labels. 00:23:13.700 |
put through a super easy, it's just counting. 00:23:15.620 |
But evaluating how valuable a search result is, 00:23:19.020 |
that requires a human, that's a little more expensive. 00:23:23.940 |
Yeah. The issue of cost is going to be unavoidable for us. 00:23:31.300 |
This research has just gotten more expensive and that's 00:23:33.780 |
obviously distorting who can participate and what we value. 00:23:37.180 |
It's another thing I could discuss under this rubric. 00:23:42.540 |
I remain optimistic because I think we are in an era now in which you could do 00:23:47.380 |
a meaningful evaluation of a system with no training data and rather 00:23:51.900 |
just a few dozen let's say 100 examples for assessment. 00:23:59.540 |
that is if you don't develop your system on it and so forth. 00:24:08.940 |
really get a sense for how my system performs on new data." 00:24:12.260 |
That's only 200 examples and I feel like that's manageable, 00:24:20.860 |
The point would be that that might be money well spent. 00:24:23.900 |
It might be that if we can get some experts to provide the 200 cases, 00:24:31.820 |
I could never have said this 10 years ago because 10 years ago, 00:24:35.740 |
the norm was to have 50,000 training instances and 5,000 test instances, 00:24:47.180 |
I feel like a few meaningful cases could be worth a lot. 00:25:03.100 |
the life sciences or something and you want a dataset, 00:25:26.300 |
If we're going to deploy these models out in the world, 00:25:29.380 |
it is really important that we understand them. 00:25:32.060 |
Right now, we do a lot of behavioral testing. 00:25:35.260 |
That is, we come up with these test cases and we see how well the model does. 00:25:40.180 |
But the problem, which is a deep problem of scientific induction, 00:25:44.220 |
is that you can never come up with enough cases. 00:25:49.580 |
and no matter how many things you dreamed up when you were doing the research, 00:25:54.580 |
it will encounter things that you never anticipated. 00:26:01.100 |
you might feel very nervous about this because you might have 00:26:03.820 |
essentially no idea what it's going to do on new cases. 00:26:07.580 |
The mission of explainability research should be to go one layer 00:26:11.660 |
deeper and understand what is happening inside 00:26:14.620 |
these models so that we have a sense for how they'll generalize to new cases. 00:26:19.100 |
It's a very challenging thing because we're thinking about 00:26:26.260 |
You can even find people saying in the literature that they're 00:26:29.060 |
skeptical that we can ever understand what's happening with these systems, 00:26:44.220 |
The importance of this is really that we have these broader societal goals. 00:26:56.860 |
It seems to me that all of these questions depend on us 00:27:01.300 |
having some true analytic guarantees about model behaviors. 00:27:08.300 |
"Trust me, my model is not biased along some dimension," 00:27:14.600 |
The best I could say is that it wasn't biased in some evaluations that I ran, 00:27:18.620 |
but I just emphasize for you that that's very different from being 00:27:22.540 |
evaluated by the world where a lot of things could happen that you didn't anticipate. 00:27:27.700 |
We'll talk about a lot of different explanation methods. 00:27:31.860 |
I think that these methods should be human interpretable. 00:27:34.920 |
That is, we don't want low-level mathematical explanations of how the models work. 00:27:38.780 |
We want this expressed in human-level concepts so that we can reason about these systems. 00:27:44.980 |
We also want them to be faithful to the underlying model. 00:27:48.660 |
We don't want to fabricate human interpretable but inaccurate explanations of the model. 00:27:53.780 |
We want them to be true to the underlying systems. 00:27:57.080 |
These are two very difficult standards to meet together. 00:28:01.060 |
I can make them human interpretable if I offer you no guarantees of faithfulness, 00:28:09.020 |
I can make them faithful by making them very technical and low-level. 00:28:12.660 |
We could just talk about all the matrix multiplications we want, 00:28:15.740 |
but that's not going to provide a human-level insight into how the models are working. 00:28:20.640 |
So together though, we need to get methods that are good for both of these, right? 00:28:24.900 |
Concept-level understanding of the causal dynamics of these systems. 00:28:30.100 |
We'll talk about a lot of different explanation methods. 00:28:40.660 |
which was an early and very influential and very ambitious attempt 00:28:44.380 |
to understand the hidden representations of our models. 00:28:50.220 |
These are ways to assign importance to different parts of the representations of these models, 00:28:59.860 |
Then we're going to talk about methods that depend on 00:29:03.220 |
active manipulations of model internal states. 00:29:09.640 |
the active manipulation approach because I think that that's 00:29:12.940 |
the kind of approach that can give us causal insights, 00:29:16.140 |
and also richly characterize what the models are doing, 00:29:19.340 |
and that's more or less the two desiderata that I just mentioned for these methods. 00:29:29.340 |
and all of them can be wonderful for your analysis sections of your final papers. 00:29:35.340 |
We might even talk about interchange intervention training, 00:29:39.220 |
which is when you use explainability methods to actually 00:29:41.940 |
push the models to become better, more systematic, 00:29:46.020 |
more reliable, maybe less biased along dimensions that you care about. 00:29:58.380 |
I have a few more kind of more low-level things about the course to do now. Yeah. 00:30:03.420 |
I know we're going to get into all of the explanation methods in a lot of detail later on, 00:30:08.180 |
but can you give a quick example just so that we have 00:30:13.460 |
Probing is training supervised classifiers on internal representations. 00:30:25.460 |
Does it encode information about animacy or part of speech?" 00:30:33.700 |
I think that was really eye-opening that even if your task was sentiment analysis, 00:30:39.300 |
you might have learned latent structure about animacy. 00:30:42.980 |
That's getting closer to the human level concept stuff. 00:30:46.460 |
Problem with probing is that you have no guarantee that 00:30:49.180 |
that information about animacy here has any causal impact on the model behavior. 00:30:53.860 |
It could be just kind of something that the model learned by the by. 00:30:57.860 |
Attribution methods have the kind of reverse problem. 00:31:01.060 |
They can give you some causal guarantees that this neuron 00:31:04.420 |
plays this particular role in the input-output behavior, 00:31:08.260 |
but it's usually just a kind of scalar value. 00:31:13.820 |
And you say, "It means that it's that much important." 00:31:20.940 |
And then I think the active manipulation thing, 00:31:23.220 |
which is like doing lots of brain surgeries on your model, 00:31:26.540 |
can provide the benefits of both probing and attribution. 00:31:30.820 |
Causal insights, but also a deep understanding of the- what the representations are. 00:31:40.500 |
It's a very exciting part of the literature. Yeah. 00:31:46.060 |
So I guess, why would we want to use the COGS dataset if we're testing for generalization? 00:31:51.180 |
Like, why can't we just prompt a language model of a word that we've never seen before, 00:31:55.380 |
and kind of like try and induce some format if you see it in the subject position, 00:31:59.540 |
get it to position in the object and see how well it does that. 00:32:06.980 |
it could be that you try to prompt a language model. 00:32:09.860 |
Zen did a bunch of that as part of the research. 00:32:14.860 |
But maybe there's a version of that where you 00:32:17.460 |
prompt in the right way with the right kind of instructions, 00:32:21.780 |
That would be wonderful because that would suggest to me that those models, 00:32:27.780 |
has internal representations that are systematic enough to have kind of 00:32:32.060 |
a notion of subject and a notion of object and 00:32:39.220 |
Yeah. The cool thing about COGS is that I think it's a pretty 00:32:42.140 |
reliable measurement device for making such claims. Yeah. 00:32:48.740 |
How transferable is this discussion to languages other than English? 00:32:54.580 |
Like, I wonder if there- if we should be concerned about 00:32:57.620 |
the very tight coupling between the properties of English as a language, 00:33:06.660 |
Well, I mean, I hope that a lot of you do projects on multilingual NLP, 00:33:17.780 |
we live in a golden age for that research as well. 00:33:20.740 |
There's more research on more languages than there were 10 years ago, 00:33:27.860 |
The downside is that it's all done with multilingual representations, 00:33:33.780 |
and they tend to do much better on English tasks than every other task. 00:33:45.900 |
mixed feelings that I have about a lot of these topics. 00:33:49.940 |
In the interest of time, let's press on a little bit. 00:33:54.980 |
I think I just wanted to skip to the course mechanics. 00:34:03.140 |
You can see that it has a kind of strong emphasis 00:34:06.100 |
toward the three parts that are related to the final project. 00:34:09.460 |
But the homeworks are also really important and the quizzes less so. 00:34:13.380 |
But I think they're important enough that you'll take them seriously. 00:34:22.660 |
and I am eager to interact with you here if possible, 00:34:29.500 |
Please attend office hours if you just want to chat. 00:34:32.020 |
One of my favorite games to play in office hours is a group comes with 00:34:38.940 |
least to most or most to least viable for the course. 00:34:44.060 |
and I think it always illuminates some things about the problems. 00:34:56.340 |
Just we want you to be focused on the final project at that point. 00:35:02.580 |
We can talk about the grading of the original systems a bit later. 00:35:08.780 |
some links here, exceptional final projects, and some guidance. 00:35:12.860 |
These are the two documents I mentioned before. 00:35:15.900 |
I'll just say again that this is the most important part of 00:35:19.180 |
the course to me and the thing that's special. 00:35:22.380 |
we have this incredibly accomplished teaching team this year, 00:35:25.540 |
diverse interests, and they all have done incredible research on their own. 00:35:30.420 |
I've learned a ton from them and from their work, 00:35:39.180 |
take advantage of their mentorship for the work you do. 00:36:02.540 |
We want you to be connected with the kind of ongoing discourse for the class. 00:36:07.860 |
so that you know your rights and responsibilities. 00:36:10.460 |
And then I think right now we should check out the homework, 00:36:16.740 |
oriented around that before we dive into transformers. 00:36:25.260 |
but I've kind of evoked it for you in case it raised any issues. 00:36:29.540 |
All right. Let's look briefly at the first homework. 00:36:41.140 |
and it is maybe an unusual mode for homeworks. 00:36:46.340 |
This is kind of cool. So this link will take you to the GitHub, 00:36:50.340 |
uh, which I think you're probably all set up with on your home computers. 00:36:54.140 |
But you might want to work with this in the Cloud. 00:36:58.980 |
So you could just quick click like open in Colab. 00:37:02.340 |
And I think I've done a pretty good job of getting you so that it will set 00:37:08.900 |
itself up with the installs that you need and the course repo and so forth. 00:37:14.020 |
I would actually be curious to know what their bumps along 00:37:16.580 |
the road to getting this to just work out of the box in Colab. 00:37:19.580 |
I do encourage this because if you're ambitious, 00:37:23.980 |
and this is a good inexpensive way to get them. 00:37:27.140 |
It's also a pretty nice environment to do the work in. Zoom in here. 00:37:35.540 |
And that's actually kind of a good place to start. 00:37:42.500 |
you're encouraged to work with three datasets, 00:37:47.700 |
Dynascent round two, and the Stanford Sentiment Treebank. 00:37:57.500 |
But I'm not guaranteeing you that those labels are aligned in the semantic sense. 00:38:02.860 |
In fact, I think that the SST labels are a bit different from the Dynascent ones. 00:38:09.220 |
But certainly, the underlying data are different because Dynascent is 00:38:12.980 |
like product reviews and Stanford Sentiment Treebank is movie reviews. 00:38:27.460 |
Whereas Dynascent round two is actually annotators working on the Dynabench platform, 00:38:35.980 |
trying to fool a really good sentiment model. 00:38:39.900 |
So the Dynascent round two examples are hard. 00:38:43.060 |
They involve like non-literal language use and sarcasm, 00:38:46.620 |
and other things that we know challenge current day models. 00:38:57.620 |
I'm just pushing you to develop a simple linear model with sparse feature representations. 00:39:04.580 |
This is a kind of more traditional mode background. 00:39:12.860 |
I think we should talk about how to get you up to speed for the course. 00:39:20.860 |
this should be a pretty straightforward question. 00:39:34.120 |
what you do is complete a function that I started. 00:39:40.980 |
This is mainly about starting to build your own original system. 00:39:55.540 |
The advantage of the test for me is that if there was 00:40:02.580 |
You also get a guarantee that if your code passes the test, 00:40:07.220 |
More or less the same tests run on Gradescope. 00:40:23.100 |
Those things always feel like they're just barely functioning. 00:40:26.780 |
But the idea is that this is really not about me evaluating you. 00:40:36.980 |
concepts that will let you develop your own systems. 00:40:40.620 |
I'm just trying to be a trusted guide for you on that. 00:40:44.340 |
So you do some coding and you have these three questions here. 00:40:47.820 |
The result of doing those three questions is that you 00:40:50.780 |
have something that could be the basis for your original system. 00:40:54.180 |
It'd be pretty cool by the way if some people 00:40:58.820 |
to show the transformers that there's still competition out there. 00:41:08.340 |
except now we're focused on transformer fine-tuning, 00:41:23.580 |
You'll learn some hugging face code and you'll also learn some concepts. 00:41:30.220 |
understand what the representations are like. 00:41:39.020 |
writing a PyTorch module where you fine-tune BERT. 00:41:43.780 |
That is step one to a really incredible system I'm sure. 00:41:54.260 |
So that given the course code and everything else, 00:41:57.180 |
the interfaces for these things are pretty straightforward to write. 00:42:03.820 |
you don't actually need heavy-duty computing at 00:42:05.820 |
all because you don't do anything heavy-duty. 00:42:11.900 |
that might be where you want to train a big monster model and figure out how to 00:42:16.140 |
work with the computational resources that you have to get that done. 00:42:26.780 |
So you'll still want to be on Colab or something like that. 00:42:30.140 |
Then I don't know how ambitious you're going to get for your original system. 00:42:34.460 |
You can tell that I'm trying to lead you toward using 00:42:37.340 |
question one and question two for your original system, 00:42:41.060 |
If you want to do something where you just prompt GPT-4, 00:42:51.260 |
so you can't just download somebody else's code. 00:42:53.660 |
If all you did was a very boring prompt structure, 00:42:56.780 |
you wouldn't get a high grade on your original system. 00:42:59.460 |
We're trying to encourage you to think creatively and explore. 00:43:03.260 |
Then the final thing is you just enter this in a bake-off, 00:43:06.380 |
and really that just means grabbing an unlabeled dataset 00:43:09.620 |
from the web and adding a column with predictions in it. 00:43:13.260 |
Then you upload that when you submit your work to Gradescope. 00:43:20.020 |
we'll reveal the scores and there'll be some winners, 00:43:24.860 |
I'm optimistic that we're going to have EC2 codes as prizes. 00:43:29.140 |
That's always been fun because if you win a bake-off, 00:43:31.100 |
you get a little bit more compute resources for your next big thing. 00:43:35.340 |
They don't want to hand out these codes anymore like they used to, 00:43:42.940 |
but I think I have an arrangement in place to get some. 00:43:48.980 |
the best systems and the most creative systems, 00:43:51.780 |
and we have even given out prizes for the lowest scoring system. 00:43:55.660 |
Because if that was a cool thing that should have worked and didn't, 00:43:58.660 |
I feel like you did a service to all of us by going down that route, 00:44:04.900 |
As a trying to have a multi-dimensional leaderboard here, 00:44:08.700 |
even as we rank all of you according to the performance of your systems. 00:44:13.260 |
That's my overview. Questions or comments or anything? 00:44:58.540 |
and you can see that this is a kind of outline of this unit. 00:45:01.500 |
Then there's also a good table of contents with good labels. 00:45:04.860 |
So if you need to find things in what I admit is a very large deck, 00:45:10.940 |
You can also track our progress as we move through these things. 00:45:25.940 |
What is happening with these contextual representations? 00:45:30.100 |
Okay. This one slide here used to take two weeks for this course. 00:45:38.620 |
The background materials are still at the website. 00:45:49.180 |
back before natural language understanding was all the rage. 00:45:56.740 |
Word2Vec, and we're going to dive right into transformers. 00:46:06.140 |
the dawning of the statistical revolution in NLP, 00:46:13.840 |
was with feature-based sparse representations. 00:46:21.060 |
you might write a feature function that says, okay, 00:46:23.740 |
yes or no on it being referring to an animate thing, 00:46:27.140 |
yes or no on it ending in the characters ing, 00:46:36.980 |
And so all these little feature functions would end up giving 00:46:39.660 |
you really long vectors of essentially ones and zeros that were kind of 00:46:45.000 |
hand-designed and that would give you a perspective on 00:46:48.500 |
a bunch of the dimensions of the word you were trying to represent. 00:46:55.400 |
and then it kind of started to get replaced pre-Word2Vec and GloVe, 00:47:01.240 |
with methods like pointwise mutual information or TF-IDF. 00:47:09.940 |
fundamental in the field of information retrieval, 00:47:13.140 |
especially TF-IDF as a main representation technique 00:47:20.300 |
Took a while for NLP people to realize that they would be valuable. 00:47:25.140 |
But what you start to see here is that instead of writing all those feature functions, 00:47:30.340 |
I'll just keep track of co-occurrence patterns in large collections of text. 00:47:35.820 |
And PMI and TF-IDF do this- do this essentially just by counting, 00:47:42.560 |
But really it is the rawest form of distributional representation. 00:47:50.520 |
or this is sort of simultaneous in an interesting way, 00:47:52.800 |
but you have paired with PMI and TF-IDF methods like a principal components analysis, 00:47:59.000 |
SVD which is sometimes called latent semantic analysis, 00:48:07.800 |
So a whole family of these things that are essentially taking 00:48:10.640 |
count data and giving you reduced dimensional versions of that count data. 00:48:16.160 |
And the power of doing that is really that you can 00:48:19.480 |
capture higher order notions of co-occurrence. 00:48:28.640 |
with words that co-occur with the things I co-occur with. 00:48:32.600 |
You're kind of second order neighbors and you can imagine 00:48:35.360 |
traveling out into this representational neighborhood here. 00:48:39.200 |
And that turns out to be very powerful because a lot of 00:48:42.200 |
semantic affinities come not from just being neighbors with something, 00:48:45.640 |
but rather from that whole network of things co-occurring with each other. 00:48:50.520 |
And what these methods do is take all that count data and compress it in a way that 00:48:54.960 |
loses some information but also captures those notions of similarity. 00:49:00.440 |
And then the final step which might actually be the kind of 00:49:04.520 |
final step in this literature were learned dimensionality reduction things, 00:49:12.800 |
And this is where you might start with some count data, 00:49:16.400 |
but you have some machine learning algorithm that learns how to 00:49:20.360 |
compute dense learned representations from that count data. 00:49:26.600 |
Um, so kind of like step three infused with more of what we know of as machine learning now. 00:49:34.120 |
And I say it might be the end because I think now, 00:49:38.280 |
for anything that you would do with this mode, 00:49:41.160 |
you would probably just use contextual representations. 00:49:47.280 |
And then here's the review if you want, right? 00:49:51.200 |
And I think it is important to understand this both the history but also 00:49:54.560 |
the technical details to really deeply understand what I'm about to dive into. 00:49:58.720 |
So you might want to circle back if that was too fast. Yeah. 00:50:03.240 |
Is there any option to just like one-hot encode your entire vocabulary? 00:50:08.040 |
I think this is my understanding of what modern transformer-based models do. 00:50:17.520 |
So well, just say a bit more like what are you, what are you doing? 00:50:21.200 |
Like my understanding of how large language models encode individual words, 00:50:29.160 |
is they have a list of all of their possible tokens. 00:50:35.960 |
you're just like, you have a vector of length, 00:50:42.120 |
And then you just one-hot encode which token that is. 00:50:48.280 |
So why don't if- I'll show you how they represent things. 00:50:51.520 |
And let's see if it connects with your question. 00:50:59.480 |
And the notion of token and the notion of type is about to get sort of complicated. 00:51:07.880 |
just a little bit of context here about why I think this is so exciting. 00:51:12.920 |
And I was excited by the static vector representations of words, 00:51:16.880 |
but it was also very annoying to me because they give you one vector per word. 00:51:23.160 |
Whereas my experience of language is that words have 00:51:27.480 |
multiple senses and it is hard to delimit where the senses begin and end. 00:51:33.800 |
which I've worked on with my PhD student, Erica Peterson. 00:51:51.680 |
But this being something more like was published or appeared. 00:52:02.840 |
Sandy broke the law is a different sense yet again. 00:52:15.840 |
The newscaster broke into the movie broadcast. 00:52:23.120 |
means I don't know, we ended up back at the same amount we started with. 00:52:30.120 |
If I was in the old mode of static representation, 00:52:33.440 |
would I survive with one break vector for all of these examples? 00:52:41.760 |
But then what about all the ones that I didn't list here? 00:52:44.400 |
The sen- the number of senses for break starts to feel impossible to enumerate. 00:52:49.960 |
If you just think about all the ways in which you encounter this verb. 00:52:53.200 |
And there is some metaphorical core that seems to run through them. 00:53:01.720 |
And this tells me that the sense of a word like break is being modulated by the context it is appearing in. 00:53:10.240 |
And the idea that we would have one fixed representation for it, 00:53:29.440 |
but those feel like at least two to four different senses for flat. 00:53:41.160 |
It's tragic to think we would have one throw that was meant to cover all of these examples, right? 00:53:52.960 |
That might feel like a standard sort of lexical ambiguity. 00:53:56.480 |
And so maybe you can imagine that we have one vector for crane as a bird, 00:54:03.760 |
But is that going to work for the entire vocabulary? 00:54:09.520 |
We wouldn't even know what vector we were dealing with there, right? 00:54:13.960 |
And now we have another problem on our hands, 00:54:15.720 |
which is selecting the static vector based on contexts, right? 00:54:19.840 |
How are we going to do that? And this is a really deep thing. 00:54:22.560 |
It's not just about the local kind of morphosyntactic context here. 00:54:30.680 |
So the sense of any there is like any typos, right? 00:54:37.840 |
Now the sense of any and the kind of elliptical stuff that comes after it is any bookstores. 00:54:44.560 |
And now I hope you can see that the sense that words can have is modulated by context in the most extended sense. 00:54:53.040 |
And having fixed static representations was never going to work in the face of all of this diversity. 00:55:00.240 |
We were never going to figure out how to cut up the senses in just the right way to get all of this data handled correctly. 00:55:09.720 |
And the vision of contextual representation models is that you're not even going to try to do all that hard and boring stuff. 00:55:17.040 |
Instead, you are just going to embrace the fact that every word could take on a different sense, 00:55:22.640 |
that is, have a different representation depending on everything that is happening around it. 00:55:28.480 |
And we won't have to decide then which sense is in 1A and whether it's different from 1B and 1C and so forth. 00:55:33.840 |
We will just have all of these token level representations. 00:55:40.120 |
It will be entirely a theory that is based in words as they appear in context. 00:55:47.920 |
For me as a linguist, it is not surprising at all that this turns out to lead to lots of engineering successes 00:55:54.400 |
because it feels so deeply right to me about how language works. 00:56:01.120 |
Uh, brief history here. I just want to be dutiful about this. Make sure people get credit where it's due. 00:56:06.680 |
November 2015, Dai and Lei, that's a foundational paper where they really did what is probably the first example of language model pre-training. 00:56:17.000 |
It's a cool paper to look at. It's complicated in some ways that are surprising to us now, and it is certainly a visionary paper. 00:56:24.920 |
And then McCann et al., this is a paper from Salesforce Research that's led by, at the time was read by Richard Socher, 00:56:31.040 |
who is a distinguished alum of this class. Proud of that. 00:56:34.960 |
They developed the Cove model where what they did is train machine translation models. 00:56:40.320 |
And then the inspired idea was that the tran- the translation representations might be useful for other tasks. 00:56:48.120 |
And again, that feels like the dawn of the notion of pre-training contextual representations. 00:56:54.200 |
And then ELMo came. I mentioned ELMo last time. Huge breakthrough. Massive bidirectional LSTMs. 00:57:01.320 |
And they really showed that that could lead to rich multipurpose representations. 00:57:06.600 |
And that's where you really feel everyone reorienting their research toward these kind of models. 00:57:13.000 |
That's not a transformer-based one, though. That's by- by LSTMs. 00:57:17.320 |
And then we get, um, GPT in June 2018 and BERT in October 2018. 00:57:25.320 |
Um, the BERT paper was published a long time after that, but as I said before, 00:57:30.040 |
it had already achieved massive impact by the time it was published in 2019 or whatever. 00:57:35.400 |
So that's why I've been giving the months here, because you can see it's really- there was this 00:57:39.560 |
sudden uptake in the amount of- of interest in these things that happened around this time. 00:57:48.440 |
Another kind of interesting thing to think about if you step back for the context here, 00:57:54.840 |
is that we as a field have been traveling from high bias models, 00:58:00.280 |
where we decide a lot about how the data should look and be processed, 00:58:03.960 |
toward models that impose essentially nothing on the world. 00:58:09.800 |
I'm just imagining a model that's kind of in the old mode, 00:58:12.280 |
where you have like your glove representations of these three words. 00:58:16.280 |
And to get a representation for the sentence, you just add up those representations. 00:58:21.320 |
In doing that, you have decided ahead of time, 00:58:24.760 |
a lot of stuff about how those pieces are gonna come together. 00:58:28.200 |
I mean, you just said it was gonna be addition, 00:58:31.960 |
which is almost certainly not correct about how the world works. 00:58:34.920 |
But that- so that's a prototypical case of a high bias decision. 00:58:38.520 |
If you move over to the right here, that's a kind of recurrent neural network. 00:58:43.720 |
And here, I've kind of decided that my data will be processed left to right. 00:58:49.640 |
I could learn a lot of different functions in that space, 00:58:53.400 |
so it's much more expressive, much less biased in this machine learning sense, 00:58:59.160 |
But I have still decided ahead of time that I'm gonna go left to right. 00:59:08.520 |
was truly a pioneer in tree-structured recursive neural networks. 00:59:13.320 |
Here, I make a lot of decisions about how the pieces can possibly come together. 00:59:18.120 |
The rock is a unit constituent, separate from rules, which comes in later. 00:59:23.080 |
And I'm just saying I know ahead of time that the data will work that way. 00:59:28.360 |
If I'm right, it's gonna give me a huge boost, 00:59:33.400 |
If I'm wrong though, I might be wrong forever. 00:59:36.520 |
And I think that's actually that feeling that you're wrong forever 00:59:42.280 |
So here, I've got kind of a bidirectional recurrent model. 00:59:46.280 |
So now, you can go left to right and right to left, 00:59:49.640 |
and all these attention mechanisms that are gonna, 00:59:56.680 |
And this is a true progression with what happened with recurrent neural networks. 01:00:02.760 |
Go both directions and add a bunch of attention connections. 01:00:06.040 |
And that is kind of the thing that caused everyone to realize, 01:00:11.160 |
"Oh, we should just connect everything to everything else 01:00:14.280 |
and go to the maximally low-biased version of this, 01:00:27.480 |
"I have no idea what the world is gonna be like. 01:00:30.120 |
I just trust in my data and my optimization." 01:00:33.160 |
The attention piece is really interesting to me. 01:00:37.480 |
You know, we used to talk a lot about this in the course. 01:00:46.760 |
is that I might fit a classifier on this final representation here 01:00:53.960 |
But people went on that journey I just described, 01:01:00.120 |
won't I lose a lot of information about the earlier words?" 01:01:04.680 |
I should have some way of, like, connecting back. 01:01:11.080 |
essentially, between the thing that you're using for your classifier 01:01:17.960 |
just as a kind of way of scoring this final thing with respect to the previous thing. 01:01:25.720 |
and then form what was called a context vector. 01:01:31.240 |
And what I've done here is build these links back to all these previous states. 01:01:36.200 |
And that turned out to be incredibly powerful. 01:01:39.640 |
And when you read the title of the paper, "Attention is all you need," 01:01:46.360 |
"You don't need LSTM connections and recurrent connections and stuff. 01:01:52.840 |
is the sense in which they're saying all you needed 01:01:55.320 |
was these connections that you were adding onto the top of your earlier models." 01:02:03.960 |
but certainly, uh, it has taken over the field. 01:02:07.880 |
Another important idea here that might often be overlooked 01:02:12.440 |
is just this notion that we should model the sub-parts of words. 01:02:16.200 |
And again, I can't resist a historical note here. 01:02:21.960 |
what they did to embrace this insight is incredible. 01:02:28.680 |
and then they fit all these convolutions on top of all those character-level representations, 01:02:34.760 |
which is essentially like ways of pooling together sub-parts of the word. 01:02:38.760 |
And then they form a representation of- at the top that's like the average 01:02:43.560 |
plus the concatenation of all of these different convolutional layers. 01:02:48.360 |
And the result of this is a vocabulary that does latently have information about characters 01:02:54.760 |
and sub-parts of words as well as words in it. 01:03:00.680 |
And this is like a space in which you could capture lots of things like 01:03:04.200 |
how talk is similar to talking and is similar to talked. 01:03:07.880 |
And you know, all that stuff that a simple unigram parsing would miss 01:03:16.280 |
But the vocabulary for ELMo is like 100,000 words. 01:03:21.640 |
So that's 100,000 embedding space that I need to have. 01:03:27.560 |
And it's still the case that if you process real data, 01:03:32.440 |
you have to unk out, that is, mark as unknown most of the words you encounter. 01:03:37.400 |
Because the language is incredibly complicated, 01:03:40.280 |
and 100,000 doesn't even come close to covering 01:03:43.800 |
all the tokens that you encounter in the world. 01:03:46.200 |
And so again, we have this kind of galaxy brain moment where I guess the field says, 01:03:52.760 |
And what you do instead is tokenize your data 01:03:56.600 |
so that you just split apart words into their sub-word tokens if you need to. 01:04:03.640 |
So here I've got an example with the BERT tokenizer. 01:04:12.600 |
notice that the word encode has been split into two tokens. 01:04:31.400 |
but rather split it apart into a bunch of different pieces. 01:04:37.720 |
that BERT-like models have only 30,000 words in their vocabulary. 01:04:41.640 |
But they're words in the sense that they're these sub-word tokens. 01:04:50.920 |
in the realm where we were doing static word representations 01:04:57.640 |
and have no sense in which it was participating in the larger part of snuffleupagus. 01:05:10.840 |
And we can hope that it reconstructs something like the word 01:05:17.640 |
Certainly, we could hope that for something like encode. 01:05:25.880 |
and incredibly freeing in terms of the engineering resources that you need. 01:05:30.200 |
But it does depend on rich contextual representations. 01:05:35.800 |
And then another notion, positional encoding. 01:05:39.080 |
So we have all these tokens or maybe, you know, subparts of words. 01:05:46.280 |
using a traditional static embedding space like a GloVe one, 01:05:50.760 |
that's what I put with these light gray boxes here, 01:05:53.640 |
we'll also represent sequences with positional encodings, 01:05:57.640 |
which will just keep track of where the token appeared in the sequence I'm processing. 01:06:03.720 |
And what that means is that the word rock here, 01:06:11.960 |
if rock appears in position 47 in the string. 01:06:21.160 |
And that's another way in which you're embracing the fact 01:06:25.160 |
that all of these representations are going to be contextual. 01:06:31.640 |
because I've been slow to realize what maybe the whole field already knew, 01:06:37.320 |
How you do this positional encoding really matters for how models perform. 01:06:42.040 |
And that's why, in fact, it's like one of these early topics here 01:06:46.600 |
And then of course, another guiding idea here 01:06:53.480 |
is that we are going to do massive scale pre-training. 01:07:00.120 |
with all these tiny little parts of words in them, 01:07:06.280 |
And we are going to train at an incredible scale. 01:07:08.840 |
That's that same story of word2vec, GloVe, through GPT, 01:07:16.760 |
And some magic happens as you do this on more and more data 01:07:29.720 |
This insight that instead of starting from scratch for my machine learning models, 01:07:35.240 |
I should start with a pre-trained one and fine-tune it for particular tasks. 01:07:41.000 |
We saw this a little bit in the pre-transformer era. 01:07:46.360 |
The standard mode was to take GloVe or word2vec representations 01:07:50.600 |
and have them be the inputs to something like an RNN. 01:07:55.640 |
And instead of having to learn the embedding space from scratch, 01:07:59.400 |
it would start in this really interesting space. 01:08:02.280 |
And that is actually a kind of learning of contextual representations. 01:08:07.880 |
Because what happens if the GloVe representations are updated 01:08:11.160 |
is that they all shift around and the network kind of pushes them around 01:08:15.080 |
so that you get different senses for them in context. 01:08:18.840 |
And then again, the transformer thing just takes that to the limit. 01:08:22.760 |
And that is the mode that you'll operate in for the first homework. 01:08:29.320 |
We have this thing where, I hope you can make it out at the bottom, 01:08:32.440 |
you load in BERT and you just put a classifier on top of it. 01:08:35.960 |
And you learn that classifier for your sentiment task, say. 01:08:40.520 |
And that actually updates the BERT parameters. 01:08:43.720 |
And the BERT parameters help you do better at your task. 01:08:47.240 |
And in particular, they might help you generalize 01:08:50.680 |
to things that are sparsely represented in your task-specific training data. 01:08:56.280 |
Because they've learned so much about the world in their pre-training phase. 01:09:05.880 |
I'm a little worried that we're moving into a future 01:09:08.280 |
in which fine-tuning is just again using an open AI API. 01:09:12.600 |
But you all will definitely learn how to do much more than this, 01:09:16.520 |
even if you fall back on doing this at some point. 01:09:18.760 |
Where now what you're doing is some lightweight version of fine-tuning 01:09:28.360 |
It's just that the starting point knows so much about language in the world, 01:09:47.180 |
Going back to the sub-word splitting of the longer word, 01:09:52.600 |
is there an infusion we are imposing on splitting that particular way, 01:09:56.600 |
or is that also part of- driven by the model itself? 01:10:02.120 |
So what gets imposed as a modeling bias in that sense 01:10:14.120 |
that you can run for doing that sub-word tokenization. 01:10:16.760 |
You'll see this as you read papers and as I talk. 01:10:24.680 |
All of them are attempts to learn kind of optimal way to tokenize the data 01:10:30.600 |
based on things that tend to co-occur a lot together. 01:10:39.240 |
as someone who's interested in the morphology of languages, 01:10:46.600 |
if you think about languages with very rich morphology. 01:10:50.120 |
You might have an intuition that you want a tokenization scheme 01:10:52.680 |
that reproduces the morphology of that language, 01:10:55.080 |
that splits a big word with all its suffixes say, 01:10:58.360 |
down into things that look like the actual pieces, 01:11:04.040 |
well, the best of these schemes should come close to that, right? 01:11:07.800 |
And that could be an important and useful bias that you impose. 01:11:16.760 |
Can you elaborate on what happens when we do fine-tuning to the original model? 01:11:35.640 |
there's like an easy answer and a hard answer. 01:11:37.480 |
So the easy answer is that you are simply back-propagating 01:11:43.720 |
you know, output comparison with the true label, 01:11:46.280 |
back through all the parameters of the model. 01:11:51.160 |
you know, as you fine-tune on your sentiment task, 01:11:57.320 |
And then of course, there are variants of that 01:11:59.000 |
where you update just some of the BERT parameters 01:12:03.560 |
But the idea is you have a smart initialization, 01:12:19.240 |
That could actually connect with the explainability stuff, 01:12:21.640 |
like what adjustments are happening to the network, 01:12:30.680 |
Are there lightweight versions of the fine-tuning 01:12:35.960 |
and get a better balance from the pre-training 01:13:01.800 |
oh, change- just change the model a little bit, 01:13:12.200 |
that you could control a kind of out-of-control car. 01:13:30.280 |
But it's an art and a science at the same time. 01:13:34.760 |
that Sid, he's going to do a hands-on session 01:13:38.360 |
and that he imparts some of his own hard-won lessons to you 01:13:49.080 |
and your optimizer and other things you can fiddle with, 01:13:52.040 |
and hope that it steers in the direction you want. 01:14:03.240 |
why don't they work super well for language models? 01:14:06.840 |
And like, I guess the sentiment that you had was like, 01:14:09.720 |
oh, kind of just like put attention towards anything 01:14:27.560 |
we were actually making it harder for our models, 01:14:30.200 |
because we were putting them in this bad initial state. 01:14:41.640 |
to see if we can see what tree structures they've induced. 01:14:48.920 |
And another aspect of this is that I feel like 01:14:59.560 |
And so we want those all simultaneously represented. 01:15:02.920 |
these powerful models we're talking about could do that. 01:15:08.200 |
for helping us learn what the right structures are, 01:15:23.960 |
Is it in the same domain or it's something otherwise? 01:15:31.080 |
This is a great example of something that sounds small, 01:15:45.800 |
Do you do it based on a strong bias that you have 01:15:57.480 |
that that low-level tokenization choice will influence 01:16:06.200 |
And so a paper that evaluated a bunch of schemes 01:17:01.240 |
We'll resolve the questions that we got back there. 01:17:03.320 |
You'll see much more of these attention connections.