Back to Index

Stanford XCS224U: Natural Language Understanding I Course Overview, Part 2 I Spring 2023


Transcript

All right. Welcome back everyone. Day two. Got a lot we want to accomplish today. What I have on the screen right now is the home base for the course. This is our public website and you could think of it as kind of a hub for everything that you'll need in the course.

You can see along the top here we've got some policy pages. There's a whole page on projects. There's a page that provides an index of background materials, YouTube screencasts, slides, hands-on materials in case you need to fill in some background stuff. Notice also I do a podcast that actually began in this course last year, and I found it so rewarding that I just continued doing it all year.

So new episodes continue to appear. If you have ideas for guests for this podcast, feel free to suggest them. I'm always looking for exciting people to interview, and I think the back episodes are also really illuminating. That's along the top. Then over here on the left, you've got one-stop shopping for the various systems that we have to deal with.

You've got our Ed forum for discussion. If you're not in there, let us know. We can get you signed up. Canvas is your home for the screencasts and also the quizzes, and I guess there's some other stuff there. Grade scope is where you'll submit the main assignments, including your project work, and also enter the bake-offs.

Then we have our course GitHub, and that is the course code that we'll be depending on for the assignments, and that I hope you can build on for the original work that you do. If you need to reach us, you can use the discussion forum, but we also have this staff email address that is vastly preferred to writing to us individually.

It really helps us manage the workload and know what's happening if you either ping us on the discussion forum, private posts, public posts, whatever, or use that staff address. In the middle of this page here, we've got links to all the materials. The first column is slides and stuff like that, and also notebooks.

The middle column, it's core readings mostly. I'm not presupposing that you will manage to do all of this reading because there is a lot of it, but these are important and rewarding papers, and so at some point in your life, you might want to immerse yourselves in them. But I'm hoping that I can be your trusted guide through that literature.

Then on the right, you have the assignments. That's the website. Questions, comments, anything I could clear up? I have a time budgeted later to review the policies and required work in a bit more detail. But if there are questions now, I'm happy to take them. Yes. For the quizzes, are the quizzes doable on the day that they become available, or do we need like the course material all the way up through the D?

That is a good question. It's going to depend on your background. But in the worst case, if this is all brand new to you, you might not feel like you can confidently finish the quiz until that final lecture in the unit. Like this one is all about transformers. All of the answers are embedded in this handout here.

But if you want to hear it from me, you might not hear that until next Thursday, but that gives you another five days. Perfect. Thank you. Yes. You mentioned past projects are available. Where can we learn? Right. That must be here. I think I've got an index of past projects behind a protected link, which will depend on you being enrolled.

If you're not enrolled, we can get you past that little hurdle. But I did get permission to release some of them. So somewhere on this page is a link to X, oh, there it is, exemplary projects. There's another list at the GitHub projects.md page, which is also linked somewhere in here, of published work, and that stuff you could download.

The private link gives you the actual course submission. That could be an interesting exercise to compare the paper they did in here with the thing they actually published. I'll emphasize again that that will be interesting because of how much work it typically takes to go from a class project to something that makes it onto the ACL anthology.

But that's of course an exciting journey. Oh yeah, and if you haven't already, do this course setup. It's very lightweight. You get your computing environment set up to use our code. Actually, this is a sign of the changing times. I also exhort you to sign up for a bunch of services, Colab, and maybe consider getting a pro account for $30.

Over the course of the entire quarter, you could get a lot more compute on Colab, including GPUs. Also, the Amazon versions, SageMaker Studio. In addition, OpenAI account and Cohere account. Both of those have free tiers. For Cohere, you get really rate limited and for OpenAI, they give you $5.

You could consider spending a little bit more. I do think you could do all our coursework for under those amounts. I think that for OpenAI, you could still have lots of accounts if you wanted to. Each one getting $5. It used to be 18 and now it's five, so we know what's coming.

But embrace it while you can. Also, I'll say, I'm pretty well confident that we'll get a bunch of credits from AWS Educate for you to use EC2 machines. So more details about that in a little bit. If you want to follow along, let's head to this one. This is our slideshow from last time.

I do just want to review some things. What we did last time is I tried to immerse us in this weird and wonderful moment for AI and give you a sense for how we got here. Then we talked about the first two units, transformers and retrieval augmented in-context learning.

I think that is all wonderful stuff. I expect you all to do lots of creative and cool things in that space. But it's important for me to continue this slideshow because there is more to our field than just those big language models and prompting. There are lots of important ways to contribute beyond that.

So let me take a moment and just give you an indication of what I have in mind there. Our third main course unit, I've called compositional generalization. This is brand new. We're going to focus on the COGS benchmark, which is a relatively recent synthetic dataset that is designed to stress test whether models have really learned systematic solutions to language problems.

So the way COGS works is we have essentially a semantic parsing task. The input is a sentence like Lena gave the bottle to John, and the task is to learn how to map those sentences to their logical form, which are these logical representations down here. The interesting thing about COGS is that they've posed hard generalization tasks.

For example, in training, you might get to see examples where Lena here is in subject position, and then at test time, you see Lena in object position. Or at train time, you might see Paula as a name but in isolation, and the task is to have the system learn how to deal with Paula as a subject of a sentence like Paula painted a cake.

Or object PP to subject PP. So at train time, you see Emma ate the cake on the table, where on the table is inside the direct object of the sentence. Then at test time, you see the cake on the table burned, where on the table is now a subject.

These seem like dead simple generalization tasks, and the sentences are very simple. But here's the punchline. This is a kind of accumulated leaderboard of a lot of entries for COGS. If you look all the way on the right, you can see systems are doing pretty well. It is impressive that they can go from these free-form sentences into those very ornate logical forms.

Okay, but look at this column. This is a column of zeros, object PP to subject PP. It looked really simple, and that's just the task of learning from Emma ate the cake on the table and predicting the cake on the table burned. Why are these- all these brand new systems getting zero on this split?

That shows first of all that this is a hard problem. Now, we are going to work with a variant that we created of COGS called reCOGS. This was done with my students Zen Wu and Chris Manning. It's brand new work. We think that in part, all those zeros derive from there being some artifacts in COGS.

So it was made kind of artificially hard and also artificially easy in some ways. So in this class, we're going to work with reCOGS, which has done some systematic meaning-preserving transformations to the original to create a new dataset that we think is fairer. But it still remains incredibly hard.

Systems can get traction where- before they were getting zero, so we know their signal. And we have more confidence that this is testing something about semantics. And then the punchline remains the same. This is incredibly hard for our systems, even our best systems. There needs to be some kind of breakthrough here for us to get our systems to do well even on these incredibly simple sentences.

So I am eager to see what you all do with this problem. You're seeing a picture here of the kind of best we could do, which is a little bit better than what was in the literature previously, but certainly not a solved task. Right. So that will culminate in this homework and bake off our third one.

From there, the course work opens up into your projects. We're done with the regular assignments and we go through the rhythm of lit review, experiment protocol, which is a special document that kind of lays down the nuts and bolts of what you're going to do for your paper, and then the final paper itself.

In the spirit of that, what we do in our course together is think about topics that will supercharge your own final project papers. The first topic that comes to mind for me there is better and more diverse benchmarks. We need measurement instruments to get reliable estimates of how well our systems are doing, and that implies having good benchmarks.

In this context, I really like to invoke this famous quotation from the explorer Jacques Cousteau. He said, "Water and air, the two essential fluids on which all life depends, that's datasets for our field." You can see here that Cousteau did continue have become global garbage cans. That might concern us about what's happening with our datasets.

I don't think it's that bad, but still you could have that in the back of your mind that we need these datasets we create to be reliable high-quality instruments. The reason for that is that we ask so much of our datasets. We use them to optimize models when we train on them.

We use them crucially, and this is increasingly important to evaluate our models, our biggest language models that are getting all the headlines. How well are they actually doing? We need datasets for that. We use it to compare models, to enable new capabilities via training and testing, to measure progress as a field.

It's our fundamental barometer for this, and of course for basic scientific inquiry into language and the world. This is a long and important list, and it shows you that datasets are really central to what we're doing. So I'm exhorting you as you can tell to think about datasets, especially ones that would be powerful as evaluation tools in the context of this course.

I am genuinely worried about the new dynamic where we are evaluating these big language models essentially on Twitter, where people have screenshots of some fun cases that they saw, and we all know that we're not seeing a full representative sample of the inputs. We're seeing the worst and the best, and it's impossible to piece together a scientific picture from that.

My student, Omar Khattab, recently observed, I think this is very wise, that we have moved into this era in which designing systems might be really easy. It might be a matter of writing a prompt, but figuring out whether it was a good system is going to get harder and harder, and for that we need lots of evaluation datasets.

You could think about this slide that I showed you from before. We have this benchmark saturation with all of these systems now increasingly quickly getting above our estimate of human performance, but I asked you to be cynical about that as a measure of human performance. Another perspective on this slide could be that our benchmarks are simply too easy, because it is not like if you interacted with one of these systems, even the most recent ones, it would feel superhuman to you.

Partly what we're seeing here is a remnant of the fact that until very recently, our evaluations had to be essentially machine tasks, not human tasks, and we had humans do machine tasks to get a measure of human performance. Maybe we're moving into a new and more exciting era. We're going to talk about adversarial testing.

I've been involved with the Dynabench effort. This is a kind of open-source effort to develop datasets that are going to be really hard for the best of our models, and I think that's a wonderful dynamic as well. That leads into this related topic of us having more meaningful evaluations.

Here's a fundamental thing that you might worry about throughout artificial intelligence. All we care about is performance for the system, some notion of accuracy. I've put this under the heading of Strathairn's law. When a measure becomes a target, it ceases to be a good measure. If we have this consensus that all we care about is accuracy, we know what will happen.

Everyone in the field will climb on accuracy. We know from Strathairn's law that that will distort the actual rate of progress by diminishing everything else that could be important to thinking about these AI systems. Relatedly, this is a wonderful study from Birhane et al. I've selected a few of the values encoded in ML research, which they did via a very extensive literature survey.

Impressionistically, here's the list. At the top, dominating everything else, you have an obsession with performance, as I said. Then way down on the list though, in second place, you have efficiency and things like explainability, applicability in the real world, robustness. As I go farther down on the list here, the ones that are colored there, they actually should be in the tiniest of type.

Because if you think about the field's actual values as reflected in the literature, you find that these things are getting almost no play. I think things are looking up, but it's still the case that it's wildly skewed toward performance. But those things that I have down there in purple and orange, beneficence, privacy, fairness, and justice, those are incredibly important things and more and more important as these systems are being deployed more widely.

So we have to, via our practices and what we hold to be valuable, elevate these other principles. You all could start to do that by thinking about proposing evaluations that would elevate them. That could be tremendously exciting. The final point here is that we could also have a move toward leaderboards that embrace more aspects of this.

Again, to help us move away from the obsession on performance, we should have leaderboards that score us along many dimensions. In this context, I've really been inspired by work that Cowen did on what he calls Dyna-scoring, which is a proposal for how to synthesize across a lot of different measurement dimensions.

To give you a quick illustration, here I have a table where the rows are question answering systems, and the columns are different things we could measure. Just a sample of them, performance, throughput, memory, fairness, robustness, and we could add other dimensions. With the current Dyna-scoring that you're seeing here, where most of the weight is put on performance, that DeBirda system is the winner in this leaderboard competition.

But that's standard. But what if we decided that we cared much more about fairness for these systems? So we adjusted the Dyna-scoring here to put five on fairness, keep a lot on performance, but diminish the other measures there. Well, now the Electra Large system is in first place. Which one was the true winner?

I think the answer is that there is no true winner here. What this shows is that all of our leaderboards are reflecting some ordering of our preferences, and when we pick one, we are instilling a particular set of values on the whole enterprise. But this is also creating space for us.

This is I think part of Cowen's vision for Dyna-scoring, that we could design leaderboards that were tuned to the things that we want to do in the world, via the weighting and the columns that we chose, and evaluate systems on that basis. Yeah. >> What does fairness to you in this field mean?

How do you measure something like that? >> What is fairness? Yeah. Well, that's a whole another aspect to this. So if we're going to start to measure these dimensions, like we're going to have a column for fairness, we better be sure that we know what's behind that. I can tell you there needs to be a lot more work on our measurement devices, our benchmarks, for assessing fairness.

Because all of those things are incredibly nuanced, multi-dimensional concepts, and so the idea would be to bring that in as well. Yeah. Throughput memory, maybe those are straightforward, but fairness is going to be a challenging one. But that's not to say that it's not incredibly important. Then finally, to really inspire you, I do feel like this is the first time I could say this in this course.

I think we're moving into an era in which our evaluations can be much more meaningful than they ever were before. Assessment today or yesterday is really one-dimensional, that's the performance thing I mentioned. It's largely insensitive to the context. We always pick F1 or something as the only thing regardless of what we're trying to accomplish in the world.

The terms are largely set by us researchers. We say it's F1 and everyone follows suit because we're supposed to be the experts on this and it's often very opaque and tailored to machine tasks. I've already complained about that. Our estimates of human performance being very different from what you would think that phrase means.

In this new future that we could start right now, our assessments could certainly be high dimensional and fluid. I showed you a glimpse of that with the Dyna scoring. I think that's incredible. They could also in turn be highly sensitive to the context that we're in. If we care about fairness and we care about efficiency, and we put those above performance, we're going to get a very different prioritization of the systems and so forth and so on.

Then in turn, the terms of these evaluations could be set not by us researchers, who are doing our very abstract thing, but rather the people who are trying to get value out of these systems, the people who have to interact with them. Then the judgments could ultimately be made by the users.

They could decide which system they want to choose based on their own expressed preferences. Then in turn, maybe we could have our evaluations be much more at the level of human tasks. Right now, for example, we might insist that some human labelers choose a particular label for an ambiguous example, and then we assess how much agreement they have.

Whereas the human thing is to discuss and debate, to have a dialogue about what the right label is in the face of ambiguity and context dependence. Well, now we could have that kind of evaluation, right? Maybe we evaluate systems on their ability to adjudicate in a human-like way on what the label should be.

Hard to imagine before, but now probably something that you could toy around with a little bit with one of these large language model APIs right now if you wanted. I think we could really embrace that. I have a couple more topics, but let me pause there. Do people have thoughts, questions, insights about benchmarks and evaluation?

I hope you're seeing that it's a wide open area for final projects. Yeah. Is there more of a move to like get like specialists in other fields, like for example, like linguistics or like related things to like help make benchmarks? What a wonderful question. You asked, is there a move to have more experts participate in evaluation?

I hope the answer is yes, but let's make the answer yes. That would be the point, right? Because what we want is to provide the world with tools and concepts that would allow domain experts people who actually know what's going on in the domain. We're trying to use this AI technology to make these decisions and make adjustments and so forth based on what's working and what isn't.

Yeah, that should be our goal. Then what we as researchers can do is provide things like what Colin provided with Dynascoring which is the intellectual infrastructure to allow them to do that. Yeah. Then you all probably have lots of domain expertise that intersects with what we're doing, but maybe comes from other fields.

You could participate as an NLU researcher and as domain expert to do a paper that embrace both aspects of this. Maybe you propose a kind of metric that you think really works well for your field of economics or sociology or whatever you're studying, right? Yeah, health, medicine, all these things, incredibly important.

So another hand go up. Yeah. I think one of the challenges we're going to face is this really expensive to collect human or more sophisticated labels. As an example, there's a paper that came out recently in Med Hall where they trained or actually really just tuned an LLM to respond to medical questions from USMOE and other medicine related exams.

They also had a section for long-form answers. The short-form answers, it's a multiple choice, they can figure it out. The long-form answers, they actually had doctors evaluate them. That's really expensive. They could only collect so many labels. Even the large staff of doctors. So I think the balance between calculating, put through a super easy, it's just counting.

But evaluating how valuable a search result is, that requires a human, that's a little more expensive. I'm curious how we can balance the cost. Yeah. The issue of cost is going to be unavoidable for us. I think we should confront it as a group. This research has just gotten more expensive and that's obviously distorting who can participate and what we value.

It's another thing I could discuss under this rubric. For your particular question though, I remain optimistic because I think we are in an era now in which you could do a meaningful evaluation of a system with no training data and rather just a few dozen let's say 100 examples for assessment.

If you're careful about how you use it, that is if you don't develop your system on it and so forth. But even if you say, "Okay, I'm going to have a 100 for development, a 100 that I keep for a final evaluation to really get a sense for how my system performs on new data." That's only 200 examples and I feel like that's manageable, even if we need experts.

The point would be that that might be money well spent. It might be that if we can get some experts to provide the 200 cases, we have a really reliable measurement tool. I could never have said this 10 years ago because 10 years ago, the norm was to have 50,000 training instances and 5,000 test instances, and now your cost concern really kicks in.

But for the present moment, I feel like a few meaningful cases could be worth a lot. You all could construct those datasets. Again, before I used to give the advice, don't create your own dataset in this class, you'll run out of time. But now I can give the advice, no, if you have some domain expertise in the life sciences or something and you want a dataset, create one to use for assessment.

It'll shape your system design, but that could be healthy as well. Another big theme, explainability. This also relates to our increased impact. If we're going to deploy these models out in the world, it is really important that we understand them. Right now, we do a lot of behavioral testing.

That is, we come up with these test cases and we see how well the model does. But the problem, which is a deep problem of scientific induction, is that you can never come up with enough cases. The world is a diverse and complex place, and no matter how many things you dreamed up when you were doing the research, if you deploy your system, it will encounter things that you never anticipated.

If all you've done is behavioral testing, you might feel very nervous about this because you might have essentially no idea what it's going to do on new cases. The mission of explainability research should be to go one layer deeper and understand what is happening inside these models so that we have a sense for how they'll generalize to new cases.

It's a very challenging thing because we're thinking about these enormous and incredibly opaque models. You can even find people saying in the literature that they're skeptical that we can ever understand what's happening with these systems, but I am optimistic. They are closed, deterministic systems. They may be complex, but we're smart.

We can figure out what they have learned. I really have confidence in this. The importance of this is really that we have these broader societal goals. We want systems that are reliable, and safe, and trustworthy, and we want to know where we can use them, and we want them to be free from bias.

It seems to me that all of these questions depend on us having some true analytic guarantees about model behaviors. It seems very hard for me to say, "Trust me, my model is not biased along some dimension," if I don't even have any idea how it works. The best I could say is that it wasn't biased in some evaluations that I ran, but I just emphasize for you that that's very different from being evaluated by the world where a lot of things could happen that you didn't anticipate.

We'll talk about a lot of different explanation methods. I think that these methods should be human interpretable. That is, we don't want low-level mathematical explanations of how the models work. We want this expressed in human-level concepts so that we can reason about these systems. We also want them to be faithful to the underlying model.

We don't want to fabricate human interpretable but inaccurate explanations of the model. We want them to be true to the underlying systems. These are two very difficult standards to meet together. I can make them human interpretable if I offer you no guarantees of faithfulness, but then I'm just tricking myself and you.

I can make them faithful by making them very technical and low-level. We could just talk about all the matrix multiplications we want, but that's not going to provide a human-level insight into how the models are working. So together though, we need to get methods that are good for both of these, right?

Concept-level understanding of the causal dynamics of these systems. We'll talk about a lot of different explanation methods. I'll just do this quickly. Train tests, that is the behavioral thing, remains very important for the field. We'll talk about probing, which was an early and very influential and very ambitious attempt to understand the hidden representations of our models.

We'll talk about attribution methods. These are ways to assign importance to different parts of the representations of these models, both input and output, and also their internal representations. Then we're going to talk about methods that depend on active manipulations of model internal states. You'll be able to tell that I strongly favor the active manipulation approach because I think that that's the kind of approach that can give us causal insights, and also richly characterize what the models are doing, and that's more or less the two desiderata that I just mentioned for these methods.

But there's value to all of these things, and we'll talk about all of them, and you'll get hands-on with all of them, and all of them can be wonderful for your analysis sections of your final papers. We might even talk about interchange intervention training, which is when you use explainability methods to actually push the models to become better, more systematic, more reliable, maybe less biased along dimensions that you care about.

That's my review of the core things. Questions or comments? I have a few more kind of more low-level things about the course to do now. Yeah. I know we're going to get into all of the explanation methods in a lot of detail later on, but can you give a quick example just so that we have any imagination of what they are?

Probing is training supervised classifiers on internal representations. This was just the cool thing to say, "Hey, I'll just look at layer three, column four of my BERT representation. Does it encode information about animacy or part of speech?" The answer seems to be yes. I think that was really eye-opening that even if your task was sentiment analysis, you might have learned latent structure about animacy.

That's getting closer to the human level concept stuff. Problem with probing is that you have no guarantee that that information about animacy here has any causal impact on the model behavior. It could be just kind of something that the model learned by the by. Attribution methods have the kind of reverse problem.

They can give you some causal guarantees that this neuron plays this particular role in the input-output behavior, but it's usually just a kind of scalar value. It's like 0.3 and you say, "Well, what does the 0.3 mean?" And you say, "It means that it's that much important." But nothing like, "Oh, is it animent?" Or none of those human level things.

And then I think the active manipulation thing, which is like doing lots of brain surgeries on your model, can provide the benefits of both probing and attribution. Causal insights, but also a deep understanding of the- what the representations are. And there's a whole family of those. It's a very exciting part of the literature.

Yeah. I have a question going back to COGS. So I guess, why would we want to use the COGS dataset if we're testing for generalization? Like, why can't we just prompt a language model of a word that we've never seen before, and kind of like try and induce some format if you see it in the subject position, get it to position in the object and see how well it does that.

Oh, yeah. No. So for COGS, for your original system, it could be that you try to prompt a language model. Zen did a bunch of that as part of the research. It was only okay. But maybe there's a version of that where you prompt in the right way with the right kind of instructions, and then it does solve COGS.

That would be wonderful because that would suggest to me that those models, whatever model you prompted, has internal representations that are systematic enough to have kind of a notion of subject and a notion of object and verb and all of that linguistic stuff, and that would be very exciting.

Yeah. The cool thing about COGS is that I think it's a pretty reliable measurement device for making such claims. Yeah. How transferable is this discussion to languages other than English? Like, I wonder if there- if we should be concerned about the very tight coupling between the properties of English as a language, and all our advancement in NLP?

Well, I mean, I hope that a lot of you do projects on multilingual NLP, low resource settings, and so forth. I think in a way, we live in a golden age for that research as well. There's more research on more languages than there were 10 years ago, and that's certainly a positive development.

The downside is that it's all done with multilingual representations, multilingual BERT, and so forth, and they tend to do much better on English tasks than every other task. So that obviously feels like suboptimal. But again, that's the same story of like sudden progress with a lot of mixed feelings that I have about a lot of these topics.

In the interest of time, let's press on a little bit. I think I just wanted to skip to the course mechanics. This is at the website, but there it is. That's the breakdown of required pieces. You can see that it has a kind of strong emphasis toward the three parts that are related to the final project.

But the homeworks are also really important and the quizzes less so. But I think they're important enough that you'll take them seriously. It's fully asynchronous. It's wonderful to see so many of you here, and I am eager to interact with you here if possible, but also on Zoom for office hours and stuff.

Please attend office hours if you just want to chat. One of my favorite games to play in office hours is a group comes with three project ideas and I rank them from least to most or most to least viable for the course. It's a fun game for me, and I think it always illuminates some things about the problems.

Then we have continuous evaluation. So you have the three assignments, the quizzes, and then the project work. There's no final exam. Just we want you to be focused on the final project at that point. I think I'll leave this aside. We can talk about the grading of the original systems a bit later.

Then you have the project work, some links here, exceptional final projects, and some guidance. These are the two documents I mentioned before. I'll just say again that this is the most important part of the course to me and the thing that's special. I'll say again also, we have this incredibly accomplished teaching team this year, diverse interests, and they all have done incredible research on their own.

I've learned a ton from them and from their work, and I encourage you to do the same. So seek them out in office hours and, um, and you know, take advantage of their mentorship for the work you do. Then here are some crucial course links, kind of covered that before.

The quizzes I think I've covered as well, and these policies are all at the website. Um, right. And so the setup, do that if you haven't already. Make sure you're in the discussion forum. We want you to be connected with the kind of ongoing discourse for the class. Do quiz zero as soon as you can, so that you know your rights and responsibilities.

And then I think right now we should check out the homework, the sentiment homework to make sure you're oriented around that before we dive into transformers. Questions about that stuff? It's all at the website, but I've kind of evoked it for you in case it raised any issues. All right.

Let's look briefly at the first homework. I feel like we should kick it off somehow, and it is maybe an unusual mode for homeworks. So feel free to ask questions. This is kind of cool. So this link will take you to the GitHub, uh, which I think you're probably all set up with on your home computers.

But you might want to work with this in the Cloud. And though this- so this works well. So you could just quick click like open in Colab. And I think I've done a pretty good job of getting you so that it will set itself up with the installs that you need and the course repo and so forth.

I would actually be curious to know what their bumps along the road to getting this to just work out of the box in Colab. I do encourage this because if you're ambitious, you'll probably want GPUs, and this is a good inexpensive way to get them. It's also a pretty nice environment to do the work in.

Zoom in here. Along the left, you can see the outline. And that's actually kind of a good place to start. So we're doing multi-domain sentiment. And what I mean by that is, you're encouraged to work with three datasets, Dynascent round one, Dynascent round two, and the Stanford Sentiment Treebank.

These are all sentiment tasks, and they all have ternary labels, positive, negative, and neutral. But I'm not guaranteeing you that those labels are aligned in the semantic sense. In fact, I think that the SST labels are a bit different from the Dynascent ones. But certainly, the underlying data are different because Dynascent is like product reviews and Stanford Sentiment Treebank is movie reviews.

But there are further things. So Dynascent round one is hard examples that we harvested from the world, from the Yelp academic dataset. Whereas Dynascent round two is actually annotators working on the Dynabench platform, which I mentioned just a minute ago, trying to fool a really good sentiment model. So the Dynascent round two examples are hard.

They involve like non-literal language use and sarcasm, and other things that we know challenge current day models. So you have these three datasets. Then there are really two main questions. For the first question, I'm just pushing you to develop a simple linear model with sparse feature representations. This is a kind of more traditional mode background.

If you need a refresher on this, this is a chance to get it. If you feel stuck on this question, I think we should talk about how to get you up to speed for the course. But for a lot of you, especially if you've been immersed in NLP, this should be a pretty straightforward question.

It leads to a pretty good system. So you do a feature function, you write a function for training models, and a function for assessing models. For each one of these questions, what you do is complete a function that I started. There is not a lot of coding. This is mainly about starting to build your own original system.

For every single one of these questions, there's a test that you can run. I like unit tests a lot. I think we should all write more unit tests. The advantage of the test for me is that if there was any unclarity about my own instructions, the test probably clears them up.

You also get a guarantee that if your code passes the test, you're in good shape. More or less the same tests run on Gradescope. So when you upload the notebook, if you got a clean bill of health at home, you'll probably do fine on Gradescope. If you don't, it might be because the Gradescope autograder has a bug.

Let me know about that. Those things always feel like they're just barely functioning. But the idea is that this is really not about me evaluating you. This is about you exercising the relevant muscles and building up concepts that will let you develop your own systems. I'm just trying to be a trusted guide for you on that.

So you do some coding and you have these three questions here. The result of doing those three questions is that you have something that could be the basis for your original system. It'd be pretty cool by the way if some people competed using just sparse linear models to show the transformers that there's still competition out there.

So that's the first question. Then the second one is the same way, except now we're focused on transformer fine-tuning, which is our main focus for this unit. I have a question here that pushes you to understand how these models tokenize data. It's really different from the old mode. You'll learn some hugging face code and you'll also learn some concepts.

Then I have a question that pushes you to understand what the representations are like. We're going to talk about them abstractly. Here you'll be hands-on with them. Then finally, you're going to finish up writing a PyTorch module where you fine-tune BERT. That is step one to a really incredible system I'm sure.

I've actually written the interface for you. So that given the course code and everything else, the interfaces for these things are pretty straightforward to write. All you have to do is write the module, and for completing the homework questions, you don't actually need heavy-duty computing at all because you don't do anything heavy-duty.

But when you get to the original system, that might be where you want to train a big monster model and figure out how to work with the computational resources that you have to get that done. This notebook is using TinyBERT, which is small, but you still need a GPU to do the work.

So you'll still want to be on Colab or something like that. Then I don't know how ambitious you're going to get for your original system. You can tell that I'm trying to lead you toward using question one and question two for your original system, but it's not required. If you want to do something where you just prompt GPT-4, maybe you'll win, I don't know.

I'm up for anything. It does need to be an original system, so you can't just download somebody else's code. If all you did was a very boring prompt structure, you wouldn't get a high grade on your original system. We're trying to encourage you to think creatively and explore. Then the final thing is you just enter this in a bake-off, and really that just means grabbing an unlabeled dataset from the web and adding a column with predictions in it.

Then you upload that when you submit your work to Gradescope. Then when everyone's submissions are in, we'll reveal the scores and there'll be some winners, and we'll give out prizes. I'm optimistic that we're going to have EC2 codes as prizes. That's always been fun because if you win a bake-off, you get a little bit more compute resources for your next big thing.

They don't want to hand out these codes anymore like they used to, because Cloud Compute is so important now, but I think I have an arrangement in place to get some. By the way, we give out prizes for the best systems and the most creative systems, and we have even given out prizes for the lowest scoring system.

Because if that was a cool thing that should have worked and didn't, I feel like you did a service to all of us by going down that route, and that deserves a prize. As a trying to have a multi-dimensional leaderboard here, even as we rank all of you according to the performance of your systems.

That's my overview. Questions or comments or anything? All right. I propose then that we go to Transformers. So download the handout. By the way, these should be really good. So these slides, you'll get a version with fewer overlays to make it more browsable. All of these things up here are links.

So if you click on these bubbles, you can go directly to that part, and you can see that this is a kind of outline of this unit. Then there's also a good table of contents with good labels. So if you need to find things in what I admit is a very large deck, that should make it easier to do that.

You can also track our progress as we move through these things. So we dive in. Guiding ideas. What is happening with these contextual representations? Okay. This one slide here used to take two weeks for this course. And I've been trying to convey this. We have stopped doing that. The background materials are still at the website.

It was also the first two weeks of CS224n. We did them before they did them in CS224n, back before natural language understanding was all the rage. But they get there first now, and it is a more basic course. I'm saying they do GloVe, Word2Vec, and we're going to dive right into transformers.

Here is my one slide summary of this. Back in the old, old days, the dawning of the statistical revolution in NLP, the way we represented examples, let's say words for this case, was with feature-based sparse representations. And what I mean by that is that if you wanted to represent a word of a language, you might write a feature function that says, okay, yes or no on it being referring to an animate thing, yes or no on it ending in the characters ing, yes or no on it mostly being used as a verb, and so forth and so on.

And so all these little feature functions would end up giving you really long vectors of essentially ones and zeros that were kind of hand-designed and that would give you a perspective on a bunch of the dimensions of the word you were trying to represent. That lasted for a while, and then it kind of started to get replaced pre-Word2Vec and GloVe, with methods like pointwise mutual information or TF-IDF.

These methods had long been recognized as fundamental in the field of information retrieval, especially TF-IDF as a main representation technique for finding relevant documents for queries. Took a while for NLP people to realize that they would be valuable. But what you start to see here is that instead of writing all those feature functions, I'll just keep track of co-occurrence patterns in large collections of text.

And PMI and TF-IDF do this- do this essentially just by counting, and then re-weighting some of the counts. But really it is the rawest form of distributional representation. That kind of got replaced, or this is sort of simultaneous in an interesting way, but you have paired with PMI and TF-IDF methods like a principal components analysis, SVD which is sometimes called latent semantic analysis, LDA which is latent Dirichlet allocation, a topic modeling technique.

So a whole family of these things that are essentially taking count data and giving you reduced dimensional versions of that count data. And the power of doing that is really that you can capture higher order notions of co-occurrence. Not just what I as a word co-occurred with, but also the sense in which I might co-occur with words that co-occur with the things I co-occur with.

You're kind of second order neighbors and you can imagine traveling out into this representational neighborhood here. And that turns out to be very powerful because a lot of semantic affinities come not from just being neighbors with something, but rather from that whole network of things co-occurring with each other.

And what these methods do is take all that count data and compress it in a way that loses some information but also captures those notions of similarity. And then the final step which might actually be the kind of final step in this literature were learned dimensionality reduction things, autoencoders, Word2Vec, and GloVe.

And this is where you might start with some count data, but you have some machine learning algorithm that learns how to compute dense learned representations from that count data. Um, so kind of like step three infused with more of what we know of as machine learning now. And I say it might be the end because I think now, for anything that you would do with this mode, you would probably just use contextual representations.

So this is the full story perhaps. And then here's the review if you want, right? And I think it is important to understand this both the history but also the technical details to really deeply understand what I'm about to dive into. So you might want to circle back if that was too fast.

Yeah. Is there any option to just like one-hot encode your entire vocabulary? I think this is my understanding of what modern transformer-based models do. To one-hot encode the whole vocabulary? Yes. So well, just say a bit more like what are you, what are you doing? Like my understanding of how large language models encode individual words, is they have a list of all of their possible tokens.

They break it down into tokens. And then if your token 337, you're just like, you have a vector of length, the number of tokens you have, like a vocabulary of like 50,000. And then you just one-hot encode which token that is. Hmm. Well, we're about to do this. So why don't if- I'll show you how they represent things.

And let's see if it connects with your question. Because it is different. Yeah, it is going to be very different. And the notion of token and the notion of type is about to get sort of complicated. Right. Before we do the technical part, just a little bit of context here about why I think this is so exciting.

I'm a linguist, right? And I was excited by the static vector representations of words, but it was also very annoying to me because they give you one vector per word. Whereas my experience of language is that words have multiple senses and it is hard to delimit where the senses begin and end.

Consider a verb like break, which I've worked on with my PhD student, Erica Peterson. The vase broke. That's one sense maybe. Dawn broke. That's the same form, broke. But that means something different. Entirely different. The news broke. Again, broke as the form there. But this being something more like was published or appeared.

Sandy broke the world record. It's very unlike the vase broke, right? Now it's like surpassing the limit. Sandy broke the law is a different sense yet again. That's some kind of transgression. The burglar broke into the house. That's break again, but now with a particle. And that means something different still.

The newscaster broke into the movie broadcast. That means it was interrupted. Very different again. We broke even, means I don't know, we ended up back at the same amount we started with. All- so how many senses of break are here? If I was in the old mode of static representation, would I survive with one break vector for all of these examples?

Or would I have one per example type? But then what about all the ones that I didn't list here? The sen- the number of senses for break starts to feel impossible to enumerate. If you just think about all the ways in which you encounter this verb. And there is some metaphorical core that seems to run through them.

But in the details, these senses are all very different. And this tells me that the sense of a word like break is being modulated by the context it is appearing in. And the idea that we would have one fixed representation for it, even if it's learned from data, is just kind of wrong from the outset.

Here's another example. We have a flat tire. But what about flat beer, flat note, flat surface? Maybe they have some metaphorical core, but those feel like at least two to four different senses for flat. Throw a party, throw a fight, throw a ball, throw a fit. All very different senses.

It's tragic to think we would have one throw that was meant to cover all of these examples, right? A crane caught a fish. A crane picked up the steel beam. That might feel like a standard sort of lexical ambiguity. And so maybe you can imagine that we have one vector for crane as a bird, and one for crane as a machine.

But is that going to work for the entire vocabulary? I suspect not. I saw a crane. We wouldn't even know what vector we were dealing with there, right? Which one would we pick? And now we have another problem on our hands, which is selecting the static vector based on contexts, right?

How are we going to do that? And this is a really deep thing. It's not just about the local kind of morphosyntactic context here. What about, are there typos? I didn't see any. So the sense of any there is like any typos, right? Versus are there bookstores downtown? I didn't see any.

Now the sense of any and the kind of elliptical stuff that comes after it is any bookstores. And now I hope you can see that the sense that words can have is modulated by context in the most extended sense. And having fixed static representations was never going to work in the face of all of this diversity.

We were never going to figure out how to cut up the senses in just the right way to get all of this data handled correctly. And the vision of contextual representation models is that you're not even going to try to do all that hard and boring stuff. Instead, you are just going to embrace the fact that every word could take on a different sense, that is, have a different representation depending on everything that is happening around it.

And we won't have to decide then which sense is in 1A and whether it's different from 1B and 1C and so forth. We will just have all of these token level representations. It will be entirely a theory that is based in words as they appear in context. For me as a linguist, it is not surprising at all that this turns out to lead to lots of engineering successes because it feels so deeply right to me about how language works.

Uh, brief history here. I just want to be dutiful about this. Make sure people get credit where it's due. November 2015, Dai and Lei, that's a foundational paper where they really did what is probably the first example of language model pre-training. It's a cool paper to look at. It's complicated in some ways that are surprising to us now, and it is certainly a visionary paper.

And then McCann et al., this is a paper from Salesforce Research that's led by, at the time was read by Richard Socher, who is a distinguished alum of this class. Proud of that. They developed the Cove model where what they did is train machine translation models. And then the inspired idea was that the tran- the translation representations might be useful for other tasks.

And again, that feels like the dawn of the notion of pre-training contextual representations. And then ELMo came. I mentioned ELMo last time. Huge breakthrough. Massive bidirectional LSTMs. And they really showed that that could lead to rich multipurpose representations. And that's where you really feel everyone reorienting their research toward these kind of models.

That's not a transformer-based one, though. That's by- by LSTMs. And then we get, um, GPT in June 2018 and BERT in October 2018. Um, the BERT paper was published a long time after that, but as I said before, it had already achieved massive impact by the time it was published in 2019 or whatever.

So that's why I've been giving the months here, because you can see it's really- there was this sudden uptake in the amount of- of interest in these things that happened around this time. And that led to where we are now. Another kind of interesting thing to think about if you step back for the context here, is that we as a field have been traveling from high bias models, where we decide a lot about how the data should look and be processed, toward models that impose essentially nothing on the world.

So if you go up into the upper left here, I'm just imagining a model that's kind of in the old mode, where you have like your glove representations of these three words. And to get a representation for the sentence, you just add up those representations. In doing that, you have decided ahead of time, a lot of stuff about how those pieces are gonna come together.

I mean, you just said it was gonna be addition, which is almost certainly not correct about how the world works. But that- so that's a prototypical case of a high bias decision. If you move over to the right here, that's a kind of recurrent neural network. And here, I've kind of decided that my data will be processed left to right.

I could learn a lot of different functions in that space, so it's much more expressive, much less biased in this machine learning sense, than this solution here. But I have still decided ahead of time that I'm gonna go left to right. And this is another example. These are tree-structured networks.

Richard Socher, who I just mentioned, was truly a pioneer in tree-structured recursive neural networks. Here, I make a lot of decisions about how the pieces can possibly come together. The rock is a unit constituent, separate from rules, which comes in later. And I'm just saying I know ahead of time that the data will work that way.

If I'm right, it's gonna give me a huge boost, because I don't have to learn those details. If I'm wrong though, I might be wrong forever. And I think that's actually that feeling that you're wrong forever is what led to this kind of thing happening. So here, I've got kind of a bidirectional recurrent model.

So now, you can go left to right and right to left, and all these attention mechanisms that are gonna, like, jump around in the linear string. And this is a true progression with what happened with recurrent neural networks. Go both directions and add a bunch of attention connections. And that is kind of the thing that caused everyone to realize, "Oh, we should just connect everything to everything else and go to the maximally low-biased version of this, and just assume that the data will teach us about what's important to connect to what.

We won't decide anything ahead of time." So a triumph of saying, "I have no idea what the world is gonna be like. I just trust in my data and my optimization." The attention piece is really interesting to me. You know, we used to talk a lot about this in the course.

Here, I have a sequence, really not so good. And a common mode, still common today, is that I might fit a classifier on this final representation here to make a sentiment decision. But people went on that journey I just described, where you think, "Wait a second. If I'm just gonna use this, won't I lose a lot of information about the earlier words?" I should have some way of, like, connecting back.

And so what they did is dot products, essentially, between the thing that you're using for your classifier and the previous states. That's what I've depicted here, just as a kind of way of scoring this final thing with respect to the previous thing. You might normalize them a little bit, and then form what was called a context vector.

This is like the attention representation. And what I've done here is build these links back to all these previous states. And that turned out to be incredibly powerful. And when you read the title of the paper, "Attention is all you need," what they are doing is saying, "You don't need LSTM connections and recurrent connections and stuff.

The sense in which attention is all you need is the sense in which they're saying all you needed was these connections that you were adding onto the top of your earlier models." And maybe they were right, but certainly, uh, it has taken over the field. Another important idea here that might often be overlooked is just this notion that we should model the sub-parts of words.

And again, I can't resist a historical note here. If you look back at the ELMo paper, what they did to embrace this insight is incredible. They had character-level representations, and then they fit all these convolutions on top of all those character-level representations, which is essentially like ways of pooling together sub-parts of the word.

And then they form a representation of- at the top that's like the average plus the concatenation of all of these different convolutional layers. And the result of this is a vocabulary that does latently have information about characters and sub-parts of words as well as words in it. And I feel that that's deeply right, right?

And this is like a space in which you could capture lots of things like how talk is similar to talking and is similar to talked. And you know, all that stuff that a simple unigram parsing would miss is latently represented in this space. But the vocabulary for ELMo is like 100,000 words.

So that's 100,000 embedding space that I need to have. It's actually gargantuan. And it's still the case that if you process real data, you have to unk out, that is, mark as unknown most of the words you encounter. Because the language is incredibly complicated, and 100,000 doesn't even come close to covering all the tokens that you encounter in the world.

And so again, we have this kind of galaxy brain moment where I guess the field says, forget all that. And what you do instead is tokenize your data so that you just split apart words into their sub-word tokens if you need to. So here I've got an example with the BERT tokenizer.

This isn't too surprising. That comes out looking kind of normal. But if you do encode me, notice that the word encode has been split into two tokens. And if you do snuffleupagus, you get 1, 2, 3, 4, 5, 6, 7 tokens, 6 or 7 from that, because it doesn't know what the word is.

And so what it does is not unk it out, but rather split it apart into a bunch of different pieces. And the result is the really startling thing that BERT-like models have only 30,000 words in their vocabulary. But they're words in the sense that they're these sub-word tokens. Now, this was going to be tragically bad in the realm where we were doing static word representations because I'm going to have a vector for NU and have no sense in which it was participating in the larger part of snuffleupagus.

But we're talking about contextual models. So even if these are the tokens, the model is going to see the full sequence. And we can hope that it reconstructs something like the word from all the pieces that it encountered. Certainly, we could hope that for something like encode. And we take this for granted now, but it's deeply insightful to me and incredibly freeing in terms of the engineering resources that you need.

But it does depend on rich contextual representations. And then another notion, positional encoding. So we have all these tokens or maybe, you know, subparts of words. In addition to representing things using a traditional static embedding space like a GloVe one, that's what I put with these light gray boxes here, we'll also represent sequences with positional encodings, which will just keep track of where the token appeared in the sequence I'm processing.

And what that means is that the word rock here, occurring in position two, will have a different representation if rock appears in position 47 in the string. It'll be kind of the same word, but also partly a different word. And that's another way in which you're embracing the fact that all of these representations are going to be contextual.

This is an interesting story for me because I've been slow to realize what maybe the whole field already knew, that this is incredibly important. How you do this positional encoding really matters for how models perform. And that's why, in fact, it's like one of these early topics here that we'll talk about next time.

And then of course, another guiding idea here is that we are going to do massive scale pre-training. I mentioned this before. We're going to have these contextual models with all these tiny little parts of words in them, all in sequences with positional encodings. And we are going to train at an incredible scale.

That's that same story of word2vec, GloVe, through GPT, and then on up to GPT-3. I mentioned this before. And some magic happens as you do this on more and more data with larger and larger models. And then finally, this is related. This insight that instead of starting from scratch for my machine learning models, I should start with a pre-trained one and fine-tune it for particular tasks.

We saw this a little bit in the pre-transformer era. The standard mode was to take GloVe or word2vec representations and have them be the inputs to something like an RNN. And then the RNN would learn. And instead of having to learn the embedding space from scratch, it would start in this really interesting space.

And that is actually a kind of learning of contextual representations. Because what happens if the GloVe representations are updated is that they all shift around and the network kind of pushes them around so that you get different senses for them in context. And then again, the transformer thing just takes that to the limit.

And that is the mode that you'll operate in for the first homework. I've put this from 2018 onwards. We have this thing where, I hope you can make it out at the bottom, you load in BERT and you just put a classifier on top of it. And you learn that classifier for your sentiment task, say.

And that actually updates the BERT parameters. And the BERT parameters help you do better at your task. And in particular, they might help you generalize to things that are sparsely represented in your task-specific training data. Because they've learned so much about the world in their pre-training phase. I put that for 2018 onwards.

I'm a little worried that we're moving into a future in which fine-tuning is just again using an open AI API. But you all will definitely learn how to do much more than this, even if you fall back on doing this at some point. Where now what you're doing is some lightweight version of fine-tuning a massive model like GPT-3.

Same mental model there. It's just that the starting point knows so much about language in the world, compared even to the BERT model up here. Those are the guiding ideas. I'll pause. Questions, comments? What's on your minds? Yeah. Going back to the sub-word splitting of the longer word, is there an infusion we are imposing on splitting that particular way, or is that also part of- driven by the model itself?

Oh, yeah. So what gets imposed as a modeling bias in that sense when you do the tokenization? Is that the question? Yeah, potentially a lot. I left this out for reasons of time, but there are a bunch of different schemes that you can run for doing that sub-word tokenization.

You'll see this as you read papers and as I talk. Word piece, there's byte-pairing coding, there's a unigram language model. All of them are attempts to learn kind of optimal way to tokenize the data based on things that tend to co-occur a lot together. But it's definitely a meaningful step.

Yeah, and so like for example, as someone who's interested in the morphology of languages, word forms, and you all might- this could be a cool multilingual angle if you think about languages with very rich morphology. You might have an intuition that you want a tokenization scheme that reproduces the morphology of that language, that splits a big word with all its suffixes say, down into things that look like the actual pieces, as you recognize them.

And then you could think, well, the best of these schemes should come close to that, right? And that could be an important and useful bias that you impose. Yeah. Go, yeah. Go. Yeah. Can you elaborate on what happens when we do fine-tuning to the original model? Like, does it change or just- it adds additional layers to it or like, what actually happens with fine-tuning?

What happens when you fine-tune? As usual with these questions, there's like an easy answer and a hard answer. So the easy answer is that you are simply back-propagating whatever error signal you got from your, you know, output comparison with the true label, back through all the parameters of the model.

And in principle, that could mean that, you know, as you fine-tune on your sentiment task, you are updating all of the parameters, even the pre-trained BERT ones. And then of course, there are variants of that where you update just some of the BERT parameters while leaving others frozen and so forth.

But the idea is you have a smart initialization, that's the BERT initialization, and then you kind of adjust the whole thing to be really good at your task. What really happens there? That's a deep question, right? That could actually connect with the explainability stuff, like what adjustments are happening to the network, and which ones are useful, which ones could even be detrimental, which ones are causing you to overfit.

Are there lightweight versions of the fine-tuning that would be better and more robust and get a better balance from the pre-training and the task-specific thing? And that just shows you fine-tuning is an art and a science all at once. Yeah. Sure. So do we also control, like, the influence of the fine-tuning dataset over the original model?

Can we control it in a way that, oh, change- just change the model a little bit, or change the original model completely? Let's see, what's the right metaphor? You could control it the same way that you could control a kind of out-of-control car. I mean, you have a steering wheel and you have an accelerator and a brake, but if they're all kind of- you're not sure how they work.

Yeah, you can try. And as you get better at the task, as you get better at driving the vehicle, you have more fine control. But it's an art and a science at the same time. I'm actually hoping, you know, that Sid, he's going to do a hands-on session with us next week, and that he imparts some of his own hard-won lessons to you about how to drive these things, because he's really been in the trenches doing this with large models.

But you know, you have your learning rate and your optimizer and other things you can fiddle with, and hope that it steers in the direction you want. Yeah, if we use tree structures right now to represent like syntax, I guess my question is, why don't they work super well for language models?

And like, I guess the sentiment that you had was like, oh, kind of just like put attention towards anything and see what works. I guess, why is that the sentiment in linguistics then as well? Great question. My personal perspective is that probably all the trees that we have come up with are kind of wrong.

And as a result, we were actually making it harder for our models, because we were putting them in this bad initial state. And so the mode I've moved into is to think, let's use the transform or something that's much more like this, totally free form, and then use explainability methods to see if we can see what tree structures they've induced.

Because those might be closer to the true tree structures of language. And another aspect of this is that I feel like there's not even for a given sentence, one structure. There could be one for semantics, one for syntax, one for other things. And so we want those all simultaneously represented.

And again, these powerful models we're talking about could do that. And so then they become devices for helping us learn what the right structures are, as opposed to us imposing them. Yeah. Numbers are important part of language. How tokenization works for numbers, because it's digits, it's words, the same meaning, different meaning.

Is it in the same domain or it's something otherwise? Oh, I love this. This is a great example of something that sounds small, but could be a wonderful final paper and turns out to be hard and deep. How do you represent numbers if you've got a word piece tokenizer?

Do you get it down to all the digits? Do you leave them as big chunks? Do you do it based on a strong bias that you have about like this is the tens place, this is the hundreds place? What do you do? Yeah, wonderful question to ask. And I am absolutely positive that that low-level tokenization choice will influence whether or not your model can learn to do basic arithmetic, say.

Yeah. And so a paper that evaluated a bunch of schemes in a way that was insightful and important, you know, on real mathematical abilities could really help us understand which models will be intrinsically limited and in turn how to develop better ones. I love it. Yeah. On the slide that you have, the- that's titled attention, you've shown your source as dot products with the final word against all the other previous ones.

Well, I picked the final one, not- not the first one. Or all of them. Attention is all you need. This is a perfect transition. So yeah, you should do all of them. That's like the- what they mean by the title of the paper. Yes, do it all. Self-attention means attending everything to everything else.

So this is a perfect transition. We're out of time, 4.20. Next time we will dive into the transformer. We'll resolve the questions that we got back there. You'll see much more of these attention connections. Yeah, we're really queued up now to dive into the technical parts.