Stanford CS224N NLP with Deep Learning | 2023 | Lecture 14 - Insights between NLP and Linguistics

00:00:00.000 | Cool. Hi, everyone. Hi, I'm Isabel. I'm a PhD student in the NLP group. It's about

00:00:15.920 | connecting insights between NLP and linguistics. Yeah, so hopefully we're going to learn some

00:00:21.040 | linguistics and think about some cool things about language. Some logistics. We're in the

00:00:27.760 | project part of the class, which is cool. We're so excited to see everything you guys

00:00:31.760 | do. You should have a mentor grader assigned through your project proposal. The person

00:00:39.360 | who ever graded your project proposal, especially if you're in a custom project, we recommend

00:00:45.080 | that you go to your graders' office hours. They'll know the most and be most into your

00:00:51.120 | project. And project milestones are due next Thursday. So that's in one week from now.

00:00:57.800 | So hopefully you guys are all getting warmed up doing some things for the project. And

00:01:04.160 | we'd love to hear where you are next week. Cool. So the main thing that I'm going to

00:01:10.960 | talk about today is that there's been kind of a paradigm shift for the role of linguistics

00:01:16.760 | in NLP due to large language models. So it used to be that there was just human language.

00:01:23.880 | We created all the time. We're literally constantly creating it. And then we would analyze it

00:01:28.440 | in all these ways. Maybe we want to make trees out of it. Maybe we want to make different

00:01:31.520 | types of trees out of it. And then all that would kind of go into making some kind of

00:01:38.240 | computer system that can use language. And now we've cut out this middle part. So we

00:01:46.280 | have human language. And we can just immediately train a system that's very competent in human

00:01:53.120 | language. And so now we have all this analysis stuff from before. And we're still producing

00:02:01.400 | more and more of it. There's still all this structure, all this knowledge that we know

00:02:04.420 | about language. And the question is, is this relevant at all to NLP? And I'm going to show

00:02:10.200 | how it's useful for looking at these models, understanding these models, understanding

00:02:17.280 | how things work, what we can expect, what we can't expect from large language models.

00:02:25.760 | So in this lecture, we'll learn some linguistics, hopefully. Language is an amazing thing. It's

00:02:32.120 | like so fun to think about language. And hopefully, we can instill some of that in you. Maybe

00:02:36.800 | you'll go take Ling 1 or something after this. And we'll discuss some questions about NLP

00:02:44.160 | and linguistics. Where does linguistics fit in for today's NLP? And what does NLP have

00:02:49.240 | to gain from knowing and analyzing human language? What does a 224N student have to gain from

00:02:55.680 | knowing all this stuff about human language?

00:02:59.680 | So for the lecture today, we're going to start off talking about structure in human language,

00:03:07.640 | thinking about the linguistics of syntax and how structure works in language. We're going

00:03:11.280 | to then move on to looking at linguistic structure in NLP, in language models, the kind of analysis

00:03:18.640 | that people have done for understanding structure in NLP. And then we're going to think of going

00:03:26.660 | beyond pure structure, so beyond thinking about syntax, thinking about how meaning and

00:03:36.080 | discourse and all of that play into making language language and how we can think of

00:03:41.640 | this both from a linguistic side and from a deep learning side.

00:03:46.200 | And then lastly, we're going to look at multilinguality and language diversity in NLP. Cool. So starting

00:03:56.260 | off with structure in human language, just like a small primer in language in general,

00:04:04.400 | if you've taken any intro to linguistics class, you'll know all of this. But I think it's

00:04:08.080 | fun to get situated in the amazingness of this stuff. So all humans have language. And

00:04:14.760 | no other animal communication is similar. It's this thing which is incredibly easy for

00:04:20.000 | any baby to pick up in any situation. And it's just this remarkably complex system.

00:04:26.940 | Very famously linguists like to talk about the case of Nicaraguan sign language because

00:04:33.840 | it kind of emerged while people were watching in a great way. It's like after the Sandinista

00:04:42.840 | Revolution, they started, there's a kind of large public education in Nicaragua and they

00:04:51.160 | made a school for deaf children. And there was no central Nicaraguan sign language. People

00:04:58.980 | had like isolated language. And then you see this like full language emerge in the school

00:05:03.440 | very autonomously, very naturally. I hope this is common knowledge. Maybe it's not.

00:05:09.140 | Sign languages are like full languages with like morphology and things like pronouns and

00:05:14.800 | tenses and like all the things. It's not like how I would talk to you across the room. Yeah.

00:05:20.480 | And so, and what's cool about language is that it can be manipulated to say infinite

00:05:24.000 | things. And the brain is finite. So it's either we have some kind of set of rules that we

00:05:29.720 | like tend to be able to pick up from hearing them as a baby and then be able to say infinite

00:05:34.840 | things. And we can manipulate these rules to really say anything, right? We can talk

00:05:38.960 | about things that don't exist, things that can't exist. This is very different from like

00:05:42.560 | the kind of animal communication we see like a squirrel, like alarm call or something,

00:05:47.160 | you know, it's like, watch out, there's a cat. Things that are like totally abstract,

00:05:53.520 | you know, that have like no grounding in anything. We can express like subtle differences between

00:05:58.640 | similar things. I always, when I'm thinking about like this point and like things called,

00:06:03.880 | yeah, like this featured language, I was thinking of like the Stack Exchange world building

00:06:07.080 | thing. I don't know if you ever looked at the sidebar where there's then there's like

00:06:11.360 | thing where like science fiction authors kind of pitch like their ideas for like their science

00:06:15.200 | fiction world. And it's like the wackiest, like you can really create any world with

00:06:19.480 | like with English, with the language that we're given. It's like amazing. And so there's

00:06:25.600 | structure underlying language, right? This is, I said recap here, cause we've done like

00:06:28.800 | the dependency parsing lectures. We thought about this, right? But you know, if we have

00:06:33.480 | some, some sentence like, you know, Isabel broke the window, the window was broken by

00:06:37.040 | Isabel, right? We have these two sentences or some kind of relation between them. And

00:06:40.760 | then, and then we have another two sentences and they have like the similar relation between

00:06:44.800 | them, right? This kind of passive alternation, it's kind of something which exists for both

00:06:47.680 | of these sentences, you know, and then we can even use like made up words and it's still,

00:06:53.360 | you can still see that it's a passive alternation, right? And so it seems that we have some knowledge

00:06:57.060 | of structure that's separate from, from the words we use and the things we say that's

00:07:00.640 | kind of above it. And then what's interesting about structure is that it dictates how we

00:07:05.960 | can use language, right? So, you know, if, if I have a sentence like the cat sat on the

00:07:10.720 | mat and it's, and it looks, you know, and, and then someone tells, tells you like, well,

00:07:16.280 | this is, you know, if you make a tree for it's going to look like this, according to

00:07:18.800 | my type of tree theory, you would say, well, why should I care about that? And the reason

00:07:24.600 | that this stuff is relevant is because it kind of influences what you could do, right?

00:07:30.720 | So like any subtree or like, you know, in this specific case, any subtree, in other

00:07:34.800 | cases, like many subtrees, it can kind of be replaced with like one item, right? So

00:07:39.280 | it's like, he sat on the mat or he sat on it or he sat there, right? Or he did so, did

00:07:44.520 | so, it's two words, but you know, there's a lot of ink spilled over do in English, especially

00:07:50.240 | in like early linguistic teaching. So we're not going to spill any ink. It's kind of like

00:07:53.400 | one word. But then when something is not a subtree, you like, can't really replace it

00:07:58.200 | with one thing, right? So you, so you can't express like the cat sat and it kind of like

00:08:02.080 | have the mat as a different thing, right? And one, you could be like, he did so on the

00:08:06.000 | mat, right? You'd have to kind of do two things. And like, and, and one way you could think

00:08:09.880 | about this is that, well, it's not a subtree, right? It's kind of like, you kind of have,

00:08:13.640 | have, have to go up a level to, to, to do this. And so you can't really separate the

00:08:18.960 | cat from on the mat in this way. And so, and we implicitly know like so many complex rules

00:08:25.220 | about structure, right? We're like processing the, these like streams of sound or like streams

00:08:29.560 | of letters all the time. And yet we like have these, like the ways that we use them show

00:08:34.600 | that we have all these like complex ideas, like the tree I just showed, or like for,

00:08:38.960 | for example, these are like, I'm just going to give some examples for like a, a taste

00:08:42.960 | of like the kinds of things people are thinking about now, but there's like so many, right?

00:08:48.120 | So like, what can we pull out to make a question, right? So like if we form, form a question,

00:08:53.280 | we, we form it by like, we were kind of referring to some part of like, you know, there might

00:08:58.280 | be another sentence, which like is the statement version, right? And we've kind of pulled,

00:09:03.280 | pulled, pulled out some, some part to make the question. They're not necessarily like

00:09:06.840 | fully related, but you know, so if say Leon is a doctor, we can kind of pull, pull that

00:09:10.440 | out to make a question, right? Like what is Leon? And if we have like, my cat likes tuna,

00:09:14.880 | we could pull that out. What does my cat like? Again, do, ignore the do. If we have something

00:09:19.680 | like Leon is a doctor and an activist, we actually can't pull out this, this last thing,

00:09:24.800 | right? So if something's like in this, if something's like conjoined with an and, we,

00:09:28.560 | it can't like be, be taken out of that and, right? You, you, you could only say like,

00:09:32.800 | what is Leon? You could be like, oh, a doctor and an activist, but you can't really say

00:09:35.880 | what is Leon a doctor and this is like not how question formation works. And you know,

00:09:39.480 | this is like some, something that we all know. It's I think something that we've, any of

00:09:42.360 | us have been taught, right? Even people who've been taught English as a second language.

00:09:45.200 | I don't think this is something which you're ever, which, which ever really taught explicitly,

00:09:50.520 | right? But, but, but most of us probably know this very well. Another such rule, right?

00:09:58.760 | Is like, when is like, is like, when can we kind of shovel things around, right? So if

00:10:04.400 | we have something like I dictated the letter to my secretary, right? We can make like a

00:10:10.040 | longer version of that, right? I dictated the letter that I had been procrastinating

00:10:13.040 | writing for weeks and weeks to my secretary. This character is like both a grad student

00:10:18.160 | and like a high ranking executive. And, and then we can, we can move the, we can move

00:10:25.840 | that, that long thing to the end, right? So it's like, I dictated to my secretary the

00:10:28.760 | letter that I'd been procrastinating writing for weeks and weeks. And that's like fine.

00:10:31.600 | You know, maybe it's like slightly awkwardly phrased, but it's not like, I think this,

00:10:37.200 | for me, at least everyone varies, right? Could, could appear in like natural productive speech,

00:10:42.000 | but then something like this is like much worse, right? So somehow the fact that it

00:10:46.280 | becomes weighty is good and we can move it to the end. But when it doesn't become weighty,

00:10:50.800 | we can't, right? And we like, this sounds kind of more like Yoda-y than like real language.

00:10:56.080 | And so, and so like, and we have this rule, like this one's not that easy to explain,

00:11:01.840 | actually. Like people have tried many ways, like to like make sense of this in linguistics.

00:11:06.280 | And it's just like, but it's a thing we all know, right? And, and so when I say rules

00:11:10.440 | of grammar, these are not the kind of rules that were usually taught as rules of grammar,

00:11:13.760 | right? So a community of speakers, you know, for example, like standard American English

00:11:19.400 | speakers, they share this rough consensus of like the implicit rules they all have.

00:11:23.480 | These are not the same, you know, like P people have like gradations and disagree on things,

00:11:27.880 | but you know, and then kind of like a grammar is an attempt to describe all, all these rules,

00:11:32.960 | right? And you can like, kind of linguists might write out like a big thing called like,

00:11:36.600 | you know, the like grammar of the English language where they're trying to describe

00:11:40.440 | all of them. It's like really not going to be large enough ever. They're like, this is

00:11:45.400 | a really hefty book and it's like not still not describing all of them, right? Like language

00:11:49.320 | is so complex. But so what, so what we were told as rules of grammar, you know, these

00:11:54.080 | kind of like prescriptive rules where they tell us what we can and can't do, you know,

00:11:57.880 | they often have other purposes than describing the English language, right? So for example,

00:12:02.040 | when they've told us things like, oh, you should never start a sentence with and, you

00:12:05.800 | know, that's like not true. You know, we start sentences with and all the time in English

00:12:08.720 | and it's fine. You know, what they probably mean, you know, there's some probably like

00:12:15.840 | reason that they're saying this, right? Like, especially if you're like trying to teach

00:12:18.160 | a high schooler to like, write, you know, you probably, when you want them to focus

00:12:21.160 | their thoughts, you probably don't want them to be like, oh, and this, oh, and this again,

00:12:23.600 | or, you know, like you want them to like, and so you tell them like, oh, rule of writing,

00:12:26.960 | you know, is like, you can never start a sentence with and, right? And when they say something

00:12:30.640 | like, oh, it's incorrect to say, I don't want nothing. This is like bad grammar, you know,

00:12:35.480 | well, this is, you know, in, in, in standard American English, you probably wouldn't have

00:12:39.720 | nothing there, right? Cause you, you would have anything, right? But, but in many dialects

00:12:45.880 | of English, you know, in many languages across the world, when you have a negation, right?

00:12:49.240 | Like the not and don't, then like everything, it kind of scopes over also has to be negated

00:12:55.360 | or has to agree. And many dialects of English are like this. And so what they're really

00:12:58.940 | telling you is, you know, the dialect with the most power in the United States doesn't

00:13:02.200 | do negation this way. And so you shouldn't either in school. Right. And, and, and so,

00:13:08.640 | you know, and so the way that we can maybe define grammaticality, right. Rather than

00:13:12.000 | like what they tell us is wrong or right is that, you know, if we choose a community of

00:13:15.160 | speakers to look into, they share this rough consensus of their implicit rules. And so

00:13:19.480 | like the utterances that we can generate from these rules, you know, are grammatical, roughly,

00:13:25.480 | you know, everyone has these like gradations of what they can accept. And if we can't produce

00:13:29.640 | not or it's using these rules, you know, it's ungrammatical. And that's where like, this

00:13:33.120 | is like the descriptive way of thinking about grammar, where we're, where we're thinking

00:13:36.820 | about what people actually say and what people actually like and don't like. And so for an

00:13:41.480 | example, you know, in, in English, large, largely, we have a pretty strict rule that

00:13:45.720 | like the subject, the verb and the object appear in this like SVO order. There's exceptions

00:13:49.760 | to this, like there's exceptions to everything, right? Especially things like says I, in some

00:13:52.520 | dialects, but you know, it is like, largely if something is before the verb, it's a subject,

00:13:56.760 | something is after the verb, it's an object, and you can't move that around too much. And,

00:14:01.880 | you know, we also have these subject pronouns, you know, like I, I, she, he, they, that have

00:14:05.840 | to be the subject and these object pronouns, you know, me, me, her, him, them, that have

00:14:09.600 | to be the object. And, and, you know, and so if we follow the, these rules, we get a

00:14:15.080 | sentence that we think is good, right? Like, I love her. And if we don't, then we get a

00:14:18.720 | sentence that we think is, is ungrammatical, right? Something like me love she, it's like,

00:14:21.960 | we don't know who is who, you know, who is doing the loving and, and, and who is being

00:14:25.760 | loved in, in this one, right? And it's, it doesn't exactly parse. And this is like also

00:14:31.280 | true, you know, like even when there's no ambiguity, this continues to be true, right?

00:14:36.280 | So for a sentence like me, a cupcake ate, which is like, the meaning is perfectly clear.

00:14:41.460 | Our rules of grammaticality don't seem to cut, to cut as much slack, right? We're like,

00:14:44.520 | oh, this is wrong. I understand what you mean, but in my head, I know it's like not, you

00:14:48.760 | know, correct, even not, not by the like prescriptive notion of what I think is correct, you know,

00:14:52.680 | by the descriptive notion, like my, I just don't, don't like it. Right. And, and, and,

00:14:59.320 | and you can also, you know, sentences can be grammatical without any meaning. So you

00:15:02.880 | can have meaning with, with that grammaticality, right? Like me, a cupcake ate, and you could

00:15:06.600 | also have, it's like classic example from, from Chomsky in 1957. I introduced it earlier,

00:15:16.480 | but yeah, classically from 1957, you know, like colorless green ideas sleep, sleep furiously,

00:15:21.080 | right? Which like this has no meaning, cause you can't really make any sense out of this

00:15:25.560 | sentence as a whole, but you know, you know, it's grammatical and you know, it's grammatical,

00:15:28.520 | right? Cause you can make an ungrammatical version of it, right? Like colorless green

00:15:31.840 | ideas sleeps furious, right? Which does make sense. Cause there's no agreement, even though

00:15:35.400 | you don't have any meaning for any of this. And then lastly, you know, people don't fully

00:15:42.520 | agree. You know, everyone has their own idiolect, right? People like usually speak like more

00:15:46.440 | than one dialect and they kind of move between them and they have a mixture and those have

00:15:49.240 | like their own way of thinking of things. They also have these, like, those have different

00:15:52.600 | opinions at the margins. People like, like some things more, others don't, right? So

00:15:57.080 | an example of this is like, not everyone is as strict for some WH constraints, right?

00:16:01.400 | So if you're trying to pull out something like, I saw who am I doubted report that would

00:16:05.320 | capture in the nationwide FBI manhunt, this from a paper by a Hofmeister and Ivan Sogg

00:16:10.200 | from Stanford. This is like, some people like it, some people don't, you know, it's kind

00:16:14.480 | of, some people can like clearly see it as like, Oh, it's the who that we had captured

00:16:17.800 | and Emma doubted the reports that we had captured. You know, and some people are like, this is

00:16:21.560 | as bad as like, what is the, on a doctor and I don't like it. Right. So yeah, so that's

00:16:30.040 | grammaticality. And the question is like, why do you even need this? Right. It's like, we,

00:16:35.000 | we like, we like accept these useless utterances and we block out these perfectly communicative

00:16:39.400 | utterances. Right. And, and this is like, I started off saying that this is like a fundamental

00:16:43.520 | facet of human intelligence. Like it seems kind of, you know, a strange thing to have.

00:16:49.040 | And so I think one thing I keep returning on when I think about linguistics is that

00:16:53.840 | a basic fact about languages that is that we can say anything, right. There's like really

00:16:58.040 | every language, you know, can express anything, you know, and if like, there's no word, word

00:17:01.080 | for something people will develop it if they want to talk about it. Right. And so if we

00:17:05.760 | ignore the rules because we know what it's probably intended, right. You know, then we

00:17:10.520 | would be limiting possibilities. Right. So in my kitchen horror novel, where the ingredients

00:17:13.800 | become sentient, I want to say the onion chopped the chef. And if people, if people just assumed

00:17:18.800 | I meant the chef chopped the onion because like SVO order doesn't really matter, then

00:17:24.040 | I can't, I can't say that. So then, yeah, to, to like, to conclude, you know, a fact

00:17:33.360 | about language that that's like very cool is that it's compositional, right. We have

00:17:38.400 | the set of rules that defines grammaticality and then this like, and then this lexicon,

00:17:42.960 | right. This like dictionary of words that, that relate to the world we want to talk to.

00:17:46.080 | And we kind of combine them in these limitless ways to say anything we want to say. Cool.

00:17:50.720 | Any questions about all this? I've like tried to bring a lot of like linguistic fun facts,

00:17:54.360 | like top of mind for this lecture. So hopefully, hopefully have answers for things you want

00:18:01.760 | to know. Cool. Cool. Yeah. Cool. So, so now, you know, that was a nice foray into like

00:18:11.000 | a lot of like sixties linguistics. You know, how, how, how does that relate to us like

00:18:16.280 | today? Right. In NLP. And so we said that in humans, you know, like we can think about

00:18:23.800 | languages, it's like there's a system for producing language, you know, that can be

00:18:27.320 | described by these discrete rules, you know, so it's not like it's smaller than all the

00:18:32.000 | things that we can say. There's this kind of like rules that we can kind of put together

00:18:34.680 | to say things. And so do NLP systems work, work like that? And one answer is like, well,

00:18:40.160 | they definitely used to, right? Because as you said in the beginning, before self supervised

00:18:44.320 | learning, the way to approach doing NLP was through understanding the human language system,

00:18:50.760 | right? And then trying to imitate it, trying to see, you know, if you think really, really

00:18:53.520 | hard about how humans do something, then you kind of like code up a computer to do it.

00:18:58.400 | Right. And so for, for one example, like, you know, parsing used to be like super important

00:19:03.160 | in, in, in, in NLP. Right. So, and this is because, you know, as an example, if I want

00:19:09.120 | my sentiment analysis system to classify a movie review correctly, right. Something like

00:19:14.120 | my uncultured roommate hated this movie, but I absolutely loved it. Right. How would, how,

00:19:19.160 | how would we do this before we had like chat GBT? We, we, we, we, you know, we might have

00:19:24.560 | some semantic representation of words like hate and uncultured, you know, it's not looking

00:19:27.720 | good for the movie, but you know, how, how, how does everything relate? Well, you know,

00:19:33.120 | we, we might ask how would human structure this word, you know, so many linguists, you

00:19:36.680 | know, there's many theories of how to make, you know, of how syntax might work, but they

00:19:40.920 | would tell you some, some, something like this. So it's like, okay, now I'm interested

00:19:44.320 | in the I, right. Cause that's like probably what, what the review relates to. They're

00:19:48.240 | just worrying stuff about uncultured and hated, but it seems like those are related like syntactically

00:19:52.720 | together, right? It's like the roommate hated and that can't really connect to the I right.

00:19:57.120 | So the I can't, can't really be related to the hated, right. Cause there's kind of separated.

00:20:03.200 | They're like separate sub sub trees separated by this like conjunction by this, but relation.

00:20:10.280 | And so, and so it seems that I goes with loved, which is looking good for the movie that,

00:20:15.280 | you know, we have loved it. And so then we have to move beyond the rules of, of, of syntax,

00:20:19.520 | right. The rules of like discourse, how, how would this kind of, you know, like what could

00:20:24.400 | it mean? You know, and there's like a bunch of rules of discourse. Now, if you say it,

00:20:27.120 | you're probably referring to like the latest kind of salient thing that's, you know, matches

00:20:31.760 | in like, you know, it is probably non-sentient, right. And so, you know, in this case it would

00:20:36.240 | be movie, right. So, so, so then, you know, like linguistic theory, you know, they helped

00:20:44.680 | NLP it helped NLP reverse engineer language. So you had something like input, you know,

00:20:49.600 | it'd get like syntax, you get semantics from, from the syntax, right. So you would take

00:20:55.040 | the tree and then from the tree kind of build up all these like little, you know, like you,

00:21:00.400 | you, you, you can build up these little functions of like how, how, how things, how things like

00:21:04.840 | relate to each other. And then, and then you, you'd go to discourse, right. So, so, so what

00:21:10.320 | refers to what, what, what nouns are being talked about, what things are being talked

00:21:14.760 | about and, you know, and, and then whatever else was interesting for your specific use

00:21:24.880 | case. Now we don't need all that, right. Language models just seem to catch on to a lot of these

00:21:29.640 | things, right. So, so, so this whole thing that I did with the tree is like Chachupitino

00:21:34.840 | does, and those much harder things than this, right. This was like, this isn't even like

00:21:38.160 | slightly prompt engineered. I just like woke up one morning, I was like, oh, there's another

00:21:41.200 | lecture going to put that into chat GPT. And this exactly, you know, I didn't even get

00:21:44.920 | some like, yeah, stop. Well, I guess I got a bit of moralizing, but I just like immediately,

00:21:51.600 | immediately just told, told, told me, you know, who, who likes it, who, who, who doesn't

00:21:55.040 | like it and why I'm doing something like slightly wrong, which is how it ends everything, right.

00:22:04.360 | And so, and so, you know, NLP systems definitely used to, this is where we were, work in this

00:22:11.720 | kind of structured, discrete way. But now NLP works better than it ever has before.

00:22:16.320 | And we're not constraining our systems to know any syntax, right. So what, what about

00:22:21.120 | structure in modern language models? And so this question is like, do the question of

00:22:29.440 | like a lot of analysis work has, has, has, has been focused on, you know, I think we'll

00:22:33.520 | have more analysis lectures later also. So this is going to be, you know, looked at in

00:22:37.520 | more detail, right. Is how could you get from training data, you know, which is just kind

00:22:41.680 | of like a loose set of just things that have appeared on the internet or sometimes not

00:22:45.560 | on the internet rarely, right. To rules about language, right. To, to, to, to the idea that

00:22:51.200 | there's this like structure underlying language that we all seem to know, even though we do

00:22:54.220 | just talk in streams of things that then sometimes appear on the internet. And one way to think

00:23:00.120 | about this is like testing, you know, is testing how novel words and old structures work, right.

00:23:07.480 | So humans can easily integrate new words into our old syntactic structures. I remember like

00:23:12.520 | I had lived in Greece for a few years for middle school, just speak, not speaking English

00:23:16.240 | too much. And I came back for high school and, and yeah, and, and this was like in Berkeley

00:23:24.360 | in the East Bay. And there was like, there was literally like 10 new vocabulary words

00:23:28.240 | I'd like never heard of before. And they all had like a very similar role to like dank

00:23:31.800 | or like sick, you know, but they were like the ones that were being tested out and did

00:23:34.160 | not pass. And within like one, you know, one day I immediately knew how to use all of them,

00:23:39.680 | right. It was not, it was not like a hard thing for me. I didn't have to like get a

00:23:43.200 | bunch of training data about how, how to use, you know, all these words. Right. And so this

00:23:49.000 | kind of like is, is, is one way of arguing that, you know, the thing I was arguing for

00:23:53.680 | the whole first part of the lecture, that's syntactic structures, they exist independently

00:23:58.120 | of the words that they have appeared with. Right. A famous example of this is, is Lewis,

00:24:04.200 | Lewis Carroll's poem, Jabberwocky. Right. I was going to quote from it, but I can't

00:24:07.400 | actually see it there. Right. Where they, where they, you know, where he just like made

00:24:12.080 | up a bunch of new words and he just made this poem, which is all new open class words, open

00:24:16.440 | class words, what we call, you know, kind of like nouns, verbs, adjectives, adverbs,

00:24:21.760 | classes of words that like we add new things to all the time while, while things like conjunctions,

00:24:27.880 | you know, like and or but are closed class. Oh, there's been a new conjunction added late,

00:24:32.240 | added recently. I just remembered after I said that. Does anyone know like of a conjunction

00:24:36.720 | that's kind of the past like 30 years or something, maybe 40? Spoken slash, like now we say slash

00:24:42.080 | and it kind of has a meaning that's like not and or but, or, or, or, or, but it's, it's

00:24:46.160 | a new one, but it's closed class generally. This happens rarely. Anyway. And, and, and

00:24:50.440 | so, you know, you, you, you have like twas brillig and the slithy toves, did gyre and

00:24:55.000 | gimble and the wave, right? Toves is a noun. We all know that we've never heard it before.

00:24:59.000 | And in fact, you know, one word for, from, from Jabberwocky chortle actually entered

00:25:03.560 | the English vocabulary, right? It kind of means like a, like a little chuckle that's

00:25:07.120 | maybe slightly suppressed or something. Right? So, so, so it shows like, you know, there

00:25:11.080 | was one, literally like one example of this word and then people picked it up and started

00:25:15.160 | using it as if it was a real word. Right? So, and so one, one way of asking do language

00:25:22.600 | models have structures, like do they have this ability? And, you know, and I was thinking

00:25:27.760 | it would be cool to go over like a benchmark about this. Right? So like the kind of things,

00:25:31.080 | so people like make things where you could test your language models to, to, to see if

00:25:34.560 | it does this. Yeah. Are there any questions until now? I go into just like this new benchmark.

00:25:44.920 | Cool. So yeah, the COGS benchmarks, the composition rule and generalization from semantics.

00:25:52.560 | Benchmark or something. Right. It kind of checks if, if language models can, can, can

00:26:00.560 | do new word structure combinations. Right? So, so the, the task at hand is semantic interpretation.

00:26:07.160 | This is, I kind of glossed over it before, but it's like if you have, if you have a sentence,

00:26:11.440 | right, like the girls saw the hedgehog, you have this idea that like, and you've seen

00:26:15.480 | what like saw is a function that takes in two arguments and it outputs at the first

00:26:20.240 | one saw the second one, you know, this is like a bit of like, you know this is like

00:26:24.400 | one way of thinking about semantics. There's many more as we'll see, but you know, this

00:26:27.400 | is one. And so like, and so, and so you can make a little like kind of Lambda expression

00:26:33.080 | about you know, about how, how, how, you know, what the sentence means and to get that you

00:26:39.640 | kind of have to use the, the, the tree to get it correct. But anyway, the, the specific

00:26:46.640 | mechanism of this is not very important, but it's just like the semantic interpretation

00:26:49.400 | where you take the girls saw the hedgehog and you, and you output this like function

00:26:52.080 | of like, you know, C takes two, two arguments, you know, first is the girl, second is the

00:26:56.440 | hedgehog. And then, and then the training on a test set, they have distinct words and

00:27:01.460 | structures in, in, in different roles. Right. So, so, so for example, you know, you have

00:27:07.200 | things like Paula, right. Or the hedgehog is like always an object in the, in the training

00:27:13.240 | data. So when you're fine tuning to do this task, but then in the test data, it's a subject,

00:27:17.880 | right. So it's like, can, can, can you like, can you, can you use this word that you've

00:27:24.680 | seen, you know, in, in a new kind of, in, in, in a new place. Cause in English, anything

00:27:28.880 | that, that, that, that, that's an object can be a subject, you know, with like some, there's

00:27:34.000 | some subtlety around like some things are more likely to be subjects, but yeah. And

00:27:37.960 | then similarly, you know, if, if you have something like the cat on the mat, you know,

00:27:42.160 | and it always appears. So, so this idea that, that like a noun can go with like a prepositional

00:27:48.440 | phrase, right. But that's always, always in the subject, right. Like Emma saw the cat

00:27:51.920 | in the mat. And then like, can, can you do something like, you know, the cat on the mat

00:27:56.320 | saw Mary, right. So it's like move that kind of structure to subject position, which is

00:27:59.880 | something that in English we can do, right. Like any type of noun phrase that can be in

00:28:04.560 | an object position can be in subject position. And so that, and so that's the, the, the Cogs

00:28:08.960 | benchmark, you know, large language models haven't aced this yet. I wrote this and like

00:28:13.120 | I was looking over this slide and I was like, well, they haven't checked the largest ones.

00:28:16.440 | You know, they never do check the largest ones because it's really hard to like do this

00:28:20.400 | kind of more, more like analysis work, you know, and things move so fast, fast, like

00:28:24.680 | the really large ones. But you know, T5, 3 billion, you know, 3 billion is like a large

00:28:29.000 | number. It's maybe not a large language model anymore. But, you know, they don't ace this,

00:28:33.920 | right. They're, they're getting like 80% while like when they don't have to do the structural

00:28:37.720 | generalization when they can just like do like a test set, which, which, which like

00:28:42.000 | things appear in the same role as it in training set, they get like 100% easy. It's not a very

00:28:45.560 | hard task. And so, you know, this is like, but still pretty good, you know, and it's

00:28:50.880 | probably like if a human had never ever seen something in subject position, I'm not sure

00:28:55.440 | that it would be like 100% as easy as if they had, you know, like I think that, you know,

00:29:00.000 | we don't want to fully idealize how, how, how things were, were working humans, right.

00:29:05.400 | So similarly, you can take literal Jabberwocky sentences, right. So, so, so build, building

00:29:12.280 | on some, some work that John did that I'm sure you'll talk about later. So I'm not going

00:29:15.160 | to go in, but maybe I'm wrong on that assumption, right. We can like kind of test the models

00:29:19.800 | like embedding space, right. So if we go high up in the layers and test the embedding space,

00:29:24.180 | we can test it to see if it encodes structural information, right. And, and so we can test

00:29:28.760 | to see like, okay, is there like a, a rough representation of like syntactic tree relations

00:29:35.920 | in this latent space. And, and, and then these, yeah, and then a recent paper asked, does

00:29:44.560 | this work when we introduce new words, right. So if we, so if we take, you know, if we take

00:29:49.320 | like Jabberwocky style sentences and then ask, can the model find out these, the, the

00:29:54.400 | trees and these in its latent space, does it like encode them? And, and, and, and the

00:29:59.760 | answer is, you know, like it's kind of worse, you know, in, in, in this graph, the, the

00:30:03.680 | hatched bar, so the ones on the right are the Jabberwocky sentences and the, and the,

00:30:09.040 | and the clear ones or the not hatched ones, I guess, are the ones, are, are, are, are

00:30:13.840 | the normal sentences. And we see, you know, performance is worse, you know, so this is

00:30:17.680 | like unlabeled attachment score on the Y axis. It is like, you know, forms probably

00:30:21.400 | worse than humans, right? It's easier to read a normal poem than to read Jabberwocky. So,

00:30:24.840 | you know, the extent to which this is like damning or something, you know, is I, I think

00:30:28.600 | very, very small. I think the paper is, I have linked it there, but, you know, I think

00:30:32.200 | the paper is maybe a bit more, um, um, uh, sure about this being a big deal maybe than

00:30:37.720 | it is. But yeah, you know, it, it, it does show that, that, that this kind of process

00:30:43.400 | isn't, um, trivial. Yeah?

00:30:45.400 | What are the words that, like, applies for Jabberwocky substitutions?

00:30:50.400 | Oh, so this is, um, this is, uh, something called like phonotactics, right? So, so in,

00:30:56.400 | uh, I think like this is probably around, kind of what you're asking that it's like,

00:31:00.400 | you want a word which sounds like it could be in English, right? Like pro- like provocated,

00:31:05.400 | right? It sounds like it can't be in English. You know, a classic example is like, you know,

00:31:09.400 | like blick, it could be an English word, you know, bnick, can't, right? We can't start

00:31:12.400 | And that's not an impossibility of the mouth.

00:31:15.660 | It's similar things like pterodactyl, pneumonia.

00:31:19.620 | These come from Greek words like pneumonas and ptero.

00:31:24.380 | It's like I can say them.

00:31:25.460 | I'm a Greek native speaker.

00:31:26.980 | Like PN and PT, I can put them at the beginning of a syllable.

00:31:30.620 | But in English they don't go.

00:31:33.280 | And so if you follow these rules and also add the correct suffixes and stuff.

00:31:39.520 | So like provocated we know is like past tense and stuff.

00:31:43.460 | Then you can make kind of words that don't exist but could exist.

00:31:50.740 | And so they don't throw people off.

00:31:52.580 | This is important for the tokenizers.

00:31:54.080 | You don't want to do something totally wacky to test the models.

00:31:59.740 | But yeah.

00:32:01.540 | So when you generate this test set with these JavaScript substitutions,

00:32:06.860 | are these words generated by a computer or is there a human coming up with words that sound like English?

00:32:13.820 | There's some databases.

00:32:16.860 | People have thought of these.

00:32:17.820 | And I think they get theirs from some list of them.

00:32:21.420 | Because if you have 200, that's enough to run this test.

00:32:23.820 | Because it's like a test.

00:32:25.460 | But yeah.

00:32:28.380 | I mean, I think that the phonotactic rules of English can be actually laid out kind of simply.

00:32:33.940 | It's like you can't really have two stops together.

00:32:39.340 | It's like put, they're both the same.

00:32:40.900 | You can't really put them together.

00:32:43.100 | You can probably make a short program or a long-ish program, but not a very super complex one

00:32:48.140 | to make good jabberwocky words in English.

00:32:51.460 | Yeah.

00:32:52.500 | Yeah?

00:32:53.000 | So I'm wondering how the model would tokenize these jabberwocky sentences.

00:32:56.740 | Would it not just map all these words like provocated just to the unknown?

00:33:01.180 | So these are largely models that have word piece tokenizers.

00:33:08.260 | So if they don't know a word, they're like, OK, what's the largest bit of it that I know?

00:33:13.420 | And then that's like a subtoken.

00:33:15.020 | And this is how most models work now.

00:33:17.100 | It's like back in the day-- and this is like back in the day meaning until maybe like six or seven years ago,

00:33:22.060 | it was very normal to have UNK tokens, like unknown tokens.

00:33:24.460 | But now generally, there is no such thing as an unknown.

00:33:27.700 | You put like kind of at a bare minimum, you have like the alphabet in your vocabulary.

00:33:33.380 | So at a bare minimum, you're splitting everything up into like letter by letter tokens, character by character tokens.

00:33:38.820 | But if you're not, then yeah, it should--

00:33:45.460 | yeah, it should find kind of like-- and this is why the phonotactic stuff is kind of important for this, right?

00:33:53.100 | That it's tokenized like hopefully in like slightly bigger chunks that have some meaning.

00:33:57.620 | And because of how attention works and how contextualization works, you can like--

00:34:01.500 | even if you have like a little bit of a word, you can give the correct kind of attention to it

00:34:08.540 | once it figures out what's going on a few layers in for like a real unknown word.

00:34:12.700 | For like a fake unknown word, then yeah.

00:34:15.660 | Cool.

00:34:16.180 | I went back, but I want to go forward.

00:34:22.980 | Cool.

00:34:23.380 | Any more questions about anything?

00:34:25.460 | Yeah.

00:34:27.220 | A few slides back, there was like 80% scores that you were saying these are not--

00:34:34.740 | like this isn't a solved problem yet.

00:34:36.940 | I'm just trying to get a sense of what 80% means in that context.

00:34:39.900 | Is it like 80% of exact--

00:34:42.300 | Yeah, it was exact.

00:34:43.940 | I think the relevant comparison is that, well, you didn't have this kind of structural difference,

00:34:48.620 | you know, where like something that was--

00:34:51.020 | sometimes a subject was like then like--

00:34:53.820 | was like something which was like never an object was then an object.

00:34:56.620 | You know, the like the accuracy on that test set is like 100% like easy.

00:35:03.540 | And so it kind of-- there was no good graph which showed these next to each other.

00:35:07.500 | They kind of mentioned it.

00:35:08.300 | But yeah.

00:35:09.180 | And so I think like that's like the relevant piece of information that like somehow this like swapping

00:35:14.500 | around of roles like slightly trips it up.

00:35:16.860 | That being said, you're right, like exact match of semantic parts is kind of a hard metric.

00:35:20.820 | You know, and so it's not--

00:35:22.940 | this is-- yeah, none of this stuff, and I think this is important.

00:35:25.100 | None of this stuff is damning.

00:35:25.900 | None of this stuff is like they do not have the kind of rules human have.

00:35:28.420 | This is also like, well, there's a bit of confusion.

00:35:30.700 | There's a bit of confusion in humans.

00:35:31.620 | It actually gets quite a bit--

00:35:33.180 | it gets quite subtle with humans.

00:35:34.820 | And I'm going to go into that in the next section too.

00:35:38.420 | Yeah.

00:35:39.460 | Overall-- sorry, what is it?

00:35:42.540 | Yeah.

00:35:42.780 | Overall, like I think the results are like surprisingly not damning, I would say.

00:35:45.460 | Yeah, this is the--

00:35:47.220 | there's like clearly like, you know, maybe not the fully like programmed discrete kind of rules.

00:35:52.420 | But yeah.

00:35:54.580 | I would say-- cool.

00:35:56.420 | Another thing we could do, yeah, is test how syntactic structure kind of maps onto like meaning and role, right?

00:36:02.100 | And so like as we said before, right, like in English, the syntax of word order gives us the who did what to whom meaning.

00:36:09.460 | And so, you know, if we have like, you know, for any combination like a verb and be,

00:36:14.700 | if I sound like a verb be, you know, like a is the doer, b is the patient.

00:36:19.220 | And so we ask like, is this kind of relationship, you know, strictly represented in English language models as it is like in the English language?

00:36:29.420 | And so what we could do is that we could take a bunch of things which like, you know, appear in subject position,

00:36:35.340 | a bunch of things which appear in object position and take their latent space representation and kind of learn, you know,

00:36:48.860 | learn like a little classifier, you know, this should be like a pretty clear distinction in latent space.

00:36:52.860 | In any like good model, right, like which like these models are good, this should be a pretty clear distinction.

00:36:57.020 | We could just like a linear classifier to kind of separate them, right?

00:36:59.740 | And the more on the one side you are, you're more subject, the more on the other side you are, you're more object, right?

00:37:05.860 | And so then we can test, you know, does the model know the difference, you know,

00:37:14.260 | be between when something is a subject and when something is an object, you know,

00:37:18.140 | does it know that like you're going to go on opposite sides of this dividing line, you know,

00:37:26.340 | even if like everything else stays the same and all the clues point to something else, right?

00:37:30.580 | So it's like does syntax map onto role in this way?

00:37:33.300 | You might think like, well, I could just check if it's like second or like fifth, right?

00:37:37.100 | But, you know, we've actually, we, yeah, this is a proof that I wrote, you know, we did like compare, you know,

00:37:42.380 | we like try to control for like position stuff in various ways.

00:37:46.540 | And these are like, yeah.

00:37:48.180 | And so it's hopefully we claim we're kind of showing like the like syntax to role mapping.

00:37:55.180 | And what we see is that it does, right?

00:37:56.500 | So if we kind of graph the distance from that dividing line, you know, on the y-axis,

00:38:04.620 | we see like the original subjects when we swap them and put them in object position,

00:38:11.260 | they do like diverge as we go up layers in that dimension.

00:38:15.700 | And we tried this again, you know, all this analysis experiment with some kind of small models,

00:38:18.580 | with some BERT, with some GPT-2, you know, with some like a bigger version of GPT-2 and it worked out.

00:38:22.380 | But it's like, you know, none of this is like, you know, none of this is like the big, big stuff.

00:38:30.420 | I think now we're starting to see more analysis on the big, big stuff.

00:38:32.740 | I think it's really cool.

00:38:35.380 | Yeah.

00:38:36.780 | So then where are we with like structure and language models, right?

00:38:39.300 | We know that language models are not, they're not engineered around discrete linguistic rules.

00:38:45.100 | But the pre-training process, you know, it isn't just a bunch of surface level memorization, right?

00:38:48.740 | We have seen this.

00:38:49.700 | There is some kind of like discrete rule-based system kind of coming out of this.

00:38:56.420 | You know, maybe it's not the perfect kind of thing you would like write down in a syntax class,

00:39:00.500 | but, you know, there is some syntactic knowledge, you know, and it's complicated in various ways.

00:39:04.980 | And humans are also complicated.

00:39:06.260 | And that's what we're going to get to next, right?

00:39:09.140 | There's no ground truth for how language works yet, right?

00:39:11.700 | Like if we knew how to fully describe English, right, with a bunch of good discrete rules,

00:39:17.020 | we would just like make an old pipeline system and it would be amazing, right?

00:39:21.500 | If we could like take the Cambridge grammar of English, but like it was truly, truly complete.

00:39:26.980 | If we just knew how English worked, we would do that.

00:39:29.540 | And so we're working on this case where there's no really no ground truth.

00:39:33.740 | Cool. Any questions about this?

00:39:35.220 | Try and move beyond syntactic structure.

00:39:41.220 | Cool.

00:39:42.260 | So moving beyond this kind of like very structure-based idea of language,

00:39:48.460 | I think it's very cool to learn about structure in this way.

00:39:51.140 | And like at least how I was taught linguistics, it was like a lot of it,

00:39:53.940 | the first like many semesters was like this kind of stuff.

00:39:59.740 | But then, but I think there's like so much more.

00:40:02.580 | And like very important, I think that meaning plays a role in linguistic structure, right?

00:40:10.020 | Like there's a lot of rich information in words that affects like the final way that like the syntax works.

00:40:16.260 | And of course, what like you end up meaning and like what like the words influence each other to mean, right?

00:40:21.740 | And so like the semantics of words, right, the meaning,

00:40:24.380 | it's like always playing a role in forming and applying the rules of language, right?

00:40:28.580 | And so, you know, for example, like a classic example is like, you know, verbs,

00:40:31.900 | they like have kind of like selectional restriction, right?

00:40:34.260 | So like ate can like take kind of any food and it can also take nothing.

00:40:37.420 | It's like I ate, it means that I've just like I've eaten, right?

00:40:40.220 | I've devoured, right?

00:40:41.860 | The word devoured actually can't be used intransitively, right?

00:40:45.220 | It sounds weird.

00:40:45.900 | You need to devour something, right?

00:40:47.860 | There's verbs like elapsed that only take like, you know, a very certain type of noun, right?

00:40:52.820 | Like elapsed only takes nouns that refer to time, you know,

00:40:59.140 | so maybe like harvest can refer to time, moon can refer to time, somewhere, you know,

00:41:02.500 | it's trees, it cannot take a verb like trees, right?

00:41:04.820 | There's even verbs that only ever take one specific noun as their argument, right?

00:41:08.380 | It's like classic example.

00:41:09.580 | I think, yeah, my- my advisor Dan- Dan Jirowski told me this one to put it in.

00:41:15.100 | [LAUGHTER]

00:41:16.980 | And- and- and what's cool is that like that- that's how we train models these days.

00:41:20.460 | If you see this- this diagram I screenshotted from John's Transformers lecture, right?

00:41:26.740 | We start with a rich semantic input, right?

00:41:28.740 | We start with these like a thousand on the order of like a thousand, you know,

00:41:32.500 | depending on the model, size, embeddings, right?

00:41:35.580 | Which it's like, think of how much information you can express like on a plane, right?

00:41:39.260 | On two dimensions, it's like the kind of richness that you can fit into a thousand dimensions, you know,

00:41:43.300 | it's huge and we start with these word- word- word embeddings and then move on, right?

00:41:48.460 | It's like the attention block and- and everything.

00:41:51.780 | And so, yeah, I'm just gonna go through some examples of the ways that- that languages,

00:41:57.060 | you know, the ways that like meaning kind of plays a role in forming syntax,

00:42:01.980 | hopefully it's like fun, a tour through like the cool things that happen in language, right?

00:42:06.900 | So, as we said, you know, anything can be an object, anything can be a subject,

00:42:11.260 | we want to be able to say anything, language can like express anything,

00:42:14.180 | this is like kind of a basic part of language.

00:42:16.300 | But, you know, many languages they have a special syntactic way of- of dealing with this, right?

00:42:20.620 | So, they want to tell you like if there's an object that you wouldn't expect, right?

00:42:24.220 | Like in this case, I want to tell you, hey, watch out, you know, the- be careful,

00:42:28.300 | we're- we're dealing with a weird object here, right?

00:42:30.980 | So, this is like kind of in the syntax of languages, you know, if you're- if you're-

00:42:34.740 | if you're a native speaker or- or you've learned Spanish, right?

00:42:38.660 | You- you know, this like a constraint, right?

00:42:40.700 | So, if you say like, you know, so if something is a- is an object but it's inanimate,

00:42:47.260 | you don't need the a because you're like, yeah, I found a problem.

00:42:49.700 | But then if you're putting something animate in the object position,

00:42:52.620 | you need to kind of mark it and you'd be like, hey, watch out, you know, there- there's an object here.

00:42:56.180 | And that's like a rule of the grammar, right?

00:42:57.260 | Like if you don't do this, it's wrong.

00:42:58.420 | And they tell you this in Spanish class.

00:43:00.980 | Similarly, like Hindi has a kind of a more subtle one, but I think it's cool, right?

00:43:07.340 | So, you- to- if- if you put an object that is definite,

00:43:12.740 | you have to mark it with a little like- this is an object marker, right?

00:43:16.020 | Like a little accusative marker, right?

00:43:18.220 | And like, you might ask, okay, I understand why like animacy is- is- is- is- is a big deal, right?

00:43:26.180 | Like, you know, maybe animate things more often do things and have things done to them.

00:43:30.220 | But like, why- why- why definiteness, right?

00:43:33.980 | Like, why- why would you need this little like call marker, this like the goat versus a goat?

00:43:39.460 | And it's like, well, probably something is definite.

00:43:41.020 | It means that it's like- it means that- that it's like in the kind of in-

00:43:45.540 | we've like kind of probably been talking about it or we're all thinking about it, you know.

00:43:48.820 | For example, it's like, oh, I ate the apple, right?

00:43:51.140 | This means that either like we had one apple left and I ate it or like it was like really rotten or something.

00:43:55.140 | You can't believe I ate it, right? Or something like that.

00:43:57.220 | And so like, then things that we're already talking about,

00:43:59.940 | they're probably more likely to be subjects, right?

00:44:02.340 | Like if we're all, you know, you know, if- if I was like, oh, Rosa, you know, like,

00:44:08.060 | yeah, I feel like Rosa did this and Rosa did- did- did that and Rosa that.

00:44:12.340 | And then- and then- and- and- and then like Leon kissed Rosa.

00:44:15.260 | You'd be like, no, you probably want to be like Rosa kissed Leon, right?

00:44:17.100 | You probably want to put, you know, it's not strict,

00:44:18.860 | but if you're talking about something, you're probably-

00:44:20.860 | it's probably going to be the subject of the next sentence.

00:44:22.580 | So then if it's the goat, you- you have to put a little accusative marker on it.

00:44:26.540 | So this is like how like the marking in the language works,

00:44:30.540 | and it's kind of all influenced by this like interesting semantic relationship.

00:44:36.100 | And language models are also aware of these gradations.

00:44:39.100 | And it's, you know, in a similar like classifying sub- subjects and objects paper that- that- that we wrote,

00:44:47.540 | we see that language models also have these gradations, right?

00:44:50.940 | So if you like- again, if you like map- map the probability of being

00:44:54.140 | with that classifier on the y-axis, right, we see that there's- there's a high accuracy, right?

00:44:57.620 | This is over many languages.

00:44:59.060 | And all of them, you know, on the left,

00:45:00.460 | we have the subjects, they're classified above.

00:45:01.980 | On the right, we have the object, they're classified below.

00:45:03.940 | But, you know, animacy kind of influences this grammatical distinction, right?

00:45:08.100 | So like if you're animate and a subject, you're very sure.

00:45:11.500 | If you're inanimate and an object, you're very sure.

00:45:13.500 | Anything else, you're kind of close to 50, you know?

00:45:16.100 | And so it's like this- this kind of a- this kind of relation where the meaning plays

00:45:25.500 | into the structure is- is reflected in language models, you know?

00:45:30.980 | And that's not bad.

00:45:31.980 | It's good because it's how humans are.

00:45:33.780 | Or, you know, it kind of- we should like, you know, temper our expectations maybe away

00:45:38.820 | from the like fully- fully syntactic things that we're talking about.

00:45:44.580 | Another kind of cool- cool example of like- of how meaning can influence, you know, what

00:45:50.700 | we can say.

00:45:51.700 | What we can say- I've said from the beginning many times that all kind of combinations of

00:45:55.900 | structures and words are possible, but that's not strictly true, right?

00:46:00.100 | So in many cases, if something is like too outlandish, we often do just assume the more

00:46:04.100 | plausible interpretation, right?

00:46:05.740 | So like there's these psycholinguistics experiments where they kind of test this- what's, you

00:46:10.900 | know, like these kind- these kind of like giving verbs.

00:46:13.460 | Verbs is like, you know, the mother gave the daughter the candle and you could actually

00:46:16.260 | like switch that around, you know, you could do like- sounds like the date of alternation,

00:46:19.740 | but you switch that around to make the mother give the candle to the daughter.

00:46:24.580 | And then if you- if you switch around who's actually being given, right?

00:46:28.580 | So if you're actually saying the mother gave the candle to the daughter, people don't really-

00:46:36.260 | people don't interpret that like in its literal sense.

00:46:38.780 | They usually interpret it as like the mother gave the daughter the candle.

00:46:41.600 | And like, of course, outlandish meanings, you know, they're never impossible to express,

00:46:45.380 | right?

00:46:46.380 | Because nothing is, right?

00:46:47.380 | And so you can like kind of spell it out, you know, you could be like, well, the mother,

00:46:52.060 | she picked up her daughter and she handed her to the candle, you know, who is sentient.

00:46:55.420 | And then you could say this, but you like can't- you can't do it simply with the give

00:47:00.900 | word, like people tend to interpret it the other way.

00:47:03.940 | And so like marking these like less prominent things and marking them- sorry, these less

00:47:07.580 | plausible things and marking them more prominently, there's like pervasive feature that we say

00:47:11.300 | across language in all these ways.

00:47:13.500 | And all these ways are like, you know, also like very like embedded in the grammar as

00:47:17.820 | we saw earlier in Spanish and Hindi.

00:47:20.260 | Cool.

00:47:21.260 | So another way that's, you know, in where- how we see meaning kind of play in to, you

00:47:31.540 | know, and kind of break apart this like full compositionality, you know, syntax picture,

00:47:37.100 | right, is that meaning can't always be composed from individual words, right?

00:47:41.060 | And just full of idioms, you know, sometimes when you talk about idioms, you, you know,

00:47:45.140 | you might think, okay, there's maybe like 20 of them, you know, things like my grandfather

00:47:48.260 | would say, you know, things about like chickens and donkeys.

00:47:50.780 | In Greece, they're all donkeys.

00:47:53.500 | You know, we're actually constantly using constructions that, that, you know, that we

00:47:56.740 | couldn't actually like get from like, you know, they're kind of like idiomatic in their

00:48:01.140 | little sense, right, that we couldn't actually get from like composing the words, right?

00:48:05.220 | Things like, I wouldn't put it past him, he's getting to me these days, that won't go down

00:48:08.860 | well with the boss, you know.

00:48:09.860 | There's like so, so, so many of these, and it's kind of like a basic part of, of communication

00:48:15.700 | to kind of use the, these little like canned idiomatic phrases, you know, and like linguists

00:48:22.700 | love, love, love saying that like, oh, any string of words you say is like totally novel,

00:48:27.780 | you know, and it's like probably true, you know, I've been speaking for like 50 minutes,

00:48:32.060 | you know, and like probably no one has said this exact thing like ever before, I just

00:48:34.740 | use the computational rules of English to make it.

00:48:36.340 | But actually, most of my real utterances like, oh, yeah, no, totally, right, like something

00:48:41.980 | like that, which is actually people say that all the time, right?

00:48:44.220 | Most of my real utterances are like, people say that all the time, you know, we have these

00:48:47.700 | little canned things that we love reusing, and that, and that, you know, we reuse them

00:48:50.980 | so much that like they stop making sense if you break them apart into individual words,

00:48:54.900 | right?

00:48:55.900 | And we even also even have these constructions that can like take arguments, but like don't

00:48:59.860 | really, you know, so, so they're not like canned words, they're kind of like a canned

00:49:03.740 | way of saying something that, you know, doesn't really work if you build up from the syntax,

00:49:06.940 | right?

00:49:07.940 | So like, oh, he won't, he won't eat shrimp, let alone like oyster, right?

00:49:13.140 | And what does that mean?

00:49:14.140 | Well, it means like I'm defining some axis of like, you know, of like moreness, right?

00:49:19.580 | In this case, probably like selfish and like, shellfish and like, weird or something, you

00:49:25.260 | know, and so it's like, well, shrimp is less weird, so oysters more, you know, and if I

00:49:27.860 | say like, oh, he won't eat shrimp, let alone beef, right?

00:49:29.980 | The axis is like vegetarianism, right?

00:49:31.220 | So it's like this construct, it does like kind of like a complex thing, right?

00:49:35.340 | Where you're saying like, he won't do one thing, let alone the one that's worse in the

00:49:38.340 | dimension, you know, like, it's like, oh, she slept the afternoon away, he knitted the

00:49:43.060 | night away, they drank the night away, right?

00:49:44.820 | It's like all this is like time away thing doesn't actually, you know, you like can't

00:49:48.700 | really tell, otherwise, you know, like these like this er, er construction, like, like

00:49:52.860 | the, the bigger they are, the more expensive they are, right?

00:49:56.020 | Like the, man, I forgot how it goes, the bigger they come, the harder they fall, right?

00:50:00.900 | Like so it doesn't even have to be a, yeah, and it was like, you know, that travesty of

00:50:04.980 | a theory, right?

00:50:05.980 | Right, like that of a construction, there's so many of these, right?

00:50:08.460 | Like so much of how we speak, if you actually try to like do like the tree parts, new like

00:50:12.220 | semantic parts up, up from it, it won't really make sense.

00:50:16.460 | And so there, there's been this work, this is more, more recent, recently kind of come,

00:50:21.620 | coming to light, and I've been really excited by it.

00:50:23.660 | There's texting constructions in large language models.

00:50:26.860 | There was just this year, a paper by Kyle Mahalwald, who is a postdoc here, testing

00:50:32.940 | the like the beautiful five days in Austin construction, right?

00:50:35.500 | So it's like the a, adjective, numeral, noun construction where it's like, it's like doesn't

00:50:42.820 | really work, right?

00:50:43.820 | Because it's like, it wouldn't really work, right?

00:50:46.220 | Because you have a, days, right?

00:50:48.860 | And there's like many ways, you know, and like anything kind of similar to it, right?

00:50:52.980 | Like it's like a five beautiful days that, that doesn't work, right?

00:50:56.220 | So somehow like this specific construction is like grammatically correct to us.

00:50:59.580 | But like, you know, you like, you can't say a five days in Austin, right?

00:51:02.380 | You can't say a five beautiful days in Austin, you know, you have to say like this.

00:51:06.260 | And it's just like GPT-3 is actually like largely concurrent, concurs with humans on

00:51:12.380 | these things, right?

00:51:13.380 | So on the, on the left here, the gray bars, we have the, the, the, the things that are

00:51:20.500 | acceptable to humans, right?

00:51:21.500 | So those are like a beautiful five, five days in Austin and five beautiful days in Austin,

00:51:26.460 | right?

00:51:27.460 | Those are both acceptable to humans.

00:51:28.460 | They do this over like many, many instances of this construction, not just Austin, obviously.

00:51:33.220 | But yeah, and we say like GPT-3 like accepts these, you know, those are the gray bars and

00:51:37.740 | humans also accept these, though those are the green triangles.

00:51:41.900 | And like every other iteration, the human triangles are very low.

00:51:45.360 | And GPT-3 is like lower, but, but, but does get tricked by some things, right?

00:51:48.560 | So it seems to have this knowledge of this construction, but not as like starkly as humans

00:51:52.580 | do, right?

00:51:53.580 | So the, especially like if you see, if you see that, that, that third one over there,

00:51:56.860 | right?

00:51:57.860 | The five beautiful days, humans don't, don't accept it as much.

00:52:01.020 | It's funny to me, it sounds almost better than those rest of them, but I guess these

00:52:05.700 | green triangles were computed very robustly.

00:52:12.220 | So I'm an outlier.

00:52:13.900 | Yeah.

00:52:14.900 | And GPT-3 is like better, you know, like think, thinks those are better than maybe humans

00:52:18.320 | do, but there is this like difference, you know, it's like significant difference between

00:52:22.060 | the gray bars and the orange bars.

00:52:25.440 | And then similarly, some people tested the, the X or the Y construction, right?

00:52:29.200 | And so it's like, they took examples of sentences that, that were like the X or the Y construction.

00:52:34.660 | And then like they, they, they took example sentence which had like an er followed by

00:52:40.300 | an er, but they weren't, or like, but, but they weren't actually the X or the Y, right?

00:52:44.620 | It's like, oh, the older guys help out the younger guys, right?

00:52:47.540 | So, but so that's not an X or Y or construction.

00:52:49.780 | And, and, and, you know, and then they were like, right, if we mark the ones that are

00:52:53.620 | as positive ones that aren't as negative, it does the latent space of models kind of

00:52:56.620 | like encode this difference, right?

00:52:58.340 | That, that, that like all this construction kind of clustered together in a way.

00:53:01.540 | And they find that it does.

00:53:04.940 | And then the last thing I want to talk about in this like semantic space, you know, after

00:53:09.700 | like constructions and all that, is like the meaning of words is like actually very subtle

00:53:13.420 | and sensitive and it's like influenced by context and all these like crazy ways, right?

00:53:17.780 | And Erica Peterson and Chris Potts from, from the linguistics department here did this like

00:53:23.180 | great investigation on a, you know, on the, on the verb, on the verb break, you know.

00:53:30.300 | And it's like the break can have all these meanings, right?

00:53:32.540 | Like we, we think it's like, yeah, break is like a word, you know, and like words are

00:53:36.340 | things like table and dog and break that have like one sense.

00:53:39.020 | But, you know, actually there aren't even senses that you can enumerate, you know, like

00:53:43.380 | river bank and financial bank and just like, yeah, you know, break the horse means tame

00:53:46.460 | while like break a $10 bill.

00:53:47.700 | It means like spread, spread into like smaller bits of money, right?

00:53:50.660 | And there's just like so many ways, right?

00:53:52.700 | Right.

00:53:53.700 | Like break free and break even.

00:53:55.020 | There's just like so, so many ways in which break, you know, like its meaning is just

00:53:59.300 | so subtle and influence.

00:54:00.300 | It's like kind of true for like every word, you know, or like many words, maybe like table

00:54:03.940 | and dog.

00:54:04.940 | It's like, yeah, there's like a set of all things that are tables or dogs.

00:54:07.140 | And it's like kind of describes that set.

00:54:08.580 | You know, there's maybe some more philosophical way of going about it, but, you know, so like

00:54:12.380 | pocket, you know, it's like a pocket, but then like you can pocket something.

00:54:15.140 | Then like it kind of means steal in many cases, doesn't just mean put something in your pocket

00:54:18.900 | literally.

00:54:19.900 | Right.

00:54:20.900 | This is like, so yeah, there's like all these ways in which in which like the meaning of

00:54:28.180 | words is like by everything around it.

00:54:31.260 | And what they do is that don't worry about like what's actually going on here, but, you

00:54:34.980 | know, they've kind of mapped each sense, like a color.

00:54:37.540 | Right.

00:54:38.540 | And when you start off in layer one, they're all I think this is just by like position

00:54:42.940 | embedding.

00:54:43.940 | Right.

00:54:44.940 | You start up in layer one and it's just like, I think that's what it is.

00:54:45.940 | And then you like if you take all the words past pass, pass them through like a big model,

00:54:51.340 | like rubber large.

00:54:52.820 | Right.

00:54:53.820 | Then then they're kind of all jumbled up.

00:54:55.220 | Right.

00:54:56.220 | Because they're all just break.

00:54:57.220 | Right.

00:54:58.220 | They're just like in different positions.

00:54:59.220 | And then, you know, by the end, they've all kind of split up.

00:55:01.380 | You take all the colors are kind of clustering together.

00:55:03.740 | Each color is kind of like one of one of these meanings.

00:55:06.060 | Right.

00:55:07.060 | And so they kind of clustered together and these like kind of is it constructions again

00:55:09.860 | or is it just like, you know, the way in which they kind of isolate these like really subtle

00:55:14.420 | aspects of meaning.

00:55:16.420 | Yeah.

00:55:18.740 | So then I think a big question in NLP, right, is like, how do we strike the balance between

00:55:22.780 | like syntax and the ways that like meaning influences things?

00:55:25.620 | Right.

00:55:26.620 | So well, and I pulled up this quote from a book by John Bidey, which I enjoy.

00:55:33.860 | And I think it kind of brings to light like a question that we should be asking in an

00:55:37.740 | LP.

00:55:38.740 | Right.

00:55:39.740 | This book is about is like just like a linguistics book.

00:55:40.740 | It's not about it at all.

00:55:41.740 | But, you know, it's in while language is full of both broad generalizations and items with

00:55:45.420 | properties, linguists have been dazzled by the quest for general patterns.

00:55:48.300 | Right.

00:55:49.300 | That was the first part of this talk.

00:55:50.300 | You know, and like, of course, the abstract structures and categories of language are

00:55:54.300 | fascinating.

00:55:55.300 | But, you know, I would submit or she would submit that what is even more fascinating

00:55:58.460 | is the way that the general structures arise from and interact with the more specific items

00:56:02.100 | of language, producing a highly conventional set of general and specific structures that

00:56:07.060 | allow the expression of both conventional and novel ideas.

00:56:09.100 | Right.

00:56:10.100 | It's kind of like this like middle ground between abstraction and like specificity that

00:56:16.180 | like we would want, you know, that like humans probably exhibit that we would want our models

00:56:19.900 | to to exhibit.

00:56:20.900 | Yeah.

00:56:21.900 | I was wondering if you could go back one slide and just unpack this diagram a little more

00:56:28.860 | because I'm fairly new to NLP.

00:56:31.180 | I've never seen a diagram like this.

00:56:32.900 | Oh, sorry.

00:56:33.900 | Yeah.

00:56:34.900 | What what does this mean?

00:56:35.900 | How should I, you know, interpret?

00:56:36.900 | Oh, so this is all like, you know, so if you take, you know, the way that that that like

00:56:45.180 | words are, you know, as you're passing through a transformer through through many layers,

00:56:48.980 | I just want to be like, look at how the colors cluster.

00:56:52.460 | But yeah, you're passing through a transformer, many layers at any one point in that transformer,

00:56:57.140 | you could like say, OK, how are the words organized now, you know, and you think, well,

00:57:02.380 | I'm going to project that to two dimensions from like a thousand.

00:57:05.340 | And that's, you know, maybe a good idea, maybe a bad idea.

00:57:07.820 | I think there's a lot of but, you know, I would be able to show them here if they were

00:57:10.620 | a thousand.

00:57:11.620 | So let's like assume that it's like an OK thing to be doing.

00:57:15.660 | Then then, you know, so this is what they've done for like for layer one and then for layer

00:57:19.820 | twenty four.

00:57:21.540 | And so we could see that that like they they start off where like the colors are totally

00:57:25.300 | jumbled and they're probably, you know, in before layer one, you add in the position

00:57:30.220 | embedding.

00:57:31.220 | So I think I think that that's what all those clusters are.

00:57:33.620 | Right.

00:57:34.620 | So it's like kind of clustering because you don't have anything to go off of.

00:57:36.340 | You know, it's like this is break and it's in position five.

00:57:38.420 | It's like, OK, I guess I'll cluster all the breaks in position five.

00:57:41.380 | But then as you go as you as you go up up the model, right.

00:57:46.380 | And kind of like all this meaning is being formed.

00:57:48.580 | You see these like senses kind of like come out in the in in in how it organizes things.

00:57:56.340 | Right.

00:57:57.340 | So it's like all all these like breaks kind of like become they're very specific.

00:58:01.500 | You know, they're very like kind of subtle versions of breaks.

00:58:04.260 | You know, there's like this work and I think it's different from a lot of NLP work because

00:58:08.300 | it has like a lot of labor put into this labeling.

00:58:12.540 | Right.

00:58:13.540 | Like this is like some something because because, you know, the person who this is a linguistic

00:58:17.940 | student.

00:58:18.940 | Right.

00:58:19.940 | And if you go through corpus and label every break by like which one of these it means,

00:58:22.100 | it's like a lot of work.

00:58:23.300 | And it's like, yeah.

00:58:24.300 | And so I think it's the kind of thing that you wouldn't be able to show otherwise.

00:58:27.500 | So it's often not really shown.

00:58:29.900 | Yeah.

00:58:31.260 | Cool.

00:58:33.100 | So yeah, language is characterized by the fact that it's this amazingly abstract system.

00:58:37.220 | Right.

00:58:38.220 | I started off raving about that.

00:58:39.220 | And, you know, and we want our models to capture that.

00:58:40.700 | That's why we do all this compositionality kind of syntax tests.

00:58:43.420 | You know, but meaning is so rich and multifaceted.

00:58:45.340 | Right.

00:58:46.340 | High dimensional spaces are much better at capturing these subtleties.

00:58:49.220 | Right.

00:58:50.220 | We started off talking about word embeddings in this class.

00:58:52.020 | Right.

00:58:53.020 | You know, high dimensional space are so much better at this than any rules that we would

00:58:55.100 | come up with being like, OK, maybe we could have like break subscript, like break money,

00:58:59.620 | you know, and we're going to put that into our system.

00:59:02.420 | And so where do deep learning models where do they stand now?

00:59:06.100 | Right.

00:59:07.100 | Between surface level memorization and abstraction.

00:59:08.940 | You know, and this is what like a lot of analysis and interpretability work is trying to understand,

00:59:12.540 | you know.

00:59:13.540 | And I think that what's important to keep in mind when we're reading and kind of doing

00:59:16.260 | this analysis and interpretability work is that this is not even a solved question for

00:59:20.100 | humans.

00:59:21.100 | Right.

00:59:22.100 | Like we don't know exactly where humans stand between like having an abstract grammar and

00:59:24.660 | having these like these like very like construction specific and meaning specific ways that that

00:59:30.140 | like things work by.

00:59:31.620 | Cool.

00:59:32.620 | Any questions overall on the importance of semantics and the richness of human language?

00:59:38.420 | Yeah.

00:59:39.420 | This is probably a question from quite a bit before, but you're showing a chart from your

00:59:45.860 | research where the model is really well able to distinguish inanimate from animate given

00:59:54.580 | its knowledge of subject or object.

00:59:58.980 | I was just trying to interpret that graph and understand what the sort of links between

01:00:05.060 | the characters.

01:00:06.060 | Yeah, I did a flashback yesterday.

01:00:07.060 | Sorry, I know it's long way back.

01:00:08.060 | No, no, it's not.

01:00:09.060 | I think it's here.

01:00:10.060 | Right.

01:00:11.060 | Yeah.

01:00:12.060 | Yeah.

01:00:13.060 | So so the main so this is similar to the other graph where it was you know where what it's

01:00:15.980 | trying to distinguish is a subject from object.

01:00:19.540 | But we've just split the test set into these four ways.

01:00:22.300 | We're split into like subject inanimate, subject animate, you know, so we just split the test

01:00:26.100 | set.

01:00:27.100 | Right.

01:00:28.100 | And so like what the what like the two panels in the x axis are showing are like these different

01:00:32.380 | splits.

01:00:33.380 | Right.

01:00:34.380 | So like OK so things that are subjects and basically the ground truth is that things

01:00:37.060 | on the left should be above 15 things on the right should be below 50.

01:00:40.700 | And that's what's happening.

01:00:41.700 | But if we further split it by animate and inanimate, we see that there's just like influence

01:00:47.500 | of of animacy on the probability.

01:00:50.300 | That was.

01:00:51.300 | Yeah.

01:00:52.300 | Sorry, I rushed over these graphs like kind of I want to get like a taste of things that

01:00:55.740 | happen.

01:00:56.740 | But yeah, it's good to also understand fully what's going on.

01:01:00.260 | Cool.

01:01:01.260 | Yeah.

01:01:02.260 | So this is also from a while back.

01:01:03.260 | You don't have to go to the slide.

01:01:04.260 | Yeah.

01:01:05.260 | So you were talking about acceptability.

01:01:09.180 | So I'm assuming for judging acceptability, you just ask that for like GPT-3, how do you

01:01:16.260 | determine if it finds a sentence acceptable?

01:01:18.740 | I think you could just take logics.

01:01:20.460 | I think that's what Kyle Mahalwa did in this paper.

01:01:23.140 | Right.

01:01:24.140 | You could just like take like the probabilities out, put it up then if you like, you know,

01:01:26.700 | if you like kind of for GPT-3, it's like going left to right.

01:01:29.620 | I think there's like other things that people do sometimes.

01:01:32.580 | But like, yeah, especially for these models, they don't have too much access to apart from

01:01:36.380 | like the like generation and like the like probability of each generation.

01:01:41.260 | I think that you could.

01:01:42.260 | Yeah, I think that you might want to do that.

01:01:44.780 | And there's like, you know, you don't want to multiply every logic together, right?

01:01:48.420 | Because then like if you're multiplying many probabilities, longer, longer sentences, you

01:01:52.140 | know, become like very unlikely, right?

01:01:54.540 | Which is like not true exactly for humans or, you know, it's not true in that way for

01:01:57.740 | humans.

01:01:58.740 | So, you know, I think there's like things you should do, like ways to control it and

01:02:00.820 | stuff like when you're running an experiment like this.

01:02:03.940 | Yeah.

01:02:05.940 | Cool.

01:02:07.940 | Okay.

01:02:09.940 | So moving on to multilinguality in NLP.

01:02:16.820 | So so far we've been talking about English, right?

01:02:19.020 | All this I haven't been saying it explicitly all the times, but most things I've said,

01:02:21.820 | you know, apart from some, maybe some differential object marking examples, right?

01:02:24.860 | They've been kind of about English, about English models, but there's so many languages,

01:02:28.500 | right?

01:02:29.500 | There's like 7,000 languages in the world, maybe not over, there's around 7,000 languages

01:02:34.140 | in the world, right?

01:02:35.140 | Like it's hard to define, right?

01:02:38.660 | Like what a language is, right?

01:02:40.780 | It's kind of difficult, you know, like even in the case of English, right?

01:02:43.460 | We have things like, it's like Scots, right?

01:02:45.460 | The language spoken in Scotland, is that English?

01:02:47.420 | It's like, you know, something like Jamaican English, you know, like maybe that's a different

01:02:50.380 | language, right?

01:02:51.380 | There's like different structures, but it's still like clearly like much more related

01:02:54.500 | than anything else, right?

01:02:55.500 | Than like German or something, right?

01:02:57.340 | And so, you know, how do you make a kind of a multilingual model?

01:03:03.100 | Well, so far a big approach to me, you know, you take a bunch of languages, this is like

01:03:09.300 | all of them, and maybe you're not going to take all of them, you know, maybe you can

01:03:11.340 | take a hundred or something, and you just funnel them into just like one transformer

01:03:15.340 | language model.

01:03:16.340 | And there's maybe things you could do like up sampling some, they don't have too much

01:03:18.580 | data of, you know, or like down sampling some, they have too much data of, you know, but

01:03:23.220 | like this is the general approach, you know, what if we just make one, you know, like one

01:03:27.300 | transformer language model, you know, like something like a BERT, it's usually like a

01:03:32.860 | BERT type model, because it's hard to get good generation for like too many languages,

01:03:36.420 | you know, but yeah, how about just get one transformer language model for all of these

01:03:40.780 | languages, right?

01:03:42.240 | And so what's cool about this is that multilingual language models, right, they let us share

01:03:46.700 | parameters between high resource languages and low resource languages, right?

01:03:51.300 | There's a lot of language in the world, really just most languages in the world, which you

01:03:54.620 | could not train like even like a BERT size model for, right, they're just like not enough

01:03:58.860 | data and there's, yeah, and there's a lot of work being done on this.

01:04:03.380 | And one way to do this is say like, well, you know, like, you know, pre-training and

01:04:07.460 | transfer learning, they brought us so much unexpected success, right?

01:04:12.860 | And so like, you know, and we get this great linguistic capability in generality, right,

01:04:17.460 | if we pre-train something in English that we weren't asking for, so, you know, so will

01:04:21.820 | this self-supervised learning paradigm, you know, can it like deliver between languages?

01:04:25.300 | So it's like, maybe I can get a lot of the, a lot of the like linguistic knowledge, like

01:04:30.500 | the more general stuff from like just all the high resource languages and then kind

01:04:33.580 | of apply it to the low resource languages, right?

01:04:35.620 | Like a bilingual person doesn't have like two totally separate parts of their self,

01:04:39.420 | right, that like have learned languages, probably some sharing some way that like things are

01:04:43.260 | like in the same space, like linguistics are broadly the same, right?

01:04:48.620 | And so, and so, and so, you know, we have this like attempt to like bring NLP to like

01:04:57.060 | some still very small subset of the 7,000 languages in the world.

01:05:00.300 | We can look at it through two lenses, right?

01:05:03.020 | On the one hand, you know, languages are remarkably diverse.

01:05:05.860 | So we'll go over some of the cool ways that languages in the world vary, you know, and

01:05:10.580 | so does multilingual NLP capture the specific differences of different languages?

01:05:15.020 | On the other hand, you know, languages are similar to each other in many ways.

01:05:18.740 | And so does multilingual NLP capture the parallel structure between languages?

01:05:23.940 | So you know, just, just, just to go over some ways, like, you know, really understanding

01:05:27.860 | like how like diverse languages can be, you know, in around a quarter, this is a quote

01:05:33.140 | from a book, but you know, in around a quarter of the world's languages, every statement,

01:05:37.900 | right, like every time you use a verb must specify the type of source on which it is

01:05:41.940 | based, right?

01:05:42.940 | So it's like a part, you know, how we have like tense in English, where we like, you

01:05:46.140 | know, kind of everything you say is like kind of either in the past or the present or the

01:05:50.020 | future tense, right?

01:05:51.100 | And so like an example in a, in Tarjana, these are again from, from the book, right?

01:05:55.700 | It's not a language I know, right?

01:05:58.320 | But it's, you know, you, you have this like marker in bold at the end, right?

01:06:02.920 | And so, and so when you say something like, Jose has played football, right?

01:06:07.940 | You if you put like the car marker, that means that we saw it, right?

01:06:10.620 | It's kind of like the visual evidential marker, right?

01:06:12.540 | And there's, and there's kind of a non visual marker that kind of means we heard it, right?

01:06:17.340 | So if you say, you know, so if you say statement, you could say we heard it, right?

01:06:21.660 | There's a like, we infer it from visual evidence, right?

01:06:24.140 | So if it's like, oh, his like cleats are gone, and he is also gone, but like, and people,

01:06:30.100 | you know, and we see people going to play football, right?

01:06:32.740 | Or we see people coming back, I guess, from playing football because in the past, right?

01:06:35.300 | That means like, you know, so, so we can infer it.

01:06:36.940 | And so you could put this, right?

01:06:37.940 | There's like, you know, or like, if he plays football every Saturday, you know, and it's

01:06:43.340 | Saturday, we you would use a different marker, right?

01:06:47.340 | Or like, if someone has told you if it's hearsay, you would use a different marker, right?

01:06:50.900 | So this is like, this is like a part of the grammar, right?

01:06:55.300 | That like, to me, at least, right?

01:06:57.660 | Like, I don't speak any language that has this, it seems like it's, it seems like very

01:07:03.100 | cool and like different from like anything I would ever think would be like a part of

01:07:06.820 | the grammar, right?

01:07:07.820 | But it is.

01:07:08.820 | Or like, especially like a compulsory part of the grammar, right?

01:07:12.940 | But, but it is, right?

01:07:14.180 | And you can like map out, I wanted to include some maps from WALS, the World Atlas of Linguistic

01:07:18.380 | Structure.

01:07:19.380 | That's always like, so fun.

01:07:21.540 | You know, you could like map out all the languages, right?

01:07:23.820 | Like I only speak white dot languages, which are like no grammatical evidentials.

01:07:27.820 | You know, if you want to say whether you heard something or saw it, you have to say it like

01:07:30.860 | in words, right?

01:07:31.860 | But there's many languages, you know, as very, yeah, especially in the Americas, right?

01:07:39.820 | Tainan is I think Brazilian language from like up by the border with, yeah.

01:07:46.460 | But yeah, the, you know, while we're looking at like language typology maps, right?

01:07:52.980 | And so like this, this like language organization, like in categorization maps, the most like,

01:07:59.220 | the classic one, right, is again, like the subject object and verb order, right?

01:08:04.460 | So as you said, English has SVO order, but there's just so, so many orders that, you

01:08:09.660 | know, kind of like almost all the possible ones are a test that, you know, some languages

01:08:14.340 | have no dominant order, like Greek, so like a language that I speak natively has a dominant

01:08:19.220 | order.

01:08:20.220 | You would say you would move things around for emphasis or whatever.

01:08:23.020 | Yeah.

01:08:24.020 | And you see like, and here, you know, we're seeing some, some like diversity, we're seeing

01:08:28.580 | typology, we're also seeing some tendencies, right?

01:08:30.900 | Like some are just so much more common than others, right?

01:08:33.580 | And this is like, again, something which like people talk about so much, right?

01:08:37.020 | It's, it's, it's like a very big part.

01:08:39.700 | Yeah.

01:08:40.700 | It's, it's like a huge part of linguistic.

01:08:41.900 | Why are some more common where some others, it's like a basic fact of language, it's something

01:08:45.100 | which happened, you know, is this like just the fact of like how discourse works, maybe,

01:08:49.140 | you know, like that's, that's more preferred for many people to say something, you know,

01:08:52.940 | and there's a lot of opinions on this.

01:08:55.740 | Another way though that languages vary, you know, is like the number of morphemes they

01:08:58.900 | have per word, right?

01:09:00.180 | Like some languages are like, you know, like Vietnamese classically, just like very isolating,

01:09:04.060 | like kind of like each, you know, like each kind of thing you want to express like tense

01:09:08.020 | or something is going to be in, in a different word, you know, in English, we actually combine

01:09:12.460 | kind of tenses, we have things like "able," right?

01:09:14.500 | Like, you know, like, like throwable or something, right?

01:09:18.180 | And then like in, in, in some languages, they're just like really so much stuff is expressed

01:09:21.660 | in morphemes, right?

01:09:23.220 | And so you can have languages, especially in like Alaska and Canada, a lot of languages

01:09:28.740 | there and like Greenland, where you have like, and these are all like one, one language family,

01:09:36.460 | you can have like kind of whole sentences expressed with just like things, things that

01:09:39.860 | get tacked on to, to, to the verb, right?

01:09:43.620 | So you have to have things, things like the, you know, like the object and the, or I guess

01:09:51.260 | in this case, you start with the object, again, you have kind of like the verb and the like,

01:09:55.340 | whether it's happening or not happening and who said it and like, or like whether it's

01:09:58.540 | said in the future and all that just kind of all put in, you know, these like, quote

01:10:02.180 | unquote, like sentence words, right?

01:10:03.380 | It's like a very different way of a language working than English works like at all, right?

01:10:07.620 | Yeah, you have a question?

01:10:08.620 | Yeah, this is from two slides ago, the one with the map.

01:10:09.620 | I just want to know like what these dots mean, because in the US, the top right is gray,

01:10:18.140 | like in the Northeast, but in the Pacific Northwest, it's yellow.

01:10:21.500 | Is that different dialects for like the same American English?

01:10:24.500 | Oh, no, these are all indigenous languages.

01:10:26.580 | Oh, I see.

01:10:27.580 | Yeah, yeah.

01:10:28.580 | So, so English is just this one dot in here spread in amongst all the like Cornish and

01:10:34.940 | Irish and stuff.

01:10:35.940 | Yeah, so English is just like in Great Britain.

01:10:39.940 | Yeah.

01:10:40.940 | Yeah.

01:10:41.940 | Yeah.

01:10:42.940 | Yeah, and that's why, yeah, and that's why like all this like really and that's why like

01:10:47.580 | all this like evidential stuff is happening in, uh, in like the Americas, right?

01:10:50.740 | Because there's like a lot of, you know, very often the indigenous languages of the Americas

01:10:54.300 | are like the classic, like very evidentially marking ones, which are the pink ones.

01:10:58.180 | Yeah.

01:10:59.180 | You said that normally we use like a bird style model for multilingual models because

01:11:04.300 | it's difficult for natural English generation across languages.

01:11:07.420 | Yeah.

01:11:08.420 | I mean, I guess intuitively that makes sense, right?

01:11:11.140 | Because of the subtleties and the nuance between different languages when you're producing it.

01:11:15.300 | But is there like a reason that, um, like a particular reason that that's been so much

01:11:19.580 | harder to make developments on?

01:11:20.580 | I think it's just hard.

01:11:21.580 | I think a good generation is just like harder, right?

01:11:25.420 | Like to get something like, you know, like GPT-3 or something.

01:11:28.540 | If you need like really like a lot of data and maybe like it's kind of like, I think

01:11:32.340 | there are, can I think of any, are there any, is it G-Shard?

01:11:35.900 | G-Shard's encoder-only.

01:11:36.900 | Yeah, I can't really think of any like, you know, like encoder-decoder, as you said, you

01:11:40.500 | know, like kind of big multilingual models.

01:11:43.540 | Of course, like GPT-3 has this thing where if you're like, how do you say this in French?

01:11:46.100 | You'll be like, you say it like this, you know?

01:11:47.340 | So it's like, if you've seen all of the data, it's going to include a lot of languages,

01:11:50.780 | but this kind of like multilingual model where you'd be like, right, you know, be as good

01:11:54.380 | as GPT-3, but in this other language, you know, I think it's just, it's just, you need

01:11:58.740 | a lot more data to get that kind of coherence, right?

01:12:01.140 | As opposed to like, yeah, as opposed to something if you do like text infilling or something,

01:12:05.580 | which is like how the bird style models are, then you get like very good, even if the text

01:12:09.860 | infilling, you know, performance isn't great for every language, you can actually get very,

01:12:15.340 | very good embeddings to work with for a lot of those languages.

01:12:18.660 | Yeah.

01:12:20.180 | Cool.

01:12:21.580 | Now, for just like a one last language diversity thing, I think this is interesting, the motion

01:12:26.620 | event stuff, because it's like, this is actually, you know, it's not, it's like languages that,

01:12:31.140 | you know, many of us know, I'm going to talk about Spanish, but it's actually something

01:12:34.220 | which you might not have thought about, but then once you see, you're like, oh, actually,

01:12:37.420 | that's like actually affects how like everything works.

01:12:40.660 | So in English, right, the manner of motion is usually expressed on the verb, right?

01:12:43.420 | So you can see something like the bottle floated into the cave, right?

01:12:46.140 | And so like, the fact that it's floating is on the verb, and the fact that it's going

01:12:50.060 | in is kind of on this satellite.

01:12:51.660 | Well, like in Spanish, the direction of motion usually expressed on the verb, Greek is like

01:12:57.180 | this too.

01:12:58.180 | I feel like most Indo-European languages are not like this, they're actually like English.

01:13:01.300 | So like most languages from like Europe to like North India tend to not be like this,

01:13:06.380 | right?

01:13:07.380 | So you would say like, "La botella entro a la cueva flotando," right?

01:13:10.860 | So you'd have like, so the floating is not usually put on the main verb.

01:13:16.140 | And like, in English, you could actually say like, right, like the bottle entered the cave

01:13:19.620 | floating, it's just like maybe not what you would say, right?

01:13:23.460 | And similar, like in Spanish, you can say the other way, right?

01:13:26.580 | This is called like satellite framing language and verb framing language, like really affects

01:13:29.780 | how you would kind of like say most, you know, like kind of how everything works, right?

01:13:34.500 | It's kind of like a division that's like, you know, pretty attested, of course, it's

01:13:37.900 | not a full division, right?

01:13:38.900 | It's not like this exclusive categorization.

01:13:43.180 | Chinese I think often has these structures where there's like two verb slots, right?

01:13:47.540 | Where you could have both a manner of motion and a direction of motion kind of in the like

01:13:52.860 | the one verb slot, none of them have to go kind of like after playing some different

01:13:57.620 | role, right?

01:13:58.620 | So these are like, there's all these ways in which like language are just different,

01:14:01.580 | you know, from like things that maybe we didn't even think could like be in a language, like

01:14:06.900 | things that like we do, right?

01:14:09.020 | But we don't realize that in some, sometimes you're just like so different in these like

01:14:13.780 | subtle ways.

01:14:16.100 | And so, you know, and so going to the other annual language are so different, they're

01:14:20.380 | also very alike, right?

01:14:21.740 | So like, you know, there's this idea like, is there like a universal grammar, some like

01:14:26.740 | abstract structure that all, that unite all languages, right?

01:14:29.860 | This is like a huge question in linguistics.

01:14:31.940 | And you know, the question is, can we define an abstraction where we can all say like all

01:14:34.700 | languages are some part version of it?

01:14:36.780 | There's like other ways of thinking about universals, like all languages like tend to

01:14:39.500 | be one way or tend to be like languages that tend to be one way also tend to be some other

01:14:43.660 | way.

01:14:44.660 | And there's like a third way of thinking about universals, that's like languages all deal

01:14:49.940 | in similar types of relations, you know, like subject, object, you know, like types of modifiers,

01:14:54.300 | right?

01:14:55.300 | So the universal dependencies project was like a way of kind of saying like, maybe we

01:15:02.020 | can make dependencies kind of for all languages in a way that doesn't shoehorn them into each

01:15:05.020 | other, you know?

01:15:06.620 | And yeah, I guess like what was it called?

01:15:08.980 | RRG, like relational something grammar, you know, was also kind of this idea that maybe

01:15:12.940 | one way to think about all languages together is like the kind of relations they define,

01:15:16.300 | you know?

01:15:17.300 | And, you know, ask me about kind of like the Chomsky and the Greenbergian stuff you want

01:15:22.380 | and how it relates to NLP.

01:15:23.540 | I think like there's a lot to say there.

01:15:26.460 | It's kind of, yeah, it's slightly more difficult.

01:15:29.380 | So maybe it's easier to think of this third one in terms of NLP, right?

01:15:33.820 | And like back to the subject object relation stuff, if we look at it across languages,

01:15:38.100 | right, we see that they're kind of encoded in parallel because classifiers, right, those

01:15:41.780 | classifiers that we're training, they're like as accurate in their own language as they

01:15:45.780 | are in other languages, right?

01:15:47.020 | Their own language being red and other languages being black, right?

01:15:50.660 | It's not like, wow, if I take a multilingual model and I train one classifier in one language,

01:15:56.140 | it's going to be so good at itself and so bad at everything else, right?

01:15:58.660 | They're kind of interspersed.

01:15:59.660 | They're clearly like on the top end, the red dots.

01:16:03.860 | And UD relations, right, so universal dependencies, right, like the kind of like dependency relations,

01:16:08.740 | they're also encoded in parallel ways.

01:16:10.460 | This is work that John has done, right?

01:16:13.620 | Again, main thing to take from this example is that like the colors cluster together,

01:16:17.900 | right?

01:16:18.900 | So if you train kind of like a parser on or like, you know, parse classification in one

01:16:24.620 | language and kind of transfer it to another, you see these clusters form for the other

01:16:28.100 | language, right?

01:16:29.100 | So it's like these ideas of how like things relate together, right?

01:16:31.460 | Like a kind of noun modifier, you know, all that kind of stuff.

01:16:35.620 | They do cluster together in these parallel ways across languages, you know?

01:16:42.060 | And so language specificity is also important.

01:16:44.900 | I might skip over this.

01:16:48.420 | But you know, it seems like maybe sometimes some languages are shoehorned into others

01:16:52.180 | in various ways.

01:16:54.820 | And maybe part of this is that data quality, it's very variable in multilingual corpora,

01:16:58.940 | right?

01:16:59.940 | So if you take like all these multilingual corpora, there was like an audit of them.

01:17:04.460 | And like for like all these various like multilingual corpora, like 20% of languages, they're less

01:17:08.620 | than 50% correct, meaning like 50% of it was often like just links or like just something

01:17:13.380 | random.

01:17:14.380 | So that might be like some language, but it was not at all.

01:17:19.340 | And like maybe the way we maybe we don't want too much parameter sharing, right?

01:17:23.140 | Like Afroberta is a recent, it's a kind of recent BERT model trained like only on African

01:17:29.460 | languages, you know, maybe like having too much, too high resources like harming, you

01:17:33.900 | know, and there's work here at Stanford being done in the same direction, you know.

01:17:37.980 | Another, yeah, another recent cross-lingual model, XLMV, came out, which is like, why

01:17:45.020 | should we be doing vocabulary sharing?

01:17:47.060 | You know, like you just have like a big vocabulary.

01:17:48.940 | Each language gets like its own words.

01:17:50.740 | It's probably going to be better.

01:17:52.140 | And it is.

01:17:53.140 | It kind of like knocks out similar models or smaller vocabularies, which are like maybe,

01:17:56.620 | you know, computer is the same in English and French.

01:17:58.620 | It should be shared, you know.

01:18:00.420 | Maybe it's better to separate things, you know.

01:18:01.700 | It's like hard to like kind of find this balance between, let's skip over this paper too.

01:18:05.860 | It's very cool and there's a link there, so you should look at it.

01:18:08.940 | But yeah, we want language generality, but we also want to preserve diversity.

01:18:13.980 | And so how is multilingual NLP doing, you know, especially with things like dialects?

01:18:17.860 | You know, there's so many complex issues for multilingual NLP to be dealing with.

01:18:22.780 | How can deep learning work for low resource languages?

01:18:25.620 | You know, what are the ethics of working in NLP for low resource languages?

01:18:28.700 | Who like wants their language in big models?

01:18:31.500 | Who like wants the language to be translated?

01:18:33.100 | You know, these are all like very important ethical issues in multilingual NLP.

01:18:38.580 | And so after looking at structure, beyond structure, multilinguality in models, I hope

01:18:47.100 | you know that linguistics is a way of, you know, investigating what's going on in black

01:18:50.500 | box models.

01:18:52.220 | The subtleties of linguistic analysis, they can help us understand what we want or expect

01:18:56.300 | from the models that we work with.

01:18:58.100 | And like even though we're not reverse engineering human language, linguistic insights, I hope

01:19:02.260 | I've convinced you they still have a place in understanding, you know, the models that

01:19:05.380 | we're working with, the models that we're dealing with.

01:19:07.740 | And you know, and in so many more ways beyond what we've discussed here, you know, like

01:19:12.220 | language acquisition, language and vision, and like instructions and music, discourse,

01:19:16.860 | conversation and communication, and like so many other ways.

01:19:20.380 | Cool.

01:19:21.380 | Thank you.

01:19:22.380 | If there's any more questions, you can come ask me.

01:19:24.500 | Time's up.

01:19:24.740 | Thank you.

01:19:25.740 | [ Applause ]

01:19:25.740 | [ Music ]

01:19:26.740 | [ Applause ]

01:19:26.740 | [ Music ]

01:19:27.740 | [ Applause ]

01:19:27.740 | [BLANK_AUDIO]