Back to Index

Stanford CS224N NLP with Deep Learning | 2023 | Lecture 14 - Insights between NLP and Linguistics


Transcript

Cool. Hi, everyone. Hi, I'm Isabel. I'm a PhD student in the NLP group. It's about connecting insights between NLP and linguistics. Yeah, so hopefully we're going to learn some linguistics and think about some cool things about language. Some logistics. We're in the project part of the class, which is cool.

We're so excited to see everything you guys do. You should have a mentor grader assigned through your project proposal. The person who ever graded your project proposal, especially if you're in a custom project, we recommend that you go to your graders' office hours. They'll know the most and be most into your project.

And project milestones are due next Thursday. So that's in one week from now. So hopefully you guys are all getting warmed up doing some things for the project. And we'd love to hear where you are next week. Cool. So the main thing that I'm going to talk about today is that there's been kind of a paradigm shift for the role of linguistics in NLP due to large language models.

So it used to be that there was just human language. We created all the time. We're literally constantly creating it. And then we would analyze it in all these ways. Maybe we want to make trees out of it. Maybe we want to make different types of trees out of it.

And then all that would kind of go into making some kind of computer system that can use language. And now we've cut out this middle part. So we have human language. And we can just immediately train a system that's very competent in human language. And so now we have all this analysis stuff from before.

And we're still producing more and more of it. There's still all this structure, all this knowledge that we know about language. And the question is, is this relevant at all to NLP? And I'm going to show how it's useful for looking at these models, understanding these models, understanding how things work, what we can expect, what we can't expect from large language models.

So in this lecture, we'll learn some linguistics, hopefully. Language is an amazing thing. It's like so fun to think about language. And hopefully, we can instill some of that in you. Maybe you'll go take Ling 1 or something after this. And we'll discuss some questions about NLP and linguistics.

Where does linguistics fit in for today's NLP? And what does NLP have to gain from knowing and analyzing human language? What does a 224N student have to gain from knowing all this stuff about human language? So for the lecture today, we're going to start off talking about structure in human language, thinking about the linguistics of syntax and how structure works in language.

We're going to then move on to looking at linguistic structure in NLP, in language models, the kind of analysis that people have done for understanding structure in NLP. And then we're going to think of going beyond pure structure, so beyond thinking about syntax, thinking about how meaning and discourse and all of that play into making language language and how we can think of this both from a linguistic side and from a deep learning side.

And then lastly, we're going to look at multilinguality and language diversity in NLP. Cool. So starting off with structure in human language, just like a small primer in language in general, if you've taken any intro to linguistics class, you'll know all of this. But I think it's fun to get situated in the amazingness of this stuff.

So all humans have language. And no other animal communication is similar. It's this thing which is incredibly easy for any baby to pick up in any situation. And it's just this remarkably complex system. Very famously linguists like to talk about the case of Nicaraguan sign language because it kind of emerged while people were watching in a great way.

It's like after the Sandinista Revolution, they started, there's a kind of large public education in Nicaragua and they made a school for deaf children. And there was no central Nicaraguan sign language. People had like isolated language. And then you see this like full language emerge in the school very autonomously, very naturally.

I hope this is common knowledge. Maybe it's not. Sign languages are like full languages with like morphology and things like pronouns and tenses and like all the things. It's not like how I would talk to you across the room. Yeah. And so, and what's cool about language is that it can be manipulated to say infinite things.

And the brain is finite. So it's either we have some kind of set of rules that we like tend to be able to pick up from hearing them as a baby and then be able to say infinite things. And we can manipulate these rules to really say anything, right?

We can talk about things that don't exist, things that can't exist. This is very different from like the kind of animal communication we see like a squirrel, like alarm call or something, you know, it's like, watch out, there's a cat. Things that are like totally abstract, you know, that have like no grounding in anything.

We can express like subtle differences between similar things. I always, when I'm thinking about like this point and like things called, yeah, like this featured language, I was thinking of like the Stack Exchange world building thing. I don't know if you ever looked at the sidebar where there's then there's like thing where like science fiction authors kind of pitch like their ideas for like their science fiction world.

And it's like the wackiest, like you can really create any world with like with English, with the language that we're given. It's like amazing. And so there's structure underlying language, right? This is, I said recap here, cause we've done like the dependency parsing lectures. We thought about this, right?

But you know, if we have some, some sentence like, you know, Isabel broke the window, the window was broken by Isabel, right? We have these two sentences or some kind of relation between them. And then, and then we have another two sentences and they have like the similar relation between them, right?

This kind of passive alternation, it's kind of something which exists for both of these sentences, you know, and then we can even use like made up words and it's still, you can still see that it's a passive alternation, right? And so it seems that we have some knowledge of structure that's separate from, from the words we use and the things we say that's kind of above it.

And then what's interesting about structure is that it dictates how we can use language, right? So, you know, if, if I have a sentence like the cat sat on the mat and it's, and it looks, you know, and, and then someone tells, tells you like, well, this is, you know, if you make a tree for it's going to look like this, according to my type of tree theory, you would say, well, why should I care about that?

And the reason that this stuff is relevant is because it kind of influences what you could do, right? So like any subtree or like, you know, in this specific case, any subtree, in other cases, like many subtrees, it can kind of be replaced with like one item, right? So it's like, he sat on the mat or he sat on it or he sat there, right?

Or he did so, did so, it's two words, but you know, there's a lot of ink spilled over do in English, especially in like early linguistic teaching. So we're not going to spill any ink. It's kind of like one word. But then when something is not a subtree, you like, can't really replace it with one thing, right?

So you, so you can't express like the cat sat and it kind of like have the mat as a different thing, right? And one, you could be like, he did so on the mat, right? You'd have to kind of do two things. And like, and, and one way you could think about this is that, well, it's not a subtree, right?

It's kind of like, you kind of have, have, have to go up a level to, to, to do this. And so you can't really separate the cat from on the mat in this way. And so, and we implicitly know like so many complex rules about structure, right? We're like processing the, these like streams of sound or like streams of letters all the time.

And yet we like have these, like the ways that we use them show that we have all these like complex ideas, like the tree I just showed, or like for, for example, these are like, I'm just going to give some examples for like a, a taste of like the kinds of things people are thinking about now, but there's like so many, right?

So like, what can we pull out to make a question, right? So like if we form, form a question, we, we form it by like, we were kind of referring to some part of like, you know, there might be another sentence, which like is the statement version, right? And we've kind of pulled, pulled, pulled out some, some part to make the question.

They're not necessarily like fully related, but you know, so if say Leon is a doctor, we can kind of pull, pull that out to make a question, right? Like what is Leon? And if we have like, my cat likes tuna, we could pull that out. What does my cat like?

Again, do, ignore the do. If we have something like Leon is a doctor and an activist, we actually can't pull out this, this last thing, right? So if something's like in this, if something's like conjoined with an and, we, it can't like be, be taken out of that and, right?

You, you, you could only say like, what is Leon? You could be like, oh, a doctor and an activist, but you can't really say what is Leon a doctor and this is like not how question formation works. And you know, this is like some, something that we all know.

It's I think something that we've, any of us have been taught, right? Even people who've been taught English as a second language. I don't think this is something which you're ever, which, which ever really taught explicitly, right? But, but, but most of us probably know this very well. Another such rule, right?

Is like, when is like, is like, when can we kind of shovel things around, right? So if we have something like I dictated the letter to my secretary, right? We can make like a longer version of that, right? I dictated the letter that I had been procrastinating writing for weeks and weeks to my secretary.

This character is like both a grad student and like a high ranking executive. And, and then we can, we can move the, we can move that, that long thing to the end, right? So it's like, I dictated to my secretary the letter that I'd been procrastinating writing for weeks and weeks.

And that's like fine. You know, maybe it's like slightly awkwardly phrased, but it's not like, I think this, for me, at least everyone varies, right? Could, could appear in like natural productive speech, but then something like this is like much worse, right? So somehow the fact that it becomes weighty is good and we can move it to the end.

But when it doesn't become weighty, we can't, right? And we like, this sounds kind of more like Yoda-y than like real language. And so, and so like, and we have this rule, like this one's not that easy to explain, actually. Like people have tried many ways, like to like make sense of this in linguistics.

And it's just like, but it's a thing we all know, right? And, and so when I say rules of grammar, these are not the kind of rules that were usually taught as rules of grammar, right? So a community of speakers, you know, for example, like standard American English speakers, they share this rough consensus of like the implicit rules they all have.

These are not the same, you know, like P people have like gradations and disagree on things, but you know, and then kind of like a grammar is an attempt to describe all, all these rules, right? And you can like, kind of linguists might write out like a big thing called like, you know, the like grammar of the English language where they're trying to describe all of them.

It's like really not going to be large enough ever. They're like, this is a really hefty book and it's like not still not describing all of them, right? Like language is so complex. But so what, so what we were told as rules of grammar, you know, these kind of like prescriptive rules where they tell us what we can and can't do, you know, they often have other purposes than describing the English language, right?

So for example, when they've told us things like, oh, you should never start a sentence with and, you know, that's like not true. You know, we start sentences with and all the time in English and it's fine. You know, what they probably mean, you know, there's some probably like reason that they're saying this, right?

Like, especially if you're like trying to teach a high schooler to like, write, you know, you probably, when you want them to focus their thoughts, you probably don't want them to be like, oh, and this, oh, and this again, or, you know, like you want them to like, and so you tell them like, oh, rule of writing, you know, is like, you can never start a sentence with and, right?

And when they say something like, oh, it's incorrect to say, I don't want nothing. This is like bad grammar, you know, well, this is, you know, in, in, in standard American English, you probably wouldn't have nothing there, right? Cause you, you would have anything, right? But, but in many dialects of English, you know, in many languages across the world, when you have a negation, right?

Like the not and don't, then like everything, it kind of scopes over also has to be negated or has to agree. And many dialects of English are like this. And so what they're really telling you is, you know, the dialect with the most power in the United States doesn't do negation this way.

And so you shouldn't either in school. Right. And, and, and so, you know, and so the way that we can maybe define grammaticality, right. Rather than like what they tell us is wrong or right is that, you know, if we choose a community of speakers to look into, they share this rough consensus of their implicit rules.

And so like the utterances that we can generate from these rules, you know, are grammatical, roughly, you know, everyone has these like gradations of what they can accept. And if we can't produce not or it's using these rules, you know, it's ungrammatical. And that's where like, this is like the descriptive way of thinking about grammar, where we're, where we're thinking about what people actually say and what people actually like and don't like.

And so for an example, you know, in, in English, large, largely, we have a pretty strict rule that like the subject, the verb and the object appear in this like SVO order. There's exceptions to this, like there's exceptions to everything, right? Especially things like says I, in some dialects, but you know, it is like, largely if something is before the verb, it's a subject, something is after the verb, it's an object, and you can't move that around too much.

And, you know, we also have these subject pronouns, you know, like I, I, she, he, they, that have to be the subject and these object pronouns, you know, me, me, her, him, them, that have to be the object. And, and, you know, and so if we follow the, these rules, we get a sentence that we think is good, right?

Like, I love her. And if we don't, then we get a sentence that we think is, is ungrammatical, right? Something like me love she, it's like, we don't know who is who, you know, who is doing the loving and, and, and who is being loved in, in this one, right?

And it's, it doesn't exactly parse. And this is like also true, you know, like even when there's no ambiguity, this continues to be true, right? So for a sentence like me, a cupcake ate, which is like, the meaning is perfectly clear. Our rules of grammaticality don't seem to cut, to cut as much slack, right?

We're like, oh, this is wrong. I understand what you mean, but in my head, I know it's like not, you know, correct, even not, not by the like prescriptive notion of what I think is correct, you know, by the descriptive notion, like my, I just don't, don't like it.

Right. And, and, and, and you can also, you know, sentences can be grammatical without any meaning. So you can have meaning with, with that grammaticality, right? Like me, a cupcake ate, and you could also have, it's like classic example from, from Chomsky in 1957. I introduced it earlier, but yeah, classically from 1957, you know, like colorless green ideas sleep, sleep furiously, right?

Which like this has no meaning, cause you can't really make any sense out of this sentence as a whole, but you know, you know, it's grammatical and you know, it's grammatical, right? Cause you can make an ungrammatical version of it, right? Like colorless green ideas sleeps furious, right? Which does make sense.

Cause there's no agreement, even though you don't have any meaning for any of this. And then lastly, you know, people don't fully agree. You know, everyone has their own idiolect, right? People like usually speak like more than one dialect and they kind of move between them and they have a mixture and those have like their own way of thinking of things.

They also have these, like, those have different opinions at the margins. People like, like some things more, others don't, right? So an example of this is like, not everyone is as strict for some WH constraints, right? So if you're trying to pull out something like, I saw who am I doubted report that would capture in the nationwide FBI manhunt, this from a paper by a Hofmeister and Ivan Sogg from Stanford.

This is like, some people like it, some people don't, you know, it's kind of, some people can like clearly see it as like, Oh, it's the who that we had captured and Emma doubted the reports that we had captured. You know, and some people are like, this is as bad as like, what is the, on a doctor and I don't like it.

Right. So yeah, so that's grammaticality. And the question is like, why do you even need this? Right. It's like, we, we like, we like accept these useless utterances and we block out these perfectly communicative utterances. Right. And, and this is like, I started off saying that this is like a fundamental facet of human intelligence.

Like it seems kind of, you know, a strange thing to have. And so I think one thing I keep returning on when I think about linguistics is that a basic fact about languages that is that we can say anything, right. There's like really every language, you know, can express anything, you know, and if like, there's no word, word for something people will develop it if they want to talk about it.

Right. And so if we ignore the rules because we know what it's probably intended, right. You know, then we would be limiting possibilities. Right. So in my kitchen horror novel, where the ingredients become sentient, I want to say the onion chopped the chef. And if people, if people just assumed I meant the chef chopped the onion because like SVO order doesn't really matter, then I can't, I can't say that.

So then, yeah, to, to like, to conclude, you know, a fact about language that that's like very cool is that it's compositional, right. We have the set of rules that defines grammaticality and then this like, and then this lexicon, right. This like dictionary of words that, that relate to the world we want to talk to.

And we kind of combine them in these limitless ways to say anything we want to say. Cool. Any questions about all this? I've like tried to bring a lot of like linguistic fun facts, like top of mind for this lecture. So hopefully, hopefully have answers for things you want to know.

Cool. Cool. Yeah. Cool. So, so now, you know, that was a nice foray into like a lot of like sixties linguistics. You know, how, how, how does that relate to us like today? Right. In NLP. And so we said that in humans, you know, like we can think about languages, it's like there's a system for producing language, you know, that can be described by these discrete rules, you know, so it's not like it's smaller than all the things that we can say.

There's this kind of like rules that we can kind of put together to say things. And so do NLP systems work, work like that? And one answer is like, well, they definitely used to, right? Because as you said in the beginning, before self supervised learning, the way to approach doing NLP was through understanding the human language system, right?

And then trying to imitate it, trying to see, you know, if you think really, really hard about how humans do something, then you kind of like code up a computer to do it. Right. And so for, for one example, like, you know, parsing used to be like super important in, in, in, in NLP.

Right. So, and this is because, you know, as an example, if I want my sentiment analysis system to classify a movie review correctly, right. Something like my uncultured roommate hated this movie, but I absolutely loved it. Right. How would, how, how would we do this before we had like chat GBT?

We, we, we, we, you know, we might have some semantic representation of words like hate and uncultured, you know, it's not looking good for the movie, but you know, how, how, how does everything relate? Well, you know, we, we might ask how would human structure this word, you know, so many linguists, you know, there's many theories of how to make, you know, of how syntax might work, but they would tell you some, some, something like this.

So it's like, okay, now I'm interested in the I, right. Cause that's like probably what, what the review relates to. They're just worrying stuff about uncultured and hated, but it seems like those are related like syntactically together, right? It's like the roommate hated and that can't really connect to the I right.

So the I can't, can't really be related to the hated, right. Cause there's kind of separated. They're like separate sub sub trees separated by this like conjunction by this, but relation. And so, and so it seems that I goes with loved, which is looking good for the movie that, you know, we have loved it.

And so then we have to move beyond the rules of, of, of syntax, right. The rules of like discourse, how, how would this kind of, you know, like what could it mean? You know, and there's like a bunch of rules of discourse. Now, if you say it, you're probably referring to like the latest kind of salient thing that's, you know, matches in like, you know, it is probably non-sentient, right.

And so, you know, in this case it would be movie, right. So, so, so then, you know, like linguistic theory, you know, they helped NLP it helped NLP reverse engineer language. So you had something like input, you know, it'd get like syntax, you get semantics from, from the syntax, right.

So you would take the tree and then from the tree kind of build up all these like little, you know, like you, you, you, you can build up these little functions of like how, how, how things, how things like relate to each other. And then, and then you, you'd go to discourse, right.

So, so, so what refers to what, what, what nouns are being talked about, what things are being talked about and, you know, and, and then whatever else was interesting for your specific use case. Now we don't need all that, right. Language models just seem to catch on to a lot of these things, right.

So, so, so this whole thing that I did with the tree is like Chachupitino does, and those much harder things than this, right. This was like, this isn't even like slightly prompt engineered. I just like woke up one morning, I was like, oh, there's another lecture going to put that into chat GPT.

And this exactly, you know, I didn't even get some like, yeah, stop. Well, I guess I got a bit of moralizing, but I just like immediately, immediately just told, told, told me, you know, who, who likes it, who, who, who doesn't like it and why I'm doing something like slightly wrong, which is how it ends everything, right.

And so, and so, you know, NLP systems definitely used to, this is where we were, work in this kind of structured, discrete way. But now NLP works better than it ever has before. And we're not constraining our systems to know any syntax, right. So what, what about structure in modern language models?

And so this question is like, do the question of like a lot of analysis work has, has, has, has been focused on, you know, I think we'll have more analysis lectures later also. So this is going to be, you know, looked at in more detail, right. Is how could you get from training data, you know, which is just kind of like a loose set of just things that have appeared on the internet or sometimes not on the internet rarely, right.

To rules about language, right. To, to, to, to the idea that there's this like structure underlying language that we all seem to know, even though we do just talk in streams of things that then sometimes appear on the internet. And one way to think about this is like testing, you know, is testing how novel words and old structures work, right.

So humans can easily integrate new words into our old syntactic structures. I remember like I had lived in Greece for a few years for middle school, just speak, not speaking English too much. And I came back for high school and, and yeah, and, and this was like in Berkeley in the East Bay.

And there was like, there was literally like 10 new vocabulary words I'd like never heard of before. And they all had like a very similar role to like dank or like sick, you know, but they were like the ones that were being tested out and did not pass. And within like one, you know, one day I immediately knew how to use all of them, right.

It was not, it was not like a hard thing for me. I didn't have to like get a bunch of training data about how, how to use, you know, all these words. Right. And so this kind of like is, is, is one way of arguing that, you know, the thing I was arguing for the whole first part of the lecture, that's syntactic structures, they exist independently of the words that they have appeared with.

Right. A famous example of this is, is Lewis, Lewis Carroll's poem, Jabberwocky. Right. I was going to quote from it, but I can't actually see it there. Right. Where they, where they, you know, where he just like made up a bunch of new words and he just made this poem, which is all new open class words, open class words, what we call, you know, kind of like nouns, verbs, adjectives, adverbs, classes of words that like we add new things to all the time while, while things like conjunctions, you know, like and or but are closed class.

Oh, there's been a new conjunction added late, added recently. I just remembered after I said that. Does anyone know like of a conjunction that's kind of the past like 30 years or something, maybe 40? Spoken slash, like now we say slash and it kind of has a meaning that's like not and or but, or, or, or, or, but it's, it's a new one, but it's closed class generally.

This happens rarely. Anyway. And, and, and so, you know, you, you, you have like twas brillig and the slithy toves, did gyre and gimble and the wave, right? Toves is a noun. We all know that we've never heard it before. And in fact, you know, one word for, from, from Jabberwocky chortle actually entered the English vocabulary, right?

It kind of means like a, like a little chuckle that's maybe slightly suppressed or something. Right? So, so, so it shows like, you know, there was one, literally like one example of this word and then people picked it up and started using it as if it was a real word.

Right? So, and so one, one way of asking do language models have structures, like do they have this ability? And, you know, and I was thinking it would be cool to go over like a benchmark about this. Right? So like the kind of things, so people like make things where you could test your language models to, to, to see if it does this.

Yeah. Are there any questions until now? I go into just like this new benchmark. Cool. So yeah, the COGS benchmarks, the composition rule and generalization from semantics. Benchmark or something. Right. It kind of checks if, if language models can, can, can do new word structure combinations. Right? So, so the, the task at hand is semantic interpretation.

This is, I kind of glossed over it before, but it's like if you have, if you have a sentence, right, like the girls saw the hedgehog, you have this idea that like, and you've seen what like saw is a function that takes in two arguments and it outputs at the first one saw the second one, you know, this is like a bit of like, you know this is like one way of thinking about semantics.

There's many more as we'll see, but you know, this is one. And so like, and so, and so you can make a little like kind of Lambda expression about you know, about how, how, how, you know, what the sentence means and to get that you kind of have to use the, the, the tree to get it correct.

But anyway, the, the specific mechanism of this is not very important, but it's just like the semantic interpretation where you take the girls saw the hedgehog and you, and you output this like function of like, you know, C takes two, two arguments, you know, first is the girl, second is the hedgehog.

And then, and then the training on a test set, they have distinct words and structures in, in, in different roles. Right. So, so, so for example, you know, you have things like Paula, right. Or the hedgehog is like always an object in the, in the training data. So when you're fine tuning to do this task, but then in the test data, it's a subject, right.

So it's like, can, can, can you like, can you, can you use this word that you've seen, you know, in, in a new kind of, in, in, in a new place. Cause in English, anything that, that, that, that, that's an object can be a subject, you know, with like some, there's some subtlety around like some things are more likely to be subjects, but yeah.

And then similarly, you know, if, if you have something like the cat on the mat, you know, and it always appears. So, so this idea that, that like a noun can go with like a prepositional phrase, right. But that's always, always in the subject, right. Like Emma saw the cat in the mat.

And then like, can, can you do something like, you know, the cat on the mat saw Mary, right. So it's like move that kind of structure to subject position, which is something that in English we can do, right. Like any type of noun phrase that can be in an object position can be in subject position.

And so that, and so that's the, the, the Cogs benchmark, you know, large language models haven't aced this yet. I wrote this and like I was looking over this slide and I was like, well, they haven't checked the largest ones. You know, they never do check the largest ones because it's really hard to like do this kind of more, more like analysis work, you know, and things move so fast, fast, like the really large ones.

But you know, T5, 3 billion, you know, 3 billion is like a large number. It's maybe not a large language model anymore. But, you know, they don't ace this, right. They're, they're getting like 80% while like when they don't have to do the structural generalization when they can just like do like a test set, which, which, which like things appear in the same role as it in training set, they get like 100% easy.

It's not a very hard task. And so, you know, this is like, but still pretty good, you know, and it's probably like if a human had never ever seen something in subject position, I'm not sure that it would be like 100% as easy as if they had, you know, like I think that, you know, we don't want to fully idealize how, how, how things were, were working humans, right.

So similarly, you can take literal Jabberwocky sentences, right. So, so, so build, building on some, some work that John did that I'm sure you'll talk about later. So I'm not going to go in, but maybe I'm wrong on that assumption, right. We can like kind of test the models like embedding space, right.

So if we go high up in the layers and test the embedding space, we can test it to see if it encodes structural information, right. And, and so we can test to see like, okay, is there like a, a rough representation of like syntactic tree relations in this latent space.

And, and, and then these, yeah, and then a recent paper asked, does this work when we introduce new words, right. So if we, so if we take, you know, if we take like Jabberwocky style sentences and then ask, can the model find out these, the, the trees and these in its latent space, does it like encode them?

And, and, and, and the answer is, you know, like it's kind of worse, you know, in, in, in this graph, the, the hatched bar, so the ones on the right are the Jabberwocky sentences and the, and the, and the clear ones or the not hatched ones, I guess, are the ones, are, are, are, are the normal sentences.

And we see, you know, performance is worse, you know, so this is like unlabeled attachment score on the Y axis. It is like, you know, forms probably worse than humans, right? It's easier to read a normal poem than to read Jabberwocky. So, you know, the extent to which this is like damning or something, you know, is I, I think very, very small.

I think the paper is, I have linked it there, but, you know, I think the paper is maybe a bit more, um, um, uh, sure about this being a big deal maybe than it is. But yeah, you know, it, it, it does show that, that, that this kind of process isn't, um, trivial.

Yeah? What are the words that, like, applies for Jabberwocky substitutions? Oh, so this is, um, this is, uh, something called like phonotactics, right? So, so in, uh, I think like this is probably around, kind of what you're asking that it's like, you want a word which sounds like it could be in English, right?

Like pro- like provocated, right? It sounds like it can't be in English. You know, a classic example is like, you know, like blick, it could be an English word, you know, bnick, can't, right? We can't start And that's not an impossibility of the mouth. It's similar things like pterodactyl, pneumonia.

These come from Greek words like pneumonas and ptero. It's like I can say them. I'm a Greek native speaker. Like PN and PT, I can put them at the beginning of a syllable. But in English they don't go. And so if you follow these rules and also add the correct suffixes and stuff.

So like provocated we know is like past tense and stuff. Then you can make kind of words that don't exist but could exist. And so they don't throw people off. This is important for the tokenizers. You don't want to do something totally wacky to test the models. But yeah.

So when you generate this test set with these JavaScript substitutions, are these words generated by a computer or is there a human coming up with words that sound like English? There's some databases. People have thought of these. And I think they get theirs from some list of them. Because if you have 200, that's enough to run this test.

Because it's like a test. But yeah. I mean, I think that the phonotactic rules of English can be actually laid out kind of simply. It's like you can't really have two stops together. It's like put, they're both the same. You can't really put them together. You can probably make a short program or a long-ish program, but not a very super complex one to make good jabberwocky words in English.

Yeah. Yeah? So I'm wondering how the model would tokenize these jabberwocky sentences. Would it not just map all these words like provocated just to the unknown? So these are largely models that have word piece tokenizers. So if they don't know a word, they're like, OK, what's the largest bit of it that I know?

And then that's like a subtoken. And this is how most models work now. It's like back in the day-- and this is like back in the day meaning until maybe like six or seven years ago, it was very normal to have UNK tokens, like unknown tokens. But now generally, there is no such thing as an unknown.

You put like kind of at a bare minimum, you have like the alphabet in your vocabulary. So at a bare minimum, you're splitting everything up into like letter by letter tokens, character by character tokens. But if you're not, then yeah, it should-- yeah, it should find kind of like-- and this is why the phonotactic stuff is kind of important for this, right?

That it's tokenized like hopefully in like slightly bigger chunks that have some meaning. And because of how attention works and how contextualization works, you can like-- even if you have like a little bit of a word, you can give the correct kind of attention to it once it figures out what's going on a few layers in for like a real unknown word.

For like a fake unknown word, then yeah. Cool. I went back, but I want to go forward. Cool. Any more questions about anything? Yeah. A few slides back, there was like 80% scores that you were saying these are not-- like this isn't a solved problem yet. I'm just trying to get a sense of what 80% means in that context.

Is it like 80% of exact-- Yeah, it was exact. I think the relevant comparison is that, well, you didn't have this kind of structural difference, you know, where like something that was-- sometimes a subject was like then like-- was like something which was like never an object was then an object.

You know, the like the accuracy on that test set is like 100% like easy. And so it kind of-- there was no good graph which showed these next to each other. They kind of mentioned it. But yeah. And so I think like that's like the relevant piece of information that like somehow this like swapping around of roles like slightly trips it up.

That being said, you're right, like exact match of semantic parts is kind of a hard metric. You know, and so it's not-- this is-- yeah, none of this stuff, and I think this is important. None of this stuff is damning. None of this stuff is like they do not have the kind of rules human have.

This is also like, well, there's a bit of confusion. There's a bit of confusion in humans. It actually gets quite a bit-- it gets quite subtle with humans. And I'm going to go into that in the next section too. Yeah. Overall-- sorry, what is it? Yeah. Overall, like I think the results are like surprisingly not damning, I would say.

Yeah, this is the-- there's like clearly like, you know, maybe not the fully like programmed discrete kind of rules. But yeah. I would say-- cool. Another thing we could do, yeah, is test how syntactic structure kind of maps onto like meaning and role, right? And so like as we said before, right, like in English, the syntax of word order gives us the who did what to whom meaning.

And so, you know, if we have like, you know, for any combination like a verb and be, if I sound like a verb be, you know, like a is the doer, b is the patient. And so we ask like, is this kind of relationship, you know, strictly represented in English language models as it is like in the English language?

And so what we could do is that we could take a bunch of things which like, you know, appear in subject position, a bunch of things which appear in object position and take their latent space representation and kind of learn, you know, learn like a little classifier, you know, this should be like a pretty clear distinction in latent space.

In any like good model, right, like which like these models are good, this should be a pretty clear distinction. We could just like a linear classifier to kind of separate them, right? And the more on the one side you are, you're more subject, the more on the other side you are, you're more object, right?

And so then we can test, you know, does the model know the difference, you know, be between when something is a subject and when something is an object, you know, does it know that like you're going to go on opposite sides of this dividing line, you know, even if like everything else stays the same and all the clues point to something else, right?

So it's like does syntax map onto role in this way? You might think like, well, I could just check if it's like second or like fifth, right? But, you know, we've actually, we, yeah, this is a proof that I wrote, you know, we did like compare, you know, we like try to control for like position stuff in various ways.

And these are like, yeah. And so it's hopefully we claim we're kind of showing like the like syntax to role mapping. And what we see is that it does, right? So if we kind of graph the distance from that dividing line, you know, on the y-axis, we see like the original subjects when we swap them and put them in object position, they do like diverge as we go up layers in that dimension.

And we tried this again, you know, all this analysis experiment with some kind of small models, with some BERT, with some GPT-2, you know, with some like a bigger version of GPT-2 and it worked out. But it's like, you know, none of this is like, you know, none of this is like the big, big stuff.

I think now we're starting to see more analysis on the big, big stuff. I think it's really cool. Yeah. So then where are we with like structure and language models, right? We know that language models are not, they're not engineered around discrete linguistic rules. But the pre-training process, you know, it isn't just a bunch of surface level memorization, right?

We have seen this. There is some kind of like discrete rule-based system kind of coming out of this. You know, maybe it's not the perfect kind of thing you would like write down in a syntax class, but, you know, there is some syntactic knowledge, you know, and it's complicated in various ways.

And humans are also complicated. And that's what we're going to get to next, right? There's no ground truth for how language works yet, right? Like if we knew how to fully describe English, right, with a bunch of good discrete rules, we would just like make an old pipeline system and it would be amazing, right?

If we could like take the Cambridge grammar of English, but like it was truly, truly complete. If we just knew how English worked, we would do that. And so we're working on this case where there's no really no ground truth. Cool. Any questions about this? Try and move beyond syntactic structure.

Cool. So moving beyond this kind of like very structure-based idea of language, I think it's very cool to learn about structure in this way. And like at least how I was taught linguistics, it was like a lot of it, the first like many semesters was like this kind of stuff.

But then, but I think there's like so much more. And like very important, I think that meaning plays a role in linguistic structure, right? Like there's a lot of rich information in words that affects like the final way that like the syntax works. And of course, what like you end up meaning and like what like the words influence each other to mean, right?

And so like the semantics of words, right, the meaning, it's like always playing a role in forming and applying the rules of language, right? And so, you know, for example, like a classic example is like, you know, verbs, they like have kind of like selectional restriction, right? So like ate can like take kind of any food and it can also take nothing.

It's like I ate, it means that I've just like I've eaten, right? I've devoured, right? The word devoured actually can't be used intransitively, right? It sounds weird. You need to devour something, right? There's verbs like elapsed that only take like, you know, a very certain type of noun, right?

Like elapsed only takes nouns that refer to time, you know, so maybe like harvest can refer to time, moon can refer to time, somewhere, you know, it's trees, it cannot take a verb like trees, right? There's even verbs that only ever take one specific noun as their argument, right?

It's like classic example. I think, yeah, my- my advisor Dan- Dan Jirowski told me this one to put it in. And- and- and what's cool is that like that- that's how we train models these days. If you see this- this diagram I screenshotted from John's Transformers lecture, right? We start with a rich semantic input, right?

We start with these like a thousand on the order of like a thousand, you know, depending on the model, size, embeddings, right? Which it's like, think of how much information you can express like on a plane, right? On two dimensions, it's like the kind of richness that you can fit into a thousand dimensions, you know, it's huge and we start with these word- word- word embeddings and then move on, right?

It's like the attention block and- and everything. And so, yeah, I'm just gonna go through some examples of the ways that- that languages, you know, the ways that like meaning kind of plays a role in forming syntax, hopefully it's like fun, a tour through like the cool things that happen in language, right?

So, as we said, you know, anything can be an object, anything can be a subject, we want to be able to say anything, language can like express anything, this is like kind of a basic part of language. But, you know, many languages they have a special syntactic way of- of dealing with this, right?

So, they want to tell you like if there's an object that you wouldn't expect, right? Like in this case, I want to tell you, hey, watch out, you know, the- be careful, we're- we're dealing with a weird object here, right? So, this is like kind of in the syntax of languages, you know, if you're- if you're- if you're a native speaker or- or you've learned Spanish, right?

You- you know, this like a constraint, right? So, if you say like, you know, so if something is a- is an object but it's inanimate, you don't need the a because you're like, yeah, I found a problem. But then if you're putting something animate in the object position, you need to kind of mark it and you'd be like, hey, watch out, you know, there- there's an object here.

And that's like a rule of the grammar, right? Like if you don't do this, it's wrong. And they tell you this in Spanish class. Similarly, like Hindi has a kind of a more subtle one, but I think it's cool, right? So, you- to- if- if you put an object that is definite, you have to mark it with a little like- this is an object marker, right?

Like a little accusative marker, right? And like, you might ask, okay, I understand why like animacy is- is- is- is- is a big deal, right? Like, you know, maybe animate things more often do things and have things done to them. But like, why- why- why definiteness, right? Like, why- why would you need this little like call marker, this like the goat versus a goat?

And it's like, well, probably something is definite. It means that it's like- it means that- that it's like in the kind of in- we've like kind of probably been talking about it or we're all thinking about it, you know. For example, it's like, oh, I ate the apple, right?

This means that either like we had one apple left and I ate it or like it was like really rotten or something. You can't believe I ate it, right? Or something like that. And so like, then things that we're already talking about, they're probably more likely to be subjects, right?

Like if we're all, you know, you know, if- if I was like, oh, Rosa, you know, like, yeah, I feel like Rosa did this and Rosa did- did- did that and Rosa that. And then- and then- and- and- and then like Leon kissed Rosa. You'd be like, no, you probably want to be like Rosa kissed Leon, right?

You probably want to put, you know, it's not strict, but if you're talking about something, you're probably- it's probably going to be the subject of the next sentence. So then if it's the goat, you- you have to put a little accusative marker on it. So this is like how like the marking in the language works, and it's kind of all influenced by this like interesting semantic relationship.

And language models are also aware of these gradations. And it's, you know, in a similar like classifying sub- subjects and objects paper that- that- that we wrote, we see that language models also have these gradations, right? So if you like- again, if you like map- map the probability of being with that classifier on the y-axis, right, we see that there's- there's a high accuracy, right?

This is over many languages. And all of them, you know, on the left, we have the subjects, they're classified above. On the right, we have the object, they're classified below. But, you know, animacy kind of influences this grammatical distinction, right? So like if you're animate and a subject, you're very sure.

If you're inanimate and an object, you're very sure. Anything else, you're kind of close to 50, you know? And so it's like this- this kind of a- this kind of relation where the meaning plays into the structure is- is reflected in language models, you know? And that's not bad.

It's good because it's how humans are. Or, you know, it kind of- we should like, you know, temper our expectations maybe away from the like fully- fully syntactic things that we're talking about. Another kind of cool- cool example of like- of how meaning can influence, you know, what we can say.

What we can say- I've said from the beginning many times that all kind of combinations of structures and words are possible, but that's not strictly true, right? So in many cases, if something is like too outlandish, we often do just assume the more plausible interpretation, right? So like there's these psycholinguistics experiments where they kind of test this- what's, you know, like these kind- these kind of like giving verbs.

Verbs is like, you know, the mother gave the daughter the candle and you could actually like switch that around, you know, you could do like- sounds like the date of alternation, but you switch that around to make the mother give the candle to the daughter. And then if you- if you switch around who's actually being given, right?

So if you're actually saying the mother gave the candle to the daughter, people don't really- people don't interpret that like in its literal sense. They usually interpret it as like the mother gave the daughter the candle. And like, of course, outlandish meanings, you know, they're never impossible to express, right?

Because nothing is, right? And so you can like kind of spell it out, you know, you could be like, well, the mother, she picked up her daughter and she handed her to the candle, you know, who is sentient. And then you could say this, but you like can't- you can't do it simply with the give word, like people tend to interpret it the other way.

And so like marking these like less prominent things and marking them- sorry, these less plausible things and marking them more prominently, there's like pervasive feature that we say across language in all these ways. And all these ways are like, you know, also like very like embedded in the grammar as we saw earlier in Spanish and Hindi.

Cool. So another way that's, you know, in where- how we see meaning kind of play in to, you know, and kind of break apart this like full compositionality, you know, syntax picture, right, is that meaning can't always be composed from individual words, right? And just full of idioms, you know, sometimes when you talk about idioms, you, you know, you might think, okay, there's maybe like 20 of them, you know, things like my grandfather would say, you know, things about like chickens and donkeys.

In Greece, they're all donkeys. You know, we're actually constantly using constructions that, that, you know, that we couldn't actually like get from like, you know, they're kind of like idiomatic in their little sense, right, that we couldn't actually get from like composing the words, right? Things like, I wouldn't put it past him, he's getting to me these days, that won't go down well with the boss, you know.

There's like so, so, so many of these, and it's kind of like a basic part of, of communication to kind of use the, these little like canned idiomatic phrases, you know, and like linguists love, love, love saying that like, oh, any string of words you say is like totally novel, you know, and it's like probably true, you know, I've been speaking for like 50 minutes, you know, and like probably no one has said this exact thing like ever before, I just use the computational rules of English to make it.

But actually, most of my real utterances like, oh, yeah, no, totally, right, like something like that, which is actually people say that all the time, right? Most of my real utterances are like, people say that all the time, you know, we have these little canned things that we love reusing, and that, and that, you know, we reuse them so much that like they stop making sense if you break them apart into individual words, right?

And we even also even have these constructions that can like take arguments, but like don't really, you know, so, so they're not like canned words, they're kind of like a canned way of saying something that, you know, doesn't really work if you build up from the syntax, right? So like, oh, he won't, he won't eat shrimp, let alone like oyster, right?

And what does that mean? Well, it means like I'm defining some axis of like, you know, of like moreness, right? In this case, probably like selfish and like, shellfish and like, weird or something, you know, and so it's like, well, shrimp is less weird, so oysters more, you know, and if I say like, oh, he won't eat shrimp, let alone beef, right?

The axis is like vegetarianism, right? So it's like this construct, it does like kind of like a complex thing, right? Where you're saying like, he won't do one thing, let alone the one that's worse in the dimension, you know, like, it's like, oh, she slept the afternoon away, he knitted the night away, they drank the night away, right?

It's like all this is like time away thing doesn't actually, you know, you like can't really tell, otherwise, you know, like these like this er, er construction, like, like the, the bigger they are, the more expensive they are, right? Like the, man, I forgot how it goes, the bigger they come, the harder they fall, right?

Like so it doesn't even have to be a, yeah, and it was like, you know, that travesty of a theory, right? Right, like that of a construction, there's so many of these, right? Like so much of how we speak, if you actually try to like do like the tree parts, new like semantic parts up, up from it, it won't really make sense.

And so there, there's been this work, this is more, more recent, recently kind of come, coming to light, and I've been really excited by it. There's texting constructions in large language models. There was just this year, a paper by Kyle Mahalwald, who is a postdoc here, testing the like the beautiful five days in Austin construction, right?

So it's like the a, adjective, numeral, noun construction where it's like, it's like doesn't really work, right? Because it's like, it wouldn't really work, right? Because you have a, days, right? And there's like many ways, you know, and like anything kind of similar to it, right? Like it's like a five beautiful days that, that doesn't work, right?

So somehow like this specific construction is like grammatically correct to us. But like, you know, you like, you can't say a five days in Austin, right? You can't say a five beautiful days in Austin, you know, you have to say like this. And it's just like GPT-3 is actually like largely concurrent, concurs with humans on these things, right?

So on the, on the left here, the gray bars, we have the, the, the, the things that are acceptable to humans, right? So those are like a beautiful five, five days in Austin and five beautiful days in Austin, right? Those are both acceptable to humans. They do this over like many, many instances of this construction, not just Austin, obviously.

But yeah, and we say like GPT-3 like accepts these, you know, those are the gray bars and humans also accept these, though those are the green triangles. And like every other iteration, the human triangles are very low. And GPT-3 is like lower, but, but, but does get tricked by some things, right?

So it seems to have this knowledge of this construction, but not as like starkly as humans do, right? So the, especially like if you see, if you see that, that, that third one over there, right? The five beautiful days, humans don't, don't accept it as much. It's funny to me, it sounds almost better than those rest of them, but I guess these green triangles were computed very robustly.

So I'm an outlier. Yeah. And GPT-3 is like better, you know, like think, thinks those are better than maybe humans do, but there is this like difference, you know, it's like significant difference between the gray bars and the orange bars. And then similarly, some people tested the, the X or the Y construction, right?

And so it's like, they took examples of sentences that, that were like the X or the Y construction. And then like they, they, they took example sentence which had like an er followed by an er, but they weren't, or like, but, but they weren't actually the X or the Y, right?

It's like, oh, the older guys help out the younger guys, right? So, but so that's not an X or Y or construction. And, and, and, you know, and then they were like, right, if we mark the ones that are as positive ones that aren't as negative, it does the latent space of models kind of like encode this difference, right?

That, that, that like all this construction kind of clustered together in a way. And they find that it does. And then the last thing I want to talk about in this like semantic space, you know, after like constructions and all that, is like the meaning of words is like actually very subtle and sensitive and it's like influenced by context and all these like crazy ways, right?

And Erica Peterson and Chris Potts from, from the linguistics department here did this like great investigation on a, you know, on the, on the verb, on the verb break, you know. And it's like the break can have all these meanings, right? Like we, we think it's like, yeah, break is like a word, you know, and like words are things like table and dog and break that have like one sense.

But, you know, actually there aren't even senses that you can enumerate, you know, like river bank and financial bank and just like, yeah, you know, break the horse means tame while like break a $10 bill. It means like spread, spread into like smaller bits of money, right? And there's just like so many ways, right?

Right. Like break free and break even. There's just like so, so many ways in which break, you know, like its meaning is just so subtle and influence. It's like kind of true for like every word, you know, or like many words, maybe like table and dog. It's like, yeah, there's like a set of all things that are tables or dogs.

And it's like kind of describes that set. You know, there's maybe some more philosophical way of going about it, but, you know, so like pocket, you know, it's like a pocket, but then like you can pocket something. Then like it kind of means steal in many cases, doesn't just mean put something in your pocket literally.

Right. This is like, so yeah, there's like all these ways in which in which like the meaning of words is like by everything around it. And what they do is that don't worry about like what's actually going on here, but, you know, they've kind of mapped each sense, like a color.

Right. And when you start off in layer one, they're all I think this is just by like position embedding. Right. You start up in layer one and it's just like, I think that's what it is. And then you like if you take all the words past pass, pass them through like a big model, like rubber large.

Right. Then then they're kind of all jumbled up. Right. Because they're all just break. Right. They're just like in different positions. And then, you know, by the end, they've all kind of split up. You take all the colors are kind of clustering together. Each color is kind of like one of one of these meanings.

Right. And so they kind of clustered together and these like kind of is it constructions again or is it just like, you know, the way in which they kind of isolate these like really subtle aspects of meaning. Yeah. So then I think a big question in NLP, right, is like, how do we strike the balance between like syntax and the ways that like meaning influences things?

Right. So well, and I pulled up this quote from a book by John Bidey, which I enjoy. And I think it kind of brings to light like a question that we should be asking in an LP. Right. This book is about is like just like a linguistics book. It's not about it at all.

But, you know, it's in while language is full of both broad generalizations and items with properties, linguists have been dazzled by the quest for general patterns. Right. That was the first part of this talk. You know, and like, of course, the abstract structures and categories of language are fascinating.

But, you know, I would submit or she would submit that what is even more fascinating is the way that the general structures arise from and interact with the more specific items of language, producing a highly conventional set of general and specific structures that allow the expression of both conventional and novel ideas.

Right. It's kind of like this like middle ground between abstraction and like specificity that like we would want, you know, that like humans probably exhibit that we would want our models to to exhibit. Yeah. I was wondering if you could go back one slide and just unpack this diagram a little more because I'm fairly new to NLP.

I've never seen a diagram like this. Oh, sorry. Yeah. What what does this mean? How should I, you know, interpret? Oh, so this is all like, you know, so if you take, you know, the way that that that like words are, you know, as you're passing through a transformer through through many layers, I just want to be like, look at how the colors cluster.

But yeah, you're passing through a transformer, many layers at any one point in that transformer, you could like say, OK, how are the words organized now, you know, and you think, well, I'm going to project that to two dimensions from like a thousand. And that's, you know, maybe a good idea, maybe a bad idea.

I think there's a lot of but, you know, I would be able to show them here if they were a thousand. So let's like assume that it's like an OK thing to be doing. Then then, you know, so this is what they've done for like for layer one and then for layer twenty four.

And so we could see that that like they they start off where like the colors are totally jumbled and they're probably, you know, in before layer one, you add in the position embedding. So I think I think that that's what all those clusters are. Right. So it's like kind of clustering because you don't have anything to go off of.

You know, it's like this is break and it's in position five. It's like, OK, I guess I'll cluster all the breaks in position five. But then as you go as you as you go up up the model, right. And kind of like all this meaning is being formed. You see these like senses kind of like come out in the in in in how it organizes things.

Right. So it's like all all these like breaks kind of like become they're very specific. You know, they're very like kind of subtle versions of breaks. You know, there's like this work and I think it's different from a lot of NLP work because it has like a lot of labor put into this labeling.

Right. Like this is like some something because because, you know, the person who this is a linguistic student. Right. And if you go through corpus and label every break by like which one of these it means, it's like a lot of work. And it's like, yeah. And so I think it's the kind of thing that you wouldn't be able to show otherwise.

So it's often not really shown. Yeah. Cool. So yeah, language is characterized by the fact that it's this amazingly abstract system. Right. I started off raving about that. And, you know, and we want our models to capture that. That's why we do all this compositionality kind of syntax tests.

You know, but meaning is so rich and multifaceted. Right. High dimensional spaces are much better at capturing these subtleties. Right. We started off talking about word embeddings in this class. Right. You know, high dimensional space are so much better at this than any rules that we would come up with being like, OK, maybe we could have like break subscript, like break money, you know, and we're going to put that into our system.

And so where do deep learning models where do they stand now? Right. Between surface level memorization and abstraction. You know, and this is what like a lot of analysis and interpretability work is trying to understand, you know. And I think that what's important to keep in mind when we're reading and kind of doing this analysis and interpretability work is that this is not even a solved question for humans.

Right. Like we don't know exactly where humans stand between like having an abstract grammar and having these like these like very like construction specific and meaning specific ways that that like things work by. Cool. Any questions overall on the importance of semantics and the richness of human language? Yeah.

This is probably a question from quite a bit before, but you're showing a chart from your research where the model is really well able to distinguish inanimate from animate given its knowledge of subject or object. I was just trying to interpret that graph and understand what the sort of links between the characters.

Yeah, I did a flashback yesterday. Sorry, I know it's long way back. No, no, it's not. I think it's here. Right. Yeah. Yeah. So so the main so this is similar to the other graph where it was you know where what it's trying to distinguish is a subject from object.

But we've just split the test set into these four ways. We're split into like subject inanimate, subject animate, you know, so we just split the test set. Right. And so like what the what like the two panels in the x axis are showing are like these different splits. Right.

So like OK so things that are subjects and basically the ground truth is that things on the left should be above 15 things on the right should be below 50. And that's what's happening. But if we further split it by animate and inanimate, we see that there's just like influence of of animacy on the probability.

That was. Yeah. Sorry, I rushed over these graphs like kind of I want to get like a taste of things that happen. But yeah, it's good to also understand fully what's going on. Cool. Yeah. So this is also from a while back. You don't have to go to the slide.

Yeah. So you were talking about acceptability. So I'm assuming for judging acceptability, you just ask that for like GPT-3, how do you determine if it finds a sentence acceptable? I think you could just take logics. I think that's what Kyle Mahalwa did in this paper. Right. You could just like take like the probabilities out, put it up then if you like, you know, if you like kind of for GPT-3, it's like going left to right.

I think there's like other things that people do sometimes. But like, yeah, especially for these models, they don't have too much access to apart from like the like generation and like the like probability of each generation. I think that you could. Yeah, I think that you might want to do that.

And there's like, you know, you don't want to multiply every logic together, right? Because then like if you're multiplying many probabilities, longer, longer sentences, you know, become like very unlikely, right? Which is like not true exactly for humans or, you know, it's not true in that way for humans.

So, you know, I think there's like things you should do, like ways to control it and stuff like when you're running an experiment like this. Yeah. Cool. Okay. So moving on to multilinguality in NLP. So so far we've been talking about English, right? All this I haven't been saying it explicitly all the times, but most things I've said, you know, apart from some, maybe some differential object marking examples, right?

They've been kind of about English, about English models, but there's so many languages, right? There's like 7,000 languages in the world, maybe not over, there's around 7,000 languages in the world, right? Like it's hard to define, right? Like what a language is, right? It's kind of difficult, you know, like even in the case of English, right?

We have things like, it's like Scots, right? The language spoken in Scotland, is that English? It's like, you know, something like Jamaican English, you know, like maybe that's a different language, right? There's like different structures, but it's still like clearly like much more related than anything else, right? Than like German or something, right?

And so, you know, how do you make a kind of a multilingual model? Well, so far a big approach to me, you know, you take a bunch of languages, this is like all of them, and maybe you're not going to take all of them, you know, maybe you can take a hundred or something, and you just funnel them into just like one transformer language model.

And there's maybe things you could do like up sampling some, they don't have too much data of, you know, or like down sampling some, they have too much data of, you know, but like this is the general approach, you know, what if we just make one, you know, like one transformer language model, you know, like something like a BERT, it's usually like a BERT type model, because it's hard to get good generation for like too many languages, you know, but yeah, how about just get one transformer language model for all of these languages, right?

And so what's cool about this is that multilingual language models, right, they let us share parameters between high resource languages and low resource languages, right? There's a lot of language in the world, really just most languages in the world, which you could not train like even like a BERT size model for, right, they're just like not enough data and there's, yeah, and there's a lot of work being done on this.

And one way to do this is say like, well, you know, like, you know, pre-training and transfer learning, they brought us so much unexpected success, right? And so like, you know, and we get this great linguistic capability in generality, right, if we pre-train something in English that we weren't asking for, so, you know, so will this self-supervised learning paradigm, you know, can it like deliver between languages?

So it's like, maybe I can get a lot of the, a lot of the like linguistic knowledge, like the more general stuff from like just all the high resource languages and then kind of apply it to the low resource languages, right? Like a bilingual person doesn't have like two totally separate parts of their self, right, that like have learned languages, probably some sharing some way that like things are like in the same space, like linguistics are broadly the same, right?

And so, and so, and so, you know, we have this like attempt to like bring NLP to like some still very small subset of the 7,000 languages in the world. We can look at it through two lenses, right? On the one hand, you know, languages are remarkably diverse. So we'll go over some of the cool ways that languages in the world vary, you know, and so does multilingual NLP capture the specific differences of different languages?

On the other hand, you know, languages are similar to each other in many ways. And so does multilingual NLP capture the parallel structure between languages? So you know, just, just, just to go over some ways, like, you know, really understanding like how like diverse languages can be, you know, in around a quarter, this is a quote from a book, but you know, in around a quarter of the world's languages, every statement, right, like every time you use a verb must specify the type of source on which it is based, right?

So it's like a part, you know, how we have like tense in English, where we like, you know, kind of everything you say is like kind of either in the past or the present or the future tense, right? And so like an example in a, in Tarjana, these are again from, from the book, right?

It's not a language I know, right? But it's, you know, you, you have this like marker in bold at the end, right? And so, and so when you say something like, Jose has played football, right? You if you put like the car marker, that means that we saw it, right?

It's kind of like the visual evidential marker, right? And there's, and there's kind of a non visual marker that kind of means we heard it, right? So if you say, you know, so if you say statement, you could say we heard it, right? There's a like, we infer it from visual evidence, right?

So if it's like, oh, his like cleats are gone, and he is also gone, but like, and people, you know, and we see people going to play football, right? Or we see people coming back, I guess, from playing football because in the past, right? That means like, you know, so, so we can infer it.

And so you could put this, right? There's like, you know, or like, if he plays football every Saturday, you know, and it's Saturday, we you would use a different marker, right? Or like, if someone has told you if it's hearsay, you would use a different marker, right? So this is like, this is like a part of the grammar, right?

That like, to me, at least, right? Like, I don't speak any language that has this, it seems like it's, it seems like very cool and like different from like anything I would ever think would be like a part of the grammar, right? But it is. Or like, especially like a compulsory part of the grammar, right?

But, but it is, right? And you can like map out, I wanted to include some maps from WALS, the World Atlas of Linguistic Structure. That's always like, so fun. You know, you could like map out all the languages, right? Like I only speak white dot languages, which are like no grammatical evidentials.

You know, if you want to say whether you heard something or saw it, you have to say it like in words, right? But there's many languages, you know, as very, yeah, especially in the Americas, right? Tainan is I think Brazilian language from like up by the border with, yeah.

But yeah, the, you know, while we're looking at like language typology maps, right? And so like this, this like language organization, like in categorization maps, the most like, the classic one, right, is again, like the subject object and verb order, right? So as you said, English has SVO order, but there's just so, so many orders that, you know, kind of like almost all the possible ones are a test that, you know, some languages have no dominant order, like Greek, so like a language that I speak natively has a dominant order.

You would say you would move things around for emphasis or whatever. Yeah. And you see like, and here, you know, we're seeing some, some like diversity, we're seeing typology, we're also seeing some tendencies, right? Like some are just so much more common than others, right? And this is like, again, something which like people talk about so much, right?

It's, it's, it's like a very big part. Yeah. It's, it's like a huge part of linguistic. Why are some more common where some others, it's like a basic fact of language, it's something which happened, you know, is this like just the fact of like how discourse works, maybe, you know, like that's, that's more preferred for many people to say something, you know, and there's a lot of opinions on this.

Another way though that languages vary, you know, is like the number of morphemes they have per word, right? Like some languages are like, you know, like Vietnamese classically, just like very isolating, like kind of like each, you know, like each kind of thing you want to express like tense or something is going to be in, in a different word, you know, in English, we actually combine kind of tenses, we have things like "able," right?

Like, you know, like, like throwable or something, right? And then like in, in, in some languages, they're just like really so much stuff is expressed in morphemes, right? And so you can have languages, especially in like Alaska and Canada, a lot of languages there and like Greenland, where you have like, and these are all like one, one language family, you can have like kind of whole sentences expressed with just like things, things that get tacked on to, to, to the verb, right?

So you have to have things, things like the, you know, like the object and the, or I guess in this case, you start with the object, again, you have kind of like the verb and the like, whether it's happening or not happening and who said it and like, or like whether it's said in the future and all that just kind of all put in, you know, these like, quote unquote, like sentence words, right?

It's like a very different way of a language working than English works like at all, right? Yeah, you have a question? Yeah, this is from two slides ago, the one with the map. I just want to know like what these dots mean, because in the US, the top right is gray, like in the Northeast, but in the Pacific Northwest, it's yellow.

Is that different dialects for like the same American English? Oh, no, these are all indigenous languages. Oh, I see. Yeah, yeah. So, so English is just this one dot in here spread in amongst all the like Cornish and Irish and stuff. Yeah, so English is just like in Great Britain.

Yeah. Yeah. Yeah. Yeah, and that's why, yeah, and that's why like all this like really and that's why like all this like evidential stuff is happening in, uh, in like the Americas, right? Because there's like a lot of, you know, very often the indigenous languages of the Americas are like the classic, like very evidentially marking ones, which are the pink ones.

Yeah. You said that normally we use like a bird style model for multilingual models because it's difficult for natural English generation across languages. Yeah. I mean, I guess intuitively that makes sense, right? Because of the subtleties and the nuance between different languages when you're producing it. But is there like a reason that, um, like a particular reason that that's been so much harder to make developments on?

I think it's just hard. I think a good generation is just like harder, right? Like to get something like, you know, like GPT-3 or something. If you need like really like a lot of data and maybe like it's kind of like, I think there are, can I think of any, are there any, is it G-Shard?

G-Shard's encoder-only. Yeah, I can't really think of any like, you know, like encoder-decoder, as you said, you know, like kind of big multilingual models. Of course, like GPT-3 has this thing where if you're like, how do you say this in French? You'll be like, you say it like this, you know?

So it's like, if you've seen all of the data, it's going to include a lot of languages, but this kind of like multilingual model where you'd be like, right, you know, be as good as GPT-3, but in this other language, you know, I think it's just, it's just, you need a lot more data to get that kind of coherence, right?

As opposed to like, yeah, as opposed to something if you do like text infilling or something, which is like how the bird style models are, then you get like very good, even if the text infilling, you know, performance isn't great for every language, you can actually get very, very good embeddings to work with for a lot of those languages.

Yeah. Cool. Now, for just like a one last language diversity thing, I think this is interesting, the motion event stuff, because it's like, this is actually, you know, it's not, it's like languages that, you know, many of us know, I'm going to talk about Spanish, but it's actually something which you might not have thought about, but then once you see, you're like, oh, actually, that's like actually affects how like everything works.

So in English, right, the manner of motion is usually expressed on the verb, right? So you can see something like the bottle floated into the cave, right? And so like, the fact that it's floating is on the verb, and the fact that it's going in is kind of on this satellite.

Well, like in Spanish, the direction of motion usually expressed on the verb, Greek is like this too. I feel like most Indo-European languages are not like this, they're actually like English. So like most languages from like Europe to like North India tend to not be like this, right? So you would say like, "La botella entro a la cueva flotando," right?

So you'd have like, so the floating is not usually put on the main verb. And like, in English, you could actually say like, right, like the bottle entered the cave floating, it's just like maybe not what you would say, right? And similar, like in Spanish, you can say the other way, right?

This is called like satellite framing language and verb framing language, like really affects how you would kind of like say most, you know, like kind of how everything works, right? It's kind of like a division that's like, you know, pretty attested, of course, it's not a full division, right?

It's not like this exclusive categorization. Chinese I think often has these structures where there's like two verb slots, right? Where you could have both a manner of motion and a direction of motion kind of in the like the one verb slot, none of them have to go kind of like after playing some different role, right?

So these are like, there's all these ways in which like language are just different, you know, from like things that maybe we didn't even think could like be in a language, like things that like we do, right? But we don't realize that in some, sometimes you're just like so different in these like subtle ways.

And so, you know, and so going to the other annual language are so different, they're also very alike, right? So like, you know, there's this idea like, is there like a universal grammar, some like abstract structure that all, that unite all languages, right? This is like a huge question in linguistics.

And you know, the question is, can we define an abstraction where we can all say like all languages are some part version of it? There's like other ways of thinking about universals, like all languages like tend to be one way or tend to be like languages that tend to be one way also tend to be some other way.

And there's like a third way of thinking about universals, that's like languages all deal in similar types of relations, you know, like subject, object, you know, like types of modifiers, right? So the universal dependencies project was like a way of kind of saying like, maybe we can make dependencies kind of for all languages in a way that doesn't shoehorn them into each other, you know?

And yeah, I guess like what was it called? RRG, like relational something grammar, you know, was also kind of this idea that maybe one way to think about all languages together is like the kind of relations they define, you know? And, you know, ask me about kind of like the Chomsky and the Greenbergian stuff you want and how it relates to NLP.

I think like there's a lot to say there. It's kind of, yeah, it's slightly more difficult. So maybe it's easier to think of this third one in terms of NLP, right? And like back to the subject object relation stuff, if we look at it across languages, right, we see that they're kind of encoded in parallel because classifiers, right, those classifiers that we're training, they're like as accurate in their own language as they are in other languages, right?

Their own language being red and other languages being black, right? It's not like, wow, if I take a multilingual model and I train one classifier in one language, it's going to be so good at itself and so bad at everything else, right? They're kind of interspersed. They're clearly like on the top end, the red dots.

And UD relations, right, so universal dependencies, right, like the kind of like dependency relations, they're also encoded in parallel ways. This is work that John has done, right? Again, main thing to take from this example is that like the colors cluster together, right? So if you train kind of like a parser on or like, you know, parse classification in one language and kind of transfer it to another, you see these clusters form for the other language, right?

So it's like these ideas of how like things relate together, right? Like a kind of noun modifier, you know, all that kind of stuff. They do cluster together in these parallel ways across languages, you know? And so language specificity is also important. I might skip over this. But you know, it seems like maybe sometimes some languages are shoehorned into others in various ways.

And maybe part of this is that data quality, it's very variable in multilingual corpora, right? So if you take like all these multilingual corpora, there was like an audit of them. And like for like all these various like multilingual corpora, like 20% of languages, they're less than 50% correct, meaning like 50% of it was often like just links or like just something random.

So that might be like some language, but it was not at all. And like maybe the way we maybe we don't want too much parameter sharing, right? Like Afroberta is a recent, it's a kind of recent BERT model trained like only on African languages, you know, maybe like having too much, too high resources like harming, you know, and there's work here at Stanford being done in the same direction, you know.

Another, yeah, another recent cross-lingual model, XLMV, came out, which is like, why should we be doing vocabulary sharing? You know, like you just have like a big vocabulary. Each language gets like its own words. It's probably going to be better. And it is. It kind of like knocks out similar models or smaller vocabularies, which are like maybe, you know, computer is the same in English and French.

It should be shared, you know. Maybe it's better to separate things, you know. It's like hard to like kind of find this balance between, let's skip over this paper too. It's very cool and there's a link there, so you should look at it. But yeah, we want language generality, but we also want to preserve diversity.

And so how is multilingual NLP doing, you know, especially with things like dialects? You know, there's so many complex issues for multilingual NLP to be dealing with. How can deep learning work for low resource languages? You know, what are the ethics of working in NLP for low resource languages?

Who like wants their language in big models? Who like wants the language to be translated? You know, these are all like very important ethical issues in multilingual NLP. And so after looking at structure, beyond structure, multilinguality in models, I hope you know that linguistics is a way of, you know, investigating what's going on in black box models.

The subtleties of linguistic analysis, they can help us understand what we want or expect from the models that we work with. And like even though we're not reverse engineering human language, linguistic insights, I hope I've convinced you they still have a place in understanding, you know, the models that we're working with, the models that we're dealing with.

And you know, and in so many more ways beyond what we've discussed here, you know, like language acquisition, language and vision, and like instructions and music, discourse, conversation and communication, and like so many other ways. Cool. Thank you. If there's any more questions, you can come ask me. Time's up.

Thank you.