Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 15

00:00:00.000 | Welcome to CS224N lecture 15.

00:00:07.480 | So I'm Megan and I'm one of the CAs in this course, and I'm also a PhD student working

00:00:11.200 | with Chris Ray.

00:00:12.200 | And today I'll be talking about integrating knowledge and language models.

00:00:18.920 | So some quick reminders, your project milestones were due today, so hopefully you turned those

00:00:22.240 | in already or will be turning them in in the next couple of days, and we'll try to get

00:00:26.000 | feedback on those as fast as possible.

00:00:29.160 | So something to be aware of is a change of grading basis and course withdrawal deadline

00:00:32.520 | is this Friday.

00:00:34.640 | So if you want to make any change to your grade, make sure to do that by then.

00:00:37.920 | And we'll be getting you the grades back on assignment five by then as well, in case that's

00:00:41.220 | helpful in making your decision.

00:00:44.320 | And finally, your final projects are due in two weeks.

00:00:46.600 | So hopefully those are going smoothly.

00:00:49.380 | So the topic of the day is integrating knowledge and language models.

00:00:52.440 | You've seen a bit about this idea in assignment five, and also in Colin Raffles' lecture last

00:00:56.280 | class.

00:00:57.280 | So in assignment five, the task was to train a model to predict the birthplace of a person

00:01:01.720 | given their name.

00:01:02.720 | And you saw that by pre-training on a larger data set, you're actually able to do better

00:01:06.480 | on this task, since you could encode some real knowledge into the language model.

00:01:11.680 | And then last lecture, Colin Raffle presented how T5 could actually be fine-tuned for a

00:01:16.560 | closed domain question answering task, such that you can give T5 a natural language question

00:01:21.760 | and it'll return an answer.

00:01:24.040 | So today we'll be building on these threads and looking at techniques that researchers

00:01:27.000 | have recently been developing to increase the amount of knowledge in language models.

00:01:32.760 | So we're going to start with a quick recap of language models, just to make sure we're

00:01:35.280 | all on the same page.

00:01:36.280 | Then we're going to talk about what types of knowledge language models can already encode

00:01:40.240 | and what they might struggle on.

00:01:41.880 | We'll also motivate why researchers are interested in increasing the amount of knowledge in language

00:01:46.680 | models, and what this could enable for future AI systems if we have language models that

00:01:50.640 | can actually reliably recall knowledge.

00:01:54.720 | We'll talk about three broad classes of techniques that researchers have been using to add knowledge

00:01:58.320 | to language models.

00:01:59.720 | These include adding pre-trained entity embeddings, using external memory or key value store,

00:02:04.800 | or even just modifying the training data.

00:02:07.280 | And for each of these techniques, we'll talk about at least one recent work that used the

00:02:11.160 | technique, so hopefully it's clear to see how to actually employ it in practice.

00:02:15.920 | And then finally, we'll wrap up by talking about how to evaluate the knowledge in language

00:02:19.560 | models and the challenges that come up in trying to do this.

00:02:24.960 | So let's dive right in.

00:02:25.960 | We're going to start by talking about standard language models.

00:02:28.760 | You learned about these at the beginning of the course.

00:02:31.040 | And the task is to predict the next word in a sequence of text and to compute the probability

00:02:35.080 | of a sequence.

00:02:36.080 | So you may remember the example that students opened their blank.

00:02:39.160 | And we talked about it could be minds, exams, we're going to go with books here.

00:02:43.880 | And the task of the standard language model is to predict the most likely next word in

00:02:46.880 | the sequence.

00:02:49.000 | A couple of lectures ago, John also introduced the notion of mass language models.

00:02:52.680 | Instead of predicting the next word in a sequence of text, the task is to predict the mass token.

00:02:58.080 | And this is done using bidirectional context.

00:02:59.960 | So you may remember the example, I masked the mask.

00:03:03.920 | And the goal of the mass language model is to break the most likely token for each of

00:03:07.400 | the masked out words.

00:03:09.640 | So maybe I went to the store.

00:03:11.980 | So while there's some differences in these two types of language models, whether you're

00:03:15.040 | predicting the next word, or whether you're predicting the masked out token, they're similar

00:03:19.280 | in that they can both be trained over large amounts of unlabeled text.

00:03:23.560 | And this is one of the reasons why they've been so widely adopted.

00:03:26.160 | They don't require any human annotated data.

00:03:30.560 | So you've seen that language models can be used for a variety of tasks, from summarization

00:03:34.640 | to dialogue to fluency evaluation, tasks that involve either generating text or evaluating

00:03:40.160 | the probability of text.

00:03:43.280 | And more recently, we've seen that language models can also be used to generate pre-trained

00:03:46.800 | representations of text that encode some notion of language understanding, and has been shown

00:03:51.480 | to be widely useful for different downstream NLP tasks.

00:03:56.320 | And then finally, today we're going to touch on this idea that if language models are trained

00:04:00.560 | over massive amounts of text, can they even be used as a knowledge base?

00:04:07.240 | So we're going to start by looking at what types of factual knowledge a language model

00:04:10.320 | might already know.

00:04:11.920 | And these examples are taken from a paper by Petroni et al. in EMNLP a couple years

00:04:16.400 | ago.

00:04:17.400 | And the goal is to test the factual or common sense knowledge in existing language models

00:04:22.040 | such as BERT-Large.

00:04:23.040 | So let's check out what BERT-Large predicts.

00:04:26.600 | iPod Touch is produced by Apple, London Jazz Festival is located in London, Danny Alves

00:04:33.440 | plays with Santos, Carl III used to communicate in German, and ravens can fly.

00:04:40.680 | So here we have the correct predictions in green and the incorrect predictions in red.

00:04:44.040 | And if you know anything about sports, you may know that Danny Alves is a soccer player,

00:04:48.840 | Santos is a soccer team.

00:04:50.640 | Here they were hoping that it would predict Barcelona, because at least at the time of

00:04:53.840 | this data set, apparently he played for Barcelona.

00:04:56.560 | And Carl III actually used to communicate in Swedish, not German.

00:05:01.080 | So what's good about these examples is the predictions are generally reasonable.

00:05:05.160 | If you didn't know the ground truth, they all make sense.

00:05:08.240 | When you want to predict a language, you do in fact predict the language.

00:05:13.880 | But of course, they're not all factually correct.

00:05:17.480 | So why might this happen?

00:05:18.840 | Well, for one, the fact might not been seen in training.

00:05:21.880 | And you can't expect the language model to do more than recall facts that it has seen

00:05:25.600 | in training.

00:05:26.600 | It can't make up facts about the world, for instance.

00:05:29.640 | It's also possible the fact is just really rare.

00:05:31.940 | So maybe the language model has seen the fact during training, but it hasn't seen it enough

00:05:35.560 | times to actually memorize the fact.

00:05:39.000 | And the last issue is a little more subtle, which the model might just be very sensitive

00:05:42.640 | to the phrasing of the fill in the blank statement.

00:05:45.960 | And so for example, you might have statements like X was created in blank that the model

00:05:50.080 | can't predict correctly.

00:05:51.080 | But if you change it to X was made in blank, suddenly it can predict it correctly.

00:05:56.080 | And we'll come back to this and how to actually evaluate the knowledge in these language models.

00:06:02.600 | So this inability to reliably recall knowledge is a key challenge facing language models

00:06:07.160 | today.

00:06:08.160 | And it'll be the focus of this talk.

00:06:10.040 | Recent works have found that language models can recover some knowledge, including the

00:06:14.160 | work that Colin presented last class.

00:06:15.800 | They've had very encouraging results.

00:06:18.560 | But there's still a way to go, as we saw with the fill in the blank statements and with

00:06:21.880 | these challenges that we just discussed above.

00:06:24.880 | So as a result, the past couple of years have had a ton of rapid progress in this area of

00:06:28.880 | research in terms of trying to figure out how do you actually encode more knowledge

00:06:33.200 | in language models.

00:06:37.720 | So I also want to motivate why researchers are interested in building language models

00:06:41.360 | that can more reliably recall knowledge.

00:06:45.040 | And one of these reasons is that the pre-trained representations are used in a variety of downstream

00:06:49.240 | tasks.

00:06:50.240 | And some of these downstream tasks are knowledge intensive.

00:06:53.920 | So for instance, you might have a downstream task to extract the relations between two

00:06:58.000 | entities in a sentence.

00:07:00.000 | And this is commonly known as relation extraction.

00:07:02.440 | And this is much easier if you have some knowledge of the entities, which could be potentially

00:07:06.920 | provided by this pre-trained language model representation.

00:07:11.920 | And when we talk about evaluation, we'll talk about what types of tasks are most likely

00:07:15.280 | to benefit from these knowledge rich pre-trained representations.

00:07:20.840 | And then as a stretch goal, some researchers are starting to propose the idea that can

00:07:25.000 | language models actually ultimately be used to replace traditional knowledge bases?

00:07:30.560 | So instead of creating a knowledge base for a fact, like you might right now with SQL,

00:07:34.200 | you'd create a language model with a natural language prompt.

00:07:37.040 | And of course, this does require the language model to have high quality on recalling facts.

00:07:43.040 | So we might not be there yet, but it's an interesting direction for us to be moving

00:07:46.520 | towards.

00:07:49.320 | So I want to make it super clear what I mean by a knowledge base.

00:07:52.360 | Here we're just talking about a knowledge graph where the nodes in the graph would be

00:07:55.960 | entities and the edges are going to be relations between the entities.

00:08:00.600 | So for example, here we have a subset of a knowledge graph for Franklin D. Roosevelt,

00:08:04.880 | and you see the information about his spouse, his place of birth, his date of birth, and

00:08:08.920 | so on.

00:08:09.920 | An important thing to note is this is a structured way of storing the knowledge, since it's just

00:08:14.600 | in a graph form.

00:08:15.840 | And you can actually describe these graphs with knowledge graph triples, which will be

00:08:19.880 | an important vocabulary word throughout this talk.

00:08:22.840 | So knowledge graph triple would be consisting of a subject entity, a relation, and then

00:08:28.560 | an object entity.

00:08:30.160 | So for instance, here we might have Franklin D. Roosevelt, date of birth, January 30th,

00:08:35.200 | 1882.

00:08:36.200 | And that would form a knowledge graph triple.

00:08:37.600 | We'll also refer to this as a parent entity, a relation, and a tail entity.

00:08:43.840 | So Wikidata is one very popular knowledge base you might come across if you're working

00:08:46.840 | in this area.

00:08:48.100 | It's a free knowledge base that's actually populated by humans, so they're filling in

00:08:52.160 | these relations and entities.

00:08:54.280 | And it's also multilingual.

00:08:57.080 | So if you want information from this knowledge base, what you do is you would write a SQL

00:09:01.080 | query.

00:09:02.340 | This is a simplified one, but the idea is you'd want to figure out the date of birth

00:09:07.040 | of Franklin Roosevelt, so you would write a query like follows.

00:09:12.320 | Now if instead you want to create a language model as a knowledge base, you'll have something

00:09:16.240 | like this diagram that you've actually probably seen in several lectures now.

00:09:20.540 | And the idea is you'll train a language model over this unstructured text, and then you'll

00:09:25.240 | use a language model to just answer these natural language query statements.

00:09:29.980 | So here, this is the work on T5, where they're training T5 over natural language or just

00:09:36.060 | unstructured text with a span corruption task.

00:09:39.020 | And then they're asking T5, when was Franklin D. Roosevelt born?

00:09:42.900 | And the idea is T5 will produce a textual answer.

00:09:46.440 | So you can see this contrast very much with the old approach of using a traditional knowledge

00:09:50.080 | base, where the knowledge base is structured, and you have these SQL statements to query

00:09:54.000 | it.

00:09:58.160 | So what are the advantages of using language models over traditional knowledge bases, and

00:10:01.760 | why might people think this could be a good idea?

00:10:04.000 | Well, for one, the language models are pre-trained over large amounts of unstructured and unlabeled

00:10:09.280 | text.

00:10:10.280 | Whereas traditional knowledge bases require manual annotation, like with wiki data, people

00:10:14.320 | actually are populating it, or complex NLP pipelines to extract from unstructured text

00:10:20.080 | into a structured form that forms a knowledge base.

00:10:24.720 | Language models can also support more flexible natural language queries.

00:10:28.780 | So if we take the example, what does the final F in the song UFOF stand for?

00:10:34.280 | A knowledge base probably won't have a field for final F, so it won't be able to answer

00:10:38.160 | your query.

00:10:39.160 | But there's a chance that a language model could actually learn and have a response for

00:10:43.480 | this natural language query.

00:10:46.400 | They also had a less extreme example in this paper by Petroni and others, where maybe your

00:10:50.800 | relation would be "is works for" in your knowledge base, and then you ask for "is working for".

00:10:56.760 | And the knowledge base doesn't have an exact match in the field, and so it returns an empty

00:11:00.480 | response.

00:11:01.480 | And it's much, it's reasonable to believe that your language model could figure out that

00:11:06.040 | these relations are similar, so if I know the answer to one of them, I probably know

00:11:09.960 | the answer to the other.

00:11:13.200 | Of course, it's not all advantages.

00:11:15.200 | There's also many open challenges to using language models as knowledge bases.

00:11:19.480 | So for one, it's harder to interpret.

00:11:21.560 | When a traditional knowledge base produces an answer, there's actually provenance information

00:11:24.960 | associated with why did it return that particular query.

00:11:28.440 | But with a language model, it's really not clear why it might produce a prediction.

00:11:34.320 | The knowledge is just encoded in the parameters of the model.

00:11:38.380 | It's also harder to trust.

00:11:39.380 | So you saw this in assignment 5, where the language model could produce realistic predictions,

00:11:45.100 | but they are incorrect.

00:11:46.480 | So it's not easy to know when the language model actually knows the fact, versus it's

00:11:50.520 | using some biases to make its prediction.

00:11:52.880 | And in the case of the traditional knowledge base, if it doesn't know a fact, it's just

00:11:57.120 | going to have an empty response.

00:12:00.240 | And then finally, language models are harder to modify.

00:12:05.320 | So in a knowledge base, if you want to update a fact, you just change the fact directly

00:12:09.120 | in the structured data.

00:12:11.960 | But in a language model, it's not quite clear how you would do this.

00:12:14.940 | You could fine tune the model longer on the updated data, but how do you know if it still

00:12:19.400 | has some memorization of the old fact?

00:12:23.440 | So there are a lot of open challenges to this goal of actually using language models as

00:12:27.500 | traditional knowledge bases.

00:12:29.360 | But hopefully you see why some people think this could actually be a good idea, and why

00:12:33.840 | researchers are interested in training language models that can actually integrate more knowledge.

00:12:41.560 | So that brings us to section 2 of the talk.

00:12:43.720 | So I want to pause here just in case there's any questions.

00:12:46.760 | OK.

00:12:47.760 | I think that's OK, yeah.

00:12:52.240 | OK, awesome.

00:12:53.720 | So now we're going to be talking about what techniques researchers are using to actually

00:13:00.240 | add more knowledge to language models.

00:13:03.900 | So we're going to talk about three broad classes of techniques.

00:13:06.600 | This is by no means exhaustive, but hopefully it gives you a good overview so that if you

00:13:10.480 | want to dive deeper, you can.

00:13:13.700 | So we'll start by talking about adding pre-trained entity embeddings.

00:13:17.120 | And for each section, we'll kind of focus on the first work that you see in the bullets.

00:13:20.760 | But we'll also talk about briefly some of the variants so you see how the works within

00:13:25.860 | each class can differ and what knobs you can turn.

00:13:31.680 | So for adding pre-trained embeddings, we first need to figure out what pre-trained embeddings

00:13:35.940 | would actually be the most useful to add knowledge to language models.

00:13:39.820 | And this can start with an observation that facts about the world are usually in terms

00:13:43.200 | of entities.

00:13:45.320 | So if we have a fact like Washington was the first president of the United States, we have

00:13:49.480 | the entities Washington, United States.

00:13:53.120 | But pre-trained word embeddings don't have this notion of entities.

00:13:57.120 | So we'd have different word embeddings for USA, United States of America, and America,

00:14:02.120 | even though these all refer to the same entity.

00:14:05.140 | And this makes it challenging for the language model to actually learn any representations

00:14:08.760 | over these entities, since they may be referred to many ways in the text.

00:14:15.700 | So what if instead we have a single embedding per entity, and we'll refer to these as entity

00:14:20.520 | embeddings.

00:14:22.640 | So now you'd have a single entity embedding for USA, United States of America, and America.

00:14:28.880 | And whenever you see a phrase in text referring to this entity, you would use the same entity

00:14:33.440 | embedding.

00:14:34.440 | And these entity embeddings can actually be pre-trained to encode this factual knowledge

00:14:39.780 | about the world.

00:14:41.380 | And this first class techniques we'll be looking at will be how do you actually best use these

00:14:45.000 | pre-trained entity embeddings in a language model.

00:14:50.440 | So I need to make a quick note that these entity embeddings are only useful to language

00:14:54.480 | models though, if you can do another NLP task called entity linking well.

00:15:00.600 | So I'm going to take a quick aside and explain what is entity linking.

00:15:05.280 | So a definition of entity linking is the link mentions in text to entities in a knowledge

00:15:09.420 | base.

00:15:10.420 | I like to think about this in terms of how you use word embeddings.

00:15:14.260 | So if you want to use word embeddings and you have a sentence, you're going to first

00:15:17.200 | tokenize that sentence into words.

00:15:19.440 | And then for each word, you're going to look up their corresponding ID in some word embedding

00:15:23.040 | matrix.

00:15:24.040 | And now you have your word embedding.

00:15:25.040 | Well, for entity embeddings, the dictionary lookup isn't so easy.

00:15:30.080 | You might have sentences like Washington is the first president of the United States.

00:15:34.080 | Well, Washington has two different candidates.

00:15:36.600 | Are we talking about George Washington?

00:15:38.640 | Or are we talking about Washington State?

00:15:40.560 | And these are different entities that have different entity embeddings.

00:15:44.240 | And the QIDs here would just be their identifiers and wiki data.

00:15:49.480 | And then United States just has a single entity.

00:15:52.560 | So task of entity linking is to figure out correctly these ambiguous mentions, what entities

00:15:57.240 | do they actually link to in a knowledge base?

00:16:00.360 | And there's many different ways you can do this entity linking.

00:16:03.620 | So one way you might be able to do this is to figure out that, oh, I see the context

00:16:06.700 | word of president.

00:16:07.880 | So Washington probably links to George Washington.

00:16:11.940 | Just some more definitions, we're going to refer to Washington as a mention, United States

00:16:15.680 | as a mention.

00:16:16.940 | And then the things that the mention could link to, so the two options for Washington

00:16:21.240 | are going to be candidates.

00:16:22.240 | So this is a whole research area of its own.

00:16:25.840 | And I encourage you to check out the resources at the bottom if you're interested in learning

00:16:28.920 | more.

00:16:29.920 | But right now, the most important thing to understand is that entity linking is what

00:16:33.360 | is going to tell us which entity embeddings are actually relevant to the text and which

00:16:36.960 | ones you want to use as you iterate through a sequence.

00:16:40.520 | And Megan, there are a few questions around here.

00:16:46.040 | One of them is, so that's entity linking, but what about the relations?

00:16:51.280 | Yeah, so some of the works we'll talk about will only use the entity embeddings.

00:16:57.300 | So some of these have been pre-trained with relation information, but in the end, you

00:17:01.080 | only have an entity embedding.

00:17:04.000 | So relation extraction is yet another NLP task that you could also do.

00:17:06.680 | But yeah, here we're just talking about entity linking.

00:17:09.600 | And if you have the knowledge graph you showed earlier, it had relations in it, right?

00:17:14.360 | Do you get any connection between that and the text?

00:17:20.400 | I mean, that's the goal of relation extraction, right?

00:17:23.080 | It's to figure out, like, given the entities, what is the relation between them, which would

00:17:27.160 | then form the full triple of head entity, tail entity, and relation.

00:17:35.240 | Okay, then I think people want to know more about how this is going to be used, but maybe

00:17:40.920 | you should go on and show some examples.

00:17:42.920 | Yeah, I will, for sure.

00:17:46.840 | Okay, right.

00:17:49.760 | So entity embeddings, just to summarize, they're like word embeddings, but they're for entities

00:17:53.680 | in a knowledge base.

00:17:54.760 | So you'll have some vector associated with George Washington, and it should be meaningful

00:17:58.560 | in embedding space such that maybe the George Washington vector is close to the vectors

00:18:02.920 | for other founding fathers.

00:18:05.760 | So we're going to briefly talk about some methods for training entity embeddings.

00:18:09.520 | There's knowledge graph embedding methods.

00:18:11.160 | You might have heard of the transie embedding method.

00:18:13.200 | So this starts from the idea of having these knowledge graph triples, and you want to learn

00:18:17.320 | pre-trained entity and pre-trained relation embeddings.

00:18:20.280 | And you want it to be the case that the subject embedding and the relation embedding, the

00:18:24.040 | sum of those two, is close to the object embedding in vector space.

00:18:28.120 | So it's an algorithm to learn that constraint.

00:18:31.480 | There's also word entity co-occurrence methods.

00:18:33.320 | So these build off of Word2vec.

00:18:34.320 | One of them is even called Wikipedia2vec.

00:18:37.480 | And the idea is given an entity, you want to figure out what words are most likely to

00:18:41.520 | co-occur around it.

00:18:44.360 | And then the last method, or one of the other methods that is common now, is actually just

00:18:48.760 | using the transformer to learn representations of an entity by encoding the entity description.

00:18:54.260 | And so Blink from Facebook is an approach that does this.

00:18:58.000 | So the methods we'll talk about today are actually agnostic to how you train your pre-trained

00:19:01.800 | entity embedding.

00:19:02.800 | But I think it's important to know that there's actually a wide variety of methods to train

00:19:06.720 | these pre-trained entity embeddings.

00:19:08.760 | And it's actually not clear which method is best for using them downstream in language

00:19:13.000 | models.

00:19:16.600 | So one of the key challenges of using pre-trained entity embeddings in language models is figuring

00:19:20.600 | out how to incorporate them when they're from a different embedding space than the language

00:19:24.720 | model.

00:19:25.720 | And so what we'll do, or the approach we'll look at today, we'll learn a fusion layer

00:19:29.960 | to combine this context and entity information.

00:19:32.960 | So we have entity embeddings and we have the contextualized word embeddings from our language

00:19:37.060 | model.

00:19:39.280 | So if we take a sequence of text and we imagine that j indicates the jth element in a sequence,

00:19:45.240 | then the challenge here is you want to figure out how do we combine some word embedding

00:19:48.720 | wj with some aligned entity embedding ek.

00:19:52.640 | So here an alignment could be like in the example where we had Washington was the first

00:19:56.480 | president.

00:19:58.120 | Washington would be your word embedding and George Washington would be the aligned entity

00:20:01.720 | embedding there.

00:20:02.720 | So you could imagine in this case, let's say your wj is Washington and your ek is your

00:20:08.120 | entity embedding for George Washington.

00:20:09.520 | And you want to align them together.

00:20:11.800 | So what you can do is learn a weight matrix wt for the text and we for the entity to project

00:20:18.080 | these embeddings to the same dimension before you sum them and finally take an activation

00:20:22.960 | function over them.

00:20:25.140 | So the idea is that by having some fusion layer mechanism like this, you can actually

00:20:30.300 | use these entity embeddings and these contextual word embeddings that are in different embedding

00:20:34.620 | spaces and fuse them together to have this single hidden representation for the element

00:20:40.760 | in the sequence.

00:20:44.080 | So the approaches we'll talk about today all have some mechanism either very similar to

00:20:48.080 | this or some variation of this to do this combination of the context and entity information.

00:20:55.840 | So the first approach we're going to talk about is called ERNI, enhanced language representation

00:20:59.840 | with informative entities.

00:21:01.600 | And so this just builds on what we've already talked about.

00:21:03.760 | It uses pre-trained entity embeddings and it also uses this notion of a fusion layer.

00:21:09.560 | So the first block in ERNI is a text encoder, which is a multilayer bidirectional transformer

00:21:14.680 | encoder.

00:21:15.840 | For their experiments, they use BERT, but it doesn't have to be BERT.

00:21:20.400 | And this is followed by a knowledge encoder, which has stacked blocks composed of two multi-headed

00:21:25.000 | attentions.

00:21:26.340 | One is over the entity embeddings and one is over your token or subword embeddings.

00:21:31.920 | And then the output of these contextualized entity and token embeddings from the multi-headed

00:21:35.760 | attentions are passed to a fusion layer, which looks very similar to what we just looked

00:21:40.760 | at.

00:21:42.160 | But now you also have new word and entity embeddings that you're producing as output

00:21:46.880 | of your fusion layer.

00:21:47.880 | So you see this WJ and this EK, which are produced as the next layer of word and entity

00:21:54.040 | embeddings.

00:21:55.920 | So the I here indicates that it's the Ith block in the knowledge encoder.

00:22:00.480 | So you'll actually have multiple stacks of these knowledge encoders and you'll be doing

00:22:04.400 | a fusion of the word and entity embedding, producing new word and entity embeddings,

00:22:08.800 | and then passing this to the next block of the knowledge encoder.

00:22:14.960 | So this is what the architecture diagram looks like.

00:22:17.200 | On the left side, we have the T encoder or the text encoder, followed by the K encoder

00:22:22.120 | or the knowledge encoder.

00:22:23.120 | And then on the right side, we have a zoomed in version of your knowledge encoder.

00:22:27.680 | So you see the multi-headed attentions over the tokens in orange, and then over the entities

00:22:31.420 | in yellow.

00:22:32.640 | And then you have this alignment between the word and entities with the dashed lines.

00:22:37.600 | So they have this example as Bob Dylan wrote "blowing in the wind" in 1962.

00:22:42.360 | The entities here are Bob Dylan and "blowing in the wind."

00:22:45.880 | And they have a simple alignment rule where you want to align the entity to the first

00:22:49.440 | word in the entity phrase.

00:22:50.680 | So you want to align Bob Dylan to Bob, that's what the dashed lines try to indicate, and

00:22:55.720 | you want to align "blowing in the wind" to "blow."

00:22:59.200 | So here, this already assumes that entity linking has been done, and you know your entities

00:23:02.640 | in advance.

00:23:03.640 | So you can see that the entities are actually input into the model.

00:23:08.300 | So after you have your word and entity alignment, this goes through the information fusion layer

00:23:12.080 | in this light purple-gray color.

00:23:14.600 | And then finally, it produces these new word entity embeddings as output.

00:23:18.800 | And then remember that you have multiple blocks of these, so those will be passed into the

00:23:21.920 | next block of your knowledge encoder.

00:23:26.800 | So how do you actually train this?

00:23:28.460 | It's pretty similar to BERT.

00:23:29.600 | You have a mass language model loss, and you have a next sentence prediction loss.

00:23:34.000 | And then they also introduce a knowledge pre-training task, which they refer to as the DEA task.

00:23:39.240 | It's named after a denoising entity autoencoder from an ICML paper in 2008.

00:23:45.880 | And the idea is they're going to randomly mask these token entity alignments.

00:23:49.240 | So the idea that Bob goes to Bob Dylan, they're going to mask that out with some random percentage.

00:23:54.840 | And then they're going to predict the corresponding entity for a token out of the entities in

00:23:58.400 | the sequence.

00:23:59.400 | So this looks like as follows.

00:24:02.480 | The summation is over m entities in the sequence.

00:24:05.180 | So this would be over Bob Dylan and blowing in the wind in the previous example.

00:24:09.780 | And given a particular word, they want to figure out what entity is it most likely to

00:24:13.980 | align to in that sequence.

00:24:15.600 | So does Bob align to Bob Dylan, or does Bob align to blowing in the wind?

00:24:21.240 | And their motivation for doing this is that if you don't have this task, all you're ever

00:24:24.960 | going to be predicting is a token with the mass language model loss.

00:24:28.940 | And you really, to encode knowledge, should also probably be predicting over entities.

00:24:33.180 | So by adding this task, they have some kind of task that is actually predicting the entity.

00:24:38.480 | And they also suggest that this might better fuse the knowledge or the entity and the word

00:24:43.280 | representations than just using the fusion layer.

00:24:48.200 | Their final loss is then that summation of the mass language model loss, the next sentence

00:24:52.360 | prediction loss, and this DEA knowledge pre-training task loss.

00:24:59.940 | So they show that ablation experiment that it's actually very important to have this

00:25:03.040 | knowledge pre-training task.

00:25:04.840 | So this has Bert on the leftmost bar, Ernie as the second bar from the left.

00:25:10.480 | And so that's with all the features of Ernie.

00:25:12.480 | And then they try removing the pre-trained entity embeddings and removing this knowledge

00:25:15.960 | pre-training task.

00:25:16.960 | So you see that Bert performs the worst.

00:25:19.860 | This isn't very surprising, and that Ernie performs the best.

00:25:22.920 | But what's interesting is that if you remove the entity embeddings or you remove the pre-training

00:25:26.640 | task, they only do a little better than Bert.

00:25:30.920 | And so it's really necessary to actually use this pre-training task to get the most use

00:25:35.800 | of your pre-trained entity embeddings.

00:25:41.040 | So some strengths of this work were that they introduced some way to combine entity and

00:25:44.320 | context information through this fusion layer and this knowledge pre-training task.

00:25:49.200 | And then they also show improved performance on downstream tasks, which we'll come back

00:25:52.760 | to when we talk about evaluation.

00:25:55.720 | But of course, there's also some limitations.

00:25:58.320 | So it needs text data with the entities annotated as input.

00:26:01.680 | And this is even true for downstream tasks.

00:26:03.080 | So if you remember on the architecture diagram, we had the entity information actually input

00:26:08.280 | into the architecture.

00:26:10.240 | But it's not very realistic that you're necessarily going to have a good entity linker for any

00:26:14.140 | downstream tasks that you want to use Ernie on.

00:26:18.280 | And the next challenge is this requires more pre-training of your language model.

00:26:21.660 | So now you don't just need to pre-train Bert, but you also need to pre-train your knowledge

00:26:24.940 | encoder on top.

00:26:27.320 | For the first challenge, we're going to actually talk about a work that presents a solution

00:26:30.240 | to address this.

00:26:31.240 | For the second challenge, I encourage you to check out the footnote on the bottom.

00:26:35.420 | This introduces a work that actually uses pre-trained entity embeddings, uses them in

00:26:39.920 | a language model, and doesn't require any more pre-training.

00:26:42.520 | So it's pretty cool.

00:26:45.800 | I guess that's all I have for Ernie.

00:26:47.080 | So I want to pause here for questions.

00:26:53.680 | Well here's one that's up here.

00:26:55.400 | So on the fusion layer, it observed that passing the entity embedding into a fusion layer to

00:27:01.640 | combine with word embedding is more powerful than just concatenating the entity embedding

00:27:06.500 | onto the end of the word embedding question mark.

00:27:08.880 | Yeah, so I guess people are still a little bit confused as to the motivation for that

00:27:14.160 | fusion layer.

00:27:15.700 | And so I guess here it's this, the simplest strategy would be, since you've got the entity

00:27:20.680 | linking, you could just concatenate entity embeddings onto the end of word embeddings

00:27:25.700 | and do regular BERT, but that worked just as well.

00:27:33.080 | I think the idea is that it wouldn't, because if you imagine that, let's say your magnitudes

00:27:37.720 | are very different, you need some way to, I guess, align the spaces so that anything

00:27:43.760 | meaningful in the entity embedding space is still meaningful in the word embedding space.

00:27:47.360 | So if you're close in the word embedding space, you also would be, you'd want to be close

00:27:50.840 | in entity embedding space.

00:27:52.560 | So I guess that's one argument.

00:28:01.560 | I mean, I think the question isn't, you know, it's a good question as people say.

00:28:05.640 | I mean, it's not completely obvious that it wouldn't work to do that.

00:28:10.120 | It seems like one of the potential problems is some words have entity links to them and

00:28:16.440 | some words don't.

00:28:18.100 | And so you, then you'd sort of have zero vectors for the ones that don't have anything linked.

00:28:22.960 | And that might act a bit weirdly, but.

00:28:26.760 | Yeah.

00:28:27.760 | In this case, when they don't have entities linked, which is a great point.

00:28:32.720 | Yeah.

00:28:33.720 | The first equation just simplifies to the first term plus the bias.

00:28:37.640 | So like there's an obvious solution in that case when you're not concatenating that you

00:28:40.920 | just don't add on the term.

00:28:41.920 | Yeah, that could be one reason too.

00:28:45.400 | Okay.

00:28:46.400 | Are there any other questions?

00:28:50.960 | I think you can go on.

00:28:58.400 | Okay, cool.

00:29:01.400 | Right.

00:29:02.900 | So now we're talking about NoBERT.

00:29:11.700 | And this is from the same folks that introduced the ELMo work.

00:29:15.240 | And the idea here is that they're going to pre-train an integrated entity linker as an

00:29:19.680 | extension to BERT.

00:29:23.580 | And so their loss function will now be the summation of the next sentence prediction,

00:29:28.140 | the mass language model loss, and this entity linking loss.

00:29:30.480 | So instead of the knowledge pre-training DEA task from Ernie, we'll have an entity linking

00:29:34.660 | loss.

00:29:35.660 | And the idea of the entity linker is you'll now have just as normal sequence as input,

00:29:41.200 | and the integrated entity linker will figure out what are the entities in the sentence

00:29:45.020 | and or what are the mentions in the sentence, what are the candidates of those mentions,

00:29:49.940 | and then what should be the scores of those entities or the candidates given the context

00:29:54.620 | of the sentence.

00:29:55.620 | And so this is all done now as part of the model rather than requiring it as some external

00:29:59.980 | pipeline stage before you could even use Ernie, for instance.

00:30:03.960 | So now for downstream tasks, you no longer need these entity annotations.

00:30:07.020 | Your integrated entity linker will figure out what the correct entity is and be able

00:30:11.040 | to use the correct entity embedding.

00:30:14.520 | So there's also this idea that learning this entity linking may actually better encode

00:30:17.780 | knowledge than this DEA pre-training task because they show that NoBERT actually outperforms

00:30:22.460 | Ernie on downstream tasks.

00:30:25.200 | So one reason this may occur is that if you think about the DEA task, it's actually a

00:30:29.420 | bit simpler than just entity linking.

00:30:32.140 | So you're trying to predict, for instance, what Bob linked to out of Bob Dylan and Blowing

00:30:36.740 | in the Wind, and it's much easier even as a human to see that Bob will more likely link

00:30:41.340 | to Bob Dylan than that Bob will link to Blowing in the Wind.

00:30:46.580 | And in the entity linking task, you actually have a much harder set of candidates to predict

00:30:49.820 | over.

00:30:50.820 | You're not just looking at the ones in the sentence.

00:30:52.580 | So does Washington link to George Washington or Washington State actually requires you

00:30:57.140 | using more information about the entity?

00:30:59.900 | So given it's a harder task, it's not too surprising that it might perform better than

00:31:04.800 | just this easier knowledge pre-training task that Ernie introduced.

00:31:10.260 | So otherwise, NoBERT has a lot of similarities to Ernie.

00:31:12.860 | It uses a fusion layer that combines this context and entity information, and it introduces

00:31:17.620 | some knowledge pre-training task.

00:31:19.840 | So I'd say a high-level takeaway is if you want to use pre-trained entity embeddings

00:31:22.640 | in a language model, you'll probably at least want to consider both of these components

00:31:27.140 | in terms of actually going to integrate the pre-trained entity embeddings and take the

00:31:31.660 | most advantage of the knowledge in them as possible.

00:31:37.500 | So that brings us to the next class of techniques, which is using an external memory.

00:31:43.100 | And here we'll mainly focus on this work called KGLM, and then we'll also briefly talk about

00:31:46.780 | KNNLM.

00:31:49.940 | So the previous methods that we've talked about have relied on pre-trained entity embeddings

00:31:53.500 | to encode the factual knowledge from knowledge bases.

00:31:57.220 | And the one problem with this, or one of the problems with this, is if you want to, let's

00:32:01.100 | say, modify your knowledge base, you now need to retrain your entity embeddings and then

00:32:05.220 | retrain your language model on top of those entity embeddings.

00:32:08.880 | So this begs the question, are there more direct ways than pre-trained entity embeddings

00:32:13.360 | to provide the model with factual knowledge?

00:32:17.140 | And so what we're going to talk about is how you can actually use an external memory or

00:32:20.260 | a key value store to give the model access to either knowledge graph triples or context

00:32:25.220 | information.

00:32:26.220 | And a key thing about this external memory is that it's independent of the learned model

00:32:30.740 | parameters.

00:32:33.100 | So this means you can actually support injecting and updating factual knowledge.

00:32:37.080 | You can do this directly to the symbolic external memory by, let's say, changing the value for

00:32:41.220 | a particular key or maybe adding another key.

00:32:44.740 | And you don't have to pre-train or retrain your entity embeddings when you make this

00:32:49.020 | change.

00:32:50.020 | And the approaches we'll talk about today can actually even have these updates to the

00:32:54.300 | external memory without more pre-training of the language model.

00:32:59.020 | So that's pretty neat.

00:33:01.060 | And then another benefit of using external memory over these pre-trained entity embedding

00:33:04.740 | approaches is it can also be more interpretable.

00:33:07.980 | So if you have an air in your model where it's not predicting a correct fact, it's

00:33:14.700 | very challenging to figure out with pre-trained entity embeddings what the problem might be.

00:33:19.460 | Was it the original knowledge base?

00:33:20.820 | Was it the encoding in the entity embeddings?

00:33:22.500 | Is it how the language model is using the entity embeddings?

00:33:25.420 | And here you have a little more information with an external memory in that you can look

00:33:29.260 | in the external memory and see, was the fact in the external memory?

00:33:33.620 | Was it not in the external memory?

00:33:34.700 | And so on.

00:33:35.900 | So it adds a little bit more interpretability than just using these pre-trained entity embeddings

00:33:40.380 | as an indirect way to encode the knowledge base.

00:33:45.940 | So the first work we're going to talk about is called KGLM.

00:33:48.660 | And unlike the other approaches we've talked about so far, this actually uses LSTMs and

00:33:53.180 | not transformers.

00:33:55.820 | So the key idea here is to condition the language model on a knowledge graph.

00:34:00.940 | So recall with the standard language model, we want to predict the next word given the

00:34:04.420 | previous words in the sequence.

00:34:07.420 | So now we also want to predict the next entity given the previous words in the sequence and

00:34:11.540 | given the previous entities in the sentence, or the entities that are relevant to the sentence,

00:34:16.540 | I should say.

00:34:17.540 | So KGLM will be building a local knowledge graph as it iterates over the sequence.

00:34:24.500 | And a local knowledge graph is just a subset of a full knowledge graph that only has the

00:34:28.260 | entities that are actually relevant to the sequence.

00:34:32.240 | So if we have this example here, a simplified example from the paper, that Super Mario Land

00:34:37.480 | is a game developed by blank.

00:34:39.760 | And Super Mario Land here is an entity.

00:34:43.160 | You'd want a local knowledge graph as follows, where you see that Super Mario Land is in

00:34:47.040 | the local knowledge graph, but we also have the relations to Super Mario Land to other

00:34:51.240 | entities that are copied from the full knowledge graph into this local knowledge graph.

00:34:56.440 | And you would build up this local knowledge graph as you iterate over the sentence.

00:34:59.560 | So whenever you see an entity, you would add it to the local knowledge graph as well as

00:35:03.020 | its relations to other entities.

00:35:06.500 | So obviously this is a much smaller example than what would really have all the relations

00:35:10.920 | to Super Mario Land, just for the purpose of the example.

00:35:14.080 | But hopefully it's clear that all of these are relevant to the sequence.

00:35:20.000 | Something important to note here is that this does assume that the entities are known during

00:35:23.240 | training so that you do have this entity annotated data for training, and therefore your local

00:35:27.800 | knowledge graph is always the ground truth local knowledge graph as you iterate over

00:35:31.320 | the sequence.

00:35:33.840 | So why might this be a good idea to do this?

00:35:35.960 | Well, here, the next word you want to predict is Nintendo.

00:35:39.640 | And you may notice that Nintendo is in your local knowledge graph.

00:35:43.120 | So sometimes this local knowledge graph can actually serve as a very strong signal for

00:35:47.000 | what you want to predict for your next word.

00:35:49.400 | Now, you may be thinking, well, this wouldn't always be helpful.

00:35:53.320 | And that's true as well.

00:35:55.640 | So if you look at just like the third word in the sequence and you want to predict that

00:35:58.640 | word, so is a game, for instance, well, if this isn't in the local knowledge graph, this

00:36:04.360 | wouldn't be necessarily that helpful.

00:36:06.840 | You would just do a standard language model prediction.

00:36:10.320 | Or if you're at the beginning of the sequence, your local knowledge graph is empty.

00:36:13.980 | So of course, you're not going to get any signal from it.

00:36:16.900 | So the first question they ask in KGLM is how can a language model know when to use

00:36:21.400 | a local knowledge graph and when it might actually be useful for predicting the next

00:36:25.560 | word?

00:36:26.560 | So we're going to keep the same example as a running example.

00:36:31.960 | And we have our local knowledge graph here.

00:36:34.200 | We now have an LSTM that looks similar to the representations you've seen throughout

00:36:37.240 | this class.

00:36:38.600 | And normally, you've seen the LSTM predicts the next word.

00:36:41.320 | Well, now we're also going to use the LSTM to predict the next type of the word.

00:36:46.920 | So is the next word going to be a related entity, meaning it's in the local knowledge

00:36:50.680 | graph already?

00:36:51.680 | Is it going to be a new entity, meaning it's not in the local knowledge graph?

00:36:56.040 | Or is it going to be not an entity, in which case you just revert to a normal LSTM prediction?

00:37:02.080 | And they're going to use the LSTM hidden state to do this prediction of the type of the next

00:37:05.600 | word over this three different classes that they might want to consider.

00:37:11.640 | So in the case of Super Mario Land as a game developed by Nintendo, we saw that this would

00:37:15.880 | be a related entity case because we saw that Nintendo was in the local knowledge graph.

00:37:20.680 | For the other cases, Super Mario Land would be a new entity case since the local knowledge

00:37:26.000 | graph is empty at that point.

00:37:27.960 | And then any of the words between Super Mario Land and Nintendo would be non-entity, as they're

00:37:33.240 | just a standard LSTM language model prediction that doesn't involve any entities.

00:37:40.360 | So now we need to talk about what the language model actually does in these three different

00:37:43.800 | scenarios to predict the next entity and the next word.

00:37:51.200 | So we're going to keep the example up at the top in case you want to refer back to the

00:37:53.680 | three different cases.

00:37:54.680 | And we're going to start with the related entity case.

00:37:59.200 | So here we assume that the next word or entity is actually in your local knowledge graph.

00:38:04.040 | And remember that we can describe a knowledge graph in terms of triples, so in terms of

00:38:08.160 | pairs of parent entities, relations, and tail entities.

00:38:11.640 | And in the case of predicting the next word as Nintendo, there's only one possible parent

00:38:17.320 | entity in the local knowledge graph, which is Super Mario Land.

00:38:21.320 | And the goal is you want to figure out what is the most relevant triple that will be useful

00:38:25.600 | in helping to predict the next word.

00:38:28.280 | So in this case, you could have the triple Super Mario Land publisher Nintendo.

00:38:32.420 | You might have the triple Super Mario Land genre platform game.

00:38:35.680 | Which of these is actually helpful in predicting that Nintendo should be the next word?

00:38:40.840 | So here, what you would want KGLM to do is predict that the top scoring parent entity

00:38:45.440 | is Super Mario Land, and the top scoring relation is publisher.

00:38:49.080 | And you can see there are actually contextual cues in the sentence that could help you figure

00:38:52.960 | out which triple you're talking about.

00:38:56.720 | And then given that your top scoring parent entity is Super Mario Land, and your top scoring

00:39:00.480 | relation is publisher, you can figure out that using knowledge graph triples, the tail

00:39:05.520 | entity has to be Nintendo.

00:39:07.680 | And therefore, this gives you a strong signal that the next word will be Nintendo.

00:39:15.260 | So the goal is you're going to find the top scoring parent entity and the top scoring

00:39:18.160 | relation using the nodes in your local knowledge graph.

00:39:20.800 | And you can do this by using the LSTM hidden state combined with pre-trained entity and

00:39:25.080 | relation embeddings.

00:39:26.080 | So I do admit I cheated here a little bit in that this does use pre-trained embeddings.

00:39:31.200 | But hopefully you'll see by the end of this discussion, why I think it fits a bit better

00:39:34.460 | in this external memory use case as well.

00:39:39.040 | So what they're going to do is they're going to take a softmax using LSTM hidden state

00:39:42.080 | and the entity embeddings for each of the potential parent entities.

00:39:45.680 | And they'll take this top scoring one as a parent entity.

00:39:48.680 | And they'll do the same thing for the relation embeddings.

00:39:52.240 | The next entity is then just this tail entity from the knowledge graph triple.

00:39:56.240 | So it's relatively trivial to figure out what the next entity should be once you've figured

00:40:00.200 | out the top scoring parent entity and your top scoring relation.

00:40:04.920 | And then finally, to predict the next word, they take the vocabulary and they expand it

00:40:09.800 | to include different aliases that could refer to that entity.

00:40:14.040 | So what we mean by aliases here are phrases that could refer to the entity in text.

00:40:18.960 | So you might not just call it Nintendo.

00:40:20.840 | You might also say Nintendo company or CoPi.

00:40:23.680 | And you want any of these to be possible words that you could predict as the next word.

00:40:28.940 | So the goal of this vocabulary expansion is to increase the probability that the next

00:40:33.480 | word you predict will actually be related to this next entity.

00:40:40.280 | So a new entity case is a bit simpler.

00:40:42.400 | This means that the entity that you're predicting is not in the local knowledge graph.

00:40:45.280 | So you're not getting any signal from this local knowledge graph that you've been building

00:40:48.680 | up.

00:40:50.360 | And all you want to do is find the top scoring entity in the full knowledge graph.

00:40:54.160 | And you can do this using the LSTM hidden state and pre-trained entity embeddings, similar

00:40:57.920 | to how we found the score for the top parent entity.

00:41:02.080 | Your next entity will just be the top scoring entity out of the full knowledge graph.

00:41:06.360 | And then your next word is once again this vocabulary expanded to include aliases of

00:41:10.480 | that entity.

00:41:13.280 | The not an entity case is the simplest.

00:41:15.920 | You just revert to normal LSTM.

00:41:17.960 | You don't have an X entity to predict.

00:41:19.680 | And your next word is just the most likely next token over your normal vocabulary.

00:41:27.120 | So here's a diagram from the paper that hopefully summarizes and makes even clearer what I just

00:41:31.960 | went over.

00:41:33.600 | So they have a longer example than the one we are looking at, but it's the same prediction

00:41:37.200 | as Nintendo's next word.

00:41:39.240 | And they have their predictions in red.

00:41:40.720 | So this is what they want KGLM to predict.

00:41:43.200 | The three different cases are in the horizontals.

00:41:45.800 | And we see that here you're in the related entity case, since Nintendo is in your local

00:41:50.600 | knowledge graph.

00:41:52.560 | So they want KGLM to predict that Nintendo should be a related entity type of word, that

00:41:57.680 | Super Mario Land should be its parent entity, that publisher should be the relevant relation.

00:42:02.880 | And as a result, the next entity is Nintendo.

00:42:06.300 | And then they expand the vocabulary.

00:42:08.000 | You see the aliases of Nintendo at the bottom.

00:42:11.240 | And then finally, they actually predict Nintendo as the next word.

00:42:14.800 | And the other cases just summarize what we also already went over.

00:42:20.280 | So they find that KGLM actually outperforms GPT-2 and AWD-LSTM, which is a strong LSTM

00:42:26.920 | language model, on a fact completion task similar to the fill-in-the-blank examples

00:42:31.240 | that we looked at at the beginning of the talk.

00:42:34.400 | They also find qualitatively that compared to GPT-2, KGLM tends to predict more specific

00:42:39.360 | tokens since it can predict these tokens from just copying from the local knowledge graph.

00:42:44.360 | Whereas GPT-2 will tend to predict more generic tokens.

00:42:47.960 | So if you want to predict the birthplace of someone, GPT-2 is more likely to predict New

00:42:51.440 | York, for example, and KGLM might predict some obscure place.

00:42:57.200 | And then they have these really cool set of experiments where they show that KGLM actually

00:43:00.600 | supports modifying or updating facts.

00:43:03.860 | So they made a direct change in the knowledge graph, and then they saw what is the change

00:43:07.400 | in KGLM's predictions.

00:43:10.280 | So they have this example where the sequence was Barack Obama is born on blank.

00:43:15.760 | They had their knowledge graph triple as Barack Obama's original birth date, and then their

00:43:19.440 | most likely next tokens were as expected, August 4, 1961.

00:43:24.200 | And then they just changed their knowledge graph.

00:43:25.800 | So they changed the birth date of Obama.

00:43:27.680 | They said, OK, he's now born 2013.

00:43:30.820 | And they looked to see what the next predictions were for KGLM, and it changed its predictions

00:43:35.580 | to match what was in the local knowledge graph.

00:43:38.600 | So this is something that's pretty cool and that really only external memory approaches

00:43:43.040 | can do compared to the original pre-trained empty embedding approaches we talked about.

00:43:47.660 | And I think it's one of the reasons that KGLM, at least in my opinion, fits better in these

00:43:51.400 | external memory use cases.

00:43:54.360 | Right.

00:43:56.840 | So the next slide is a different paper.

00:43:58.920 | So I guess I'll take questions on KGLM if there are any.

00:44:04.480 | It's a pretty complex method, so feel free to have questions.

00:44:10.600 | Yeah, could you one more time explain what the definition of the local knowledge graph

00:44:15.520 | is in relationship to the global knowledge graph?

00:44:18.360 | Yep.

00:44:19.360 | So a local knowledge graph is supposed to be a subset of the full knowledge graph, and

00:44:24.760 | it's only supposed to consist of entities that have actually been seen in the sequence

00:44:31.280 | as well as their relevant entities.

00:44:35.440 | OK.

00:44:36.440 | Oops.

00:44:37.440 | All right.

00:44:39.200 | So here you see that Super Mario Land is in the local knowledge graph because Super Mario

00:44:43.440 | Land is an entity that is seen in the sequence.

00:44:45.920 | And then you also want to copy over all the edges from Super Mario Land that would be

00:44:50.760 | in the full knowledge graph.

00:44:52.400 | So this is just a subset of them for the purpose of the example.

00:44:54.920 | But you see that Super Mario Land has an edge to Nintendo, to Game Boy, to platform game.

00:44:59.440 | And so you would copy all edges that Super Mario Land has to another node in the full

00:45:03.160 | knowledge graph.

00:45:04.160 | And they know in advance, like they have the labels here for what the entities are during

00:45:09.080 | training.

00:45:10.080 | So that's how they can actually create this ground truth knowledge graph.

00:45:13.400 | And then briefly, a student asked why we can't just use the whole knowledge graph.

00:45:19.720 | And I gave an answer, but maybe you know better.

00:45:22.640 | Yeah, I think the idea is the signal will be much stronger if you just use a local knowledge

00:45:27.480 | graph.

00:45:28.480 | So in the Softmax for the related entity case, you would just be predicting over the potential

00:45:36.080 | parent entities in your local knowledge graph, which is a much smaller set than what's in

00:45:39.000 | your full knowledge graph.

00:45:41.480 | So I guess it's more likely that you're going to predict something that is correct in that

00:45:44.920 | case than when you have like 5 million or so entities in your full knowledge graph.

00:45:49.080 | It's also much cheaper to compute.

00:45:51.640 | In this case, there's only a single parent entity, but you could have multiple parent

00:45:54.520 | entities that you're trying to compute which one's most likely over.

00:45:58.680 | Is that what you were also thinking, John?

00:46:01.360 | Yeah, I mainly just said efficiency.

00:46:05.360 | So the signal thing is cool too.

00:46:08.360 | Here's an exciting question.

00:46:09.360 | What about queries that require more than one step in the knowledge graph, such as the

00:46:16.760 | location of the publisher of Super Mario Land?

00:46:20.920 | Yeah, that's a good question.

00:46:25.560 | So the idea is like, can it support those types?

00:46:27.760 | Like does it support multi-hop kind of building of the knowledge graph?

00:46:31.960 | Yeah, yeah.

00:46:32.960 | How does KGLM perform in those cases?

00:46:36.120 | Yeah, I don't know.

00:46:37.880 | That's a very good question.

00:46:38.880 | They built up the knowledge graph so that it's just single hop as far as I know.

00:46:43.120 | But like if you saw the other entities, if you were to see the entities along the hops,

00:46:47.640 | it would have them in the local knowledge graph.

00:46:49.560 | Yeah, that's a good question.

00:46:51.440 | I don't know if they explored that.

00:46:56.920 | Great.

00:46:57.920 | Okay, let's move along then.

00:47:03.880 | Okay, so the next piece of work we're going to talk about, you guys have actually briefly

00:47:13.680 | seen in the natural language generation lecture.

00:47:16.440 | But I'm going to go over it again quickly here.

00:47:20.120 | So unlike the other works that we've talked about that have used knowledge graph triples,

00:47:23.440 | this is actually going to take kind of a looser notion of knowledge in that the knowledge

00:47:27.400 | will just be encoded in the text in the training data set.

00:47:30.720 | So this is called KNNLM.

00:47:33.020 | And the idea is that, or it's building the idea that language models not only learn to

00:47:37.240 | predict the next word in text, but they also learn these representations of text.

00:47:42.160 | And the authors suggest that it might actually be easier to learn similarities between text

00:47:46.360 | sequences than it is to predict the next word in the text.

00:47:49.640 | So you have this example that Dickens is the author of blank and Dickens wrote blank.

00:47:55.320 | And they argue that it's easier to tell for a human, but also for a model, that these

00:47:59.640 | sequences are similar and they should probably have the same next word, even if you don't

00:48:03.960 | know what the next word is.

00:48:06.360 | So that's suggesting that it's easier to learn these similarities than it is to actually

00:48:10.120 | predict the next word.

00:48:11.120 | And they argue that this is even more true for long tail patterns, where it's very challenging

00:48:15.920 | for the model to predict that the next word is some rarely seen token or rare entity than

00:48:21.080 | it is to find another similar sequence that it's already seen and just copy the next word

00:48:25.880 | from that sequence.

00:48:28.640 | So what they propose to do is store all representations of text sequences in nearest neighbor data

00:48:32.840 | store.

00:48:34.040 | And then at inference, what you'll want to do is you find the k most similar sequences

00:48:37.800 | of text, you then retrieve their corresponding values.

00:48:40.800 | So you just peek at those sequences and see what were their next words.

00:48:45.200 | And then you combine the probability from this nearest neighbor data store with just

00:48:49.760 | a typical language model prediction.

00:48:52.080 | And so they call this an interpolation step in that they're weighting how much to pay

00:48:55.800 | attention to the probability from this k and n approach, and how much to pay attention

00:48:59.720 | to this language model approach.

00:49:02.600 | And the lambda here is just a hyperparameter that they tune.

00:49:08.040 | So they have this diagram from their paper where they want to predict the next word in

00:49:11.320 | the sequence, Shakespeare's play blank.

00:49:13.520 | So what they do is they have all the training contexts already encoded in their data store.

00:49:18.400 | So they have representations of all of the training contexts.

00:49:21.840 | And then they compute a representation of their text context, and they want to figure

00:49:25.200 | out which representations in the training context are most similar to this test context

00:49:30.320 | representation.

00:49:32.880 | And so here in the external memory view of things, the keys would be the representations

00:49:37.720 | of the training context, and the values would be the next words.

00:49:42.840 | So they get the k nearest training representations.

00:49:46.440 | They then copy over their values.

00:49:47.800 | So that's what you see with this Macbeth, Hamlet, Macbeth example.

00:49:51.600 | They have a normalization step where they convert this to probability space.

00:49:55.760 | And then finally, they have an aggregation step.

00:49:58.160 | So if a word is seen as the next word in several of these k nearest neighbors, then they want

00:50:03.480 | to count more for that.

00:50:04.480 | So that's why they aggregate.

00:50:05.480 | So they see Macbeth twice.

00:50:06.640 | It means Macbeth is more likely.

00:50:10.300 | And then finally, they have this interpolation step where they try to balance between the

00:50:14.400 | classification probabilities from the language model and from the k and n approach.

00:50:20.960 | So some immediate observation you might have is this seems really expensive.

00:50:25.620 | They do propose ways to try to minimize the expense of actually having to store all the

00:50:30.840 | training contexts in this data store, because they actually store it for every single window

00:50:35.480 | of next word in the training context.

00:50:38.560 | And you can do quantization on some nearest neighbor approaches to try to make this less

00:50:42.440 | expensive.

00:50:44.040 | But I imagine this would still be pretty expensive for really large training data sets.

00:50:47.800 | They also have some cool experiments that show that this is very good for domain adaptation.

00:50:53.040 | So if you take your language model and you have a new domain that you want to apply your

00:50:56.680 | language model to, you could just create a nearest neighbor data store of your new domain.

00:51:02.420 | So you encode all the representations of that new domain.

00:51:05.640 | You stick it in a data store.

00:51:07.380 | And then you can just use your language model with these k and n probabilities as well,

00:51:12.880 | just immediately on this new domain without actually having to further train your language

00:51:16.780 | model.

00:51:18.240 | So I thought that was a pretty cool use case of this external memory approach.

00:51:23.560 | So while it doesn't leverage knowledge bases directly, it does have this loose knowledge

00:51:27.520 | of-- or loose idea of encoding knowledge that is in a textual representation form into some

00:51:33.120 | external memory that the model can then take advantage of.

00:51:40.120 | That's all I have for this approach.

00:51:41.360 | Are there any questions on this approach?

00:51:45.360 | Well, so only one person is asking, how does the k and n make predictions for the next

00:51:55.380 | word?

00:51:56.380 | The k neighbors are for the context instead of the next word.

00:52:00.140 | Oh, OK.

00:52:01.140 | That wasn't clear.

00:52:02.520 | So the keys are the representations of the context.

00:52:05.860 | The values in your external memory are the next words.

00:52:09.060 | So when you figure out-- you figure out your nearest neighbors using your keys, and then

00:52:12.860 | you copy over their values.

00:52:14.460 | So it does actually know what the next words are for each of those representations.

00:52:25.340 | So finally, we're going to talk about how you can just modify the training data to better

00:52:29.180 | encode knowledge and language models.

00:52:32.300 | So approaches we've talked about so far are actually incorporating knowledge explicitly

00:52:36.980 | by using either pre-trained embeddings or an external memory.

00:52:40.820 | We also want to talk about how can you just incorporate knowledge implicitly through the

00:52:44.820 | unstructured text.

00:52:48.300 | So what we're going to do is either mask or corrupt the data to introduce additional training

00:52:51.940 | tasks that require factual knowledge to figure out what data was masked, for instance.

00:52:56.860 | So this has some clear advantages.

00:52:59.780 | It doesn't have any additional memory or computation requirements.

00:53:02.580 | You don't have a data store to deal with.

00:53:04.420 | You don't have extra knowledge encoder layers to train.

00:53:06.980 | All you do is modify the training data.

00:53:08.580 | And you don't have to modify your architecture either.

00:53:11.620 | So you can continue using your favorite BERT model and just make these changes to the training

00:53:15.980 | data.

00:53:18.580 | So the first work we're going to look at is called WKLM, Weekly Supervised Knowledge Pre-training

00:53:22.940 | Language Model, or Pre-trained Language Model.

00:53:25.620 | And the key idea here is to train the model to distinguish between true and false knowledge.

00:53:31.300 | So they're going to corrupt the data by replacing mentions in the text with mentions that refer

00:53:35.060 | to different entities of the same type to create what they refer to as negative knowledge

00:53:39.700 | statements.

00:53:40.700 | And then the model will just predict, has the entity been replaced or corrupted?

00:53:47.700 | This type constraint is necessary to make sure that-- or to encourage the model to actually

00:53:52.140 | use factual knowledge to figure out if this corruption is taking place.

00:53:54.940 | So you could imagine if you replace it with something that's not realistic at all, the

00:53:58.580 | model could just be basing its prediction based on, is this sentence linguistically

00:54:02.420 | correct?

00:54:04.700 | So as an example, we have a true knowledge statement as JK Rowling is the author of Harry

00:54:09.900 | Potter.

00:54:10.900 | And then we want to modify this to replace it with another author.

00:54:15.020 | So let's say we change this to J.R.

00:54:16.740 | Tolkien is the author of Harry Potter.

00:54:19.820 | So you can see that this requires some amount of knowledge, background knowledge, to actually

00:54:24.020 | be able to figure out which statement's true and which statement is false.

00:54:27.140 | And the idea is that the model will be able to predict for each of these mentions whether

00:54:31.580 | it's a true or false mention.

00:54:36.900 | So this diagram here is from the paper and hopefully explains this a bit better.

00:54:40.380 | They have their original article on the left, and then they have their replaced article

00:54:43.580 | with the corruptions on the right.

00:54:45.260 | And the entities are in blue.

00:54:47.540 | So what they do is for a given entity, they first look up its type.

00:54:51.220 | They find other entities of that type.

00:54:53.820 | And then they randomly sample the entity and get an alias of it to replace in the text.

00:54:59.420 | So they're going to replace Stan Lee, for instance, with Brian Johnson and Marvel Comics

00:55:03.460 | with DC Comics.

00:55:04.940 | And their placements are in red on the right.

00:55:08.380 | And then the idea is that the model will be able to predict for each of these mentions

00:55:12.100 | was it replaced or not.

00:55:14.060 | So in the case of Brian Johnson, they have the red X for this is a false mention.

00:55:18.300 | And in the case of the true mentions, they have the checkmark.

00:55:22.420 | So it's a pretty simple approach, but they actually show that it can help the model increase

00:55:27.380 | the amount of knowledge that's encoded in its parameters.

00:55:36.380 | So WKLM uses an entity or placement loss to train the model to distinguish between these

00:55:41.020 | true and false mentions.

00:55:42.640 | And this just looks like a binary classification loss where your true mentions are on the left

00:55:47.180 | and your false mentions are on the right.

00:55:49.520 | And you want to increase the probability that this P of E given C, so the probability of

00:55:54.620 | the entity given the context, you want to increase that for the true mentions and decrease

00:55:58.500 | it for the false mentions.

00:56:01.540 | The total loss is then just a combination of the mass language model loss and this entity

00:56:05.420 | replacement loss.

00:56:08.140 | The mass language model loss is defined at the token level.

00:56:13.180 | And the entity replacement loss is defined at the entity level, meaning it's not just

00:56:17.660 | over subwords.

00:56:18.900 | It's even potentially over words if you have multi-word entities, phrases, for instance.

00:56:25.220 | And this is an important point or an important theme that we really see occurring throughout

00:56:29.720 | these works that we'll look at in that modifying the data at the entity level seems to be an

00:56:34.580 | important component of actually increasing the amount of knowledge that a language model

00:56:38.500 | can encode.

00:56:39.500 | So they find that WKLM improves over BERT and GPT-2, in fact, completion tasks like

00:56:47.620 | the fill in the blank statements that we looked at at the beginning.

00:56:50.840 | They also find that it improves over the Ernie paper that we talked about on a downstream

00:56:54.860 | task.

00:56:55.860 | And they had a set of ablation experiments where they looked at, can you just remove

00:56:59.780 | this mass language model loss now?

00:57:02.940 | And if you just train BERT for longer, do you really need this entity replacement loss?

00:57:06.780 | So that's what the table here is looking at.

00:57:09.820 | The second row is looking at if we remove the mass language model loss, what happens?

00:57:14.260 | We see that it performs much worse without the mass language model loss.

00:57:16.940 | So you really need both losses.

00:57:19.420 | Their intuition there was the mass language model loss helps to encode just general language

00:57:24.660 | understanding.

00:57:26.940 | And then training BERT for longer performs much worse than using its entity replacement

00:57:31.020 | loss.

00:57:32.020 | So this motivates even farther that you really do need, or the entity replacement loss is

00:57:36.700 | actually really helping encode more knowledge in these language models.

00:57:43.420 | So in addition to corrupting the data, we're also going to look at, can we just mask the

00:57:47.060 | data differently?

00:57:48.060 | Can we be more clever about how we do the masking?

00:57:50.820 | And this is a thread in several recent works.

00:57:53.540 | So there's actually another paper called Ernie.

00:57:55.700 | So this is different than the one we talked about before.

00:57:57.920 | And this is enhanced representation through knowledge integration.

00:58:01.420 | And what they do is show improvements on downstream Chinese NLP tasks by doing phrase level and

00:58:06.740 | entity level masking.

00:58:08.580 | So instead of just masking out subwords, they're going to mask out phrases of multiple

00:58:12.780 | words and entities, the full phrase of an entity, which corresponds to some entity in

00:58:18.060 | a text that they might find with like NER techniques, for example.

00:58:23.720 | And then the second work is actually something you heard about in the last lecture, which

00:58:27.460 | is the idea of using salient span masking to mask out salient spans.

00:58:32.460 | And a salient span is just a named entity or a date.

00:58:34.900 | So you can see this is pretty similar to what Ernie is doing.

00:58:38.280 | And they found that using salient span masking actually significantly helped T5 performance

00:58:43.180 | on these closed domain question answering tasks.

00:58:48.420 | So just to make sure we're all on the same page with the different masking techniques,

00:58:52.020 | this diagram from the Ernie paper is comparing to what Bert does versus what Ernie does.

00:58:56.620 | The top shows that Ernie masked out the subword tokens or that Bert masked out the subword

00:59:01.060 | tokens, whereas Ernie masked out phrases like a series of, as well as entities like JK

00:59:07.300 | and T5.

00:59:08.300 | There's some interesting results on showing that salient span masking is helping encode

00:59:15.800 | more knowledge in these representations.

00:59:18.740 | So on the left, we're looking at the results of the original paper that proposed salient

00:59:23.620 | span masking.

00:59:24.880 | So this is the Realm work.

00:59:27.320 | And the idea here was that they were training a knowledge retriever.

00:59:30.760 | So it's actually more of an external memory class of techniques.

00:59:34.460 | But they find that by using the salient span masking technique, they could actually train

00:59:39.040 | a much better knowledge retriever.

00:59:41.080 | So it's a good example of how these techniques are really complementary.

00:59:45.860 | So while I presented three classes of techniques, you can definitely get benefits by doing multiple

00:59:49.960 | techniques together.

00:59:52.260 | And they found that doing salient span masking compared to using masking from Bert, which

00:59:56.320 | would be the random uniform masks, or doing random masking of spans from a paper called

01:00:01.720 | SpanBert, it performs much better to do salient span masking.

01:00:06.480 | So you see a 38 exact match score versus a 32 exact match score, for instance.

01:00:13.760 | And on the right, we have results from fine tuning T5 with either salient span masking

01:00:19.840 | or the span corruption task that you saw in assignment 5.

01:00:23.080 | And you can see that on these different QA data sets, salient span masking does significantly

01:00:27.240 | better than just using the span corruption technique.

01:00:31.920 | So this really suggests that doing the salient span masking and masking out these salient

01:00:36.800 | spans of these entities is, in fact, helping to encode more knowledge in these language

01:00:41.920 | models.

01:00:46.520 | So to recap, we talked about three different classes of techniques to add knowledge to

01:00:49.760 | language models.

01:00:51.940 | We talked about using pre-trained entity embeddings.

01:00:54.360 | These weren't too difficult to apply to existing architectures, and as a way to leverage this

01:00:58.680 | knowledge graph pre-training.

01:01:01.080 | But it was a rather indirect way of incorporating knowledge, and it could be hard to interpret.

01:01:06.360 | We also talked about approaches to add an external memory.

01:01:10.120 | This could support modifying the knowledge base.

01:01:12.960 | It was also easier to interpret.

01:01:15.520 | But they tended to be more complex in implementation, like we saw with KGLM.

01:01:19.600 | And they also required more memory, like we saw with the KNNLM approach.

01:01:24.720 | And then finally, we talked about modifying the training data.

01:01:28.040 | So this requires no model changes or additional computation.

01:01:31.480 | It also might be the easiest to theoretically analyze.

01:01:34.080 | So it's actually an active area research right now.

01:01:37.680 | But still an open question if modifying the training data is always as effective as model

01:01:41.880 | changes and what the trade-offs are in terms of the amount of data required versus doing

01:01:46.560 | one of these other knowledge enhancement approaches.

01:01:52.760 | So that leads us to section three.

01:01:55.200 | So I guess I'll pause again for questions.

01:01:58.240 | I think we may be good.

01:02:04.880 | Awesome.

01:02:05.880 | OK.

01:02:06.880 | So section three is about how researchers are actually going about evaluating the knowledge

01:02:10.960 | and language models.

01:02:12.680 | And I guess how some of the techniques we actually just talked about stand up in this

01:02:16.160 | evaluation.

01:02:17.960 | So first, we're going to talk about probes, which don't require any fine-tuning of the

01:02:22.320 | language model.

01:02:23.320 | And then we're going to talk about downstream tasks, which look at how well do these pre-trained

01:02:27.320 | representations actually transfer their knowledge to other tasks.

01:02:32.800 | So one of the initial works in this area was called LAMA.

01:02:35.800 | And this really started a series of works to look into how much knowledge is already

01:02:40.640 | encoded in these language models.

01:02:43.660 | So their question was, how much relational, common sense, and factual knowledge is in

01:02:47.720 | off-the-shelf language models?

01:02:49.320 | So this is just taking pre-trained language models and evaluating the knowledge in them.

01:02:54.360 | And this is without any additional training or fine-tuning.

01:02:57.740 | So they mainly constructed a set of what they refer to as closed statements.

01:03:01.080 | And these are just the fill-in-the-blank statements that we actually drew from at the beginning

01:03:04.740 | of the talk.

01:03:05.740 | And I'll show you some more examples here.

01:03:10.900 | And they manually created these templates of closed statements using knowledge graph

01:03:14.300 | triples and question-answering pairs from existing data sets.

01:03:19.260 | They wanted to compare pre-trained language models to supervised relation extraction and

01:03:23.820 | question-answering systems to see how do these language models that were trained in an unsupervised

01:03:28.740 | fashion compare to these baseline systems that are not only supervised but really targeted

01:03:33.940 | for this task of knowledge extraction.

01:03:37.620 | And their goal was to evaluate the knowledge in existing pre-trained language models.

01:03:41.860 | And a key point about this is they're just using the language models as they are available

01:03:46.360 | to researchers.

01:03:47.600 | So this means there could be differences in the pre-trained corpora, for example.

01:03:51.520 | So when you look at the following table and you're comparing language models, also keep

01:03:54.540 | in mind that these don't account for the differences in the pre-trained corpora.

01:04:00.860 | So a lot of these language models probably look familiar to you, either from previous

01:04:04.580 | lectures or maybe your final projects.

01:04:07.500 | And what we see is that overall, BERT-based and BERT-large pre-trained models are performing

01:04:12.580 | much better than the previous language or the other language models here.

01:04:16.900 | I guess I forgot to mention what mean precision at 1 is.

01:04:20.460 | This is a pretty simple metric.

01:04:21.940 | The idea is if you look at the blank and you look at the top predictions for-- or the top

01:04:26.100 | prediction for the blank, is it correct or not?

01:04:28.140 | So that's what precision at 1 means.

01:04:30.180 | Precision at 10 would be let's look at the top 10 predictions.

01:04:33.220 | Is the correct prediction in the top 10?

01:04:37.620 | So in addition to BERT-large and BERT-based performing well overall, we do see that in

01:04:43.020 | the T-Rex data set, the relation extraction baseline is performing a bit better than BERT.

01:04:48.820 | One thing to notice here that's pretty interesting is that this data set has a lot of different

01:04:52.940 | types of relations.

01:04:54.420 | And relations can be classified in terms of are they a one-to-one relation, are they an

01:04:58.460 | end-to-one relation, are they an end-to-M relation?

01:05:02.060 | An example of a one-to-one relation would be your student ID relation.

01:05:06.340 | So you have a unique student ID.

01:05:08.620 | An example of an end-to-M relation would be the enrolled-in relation.

01:05:13.180 | So there's lots of students enrolled in lots of classes.

01:05:15.620 | So this would be an end-to-M relation.

01:05:17.740 | And they find that BERT really struggles on these end-to-M relations.

01:05:21.920 | So while it performs better than relation extraction baseline on some types of relations,

01:05:26.460 | overall it does pretty terribly on these end-to-M relations.

01:05:29.060 | So overall it does a bit worse than the baseline on this T-Rex data set.

01:05:33.820 | They also compare to SQuAD on Docker QA.

01:05:36.860 | And they find that it does a fair amount worse.

01:05:39.780 | They note that the language model is not fine-tuned here and also has no access to an information

01:05:44.220 | retrieval system.

01:05:45.740 | And then when they look at the precision at 10, they find that this gap between Docker

01:05:49.300 | QA's performance and BERT actually closes quite a bit, which suggests that these language

01:05:54.420 | models do have some amount of knowledge encoded in them and that they're even competitive

01:05:59.740 | with these knowledge extraction supervised baselines.

01:06:03.900 | So you can also try out examples on their GitHub repo for the llama probe.

01:06:10.700 | We have an example that was from their repo that was the cat is on the mask.

01:06:15.060 | You can see what the top 10 predictions are to fill in the closed statement.

01:06:19.580 | Here they have the cat is on the phone.

01:06:22.540 | So this can be a fun way just to figure out what factual and common sense knowledge is

01:06:26.860 | in existing language models.

01:06:28.580 | And it's pretty easy to use with this interactive prompt.

01:06:33.620 | So some limitations of the llama probe are that it can be hard to understand why the

01:06:37.380 | models perform well when they do.

01:06:40.480 | So for instance, BERT might just be predicting the most popular token.

01:06:43.740 | And this happens to be right.

01:06:44.740 | Maybe it's just memorizing co-occurrence patterns and doesn't really understand the knowledge

01:06:49.980 | statement and doesn't understand what the fact is.

01:06:54.660 | It might also just be identifying similarities between surface forms of the subject and object.

01:06:59.500 | So for instance, in this example, Pope Clement VII has a position of blank.

01:07:03.460 | Even if you don't know anything about Pope Clement VII, you might be able to figure out

01:07:08.060 | that Pope is a likely next word for this triple or for this template.

01:07:15.220 | So the problem with this is if the model is just making these predictions based on these

01:07:18.860 | surface forms or co-occurrence patterns, it's difficult to know if we're actually evaluating

01:07:23.940 | the knowledge in the model.

01:07:25.260 | Maybe it's just making correct predictions for other reasons.

01:07:29.860 | And the more subtle issue that we've brought up is that language models might be just sensitive

01:07:33.460 | to the phrasing of the statement.

01:07:35.500 | So for each triple in their data set or for each relation in their data set, they just

01:07:40.100 | had one mainly defined template.

01:07:42.380 | And qualitatively, we found that if they just make small changes as template, it could actually

01:07:46.260 | change whether or not the model could recall the correct prediction or not.

01:07:51.500 | And so this means that the probe results are really a lower bound on the knowledge that's

01:07:55.020 | encoded in the language model.

01:07:58.060 | So if you change the phrasing, it's possible that the model might show that it actually

01:08:01.580 | does have the knowledge encoded in it.

01:08:04.620 | So the next lines of work we'll talk about are really building on these two limitations

01:08:08.900 | of this original LAMA probe.

01:08:12.620 | So the first one is called LAMA-UN or LAMA Unhelpful Names.

01:08:16.340 | And the key idea is to remove these examples from LAMA that can be answered without the

01:08:20.260 | relational knowledge.

01:08:21.560 | So this is kind of addressing the first limitation on the last slide.

01:08:25.700 | So they observed that BERT relies on just surface forms entities, might not be using

01:08:29.420 | knowledge to make these predictions.

01:08:31.480 | This includes a string match situation that we talked about with the pope.

01:08:35.620 | This also is dealing with the revealing person name issue that you saw in assignment five.

01:08:40.900 | So this is where the name could be an incorrect prior for the native language of someone,

01:08:45.020 | their place of birth, their nationality.

01:08:47.940 | They have this example from the paper where they look at different people names or person's

01:08:52.980 | names and then they look at BERT's prediction for their native language.

01:08:56.460 | And these are all French speaking actors.

01:08:58.720 | And BERT just predicts very biased and stereotypical languages for these particular names.

01:09:04.820 | So this can really work both ways.

01:09:06.460 | It can lead BERT to make incorrect predictions in some cases.

01:09:10.340 | But it could also work to make or to let BERT make correct predictions even if it has no

01:09:14.500 | factual knowledge of those people.

01:09:16.460 | So that's the issue they're trying to get at here is do we know that BERT actually knows

01:09:19.980 | this fact or is it just using some bias to make its prediction?

01:09:24.660 | So what they do is they introduce a couple heuristics to basically just filter out these

01:09:27.800 | examples from the LAMA probe that can either be solved by the string match setting or the

01:09:32.900 | surveilling person name setting.

01:09:35.340 | So they make a harder subset of the LAMA data set essentially.

01:09:39.660 | They find that when they test BERT on this harder subset that its performance drops about

01:09:43.500 | 8%.

01:09:44.500 | But when they test their knowledge enhanced model, which they call EBERT, the score only

01:09:48.460 | drops about 1%.

01:09:49.460 | So it's possible that as you make harder knowledge probes, we'll actually see even bigger differences

01:09:54.860 | in the performance of knowledge enhanced models to models without these knowledge enhancements.

01:10:02.940 | The next piece of work we'll talk about is actually getting at this issue of the phrasing

01:10:08.980 | of the prompt might actually trigger different responses from the language model.

01:10:14.060 | So the language model might know the fact, but it might fail on the task due to the phrasing.

01:10:19.460 | One reason this might happen is the pre-training is on different contexts and sentence structures

01:10:23.260 | in the query.

01:10:24.260 | So for example, you might have in your pre-training corpus, the birthplace of Barack Obama is

01:10:28.940 | Honolulu, Hawaii.

01:10:30.380 | And this might be something you see in Wikipedia, for instance, that's a common training data

01:10:33.380 | set.

01:10:34.380 | And then as a researcher, you write Barack Obama was born in blank.

01:10:38.340 | And you can see that these sentence structures are pretty different.

01:10:40.900 | So the model might've seen the first fact, but the sentence structure difference is actually

01:10:44.700 | enough to confuse it.

01:10:46.260 | So it can't answer this query.

01:10:49.500 | So what they do is they generate a lot more of these prompts by mining templates from

01:10:53.140 | Wikipedia.

01:10:54.140 | One of the techniques actually uses dependency parsing and also generating paraphrase prompts

01:10:58.900 | by taking inspiration from the machine translation literature and using back translation.

01:11:05.180 | So they generate a lot more prompts to try to query the language models and figure out

01:11:08.980 | do small variations in the prompt trigger the correct prediction from the language model.

01:11:14.860 | They also experiment with ensembling prompts.

01:11:16.860 | So if we give the model multiple prompts and then take some probability averaged over these

01:11:21.380 | different prompts, can we improve the performance on the model returning the correct prediction?

01:11:26.740 | So we give it a higher chance of seeing a context that it might've actually seen during

01:11:30.020 | pre-training.

01:11:31.020 | They find that the performance on LLAMA increases when they either use a top performing prompt

01:11:37.580 | or when they use this ensembling approach.

01:11:39.940 | So this suggests that the original LLAMA really was a lower bound on the amount of knowledge

01:11:43.420 | encoded in these language models.

01:11:45.900 | And changing the phrasing can actually help the model recall the correct answer.

01:11:52.980 | This table is a bit frightening, but they find that small changes in the query can lead

01:11:56.480 | to really large gains on performance.

01:11:58.900 | So if you just have a query like X plays in Y position, and then you change that to X

01:12:03.700 | plays at Y position, this can actually lead to like a 23% accuracy gain on this particular

01:12:08.340 | relation in terms of the model actually being able to recall the correct answer.

01:12:13.540 | Or even just X was created in Y to X is created in Y, 10% accuracy gain.

01:12:19.740 | So I think this motivates the need to not only develop better ways to query these models,

01:12:23.820 | but probably also build language models that are actually more robust to the query itself.

01:12:28.420 | So in addition to probes, another way to evaluate these language models is by looking at how

01:12:36.180 | well they transfer from the pre-trained representation to downstream tasks.

01:12:42.380 | And so the idea here is you're actually going to fine tune the pre-trained representation

01:12:45.540 | on different downstream tasks, similar to how you would evaluate BERT on glue tasks.

01:12:51.700 | Some common tasks that are used for this are relation extraction, entity typing, and question

01:12:56.940 | answering.

01:12:57.940 | So relation extraction is where you want to predict the relation between two entities.

01:13:01.780 | So this is getting back at one of the questions earlier in the talk in terms of, well, how

01:13:05.300 | do you get the relation that's the edges in these knowledge bases?

01:13:08.340 | So given two entities, you learn a model to predict what is the relation between them.

01:13:13.420 | Entity typing is a task of given an entity, what is the type of the entity?

01:13:16.780 | So here, Alice robbed the bank.

01:13:17.780 | You want to predict her as a criminal.

01:13:20.100 | And then you guys are very familiar with question answering.

01:13:23.580 | So the idea of these tasks is that they're knowledge intensive.

01:13:27.660 | So they're good candidates to see how well do these pre-trained representations actually

01:13:31.340 | transfer the knowledge to these downstream tasks.

01:13:36.580 | Here we're looking at the performance on a relation extraction benchmark called TACRID.

01:13:40.740 | And all the models that we show here were at one point state-of-the-art on TACRID.

01:13:45.340 | So this CGCN is a graph convolutional neural network over dependency trees.

01:13:50.740 | The BERT LSTM base is one of the first works that showed that you could actually get state-of-the-art

01:13:56.020 | performance with BERT on relation extraction.

01:13:58.060 | And this is just putting an LSTM layer over BERT's output.

01:14:01.860 | Ernie is a work that we talked about with the pre-trained entity embeddings.

01:14:04.740 | Matching the blanks we didn't get to today, but it's a really interesting work about learning

01:14:09.180 | meaningful relation representations.

01:14:11.540 | And it falls more into the training data modification approaches and that they are actually masking

01:14:16.780 | out entities again.

01:14:19.200 | And then NoBERT is what we talked about.

01:14:22.180 | The W in W here means they actually encode two knowledge bases in NoBERT.

01:14:26.140 | So they're encoding WordNet and they're also encoding Wikipedia.

01:14:30.380 | And the high-level takeaway from this table is that you can see that the recent knowledge-enhanced

01:14:34.300 | models have achieved state-of-the-art over the original models that once performed very

01:14:39.300 | well on TACRID.

01:14:40.300 | And we have about 5 F1 gains here.

01:14:44.020 | Another interesting takeaway from this table is there seems to be a trade-off in the size

01:14:47.380 | of the language model that's necessary to get a certain performance.

01:14:50.980 | So if you just consider the size of the language model, then NoBERT performs the best.

01:14:55.340 | But if you don't consider that, then it ties with matching the blanks.

01:15:00.900 | So overall, this is pretty good evidence that these knowledge-enhanced methods are in fact

01:15:05.380 | transferring to these knowledge-intensive downstream tasks that can really take advantage

01:15:09.980 | of these pre-trained representations.

01:15:14.220 | We also have results on entity typing.

01:15:16.180 | So here we're comparing a slightly different set of models.

01:15:18.900 | Some of the baselines are LSTM models that were designed for entity typing.

01:15:23.180 | And we have Ernie and NoBERT leading the, I guess, leaderboard here on the entity typing

01:15:29.060 | task of OpenEntity.

01:15:30.820 | And we see gains of about 15 F1 points with Ernie and NoBERT.

01:15:34.660 | So once again, we really do see that these knowledge-rich pre-trained representations

01:15:39.020 | are transferring and helping on these knowledge-intensive downstream tasks.

01:15:45.980 | So just to recap, we talked about probes which evaluate the knowledge already present in

01:15:50.140 | models.

01:15:51.140 | These don't require any more training.

01:15:52.900 | But it can be challenging to construct benchmarks to actually make sure you're testing the knowledge

01:15:57.340 | in these language models.

01:15:58.340 | It can also be challenging to construct the queries used in the probe.

01:16:03.100 | We then talked about downstream tasks.

01:16:05.380 | These are a bit of an indirect way to evaluate knowledge in that they have this extra component

01:16:08.580 | of fine-tuning.

01:16:09.580 | But it's a good way to evaluate how useful is this knowledge-rich pre-trained representation

01:16:14.500 | in actual applications.

01:16:18.980 | So I just touched on the exciting work in this area.

01:16:22.300 | But there's many other directions if you want to dive more into this.

01:16:25.800 | So there's retrieval-augmented language models, which learn knowledge retrievers to figure

01:16:30.180 | out what documents might be relevant for predicting the next word.

01:16:34.020 | There's work in modifying the knowledge in language models.

01:16:36.980 | So I talked about how this is one of the obstacles and challenges to using language models as

01:16:41.540 | knowledge bases.

01:16:42.540 | So there's been recent work in this area.

01:16:45.300 | We also saw how important the knowledge pre-training task was.

01:16:48.900 | Well, there's many papers that are proposing different tasks to do the knowledge pre-training.

01:16:53.420 | So it's still an open question in terms of what tasks are best to add to encode more

01:16:58.260 | knowledge.

01:16:59.260 | There's also been work on more efficient knowledge systems.

01:17:02.340 | So at NERPS, there's now an efficient QA challenge, which aims at building the smallest QA system.

01:17:07.100 | And then finally, there's been work on building better knowledge benchmarks that build on

01:17:12.100 | the benchmarks that we saw today.

01:17:16.140 | So that's all I have for today, and I hope your final projects are going well.

01:17:19.260 | Thank you.

01:17:20.260 | [END]

01:17:20.260 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 15 - Add Knowledge to Language Models

Chapters