Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 15

Welcome to CS224N lecture 15. So I'm Megan and I'm one of the CAs in this course, and I'm also a PhD student working with Chris Ray. And today I'll be talking about integrating knowledge and language models. So some quick reminders, your project milestones were due today, so hopefully you turned those in already or will be turning them in in the next couple of days, and we'll try to get feedback on those as fast as possible.

So something to be aware of is a change of grading basis and course withdrawal deadline is this Friday. So if you want to make any change to your grade, make sure to do that by then. And we'll be getting you the grades back on assignment five by then as well, in case that's helpful in making your decision.

And finally, your final projects are due in two weeks. So hopefully those are going smoothly. So the topic of the day is integrating knowledge and language models. You've seen a bit about this idea in assignment five, and also in Colin Raffles' lecture last class. So in assignment five, the task was to train a model to predict the birthplace of a person given their name.

And you saw that by pre-training on a larger data set, you're actually able to do better on this task, since you could encode some real knowledge into the language model. And then last lecture, Colin Raffle presented how T5 could actually be fine-tuned for a closed domain question answering task, such that you can give T5 a natural language question and it'll return an answer.

So today we'll be building on these threads and looking at techniques that researchers have recently been developing to increase the amount of knowledge in language models. So we're going to start with a quick recap of language models, just to make sure we're all on the same page. Then we're going to talk about what types of knowledge language models can already encode and what they might struggle on.

We'll also motivate why researchers are interested in increasing the amount of knowledge in language models, and what this could enable for future AI systems if we have language models that can actually reliably recall knowledge. We'll talk about three broad classes of techniques that researchers have been using to add knowledge to language models.

These include adding pre-trained entity embeddings, using external memory or key value store, or even just modifying the training data. And for each of these techniques, we'll talk about at least one recent work that used the technique, so hopefully it's clear to see how to actually employ it in practice.

And then finally, we'll wrap up by talking about how to evaluate the knowledge in language models and the challenges that come up in trying to do this. So let's dive right in. We're going to start by talking about standard language models. You learned about these at the beginning of the course.

And the task is to predict the next word in a sequence of text and to compute the probability of a sequence. So you may remember the example that students opened their blank. And we talked about it could be minds, exams, we're going to go with books here. And the task of the standard language model is to predict the most likely next word in the sequence.

A couple of lectures ago, John also introduced the notion of mass language models. Instead of predicting the next word in a sequence of text, the task is to predict the mass token. And this is done using bidirectional context. So you may remember the example, I masked the mask. And the goal of the mass language model is to break the most likely token for each of the masked out words.

So maybe I went to the store. So while there's some differences in these two types of language models, whether you're predicting the next word, or whether you're predicting the masked out token, they're similar in that they can both be trained over large amounts of unlabeled text. And this is one of the reasons why they've been so widely adopted.

They don't require any human annotated data. So you've seen that language models can be used for a variety of tasks, from summarization to dialogue to fluency evaluation, tasks that involve either generating text or evaluating the probability of text. And more recently, we've seen that language models can also be used to generate pre-trained representations of text that encode some notion of language understanding, and has been shown to be widely useful for different downstream NLP tasks.

And then finally, today we're going to touch on this idea that if language models are trained over massive amounts of text, can they even be used as a knowledge base? So we're going to start by looking at what types of factual knowledge a language model might already know. And these examples are taken from a paper by Petroni et al.

in EMNLP a couple years ago. And the goal is to test the factual or common sense knowledge in existing language models such as BERT-Large. So let's check out what BERT-Large predicts. iPod Touch is produced by Apple, London Jazz Festival is located in London, Danny Alves plays with Santos, Carl III used to communicate in German, and ravens can fly.

So here we have the correct predictions in green and the incorrect predictions in red. And if you know anything about sports, you may know that Danny Alves is a soccer player, Santos is a soccer team. Here they were hoping that it would predict Barcelona, because at least at the time of this data set, apparently he played for Barcelona.

And Carl III actually used to communicate in Swedish, not German. So what's good about these examples is the predictions are generally reasonable. If you didn't know the ground truth, they all make sense. When you want to predict a language, you do in fact predict the language. But of course, they're not all factually correct.

So why might this happen? Well, for one, the fact might not been seen in training. And you can't expect the language model to do more than recall facts that it has seen in training. It can't make up facts about the world, for instance. It's also possible the fact is just really rare.

So maybe the language model has seen the fact during training, but it hasn't seen it enough times to actually memorize the fact. And the last issue is a little more subtle, which the model might just be very sensitive to the phrasing of the fill in the blank statement. And so for example, you might have statements like X was created in blank that the model can't predict correctly.

But if you change it to X was made in blank, suddenly it can predict it correctly. And we'll come back to this and how to actually evaluate the knowledge in these language models. So this inability to reliably recall knowledge is a key challenge facing language models today. And it'll be the focus of this talk.

Recent works have found that language models can recover some knowledge, including the work that Colin presented last class. They've had very encouraging results. But there's still a way to go, as we saw with the fill in the blank statements and with these challenges that we just discussed above. So as a result, the past couple of years have had a ton of rapid progress in this area of research in terms of trying to figure out how do you actually encode more knowledge in language models.

So I also want to motivate why researchers are interested in building language models that can more reliably recall knowledge. And one of these reasons is that the pre-trained representations are used in a variety of downstream tasks. And some of these downstream tasks are knowledge intensive. So for instance, you might have a downstream task to extract the relations between two entities in a sentence.

And this is commonly known as relation extraction. And this is much easier if you have some knowledge of the entities, which could be potentially provided by this pre-trained language model representation. And when we talk about evaluation, we'll talk about what types of tasks are most likely to benefit from these knowledge rich pre-trained representations.

And then as a stretch goal, some researchers are starting to propose the idea that can language models actually ultimately be used to replace traditional knowledge bases? So instead of creating a knowledge base for a fact, like you might right now with SQL, you'd create a language model with a natural language prompt.

And of course, this does require the language model to have high quality on recalling facts. So we might not be there yet, but it's an interesting direction for us to be moving towards. So I want to make it super clear what I mean by a knowledge base. Here we're just talking about a knowledge graph where the nodes in the graph would be entities and the edges are going to be relations between the entities.

So for example, here we have a subset of a knowledge graph for Franklin D. Roosevelt, and you see the information about his spouse, his place of birth, his date of birth, and so on. An important thing to note is this is a structured way of storing the knowledge, since it's just in a graph form.

And you can actually describe these graphs with knowledge graph triples, which will be an important vocabulary word throughout this talk. So knowledge graph triple would be consisting of a subject entity, a relation, and then an object entity. So for instance, here we might have Franklin D. Roosevelt, date of birth, January 30th, 1882.

And that would form a knowledge graph triple. We'll also refer to this as a parent entity, a relation, and a tail entity. So Wikidata is one very popular knowledge base you might come across if you're working in this area. It's a free knowledge base that's actually populated by humans, so they're filling in these relations and entities.

And it's also multilingual. So if you want information from this knowledge base, what you do is you would write a SQL query. This is a simplified one, but the idea is you'd want to figure out the date of birth of Franklin Roosevelt, so you would write a query like follows.

Now if instead you want to create a language model as a knowledge base, you'll have something like this diagram that you've actually probably seen in several lectures now. And the idea is you'll train a language model over this unstructured text, and then you'll use a language model to just answer these natural language query statements.

So here, this is the work on T5, where they're training T5 over natural language or just unstructured text with a span corruption task. And then they're asking T5, when was Franklin D. Roosevelt born? And the idea is T5 will produce a textual answer. So you can see this contrast very much with the old approach of using a traditional knowledge base, where the knowledge base is structured, and you have these SQL statements to query it.

So what are the advantages of using language models over traditional knowledge bases, and why might people think this could be a good idea? Well, for one, the language models are pre-trained over large amounts of unstructured and unlabeled text. Whereas traditional knowledge bases require manual annotation, like with wiki data, people actually are populating it, or complex NLP pipelines to extract from unstructured text into a structured form that forms a knowledge base.

Language models can also support more flexible natural language queries. So if we take the example, what does the final F in the song UFOF stand for? A knowledge base probably won't have a field for final F, so it won't be able to answer your query. But there's a chance that a language model could actually learn and have a response for this natural language query.

They also had a less extreme example in this paper by Petroni and others, where maybe your relation would be "is works for" in your knowledge base, and then you ask for "is working for". And the knowledge base doesn't have an exact match in the field, and so it returns an empty response.

And it's much, it's reasonable to believe that your language model could figure out that these relations are similar, so if I know the answer to one of them, I probably know the answer to the other. Of course, it's not all advantages. There's also many open challenges to using language models as knowledge bases.

So for one, it's harder to interpret. When a traditional knowledge base produces an answer, there's actually provenance information associated with why did it return that particular query. But with a language model, it's really not clear why it might produce a prediction. The knowledge is just encoded in the parameters of the model.

It's also harder to trust. So you saw this in assignment 5, where the language model could produce realistic predictions, but they are incorrect. So it's not easy to know when the language model actually knows the fact, versus it's using some biases to make its prediction. And in the case of the traditional knowledge base, if it doesn't know a fact, it's just going to have an empty response.

And then finally, language models are harder to modify. So in a knowledge base, if you want to update a fact, you just change the fact directly in the structured data. But in a language model, it's not quite clear how you would do this. You could fine tune the model longer on the updated data, but how do you know if it still has some memorization of the old fact?

So there are a lot of open challenges to this goal of actually using language models as traditional knowledge bases. But hopefully you see why some people think this could actually be a good idea, and why researchers are interested in training language models that can actually integrate more knowledge. So that brings us to section 2 of the talk.

So I want to pause here just in case there's any questions. OK. I think that's OK, yeah. OK, awesome. So now we're going to be talking about what techniques researchers are using to actually add more knowledge to language models. So we're going to talk about three broad classes of techniques.

This is by no means exhaustive, but hopefully it gives you a good overview so that if you want to dive deeper, you can. So we'll start by talking about adding pre-trained entity embeddings. And for each section, we'll kind of focus on the first work that you see in the bullets.

But we'll also talk about briefly some of the variants so you see how the works within each class can differ and what knobs you can turn. So for adding pre-trained embeddings, we first need to figure out what pre-trained embeddings would actually be the most useful to add knowledge to language models.

And this can start with an observation that facts about the world are usually in terms of entities. So if we have a fact like Washington was the first president of the United States, we have the entities Washington, United States. But pre-trained word embeddings don't have this notion of entities.

So we'd have different word embeddings for USA, United States of America, and America, even though these all refer to the same entity. And this makes it challenging for the language model to actually learn any representations over these entities, since they may be referred to many ways in the text.

So what if instead we have a single embedding per entity, and we'll refer to these as entity embeddings. So now you'd have a single entity embedding for USA, United States of America, and America. And whenever you see a phrase in text referring to this entity, you would use the same entity embedding.

And these entity embeddings can actually be pre-trained to encode this factual knowledge about the world. And this first class techniques we'll be looking at will be how do you actually best use these pre-trained entity embeddings in a language model. So I need to make a quick note that these entity embeddings are only useful to language models though, if you can do another NLP task called entity linking well.

So I'm going to take a quick aside and explain what is entity linking. So a definition of entity linking is the link mentions in text to entities in a knowledge base. I like to think about this in terms of how you use word embeddings. So if you want to use word embeddings and you have a sentence, you're going to first tokenize that sentence into words.

And then for each word, you're going to look up their corresponding ID in some word embedding matrix. And now you have your word embedding. Well, for entity embeddings, the dictionary lookup isn't so easy. You might have sentences like Washington is the first president of the United States. Well, Washington has two different candidates.

Are we talking about George Washington? Or are we talking about Washington State? And these are different entities that have different entity embeddings. And the QIDs here would just be their identifiers and wiki data. And then United States just has a single entity. So task of entity linking is to figure out correctly these ambiguous mentions, what entities do they actually link to in a knowledge base?

And there's many different ways you can do this entity linking. So one way you might be able to do this is to figure out that, oh, I see the context word of president. So Washington probably links to George Washington. Just some more definitions, we're going to refer to Washington as a mention, United States as a mention.

And then the things that the mention could link to, so the two options for Washington are going to be candidates. So this is a whole research area of its own. And I encourage you to check out the resources at the bottom if you're interested in learning more. But right now, the most important thing to understand is that entity linking is what is going to tell us which entity embeddings are actually relevant to the text and which ones you want to use as you iterate through a sequence.

And Megan, there are a few questions around here. One of them is, so that's entity linking, but what about the relations? Yeah, so some of the works we'll talk about will only use the entity embeddings. So some of these have been pre-trained with relation information, but in the end, you only have an entity embedding.

So relation extraction is yet another NLP task that you could also do. But yeah, here we're just talking about entity linking. And if you have the knowledge graph you showed earlier, it had relations in it, right? Do you get any connection between that and the text? I mean, that's the goal of relation extraction, right?

It's to figure out, like, given the entities, what is the relation between them, which would then form the full triple of head entity, tail entity, and relation. Okay, then I think people want to know more about how this is going to be used, but maybe you should go on and show some examples.

Yeah, I will, for sure. Okay, right. So entity embeddings, just to summarize, they're like word embeddings, but they're for entities in a knowledge base. So you'll have some vector associated with George Washington, and it should be meaningful in embedding space such that maybe the George Washington vector is close to the vectors for other founding fathers.

So we're going to briefly talk about some methods for training entity embeddings. There's knowledge graph embedding methods. You might have heard of the transie embedding method. So this starts from the idea of having these knowledge graph triples, and you want to learn pre-trained entity and pre-trained relation embeddings. And you want it to be the case that the subject embedding and the relation embedding, the sum of those two, is close to the object embedding in vector space.

So it's an algorithm to learn that constraint. There's also word entity co-occurrence methods. So these build off of Word2vec. One of them is even called Wikipedia2vec. And the idea is given an entity, you want to figure out what words are most likely to co-occur around it. And then the last method, or one of the other methods that is common now, is actually just using the transformer to learn representations of an entity by encoding the entity description.

And so Blink from Facebook is an approach that does this. So the methods we'll talk about today are actually agnostic to how you train your pre-trained entity embedding. But I think it's important to know that there's actually a wide variety of methods to train these pre-trained entity embeddings. And it's actually not clear which method is best for using them downstream in language models.

So one of the key challenges of using pre-trained entity embeddings in language models is figuring out how to incorporate them when they're from a different embedding space than the language model. And so what we'll do, or the approach we'll look at today, we'll learn a fusion layer to combine this context and entity information.

So we have entity embeddings and we have the contextualized word embeddings from our language model. So if we take a sequence of text and we imagine that j indicates the jth element in a sequence, then the challenge here is you want to figure out how do we combine some word embedding wj with some aligned entity embedding ek.

So here an alignment could be like in the example where we had Washington was the first president. Washington would be your word embedding and George Washington would be the aligned entity embedding there. So you could imagine in this case, let's say your wj is Washington and your ek is your entity embedding for George Washington.

And you want to align them together. So what you can do is learn a weight matrix wt for the text and we for the entity to project these embeddings to the same dimension before you sum them and finally take an activation function over them. So the idea is that by having some fusion layer mechanism like this, you can actually use these entity embeddings and these contextual word embeddings that are in different embedding spaces and fuse them together to have this single hidden representation for the element in the sequence.

So the approaches we'll talk about today all have some mechanism either very similar to this or some variation of this to do this combination of the context and entity information. So the first approach we're going to talk about is called ERNI, enhanced language representation with informative entities. And so this just builds on what we've already talked about.

It uses pre-trained entity embeddings and it also uses this notion of a fusion layer. So the first block in ERNI is a text encoder, which is a multilayer bidirectional transformer encoder. For their experiments, they use BERT, but it doesn't have to be BERT. And this is followed by a knowledge encoder, which has stacked blocks composed of two multi-headed attentions.

One is over the entity embeddings and one is over your token or subword embeddings. And then the output of these contextualized entity and token embeddings from the multi-headed attentions are passed to a fusion layer, which looks very similar to what we just looked at. But now you also have new word and entity embeddings that you're producing as output of your fusion layer.

So you see this WJ and this EK, which are produced as the next layer of word and entity embeddings. So the I here indicates that it's the Ith block in the knowledge encoder. So you'll actually have multiple stacks of these knowledge encoders and you'll be doing a fusion of the word and entity embedding, producing new word and entity embeddings, and then passing this to the next block of the knowledge encoder.

So this is what the architecture diagram looks like. On the left side, we have the T encoder or the text encoder, followed by the K encoder or the knowledge encoder. And then on the right side, we have a zoomed in version of your knowledge encoder. So you see the multi-headed attentions over the tokens in orange, and then over the entities in yellow.

And then you have this alignment between the word and entities with the dashed lines. So they have this example as Bob Dylan wrote "blowing in the wind" in 1962. The entities here are Bob Dylan and "blowing in the wind." And they have a simple alignment rule where you want to align the entity to the first word in the entity phrase.

So you want to align Bob Dylan to Bob, that's what the dashed lines try to indicate, and you want to align "blowing in the wind" to "blow." So here, this already assumes that entity linking has been done, and you know your entities in advance. So you can see that the entities are actually input into the model.

So after you have your word and entity alignment, this goes through the information fusion layer in this light purple-gray color. And then finally, it produces these new word entity embeddings as output. And then remember that you have multiple blocks of these, so those will be passed into the next block of your knowledge encoder.

So how do you actually train this? It's pretty similar to BERT. You have a mass language model loss, and you have a next sentence prediction loss. And then they also introduce a knowledge pre-training task, which they refer to as the DEA task. It's named after a denoising entity autoencoder from an ICML paper in 2008.

And the idea is they're going to randomly mask these token entity alignments. So the idea that Bob goes to Bob Dylan, they're going to mask that out with some random percentage. And then they're going to predict the corresponding entity for a token out of the entities in the sequence.

So this looks like as follows. The summation is over m entities in the sequence. So this would be over Bob Dylan and blowing in the wind in the previous example. And given a particular word, they want to figure out what entity is it most likely to align to in that sequence.

So does Bob align to Bob Dylan, or does Bob align to blowing in the wind? And their motivation for doing this is that if you don't have this task, all you're ever going to be predicting is a token with the mass language model loss. And you really, to encode knowledge, should also probably be predicting over entities.

So by adding this task, they have some kind of task that is actually predicting the entity. And they also suggest that this might better fuse the knowledge or the entity and the word representations than just using the fusion layer. Their final loss is then that summation of the mass language model loss, the next sentence prediction loss, and this DEA knowledge pre-training task loss.

So they show that ablation experiment that it's actually very important to have this knowledge pre-training task. So this has Bert on the leftmost bar, Ernie as the second bar from the left. And so that's with all the features of Ernie. And then they try removing the pre-trained entity embeddings and removing this knowledge pre-training task.

So you see that Bert performs the worst. This isn't very surprising, and that Ernie performs the best. But what's interesting is that if you remove the entity embeddings or you remove the pre-training task, they only do a little better than Bert. And so it's really necessary to actually use this pre-training task to get the most use of your pre-trained entity embeddings.

So some strengths of this work were that they introduced some way to combine entity and context information through this fusion layer and this knowledge pre-training task. And then they also show improved performance on downstream tasks, which we'll come back to when we talk about evaluation. But of course, there's also some limitations.

So it needs text data with the entities annotated as input. And this is even true for downstream tasks. So if you remember on the architecture diagram, we had the entity information actually input into the architecture. But it's not very realistic that you're necessarily going to have a good entity linker for any downstream tasks that you want to use Ernie on.

And the next challenge is this requires more pre-training of your language model. So now you don't just need to pre-train Bert, but you also need to pre-train your knowledge encoder on top. For the first challenge, we're going to actually talk about a work that presents a solution to address this.

For the second challenge, I encourage you to check out the footnote on the bottom. This introduces a work that actually uses pre-trained entity embeddings, uses them in a language model, and doesn't require any more pre-training. So it's pretty cool. I guess that's all I have for Ernie. So I want to pause here for questions.

Well here's one that's up here. So on the fusion layer, it observed that passing the entity embedding into a fusion layer to combine with word embedding is more powerful than just concatenating the entity embedding onto the end of the word embedding question mark. Yeah, so I guess people are still a little bit confused as to the motivation for that fusion layer.

And so I guess here it's this, the simplest strategy would be, since you've got the entity linking, you could just concatenate entity embeddings onto the end of word embeddings and do regular BERT, but that worked just as well. I think the idea is that it wouldn't, because if you imagine that, let's say your magnitudes are very different, you need some way to, I guess, align the spaces so that anything meaningful in the entity embedding space is still meaningful in the word embedding space.

So if you're close in the word embedding space, you also would be, you'd want to be close in entity embedding space. So I guess that's one argument. I mean, I think the question isn't, you know, it's a good question as people say. I mean, it's not completely obvious that it wouldn't work to do that.

It seems like one of the potential problems is some words have entity links to them and some words don't. And so you, then you'd sort of have zero vectors for the ones that don't have anything linked. And that might act a bit weirdly, but. Yeah. In this case, when they don't have entities linked, which is a great point.

Yeah. The first equation just simplifies to the first term plus the bias. So like there's an obvious solution in that case when you're not concatenating that you just don't add on the term. Yeah, that could be one reason too. Okay. Are there any other questions? I think you can go on.

Okay, cool. Right. So now we're talking about NoBERT. And this is from the same folks that introduced the ELMo work. And the idea here is that they're going to pre-train an integrated entity linker as an extension to BERT. And so their loss function will now be the summation of the next sentence prediction, the mass language model loss, and this entity linking loss.

So instead of the knowledge pre-training DEA task from Ernie, we'll have an entity linking loss. And the idea of the entity linker is you'll now have just as normal sequence as input, and the integrated entity linker will figure out what are the entities in the sentence and or what are the mentions in the sentence, what are the candidates of those mentions, and then what should be the scores of those entities or the candidates given the context of the sentence.

And so this is all done now as part of the model rather than requiring it as some external pipeline stage before you could even use Ernie, for instance. So now for downstream tasks, you no longer need these entity annotations. Your integrated entity linker will figure out what the correct entity is and be able to use the correct entity embedding.

So there's also this idea that learning this entity linking may actually better encode knowledge than this DEA pre-training task because they show that NoBERT actually outperforms Ernie on downstream tasks. So one reason this may occur is that if you think about the DEA task, it's actually a bit simpler than just entity linking.

So you're trying to predict, for instance, what Bob linked to out of Bob Dylan and Blowing in the Wind, and it's much easier even as a human to see that Bob will more likely link to Bob Dylan than that Bob will link to Blowing in the Wind. And in the entity linking task, you actually have a much harder set of candidates to predict over.

You're not just looking at the ones in the sentence. So does Washington link to George Washington or Washington State actually requires you using more information about the entity? So given it's a harder task, it's not too surprising that it might perform better than just this easier knowledge pre-training task that Ernie introduced.

So otherwise, NoBERT has a lot of similarities to Ernie. It uses a fusion layer that combines this context and entity information, and it introduces some knowledge pre-training task. So I'd say a high-level takeaway is if you want to use pre-trained entity embeddings in a language model, you'll probably at least want to consider both of these components in terms of actually going to integrate the pre-trained entity embeddings and take the most advantage of the knowledge in them as possible.

So that brings us to the next class of techniques, which is using an external memory. And here we'll mainly focus on this work called KGLM, and then we'll also briefly talk about KNNLM. So the previous methods that we've talked about have relied on pre-trained entity embeddings to encode the factual knowledge from knowledge bases.

And the one problem with this, or one of the problems with this, is if you want to, let's say, modify your knowledge base, you now need to retrain your entity embeddings and then retrain your language model on top of those entity embeddings. So this begs the question, are there more direct ways than pre-trained entity embeddings to provide the model with factual knowledge?

And so what we're going to talk about is how you can actually use an external memory or a key value store to give the model access to either knowledge graph triples or context information. And a key thing about this external memory is that it's independent of the learned model parameters.

So this means you can actually support injecting and updating factual knowledge. You can do this directly to the symbolic external memory by, let's say, changing the value for a particular key or maybe adding another key. And you don't have to pre-train or retrain your entity embeddings when you make this change.

And the approaches we'll talk about today can actually even have these updates to the external memory without more pre-training of the language model. So that's pretty neat. And then another benefit of using external memory over these pre-trained entity embedding approaches is it can also be more interpretable. So if you have an air in your model where it's not predicting a correct fact, it's very challenging to figure out with pre-trained entity embeddings what the problem might be.

Was it the original knowledge base? Was it the encoding in the entity embeddings? Is it how the language model is using the entity embeddings? And here you have a little more information with an external memory in that you can look in the external memory and see, was the fact in the external memory?

Was it not in the external memory? And so on. So it adds a little bit more interpretability than just using these pre-trained entity embeddings as an indirect way to encode the knowledge base. So the first work we're going to talk about is called KGLM. And unlike the other approaches we've talked about so far, this actually uses LSTMs and not transformers.

So the key idea here is to condition the language model on a knowledge graph. So recall with the standard language model, we want to predict the next word given the previous words in the sequence. So now we also want to predict the next entity given the previous words in the sequence and given the previous entities in the sentence, or the entities that are relevant to the sentence, I should say.

So KGLM will be building a local knowledge graph as it iterates over the sequence. And a local knowledge graph is just a subset of a full knowledge graph that only has the entities that are actually relevant to the sequence. So if we have this example here, a simplified example from the paper, that Super Mario Land is a game developed by blank.

And Super Mario Land here is an entity. You'd want a local knowledge graph as follows, where you see that Super Mario Land is in the local knowledge graph, but we also have the relations to Super Mario Land to other entities that are copied from the full knowledge graph into this local knowledge graph.

And you would build up this local knowledge graph as you iterate over the sentence. So whenever you see an entity, you would add it to the local knowledge graph as well as its relations to other entities. So obviously this is a much smaller example than what would really have all the relations to Super Mario Land, just for the purpose of the example.

But hopefully it's clear that all of these are relevant to the sequence. Something important to note here is that this does assume that the entities are known during training so that you do have this entity annotated data for training, and therefore your local knowledge graph is always the ground truth local knowledge graph as you iterate over the sequence.

So why might this be a good idea to do this? Well, here, the next word you want to predict is Nintendo. And you may notice that Nintendo is in your local knowledge graph. So sometimes this local knowledge graph can actually serve as a very strong signal for what you want to predict for your next word.

Now, you may be thinking, well, this wouldn't always be helpful. And that's true as well. So if you look at just like the third word in the sequence and you want to predict that word, so is a game, for instance, well, if this isn't in the local knowledge graph, this wouldn't be necessarily that helpful.

You would just do a standard language model prediction. Or if you're at the beginning of the sequence, your local knowledge graph is empty. So of course, you're not going to get any signal from it. So the first question they ask in KGLM is how can a language model know when to use a local knowledge graph and when it might actually be useful for predicting the next word?

So we're going to keep the same example as a running example. And we have our local knowledge graph here. We now have an LSTM that looks similar to the representations you've seen throughout this class. And normally, you've seen the LSTM predicts the next word. Well, now we're also going to use the LSTM to predict the next type of the word.

So is the next word going to be a related entity, meaning it's in the local knowledge graph already? Is it going to be a new entity, meaning it's not in the local knowledge graph? Or is it going to be not an entity, in which case you just revert to a normal LSTM prediction?

And they're going to use the LSTM hidden state to do this prediction of the type of the next word over this three different classes that they might want to consider. So in the case of Super Mario Land as a game developed by Nintendo, we saw that this would be a related entity case because we saw that Nintendo was in the local knowledge graph.

For the other cases, Super Mario Land would be a new entity case since the local knowledge graph is empty at that point. And then any of the words between Super Mario Land and Nintendo would be non-entity, as they're just a standard LSTM language model prediction that doesn't involve any entities.

So now we need to talk about what the language model actually does in these three different scenarios to predict the next entity and the next word. So we're going to keep the example up at the top in case you want to refer back to the three different cases. And we're going to start with the related entity case.

So here we assume that the next word or entity is actually in your local knowledge graph. And remember that we can describe a knowledge graph in terms of triples, so in terms of pairs of parent entities, relations, and tail entities. And in the case of predicting the next word as Nintendo, there's only one possible parent entity in the local knowledge graph, which is Super Mario Land.

And the goal is you want to figure out what is the most relevant triple that will be useful in helping to predict the next word. So in this case, you could have the triple Super Mario Land publisher Nintendo. You might have the triple Super Mario Land genre platform game.

Which of these is actually helpful in predicting that Nintendo should be the next word? So here, what you would want KGLM to do is predict that the top scoring parent entity is Super Mario Land, and the top scoring relation is publisher. And you can see there are actually contextual cues in the sentence that could help you figure out which triple you're talking about.

And then given that your top scoring parent entity is Super Mario Land, and your top scoring relation is publisher, you can figure out that using knowledge graph triples, the tail entity has to be Nintendo. And therefore, this gives you a strong signal that the next word will be Nintendo.

So the goal is you're going to find the top scoring parent entity and the top scoring relation using the nodes in your local knowledge graph. And you can do this by using the LSTM hidden state combined with pre-trained entity and relation embeddings. So I do admit I cheated here a little bit in that this does use pre-trained embeddings.

But hopefully you'll see by the end of this discussion, why I think it fits a bit better in this external memory use case as well. So what they're going to do is they're going to take a softmax using LSTM hidden state and the entity embeddings for each of the potential parent entities.

And they'll take this top scoring one as a parent entity. And they'll do the same thing for the relation embeddings. The next entity is then just this tail entity from the knowledge graph triple. So it's relatively trivial to figure out what the next entity should be once you've figured out the top scoring parent entity and your top scoring relation.

And then finally, to predict the next word, they take the vocabulary and they expand it to include different aliases that could refer to that entity. So what we mean by aliases here are phrases that could refer to the entity in text. So you might not just call it Nintendo.

You might also say Nintendo company or CoPi. And you want any of these to be possible words that you could predict as the next word. So the goal of this vocabulary expansion is to increase the probability that the next word you predict will actually be related to this next entity.

So a new entity case is a bit simpler. This means that the entity that you're predicting is not in the local knowledge graph. So you're not getting any signal from this local knowledge graph that you've been building up. And all you want to do is find the top scoring entity in the full knowledge graph.

And you can do this using the LSTM hidden state and pre-trained entity embeddings, similar to how we found the score for the top parent entity. Your next entity will just be the top scoring entity out of the full knowledge graph. And then your next word is once again this vocabulary expanded to include aliases of that entity.

The not an entity case is the simplest. You just revert to normal LSTM. You don't have an X entity to predict. And your next word is just the most likely next token over your normal vocabulary. So here's a diagram from the paper that hopefully summarizes and makes even clearer what I just went over.

So they have a longer example than the one we are looking at, but it's the same prediction as Nintendo's next word. And they have their predictions in red. So this is what they want KGLM to predict. The three different cases are in the horizontals. And we see that here you're in the related entity case, since Nintendo is in your local knowledge graph.

So they want KGLM to predict that Nintendo should be a related entity type of word, that Super Mario Land should be its parent entity, that publisher should be the relevant relation. And as a result, the next entity is Nintendo. And then they expand the vocabulary. You see the aliases of Nintendo at the bottom.

And then finally, they actually predict Nintendo as the next word. And the other cases just summarize what we also already went over. So they find that KGLM actually outperforms GPT-2 and AWD-LSTM, which is a strong LSTM language model, on a fact completion task similar to the fill-in-the-blank examples that we looked at at the beginning of the talk.

They also find qualitatively that compared to GPT-2, KGLM tends to predict more specific tokens since it can predict these tokens from just copying from the local knowledge graph. Whereas GPT-2 will tend to predict more generic tokens. So if you want to predict the birthplace of someone, GPT-2 is more likely to predict New York, for example, and KGLM might predict some obscure place.

And then they have these really cool set of experiments where they show that KGLM actually supports modifying or updating facts. So they made a direct change in the knowledge graph, and then they saw what is the change in KGLM's predictions. So they have this example where the sequence was Barack Obama is born on blank.

They had their knowledge graph triple as Barack Obama's original birth date, and then their most likely next tokens were as expected, August 4, 1961. And then they just changed their knowledge graph. So they changed the birth date of Obama. They said, OK, he's now born 2013. And they looked to see what the next predictions were for KGLM, and it changed its predictions to match what was in the local knowledge graph.

So this is something that's pretty cool and that really only external memory approaches can do compared to the original pre-trained empty embedding approaches we talked about. And I think it's one of the reasons that KGLM, at least in my opinion, fits better in these external memory use cases. Right.

So the next slide is a different paper. So I guess I'll take questions on KGLM if there are any. It's a pretty complex method, so feel free to have questions. Yeah, could you one more time explain what the definition of the local knowledge graph is in relationship to the global knowledge graph?

Yep. So a local knowledge graph is supposed to be a subset of the full knowledge graph, and it's only supposed to consist of entities that have actually been seen in the sequence as well as their relevant entities. OK. Oops. All right. So here you see that Super Mario Land is in the local knowledge graph because Super Mario Land is an entity that is seen in the sequence.

And then you also want to copy over all the edges from Super Mario Land that would be in the full knowledge graph. So this is just a subset of them for the purpose of the example. But you see that Super Mario Land has an edge to Nintendo, to Game Boy, to platform game.

And so you would copy all edges that Super Mario Land has to another node in the full knowledge graph. And they know in advance, like they have the labels here for what the entities are during training. So that's how they can actually create this ground truth knowledge graph. And then briefly, a student asked why we can't just use the whole knowledge graph.

And I gave an answer, but maybe you know better. Yeah, I think the idea is the signal will be much stronger if you just use a local knowledge graph. So in the Softmax for the related entity case, you would just be predicting over the potential parent entities in your local knowledge graph, which is a much smaller set than what's in your full knowledge graph.

So I guess it's more likely that you're going to predict something that is correct in that case than when you have like 5 million or so entities in your full knowledge graph. It's also much cheaper to compute. In this case, there's only a single parent entity, but you could have multiple parent entities that you're trying to compute which one's most likely over.

Is that what you were also thinking, John? Yeah, I mainly just said efficiency. So the signal thing is cool too. Here's an exciting question. What about queries that require more than one step in the knowledge graph, such as the location of the publisher of Super Mario Land? Yeah, that's a good question.

So the idea is like, can it support those types? Like does it support multi-hop kind of building of the knowledge graph? Yeah, yeah. How does KGLM perform in those cases? Yeah, I don't know. That's a very good question. They built up the knowledge graph so that it's just single hop as far as I know.

But like if you saw the other entities, if you were to see the entities along the hops, it would have them in the local knowledge graph. Yeah, that's a good question. I don't know if they explored that. Great. Okay, let's move along then. Okay, so the next piece of work we're going to talk about, you guys have actually briefly seen in the natural language generation lecture.

But I'm going to go over it again quickly here. So unlike the other works that we've talked about that have used knowledge graph triples, this is actually going to take kind of a looser notion of knowledge in that the knowledge will just be encoded in the text in the training data set.

So this is called KNNLM. And the idea is that, or it's building the idea that language models not only learn to predict the next word in text, but they also learn these representations of text. And the authors suggest that it might actually be easier to learn similarities between text sequences than it is to predict the next word in the text.

So you have this example that Dickens is the author of blank and Dickens wrote blank. And they argue that it's easier to tell for a human, but also for a model, that these sequences are similar and they should probably have the same next word, even if you don't know what the next word is.

So that's suggesting that it's easier to learn these similarities than it is to actually predict the next word. And they argue that this is even more true for long tail patterns, where it's very challenging for the model to predict that the next word is some rarely seen token or rare entity than it is to find another similar sequence that it's already seen and just copy the next word from that sequence.

So what they propose to do is store all representations of text sequences in nearest neighbor data store. And then at inference, what you'll want to do is you find the k most similar sequences of text, you then retrieve their corresponding values. So you just peek at those sequences and see what were their next words.

And then you combine the probability from this nearest neighbor data store with just a typical language model prediction. And so they call this an interpolation step in that they're weighting how much to pay attention to the probability from this k and n approach, and how much to pay attention to this language model approach.

And the lambda here is just a hyperparameter that they tune. So they have this diagram from their paper where they want to predict the next word in the sequence, Shakespeare's play blank. So what they do is they have all the training contexts already encoded in their data store. So they have representations of all of the training contexts.

And then they compute a representation of their text context, and they want to figure out which representations in the training context are most similar to this test context representation. And so here in the external memory view of things, the keys would be the representations of the training context, and the values would be the next words.

So they get the k nearest training representations. They then copy over their values. So that's what you see with this Macbeth, Hamlet, Macbeth example. They have a normalization step where they convert this to probability space. And then finally, they have an aggregation step. So if a word is seen as the next word in several of these k nearest neighbors, then they want to count more for that.

So that's why they aggregate. So they see Macbeth twice. It means Macbeth is more likely. And then finally, they have this interpolation step where they try to balance between the classification probabilities from the language model and from the k and n approach. So some immediate observation you might have is this seems really expensive.

They do propose ways to try to minimize the expense of actually having to store all the training contexts in this data store, because they actually store it for every single window of next word in the training context. And you can do quantization on some nearest neighbor approaches to try to make this less expensive.

But I imagine this would still be pretty expensive for really large training data sets. They also have some cool experiments that show that this is very good for domain adaptation. So if you take your language model and you have a new domain that you want to apply your language model to, you could just create a nearest neighbor data store of your new domain.

So you encode all the representations of that new domain. You stick it in a data store. And then you can just use your language model with these k and n probabilities as well, just immediately on this new domain without actually having to further train your language model. So I thought that was a pretty cool use case of this external memory approach.

So while it doesn't leverage knowledge bases directly, it does have this loose knowledge of-- or loose idea of encoding knowledge that is in a textual representation form into some external memory that the model can then take advantage of. That's all I have for this approach. Are there any questions on this approach?

Well, so only one person is asking, how does the k and n make predictions for the next word? The k neighbors are for the context instead of the next word. Oh, OK. That wasn't clear. So the keys are the representations of the context. The values in your external memory are the next words.

So when you figure out-- you figure out your nearest neighbors using your keys, and then you copy over their values. So it does actually know what the next words are for each of those representations. So finally, we're going to talk about how you can just modify the training data to better encode knowledge and language models.

So approaches we've talked about so far are actually incorporating knowledge explicitly by using either pre-trained embeddings or an external memory. We also want to talk about how can you just incorporate knowledge implicitly through the unstructured text. So what we're going to do is either mask or corrupt the data to introduce additional training tasks that require factual knowledge to figure out what data was masked, for instance.

So this has some clear advantages. It doesn't have any additional memory or computation requirements. You don't have a data store to deal with. You don't have extra knowledge encoder layers to train. All you do is modify the training data. And you don't have to modify your architecture either. So you can continue using your favorite BERT model and just make these changes to the training data.

So the first work we're going to look at is called WKLM, Weekly Supervised Knowledge Pre-training Language Model, or Pre-trained Language Model. And the key idea here is to train the model to distinguish between true and false knowledge. So they're going to corrupt the data by replacing mentions in the text with mentions that refer to different entities of the same type to create what they refer to as negative knowledge statements.

And then the model will just predict, has the entity been replaced or corrupted? This type constraint is necessary to make sure that-- or to encourage the model to actually use factual knowledge to figure out if this corruption is taking place. So you could imagine if you replace it with something that's not realistic at all, the model could just be basing its prediction based on, is this sentence linguistically correct?

So as an example, we have a true knowledge statement as JK Rowling is the author of Harry Potter. And then we want to modify this to replace it with another author. So let's say we change this to J.R. Tolkien is the author of Harry Potter. So you can see that this requires some amount of knowledge, background knowledge, to actually be able to figure out which statement's true and which statement is false.

And the idea is that the model will be able to predict for each of these mentions whether it's a true or false mention. So this diagram here is from the paper and hopefully explains this a bit better. They have their original article on the left, and then they have their replaced article with the corruptions on the right.

And the entities are in blue. So what they do is for a given entity, they first look up its type. They find other entities of that type. And then they randomly sample the entity and get an alias of it to replace in the text. So they're going to replace Stan Lee, for instance, with Brian Johnson and Marvel Comics with DC Comics.

And their placements are in red on the right. And then the idea is that the model will be able to predict for each of these mentions was it replaced or not. So in the case of Brian Johnson, they have the red X for this is a false mention. And in the case of the true mentions, they have the checkmark.

So it's a pretty simple approach, but they actually show that it can help the model increase the amount of knowledge that's encoded in its parameters. So WKLM uses an entity or placement loss to train the model to distinguish between these true and false mentions. And this just looks like a binary classification loss where your true mentions are on the left and your false mentions are on the right.

And you want to increase the probability that this P of E given C, so the probability of the entity given the context, you want to increase that for the true mentions and decrease it for the false mentions. The total loss is then just a combination of the mass language model loss and this entity replacement loss.

The mass language model loss is defined at the token level. And the entity replacement loss is defined at the entity level, meaning it's not just over subwords. It's even potentially over words if you have multi-word entities, phrases, for instance. And this is an important point or an important theme that we really see occurring throughout these works that we'll look at in that modifying the data at the entity level seems to be an important component of actually increasing the amount of knowledge that a language model can encode.

So they find that WKLM improves over BERT and GPT-2, in fact, completion tasks like the fill in the blank statements that we looked at at the beginning. They also find that it improves over the Ernie paper that we talked about on a downstream task. And they had a set of ablation experiments where they looked at, can you just remove this mass language model loss now?

And if you just train BERT for longer, do you really need this entity replacement loss? So that's what the table here is looking at. The second row is looking at if we remove the mass language model loss, what happens? We see that it performs much worse without the mass language model loss.

So you really need both losses. Their intuition there was the mass language model loss helps to encode just general language understanding. And then training BERT for longer performs much worse than using its entity replacement loss. So this motivates even farther that you really do need, or the entity replacement loss is actually really helping encode more knowledge in these language models.

So in addition to corrupting the data, we're also going to look at, can we just mask the data differently? Can we be more clever about how we do the masking? And this is a thread in several recent works. So there's actually another paper called Ernie. So this is different than the one we talked about before.

And this is enhanced representation through knowledge integration. And what they do is show improvements on downstream Chinese NLP tasks by doing phrase level and entity level masking. So instead of just masking out subwords, they're going to mask out phrases of multiple words and entities, the full phrase of an entity, which corresponds to some entity in a text that they might find with like NER techniques, for example.

And then the second work is actually something you heard about in the last lecture, which is the idea of using salient span masking to mask out salient spans. And a salient span is just a named entity or a date. So you can see this is pretty similar to what Ernie is doing.

And they found that using salient span masking actually significantly helped T5 performance on these closed domain question answering tasks. So just to make sure we're all on the same page with the different masking techniques, this diagram from the Ernie paper is comparing to what Bert does versus what Ernie does.

The top shows that Ernie masked out the subword tokens or that Bert masked out the subword tokens, whereas Ernie masked out phrases like a series of, as well as entities like JK and T5. There's some interesting results on showing that salient span masking is helping encode more knowledge in these representations.

So on the left, we're looking at the results of the original paper that proposed salient span masking. So this is the Realm work. And the idea here was that they were training a knowledge retriever. So it's actually more of an external memory class of techniques. But they find that by using the salient span masking technique, they could actually train a much better knowledge retriever.

So it's a good example of how these techniques are really complementary. So while I presented three classes of techniques, you can definitely get benefits by doing multiple techniques together. And they found that doing salient span masking compared to using masking from Bert, which would be the random uniform masks, or doing random masking of spans from a paper called SpanBert, it performs much better to do salient span masking.

So you see a 38 exact match score versus a 32 exact match score, for instance. And on the right, we have results from fine tuning T5 with either salient span masking or the span corruption task that you saw in assignment 5. And you can see that on these different QA data sets, salient span masking does significantly better than just using the span corruption technique.

So this really suggests that doing the salient span masking and masking out these salient spans of these entities is, in fact, helping to encode more knowledge in these language models. So to recap, we talked about three different classes of techniques to add knowledge to language models. We talked about using pre-trained entity embeddings.

These weren't too difficult to apply to existing architectures, and as a way to leverage this knowledge graph pre-training. But it was a rather indirect way of incorporating knowledge, and it could be hard to interpret. We also talked about approaches to add an external memory. This could support modifying the knowledge base.

It was also easier to interpret. But they tended to be more complex in implementation, like we saw with KGLM. And they also required more memory, like we saw with the KNNLM approach. And then finally, we talked about modifying the training data. So this requires no model changes or additional computation.

It also might be the easiest to theoretically analyze. So it's actually an active area research right now. But still an open question if modifying the training data is always as effective as model changes and what the trade-offs are in terms of the amount of data required versus doing one of these other knowledge enhancement approaches.

So that leads us to section three. So I guess I'll pause again for questions. I think we may be good. Awesome. OK. So section three is about how researchers are actually going about evaluating the knowledge and language models. And I guess how some of the techniques we actually just talked about stand up in this evaluation.

So first, we're going to talk about probes, which don't require any fine-tuning of the language model. And then we're going to talk about downstream tasks, which look at how well do these pre-trained representations actually transfer their knowledge to other tasks. So one of the initial works in this area was called LAMA.

And this really started a series of works to look into how much knowledge is already encoded in these language models. So their question was, how much relational, common sense, and factual knowledge is in off-the-shelf language models? So this is just taking pre-trained language models and evaluating the knowledge in them.

And this is without any additional training or fine-tuning. So they mainly constructed a set of what they refer to as closed statements. And these are just the fill-in-the-blank statements that we actually drew from at the beginning of the talk. And I'll show you some more examples here. And they manually created these templates of closed statements using knowledge graph triples and question-answering pairs from existing data sets.

They wanted to compare pre-trained language models to supervised relation extraction and question-answering systems to see how do these language models that were trained in an unsupervised fashion compare to these baseline systems that are not only supervised but really targeted for this task of knowledge extraction. And their goal was to evaluate the knowledge in existing pre-trained language models.

And a key point about this is they're just using the language models as they are available to researchers. So this means there could be differences in the pre-trained corpora, for example. So when you look at the following table and you're comparing language models, also keep in mind that these don't account for the differences in the pre-trained corpora.

So a lot of these language models probably look familiar to you, either from previous lectures or maybe your final projects. And what we see is that overall, BERT-based and BERT-large pre-trained models are performing much better than the previous language or the other language models here. I guess I forgot to mention what mean precision at 1 is.

This is a pretty simple metric. The idea is if you look at the blank and you look at the top predictions for-- or the top prediction for the blank, is it correct or not? So that's what precision at 1 means. Precision at 10 would be let's look at the top 10 predictions.

Is the correct prediction in the top 10? So in addition to BERT-large and BERT-based performing well overall, we do see that in the T-Rex data set, the relation extraction baseline is performing a bit better than BERT. One thing to notice here that's pretty interesting is that this data set has a lot of different types of relations.

And relations can be classified in terms of are they a one-to-one relation, are they an end-to-one relation, are they an end-to-M relation? An example of a one-to-one relation would be your student ID relation. So you have a unique student ID. An example of an end-to-M relation would be the enrolled-in relation.

So there's lots of students enrolled in lots of classes. So this would be an end-to-M relation. And they find that BERT really struggles on these end-to-M relations. So while it performs better than relation extraction baseline on some types of relations, overall it does pretty terribly on these end-to-M relations.

So overall it does a bit worse than the baseline on this T-Rex data set. They also compare to SQuAD on Docker QA. And they find that it does a fair amount worse. They note that the language model is not fine-tuned here and also has no access to an information retrieval system.

And then when they look at the precision at 10, they find that this gap between Docker QA's performance and BERT actually closes quite a bit, which suggests that these language models do have some amount of knowledge encoded in them and that they're even competitive with these knowledge extraction supervised baselines.

So you can also try out examples on their GitHub repo for the llama probe. We have an example that was from their repo that was the cat is on the mask. You can see what the top 10 predictions are to fill in the closed statement. Here they have the cat is on the phone.

So this can be a fun way just to figure out what factual and common sense knowledge is in existing language models. And it's pretty easy to use with this interactive prompt. So some limitations of the llama probe are that it can be hard to understand why the models perform well when they do.

So for instance, BERT might just be predicting the most popular token. And this happens to be right. Maybe it's just memorizing co-occurrence patterns and doesn't really understand the knowledge statement and doesn't understand what the fact is. It might also just be identifying similarities between surface forms of the subject and object.

So for instance, in this example, Pope Clement VII has a position of blank. Even if you don't know anything about Pope Clement VII, you might be able to figure out that Pope is a likely next word for this triple or for this template. So the problem with this is if the model is just making these predictions based on these surface forms or co-occurrence patterns, it's difficult to know if we're actually evaluating the knowledge in the model.

Maybe it's just making correct predictions for other reasons. And the more subtle issue that we've brought up is that language models might be just sensitive to the phrasing of the statement. So for each triple in their data set or for each relation in their data set, they just had one mainly defined template.

And qualitatively, we found that if they just make small changes as template, it could actually change whether or not the model could recall the correct prediction or not. And so this means that the probe results are really a lower bound on the knowledge that's encoded in the language model.

So if you change the phrasing, it's possible that the model might show that it actually does have the knowledge encoded in it. So the next lines of work we'll talk about are really building on these two limitations of this original LAMA probe. So the first one is called LAMA-UN or LAMA Unhelpful Names.

And the key idea is to remove these examples from LAMA that can be answered without the relational knowledge. So this is kind of addressing the first limitation on the last slide. So they observed that BERT relies on just surface forms entities, might not be using knowledge to make these predictions.

This includes a string match situation that we talked about with the pope. This also is dealing with the revealing person name issue that you saw in assignment five. So this is where the name could be an incorrect prior for the native language of someone, their place of birth, their nationality.

They have this example from the paper where they look at different people names or person's names and then they look at BERT's prediction for their native language. And these are all French speaking actors. And BERT just predicts very biased and stereotypical languages for these particular names. So this can really work both ways.

It can lead BERT to make incorrect predictions in some cases. But it could also work to make or to let BERT make correct predictions even if it has no factual knowledge of those people. So that's the issue they're trying to get at here is do we know that BERT actually knows this fact or is it just using some bias to make its prediction?

So what they do is they introduce a couple heuristics to basically just filter out these examples from the LAMA probe that can either be solved by the string match setting or the surveilling person name setting. So they make a harder subset of the LAMA data set essentially. They find that when they test BERT on this harder subset that its performance drops about 8%.

But when they test their knowledge enhanced model, which they call EBERT, the score only drops about 1%. So it's possible that as you make harder knowledge probes, we'll actually see even bigger differences in the performance of knowledge enhanced models to models without these knowledge enhancements. The next piece of work we'll talk about is actually getting at this issue of the phrasing of the prompt might actually trigger different responses from the language model.

So the language model might know the fact, but it might fail on the task due to the phrasing. One reason this might happen is the pre-training is on different contexts and sentence structures in the query. So for example, you might have in your pre-training corpus, the birthplace of Barack Obama is Honolulu, Hawaii.

And this might be something you see in Wikipedia, for instance, that's a common training data set. And then as a researcher, you write Barack Obama was born in blank. And you can see that these sentence structures are pretty different. So the model might've seen the first fact, but the sentence structure difference is actually enough to confuse it.

So it can't answer this query. So what they do is they generate a lot more of these prompts by mining templates from Wikipedia. One of the techniques actually uses dependency parsing and also generating paraphrase prompts by taking inspiration from the machine translation literature and using back translation. So they generate a lot more prompts to try to query the language models and figure out do small variations in the prompt trigger the correct prediction from the language model.

They also experiment with ensembling prompts. So if we give the model multiple prompts and then take some probability averaged over these different prompts, can we improve the performance on the model returning the correct prediction? So we give it a higher chance of seeing a context that it might've actually seen during pre-training.

They find that the performance on LLAMA increases when they either use a top performing prompt or when they use this ensembling approach. So this suggests that the original LLAMA really was a lower bound on the amount of knowledge encoded in these language models. And changing the phrasing can actually help the model recall the correct answer.

This table is a bit frightening, but they find that small changes in the query can lead to really large gains on performance. So if you just have a query like X plays in Y position, and then you change that to X plays at Y position, this can actually lead to like a 23% accuracy gain on this particular relation in terms of the model actually being able to recall the correct answer.

Or even just X was created in Y to X is created in Y, 10% accuracy gain. So I think this motivates the need to not only develop better ways to query these models, but probably also build language models that are actually more robust to the query itself. So in addition to probes, another way to evaluate these language models is by looking at how well they transfer from the pre-trained representation to downstream tasks.

And so the idea here is you're actually going to fine tune the pre-trained representation on different downstream tasks, similar to how you would evaluate BERT on glue tasks. Some common tasks that are used for this are relation extraction, entity typing, and question answering. So relation extraction is where you want to predict the relation between two entities.

So this is getting back at one of the questions earlier in the talk in terms of, well, how do you get the relation that's the edges in these knowledge bases? So given two entities, you learn a model to predict what is the relation between them. Entity typing is a task of given an entity, what is the type of the entity?

So here, Alice robbed the bank. You want to predict her as a criminal. And then you guys are very familiar with question answering. So the idea of these tasks is that they're knowledge intensive. So they're good candidates to see how well do these pre-trained representations actually transfer the knowledge to these downstream tasks.

Here we're looking at the performance on a relation extraction benchmark called TACRID. And all the models that we show here were at one point state-of-the-art on TACRID. So this CGCN is a graph convolutional neural network over dependency trees. The BERT LSTM base is one of the first works that showed that you could actually get state-of-the-art performance with BERT on relation extraction.

And this is just putting an LSTM layer over BERT's output. Ernie is a work that we talked about with the pre-trained entity embeddings. Matching the blanks we didn't get to today, but it's a really interesting work about learning meaningful relation representations. And it falls more into the training data modification approaches and that they are actually masking out entities again.

And then NoBERT is what we talked about. The W in W here means they actually encode two knowledge bases in NoBERT. So they're encoding WordNet and they're also encoding Wikipedia. And the high-level takeaway from this table is that you can see that the recent knowledge-enhanced models have achieved state-of-the-art over the original models that once performed very well on TACRID.

And we have about 5 F1 gains here. Another interesting takeaway from this table is there seems to be a trade-off in the size of the language model that's necessary to get a certain performance. So if you just consider the size of the language model, then NoBERT performs the best.

But if you don't consider that, then it ties with matching the blanks. So overall, this is pretty good evidence that these knowledge-enhanced methods are in fact transferring to these knowledge-intensive downstream tasks that can really take advantage of these pre-trained representations. We also have results on entity typing. So here we're comparing a slightly different set of models.

Some of the baselines are LSTM models that were designed for entity typing. And we have Ernie and NoBERT leading the, I guess, leaderboard here on the entity typing task of OpenEntity. And we see gains of about 15 F1 points with Ernie and NoBERT. So once again, we really do see that these knowledge-rich pre-trained representations are transferring and helping on these knowledge-intensive downstream tasks.

So just to recap, we talked about probes which evaluate the knowledge already present in models. These don't require any more training. But it can be challenging to construct benchmarks to actually make sure you're testing the knowledge in these language models. It can also be challenging to construct the queries used in the probe.

We then talked about downstream tasks. These are a bit of an indirect way to evaluate knowledge in that they have this extra component of fine-tuning. But it's a good way to evaluate how useful is this knowledge-rich pre-trained representation in actual applications. So I just touched on the exciting work in this area.

But there's many other directions if you want to dive more into this. So there's retrieval-augmented language models, which learn knowledge retrievers to figure out what documents might be relevant for predicting the next word. There's work in modifying the knowledge in language models. So I talked about how this is one of the obstacles and challenges to using language models as knowledge bases.

So there's been recent work in this area. We also saw how important the knowledge pre-training task was. Well, there's many papers that are proposing different tasks to do the knowledge pre-training. So it's still an open question in terms of what tasks are best to add to encode more knowledge.

There's also been work on more efficient knowledge systems. So at NERPS, there's now an efficient QA challenge, which aims at building the smallest QA system. And then finally, there's been work on building better knowledge benchmarks that build on the benchmarks that we saw today. So that's all I have for today, and I hope your final projects are going well.

Thank you.

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 15 - Add Knowledge to Language Models

Chapters

Transcript