Stanford CS224N NLP with Deep Learning | Winter 2021

Okay, so what are we going to do for today? So the main content for today is to, um, go through sort of more stuff about word vectors, including touching on word sensors, and then introducing the notion of neural network classifiers. So our biggest goal is that by the end of today's class, you should feel like you could confidently look at one of the word embeddings papers, such as the Google Word2Vec paper, or the GloVe paper, or Sanjeev Arora's paper that we'll come to later, and feel like, yeah, I can understand this, I know what they're doing, and it makes sense.

So let's go back to where we were. So this was sort of introducing this model of Word2Vec. And at the time, your idea was that we started with random word vectors, and then we're going to sort of, we have a big corpus of text, and we're going to iterate through each word in the whole corpus.

And for each position, we're going to try and predict what words surround our center word. And we're going to do that with a probability distribution that's defined in terms of the dot product between the word vectors for the center word and the context words. And so that will give a probability estimate of a word appearing in the context of "into." Well, actual words did occur in the context of "into" on this occasion.

So what we're going to want to do is sort of make it more likely that turning problems, banking, and crises will turn up in the context of "into." And so that's learning, updating the word vectors so that they can predict actual surrounding words better. And then the thing that's almost magical is that doing no more than this simple algorithm, this allows us to learn word vectors that capture well word similarity and meaningful directions in a word space.

So more precisely, right, for this model, the only parameters of this model are the word vectors. So we have outside word vectors and center word vectors for each word. And then we're taking their dot product to get a probability. Well, we're taking a dot product to get a score of how likely a particular outside word is to occur with the center word.

And then we're using the softmax transformation to convert those scores into probabilities, as I discussed last time, and I kind of come back to at the end this time. A couple of things to note. This model is what we call in NLP a bag of words model. So bag of words models are models that don't actually pay any attention to word order or position.

It doesn't matter if you're next to the center word or a bit further away on the left or right. The probability estimate would be the same. And that seems like a very crude model of language that will offend any linguist. And it is a very crude model of language, and we'll move on to better models of language as we go on.

But even that crude model of language is enough to learn quite a lot of the probability. Sorry, quite a lot about the properties of words. And then the second note is, well, with this model, we wanted to give reasonably high probabilities to the words that do occur in the context of the center word, at least if they do so at all often.

But obviously lots of different words can occur. So we're not talking about probabilities like .3 and .5. We're more likely going to be talking about probabilities like .01 and numbers like that. Well, how do we achieve that? And well, the way that the word2vec model achieves this, and this is the learning phase of the model, is to place words that are similar in meaning close to each other in this high dimensional vector space.

So again, you can't read this one, but if we scroll into this one, we see lots of words that are similar in meaning grouped close together in the space. So here are days of the week like Tuesday, Thursday, Sunday, and also Christmas. Over, what else do we have? We have Samsung and Nokia.

This is a diagram I made quite a few years ago. So that's when Nokia was still an important maker of cell phones. We have various sort of fields like mathematics and economics over here. So we group words that are similar in meaning. Actually one more note I wanted to make on this.

I mean, again, this is a two dimensional picture, which is all I can show you on a slide. And it's done with a principal components projection that you'll also use in the assignment. Something important to remember, but hard to remember, is that high dimensional spaces have very different properties to the two dimensional spaces that we can look at.

And so in particular, a word, a vector, can be close to many other things in a high dimensional space, but close to them on different dimensions. Okay, so I've mentioned doing learning. So the next question is, well, how do we learn good word vectors? And this was the bit that I didn't quite hook up at the end of last class.

So for a while in the last, I said, oh, geocalculus, and we have to work out the gradient of the loss function with respect to the parameters that will allow us to make progress. But I didn't sort of altogether put that together. So what we're going to do is we start off with random word vectors.

We initialize them to small numbers near zero in each dimension. We've defined our loss function J, which we looked at last time. And then we're going to use a gradient descent algorithm, which is an iterative algorithm that learns to maximize J of theta by changing theta. And so the idea of this algorithm is that from the current values of theta, you calculate the gradient J of theta.

And then what you're going to do is make a small step in the direction of the negative gradient. So the gradient is pointing upwards, and we're taking a small step in the direction of the negative of the gradient to gradually move down towards the minimum. And so one of the parameters of neural nets that you can fiddle in your software package is what is the step size.

So if you take a really, really itsy bitsy step, it might take you a long time to minimize the function. You do a lot of wasted computation. On the other hand, if your step size is much too big, well, then you can actually diverge and start going to worse places.

Or even if you are going downhill a little bit, that what's going to happen is you're then going to end up bouncing back and forth, and it'll take you much longer to get to the minimum. OK, in this picture, I have a beautiful quadratic, and it's easy to minimize it.

Something that you might know about neural networks is that in general, they're not convex. So you could think that this is just all going to go awry. But the truth is in practice, life works out to be OK. But I think I won't get into that more right now and come back to that in a later class.

So this is our gradient descent. So we have the current values of the parameters theta. We then walk a little bit in the negative direction of the gradient using our learning rate or step size alpha. And that gives us new parameter values, where that means that these are vectors.

But for each individual parameter, we are updating it a little bit by working out the partial derivative of J with respect to that parameter. So that's the simple gradient descent algorithm. Nobody uses it, and you shouldn't use it. The problem is that our J is a function of all windows in the corpus.

Remember, we're doing this sum over every center word in the entire corpus. And we'll often have billions of words in the corpus. So actually working out J of theta or the gradient of J of theta would be extremely, extremely expensive because we have to iterate over our entire corpus.

So you'd wait a very long time before you made a single gradient update. And so optimization would be extremely slow. And so basically 100% of the time in neural network land, we don't use gradient descent. We instead use what's called stochastic gradient descent. And stochastic gradient descent is a very simple modification of this.

So rather than working out an estimate of the gradient based on the entire corpus, you simply take one center word or a small batch like 32 center words, and you work out an estimate of the gradient based on them. Now that estimate of the gradient will be noisy and bad because you've only looked at a small fraction of the corpus rather than the whole corpus.

But nevertheless, you can use that estimate of the gradient to update your theta parameters in exactly the same way. And so this is the algorithm that we can do. And so then if we have a billion word corpus, we can if we do it on each center word, we can make a billion updates to the parameters as we pass through the corpus once rather than only making one more accurate update to the parameters at once you've been through the corpus.

So overall, we can learn several orders of magnitude more quickly. And so this is the algorithm that you'll be using everywhere, including, you know, right from the beginning from our assignments. Again, just an extra comment of more complicated stuff we'll come back to. I I pretend this is the gradient descent is sort of performance hack.

It lets you learn much more quickly. It turns out it's not only a performance hack. Neural nets have some quite counterintuitive properties. And actually, the fact that stochastic gradient descent is kind of noisy and bounces around as it does its thing, it actually means that in complex networks, it learns better solutions than if you were to run plain gradient descent very slowly.

So you can both compute much more quickly and do a better job. OK, one final note on running stochastic gradients with word vectors, this is kind of an aside, but something to note is that if we're doing a stochastic gradient update based on one window, then actually in that window, we'll have seen almost none of our parameters.

Because if we have a window of something like five words to either side of the center word, we've seen at most 11 distinct word types. So we will have gradient information for those 11 words. But the other 100,000 odd words in our vocabulary will have no gradient update information.

So this will be a very, very sparse gradient update. So if you're only thinking math, you can just have your entire gradient and use the equation that I showed before. But if you're thinking systems optimization, then you'd want to think, well, actually, I only want to update the parameters for a few words.

And there have to be and there are much more efficient ways that I could do that. And so here's sort of this is another aside will be useful for the assignment. So I will say it up until now, when I presented word vectors, I presented them as column vectors.

And that makes the most sense if you think about it as a piece of math. Whereas actually, in all common deep learning packages, including PyTorch that we're using, word vectors are actually represented as row vectors. And if you remember back to the representation of matrices and CS107 or something like that, you'll know that that's then obviously efficient for representing words, because then you can access an entire word vector as a continuous range of memory.

Different if you're in Fortran. But anyway, so actually, our word vectors will be row vectors when you look at those inside PyTorch. OK, now I wanted to say a bit more about the word to vector algorithm family and also what you're going to do in homework two. So if you're still meant to be working on homework one, which remember is due next Tuesday, but really actually with today's content, we're starting into homework two.

And I'll kind of go through the first part of homework two today and the other stuff you need to know for homework two. So I mentioned briefly the idea that we have two separate vectors for each word type, the center vector and the outside vectors. And we just average them both at the end.

They're similar, but not identical for multiple reasons, including the random initialization and the stochastic gradient descent. You can implement a word to vector algorithm with just one vector per word. And actually, if you do, it works slightly better, but it makes the algorithm much more complicated. And the reason for that is sometimes you will have the same word type as the center word and the context word.

And that means that when you're doing your calculus at that point, you've then got this sort of messy case that just for that word, you're getting an x squared, sorry, dot product. You're getting a dot product of x dot x term, which makes it sort of much messier to work out.

And so that's why we use this sort of simple optimization of having two vectors per word. Okay. So for the word to vector model as introduced in the Mikhailov et al paper in 2013, it wasn't really just one algorithm. It was a family of algorithms. So there were two basic model variants.

One was called the skip gram model, which is the one that I've explained to you that pretty much outside words, position independent, given the center word in a bag of words style model. The other one was called the continuous bag of words model SIBO. And in this one, you predict the center word for a bag of context words.

Both of these give similar results. The skip gram one is more natural in various ways. So it's sort of normally the one that people have gravitated to and subsequent work. But then as to how you train this model, what I've presented so far is the naive softmax equation, which is a simple but relatively expensive training method.

And so that isn't really what they suggest using in your paper in the paper. They suggest using a method that's called negative sampling. So an acronym you'll see sometimes is SGNS, which means skip grams, negative sampling. So let me just say a little bit about what this is. But actually doing the skip gram model with negative sampling is the part of homework too.

So you'll get to know this model well. So the point is that if you use this naive softmax, you know, even though people commonly do use this naive softmax in various neural net models, that working out the denominator is pretty expensive. And that's because you have to iterate over every word in the vocabulary and work out these dot products.

So if you have a hundred thousand word vocabulary, you have to do a hundred thousand dot products to work out the denominator. And that seems a little bit of a shame. And so instead of that, the idea of negative sampling is where instead of using this softmax, we're going to train binary logistic regression models for both the troop, the true pair of center word and the context word versus noise pairs where we keep the true center word and we just randomly sample words from the vocabulary.

So as presented in the paper, the idea is like this. So overall, what we want to optimize is still an average of the loss for each particular center word. But for when we're working out the loss for each particular center word, we're going to work out, sorry, the loss for each particular center word and each particular window.

We're going to take the dot product as before of the center word and the outside word. And that's sort of the main quantity. But now instead of using that inside the softmax, we're going to put it through the logistic function, which is sometimes also often also called the sigmoid function.

The name logistic is more precise. So that's this function here. So the logistic function is a handy function that will map any real number to a probability between zero and one open interval. So basically, if the dot product is large, the logistic of the dot product will be virtually one.

Okay, so we want this to be large. And then what we'd like is on average, we'd like the dot product between the center word and words that we just chose randomly, i.e. they most likely didn't actually occur in the context of the center word to be small. And there's just one little trick of how this is done, which is this sigmoid function is symmetric.

And so if we want this probability to be small, we can take the negative of the dot product. So we're wanting it to be over here, that the product, the dot product of random word in the center word is a negative number. And so then we're going to take the negation of that.

And then again, once we put that through the sigmoid, we'd like a big number. Okay, so the way they're presenting things, they're actually maximizing this quantity. But if I go back to making it a bit more similar to the way we had written things, we'd worked with minimizing the negative log likelihood.

So it looks like this. So we're taking the negative log likelihood of this, the sigmoid of the dot product. Again, negative log likelihood, we're using the same negated dot product through the sigmoid. And then we're going to work out this quantity for a handful of random word. We take negative samples, and how likely they are to sample a word depends on their probability.

And where this loss function is going to be minimized, given this negation by making these dot products large, and these dot products, small means negative. So there was just then one other trick that they use. Actually there's more than one other trick that's used in the Word2Vec paper to get it to perform well, but I'll only mention one of their other tricks here.

When they sample the words, they don't simply just sample the words based on their probability of occurrence in the corpus or uniformly. What they do is they start with what we call the unigram distribution of words. So that is how often words actually occur in our big corpus. So if you have a billion word corpus and a particular word occurred 90 times in it, you're taking 90 divided by a billion, and so that's the unigram probability of the word.

But what they then do is that they take that to the three quarters power. And the effect of that three quarters power, which is then renormalized to make a probability distribution with Z, kind of like we saw last time with the softmax. By taking the three quarters power, that has the effect of dampening the difference between common and rare words so that less frequent words are sampled somewhat more often, but still not nearly as much as they would be if you just use something like a uniform distribution over the vocabulary.

Okay, so that's basically everything to say about the basics of how we have this very simple neural network algorithm, Word2Vec, and how we can train it and learn word vectors. So for the next bit, what I want to do is step back a bit and say, well, here's an algorithm that I've shown you that works great.

What else could we have done and what can we say about that? The first thing that you might think about is, well, here's this funny iterative algorithm to give you word vectors. You know, if we have a lot of words in a corpus, it seems like a more obvious thing that we could do is just look at the counts of how words occur with each other and build a matrix of counts, a co-occurrence matrix.

So here's the idea of a co-occurrence matrix. So I've got a teeny little corpus. I like deep learning. I like NLP. I enjoy flying. And I can define a window size. I made my window simply size one to make it easy to fill in my matrix, symmetric just like our Word2Vec algorithm.

And so then the counts in these cells are simply how often things co-occur in the window of size one. So I like occurs twice. So we get twos in these cells because it's symmetric. Deep learning occurs one. So we get one here and lots of other things occur zero.

So we can build up a co-occurrence matrix like this. And well, these actually give us a representation of words as co-occurrence vectors. So I can take the word I with either a row or a column vector since it's symmetric and say okay, my representation of the word I is this row vector.

And that is a representation of the word I. And I think you can maybe convince yourself that to the extent that words have similar meaning and usage, you'd sort of expect them to have somewhat similar vectors, right? So if I had the word you as well on a larger corpus, you might expect I and you to have similar vectors because I like you like I enjoy you enjoy.

You'd see the same kinds of possibilities. Hey, Chris, can we keep looking to answer some questions? Sure. All right. So we got some questions from negative, sort of the negative sampling slides. In particular, what's like, can you give some intuition for negative sampling? What is the negative sampling doing?

And why do we only take one positive example? Those are two questions to be answered in tandem. Okay, that's a good question. Okay, I'll try and give more intuition. So is to work out something like what the softmax did in a much more efficient way. So, in the softmax, well, you wanted to give high probability to the in predicting the context, a context word that actually did appear with the center word.

And well, the way you do that is by having the dot product between those two words be as big as possible. And part of how but you know, you're going to be sort of it's more than that, because in the denominator, you're also working out the dot product with every other word in the vocabulary.

So as well as wanting the dot product with the actual word that you see in the context to be big, you maximize your likelihood by making the dot products of other words that weren't in the context smaller, because that's shrinking your denominator. And therefore, you've got a bigger number coming out and you're maximizing the loss.

So even for the softmax, the general thing that you want to do to maximize it is have dot product with words actually in the context big dot product with words and not in the context be small to the extent possible. And obviously, you have to average this as best you can over all kinds of different contexts, because sometimes different words appear in different contexts, obviously.

So the negative sampling as a way of therefore trying to maximize the same objective. Now, you know, for you only you only have one positive term because you're actually wanting to use the actual data. So you're not wanting to invent data. So for working out the entire J, we do do work this quantity out for every center word and every context word.

So you know, we are iterating over the different words in the context window, and then we're moving through positions in the corpus. So we're doing different VC. So, you know, gradually we do this, but for one particular center word and one particular context word, we only have one real piece of data that's positive.

So that's all we use because we don't know what other words should be counted as positive words. Now for the negative words, you could just sample one negative word and that would probably work. But if you want to sort of a slightly better, more stable sense of, OK, we'd like to, in general, have other words have low probability.

It seems like you might be able to get better, more stable results if you instead say, let's have 10 or 15 sample negative words. And indeed that's been found to be true. And for the negative words, well, it's easy to sample any number of random words you want. And at that point, it's kind of a probabilistic argument.

The words that you're sampling might not be actually bad words to appear in the context. They might actually be other words that are in the context, but 99.9% of the time, they will be unlikely words to occur in the context. And so they're good ones to use. And yes, you only sample 10 or 15 of them, but that's enough to make progress because the center word is going to turn up on other occasions.

And when it does, you'll sample different words over here so that you gradually sample different parts of the space and start to learn. We had this co-occurrence matrix and it gives a representation of words as co-occurrence vectors. And just one more note on that. I mean, there are actually two ways that people have commonly made these co-occurrence matrices.

One corresponds to what we've seen already, that you use a window around a word, which is similar to word2vec. And that allows you to capture some locality and some of the sort of syntactic and semantic proximity that's more fine grained. The other way these co-occurrence matrices have often been made is that normally documents have some structure, whether it's paragraphs or just actual web pages, sort of size documents.

So you can just make your window size a paragraph or a whole web page and count co-occurrence in those. And this is the kind of method that's often being used in information retrieval in methods like latent semantic analysis. OK, so the question then is, are these kind of count word vectors good things to use?

Well, people have used them. They're not terrible, but they have certain problems. But the kind of problems that they have, well, firstly, they're huge, though very sparse. So this is back where I said before, if we had a vocabulary of half a million words, when then we have a half a million dimensional vector for each word, which is much, much bigger than the word vectors that we typically use.

And it also means that because we have these very high dimensional vectors, that we have a lot of sparsity and a lot of randomness. So the results that you get tend to be noisier and less robust, depending on what particular stuff was in the corpus. And so in general, people have found that you can get much better results by working with low dimensional vectors.

So then the idea is we can store the most of the important information about the distribution of words in the context of other words in a fixed small number of dimensions, giving a dense vector. And in practice, the dimensionality of the vectors that are used are normally somewhere between 25 and 1000.

And so at that point, we need to use some way to reduce the dimensionality of our count co-occurrence vectors. So if you have a good memory from a linear algebra class, you hopefully saw singular value decomposition. And it has various mathematical properties that I'm not going to talk about here of single singular value projection, giving you an optimal way under a certain definition of optimality of producing a reduced dimensionality matrix that maximally or sorry, pair of matrices that maximally well lets you recover the original matrix.

But the idea of the singular value decomposition is you can take any matrix such as our count matrix and you can decompose that into three matrices, U, a diagonal matrix, sigma and a V transpose matrix. And this works for any shape. Now, in these matrices, some parts of it are never used because since this matrix is rectangular, there's nothing over here, and so this part of the V transpose matrix gets ignored.

But if you're wanting to get smaller dimensional representations, what you do is take advantage of the fact that the singular values inside the diagonal sigma matrix are ordered from largest down to smallest. So what we can do is just delete out more of the matrix of the delete out some singular values, which effectively means that in this product, some of U and some of V is also not used.

And so then as a result of that, we're getting lower dimensional representations for our words if we're wanting to have word vectors, which still do as good as possible a job within the given dimensionality of enabling you to recover the original co-occurrence matrix. So from a linear algebra background, this is the obvious thing to use.

So how does that work? Well, if you just build a raw count co-occurrence matrix and run SVD on that and try and use those as word vectors, it actually works poorly. And it works poorly because if you get into the mathematical assumptions of SVD, you're expecting to have these normally distributed errors.

And what you're getting with word counts looked not at all like something's normally distributed because you have exceedingly common words like "a", "the", and "and", and you have a very large number of rare words. So that doesn't work very well. But you can actually get something that works a lot better if you scale the counts in the cells.

So to deal with this problem of extremely frequent words, there are some things we can do. We could just take the log of the raw counts. We could kind of cap the maximum count. We could throw away the function words. And any of these kind of ideas let you build, then have a co-occurrence matrix that you get more useful word vectors from running something like SVD.

And indeed, these kind of models were explored in the 1990s and in the 2000s. And in particular, Doug Rohde explored a number of these ideas as to how to improve the co-occurrence matrix in a model that he built that was called COLS. And actually, in his COLS model, he observed the fact that you could get the same kind of linear components that have semantic components that we saw yesterday when talking about analogies.

So for example, this is a figure from his paper. And you can see that we seem to have a meaning component going from a verb to the person who does the verb. So drive to drive, swim to swim, teach to teach, marry to priest. And that these vector components are not perfectly, but are roughly parallel and roughly the same size.

And so we have a meaning component there that we could add on to another word, just like we did previously for analogies. We could say drive is to driver as marry is to what. And we'd add on this green vector component, which is roughly the same as this one.

And we'd say, oh, priest. So this space could actually get some word vectors analogies right as well. And so that seemed really interesting to us around the time Word2Vec came out of wanting to understand better what the iterative updating algorithm of Word2Vec did and how it related to these more linear algebra based methods that had been explored in the couple of decades previously.

And so for the next bit, I want to tell you a little bit about the GloVe algorithm, which was an algorithm for word vectors that was made by Jeffrey Pennington, Richard Socher and me in 2014. And so the starting point of this was to try to connect together the linear algebra based methods on co-occurrence matrices like LSA and COLS with the models like Skip Graham, Zeebo and their other friends, which were iterative neural updating algorithms.

So on the one hand, you know, the linear algebra methods actually seemed like they had advantages for fast training and efficient usage of statistics. But although there had been work on capturing word similarities with them, by and large, the results weren't as good, perhaps because of disproportionate importance given to large counts in the main.

Conversely, the models, the neural models, it seems like if you're just doing these gradient updates on Windows, you're somehow inefficiently using statistics versus a co-occurrence matrix. But on the other hand, it's actually easier to scale to a very large corpus by trading time for space. And at that time, it seemed like the neural methods just worked better for people, that they generated improved performance on many tasks, not just on word similarity, and that they could capture complex patterns such as the analogies that went beyond word similarity.

And so what we wanted to do was understand a bit more as to what properties do you need to have this analogies work out, as I showed last time. And so what we realized was that if you'd like to do have these sort of vector subtractions and additions work for an analogy, the property that you want is for meaning components.

So a meaning component is something like going from male to female, queen to king, or going from a bird to its agent, truck to driver, that those meaning components should be represented as ratios of co-occurrence probabilities. So here's an example that shows that. So suppose the meaning component that we want to get out is the spectrum from solid to gas, as in physics.

Well, you'd think that you can get at the solid part of it, perhaps by saying, does the word co-occur with ice? And the word solid occurs with ice, so that looks hopeful. And gas doesn't occur with ice much, so that looks hopeful. But the problem is the word water will also occur a lot with ice.

And if you just take some other random word, like the word random, it probably doesn't occur with ice much. In contrast, if you look at words co-occurring with steam, solid won't occur with steam much, but gas will. But water will again, and random will be small. So to get out the meaning component we want of going from gas to solid, what's actually really useful is to look at the ratio of these co-occurrence probabilities.

Because then we get a spectrum from large to small between solid and gas, whereas for water and a random word, it basically cancels out and gives you one. I just wrote these numbers in, but if you count them up in a large corpus, it is basically what you get.

So here are actual co-occurrence probabilities, and that for water and my random word, which was fashion here, these are approximately one. Whereas for the ratio of probability of co-occurrence of solid with ice or steam is about 10, and for gas it's about a 10th. So how can we capture these ratios of co-occurrence probabilities as linear meaning components so that in our word vector space we can just add and subtract linear meaning components?

Well, it seems like the way we can achieve that is if we build a log bilinear model so that the dot product between two word vectors attempts to approximate the log of the probability of co-occurrence. So if you do that, you then get this property that the difference between two vectors, its similarity to another word corresponds to the log of the probability ratio shown on the previous slide.

So the glove model wanted to try and unify the thinking between the co-occurrence matrix models and the neural models by being in some way similar to a neural model, but actually calculated on top of a co-occurrence matrix count. So we had an explicit loss function, and our explicit loss function is that we wanted the dot product to be similar to the log of the co-occurrence.

We actually added in some bias terms here, but I'll ignore those for the moment. And we wanted to not have very common words dominate, and so we capped the effect of high word counts using this f function that's shown here. And then we could optimize this j function directly on the co-occurrence count matrix.

So that gave us fast training scalable to huge corpora. And so this algorithm worked very well. So if you run this algorithm and ask what are the nearest words to frog, you get frogs, toad, and then you get some complicated words, but it turns out they are all frogs until you get down to lizards.

So Latoria is that lovely tree frog there. And so this actually seemed to work out pretty well. How well did it work out? To discuss that a bit more, I now want to say something about how do we evaluate word vectors. Are we good for up to there for questions?

We've got some questions. What do you mean by an inefficient use of statistics as a con for skip gram? Well, what I mean is that, you know, for word2vec, you're just, you know, looking at one center word at a time and generating a few negative samples. And so it sort of seems like doing something precise there.

Whereas if you're doing optimization algorithm on the whole matrix at once, well, you actually know everything about the matrix at once. You're not just looking at what words, what other words occurred in this one context of the center word. You've got the entire vector of co-occurrence accounts for the center word and another word.

And so therefore you can much more efficiently and less noisily work out how to minimize your loss. Okay. I'll go on. Okay. So I've sort of said, look at these word vectors. They're great. And I sort of showed you a few things at the end of the last class, which argued, hey, these are great.

You know, they work out these analogies. They show similarity and things like this. We want to make this a bit more precise. And indeed for natural language processing, as in other areas of machine learning, a big part of what people are doing is working out good ways to evaluate knowledge that things have.

So how can we really evaluate word vectors? So in general for NLP evaluation, people talk about two ways of evaluation, intrinsic and extrinsic. So an intrinsic evaluation means that you evaluate directly on the specific or intermediate subtasks that you've been working on. So I want a measure where I can directly score how good my word vectors are.

And normally intrinsic evaluations are fast to compute. They helped you to understand the component you've been working on. But often simply trying to optimize that component may or may not have a very big good effect on the overall system that you're trying to build. So people have also been very interested in extrinsic evaluations.

So an extrinsic evaluation is that you take some real task of interest to human beings, whether that's web search or machine translation or something like that, and you say your goal is to actually improve performance on that task. Well that's a real proof that this is doing something useful.

So in some ways it's just clearly better. But on the other hand, it also has some disadvantages. It takes a lot longer to evaluate on an extrinsic task because it's a much bigger system. And sometimes, you know, when you change things, it's unclear whether the fact that the numbers went down was because you now have worse word vectors or whether it's just somehow the other components of the system interacted better with your old word vectors.

And if you change the other components as well, things would get better again. So in some ways it can sometimes be muddier to see if you're making progress. But I'll touch on both of these methods here. So for intrinsic evaluation of word vectors, one way which we mentioned last time was this word vector analogy.

So we could simply give our models a big collection of word vector analogy problems. So we could say man is the woman as king is the what, and ask the model to find the word that is closest using that sort of word analogy computation and hope that what comes out there is queen.

And so that's something people have done and have worked out an accuracy score of how often that you are right. At this point, I should just mention one little trick of these word vector analogies that everyone uses but not everyone talks about along the first instance. I mean, there's a little trick which you can find in the Gensim code if you look at it, but when it does man is the woman as king is to what, something that could often happen is that actually the word, once you do your pluses and your minuses, that the word that will actually be closest is still king.

So the way people always do this is that they don't allow one of the three input words in the selection process. So you're choosing the nearest word that isn't one of the input words. So since here is showing results from the glove vectors. So the glove vectors have this strong linear component property, just like I showed before for coals.

So this is for the male female dimension. And so because of this, you'd expect in a lot of cases that word analogies would work because I can take the vector difference of man and woman. And then if I add that vector difference on to brother, I expect to get to sister and king, queen, and for many of these examples.

But of course, they may not always work, right? Because if I start from emperor, it's sort of on a more of a lean. And so it might turn out that I get countess or duchess coming out instead. You can do this for various different relations, so different semantic relations.

So these sort of word vectors actually learn quite a bit of just world knowledge. So here's the company CEO, or this is the company CEO around 2010 to 2014, when the data was taken from word vectors. And they, as well as semantic things or pragmatic things like this, they also learn syntactic things.

So here are vectors for positive, comparative, and superlative forms of adjectives. And you can see those also move in roughly linear components. So the word2vec people built a data set of analogies so you could evaluate different models on the accuracy of their analogies. And so here's how you can do this.

And this gives some numbers. So there are semantic and syntactic analogies. I'll just look at the totals. OK, so what I said before is if you just use unscaled co-occurrence counts and pass them through an SVD, things work terribly. And you see that there, you only get 7.3. But then, as I also pointed out, if you do some scaling, you can actually get SVD of a scaled count matrix to work reasonably well.

So this SVDL is similar to the COLS model. And now we're getting up to 60.1, which actually isn't a bad score, right? So you can actually do a decent job without a neural network. And then here are the two variants of the word2vec model. And here are our results from the GloVe model.

And of course, at the time, 2014, we took this as absolute proof that our model was better and our more efficient use of statistics was really working in our favor. With seven years of retrospect, I think that's kind of not really true, it turns out. I think the main part of why we scored better is that we actually had better data.

And so there's a bit of evidence about that on this next slide here. So this looks at the semantic, syntactic, and overall performance on word analogies of GloVe models that were trained on different subsets of data. So in particular, the two on the left are trained on Wikipedia. And you can see that training on Wikipedia makes you do really well on semantic analogies, which maybe makes sense because Wikipedia just tells you a lot of semantic facts.

I mean, that's kind of what encyclopedias do. And so one of the big advantages we actually had was that Wikipedia, that the GloVe model was partly trained on Wikipedia as well as other text, whereas the word2vec model that was released was trained exclusively on Google News, so Newswire data.

And if you only train on a smallish amount of Newswire data, you can see that for the semantics, it's just not as good as even one quarter of the size amount of Wikipedia data. Though if you get a lot of data, you can compensate for that. So here on the right hand, did you then have common crawl web data?

And so once there's a lot of web data, so now 42 billion words, you'll then start to get good scores again from the semantic side. The graph on the right then shows how well do you do as you increase the vector dimension. And so what you can see there is, you know, 25 dimensional vectors aren't very good.

They go up to sort of 50 and then 100. And so 100 dimensional vectors already work reasonably well. So that's why I used 100 dimensional vectors when I showed my example in class. That is the sweets, too long load and working reasonably well. But you still get significant gains for 200 and it's somewhat to 300.

So at least back around sort of 2013 to 15, everyone sort of gravitated to the fact that 300 dimensional vectors is the sweet spot. So almost frequently, if you look through the best known sets of word vectors that include the word to deck vectors and the glove vectors, that usually what you get is 300 dimensional word vectors.

That's not the only intrinsic evaluation you can do. Another intrinsic evaluation you can do is see how these models model human judgments of word similarity. So psychologists for several decades have actually taken human judgments of word similarity, where literally you're asking people for pairs of words like professor and doctor to give them a similarity score that's sort of being measured as some continuous quantity, giving you a score between, say, zero and 10.

And so there are human judgments which are then averaged over multiple human judgments as to how similar different words are. So tiger and cat is pretty similar. Computer and Internet is pretty similar. Machine and car is less similar. Stock and CD aren't very similar at all, but stock and Jaguar are even less similar.

So we could then say for our models, do they have the same similarity judgments? And in particular, we can measure a correlation coefficient of whether they give the same ordering of similarity judgments. And so then we can get data for that. And so there are various different data sets of word similarities, and we can score different models as to how well they do on similarities.

And again, you see here that plain SVDs works comparatively better here for similarities than it did for analogies. You know, it's not great, but it's now not completely terrible because we no longer need that linear property. But again, scaled SVDs work a lot better. Word2vec works a bit better than that.

And we got some of the same kind of minor advantages from the GloVe model. Hey, Chris, sorry to interrupt. A lot of the students were asking if you could re-explain the objective function for the GloVe model and also what log bilinear means. OK. Sure. OK, here is my objective function.

So one slide before that. So the property that we want is that we want the dot product to represent the log probability of co-occurrence. So that then gives me my tricky log bilinear. So the bi is that there's sort of the wi and the wj, so that there are sort of two linear things and it's linear in each one of them.

So this is sort of like having an rather than having a sort of an ax where you just have something that's linear in x and a is a constant. It's bilinear because we have the wi, wj and this linear in both of them. And that's then related to the log of a probability.

And so that gives us the log bilinear model. And so since we'd like these things to be equal, what we're doing here, if you ignore these two center terms, is that we're wanting to say the difference between these two is as small as possible. So we're taking this difference and we're squaring it.

So it's always positive. And we want that squared term to be as small as possible. And that's 90% of that. And you can basically stop there. But the other bit that's in here is a lot of the time when you're building models, rather than simply having sort of an ax model, it seems useful to have a bias term, which can move things up and down for the word in general.

And so we added into the model bias term so that there's a bias term for both words. So if in general probabilities are high for a certain word, this bias term can model that. And for the other word, this bias term can model it. So now I'll pop back.

And after-- oh, actually, I just saw someone said, why multiplying by the f of-- sorry, I did skip that last term. So why modifying by this f of xij? So this last bit was to scale things depending on the frequency of a word. Because you want to pay more attention to words that are more common or word pairs that are more common.

Because if you think about it in word2vec terms, you're seeing if things have a co-occurrence count of 50 versus 3, you want to do a better job at modeling the co-occurrence of the things that occurred together 50 times. And so you want to consider in the count of co-occurrence.

But then the argument is that that actually leads you astray when you have extremely common words like function words. And so effectively, you paid more attention to words that co-occurred together up until a certain point. And then the curve just went flat. So it didn't matter if it was an extremely, extremely common word.

So then for extrinsic word vector evaluation. So at this point, you're now wanting to sort of say, well, can we embed our word vectors in some end user task? And do they help? And do different word vectors work better or worse than other word vectors? So this is something that we'll see a lot of later in the class.

I mean, in particular, when you get on to doing assignment 3, that assignment 3, you get to build dependency parsers. And you can then use word vectors in the dependency parser and see how much they help. We don't actually make you test out different sets of word vectors, but you could.

Here's just one example of this to give you a sense. So the task of named entity recognition is going through a piece of text and identifying mentions of a person name or an organization name like a company or a location. And so if you have good word vectors, do they help you do named entity recognition better?

And the answer to that is yes. So if one starts off with a model that simply has discrete features, so it uses word identity as features, you can build a pretty good named entity model doing that. But if you add into it word vectors, you get a better representation of the meaning of words and so that you can have the numbers go up quite a bit.

And then you can compare different models to see how much gain they give you in terms of this extrinsic task. So skipping ahead, this was a question that I was asked after class, which was word sensors. So far, we've had just one word. Sorry, for one particular string, we've got some string house, and we're going to say for each of those strings, there's a word vector.

And if you think about it a bit more, that seems like it's very weird, because actually, most words, especially common words, and especially words that have existed for a long time, actually have many meanings, which are very different. So how could that be captured if you only have one word vector for the word, because you can't actually capture the fact that you've got different meanings for the word, because your meaning for the word is just one point in space, one vector.

And so as an example of that, here's the word pike. But it is an old Germanic word. Well, what kind of meanings does the word pike have? So you can maybe just think for a minute and think what meanings the word pike has. And it actually turns out, you know, it has a lot of different meanings.

So perhaps the most basic meaning is, if you did fantasy games or something, medieval weapons, a sharp pointed staff is a pike. But there's a kind of a fish that has a similar elongated shape that's a pike. It was used for railroad lines. Maybe that usage isn't used much anymore, but it certainly still survives in referring to roads.

So this is like when you have turnpikes, we have expressions where pike means the future, like coming down the pike. It's a position in diving, that divers do a pike. Those are all noun uses. They're also verbal uses. So you can pike somebody with your pike. You know, different usages might have different currency.

In Australia, you can also use pike to mean that you pull out of doing something like, I reckon he's going to pike. I don't think that usage is used in America, but lots of meanings. And actually, for words that are common, or if you start thinking of words like line or field, I mean, they just have even more meanings than this.

So what are we actually doing with just one vector for a word? And well, one way you could go is to say, okay, up until now, what we've done is crazy. Pike has, in other words, have all of these different meanings. So maybe what we should do is have different word vectors for the different meanings of pike.

So we'd have one word vector for the medieval pointy weapon, another word vector for the kind of fish, another word vector for the kind of road. So that they'd then be word sense vectors. And you can do that. I mean, actually, we were working on that in the early 2010s, actually, even before Word2Vec came out.

So this picture is a little bit small to see, but what we were doing was for words, we were clustering instances of a word, hoping that those clusters, so clustering the word tokens, hoping those clusters that were similar represented sensors. And then for the clusters of word tokens, we were sort of treating them like they were separate words and learning a word vector for each.

And basically that actually works. So in green, we have two sensors for the word bank. And so there's one sense for the word bank that's over here, where it's close to words like banking, finance, transaction, and laundering. And then we have another sense for the word bank over here, where it's close to words like plateau, boundary, gap, territory, which is the riverbank sense of the word bank.

And for the word jaguar, that's in purple. Well, jaguar has a number of sensors. And so we have those as well. So this sense down here is close to hunter. So that's the sort of big game animal sense of jaguar. Up the top here, it's being shown close to luxury and convertible.

So this is the jaguar car sense. Then jaguar here is near string, keyboard, and words like that. So jaguar is the name of a kind of keyboard. And then this final jaguar over here is close to software and Microsoft. And then if you're old enough, you'll remember that there was an old version of Mac OS that was called jaguar.

So that's then the computer sense. So basically, this does work. And we can learn word vectors for different senses of a word. But actually, this isn't the majority way that things have then gone in practice. And there are kind of a couple of reasons for that. I mean, one is just simplicity.

If you do this, it's kind of complex, because you first of all have to learn word sensors and then start learning word vectors in terms of the word sensors. But the other reason is, although this model of having word sensors is traditional, it's what you see in dictionaries, it's commonly what's being used in natural language processing.

I mean, it tends to be imperfect in its own way, because we're trying to take all the uses of the word pike and sort of cut them up into key different senses, where the difference is kind of overlapping, and it's often not clear which ones to count as distinct.

So for example, here, right, a railroad line and a type of road, well, sort of that's the same sense of pike, it's just that they're different forms of transportation. And so you know that this could be, you know, a type of transportation line and cover both of them. So it's always sort of very unclear how you cut word meaning into different senses.

And indeed, if you look at different dictionaries, everyone does it differently. So it actually turns out that in practice, you can do rather well by simply having one word vector per word type. And what happens if you do that? Well, what you find is that what you learn as a word vector is what gets referred to in fancy talk as a super superposition of the word vectors for the different senses of a word, where the word superposition means no more or less than a weighted sum.

So the vector that we learned for pike will be a weighted average of the vectors that you would have learned for the medieval weapon sense, plus the fish sense, plus the road sense, plus whatever other senses that you have, where the weighting that's given to these different sense vectors corresponds to the frequencies of use of the different senses.

So we end up with the word, the vector for pike being a kind of an average vector. And so if you're, if you're, say, okay, you've just added up several different vectors into an average, you might think that that's kind of useless, because, you know, you've lost the real meanings of the word, and you've just got some kind of funny average vector that's in between them.

But actually, it turns out that if you use this average vector in applications, it tends to sort of self disambiguate. Because if you say, is the word pike similar to the word for fish? Well, part of this vector represents fish, the fish sense of pike. And so in those components, it'll be kind of similar to the fish vector.

And so yes, you'll say the substantial, there's substantial similarity. Whereas if in another piece of text that says, you know, the men were armed with pikes and lances or pikes and maces or whatever other medieval weapons you remember, well, actually, some of that meaning is in the pike vector as well.

And so it'll say, yeah, there's good similarity with mace and staff and words like that as well. And in fact, we can work out which sense of pike is intended by just sort of seeing which components are similar to other words that are used in the same context. And indeed, there's actually a much more surprising result than that.

And this is a result that's Juzo Sanjeev Arora, Tungu Umar, who's now on our Stanford faculty and others in 2018. And that's the following result, which I'm not actually going to explain. But so if you think that the vector for pike is just a sum of the vectors for the different senses, well, it should be you'd think that it's just completely impossible to reconstruct the sense vectors from the vector for the word type.

Because normally, if I say I've got two numbers, the sum of them is 17, you just have no information as to what my two numbers are, right? You can't resolve it. And even worse, if I tell you I've got three numbers, and they sum to 17. But it turns out that when we have these high dimensional vector spaces, that things are so sparse in those high dimensional vector spaces, that you can use ideas from sparse coding to actually separate out the different senses, providing they're relatively common.

So they show in their paper that you can start with the vector of say, pike, and actually separate out components of that vector that correspond to different senses of the word pike. And so here's an example at the bottom of this slide, which is for the word pike, separate out that vector into five different senses.

And so there's one sense is close to the words trousers, blouse, waistcoats, and this is the sort of clothing sense of tie. Another sense is close to wires, cables, wiring, electrical. So that's the sort of the tie sense of a tie used in electrical stuff. Then we have sort of scoreline, goal is equalizer, so this is the sporting game sense of tie.

This one also seems to in a different way, evoke sporting game sense of tie. Then there's finally this one here, maybe my music is just really bad. Maybe it's because you get ties in music when you tie notes together, I guess. So you get these different senses out of it.

Thank you.

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers

Chapters

Transcript