Stanford CS224N NLP with Deep Learning | Winter 2021

00:00:00.000 | Okay, so what are we going to do for today?

00:00:07.520 | So the main content for today is to, um, go through sort of more stuff about word vectors,

00:00:17.040 | including touching on word sensors, and then introducing the notion of neural network classifiers.

00:00:23.240 | So our biggest goal is that by the end of today's class, you should feel like you could

00:00:28.040 | confidently look at one of the word embeddings papers, such as the Google Word2Vec paper,

00:00:34.280 | or the GloVe paper, or Sanjeev Arora's paper that we'll come to later, and feel like, yeah,

00:00:39.360 | I can understand this, I know what they're doing, and it makes sense.

00:00:43.480 | So let's go back to where we were.

00:00:45.920 | So this was sort of introducing this model of Word2Vec.

00:00:49.760 | And at the time, your idea was that we started with random word vectors, and then we're going

00:00:57.200 | to sort of, we have a big corpus of text, and we're going to iterate through each word

00:01:01.440 | in the whole corpus.

00:01:03.120 | And for each position, we're going to try and predict what words surround our center

00:01:09.880 | word.

00:01:10.880 | And we're going to do that with a probability distribution that's defined in terms of the

00:01:15.760 | dot product between the word vectors for the center word and the context words.

00:01:23.080 | And so that will give a probability estimate of a word appearing in the context of "into."

00:01:27.760 | Well, actual words did occur in the context of "into" on this occasion.

00:01:32.600 | So what we're going to want to do is sort of make it more likely that turning problems,

00:01:37.880 | banking, and crises will turn up in the context of "into."

00:01:42.040 | And so that's learning, updating the word vectors so that they can predict actual surrounding

00:01:47.080 | words better.

00:01:49.800 | And then the thing that's almost magical is that doing no more than this simple algorithm,

00:01:56.040 | this allows us to learn word vectors that capture well word similarity and meaningful

00:02:01.360 | directions in a word space.

00:02:05.100 | So more precisely, right, for this model, the only parameters of this model are the

00:02:11.560 | word vectors.

00:02:12.740 | So we have outside word vectors and center word vectors for each word.

00:02:17.440 | And then we're taking their dot product to get a probability.

00:02:21.640 | Well, we're taking a dot product to get a score of how likely a particular outside word

00:02:27.560 | is to occur with the center word.

00:02:29.600 | And then we're using the softmax transformation to convert those scores into probabilities,

00:02:34.880 | as I discussed last time, and I kind of come back to at the end this time.

00:02:40.320 | A couple of things to note.

00:02:42.620 | This model is what we call in NLP a bag of words model.

00:02:47.720 | So bag of words models are models that don't actually pay any attention to word order or

00:02:53.320 | position.

00:02:54.320 | It doesn't matter if you're next to the center word or a bit further away on the left or

00:02:58.080 | right.

00:02:59.080 | The probability estimate would be the same.

00:03:02.160 | And that seems like a very crude model of language that will offend any linguist.

00:03:07.880 | And it is a very crude model of language, and we'll move on to better models of language

00:03:11.840 | as we go on.

00:03:12.840 | But even that crude model of language is enough to learn quite a lot of the probability.

00:03:18.280 | Sorry, quite a lot about the properties of words.

00:03:23.840 | And then the second note is, well, with this model, we wanted to give reasonably high probabilities

00:03:32.760 | to the words that do occur in the context of the center word, at least if they do so

00:03:38.440 | at all often.

00:03:39.920 | But obviously lots of different words can occur.

00:03:43.080 | So we're not talking about probabilities like .3 and .5.

00:03:47.000 | We're more likely going to be talking about probabilities like .01 and numbers like that.

00:03:52.720 | Well, how do we achieve that?

00:03:55.800 | And well, the way that the word2vec model achieves this, and this is the learning phase

00:04:01.280 | of the model, is to place words that are similar in meaning close to each other in this high

00:04:09.200 | dimensional vector space.

00:04:11.180 | So again, you can't read this one, but if we scroll into this one, we see lots of words

00:04:18.600 | that are similar in meaning grouped close together in the space.

00:04:21.760 | So here are days of the week like Tuesday, Thursday, Sunday, and also Christmas.

00:04:27.080 | Over, what else do we have?

00:04:31.960 | We have Samsung and Nokia.

00:04:35.040 | This is a diagram I made quite a few years ago.

00:04:38.480 | So that's when Nokia was still an important maker of cell phones.

00:04:42.680 | We have various sort of fields like mathematics and economics over here.

00:04:47.520 | So we group words that are similar in meaning.

00:04:52.320 | Actually one more note I wanted to make on this.

00:04:54.280 | I mean, again, this is a two dimensional picture, which is all I can show you on a slide.

00:05:01.360 | And it's done with a principal components projection that you'll also use in the assignment.

00:05:08.400 | Something important to remember, but hard to remember, is that high dimensional spaces

00:05:14.160 | have very different properties to the two dimensional spaces that we can look at.

00:05:19.160 | And so in particular, a word, a vector, can be close to many other things in a high dimensional

00:05:27.000 | space, but close to them on different dimensions.

00:05:30.760 | Okay, so I've mentioned doing learning.

00:05:36.280 | So the next question is, well, how do we learn good word vectors?

00:05:42.880 | And this was the bit that I didn't quite hook up at the end of last class.

00:05:48.120 | So for a while in the last, I said, oh, geocalculus, and we have to work out the gradient of the

00:05:55.080 | loss function with respect to the parameters that will allow us to make progress.

00:06:00.000 | But I didn't sort of altogether put that together.

00:06:02.260 | So what we're going to do is we start off with random word vectors.

00:06:08.800 | We initialize them to small numbers near zero in each dimension.

00:06:13.160 | We've defined our loss function J, which we looked at last time.

00:06:18.800 | And then we're going to use a gradient descent algorithm, which is an iterative algorithm

00:06:24.800 | that learns to maximize J of theta by changing theta.

00:06:29.340 | And so the idea of this algorithm is that from the current values of theta, you calculate

00:06:36.120 | the gradient J of theta.

00:06:38.520 | And then what you're going to do is make a small step in the direction of the negative

00:06:43.400 | gradient.

00:06:44.400 | So the gradient is pointing upwards, and we're taking a small step in the direction of the

00:06:49.560 | negative of the gradient to gradually move down towards the minimum.

00:06:55.320 | And so one of the parameters of neural nets that you can fiddle in your software package

00:07:00.600 | is what is the step size.

00:07:02.900 | So if you take a really, really itsy bitsy step, it might take you a long time to minimize

00:07:08.900 | the function.

00:07:09.900 | You do a lot of wasted computation.

00:07:13.220 | On the other hand, if your step size is much too big, well, then you can actually diverge

00:07:20.940 | and start going to worse places.

00:07:23.360 | Or even if you are going downhill a little bit, that what's going to happen is you're

00:07:27.580 | then going to end up bouncing back and forth, and it'll take you much longer to get to the

00:07:31.820 | minimum.

00:07:32.820 | OK, in this picture, I have a beautiful quadratic, and it's easy to minimize it.

00:07:40.980 | Something that you might know about neural networks is that in general, they're not convex.

00:07:45.780 | So you could think that this is just all going to go awry.

00:07:49.920 | But the truth is in practice, life works out to be OK.

00:07:52.940 | But I think I won't get into that more right now and come back to that in a later class.

00:07:59.380 | So this is our gradient descent.

00:08:01.420 | So we have the current values of the parameters theta.

00:08:04.740 | We then walk a little bit in the negative direction of the gradient using our learning

00:08:12.260 | rate or step size alpha.

00:08:14.400 | And that gives us new parameter values, where that means that these are vectors.

00:08:19.980 | But for each individual parameter, we are updating it a little bit by working out the

00:08:25.720 | partial derivative of J with respect to that parameter.

00:08:32.220 | So that's the simple gradient descent algorithm.

00:08:35.360 | Nobody uses it, and you shouldn't use it.

00:08:39.100 | The problem is that our J is a function of all windows in the corpus.

00:08:45.300 | Remember, we're doing this sum over every center word in the entire corpus.

00:08:51.760 | And we'll often have billions of words in the corpus.

00:08:54.600 | So actually working out J of theta or the gradient of J of theta would be extremely,

00:09:01.120 | extremely expensive because we have to iterate over our entire corpus.

00:09:04.640 | So you'd wait a very long time before you made a single gradient update.

00:09:09.120 | And so optimization would be extremely slow.

00:09:12.020 | And so basically 100% of the time in neural network land, we don't use gradient descent.

00:09:19.300 | We instead use what's called stochastic gradient descent.

00:09:22.680 | And stochastic gradient descent is a very simple modification of this.

00:09:27.400 | So rather than working out an estimate of the gradient based on the entire corpus, you

00:09:34.680 | simply take one center word or a small batch like 32 center words, and you work out an

00:09:41.440 | estimate of the gradient based on them.

00:09:44.780 | Now that estimate of the gradient will be noisy and bad because you've only looked at

00:09:50.720 | a small fraction of the corpus rather than the whole corpus.

00:09:54.160 | But nevertheless, you can use that estimate of the gradient to update your theta parameters

00:09:59.720 | in exactly the same way.

00:10:01.880 | And so this is the algorithm that we can do.

00:10:04.580 | And so then if we have a billion word corpus, we can if we do it on each center word, we

00:10:12.440 | can make a billion updates to the parameters as we pass through the corpus once rather

00:10:17.560 | than only making one more accurate update to the parameters at once you've been through

00:10:24.120 | the corpus.

00:10:25.280 | So overall, we can learn several orders of magnitude more quickly.

00:10:31.040 | And so this is the algorithm that you'll be using everywhere, including, you know, right

00:10:37.560 | from the beginning from our assignments.

00:10:41.200 | Again, just an extra comment of more complicated stuff we'll come back to.

00:10:47.400 | I I pretend this is the gradient descent is sort of performance hack.

00:10:56.480 | It lets you learn much more quickly.

00:10:58.440 | It turns out it's not only a performance hack.

00:11:01.800 | Neural nets have some quite counterintuitive properties.

00:11:05.880 | And actually, the fact that stochastic gradient descent is kind of noisy and bounces around

00:11:12.120 | as it does its thing, it actually means that in complex networks, it learns better solutions

00:11:19.880 | than if you were to run plain gradient descent very slowly.

00:11:24.560 | So you can both compute much more quickly and do a better job.

00:11:29.240 | OK, one final note on running stochastic gradients with word vectors, this is kind of an aside,

00:11:37.180 | but something to note is that if we're doing a stochastic gradient update based on one

00:11:42.360 | window, then actually in that window, we'll have seen almost none of our parameters.

00:11:49.120 | Because if we have a window of something like five words to either side of the center word,

00:11:54.440 | we've seen at most 11 distinct word types.

00:11:58.560 | So we will have gradient information for those 11 words.

00:12:02.900 | But the other 100,000 odd words in our vocabulary will have no gradient update information.

00:12:09.440 | So this will be a very, very sparse gradient update.

00:12:13.900 | So if you're only thinking math, you can just have your entire gradient and use the equation

00:12:23.120 | that I showed before.

00:12:25.040 | But if you're thinking systems optimization, then you'd want to think, well, actually,

00:12:31.380 | I only want to update the parameters for a few words.

00:12:36.960 | And there have to be and there are much more efficient ways that I could do that.

00:12:43.580 | And so here's sort of this is another aside will be useful for the assignment.

00:12:49.080 | So I will say it up until now, when I presented word vectors, I presented them as column vectors.

00:12:57.860 | And that makes the most sense if you think about it as a piece of math.

00:13:03.840 | Whereas actually, in all common deep learning packages, including PyTorch that we're using,

00:13:12.120 | word vectors are actually represented as row vectors.

00:13:16.220 | And if you remember back to the representation of matrices and CS107 or something like that,

00:13:23.140 | you'll know that that's then obviously efficient for representing words, because then you can

00:13:29.800 | access an entire word vector as a continuous range of memory.

00:13:35.080 | Different if you're in Fortran.

00:13:36.080 | But anyway, so actually, our word vectors will be row vectors when you look at those

00:13:43.500 | inside PyTorch.

00:13:45.500 | OK, now I wanted to say a bit more about the word to vector algorithm family and also what

00:13:55.460 | you're going to do in homework two.

00:13:58.800 | So if you're still meant to be working on homework one, which remember is due next Tuesday,

00:14:04.060 | but really actually with today's content, we're starting into homework two.

00:14:08.900 | And I'll kind of go through the first part of homework two today and the other stuff

00:14:13.560 | you need to know for homework two.

00:14:15.880 | So I mentioned briefly the idea that we have two separate vectors for each word type, the

00:14:21.940 | center vector and the outside vectors.

00:14:25.300 | And we just average them both at the end.

00:14:27.220 | They're similar, but not identical for multiple reasons, including the random initialization

00:14:32.580 | and the stochastic gradient descent.

00:14:36.040 | You can implement a word to vector algorithm with just one vector per word.

00:14:42.540 | And actually, if you do, it works slightly better, but it makes the algorithm much more

00:14:48.340 | complicated.

00:14:49.500 | And the reason for that is sometimes you will have the same word type as the center word

00:14:56.780 | and the context word.

00:14:58.740 | And that means that when you're doing your calculus at that point, you've then got this

00:15:04.060 | sort of messy case that just for that word, you're getting an x squared, sorry, dot product.

00:15:11.260 | You're getting a dot product of x dot x term, which makes it sort of much messier to work

00:15:15.620 | out.

00:15:16.620 | And so that's why we use this sort of simple optimization of having two vectors per word.

00:15:21.700 | Okay.

00:15:22.700 | So for the word to vector model as introduced in the Mikhailov et al paper in 2013, it wasn't

00:15:32.500 | really just one algorithm.

00:15:36.160 | It was a family of algorithms.

00:15:38.480 | So there were two basic model variants.

00:15:42.000 | One was called the skip gram model, which is the one that I've explained to you that

00:15:46.540 | pretty much outside words, position independent, given the center word in a bag of words style

00:15:56.900 | model.

00:15:57.900 | The other one was called the continuous bag of words model SIBO.

00:16:01.340 | And in this one, you predict the center word for a bag of context words.

00:16:07.580 | Both of these give similar results.

00:16:09.780 | The skip gram one is more natural in various ways.

00:16:13.300 | So it's sort of normally the one that people have gravitated to and subsequent work.

00:16:19.620 | But then as to how you train this model, what I've presented so far is the naive softmax

00:16:26.900 | equation, which is a simple but relatively expensive training method.

00:16:33.980 | And so that isn't really what they suggest using in your paper in the paper.

00:16:39.120 | They suggest using a method that's called negative sampling.

00:16:42.240 | So an acronym you'll see sometimes is SGNS, which means skip grams, negative sampling.

00:16:49.200 | So let me just say a little bit about what this is.

00:16:53.900 | But actually doing the skip gram model with negative sampling is the part of homework

00:16:59.340 | too.

00:17:00.340 | So you'll get to know this model well.

00:17:01.740 | So the point is that if you use this naive softmax, you know, even though people commonly

00:17:06.900 | do use this naive softmax in various neural net models, that working out the denominator

00:17:13.660 | is pretty expensive.

00:17:15.460 | And that's because you have to iterate over every word in the vocabulary and work out

00:17:21.820 | these dot products.

00:17:22.940 | So if you have a hundred thousand word vocabulary, you have to do a hundred thousand dot products

00:17:29.980 | to work out the denominator.

00:17:32.300 | And that seems a little bit of a shame.

00:17:34.860 | And so instead of that, the idea of negative sampling is where instead of using this softmax,

00:17:42.780 | we're going to train binary logistic regression models for both the troop, the true pair of

00:17:52.020 | center word and the context word versus noise pairs where we keep the true center word and

00:18:01.820 | we just randomly sample words from the vocabulary.

00:18:06.780 | So as presented in the paper, the idea is like this.

00:18:10.820 | So overall, what we want to optimize is still an average of the loss for each particular

00:18:19.780 | center word.

00:18:21.340 | But for when we're working out the loss for each particular center word, we're going to

00:18:25.980 | work out, sorry, the loss for each particular center word and each particular window.

00:18:31.740 | We're going to take the dot product as before of the center word and the outside word.

00:18:39.980 | And that's sort of the main quantity.

00:18:42.220 | But now instead of using that inside the softmax, we're going to put it through the logistic

00:18:47.780 | function, which is sometimes also often also called the sigmoid function.

00:18:52.380 | The name logistic is more precise.

00:18:54.040 | So that's this function here.

00:18:55.660 | So the logistic function is a handy function that will map any real number to a probability

00:19:02.500 | between zero and one open interval.

00:19:05.340 | So basically, if the dot product is large, the logistic of the dot product will be virtually

00:19:12.460 | one.

00:19:13.460 | Okay, so we want this to be large.

00:19:16.660 | And then what we'd like is on average, we'd like the dot product between the center word

00:19:22.900 | and words that we just chose randomly, i.e. they most likely didn't actually occur in

00:19:29.060 | the context of the center word to be small.

00:19:33.100 | And there's just one little trick of how this is done, which is this sigmoid function is

00:19:40.580 | symmetric.

00:19:41.780 | And so if we want this probability to be small, we can take the negative of the dot product.

00:19:51.460 | So we're wanting it to be over here, that the product, the dot product of random word

00:19:57.420 | in the center word is a negative number.

00:20:01.020 | And so then we're going to take the negation of that.

00:20:04.500 | And then again, once we put that through the sigmoid, we'd like a big number.

00:20:08.860 | Okay, so the way they're presenting things, they're actually maximizing this quantity.

00:20:14.060 | But if I go back to making it a bit more similar to the way we had written things, we'd worked

00:20:21.060 | with minimizing the negative log likelihood.

00:20:25.700 | So it looks like this.

00:20:28.420 | So we're taking the negative log likelihood of this, the sigmoid of the dot product.

00:20:35.340 | Again, negative log likelihood, we're using the same negated dot product through the sigmoid.

00:20:43.020 | And then we're going to work out this quantity for a handful of random word.

00:20:50.660 | We take negative samples, and how likely they are to sample a word depends on their probability.

00:20:57.740 | And where this loss function is going to be minimized, given this negation by making these

00:21:05.060 | dot products large, and these dot products, small means negative.

00:21:14.260 | So there was just then one other trick that they use.

00:21:19.980 | Actually there's more than one other trick that's used in the Word2Vec paper to get it

00:21:23.620 | to perform well, but I'll only mention one of their other tricks here.

00:21:28.300 | When they sample the words, they don't simply just sample the words based on their probability

00:21:36.300 | of occurrence in the corpus or uniformly.

00:21:39.740 | What they do is they start with what we call the unigram distribution of words.

00:21:44.140 | So that is how often words actually occur in our big corpus.

00:21:49.660 | So if you have a billion word corpus and a particular word occurred 90 times in it, you're

00:21:55.820 | taking 90 divided by a billion, and so that's the unigram probability of the word.

00:22:01.340 | But what they then do is that they take that to the three quarters power.

00:22:06.340 | And the effect of that three quarters power, which is then renormalized to make a probability

00:22:10.620 | distribution with Z, kind of like we saw last time with the softmax.

00:22:15.260 | By taking the three quarters power, that has the effect of dampening the difference between

00:22:21.780 | common and rare words so that less frequent words are sampled somewhat more often, but

00:22:28.260 | still not nearly as much as they would be if you just use something like a uniform distribution

00:22:34.260 | over the vocabulary.

00:22:36.300 | Okay, so that's basically everything to say about the basics of how we have this very

00:22:47.660 | simple neural network algorithm, Word2Vec, and how we can train it and learn word vectors.

00:22:56.700 | So for the next bit, what I want to do is step back a bit and say, well, here's an algorithm

00:23:02.180 | that I've shown you that works great.

00:23:06.260 | What else could we have done and what can we say about that?

00:23:10.740 | The first thing that you might think about is, well, here's this funny iterative algorithm

00:23:18.340 | to give you word vectors.

00:23:22.580 | You know, if we have a lot of words in a corpus, it seems like a more obvious thing that we

00:23:28.540 | could do is just look at the counts of how words occur with each other and build a matrix

00:23:37.100 | of counts, a co-occurrence matrix.

00:23:40.780 | So here's the idea of a co-occurrence matrix.

00:23:44.340 | So I've got a teeny little corpus.

00:23:46.180 | I like deep learning.

00:23:47.420 | I like NLP.

00:23:48.540 | I enjoy flying.

00:23:50.820 | And I can define a window size.

00:23:52.820 | I made my window simply size one to make it easy to fill in my matrix, symmetric just

00:24:00.380 | like our Word2Vec algorithm.

00:24:02.940 | And so then the counts in these cells are simply how often things co-occur in the window

00:24:11.300 | of size one.

00:24:12.800 | So I like occurs twice.

00:24:16.100 | So we get twos in these cells because it's symmetric.

00:24:19.980 | Deep learning occurs one.

00:24:21.940 | So we get one here and lots of other things occur zero.

00:24:26.260 | So we can build up a co-occurrence matrix like this.

00:24:31.460 | And well, these actually give us a representation of words as co-occurrence vectors.

00:24:38.700 | So I can take the word I with either a row or a column vector since it's symmetric and

00:24:44.060 | say okay, my representation of the word I is this row vector.

00:24:50.840 | And that is a representation of the word I.

00:24:53.780 | And I think you can maybe convince yourself that to the extent that words have similar

00:24:59.860 | meaning and usage, you'd sort of expect them to have somewhat similar vectors, right?

00:25:05.980 | So if I had the word you as well on a larger corpus, you might expect I and you to have

00:25:11.220 | similar vectors because I like you like I enjoy you enjoy.

00:25:15.820 | You'd see the same kinds of possibilities.

00:25:18.140 | Hey, Chris, can we keep looking to answer some questions?

00:25:21.820 | Sure.

00:25:22.820 | All right.

00:25:23.820 | So we got some questions from negative, sort of the negative sampling slides.

00:25:29.840 | In particular, what's like, can you give some intuition for negative sampling?

00:25:34.940 | What is the negative sampling doing?

00:25:36.820 | And why do we only take one positive example?

00:25:39.820 | Those are two questions to be answered in tandem.

00:25:43.060 | Okay, that's a good question.

00:25:44.660 | Okay, I'll try and give more intuition.

00:25:46.900 | So is to work out something like what the softmax did in a much more efficient way.

00:25:59.300 | So, in the softmax, well, you wanted to give high probability to the in predicting the

00:26:08.860 | context, a context word that actually did appear with the center word.

00:26:14.500 | And well, the way you do that is by having the dot product between those two words be

00:26:20.300 | as big as possible.

00:26:22.340 | And part of how but you know, you're going to be sort of it's more than that, because

00:26:28.380 | in the denominator, you're also working out the dot product with every other word in the

00:26:33.220 | vocabulary.

00:26:34.480 | So as well as wanting the dot product with the actual word that you see in the context

00:26:39.380 | to be big, you maximize your likelihood by making the dot products of other words that

00:26:47.600 | weren't in the context smaller, because that's shrinking your denominator.

00:26:52.740 | And therefore, you've got a bigger number coming out and you're maximizing the loss.

00:26:58.920 | So even for the softmax, the general thing that you want to do to maximize it is have

00:27:04.540 | dot product with words actually in the context big dot product with words and not in the

00:27:10.020 | context be small to the extent possible.

00:27:14.100 | And obviously, you have to average this as best you can over all kinds of different contexts,

00:27:18.620 | because sometimes different words appear in different contexts, obviously.

00:27:25.700 | So the negative sampling as a way of therefore trying to maximize the same objective.

00:27:32.940 | Now, you know, for you only you only have one positive term because you're actually

00:27:40.140 | wanting to use the actual data.

00:27:42.580 | So you're not wanting to invent data.

00:27:45.280 | So for working out the entire J, we do do work this quantity out for every center word

00:27:52.340 | and every context word.

00:27:54.500 | So you know, we are iterating over the different words in the context window, and then we're

00:27:59.460 | moving through positions in the corpus.

00:28:01.660 | So we're doing different VC.

00:28:03.220 | So, you know, gradually we do this, but for one particular center word and one particular

00:28:08.200 | context word, we only have one real piece of data that's positive.

00:28:12.820 | So that's all we use because we don't know what other words should be counted as positive

00:28:19.540 | words.

00:28:20.540 | Now for the negative words, you could just sample one negative word and that would probably

00:28:28.900 | work.

00:28:29.900 | But if you want to sort of a slightly better, more stable sense of, OK, we'd like to, in

00:28:36.860 | general, have other words have low probability.

00:28:39.840 | It seems like you might be able to get better, more stable results if you instead say, let's

00:28:44.800 | have 10 or 15 sample negative words.

00:28:48.440 | And indeed that's been found to be true.

00:28:52.200 | And for the negative words, well, it's easy to sample any number of random words you want.

00:28:57.160 | And at that point, it's kind of a probabilistic argument.

00:29:00.040 | The words that you're sampling might not be actually bad words to appear in the context.

00:29:06.440 | They might actually be other words that are in the context, but 99.9% of the time, they

00:29:11.840 | will be unlikely words to occur in the context.

00:29:15.520 | And so they're good ones to use.

00:29:17.880 | And yes, you only sample 10 or 15 of them, but that's enough to make progress because

00:29:25.160 | the center word is going to turn up on other occasions.

00:29:28.920 | And when it does, you'll sample different words over here so that you gradually sample

00:29:33.240 | different parts of the space and start to learn.

00:29:36.260 | We had this co-occurrence matrix and it gives a representation of words as co-occurrence

00:29:44.400 | vectors.

00:29:46.600 | And just one more note on that.

00:29:49.280 | I mean, there are actually two ways that people have commonly made these co-occurrence matrices.

00:29:54.720 | One corresponds to what we've seen already, that you use a window around a word, which

00:30:00.160 | is similar to word2vec.

00:30:02.800 | And that allows you to capture some locality and some of the sort of syntactic and semantic

00:30:07.640 | proximity that's more fine grained.

00:30:10.720 | The other way these co-occurrence matrices have often been made is that normally documents

00:30:18.620 | have some structure, whether it's paragraphs or just actual web pages, sort of size documents.

00:30:25.120 | So you can just make your window size a paragraph or a whole web page and count co-occurrence

00:30:31.180 | in those.

00:30:32.180 | And this is the kind of method that's often being used in information retrieval in methods

00:30:36.960 | like latent semantic analysis.

00:30:39.440 | OK, so the question then is, are these kind of count word vectors good things to use?

00:30:50.960 | Well, people have used them.

00:30:53.520 | They're not terrible, but they have certain problems.

00:30:57.280 | But the kind of problems that they have, well, firstly, they're huge, though very sparse.

00:31:03.500 | So this is back where I said before, if we had a vocabulary of half a million words,

00:31:08.360 | when then we have a half a million dimensional vector for each word, which is much, much

00:31:14.600 | bigger than the word vectors that we typically use.

00:31:20.100 | And it also means that because we have these very high dimensional vectors, that we have

00:31:27.220 | a lot of sparsity and a lot of randomness.

00:31:30.620 | So the results that you get tend to be noisier and less robust, depending on what particular

00:31:36.440 | stuff was in the corpus.

00:31:38.860 | And so in general, people have found that you can get much better results by working

00:31:44.280 | with low dimensional vectors.

00:31:46.520 | So then the idea is we can store the most of the important information about the distribution

00:31:53.420 | of words in the context of other words in a fixed small number of dimensions, giving

00:31:59.400 | a dense vector.

00:32:01.180 | And in practice, the dimensionality of the vectors that are used are normally somewhere

00:32:05.800 | between 25 and 1000.

00:32:08.420 | And so at that point, we need to use some way to reduce the dimensionality of our count

00:32:16.040 | co-occurrence vectors.

00:32:19.880 | So if you have a good memory from a linear algebra class, you hopefully saw singular

00:32:28.180 | value decomposition.

00:32:30.380 | And it has various mathematical properties that I'm not going to talk about here of single

00:32:37.720 | singular value projection, giving you an optimal way under a certain definition of optimality

00:32:43.840 | of producing a reduced dimensionality matrix that maximally or sorry, pair of matrices

00:32:52.460 | that maximally well lets you recover the original matrix.

00:32:56.680 | But the idea of the singular value decomposition is you can take any matrix such as our count

00:33:03.120 | matrix and you can decompose that into three matrices, U, a diagonal matrix, sigma and

00:33:13.840 | a V transpose matrix.

00:33:17.920 | And this works for any shape.

00:33:20.160 | Now, in these matrices, some parts of it are never used because since this matrix is rectangular,

00:33:28.680 | there's nothing over here, and so this part of the V transpose matrix gets ignored.

00:33:35.520 | But if you're wanting to get smaller dimensional representations, what you do is take advantage

00:33:42.840 | of the fact that the singular values inside the diagonal sigma matrix are ordered from

00:33:50.320 | largest down to smallest.

00:33:52.880 | So what we can do is just delete out more of the matrix of the delete out some singular

00:34:00.520 | values, which effectively means that in this product, some of U and some of V is also not

00:34:08.480 | used.

00:34:09.680 | And so then as a result of that, we're getting lower dimensional representations for our

00:34:17.480 | words if we're wanting to have word vectors, which still do as good as possible a job within

00:34:24.360 | the given dimensionality of enabling you to recover the original co-occurrence matrix.

00:34:34.080 | So from a linear algebra background, this is the obvious thing to use.

00:34:40.420 | So how does that work?

00:34:42.720 | Well, if you just build a raw count co-occurrence matrix and run SVD on that and try and use

00:34:52.120 | those as word vectors, it actually works poorly.

00:34:56.620 | And it works poorly because if you get into the mathematical assumptions of SVD, you're

00:35:02.600 | expecting to have these normally distributed errors.

00:35:07.680 | And what you're getting with word counts looked not at all like something's normally distributed

00:35:16.840 | because you have exceedingly common words like "a", "the", and "and", and you have a very

00:35:22.180 | large number of rare words.

00:35:24.260 | So that doesn't work very well.

00:35:25.920 | But you can actually get something that works a lot better if you scale the counts in the

00:35:31.640 | cells.

00:35:32.680 | So to deal with this problem of extremely frequent words, there are some things we can

00:35:37.400 | do.

00:35:38.400 | We could just take the log of the raw counts.

00:35:41.420 | We could kind of cap the maximum count.

00:35:44.720 | We could throw away the function words.

00:35:47.280 | And any of these kind of ideas let you build, then have a co-occurrence matrix that you

00:35:53.480 | get more useful word vectors from running something like SVD.

00:35:58.680 | And indeed, these kind of models were explored in the 1990s and in the 2000s.

00:36:05.840 | And in particular, Doug Rohde explored a number of these ideas as to how to improve the co-occurrence

00:36:13.320 | matrix in a model that he built that was called COLS.

00:36:18.200 | And actually, in his COLS model, he observed the fact that you could get the same kind

00:36:27.800 | of linear components that have semantic components that we saw yesterday when talking about analogies.

00:36:37.400 | So for example, this is a figure from his paper.

00:36:41.800 | And you can see that we seem to have a meaning component going from a verb to the person

00:36:48.840 | who does the verb.

00:36:50.200 | So drive to drive, swim to swim, teach to teach, marry to priest.

00:36:55.480 | And that these vector components are not perfectly, but are roughly parallel and roughly the same

00:37:03.800 | size.

00:37:04.840 | And so we have a meaning component there that we could add on to another word, just like

00:37:10.640 | we did previously for analogies.

00:37:13.120 | We could say drive is to driver as marry is to what.

00:37:18.080 | And we'd add on this green vector component, which is roughly the same as this one.

00:37:23.240 | And we'd say, oh, priest.

00:37:24.920 | So this space could actually get some word vectors analogies right as well.

00:37:33.160 | And so that seemed really interesting to us around the time Word2Vec came out of wanting

00:37:40.160 | to understand better what the iterative updating algorithm of Word2Vec did and how it related

00:37:46.320 | to these more linear algebra based methods that had been explored in the couple of decades

00:37:51.760 | previously.

00:37:53.000 | And so for the next bit, I want to tell you a little bit about the GloVe algorithm, which

00:37:58.040 | was an algorithm for word vectors that was made by Jeffrey Pennington, Richard Socher

00:38:04.080 | and me in 2014.

00:38:07.400 | And so the starting point of this was to try to connect together the linear algebra based

00:38:14.200 | methods on co-occurrence matrices like LSA and COLS with the models like Skip Graham,

00:38:20.520 | Zeebo and their other friends, which were iterative neural updating algorithms.

00:38:26.400 | So on the one hand, you know, the linear algebra methods actually seemed like they had advantages

00:38:32.560 | for fast training and efficient usage of statistics.

00:38:36.760 | But although there had been work on capturing word similarities with them, by and large,

00:38:44.120 | the results weren't as good, perhaps because of disproportionate importance given to large

00:38:48.400 | counts in the main.

00:38:50.120 | Conversely, the models, the neural models, it seems like if you're just doing these gradient

00:38:58.480 | updates on Windows, you're somehow inefficiently using statistics versus a co-occurrence matrix.

00:39:05.820 | But on the other hand, it's actually easier to scale to a very large corpus by trading

00:39:12.160 | time for space.

00:39:14.320 | And at that time, it seemed like the neural methods just worked better for people, that

00:39:20.400 | they generated improved performance on many tasks, not just on word similarity, and that

00:39:25.840 | they could capture complex patterns such as the analogies that went beyond word similarity.

00:39:33.560 | And so what we wanted to do was understand a bit more as to what properties do you need

00:39:39.720 | to have this analogies work out, as I showed last time.

00:39:44.980 | And so what we realized was that if you'd like to do have these sort of vector subtractions

00:39:54.520 | and additions work for an analogy, the property that you want is for meaning components.

00:40:04.440 | So a meaning component is something like going from male to female, queen to king, or going

00:40:12.320 | from a bird to its agent, truck to driver, that those meaning components should be represented

00:40:23.020 | as ratios of co-occurrence probabilities.

00:40:26.160 | So here's an example that shows that.

00:40:29.800 | So suppose the meaning component that we want to get out is the spectrum from solid to gas,

00:40:38.300 | as in physics.

00:40:39.800 | Well, you'd think that you can get at the solid part of it, perhaps by saying, does

00:40:46.080 | the word co-occur with ice?

00:40:48.440 | And the word solid occurs with ice, so that looks hopeful.

00:40:52.200 | And gas doesn't occur with ice much, so that looks hopeful.

00:40:55.760 | But the problem is the word water will also occur a lot with ice.

00:41:00.640 | And if you just take some other random word, like the word random, it probably doesn't

00:41:04.960 | occur with ice much.

00:41:08.040 | In contrast, if you look at words co-occurring with steam, solid won't occur with steam much,

00:41:15.920 | but gas will.

00:41:16.920 | But water will again, and random will be small.

00:41:20.940 | So to get out the meaning component we want of going from gas to solid, what's actually

00:41:26.660 | really useful is to look at the ratio of these co-occurrence probabilities.

00:41:32.940 | Because then we get a spectrum from large to small between solid and gas, whereas for

00:41:39.660 | water and a random word, it basically cancels out and gives you one.

00:41:45.880 | I just wrote these numbers in, but if you count them up in a large corpus, it is basically

00:41:52.180 | what you get.

00:41:53.180 | So here are actual co-occurrence probabilities, and that for water and my random word, which

00:42:00.060 | was fashion here, these are approximately one.

00:42:03.820 | Whereas for the ratio of probability of co-occurrence of solid with ice or steam is about 10, and

00:42:13.180 | for gas it's about a 10th.

00:42:16.580 | So how can we capture these ratios of co-occurrence probabilities as linear meaning components

00:42:26.220 | so that in our word vector space we can just add and subtract linear meaning components?

00:42:32.420 | Well, it seems like the way we can achieve that is if we build a log bilinear model so

00:42:39.900 | that the dot product between two word vectors attempts to approximate the log of the probability

00:42:47.180 | of co-occurrence.

00:42:48.860 | So if you do that, you then get this property that the difference between two vectors, its

00:42:59.060 | similarity to another word corresponds to the log of the probability ratio shown on

00:43:05.060 | the previous slide.

00:43:07.100 | So the glove model wanted to try and unify the thinking between the co-occurrence matrix

00:43:15.940 | models and the neural models by being in some way similar to a neural model, but actually

00:43:23.340 | calculated on top of a co-occurrence matrix count.

00:43:29.420 | So we had an explicit loss function, and our explicit loss function is that we wanted the

00:43:36.540 | dot product to be similar to the log of the co-occurrence.

00:43:42.020 | We actually added in some bias terms here, but I'll ignore those for the moment.

00:43:46.420 | And we wanted to not have very common words dominate, and so we capped the effect of high

00:43:53.140 | word counts using this f function that's shown here.

00:43:57.860 | And then we could optimize this j function directly on the co-occurrence count matrix.

00:44:05.940 | So that gave us fast training scalable to huge corpora.

00:44:11.340 | And so this algorithm worked very well.

00:44:15.260 | So if you run this algorithm and ask what are the nearest words to frog, you get frogs,

00:44:21.540 | toad, and then you get some complicated words, but it turns out they are all frogs until

00:44:27.460 | you get down to lizards.

00:44:28.740 | So Latoria is that lovely tree frog there.

00:44:32.340 | And so this actually seemed to work out pretty well.

00:44:36.060 | How well did it work out?

00:44:37.940 | To discuss that a bit more, I now want to say something about how do we evaluate word

00:44:43.260 | vectors.

00:44:44.260 | Are we good for up to there for questions?

00:44:48.020 | We've got some questions.

00:44:51.940 | What do you mean by an inefficient use of statistics as a con for skip gram?

00:44:56.260 | Well, what I mean is that, you know, for word2vec, you're just, you know, looking at one center

00:45:06.700 | word at a time and generating a few negative samples.

00:45:11.580 | And so it sort of seems like doing something precise there.

00:45:17.380 | Whereas if you're doing optimization algorithm on the whole matrix at once, well, you actually

00:45:25.620 | know everything about the matrix at once.

00:45:27.660 | You're not just looking at what words, what other words occurred in this one context of

00:45:34.060 | the center word.

00:45:35.380 | You've got the entire vector of co-occurrence accounts for the center word and another word.

00:45:41.260 | And so therefore you can much more efficiently and less noisily work out how to minimize

00:45:48.200 | your loss.

00:45:49.700 | Okay.

00:45:50.700 | I'll go on.

00:45:53.620 | Okay.

00:45:54.820 | So I've sort of said, look at these word vectors.

00:45:58.380 | They're great.

00:45:59.380 | And I sort of showed you a few things at the end of the last class, which argued, hey,

00:46:04.020 | these are great.

00:46:05.020 | You know, they work out these analogies.

00:46:08.580 | They show similarity and things like this.

00:46:11.820 | We want to make this a bit more precise.

00:46:14.740 | And indeed for natural language processing, as in other areas of machine learning, a big

00:46:19.860 | part of what people are doing is working out good ways to evaluate knowledge that things

00:46:25.700 | have.

00:46:27.300 | So how can we really evaluate word vectors?

00:46:30.180 | So in general for NLP evaluation, people talk about two ways of evaluation, intrinsic and

00:46:36.780 | extrinsic.

00:46:38.060 | So an intrinsic evaluation means that you evaluate directly on the specific or intermediate

00:46:47.140 | subtasks that you've been working on.

00:46:49.260 | So I want a measure where I can directly score how good my word vectors are.

00:46:54.580 | And normally intrinsic evaluations are fast to compute.

00:46:59.620 | They helped you to understand the component you've been working on.

00:47:03.740 | But often simply trying to optimize that component may or may not have a very big good effect

00:47:12.060 | on the overall system that you're trying to build.

00:47:17.000 | So people have also been very interested in extrinsic evaluations.

00:47:22.260 | So an extrinsic evaluation is that you take some real task of interest to human beings,

00:47:29.180 | whether that's web search or machine translation or something like that, and you say your goal

00:47:35.420 | is to actually improve performance on that task.

00:47:39.420 | Well that's a real proof that this is doing something useful.

00:47:44.340 | So in some ways it's just clearly better.

00:47:47.980 | But on the other hand, it also has some disadvantages.

00:47:52.180 | It takes a lot longer to evaluate on an extrinsic task because it's a much bigger system.

00:47:59.820 | And sometimes, you know, when you change things, it's unclear whether the fact that the numbers

00:48:06.620 | went down was because you now have worse word vectors or whether it's just somehow the other

00:48:13.600 | components of the system interacted better with your old word vectors.

00:48:18.820 | And if you change the other components as well, things would get better again.

00:48:23.380 | So in some ways it can sometimes be muddier to see if you're making progress.

00:48:29.340 | But I'll touch on both of these methods here.

00:48:34.140 | So for intrinsic evaluation of word vectors, one way which we mentioned last time was this

00:48:42.900 | word vector analogy.

00:48:44.440 | So we could simply give our models a big collection of word vector analogy problems.

00:48:50.080 | So we could say man is the woman as king is the what, and ask the model to find the word

00:48:56.800 | that is closest using that sort of word analogy computation and hope that what comes out there

00:49:03.840 | is queen.

00:49:05.960 | And so that's something people have done and have worked out an accuracy score of how often

00:49:10.980 | that you are right.

00:49:14.120 | At this point, I should just mention one little trick of these word vector analogies that

00:49:20.320 | everyone uses but not everyone talks about along the first instance.

00:49:25.640 | I mean, there's a little trick which you can find in the Gensim code if you look at it,

00:49:31.160 | but when it does man is the woman as king is to what, something that could often happen

00:49:43.180 | is that actually the word, once you do your pluses and your minuses, that the word that

00:49:49.340 | will actually be closest is still king.

00:49:54.660 | So the way people always do this is that they don't allow one of the three input words in

00:50:01.620 | the selection process.

00:50:03.500 | So you're choosing the nearest word that isn't one of the input words.

00:50:11.420 | So since here is showing results from the glove vectors.

00:50:16.220 | So the glove vectors have this strong linear component property, just like I showed before

00:50:24.420 | for coals.

00:50:26.500 | So this is for the male female dimension.

00:50:30.760 | And so because of this, you'd expect in a lot of cases that word analogies would work

00:50:36.260 | because I can take the vector difference of man and woman.

00:50:40.160 | And then if I add that vector difference on to brother, I expect to get to sister and

00:50:45.740 | king, queen, and for many of these examples.

00:50:48.340 | But of course, they may not always work, right?

00:50:51.980 | Because if I start from emperor, it's sort of on a more of a lean.

00:50:56.040 | And so it might turn out that I get countess or duchess coming out instead.

00:51:02.440 | You can do this for various different relations, so different semantic relations.

00:51:06.920 | So these sort of word vectors actually learn quite a bit of just world knowledge.

00:51:12.200 | So here's the company CEO, or this is the company CEO around 2010 to 2014, when the

00:51:19.380 | data was taken from word vectors.

00:51:23.680 | And they, as well as semantic things or pragmatic things like this, they also learn syntactic

00:51:28.800 | things.

00:51:29.800 | So here are vectors for positive, comparative, and superlative forms of adjectives.

00:51:35.500 | And you can see those also move in roughly linear components.

00:51:40.840 | So the word2vec people built a data set of analogies so you could evaluate different

00:51:47.200 | models on the accuracy of their analogies.

00:51:50.840 | And so here's how you can do this.

00:51:54.040 | And this gives some numbers.

00:51:55.640 | So there are semantic and syntactic analogies.

00:51:58.200 | I'll just look at the totals.

00:52:00.320 | OK, so what I said before is if you just use unscaled co-occurrence counts and pass them

00:52:09.040 | through an SVD, things work terribly.

00:52:11.740 | And you see that there, you only get 7.3.

00:52:14.780 | But then, as I also pointed out, if you do some scaling, you can actually get SVD of

00:52:20.240 | a scaled count matrix to work reasonably well.

00:52:23.560 | So this SVDL is similar to the COLS model.

00:52:28.980 | And now we're getting up to 60.1, which actually isn't a bad score, right?

00:52:32.740 | So you can actually do a decent job without a neural network.

00:52:36.880 | And then here are the two variants of the word2vec model.

00:52:43.440 | And here are our results from the GloVe model.

00:52:46.400 | And of course, at the time, 2014, we took this as absolute proof that our model was

00:52:53.240 | better and our more efficient use of statistics was really working in our favor.

00:53:00.040 | With seven years of retrospect, I think that's kind of not really true, it turns out.

00:53:05.000 | I think the main part of why we scored better is that we actually had better data.

00:53:10.840 | And so there's a bit of evidence about that on this next slide here.

00:53:15.960 | So this looks at the semantic, syntactic, and overall performance on word analogies

00:53:23.360 | of GloVe models that were trained on different subsets of data.

00:53:28.920 | So in particular, the two on the left are trained on Wikipedia.

00:53:34.640 | And you can see that training on Wikipedia makes you do really well on semantic analogies,

00:53:40.600 | which maybe makes sense because Wikipedia just tells you a lot of semantic facts.

00:53:44.720 | I mean, that's kind of what encyclopedias do.

00:53:48.240 | And so one of the big advantages we actually had was that Wikipedia, that the GloVe model

00:53:55.680 | was partly trained on Wikipedia as well as other text, whereas the word2vec model that

00:54:01.080 | was released was trained exclusively on Google News, so Newswire data.

00:54:06.240 | And if you only train on a smallish amount of Newswire data, you can see that for the

00:54:13.280 | semantics, it's just not as good as even one quarter of the size amount of Wikipedia data.

00:54:22.080 | Though if you get a lot of data, you can compensate for that.

00:54:25.160 | So here on the right hand, did you then have common crawl web data?

00:54:29.800 | And so once there's a lot of web data, so now 42 billion words, you'll then start to

00:54:35.240 | get good scores again from the semantic side.

00:54:39.840 | The graph on the right then shows how well do you do as you increase the vector dimension.

00:54:47.360 | And so what you can see there is, you know, 25 dimensional vectors aren't very good.

00:54:53.360 | They go up to sort of 50 and then 100.

00:54:56.560 | And so 100 dimensional vectors already work reasonably well.

00:55:00.080 | So that's why I used 100 dimensional vectors when I showed my example in class.

00:55:06.800 | That is the sweets, too long load and working reasonably well.

00:55:11.840 | But you still get significant gains for 200 and it's somewhat to 300.

00:55:16.200 | So at least back around sort of 2013 to 15, everyone sort of gravitated to the fact that

00:55:22.200 | 300 dimensional vectors is the sweet spot.

00:55:25.400 | So almost frequently, if you look through the best known sets of word vectors that include

00:55:30.600 | the word to deck vectors and the glove vectors, that usually what you get is 300 dimensional

00:55:36.480 | word vectors.

00:55:39.960 | That's not the only intrinsic evaluation you can do.

00:55:43.760 | Another intrinsic evaluation you can do is see how these models model human judgments

00:55:50.600 | of word similarity.

00:55:52.840 | So psychologists for several decades have actually taken human judgments of word similarity,

00:56:00.800 | where literally you're asking people for pairs of words like professor and doctor to give

00:56:07.200 | them a similarity score that's sort of being measured as some continuous quantity, giving

00:56:12.600 | you a score between, say, zero and 10.

00:56:16.600 | And so there are human judgments which are then averaged over multiple human judgments

00:56:21.400 | as to how similar different words are.

00:56:24.080 | So tiger and cat is pretty similar.

00:56:27.160 | Computer and Internet is pretty similar.

00:56:29.320 | Machine and car is less similar.

00:56:31.560 | Stock and CD aren't very similar at all, but stock and Jaguar are even less similar.

00:56:38.560 | So we could then say for our models, do they have the same similarity judgments?

00:56:46.240 | And in particular, we can measure a correlation coefficient of whether they give the same

00:56:51.160 | ordering of similarity judgments.

00:56:54.280 | And so then we can get data for that.

00:56:57.040 | And so there are various different data sets of word similarities, and we can score different

00:57:02.360 | models as to how well they do on similarities.

00:57:05.920 | And again, you see here that plain SVDs works comparatively better here for similarities

00:57:14.840 | than it did for analogies.

00:57:15.840 | You know, it's not great, but it's now not completely terrible because we no longer need

00:57:20.440 | that linear property.

00:57:22.000 | But again, scaled SVDs work a lot better.

00:57:26.080 | Word2vec works a bit better than that.

00:57:28.720 | And we got some of the same kind of minor advantages from the GloVe model.

00:57:32.520 | Hey, Chris, sorry to interrupt.

00:57:35.240 | A lot of the students were asking if you could re-explain the objective function for the

00:57:39.880 | GloVe model and also what log bilinear means.

00:57:45.080 | OK.

00:57:47.080 | Sure.

00:57:49.080 | OK, here is my objective function.

00:57:56.680 | So one slide before that.

00:58:00.440 | So the property that we want is that we want the dot product to represent the log probability

00:58:12.560 | of co-occurrence.

00:58:14.520 | So that then gives me my tricky log bilinear.

00:58:20.220 | So the bi is that there's sort of the wi and the wj, so that there are sort of two linear

00:58:27.680 | things and it's linear in each one of them.

00:58:30.800 | So this is sort of like having an rather than having a sort of an ax where you just have

00:58:37.800 | something that's linear in x and a is a constant.

00:58:41.800 | It's bilinear because we have the wi, wj and this linear in both of them.

00:58:47.760 | And that's then related to the log of a probability.

00:58:51.320 | And so that gives us the log bilinear model.

00:58:57.080 | And so since we'd like these things to be equal, what we're doing here, if you ignore

00:59:05.520 | these two center terms, is that we're wanting to say the difference between these two is

00:59:13.280 | as small as possible.

00:59:14.920 | So we're taking this difference and we're squaring it.

00:59:17.600 | So it's always positive.

00:59:19.320 | And we want that squared term to be as small as possible.

00:59:25.040 | And that's 90% of that.

00:59:28.040 | And you can basically stop there.

00:59:30.160 | But the other bit that's in here is a lot of the time when you're building models, rather

00:59:38.000 | than simply having sort of an ax model, it seems useful to have a bias term, which can

00:59:45.840 | move things up and down for the word in general.

00:59:51.200 | And so we added into the model bias term so that there's a bias term for both words.

00:59:57.000 | So if in general probabilities are high for a certain word, this bias term can model that.

01:00:03.600 | And for the other word, this bias term can model it.

01:00:09.000 | So now I'll pop back.

01:00:10.800 | And after-- oh, actually, I just saw someone said, why multiplying by the f of-- sorry,

01:00:20.000 | I did skip that last term.

01:00:24.440 | So why modifying by this f of xij?

01:00:27.880 | So this last bit was to scale things depending on the frequency of a word.

01:00:38.640 | Because you want to pay more attention to words that are more common or word pairs that

01:00:48.040 | are more common.

01:00:49.560 | Because if you think about it in word2vec terms, you're seeing if things have a co-occurrence

01:00:56.800 | count of 50 versus 3, you want to do a better job at modeling the co-occurrence of the things

01:01:05.320 | that occurred together 50 times.

01:01:11.480 | And so you want to consider in the count of co-occurrence.

01:01:16.360 | But then the argument is that that actually leads you astray when you have extremely common

01:01:22.000 | words like function words.

01:01:24.200 | And so effectively, you paid more attention to words that co-occurred together up until

01:01:31.800 | a certain point.

01:01:33.320 | And then the curve just went flat.

01:01:35.440 | So it didn't matter if it was an extremely, extremely common word.

01:01:39.080 | So then for extrinsic word vector evaluation.

01:01:45.560 | So at this point, you're now wanting to sort of say, well, can we embed our word vectors

01:01:52.040 | in some end user task?

01:01:55.000 | And do they help?

01:01:58.000 | And do different word vectors work better or worse than other word vectors?

01:02:03.480 | So this is something that we'll see a lot of later in the class.

01:02:07.400 | I mean, in particular, when you get on to doing assignment 3, that assignment 3, you

01:02:13.720 | get to build dependency parsers.

01:02:16.200 | And you can then use word vectors in the dependency parser and see how much they help.

01:02:22.640 | We don't actually make you test out different sets of word vectors, but you could.

01:02:28.640 | Here's just one example of this to give you a sense.

01:02:31.560 | So the task of named entity recognition is going through a piece of text and identifying

01:02:37.560 | mentions of a person name or an organization name like a company or a location.

01:02:44.000 | And so if you have good word vectors, do they help you do named entity recognition better?

01:02:53.080 | And the answer to that is yes.

01:02:55.040 | So if one starts off with a model that simply has discrete features, so it uses word identity

01:03:02.040 | as features, you can build a pretty good named entity model doing that.

01:03:06.360 | But if you add into it word vectors, you get a better representation of the meaning of

01:03:11.640 | words and so that you can have the numbers go up quite a bit.

01:03:16.600 | And then you can compare different models to see how much gain they give you in terms

01:03:21.860 | of this extrinsic task.

01:03:24.760 | So skipping ahead, this was a question that I was asked after class, which was word sensors.

01:03:33.160 | So far, we've had just one word.

01:03:38.440 | Sorry, for one particular string, we've got some string house, and we're going to say

01:03:45.400 | for each of those strings, there's a word vector.

01:03:49.760 | And if you think about it a bit more, that seems like it's very weird, because actually,

01:03:58.200 | most words, especially common words, and especially words that have existed for a long time, actually

01:04:04.880 | have many meanings, which are very different.

01:04:08.360 | So how could that be captured if you only have one word vector for the word, because

01:04:12.720 | you can't actually capture the fact that you've got different meanings for the word, because

01:04:17.680 | your meaning for the word is just one point in space, one vector.

01:04:22.200 | And so as an example of that, here's the word pike.

01:04:30.600 | But it is an old Germanic word.

01:04:32.240 | Well, what kind of meanings does the word pike have?

01:04:36.600 | So you can maybe just think for a minute and think what meanings the word pike has.

01:04:46.800 | And it actually turns out, you know, it has a lot of different meanings.

01:04:50.960 | So perhaps the most basic meaning is, if you did fantasy games or something, medieval weapons,

01:04:59.240 | a sharp pointed staff is a pike.

01:05:02.280 | But there's a kind of a fish that has a similar elongated shape that's a pike.

01:05:08.140 | It was used for railroad lines.

01:05:12.360 | Maybe that usage isn't used much anymore, but it certainly still survives in referring

01:05:16.720 | to roads.

01:05:17.720 | So this is like when you have turnpikes, we have expressions where pike means the future,

01:05:23.280 | like coming down the pike.

01:05:25.760 | It's a position in diving, that divers do a pike.

01:05:30.400 | Those are all noun uses.

01:05:32.540 | They're also verbal uses.

01:05:34.140 | So you can pike somebody with your pike.

01:05:37.880 | You know, different usages might have different currency.

01:05:40.920 | In Australia, you can also use pike to mean that you pull out of doing something like,

01:05:46.320 | I reckon he's going to pike.

01:05:49.440 | I don't think that usage is used in America, but lots of meanings.

01:05:52.720 | And actually, for words that are common, or if you start thinking of words like line or

01:05:56.760 | field, I mean, they just have even more meanings than this.

01:06:00.600 | So what are we actually doing with just one vector for a word?

01:06:05.960 | And well, one way you could go is to say, okay, up until now, what we've done is crazy.

01:06:12.840 | Pike has, in other words, have all of these different meanings.

01:06:16.680 | So maybe what we should do is have different word vectors for the different meanings of

01:06:23.000 | pike.

01:06:24.000 | So we'd have one word vector for the medieval pointy weapon, another word vector for the

01:06:30.600 | kind of fish, another word vector for the kind of road.

01:06:34.400 | So that they'd then be word sense vectors.

01:06:39.560 | And you can do that.

01:06:40.560 | I mean, actually, we were working on that in the early 2010s, actually, even before

01:06:48.280 | Word2Vec came out.

01:06:50.760 | So this picture is a little bit small to see, but what we were doing was for words, we were

01:06:59.480 | clustering instances of a word, hoping that those clusters, so clustering the word tokens,

01:07:07.040 | hoping those clusters that were similar represented sensors.

01:07:10.800 | And then for the clusters of word tokens, we were sort of treating them like they were

01:07:15.280 | separate words and learning a word vector for each.

01:07:19.080 | And basically that actually works.

01:07:21.880 | So in green, we have two sensors for the word bank.

01:07:26.200 | And so there's one sense for the word bank that's over here, where it's close to words

01:07:30.400 | like banking, finance, transaction, and laundering.

01:07:34.040 | And then we have another sense for the word bank over here, where it's close to words

01:07:38.180 | like plateau, boundary, gap, territory, which is the riverbank sense of the word bank.

01:07:44.860 | And for the word jaguar, that's in purple.

01:07:48.200 | Well, jaguar has a number of sensors.

01:07:51.400 | And so we have those as well.

01:07:53.600 | So this sense down here is close to hunter.

01:07:57.140 | So that's the sort of big game animal sense of jaguar.

01:08:01.400 | Up the top here, it's being shown close to luxury and convertible.

01:08:05.240 | So this is the jaguar car sense.

01:08:08.600 | Then jaguar here is near string, keyboard, and words like that.

01:08:13.600 | So jaguar is the name of a kind of keyboard.

01:08:18.000 | And then this final jaguar over here is close to software and Microsoft.

01:08:23.360 | And then if you're old enough, you'll remember that there was an old version of Mac OS that

01:08:27.680 | was called jaguar.

01:08:29.200 | So that's then the computer sense.

01:08:31.160 | So basically, this does work.

01:08:33.240 | And we can learn word vectors for different senses of a word.

01:08:38.520 | But actually, this isn't the majority way that things have then gone in practice.

01:08:45.120 | And there are kind of a couple of reasons for that.

01:08:48.800 | I mean, one is just simplicity.

01:08:51.480 | If you do this, it's kind of complex, because you first of all have to learn word sensors

01:08:58.360 | and then start learning word vectors in terms of the word sensors.

01:09:01.760 | But the other reason is, although this model of having word sensors is traditional, it's

01:09:10.120 | what you see in dictionaries, it's commonly what's being used in natural language processing.

01:09:15.760 | I mean, it tends to be imperfect in its own way, because we're trying to take all the

01:09:20.440 | uses of the word pike and sort of cut them up into key different senses, where the difference

01:09:30.360 | is kind of overlapping, and it's often not clear which ones to count as distinct.

01:09:35.040 | So for example, here, right, a railroad line and a type of road, well, sort of that's the

01:09:39.720 | same sense of pike, it's just that they're different forms of transportation.

01:09:44.080 | And so you know that this could be, you know, a type of transportation line and cover both

01:09:48.320 | of them.

01:09:49.320 | So it's always sort of very unclear how you cut word meaning into different senses.

01:09:55.960 | And indeed, if you look at different dictionaries, everyone does it differently.

01:10:01.760 | So it actually turns out that in practice, you can do rather well by simply having one

01:10:11.160 | word vector per word type.

01:10:14.580 | And what happens if you do that?

01:10:17.600 | Well, what you find is that what you learn as a word vector is what gets referred to

01:10:28.600 | in fancy talk as a super superposition of the word vectors for the different senses

01:10:37.400 | of a word, where the word superposition means no more or less than a weighted sum.

01:10:44.400 | So the vector that we learned for pike will be a weighted average of the vectors that

01:10:50.480 | you would have learned for the medieval weapon sense, plus the fish sense, plus the road

01:10:56.520 | sense, plus whatever other senses that you have, where the weighting that's given to

01:11:01.360 | these different sense vectors corresponds to the frequencies of use of the different

01:11:06.720 | senses.

01:11:08.080 | So we end up with the word, the vector for pike being a kind of an average vector.

01:11:16.840 | And so if you're, if you're, say, okay, you've just added up several different vectors

01:11:24.720 | into an average, you might think that that's kind of useless, because, you know, you've

01:11:30.920 | lost the real meanings of the word, and you've just got some kind of funny average vector

01:11:36.680 | that's in between them.

01:11:39.400 | But actually, it turns out that if you use this average vector in applications, it tends

01:11:47.080 | to sort of self disambiguate.

01:11:50.360 | Because if you say, is the word pike similar to the word for fish?

01:11:57.560 | Well, part of this vector represents fish, the fish sense of pike.

01:12:02.980 | And so in those components, it'll be kind of similar to the fish vector.

01:12:07.680 | And so yes, you'll say the substantial, there's substantial similarity.

01:12:13.680 | Whereas if in another piece of text that says, you know, the men were armed with pikes and

01:12:21.480 | lances or pikes and maces or whatever other medieval weapons you remember, well, actually,

01:12:28.380 | some of that meaning is in the pike vector as well.

01:12:31.460 | And so it'll say, yeah, there's good similarity with mace and staff and words like that as

01:12:37.640 | well.

01:12:38.780 | And in fact, we can work out which sense of pike is intended by just sort of seeing which

01:12:46.060 | components are similar to other words that are used in the same context.

01:12:51.840 | And indeed, there's actually a much more surprising result than that.

01:12:56.220 | And this is a result that's Juzo Sanjeev Arora, Tungu Umar, who's now on our Stanford faculty

01:13:03.420 | and others in 2018.

01:13:05.900 | And that's the following result, which I'm not actually going to explain.

01:13:09.380 | But so if you think that the vector for pike is just a sum of the vectors for the different

01:13:19.180 | senses, well, it should be you'd think that it's just completely impossible to reconstruct

01:13:27.540 | the sense vectors from the vector for the word type.

01:13:34.260 | Because normally, if I say I've got two numbers, the sum of them is 17, you just have no information

01:13:41.520 | as to what my two numbers are, right?

01:13:43.460 | You can't resolve it.

01:13:45.300 | And even worse, if I tell you I've got three numbers, and they sum to 17.

01:13:50.140 | But it turns out that when we have these high dimensional vector spaces, that things are

01:13:57.500 | so sparse in those high dimensional vector spaces, that you can use ideas from sparse

01:14:03.420 | coding to actually separate out the different senses, providing they're relatively common.

01:14:11.540 | So they show in their paper that you can start with the vector of say, pike, and actually

01:14:17.280 | separate out components of that vector that correspond to different senses of the word

01:14:22.480 | pike.

01:14:23.480 | And so here's an example at the bottom of this slide, which is for the word pike, separate

01:14:30.300 | out that vector into five different senses.

01:14:33.320 | And so there's one sense is close to the words trousers, blouse, waistcoats, and this is

01:14:38.500 | the sort of clothing sense of tie.

01:14:40.980 | Another sense is close to wires, cables, wiring, electrical.

01:14:45.380 | So that's the sort of the tie sense of a tie used in electrical stuff.

01:14:50.500 | Then we have sort of scoreline, goal is equalizer, so this is the sporting game sense of tie.

01:14:56.420 | This one also seems to in a different way, evoke sporting game sense of tie.

01:15:02.380 | Then there's finally this one here, maybe my music is just really bad.

01:15:06.860 | Maybe it's because you get ties in music when you tie notes together, I guess.

01:15:10.620 | So you get these different senses out of it.

01:15:12.380 | Thank you.

01:15:12.880 | [BLANK_AUDIO]

Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers

Chapters