back to indexStanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 2 - Neural Classifiers
Chapters
0:0 Intro
2:5 Parameters
4:3 Learning Phase
7:59 Gradient Descent
8:39 Stochastic Gradient Descent
11:29 Note on Word Vectors
13:46 Word tovec
16:53 Negative sampling
18:6 Logistic function
20:15 Negative log likelihood
23:27 Cooccurrence matrix
25:19 Questions
29:49 Cooccurrence matrices
32:19 Singular value decomposition
36:17 Kohls model
37:53 Glove algorithm
42:32 Log by linear model
43:6 Explicit loss function
44:45 Question
00:00:07.520 |
So the main content for today is to, um, go through sort of more stuff about word vectors, 00:00:17.040 |
including touching on word sensors, and then introducing the notion of neural network classifiers. 00:00:23.240 |
So our biggest goal is that by the end of today's class, you should feel like you could 00:00:28.040 |
confidently look at one of the word embeddings papers, such as the Google Word2Vec paper, 00:00:34.280 |
or the GloVe paper, or Sanjeev Arora's paper that we'll come to later, and feel like, yeah, 00:00:39.360 |
I can understand this, I know what they're doing, and it makes sense. 00:00:45.920 |
So this was sort of introducing this model of Word2Vec. 00:00:49.760 |
And at the time, your idea was that we started with random word vectors, and then we're going 00:00:57.200 |
to sort of, we have a big corpus of text, and we're going to iterate through each word 00:01:03.120 |
And for each position, we're going to try and predict what words surround our center 00:01:10.880 |
And we're going to do that with a probability distribution that's defined in terms of the 00:01:15.760 |
dot product between the word vectors for the center word and the context words. 00:01:23.080 |
And so that will give a probability estimate of a word appearing in the context of "into." 00:01:27.760 |
Well, actual words did occur in the context of "into" on this occasion. 00:01:32.600 |
So what we're going to want to do is sort of make it more likely that turning problems, 00:01:37.880 |
banking, and crises will turn up in the context of "into." 00:01:42.040 |
And so that's learning, updating the word vectors so that they can predict actual surrounding 00:01:49.800 |
And then the thing that's almost magical is that doing no more than this simple algorithm, 00:01:56.040 |
this allows us to learn word vectors that capture well word similarity and meaningful 00:02:05.100 |
So more precisely, right, for this model, the only parameters of this model are the 00:02:12.740 |
So we have outside word vectors and center word vectors for each word. 00:02:17.440 |
And then we're taking their dot product to get a probability. 00:02:21.640 |
Well, we're taking a dot product to get a score of how likely a particular outside word 00:02:29.600 |
And then we're using the softmax transformation to convert those scores into probabilities, 00:02:34.880 |
as I discussed last time, and I kind of come back to at the end this time. 00:02:42.620 |
This model is what we call in NLP a bag of words model. 00:02:47.720 |
So bag of words models are models that don't actually pay any attention to word order or 00:02:54.320 |
It doesn't matter if you're next to the center word or a bit further away on the left or 00:03:02.160 |
And that seems like a very crude model of language that will offend any linguist. 00:03:07.880 |
And it is a very crude model of language, and we'll move on to better models of language 00:03:12.840 |
But even that crude model of language is enough to learn quite a lot of the probability. 00:03:18.280 |
Sorry, quite a lot about the properties of words. 00:03:23.840 |
And then the second note is, well, with this model, we wanted to give reasonably high probabilities 00:03:32.760 |
to the words that do occur in the context of the center word, at least if they do so 00:03:39.920 |
But obviously lots of different words can occur. 00:03:43.080 |
So we're not talking about probabilities like .3 and .5. 00:03:47.000 |
We're more likely going to be talking about probabilities like .01 and numbers like that. 00:03:55.800 |
And well, the way that the word2vec model achieves this, and this is the learning phase 00:04:01.280 |
of the model, is to place words that are similar in meaning close to each other in this high 00:04:11.180 |
So again, you can't read this one, but if we scroll into this one, we see lots of words 00:04:18.600 |
that are similar in meaning grouped close together in the space. 00:04:21.760 |
So here are days of the week like Tuesday, Thursday, Sunday, and also Christmas. 00:04:35.040 |
This is a diagram I made quite a few years ago. 00:04:38.480 |
So that's when Nokia was still an important maker of cell phones. 00:04:42.680 |
We have various sort of fields like mathematics and economics over here. 00:04:47.520 |
So we group words that are similar in meaning. 00:04:52.320 |
Actually one more note I wanted to make on this. 00:04:54.280 |
I mean, again, this is a two dimensional picture, which is all I can show you on a slide. 00:05:01.360 |
And it's done with a principal components projection that you'll also use in the assignment. 00:05:08.400 |
Something important to remember, but hard to remember, is that high dimensional spaces 00:05:14.160 |
have very different properties to the two dimensional spaces that we can look at. 00:05:19.160 |
And so in particular, a word, a vector, can be close to many other things in a high dimensional 00:05:27.000 |
space, but close to them on different dimensions. 00:05:36.280 |
So the next question is, well, how do we learn good word vectors? 00:05:42.880 |
And this was the bit that I didn't quite hook up at the end of last class. 00:05:48.120 |
So for a while in the last, I said, oh, geocalculus, and we have to work out the gradient of the 00:05:55.080 |
loss function with respect to the parameters that will allow us to make progress. 00:06:00.000 |
But I didn't sort of altogether put that together. 00:06:02.260 |
So what we're going to do is we start off with random word vectors. 00:06:08.800 |
We initialize them to small numbers near zero in each dimension. 00:06:13.160 |
We've defined our loss function J, which we looked at last time. 00:06:18.800 |
And then we're going to use a gradient descent algorithm, which is an iterative algorithm 00:06:24.800 |
that learns to maximize J of theta by changing theta. 00:06:29.340 |
And so the idea of this algorithm is that from the current values of theta, you calculate 00:06:38.520 |
And then what you're going to do is make a small step in the direction of the negative 00:06:44.400 |
So the gradient is pointing upwards, and we're taking a small step in the direction of the 00:06:49.560 |
negative of the gradient to gradually move down towards the minimum. 00:06:55.320 |
And so one of the parameters of neural nets that you can fiddle in your software package 00:07:02.900 |
So if you take a really, really itsy bitsy step, it might take you a long time to minimize 00:07:13.220 |
On the other hand, if your step size is much too big, well, then you can actually diverge 00:07:23.360 |
Or even if you are going downhill a little bit, that what's going to happen is you're 00:07:27.580 |
then going to end up bouncing back and forth, and it'll take you much longer to get to the 00:07:32.820 |
OK, in this picture, I have a beautiful quadratic, and it's easy to minimize it. 00:07:40.980 |
Something that you might know about neural networks is that in general, they're not convex. 00:07:45.780 |
So you could think that this is just all going to go awry. 00:07:49.920 |
But the truth is in practice, life works out to be OK. 00:07:52.940 |
But I think I won't get into that more right now and come back to that in a later class. 00:08:01.420 |
So we have the current values of the parameters theta. 00:08:04.740 |
We then walk a little bit in the negative direction of the gradient using our learning 00:08:14.400 |
And that gives us new parameter values, where that means that these are vectors. 00:08:19.980 |
But for each individual parameter, we are updating it a little bit by working out the 00:08:25.720 |
partial derivative of J with respect to that parameter. 00:08:32.220 |
So that's the simple gradient descent algorithm. 00:08:39.100 |
The problem is that our J is a function of all windows in the corpus. 00:08:45.300 |
Remember, we're doing this sum over every center word in the entire corpus. 00:08:51.760 |
And we'll often have billions of words in the corpus. 00:08:54.600 |
So actually working out J of theta or the gradient of J of theta would be extremely, 00:09:01.120 |
extremely expensive because we have to iterate over our entire corpus. 00:09:04.640 |
So you'd wait a very long time before you made a single gradient update. 00:09:12.020 |
And so basically 100% of the time in neural network land, we don't use gradient descent. 00:09:19.300 |
We instead use what's called stochastic gradient descent. 00:09:22.680 |
And stochastic gradient descent is a very simple modification of this. 00:09:27.400 |
So rather than working out an estimate of the gradient based on the entire corpus, you 00:09:34.680 |
simply take one center word or a small batch like 32 center words, and you work out an 00:09:44.780 |
Now that estimate of the gradient will be noisy and bad because you've only looked at 00:09:50.720 |
a small fraction of the corpus rather than the whole corpus. 00:09:54.160 |
But nevertheless, you can use that estimate of the gradient to update your theta parameters 00:10:04.580 |
And so then if we have a billion word corpus, we can if we do it on each center word, we 00:10:12.440 |
can make a billion updates to the parameters as we pass through the corpus once rather 00:10:17.560 |
than only making one more accurate update to the parameters at once you've been through 00:10:25.280 |
So overall, we can learn several orders of magnitude more quickly. 00:10:31.040 |
And so this is the algorithm that you'll be using everywhere, including, you know, right 00:10:41.200 |
Again, just an extra comment of more complicated stuff we'll come back to. 00:10:47.400 |
I I pretend this is the gradient descent is sort of performance hack. 00:10:58.440 |
It turns out it's not only a performance hack. 00:11:01.800 |
Neural nets have some quite counterintuitive properties. 00:11:05.880 |
And actually, the fact that stochastic gradient descent is kind of noisy and bounces around 00:11:12.120 |
as it does its thing, it actually means that in complex networks, it learns better solutions 00:11:19.880 |
than if you were to run plain gradient descent very slowly. 00:11:24.560 |
So you can both compute much more quickly and do a better job. 00:11:29.240 |
OK, one final note on running stochastic gradients with word vectors, this is kind of an aside, 00:11:37.180 |
but something to note is that if we're doing a stochastic gradient update based on one 00:11:42.360 |
window, then actually in that window, we'll have seen almost none of our parameters. 00:11:49.120 |
Because if we have a window of something like five words to either side of the center word, 00:11:58.560 |
So we will have gradient information for those 11 words. 00:12:02.900 |
But the other 100,000 odd words in our vocabulary will have no gradient update information. 00:12:09.440 |
So this will be a very, very sparse gradient update. 00:12:13.900 |
So if you're only thinking math, you can just have your entire gradient and use the equation 00:12:25.040 |
But if you're thinking systems optimization, then you'd want to think, well, actually, 00:12:31.380 |
I only want to update the parameters for a few words. 00:12:36.960 |
And there have to be and there are much more efficient ways that I could do that. 00:12:43.580 |
And so here's sort of this is another aside will be useful for the assignment. 00:12:49.080 |
So I will say it up until now, when I presented word vectors, I presented them as column vectors. 00:12:57.860 |
And that makes the most sense if you think about it as a piece of math. 00:13:03.840 |
Whereas actually, in all common deep learning packages, including PyTorch that we're using, 00:13:12.120 |
word vectors are actually represented as row vectors. 00:13:16.220 |
And if you remember back to the representation of matrices and CS107 or something like that, 00:13:23.140 |
you'll know that that's then obviously efficient for representing words, because then you can 00:13:29.800 |
access an entire word vector as a continuous range of memory. 00:13:36.080 |
But anyway, so actually, our word vectors will be row vectors when you look at those 00:13:45.500 |
OK, now I wanted to say a bit more about the word to vector algorithm family and also what 00:13:58.800 |
So if you're still meant to be working on homework one, which remember is due next Tuesday, 00:14:04.060 |
but really actually with today's content, we're starting into homework two. 00:14:08.900 |
And I'll kind of go through the first part of homework two today and the other stuff 00:14:15.880 |
So I mentioned briefly the idea that we have two separate vectors for each word type, the 00:14:27.220 |
They're similar, but not identical for multiple reasons, including the random initialization 00:14:36.040 |
You can implement a word to vector algorithm with just one vector per word. 00:14:42.540 |
And actually, if you do, it works slightly better, but it makes the algorithm much more 00:14:49.500 |
And the reason for that is sometimes you will have the same word type as the center word 00:14:58.740 |
And that means that when you're doing your calculus at that point, you've then got this 00:15:04.060 |
sort of messy case that just for that word, you're getting an x squared, sorry, dot product. 00:15:11.260 |
You're getting a dot product of x dot x term, which makes it sort of much messier to work 00:15:16.620 |
And so that's why we use this sort of simple optimization of having two vectors per word. 00:15:22.700 |
So for the word to vector model as introduced in the Mikhailov et al paper in 2013, it wasn't 00:15:42.000 |
One was called the skip gram model, which is the one that I've explained to you that 00:15:46.540 |
pretty much outside words, position independent, given the center word in a bag of words style 00:15:57.900 |
The other one was called the continuous bag of words model SIBO. 00:16:01.340 |
And in this one, you predict the center word for a bag of context words. 00:16:09.780 |
The skip gram one is more natural in various ways. 00:16:13.300 |
So it's sort of normally the one that people have gravitated to and subsequent work. 00:16:19.620 |
But then as to how you train this model, what I've presented so far is the naive softmax 00:16:26.900 |
equation, which is a simple but relatively expensive training method. 00:16:33.980 |
And so that isn't really what they suggest using in your paper in the paper. 00:16:39.120 |
They suggest using a method that's called negative sampling. 00:16:42.240 |
So an acronym you'll see sometimes is SGNS, which means skip grams, negative sampling. 00:16:49.200 |
So let me just say a little bit about what this is. 00:16:53.900 |
But actually doing the skip gram model with negative sampling is the part of homework 00:17:01.740 |
So the point is that if you use this naive softmax, you know, even though people commonly 00:17:06.900 |
do use this naive softmax in various neural net models, that working out the denominator 00:17:15.460 |
And that's because you have to iterate over every word in the vocabulary and work out 00:17:22.940 |
So if you have a hundred thousand word vocabulary, you have to do a hundred thousand dot products 00:17:34.860 |
And so instead of that, the idea of negative sampling is where instead of using this softmax, 00:17:42.780 |
we're going to train binary logistic regression models for both the troop, the true pair of 00:17:52.020 |
center word and the context word versus noise pairs where we keep the true center word and 00:18:01.820 |
we just randomly sample words from the vocabulary. 00:18:06.780 |
So as presented in the paper, the idea is like this. 00:18:10.820 |
So overall, what we want to optimize is still an average of the loss for each particular 00:18:21.340 |
But for when we're working out the loss for each particular center word, we're going to 00:18:25.980 |
work out, sorry, the loss for each particular center word and each particular window. 00:18:31.740 |
We're going to take the dot product as before of the center word and the outside word. 00:18:42.220 |
But now instead of using that inside the softmax, we're going to put it through the logistic 00:18:47.780 |
function, which is sometimes also often also called the sigmoid function. 00:18:55.660 |
So the logistic function is a handy function that will map any real number to a probability 00:19:05.340 |
So basically, if the dot product is large, the logistic of the dot product will be virtually 00:19:16.660 |
And then what we'd like is on average, we'd like the dot product between the center word 00:19:22.900 |
and words that we just chose randomly, i.e. they most likely didn't actually occur in 00:19:33.100 |
And there's just one little trick of how this is done, which is this sigmoid function is 00:19:41.780 |
And so if we want this probability to be small, we can take the negative of the dot product. 00:19:51.460 |
So we're wanting it to be over here, that the product, the dot product of random word 00:20:01.020 |
And so then we're going to take the negation of that. 00:20:04.500 |
And then again, once we put that through the sigmoid, we'd like a big number. 00:20:08.860 |
Okay, so the way they're presenting things, they're actually maximizing this quantity. 00:20:14.060 |
But if I go back to making it a bit more similar to the way we had written things, we'd worked 00:20:28.420 |
So we're taking the negative log likelihood of this, the sigmoid of the dot product. 00:20:35.340 |
Again, negative log likelihood, we're using the same negated dot product through the sigmoid. 00:20:43.020 |
And then we're going to work out this quantity for a handful of random word. 00:20:50.660 |
We take negative samples, and how likely they are to sample a word depends on their probability. 00:20:57.740 |
And where this loss function is going to be minimized, given this negation by making these 00:21:05.060 |
dot products large, and these dot products, small means negative. 00:21:14.260 |
So there was just then one other trick that they use. 00:21:19.980 |
Actually there's more than one other trick that's used in the Word2Vec paper to get it 00:21:23.620 |
to perform well, but I'll only mention one of their other tricks here. 00:21:28.300 |
When they sample the words, they don't simply just sample the words based on their probability 00:21:39.740 |
What they do is they start with what we call the unigram distribution of words. 00:21:44.140 |
So that is how often words actually occur in our big corpus. 00:21:49.660 |
So if you have a billion word corpus and a particular word occurred 90 times in it, you're 00:21:55.820 |
taking 90 divided by a billion, and so that's the unigram probability of the word. 00:22:01.340 |
But what they then do is that they take that to the three quarters power. 00:22:06.340 |
And the effect of that three quarters power, which is then renormalized to make a probability 00:22:10.620 |
distribution with Z, kind of like we saw last time with the softmax. 00:22:15.260 |
By taking the three quarters power, that has the effect of dampening the difference between 00:22:21.780 |
common and rare words so that less frequent words are sampled somewhat more often, but 00:22:28.260 |
still not nearly as much as they would be if you just use something like a uniform distribution 00:22:36.300 |
Okay, so that's basically everything to say about the basics of how we have this very 00:22:47.660 |
simple neural network algorithm, Word2Vec, and how we can train it and learn word vectors. 00:22:56.700 |
So for the next bit, what I want to do is step back a bit and say, well, here's an algorithm 00:23:06.260 |
What else could we have done and what can we say about that? 00:23:10.740 |
The first thing that you might think about is, well, here's this funny iterative algorithm 00:23:22.580 |
You know, if we have a lot of words in a corpus, it seems like a more obvious thing that we 00:23:28.540 |
could do is just look at the counts of how words occur with each other and build a matrix 00:23:40.780 |
So here's the idea of a co-occurrence matrix. 00:23:52.820 |
I made my window simply size one to make it easy to fill in my matrix, symmetric just 00:24:02.940 |
And so then the counts in these cells are simply how often things co-occur in the window 00:24:16.100 |
So we get twos in these cells because it's symmetric. 00:24:21.940 |
So we get one here and lots of other things occur zero. 00:24:26.260 |
So we can build up a co-occurrence matrix like this. 00:24:31.460 |
And well, these actually give us a representation of words as co-occurrence vectors. 00:24:38.700 |
So I can take the word I with either a row or a column vector since it's symmetric and 00:24:44.060 |
say okay, my representation of the word I is this row vector. 00:24:53.780 |
And I think you can maybe convince yourself that to the extent that words have similar 00:24:59.860 |
meaning and usage, you'd sort of expect them to have somewhat similar vectors, right? 00:25:05.980 |
So if I had the word you as well on a larger corpus, you might expect I and you to have 00:25:11.220 |
similar vectors because I like you like I enjoy you enjoy. 00:25:18.140 |
Hey, Chris, can we keep looking to answer some questions? 00:25:23.820 |
So we got some questions from negative, sort of the negative sampling slides. 00:25:29.840 |
In particular, what's like, can you give some intuition for negative sampling? 00:25:36.820 |
And why do we only take one positive example? 00:25:39.820 |
Those are two questions to be answered in tandem. 00:25:46.900 |
So is to work out something like what the softmax did in a much more efficient way. 00:25:59.300 |
So, in the softmax, well, you wanted to give high probability to the in predicting the 00:26:08.860 |
context, a context word that actually did appear with the center word. 00:26:14.500 |
And well, the way you do that is by having the dot product between those two words be 00:26:22.340 |
And part of how but you know, you're going to be sort of it's more than that, because 00:26:28.380 |
in the denominator, you're also working out the dot product with every other word in the 00:26:34.480 |
So as well as wanting the dot product with the actual word that you see in the context 00:26:39.380 |
to be big, you maximize your likelihood by making the dot products of other words that 00:26:47.600 |
weren't in the context smaller, because that's shrinking your denominator. 00:26:52.740 |
And therefore, you've got a bigger number coming out and you're maximizing the loss. 00:26:58.920 |
So even for the softmax, the general thing that you want to do to maximize it is have 00:27:04.540 |
dot product with words actually in the context big dot product with words and not in the 00:27:14.100 |
And obviously, you have to average this as best you can over all kinds of different contexts, 00:27:18.620 |
because sometimes different words appear in different contexts, obviously. 00:27:25.700 |
So the negative sampling as a way of therefore trying to maximize the same objective. 00:27:32.940 |
Now, you know, for you only you only have one positive term because you're actually 00:27:45.280 |
So for working out the entire J, we do do work this quantity out for every center word 00:27:54.500 |
So you know, we are iterating over the different words in the context window, and then we're 00:28:03.220 |
So, you know, gradually we do this, but for one particular center word and one particular 00:28:08.200 |
context word, we only have one real piece of data that's positive. 00:28:12.820 |
So that's all we use because we don't know what other words should be counted as positive 00:28:20.540 |
Now for the negative words, you could just sample one negative word and that would probably 00:28:29.900 |
But if you want to sort of a slightly better, more stable sense of, OK, we'd like to, in 00:28:36.860 |
general, have other words have low probability. 00:28:39.840 |
It seems like you might be able to get better, more stable results if you instead say, let's 00:28:52.200 |
And for the negative words, well, it's easy to sample any number of random words you want. 00:28:57.160 |
And at that point, it's kind of a probabilistic argument. 00:29:00.040 |
The words that you're sampling might not be actually bad words to appear in the context. 00:29:06.440 |
They might actually be other words that are in the context, but 99.9% of the time, they 00:29:11.840 |
will be unlikely words to occur in the context. 00:29:17.880 |
And yes, you only sample 10 or 15 of them, but that's enough to make progress because 00:29:25.160 |
the center word is going to turn up on other occasions. 00:29:28.920 |
And when it does, you'll sample different words over here so that you gradually sample 00:29:33.240 |
different parts of the space and start to learn. 00:29:36.260 |
We had this co-occurrence matrix and it gives a representation of words as co-occurrence 00:29:49.280 |
I mean, there are actually two ways that people have commonly made these co-occurrence matrices. 00:29:54.720 |
One corresponds to what we've seen already, that you use a window around a word, which 00:30:02.800 |
And that allows you to capture some locality and some of the sort of syntactic and semantic 00:30:10.720 |
The other way these co-occurrence matrices have often been made is that normally documents 00:30:18.620 |
have some structure, whether it's paragraphs or just actual web pages, sort of size documents. 00:30:25.120 |
So you can just make your window size a paragraph or a whole web page and count co-occurrence 00:30:32.180 |
And this is the kind of method that's often being used in information retrieval in methods 00:30:39.440 |
OK, so the question then is, are these kind of count word vectors good things to use? 00:30:53.520 |
They're not terrible, but they have certain problems. 00:30:57.280 |
But the kind of problems that they have, well, firstly, they're huge, though very sparse. 00:31:03.500 |
So this is back where I said before, if we had a vocabulary of half a million words, 00:31:08.360 |
when then we have a half a million dimensional vector for each word, which is much, much 00:31:14.600 |
bigger than the word vectors that we typically use. 00:31:20.100 |
And it also means that because we have these very high dimensional vectors, that we have 00:31:30.620 |
So the results that you get tend to be noisier and less robust, depending on what particular 00:31:38.860 |
And so in general, people have found that you can get much better results by working 00:31:46.520 |
So then the idea is we can store the most of the important information about the distribution 00:31:53.420 |
of words in the context of other words in a fixed small number of dimensions, giving 00:32:01.180 |
And in practice, the dimensionality of the vectors that are used are normally somewhere 00:32:08.420 |
And so at that point, we need to use some way to reduce the dimensionality of our count 00:32:19.880 |
So if you have a good memory from a linear algebra class, you hopefully saw singular 00:32:30.380 |
And it has various mathematical properties that I'm not going to talk about here of single 00:32:37.720 |
singular value projection, giving you an optimal way under a certain definition of optimality 00:32:43.840 |
of producing a reduced dimensionality matrix that maximally or sorry, pair of matrices 00:32:52.460 |
that maximally well lets you recover the original matrix. 00:32:56.680 |
But the idea of the singular value decomposition is you can take any matrix such as our count 00:33:03.120 |
matrix and you can decompose that into three matrices, U, a diagonal matrix, sigma and 00:33:20.160 |
Now, in these matrices, some parts of it are never used because since this matrix is rectangular, 00:33:28.680 |
there's nothing over here, and so this part of the V transpose matrix gets ignored. 00:33:35.520 |
But if you're wanting to get smaller dimensional representations, what you do is take advantage 00:33:42.840 |
of the fact that the singular values inside the diagonal sigma matrix are ordered from 00:33:52.880 |
So what we can do is just delete out more of the matrix of the delete out some singular 00:34:00.520 |
values, which effectively means that in this product, some of U and some of V is also not 00:34:09.680 |
And so then as a result of that, we're getting lower dimensional representations for our 00:34:17.480 |
words if we're wanting to have word vectors, which still do as good as possible a job within 00:34:24.360 |
the given dimensionality of enabling you to recover the original co-occurrence matrix. 00:34:34.080 |
So from a linear algebra background, this is the obvious thing to use. 00:34:42.720 |
Well, if you just build a raw count co-occurrence matrix and run SVD on that and try and use 00:34:52.120 |
those as word vectors, it actually works poorly. 00:34:56.620 |
And it works poorly because if you get into the mathematical assumptions of SVD, you're 00:35:02.600 |
expecting to have these normally distributed errors. 00:35:07.680 |
And what you're getting with word counts looked not at all like something's normally distributed 00:35:16.840 |
because you have exceedingly common words like "a", "the", and "and", and you have a very 00:35:25.920 |
But you can actually get something that works a lot better if you scale the counts in the 00:35:32.680 |
So to deal with this problem of extremely frequent words, there are some things we can 00:35:38.400 |
We could just take the log of the raw counts. 00:35:47.280 |
And any of these kind of ideas let you build, then have a co-occurrence matrix that you 00:35:53.480 |
get more useful word vectors from running something like SVD. 00:35:58.680 |
And indeed, these kind of models were explored in the 1990s and in the 2000s. 00:36:05.840 |
And in particular, Doug Rohde explored a number of these ideas as to how to improve the co-occurrence 00:36:13.320 |
matrix in a model that he built that was called COLS. 00:36:18.200 |
And actually, in his COLS model, he observed the fact that you could get the same kind 00:36:27.800 |
of linear components that have semantic components that we saw yesterday when talking about analogies. 00:36:37.400 |
So for example, this is a figure from his paper. 00:36:41.800 |
And you can see that we seem to have a meaning component going from a verb to the person 00:36:50.200 |
So drive to drive, swim to swim, teach to teach, marry to priest. 00:36:55.480 |
And that these vector components are not perfectly, but are roughly parallel and roughly the same 00:37:04.840 |
And so we have a meaning component there that we could add on to another word, just like 00:37:13.120 |
We could say drive is to driver as marry is to what. 00:37:18.080 |
And we'd add on this green vector component, which is roughly the same as this one. 00:37:24.920 |
So this space could actually get some word vectors analogies right as well. 00:37:33.160 |
And so that seemed really interesting to us around the time Word2Vec came out of wanting 00:37:40.160 |
to understand better what the iterative updating algorithm of Word2Vec did and how it related 00:37:46.320 |
to these more linear algebra based methods that had been explored in the couple of decades 00:37:53.000 |
And so for the next bit, I want to tell you a little bit about the GloVe algorithm, which 00:37:58.040 |
was an algorithm for word vectors that was made by Jeffrey Pennington, Richard Socher 00:38:07.400 |
And so the starting point of this was to try to connect together the linear algebra based 00:38:14.200 |
methods on co-occurrence matrices like LSA and COLS with the models like Skip Graham, 00:38:20.520 |
Zeebo and their other friends, which were iterative neural updating algorithms. 00:38:26.400 |
So on the one hand, you know, the linear algebra methods actually seemed like they had advantages 00:38:32.560 |
for fast training and efficient usage of statistics. 00:38:36.760 |
But although there had been work on capturing word similarities with them, by and large, 00:38:44.120 |
the results weren't as good, perhaps because of disproportionate importance given to large 00:38:50.120 |
Conversely, the models, the neural models, it seems like if you're just doing these gradient 00:38:58.480 |
updates on Windows, you're somehow inefficiently using statistics versus a co-occurrence matrix. 00:39:05.820 |
But on the other hand, it's actually easier to scale to a very large corpus by trading 00:39:14.320 |
And at that time, it seemed like the neural methods just worked better for people, that 00:39:20.400 |
they generated improved performance on many tasks, not just on word similarity, and that 00:39:25.840 |
they could capture complex patterns such as the analogies that went beyond word similarity. 00:39:33.560 |
And so what we wanted to do was understand a bit more as to what properties do you need 00:39:39.720 |
to have this analogies work out, as I showed last time. 00:39:44.980 |
And so what we realized was that if you'd like to do have these sort of vector subtractions 00:39:54.520 |
and additions work for an analogy, the property that you want is for meaning components. 00:40:04.440 |
So a meaning component is something like going from male to female, queen to king, or going 00:40:12.320 |
from a bird to its agent, truck to driver, that those meaning components should be represented 00:40:29.800 |
So suppose the meaning component that we want to get out is the spectrum from solid to gas, 00:40:39.800 |
Well, you'd think that you can get at the solid part of it, perhaps by saying, does 00:40:48.440 |
And the word solid occurs with ice, so that looks hopeful. 00:40:52.200 |
And gas doesn't occur with ice much, so that looks hopeful. 00:40:55.760 |
But the problem is the word water will also occur a lot with ice. 00:41:00.640 |
And if you just take some other random word, like the word random, it probably doesn't 00:41:08.040 |
In contrast, if you look at words co-occurring with steam, solid won't occur with steam much, 00:41:16.920 |
But water will again, and random will be small. 00:41:20.940 |
So to get out the meaning component we want of going from gas to solid, what's actually 00:41:26.660 |
really useful is to look at the ratio of these co-occurrence probabilities. 00:41:32.940 |
Because then we get a spectrum from large to small between solid and gas, whereas for 00:41:39.660 |
water and a random word, it basically cancels out and gives you one. 00:41:45.880 |
I just wrote these numbers in, but if you count them up in a large corpus, it is basically 00:41:53.180 |
So here are actual co-occurrence probabilities, and that for water and my random word, which 00:42:00.060 |
was fashion here, these are approximately one. 00:42:03.820 |
Whereas for the ratio of probability of co-occurrence of solid with ice or steam is about 10, and 00:42:16.580 |
So how can we capture these ratios of co-occurrence probabilities as linear meaning components 00:42:26.220 |
so that in our word vector space we can just add and subtract linear meaning components? 00:42:32.420 |
Well, it seems like the way we can achieve that is if we build a log bilinear model so 00:42:39.900 |
that the dot product between two word vectors attempts to approximate the log of the probability 00:42:48.860 |
So if you do that, you then get this property that the difference between two vectors, its 00:42:59.060 |
similarity to another word corresponds to the log of the probability ratio shown on 00:43:07.100 |
So the glove model wanted to try and unify the thinking between the co-occurrence matrix 00:43:15.940 |
models and the neural models by being in some way similar to a neural model, but actually 00:43:23.340 |
calculated on top of a co-occurrence matrix count. 00:43:29.420 |
So we had an explicit loss function, and our explicit loss function is that we wanted the 00:43:36.540 |
dot product to be similar to the log of the co-occurrence. 00:43:42.020 |
We actually added in some bias terms here, but I'll ignore those for the moment. 00:43:46.420 |
And we wanted to not have very common words dominate, and so we capped the effect of high 00:43:53.140 |
word counts using this f function that's shown here. 00:43:57.860 |
And then we could optimize this j function directly on the co-occurrence count matrix. 00:44:05.940 |
So that gave us fast training scalable to huge corpora. 00:44:15.260 |
So if you run this algorithm and ask what are the nearest words to frog, you get frogs, 00:44:21.540 |
toad, and then you get some complicated words, but it turns out they are all frogs until 00:44:32.340 |
And so this actually seemed to work out pretty well. 00:44:37.940 |
To discuss that a bit more, I now want to say something about how do we evaluate word 00:44:51.940 |
What do you mean by an inefficient use of statistics as a con for skip gram? 00:44:56.260 |
Well, what I mean is that, you know, for word2vec, you're just, you know, looking at one center 00:45:06.700 |
word at a time and generating a few negative samples. 00:45:11.580 |
And so it sort of seems like doing something precise there. 00:45:17.380 |
Whereas if you're doing optimization algorithm on the whole matrix at once, well, you actually 00:45:27.660 |
You're not just looking at what words, what other words occurred in this one context of 00:45:35.380 |
You've got the entire vector of co-occurrence accounts for the center word and another word. 00:45:41.260 |
And so therefore you can much more efficiently and less noisily work out how to minimize 00:45:54.820 |
So I've sort of said, look at these word vectors. 00:45:59.380 |
And I sort of showed you a few things at the end of the last class, which argued, hey, 00:46:14.740 |
And indeed for natural language processing, as in other areas of machine learning, a big 00:46:19.860 |
part of what people are doing is working out good ways to evaluate knowledge that things 00:46:30.180 |
So in general for NLP evaluation, people talk about two ways of evaluation, intrinsic and 00:46:38.060 |
So an intrinsic evaluation means that you evaluate directly on the specific or intermediate 00:46:49.260 |
So I want a measure where I can directly score how good my word vectors are. 00:46:54.580 |
And normally intrinsic evaluations are fast to compute. 00:46:59.620 |
They helped you to understand the component you've been working on. 00:47:03.740 |
But often simply trying to optimize that component may or may not have a very big good effect 00:47:12.060 |
on the overall system that you're trying to build. 00:47:17.000 |
So people have also been very interested in extrinsic evaluations. 00:47:22.260 |
So an extrinsic evaluation is that you take some real task of interest to human beings, 00:47:29.180 |
whether that's web search or machine translation or something like that, and you say your goal 00:47:35.420 |
is to actually improve performance on that task. 00:47:39.420 |
Well that's a real proof that this is doing something useful. 00:47:47.980 |
But on the other hand, it also has some disadvantages. 00:47:52.180 |
It takes a lot longer to evaluate on an extrinsic task because it's a much bigger system. 00:47:59.820 |
And sometimes, you know, when you change things, it's unclear whether the fact that the numbers 00:48:06.620 |
went down was because you now have worse word vectors or whether it's just somehow the other 00:48:13.600 |
components of the system interacted better with your old word vectors. 00:48:18.820 |
And if you change the other components as well, things would get better again. 00:48:23.380 |
So in some ways it can sometimes be muddier to see if you're making progress. 00:48:29.340 |
But I'll touch on both of these methods here. 00:48:34.140 |
So for intrinsic evaluation of word vectors, one way which we mentioned last time was this 00:48:44.440 |
So we could simply give our models a big collection of word vector analogy problems. 00:48:50.080 |
So we could say man is the woman as king is the what, and ask the model to find the word 00:48:56.800 |
that is closest using that sort of word analogy computation and hope that what comes out there 00:49:05.960 |
And so that's something people have done and have worked out an accuracy score of how often 00:49:14.120 |
At this point, I should just mention one little trick of these word vector analogies that 00:49:20.320 |
everyone uses but not everyone talks about along the first instance. 00:49:25.640 |
I mean, there's a little trick which you can find in the Gensim code if you look at it, 00:49:31.160 |
but when it does man is the woman as king is to what, something that could often happen 00:49:43.180 |
is that actually the word, once you do your pluses and your minuses, that the word that 00:49:54.660 |
So the way people always do this is that they don't allow one of the three input words in 00:50:03.500 |
So you're choosing the nearest word that isn't one of the input words. 00:50:11.420 |
So since here is showing results from the glove vectors. 00:50:16.220 |
So the glove vectors have this strong linear component property, just like I showed before 00:50:30.760 |
And so because of this, you'd expect in a lot of cases that word analogies would work 00:50:36.260 |
because I can take the vector difference of man and woman. 00:50:40.160 |
And then if I add that vector difference on to brother, I expect to get to sister and 00:50:48.340 |
But of course, they may not always work, right? 00:50:51.980 |
Because if I start from emperor, it's sort of on a more of a lean. 00:50:56.040 |
And so it might turn out that I get countess or duchess coming out instead. 00:51:02.440 |
You can do this for various different relations, so different semantic relations. 00:51:06.920 |
So these sort of word vectors actually learn quite a bit of just world knowledge. 00:51:12.200 |
So here's the company CEO, or this is the company CEO around 2010 to 2014, when the 00:51:23.680 |
And they, as well as semantic things or pragmatic things like this, they also learn syntactic 00:51:29.800 |
So here are vectors for positive, comparative, and superlative forms of adjectives. 00:51:35.500 |
And you can see those also move in roughly linear components. 00:51:40.840 |
So the word2vec people built a data set of analogies so you could evaluate different 00:51:55.640 |
So there are semantic and syntactic analogies. 00:52:00.320 |
OK, so what I said before is if you just use unscaled co-occurrence counts and pass them 00:52:14.780 |
But then, as I also pointed out, if you do some scaling, you can actually get SVD of 00:52:20.240 |
a scaled count matrix to work reasonably well. 00:52:28.980 |
And now we're getting up to 60.1, which actually isn't a bad score, right? 00:52:32.740 |
So you can actually do a decent job without a neural network. 00:52:36.880 |
And then here are the two variants of the word2vec model. 00:52:43.440 |
And here are our results from the GloVe model. 00:52:46.400 |
And of course, at the time, 2014, we took this as absolute proof that our model was 00:52:53.240 |
better and our more efficient use of statistics was really working in our favor. 00:53:00.040 |
With seven years of retrospect, I think that's kind of not really true, it turns out. 00:53:05.000 |
I think the main part of why we scored better is that we actually had better data. 00:53:10.840 |
And so there's a bit of evidence about that on this next slide here. 00:53:15.960 |
So this looks at the semantic, syntactic, and overall performance on word analogies 00:53:23.360 |
of GloVe models that were trained on different subsets of data. 00:53:28.920 |
So in particular, the two on the left are trained on Wikipedia. 00:53:34.640 |
And you can see that training on Wikipedia makes you do really well on semantic analogies, 00:53:40.600 |
which maybe makes sense because Wikipedia just tells you a lot of semantic facts. 00:53:44.720 |
I mean, that's kind of what encyclopedias do. 00:53:48.240 |
And so one of the big advantages we actually had was that Wikipedia, that the GloVe model 00:53:55.680 |
was partly trained on Wikipedia as well as other text, whereas the word2vec model that 00:54:01.080 |
was released was trained exclusively on Google News, so Newswire data. 00:54:06.240 |
And if you only train on a smallish amount of Newswire data, you can see that for the 00:54:13.280 |
semantics, it's just not as good as even one quarter of the size amount of Wikipedia data. 00:54:22.080 |
Though if you get a lot of data, you can compensate for that. 00:54:25.160 |
So here on the right hand, did you then have common crawl web data? 00:54:29.800 |
And so once there's a lot of web data, so now 42 billion words, you'll then start to 00:54:35.240 |
get good scores again from the semantic side. 00:54:39.840 |
The graph on the right then shows how well do you do as you increase the vector dimension. 00:54:47.360 |
And so what you can see there is, you know, 25 dimensional vectors aren't very good. 00:54:56.560 |
And so 100 dimensional vectors already work reasonably well. 00:55:00.080 |
So that's why I used 100 dimensional vectors when I showed my example in class. 00:55:06.800 |
That is the sweets, too long load and working reasonably well. 00:55:11.840 |
But you still get significant gains for 200 and it's somewhat to 300. 00:55:16.200 |
So at least back around sort of 2013 to 15, everyone sort of gravitated to the fact that 00:55:25.400 |
So almost frequently, if you look through the best known sets of word vectors that include 00:55:30.600 |
the word to deck vectors and the glove vectors, that usually what you get is 300 dimensional 00:55:39.960 |
That's not the only intrinsic evaluation you can do. 00:55:43.760 |
Another intrinsic evaluation you can do is see how these models model human judgments 00:55:52.840 |
So psychologists for several decades have actually taken human judgments of word similarity, 00:56:00.800 |
where literally you're asking people for pairs of words like professor and doctor to give 00:56:07.200 |
them a similarity score that's sort of being measured as some continuous quantity, giving 00:56:16.600 |
And so there are human judgments which are then averaged over multiple human judgments 00:56:31.560 |
Stock and CD aren't very similar at all, but stock and Jaguar are even less similar. 00:56:38.560 |
So we could then say for our models, do they have the same similarity judgments? 00:56:46.240 |
And in particular, we can measure a correlation coefficient of whether they give the same 00:56:57.040 |
And so there are various different data sets of word similarities, and we can score different 00:57:02.360 |
models as to how well they do on similarities. 00:57:05.920 |
And again, you see here that plain SVDs works comparatively better here for similarities 00:57:15.840 |
You know, it's not great, but it's now not completely terrible because we no longer need 00:57:28.720 |
And we got some of the same kind of minor advantages from the GloVe model. 00:57:35.240 |
A lot of the students were asking if you could re-explain the objective function for the 00:57:39.880 |
GloVe model and also what log bilinear means. 00:58:00.440 |
So the property that we want is that we want the dot product to represent the log probability 00:58:14.520 |
So that then gives me my tricky log bilinear. 00:58:20.220 |
So the bi is that there's sort of the wi and the wj, so that there are sort of two linear 00:58:30.800 |
So this is sort of like having an rather than having a sort of an ax where you just have 00:58:37.800 |
something that's linear in x and a is a constant. 00:58:41.800 |
It's bilinear because we have the wi, wj and this linear in both of them. 00:58:47.760 |
And that's then related to the log of a probability. 00:58:57.080 |
And so since we'd like these things to be equal, what we're doing here, if you ignore 00:59:05.520 |
these two center terms, is that we're wanting to say the difference between these two is 00:59:14.920 |
So we're taking this difference and we're squaring it. 00:59:19.320 |
And we want that squared term to be as small as possible. 00:59:30.160 |
But the other bit that's in here is a lot of the time when you're building models, rather 00:59:38.000 |
than simply having sort of an ax model, it seems useful to have a bias term, which can 00:59:45.840 |
move things up and down for the word in general. 00:59:51.200 |
And so we added into the model bias term so that there's a bias term for both words. 00:59:57.000 |
So if in general probabilities are high for a certain word, this bias term can model that. 01:00:03.600 |
And for the other word, this bias term can model it. 01:00:10.800 |
And after-- oh, actually, I just saw someone said, why multiplying by the f of-- sorry, 01:00:27.880 |
So this last bit was to scale things depending on the frequency of a word. 01:00:38.640 |
Because you want to pay more attention to words that are more common or word pairs that 01:00:49.560 |
Because if you think about it in word2vec terms, you're seeing if things have a co-occurrence 01:00:56.800 |
count of 50 versus 3, you want to do a better job at modeling the co-occurrence of the things 01:01:11.480 |
And so you want to consider in the count of co-occurrence. 01:01:16.360 |
But then the argument is that that actually leads you astray when you have extremely common 01:01:24.200 |
And so effectively, you paid more attention to words that co-occurred together up until 01:01:35.440 |
So it didn't matter if it was an extremely, extremely common word. 01:01:39.080 |
So then for extrinsic word vector evaluation. 01:01:45.560 |
So at this point, you're now wanting to sort of say, well, can we embed our word vectors 01:01:58.000 |
And do different word vectors work better or worse than other word vectors? 01:02:03.480 |
So this is something that we'll see a lot of later in the class. 01:02:07.400 |
I mean, in particular, when you get on to doing assignment 3, that assignment 3, you 01:02:16.200 |
And you can then use word vectors in the dependency parser and see how much they help. 01:02:22.640 |
We don't actually make you test out different sets of word vectors, but you could. 01:02:28.640 |
Here's just one example of this to give you a sense. 01:02:31.560 |
So the task of named entity recognition is going through a piece of text and identifying 01:02:37.560 |
mentions of a person name or an organization name like a company or a location. 01:02:44.000 |
And so if you have good word vectors, do they help you do named entity recognition better? 01:02:55.040 |
So if one starts off with a model that simply has discrete features, so it uses word identity 01:03:02.040 |
as features, you can build a pretty good named entity model doing that. 01:03:06.360 |
But if you add into it word vectors, you get a better representation of the meaning of 01:03:11.640 |
words and so that you can have the numbers go up quite a bit. 01:03:16.600 |
And then you can compare different models to see how much gain they give you in terms 01:03:24.760 |
So skipping ahead, this was a question that I was asked after class, which was word sensors. 01:03:38.440 |
Sorry, for one particular string, we've got some string house, and we're going to say 01:03:45.400 |
for each of those strings, there's a word vector. 01:03:49.760 |
And if you think about it a bit more, that seems like it's very weird, because actually, 01:03:58.200 |
most words, especially common words, and especially words that have existed for a long time, actually 01:04:04.880 |
have many meanings, which are very different. 01:04:08.360 |
So how could that be captured if you only have one word vector for the word, because 01:04:12.720 |
you can't actually capture the fact that you've got different meanings for the word, because 01:04:17.680 |
your meaning for the word is just one point in space, one vector. 01:04:22.200 |
And so as an example of that, here's the word pike. 01:04:32.240 |
Well, what kind of meanings does the word pike have? 01:04:36.600 |
So you can maybe just think for a minute and think what meanings the word pike has. 01:04:46.800 |
And it actually turns out, you know, it has a lot of different meanings. 01:04:50.960 |
So perhaps the most basic meaning is, if you did fantasy games or something, medieval weapons, 01:05:02.280 |
But there's a kind of a fish that has a similar elongated shape that's a pike. 01:05:12.360 |
Maybe that usage isn't used much anymore, but it certainly still survives in referring 01:05:17.720 |
So this is like when you have turnpikes, we have expressions where pike means the future, 01:05:25.760 |
It's a position in diving, that divers do a pike. 01:05:37.880 |
You know, different usages might have different currency. 01:05:40.920 |
In Australia, you can also use pike to mean that you pull out of doing something like, 01:05:49.440 |
I don't think that usage is used in America, but lots of meanings. 01:05:52.720 |
And actually, for words that are common, or if you start thinking of words like line or 01:05:56.760 |
field, I mean, they just have even more meanings than this. 01:06:00.600 |
So what are we actually doing with just one vector for a word? 01:06:05.960 |
And well, one way you could go is to say, okay, up until now, what we've done is crazy. 01:06:12.840 |
Pike has, in other words, have all of these different meanings. 01:06:16.680 |
So maybe what we should do is have different word vectors for the different meanings of 01:06:24.000 |
So we'd have one word vector for the medieval pointy weapon, another word vector for the 01:06:30.600 |
kind of fish, another word vector for the kind of road. 01:06:40.560 |
I mean, actually, we were working on that in the early 2010s, actually, even before 01:06:50.760 |
So this picture is a little bit small to see, but what we were doing was for words, we were 01:06:59.480 |
clustering instances of a word, hoping that those clusters, so clustering the word tokens, 01:07:07.040 |
hoping those clusters that were similar represented sensors. 01:07:10.800 |
And then for the clusters of word tokens, we were sort of treating them like they were 01:07:15.280 |
separate words and learning a word vector for each. 01:07:21.880 |
So in green, we have two sensors for the word bank. 01:07:26.200 |
And so there's one sense for the word bank that's over here, where it's close to words 01:07:30.400 |
like banking, finance, transaction, and laundering. 01:07:34.040 |
And then we have another sense for the word bank over here, where it's close to words 01:07:38.180 |
like plateau, boundary, gap, territory, which is the riverbank sense of the word bank. 01:07:57.140 |
So that's the sort of big game animal sense of jaguar. 01:08:01.400 |
Up the top here, it's being shown close to luxury and convertible. 01:08:08.600 |
Then jaguar here is near string, keyboard, and words like that. 01:08:18.000 |
And then this final jaguar over here is close to software and Microsoft. 01:08:23.360 |
And then if you're old enough, you'll remember that there was an old version of Mac OS that 01:08:33.240 |
And we can learn word vectors for different senses of a word. 01:08:38.520 |
But actually, this isn't the majority way that things have then gone in practice. 01:08:45.120 |
And there are kind of a couple of reasons for that. 01:08:51.480 |
If you do this, it's kind of complex, because you first of all have to learn word sensors 01:08:58.360 |
and then start learning word vectors in terms of the word sensors. 01:09:01.760 |
But the other reason is, although this model of having word sensors is traditional, it's 01:09:10.120 |
what you see in dictionaries, it's commonly what's being used in natural language processing. 01:09:15.760 |
I mean, it tends to be imperfect in its own way, because we're trying to take all the 01:09:20.440 |
uses of the word pike and sort of cut them up into key different senses, where the difference 01:09:30.360 |
is kind of overlapping, and it's often not clear which ones to count as distinct. 01:09:35.040 |
So for example, here, right, a railroad line and a type of road, well, sort of that's the 01:09:39.720 |
same sense of pike, it's just that they're different forms of transportation. 01:09:44.080 |
And so you know that this could be, you know, a type of transportation line and cover both 01:09:49.320 |
So it's always sort of very unclear how you cut word meaning into different senses. 01:09:55.960 |
And indeed, if you look at different dictionaries, everyone does it differently. 01:10:01.760 |
So it actually turns out that in practice, you can do rather well by simply having one 01:10:17.600 |
Well, what you find is that what you learn as a word vector is what gets referred to 01:10:28.600 |
in fancy talk as a super superposition of the word vectors for the different senses 01:10:37.400 |
of a word, where the word superposition means no more or less than a weighted sum. 01:10:44.400 |
So the vector that we learned for pike will be a weighted average of the vectors that 01:10:50.480 |
you would have learned for the medieval weapon sense, plus the fish sense, plus the road 01:10:56.520 |
sense, plus whatever other senses that you have, where the weighting that's given to 01:11:01.360 |
these different sense vectors corresponds to the frequencies of use of the different 01:11:08.080 |
So we end up with the word, the vector for pike being a kind of an average vector. 01:11:16.840 |
And so if you're, if you're, say, okay, you've just added up several different vectors 01:11:24.720 |
into an average, you might think that that's kind of useless, because, you know, you've 01:11:30.920 |
lost the real meanings of the word, and you've just got some kind of funny average vector 01:11:39.400 |
But actually, it turns out that if you use this average vector in applications, it tends 01:11:50.360 |
Because if you say, is the word pike similar to the word for fish? 01:11:57.560 |
Well, part of this vector represents fish, the fish sense of pike. 01:12:02.980 |
And so in those components, it'll be kind of similar to the fish vector. 01:12:07.680 |
And so yes, you'll say the substantial, there's substantial similarity. 01:12:13.680 |
Whereas if in another piece of text that says, you know, the men were armed with pikes and 01:12:21.480 |
lances or pikes and maces or whatever other medieval weapons you remember, well, actually, 01:12:28.380 |
some of that meaning is in the pike vector as well. 01:12:31.460 |
And so it'll say, yeah, there's good similarity with mace and staff and words like that as 01:12:38.780 |
And in fact, we can work out which sense of pike is intended by just sort of seeing which 01:12:46.060 |
components are similar to other words that are used in the same context. 01:12:51.840 |
And indeed, there's actually a much more surprising result than that. 01:12:56.220 |
And this is a result that's Juzo Sanjeev Arora, Tungu Umar, who's now on our Stanford faculty 01:13:05.900 |
And that's the following result, which I'm not actually going to explain. 01:13:09.380 |
But so if you think that the vector for pike is just a sum of the vectors for the different 01:13:19.180 |
senses, well, it should be you'd think that it's just completely impossible to reconstruct 01:13:27.540 |
the sense vectors from the vector for the word type. 01:13:34.260 |
Because normally, if I say I've got two numbers, the sum of them is 17, you just have no information 01:13:45.300 |
And even worse, if I tell you I've got three numbers, and they sum to 17. 01:13:50.140 |
But it turns out that when we have these high dimensional vector spaces, that things are 01:13:57.500 |
so sparse in those high dimensional vector spaces, that you can use ideas from sparse 01:14:03.420 |
coding to actually separate out the different senses, providing they're relatively common. 01:14:11.540 |
So they show in their paper that you can start with the vector of say, pike, and actually 01:14:17.280 |
separate out components of that vector that correspond to different senses of the word 01:14:23.480 |
And so here's an example at the bottom of this slide, which is for the word pike, separate 01:14:33.320 |
And so there's one sense is close to the words trousers, blouse, waistcoats, and this is 01:14:40.980 |
Another sense is close to wires, cables, wiring, electrical. 01:14:45.380 |
So that's the sort of the tie sense of a tie used in electrical stuff. 01:14:50.500 |
Then we have sort of scoreline, goal is equalizer, so this is the sporting game sense of tie. 01:14:56.420 |
This one also seems to in a different way, evoke sporting game sense of tie. 01:15:02.380 |
Then there's finally this one here, maybe my music is just really bad. 01:15:06.860 |
Maybe it's because you get ties in music when you tie notes together, I guess.