Back to Index

Stanford CS224N: NLP with Deep Learning | Winter 2021 | Lecture 1 - Intro & Word Vectors


Chapters

0:0 Introduction
1:43 Goals
3:10 Human Language
10:7 Google Translate
10:43 GPT
14:13 Meaning
16:19 Wordnet
19:11 Word Relationships
20:27 Distributional Semantics
23:33 Word Embeddings
27:31 Word tovec
37:55 How to minimize loss
39:55 Interactive whiteboard
41:10 Gradient
48:50 Chain Rule

Transcript

Hi, everybody. Welcome to Stanford CS224N, also known as Ling284, Natural Language Processing with Deep Learning. I'm Christopher Manning and I'm the main instructor for this class. So what we hope to do today is to dive right in. So I'm going to spend about 10 minutes talking about the course, and then we're going to get straight into content for reasons I'll explain in a minute.

So we'll talk about human language and word meaning. I'll then introduce the ideas of the Word2Vec algorithm for learning word meaning. And then going from there, we'll kind of concretely work through how you can work out objective function gradients with respect to the Word2Vec algorithm and say a teeny bit about how optimization works.

And then right at the end of the class, I then want to spend a little bit of time giving you a sense of how these word vectors work and what you can do with them. So really, the key learning for today is I want to give you a sense of how amazing deep learning word vectors are.

So we have this really surprising result that word meaning can be represented not perfectly, but really rather well by a large vector of real numbers. And, you know, that's sort of in a way a commonplace of the last decade of deep learning, but it flies in the face of thousands of years of tradition and it's really rather an unexpected result to start focusing on.

OK, so quickly, what do we hope to teach in this course? So we've got three primary goals. The first is to teach you the foundations, i.e. a good deep understanding of the effect of modern methods for deep learning applied to NLP. So we are going to start with and go through the basics and then go on to key methods that are used in NLP, recurrent networks, attention, transformers and things like that.

We want to do something more than just that. We'd also like to give you some sense of a big picture understanding of human languages and what are the reasons for why they're actually quite difficult to understand and produce, even though humans seem to do it easily. Now, obviously, if you really want to learn a lot about this topic, you should enroll in and go and start doing some classes in the linguistics department.

But nevertheless, for a lot of you, this is the only human language content you'll see during your master's degree or whatever. And so we do hope to spend a bit of time on that starting today. And then finally, we want to give you an understanding of an ability to build systems in PyTorch for some of the major problems in NLP.

So we'll look at learning word meanings, dependency parsing, machine translation, question answering. Let's dive in to human language. Once upon a time, I had a lot longer introduction that gave lots of examples about human, how human languages can be misunderstood and complex. I'll show a few of those examples in later lectures.

But since right for today, we're going to be focused on word meaning. I thought I'd just give one example, which comes from a very nice XKCD cartoon. And that isn't sort of about some of the sort of syntactic ambiguities of sentences, but instead it's really emphasizing the important point that language is a social system constructed and interpreted by people.

And that's part of how it changes as people decide to adapt its construction. And that's part of the reason why human languages are great as an adaptive system for human beings, but difficult as a system for our computers to understand to this day. So in this conversation between the two women, one says, anyway, I could care less.

And the other says, I think you mean you couldn't care less. Saying you could care less implies you care at least some amount. And the other one says, I don't know. We're these unbelievably complicated brains drifting through a void, trying in vain to connect with one another by blindly fleeing words out into the darkness.

Every choice of phrasing, spelling and tone and timing carries countless signals and contexts and subtexts and more. And every listener interprets those signals in their own way. Language isn't a formal system. Language is glorious chaos. You can never know for sure what any words will mean to anyone. All you can do is try to get better at guessing how your words affect people so you can have a chance of finding the ones that will make them feel something like what you want them to feel.

Everything else is pointless. I assume you're giving me tips on how you interpret words because you want me to feel less alone. If so, then thank you. That means a lot. But if you're just running my sentences past some mental checklist so you can show off how well you know it, then I could care less.

OK, so that's ultimately what our goal is, is to how to do a better job at building computational systems that try to get better at guessing how their words will affect other people and what other people are meaning by the words that they choose to say. So an interesting thing about human language is it is a system that was constructed by human beings.

And it's a system that was constructed, you know, relatively recently in some sense. So in discussions of artificial intelligence, a lot of the time people focus a lot on human brains and the neurons buzzing by. And this intelligence that's meant to be inside people's heads. But I just wanted to focus for a moment on the role of language.

There's actually, you know, this is kind of controversial, but, you know, it's not necessarily the case that humans are much more intelligent than some of the higher apes like chimpanzees or bonobos. Right. So chimpanzees and bonobos have been shown to be able to use tools to make plans. And in fact, chimps have much better short term memory than human beings do.

So relative to that, if you look through the history of life on Earth, human beings developed language really recently. How recently? We kind of actually don't know because, you know, there's no fossils that say, OK, here's a language speaker. But, you know, most people estimate that language arose for human beings sort of, you know, somewhere in the range of one hundred thousand to a million years ago.

OK, that's a while ago. But compared to the process of evolution of life on Earth, that's kind of blinking an eyelid. But that power of communication between human beings quickly set off our ascendancy over other creatures. So it's kind of interesting that the ultimate power turned out not to be having poisonous fangs or being super fast or super big, but having the ability to communicate with other members of your tribe.

It was much more recently, again, that humans developed writing, which allowed knowledge to be communicated across distances of time and space. And so that's only about five thousand years old, the power of writing. So in just a few thousand years, the ability to preserve and share knowledge took us from the Bronze Age to the smartphones and tablets of today.

So a key question for artificial intelligence and human computer interaction is how to get computers to be able to understand the information conveyed in human languages. Simultaneously, artificial intelligence requires computers with the knowledge of people. Fortunately, now our AI systems might be able to benefit from a virtuous cycle.

We need knowledge to understand language and people well, but it's also the case that a lot of that knowledge is contained in language spread out across the books and Web pages of the world. And that's one of the things we're going to look at in this course is how that we can sort of build on that virtuous cycle.

A lot of progress has already been made, and I just want to very quickly give a sense of that. So in the last decade or so, and especially in the last few years with neural methods of machine translation, we're now in a space where machine translation really works moderately well.

So, again, from the history of the world, this is just amazing, right? For thousands of years, learning other people's languages was a human task which required a lot of effort and concentration. But now we're in a world where you could just hop on your Web browser and think, oh, I wonder what the news is in Kenya today.

And you can head off over to a Kenyan website and you can see something like this and you can go, huh, and you can then ask Google to translate it for you from Swahili. And, you know, the translation isn't quite perfect, but it's, you know, it's reasonably good. So the newspaper Tuco has been informed that local government minister, Linsan Belakanyama, and his transport counterpart, Siddig Meir, died within two separate hours.

So, you know, within two separate hours is kind of awkward, but essentially we're doing pretty well at getting the information out of this page. And so that's quite amazing. The single biggest development in NLP for the last year, certainly in the popular media, was GPT-3, which was a huge new model that was released by OpenAI.

What GPT-3 is about and why it's great is actually a bit subtle. And so I can't really go through all the details of this here, but it's exciting because it seems like it's the first step on the path to what we might call universal models, where you can train up one extremely large model on something like that library picture I showed before.

And it just has knowledge of the world, knowledge of human languages, knowledge of how to do tasks, and then you can apply it to do all sorts of things. So no longer are we building a model to detect spam and then a model to detect pornography and then a model to detect whatever foreign language content and just building all these separate supervised classifiers for every different task.

We've now just built up a model that understands. So exactly what it does is it just predicts following words. On the left, it's being told to write about Elon Musk in the style of Dr. Seuss, and it started off with some text and then it's generating more text. And the way it generates more text is literally by just predicting one word at a time, following words to complete its text.

But this has a very powerful facility, because what you can do with GPT-3 is you can give it a couple of examples of what you'd like it to do. So I can give it some text and say, I broke the window, change it into a question, what did I break?

I gracefully save the day, I change it into a question, what did I gracefully save? And then that text tells GPT-3 what I'm wanting it to do. And so then if I give it another statement, like I gave John flowers, I can then say GPT-3 predict what words come next, and it'll follow my prompt and produce who did I give flowers to.

I can say I gave her a rose and a guitar, and it will follow the idea of the pattern and do who did I give a rose and a guitar to. And actually this one model can then do an amazing range of things, including many that's quite surprising to do at all.

So that's one example of that. Another thing that you can do is get it to translate human language sentences into SQL. So this can make it much easier to do CS145. So having given it a couple of examples of SQL translation of human language text, which I'm this time not showing because it won't fit on my slide, I can then give it a sentence like how many users have signed up since the start of 2020, and it turns it into SQL, or I can give it another query, what is the average number of influences each user subscribed to?

And it then converts that into SQL. So GPT-3 knows a lot about the meaning of language and the meaning of other things like SQL and can fluently manipulate it. Okay, so that leads us straight into this topic of meaning, and how do we represent the meaning of a word?

Well, what is meaning? Well, we can look up something like the Webster dictionary and say, okay, the idea that is represented by a word, the idea that a person wants to express by using words, signs, etc. So Webster's dictionary definition is really focused on the word idea somehow, but this is pretty close to the commonest way that linguists think about meaning.

So that they think of word meaning as being a pairing between a word, which is a signifier or symbol, and the thing that it signifies, the signified thing, which is an idea or thing, so that the meaning of the word chair is a set of things that are chairs.

This is referred to as denotational semantics, a term that's also used and similarly applied for the semantics of programming languages. This model isn't very deeply implementable, like, how do I go from the idea that, okay, chair means the set of chairs in the world, to something I can manipulate meaning with in my computers.

So, traditionally, the way that meaning has normally been handled in actual language processing systems is to make use of resources like dictionaries and thesauri, in particular a popular one is WordNet, which organized words and terms into both synonym sets, words that can mean the same thing and hyponyms, which correspond to is a relationships.

And so for the is a relationships, you know, we can kind of look at the hyponyms of panda and a panda is a kind of procyonid, whatever those are, I guess that's probably with red pandas, which is a kind of carnivore, which is a kind of placental, which is kind of mammal, and you sort of head up this hyponym hierarchy.

So WordNet has been a greater resource for NLP, but it's also been highly deficient. So, it lacks a lot of nuance. So for example in WordNet, proficient is listed as a synonym for good, but you know maybe that's sometimes true but it seems like in a lot of context it's not true and you mean something rather different when you say proficient versus good.

So, it's limited as a human constructed thesaurus. So, in particular, there's lots of words and lots of uses of words that just aren't there including you know anything that is, you know, sort of more current terminology like wicked is there for the wicked witch but not for more modern colloquial uses.

So, the word thesaurus certainly isn't there for the kind of description some people make of programmers, and it's impossible to keep up to date. So it requires a lot of human labor, but even when you have that, you know, it has a sense of synonyms, but doesn't really have a good sense of words that means something similar.

Fantastic and great means something similar without really being synonyms. And so this idea of meaning similarity is something that'd be really useful to make progress on and where deep learning models excel. Okay, so what's the problem with a lot of traditional NLP. Well the problem with a lot of traditional NLP is that words are regarded as discrete symbols so we have symbols like hotel conference motel our words, which in deep learning speak, we refer to as a localist representation.

And that's because if you in statistical or machine learning systems, want to represent these symbols that each of them is a separate thing. So the standard way of representing them, and this is what you do in something like a statistical model if you're building a logistic regression model with words as features is that you represent them as one hot vectors so you have a dimension for each different word.

So maybe like my example here are my representations as vectors for motel and hotel. And so that means that we have to have huge vectors corresponding to the number of words now vocabulary. So the kind of if you had a high school English dictionary it probably had about 250,000 words in that.

But there are many, many more words in the language really so maybe we at least want to have a 500,000 dimensional vector to be able to cope with that. Um, but the bigger, even bigger problem with the streets symbols, is that we don't have this notion of word relationships and similarity.

So for example in web search. If a user searches for Seattle motel, we'd also like to match on documents containing Seattle hotel. But our problem is we've got these one hot vectors for the different words. And so, in a formal mathematical sense, these two vectors are orthogonal, that there's no natural notion of similarity between them whatsoever.

Well, there are some things that we could do a bit try and do about that and people did do about that in, you know, before 2010, we could say hey we could use word net synonyms and we count things that list the synonyms are similar anyway, or, hey, maybe we could build up a representations of words that have meaning overlap and people did all of those things, but they tended to fail badly from incompleteness.

So instead, what I want to introduce today is the modern deep learning method of doing that, where we encode similarity in a real value vector themselves. So how do we go about doing that. The way we do that is by exploiting this idea called distributional semantics. So the idea of distributional semantics is, again, something that when you first see it, maybe feels a little bit crazy.

Because rather than having something like denotational semantics, what we're now going to do is say that a word's meaning is going to be given by the words that frequently appear close to it. JR Firth was a British linguist from the middle of last century, and one of his pity slogans that everyone quotes at this moment is, you shall know a word by the company it keeps.

And so this idea that you can represent a sense for words meaning as a notion of what context it appears in has been a very successful idea. One of the most successful ideas that's used throughout statistical and deep learning NLP. It's actually an interesting idea, more philosophically, so that there are kind of interesting connections, for example, in Wittgenstein's later writings, he became enamored of a use theory of meaning.

And this is, in some sense, a use theory of meaning. But whether, you know, it's the ultimate theory of semantics is actually still pretty controversial. But it proves to be an extremely computational sense of semantics, which has just led to it being used everywhere very successfully in deep learning systems.

So when a word appears in a text, it has a context, which are the set of words that appear nearby. So for a particular word, my example here is banking, we'll find a bunch of places where banking occurs in texts, and we'll collect the sort of nearby words as context words, and we'll see, say that those words that are appearing in that kind of muddy brown color around the word banking, that those context words will in some sense, represent the meaning of the word banking.

While I'm here, let me just mention one distinction that will come up regularly. When we're talking about a word in our natural language processing class. There are two senses of word, which are referred to as types and tokens. So there's a particular instance for word. So there's in the first example government debt problems turning, turning into banking crises, there's banking there.

That's a token of the word banking, but then I've collected a bunch of instances of quote unquote the word banking and when I say the word banking, and a bunch of examples of it. I'm treating banking as a type, which refers to, you know, the uses and meaning the word banking has across instances.

So, what are we going to do with these distributional models of language. Well, what we want to do is we're going based on looking at the words that occur in context as vectors that we want to build up dense real valued vector for each word that in some sense, represents the meaning of that word, and the way it will represent the meaning of that word is that this vector will be useful for predicting other words that occur in the context.

So, in this example to keep it manageable on the slide vectors are only eight dimensional. But in reality we use considerably bigger vectors so very common sizes actually 300 dimensional vectors. Okay, so for each word, that's a word type, we're going to have a word vector. These are also used with other names.

They refer to as newer word representations, or for a reason they'll become clear on the next slide, they're referred to as word embeddings. So these are now distributed representation not a localist representation, because the meaning of the word banking is spread over all 300 dimensions of the vector. And these are called word embeddings because effectively, when we have a whole bunch of words.

These representations place them all in a high dimensional vector space. And so they're embedded into that space. Now unfortunately human beings, very bad at looking at 300 dimensional vector spaces or even eight dimensional vector spaces. The only thing that I can really display to you here is a two dimensional projection of that space.

Now even that's useful. But it's also important to realize that when you're making a two dimensional projection of a 300 dimensional space, you're losing almost all the information in that space, and a lot of things will be crushed together that don't actually deserve to be better. So here's my word embeddings.

Of course you can't see any of those at all. But if I zoom in, and then I zoom in further, what you'll already see is that the representations we've learned distributionally do just a good job at grouping together similar to a word embedding. So in this sort of overall picture, I can zoom into one part of the space is actually the part that's up here in this view of it.

And it's got words for countries so not only are countries generally grouped together, even the sort of particular sub groupings of countries, make a certain amount of sense and down here we then have nationality words. And then in another part of the space we can see different kinds of words so here are verbs and we have ones like come and go, a very similar saying and thinking words say think expect the kind of similar and nearby.

And then down here in the bottom right, we have sort of verbal auxiliaries and copulas so have had has forms of the verb to be, and certain contentful verbs are similar to copula verbs because they describe states, you know, he remained angry he became angry. And they're actually then group close together to the word, the verb to be so there's a lot of interesting structure in this space that then represents the meaning of words for the algorithm I'm going to introduce now is one that's called word to Vec which was introduced by Michael off and colleagues in 2013 as a framework for learning word vectors and it's sort of a simple and easy to understand place to start.

So the idea is we have a lot of text from somewhere which we commonly refer to as a corpus of text corpus is just the Latin word for body. So it's a body of text. And we choose a fixed vocabulary, which will typically be large, but nevertheless truncated so we get rid of some of the really rare words.

So we might say vocabulary size of 400,000, and we then create for ourselves vector for each word. Okay, so then what we do is we want to work out what's a good vector to for each word. And the really interesting thing is that we can learn these word vectors from just a big pile of text by doing this distributional similarity task of being able to predict what words occur in the context of other words.

So in particular, we're going to iterate through the text. And so at any moment we have a center word, C and context words outside of it, which we'll call O. And then, based on the current word vectors, we're going to calculate the probability of a context word occurring, given the center word, according to our current model.

And we know that certain words did actually occur in the context of that center word. And so what we want to do is then keep adjusting the word vectors to maximize the probability that's assigned to words that actually occur in the context of the center word, as we proceed through these texts.

So to start to make that a bit more concrete. This is what we're doing. So we have a piece of text, we choose our center word, which is here in two. And then we say, well, a model of predicting the probability of context words given the center word, and this model will come to in a minute but it's defined in terms of our word vectors.

And so let's see what probability it gives to the words that actually occurred in the context of this word. Huh. It gives them some probability but maybe be nice if the probability of the sign was higher. So how can we change our word vectors to raise those probabilities. And so we'll do some calculations with into being the center word, and then we'll just go on to the next word, and then we'll do the same kind of calculations and keep on chunking.

So the big question then is, well, what are we doing for working out the probability of a word occurring in the context of the center word. And so that's the central part of what we develop as the word to object. And then we have an overall model that we want to use.

So, for each position and our corpus our body of text. We want to predict context words within a window of fixed size and given the center word, WJ, and we want to become good at doing that so we want to give high probability of a word occurring in the context.

And so what we're going to do is we're going to work out what's formally the data likelihood as to how good a job we do at predicting words in the context of other words. And so formally that likelihood is going to be defined in terms of our word vectors so they're the parameters of our model, and it's going to be calculated as taking the product of using each word as the center word, and then the product of each word in a window around that of the probability of predicting that context word in the center word.

And so in this model, we're going to have an objective function sometimes also called a cost or a loss that we want to optimize and essentially what we want to do is we want to maximize the likelihood of the context we see around center words, but following standard practice, we don't fiddle that because rather than dealing with products it's easier to deal with sums.

And so we work with log likelihood. And once we take log likelihood all of our products turn into sums. And then we also work with the average log likelihood so we've got a one on T term here for the number of words in the corpus. And finally, for no particular reason, we like to minimize our objective function, rather than maximizing it so we stick a minus sign in there.

But then by minimizing this objective function J of theta, that maximizing our predictive accuracy. Okay, so that's the setup, but we still haven't made any progress in how do we calculate the probability of a word occurring in the context, given the center word. So, the way we're actually going to do that is we have vector representations for each word, and we're going to work out the probability, simply in terms of the word vectors.

Now at this point there's a little technical point, we're actually going to give to each word, two word vectors, one word vector for when it's used as the center word, and a different word vector when it's used as a context word. And this is done because it just simplifies the math and the optimization.

So it seems a little bit ugly, but actually makes building word vectors a lot easier, and really, we can come back to that and discuss it later. But that's what it is. And so then once we have these word vectors, the equation that we're going to use for giving the probability of a context word appearing given the center word is that we're going to calculate it using the expression in the middle bottom of my slide.

So, let's sort of pull that apart, just a little bit more. So, what we have here with this expression is, so for a particular center word and a particular context word, O, we're going to look up the vector representation of each word, so they're U of O and V of C.

And so then we're simply going to take the dot product of those two vectors. So, dot product is a natural measure for similarity between words, because in any particular mention of positive, you'll get a component that adds to the dot product sum. If both are negative, it'll add a lot to the dot product sum.

If one's positive and one's negative, it'll subtract from the similarity measure. If both of them are zero, it won't change the similarity. So, it sort of seems a sort of plausible idea to just take a dot product and thinking, well, if two words have a larger dot product, that means they're more similar.

So, then after that, we sort of really doing nothing more than, okay, we want to use dot products to represent word similarity, and now let's do the dumbest thing that we know how to turn this into a probability distribution. So, what do we do? Well, firstly, well, taking a dot product of two vectors that might come out as positive or negative, but well, we want to have probabilities, we can't have negative probabilities.

So, one way to avoid negative probabilities is to exponentiate them, because then we know everything is positive. And so, then we are always getting a positive number in the numerator. But for probabilities, we also want to have the numbers add up to one. So, we have a probability distribution.

So, we're just normalizing in the obvious way where we divide through by the sum of the numerator quantity for each different word in the vocabulary. So, then necessarily that gives us a probability distribution. So, all the rest of that that I was just talking through, what we're using there is what's called the softmax function.

So, the softmax function will take any Rn vector and turn it into things between zero to one. And so, we can take numbers and put them through this softmax and turn them into a probability distribution, right? So, the name comes from the fact that it's sort of like a max.

So, because of the fact that we exponentiate, that really emphasizes the big contents in the different dimensions of calculating similarity. So, most of the probability goes to the most similar things. And it's called soft because, well, it doesn't do that absolutely. It'll still give some probability to everything that's in the slightest bit similar.

I mean, on the other hand, it's a slightly weird name because, you know, max normally takes a set of things and just returns one, the biggest of them, whereas the softmax is taking a set of numbers and is scaling them, but is returning the whole probability distribution. Okay, so now we have all the pieces of our model.

And so, how do we make our word vectors? Well, the idea of what we want to do is we want to fiddle our word vectors in such a way that we minimize our loss, i.e. that we maximize the probability of the words that we actually saw in the context of the center word.

And so, the theta represents all of our model parameters in one very long vector. So, for our model here, the only parameters are our word vectors. So, we have for each word, two vectors, its context vector and its center vector. And each of those is a d dimensional vector where d might be 300.

And we have v many words. So, we end up with this big huge vector, which is 2dv long, which if you have a 500,000 vocab times the 300 dimensional vector time, it's more math than I can do in my head, but it's got millions and millions of parameters. So, we've got millions and millions of parameters.

And we somehow want to fiddle them all to maximize the prediction of context words. And so, the way we're going to do that then is we use calculus. So, what we want to do is take that math that we've seen previously and say, huh, well, with this objective function, we can work out derivatives.

And so, we can work out where the gradient is. So, how we can walk downhill to minimize loss. So, we're at some point and we can figure out what is downhill and we can then progressively walk downhill and improve our model. And so, what our job is going to be is to compute all of those vector gradients.

Okay. So, at this point, I then want to kind of show a little bit more as to how we can actually do that. A couple more slides here, but maybe I'll just try and jigger things again and move to my interactive whiteboard. What we wanted to do, right, so we had our overall, we had our overall J theta that we were wanting to minimize our average neg log likelihood.

So, that was the minus one on T of the sum of T equals one to big T, which was our text length. And then we were going through the words in each context. So, we were doing J between M words on each side, except itself. And then what we wanted to do was in the side there, we were then working out the log probability of the context word at that position, given the word that's in the center position T.

And so, then we converted that into our word vectors by saying that the probability of O given C is going to be expressed as the softmax of the dot product. Okay. So, now what we want to do is work out the gradient, the direction of downhill for this last gen.

And so, the way we're doing that is we're working out the partial derivative of this expression with respect to every parameter in the model. And all the parameters in the model are the components, the dimensions of the word vectors of every word. And so, we have the center word vectors and the outside word vectors.

So, here, I'm just going to do the center word vectors. But on a future homework, assignment two, the outside word vectors will show up and they're kind of similar. So, what we're doing is we're working out the partial derivative with respect to our center word vector, which is, you know, maybe a 300 dimensional word vector of this probability of O given C.

And since we're using log probabilities of the log of this probability of O given C of this exp of U of OT VC over my writing will get worse and worse, sorry. I've already made a mistake, haven't I? Sum, sum W equals one to the vocabulary of the exp of UWT VC.

Okay. Well, at this point, things start off pretty easy. So, what we have here is something that's log of A over B, so that's easy. We can turn this into log A minus log B. But before I go further, I'll just make a comment at this point. You know, so at this point, my audience divides on in two, right?

There are some people in the audience for which maybe a lot of people in the audience, this is really elementary math. I've seen this a million times before and he isn't even explaining it very well. So if you're in that group, well, feel free to look at your email or the newspaper or whatever else is best suited to you.

But I think there are also other people in the class who, oh, the last time I saw calculus was when I was in high school, for which that's not the case. So I wanted to spend a few minutes going through this a bit concretely so that to try and get over the idea that, you know, even though most of deep learning and even word vector learning seems like magic, that it's not really magic.

It's really just doing math. And one of the things we hope is that you do actually understand this math that's being done. So that you can keep along and do a bit more of it. Okay, so then what we have is this way of writing the log. We can say that that expression above equals the partial derivatives with VC of the log of the numerator log x u o t v c minus the partial derivative of the log of the denominator.

So that's then the sum of w equals one to v of the x of u w t v c. Okay, so at that point, I have my numerator here, and my former denominator there. So at that point, there is that the first part is the numerator part. So the numerator part is really, really easy.

So, we have here log and x but just inverses of each other. So they just go away. So that becomes the derivative with respect to VC of just what's left behind, which is u0 dot producted with VC. Okay. And so the thing to be aware of is, you know, we're still doing this multivariate calculus.

So what we have here is calculus with respect to a vector, like hopefully you saw some of in math 51 or some other place, not high school single variable calculus. On the other hand, you know, to the extent you half remember some of this stuff, most of the time you can just do perfectly well by thinking about what happens with one dimension at a time, and it generalizes the multivariable calculus.

So if about all that you remember of calculus is that d dx of ax equals a, really, it's the same thing that we're going to be using here, that here we have the inside word dot producted with the VC. Well, at the end of the day, that's going to have terms of sort of u0 component one times the center word component one plus u0 component two plus u0 component two.

And so we're sort of using this bit over here. And so what we're going to be getting out is the u0 and u01 and the u02. So this will be all that is left with respect to VC1 when we take its derivative with respect to VC1, and this term will be everything left when we take the derivative with respect to the variable VC2.

So the end result of taking the vector derivative of u0 dot producted with VC is simply going to be u0. Okay, great. So that's progress. So then at that point, we go on and we say, oh damn, we still have the denominator, and that's slightly more complex, but not so bad.

So then we want to take the partial derivatives with respect to VC of the log of the denominator. Okay. And so then at this point, the one tool that we need to know and remember is how to use the chain rule. So the chain rule is when you're wanting to work out a way of having derivatives of compositions of functions.

So we have f of g of whatever x, but here it's going to be VC. And so we want to say, okay, what we have here is we're working out a composition of functions. So here's our f, and here is our x, which is g of VC. Actually, maybe I shouldn't call it x.

Oops. Maybe I should probably better to call it z or something. Okay, so when we then want to work out the chain rule, well, what do we do? We take the derivative of f at the point z. And so at that point, we have to actually remember something. We have to remember that the derivative of log is the one on x function.

So this is going to be equal to the one on x for z. So that's then going to be one over the sum of w equals one to v of x of u, t, v, c, multiplied by the derivative of the inner function. So the derivative of the part that is remaining, I hope I'm getting this right, the sum of, oh, and there's one trick here.

At this point, we do want to have a change of index. So we want to say the sum of x equals one to v of x of u of x, v, c. Since we can get into trouble if we don't change that variable to be using a different one.

Okay, so at that point, we're making some progress, but we still want to work out the derivative of this. And so what we want to do is apply the chain rule once more. So now here's our f, and in here is our new z equals g of v, c.

And so we then sort of repeat over. So we can move the derivative inside a sum always. So we're then taking the derivative of this. And so then the derivative of x is itself. So we're going to just have x of u, x, t, v, c times, this is sum of x equals one to v, times the derivative of u, x, t, v, c.

And so then this is what we've worked out before. We can just rewrite as u, x. Okay, so we're now making progress. So if we start putting all of that together, what we have is the derivative of the partial derivatives with v, c of this log probability. Right, we have the numerator, which was just u zero, minus, we then had the sum of the numerator, sum of x equals one to v of x, u, x, t, v, c times u of x.

And then that was multiplied by our first term that came from the one on x, which gives you the sum of w equals one to v of the x of u, w, t, v, c. And the fact that we changed the variables became important. And so by just sort of rewriting that a little, we can get that that equals u zero minus the sum of v equals, oh sorry, x equals, one to v of this x, u of x, t, v, c over the sum of w equals one to v of x, u, w, t, v, c times u of x.

And so at that point, this sort of interesting thing has happened that we've ended up getting straight back exactly the softmax formula probability that we saw when we started. We can just rewrite that more conveniently as saying this equals u zero minus the sum over x equals one to v of the probability of x given c times u, x.

And so what we have at that moment is this thing here is an expectation. And so this is an average over all the context vectors weighted by their probability according to the model. And so it's always the case with these softmax style models that what you get out for the derivatives is you get the observed minus the expected.

So our model is good if our model on average predicts exactly the word vector that we actually see. And so we're going to try and adjust the parameters of our model so it does that as much as possible. Now, I mean, we try and make it do it as much as possible.

I mean, of course, as you'll find, you can never get close, right? You know, if I just say to you, okay, the word is croissant, which words are going to occur in the context of croissant? I mean, you can't answer that. There are all sorts of sentences that you could say that involve the word croissant.

So actually, our particular probability estimates are going to be kind of small, but nevertheless, we want to sort of fiddle our word vectors to try and make those estimates as high as we possibly can. So I've gone on about this stuff a bit, but haven't actually sort of shown you any of what actually happens.

So I just want to quickly show you a bit of that as to what actually happens with word vectors. So here's a simple little iPython notebook, which is also what you'll be using for assignment one only. So in the first cell, I import a bunch of stuff. So we've got NumPy for our vectors, matplotlib for plotting, it learns kind of your machine learning, Swiss Army Knife.

GenSim is a package that you may well not have seen before. It's a package that's often used for word vectors. It's not really used for deep learning. So this is the only time you'll see it in the class. But if you just want a good package for working with word vectors and some other application, it's a good one to know about.

Okay, so then in my second cell here, I'm loading a particular set of word vectors. So these are our GloVe word vectors that we made at Stanford in 2014. And I'm loading 100 dimensional word vectors so that things are a little bit quicker for me while I'm doing things here.

Sort of do this model of bread and croissant. Well, what I've just got here is word vectors. So I just wanted to sort of show you that there are word vectors. Well, maybe I should have loaded those word vectors in advance. Let's see. Oh, okay. Well, I'm in business.

Okay, so right. So here are my word vectors for bread and croissant. And while I'm seeing that maybe these two words are a bit similar, so both of them are negative in the first dimension, positive in the second, negative in the third, positive in the fourth, negative in the fifth.

So it sort of looks like they might have a fair bit of dot product, which is kind of what we want because bread and croissant are kind of similar. But what we can do is actually ask the model, and these are Gensim functions now, you know, what are the most similar words so I can ask for croissant.

What are the most similar words to that, and it will tell me it's things like brioche, baguette, focaccia. So that's pretty good. Pudding is perhaps a little bit more questionable. We can say, most similar to the USA and it says Canada, America, USA with periods, United States, that's pretty good.

Most similar to banana. I get out coconut, mangoes, bananas, sort of fairly tropical fruit. Great. Before finishing though, I want to show you something slightly more than just similarity, which is one of the amazing things that people observed with these word vectors, and that was to say, you can actually sort of do arithmetic in this vector space that makes sense.

And so in particular people suggested this analogy task. And so the idea of the analogy task is you should be able to start with a word like king, and you should be able to subtract out a male component from it, add back in a woman component, and then you should be able to ask, well what word is over here, and what you'd like is that the word over there is queen.

And so, this sort of little bit of, so we're going to do that with this sort of same most similar function which is actually more, so as well as having positive words, you can ask for most similar negative words, and you might wonder what's most negatively similar to a banana and you might be thinking, oh, it's, I don't know, some kind of meat or something.

Actually that by itself isn't very useful because when you could just ask for most negatively similar to things you tend to get crazy strings that were found in the data set that you don't know what they mean if anything. But if we put the two together, we can use the most similar function with positives and negatives to do analogies.

So, we're going to say we want a positive king, we want to subtract out negatively man, we want to then add in positively woman and find out what's most similar to this point in the space. So my analogy function does that precisely that by taking a couple of most similar ones, and then subtracting out the negative one.

And so we can try out this analogy function, so I can do the analogy I show in the picture with man is the king as woman is fight. I'm not saying this right. Yeah, man is the king as woman is to. Sorry, I haven't done my cells. Okay, man is the king as woman is the queen so that's great.

And that's works well. I mean, and you can do it the sort of other way around king is the man as queen is to woman. If this only worked for that one freakish example, you may be wouldn't be very impressed, but you know it actually turns out like it's not perfect but you can do all sorts of fun analogies with this, and they actually work so you know I could ask for something like an allergy.

Oh, here's a good one. Australia is to be as France is to what, and you can think about what you think the answer that one should be, and it comes out as champagne which is pretty good. Or I could ask for something like analogy pencil is to sketching as camera is to what, and it says photographing.

You can also do the analogies with people. At this point I have to point out that this data was, and the model was built in 2014, so you can't ask anything about Donald Trump in it while you can eat Trump is in there but not as president, but I could ask something like analogy.

Is to Clinton as Reagan is to what and you can think of what you think is the right analogy there. The analogy it returns is Nixon. So I guess that depends on what you think of Bill Clinton, as to whether you think that was a good analogy or not.

You can also do sort of linguistic analogies with it so you can do something like analogy, tall is to tallest as long is to what is to what is the longest. So it really just sort of knows a lot about the meaning and behavior of words. And you know I think when these methods were first developed and hopefully still for you that you know people were just gobsmacked about how well this actually worked at capturing the meaning of words.

And so these word vectors then went everywhere as a new representation that was so powerful for working out word meaning. And so that's our starting point for this class, and we'll say a bit more about them next time. And they're also the basis of what you're looking at for the first assignment.

Can I ask a quick question about the distinction between the two vectors per word. Yes. My understanding is that there can be several context words per word in the vocabulary, but then there's only two vectors I kind of thought the distinction between the two is that one is like the actual word and one's like the context word but multiple context words.

Right. How do you, how do you pick just two then. Well, so we're doing every one of them. Right. So, like, maybe I won't turn back on the screen share, but you know we were doing. In the objective function there was a sum over you. So you've got, you know, this big corpus of text right so you're taking a sum over every word, which is it appearing as the center word, and then inside that there's a second sum, which is for each word it's a context so you are going to count each word as a context word.

And so then for one particular term of that objective function, you've got a particular context word, and a particular center word, but you're then sort of summing over different context words for each word, and then you're summing over all of the decisions of different center words and, and to say a little just a sentence more about having two vectors.

I mean, you know, in some sense it's an ugly detail, but it was done to make things sort of simple and fast. So, you know, if you look at the math carefully if you sort of treated this two vectors as the same so if you use the same vectors for center and context.

You can say okay let's work out the derivatives, things get uglier, and the reason that they get uglier is it's okay when I'm iterating over all the choices of context word oh my god sometimes the context word is going to be the same as the center word, and so that messes up my derivatives, whereas by taking them as separate vectors that never happens so it's easy.

But the kind of interesting thing is, you know, saying that you have these two different representations, sort of just ends up really sort of doing no harm, and my wave my hands argument for that is, you know, since we're kind of moving through each position the corpus is moving one by one, you know, something, a word that is the center word at one moment is going to be a context word at the next moment, and the word that was the context word is going to have become the center word.

So you're sort of doing the, the computation both ways. In each case, and so you should be able to convince yourself that the two representations for the word, end up being very similar, and they do not not identical for technical reasons of the ends of documents and things like that, but very, very similar.

So we tend to get two very similar representations for each word, and we just average them and call that the word vector. And so when we use word vectors we just have one vector for each word. That makes sense. Thank you. I have a question purely out of curiosity.

So we saw when we projected the vectors, the word vectors onto the 2D surface we saw like little clusters of words that are similar to each other and then later on we saw that with the analogies thing, we kind of see that there's these directional vectors that sort of indicate like the ruler of or the CEO of something like that.

And so I'm wondering, is there, are there relationships between those relational vectors themselves such as like, is the ruler of vector sort of similar to the CEO of vector which is very different from like is makes a good sandwich with vector. Is there any research on that. That's a good question.

How are you stumped me already in the first lecture. I mean that. I can't actually think of a piece of research and so I'm not sure I have a confident, I'm not sure I have a confident answer. I mean it seems like that's a really easy thing to check.

Once you have one of these sets of word vectors that it seems like any for any relationship that is represented well enough by word you should be able to see if it comes out, kind of similar. I mean, I'm not sure we can look and see. That's totally okay just just curious.

I want to go last little bit to your answer to first question so when you want to collapse two vectors for the same word do you see usually take the average different people have done different things but the most common practices after you, you know, there's still a bit left to cover about running word to back that we didn't really get through today so I still got a bit more work to do on Thursday, but you know once you run your word to back algorithm, and you sort of your output is two vectors for each word and kind of like when it's center and when it's context.

And so, typically people just average those two vectors and say okay that's the representation of the word croissant, and that's what appears in the sort of word vectors file like the one I loaded. So my question is, if a word have two different meanings or multiple different meanings, can we still represent it as a single vector.

Yes, that's a very good question. Actually there is some content on that and Thursday's lecture, so I can say more about that. But I think that it does have lots of meaning so if you have a word like star that can be astronomical object or it can be you know film star Hollywood star, or it can be something like the gold stars that you got in elementary school, and we just taking all those uses of the word star and collapsing them together into one word vector.

That sounds really crazy and bad, but actually turns out to work rather well. Maybe I won't go through all of that right now because there is actually stuff on that on Thursday's lecture. Oh, I see. I'm going to save my slides for next time. My next question is, do we look at how to implement or do we look at like the stack of like something like Alexa or something for like speech to context actions in this course, or is it just primarily understanding.

Yeah, so this is an unusual an unusual quarter. But for this quarter, there's a very clear answer, which is this quarter. There's a new class being taught, which is CS to 24 s, a speech class being taught by Andrew mass, and you know this is a class that's been more regularly offered.

Sometimes it's only been offered every third year, but it's being offered right now. So, if what you want to do is learn about speech recognition and learn about sort of methods for building dialogue systems. So, CS to 24 s. So, you know for this class in general. The vast bulk of this class is working with text and doing various kinds of text analysis and understanding so we do tasks like some of the ones I've mentioned we do machine translation.

We do question answering. We look at how to pass this structure of sentences and things like that, you know, in other years I sometimes say a little bit about speech. But since this quarter there's a whole different class that's focused on speech that seem a little bit silly. So that sounds good.

I'm now getting a bad echo I'm not sure if that's my fault or your fault but anyway. Yeah, so, yeah, so the speech class does a mix of stuff so I mean the sort of pure speech problems classically have been doing speech recognition so going from a speech signal to text and doing text to speech going from text to a speech signal, and both of those which are now normally done, including by the cell phone that sits in your pocket, using neural networks and so it covers, both of those, but then between that the class covers quite a bit.

And in particular it starts off with looking at building dialogue systems so this is sort of something like Alexa Google Assistant Siri, as to, well, assuming you have a speech recognition a text to speech system. You do have text in and text out what are the kind of ways that people go about building dialogue systems like the ones that I just mentioned.

I have a question. So, I think there was some people in the chat noticing that the like opposites were really near to each other, which was kind of odd, but I was also wondering, what about like positive and negative valence or like affect is that captured in this type of model, or is it like not captured well like with the opposites how those weren't really.

So the short answer is for both of those. And so there's this is a good question a good observation, and the short answer is no both of those are captured really really badly. I mean, there's, there's a definition. When I say really really badly. I mean, what I mean is, if that's what you want to focus on.

You've got problems I mean it's not that the algorithm doesn't work so precisely what you find is that, you know, antonyms generally occur in very similar topics because you know whether it's saying, you know, john is really tall john is really short, or that movie was fantastic or that movie was terrible, right, you get antonyms occurring in the same context so because of that, their vectors are very similar, and similarly for sort of affect and sentiment based words well like my great and terrible example, their context is similar.

There for that if you're just learning this kind of predict words in context models that know that's not captured. Now, that's not the end of the story. You know, absolutely people wanted to use neural networks for sentiment and other kinds of sort of connotation affect, and there are very good ways of doing that, but somehow you have to do something more than simply predicting words and context because that's not sufficient to capture that dimension.

More on that later. Yeah, I mean I happen to like adjectives to like very basic adjectives like so and like not because those would like appear in like similar context. What was your first example before not like so this is so cool. So that's actually a good question as well.

So, yeah, so there are these very common words that are commonly referred to as function words by linguists, which now includes ones like. Other ones like and and prepositions like you know to and on. You sort of might suspect that the word vectors for those don't work out very well because they occur and all kinds of different contexts and they're not very distinct from each other in many cases, and to a first approximation I think that's true, and part of why I didn't use those examples in my slides.

Yeah. But you know at the end of the day, we do build up vector representations of those words too. And you'll see in a few lectures time when we start building what we call language models that actually they do do a great job in those words as well. So, I mean, I think the meaning there.

I mean, you know, another feature of the word to vect model is it actually know the position of words right so it said I'm going to predict every word around the center word, but you know I'm predicting it in the same way. I'm predicting constantly the word before me or versus the word after me, or the word to away in either direction right that all just predicted the same by that one probability function.

So if that's all you've got that sort of destroys your ability to do a good job at capturing these sort of common more grammatical words like so not an end, but we build slightly different models that are more sensitive to the structure of sentences and then we start doing a good job on those two.

Okay, thank you. I have a question about the characterization of what to bet. Because I read one of the media, and it seems to characterize architecture as well. So, it's slightly different from how it's presented in the book. So what to back is kind of a framework for building word vectors, and that there are sort of several variant precise algorithms within the framework.

And, you know, one of them is how, whether you're predicting the context words or whether you're predicting the center word. So the model I showed was predicting the context words, so it was the skip gram model, but then there's sort of a detail of how in particular, do you do the optimization, and what I presented was the sort of easiest way to do it, which is naive optimization with the equation, the softmax equation for word vectors.

And it turns out that that naive optimization is sort of needlessly expensive, and people have come up with faster ways of doing it in particular, the commonest thing you see is what's called skip gram with negative sampling, and the negative sampling is then sort of a much more efficient way to estimate things and I'll mention that on Thursday.

Right. Okay. So, who's asking for more information about how word vectors are constructed beyond the summary of random initialization, and then gradient based iterative optimization. Yeah. So, I sort of will do a bit more connecting this together. In the Thursday lecture, I guess this sort of, I mean so much fun can fit in the first class.

But the picture. The picture is essentially the picture I showed the pieces of. So, to learn word vectors, you start off by having a vector for each word type, both for context and outside, and those vectors, you initialize randomly. So, that you just put small little numbers that are randomly generated in each vector component, and that's just your starting point.

And so from there on, you're using an iterative algorithm, where you're progressively updating those word vectors, so they do a better job at predicting which words appear in the context of other words. And the way that we're going to do that is by using the gradients that I was sort of starting to show how to calculate, and then, you know, once you have a gradient you can walk in the opposite direction of that gradient, and you're then walking downhill I you're minimizing your loss, and we're going to sort of do lots of that until our word vectors get as good as possible.

So, you know, it's really all math, but in some sense, you know, word vector learning is sort of miraculous since you do literally just start off with completely random word vectors and run this algorithm of predicting words for a long time, and out of nothing emerges a set of word vectors that represent meaning well.