Back to Index

Deep Learning for Natural Language Processing (Richard Socher, Salesforce)


Chapters

0:0
0:51 What is Natural Language Processing?
1:56 NLP Levels
5:27 (A tiny sample of) NLP Applications
9:1 Outline
18:34 Combining the best of both worlds: Glove (Pennington et al. 2014)
20:30 Glove results
21:24 Intrinsic word vector evaluation Word Vector Analogies
23:37 Glove Visualizations: Superlatives
23:54 Analogy evaluation and hyperparameters
25:47 Recurrent Neural Networks (!)
29:18 RNN language model
35:41 Attempt at a clean illustration
39:47 Pointer sentinel mixture models
41:33 Language Model Evaluation
43:55 Current Research
46:33 First Major Obstacle
47:25 Second Major Obstacle
49:56 High level idea for harder questions
50:25 Basic lego block: GRU (defined before)
50:33 Dynamic Memory Network
54:23 The Modules: Input
54:42 The Modules: Question
54:57 The Modules: Episodic Memory

Transcript

Thank you, everybody, and thanks for coming back very soon after lunch. I'll try to make it entertaining to avoid some post-food coma. So I actually have a lot to owe for being here to Andrew and Chris and my PhD at Stanford here. It's really, it's always fun to be back.

I figured there's going to be a broad range of capabilities in the room. So I'm sorry I will probably bore some of you for the first two-thirds of the talk, cuz I'll go over the basics of what's NLP, what's natural language processing, what's deep learning, and what's really at the intersection of the two.

And then the last third, I'll talk a little bit about some exciting new research that's happening right now. So let's get started with what is natural language processing? It's really a field at the intersection of computer science, AI, and linguistics. And you could define a lot of goals, and a lot of these statements here we could really talk and philosophize a lot about.

But I'll move through them pretty quickly. For me, the goal of natural language processing is for computers to process or scare quotes, understand natural language in order to perform tasks that are actually useful for people, such as question answering. The caveat here is that really fully understanding and representing the meaning of language, or even defining it, is quite an elusive goal.

So whenever I say the model understands, I'm sorry, I shouldn't say that. Really, these models don't understand in the sense that we understand language anything. So whenever somebody says they can read or represent the full meaning in its entire glory, it's usually not quite true. Really, perfect language understanding is in some sense AI complete in the sense that you need to understand all of visual inputs and thought and a lot of other complex things.

So a little more concretely, as we try to tackle this overall problem of understanding language, what are sort of the different levels that we often look at? Often, and for many people, starts at speech. And then once you have speech, you might say, all right, now I know what phonemes are smaller parts of words.

I understand how words form, that's morphology or morphological analysis. Once I know what the meaning of words are, I might try to understand how they're put together in grammatical ways such that the sentences are understandable or at least grammatically correct to a lot of speakers of the language. Once we go and we understand the structure, we actually want to get to the meaning.

And that's really where I think most of my interest lies, in semantic interpretation, actually trying to get to the meaning in some useful capacity. And then after that, we might say, well, if we understand now the meaning of a whole sentence, how do we actually interact? What's the discourse?

How do we have spoken dialogue system, things like that? Whereas deep learning has really improved the state of the art significantly, is really in speech recognition and syntax and semantics. And the interesting thing is that we're kind of actually skipping some of these levels. Deep learning doesn't require often morphological analysis to create very useful systems.

And in some cases, actually skips syntactic analysis entirely as well. It doesn't have to know about the grammar. It doesn't have to be taught about what noun phrases are, prepositional phrases. It can actually get straight to some semantically useful tasks right away. And that's going to be one of the sort of advantages that we don't have to actually be as inspired by linguistics as traditional natural language processing had to be.

So why is NLP hard? Well, there's a lot of complexity in representing and learning, and especially using linguistic situational world and visual knowledge. Really all of these are connected when it gets to the meaning of language. To really understand what read means, can you do that without visual understanding, for instance?

If you have, for instance, this sentence here, Jane hit June and then she fell, or and then she ran. Depending on which verb comes after she, the definition, the meaning of she actually changes. And this is one subtask you might look at, so-called Nuffer resolution or coreference resolution in general, where you try to understand, who does she actually refer to?

And it really depends on the meaning, again, somewhat scare quotes here, of the verb that follows this pronoun. Similarly, there's a lot of ambiguity. So here we have a very simple sentence, four words, I made her duck. Now that simple sentence can actually have at least four different meanings, if you can think about it for a little bit, right?

You made her a duck that she loves for Christmas as her dinner. You made her duck like me just now, and so on. There are actually four different meanings. And to know which one requires, in some sense, situational awareness or knowledge to really disambiguate what is meant here. So that's sort of the high level of NLP.

Now, where does it actually become useful in terms of applications? Well, they actually range from very simple things that we kind of assume or are given now, we use them all the time every day, to more and more complex, and then also more in the realm of research. The simple ones are things like spell checking or keyword search and finding synonyms and a thesaurus.

Then the medium sort of difficulty ones are to extract information from websites, trying to extract sort of product prices or dates and locations, people or company names, so-called named entity recognition. You can go a little bit above that and try to classify sort of reading levels for school text, for instance, or do sentiment analysis that can be helpful if you have a lot of customer emails that come in and you want to prioritize highly the ones of customers who are really, really annoyed with you right now.

And then the really hard ones, and I think in some sense, the most interesting ones are machine translation, trying to actually be able to translate between all the different languages in the world, question answering, clearly something that is a very exciting and useful piece of technology, especially over very large, complex domains.

Can be used for automated email replies. I know pretty much everybody here would love to have some simple automated email reply system, and then spoken dialogue systems, bots are very hip right now. These are all sort of complex things that are still in the realm of research to do them really well.

We're making huge progress, especially with deep learning on these three, but they're still nowhere near human accuracy. So let's look at the representations. I mentioned we have morphology and words and syntax and semantics and so on. We can look at one example, namely machine translation, and look at how did people try to solve this problem of machine translation.

Well, it turns out they actually tried all these different levels with varying degrees of success. You can try to have a direct translation of words to other words. The problem is that is often a very tricky mapping. The meaning of one word in English might have three different words in German and vice versa.

You can have three of the same words in English, meaning all the single same word in German, for instance. So then people said, well, let's try to maybe do some tactic transfer where we have whole phrases like to kick the bucket, just means in German. Okay, not a fun example.

And then semantic transfer might be, well, let's try to find a logical representation of the whole sentence, the actual meaning in some human understandable form, and then try to just find another surface representation of that. Now, of course, that will also get rid of a lot of the subtleties of language.

And so, the tricky problems in all these kinds of representations. Now, the question is, what does deep learning do? You've already saw at least two methods, standard neural networks before and convolutional neural networks for vision. And in some sense, there's going to be a huge similarity here to these methods.

Because just like images that are essentially a long list of numbers, a vector, and standard neural networks with a hidden state is also just a vector or a list of numbers. That is also going to be the main representation that we will use throughout for characters, for words, for short phrases, for sentences, and in some cases for entire documents.

They will all be vectors. And with that, we are sort of finishing up the whirlwind of what's NLP. Of course, you could give an entire lecture on almost every single slide I just gave, so we're very, very high level. But we'll continue at that speed to try to squeeze this complex deep learning for NLP subject area into an hour and a half.

I think they're two of the most important basic Lego blocks that you nowadays want to know in order to be able to creatively play around with more complex models, and those are going to be word vectors and sequence models, namely recurrent neural networks. And I kind of split this into words, sentences, and multiple sentences.

But really, you could use recurrent neural networks for shorter phrases as well as multiple sentences. But in many cases, we'll see that they have some limitations as you move to longer and longer sequences and just use the default neural network sequence models. All right, so let's start with words.

And maybe one last blast from the past here, to represent the meaning of words, we actually used to use taxonomies like WordNet that kind of defines each word in relationship to lots of other ones. So you can, for instance, define hypernames and is a relationships. You might say the word panda, for instance, in its first meaning as a noun, basically goes through this complex stack, this directed acyclic graph.

Most of it is roughly just a tree. And in the end, like everything, it is an entity, but it's actually a physical entity, a type of object. It's a whole object, it's a living thing, it's an organism, animal, and so on. So you basically can define a word like this.

And another way, at each node of this tree, you actually have so-called syn sets, or synonym sets. And here's an example for the synonym set of the word good. Good can have a lot of different meanings, can actually be both an adjective, and as well as an adverb, as well as a noun.

Now, what are the problems with this kind of discrete representation? Well, they can be great as a resource if you're a human, you wanna find synonyms. But they're never going to be quite sufficient to capture all the nuances that we have in language. So for instance, the synonyms here for good were adapt, expert, practice, proficient, and skillful.

But of course, you would use these words in slightly different contexts. You would not use the word expert in exactly all the same contexts as you would use the meaning of good, or the word good. Likewise, it will be missing a lot of new words. Language is this interesting living organism, we change it all the time.

You might have some kids, they say YOLO, and all of a sudden, you need to update your dictionary. Likewise, maybe in Silicon Valley, you might see ninja a lot, and now you need to update your dictionary again. And that is basically going to be a Sisyphus job, right? Nobody will ever be able to really capture all the meanings and this living, breathing organism that language is.

So it's also very subjective. Some people might think ninja should just be deleted from the dictionary and say we don't want to include it. I just think nifty or badass is kind of a silly word and should not be included in a proper dictionary, but it's being used in real language and so on.

It requires human labor. As soon as you change your domain, you have to ask people to update it. And it's also hard to compute accurate word similarities. Some of these words are subtly different, and it's really a continuum in which we can measure their similarities. So instead, what we're going to use and what is also the first step for deep learning, we'll actually realize it's not quite deep learning in many cases.

But it is sort of the first step to use deep learning in NLP, is we will use distributional similarities. So what does that mean? Basically, the idea is that we'll use the neighbors of a word to represent that word itself. It's a pretty old concept. And here's an example, for instance, for the word banking.

We might actually represent banking in terms of all these other words that are around it. So let's do a very simple example where we look at a window around each word. And so here, the window length, that's just for simplicity, say it's one. We represent each word only with the words one to the left and one to the right of it.

We'll just use the symmetric context around each word. And here's a simple example corpus. So if the three sentences in my corpus, of course, we would always wanna use corpora with billions of words instead of just a couple. But just to give you an idea of what's being captured in these word vectors, is I like deep learning, I like NLP, and I enjoy flying.

And now, this is a very simple so-called co-occurrence statistic. You'll just simply see here, i, for instance, appears twice in its window size of one here, the word like is in its window and its context, and the word enjoy is once in its context. And for like, you have twice to its left, i, and once deep, and once NLP.

It turns out, if you just take those vectors, now this could be a vector representation, just each row could be a vector representation for words. Unfortunately, as soon as your vocabulary increases, that vector dimensionality would change. And hence, you'd have to retrain your whole model. It's also very sparse, and really, it's going to be somewhat noisy if you use that vector.

Now, another better thing to do might be to run SVD or something similar like PCA dimensionality reduction on such a co-occurrence matrix. And that actually gives you a reasonable first approximation to word vectors. Very old method, works reasonably well. Now, what works even better than simple PCA is actually a model introduced by Tomasz Mikoloff in 2013 called Word2Vec.

So instead of capturing co-occurrence counts directly out of a matrix like that, you'll actually go through each window in a large corpus and try to predict a word that's in the center of each window and use that to predict the words around it. That way, you can very quickly train, you can train almost online, though few people do this, and add words to your vocabulary very quickly in this streaming fashion.

So now let's look a little bit at this model Word2Vec, because it's first a very simple NLP model, and two, it's very instructive. We won't go into too many details, but at least look at a couple of equations. So again, main goal is to predict the surrounding words in a window of some length that we define M, it's a hyperparameter, of every word.

Now, the objective function will essentially try to maximize here the log probability of any of these context words given the center word. So we go through our entire corpus T, very long sequence, and at each time step j, we will basically look at all the words in the context of the current word T, and basically try to maximize here this probability of trying to be able to predict that word that is around the current word T, and theta, all the parameters, namely all the word vectors that we'd want to optimize.

So now, how do we actually define this probability P here? The simplest way to do this, and this is not the actual way, but it's the simplest and first to understand and derive this model, is with this very simple inner product here, and that's why we can't quite call it deep.

There's not going to be many layers of nonlinearities like we see in deep neural networks, it's really just a simple inner product. And the higher that inner product is, the more likely these two will be predicting one another. So here's C, the context is the center word, sorry, O is the outside word.

And basically, this inner product, the larger it is, the more likely we were going to predict this. And these are both just standard n-dimensional vectors. And now, in order to get a real probability, we'll essentially apply softmax to all the potential inner products that you might have in your vocabulary.

And one thing you will notice here is, well, this denominator is actually going to be a very large sum, right? We'll want to sum here over all potential inner products for every single window, that would be too slow. So now, the real methods that we would use are going to approximate the sum in a variety of clever ways.

Now, I could literally talk the next hour and a half just about how to optimize the details of this equation, but then we'll all deplete our mental energy for the rest of the day. And so, I'm just going to point you to the class I taught earlier this year, CS24D, where we will have lots of different slides that go into all the details of this equation, how to approximate it, and then how to optimize it.

It's going to be very similar to the way we optimize any other neural network. We're going to use stochastic gradient descent. We're going to look at mini-patches of a couple of hundred windows at a time, and then update those word vectors. And we're just going to take simple gradients of each of these vectors as we go through windows in a large corpus.

All right, now, we briefly mentioned PCA-like methods, based on senior value decomposition often, or standard simple PCA. Now, we also had this word2vec model. There's actually one model that combines the best of both worlds, namely GloVe, or global vectors, introduced by Jeffrey Pennington in 2014. And it has a very similar idea, and you'll notice here, there's some similarity.

You have this inner product again for different pairs. But this model will actually go over the co-occurrence matrix. Once you have this co-occurrence matrix, it's much more efficient to try to predict once how often two words appear next to each other, rather than do it 50 times each time that pair appears in an actual corpus.

So in some sense, you can be more efficiently going through all the co-occurrence statistics, and you're going to basically try to minimize this subtraction here. And what that basically means is that each inner product will try to approximate the log probability of these two words actually co-occurring. Now, you have this function here, which essentially will allow us to not overly weight certain pairs that occur very, very frequently.

The, for instance, co-occurs with lots of different words, and you want to basically lower the importance of all the words that co-occur with the. So you can train this very fast. It scales to gigantic corpora. In fact, we trained this on Common Crawl, which is a really great data set of most of the internet.

It's many billions of tokens. And it gets also very good performance on small corpora because it makes use very efficiently of these co-occurrence statistics. And that's essentially what words, well, word vectors are always capturing. So if in one sentence, you just want to remember every time you hear word vectors in deep learning, one, they're not quite deep, even though we call them sort of step one of deep learning.

And two, they're really just capturing co-occurrence counts. How often does a word appear in the context of other words? So let's look at some interesting results of these GloVe vectors. Here, the first thing we do is look at nearest neighbors. So now that we have these n-dimensional vectors, usually we say n between 50 to at most 500, good general numbers, 100 or 200 dimensional.

Each word is now represented as a single vector. And so we can look in this vector space for words that appear close by. We started and looked for the nearest neighbors of frog. And well, it turned out these are the nearest neighbors, which was a little confusing since we're not biologists.

But fortunately, when you actually look up in Google what those mean, you'll see that they are actually all indeed different kinds of frogs. Some appear very rarely in the corpus and others like toad are much more frequent. Now, one of the most exciting results that came out of word vectors are actually these word analogies.

So the idea here is can linearly, can there be relationships between different word vectors that simply fall out of very linear and simple addition and subtraction? So the idea here is what is man to woman equal to king to something else? As in, what is the right analogy when I try to basically fill in here the last missing word?

Now, the way we're going to do this is very simple cosine similarity. We basically just take, let's take an example here, the vector of woman, we subtract the word vector we learned of man, and we add the word vector of king. And the resulting vector i, the argmax for this, turns out to going to be queen for a lot of these different models.

And that was very surprising. Again, we're capturing co-occurrence statistics. So man might, in its context, often have things like running and fighting and other silly things that men do. And then you subtract those kinds of words from the context and you add them again. And in some sense, it's intuitive, though surprising that it works out that well for so many different examples.

So here are some other examples similar to the king and queen example where we basically took these 200 dimensional vectors and we projected them down to two dimensions. Again, with a very simple method like PCA. And what we find is actually quite interestingly, even in just the two first principle components of this space, we have some very interesting sort of female-male relationships.

So man to woman is similar to uncle and aunt, brother and sister, sir and madam, and so on. So this is an interesting semantic relationship that falls out of essentially co-occurrence counts in specific windows around each word in a large corpus. Here's another one that's more of a syntactic relationship.

We actually have here superlatives, like slow, slower, and slowest is in a similar vector relationship to short, shorter, and shortest, or strong, stronger, and strongest. So this was very exciting, and of course, when you see an interesting qualitative result, you want to try to quantify who can do better in trying to understand these analogies and what are the different modes and hyperparameters that modify the performance.

Now, this is something that you will notice in pretty much every deep learning project ever, which is more data will give you better performance. It's probably the single most useful thing you can do to a machine learning or deep learning system is to train it with more data, and we found that too.

Now, there are different vector sizes too, which is a common hyperparameter. Like I said, usually between 50 to at most 500. Here we have 300 dimensional that essentially gave us the best performance for these different kinds of semantics and tactic relationships. Now, in many ways, having a single vector for words can be oversimplifying, right?

Some words have multiple meanings, maybe they should have multiple vectors. Sometimes the word meaning changes over time, and so on. So there's a lot of simplifying assumptions here, but again, our final goal for deep NLP is going to be to create useful systems. And it turns out this is a useful first step to create such systems that mimic some human language behavior in order to create useful applications for us.

All right, but words, word vectors are very useful, but words of course never appear in isolation. And what we really want to do is understand words in their context. And so this leads us to the second section here on recurrent neural networks. So we already went over the basic definition of standard neural networks.

Really the main difference between a standard neural network and a recurrent neural network, which I'll abbreviate as RNN now, is that we will tie the weights at each time step. And that will allow us to essentially condition the neural network on all the previous words, in theory. In practice, how we can optimize it, it won't be really all the previous words.

Be more like at most the last 30 words, but in theory, this is what a powerful model can do. So let's look at the definition of a recurrent neural network. And this is going to be a very important definition, so we'll go into a little bit of details here.

So let's assume for now we have our word vectors as given, and we'll represent each sequence in the beginning as just a list of these word vectors. Now what we're going to do is we're computing a hidden state, ht, at each time step, and the way we're going to do this is with a simple neural network architecture.

In fact, you can think of this summation here as really just a single layer neural network, if you were to concatenate the two matrices in these two vectors, but intuitively, we basically will map our current word vector at that time step t, sometimes I use these square brackets to denote that we're taking the word vector from that time step in there.

We map that with a linear layer, a simple matrix vector product, and we sum up, sum that matrix vector product to another matrix vector product of the previous hidden state at the previous time step. We sum those two, and we apply in one case a simple sigmoid function to define this standard neural network layer.

That will be ht, and now at each time step we want to predict some kind of class, probability over a set of potential events, classes, words, and so on. And we use the standard softmax classifier, some other communities call it the logistic regression classifier. So here we have a simple matrix, Ws for the softmax weights.

We have basically a number of rows, they're going to be a number of classes that we have, and the number of columns is the same as the hidden dimension. Sometimes we want to predict the next word in a sequence in order to be able to identify the most likely sequence.

So for instance, if I ask for a speech recognition system, what is the price of wood? Now in isolation, if you hear wood, you would probably assume it's the W-O-U-L-D, auxiliary verb wood, but in this particular context, the price of, it wouldn't make sense to have a verb following that.

And so it's more like the W-O-O-D to find the price of wood. So language modeling is a very useful task, and it's also very instructive to use as an example for where recurrent neural networks really shine. So in our case here, this softmax is going to be quite a large matrix that goes over the entire vocabulary of all the possible words that we have.

So each word is going to be our class. The classes for language models are the words in our vocabulary. And so we can define here this y hat t, the jth one is basically denoting here the probability that the jth word, that the jth index will come next after all the previous words.

It's a very useful model again for speech recognition, for machine translation, for just finding a prior for language in general. All right, again, main difference to standard neural networks, we just have the same set of W-8s at all the different time steps. Everything else is pretty much a standard neural network.

We often initialize the first H0 here just either randomly or all zeros. And again, in language modeling in particular, the next word is our class of the softmax. Now we can measure basically the performance of language models with terms so-called perplexity, which really is here the average log likelihood of basically the probabilities of being able to predict the next word.

So you want to really give the highest probability to the word that actually will appear next in a long sequence. And then the higher that probability is, the lower your perplexity, and hence the model is less perplexed to see the next word. In some sense, you can think of language modeling as almost NLP complete, in some silly sense that if you can actually predict every single word that follows after any arbitrary sequence of words in a perfect way, you would have disambiguated a lot of things.

You can say, for instance, what is the answer to the following question? Ask the question, and then the next couple of words would be the predicted answer. So there's no way we can actually ever do a perfect job in language modeling. But there's certain contexts where we can give a very high probability to the right next couple of words.

Now, this is the standard recurrent neural network. And one problem with this is that we will modify the hidden state here at every time step. So even if I have words like the, and a, and a sentence period, and things like that, it will significantly modify my hidden state.

Now, that can be problematic. Let's say, for instance, I want to train a sentiment analysis algorithm. And I talk about movies, and I talk about the plot for a very long time. Then I say, man, this movie was really wonderful. It's great to watch. And then especially the ending, and you talk again for like 50 time steps, or 50 words, or 100 words about the plot.

Now, all these plot words will essentially modify my hidden state. So if at the end of that whole sequence I want to classify the sentiment, the word wonderful and great that I mentioned somewhere in the middle might be completely gone. Because I keep updating my hidden state with all these content words that talk about the plot.

Now, the way to improve this is by use better kinds of recurrent units. And I'll introduce here a particular kind, so-called gated recurrent units, introduced by Cho. In some sense, and we'll learn more about the LSTM tomorrow when Kwok gives his lecture, but GeoUsers are in some sense a special case of LSTMs.

The main idea is that we want to have the ability to keep certain memories around without having the current input modify them at all. So again, this example of sentiment analysis. I say something's great, that should somehow be captured in my hidden state. And I don't want all the content words that talk about the plot in the movie review to modify that it's actually overall was a great movie.

And then we also want to allow error messages to flow at different strengths depending on the input. So if I say, great, I want that to modify a lot of things in the past. So let's define a GRU. Fortunately, since you already know the basic Lego block of a standard neural network, there's only really one or two subtleties here that are different.

There are a couple of different steps that we'll need to compute at every time step. So in the standard RNN, what we did was just have this one single neural network that we hope would capture all this complexity of the sequence. Instead now, we'll first compute a couple of gates at that time step.

So the first thing we'll compute is the so-called update gate. It's just yet another neural network layer based on the current input word vector and again the past hidden state. So these look quite familiar, but this will just be an intermediate value and we'll call it the update gate.

Then we'll also compute a reset gate, is yet another standard neural network layer. Again, just matrix vector product, summation matrix vector product, some kind of non-linearity here, namely a sigmoid. It's actually important in this case that it is a sigmoid. Just basically, both of these will be vectors with numbers that are between 0 and 1.

Now, we'll compute a new memory content, an intermediate h-tilde here, with yet another neural network, but then we have this little funky symbol in here. Basically, this will be an element-wise multiplication. So basically, what this will allow us to do is if that reset gate is 0, we can essentially ignore all the previous memory elements and only store the new word information.

So for instance, if I talked for a long time about the plot, now I say this was an awesome movie. Now you want to basically be able to ignore if your whole goal of this sequence classification model is to capture sentiment, you want to be able to ignore past content.

This is, of course, if this was entirely a zero vector. Now, this will be more subtle. This is a long vector of maybe 100 or 200 dimensions, so maybe some dimensions should be reset, but others maybe not. And then here we'll have our final memory, and it essentially combines these two states, the previous hidden state and this intermediate one at our current time step.

And what this will allow us to do is essentially also say, well, maybe you want to ignore everything that's currently happening and only update the last time step. We basically copy over the previous time step and the hidden state of that and ignore the current thing. Again, simple example, in sentiment, maybe there's a lot of talk about the plot when the movie was released.

You want to basically have the ability to ignore that and just copy that in the beginning and may have said, it was an awesome movie. So here's an attempt at a clean illustration. I have to say, personally, I, in the end, find the equations a little more intuitive than the visualizations that we tried to do, but some people are more visual here.

So this is, in some ways, basically here we have our word vector and it goes through different layers. And then some of these layers will essentially modify other outputs of previous time steps. So this is a pretty nifty model and it's really the second most important basic Lego block that we're going to learn about today.

And so just want to make sure we take a little bit of time, I'll repeat this here. Again, if the reset gate, this R value, is close to zero, those kinds of hidden dimensions are basically allowed to be dropped. And if the update gate Z basically is one, then we can copy information of that unit through many, many different time steps.

And if you think about optimization a lot, what this will also mean is that the gradient can flow through the recurrent neural network through multiple time steps until it actually matters and you want to update a specific word, for instance, and go all the way through many different time steps.

So then what this also allows us is to actually have some units that have different update frequencies. Some you might want to reset every other word, other ones you might really cap, like they have some long term context and they stay around for much longer. All right, this is the GRU.

It's the second most important building block for today. There are, like I said, a lot of other variants of recurrent neural networks. Lots of amazing work in that space right now, and tomorrow Kwak will talk a lot about some more advanced methods. So now that you understand word vectors and neural network sequence models, you really have the two most important concepts for deep NLP.

And that's pretty awesome, so congrats. We can now, in some ways, really play around with those two Lego blocks, plus some slight modifications of them, very creatively, and built a lot of really cool models. A lot of the models that I'll show you and that you can read and see and read the latest papers that are now coming out almost every week on archive, will have some kind of component of these, will use really these two components in a major way.

Now, this is one of the few slides now with something really, really new, because I want to keep it exciting for the people who already knew all this stuff and took the class and everything. This is tackling an important problem, which is, in all these models that you'll see in pretty much most of these papers, we have in the end one final softmax here, right?

And that softmax is basically our default way of classifying what we can see next, what kinds of classes we can predict. The problem with that is, of course, that that will only ever predict accurately frequently seen classes that we had at training time. But in the case of language modeling, for instance, where our classes are the words, we may see at test time some completely new words.

Maybe I'm just going to introduce to you a new name, Srini, for instance, and nobody may have seen that word at training time. But now that I mentioned him, and I will introduce him to you, you should be able to predict the word Srini and that person in a new context.

And so the solution that we're literally going to release only next week in the new paper is to essentially combine the standard softmax that we can train with a pointer component. And that pointer component will allow us to point to previous contexts and then predict based on that to see that word.

So let's, for instance, take the example here of language modeling again. We may read a long article about the Fed Chair, Janet Yellen. And maybe the word Yellen had not appeared in training time before, so we couldn't ever predict it, even though we just learned about it. And now a couple of sentences later, interest rates were raised, and then misses, and now we want to predict that next word.

Now, if that hadn't appeared in our softmax standard training procedure at training time, we would never be able to predict it. What this model will do, and we're calling it a pointer sentinel mixture model, is it will essentially first try to see would any of these previous words maybe be the right candidate.

So we can really take into consideration the previous context of, say, the last 100 words. And if we see that word and that word makes sense after we train it, of course, then we might give a lot of probability mass to just that word at this current position in our previous immediate context at test time.

And then we have also the sentinel, which is basically going to be the rest of the probability if we cannot refer to some of the words that we just saw. And that one will go directly to our standard softmax. And then what we'll essentially have is a mixture model that allows us to say either we have or we have a combination of both of essentially words that just appeared in this context and words that we saw in our standard softmax language modeling system.

So I think this is a pretty important next step because it will allow us to predict things we've never seen at training time. And that's something that's clearly a human capability that most, or pretty much none of these language models had before. And so to look at how much it actually helps, it'll be interesting to look at some of the performance before.

So again, what we're measuring here is perplexity. And the lower the better, because it's essentially inverse here of the actual probability that we assign to the correct next word. And in just 2010, so six years ago, this was some great work, early work by Tomas Mikulov, where he compared to a lot of standard natural language processing methods, syntactic models that essentially tried to predict the next word and had a perplexity of 107.

And he was able to use the standard recurrent neural networks, and actually an ensemble of eight of them, to really significantly push down the perplexity, especially when you combine it with standard count-based methods for language modeling. So in 2010, he made great progress by pushing it down to 87.

And now this is one of the great examples of how much progress is being made in the field, thanks to deep learning, where two years ago, Whitecheck Zaremba and his collaborators were able to push that down even further to 78 with a very large LSTM, similar to a GRU-like model, but even more advanced.

Kwok will teach you the basics of LSTMs tomorrow. Then last year, the performance was pushed down even further by Yaren Gal. And then this one actually came out just a couple weeks ago, variational recurrent highway networks, pushed it down even further. But this Pointer Sentinel model is able to get it down to 70.

So in just a short amount of time, we pushed it down by more than 10 perplexity points in two years. And that is really an increased speed in performance that we're seeing now, that deep learning is changing a lot of areas of natural language processing. All right, now we have our basic Lego blocks, the word vectors and the GRU sequence models.

And now we can talk a little bit about some of the ongoing research that we're working on. And I'll start that with maybe a controversial question, which is, could we possibly reduce all NLP tasks to essentially question answering tasks over some kind of input? And in some ways, that's a trivial observation that you could do that, but it actually might help us to think of models that could take any kind of input, a question about that input, and try to produce an output sequence.

So let me give you a couple of examples of what I mean by this. So here we have, the first one is a task that we would standardly associate with question answering. I'll give you a couple of facts. Mary walked to the bathroom, Sandra went to the garden, Daniel went back to the garden, Sandra took the milk there, where's the milk?

And now you might have to logically reason, try to find the sentence about milk. Maybe Sandra took the milk there. Now I'd have to maybe do an effort resolution, find out what does there refer to, and then you try to find the previous sentence that mentions Sandra, see that it's garden, and then give an answer garden.

So this is a simple logical reasoning question answering task. And that's what most people in the QA field sort of associated with some kinds of question answers. But we can also say, everybody's happy and the question is, what's the sentiment? And the answer is positive. All right, so this is a different subfield of NLP that tackles sentiment analysis.

We can go further and ask, what are the named entities of a sentence like, Jane has a baby in Dresden, and you want to find out that Jane is a person and Dresden is a location, and this is an example of sequence tagging. You can even go as far and say, I think this model is incredible, and the question is, what's the translation into French?

And you get, and that in some ways would be phenomenal if we're able to actually tackle all these different kinds of tasks with the same kind of model. So maybe it would be an interesting new goal for NLP to try to develop a single joint model for general question answering.

I think it would push us to think about new kinds of sequence models and new kinds of reasoning capabilities in an interesting way. Now, there are two major obstacles to actually achieving the single joint model for arbitrary QA tasks. The first one is that we don't even have a single model architecture that gets consistent state of the art results across a variety of different tasks.

So for instance, for question answering, this is a data set called Bobby that Facebook published last year, strongly supervised memory networks get the state of the art. For sentiment analysis, you had tree LSTM models developed by Kai-Sheng Tai here at Stanford last year. And for part of speech tagging, you might have bidirectional LSTM conditional random fields.

One thing you do notice is all the current state of the art methods are deep learning. Sometimes they still connect to other traditional methods like conditional random fields and undirected graphical models. But there's always some kind of deep learning component in them. So that is the first obstacle. The second one is that really fully joint multi-task learning is very, very hard.

Usually when we do do it, we restrict it to lower layers. So for instance, in natural language processing, all we're currently able to share in some principled way are word vectors. We take the same word vectors we train, for instance, with GloVe or Word2Vec, and we initialize our deep neural network sequence models with those word vectors.

In computer vision, we're actually a little further ahead, and you're able to use multiple of the different layers. And you initialize a lot of your CNN models with a first pre-trained CNN that was pre-trained on ImageNet, for instance. Now, usually people evaluate multi-task learning with only two tasks. They train on a first task, and then they evaluate the model that they initialized from the first on the second task, but they often ignore how much the performance degrades on the original task.

So when somebody takes an ImageNet CNN and applies it to a new problem, they rarely ever go back and say how much did my accuracy actually decrease on the original data set? And furthermore, we usually only look at tasks that are actually related, and then we find, look, there's some amazing transfer learning capability going on.

What we don't look at often in the literature and most people's work is that when the tasks aren't related to one another, they actually hurt each other. And this is so-called catastrophic forgetting. It's not, there's not too much work around that right now. Now, I also would like to say that right now, almost nobody uses the exact same decoder or classifier for a variety of different kinds of outputs, right?

We at least replace the softmax to try to predict different kinds of problems. All right, so this is the second obstacle now. For now, we'll only tackle the first obstacle, and this is basically what motivated us to come up with dynamic memory networks. They're essentially an architecture to try to tackle arbitrary question answering tasks.

When I'll talk about dynamic memory networks, it's important to note here that for each of the different tasks I'll talk about, it'll be a different dynamic memory network. It won't have the exact same weights. It'll just be the same general architecture. So the high level idea for DMNs is as follows.

Imagine you had to read a bunch of facts like these here. They're all very simple in and of themselves. But if I now ask you a question, I showed you these, and I ask, where's Sandra? It'd be very hard, even if you read them, all of them, it'd be kind of hard to remember.

And so the idea here is that for complex questions, we might actually want to allow you to have multiple glances at the input. And just like I promised, one of our most important basic Lego blocks will be this GRU we just introduced in the previous section. Now, here's this whole model in all its gory details.

And we'll dive into all of that in the next couple of slides, so don't worry. It's a big model. A couple of observations. So the first one is, I think we're moving in deep learning now to try to use more proper software engineering principles. Basically to modularize, encapsulate certain capabilities, and then take those as basic Lego blocks and build more complex models on top of them.

A lot of times nowadays you just have a CNN, that's like one little block in a complex paper, and then other things happen on top. Here we'll have the GRU or word vectors basically as one module, a sub-module in these different ones here. And I'm not even mentioning word vectors anymore, but word vectors still play a crucial role.

And each of these words is essentially represented as this word vector, but we just kind of assume that it's there. Okay, so let's walk on a very high level through this model. There are essentially four different modules. There's the input module, which will be a neural network sequence model, a GRU.

There's a question module, an episodic memory module, and an answering module. And sometimes we also have these semantic memory modules here, but for now these are really just our word vectors, and we'll ignore that for now. So let's go through this. Here is our corpus, and our question is, where is the football?

And this is our input that should allow us to answer this question. Now if I ask this question, I will essentially use the final representation of this question to learn to pay attention to the right kinds of inputs that seem relevant for given what I know to answer this question.

So where's the football? Well, it would make sense to basically pay attention to all the sentences that mention football, and maybe especially the last ones if the football moves around a lot. So what we'll observe here is that this last sentence will get a lot of attention. So John put down the football.

And now what we'll basically do is that this hidden state of this recurrent neural network model will be given as input to another recurrent neural network because it seemed relevant to answer this current question at hand. Now we'll basically agglomerate all these different facts that seem relevant at the time in this another GRU, in this final vector m.

And now this vector m together with the question will be used to go over the inputs again if the model deems that it doesn't have enough information yet to answer the question. So if I ask you where's the football and it so far only found that John put down the football, you don't know enough.

You still don't know where it is, but you now have a new fact, namely John seems relevant to answer the question. And that fact is now represented in this vector m, which is also just the last hidden state of another recurrent neural network. Now we'll go over the inputs again.

Now that we know that John and the football are relevant, we'll learn to pay attention to John move to the bedroom. And John went to the hallway. Again, those are going to get agglomerated here in this recurrent neural network. And now the model thinks that it actually knows enough because it basically intrinsically captured things about the football.

John found a location and so on. Of course, we didn't have to tell it anything about their people, their locations, if x moves to y and y is in the set of locations, then this happens, none of that. You just give it a lot of stories like that and in its hidden states it will capture these kinds of patterns.

So then we have the final vector m and we'll give that to an answer module, which produces in our standard softmax way the answer. All right, now let's zoom into the different modules of this overall dynamic memory network architecture. The input, fortunately, is just a standard GRU, the way we defined it before.

So simple word vectors, hidden states, reset gates, update gates, and so on. The question module is also just a GRU, a separate one with its own weights. And the final vector q here is just going to be the last hidden state of that recurrent neural network sequence model. Now, the interesting stuff happens in the episodic memory module, which is essentially a sort of meta-gated GRU, where this gate will basically define, is defined and computed by the attention mechanism.

And it will basically say this current state sentence SI here seems to matter. And the superscript T is the episode that we have. So each episode basically means we're going over the input entirely one time. So it starts at G1 here. And what this basically will allow us to do is to say, well, if G is 0, then what we'll do is basically just copy over the past states from the input.

Nothing will happen. And unlike before in all these GRU equations, this G is just a single scalar number. It will basically say, if G is 0, then this sentence is completely irrelevant to my current question at hand. I can completely skip it, right? And there are lots of examples, like married travel to the hallway, that are just completely irrelevant to answering the current question.

In those cases, this G will be 0, and we're just copying the previous hidden state of this recurrent neural network over. Otherwise, we'll have a standard GRU model. So now, of course, the big question is, how do we compute this G? And this might look a little ugly, but it's quite simple.

Basically, we're going to compute two vector similarities, multiplicative and additive one with absolute values of all the single values of the sentence vector that we currently have, and the question vector, and the first, the memory state of the previous pass of the input. And the first pass of the input, the memory state is initialized to be just a question, and then afterwards, it agglomerated relevant facts.

So intuitively here, if the sentence mentions John, for instance, and the question is, or mentions football, and the question is, where's the football, then you'd hope that the question vector Q mentions has some units that are more active because football was mentioned. And the sentence vector mentions football, so there are some units that are more active because football is mentioned.

And hence, some of these inner products or absolute values of subtractions are going to be large. And then what we're going to do is just plug that into a standard, through standard single layer neural network, and then a standard linear layer here, and then we apply a softmax to essentially weight all of these different potential sentences that we might have to compute the final gate.

So this will basically be a soft attention mechanism that sums to one and will pay most attention to the facts that seem most relevant, given what I know so far in the question. Then when the end of the input is reached, all these relevant facts here are summarized in another GRU that basically moves up here.

And you can train a classifier also, if you have the right kind of supervision, to basically train that the model knows enough to actually answer the question and stop iterating over the inputs. If you don't have that kind of supervision, you can also just say, I will go over the inputs a fixed number of times, and that works reasonably well too.

All right, there's a lot to sink in, so I'll give you a couple seconds. Basically, we pay attention to different facts given a certain question. We iterate over the input multiple times, and we agglomerate the facts that seem relevant given the current knowledge and the question. Now, I don't usually talk about neuroscience.

I'm not a neuroscientist, but there is a very interesting relationship here that a friend of mine, Sam Gershman, pointed out, which is that the episodic memory, in general for humans, is actually the memory of autobiographical events. So it's the time when we remember the first time we went to school or something like that.

And it's essentially a collection of our past personal experiences that occurred at a particular time in a particular place. And just like our episodic memory that can be triggered with a variety of different inputs, this episodic memory is also triggered with a specific question at hand. And what's also interesting is the hippocampus, which is the seat of the episodic memory in humans, is actually active during transitive inference.

So transitive inference is going from A to B to C to have some connection from A to C. Or in this case here, with this football, for instance, you first had to find facts about John and the football, and then finding where John was, and then finding the location of John.

So those are examples of transitive inference. And it turns out that you also need, in the DMN, these multiple passes to enable the capability to do transitive inference. Now, the final module, again, is a very simple GRU and softmax to produce the final answers. The main difference here is that instead of just having the current, the previous hidden state, 80 minus 1, as input, we'll also include the question at every time.

And we will include the answer that was generated at the previous time step. But other than that, it's our standard softmax. We use standard cross-entropy errors to minimize it. And now, the beautiful thing of this whole model is that it's end-to-end trainable. These four different modules will actually all train, based on the cross-entropy error of that final softmax.

All these different modules communicate with vectors, and we'll just have delta messages and back propagation to train them. Now, there's been a lot of work in the last two years on models like this. In fact, Kwak will cover a lot of these really interesting models tomorrow, different types of memory, structures, and so on.

And the dynamic memory network is, in some sense, one of those models. One particular model is a proper comparison, because there are a lot of similarities, namely memory networks from Jason Weston. Those basically also have inputs and scoring and attention response mechanisms. The main difference is that they use different kinds of basic Lego blocks for these different kinds of mechanisms.

For input, they use bag-of-words representations, or nonlinear and linear embeddings. For the attention and responses, they have different kinds of iteratively run functions. The main interesting sort of difference to the DMN is that the DMN really uses recurrent neural network type sequence models for all of these different modules and capabilities.

And in some sense, that helps us to have a broader range of applications that include things like sequence tagging. And so let me go over a couple of results and experiments of this model. So the first one is on this Bobby dataset that Facebook published. It basically has a lot of these kinds of simple, logical reasoning type questions.

In fact, all these, like where's the football? Those were examples from the Facebook Bobby dataset. And it also includes things like yes/no questions, simple counting, negation, some indefinite knowledge where the answer might be maybe. Basic coreference, where you have to realize who does she refer to or he, reasoning over time.

If this happened before that, and so on. And basically, this dynamic memory network, I think, is currently the state of the art on this dataset of the simple logical reasoning. Now, the problem with this dataset is that it's a synthetic dataset. And so it had only a certain set of generating human-defined generative functions that created certain patterns.

And in that sense, it's only a necessary and not a sufficient condition of solving it with sometimes 100% accuracy to real question answering. So there's still a lot of complexity. The main interesting bit to point out here is that there are different numbers of training examples for each of these different subtasks.

And so you have basically 1,000 examples of simple negation, for instance. And it's always a similar kind of pattern, and hence you're able to classify it very well. Now, real language, you will never have that many examples for each type of pattern you want to learn. And so it's still general question answering is still an open problem and non-trivial.

Now, what's cool is this same architecture of allowing the model to go over inputs multiple times also got state of the art and sentiment analysis. Very different kind of task. And we actually analyzed whether it's really helpful to have multiple passes over the input, and it turns out it is.

So there's certain things like reasoning over three facts or counting, where you really have to have this dynamic, this episodic memory module, and it goes over the input maybe five times. For sentiment, it actually turns out it hurts after going over the input more than two times. And that's actually one of the things we're now working on is, can we find a single model that does the same thing for every single input with the same weights to try to learn these different tasks?

We can actually look at a couple of fun examples of this model and what happens with tough sentiment sentences. Generally, to be honest, sentiment, you can probably get to 75% accuracy with some very simple models. It's just basically fine, great words like great and wonderful and awesome, and you'll get to something that's roughly right.

Here are some of the examples that, those are the kinds of examples that you now need to get right to really try to push the state of the art further in sentiment analysis. So here, the sentence is, in its ragged, cheap, and unassuming way, the movie works. So this sentence is incorrect, even if you allow the DMN, but you have this whole architecture, but only allow one pass over the input.

Once you have two passes over the input, it actually learns to pay attention, not just to these very strong adjectives, but in the end, actually to the movie working. So here, these fields are essentially the gating function G that we defined that pays attention to specific words. And the darker it is, the larger that gate is, and the more open it is, and the more that word affects the hidden state in the episodic memory module.

So it goes over the input the first time, pays attention to cheap and unassuming and way and a little bit of works too. But the second time, it basically figured out, it agglomerated sort of the facts of that sentence, and then learned to pay attention more to specific words that seem more important.

Just one more example here, my response to the film is best described as lukewarm. So in general, a sentiment analysis, when you look at sort of Unigam scores, like the word best is basically one of the most positive words you could possibly use in a sentence. And the first time the model passes over the sentence, it also pays most attention to this incredibly positive word, namely best.

But then, once it agglomerated the context, it actually realizes, well, best actually here is not used in its adjective way, but it's actually an adverb that best describes something, and what it describes is actually lukewarm, and hence, it's actually a negative sentence. So those are the kinds of examples that you need to get to now to appreciate improvements in sentiment analysis, where we basically also went from, on this particular data set, these are all neural network type models that started 82.

Until then, that same data set existed for around eight years, and none of the standard NLP models had reached above 80% accuracy. And now, we're basically in the high 80s. And those are the kinds of improvements that you see across a variety of different NLP tasks now that deep learning has come and deep learning techniques are being used in NLP.

And now, the last task in NLP that this model turned out to also work incredibly well on is part of speech tagging. Now, part of speech tagging is less exciting of a task. It's more of an intermediate task. But it's still fascinating to see that after this data set has been around for over 20 years, you can still improve the state of the art with the same kind of architecture that also did well on fuzzy reasoning of sentiment and discrete logical reasoning for question answering.

Now, we had a new person join the group, Saiming. And he thought, well, that's cool. But he was more of a computer vision researcher. And so he thought, well, could I use this great question answering module now to do visual question answering? To combine sort of some that was going on in the group in NLP and apply it to computer vision.

And he did not have to know all of the different aspects of the code. All he had to do was change the input module from one that gives you hidden states at each word over a long sequence of words and sentences to an input module that would give him vectors for sequences of regions in an image.

And he literally did not touch some of the other parts of the code. He didn't have to look carefully at this input module where, again, here, our basic Lego block that Andre introduced really well of our convolutional neural network, and then the convolutional neural networks will essentially give us 14 by 14 many vectors, one for each in one of its top states, one representing each region of an image.

And then what we'll do is basically take those vectors and now replace the word vectors we used to have with CNN vectors, and then plug them into a GRU. Now again, the GRU, we know as our basic Lego block, we already defined it. One addition here is that it'll actually be a bidirectional GRU.

We'll go once from left to right in this snake-like fashion, and another one goes from right to left backwards. Now both of these will basically have a hidden state, and you can just concatenate the hidden states of both of these to compute the final hidden state for each block of the image.

And that model, too, actually achieved state of the art results. This data set has been only released last year, so everybody now works on deep learning techniques to try to solve it. And I was at first a little skeptical. It was just too good to be true that this model we developed for NLP would work so well.

So we really dug into looking at the attention. So what I showed you here, these g values, again, that we computed with this equation. Now, instead of paying attention to words, it paid attention to different regions in the image. And we started basically analyzing, going through a bunch of those on the depth set, and analyzing what is it actually paying attention to.

Again, it's being trained only with the image, the question, and the final answer. That's what you get at training time. You do not get this sort of latent representation of where you should actually pay attention to in the image in order to answer that question correctly. So when the question was, what is the main color on the bus?

I learned to actually pay attention here to that bus. I'm like, well, okay, maybe that's not that impressive. It's just the main object in the center of the image. And what are the types, the type of trees are in the background? Well, maybe it just connects tree with anything that's green and pays attention to that.

So it was neat, but not super impressive yet. So is this in the wild kind of more interesting and actually pays attention to a man-made structure in the background? Incorrectly answers, no. Then this one is kind of interesting. Who is on both photos? The answer is girl. Now, to be honest, I don't think the model actually knows that there are two people, tries to match them, and so on.

It just finds the main person or main object in this scene. The main object is a little baby girl, so it says girl. This one's also relatively trivial. What time of day was this picture taken? The answer is night, because it's a very dark picture, at least in the sky.

This one is getting a little more interesting. What is the boy holding? The answer is surfboard, and it actually does pay attention to both of the arms, and then what's just below that arm. So that's a little more interesting kind of attention visualization. And then for a while, we're also worried, well, what if in the data set, it just learns really well from language alone?

Yes, it pays attention to things, but maybe it'll just say things that it often sees in the text. So if I ask you what color are the bananas, you don't really have to look at an image. In 95% of the cases, you're right just saying yellow without seeing an image.

So this one I was kind of excited about, because it actually paid attention to the bananas in the middle, and then did say green, and kind of overruled the prior that it would get from language alone. What is the pattern on the cat's fur on its tail? Pays attention mostly to the tail and says stripes.

Now, this one here was interesting. Did the player hit the ball? The answer is yes, though I have to say that we later had a journalist want to do his own question. He asked John Marker from New York Times, and we just put together this demo the night before.

And he's like, well, I want to ask my own question. I'm like, okay. And he asked, is the girl wearing a hat? And it wasn't made for production, so it was kind of slow, and the system was cranking. I'm like, well, he's trying to come up with excuses. It's kind of black background and a black hat, and it might be kind of hard to see.

And unfortunately, it got it right and said yes. And then after the interview, I said, well, maybe let's look and see if what I, I would just ask it myself, less stressful situation, a bunch of questions on my own. And these are all the questions, the first eight questions that I could come up with.

And somewhat to my surprise, it actually got them all right. So what is the girl holding, a tennis racket? What is she playing, playing tennis? Or what is she doing? Is the girl wearing shorts? What is the color of the ground, brown? Then I was like, well, okay, let's try to break it by asking just what's the color of some of the smallest object, the ball.

Actually got that right too, because her skirt white. Also kind of interesting, when you ask the model what she's wearing shorts, but when you ask about the skirt, and it still sort of is capturing that you might call this different things. And then this one was interesting, what did the girl just hit?

Tennis ball. And then I was like, well, what if I ask, is the girl about to hit the tennis ball? It said yes. And then, did the girl just hit the tennis ball? And it said yes again. So then I finally found a way to break it, so it doesn't have enough the core current statistics to understand.

And again, understand sort of which angle does the arm have to be in order to assume that the ball was just hit or was about to hit. But what it basically does show us is that once it saw a lot of examples on a specific domain, it really can capture quite a lot of different things.

And now, let's see if we can get the demo up. I have to be on a VPN to make it work. But so here's one example. The best way to hope for any chance of enjoying this film is by lowering your expectations. Again, one of those kinds of sentences that you have to now get correct in order to get improved performance on sentiment.

And actually correctly says that this is negative. Now, we can also actually ask that question in Chinese. And this is one of the beautiful things of the DMN and in general really of most deep learning techniques. We don't have to be experts in a domain or even in a language to create a very, very accurate model for that language or that domain.

There's no more feature engineering. I'm not gonna make a fool of myself trying to read that one out loud, but it's an interesting example. You can also, this is the, what parts of speech are there. You can have other things like named entities and other sequence problems. You can also ask, what are the men wearing on the head?

Answer is helmets. And then maybe a slightly more interesting question, why are the men wearing helmets? And the answer is safety. So, especially we're close to the circle of death here at Stanford, where a lot of bikes crash and it's a good answer. All right, with that, I wanna leave a couple of minutes for questions.

So basically the summary is, word vectors and recurrent neural networks are super useful building blocks. Once you really appreciate and understand those two building blocks, you're kind of ready to have some fun and build more complex models. Really in the end, this DMN is a way to combine that in just a variety of new ways to a larger, more complex model.

And that's also where the state, I think, of deep learning is for natural language processing. We've tackled a lot of these smaller sub-problems, intermediate tasks, and now we can work on more interesting, complex problems like dialogue and question answering, machine translation, and things like that. All right, thank you.

>> >> Five minutes? All right, cool, yeah, five minutes. >> A quick question, in the dynamic memory network, you have the RNN. And you also mentioned that if you have better assumption of the input, right? So you also worked on the tree LSTM, right? So if you change the RNN into a tree structure, would that help?

>> It's a good question. I actually love tree structures. I did my whole PhD about tree structures. And somewhat surprising in the last couple weeks, there are actually some new results on SNLI, the Stanford Natural Language Inference data set, where tree structures are again the state of the art.

Though I have to say that I think the dynamic memory network, by having this ability in the episodic memory to keep track of different sub phrases and pay attention to those and then combine them over multiple passes, I think you can kind of get away with not having a tree structure.

So yes, you might have a slight improvement representing sentences as trees in your input module. But I think they're only going to be slight. And I think the episodic memory module that has this capability to go over the input multiple times, pay attention to certain sub phrases, will capture a lot of the kinds of complexities that you might want to capture in tree structures.

So my short answer is, I don't think you necessarily need it. >> So have you tried it? >> We have not, no. >> Thanks. >> Hi, my question is about question answering. So if we want to apply question answering to some specific domains like healthcare, but we don't really have the data, we don't have question answer pairs.

And what should we do? Are there any general principles here? >> That's a great question. What do you do if you want to do question answering on a complex domain, you don't have the data? I think, and this feels maybe like a cop out, but I think it's very true both in practice and in theory, create the data.

Like if you cannot possibly create more than a thousand examples of anything, then maybe automating that process is not that important. So clearly, you should be able to create some data. And in many cases, that is the best use of your time, is just to sit down or ask the domain expert to create a lot of questions and then have people find the answers.

And then measure how they actually get to those answers. Try to have them in a constrained environment and so on. I think most companies, for instance, when you try to do automated email replies, which is in some ways a little bit similar to question answering, well, there is a nice domain because everybody had already emailed, they were already answered before, so you can use sort of past behavior.

Now, if you had a search engine where people asked a lot of questions, then you can also use that to bootstrap and see where did they actually fail? And then take all those really tough queries where they failed, have some humans sit there and collect the data. So that's the simplest answer.

Now, the other answer is, let's work together for the next many years on research for smaller training data set sizes and complex reasoning. The fact of the matter for that line of research will still be, if a system has never seen a certain type of reasoning, it'll be hard for the system to pick up that type of reasoning.

I think we're going to get with these kinds of architectures to the space where at least if it has seen this type of reasoning, a specific type of transitive reasoning or temporal reasoning or sort of cause and effect type reasoning, at least a couple hundred times, then you should be able to train a system with these kinds of models to do it.

>> Are these QA systems currently robust to false input or questions? For the woman playing tennis, if you asked, what's the man holding? Would it reply, there's no man? >> It would not. And largely because at training time, you never try to mess with it like that. I'm pretty sure if you added a lot of training samples where you had those, it would probably eventually pick it up.

>> Those would be important for real world implementations and- >> So real world implementations of this insecurity are actually kind of tricky. I think whenever you train a system, we know we can, for instance, both steal certain classifiers by using them a lot. We know we can fool them into classifying certain images, for instance, as others.

We have folks in the audience who worked on that exact line of work. So I would be careful using it in security environments right now. >> Yeah, hi, I have a question. >> Wow, up there, hi there. >> Yeah, I have a question actually. There was a slide where you had the input module and then there were a bunch of sentences.

So were those sentences themselves RNNs? Because sequence is basically made up of those individual words and say glove representations. So were those also RNNs that were stitched together? >> So the answer there is a little complex, because we have two papers with the DMN, and the answer is different for each.

In the simplest form of that, it is actually a single GRU that goes from the first word through all the sentences as if they're one gigantic sequence. But it has access to each sentence period at the end to pay special attention to the end of sentences. And so yes, in the simplest form, it is just a GRU that goes over all the words.

>> So is this a normal process to basically just concatenate all the sentences into one gigantic- >> So the answer there, and this is kind of why I split the talk into three different ones from words, single sentences, and then multiple sentences. I think if you just had a single GRU that goes over everything and now you try to reason over that entire sequence, it would not work very well.

You really need to have an additional structure, such as an attention mechanism or a pointer mechanism that has the ability to pay attention to specific parts of your input to do that very accurately. But yeah, in general, that's fine, as long as you have this additional mechanism. Thank you.

>> Thank you, great question. >> So in recurrent neural nets, you're using sigmoids. In visual recognition, I guess, rectified linear units were the more popular non-linearity. >> That's right, so relu's are great. Now, when you look at the GRU equations here, you have these reset gates. And so these reset gates here, you want them to essentially be between 0 and 1, so that it can either ignore this input entirely or you have it normally be part of the computation of h tills.

So in some cases, you really do want to have sigmoids there. But other ones, for instance, some simpler things where you actually don't have that much recurrence, such as going from one memory state to another. In the second iteration of this model, actually relu's were good activation functions too.

>> Did you guys try to, after training this network, try to take these weights for the images and do object detection again? So these weights would be augmented with the text vectors. Did you try to use- >> That is a very cool idea that we did not explore. No, there you go.

You gotta do it fast. >> >> This field is moving fast, you just let the cat out of the box. >> >> So those attention models are pretty powerful when you have enough training data and then you can learn to make good use of the data. But even though some of the tasks are pretty, I guess, trivial to human, but it's hard for a model to learn.

So what do you think of, I guess, even right now, we have a lot of knowledge base on the web, right, like Wikipedia. We know a lot about common sense, but what do you think about incorporate those knowledge base into those models? >> I actually love that line of research too, and that was kind of what we started out with.

This semantic memory module in the simplest form is just word vectors. I think one next iteration would actually be to have knowledge bases also influence the reasoning. There's very little work on combining text and knowledge bases to do overall complex question answering that requires reasoning. I think it's a phenomenally interesting area of research.

>> So were any night hints or any starting point about how to encode those? >> There are some papers that do reasoning over knowledge bases alone. So we had a paper on recursive neural tensor networks that basically takes a triplet, a word vector for an entity, might be in Freebase, might be in WordNet.

A relation, a vector for a relationship, and a vector for another entity. And then basically pipe them into a neural network and say, yes, no, are these two entities actually in that relationship? And you can have a variety of different architectures. I think Sammy worked on that as well.

Wait, that's a different brother, different Benjo. >> >> Over there, all right. >> >> And- >> And I did too. >> That's true, that's true, yeah. Antoine Borde, right? That's right, that's right. So I think you can also reason over knowledge graphs. And you could then try to combine that with reasoning over fuzzy text.

It all has been done. I think nobody has yet really combined it in a principled way. Great question. Yeah, one last question. >> I have a question. So while the model answer my questions correctly, so how do I check the model actually understood my question? And what's the logic, what's the model's logic behind that?

>> It's a good question. In some ways, it's a common question for neural network interpretability. >> So in computer vision, sometimes we can at least visualize the features. Right, so how about the- >> That's right. And so I think the best thing that we could do right now is to show these attention scores where for sentiment, we're like, how did it come up with the sentiment?

It paid attention to the movie working. And likewise for question answering, we can see which sentences did it actually pay attention to in order to answer that overall question. So that is, I think, the best answer that we could come up with right now. But how, yeah, there's certain other complexities that there's still an area of open research.

>> Thanks. >> Thank you. All right, thank you, everybody. >> >> So thank you, Richard. We'll take another coffee break for 30 minutes, so please come back at 2.45 for our presentation by Sherry Moore.