back to indexDeep Learning for Natural Language Processing (Richard Socher, Salesforce)
Chapters
0:0
0:51 What is Natural Language Processing?
1:56 NLP Levels
5:27 (A tiny sample of) NLP Applications
9:1 Outline
18:34 Combining the best of both worlds: Glove (Pennington et al. 2014)
20:30 Glove results
21:24 Intrinsic word vector evaluation Word Vector Analogies
23:37 Glove Visualizations: Superlatives
23:54 Analogy evaluation and hyperparameters
25:47 Recurrent Neural Networks (!)
29:18 RNN language model
35:41 Attempt at a clean illustration
39:47 Pointer sentinel mixture models
41:33 Language Model Evaluation
43:55 Current Research
46:33 First Major Obstacle
47:25 Second Major Obstacle
49:56 High level idea for harder questions
50:25 Basic lego block: GRU (defined before)
50:33 Dynamic Memory Network
54:23 The Modules: Input
54:42 The Modules: Question
54:57 The Modules: Episodic Memory
00:00:00.000 |
Thank you, everybody, and thanks for coming back very soon after lunch. 00:00:04.640 |
I'll try to make it entertaining to avoid some post-food coma. 00:00:09.120 |
So I actually have a lot to owe for being here to Andrew and Chris and 00:00:19.080 |
I figured there's going to be a broad range of capabilities in the room. 00:00:24.080 |
So I'm sorry I will probably bore some of you for 00:00:28.160 |
the first two-thirds of the talk, cuz I'll go over the basics of what's NLP, 00:00:34.080 |
what's natural language processing, what's deep learning, and 00:00:36.440 |
what's really at the intersection of the two. 00:00:38.560 |
And then the last third, I'll talk a little bit about some exciting new 00:00:45.880 |
So let's get started with what is natural language processing? 00:00:49.440 |
It's really a field at the intersection of computer science, AI, and linguistics. 00:00:56.440 |
a lot of these statements here we could really talk and philosophize a lot about. 00:01:02.640 |
For me, the goal of natural language processing is for computers to process or 00:01:08.360 |
scare quotes, understand natural language in order to perform tasks that are 00:01:11.760 |
actually useful for people, such as question answering. 00:01:14.840 |
The caveat here is that really fully understanding and 00:01:19.600 |
representing the meaning of language, or even defining it, is quite an elusive goal. 00:01:25.200 |
So whenever I say the model understands, I'm sorry, I shouldn't say that. 00:01:30.720 |
Really, these models don't understand in the sense that we understand language 00:01:38.120 |
represent the full meaning in its entire glory, it's usually not quite true. 00:01:43.240 |
Really, perfect language understanding is in some sense AI complete in the sense 00:01:48.200 |
that you need to understand all of visual inputs and 00:01:54.160 |
So a little more concretely, as we try to tackle this overall problem of 00:01:59.120 |
understanding language, what are sort of the different levels that we often look at? 00:02:04.560 |
Often, and for many people, starts at speech. 00:02:07.680 |
And then once you have speech, you might say, all right, 00:02:09.520 |
now I know what phonemes are smaller parts of words. 00:02:12.480 |
I understand how words form, that's morphology or morphological analysis. 00:02:17.000 |
Once I know what the meaning of words are, I might try to understand how they're put 00:02:21.800 |
together in grammatical ways such that the sentences are understandable or 00:02:26.840 |
at least grammatically correct to a lot of speakers of the language. 00:02:35.360 |
And that's really where I think most of my interest lies, 00:02:40.840 |
in semantic interpretation, actually trying to get to the meaning in some useful 00:02:46.040 |
And then after that, we might say, well, if we understand now the meaning of 00:02:48.560 |
a whole sentence, how do we actually interact? 00:02:53.000 |
How do we have spoken dialogue system, things like that? 00:02:55.880 |
Whereas deep learning has really improved the state of the art significantly, 00:03:01.200 |
is really in speech recognition and syntax and semantics. 00:03:05.640 |
And the interesting thing is that we're kind of actually skipping some of these 00:03:10.400 |
Deep learning doesn't require often morphological analysis to create 00:03:15.680 |
And in some cases, actually skips syntactic analysis entirely as well. 00:03:21.320 |
It doesn't have to be taught about what noun phrases are, 00:03:24.840 |
It can actually get straight to some semantically useful tasks right away. 00:03:29.240 |
And that's going to be one of the sort of advantages that we don't have to 00:03:34.360 |
actually be as inspired by linguistics as traditional natural language 00:03:42.120 |
Well, there's a lot of complexity in representing and learning, and 00:03:46.600 |
especially using linguistic situational world and visual knowledge. 00:03:49.840 |
Really all of these are connected when it gets to the meaning of language. 00:03:55.480 |
can you do that without visual understanding, for instance? 00:03:59.440 |
If you have, for instance, this sentence here, Jane hit June and 00:04:07.160 |
Depending on which verb comes after she, the definition, 00:04:14.280 |
And this is one subtask you might look at, so-called Nuffer resolution or 00:04:18.880 |
coreference resolution in general, where you try to understand, 00:04:23.200 |
And it really depends on the meaning, again, somewhat scare quotes here, 00:04:35.800 |
So here we have a very simple sentence, four words, I made her duck. 00:04:38.960 |
Now that simple sentence can actually have at least four different meanings, 00:04:45.720 |
if you can think about it for a little bit, right? 00:04:48.080 |
You made her a duck that she loves for Christmas as her dinner. 00:04:52.480 |
You made her duck like me just now, and so on. 00:04:57.800 |
And to know which one requires, in some sense, situational awareness or 00:05:02.320 |
knowledge to really disambiguate what is meant here. 00:05:10.560 |
Now, where does it actually become useful in terms of applications? 00:05:14.000 |
Well, they actually range from very simple things that we kind of assume or 00:05:17.680 |
are given now, we use them all the time every day, to more and 00:05:20.400 |
more complex, and then also more in the realm of research. 00:05:24.360 |
The simple ones are things like spell checking or keyword search and 00:05:30.600 |
Then the medium sort of difficulty ones are to extract information from websites, 00:05:36.880 |
trying to extract sort of product prices or dates and locations, people or 00:05:40.720 |
company names, so-called named entity recognition. 00:05:43.720 |
You can go a little bit above that and try to classify sort of reading levels for 00:05:48.680 |
school text, for instance, or do sentiment analysis that can be helpful if you 00:05:53.480 |
have a lot of customer emails that come in and you want to prioritize highly the ones 00:05:58.120 |
of customers who are really, really annoyed with you right now. 00:06:01.200 |
And then the really hard ones, and I think in some sense, 00:06:03.920 |
the most interesting ones are machine translation, 00:06:07.440 |
trying to actually be able to translate between all the different languages in 00:06:10.280 |
the world, question answering, clearly something that is a very exciting and 00:06:16.600 |
useful piece of technology, especially over very large, complex domains. 00:06:24.880 |
I know pretty much everybody here would love to have some simple automated email 00:06:29.560 |
reply system, and then spoken dialogue systems, bots are very hip right now. 00:06:34.000 |
These are all sort of complex things that are still in the realm of research to do 00:06:39.160 |
We're making huge progress, especially with deep learning on these three, but 00:06:52.040 |
I mentioned we have morphology and words and syntax and semantics and so on. 00:06:58.720 |
We can look at one example, namely machine translation, and 00:07:03.280 |
look at how did people try to solve this problem of machine translation. 00:07:08.440 |
Well, it turns out they actually tried all these different levels 00:07:13.440 |
You can try to have a direct translation of words to other words. 00:07:16.680 |
The problem is that is often a very tricky mapping. 00:07:19.040 |
The meaning of one word in English might have three different words in German and 00:07:24.760 |
You can have three of the same words in English, 00:07:27.360 |
meaning all the single same word in German, for instance. 00:07:30.520 |
So then people said, well, let's try to maybe do some tactic transfer where we 00:07:34.000 |
have whole phrases like to kick the bucket, just means [FOREIGN] in German. 00:07:41.680 |
let's try to find a logical representation of the whole sentence, 00:07:44.440 |
the actual meaning in some human understandable form, and 00:07:48.280 |
then try to just find another surface representation of that. 00:07:51.560 |
Now, of course, that will also get rid of a lot of the subtleties of language. 00:07:56.200 |
And so, the tricky problems in all these kinds of representations. 00:08:01.200 |
Now, the question is, what does deep learning do? 00:08:03.600 |
You've already saw at least two methods, standard neural networks before and 00:08:11.680 |
And in some sense, there's going to be a huge similarity here to these methods. 00:08:17.000 |
Because just like images that are essentially a long list of numbers, 00:08:22.760 |
a vector, and standard neural networks with a hidden state 00:08:30.760 |
That is also going to be the main representation that we will use throughout 00:08:34.960 |
for characters, for words, for short phrases, for sentences, and 00:08:43.960 |
And with that, we are sort of finishing up the whirlwind of what's NLP. 00:08:49.200 |
Of course, you could give an entire lecture on almost every single slide I 00:08:57.000 |
But we'll continue at that speed to try to squeeze this complex deep learning for 00:09:05.440 |
I think they're two of the most important basic Lego blocks that you 00:09:10.560 |
nowadays want to know in order to be able to creatively play around with more 00:09:15.120 |
complex models, and those are going to be word vectors and 00:09:19.960 |
sequence models, namely recurrent neural networks. 00:09:22.480 |
And I kind of split this into words, sentences, and multiple sentences. 00:09:28.640 |
But really, you could use recurrent neural networks for 00:09:31.920 |
shorter phrases as well as multiple sentences. 00:09:34.160 |
But in many cases, we'll see that they have some limitations as you move to longer and 00:09:38.800 |
longer sequences and just use the default neural network sequence models. 00:09:47.360 |
And maybe one last blast from the past here, to represent the meaning of words, 00:09:53.080 |
we actually used to use taxonomies like WordNet that kind of defines 00:09:58.440 |
each word in relationship to lots of other ones. 00:10:01.400 |
So you can, for instance, define hypernames and is a relationships. 00:10:05.200 |
You might say the word panda, for instance, in its first meaning as a noun, 00:10:10.760 |
basically goes through this complex stack, this directed acyclic graph. 00:10:17.880 |
And in the end, like everything, it is an entity, but 00:10:20.520 |
it's actually a physical entity, a type of object. 00:10:22.640 |
It's a whole object, it's a living thing, it's an organism, animal, and so on. 00:10:26.160 |
So you basically can define a word like this. 00:10:31.680 |
you actually have so-called syn sets, or synonym sets. 00:10:34.480 |
And here's an example for the synonym set of the word good. 00:10:38.320 |
Good can have a lot of different meanings, can actually be both an adjective, 00:10:47.360 |
Now, what are the problems with this kind of discrete representation? 00:10:51.960 |
Well, they can be great as a resource if you're a human, you wanna find synonyms. 00:10:57.320 |
But they're never going to be quite sufficient to capture all the nuances 00:11:05.520 |
So for instance, the synonyms here for good were adapt, expert, 00:11:13.240 |
But of course, you would use these words in slightly different contexts. 00:11:17.120 |
You would not use the word expert in exactly all the same contexts 00:11:23.320 |
as you would use the meaning of good, or the word good. 00:11:27.120 |
Likewise, it will be missing a lot of new words. 00:11:29.680 |
Language is this interesting living organism, we change it all the time. 00:11:33.600 |
You might have some kids, they say YOLO, and all of a sudden, 00:11:39.360 |
Likewise, maybe in Silicon Valley, you might see ninja a lot, and 00:11:43.120 |
now you need to update your dictionary again. 00:11:44.960 |
And that is basically going to be a Sisyphus job, right? 00:11:47.560 |
Nobody will ever be able to really capture all the meanings and 00:11:52.520 |
this living, breathing organism that language is. 00:11:58.080 |
Some people might think ninja should just be deleted from the dictionary and 00:12:03.120 |
I just think nifty or badass is kind of a silly word and 00:12:06.560 |
should not be included in a proper dictionary, but 00:12:11.880 |
As soon as you change your domain, you have to ask people to update it. 00:12:16.120 |
And it's also hard to compute accurate word similarities. 00:12:18.920 |
Some of these words are subtly different, and 00:12:21.080 |
it's really a continuum in which we can measure their similarities. 00:12:28.360 |
what is also the first step for deep learning, 00:12:31.640 |
we'll actually realize it's not quite deep learning in many cases. 00:12:35.280 |
But it is sort of the first step to use deep learning in NLP, 00:12:42.080 |
Basically, the idea is that we'll use the neighbors of a word 00:12:50.960 |
And here's an example, for instance, for the word banking. 00:12:53.920 |
We might actually represent banking in terms of all these other words 00:12:58.760 |
So let's do a very simple example where we look at a window around each word. 00:13:06.960 |
And so here, the window length, that's just for simplicity, say it's one. 00:13:11.120 |
We represent each word only with the words one to the left and 00:13:15.160 |
We'll just use the symmetric context around each word. 00:13:22.040 |
So if the three sentences in my corpus, of course, 00:13:24.680 |
we would always wanna use corpora with billions of words instead of just a couple. 00:13:29.360 |
But just to give you an idea of what's being captured in these word vectors, 00:13:33.480 |
is I like deep learning, I like NLP, and I enjoy flying. 00:13:37.200 |
And now, this is a very simple so-called co-occurrence statistic. 00:13:42.520 |
You'll just simply see here, i, for instance, appears twice in its window 00:13:47.280 |
size of one here, the word like is in its window and its context, and 00:13:54.640 |
And for like, you have twice to its left, i, and once deep, and once NLP. 00:14:01.600 |
It turns out, if you just take those vectors, now this could be a vector 00:14:06.960 |
representation, just each row could be a vector representation for words. 00:14:11.360 |
Unfortunately, as soon as your vocabulary increases, 00:14:15.960 |
And hence, you'd have to retrain your whole model. 00:14:20.760 |
really, it's going to be somewhat noisy if you use that vector. 00:14:24.480 |
Now, another better thing to do might be to run SVD or 00:14:28.760 |
something similar like PCA dimensionality reduction on such a co-occurrence matrix. 00:14:34.240 |
And that actually gives you a reasonable first approximation to word vectors. 00:14:41.360 |
Now, what works even better than simple PCA is actually a model introduced 00:14:49.880 |
So instead of capturing co-occurrence counts directly out of a matrix like that, 00:14:54.600 |
you'll actually go through each window in a large corpus and 00:14:57.960 |
try to predict a word that's in the center of each window and 00:15:03.960 |
That way, you can very quickly train, you can train almost online, 00:15:12.160 |
add words to your vocabulary very quickly in this streaming fashion. 00:15:16.440 |
So now let's look a little bit at this model Word2Vec, 00:15:20.160 |
because it's first a very simple NLP model, and two, 00:15:27.920 |
We won't go into too many details, but at least look at a couple of equations. 00:15:31.800 |
So again, main goal is to predict the surrounding words in a window 00:15:36.240 |
of some length that we define M, it's a hyperparameter, of every word. 00:15:40.680 |
Now, the objective function will essentially try to maximize here 00:15:43.560 |
the log probability of any of these context words given the center word. 00:15:47.720 |
So we go through our entire corpus T, very long sequence, and 00:15:52.640 |
at each time step j, we will basically look at all the words in the context 00:15:58.960 |
of the current word T, and basically try to maximize here 00:16:05.440 |
this probability of trying to be able to predict that word that is around 00:16:09.560 |
the current word T, and theta, all the parameters, 00:16:15.080 |
namely all the word vectors that we'd want to optimize. 00:16:17.360 |
So now, how do we actually define this probability P here? 00:16:21.120 |
The simplest way to do this, and this is not the actual way, but 00:16:25.720 |
it's the simplest and first to understand and derive this model, 00:16:29.720 |
is with this very simple inner product here, and 00:16:34.960 |
There's not going to be many layers of nonlinearities like we see in 00:16:38.480 |
deep neural networks, it's really just a simple inner product. 00:16:43.040 |
the more likely these two will be predicting one another. 00:16:47.640 |
So here's C, the context is the center word, sorry, O is the outside word. 00:16:55.160 |
And basically, this inner product, the larger it is, 00:16:57.760 |
the more likely we were going to predict this. 00:17:00.320 |
And these are both just standard n-dimensional vectors. 00:17:04.080 |
And now, in order to get a real probability, we'll essentially apply 00:17:08.360 |
softmax to all the potential inner products that you might have in your vocabulary. 00:17:14.600 |
this denominator is actually going to be a very large sum, right? 00:17:19.840 |
We'll want to sum here over all potential inner products for 00:17:25.000 |
So now, the real methods that we would use are going to 00:17:29.840 |
approximate the sum in a variety of clever ways. 00:17:34.320 |
Now, I could literally talk the next hour and a half just about how to optimize 00:17:39.120 |
then we'll all deplete our mental energy for the rest of the day. 00:17:42.880 |
And so, I'm just going to point you to the class I taught earlier this year, 00:17:47.520 |
CS24D, where we will have lots of different slides that go into all 00:17:52.840 |
the details of this equation, how to approximate it, and then how to optimize it. 00:17:56.320 |
It's going to be very similar to the way we optimize any other neural network. 00:18:00.520 |
We're going to use stochastic gradient descent. 00:18:03.080 |
We're going to look at mini-patches of a couple of hundred windows at a time, and 00:18:09.240 |
And we're just going to take simple gradients of each of these vectors 00:18:16.360 |
All right, now, we briefly mentioned PCA-like methods, 00:18:23.080 |
based on senior value decomposition often, or standard simple PCA. 00:18:30.680 |
There's actually one model that combines the best of both worlds, namely GloVe, or 00:18:35.920 |
global vectors, introduced by Jeffrey Pennington in 2014. 00:18:39.440 |
And it has a very similar idea, and you'll notice here, 00:18:44.360 |
You have this inner product again for different pairs. 00:18:47.760 |
But this model will actually go over the co-occurrence matrix. 00:18:51.120 |
Once you have this co-occurrence matrix, it's much more efficient to try to predict 00:18:54.320 |
once how often two words appear next to each other, rather than do it 50 times 00:19:00.080 |
each time that pair appears in an actual corpus. 00:19:04.960 |
So in some sense, you can be more efficiently going through all the co-occurrence 00:19:08.760 |
statistics, and you're going to basically try to minimize this subtraction here. 00:19:16.600 |
And what that basically means is that each inner product will try to approximate 00:19:21.600 |
the log probability of these two words actually co-occurring. 00:19:25.920 |
Now, you have this function here, which essentially will allow us to not overly 00:19:32.320 |
weight certain pairs that occur very, very frequently. 00:19:36.800 |
The, for instance, co-occurs with lots of different words, and you want to basically 00:19:40.880 |
lower the importance of all the words that co-occur with the. 00:19:50.480 |
In fact, we trained this on Common Crawl, which is a really great data set of most 00:20:00.880 |
And it gets also very good performance on small corpora because it makes use very 00:20:05.640 |
efficiently of these co-occurrence statistics. 00:20:08.040 |
And that's essentially what words, well, word vectors are always capturing. 00:20:12.240 |
So if in one sentence, you just want to remember every time you hear word vectors 00:20:17.160 |
in deep learning, one, they're not quite deep, even though we call them sort of 00:20:22.320 |
And two, they're really just capturing co-occurrence counts. 00:20:25.040 |
How often does a word appear in the context of other words? 00:20:28.440 |
So let's look at some interesting results of these GloVe vectors. 00:20:34.960 |
Here, the first thing we do is look at nearest neighbors. 00:20:38.080 |
So now that we have these n-dimensional vectors, usually we say n between 50 to at 00:20:43.680 |
most 500, good general numbers, 100 or 200 dimensional. 00:20:47.680 |
Each word is now represented as a single vector. 00:20:53.360 |
And so we can look in this vector space for words that appear close by. 00:20:57.720 |
We started and looked for the nearest neighbors of frog. 00:21:01.640 |
And well, it turned out these are the nearest neighbors, 00:21:05.840 |
which was a little confusing since we're not biologists. 00:21:08.360 |
But fortunately, when you actually look up in Google what those mean, 00:21:12.680 |
you'll see that they are actually all indeed different kinds of frogs. 00:21:17.040 |
Some appear very rarely in the corpus and others like toad are much more frequent. 00:21:22.680 |
Now, one of the most exciting results that came out of word vectors 00:21:29.560 |
So the idea here is can linearly, can there be relationships between 00:21:36.840 |
different word vectors that simply fall out of very linear and 00:21:42.960 |
So the idea here is what is man to woman equal to king to something else? 00:21:54.120 |
try to basically fill in here the last missing word? 00:22:00.040 |
Now, the way we're going to do this is very simple cosine similarity. 00:22:04.880 |
We basically just take, let's take an example here, 00:22:08.560 |
the vector of woman, we subtract the word vector we learned of man, 00:22:21.840 |
this, turns out to going to be queen for a lot of these different models. 00:22:28.160 |
Again, we're capturing co-occurrence statistics. 00:22:30.520 |
So man might, in its context, often have things like running and 00:22:38.760 |
And then you subtract those kinds of words from the context and 00:22:44.120 |
And in some sense, it's intuitive, though surprising that it works out that well for 00:22:50.480 |
So here are some other examples similar to the king and 00:22:56.560 |
queen example where we basically took these 200 dimensional vectors and 00:23:05.800 |
And what we find is actually quite interestingly, 00:23:09.160 |
even in just the two first principle components of this space, 00:23:12.840 |
we have some very interesting sort of female-male relationships. 00:23:17.240 |
So man to woman is similar to uncle and aunt, brother and sister, 00:23:23.320 |
So this is an interesting semantic relationship that falls out of 00:23:29.960 |
essentially co-occurrence counts in specific windows around each word in 00:23:36.000 |
Here's another one that's more of a syntactic relationship. 00:23:40.160 |
We actually have here superlatives, like slow, slower, and slowest is in a similar 00:23:44.960 |
vector relationship to short, shorter, and shortest, or strong, stronger, and strongest. 00:23:53.960 |
when you see an interesting qualitative result, you want to try to quantify 00:23:59.480 |
who can do better in trying to understand these analogies and 00:24:02.760 |
what are the different modes and hyperparameters that modify the performance. 00:24:07.960 |
Now, this is something that you will notice in pretty much every deep learning 00:24:11.320 |
project ever, which is more data will give you better performance. 00:24:14.880 |
It's probably the single most useful thing you can do to a machine learning or 00:24:18.720 |
deep learning system is to train it with more data, and we found that too. 00:24:22.200 |
Now, there are different vector sizes too, which is a common hyperparameter. 00:24:27.200 |
Like I said, usually between 50 to at most 500. 00:24:30.280 |
Here we have 300 dimensional that essentially gave us the best performance 00:24:36.000 |
for these different kinds of semantics and tactic relationships. 00:24:40.760 |
Now, in many ways, having a single vector for words can be oversimplifying, right? 00:24:45.880 |
Some words have multiple meanings, maybe they should have multiple vectors. 00:24:50.080 |
Sometimes the word meaning changes over time, and so on. 00:24:56.760 |
So there's a lot of simplifying assumptions here, but again, 00:24:59.720 |
our final goal for deep NLP is going to be to create useful systems. 00:25:04.680 |
And it turns out this is a useful first step to create such systems 00:25:09.400 |
that mimic some human language behavior in order to create useful applications for us. 00:25:17.160 |
All right, but words, word vectors are very useful, but 00:25:22.160 |
And what we really want to do is understand words in their context. 00:25:25.720 |
And so this leads us to the second section here on recurrent neural networks. 00:25:31.880 |
So we already went over the basic definition of standard neural networks. 00:25:37.520 |
Really the main difference between a standard neural network and 00:25:41.560 |
a recurrent neural network, which I'll abbreviate as RNN now, 00:25:45.200 |
is that we will tie the weights at each time step. 00:25:48.880 |
And that will allow us to essentially condition the neural network on all 00:25:53.640 |
In practice, how we can optimize it, it won't be really all the previous words. 00:26:01.520 |
in theory, this is what a powerful model can do. 00:26:04.280 |
So let's look at the definition of a recurrent neural network. 00:26:08.240 |
And this is going to be a very important definition, so 00:26:12.760 |
So let's assume for now we have our word vectors as given, and 00:26:17.240 |
we'll represent each sequence in the beginning as just a list of these word vectors. 00:26:22.040 |
Now what we're going to do is we're computing a hidden state, ht, at each time 00:26:27.720 |
step, and the way we're going to do this is with a simple neural network architecture. 00:26:33.600 |
In fact, you can think of this summation here as really just a single 00:26:39.600 |
layer neural network, if you were to concatenate the two matrices in these two 00:26:43.320 |
vectors, but intuitively, we basically will map our current word vector at that 00:26:49.120 |
time step t, sometimes I use these square brackets to denote that we're taking 00:26:54.480 |
the word vector from that time step in there. 00:26:58.800 |
We map that with a linear layer, a simple matrix vector product, and 00:27:02.520 |
we sum up, sum that matrix vector product to another matrix vector product 00:27:07.560 |
of the previous hidden state at the previous time step. 00:27:11.800 |
We sum those two, and we apply in one case a simple sigmoid function 00:27:16.640 |
to define this standard neural network layer. 00:27:20.240 |
That will be ht, and now at each time step we want to predict some kind of class, 00:27:25.440 |
probability over a set of potential events, classes, words, and so on. 00:27:33.280 |
some other communities call it the logistic regression classifier. 00:27:40.520 |
So here we have a simple matrix, Ws for the softmax weights. 00:27:47.680 |
We have basically a number of rows, they're going to be a number of classes 00:27:50.480 |
that we have, and the number of columns is the same as the hidden dimension. 00:27:55.480 |
Sometimes we want to predict the next word in a sequence in order to be able to 00:28:06.040 |
So for instance, if I ask for a speech recognition system, 00:28:11.520 |
Now in isolation, if you hear wood, you would probably assume it's the W-O-U-L-D, 00:28:17.720 |
auxiliary verb wood, but in this particular context, the price of, 00:28:21.680 |
it wouldn't make sense to have a verb following that. 00:28:23.960 |
And so it's more like the W-O-O-D to find the price of wood. 00:28:28.880 |
So language modeling is a very useful task, and it's also very instructive 00:28:33.440 |
to use as an example for where recurrent neural networks really shine. 00:28:38.520 |
So in our case here, this softmax is going to be quite a large matrix that goes over 00:28:43.680 |
the entire vocabulary of all the possible words that we have. 00:28:50.800 |
The classes for language models are the words in our vocabulary. 00:29:00.160 |
the jth one is basically denoting here the probability that the jth word, 00:29:05.280 |
that the jth index will come next after all the previous words. 00:29:09.680 |
It's a very useful model again for speech recognition, for machine translation, 00:29:13.960 |
for just finding a prior for language in general. 00:29:16.720 |
All right, again, main difference to standard neural networks, 00:29:22.920 |
we just have the same set of W-8s at all the different time steps. 00:29:26.280 |
Everything else is pretty much a standard neural network. 00:29:31.240 |
We often initialize the first H0 here just either randomly or all zeros. 00:29:38.080 |
And again, in language modeling in particular, 00:29:46.400 |
Now we can measure basically the performance of language models 00:29:50.680 |
with terms so-called perplexity, which really is here the average 00:29:57.000 |
log likelihood of basically the probabilities of being able to predict the next word. 00:30:02.120 |
So you want to really give the highest probability to the word that 00:30:05.840 |
actually will appear next in a long sequence. 00:30:08.200 |
And then the higher that probability is, the lower your perplexity, and 00:30:14.040 |
hence the model is less perplexed to see the next word. 00:30:17.960 |
In some sense, you can think of language modeling as almost NLP complete, 00:30:23.360 |
in some silly sense that if you can actually predict every single 00:30:28.320 |
word that follows after any arbitrary sequence of words in a perfect way, 00:30:33.520 |
you would have disambiguated a lot of things. 00:30:36.280 |
You can say, for instance, what is the answer to the following question? 00:30:40.120 |
Ask the question, and then the next couple of words would be the predicted answer. 00:30:43.720 |
So there's no way we can actually ever do a perfect job in language modeling. 00:30:48.080 |
But there's certain contexts where we can give a very high probability to 00:30:53.720 |
Now, this is the standard recurrent neural network. 00:30:58.480 |
And one problem with this is that we will modify the hidden state here 00:31:04.240 |
So even if I have words like the, and a, and a sentence period, and 00:31:08.880 |
things like that, it will significantly modify my hidden state. 00:31:15.600 |
Let's say, for instance, I want to train a sentiment analysis algorithm. 00:31:20.840 |
And I talk about movies, and I talk about the plot for a very long time. 00:31:25.240 |
Then I say, man, this movie was really wonderful. 00:31:29.440 |
And then especially the ending, and you talk again for 00:31:31.480 |
like 50 time steps, or 50 words, or 100 words about the plot. 00:31:35.840 |
Now, all these plot words will essentially modify my hidden state. 00:31:39.080 |
So if at the end of that whole sequence I want to classify the sentiment, 00:31:43.600 |
great that I mentioned somewhere in the middle might be completely gone. 00:31:47.720 |
Because I keep updating my hidden state with all these content words that talk 00:31:52.760 |
Now, the way to improve this is by use better kinds of recurrent units. 00:32:04.360 |
so-called gated recurrent units, introduced by Cho. 00:32:08.680 |
In some sense, and we'll learn more about the LSTM tomorrow when 00:32:13.760 |
Kwok gives his lecture, but GeoUsers are in some sense a special case of LSTMs. 00:32:19.240 |
The main idea is that we want to have the ability to keep certain memories 00:32:23.640 |
around without having the current input modify them at all. 00:32:28.760 |
So again, this example of sentiment analysis. 00:32:30.680 |
I say something's great, that should somehow be captured in my hidden state. 00:32:34.560 |
And I don't want all the content words that talk about the plot in the movie 00:32:37.320 |
review to modify that it's actually overall was a great movie. 00:32:42.440 |
And then we also want to allow error messages to flow 00:32:45.480 |
at different strengths depending on the input. 00:32:47.600 |
So if I say, great, I want that to modify a lot of things in the past. 00:32:56.120 |
Fortunately, since you already know the basic Lego block of a standard neural 00:32:59.960 |
network, there's only really one or two subtleties here that are different. 00:33:03.920 |
There are a couple of different steps that we'll need to compute 00:33:10.920 |
So in the standard RNN, what we did was just have this one single neural network 00:33:15.560 |
that we hope would capture all this complexity of the sequence. 00:33:18.680 |
Instead now, we'll first compute a couple of gates at that time step. 00:33:23.280 |
So the first thing we'll compute is the so-called update gate. 00:33:26.880 |
It's just yet another neural network layer based on the current input word vector and 00:33:32.920 |
So these look quite familiar, but this will just be an intermediate value and 00:33:39.480 |
Then we'll also compute a reset gate, is yet another standard neural network layer. 00:33:45.160 |
Again, just matrix vector product, summation matrix vector product, 00:33:48.960 |
some kind of non-linearity here, namely a sigmoid. 00:33:51.840 |
It's actually important in this case that it is a sigmoid. 00:33:54.360 |
Just basically, both of these will be vectors with numbers that are between 0 and 1. 00:33:59.360 |
Now, we'll compute a new memory content, an intermediate h-tilde here, 00:34:06.640 |
with yet another neural network, but then we have this little funky symbol in here. 00:34:12.800 |
Basically, this will be an element-wise multiplication. 00:34:15.520 |
So basically, what this will allow us to do is if that reset gate is 0, 00:34:22.640 |
we can essentially ignore all the previous memory elements and 00:34:30.640 |
So for instance, if I talked for a long time about the plot, 00:34:37.880 |
Now you want to basically be able to ignore if your whole goal of this sequence 00:34:42.360 |
classification model is to capture sentiment, you want to be able to ignore 00:34:47.240 |
This is, of course, if this was entirely a zero vector. 00:34:52.560 |
This is a long vector of maybe 100 or 200 dimensions, so 00:34:56.320 |
maybe some dimensions should be reset, but others maybe not. 00:34:59.520 |
And then here we'll have our final memory, and 00:35:08.160 |
the previous hidden state and this intermediate one at our current time step. 00:35:13.160 |
And what this will allow us to do is essentially also say, well, 00:35:16.000 |
maybe you want to ignore everything that's currently happening and 00:35:21.960 |
We basically copy over the previous time step and 00:35:25.000 |
the hidden state of that and ignore the current thing. 00:35:30.560 |
maybe there's a lot of talk about the plot when the movie was released. 00:35:34.320 |
You want to basically have the ability to ignore that and 00:35:36.400 |
just copy that in the beginning and may have said, it was an awesome movie. 00:35:39.760 |
So here's an attempt at a clean illustration. 00:35:42.600 |
I have to say, personally, I, in the end, find the equations a little more intuitive 00:35:46.840 |
than the visualizations that we tried to do, but some people are more visual here. 00:35:51.280 |
So this is, in some ways, basically here we have our word vector and 00:35:57.920 |
And then some of these layers will essentially modify other 00:36:06.920 |
So this is a pretty nifty model and it's really the second most important 00:36:12.160 |
basic Lego block that we're going to learn about today. 00:36:18.320 |
And so just want to make sure we take a little bit of time, 00:36:22.560 |
Again, if the reset gate, this R value, is close to zero, 00:36:27.800 |
those kinds of hidden dimensions are basically allowed to be dropped. 00:36:38.560 |
then we can copy information of that unit through many, many different time steps. 00:36:44.560 |
And if you think about optimization a lot, what this will also mean is that 00:36:48.680 |
the gradient can flow through the recurrent neural network through multiple 00:36:52.520 |
time steps until it actually matters and you want to update a specific word, 00:36:56.680 |
for instance, and go all the way through many different time steps. 00:37:01.000 |
So then what this also allows us is to actually have some units 00:37:12.280 |
Some you might want to reset every other word, other ones you might really cap, 00:37:16.720 |
like they have some long term context and they stay around for much longer. 00:37:24.640 |
It's the second most important building block for today. 00:37:28.120 |
There are, like I said, a lot of other variants of recurrent neural networks. 00:37:33.440 |
Lots of amazing work in that space right now, and 00:37:36.400 |
tomorrow Kwak will talk a lot about some more advanced methods. 00:37:47.080 |
neural network sequence models, you really have the two most important concepts for 00:37:55.240 |
We can now, in some ways, really play around with those two Lego blocks, 00:38:00.640 |
plus some slight modifications of them, very creatively, and 00:38:06.880 |
A lot of the models that I'll show you and that you can read and see and 00:38:10.440 |
read the latest papers that are now coming out almost every week on archive, 00:38:17.880 |
will use really these two components in a major way. 00:38:21.400 |
Now, this is one of the few slides now with something really, really new, 00:38:29.640 |
the people who already knew all this stuff and took the class and everything. 00:38:33.160 |
This is tackling an important problem, which is, in all these models 00:38:38.400 |
that you'll see in pretty much most of these papers, 00:38:42.160 |
we have in the end one final softmax here, right? 00:38:46.680 |
And that softmax is basically our default way of classifying what we can see next, 00:38:53.480 |
The problem with that is, of course, that that will only ever predict accurately 00:38:58.200 |
frequently seen classes that we had at training time. 00:39:01.840 |
But in the case of language modeling, for instance, where our classes are the words, 00:39:06.360 |
we may see at test time some completely new words. 00:39:09.080 |
Maybe I'm just going to introduce to you a new name, Srini, for instance, and 00:39:14.520 |
nobody may have seen that word at training time. 00:39:19.520 |
But now that I mentioned him, and I will introduce him to you, 00:39:22.680 |
you should be able to predict the word Srini and that person in a new context. 00:39:28.360 |
And so the solution that we're literally going to release only next week in 00:39:32.600 |
the new paper is to essentially combine the standard softmax that we can train 00:39:39.280 |
And that pointer component will allow us to point to previous contexts and 00:39:46.840 |
So let's, for instance, take the example here of language modeling again. 00:39:50.320 |
We may read a long article about the Fed Chair, Janet Yellen. 00:39:55.720 |
And maybe the word Yellen had not appeared in training time before, so 00:40:00.200 |
we couldn't ever predict it, even though we just learned about it. 00:40:03.640 |
And now a couple of sentences later, interest rates were raised, and 00:40:07.000 |
then misses, and now we want to predict that next word. 00:40:11.200 |
Now, if that hadn't appeared in our softmax standard training procedure at 00:40:15.680 |
training time, we would never be able to predict it. 00:40:18.960 |
What this model will do, and we're calling it a pointer sentinel mixture model, 00:40:23.080 |
is it will essentially first try to see would any of these previous words 00:40:29.880 |
So we can really take into consideration the previous context of, say, 00:40:34.440 |
And if we see that word and that word makes sense after we train it, of course, 00:40:38.720 |
then we might give a lot of probability mass to just that word at this current 00:40:43.120 |
position in our previous immediate context at test time. 00:40:50.680 |
which is basically going to be the rest of the probability if we cannot refer to 00:40:57.360 |
And that one will go directly to our standard softmax. 00:41:02.440 |
And then what we'll essentially have is a mixture model that allows us to say 00:41:06.840 |
either we have or we have a combination of both of essentially words that just 00:41:12.160 |
appeared in this context and words that we saw in our standard softmax 00:41:18.720 |
So I think this is a pretty important next step because it will allow us 00:41:23.680 |
to predict things we've never seen at training time. 00:41:25.800 |
And that's something that's clearly a human capability that most, or 00:41:29.520 |
pretty much none of these language models had before. 00:41:32.600 |
And so to look at how much it actually helps, 00:41:35.520 |
it'll be interesting to look at some of the performance before. 00:41:39.200 |
So again, what we're measuring here is perplexity. 00:41:41.800 |
And the lower the better, because it's essentially inverse here 00:41:47.720 |
of the actual probability that we assign to the correct next word. 00:41:51.800 |
And in just 2010, so six years ago, this was some great work, 00:41:58.480 |
early work by Tomas Mikulov, where he compared to a lot of standard 00:42:03.280 |
natural language processing methods, syntactic models 00:42:09.360 |
that essentially tried to predict the next word and had a perplexity of 107. 00:42:13.440 |
And he was able to use the standard recurrent neural networks, and 00:42:17.760 |
actually an ensemble of eight of them, to really significantly push down 00:42:22.080 |
the perplexity, especially when you combine it with standard 00:42:28.960 |
So in 2010, he made great progress by pushing it down to 87. 00:42:34.800 |
And now this is one of the great examples of how much progress 00:42:39.120 |
is being made in the field, thanks to deep learning, where two years ago, 00:42:44.680 |
Whitecheck Zaremba and his collaborators were able to push that down 00:42:49.960 |
even further to 78 with a very large LSTM, similar to a GRU-like model, 00:42:57.480 |
Kwok will teach you the basics of LSTMs tomorrow. 00:43:02.240 |
Then last year, the performance was pushed down even further by Yaren Gal. 00:43:07.800 |
And then this one actually came out just a couple weeks ago, 00:43:13.840 |
variational recurrent highway networks, pushed it down even further. 00:43:17.280 |
But this Pointer Sentinel model is able to get it down to 70. 00:43:23.800 |
we pushed it down by more than 10 perplexity points in two years. 00:43:28.840 |
And that is really an increased speed in performance that we're seeing now, 00:43:34.040 |
that deep learning is changing a lot of areas of natural language processing. 00:43:38.240 |
All right, now we have our basic Lego blocks, 00:43:43.080 |
the word vectors and the GRU sequence models. 00:43:47.240 |
And now we can talk a little bit about some of the ongoing research that we're 00:43:53.520 |
And I'll start that with maybe a controversial question, 00:43:56.640 |
which is, could we possibly reduce all NLP tasks to essentially 00:44:03.400 |
question answering tasks over some kind of input? 00:44:06.600 |
And in some ways, that's a trivial observation that you could do that, but 00:44:11.600 |
it actually might help us to think of models that could take any kind of input, 00:44:16.960 |
a question about that input, and try to produce an output sequence. 00:44:21.440 |
So let me give you a couple of examples of what I mean by this. 00:44:26.000 |
So here we have, the first one is a task that we would 00:44:30.200 |
standardly associate with question answering. 00:44:34.280 |
Mary walked to the bathroom, Sandra went to the garden, Daniel went back to 00:44:38.160 |
the garden, Sandra took the milk there, where's the milk? 00:44:51.160 |
Now I'd have to maybe do an effort resolution, find out what does there refer 00:44:56.400 |
to, and then you try to find the previous sentence that mentions Sandra, 00:45:01.480 |
see that it's garden, and then give an answer garden. 00:45:04.040 |
So this is a simple logical reasoning question answering task. 00:45:08.280 |
And that's what most people in the QA field sort of associated with 00:45:21.120 |
All right, so this is a different subfield of NLP that tackles sentiment analysis. 00:45:26.720 |
We can go further and ask, what are the named entities of a sentence like, 00:45:31.480 |
Jane has a baby in Dresden, and you want to find out that Jane is a person and 00:45:34.840 |
Dresden is a location, and this is an example of sequence tagging. 00:45:38.320 |
You can even go as far and say, I think this model is incredible, and 00:45:43.840 |
the question is, what's the translation into French? 00:45:52.760 |
that in some ways would be phenomenal if we're able to actually 00:45:57.600 |
tackle all these different kinds of tasks with the same kind of model. 00:46:03.720 |
So maybe it would be an interesting new goal for NLP to try to 00:46:08.480 |
develop a single joint model for general question answering. 00:46:13.200 |
I think it would push us to think about new kinds of sequence models and 00:46:20.400 |
new kinds of reasoning capabilities in an interesting way. 00:46:23.960 |
Now, there are two major obstacles to actually achieving 00:46:27.040 |
the single joint model for arbitrary QA tasks. 00:46:30.240 |
The first one is that we don't even have a single model architecture that gets 00:46:34.480 |
consistent state of the art results across a variety of different tasks. 00:46:39.520 |
So for instance, for question answering, this is a data set called Bobby that 00:46:46.600 |
strongly supervised memory networks get the state of the art. 00:46:49.800 |
For sentiment analysis, you had tree LSTM models 00:46:53.880 |
developed by Kai-Sheng Tai here at Stanford last year. 00:47:00.800 |
you might have bidirectional LSTM conditional random fields. 00:47:04.840 |
One thing you do notice is all the current state of the art 00:47:09.640 |
Sometimes they still connect to other traditional methods like conditional 00:47:14.240 |
random fields and undirected graphical models. 00:47:16.280 |
But there's always some kind of deep learning component in them. 00:47:24.280 |
The second one is that really fully joint multi-task learning 00:47:30.440 |
Usually when we do do it, we restrict it to lower layers. 00:47:35.440 |
So for instance, in natural language processing, 00:47:37.720 |
all we're currently able to share in some principled way are word vectors. 00:47:42.560 |
We take the same word vectors we train, for instance, with GloVe or Word2Vec, and 00:47:46.120 |
we initialize our deep neural network sequence models with those word vectors. 00:47:51.920 |
In computer vision, we're actually a little further ahead, and 00:47:56.280 |
you're able to use multiple of the different layers. 00:47:59.680 |
And you initialize a lot of your CNN models with a first pre-trained CNN that 00:48:07.040 |
Now, usually people evaluate multi-task learning with only two tasks. 00:48:14.640 |
then they evaluate the model that they initialized from the first on the second 00:48:18.880 |
task, but they often ignore how much the performance degrades on the original task. 00:48:23.880 |
So when somebody takes an ImageNet CNN and applies it to a new problem, 00:48:29.160 |
say how much did my accuracy actually decrease on the original data set? 00:48:32.800 |
And furthermore, we usually only look at tasks that are actually related, and 00:48:38.320 |
then we find, look, there's some amazing transfer learning capability going on. 00:48:42.920 |
What we don't look at often in the literature and 00:48:46.480 |
most people's work is that when the tasks aren't related to one another, 00:48:51.920 |
And this is so-called catastrophic forgetting. 00:48:55.040 |
It's not, there's not too much work around that right now. 00:48:58.960 |
Now, I also would like to say that right now, 00:49:07.920 |
classifier for a variety of different kinds of outputs, right? 00:49:12.800 |
We at least replace the softmax to try to predict different kinds of problems. 00:49:18.440 |
All right, so this is the second obstacle now. 00:49:21.920 |
For now, we'll only tackle the first obstacle, and 00:49:24.760 |
this is basically what motivated us to come up with dynamic memory networks. 00:49:29.480 |
They're essentially an architecture to try to tackle arbitrary question answering 00:49:33.980 |
When I'll talk about dynamic memory networks, it's important to note here 00:49:38.960 |
that for each of the different tasks I'll talk about, 00:49:50.800 |
So the high level idea for DMNs is as follows. 00:49:55.040 |
Imagine you had to read a bunch of facts like these here. 00:49:59.160 |
They're all very simple in and of themselves. 00:50:02.200 |
But if I now ask you a question, I showed you these, and I ask, where's Sandra? 00:50:08.040 |
It'd be very hard, even if you read them, all of them, 00:50:13.080 |
And so the idea here is that for complex questions, 00:50:16.520 |
we might actually want to allow you to have multiple glances at the input. 00:50:22.600 |
And just like I promised, one of our most important basic Lego blocks will be this 00:50:29.240 |
GRU we just introduced in the previous section. 00:50:31.520 |
Now, here's this whole model in all its gory details. 00:50:37.080 |
And we'll dive into all of that in the next couple of slides, so don't worry. 00:50:45.240 |
So the first one is, I think we're moving in deep learning now to try to use more 00:50:53.360 |
Basically to modularize, encapsulate certain capabilities, and 00:50:57.800 |
then take those as basic Lego blocks and build more complex models on top of them. 00:51:06.320 |
that's like one little block in a complex paper, and then other things happen on top. 00:51:10.360 |
Here we'll have the GRU or word vectors basically as one module, 00:51:19.200 |
And I'm not even mentioning word vectors anymore, but 00:51:23.640 |
And each of these words is essentially represented as this word vector, but 00:51:29.280 |
Okay, so let's walk on a very high level through this model. 00:51:32.600 |
There are essentially four different modules. 00:51:35.040 |
There's the input module, which will be a neural network sequence model, a GRU. 00:51:40.320 |
There's a question module, an episodic memory module, and an answering module. 00:51:45.320 |
And sometimes we also have these semantic memory modules here, but for 00:51:49.320 |
now these are really just our word vectors, and we'll ignore that for now. 00:51:54.720 |
Here is our corpus, and our question is, where is the football? 00:51:58.160 |
And this is our input that should allow us to answer this question. 00:52:02.760 |
Now if I ask this question, I will essentially use the final representation 00:52:08.880 |
of this question to learn to pay attention to the right kinds of inputs that seem 00:52:13.520 |
relevant for given what I know to answer this question. 00:52:18.680 |
Well, it would make sense to basically pay attention to all the sentences that 00:52:22.720 |
mention football, and maybe especially the last ones if the football moves around a lot. 00:52:27.360 |
So what we'll observe here is that this last sentence will get a lot of attention. 00:52:34.320 |
And now what we'll basically do is that this hidden state of this 00:52:38.960 |
recurrent neural network model will be given as input to another recurrent neural 00:52:44.400 |
network because it seemed relevant to answer this current question at hand. 00:52:48.440 |
Now we'll basically agglomerate all these different facts 00:52:54.080 |
that seem relevant at the time in this another GRU, in this final vector m. 00:52:58.560 |
And now this vector m together with the question will be used to go over 00:53:02.360 |
the inputs again if the model deems that it doesn't have enough information yet to 00:53:07.640 |
So if I ask you where's the football and it so 00:53:09.720 |
far only found that John put down the football, you don't know enough. 00:53:13.280 |
You still don't know where it is, but you now have a new fact, 00:53:15.800 |
namely John seems relevant to answer the question. 00:53:18.680 |
And that fact is now represented in this vector m, 00:53:21.960 |
which is also just the last hidden state of another recurrent neural network. 00:53:29.840 |
the football are relevant, we'll learn to pay attention to John move to the bedroom. 00:53:39.040 |
Again, those are going to get agglomerated here in this recurrent neural network. 00:53:44.920 |
And now the model thinks that it actually knows enough because it 00:53:50.440 |
basically intrinsically captured things about the football. 00:53:56.760 |
Of course, we didn't have to tell it anything about their people, 00:54:00.120 |
their locations, if x moves to y and y is in the set of locations, 00:54:05.760 |
You just give it a lot of stories like that and 00:54:07.800 |
in its hidden states it will capture these kinds of patterns. 00:54:11.520 |
So then we have the final vector m and we'll give that to an answer module, 00:54:16.920 |
which produces in our standard softmax way the answer. 00:54:21.880 |
All right, now let's zoom into the different modules of this overall dynamic 00:54:28.840 |
The input, fortunately, is just a standard GRU, the way we defined it before. 00:54:34.480 |
So simple word vectors, hidden states, reset gates, update gates, and so on. 00:54:39.840 |
The question module is also just a GRU, a separate one with its own weights. 00:54:49.160 |
And the final vector q here is just going to be the last hidden state of that 00:54:56.200 |
Now, the interesting stuff happens in the episodic memory module, 00:54:59.160 |
which is essentially a sort of meta-gated GRU, 00:55:05.960 |
where this gate will basically define, is defined and 00:55:13.840 |
And it will basically say this current state sentence SI here seems to matter. 00:55:19.880 |
And the superscript T is the episode that we have. 00:55:23.960 |
So each episode basically means we're going over the input entirely one time. 00:55:33.240 |
And what this basically will allow us to do is to say, well, if G is 0, 00:55:38.640 |
then what we'll do is basically just copy over the past states from the input. 00:55:47.480 |
And unlike before in all these GRU equations, 00:55:56.000 |
then this sentence is completely irrelevant to my current question at hand. 00:56:03.040 |
And there are lots of examples, like married travel to the hallway, 00:56:06.920 |
that are just completely irrelevant to answering the current question. 00:56:13.720 |
we're just copying the previous hidden state of this recurrent neural network over. 00:56:22.080 |
So now, of course, the big question is, how do we compute this G? 00:56:26.720 |
And this might look a little ugly, but it's quite simple. 00:56:30.080 |
Basically, we're going to compute two vector similarities, multiplicative and 00:56:35.480 |
additive one with absolute values of all the single values of the sentence vector 00:56:41.480 |
that we currently have, and the question vector, and 00:56:44.200 |
the first, the memory state of the previous pass of the input. 00:56:47.840 |
And the first pass of the input, the memory state is initialized to be just 00:56:52.920 |
a question, and then afterwards, it agglomerated relevant facts. 00:56:56.320 |
So intuitively here, if the sentence mentions John, for instance, and 00:57:02.320 |
the question is, or mentions football, and the question is, 00:57:04.960 |
where's the football, then you'd hope that the question vector Q mentions 00:57:09.160 |
has some units that are more active because football was mentioned. 00:57:13.000 |
And the sentence vector mentions football, so 00:57:14.880 |
there are some units that are more active because football is mentioned. 00:57:20.720 |
absolute values of subtractions are going to be large. 00:57:25.000 |
And then what we're going to do is just plug that into a standard, 00:57:28.640 |
through standard single layer neural network, and then a standard linear layer 00:57:32.600 |
here, and then we apply a softmax to essentially weight all of these 00:57:37.600 |
different potential sentences that we might have to compute the final gate. 00:57:41.880 |
So this will basically be a soft attention mechanism that sums to one and 00:57:46.920 |
will pay most attention to the facts that seem most relevant, 00:57:53.520 |
Then when the end of the input is reached, all these relevant facts here 00:57:59.520 |
are summarized in another GRU that basically moves up here. 00:58:03.080 |
And you can train a classifier also, if you have the right kind of supervision, 00:58:09.680 |
to basically train that the model knows enough to actually answer the question and 00:58:15.760 |
If you don't have that kind of supervision, you can also just say, 00:58:19.520 |
I will go over the inputs a fixed number of times, and 00:58:27.200 |
All right, there's a lot to sink in, so I'll give you a couple seconds. 00:58:31.720 |
Basically, we pay attention to different facts given a certain question. 00:58:36.200 |
We iterate over the input multiple times, and we agglomerate the facts that seem 00:58:41.200 |
relevant given the current knowledge and the question. 00:58:44.240 |
Now, I don't usually talk about neuroscience. 00:58:47.480 |
I'm not a neuroscientist, but there is a very interesting relationship here that 00:58:51.920 |
a friend of mine, Sam Gershman, pointed out, which is that the episodic memory, 00:58:56.480 |
in general for humans, is actually the memory of autobiographical events. 00:59:01.280 |
So it's the time when we remember the first time we went to school or 00:59:06.040 |
And it's essentially a collection of our past personal experiences that occurred at 00:59:12.400 |
And just like our episodic memory that can be triggered with a variety of 00:59:16.560 |
different inputs, this episodic memory is also triggered 00:59:24.080 |
And what's also interesting is the hippocampus, 00:59:26.080 |
which is the seat of the episodic memory in humans, 00:59:28.760 |
is actually active during transitive inference. 00:59:31.040 |
So transitive inference is going from A to B to C to have some connection from A to C. 00:59:36.400 |
Or in this case here, with this football, for instance, you first had to find facts 00:59:40.920 |
about John and the football, and then finding where John was, and 00:59:45.560 |
So those are examples of transitive inference. 00:59:48.000 |
And it turns out that you also need, in the DMN, 00:59:53.160 |
these multiple passes to enable the capability to do transitive inference. 00:59:58.440 |
Now, the final module, again, is a very simple GRU and 01:00:07.280 |
The main difference here is that instead of just having the current, 01:00:11.120 |
the previous hidden state, 80 minus 1, as input, 01:00:14.520 |
we'll also include the question at every time. 01:00:17.600 |
And we will include the answer that was generated at the previous time step. 01:00:22.840 |
But other than that, it's our standard softmax. 01:00:24.800 |
We use standard cross-entropy errors to minimize it. 01:00:27.320 |
And now, the beautiful thing of this whole model is that it's end-to-end trainable. 01:00:31.640 |
These four different modules will actually all train, 01:00:35.760 |
based on the cross-entropy error of that final softmax. 01:00:38.680 |
All these different modules communicate with vectors, and 01:00:42.480 |
we'll just have delta messages and back propagation to train them. 01:00:46.120 |
Now, there's been a lot of work in the last two years on models like this. 01:00:52.360 |
In fact, Kwak will cover a lot of these really interesting models tomorrow, 01:00:57.400 |
different types of memory, structures, and so on. 01:00:59.880 |
And the dynamic memory network is, in some sense, one of those models. 01:01:09.040 |
because there are a lot of similarities, namely memory networks from Jason Weston. 01:01:14.200 |
Those basically also have inputs and scoring and attention response mechanisms. 01:01:21.400 |
The main difference is that they use different kinds of basic Lego blocks for 01:01:29.920 |
For input, they use bag-of-words representations, or nonlinear and 01:01:37.920 |
they have different kinds of iteratively run functions. 01:01:40.800 |
The main interesting sort of difference to the DMN is that the DMN really uses 01:01:47.160 |
recurrent neural network type sequence models for 01:01:49.920 |
all of these different modules and capabilities. 01:01:53.960 |
And in some sense, that helps us to have a broader range of applications that include 01:02:00.240 |
And so let me go over a couple of results and experiments of this model. 01:02:04.920 |
So the first one is on this Bobby dataset that Facebook published. 01:02:13.160 |
It basically has a lot of these kinds of simple, logical reasoning type questions. 01:02:18.600 |
In fact, all these, like where's the football? 01:02:20.680 |
Those were examples from the Facebook Bobby dataset. 01:02:24.480 |
And it also includes things like yes/no questions, simple counting, 01:02:29.560 |
negation, some indefinite knowledge where the answer might be maybe. 01:02:33.040 |
Basic coreference, where you have to realize who does she refer to or 01:02:45.040 |
And basically, this dynamic memory network, I think, is currently the state of 01:02:48.920 |
the art on this dataset of the simple logical reasoning. 01:02:54.000 |
Now, the problem with this dataset is that it's a synthetic dataset. 01:02:58.480 |
And so it had only a certain set of generating human-defined 01:03:05.720 |
generative functions that created certain patterns. 01:03:09.320 |
And in that sense, it's only a necessary and not a sufficient condition of solving 01:03:14.320 |
it with sometimes 100% accuracy to real question answering. 01:03:20.920 |
The main interesting bit to point out here is that there are different 01:03:24.880 |
numbers of training examples for each of these different subtasks. 01:03:30.400 |
And so you have basically 1,000 examples of simple negation, for instance. 01:03:35.000 |
And it's always a similar kind of pattern, and 01:03:39.440 |
Now, real language, you will never have that many examples for 01:03:44.400 |
And so it's still general question answering is still an open problem and 01:03:49.240 |
Now, what's cool is this same architecture of allowing the model to go over inputs 01:03:55.120 |
multiple times also got state of the art and sentiment analysis. 01:04:02.720 |
And we actually analyzed whether it's really helpful to have 01:04:07.120 |
multiple passes over the input, and it turns out it is. 01:04:10.440 |
So there's certain things like reasoning over three facts or counting, 01:04:15.040 |
where you really have to have this dynamic, this episodic memory module, 01:04:23.400 |
For sentiment, it actually turns out it hurts 01:04:26.520 |
after going over the input more than two times. 01:04:30.320 |
And that's actually one of the things we're now working on is, 01:04:32.680 |
can we find a single model that does the same thing for 01:04:35.640 |
every single input with the same weights to try to learn these different tasks? 01:04:40.240 |
We can actually look at a couple of fun examples of this model and 01:04:52.840 |
you can probably get to 75% accuracy with some very simple models. 01:04:58.240 |
It's just basically fine, great words like great and wonderful and awesome, and 01:05:02.480 |
you'll get to something that's roughly right. 01:05:04.840 |
Here are some of the examples that, those are the kinds of examples that you now 01:05:09.440 |
need to get right to really try to push the state of the art further in sentiment 01:05:14.360 |
So here, the sentence is, in its ragged, cheap, and unassuming way, 01:05:19.880 |
So this sentence is incorrect, even if you allow the DMN, but 01:05:24.400 |
you have this whole architecture, but only allow one pass over the input. 01:05:28.600 |
Once you have two passes over the input, it actually learns to pay attention, 01:05:37.600 |
but in the end, actually to the movie working. 01:05:41.600 |
So here, these fields are essentially the gating function G 01:05:47.200 |
that we defined that pays attention to specific words. 01:05:51.160 |
And the darker it is, the larger that gate is, and the more open it is, and 01:05:56.200 |
the more that word affects the hidden state in the episodic memory module. 01:06:00.960 |
So it goes over the input the first time, pays attention to cheap and 01:06:07.520 |
unassuming and way and a little bit of works too. 01:06:11.280 |
But the second time, it basically figured out, it agglomerated sort of the facts of 01:06:15.480 |
that sentence, and then learned to pay attention more to specific 01:06:25.800 |
my response to the film is best described as lukewarm. 01:06:29.280 |
So in general, a sentiment analysis, when you look at sort of Unigam scores, 01:06:36.440 |
like the word best is basically one of the most positive words you could 01:06:42.960 |
And the first time the model passes over the sentence, 01:06:46.080 |
it also pays most attention to this incredibly positive word, namely best. 01:06:51.000 |
But then, once it agglomerated the context, it actually realizes, well, 01:06:55.160 |
best actually here is not used in its adjective way, 01:07:00.800 |
but it's actually an adverb that best describes something, and 01:07:04.360 |
what it describes is actually lukewarm, and hence, it's actually a negative sentence. 01:07:09.120 |
So those are the kinds of examples that you need to get to now to appreciate 01:07:13.120 |
improvements in sentiment analysis, where we basically also went from, 01:07:18.720 |
on this particular data set, these are all neural network type models that started 82. 01:07:24.000 |
Until then, that same data set existed for around eight years, and 01:07:28.680 |
none of the standard NLP models had reached above 80% accuracy. 01:07:36.200 |
And those are the kinds of improvements that you see across a variety of 01:07:40.760 |
different NLP tasks now that deep learning has come and 01:07:45.600 |
deep learning techniques are being used in NLP. 01:07:49.480 |
And now, the last task in NLP that this model turned out to also work incredibly 01:07:54.800 |
Now, part of speech tagging is less exciting of a task. 01:07:58.520 |
But it's still fascinating to see that after this data set has been around for 01:08:04.560 |
over 20 years, you can still improve the state of the art with the same kind of 01:08:09.320 |
architecture that also did well on fuzzy reasoning of sentiment and 01:08:12.640 |
discrete logical reasoning for question answering. 01:08:16.080 |
Now, we had a new person join the group, Saiming. 01:08:25.040 |
But he was more of a computer vision researcher. 01:08:28.480 |
And so he thought, well, could I use this great question answering module now 01:08:35.560 |
To combine sort of some that was going on in the group in NLP and 01:08:40.880 |
And he did not have to know all of the different aspects of the code. 01:08:47.400 |
All he had to do was change the input module from one that gives you 01:08:52.080 |
hidden states at each word over a long sequence of words and 01:08:57.920 |
sentences to an input module that would give him vectors for 01:09:04.880 |
And he literally did not touch some of the other parts of the code. 01:09:09.040 |
He didn't have to look carefully at this input module where, again, 01:09:14.880 |
here, our basic Lego block that Andre introduced really well of our convolutional 01:09:21.080 |
neural network, and then the convolutional neural networks will essentially give us 01:09:25.240 |
14 by 14 many vectors, one for each in one of its top states, 01:09:34.520 |
And then what we'll do is basically take those vectors and 01:09:37.200 |
now replace the word vectors we used to have with CNN vectors, and 01:09:43.600 |
Now again, the GRU, we know as our basic Lego block, we already defined it. 01:09:48.960 |
One addition here is that it'll actually be a bidirectional GRU. 01:09:53.360 |
We'll go once from left to right in this snake-like fashion, and 01:09:57.840 |
another one goes from right to left backwards. 01:10:01.440 |
Now both of these will basically have a hidden state, and you can just concatenate 01:10:05.440 |
the hidden states of both of these to compute the final hidden state for 01:10:13.200 |
And that model, too, actually achieved state of the art results. 01:10:17.560 |
This data set has been only released last year, so 01:10:22.040 |
everybody now works on deep learning techniques to try to solve it. 01:10:28.560 |
It was just too good to be true that this model we developed for 01:10:32.760 |
So we really dug into looking at the attention. 01:10:50.440 |
it paid attention to different regions in the image. 01:10:54.480 |
And we started basically analyzing, going through a bunch of those on the depth set, 01:11:00.320 |
and analyzing what is it actually paying attention to. 01:11:02.960 |
Again, it's being trained only with the image, the question, and 01:11:10.240 |
You do not get this sort of latent representation of where you should 01:11:14.360 |
actually pay attention to in the image in order to answer that question correctly. 01:11:18.600 |
So when the question was, what is the main color on the bus? 01:11:22.040 |
I learned to actually pay attention here to that bus. 01:11:25.400 |
I'm like, well, okay, maybe that's not that impressive. 01:11:27.960 |
It's just the main object in the center of the image. 01:11:30.960 |
And what are the types, the type of trees are in the background? 01:11:35.600 |
Well, maybe it just connects tree with anything that's green and 01:11:41.040 |
So it was neat, but not super impressive yet. 01:11:45.760 |
So is this in the wild kind of more interesting and 01:11:48.120 |
actually pays attention to a man-made structure in the background? 01:12:01.440 |
Now, to be honest, I don't think the model actually knows that there are two people, 01:12:08.720 |
It just finds the main person or main object in this scene. 01:12:13.920 |
The main object is a little baby girl, so it says girl. 01:12:22.280 |
The answer is night, because it's a very dark picture, at least in the sky. 01:12:26.720 |
This one is getting a little more interesting. 01:12:30.080 |
The answer is surfboard, and it actually does pay attention to both of the arms, 01:12:37.160 |
So that's a little more interesting kind of attention visualization. 01:12:41.200 |
And then for a while, we're also worried, well, what if in the data set, 01:12:45.640 |
it just learns really well from language alone? 01:12:48.440 |
Yes, it pays attention to things, but maybe it'll just say things 01:12:58.120 |
In 95% of the cases, you're right just saying yellow without seeing an image. 01:13:03.040 |
So this one I was kind of excited about, because it actually 01:13:07.160 |
paid attention to the bananas in the middle, and then did say green, 01:13:10.400 |
and kind of overruled the prior that it would get from language alone. 01:13:17.640 |
What is the pattern on the cat's fur on its tail? 01:13:20.440 |
Pays attention mostly to the tail and says stripes. 01:13:28.200 |
The answer is yes, though I have to say that we later 01:13:32.560 |
had a journalist want to do his own question. 01:13:36.040 |
He asked John Marker from New York Times, and 01:13:40.920 |
we just put together this demo the night before. 01:13:44.840 |
And he's like, well, I want to ask my own question. 01:13:51.720 |
And it wasn't made for production, so it was kind of slow, and 01:13:56.200 |
I'm like, well, he's trying to come up with excuses. 01:13:59.280 |
It's kind of black background and a black hat, and it might be kind of hard to see. 01:14:03.560 |
And unfortunately, it got it right and said yes. 01:14:06.680 |
And then after the interview, I said, well, maybe let's look and 01:14:09.920 |
see if what I, I would just ask it myself, less stressful situation, 01:14:16.640 |
And these are all the questions, the first eight questions that I could come up with. 01:14:20.640 |
And somewhat to my surprise, it actually got them all right. 01:14:23.560 |
So what is the girl holding, a tennis racket? 01:14:33.000 |
Then I was like, well, okay, let's try to break it by asking just what's the color 01:14:39.920 |
Actually got that right too, because her skirt white. 01:14:42.920 |
Also kind of interesting, when you ask the model what she's wearing shorts, but 01:14:46.440 |
when you ask about the skirt, and it still sort of is capturing 01:14:54.480 |
And then this one was interesting, what did the girl just hit? 01:14:59.160 |
And then I was like, well, what if I ask, is the girl about to hit the tennis ball? 01:15:04.960 |
And then, did the girl just hit the tennis ball? 01:15:08.000 |
So then I finally found a way to break it, so 01:15:09.880 |
it doesn't have enough the core current statistics to understand. 01:15:13.560 |
And again, understand sort of which angle does the arm have to be in order to 01:15:18.640 |
assume that the ball was just hit or was about to hit. 01:15:21.200 |
But what it basically does show us is that once it saw a lot of examples 01:15:27.720 |
on a specific domain, it really can capture quite a lot of different things. 01:15:31.800 |
And now, let's see if we can get the demo up. 01:15:43.320 |
The best way to hope for any chance of enjoying this film is by lowering your 01:15:47.760 |
Again, one of those kinds of sentences that you have to now get 01:15:53.720 |
correct in order to get improved performance on sentiment. 01:15:58.880 |
And actually correctly says that this is negative. 01:16:03.600 |
Now, we can also actually ask that question in Chinese. 01:16:09.280 |
And this is one of the beautiful things of the DMN and 01:16:14.600 |
in general really of most deep learning techniques. 01:16:17.160 |
We don't have to be experts in a domain or even in a language to create a very, 01:16:21.600 |
very accurate model for that language or that domain. 01:16:28.360 |
I'm not gonna make a fool of myself trying to read that one out loud, but 01:16:32.720 |
You can also, this is the, what parts of speech are there. 01:16:38.040 |
You can have other things like named entities and other sequence problems. 01:16:43.480 |
You can also ask, what are the men wearing on the head? 01:16:48.680 |
And then maybe a slightly more interesting question, 01:16:56.440 |
So, especially we're close to the circle of death here at Stanford, 01:17:01.200 |
where a lot of bikes crash and it's a good answer. 01:17:04.400 |
All right, with that, I wanna leave a couple of minutes for questions. 01:17:10.160 |
So basically the summary is, word vectors and 01:17:13.320 |
recurrent neural networks are super useful building blocks. 01:17:17.080 |
Once you really appreciate and understand those two building blocks, 01:17:20.760 |
you're kind of ready to have some fun and build more complex models. 01:17:24.840 |
Really in the end, this DMN is a way to combine that in just a variety of 01:17:31.760 |
And that's also where the state, I think, of deep learning is for 01:17:36.400 |
We've tackled a lot of these smaller sub-problems, intermediate tasks, and 01:17:40.920 |
now we can work on more interesting, complex problems like dialogue and 01:17:45.640 |
question answering, machine translation, and things like that. 01:18:01.880 |
>> A quick question, in the dynamic memory network, you have the RNN. 01:18:07.880 |
And you also mentioned that if you have better assumption of the input, right? 01:18:18.080 |
So if you change the RNN into a tree structure, would that help? 01:18:28.840 |
And somewhat surprising in the last couple weeks, 01:18:32.680 |
there are actually some new results on SNLI, the Stanford Natural Language 01:18:37.720 |
Inference data set, where tree structures are again the state of the art. 01:18:42.080 |
Though I have to say that I think the dynamic memory network, 01:18:48.600 |
by having this ability in the episodic memory to keep track of different sub 01:18:53.280 |
phrases and pay attention to those and then combine them over multiple passes, 01:18:58.520 |
I think you can kind of get away with not having a tree structure. 01:19:04.640 |
representing sentences as trees in your input module. 01:19:11.920 |
And I think the episodic memory module that has this capability to go over the 01:19:15.000 |
input multiple times, pay attention to certain sub phrases, 01:19:17.880 |
will capture a lot of the kinds of complexities that you might 01:19:22.400 |
So my short answer is, I don't think you necessarily need it. 01:19:28.700 |
>> Hi, my question is about question answering. 01:19:35.200 |
So if we want to apply question answering to some specific domains like 01:19:40.760 |
healthcare, but we don't really have the data, we don't have question answer pairs. 01:19:50.680 |
What do you do if you want to do question answering on a complex domain, 01:19:54.760 |
I think, and this feels maybe like a cop out, but 01:19:58.600 |
I think it's very true both in practice and in theory, create the data. 01:20:02.800 |
Like if you cannot possibly create more than a thousand examples of anything, 01:20:07.360 |
then maybe automating that process is not that important. 01:20:10.040 |
So clearly, you should be able to create some data. 01:20:12.680 |
And in many cases, that is the best use of your time, is just to sit down or 01:20:16.560 |
ask the domain expert to create a lot of questions and 01:20:21.320 |
And then measure how they actually get to those answers. 01:20:24.960 |
Try to have them in a constrained environment and so on. 01:20:29.760 |
when you try to do automated email replies, which is in some ways a little bit 01:20:33.640 |
similar to question answering, well, there is a nice domain because 01:20:39.080 |
everybody had already emailed, they were already answered before, so 01:20:44.440 |
Now, if you had a search engine where people asked a lot of questions, 01:20:48.000 |
then you can also use that to bootstrap and see where did they actually fail? 01:20:52.200 |
And then take all those really tough queries where they failed, 01:20:55.640 |
have some humans sit there and collect the data. 01:20:59.560 |
Now, the other answer is, let's work together for the next many years on 01:21:04.120 |
research for smaller training data set sizes and complex reasoning. 01:21:08.560 |
The fact of the matter for that line of research will still be, 01:21:12.880 |
if a system has never seen a certain type of reasoning, 01:21:17.000 |
it'll be hard for the system to pick up that type of reasoning. 01:21:21.200 |
I think we're going to get with these kinds of architectures to the space where 01:21:24.840 |
at least if it has seen this type of reasoning, a specific type of 01:21:28.720 |
transitive reasoning or temporal reasoning or sort of cause and 01:21:33.160 |
effect type reasoning, at least a couple hundred times, 01:21:36.520 |
then you should be able to train a system with these kinds of models to do it. 01:21:40.200 |
>> Are these QA systems currently robust to false input or questions? 01:21:49.600 |
For the woman playing tennis, if you asked, what's the man holding? 01:21:57.560 |
And largely because at training time, you never try to mess with it like that. 01:22:02.880 |
I'm pretty sure if you added a lot of training samples where you had those, 01:22:09.320 |
>> Those would be important for real world implementations and- 01:22:16.920 |
I think whenever you train a system, we know we can, for instance, 01:22:20.440 |
both steal certain classifiers by using them a lot. 01:22:23.440 |
We know we can fool them into classifying certain images, for instance, as others. 01:22:28.280 |
We have folks in the audience who worked on that exact line of work. 01:22:32.680 |
So I would be careful using it in security environments right now. 01:22:46.040 |
There was a slide where you had the input module and 01:22:54.360 |
Because sequence is basically made up of those individual words and 01:23:01.080 |
So were those also RNNs that were stitched together? 01:23:07.960 |
because we have two papers with the DMN, and the answer is different for each. 01:23:12.960 |
In the simplest form of that, it is actually a single GRU that goes from 01:23:18.760 |
the first word through all the sentences as if they're one gigantic sequence. 01:23:23.560 |
But it has access to each sentence period at the end to pay special attention to 01:23:30.640 |
And so yes, in the simplest form, it is just a GRU that goes over all the words. 01:23:35.240 |
>> So is this a normal process to basically just concatenate all the sentences 01:23:39.840 |
into one gigantic- >> So the answer there, and 01:23:44.240 |
this is kind of why I split the talk into three different ones from words, 01:23:48.800 |
single sentences, and then multiple sentences. 01:23:51.160 |
I think if you just had a single GRU that goes over everything and 01:23:54.600 |
now you try to reason over that entire sequence, it would not work very well. 01:23:58.240 |
You really need to have an additional structure, such as an attention mechanism 01:24:02.400 |
or a pointer mechanism that has the ability to pay attention to specific parts 01:24:09.400 |
But yeah, in general, that's fine, as long as you have this additional mechanism. 01:24:14.920 |
>> So in recurrent neural nets, you're using sigmoids. 01:24:22.560 |
rectified linear units were the more popular non-linearity. 01:24:30.000 |
Now, when you look at the GRU equations here, you have these reset gates. 01:24:34.400 |
And so these reset gates here, you want them to essentially be between 0 and 1, 01:24:38.960 |
so that it can either ignore this input entirely or 01:24:41.720 |
you have it normally be part of the computation of h tills. 01:24:45.480 |
So in some cases, you really do want to have sigmoids there. 01:24:50.480 |
But other ones, for instance, some simpler things where you actually don't have 01:24:56.360 |
that much recurrence, such as going from one memory state to another. 01:25:01.720 |
actually relu's were good activation functions too. 01:25:08.440 |
>> Did you guys try to, after training this network, 01:25:12.520 |
try to take these weights for the images and do object detection again? 01:25:16.600 |
So these weights would be augmented with the text vectors. 01:25:21.640 |
Did you try to use- >> That is a very cool idea that we did 01:25:41.240 |
those attention models are pretty powerful when you have enough training data and 01:25:45.800 |
then you can learn to make good use of the data. 01:25:50.880 |
But even though some of the tasks are pretty, I guess, trivial to human, but 01:25:58.400 |
So what do you think of, I guess, even right now, 01:26:01.800 |
we have a lot of knowledge base on the web, right, like Wikipedia. 01:26:09.760 |
what do you think about incorporate those knowledge base into those models? 01:26:14.480 |
>> I actually love that line of research too, and 01:26:21.000 |
This semantic memory module in the simplest form is just word vectors. 01:26:24.640 |
I think one next iteration would actually be to have knowledge bases also influence 01:26:30.000 |
There's very little work on combining text and 01:26:34.480 |
knowledge bases to do overall complex question answering that requires reasoning. 01:26:39.320 |
I think it's a phenomenally interesting area of research. 01:26:42.760 |
>> So were any night hints or any starting point about how to encode those? 01:26:47.520 |
>> There are some papers that do reasoning over knowledge bases alone. 01:26:52.600 |
So we had a paper on recursive neural tensor networks that basically takes 01:26:56.680 |
a triplet, a word vector for an entity, might be in Freebase, might be in WordNet. 01:27:01.920 |
A relation, a vector for a relationship, and a vector for another entity. 01:27:09.040 |
And then basically pipe them into a neural network and say, yes, no, 01:27:11.840 |
are these two entities actually in that relationship? 01:27:13.920 |
And you can have a variety of different architectures. 01:27:20.080 |
Wait, that's a different brother, different Benjo. 01:27:31.400 |
So I think you can also reason over knowledge graphs. 01:27:37.000 |
And you could then try to combine that with reasoning over fuzzy text. 01:27:43.240 |
I think nobody has yet really combined it in a principled way. 01:27:50.320 |
So while the model answer my questions correctly, so 01:27:55.480 |
how do I check the model actually understood my question? 01:27:59.920 |
And what's the logic, what's the model's logic behind that? 01:28:05.000 |
In some ways, it's a common question for neural network interpretability. 01:28:10.800 |
>> So in computer vision, sometimes we can at least visualize the features. 01:28:17.960 |
And so I think the best thing that we could do right now is to show these 01:28:21.960 |
attention scores where for sentiment, we're like, how did it come up with the sentiment? 01:28:29.800 |
And likewise for question answering, we can see which sentences did it actually 01:28:34.600 |
pay attention to in order to answer that overall question. 01:28:37.880 |
So that is, I think, the best answer that we could come up with right now. 01:28:41.800 |
But how, yeah, there's certain other complexities that there's still an area 01:28:56.800 |
We'll take another coffee break for 30 minutes, so 01:28:59.480 |
please come back at 2.45 for our presentation by Sherry Moore.