Deep Learning for Natural Language Processing (Richard Socher, Salesforce)

00:00:00.000 | Thank you, everybody, and thanks for coming back very soon after lunch.

00:00:04.640 | I'll try to make it entertaining to avoid some post-food coma.

00:00:09.120 | So I actually have a lot to owe for being here to Andrew and Chris and

00:00:13.920 | my PhD at Stanford here.

00:00:16.360 | It's really, it's always fun to be back.

00:00:19.080 | I figured there's going to be a broad range of capabilities in the room.

00:00:24.080 | So I'm sorry I will probably bore some of you for

00:00:28.160 | the first two-thirds of the talk, cuz I'll go over the basics of what's NLP,

00:00:34.080 | what's natural language processing, what's deep learning, and

00:00:36.440 | what's really at the intersection of the two.

00:00:38.560 | And then the last third, I'll talk a little bit about some exciting new

00:00:43.040 | research that's happening right now.

00:00:45.880 | So let's get started with what is natural language processing?

00:00:49.440 | It's really a field at the intersection of computer science, AI, and linguistics.

00:00:54.040 | And you could define a lot of goals, and

00:00:56.440 | a lot of these statements here we could really talk and philosophize a lot about.

00:01:00.200 | But I'll move through them pretty quickly.

00:01:02.640 | For me, the goal of natural language processing is for computers to process or

00:01:08.360 | scare quotes, understand natural language in order to perform tasks that are

00:01:11.760 | actually useful for people, such as question answering.

00:01:14.840 | The caveat here is that really fully understanding and

00:01:19.600 | representing the meaning of language, or even defining it, is quite an elusive goal.

00:01:25.200 | So whenever I say the model understands, I'm sorry, I shouldn't say that.

00:01:30.720 | Really, these models don't understand in the sense that we understand language

00:01:34.520 | anything.

00:01:35.400 | So whenever somebody says they can read or

00:01:38.120 | represent the full meaning in its entire glory, it's usually not quite true.

00:01:43.240 | Really, perfect language understanding is in some sense AI complete in the sense

00:01:48.200 | that you need to understand all of visual inputs and

00:01:51.600 | thought and a lot of other complex things.

00:01:54.160 | So a little more concretely, as we try to tackle this overall problem of

00:01:59.120 | understanding language, what are sort of the different levels that we often look at?

00:02:04.560 | Often, and for many people, starts at speech.

00:02:07.680 | And then once you have speech, you might say, all right,

00:02:09.520 | now I know what phonemes are smaller parts of words.

00:02:12.480 | I understand how words form, that's morphology or morphological analysis.

00:02:17.000 | Once I know what the meaning of words are, I might try to understand how they're put

00:02:21.800 | together in grammatical ways such that the sentences are understandable or

00:02:26.840 | at least grammatically correct to a lot of speakers of the language.

00:02:31.520 | Once we go and we understand the structure,

00:02:33.880 | we actually want to get to the meaning.

00:02:35.360 | And that's really where I think most of my interest lies,

00:02:40.840 | in semantic interpretation, actually trying to get to the meaning in some useful

00:02:45.040 | capacity.

00:02:46.040 | And then after that, we might say, well, if we understand now the meaning of

00:02:48.560 | a whole sentence, how do we actually interact?

00:02:51.720 | What's the discourse?

00:02:53.000 | How do we have spoken dialogue system, things like that?

00:02:55.880 | Whereas deep learning has really improved the state of the art significantly,

00:03:01.200 | is really in speech recognition and syntax and semantics.

00:03:05.640 | And the interesting thing is that we're kind of actually skipping some of these

00:03:09.320 | levels.

00:03:10.400 | Deep learning doesn't require often morphological analysis to create

00:03:14.200 | very useful systems.

00:03:15.680 | And in some cases, actually skips syntactic analysis entirely as well.

00:03:19.800 | It doesn't have to know about the grammar.

00:03:21.320 | It doesn't have to be taught about what noun phrases are,

00:03:23.920 | prepositional phrases.

00:03:24.840 | It can actually get straight to some semantically useful tasks right away.

00:03:29.240 | And that's going to be one of the sort of advantages that we don't have to

00:03:34.360 | actually be as inspired by linguistics as traditional natural language

00:03:38.320 | processing had to be.

00:03:40.200 | So why is NLP hard?

00:03:42.120 | Well, there's a lot of complexity in representing and learning, and

00:03:46.600 | especially using linguistic situational world and visual knowledge.

00:03:49.840 | Really all of these are connected when it gets to the meaning of language.

00:03:53.680 | To really understand what read means,

00:03:55.480 | can you do that without visual understanding, for instance?

00:03:59.440 | If you have, for instance, this sentence here, Jane hit June and

00:04:03.280 | then she fell, or and then she ran.

00:04:07.160 | Depending on which verb comes after she, the definition,

00:04:12.240 | the meaning of she actually changes.

00:04:14.280 | And this is one subtask you might look at, so-called Nuffer resolution or

00:04:18.880 | coreference resolution in general, where you try to understand,

00:04:22.000 | who does she actually refer to?

00:04:23.200 | And it really depends on the meaning, again, somewhat scare quotes here,

00:04:27.320 | of the verb that follows this pronoun.

00:04:31.560 | Similarly, there's a lot of ambiguity.

00:04:35.800 | So here we have a very simple sentence, four words, I made her duck.

00:04:38.960 | Now that simple sentence can actually have at least four different meanings,

00:04:45.720 | if you can think about it for a little bit, right?

00:04:48.080 | You made her a duck that she loves for Christmas as her dinner.

00:04:52.480 | You made her duck like me just now, and so on.

00:04:55.920 | There are actually four different meanings.

00:04:57.800 | And to know which one requires, in some sense, situational awareness or

00:05:02.320 | knowledge to really disambiguate what is meant here.

00:05:06.120 | So that's sort of the high level of NLP.

00:05:10.560 | Now, where does it actually become useful in terms of applications?

00:05:14.000 | Well, they actually range from very simple things that we kind of assume or

00:05:17.680 | are given now, we use them all the time every day, to more and

00:05:20.400 | more complex, and then also more in the realm of research.

00:05:24.360 | The simple ones are things like spell checking or keyword search and

00:05:27.800 | finding synonyms and a thesaurus.

00:05:30.600 | Then the medium sort of difficulty ones are to extract information from websites,

00:05:36.880 | trying to extract sort of product prices or dates and locations, people or

00:05:40.720 | company names, so-called named entity recognition.

00:05:43.720 | You can go a little bit above that and try to classify sort of reading levels for

00:05:48.680 | school text, for instance, or do sentiment analysis that can be helpful if you

00:05:53.480 | have a lot of customer emails that come in and you want to prioritize highly the ones

00:05:58.120 | of customers who are really, really annoyed with you right now.

00:06:01.200 | And then the really hard ones, and I think in some sense,

00:06:03.920 | the most interesting ones are machine translation,

00:06:07.440 | trying to actually be able to translate between all the different languages in

00:06:10.280 | the world, question answering, clearly something that is a very exciting and

00:06:16.600 | useful piece of technology, especially over very large, complex domains.

00:06:21.840 | Can be used for automated email replies.

00:06:24.880 | I know pretty much everybody here would love to have some simple automated email

00:06:29.560 | reply system, and then spoken dialogue systems, bots are very hip right now.

00:06:34.000 | These are all sort of complex things that are still in the realm of research to do

00:06:37.840 | them really well.

00:06:39.160 | We're making huge progress, especially with deep learning on these three, but

00:06:42.880 | they're still nowhere near human accuracy.

00:06:45.560 | So let's look at the representations.

00:06:52.040 | I mentioned we have morphology and words and syntax and semantics and so on.

00:06:58.720 | We can look at one example, namely machine translation, and

00:07:03.280 | look at how did people try to solve this problem of machine translation.

00:07:08.440 | Well, it turns out they actually tried all these different levels

00:07:11.320 | with varying degrees of success.

00:07:13.440 | You can try to have a direct translation of words to other words.

00:07:16.680 | The problem is that is often a very tricky mapping.

00:07:19.040 | The meaning of one word in English might have three different words in German and

00:07:23.400 | vice versa.

00:07:24.760 | You can have three of the same words in English,

00:07:27.360 | meaning all the single same word in German, for instance.

00:07:30.520 | So then people said, well, let's try to maybe do some tactic transfer where we

00:07:34.000 | have whole phrases like to kick the bucket, just means [FOREIGN] in German.

00:07:38.120 | Okay, not a fun example.

00:07:39.480 | And then semantic transfer might be, well,

00:07:41.680 | let's try to find a logical representation of the whole sentence,

00:07:44.440 | the actual meaning in some human understandable form, and

00:07:48.280 | then try to just find another surface representation of that.

00:07:51.560 | Now, of course, that will also get rid of a lot of the subtleties of language.

00:07:56.200 | And so, the tricky problems in all these kinds of representations.

00:08:01.200 | Now, the question is, what does deep learning do?

00:08:03.600 | You've already saw at least two methods, standard neural networks before and

00:08:09.480 | convolutional neural networks for vision.

00:08:11.680 | And in some sense, there's going to be a huge similarity here to these methods.

00:08:17.000 | Because just like images that are essentially a long list of numbers,

00:08:22.760 | a vector, and standard neural networks with a hidden state

00:08:27.720 | is also just a vector or a list of numbers.

00:08:30.760 | That is also going to be the main representation that we will use throughout

00:08:34.960 | for characters, for words, for short phrases, for sentences, and

00:08:39.600 | in some cases for entire documents.

00:08:41.600 | They will all be vectors.

00:08:43.960 | And with that, we are sort of finishing up the whirlwind of what's NLP.

00:08:49.200 | Of course, you could give an entire lecture on almost every single slide I

00:08:53.920 | just gave, so we're very, very high level.

00:08:57.000 | But we'll continue at that speed to try to squeeze this complex deep learning for

00:09:02.720 | NLP subject area into an hour and a half.

00:09:05.440 | I think they're two of the most important basic Lego blocks that you

00:09:10.560 | nowadays want to know in order to be able to creatively play around with more

00:09:15.120 | complex models, and those are going to be word vectors and

00:09:19.960 | sequence models, namely recurrent neural networks.

00:09:22.480 | And I kind of split this into words, sentences, and multiple sentences.

00:09:28.640 | But really, you could use recurrent neural networks for

00:09:31.920 | shorter phrases as well as multiple sentences.

00:09:34.160 | But in many cases, we'll see that they have some limitations as you move to longer and

00:09:38.800 | longer sequences and just use the default neural network sequence models.

00:09:43.320 | All right, so let's start with words.

00:09:47.360 | And maybe one last blast from the past here, to represent the meaning of words,

00:09:53.080 | we actually used to use taxonomies like WordNet that kind of defines

00:09:58.440 | each word in relationship to lots of other ones.

00:10:01.400 | So you can, for instance, define hypernames and is a relationships.

00:10:05.200 | You might say the word panda, for instance, in its first meaning as a noun,

00:10:10.760 | basically goes through this complex stack, this directed acyclic graph.

00:10:15.760 | Most of it is roughly just a tree.

00:10:17.880 | And in the end, like everything, it is an entity, but

00:10:20.520 | it's actually a physical entity, a type of object.

00:10:22.640 | It's a whole object, it's a living thing, it's an organism, animal, and so on.

00:10:26.160 | So you basically can define a word like this.

00:10:28.640 | And another way, at each node of this tree,

00:10:31.680 | you actually have so-called syn sets, or synonym sets.

00:10:34.480 | And here's an example for the synonym set of the word good.

00:10:38.320 | Good can have a lot of different meanings, can actually be both an adjective,

00:10:44.280 | and as well as an adverb, as well as a noun.

00:10:47.360 | Now, what are the problems with this kind of discrete representation?

00:10:51.960 | Well, they can be great as a resource if you're a human, you wanna find synonyms.

00:10:57.320 | But they're never going to be quite sufficient to capture all the nuances

00:11:03.920 | that we have in language.

00:11:05.520 | So for instance, the synonyms here for good were adapt, expert,

00:11:11.200 | practice, proficient, and skillful.

00:11:13.240 | But of course, you would use these words in slightly different contexts.

00:11:17.120 | You would not use the word expert in exactly all the same contexts

00:11:23.320 | as you would use the meaning of good, or the word good.

00:11:27.120 | Likewise, it will be missing a lot of new words.

00:11:29.680 | Language is this interesting living organism, we change it all the time.

00:11:33.600 | You might have some kids, they say YOLO, and all of a sudden,

00:11:36.280 | you need to update your dictionary.

00:11:39.360 | Likewise, maybe in Silicon Valley, you might see ninja a lot, and

00:11:43.120 | now you need to update your dictionary again.

00:11:44.960 | And that is basically going to be a Sisyphus job, right?

00:11:47.560 | Nobody will ever be able to really capture all the meanings and

00:11:52.520 | this living, breathing organism that language is.

00:11:55.800 | So it's also very subjective.

00:11:58.080 | Some people might think ninja should just be deleted from the dictionary and

00:12:01.640 | say we don't want to include it.

00:12:03.120 | I just think nifty or badass is kind of a silly word and

00:12:06.560 | should not be included in a proper dictionary, but

00:12:08.720 | it's being used in real language and so on.

00:12:10.800 | It requires human labor.

00:12:11.880 | As soon as you change your domain, you have to ask people to update it.

00:12:16.120 | And it's also hard to compute accurate word similarities.

00:12:18.920 | Some of these words are subtly different, and

00:12:21.080 | it's really a continuum in which we can measure their similarities.

00:12:25.720 | So instead, what we're going to use and

00:12:28.360 | what is also the first step for deep learning,

00:12:31.640 | we'll actually realize it's not quite deep learning in many cases.

00:12:35.280 | But it is sort of the first step to use deep learning in NLP,

00:12:38.280 | is we will use distributional similarities.

00:12:41.280 | So what does that mean?

00:12:42.080 | Basically, the idea is that we'll use the neighbors of a word

00:12:46.040 | to represent that word itself.

00:12:48.080 | It's a pretty old concept.

00:12:50.960 | And here's an example, for instance, for the word banking.

00:12:53.920 | We might actually represent banking in terms of all these other words

00:12:57.800 | that are around it.

00:12:58.760 | So let's do a very simple example where we look at a window around each word.

00:13:06.960 | And so here, the window length, that's just for simplicity, say it's one.

00:13:11.120 | We represent each word only with the words one to the left and

00:13:13.800 | one to the right of it.

00:13:15.160 | We'll just use the symmetric context around each word.

00:13:19.040 | And here's a simple example corpus.

00:13:22.040 | So if the three sentences in my corpus, of course,

00:13:24.680 | we would always wanna use corpora with billions of words instead of just a couple.

00:13:29.360 | But just to give you an idea of what's being captured in these word vectors,

00:13:33.480 | is I like deep learning, I like NLP, and I enjoy flying.

00:13:37.200 | And now, this is a very simple so-called co-occurrence statistic.

00:13:42.520 | You'll just simply see here, i, for instance, appears twice in its window

00:13:47.280 | size of one here, the word like is in its window and its context, and

00:13:51.840 | the word enjoy is once in its context.

00:13:54.640 | And for like, you have twice to its left, i, and once deep, and once NLP.

00:14:01.600 | It turns out, if you just take those vectors, now this could be a vector

00:14:06.960 | representation, just each row could be a vector representation for words.

00:14:11.360 | Unfortunately, as soon as your vocabulary increases,

00:14:13.680 | that vector dimensionality would change.

00:14:15.960 | And hence, you'd have to retrain your whole model.

00:14:17.720 | It's also very sparse, and

00:14:20.760 | really, it's going to be somewhat noisy if you use that vector.

00:14:24.480 | Now, another better thing to do might be to run SVD or

00:14:28.760 | something similar like PCA dimensionality reduction on such a co-occurrence matrix.

00:14:34.240 | And that actually gives you a reasonable first approximation to word vectors.

00:14:38.120 | Very old method, works reasonably well.

00:14:41.360 | Now, what works even better than simple PCA is actually a model introduced

00:14:45.880 | by Tomasz Mikoloff in 2013 called Word2Vec.

00:14:49.880 | So instead of capturing co-occurrence counts directly out of a matrix like that,

00:14:54.600 | you'll actually go through each window in a large corpus and

00:14:57.960 | try to predict a word that's in the center of each window and

00:15:02.120 | use that to predict the words around it.

00:15:03.960 | That way, you can very quickly train, you can train almost online,

00:15:09.160 | though few people do this, and

00:15:12.160 | add words to your vocabulary very quickly in this streaming fashion.

00:15:16.440 | So now let's look a little bit at this model Word2Vec,

00:15:20.160 | because it's first a very simple NLP model, and two,

00:15:24.480 | it's very instructive.

00:15:27.920 | We won't go into too many details, but at least look at a couple of equations.

00:15:31.800 | So again, main goal is to predict the surrounding words in a window

00:15:36.240 | of some length that we define M, it's a hyperparameter, of every word.

00:15:40.680 | Now, the objective function will essentially try to maximize here

00:15:43.560 | the log probability of any of these context words given the center word.

00:15:47.720 | So we go through our entire corpus T, very long sequence, and

00:15:52.640 | at each time step j, we will basically look at all the words in the context

00:15:58.960 | of the current word T, and basically try to maximize here

00:16:05.440 | this probability of trying to be able to predict that word that is around

00:16:09.560 | the current word T, and theta, all the parameters,

00:16:15.080 | namely all the word vectors that we'd want to optimize.

00:16:17.360 | So now, how do we actually define this probability P here?

00:16:21.120 | The simplest way to do this, and this is not the actual way, but

00:16:25.720 | it's the simplest and first to understand and derive this model,

00:16:29.720 | is with this very simple inner product here, and

00:16:33.160 | that's why we can't quite call it deep.

00:16:34.960 | There's not going to be many layers of nonlinearities like we see in

00:16:38.480 | deep neural networks, it's really just a simple inner product.

00:16:41.640 | And the higher that inner product is,

00:16:43.040 | the more likely these two will be predicting one another.

00:16:47.640 | So here's C, the context is the center word, sorry, O is the outside word.

00:16:55.160 | And basically, this inner product, the larger it is,

00:16:57.760 | the more likely we were going to predict this.

00:17:00.320 | And these are both just standard n-dimensional vectors.

00:17:04.080 | And now, in order to get a real probability, we'll essentially apply

00:17:08.360 | softmax to all the potential inner products that you might have in your vocabulary.

00:17:12.840 | And one thing you will notice here is, well,

00:17:14.600 | this denominator is actually going to be a very large sum, right?

00:17:19.840 | We'll want to sum here over all potential inner products for

00:17:22.760 | every single window, that would be too slow.

00:17:25.000 | So now, the real methods that we would use are going to

00:17:29.840 | approximate the sum in a variety of clever ways.

00:17:34.320 | Now, I could literally talk the next hour and a half just about how to optimize

00:17:37.840 | the details of this equation, but

00:17:39.120 | then we'll all deplete our mental energy for the rest of the day.

00:17:42.880 | And so, I'm just going to point you to the class I taught earlier this year,

00:17:47.520 | CS24D, where we will have lots of different slides that go into all

00:17:52.840 | the details of this equation, how to approximate it, and then how to optimize it.

00:17:56.320 | It's going to be very similar to the way we optimize any other neural network.

00:18:00.520 | We're going to use stochastic gradient descent.

00:18:03.080 | We're going to look at mini-patches of a couple of hundred windows at a time, and

00:18:07.720 | then update those word vectors.

00:18:09.240 | And we're just going to take simple gradients of each of these vectors

00:18:14.160 | as we go through windows in a large corpus.

00:18:16.360 | All right, now, we briefly mentioned PCA-like methods,

00:18:23.080 | based on senior value decomposition often, or standard simple PCA.

00:18:27.880 | Now, we also had this word2vec model.

00:18:30.680 | There's actually one model that combines the best of both worlds, namely GloVe, or

00:18:35.920 | global vectors, introduced by Jeffrey Pennington in 2014.

00:18:39.440 | And it has a very similar idea, and you'll notice here,

00:18:43.360 | there's some similarity.

00:18:44.360 | You have this inner product again for different pairs.

00:18:47.760 | But this model will actually go over the co-occurrence matrix.

00:18:51.120 | Once you have this co-occurrence matrix, it's much more efficient to try to predict

00:18:54.320 | once how often two words appear next to each other, rather than do it 50 times

00:19:00.080 | each time that pair appears in an actual corpus.

00:19:04.960 | So in some sense, you can be more efficiently going through all the co-occurrence

00:19:08.760 | statistics, and you're going to basically try to minimize this subtraction here.

00:19:16.600 | And what that basically means is that each inner product will try to approximate

00:19:21.600 | the log probability of these two words actually co-occurring.

00:19:25.920 | Now, you have this function here, which essentially will allow us to not overly

00:19:32.320 | weight certain pairs that occur very, very frequently.

00:19:36.800 | The, for instance, co-occurs with lots of different words, and you want to basically

00:19:40.880 | lower the importance of all the words that co-occur with the.

00:19:46.280 | So you can train this very fast.

00:19:48.600 | It scales to gigantic corpora.

00:19:50.480 | In fact, we trained this on Common Crawl, which is a really great data set of most

00:19:57.440 | of the internet.

00:19:58.480 | It's many billions of tokens.

00:20:00.880 | And it gets also very good performance on small corpora because it makes use very

00:20:05.640 | efficiently of these co-occurrence statistics.

00:20:08.040 | And that's essentially what words, well, word vectors are always capturing.

00:20:12.240 | So if in one sentence, you just want to remember every time you hear word vectors

00:20:17.160 | in deep learning, one, they're not quite deep, even though we call them sort of

00:20:21.040 | step one of deep learning.

00:20:22.320 | And two, they're really just capturing co-occurrence counts.

00:20:25.040 | How often does a word appear in the context of other words?

00:20:28.440 | So let's look at some interesting results of these GloVe vectors.

00:20:34.960 | Here, the first thing we do is look at nearest neighbors.

00:20:38.080 | So now that we have these n-dimensional vectors, usually we say n between 50 to at

00:20:43.680 | most 500, good general numbers, 100 or 200 dimensional.

00:20:47.680 | Each word is now represented as a single vector.

00:20:53.360 | And so we can look in this vector space for words that appear close by.

00:20:57.720 | We started and looked for the nearest neighbors of frog.

00:21:01.640 | And well, it turned out these are the nearest neighbors,

00:21:05.840 | which was a little confusing since we're not biologists.

00:21:08.360 | But fortunately, when you actually look up in Google what those mean,

00:21:12.680 | you'll see that they are actually all indeed different kinds of frogs.

00:21:17.040 | Some appear very rarely in the corpus and others like toad are much more frequent.

00:21:22.680 | Now, one of the most exciting results that came out of word vectors

00:21:28.040 | are actually these word analogies.

00:21:29.560 | So the idea here is can linearly, can there be relationships between

00:21:36.840 | different word vectors that simply fall out of very linear and

00:21:40.480 | simple addition and subtraction?

00:21:42.960 | So the idea here is what is man to woman equal to king to something else?

00:21:49.240 | As in, what is the right analogy when I

00:21:54.120 | try to basically fill in here the last missing word?

00:22:00.040 | Now, the way we're going to do this is very simple cosine similarity.

00:22:04.880 | We basically just take, let's take an example here,

00:22:08.560 | the vector of woman, we subtract the word vector we learned of man,

00:22:15.680 | and we add the word vector of king.

00:22:18.720 | And the resulting vector i, the argmax for

00:22:21.840 | this, turns out to going to be queen for a lot of these different models.

00:22:26.960 | And that was very surprising.

00:22:28.160 | Again, we're capturing co-occurrence statistics.

00:22:30.520 | So man might, in its context, often have things like running and

00:22:36.240 | fighting and other silly things that men do.

00:22:38.760 | And then you subtract those kinds of words from the context and

00:22:43.440 | you add them again.

00:22:44.120 | And in some sense, it's intuitive, though surprising that it works out that well for

00:22:49.240 | so many different examples.

00:22:50.480 | So here are some other examples similar to the king and

00:22:56.560 | queen example where we basically took these 200 dimensional vectors and

00:23:00.440 | we projected them down to two dimensions.

00:23:02.920 | Again, with a very simple method like PCA.

00:23:05.800 | And what we find is actually quite interestingly,

00:23:09.160 | even in just the two first principle components of this space,

00:23:12.840 | we have some very interesting sort of female-male relationships.

00:23:17.240 | So man to woman is similar to uncle and aunt, brother and sister,

00:23:21.480 | sir and madam, and so on.

00:23:23.320 | So this is an interesting semantic relationship that falls out of

00:23:29.960 | essentially co-occurrence counts in specific windows around each word in

00:23:34.280 | a large corpus.

00:23:36.000 | Here's another one that's more of a syntactic relationship.

00:23:40.160 | We actually have here superlatives, like slow, slower, and slowest is in a similar

00:23:44.960 | vector relationship to short, shorter, and shortest, or strong, stronger, and strongest.

00:23:50.440 | So this was very exciting, and of course,

00:23:53.960 | when you see an interesting qualitative result, you want to try to quantify

00:23:59.480 | who can do better in trying to understand these analogies and

00:24:02.760 | what are the different modes and hyperparameters that modify the performance.

00:24:07.960 | Now, this is something that you will notice in pretty much every deep learning

00:24:11.320 | project ever, which is more data will give you better performance.

00:24:14.880 | It's probably the single most useful thing you can do to a machine learning or

00:24:18.720 | deep learning system is to train it with more data, and we found that too.

00:24:22.200 | Now, there are different vector sizes too, which is a common hyperparameter.

00:24:27.200 | Like I said, usually between 50 to at most 500.

00:24:30.280 | Here we have 300 dimensional that essentially gave us the best performance

00:24:36.000 | for these different kinds of semantics and tactic relationships.

00:24:40.760 | Now, in many ways, having a single vector for words can be oversimplifying, right?

00:24:45.880 | Some words have multiple meanings, maybe they should have multiple vectors.

00:24:50.080 | Sometimes the word meaning changes over time, and so on.

00:24:56.760 | So there's a lot of simplifying assumptions here, but again,

00:24:59.720 | our final goal for deep NLP is going to be to create useful systems.

00:25:04.680 | And it turns out this is a useful first step to create such systems

00:25:09.400 | that mimic some human language behavior in order to create useful applications for us.

00:25:17.160 | All right, but words, word vectors are very useful, but

00:25:19.880 | words of course never appear in isolation.

00:25:22.160 | And what we really want to do is understand words in their context.

00:25:25.720 | And so this leads us to the second section here on recurrent neural networks.

00:25:31.880 | So we already went over the basic definition of standard neural networks.

00:25:37.520 | Really the main difference between a standard neural network and

00:25:41.560 | a recurrent neural network, which I'll abbreviate as RNN now,

00:25:45.200 | is that we will tie the weights at each time step.

00:25:48.880 | And that will allow us to essentially condition the neural network on all

00:25:52.000 | the previous words, in theory.

00:25:53.640 | In practice, how we can optimize it, it won't be really all the previous words.

00:25:58.600 | Be more like at most the last 30 words, but

00:26:01.520 | in theory, this is what a powerful model can do.

00:26:04.280 | So let's look at the definition of a recurrent neural network.

00:26:08.240 | And this is going to be a very important definition, so

00:26:10.760 | we'll go into a little bit of details here.

00:26:12.760 | So let's assume for now we have our word vectors as given, and

00:26:17.240 | we'll represent each sequence in the beginning as just a list of these word vectors.

00:26:22.040 | Now what we're going to do is we're computing a hidden state, ht, at each time

00:26:27.720 | step, and the way we're going to do this is with a simple neural network architecture.

00:26:33.600 | In fact, you can think of this summation here as really just a single

00:26:39.600 | layer neural network, if you were to concatenate the two matrices in these two

00:26:43.320 | vectors, but intuitively, we basically will map our current word vector at that

00:26:49.120 | time step t, sometimes I use these square brackets to denote that we're taking

00:26:54.480 | the word vector from that time step in there.

00:26:58.800 | We map that with a linear layer, a simple matrix vector product, and

00:27:02.520 | we sum up, sum that matrix vector product to another matrix vector product

00:27:07.560 | of the previous hidden state at the previous time step.

00:27:11.800 | We sum those two, and we apply in one case a simple sigmoid function

00:27:16.640 | to define this standard neural network layer.

00:27:20.240 | That will be ht, and now at each time step we want to predict some kind of class,

00:27:25.440 | probability over a set of potential events, classes, words, and so on.

00:27:30.800 | And we use the standard softmax classifier,

00:27:33.280 | some other communities call it the logistic regression classifier.

00:27:40.520 | So here we have a simple matrix, Ws for the softmax weights.

00:27:47.680 | We have basically a number of rows, they're going to be a number of classes

00:27:50.480 | that we have, and the number of columns is the same as the hidden dimension.

00:27:55.480 | Sometimes we want to predict the next word in a sequence in order to be able to

00:28:02.600 | identify the most likely sequence.

00:28:06.040 | So for instance, if I ask for a speech recognition system,

00:28:09.480 | what is the price of wood?

00:28:11.520 | Now in isolation, if you hear wood, you would probably assume it's the W-O-U-L-D,

00:28:17.720 | auxiliary verb wood, but in this particular context, the price of,

00:28:21.680 | it wouldn't make sense to have a verb following that.

00:28:23.960 | And so it's more like the W-O-O-D to find the price of wood.

00:28:28.880 | So language modeling is a very useful task, and it's also very instructive

00:28:33.440 | to use as an example for where recurrent neural networks really shine.

00:28:38.520 | So in our case here, this softmax is going to be quite a large matrix that goes over

00:28:43.680 | the entire vocabulary of all the possible words that we have.

00:28:47.240 | So each word is going to be our class.

00:28:50.800 | The classes for language models are the words in our vocabulary.

00:28:54.960 | And so we can define here this y hat t,

00:29:00.160 | the jth one is basically denoting here the probability that the jth word,

00:29:05.280 | that the jth index will come next after all the previous words.

00:29:09.680 | It's a very useful model again for speech recognition, for machine translation,

00:29:13.960 | for just finding a prior for language in general.

00:29:16.720 | All right, again, main difference to standard neural networks,

00:29:22.920 | we just have the same set of W-8s at all the different time steps.

00:29:26.280 | Everything else is pretty much a standard neural network.

00:29:31.240 | We often initialize the first H0 here just either randomly or all zeros.

00:29:38.080 | And again, in language modeling in particular,

00:29:42.840 | the next word is our class of the softmax.

00:29:46.400 | Now we can measure basically the performance of language models

00:29:50.680 | with terms so-called perplexity, which really is here the average

00:29:57.000 | log likelihood of basically the probabilities of being able to predict the next word.

00:30:02.120 | So you want to really give the highest probability to the word that

00:30:05.840 | actually will appear next in a long sequence.

00:30:08.200 | And then the higher that probability is, the lower your perplexity, and

00:30:14.040 | hence the model is less perplexed to see the next word.

00:30:17.960 | In some sense, you can think of language modeling as almost NLP complete,

00:30:23.360 | in some silly sense that if you can actually predict every single

00:30:28.320 | word that follows after any arbitrary sequence of words in a perfect way,

00:30:33.520 | you would have disambiguated a lot of things.

00:30:36.280 | You can say, for instance, what is the answer to the following question?

00:30:40.120 | Ask the question, and then the next couple of words would be the predicted answer.

00:30:43.720 | So there's no way we can actually ever do a perfect job in language modeling.

00:30:48.080 | But there's certain contexts where we can give a very high probability to

00:30:52.000 | the right next couple of words.

00:30:53.720 | Now, this is the standard recurrent neural network.

00:30:58.480 | And one problem with this is that we will modify the hidden state here

00:31:03.040 | at every time step.

00:31:04.240 | So even if I have words like the, and a, and a sentence period, and

00:31:08.880 | things like that, it will significantly modify my hidden state.

00:31:12.640 | Now, that can be problematic.

00:31:15.600 | Let's say, for instance, I want to train a sentiment analysis algorithm.

00:31:20.840 | And I talk about movies, and I talk about the plot for a very long time.

00:31:25.240 | Then I say, man, this movie was really wonderful.

00:31:28.120 | It's great to watch.

00:31:29.440 | And then especially the ending, and you talk again for

00:31:31.480 | like 50 time steps, or 50 words, or 100 words about the plot.

00:31:35.840 | Now, all these plot words will essentially modify my hidden state.

00:31:39.080 | So if at the end of that whole sequence I want to classify the sentiment,

00:31:42.560 | the word wonderful and

00:31:43.600 | great that I mentioned somewhere in the middle might be completely gone.

00:31:47.720 | Because I keep updating my hidden state with all these content words that talk

00:31:52.240 | about the plot.

00:31:52.760 | Now, the way to improve this is by use better kinds of recurrent units.

00:31:59.560 | And I'll introduce here a particular kind,

00:32:04.360 | so-called gated recurrent units, introduced by Cho.

00:32:08.680 | In some sense, and we'll learn more about the LSTM tomorrow when

00:32:13.760 | Kwok gives his lecture, but GeoUsers are in some sense a special case of LSTMs.

00:32:19.240 | The main idea is that we want to have the ability to keep certain memories

00:32:23.640 | around without having the current input modify them at all.

00:32:28.760 | So again, this example of sentiment analysis.

00:32:30.680 | I say something's great, that should somehow be captured in my hidden state.

00:32:34.560 | And I don't want all the content words that talk about the plot in the movie

00:32:37.320 | review to modify that it's actually overall was a great movie.

00:32:42.440 | And then we also want to allow error messages to flow

00:32:45.480 | at different strengths depending on the input.

00:32:47.600 | So if I say, great, I want that to modify a lot of things in the past.

00:32:53.160 | So let's define a GRU.

00:32:56.120 | Fortunately, since you already know the basic Lego block of a standard neural

00:32:59.960 | network, there's only really one or two subtleties here that are different.

00:33:03.920 | There are a couple of different steps that we'll need to compute

00:33:10.120 | at every time step.

00:33:10.920 | So in the standard RNN, what we did was just have this one single neural network

00:33:15.560 | that we hope would capture all this complexity of the sequence.

00:33:18.680 | Instead now, we'll first compute a couple of gates at that time step.

00:33:23.280 | So the first thing we'll compute is the so-called update gate.

00:33:26.880 | It's just yet another neural network layer based on the current input word vector and

00:33:31.320 | again the past hidden state.

00:33:32.920 | So these look quite familiar, but this will just be an intermediate value and

00:33:37.240 | we'll call it the update gate.

00:33:39.480 | Then we'll also compute a reset gate, is yet another standard neural network layer.

00:33:45.160 | Again, just matrix vector product, summation matrix vector product,

00:33:48.960 | some kind of non-linearity here, namely a sigmoid.

00:33:51.840 | It's actually important in this case that it is a sigmoid.

00:33:54.360 | Just basically, both of these will be vectors with numbers that are between 0 and 1.

00:33:59.360 | Now, we'll compute a new memory content, an intermediate h-tilde here,

00:34:06.640 | with yet another neural network, but then we have this little funky symbol in here.

00:34:12.800 | Basically, this will be an element-wise multiplication.

00:34:15.520 | So basically, what this will allow us to do is if that reset gate is 0,

00:34:22.640 | we can essentially ignore all the previous memory elements and

00:34:27.680 | only store the new word information.

00:34:30.640 | So for instance, if I talked for a long time about the plot,

00:34:35.760 | now I say this was an awesome movie.

00:34:37.880 | Now you want to basically be able to ignore if your whole goal of this sequence

00:34:42.360 | classification model is to capture sentiment, you want to be able to ignore

00:34:45.880 | past content.

00:34:47.240 | This is, of course, if this was entirely a zero vector.

00:34:50.880 | Now, this will be more subtle.

00:34:52.560 | This is a long vector of maybe 100 or 200 dimensions, so

00:34:56.320 | maybe some dimensions should be reset, but others maybe not.

00:34:59.520 | And then here we'll have our final memory, and

00:35:04.760 | it essentially combines these two states,

00:35:08.160 | the previous hidden state and this intermediate one at our current time step.

00:35:13.160 | And what this will allow us to do is essentially also say, well,

00:35:16.000 | maybe you want to ignore everything that's currently happening and

00:35:18.800 | only update the last time step.

00:35:21.960 | We basically copy over the previous time step and

00:35:25.000 | the hidden state of that and ignore the current thing.

00:35:28.680 | Again, simple example, in sentiment,

00:35:30.560 | maybe there's a lot of talk about the plot when the movie was released.

00:35:34.320 | You want to basically have the ability to ignore that and

00:35:36.400 | just copy that in the beginning and may have said, it was an awesome movie.

00:35:39.760 | So here's an attempt at a clean illustration.

00:35:42.600 | I have to say, personally, I, in the end, find the equations a little more intuitive

00:35:46.840 | than the visualizations that we tried to do, but some people are more visual here.

00:35:51.280 | So this is, in some ways, basically here we have our word vector and

00:35:55.680 | it goes through different layers.

00:35:57.920 | And then some of these layers will essentially modify other

00:36:03.000 | outputs of previous time steps.

00:36:06.920 | So this is a pretty nifty model and it's really the second most important

00:36:12.160 | basic Lego block that we're going to learn about today.

00:36:18.320 | And so just want to make sure we take a little bit of time,

00:36:21.360 | I'll repeat this here.

00:36:22.560 | Again, if the reset gate, this R value, is close to zero,

00:36:27.800 | those kinds of hidden dimensions are basically allowed to be dropped.

00:36:32.720 | And if the update gate Z basically is one,

00:36:38.560 | then we can copy information of that unit through many, many different time steps.

00:36:44.560 | And if you think about optimization a lot, what this will also mean is that

00:36:48.680 | the gradient can flow through the recurrent neural network through multiple

00:36:52.520 | time steps until it actually matters and you want to update a specific word,

00:36:56.680 | for instance, and go all the way through many different time steps.

00:37:01.000 | So then what this also allows us is to actually have some units

00:37:07.720 | that have different update frequencies.

00:37:12.280 | Some you might want to reset every other word, other ones you might really cap,

00:37:16.720 | like they have some long term context and they stay around for much longer.

00:37:19.760 | All right, this is the GRU.

00:37:24.640 | It's the second most important building block for today.

00:37:28.120 | There are, like I said, a lot of other variants of recurrent neural networks.

00:37:33.440 | Lots of amazing work in that space right now, and

00:37:36.400 | tomorrow Kwak will talk a lot about some more advanced methods.

00:37:41.400 | So now that you understand word vectors and

00:37:47.080 | neural network sequence models, you really have the two most important concepts for

00:37:51.440 | deep NLP.

00:37:53.040 | And that's pretty awesome, so congrats.

00:37:55.240 | We can now, in some ways, really play around with those two Lego blocks,

00:38:00.640 | plus some slight modifications of them, very creatively, and

00:38:05.080 | built a lot of really cool models.

00:38:06.880 | A lot of the models that I'll show you and that you can read and see and

00:38:10.440 | read the latest papers that are now coming out almost every week on archive,

00:38:15.280 | will have some kind of component of these,

00:38:17.880 | will use really these two components in a major way.

00:38:21.400 | Now, this is one of the few slides now with something really, really new,

00:38:25.920 | because I want to keep it exciting for

00:38:29.640 | the people who already knew all this stuff and took the class and everything.

00:38:33.160 | This is tackling an important problem, which is, in all these models

00:38:38.400 | that you'll see in pretty much most of these papers,

00:38:42.160 | we have in the end one final softmax here, right?

00:38:46.680 | And that softmax is basically our default way of classifying what we can see next,

00:38:51.640 | what kinds of classes we can predict.

00:38:53.480 | The problem with that is, of course, that that will only ever predict accurately

00:38:58.200 | frequently seen classes that we had at training time.

00:39:01.840 | But in the case of language modeling, for instance, where our classes are the words,

00:39:06.360 | we may see at test time some completely new words.

00:39:09.080 | Maybe I'm just going to introduce to you a new name, Srini, for instance, and

00:39:14.520 | nobody may have seen that word at training time.

00:39:19.520 | But now that I mentioned him, and I will introduce him to you,

00:39:22.680 | you should be able to predict the word Srini and that person in a new context.

00:39:28.360 | And so the solution that we're literally going to release only next week in

00:39:32.600 | the new paper is to essentially combine the standard softmax that we can train

00:39:37.240 | with a pointer component.

00:39:39.280 | And that pointer component will allow us to point to previous contexts and

00:39:43.520 | then predict based on that to see that word.

00:39:46.840 | So let's, for instance, take the example here of language modeling again.

00:39:50.320 | We may read a long article about the Fed Chair, Janet Yellen.

00:39:55.720 | And maybe the word Yellen had not appeared in training time before, so

00:40:00.200 | we couldn't ever predict it, even though we just learned about it.

00:40:03.640 | And now a couple of sentences later, interest rates were raised, and

00:40:07.000 | then misses, and now we want to predict that next word.

00:40:11.200 | Now, if that hadn't appeared in our softmax standard training procedure at

00:40:15.680 | training time, we would never be able to predict it.

00:40:18.960 | What this model will do, and we're calling it a pointer sentinel mixture model,

00:40:23.080 | is it will essentially first try to see would any of these previous words

00:40:27.880 | maybe be the right candidate.

00:40:29.880 | So we can really take into consideration the previous context of, say,

00:40:32.800 | the last 100 words.

00:40:34.440 | And if we see that word and that word makes sense after we train it, of course,

00:40:38.720 | then we might give a lot of probability mass to just that word at this current

00:40:43.120 | position in our previous immediate context at test time.

00:40:46.680 | And then we have also the sentinel,

00:40:50.680 | which is basically going to be the rest of the probability if we cannot refer to

00:40:55.680 | some of the words that we just saw.

00:40:57.360 | And that one will go directly to our standard softmax.

00:41:02.440 | And then what we'll essentially have is a mixture model that allows us to say

00:41:06.840 | either we have or we have a combination of both of essentially words that just

00:41:12.160 | appeared in this context and words that we saw in our standard softmax

00:41:16.760 | language modeling system.

00:41:18.720 | So I think this is a pretty important next step because it will allow us

00:41:23.680 | to predict things we've never seen at training time.

00:41:25.800 | And that's something that's clearly a human capability that most, or

00:41:29.520 | pretty much none of these language models had before.

00:41:32.600 | And so to look at how much it actually helps,

00:41:35.520 | it'll be interesting to look at some of the performance before.

00:41:39.200 | So again, what we're measuring here is perplexity.

00:41:41.800 | And the lower the better, because it's essentially inverse here

00:41:47.720 | of the actual probability that we assign to the correct next word.

00:41:51.800 | And in just 2010, so six years ago, this was some great work,

00:41:58.480 | early work by Tomas Mikulov, where he compared to a lot of standard

00:42:03.280 | natural language processing methods, syntactic models

00:42:09.360 | that essentially tried to predict the next word and had a perplexity of 107.

00:42:13.440 | And he was able to use the standard recurrent neural networks, and

00:42:17.760 | actually an ensemble of eight of them, to really significantly push down

00:42:22.080 | the perplexity, especially when you combine it with standard

00:42:26.000 | count-based methods for language modeling.

00:42:28.960 | So in 2010, he made great progress by pushing it down to 87.

00:42:34.800 | And now this is one of the great examples of how much progress

00:42:39.120 | is being made in the field, thanks to deep learning, where two years ago,

00:42:44.680 | Whitecheck Zaremba and his collaborators were able to push that down

00:42:49.960 | even further to 78 with a very large LSTM, similar to a GRU-like model,

00:42:56.440 | but even more advanced.

00:42:57.480 | Kwok will teach you the basics of LSTMs tomorrow.

00:43:02.240 | Then last year, the performance was pushed down even further by Yaren Gal.

00:43:07.800 | And then this one actually came out just a couple weeks ago,

00:43:13.840 | variational recurrent highway networks, pushed it down even further.

00:43:17.280 | But this Pointer Sentinel model is able to get it down to 70.

00:43:20.480 | So in just a short amount of time,

00:43:23.800 | we pushed it down by more than 10 perplexity points in two years.

00:43:28.840 | And that is really an increased speed in performance that we're seeing now,

00:43:34.040 | that deep learning is changing a lot of areas of natural language processing.

00:43:38.240 | All right, now we have our basic Lego blocks,

00:43:43.080 | the word vectors and the GRU sequence models.

00:43:47.240 | And now we can talk a little bit about some of the ongoing research that we're

00:43:52.120 | working on.

00:43:53.520 | And I'll start that with maybe a controversial question,

00:43:56.640 | which is, could we possibly reduce all NLP tasks to essentially

00:44:03.400 | question answering tasks over some kind of input?

00:44:06.600 | And in some ways, that's a trivial observation that you could do that, but

00:44:11.600 | it actually might help us to think of models that could take any kind of input,

00:44:16.960 | a question about that input, and try to produce an output sequence.

00:44:21.440 | So let me give you a couple of examples of what I mean by this.

00:44:26.000 | So here we have, the first one is a task that we would

00:44:30.200 | standardly associate with question answering.

00:44:32.760 | I'll give you a couple of facts.

00:44:34.280 | Mary walked to the bathroom, Sandra went to the garden, Daniel went back to

00:44:38.160 | the garden, Sandra took the milk there, where's the milk?

00:44:41.560 | And now you might have to logically reason,

00:44:44.680 | try to find the sentence about milk.

00:44:48.280 | Maybe Sandra took the milk there.

00:44:51.160 | Now I'd have to maybe do an effort resolution, find out what does there refer

00:44:56.400 | to, and then you try to find the previous sentence that mentions Sandra,

00:45:01.480 | see that it's garden, and then give an answer garden.

00:45:04.040 | So this is a simple logical reasoning question answering task.

00:45:08.280 | And that's what most people in the QA field sort of associated with

00:45:12.760 | some kinds of question answers.

00:45:14.600 | But we can also say, everybody's happy and

00:45:17.160 | the question is, what's the sentiment?

00:45:19.160 | And the answer is positive.

00:45:21.120 | All right, so this is a different subfield of NLP that tackles sentiment analysis.

00:45:26.720 | We can go further and ask, what are the named entities of a sentence like,

00:45:31.480 | Jane has a baby in Dresden, and you want to find out that Jane is a person and

00:45:34.840 | Dresden is a location, and this is an example of sequence tagging.

00:45:38.320 | You can even go as far and say, I think this model is incredible, and

00:45:43.840 | the question is, what's the translation into French?

00:45:46.880 | And you get, [FOREIGN] and

00:45:52.760 | that in some ways would be phenomenal if we're able to actually

00:45:57.600 | tackle all these different kinds of tasks with the same kind of model.

00:46:03.720 | So maybe it would be an interesting new goal for NLP to try to

00:46:08.480 | develop a single joint model for general question answering.

00:46:13.200 | I think it would push us to think about new kinds of sequence models and

00:46:20.400 | new kinds of reasoning capabilities in an interesting way.

00:46:23.960 | Now, there are two major obstacles to actually achieving

00:46:27.040 | the single joint model for arbitrary QA tasks.

00:46:30.240 | The first one is that we don't even have a single model architecture that gets

00:46:34.480 | consistent state of the art results across a variety of different tasks.

00:46:39.520 | So for instance, for question answering, this is a data set called Bobby that

00:46:43.680 | Facebook published last year,

00:46:46.600 | strongly supervised memory networks get the state of the art.

00:46:49.800 | For sentiment analysis, you had tree LSTM models

00:46:53.880 | developed by Kai-Sheng Tai here at Stanford last year.

00:46:59.640 | And for part of speech tagging,

00:47:00.800 | you might have bidirectional LSTM conditional random fields.

00:47:04.840 | One thing you do notice is all the current state of the art

00:47:07.600 | methods are deep learning.

00:47:09.640 | Sometimes they still connect to other traditional methods like conditional

00:47:14.240 | random fields and undirected graphical models.

00:47:16.280 | But there's always some kind of deep learning component in them.

00:47:20.520 | So that is the first obstacle.

00:47:24.280 | The second one is that really fully joint multi-task learning

00:47:29.440 | is very, very hard.

00:47:30.440 | Usually when we do do it, we restrict it to lower layers.

00:47:35.440 | So for instance, in natural language processing,

00:47:37.720 | all we're currently able to share in some principled way are word vectors.

00:47:42.560 | We take the same word vectors we train, for instance, with GloVe or Word2Vec, and

00:47:46.120 | we initialize our deep neural network sequence models with those word vectors.

00:47:51.920 | In computer vision, we're actually a little further ahead, and

00:47:56.280 | you're able to use multiple of the different layers.

00:47:59.680 | And you initialize a lot of your CNN models with a first pre-trained CNN that

00:48:05.600 | was pre-trained on ImageNet, for instance.

00:48:07.040 | Now, usually people evaluate multi-task learning with only two tasks.

00:48:12.760 | They train on a first task, and

00:48:14.640 | then they evaluate the model that they initialized from the first on the second

00:48:18.880 | task, but they often ignore how much the performance degrades on the original task.

00:48:23.880 | So when somebody takes an ImageNet CNN and applies it to a new problem,

00:48:28.120 | they rarely ever go back and

00:48:29.160 | say how much did my accuracy actually decrease on the original data set?

00:48:32.800 | And furthermore, we usually only look at tasks that are actually related, and

00:48:38.320 | then we find, look, there's some amazing transfer learning capability going on.

00:48:42.920 | What we don't look at often in the literature and

00:48:46.480 | most people's work is that when the tasks aren't related to one another,

00:48:50.140 | they actually hurt each other.

00:48:51.920 | And this is so-called catastrophic forgetting.

00:48:55.040 | It's not, there's not too much work around that right now.

00:48:58.960 | Now, I also would like to say that right now,

00:49:04.760 | almost nobody uses the exact same decoder or

00:49:07.920 | classifier for a variety of different kinds of outputs, right?

00:49:12.800 | We at least replace the softmax to try to predict different kinds of problems.

00:49:18.440 | All right, so this is the second obstacle now.

00:49:21.920 | For now, we'll only tackle the first obstacle, and

00:49:24.760 | this is basically what motivated us to come up with dynamic memory networks.

00:49:29.480 | They're essentially an architecture to try to tackle arbitrary question answering

00:49:33.480 | tasks.

00:49:33.980 | When I'll talk about dynamic memory networks, it's important to note here

00:49:38.960 | that for each of the different tasks I'll talk about,

00:49:41.560 | it'll be a different dynamic memory network.

00:49:44.520 | It won't have the exact same weights.

00:49:46.680 | It'll just be the same general architecture.

00:49:50.800 | So the high level idea for DMNs is as follows.

00:49:55.040 | Imagine you had to read a bunch of facts like these here.

00:49:59.160 | They're all very simple in and of themselves.

00:50:02.200 | But if I now ask you a question, I showed you these, and I ask, where's Sandra?

00:50:08.040 | It'd be very hard, even if you read them, all of them,

00:50:10.920 | it'd be kind of hard to remember.

00:50:13.080 | And so the idea here is that for complex questions,

00:50:16.520 | we might actually want to allow you to have multiple glances at the input.

00:50:22.600 | And just like I promised, one of our most important basic Lego blocks will be this

00:50:29.240 | GRU we just introduced in the previous section.

00:50:31.520 | Now, here's this whole model in all its gory details.

00:50:37.080 | And we'll dive into all of that in the next couple of slides, so don't worry.

00:50:41.520 | It's a big model.

00:50:44.080 | A couple of observations.

00:50:45.240 | So the first one is, I think we're moving in deep learning now to try to use more

00:50:50.120 | proper software engineering principles.

00:50:53.360 | Basically to modularize, encapsulate certain capabilities, and

00:50:57.800 | then take those as basic Lego blocks and build more complex models on top of them.

00:51:03.160 | A lot of times nowadays you just have a CNN,

00:51:06.320 | that's like one little block in a complex paper, and then other things happen on top.

00:51:10.360 | Here we'll have the GRU or word vectors basically as one module,

00:51:16.240 | a sub-module in these different ones here.

00:51:19.200 | And I'm not even mentioning word vectors anymore, but

00:51:21.440 | word vectors still play a crucial role.

00:51:23.640 | And each of these words is essentially represented as this word vector, but

00:51:27.160 | we just kind of assume that it's there.

00:51:29.280 | Okay, so let's walk on a very high level through this model.

00:51:32.600 | There are essentially four different modules.

00:51:35.040 | There's the input module, which will be a neural network sequence model, a GRU.

00:51:40.320 | There's a question module, an episodic memory module, and an answering module.

00:51:45.320 | And sometimes we also have these semantic memory modules here, but for

00:51:49.320 | now these are really just our word vectors, and we'll ignore that for now.

00:51:53.400 | So let's go through this.

00:51:54.720 | Here is our corpus, and our question is, where is the football?

00:51:58.160 | And this is our input that should allow us to answer this question.

00:52:02.760 | Now if I ask this question, I will essentially use the final representation

00:52:08.880 | of this question to learn to pay attention to the right kinds of inputs that seem

00:52:13.520 | relevant for given what I know to answer this question.

00:52:17.640 | So where's the football?

00:52:18.680 | Well, it would make sense to basically pay attention to all the sentences that

00:52:22.720 | mention football, and maybe especially the last ones if the football moves around a lot.

00:52:27.360 | So what we'll observe here is that this last sentence will get a lot of attention.

00:52:31.920 | So John put down the football.

00:52:34.320 | And now what we'll basically do is that this hidden state of this

00:52:38.960 | recurrent neural network model will be given as input to another recurrent neural

00:52:44.400 | network because it seemed relevant to answer this current question at hand.

00:52:48.440 | Now we'll basically agglomerate all these different facts

00:52:54.080 | that seem relevant at the time in this another GRU, in this final vector m.

00:52:58.560 | And now this vector m together with the question will be used to go over

00:53:02.360 | the inputs again if the model deems that it doesn't have enough information yet to

00:53:06.320 | answer the question.

00:53:07.640 | So if I ask you where's the football and it so

00:53:09.720 | far only found that John put down the football, you don't know enough.

00:53:13.280 | You still don't know where it is, but you now have a new fact,

00:53:15.800 | namely John seems relevant to answer the question.

00:53:18.680 | And that fact is now represented in this vector m,

00:53:21.960 | which is also just the last hidden state of another recurrent neural network.

00:53:25.200 | Now we'll go over the inputs again.

00:53:28.480 | Now that we know that John and

00:53:29.840 | the football are relevant, we'll learn to pay attention to John move to the bedroom.

00:53:35.440 | And John went to the hallway.

00:53:39.040 | Again, those are going to get agglomerated here in this recurrent neural network.

00:53:44.920 | And now the model thinks that it actually knows enough because it

00:53:50.440 | basically intrinsically captured things about the football.

00:53:55.120 | John found a location and so on.

00:53:56.760 | Of course, we didn't have to tell it anything about their people,

00:54:00.120 | their locations, if x moves to y and y is in the set of locations,

00:54:04.320 | then this happens, none of that.

00:54:05.760 | You just give it a lot of stories like that and

00:54:07.800 | in its hidden states it will capture these kinds of patterns.

00:54:11.520 | So then we have the final vector m and we'll give that to an answer module,

00:54:16.920 | which produces in our standard softmax way the answer.

00:54:21.880 | All right, now let's zoom into the different modules of this overall dynamic

00:54:26.880 | memory network architecture.

00:54:28.840 | The input, fortunately, is just a standard GRU, the way we defined it before.

00:54:34.480 | So simple word vectors, hidden states, reset gates, update gates, and so on.

00:54:39.840 | The question module is also just a GRU, a separate one with its own weights.

00:54:49.160 | And the final vector q here is just going to be the last hidden state of that

00:54:53.880 | recurrent neural network sequence model.

00:54:56.200 | Now, the interesting stuff happens in the episodic memory module,

00:54:59.160 | which is essentially a sort of meta-gated GRU,

00:55:05.960 | where this gate will basically define, is defined and

00:55:10.840 | computed by the attention mechanism.

00:55:13.840 | And it will basically say this current state sentence SI here seems to matter.

00:55:19.880 | And the superscript T is the episode that we have.

00:55:23.960 | So each episode basically means we're going over the input entirely one time.

00:55:29.000 | So it starts at G1 here.

00:55:33.240 | And what this basically will allow us to do is to say, well, if G is 0,

00:55:38.640 | then what we'll do is basically just copy over the past states from the input.

00:55:46.120 | Nothing will happen.

00:55:47.480 | And unlike before in all these GRU equations,

00:55:50.080 | this G is just a single scalar number.

00:55:52.000 | It will basically say, if G is 0,

00:55:56.000 | then this sentence is completely irrelevant to my current question at hand.

00:56:00.720 | I can completely skip it, right?

00:56:03.040 | And there are lots of examples, like married travel to the hallway,

00:56:06.920 | that are just completely irrelevant to answering the current question.

00:56:10.480 | In those cases, this G will be 0, and

00:56:13.720 | we're just copying the previous hidden state of this recurrent neural network over.

00:56:17.920 | Otherwise, we'll have a standard GRU model.

00:56:22.080 | So now, of course, the big question is, how do we compute this G?

00:56:26.720 | And this might look a little ugly, but it's quite simple.

00:56:30.080 | Basically, we're going to compute two vector similarities, multiplicative and

00:56:35.480 | additive one with absolute values of all the single values of the sentence vector

00:56:41.480 | that we currently have, and the question vector, and

00:56:44.200 | the first, the memory state of the previous pass of the input.

00:56:47.840 | And the first pass of the input, the memory state is initialized to be just

00:56:52.920 | a question, and then afterwards, it agglomerated relevant facts.

00:56:56.320 | So intuitively here, if the sentence mentions John, for instance, and

00:57:02.320 | the question is, or mentions football, and the question is,

00:57:04.960 | where's the football, then you'd hope that the question vector Q mentions

00:57:09.160 | has some units that are more active because football was mentioned.

00:57:13.000 | And the sentence vector mentions football, so

00:57:14.880 | there are some units that are more active because football is mentioned.

00:57:17.920 | And hence, some of these inner products or

00:57:20.720 | absolute values of subtractions are going to be large.

00:57:25.000 | And then what we're going to do is just plug that into a standard,

00:57:28.640 | through standard single layer neural network, and then a standard linear layer

00:57:32.600 | here, and then we apply a softmax to essentially weight all of these

00:57:37.600 | different potential sentences that we might have to compute the final gate.

00:57:41.880 | So this will basically be a soft attention mechanism that sums to one and

00:57:46.920 | will pay most attention to the facts that seem most relevant,

00:57:51.000 | given what I know so far in the question.

00:57:53.520 | Then when the end of the input is reached, all these relevant facts here

00:57:59.520 | are summarized in another GRU that basically moves up here.

00:58:03.080 | And you can train a classifier also, if you have the right kind of supervision,

00:58:09.680 | to basically train that the model knows enough to actually answer the question and

00:58:13.760 | stop iterating over the inputs.

00:58:15.760 | If you don't have that kind of supervision, you can also just say,

00:58:19.520 | I will go over the inputs a fixed number of times, and

00:58:23.600 | that works reasonably well too.

00:58:27.200 | All right, there's a lot to sink in, so I'll give you a couple seconds.

00:58:31.720 | Basically, we pay attention to different facts given a certain question.

00:58:36.200 | We iterate over the input multiple times, and we agglomerate the facts that seem

00:58:41.200 | relevant given the current knowledge and the question.

00:58:44.240 | Now, I don't usually talk about neuroscience.

00:58:47.480 | I'm not a neuroscientist, but there is a very interesting relationship here that

00:58:51.920 | a friend of mine, Sam Gershman, pointed out, which is that the episodic memory,

00:58:56.480 | in general for humans, is actually the memory of autobiographical events.

00:59:01.280 | So it's the time when we remember the first time we went to school or

00:59:04.680 | something like that.

00:59:06.040 | And it's essentially a collection of our past personal experiences that occurred at

00:59:09.760 | a particular time in a particular place.

00:59:12.400 | And just like our episodic memory that can be triggered with a variety of

00:59:16.560 | different inputs, this episodic memory is also triggered

00:59:21.360 | with a specific question at hand.

00:59:24.080 | And what's also interesting is the hippocampus,

00:59:26.080 | which is the seat of the episodic memory in humans,

00:59:28.760 | is actually active during transitive inference.

00:59:31.040 | So transitive inference is going from A to B to C to have some connection from A to C.

00:59:36.400 | Or in this case here, with this football, for instance, you first had to find facts

00:59:40.920 | about John and the football, and then finding where John was, and

00:59:44.040 | then finding the location of John.

00:59:45.560 | So those are examples of transitive inference.

00:59:48.000 | And it turns out that you also need, in the DMN,

00:59:53.160 | these multiple passes to enable the capability to do transitive inference.

00:59:58.440 | Now, the final module, again, is a very simple GRU and

01:00:03.640 | softmax to produce the final answers.

01:00:07.280 | The main difference here is that instead of just having the current,

01:00:11.120 | the previous hidden state, 80 minus 1, as input,

01:00:14.520 | we'll also include the question at every time.

01:00:17.600 | And we will include the answer that was generated at the previous time step.

01:00:22.840 | But other than that, it's our standard softmax.

01:00:24.800 | We use standard cross-entropy errors to minimize it.

01:00:27.320 | And now, the beautiful thing of this whole model is that it's end-to-end trainable.

01:00:31.640 | These four different modules will actually all train,

01:00:35.760 | based on the cross-entropy error of that final softmax.

01:00:38.680 | All these different modules communicate with vectors, and

01:00:42.480 | we'll just have delta messages and back propagation to train them.

01:00:46.120 | Now, there's been a lot of work in the last two years on models like this.

01:00:52.360 | In fact, Kwak will cover a lot of these really interesting models tomorrow,

01:00:57.400 | different types of memory, structures, and so on.

01:00:59.880 | And the dynamic memory network is, in some sense, one of those models.

01:01:04.920 | One particular model is a proper comparison,

01:01:09.040 | because there are a lot of similarities, namely memory networks from Jason Weston.

01:01:14.200 | Those basically also have inputs and scoring and attention response mechanisms.

01:01:21.400 | The main difference is that they use different kinds of basic Lego blocks for

01:01:27.280 | these different kinds of mechanisms.

01:01:29.920 | For input, they use bag-of-words representations, or nonlinear and

01:01:33.000 | linear embeddings.

01:01:33.840 | For the attention and responses,

01:01:37.920 | they have different kinds of iteratively run functions.

01:01:40.800 | The main interesting sort of difference to the DMN is that the DMN really uses

01:01:47.160 | recurrent neural network type sequence models for

01:01:49.920 | all of these different modules and capabilities.

01:01:53.960 | And in some sense, that helps us to have a broader range of applications that include

01:01:58.520 | things like sequence tagging.

01:02:00.240 | And so let me go over a couple of results and experiments of this model.

01:02:04.920 | So the first one is on this Bobby dataset that Facebook published.

01:02:13.160 | It basically has a lot of these kinds of simple, logical reasoning type questions.

01:02:18.600 | In fact, all these, like where's the football?

01:02:20.680 | Those were examples from the Facebook Bobby dataset.

01:02:24.480 | And it also includes things like yes/no questions, simple counting,

01:02:29.560 | negation, some indefinite knowledge where the answer might be maybe.

01:02:33.040 | Basic coreference, where you have to realize who does she refer to or

01:02:38.960 | he, reasoning over time.

01:02:41.840 | If this happened before that, and so on.

01:02:45.040 | And basically, this dynamic memory network, I think, is currently the state of

01:02:48.920 | the art on this dataset of the simple logical reasoning.

01:02:54.000 | Now, the problem with this dataset is that it's a synthetic dataset.

01:02:58.480 | And so it had only a certain set of generating human-defined

01:03:05.720 | generative functions that created certain patterns.

01:03:09.320 | And in that sense, it's only a necessary and not a sufficient condition of solving

01:03:14.320 | it with sometimes 100% accuracy to real question answering.

01:03:17.880 | So there's still a lot of complexity.

01:03:20.920 | The main interesting bit to point out here is that there are different

01:03:24.880 | numbers of training examples for each of these different subtasks.

01:03:30.400 | And so you have basically 1,000 examples of simple negation, for instance.

01:03:35.000 | And it's always a similar kind of pattern, and

01:03:37.560 | hence you're able to classify it very well.

01:03:39.440 | Now, real language, you will never have that many examples for

01:03:42.640 | each type of pattern you want to learn.

01:03:44.400 | And so it's still general question answering is still an open problem and

01:03:48.720 | non-trivial.

01:03:49.240 | Now, what's cool is this same architecture of allowing the model to go over inputs

01:03:55.120 | multiple times also got state of the art and sentiment analysis.

01:03:59.160 | Very different kind of task.

01:04:02.720 | And we actually analyzed whether it's really helpful to have

01:04:07.120 | multiple passes over the input, and it turns out it is.

01:04:10.440 | So there's certain things like reasoning over three facts or counting,

01:04:15.040 | where you really have to have this dynamic, this episodic memory module,

01:04:20.200 | and it goes over the input maybe five times.

01:04:23.400 | For sentiment, it actually turns out it hurts

01:04:26.520 | after going over the input more than two times.

01:04:30.320 | And that's actually one of the things we're now working on is,

01:04:32.680 | can we find a single model that does the same thing for

01:04:35.640 | every single input with the same weights to try to learn these different tasks?

01:04:40.240 | We can actually look at a couple of fun examples of this model and

01:04:47.560 | what happens with tough sentiment sentences.

01:04:50.480 | Generally, to be honest, sentiment,

01:04:52.840 | you can probably get to 75% accuracy with some very simple models.

01:04:58.240 | It's just basically fine, great words like great and wonderful and awesome, and

01:05:02.480 | you'll get to something that's roughly right.

01:05:04.840 | Here are some of the examples that, those are the kinds of examples that you now

01:05:09.440 | need to get right to really try to push the state of the art further in sentiment

01:05:13.200 | analysis.

01:05:14.360 | So here, the sentence is, in its ragged, cheap, and unassuming way,

01:05:18.440 | the movie works.

01:05:19.880 | So this sentence is incorrect, even if you allow the DMN, but

01:05:24.400 | you have this whole architecture, but only allow one pass over the input.

01:05:28.600 | Once you have two passes over the input, it actually learns to pay attention,

01:05:32.640 | not just to these very strong adjectives,

01:05:37.600 | but in the end, actually to the movie working.

01:05:41.600 | So here, these fields are essentially the gating function G

01:05:47.200 | that we defined that pays attention to specific words.

01:05:51.160 | And the darker it is, the larger that gate is, and the more open it is, and

01:05:56.200 | the more that word affects the hidden state in the episodic memory module.

01:06:00.960 | So it goes over the input the first time, pays attention to cheap and

01:06:07.520 | unassuming and way and a little bit of works too.

01:06:11.280 | But the second time, it basically figured out, it agglomerated sort of the facts of

01:06:15.480 | that sentence, and then learned to pay attention more to specific

01:06:19.960 | words that seem more important.

01:06:22.320 | Just one more example here,

01:06:25.800 | my response to the film is best described as lukewarm.

01:06:29.280 | So in general, a sentiment analysis, when you look at sort of Unigam scores,

01:06:36.440 | like the word best is basically one of the most positive words you could

01:06:41.280 | possibly use in a sentence.

01:06:42.960 | And the first time the model passes over the sentence,

01:06:46.080 | it also pays most attention to this incredibly positive word, namely best.

01:06:51.000 | But then, once it agglomerated the context, it actually realizes, well,

01:06:55.160 | best actually here is not used in its adjective way,

01:07:00.800 | but it's actually an adverb that best describes something, and

01:07:04.360 | what it describes is actually lukewarm, and hence, it's actually a negative sentence.

01:07:09.120 | So those are the kinds of examples that you need to get to now to appreciate

01:07:13.120 | improvements in sentiment analysis, where we basically also went from,

01:07:18.720 | on this particular data set, these are all neural network type models that started 82.

01:07:24.000 | Until then, that same data set existed for around eight years, and

01:07:28.680 | none of the standard NLP models had reached above 80% accuracy.

01:07:32.800 | And now, we're basically in the high 80s.

01:07:36.200 | And those are the kinds of improvements that you see across a variety of

01:07:40.760 | different NLP tasks now that deep learning has come and

01:07:45.600 | deep learning techniques are being used in NLP.

01:07:49.480 | And now, the last task in NLP that this model turned out to also work incredibly

01:07:53.640 | well on is part of speech tagging.

01:07:54.800 | Now, part of speech tagging is less exciting of a task.

01:07:57.160 | It's more of an intermediate task.

01:07:58.520 | But it's still fascinating to see that after this data set has been around for

01:08:04.560 | over 20 years, you can still improve the state of the art with the same kind of

01:08:09.320 | architecture that also did well on fuzzy reasoning of sentiment and

01:08:12.640 | discrete logical reasoning for question answering.

01:08:16.080 | Now, we had a new person join the group, Saiming.

01:08:20.640 | And he thought, well, that's cool.

01:08:25.040 | But he was more of a computer vision researcher.

01:08:28.480 | And so he thought, well, could I use this great question answering module now

01:08:33.000 | to do visual question answering?

01:08:35.560 | To combine sort of some that was going on in the group in NLP and

01:08:39.320 | apply it to computer vision.

01:08:40.880 | And he did not have to know all of the different aspects of the code.

01:08:47.400 | All he had to do was change the input module from one that gives you

01:08:52.080 | hidden states at each word over a long sequence of words and

01:08:57.920 | sentences to an input module that would give him vectors for

01:09:02.280 | sequences of regions in an image.

01:09:04.880 | And he literally did not touch some of the other parts of the code.

01:09:09.040 | He didn't have to look carefully at this input module where, again,

01:09:14.880 | here, our basic Lego block that Andre introduced really well of our convolutional

01:09:21.080 | neural network, and then the convolutional neural networks will essentially give us

01:09:25.240 | 14 by 14 many vectors, one for each in one of its top states,

01:09:31.320 | one representing each region of an image.

01:09:34.520 | And then what we'll do is basically take those vectors and

01:09:37.200 | now replace the word vectors we used to have with CNN vectors, and

01:09:41.560 | then plug them into a GRU.

01:09:43.600 | Now again, the GRU, we know as our basic Lego block, we already defined it.

01:09:48.960 | One addition here is that it'll actually be a bidirectional GRU.

01:09:53.360 | We'll go once from left to right in this snake-like fashion, and

01:09:57.840 | another one goes from right to left backwards.

01:10:01.440 | Now both of these will basically have a hidden state, and you can just concatenate

01:10:05.440 | the hidden states of both of these to compute the final hidden state for

01:10:10.560 | each block of the image.

01:10:13.200 | And that model, too, actually achieved state of the art results.

01:10:17.560 | This data set has been only released last year, so

01:10:22.040 | everybody now works on deep learning techniques to try to solve it.

01:10:25.160 | And I was at first a little skeptical.

01:10:28.560 | It was just too good to be true that this model we developed for

01:10:31.280 | NLP would work so well.

01:10:32.760 | So we really dug into looking at the attention.

01:10:36.840 | So what I showed you here, these g values,

01:10:41.480 | again, that we computed with this equation.

01:10:46.800 | Now, instead of paying attention to words,

01:10:50.440 | it paid attention to different regions in the image.

01:10:54.480 | And we started basically analyzing, going through a bunch of those on the depth set,

01:11:00.320 | and analyzing what is it actually paying attention to.

01:11:02.960 | Again, it's being trained only with the image, the question, and

01:11:07.240 | the final answer.

01:11:08.320 | That's what you get at training time.

01:11:10.240 | You do not get this sort of latent representation of where you should

01:11:14.360 | actually pay attention to in the image in order to answer that question correctly.

01:11:18.600 | So when the question was, what is the main color on the bus?

01:11:22.040 | I learned to actually pay attention here to that bus.

01:11:25.400 | I'm like, well, okay, maybe that's not that impressive.

01:11:27.960 | It's just the main object in the center of the image.

01:11:30.960 | And what are the types, the type of trees are in the background?

01:11:35.600 | Well, maybe it just connects tree with anything that's green and

01:11:39.600 | pays attention to that.

01:11:41.040 | So it was neat, but not super impressive yet.

01:11:45.760 | So is this in the wild kind of more interesting and

01:11:48.120 | actually pays attention to a man-made structure in the background?

01:11:51.600 | Incorrectly answers, no.

01:11:55.160 | Then this one is kind of interesting.

01:11:57.560 | Who is on both photos?

01:12:00.160 | The answer is girl.

01:12:01.440 | Now, to be honest, I don't think the model actually knows that there are two people,

01:12:06.920 | tries to match them, and so on.

01:12:08.720 | It just finds the main person or main object in this scene.

01:12:13.920 | The main object is a little baby girl, so it says girl.

01:12:18.520 | This one's also relatively trivial.

01:12:20.360 | What time of day was this picture taken?

01:12:22.280 | The answer is night, because it's a very dark picture, at least in the sky.

01:12:26.720 | This one is getting a little more interesting.

01:12:28.520 | What is the boy holding?

01:12:30.080 | The answer is surfboard, and it actually does pay attention to both of the arms,

01:12:35.160 | and then what's just below that arm.

01:12:37.160 | So that's a little more interesting kind of attention visualization.

01:12:41.200 | And then for a while, we're also worried, well, what if in the data set,

01:12:45.640 | it just learns really well from language alone?

01:12:48.440 | Yes, it pays attention to things, but maybe it'll just say things

01:12:51.600 | that it often sees in the text.

01:12:53.240 | So if I ask you what color are the bananas,

01:12:56.360 | you don't really have to look at an image.

01:12:58.120 | In 95% of the cases, you're right just saying yellow without seeing an image.

01:13:03.040 | So this one I was kind of excited about, because it actually

01:13:07.160 | paid attention to the bananas in the middle, and then did say green,

01:13:10.400 | and kind of overruled the prior that it would get from language alone.

01:13:17.640 | What is the pattern on the cat's fur on its tail?

01:13:20.440 | Pays attention mostly to the tail and says stripes.

01:13:24.760 | Now, this one here was interesting.

01:13:26.600 | Did the player hit the ball?

01:13:28.200 | The answer is yes, though I have to say that we later

01:13:32.560 | had a journalist want to do his own question.

01:13:36.040 | He asked John Marker from New York Times, and

01:13:40.920 | we just put together this demo the night before.

01:13:44.840 | And he's like, well, I want to ask my own question.

01:13:46.520 | I'm like, okay.

01:13:48.040 | And he asked, is the girl wearing a hat?

01:13:51.720 | And it wasn't made for production, so it was kind of slow, and

01:13:55.040 | the system was cranking.

01:13:56.200 | I'm like, well, he's trying to come up with excuses.

01:13:59.280 | It's kind of black background and a black hat, and it might be kind of hard to see.

01:14:03.560 | And unfortunately, it got it right and said yes.

01:14:06.680 | And then after the interview, I said, well, maybe let's look and

01:14:09.920 | see if what I, I would just ask it myself, less stressful situation,

01:14:14.840 | a bunch of questions on my own.

01:14:16.640 | And these are all the questions, the first eight questions that I could come up with.

01:14:20.640 | And somewhat to my surprise, it actually got them all right.

01:14:23.560 | So what is the girl holding, a tennis racket?

01:14:26.080 | What is she playing, playing tennis?

01:14:27.840 | Or what is she doing?

01:14:29.600 | Is the girl wearing shorts?

01:14:31.400 | What is the color of the ground, brown?

01:14:33.000 | Then I was like, well, okay, let's try to break it by asking just what's the color

01:14:36.520 | of some of the smallest object, the ball.

01:14:39.920 | Actually got that right too, because her skirt white.

01:14:42.920 | Also kind of interesting, when you ask the model what she's wearing shorts, but

01:14:46.440 | when you ask about the skirt, and it still sort of is capturing

01:14:51.520 | that you might call this different things.

01:14:54.480 | And then this one was interesting, what did the girl just hit?

01:14:58.160 | Tennis ball.

01:14:59.160 | And then I was like, well, what if I ask, is the girl about to hit the tennis ball?

01:15:04.200 | It said yes.

01:15:04.960 | And then, did the girl just hit the tennis ball?

01:15:06.840 | And it said yes again.

01:15:08.000 | So then I finally found a way to break it, so

01:15:09.880 | it doesn't have enough the core current statistics to understand.

01:15:13.560 | And again, understand sort of which angle does the arm have to be in order to

01:15:18.640 | assume that the ball was just hit or was about to hit.

01:15:21.200 | But what it basically does show us is that once it saw a lot of examples

01:15:27.720 | on a specific domain, it really can capture quite a lot of different things.

01:15:31.800 | And now, let's see if we can get the demo up.

01:15:35.400 | I have to be on a VPN to make it work.

01:15:38.960 | But so here's one example.

01:15:43.320 | The best way to hope for any chance of enjoying this film is by lowering your

01:15:46.840 | expectations.

01:15:47.760 | Again, one of those kinds of sentences that you have to now get

01:15:53.720 | correct in order to get improved performance on sentiment.

01:15:58.880 | And actually correctly says that this is negative.

01:16:03.600 | Now, we can also actually ask that question in Chinese.

01:16:09.280 | And this is one of the beautiful things of the DMN and

01:16:14.600 | in general really of most deep learning techniques.

01:16:17.160 | We don't have to be experts in a domain or even in a language to create a very,

01:16:21.600 | very accurate model for that language or that domain.

01:16:25.920 | There's no more feature engineering.

01:16:28.360 | I'm not gonna make a fool of myself trying to read that one out loud, but

01:16:30.640 | it's an interesting example.

01:16:32.720 | You can also, this is the, what parts of speech are there.

01:16:38.040 | You can have other things like named entities and other sequence problems.

01:16:43.480 | You can also ask, what are the men wearing on the head?

01:16:46.360 | Answer is helmets.

01:16:48.680 | And then maybe a slightly more interesting question,

01:16:51.760 | why are the men wearing helmets?

01:16:53.520 | And the answer is safety.

01:16:56.440 | So, especially we're close to the circle of death here at Stanford,

01:17:01.200 | where a lot of bikes crash and it's a good answer.

01:17:04.400 | All right, with that, I wanna leave a couple of minutes for questions.

01:17:10.160 | So basically the summary is, word vectors and

01:17:13.320 | recurrent neural networks are super useful building blocks.

01:17:17.080 | Once you really appreciate and understand those two building blocks,

01:17:20.760 | you're kind of ready to have some fun and build more complex models.

01:17:24.840 | Really in the end, this DMN is a way to combine that in just a variety of

01:17:29.000 | new ways to a larger, more complex model.

01:17:31.760 | And that's also where the state, I think, of deep learning is for

01:17:35.280 | natural language processing.

01:17:36.400 | We've tackled a lot of these smaller sub-problems, intermediate tasks, and

01:17:40.920 | now we can work on more interesting, complex problems like dialogue and

01:17:45.640 | question answering, machine translation, and things like that.

01:17:49.240 | All right, thank you.

01:17:49.880 | >> [APPLAUSE]

01:17:59.200 | >> Five minutes?

01:18:00.480 | All right, cool, yeah, five minutes.

01:18:01.880 | >> A quick question, in the dynamic memory network, you have the RNN.

01:18:07.880 | And you also mentioned that if you have better assumption of the input, right?

01:18:14.680 | So you also worked on the tree LSTM, right?

01:18:18.080 | So if you change the RNN into a tree structure, would that help?

01:18:22.480 | >> It's a good question.

01:18:24.360 | I actually love tree structures.

01:18:26.840 | I did my whole PhD about tree structures.

01:18:28.840 | And somewhat surprising in the last couple weeks,

01:18:32.680 | there are actually some new results on SNLI, the Stanford Natural Language

01:18:37.720 | Inference data set, where tree structures are again the state of the art.

01:18:42.080 | Though I have to say that I think the dynamic memory network,

01:18:48.600 | by having this ability in the episodic memory to keep track of different sub

01:18:53.280 | phrases and pay attention to those and then combine them over multiple passes,

01:18:58.520 | I think you can kind of get away with not having a tree structure.

01:19:01.080 | So yes, you might have a slight improvement

01:19:04.640 | representing sentences as trees in your input module.

01:19:10.320 | But I think they're only going to be slight.

01:19:11.920 | And I think the episodic memory module that has this capability to go over the

01:19:15.000 | input multiple times, pay attention to certain sub phrases,

01:19:17.880 | will capture a lot of the kinds of complexities that you might

01:19:21.120 | want to capture in tree structures.

01:19:22.400 | So my short answer is, I don't think you necessarily need it.

01:19:25.760 | >> So have you tried it?

01:19:27.200 | >> We have not, no.

01:19:28.200 | >> Thanks.

01:19:28.700 | >> Hi, my question is about question answering.

01:19:35.200 | So if we want to apply question answering to some specific domains like

01:19:40.760 | healthcare, but we don't really have the data, we don't have question answer pairs.

01:19:44.880 | And what should we do?

01:19:46.880 | Are there any general principles here?

01:19:49.880 | >> That's a great question.

01:19:50.680 | What do you do if you want to do question answering on a complex domain,

01:19:53.800 | you don't have the data?

01:19:54.760 | I think, and this feels maybe like a cop out, but

01:19:58.600 | I think it's very true both in practice and in theory, create the data.

01:20:02.800 | Like if you cannot possibly create more than a thousand examples of anything,

01:20:07.360 | then maybe automating that process is not that important.

01:20:10.040 | So clearly, you should be able to create some data.

01:20:12.680 | And in many cases, that is the best use of your time, is just to sit down or

01:20:16.560 | ask the domain expert to create a lot of questions and

01:20:19.240 | then have people find the answers.

01:20:21.320 | And then measure how they actually get to those answers.

01:20:24.960 | Try to have them in a constrained environment and so on.

01:20:27.720 | I think most companies, for instance,

01:20:29.760 | when you try to do automated email replies, which is in some ways a little bit

01:20:33.640 | similar to question answering, well, there is a nice domain because

01:20:39.080 | everybody had already emailed, they were already answered before, so

01:20:42.440 | you can use sort of past behavior.

01:20:44.440 | Now, if you had a search engine where people asked a lot of questions,

01:20:48.000 | then you can also use that to bootstrap and see where did they actually fail?

01:20:52.200 | And then take all those really tough queries where they failed,

01:20:55.640 | have some humans sit there and collect the data.

01:20:57.440 | So that's the simplest answer.

01:20:59.560 | Now, the other answer is, let's work together for the next many years on

01:21:04.120 | research for smaller training data set sizes and complex reasoning.

01:21:08.560 | The fact of the matter for that line of research will still be,

01:21:12.880 | if a system has never seen a certain type of reasoning,

01:21:17.000 | it'll be hard for the system to pick up that type of reasoning.

01:21:21.200 | I think we're going to get with these kinds of architectures to the space where

01:21:24.840 | at least if it has seen this type of reasoning, a specific type of

01:21:28.720 | transitive reasoning or temporal reasoning or sort of cause and

01:21:33.160 | effect type reasoning, at least a couple hundred times,

01:21:36.520 | then you should be able to train a system with these kinds of models to do it.

01:21:40.200 | >> Are these QA systems currently robust to false input or questions?

01:21:49.600 | For the woman playing tennis, if you asked, what's the man holding?

01:21:53.720 | Would it reply, there's no man?

01:21:56.080 | >> It would not.

01:21:57.560 | And largely because at training time, you never try to mess with it like that.

01:22:02.880 | I'm pretty sure if you added a lot of training samples where you had those,

01:22:07.240 | it would probably eventually pick it up.

01:22:09.320 | >> Those would be important for real world implementations and-

01:22:12.520 | >> So real world implementations of

01:22:15.040 | this insecurity are actually kind of tricky.

01:22:16.920 | I think whenever you train a system, we know we can, for instance,

01:22:20.440 | both steal certain classifiers by using them a lot.

01:22:23.440 | We know we can fool them into classifying certain images, for instance, as others.

01:22:28.280 | We have folks in the audience who worked on that exact line of work.

01:22:32.680 | So I would be careful using it in security environments right now.

01:22:36.720 | >> Yeah, hi, I have a question.

01:22:39.800 | >> Wow, up there, hi there.

01:22:42.840 | >> [LAUGH] Yeah, I have a question actually.

01:22:46.040 | There was a slide where you had the input module and

01:22:48.400 | then there were a bunch of sentences.

01:22:51.160 | So were those sentences themselves RNNs?

01:22:54.360 | Because sequence is basically made up of those individual words and

01:22:58.560 | say glove representations.

01:23:01.080 | So were those also RNNs that were stitched together?

01:23:06.320 | >> So the answer there is a little complex,

01:23:07.960 | because we have two papers with the DMN, and the answer is different for each.

01:23:12.960 | In the simplest form of that, it is actually a single GRU that goes from

01:23:18.760 | the first word through all the sentences as if they're one gigantic sequence.

01:23:23.560 | But it has access to each sentence period at the end to pay special attention to

01:23:28.600 | the end of sentences.

01:23:30.640 | And so yes, in the simplest form, it is just a GRU that goes over all the words.

01:23:35.240 | >> So is this a normal process to basically just concatenate all the sentences

01:23:39.840 | into one gigantic- >> So the answer there, and

01:23:44.240 | this is kind of why I split the talk into three different ones from words,

01:23:48.800 | single sentences, and then multiple sentences.

01:23:51.160 | I think if you just had a single GRU that goes over everything and

01:23:54.600 | now you try to reason over that entire sequence, it would not work very well.

01:23:58.240 | You really need to have an additional structure, such as an attention mechanism

01:24:02.400 | or a pointer mechanism that has the ability to pay attention to specific parts

01:24:06.280 | of your input to do that very accurately.

01:24:09.400 | But yeah, in general, that's fine, as long as you have this additional mechanism.

01:24:13.360 | Thank you. >> Thank you, great question.

01:24:14.920 | >> So in recurrent neural nets, you're using sigmoids.

01:24:19.080 | In visual recognition, I guess,

01:24:22.560 | rectified linear units were the more popular non-linearity.

01:24:27.760 | >> That's right, so relu's are great.

01:24:30.000 | Now, when you look at the GRU equations here, you have these reset gates.

01:24:34.400 | And so these reset gates here, you want them to essentially be between 0 and 1,

01:24:38.960 | so that it can either ignore this input entirely or

01:24:41.720 | you have it normally be part of the computation of h tills.

01:24:45.480 | So in some cases, you really do want to have sigmoids there.

01:24:50.480 | But other ones, for instance, some simpler things where you actually don't have

01:24:56.360 | that much recurrence, such as going from one memory state to another.

01:24:59.080 | In the second iteration of this model,

01:25:01.720 | actually relu's were good activation functions too.

01:25:08.440 | >> Did you guys try to, after training this network,

01:25:12.520 | try to take these weights for the images and do object detection again?

01:25:16.600 | So these weights would be augmented with the text vectors.

01:25:21.640 | Did you try to use- >> That is a very cool idea that we did

01:25:24.680 | not explore.

01:25:25.560 | No, there you go.

01:25:26.720 | You gotta do it fast.

01:25:30.360 | >> [LAUGH] >> This field is moving fast,

01:25:33.080 | you just let the cat out of the box.

01:25:34.360 | >> [LAUGH] >> So

01:25:41.240 | those attention models are pretty powerful when you have enough training data and

01:25:45.800 | then you can learn to make good use of the data.

01:25:50.880 | But even though some of the tasks are pretty, I guess, trivial to human, but

01:25:55.680 | it's hard for a model to learn.

01:25:58.400 | So what do you think of, I guess, even right now,

01:26:01.800 | we have a lot of knowledge base on the web, right, like Wikipedia.

01:26:06.000 | We know a lot about common sense, but

01:26:09.760 | what do you think about incorporate those knowledge base into those models?

01:26:14.480 | >> I actually love that line of research too, and

01:26:18.760 | that was kind of what we started out with.

01:26:21.000 | This semantic memory module in the simplest form is just word vectors.

01:26:24.640 | I think one next iteration would actually be to have knowledge bases also influence

01:26:28.600 | the reasoning.

01:26:30.000 | There's very little work on combining text and

01:26:34.480 | knowledge bases to do overall complex question answering that requires reasoning.

01:26:39.320 | I think it's a phenomenally interesting area of research.

01:26:42.760 | >> So were any night hints or any starting point about how to encode those?

01:26:47.520 | >> There are some papers that do reasoning over knowledge bases alone.

01:26:52.600 | So we had a paper on recursive neural tensor networks that basically takes

01:26:56.680 | a triplet, a word vector for an entity, might be in Freebase, might be in WordNet.

01:27:01.920 | A relation, a vector for a relationship, and a vector for another entity.

01:27:09.040 | And then basically pipe them into a neural network and say, yes, no,

01:27:11.840 | are these two entities actually in that relationship?

01:27:13.920 | And you can have a variety of different architectures.

01:27:17.280 | I think Sammy worked on that as well.

01:27:20.080 | Wait, that's a different brother, different Benjo.

01:27:22.120 | >> [LAUGH] >> Over there, all right.

01:27:24.440 | >> [LAUGH] >> And-

01:27:26.720 | >> And I did too.

01:27:27.600 | >> That's true, that's true, yeah.

01:27:29.040 | Antoine Borde, right?

01:27:30.480 | That's right, that's right.

01:27:31.400 | So I think you can also reason over knowledge graphs.

01:27:37.000 | And you could then try to combine that with reasoning over fuzzy text.

01:27:41.240 | It all has been done.

01:27:43.240 | I think nobody has yet really combined it in a principled way.

01:27:45.520 | Great question.

01:27:48.720 | Yeah, one last question.

01:27:49.800 | >> I have a question.

01:27:50.320 | So while the model answer my questions correctly, so

01:27:55.480 | how do I check the model actually understood my question?

01:27:59.920 | And what's the logic, what's the model's logic behind that?

01:28:03.960 | >> It's a good question.

01:28:05.000 | In some ways, it's a common question for neural network interpretability.

01:28:10.800 | >> So in computer vision, sometimes we can at least visualize the features.

01:28:16.400 | Right, so how about the- >> That's right.

01:28:17.960 | And so I think the best thing that we could do right now is to show these

01:28:21.960 | attention scores where for sentiment, we're like, how did it come up with the sentiment?

01:28:26.920 | It paid attention to the movie working.

01:28:29.800 | And likewise for question answering, we can see which sentences did it actually

01:28:34.600 | pay attention to in order to answer that overall question.

01:28:37.880 | So that is, I think, the best answer that we could come up with right now.

01:28:41.800 | But how, yeah, there's certain other complexities that there's still an area

01:28:45.880 | of open research.

01:28:46.440 | >> Thanks.

01:28:47.280 | >> Thank you.

01:28:47.800 | All right, thank you, everybody.

01:28:48.640 | >> [APPLAUSE]

01:28:55.680 | >> So thank you, Richard.

01:28:56.800 | We'll take another coffee break for 30 minutes, so

01:28:59.480 | please come back at 2.45 for our presentation by Sherry Moore.

Deep Learning for Natural Language Processing (Richard Socher, Salesforce)

Chapters