back to indexSequence to Sequence Deep Learning (Quoc Le, Google)
Chapters
0:0
2:10 Preprocessing
2:34 Feature Representation
4:56 Training with stochastic gradient descent
7:28 Information Loss
7:42 Recurrent Neural Network
9:34 Training RNN with stochastic gradient descent
17:10 Better Formulation
19:40 Sequence to Sequence Training with SGD
22:19 Sequence to Sequence Prediction
30:31 Scheduled Sampling
37:41 The big picture so far
41:14 Model Understandability with Attention Mechanism
48:8 LSTMCell vs. RNNCell
49:39 Applications
52:18 Sequence to Sequence With Attention for Speech
00:00:08.680 |
And then the second part, I will play sequence to sequence in a broader context 00:00:27.200 |
So a week ago, I came back from vacation, and 00:00:30.560 |
in my inbox, I have 508 emails, un-replied emails. 00:00:36.800 |
And a lot of emails basically just require just yes and no answer. 00:00:42.280 |
So let's try to see whether we can build a system that can automatically 00:00:53.480 |
And for example, some of the email would be from my friend, Ann. 00:01:02.800 |
are you visiting Vietnam for the New Year walk? 00:01:10.080 |
So you can gather a data set like this, and then you have some input content. 00:01:16.520 |
So for now, let's ignore the author of the email and the subject. 00:01:25.400 |
So let's suppose that you gather some email, and some input would be something 00:01:28.560 |
like, are you visiting Vietnam for the New Year walk? 00:01:33.000 |
And then another email would be, are you hanging out with us tonight? 00:01:52.880 |
Now let's do a little bit of processing, where basically in 00:01:59.640 |
the previous slide, we have year and comma, and 00:02:06.160 |
then quark, and then question mark, and so on. 00:02:09.080 |
So let's do a little bit of processing, and then put the comma, 00:02:14.600 |
a space between year and comma, and then quark and question mark, and so on. 00:02:20.600 |
So this step, a lot of people call tokenization and normalization. 00:02:26.360 |
Now, so and then the second step I would do would be to do feature representation. 00:02:34.520 |
So in this step, what I'm gonna do is the following. 00:02:36.440 |
I'm gonna construct a 2,000 dimensional vector. 00:02:40.200 |
2,000 represent the size of English vocabulary. 00:02:45.600 |
I'm gonna count how many times a particular word occur in my email. 00:02:50.160 |
For example, the word r occur one in my email, so I increase the counter. 00:02:59.080 |
And then u occur once, I increase another counter, and s, etc. 00:03:04.760 |
And then I will reserve at the end a token to 00:03:08.200 |
just count all the words that are just out of vocabulary. 00:03:19.240 |
you're gonna convert all of your email from input to output pairs, 00:03:23.440 |
where the input would be fixed length representation of 20,000 dimensional vector. 00:03:36.460 |
Okay, so I will get, so as you said, somebody in the audience said, 00:03:43.200 |
the order of the words don't matter, and the answer is yes. 00:03:57.600 |
my job now is to try to find some w such that w times x can approximate y. 00:04:09.440 |
So because of this problem has two categories, 00:04:13.800 |
you can think of it as a logistic regression problem. 00:04:16.960 |
Now, if anybody follow the great CS229 class by Andrew, 00:04:24.520 |
But in very short, the algorithm comes as follows. 00:04:30.040 |
You kind of try to come up with a vector for every email. 00:04:40.440 |
The first column will find the probability for 00:04:43.160 |
the whether the email have to be answered as yes. 00:04:50.400 |
And then you basically take the dot product between the first column. 00:04:58.720 |
So you run for iteration one to like a million, you run for a long, long time. 00:05:05.160 |
You sample a random email x, and there's some reply. 00:05:08.120 |
And then if the reply is yes, then you wanna update your w1 and 00:05:14.640 |
w2 such that you increase the probability that the answer is yes. 00:05:22.520 |
Now if the correct reply is no, then you're gonna update w1 and 00:05:29.600 |
w2 so that you can increase the probability of 00:05:34.880 |
the email to be answered as no, so the second probability, okay? 00:05:44.520 |
Now, so because I said to update the increase of probability, 00:05:52.480 |
What that means is that you find the gradient of the partial gradient 00:05:57.400 |
of the objective function with respect to some parameter. 00:06:01.600 |
So now you have to pick some alpha, which is the learning rate. 00:06:06.120 |
And then you say w1 is equal to w1 plus some alpha, 00:06:11.040 |
the partial derivative of log p1 with respect to d of w1, okay? 00:06:19.040 |
Now, I cheated a little bit here because I used the log function. 00:06:22.840 |
It turns out because the log function is a monotonic increasing function. 00:06:27.360 |
So increasing p1 is equivalent to increasing the log of p1, okay? 00:06:32.200 |
And it usually, with this formulation, stochastic gradient descent works better. 00:06:38.360 |
And then you can also update w2 if the email to be replied is yes. 00:06:48.960 |
And you can have different way to update if the reply is no. 00:06:55.960 |
So what's, and then if you have a new email coming in, 00:07:00.960 |
then you take x and then you convert into the vector. 00:07:04.760 |
Then you compute the first probability, okay, w1 times x divided by exponential, 00:07:15.000 |
And if that probability is larger than 0.5, then you say yes. 00:07:18.880 |
And if that probability is less than 0.5, then you say no. 00:07:24.160 |
Okay, so that's how you do prediction with this. 00:07:27.120 |
Now, there's a problem with this representation, 00:07:35.040 |
So somebody in the audience just said that the order of the words don't matter. 00:07:40.840 |
Now, let's fix this problem by using something called the Recurrent Network. 00:07:47.560 |
And I think Richard Solcher already talked about Recurrent Networks and 00:07:53.840 |
some part of it yesterday, and Andre as well. 00:07:57.840 |
Now, the idea of a Recurrent Network is basically you also have 00:08:02.200 |
fixed length representation for your input, but 00:08:05.840 |
it actually preserves some sort of ordering information. 00:08:08.880 |
And the way that you compute the hidden units are the following. 00:08:15.680 |
So the function h of 0 is basically a hyperbolic 00:08:31.080 |
Okay, so Richard also talked about word vectors yesterday. 00:08:35.800 |
So you can take word vectors coming out of word2vec, or 00:08:39.600 |
you can just actually randomly initialize them if you want to. 00:08:53.200 |
the vector for u, which is a times h of 0 plus u times v of vector u. 00:09:04.240 |
This is one of my three most complicated slides, so 00:09:43.660 |
And basically, you make update on the w matrix, 00:09:47.300 |
which is the classifier at the top, like what I said earlier. 00:09:50.540 |
Now, but you also have to update all the relevant matrices, 00:09:56.940 |
which is the matrix u, the matrix a, and some word vectors, right? 00:10:03.780 |
So this is basically, you have to compute the partial derivative 00:10:07.660 |
of the loss function with respect to those parameters. 00:10:14.820 |
And usually, when I do that myself, I get that wrong. 00:10:19.460 |
But there's a lot of toolkits out there that you can use. 00:10:31.700 |
you can call Tiano to actually compute the derivatives. 00:10:35.940 |
And once you have the derivatives, you can just make the updates. 00:10:46.020 |
>> So u, the matrix u, I share, so I'm gonna go back to one side. 00:10:52.020 |
So the matrix u, I share for all vertical matrices, right? 00:11:01.540 |
And the size, you have to determine ahead of time. 00:11:04.420 |
For example, the number of column would be the size of the word vectors. 00:11:11.580 |
But the number of rows must be like 1,000 if you want, or maybe 255 if you want. 00:11:18.580 |
So this is model selection, and it depends on whether you're underfitting or 00:11:22.540 |
overfitting to choose a bigger model or a smaller model. 00:11:26.980 |
And you'll compute power so that you can train a larger model or smaller model. 00:11:30.980 |
>> So what would you consider the number of words in the dictionary? 00:11:39.860 |
Yeah, so the word vectors, the number of word vectors that you use, 00:11:52.380 |
So you're gonna tend to end up with 20,000 word vectors, right? 00:11:57.500 |
But the size of, so that means you have 20,000 rows in matrix u, 00:12:03.940 |
but the number of column you can, sorry, the number of column is 20,000, but 00:12:08.060 |
the number of row would be, you have to determine that yourself. 00:12:23.500 |
So the big picture is I started with bag of words representations. 00:12:27.460 |
And then I talk about INN as a new way to represent 00:12:33.860 |
variable size input that can capture some sort of ordering information. 00:12:41.580 |
that you can compute the partial derivatives. 00:12:43.900 |
And these, you can find auto diff in TensorFlow or Tiano or Torch. 00:12:50.220 |
Now then I talk about stochastic gradient descent as a way to train 00:13:03.260 |
>> How long is the data size if you're not using the data system? 00:13:07.500 |
>> So that also depends on how big your training set and 00:13:15.660 |
But usually if you use RNN and if you use a hidden state of 100, 00:13:22.620 |
Yeah, but it largely depends on size of training data, 00:13:29.260 |
because you want to iterate for a lot of, you sample a lot of emails, right? 00:13:33.500 |
And you want your algorithm to see as many emails as possible. 00:13:38.240 |
So, okay, so if you use such algorithm to just say yes, no, and yes, no, 00:13:45.820 |
then you might end up losing a lot of friends. 00:13:55.820 |
Because when you say, for example, my friend asked me, 00:14:01.460 |
are you visiting Vietnam for the New Year walk? 00:14:03.340 |
Then maybe the better answer would be, yes, see you soon, right? 00:14:10.660 |
And then if my friends ask me, are you hanging out with us tonight? 00:14:19.900 |
All right, or did you read the cool [INAUDIBLE] right? 00:14:26.980 |
So before I'm going to tell you the solution, 00:14:31.220 |
I would say this problem basically requires you 00:14:36.220 |
to map between variable size input to some variable size output. 00:14:43.220 |
And if you can do something like this, then there's a lot of applications. 00:14:49.180 |
Because you can do auto reply, which is what we've been working on so far. 00:14:53.260 |
But we can also work on user to do translation, 00:15:00.220 |
So input would be a fixed length vector, a representation coming from ConfNet. 00:15:06.180 |
And then output would be the cat sat on the mat. 00:15:12.420 |
The input will be a document, and output will be some summary of it. 00:15:19.660 |
you can have input would be speech frames, and output would be words. 00:15:27.060 |
So basically, the input would be the conversation so far, 00:15:51.820 |
that you can configure neural networks to do things. 00:16:05.460 |
The green would be the hidden state, and the output 00:16:13.140 |
Now, one to one is not what we want, because we have many to many. 00:16:17.500 |
So it's probably more like the last two to the right. 00:16:23.340 |
But we arrived at the solution that I said in the red box. 00:16:32.060 |
is because the size of the input and the size of output 00:16:37.940 |
Sometimes you have smaller input, but larger output. 00:16:41.900 |
But sometimes you have larger input and smaller output. 00:16:54.940 |
then maybe the output has to be smaller, or at least 00:17:05.420 |
So let's construct a solution that look like that. 00:17:10.340 |
So the input would be something like, hi, how are you? 00:17:19.660 |
And then you're going to predict the first token, which 00:17:26.940 |
And then you predict the third token, thanks. 00:17:29.860 |
And then you keep going on until you predict the word end. 00:17:36.700 |
Now, I want to mention that in the previous set of slides, 00:17:45.340 |
And in yes and no, you have only two choices. 00:17:59.140 |
And you can extend it to cover more than two choices. 00:18:06.860 |
And then the algorithm will just follow the same way. 00:18:09.460 |
So this is my first solution when I worked with 626. 00:18:19.660 |
is because the model never know what it actually 00:18:39.220 |
feed what the model predicts in the previous step 00:18:46.620 |
So for example, in this case, I'm going to take am. 00:18:52.020 |
going to predict the second word, which is fine, 00:18:58.420 |
So a lot of people call this concept autoregressive. 00:19:02.220 |
So you eat your own output and make it as your input. 00:19:28.820 |
people also call the encoder as the recurrent network 00:19:34.980 |
And the decoder would be the recurrent network 00:19:41.460 |
So again, so you basically run for a million steps. 00:19:48.780 |
And for each iteration, you sample an email x and a reply 00:20:01.380 |
And then you update the INN encoder and decoder parameters 00:20:06.100 |
so that you can increase the probability that y of t 00:20:10.900 |
is correct, given all what you've seen before, 00:20:15.300 |
which is your yt minus 1, yt minus 2, et cetera, 00:20:22.500 |
And then you have to compute the partial derivatives 00:20:25.660 |
So the computing part of partial derivative is very difficult. 00:20:31.240 |
like autodifferentiation in TensorFlow or Torch or Tiana. 00:20:45.860 |
didn't change because you have u, v, and a are fixed. 00:20:56.860 |
is that if the INN are different for different example, 00:21:15.180 |
OK, yeah, I'm going to get to that in the next slide. 00:21:32.660 |
I would say you usually stop at 400 steps or something 00:21:40.180 |
it's going to be too long to make the update. 00:22:05.180 |
So I'm going to talk about the prediction next. 00:22:09.100 |
So let me go to the prediction, and then you can ask questions. 00:22:13.580 |
So the first algorithm that you can do is called greedy 00:22:19.180 |
In greedy decoding, for any incoming email x, 00:22:39.380 |
You keep going until you see the word end, and then stop. 00:23:03.760 |
going to feed in the next step, and then you arrive at three. 00:23:06.700 |
So the next step, you're going to have nine candidates. 00:23:10.540 |
And then you're going to end up going that way. 00:23:14.740 |
So given input x, I'm going to predict the first token. 00:23:24.580 |
and the network will produce another three, and et cetera. 00:23:27.620 |
So you're going to end up with a lot of candidates. 00:23:35.060 |
and then you compute the joint probability at each step. 00:23:41.060 |
have the highest probability to be the sequence of choice. 00:23:52.700 |
This is the most complicated slide in my talk. 00:24:12.180 |
Now, it turns out in this algorithm, what you do 00:24:14.180 |
is that for any word that is out-of-vocabulary, 00:24:31.300 |
There's a bunch of algorithms to address these issues. 00:24:33.820 |
For example, they break it into characters and things 00:24:37.260 |
like that, and then you could fix this problem. 00:24:54.580 |
So the cost function is that you sample a random word, yt. 00:24:58.780 |
Let's suppose that here, this is my input so far. 00:25:07.660 |
Let's say t is equal to 2, which means the word "fine." 00:25:14.980 |
I want to increase the probability of the model 00:25:20.420 |
So every time, the model will make a lot of predictions. 00:25:49.980 |
Yeah, but we don't care if there's no i, m, et cetera. 00:25:56.980 |
So when I'm at "fine," my input would be "hi," "how," "are you," 00:26:13.700 |
And if I'm at the word "thanks," my input would be "hi," "how," 00:26:20.620 |
And I've got to get my "thanks" probability right. 00:26:43.740 |
So let's suppose that you have a lot of users, 00:26:53.300 |
So basically, if all the sequences were the same length, 00:26:56.260 |
the number of paths down the tree is k to the n, right? 00:27:11.180 |
Then you go from 10, like 100, and then 1,000, 00:27:21.340 |
then you end up with k to the n or something like that. 00:27:45.580 |
So that way, you don't end up with a huge beam. 00:27:48.540 |
And usually, in practice, using a beam size of 3 or 10 00:28:07.300 |
Now, to be fast, sometimes we have to pad the input, 00:28:10.860 |
because we want to make sure that batch processing works 00:28:23.400 |
Did you change the graph from batch to batch? 00:28:28.020 |
Yeah, so let's suppose that you have a sequence of 10, 00:28:51.060 |
where you would connect that embedding to this memory? 00:29:08.140 |
is, how do you insert the world embedding into the model? 00:29:20.100 |
And that's a vector for quark with an ID 1, 2, 3, 4, 5. 00:29:25.300 |
And then if it's Peter, then the vector would be 5, 6, 7, 8. 00:29:30.060 |
So it would be the initial vector for the encoder, 00:29:39.140 |
You can do it at the end, or you can do it at the beginning, 00:29:41.980 |
or you can insert it at every prediction steps. 00:29:46.020 |
But my proposal is just put it at the beginning. 00:29:54.020 |
OK, well, I'm thinking that because your prediction is 00:30:12.820 |
If you make a prediction, and then that's a bad prediction, 00:30:16.180 |
and then your model never sees, and then it keeps derailing, 00:30:30.220 |
So there's an algorithm called scheduled sampling. 00:30:33.020 |
So in scheduled sampling, what you do is you-- 00:30:38.020 |
instead of feeding the truth during training, 00:30:46.060 |
and then feed in as input so that the model understands 00:31:26.580 |
Well, my answer is that try to be as large as possible. 00:31:38.220 |
you use fixed-length embedding to represent very, very 00:31:43.780 |
much the long-term dependency, like a huge input, right? 00:31:49.380 |
So I'm going to come back to that issue with the attention 00:32:01.620 |
doesn't that make you learn away from synonyms? 00:32:13.700 |
Like, aren't you biasing it against synonyms? 00:32:27.020 |
It turns out that if you learn, it turns out that it map good. 00:32:30.060 |
And if you visualize embedding, the good and fine and so on 00:32:34.740 |
are mapped very closely to the embedding space. 00:32:39.100 |
But in the output, we don't know what else to do. 00:32:44.460 |
The other approach is basically to train the word 00:32:49.500 |
try to ask the model to regress to the word embeddings. 00:33:03.740 |
But anyway, the algorithm that you've seen so far 00:33:10.860 |
So if you use the smart reply feature in Inbox, 00:33:20.020 |
Now, for example, in this email, my colleague Greg Corrado 00:33:23.380 |
got an email from his friend saying that, hey, 00:33:27.500 |
we wanted to invite you to join us for an early Thanksgiving 00:33:34.780 |
Please bring your favorite dish and reserve by next week. 00:33:40.700 |
For example, the first answer would be, tell us in. 00:33:46.820 |
And the third answer is, sorry, we won't be able to make it. 00:33:54.540 |
Now, there's an algorithm to actually figure out 00:33:58.940 |
so that you don't end up with very similar answers. 00:34:05.140 |
that make these beams a little bit more diverse. 00:34:08.660 |
And then they pick the best three to present to you. 00:34:18.580 |
How do you make sure that the beam terminates? 00:34:24.300 |
So the question is, how do I guarantee that the beam will 00:34:33.900 |
Indeed, there are certain cases like that if you don't 00:34:37.380 |
But if you train the model well with very good accuracy, 00:34:43.540 |
I've hardly seen any cases that it doesn't terminate. 00:34:55.420 |
But you can stop the model after 1,000 or 100 00:35:12.740 |
but the reply seems all good with us meeting. 00:35:18.820 |
Yeah, it just comes out because there's a lot of emails. 00:35:21.500 |
And if you invite someone, there's more than one person. 00:35:25.980 |
It just means inviting the whole family, things like that. 00:36:15.520 |
So I would say we don't have any user embedding in this. 00:36:23.600 |
and the output would be the prediction, the reply. 00:36:30.320 |
So it sees the context, which is the threat so far. 00:36:38.120 |
OK, yeah, you can catch me up after the talk. 00:37:04.120 |
So the question is, there are some emails that 00:37:14.480 |
So one algorithm is to say yes or no to reply. 00:37:19.680 |
And then after it passes, sees the threshold, 00:37:22.360 |
there's an algorithm to run to produce the threshold. 00:37:32.240 |
I have to get going, but you can get back to the question. 00:37:35.440 |
So there's a lot of more interesting stuff coming along. 00:37:40.440 |
So the big picture is that we have an INN encoder that 00:37:48.760 |
trying to predict one token at a time in the output. 00:38:07.160 |
And then you should be able to find a good beam 00:38:12.360 |
Now, someone in the audience brought up the issue 00:38:20.600 |
So just before you make a prediction, the h of n, 00:38:24.600 |
the white thing right before you go to the decoder, 00:38:31.160 |
And you can think of it as a vector that capture 00:38:39.200 |
It could be 1,000 words or it could be five words. 00:38:44.920 |
for a variable length input, which is kind of not so nice. 00:38:55.400 |
And it's actually invented at University of Montreal. 00:39:08.240 |
So in principle, what you want is something like this. 00:39:15.840 |
you kind of want to look again at all the hidden states 00:39:20.640 |
You want to look at all what you see in the input so far. 00:39:28.960 |
want to see all the hidden state of the input so far and on. 00:39:41.680 |
Let's say that vector is the same dimension with all the h. 00:39:54.560 |
And then you take c, and then you do dot product 00:39:59.360 |
And then you have coefficients a, a0, a1, blah, blah, blah, 00:40:12.800 |
you compute something called the beta, which is basically 00:40:30.800 |
And then you take those bi and then multiply by h of i. 00:40:40.960 |
And then you send it to add additional signal 00:40:48.720 |
So in the next step, you also predict another c. 00:40:51.160 |
And then you take that c to compute the dot product. 00:40:53.760 |
You compute the a, and then you can compute the b. 00:41:10.080 |
And this algorithm is implemented in TensorFlow. 00:41:14.640 |
OK, so how intuitively-- what is going on here? 00:41:18.600 |
So let's suppose that you want to use this for translation. 00:41:26.200 |
for example, the input would be, hi, how are you? 00:41:28.680 |
And the output is, hola, como estas, or something like that. 00:41:49.560 |
the betas that you learn will put a strong weight 00:41:56.800 |
And then it has a smaller weight for all the stuff. 00:41:59.320 |
And then if you keep going, then when you say, como, 00:42:08.600 |
It puts a strong emphasis on the relevant word. 00:42:13.200 |
extremely useful because you know the one-to-one mapping 00:42:39.120 |
how do I deal with languages where the order don't reverse, 00:42:45.240 |
So some of the verbs get moved and things like that. 00:42:48.080 |
Well, I did not hard code A or B. They are learned. 00:42:54.160 |
So by virtue of learning, they will figure out 00:43:03.000 |
And those are basically computed by gradient set. 00:43:30.640 |
I think some people have explored something like that. 00:43:40.200 |
So with the capitalization of your first words in the-- 00:43:45.700 |
--line, does that imply that you have to have your own [INAUDIBLE] 00:43:52.800 |
So the question is, let's suppose-- because right now, 00:43:55.520 |
the word "high" is capitalized at the first character. 00:44:00.320 |
Does it mean I'm using 2n or n vocabulary size? 00:44:04.400 |
So in practice, we should do some normalization. 00:44:07.240 |
If you have a small data set, what you should do 00:44:14.960 |
Now, if you have a huge data set, it doesn't matter. 00:44:24.000 |
change the positional aspect of the words, right? 00:44:30.840 |
captures the positional information in the input. 00:44:45.400 |
So the question is, what do I do with punctuation? 00:44:48.160 |
Well, right now, I just present the algorithm 00:45:03.120 |
train the algorithm, you put a space between the word 00:45:10.560 |
that step is called tokenization or normalization 00:45:15.320 |
So you can use any Stanford NLP package or something 00:45:19.480 |
like that to normalize your text so that it's easy to train. 00:45:23.480 |
Now, if you have infinite data, then it will just learn itself. 00:45:30.840 |
So I should get going because there's a lot of-- all 00:45:34.140 |
So it turns out that that's the basic implementation. 00:45:40.280 |
and if you have big data sets, so one thing that you can do 00:45:44.920 |
And one way to make deep is in the following way. 00:45:48.400 |
So you stack your recurrent network on top of each other. 00:45:53.120 |
So you-- in the first sequence-to-sequence paper, 00:45:56.080 |
we use a network of four, but people are gradually 00:45:59.040 |
increasing to six, eight, and so on right now. 00:46:01.960 |
And they're getting better and better results. 00:46:04.120 |
Like in ImageNet, if you make a network deeper, 00:46:19.720 |
when we-- like many labs working on this problem 00:46:25.440 |
But right now, in translation, many translation tasks 00:46:40.160 |
So to train this model, so number one is that, as I said, 00:46:53.160 |
So Barack Obama will be just an unknown, right? 00:46:58.280 |
Now, you might use something like word segments, right? 00:47:03.960 |
For example, Barack Obama would be "ba" and "rak" and et cetera. 00:47:15.000 |
You can split words that are unknown to be into characters, 00:47:25.080 |
Now, tip number two is that when you train this algorithm-- 00:47:29.880 |
because when you do back propagation or forward 00:47:32.640 |
propagation, you essentially multiply a matrix 00:47:51.440 |
So you say that if the magnitude of the gradient 00:48:00.000 |
Then tip number three is to use GRU, or in our work, 00:48:07.160 |
So I want to revisit this long short-term memory 00:48:24.120 |
and then you apply with some activation function. 00:48:35.400 |
Now, in LSTM, you basically multiply the input and H 00:48:45.560 |
That theta is four times bigger than the theta 00:48:51.280 |
And then you're going to take that Z that's coming out. 00:49:01.240 |
And then you use the value of something called the cell, 00:49:06.320 |
and then you keep adding the newly computed values 00:49:11.880 |
So there's a part here that I say the integral of C. 00:49:17.480 |
keeps a hidden state where it keeps adding information to it. 00:49:26.420 |
if you want to just apply LSTM, because it's already 00:49:35.360 |
So in terms of applications, you can use this thing 00:49:44.280 |
So I've started seeing work in summarization, pretty exciting. 00:49:53.280 |
be the representation of an image coming out from VGG 00:50:06.560 |
Or you can use it for speech recognition or transcription. 00:50:28.480 |
And then the output could be some words, hi, how's it. 00:50:46.080 |
And then you convert it into MFCC before you send to INN. 00:50:52.680 |
And then you use the algorithm that I said earlier 00:51:03.400 |
You predict one word at a time in the output. 00:51:15.000 |
You can end up with thousands and thousands steps. 00:51:17.760 |
So backpropagating in time, even with attention, 00:51:24.760 |
you do some kind of a pyramid to map the input. 00:51:28.840 |
So if you do enough layers, you can divide your input 00:51:32.440 |
into a factor of 8 or 16 if you do enough layers. 00:51:49.800 |
like in the Baidu's work where they have the CTC. 00:51:54.200 |
Now, I have to say that the strength of this algorithm 00:52:06.720 |
it's actually conditioned on "hi" and "step 4," 00:52:11.920 |
So there's an implicit language model already. 00:52:14.600 |
But the problem with this is that actually you 00:52:19.560 |
have to wait until the end of the input to do decoding. 00:52:39.480 |
that can use this and do it in an online fashion, 00:52:45.080 |
Now, also, I have to mention that in translation, 00:52:56.040 |
But when it comes to speech, it doesn't work as well 00:53:03.800 |
We're not as good as CTC, which is what Adam talked earlier, 00:53:13.120 |
is the most widely-spoken speech system currently. 00:53:19.640 |
So I want to pause there, and then I can take questions. 00:53:25.720 |
So on the machine translation, you were mentioning attention. 00:53:34.120 |
there's one word that basically is having meaning 00:53:44.040 |
Because attention will be focusing on [INAUDIBLE] 00:53:47.520 |
sense, and you know that one is the sending sense. 00:54:00.760 |
Well, in translation, what we do is basically 00:54:12.680 |
And then we have pairs of sentences like this, 00:54:23.840 |
But before we make a prediction, the model has the attention. 00:54:36.440 |
What is the issue with the model again, please? 00:54:38.720 |
When you look at, say, i equals 1 to the [INAUDIBLE] 00:54:57.520 |
So there's no basically one, two, three coming afterward. 00:55:17.160 |
For example, what if you actually, in the inbox, 00:55:31.400 |
So the model-- the inbox thing that I presented, 00:55:36.200 |
But there's no limitation in the model in terms of language. 00:55:43.360 |
sometimes you write in English, and sometimes you 00:55:52.400 |
Then I would say that it will just learn your behavior. 00:55:55.040 |
And then it will basically predict the word that you want. 00:55:58.680 |
But make sure that your output vocabulary is large enough 00:56:03.120 |
so that it covers not only the English words, 00:56:10.120 |
So your vocabulary is not going to be 20,000. 00:56:12.920 |
It's going to be like 100,000, because you have more choices. 00:56:16.840 |
And then you have to train your model on those examples. 00:56:33.340 |
Is it possible to change your model a little bit 00:56:36.320 |
so that it doesn't have to start predicting at the end 00:56:40.520 |
So the question is that in the case of voice search, 00:56:42.520 |
right now you have to wait to the end to make a prediction. 00:56:50.080 |
So you can actually figure out an algorithm, a simple algorithm 00:56:52.920 |
to actually segment the speech, and then make a prediction, 00:57:01.340 |
So in theory, you can actually do online decoding. 00:57:06.280 |
But I'm saying that you can do online decoding, 00:57:17.800 |
So the question is regarding the Google's order of life. 00:57:29.520 |
know for this question, this is the answer you're creating? 00:57:32.900 |
So we have some input email and then some output email 00:57:55.220 |
not much of a constraint on how the output aligns 00:58:03.260 |
Is there any way to add this constraint to sequence 00:58:54.980 |
So the question is that, because right now we 00:59:00.380 |
any way to actually look globally at the output 00:59:03.860 |
and maybe use some kind of reinforcement learning 00:59:08.420 |
So there's a recent paper at Facebook called, I think, 00:59:13.180 |
sequence level training or something like that, 00:59:15.700 |
where they don't optimize for one step at a time, 00:59:19.540 |
but they predict-- they look at the globally, 00:59:30.100 |
And it seems to be making some improvement in the metrics 00:59:39.940 |
people still prefer the output from this model. 00:59:44.660 |
So some of the metrics that we use in translation and so on 00:59:47.660 |
might not be what the metrics that we optimize. 00:59:56.280 |
Yeah, so the question is, can we add the GAN loss? 01:00:23.140 |
based on the human input to influence the encoder? 01:00:29.060 |
So let's suppose that you type the first word, "Hola," 01:00:33.180 |
then you can actually start the beam from there. 01:00:35.780 |
So the question is, is there any way to incorporate user input? 01:00:41.020 |
Let's suppose that you say, "Hola," sorry, "Hi, how are you?" 01:00:51.900 |
So you can actually condition your beam on the first word, 01:01:17.340 |
And the WMT corpora usually have tens of millions 01:01:24.700 |
And every sentence have like 20 words on average, 01:01:31.980 |
I can't remember, but that's something like that, 01:01:45.900 |
So how is it compared to Google Search auto-completion? 01:01:51.340 |
Honestly, I don't know what's used underneath Google Search 01:01:56.060 |
But I think they should use something like this, 01:02:00.480 |
OK, I have still lots of interesting stuff coming along. 01:02:12.100 |
So the big picture is so far, I talk about sequence 01:02:20.540 |
about most of the big trends in deep learning. 01:02:25.580 |
And he was talking about the second trend was basically 01:02:29.860 |
So you can characterize sequence to sequence learning 01:02:39.060 |
So it should work for a lot of NLP-related tasks, 01:02:43.220 |
because a lot of them, you would have input sequence and output 01:02:56.380 |
But it works great when you have a lot of data. 01:03:01.260 |
then maybe you want to consider dividing your problems 01:03:04.480 |
into smaller components, and then train your sequence 01:03:16.100 |
then it's also possible to actually merge all these tasks 01:03:19.980 |
by combining the data, and then have an indicator bit to say, 01:03:26.900 |
this is email reply, and then change it jointly. 01:03:45.980 |
picture of the active ongoing work in neural nets for NLP. 01:03:57.180 |
So if you have any questions, you can ask now. 01:04:11.940 |
So the question is, does the model handle emoji? 01:04:16.820 |
I don't know, but emoji is like a piece of text, too, right? 01:04:19.820 |
So you can just feed it in as another extra token. 01:04:27.220 |
then you should be able to cover emoji as well. 01:04:36.780 |
do you have to retrain the model, or do you do anything? 01:04:45.720 |
I think towards the end, we lower the learning rate. 01:04:51.680 |
So if you add new data, it will not make a lot of good updates. 01:04:59.360 |
increase the learning rate, and then continue to train. 01:05:11.300 |
is very exciting, which is in the area of automatic Q&A. 01:05:16.180 |
So you can think that maybe the setup would be, 01:05:20.180 |
can you read a Wikipedia page and then answer a question? 01:05:23.980 |
Or can you read a book and answer a question? 01:05:26.860 |
Now, in theory, you can use sequence-to-sequence 01:05:35.260 |
You're going to read the book, one token at a time, 01:05:47.300 |
And then you make a prediction of the tokens. 01:05:56.980 |
that way, sometimes we don't have knowledge about the fact. 01:06:00.100 |
So we actually read the book again to answer the fact. 01:06:06.820 |
is Barack Obama the President of the United States? 01:06:09.860 |
I would say, yes, because it's already in my memory. 01:06:12.620 |
So maybe it's better to actually augment the RNN 01:06:17.780 |
with some kind of memory, so that it will not 01:06:27.820 |
I'm not a definite expert, but I'm very aware, 01:06:32.820 |
so I can place you in the right context here. 01:06:35.180 |
So a work in this area would be memory networks 01:06:40.980 |
There will be neural-tutoring machines at DeepMind. 01:06:43.780 |
Dynamic memory networks would be Richard Solcher presented 01:06:47.420 |
yesterday, and then stack-augmented algorithms. 01:06:51.140 |
And then stack-augmented RNNs by Facebook again, and et cetera. 01:07:10.500 |
In the encoder, you're going to look at some input. 01:07:12.820 |
And then you have a controller, which is your h variable. 01:07:19.900 |
But along the side, you're going to write down 01:07:31.740 |
is you're going to continue producing some output. 01:07:39.700 |
but you're going to read from memory your h again. 01:07:49.380 |
And then in the output, you read from memory. 01:07:53.180 |
Now let's try to be a little bit more general. 01:07:58.060 |
And the general would be at any point in time, 01:08:02.140 |
You have a controller, and you can read and write all the time. 01:08:07.060 |
Now to do that, you have the following architecture. 01:08:10.340 |
So you have some memory bank, a big memory bank. 01:08:18.820 |
You can decide to write some information into it 01:08:23.220 |
by a combination of the memory bank in the previous step 01:08:27.620 |
and the hidden variable in the previous step. 01:08:30.500 |
And then you also read into the hidden state, too. 01:08:35.900 |
and then you can keep going forever like that. 01:08:38.500 |
So this concept is called RNN with augmented memory. 01:09:02.900 |
A lot of these algorithms are actually soft attention. 01:09:11.260 |
You can actually predict where to look, right? 01:09:16.020 |
Now the problem with that is you end up with very, 01:09:28.980 |
but you can use reinforce and so on to train it. 01:09:36.620 |
that actually does something like this, right? 01:09:39.820 |
Not exactly, but it will deal with discrete actions. 01:09:50.740 |
Okay, so another extension that a lot of people talk about 01:10:19.420 |
The building was constructed in the year 2000. 01:10:44.140 |
Now, neural nets, if you can train with a lot of examples, 01:10:48.860 |
It can learn to subtract numbers and things like that, 01:10:54.120 |
All right, so maybe it's better to augment them 01:10:58.260 |
with functions, like addition and subtraction. 01:11:05.080 |
the neural network will read all the tokens so far, 01:11:11.380 |
And then you'll get, the neural net is augmented 01:11:36.380 |
of the values coming out of these two function. 01:11:41.260 |
and you push it into the stack in the next step. 01:11:44.860 |
you will call the addition and subtraction again, 01:11:49.160 |
That's the principle of something called neural programmers 01:11:56.340 |
from Google Brain and DeepMind was talking about this. 01:12:23.920 |
So it's one of the big trends happening in natural language. 01:12:32.540 |
if you have a lot and a lot of supervised data, 01:12:37.420 |
So if you have a lot of data, it should work great. 01:12:40.720 |
But if you don't have enough supervised data, 01:12:47.940 |
Or you can train jointly in a multitask settings. 01:12:51.940 |
And people also train it jointly with autoencoder, 01:13:04.740 |
If you go home and then you wanna make impact 01:13:10.620 |
at your work tomorrow, then so far, that's so far so good. 01:13:25.420 |
But it seems like it's still work in progress. 01:13:40.980 |
He talk about attention and augmented recorder networks. 01:13:46.820 |
The sequence-to-sequence with attention for translation 01:14:02.020 |
Now, there's a lot of work going on in this area. 01:14:09.560 |
So as you can see, you can't even read the words. 01:14:12.860 |
That means how many papers come along in this area. 01:14:16.940 |
So I can pause there, and I have five minutes 01:15:20.720 |
- There could be many books that have a character Harry. 01:15:25.320 |
which is which Harry should I answer the question for. 01:15:30.800 |
when you're training this kind of Q&A type network? 01:15:35.360 |
So I think one thing is that you can always personalize. 01:15:43.740 |
when he talk about, you can have a representation 01:15:47.200 |
for the user, and then you know that when he say Harry, 01:15:50.960 |
because he actually been reading a lot of books 01:15:52.960 |
about Harry Potter, so it's more likely to be Harry Potter. 01:15:59.280 |
I just want to make sure that it's as simple as possible. 01:16:10.160 |
But I'm saying if you represent user vectors, 01:16:13.760 |
and then you inject more additional knowledge 01:16:19.560 |
into as additional token in the input of the net, 01:16:37.620 |
- Do you have an idea what the state of the art 01:16:39.020 |
in generalizing word2vec is to more than one word? 01:16:44.400 |
I think skip thoughts are interesting directions here. 01:17:02.580 |
And his idea is basically using sequence to sequence 01:17:20.340 |
And I've heard a lot of good things about skip thoughts, 01:17:29.180 |
and things like that, and it works very well. 01:17:30.900 |
So that's probably one place that you can go. 01:17:48.740 |
- See, what was your thoughts on how to solve 01:17:55.380 |
- Oh, common sense, I'm deeply interested in common sense, 01:18:08.020 |
first of all, there's a lot of knowledge about the world 01:18:14.580 |
Like for example, gravity and things like that. 01:18:39.260 |
- Is there a good way to represent all these rules 01:18:46.260 |
So the question is, how do you represent rules? 01:19:33.460 |
- Are there any tips to reduce how many pairs you need 01:19:44.460 |
but the corpus that people train this on translation, 01:19:50.540 |
it has only about three, five million pairs of sentences 01:19:54.060 |
So that's kind of small, three million, right? 01:20:04.580 |
then I would say things like pre-train your word vectors 01:20:13.380 |
That's one area that you have a lot of parameters. 01:20:24.380 |
That's another area that you have a lot of parameters. 01:20:29.660 |
or dropout some random word in the input sentence. 01:20:32.620 |
So those things can improve the regularization