back to index

Sequence to Sequence Deep Learning (Quoc Le, Google)


Chapters

0:0
2:10 Preprocessing
2:34 Feature Representation
4:56 Training with stochastic gradient descent
7:28 Information Loss
7:42 Recurrent Neural Network
9:34 Training RNN with stochastic gradient descent
17:10 Better Formulation
19:40 Sequence to Sequence Training with SGD
22:19 Sequence to Sequence Prediction
30:31 Scheduled Sampling
37:41 The big picture so far
41:14 Model Understandability with Attention Mechanism
48:8 LSTMCell vs. RNNCell
49:39 Applications
52:18 Sequence to Sequence With Attention for Speech

Whisper Transcript | Transcript Only Page

00:00:00.000 | I will divide it in two parts.
00:00:03.160 | So number one, I will work with you and
00:00:05.720 | develop the sequence to sequence learning.
00:00:08.680 | And then the second part, I will play sequence to sequence in a broader context
00:00:14.920 | on a lot of exciting work in this area.
00:00:18.680 | Now, so let's motivate this by an example.
00:00:27.200 | So a week ago, I came back from vacation, and
00:00:30.560 | in my inbox, I have 508 emails, un-replied emails.
00:00:36.800 | And a lot of emails basically just require just yes and no answer.
00:00:42.280 | So let's try to see whether we can build a system that can automatically
00:00:49.120 | reply these emails to say yes and no.
00:00:53.480 | And for example, some of the email would be from my friend, Ann.
00:00:59.960 | She said, hi, in the subject, and she said,
00:01:02.800 | are you visiting Vietnam for the New Year walk?
00:01:05.800 | That would be her content.
00:01:07.720 | And then my probable reply would be yes.
00:01:10.080 | So you can gather a data set like this, and then you have some input content.
00:01:16.520 | So for now, let's ignore the author of the email and the subject.
00:01:23.520 | But let's focus on the content.
00:01:25.400 | So let's suppose that you gather some email, and some input would be something
00:01:28.560 | like, are you visiting Vietnam for the New Year walk?
00:01:32.080 | And the answer would be yes.
00:01:33.000 | And then another email would be, are you hanging out with us tonight?
00:01:40.200 | The answer is no, because I'm quite busy.
00:01:42.800 | >> [LAUGH]
00:01:46.040 | >> So the third email would be,
00:01:47.880 | did you read the coolness paper on ResNet?
00:01:50.760 | The answer is yes, because I like it.
00:01:52.880 | Now let's do a little bit of processing, where basically in
00:01:59.640 | the previous slide, we have year and comma, and
00:02:06.160 | then quark, and then question mark, and so on.
00:02:09.080 | So let's do a little bit of processing, and then put the comma,
00:02:14.600 | a space between year and comma, and then quark and question mark, and so on.
00:02:20.600 | So this step, a lot of people call tokenization and normalization.
00:02:24.360 | So let's do that with our emails.
00:02:26.360 | Now, so and then the second step I would do would be to do feature representation.
00:02:34.520 | So in this step, what I'm gonna do is the following.
00:02:36.440 | I'm gonna construct a 2,000 dimensional vector.
00:02:40.200 | 2,000 represent the size of English vocabulary.
00:02:44.120 | And then I'm gonna go through email.
00:02:45.600 | I'm gonna count how many times a particular word occur in my email.
00:02:50.160 | For example, the word r occur one in my email, so I increase the counter.
00:02:59.080 | And then u occur once, I increase another counter, and s, etc.
00:03:04.760 | And then I will reserve at the end a token to
00:03:08.200 | just count all the words that are just out of vocabulary.
00:03:11.560 | Okay?
00:03:12.800 | And then, now, if you do this process,
00:03:19.240 | you're gonna convert all of your email from input to output pairs,
00:03:23.440 | where the input would be fixed length representation of 20,000 dimensional vector.
00:03:28.880 | And output would be either 0 or 1.
00:03:31.680 | Okay?
00:03:33.280 | Any questions so far?
00:03:34.120 | Okay, good.
00:03:36.460 | Okay, so I will get, so as you said, somebody in the audience said,
00:03:43.200 | the order of the words don't matter, and the answer is yes.
00:03:47.640 | So I'm gonna get back to that issue later.
00:03:50.880 | Now, so that x and y, and now, your job,
00:03:57.600 | my job now is to try to find some w such that w times x can approximate y.
00:04:04.760 | Y is the output, right?
00:04:06.800 | And y here is yes and no.
00:04:09.440 | So because of this problem has two categories,
00:04:13.800 | you can think of it as a logistic regression problem.
00:04:16.960 | Now, if anybody follow the great CS229 class by Andrew,
00:04:22.000 | probably can formulate this very quickly.
00:04:24.520 | But in very short, the algorithm comes as follows.
00:04:30.040 | You kind of try to come up with a vector for every email.
00:04:34.000 | Your w is a two column matrix, okay?
00:04:40.440 | The first column will find the probability for
00:04:43.160 | the whether the email have to be answered as yes.
00:04:47.720 | Second column will be answered as no.
00:04:50.400 | And then you basically take the dot product between the first column.
00:04:55.680 | [INAUDIBLE]
00:04:57.480 | The stochastic gradient descent.
00:04:58.720 | So you run for iteration one to like a million, you run for a long, long time.
00:05:05.160 | You sample a random email x, and there's some reply.
00:05:08.120 | And then if the reply is yes, then you wanna update your w1 and
00:05:14.640 | w2 such that you increase the probability that the answer is yes.
00:05:20.960 | So you increase the first probability.
00:05:22.520 | Now if the correct reply is no, then you're gonna update w1 and
00:05:29.600 | w2 so that you can increase the probability of
00:05:34.880 | the email to be answered as no, so the second probability, okay?
00:05:40.960 | So let's call those p1 and p2.
00:05:44.520 | Now, so because I said to update the increase of probability,
00:05:51.560 | what does that mean?
00:05:52.480 | What that means is that you find the gradient of the partial gradient
00:05:57.400 | of the objective function with respect to some parameter.
00:06:01.600 | So now you have to pick some alpha, which is the learning rate.
00:06:06.120 | And then you say w1 is equal to w1 plus some alpha,
00:06:11.040 | the partial derivative of log p1 with respect to d of w1, okay?
00:06:19.040 | Now, I cheated a little bit here because I used the log function.
00:06:22.840 | It turns out because the log function is a monotonic increasing function.
00:06:27.360 | So increasing p1 is equivalent to increasing the log of p1, okay?
00:06:32.200 | And it usually, with this formulation, stochastic gradient descent works better.
00:06:36.160 | Any questions so far?
00:06:38.360 | And then you can also update w2 if the email to be replied is yes.
00:06:48.960 | And you can have different way to update if the reply is no.
00:06:55.960 | So what's, and then if you have a new email coming in,
00:07:00.960 | then you take x and then you convert into the vector.
00:07:04.760 | Then you compute the first probability, okay, w1 times x divided by exponential,
00:07:11.600 | w1 times x plus exponential w2 times x.
00:07:15.000 | And if that probability is larger than 0.5, then you say yes.
00:07:18.880 | And if that probability is less than 0.5, then you say no.
00:07:24.160 | Okay, so that's how you do prediction with this.
00:07:27.120 | Now, there's a problem with this representation,
00:07:33.120 | is that there's some information loss.
00:07:35.040 | So somebody in the audience just said that the order of the words don't matter.
00:07:39.560 | And that's true.
00:07:40.840 | Now, let's fix this problem by using something called the Recurrent Network.
00:07:47.560 | And I think Richard Solcher already talked about Recurrent Networks and
00:07:53.840 | some part of it yesterday, and Andre as well.
00:07:57.840 | Now, the idea of a Recurrent Network is basically you also have
00:08:02.200 | fixed length representation for your input, but
00:08:05.840 | it actually preserves some sort of ordering information.
00:08:08.880 | And the way that you compute the hidden units are the following.
00:08:15.680 | So the function h of 0 is basically a hyperbolic
00:08:21.840 | tangent of some matrix u times
00:08:27.680 | the word vector for the word r.
00:08:31.080 | Okay, so Richard also talked about word vectors yesterday.
00:08:35.800 | So you can take word vectors coming out of word2vec, or
00:08:39.600 | you can just actually randomly initialize them if you want to.
00:08:43.360 | Okay, so let's suppose that that's h of 0.
00:08:46.680 | Now, h of 1 would be a function of h 0 and
00:08:53.200 | the vector for u, which is a times h of 0 plus u times v of vector u.
00:09:01.520 | And then you can keep going with that.
00:09:04.240 | This is one of my three most complicated slides, so
00:09:11.880 | you should ask questions.
00:09:13.280 | No questions?
00:09:16.240 | So everybody familiar with Recurrent Nets?
00:09:20.860 | >> [LAUGH]
00:09:22.900 | >> Okay, so to make prediction with this,
00:09:25.860 | you tack on the label at the last step.
00:09:29.580 | And then you say, try to predict y for me.
00:09:32.380 | Now, how do you do that?
00:09:33.740 | Now, here, basically,
00:09:37.140 | you went the way you did before.
00:09:43.660 | And basically, you make update on the w matrix,
00:09:47.300 | which is the classifier at the top, like what I said earlier.
00:09:50.540 | Now, but you also have to update all the relevant matrices,
00:09:56.940 | which is the matrix u, the matrix a, and some word vectors, right?
00:10:03.780 | So this is basically, you have to compute the partial derivative
00:10:07.660 | of the loss function with respect to those parameters.
00:10:12.780 | Now, that's gonna be very complicated.
00:10:14.820 | And usually, when I do that myself, I get that wrong.
00:10:19.460 | But there's a lot of toolkits out there that you can use.
00:10:22.820 | What is, you can use auto differentiation in
00:10:27.220 | TensorFlow, or you can call Torch, or
00:10:31.700 | you can call Tiano to actually compute the derivatives.
00:10:35.940 | And once you have the derivatives, you can just make the updates.
00:10:38.100 | All right?
00:10:41.100 | Yeah?
00:10:41.900 | >> Do you use the same rule?
00:10:43.740 | >> Yes.
00:10:44.460 | >> What size?
00:10:46.020 | >> So u, the matrix u, I share, so I'm gonna go back to one side.
00:10:52.020 | So the matrix u, I share for all vertical matrices, right?
00:11:01.540 | And the size, you have to determine ahead of time.
00:11:04.420 | For example, the number of column would be the size of the word vectors.
00:11:11.580 | But the number of rows must be like 1,000 if you want, or maybe 255 if you want.
00:11:18.580 | So this is model selection, and it depends on whether you're underfitting or
00:11:22.540 | overfitting to choose a bigger model or a smaller model.
00:11:26.980 | And you'll compute power so that you can train a larger model or smaller model.
00:11:30.980 | >> So what would you consider the number of words in the dictionary?
00:11:37.140 | >> The matrix u?
00:11:37.940 | >> Yeah.
00:11:39.860 | Yeah, so the word vectors, the number of word vectors that you use,
00:11:46.700 | are the size of vocabulary, right?
00:11:52.380 | So you're gonna tend to end up with 20,000 word vectors, right?
00:11:57.500 | But the size of, so that means you have 20,000 rows in matrix u,
00:12:03.940 | but the number of column you can, sorry, the number of column is 20,000, but
00:12:08.060 | the number of row would be, you have to determine that yourself.
00:12:11.500 | Okay?
00:12:13.100 | Any other questions?
00:12:15.580 | Now, okay, so what's the big picture?
00:12:23.500 | So the big picture is I started with bag of words representations.
00:12:27.460 | And then I talk about INN as a new way to represent
00:12:33.860 | variable size input that can capture some sort of ordering information.
00:12:38.620 | Then I talk about auto differentiation so
00:12:41.580 | that you can compute the partial derivatives.
00:12:43.900 | And these, you can find auto diff in TensorFlow or Tiano or Torch.
00:12:50.220 | Now then I talk about stochastic gradient descent as a way to train
00:12:54.860 | the neural networks.
00:12:57.460 | Any questions so far?
00:13:00.700 | Okay, you have a question.
00:13:03.260 | >> How long is the data size if you're not using the data system?
00:13:07.500 | >> So that also depends on how big your training set and
00:13:11.220 | how big is your computer and so on, right?
00:13:15.660 | But usually if you use RNN and if you use a hidden state of 100,
00:13:20.700 | it should take a couple of hours.
00:13:22.620 | Yeah, but it largely depends on size of training data,
00:13:29.260 | because you want to iterate for a lot of, you sample a lot of emails, right?
00:13:33.500 | And you want your algorithm to see as many emails as possible.
00:13:37.740 | Right?
00:13:38.240 | So, okay, so if you use such algorithm to just say yes, no, and yes, no,
00:13:45.820 | then you might end up losing a lot of friends.
00:13:49.220 | Because- >> [LAUGH]
00:13:52.180 | >> Because we don't just say yes, no.
00:13:55.820 | Because when you say, for example, my friend asked me,
00:14:01.460 | are you visiting Vietnam for the New Year walk?
00:14:03.340 | Then maybe the better answer would be, yes, see you soon, right?
00:14:07.180 | That's a better, nicer way to approach this.
00:14:10.660 | And then if my friends ask me, are you hanging out with us tonight?
00:14:16.220 | So instance, just say, no.
00:14:17.980 | I would say, no, I'm too busy.
00:14:19.900 | All right, or did you read the cool [INAUDIBLE] right?
00:14:24.420 | So let's see how we're going to fix this.
00:14:26.980 | So before I'm going to tell you the solution,
00:14:31.220 | I would say this problem basically requires you
00:14:36.220 | to map between variable size input to some variable size output.
00:14:43.220 | And if you can do something like this, then there's a lot of applications.
00:14:49.180 | Because you can do auto reply, which is what we've been working on so far.
00:14:53.260 | But we can also work on user to do translation,
00:14:55.980 | translate between English, French.
00:14:58.860 | You can do image captioning.
00:15:00.220 | So input would be a fixed length vector, a representation coming from ConfNet.
00:15:06.180 | And then output would be the cat sat on the mat.
00:15:10.500 | You can do summarization.
00:15:12.420 | The input will be a document, and output will be some summary of it.
00:15:16.900 | Or you can do speech transcription, where
00:15:19.660 | you can have input would be speech frames, and output would be words.
00:15:25.580 | Or you can do conversation.
00:15:27.060 | So basically, the input would be the conversation so far,
00:15:29.940 | and the output would be my reply.
00:15:32.660 | Or you can do Q&A, et cetera, et cetera.
00:15:35.060 | So we can keep going on.
00:15:36.100 | So how do we solve this problem?
00:15:41.660 | So this is hard.
00:15:43.380 | So let's check out what Andrej Karpathy has
00:15:45.820 | to say about recurrent networks.
00:15:49.460 | So Andrej say that there's more than one way
00:15:51.820 | that you can configure neural networks to do things.
00:15:54.420 | So you can use neural network to map--
00:15:58.100 | recurrent networks to map one to one.
00:16:01.620 | So at the bottom, that's the input.
00:16:05.460 | The green would be the hidden state, and the output
00:16:08.900 | would be what you want to predict.
00:16:13.140 | Now, one to one is not what we want, because we have many to many.
00:16:17.500 | So it's probably more like the last two to the right.
00:16:23.340 | But we arrived at the solution that I said in the red box.
00:16:28.820 | And the reason why that's a better solution
00:16:32.060 | is because the size of the input and the size of output
00:16:36.100 | can vary a lot.
00:16:37.940 | Sometimes you have smaller input, but larger output.
00:16:41.900 | But sometimes you have larger input and smaller output.
00:16:46.220 | So if you do the one in the red circle,
00:16:49.860 | you can be very flexible.
00:16:52.460 | If you do the one to the extreme right,
00:16:54.940 | then maybe the output has to be smaller, or at least
00:16:58.980 | the same with the input.
00:17:03.700 | That's what we don't want.
00:17:05.420 | So let's construct a solution that look like that.
00:17:08.340 | So here's the solution.
00:17:10.340 | So the input would be something like, hi, how are you?
00:17:14.260 | And then let's put a special token.
00:17:17.420 | Let's say the token is end.
00:17:19.660 | And then you're going to predict the first token, which
00:17:22.220 | is am.
00:17:24.460 | And then you predict the second token, fine.
00:17:26.940 | And then you predict the third token, thanks.
00:17:29.860 | And then you keep going on until you predict the word end.
00:17:33.660 | And then you stop.
00:17:36.700 | Now, I want to mention that in the previous set of slides,
00:17:43.220 | I was just talking about yes and no.
00:17:45.340 | And in yes and no, you have only two choices.
00:17:48.860 | Now you have more than two choices.
00:17:50.980 | You have actually 20,000 choices.
00:17:54.300 | And you can actually use the algorithm
00:17:57.060 | that are the logistic regression.
00:17:59.140 | And you can extend it to cover more than two choices.
00:18:03.820 | You can have a lot of choices.
00:18:06.860 | And then the algorithm will just follow the same way.
00:18:09.460 | So this is my first solution when I worked with 626.
00:18:15.980 | But it turns out it didn't work very well.
00:18:17.860 | And the reason why it didn't work very well
00:18:19.660 | is because the model never know what it actually
00:18:21.980 | predicted in the last step.
00:18:25.340 | So it keep going.
00:18:26.740 | And it will keep synthesizing output.
00:18:28.500 | But it didn't know what it said.
00:18:30.540 | It didn't know what decision it committed
00:18:32.740 | in the previous step.
00:18:33.980 | So a better solution would look like this.
00:18:36.860 | A better solution is basically you
00:18:39.220 | feed what the model predicts in the previous step
00:18:42.940 | as input to the next step.
00:18:46.620 | So for example, in this case, I'm going to take am.
00:18:48.940 | I feed it in to the next step so that I'm
00:18:52.020 | going to predict the second word, which is fine,
00:18:57.020 | and et cetera.
00:18:58.420 | So a lot of people call this concept autoregressive.
00:19:02.220 | So you eat your own output and make it as your input.
00:19:08.900 | Any questions so far?
00:19:09.820 | How do you know when to stop?
00:19:15.580 | Oh, whenever it produces end, then you stop.
00:19:19.300 | There's a special token end.
00:19:20.740 | So relevant architecture here would be--
00:19:28.820 | people also call the encoder as the recurrent network
00:19:34.140 | in the input.
00:19:34.980 | And the decoder would be the recurrent network
00:19:37.100 | in the output.
00:19:40.020 | OK, so how do you train this?
00:19:41.460 | So again, so you basically run for a million steps.
00:19:45.220 | You see all of your emails.
00:19:46.780 | And then you say you sample.
00:19:48.780 | And for each iteration, you sample an email x and a reply
00:19:54.620 | Y would be, I'm fine, thanks.
00:19:57.460 | And then you sample a random word yt in y.
00:20:01.380 | And then you update the INN encoder and decoder parameters
00:20:06.100 | so that you can increase the probability that y of t
00:20:10.900 | is correct, given all what you've seen before,
00:20:15.300 | which is your yt minus 1, yt minus 2, et cetera,
00:20:18.900 | and also all the x's.
00:20:22.500 | And then you have to compute the partial derivatives
00:20:24.700 | to make it work.
00:20:25.660 | So the computing part of partial derivative is very difficult.
00:20:29.060 | So again, I recommend you to use something
00:20:31.240 | like autodifferentiation in TensorFlow or Torch or Tiana.
00:20:35.420 | OK, you have a question.
00:20:39.660 | [INAUDIBLE]
00:20:42.500 | Yeah, but the number of parameters
00:20:45.860 | didn't change because you have u, v, and a are fixed.
00:20:50.660 | OK, so the question in the audience
00:20:56.860 | is that if the INN are different for different example,
00:21:03.780 | and the answer is yes.
00:21:05.860 | So the number of steps are different.
00:21:08.260 | [INAUDIBLE]
00:21:10.740 | I have a question there.
00:21:12.220 | [INAUDIBLE]
00:21:15.180 | OK, yeah, I'm going to get to that in the next slide.
00:21:22.180 | Yeah, OK.
00:21:24.180 | [INAUDIBLE]
00:21:28.300 | So the question is, in practice, how long
00:21:31.020 | would I go for the INN?
00:21:32.660 | I would say you usually stop at 400 steps or something
00:21:38.500 | like that, because outside of that,
00:21:40.180 | it's going to be too long to make the update.
00:21:43.860 | And compute, it's very expensive to compute.
00:21:49.140 | But you can go more if you want to.
00:21:51.900 | Yeah.
00:21:53.900 | I have a question.
00:21:55.740 | Yeah.
00:21:56.240 | [INAUDIBLE]
00:21:59.220 | [INAUDIBLE]
00:22:03.460 | Yeah, yeah, so that's a problem.
00:22:05.180 | So I'm going to talk about the prediction next.
00:22:09.100 | So let me go to the prediction, and then you can ask questions.
00:22:12.300 | So how do you do prediction?
00:22:13.580 | So the first algorithm that you can do is called greedy
00:22:17.580 | decoding.
00:22:19.180 | In greedy decoding, for any incoming email x,
00:22:24.220 | I'm going to predict the first word.
00:22:27.900 | And then you find the most likely word,
00:22:30.220 | and then you feed back in.
00:22:32.460 | And then you find the next most likely word,
00:22:35.140 | and then you feed back in, and et cetera.
00:22:38.340 | So you keep going.
00:22:39.380 | You keep going until you see the word end, and then stop.
00:22:42.500 | Or if it exceeds a certain length, you stop.
00:22:46.780 | Now, that's just too greedy.
00:22:49.500 | So let's do a little bit less greedy.
00:22:51.580 | So it turns out that, so given x,
00:22:53.980 | you can predict more than one candidate.
00:22:55.820 | So let's say you can predict k candidates.
00:22:58.220 | Let's say three.
00:23:00.260 | So you take three candidates.
00:23:02.260 | And then for each candidate, you're
00:23:03.760 | going to feed in the next step, and then you arrive at three.
00:23:06.700 | So the next step, you're going to have nine candidates.
00:23:10.540 | And then you're going to end up going that way.
00:23:13.060 | So here's a picture.
00:23:14.740 | So given input x, I'm going to predict the first token.
00:23:19.180 | That would be hi, yes, and please.
00:23:21.140 | And given every first token like this,
00:23:22.940 | I'm going to feed back into the network,
00:23:24.580 | and the network will produce another three, and et cetera.
00:23:27.620 | So you're going to end up with a lot of candidates.
00:23:30.580 | So how do you select the best candidate?
00:23:32.460 | Well, you can traverse each beam,
00:23:35.060 | and then you compute the joint probability at each step.
00:23:38.700 | And then you find the sequence that
00:23:41.060 | have the highest probability to be the sequence of choice.
00:23:47.460 | What is your reply?
00:23:48.260 | Any question?
00:23:52.700 | This is the most complicated slide in my talk.
00:23:55.020 | Oh, yeah.
00:24:01.220 | So there will be no out-of-vocabulary words
00:24:04.700 | in your algorithm, right?
00:24:07.220 | So the question is, what do you do
00:24:10.380 | with the out-of-vocabulary words?
00:24:12.180 | Now, it turns out in this algorithm, what you do
00:24:14.180 | is that for any word that is out-of-vocabulary,
00:24:17.340 | you create a token called unknown.
00:24:20.220 | And you map everything to unknown,
00:24:22.460 | or anything that out-of-vocabulary
00:24:24.860 | to be unknown.
00:24:26.500 | So it doesn't seem a very nice thing,
00:24:28.820 | but usually it works well.
00:24:31.300 | There's a bunch of algorithms to address these issues.
00:24:33.820 | For example, they break it into characters and things
00:24:37.260 | like that, and then you could fix this problem.
00:24:40.620 | Yeah.
00:24:42.340 | Yeah.
00:24:42.840 | What is the cost function in training?
00:24:46.580 | The cost function is that--
00:24:48.500 | so I go back one slide.
00:24:51.060 | So the cost function--
00:24:53.260 | one more slide.
00:24:54.580 | So the cost function is that you sample a random word, yt.
00:24:58.780 | Let's suppose that here, this is my input so far.
00:25:06.300 | And I'm sampling yt.
00:25:07.660 | Let's say t is equal to 2, which means the word "fine."
00:25:13.540 | I'm at the word "fine."
00:25:14.980 | I want to increase the probability of the model
00:25:18.020 | to predict word "fine."
00:25:20.420 | So every time, the model will make a lot of predictions.
00:25:24.500 | A lot of them will be incorrect.
00:25:26.900 | So you have a lot of probabilities.
00:25:28.620 | You have probability for the word "er,"
00:25:31.060 | and the probability "er, er," and et cetera,
00:25:33.060 | and then probability for "z, z, z, z, z."
00:25:35.820 | And you have a lot of probabilities.
00:25:38.020 | You want the probability for the word "fine"
00:25:43.820 | to be as high as possible.
00:25:45.500 | You increase that probability.
00:25:48.500 | Does that make sense?
00:25:49.980 | Yeah, but we don't care if there's no i, m, et cetera.
00:25:53.980 | Oh, you condition on i, m.
00:25:56.980 | So when I'm at "fine," my input would be "hi," "how," "are you,"
00:26:04.020 | "and," and "am."
00:26:08.100 | That's all I see.
00:26:09.420 | And then I need to make a prediction.
00:26:10.960 | I have to make that prediction right.
00:26:13.700 | And if I'm at the word "thanks," my input would be "hi," "how,"
00:26:17.700 | "are you," and "am fine."
00:26:20.620 | And I've got to get my "thanks" probability right.
00:26:27.700 | Yeah, I have a question here.
00:26:28.980 | So how do you personalize the extreme model
00:26:31.420 | for each time?
00:26:33.180 | Oh, I haven't thought about it yet.
00:26:37.140 | So the question is, how do you personalize?
00:26:39.460 | So well, one way to do it is basically
00:26:41.820 | embed a user as a vector.
00:26:43.740 | So let's suppose that you have a lot of users,
00:26:45.660 | and you embed a user as a vector.
00:26:47.500 | That's one way to do it.
00:26:49.140 | Yeah.
00:26:51.540 | I have a question here.
00:26:53.300 | So basically, if all the sequences were the same length,
00:26:56.260 | the number of paths down the tree is k to the n, right?
00:27:03.340 | Yeah.
00:27:03.840 | So that's a lot.
00:27:05.500 | Yeah.
00:27:06.540 | So the question is, let's suppose
00:27:08.260 | that my beam search is 10.
00:27:11.180 | Then you go from 10, like 100, and then 1,000,
00:27:15.660 | and suddenly it grows very quickly, right?
00:27:17.500 | It goes to-- if your sequence is long,
00:27:21.340 | then you end up with k to the n or something like that.
00:27:23.780 | Well, one way to do it is basically
00:27:25.860 | you do truncated beam search, where
00:27:29.300 | any sequence with very low probability,
00:27:31.140 | you just kick it out.
00:27:32.140 | You don't use it anymore.
00:27:33.220 | So you go-- so you can do this.
00:27:35.100 | You can do 3, 9, and then you're 27,
00:27:40.580 | and then you go back up to 9, right?
00:27:43.500 | And then you keep going.
00:27:45.580 | So that way, you don't end up with a huge beam.
00:27:48.540 | And usually, in practice, using a beam size of 3 or 10
00:27:53.380 | would work just fine.
00:27:54.500 | It works great.
00:27:55.100 | Yeah.
00:27:56.460 | Yeah, I have a question here.
00:27:57.660 | [INAUDIBLE]
00:28:02.700 | OK, so because it's an i and n, we
00:28:05.580 | don't have to pad the input.
00:28:07.300 | Now, to be fast, sometimes we have to pad the input,
00:28:10.860 | because we want to make sure that batch processing works
00:28:15.380 | very well, so we pad.
00:28:17.620 | But we pad with only like 0 tokens.
00:28:23.400 | Did you change the graph from batch to batch?
00:28:28.020 | Yeah, so let's suppose that you have a sequence of 10,
00:28:30.740 | then you have a graph for 10.
00:28:32.420 | You have a sequence-- a batch of all 20,
00:28:34.700 | you have a graph of 20, et cetera.
00:28:38.700 | Yeah, that will make the GPU very happy.
00:28:44.500 | I have a question there.
00:28:46.100 | If you would use the user embedding
00:28:48.580 | to customize the RAM to be applied,
00:28:51.060 | where you would connect that embedding to this memory?
00:28:55.020 | As an initial state, or--
00:28:58.980 | Oh, so--
00:28:59.500 | [INAUDIBLE]
00:29:01.740 | So are you asking--
00:29:03.540 | so my interpretation of your question
00:29:08.140 | is, how do you insert the world embedding into the model?
00:29:12.340 | Is that correct?
00:29:13.020 | No, the user embedding.
00:29:14.020 | User embedding.
00:29:14.740 | Oh, if you want to personalize the thing,
00:29:17.500 | then at the beginning, you have a vector.
00:29:20.100 | And that's a vector for quark with an ID 1, 2, 3, 4, 5.
00:29:25.300 | And then if it's Peter, then the vector would be 5, 6, 7, 8.
00:29:30.060 | So it would be the initial vector for the encoder,
00:29:33.180 | like the initial state for that.
00:29:34.620 | Yeah, yeah.
00:29:36.060 | That's one way to do it.
00:29:37.860 | Well, there's more than one way.
00:29:39.140 | You can do it at the end, or you can do it at the beginning,
00:29:41.980 | or you can insert it at every prediction steps.
00:29:46.020 | But my proposal is just put it at the beginning.
00:29:48.700 | That's simpler.
00:29:50.980 | I have a question there.
00:29:52.620 | Yeah, you.
00:29:54.020 | OK, well, I'm thinking that because your prediction is
00:29:56.980 | using the prediction as a nice [INAUDIBLE]
00:30:01.220 | Yeah.
00:30:01.720 | [INAUDIBLE]
00:30:02.220 | That's a very good question.
00:30:09.420 | The question is, what if the model derails?
00:30:12.820 | If you make a prediction, and then that's a bad prediction,
00:30:16.180 | and then your model never sees, and then it keeps derailing,
00:30:18.740 | and it will produce garbage.
00:30:20.540 | Yeah, that's a good question.
00:30:22.020 | So I'm going to get to that.
00:30:26.420 | So well, so this is a slide.
00:30:30.220 | So there's an algorithm called scheduled sampling.
00:30:33.020 | So in scheduled sampling, what you do is you--
00:30:38.020 | instead of feeding the truth during training,
00:30:40.700 | you can feed what sample from the softmax.
00:30:44.220 | So what's generated by the model,
00:30:46.060 | and then feed in as input so that the model understands
00:30:50.100 | that if it produced something bad,
00:30:51.940 | it would suck, actually can recover from it.
00:30:55.540 | So that's one way to address this issue.
00:31:00.620 | Did that make sense?
00:31:04.820 | Yeah.
00:31:05.320 | Any question?
00:31:09.500 | There's a question here.
00:31:11.040 | [INAUDIBLE]
00:31:11.540 | Yeah.
00:31:17.200 | [INAUDIBLE]
00:31:17.700 | Yeah, yeah.
00:31:19.420 | So in this algorithm, the question
00:31:21.580 | is, how large is the size of the decoder?
00:31:26.580 | Well, my answer is that try to be as large as possible.
00:31:30.420 | But it's going to be very slow.
00:31:31.760 | And in this algorithm, what happens
00:31:33.340 | is that you use the same--
00:31:38.220 | you use fixed-length embedding to represent very, very
00:31:43.780 | much the long-term dependency, like a huge input, right?
00:31:48.000 | And that's going to be a problem.
00:31:49.380 | So I'm going to come back to that issue with the attention
00:31:53.100 | model in a second.
00:31:56.460 | Any question?
00:31:57.140 | OK, here's a question.
00:31:58.140 | So if you're scoring based on a single word,
00:32:01.620 | doesn't that make you learn away from synonyms?
00:32:03.580 | Because you're unlikely to use them
00:32:05.060 | in the same sentence as the answer.
00:32:09.100 | So does the model learn synonyms?
00:32:11.900 | Is that the question?
00:32:12.700 | Or what's the question?
00:32:13.700 | Like, aren't you biasing it against synonyms?
00:32:15.660 | So it's like your answer is fine,
00:32:17.140 | and you're scoring based on that answer.
00:32:20.100 | Does it become an unlikely answer?
00:32:22.820 | Oh, I see.
00:32:24.820 | Well, yeah.
00:32:27.020 | It turns out that if you learn, it turns out that it map good.
00:32:30.060 | And if you visualize embedding, the good and fine and so on
00:32:34.740 | are mapped very closely to the embedding space.
00:32:39.100 | But in the output, we don't know what else to do.
00:32:44.460 | The other approach is basically to train the word
00:32:47.500 | embeddings using Word2Vec and then
00:32:49.500 | try to ask the model to regress to the word embeddings.
00:32:53.940 | So that's one way to address the issue.
00:32:55.580 | We tried something like that.
00:32:56.780 | It did not work very well.
00:32:58.260 | So whatever we had in here was pretty good.
00:33:02.900 | I have to keep going.
00:33:03.740 | But anyway, the algorithm that you've seen so far
00:33:08.140 | turns out actually answers some emails.
00:33:10.860 | So if you use the smart reply feature in Inbox,
00:33:15.700 | it's already used this system in production.
00:33:20.020 | Now, for example, in this email, my colleague Greg Corrado
00:33:23.380 | got an email from his friend saying that, hey,
00:33:27.500 | we wanted to invite you to join us for an early Thanksgiving
00:33:31.540 | on November 22nd, beginning around 2 PM.
00:33:34.780 | Please bring your favorite dish and reserve by next week.
00:33:38.420 | And then it would propose three answers.
00:33:40.700 | For example, the first answer would be, tell us in.
00:33:44.460 | Second answer would be, we'll be there.
00:33:46.820 | And the third answer is, sorry, we won't be able to make it.
00:33:50.380 | Now, where do these three answers come from?
00:33:52.860 | Those are the beams.
00:33:54.540 | Now, there's an algorithm to actually figure out
00:33:57.340 | the diversity as well of the beams
00:33:58.940 | so that you don't end up with very similar answers.
00:34:02.100 | So there's an algorithm, like a heuristic,
00:34:05.140 | that make these beams a little bit more diverse.
00:34:08.660 | And then they pick the best three to present to you.
00:34:11.580 | OK, any question?
00:34:16.780 | Yeah, I have a question here.
00:34:18.580 | How do you make sure that the beam terminates?
00:34:22.940 | Yeah, there's no guarantees.
00:34:24.300 | So the question is, how do I guarantee that the beam will
00:34:28.340 | terminate and end?
00:34:29.620 | Now, there's no guarantee.
00:34:32.460 | It can go on forever.
00:34:33.900 | Indeed, there are certain cases like that if you don't
00:34:36.180 | train the model very well.
00:34:37.380 | But if you train the model well with very good accuracy,
00:34:41.180 | then the model usually terminates.
00:34:43.540 | I've hardly seen any cases that it doesn't terminate.
00:34:50.020 | But there are certain corner cases
00:34:52.180 | that it will do funny things.
00:34:55.420 | But you can stop the model after 1,000 or 100
00:34:58.700 | or something like that so that you make sure
00:35:00.780 | that the model doesn't go on crazy.
00:35:06.620 | I have a question here.
00:35:07.820 | So there's nothing in the email which
00:35:10.300 | says they're inviting multiple people,
00:35:12.740 | but the reply seems all good with us meeting.
00:35:17.820 | That's very interesting.
00:35:18.820 | Yeah, it just comes out because there's a lot of emails.
00:35:21.500 | And if you invite someone, there's more than one person.
00:35:24.100 | And it learns about Thanksgiving.
00:35:25.980 | It just means inviting the whole family, things like that.
00:35:29.020 | Yeah, it just learns from statistics.
00:35:30.700 | [INAUDIBLE]
00:35:33.700 | Yeah, or maybe there's something like that.
00:35:35.500 | Yeah.
00:35:38.500 | [INAUDIBLE]
00:35:45.500 | Oh, in this algorithm-- so the question
00:35:48.060 | is, do I do any post-processing to correct
00:35:51.180 | the grammar of the beams?
00:35:52.820 | In this algorithm, we did not have to do it.
00:35:54.700 | [INAUDIBLE]
00:35:55.900 | Yeah.
00:36:00.000 | I have another question.
00:36:01.800 | So how contextual these multiple are?
00:36:05.280 | Are they very basic to a specific email,
00:36:09.240 | or do they tend to be [INAUDIBLE]??
00:36:12.760 | So OK, so the question is, how contextual?
00:36:15.520 | So I would say we don't have any user embedding in this.
00:36:18.320 | So it's pretty general.
00:36:21.040 | The input would be the previous emails,
00:36:23.600 | and the output would be the prediction, the reply.
00:36:29.480 | That's all we have.
00:36:30.320 | So it sees the context, which is the threat so far.
00:36:37.000 | Did I answer your question?
00:36:38.120 | OK, yeah, you can catch me up after the talk.
00:36:45.240 | Yeah?
00:36:46.240 | Is it running on the phone or on the server?
00:36:50.360 | It runs on the server.
00:36:51.840 | Yeah, slow.
00:36:54.960 | Question?
00:36:55.960 | I guess there are definitely some emails
00:36:57.960 | which are not suitable for auto-smart reply.
00:37:00.440 | How do you detect which email to reply?
00:37:02.840 | Oh, I see.
00:37:04.120 | So the question is, there are some emails that
00:37:06.240 | are not relevant for a smart reply.
00:37:08.200 | Maybe they're too long, or you should not
00:37:10.720 | reply or something like that.
00:37:12.200 | So in fact, we have two algorithms.
00:37:14.480 | So one algorithm is to say yes or no to reply.
00:37:19.680 | And then after it passes, sees the threshold,
00:37:22.360 | there's an algorithm to run to produce the threshold.
00:37:25.080 | So it's a combined of two algorithms
00:37:26.760 | that I presented earlier.
00:37:31.040 | Yeah.
00:37:32.240 | I have to get going, but you can get back to the question.
00:37:35.440 | So there's a lot of more interesting stuff coming along.
00:37:37.960 | OK, so what's the big picture so far?
00:37:40.440 | So the big picture is that we have an INN encoder that
00:37:45.000 | eats all the input.
00:37:46.080 | And then we have an INN decoder that's
00:37:48.760 | trying to predict one token at a time in the output.
00:37:52.160 | Now, everything else follows the same way.
00:37:55.560 | So you can use stochastic gradient descent
00:37:58.320 | to train the algorithm.
00:38:01.160 | And then you do beam search decoding.
00:38:04.320 | Usually, you do a beam search of three.
00:38:07.160 | And then you should be able to find a good beam
00:38:10.440 | with the highest probability.
00:38:12.360 | Now, someone in the audience brought up the issue
00:38:17.880 | that we use fixed length representation.
00:38:20.600 | So just before you make a prediction, the h of n,
00:38:24.600 | the white thing right before you go to the decoder,
00:38:29.160 | that is a fixed length representation.
00:38:31.160 | And you can think of it as a vector that capture
00:38:35.160 | all everything in the input.
00:38:39.200 | It could be 1,000 words or it could be five words.
00:38:43.040 | And you use a fixed length representation
00:38:44.920 | for a variable length input, which is kind of not so nice.
00:38:49.960 | So we want to fix that issue.
00:38:53.040 | So there's an algorithm coming along.
00:38:55.400 | And it's actually invented at University of Montreal.
00:39:01.280 | You're sure he's here?
00:39:03.640 | So the idea is to use an attention.
00:39:06.840 | So how does an attention work?
00:39:08.240 | So in principle, what you want is something like this.
00:39:11.480 | Every time before you make a prediction--
00:39:13.600 | let's say you predict the word m--
00:39:15.840 | you kind of want to look again at all the hidden states
00:39:19.400 | so far.
00:39:20.640 | You want to look at all what you see in the input so far.
00:39:24.360 | Now, same, when you do find, you also
00:39:28.960 | want to see all the hidden state of the input so far and on.
00:39:33.560 | Now, how do you do that as a program?
00:39:36.680 | So well, you can do this.
00:39:37.960 | So at h of m, you predict a vector c.
00:39:41.680 | Let's say that vector is the same dimension with all the h.
00:39:47.520 | So if your h of 1 is a dimension of 100,
00:39:51.080 | then c also has a dimension of 100.
00:39:54.560 | And then you take c, and then you do dot product
00:39:57.280 | with all the h.
00:39:59.360 | And then you have coefficients a, a0, a1, blah, blah, blah,
00:40:05.080 | to a to the n.
00:40:07.880 | And those are scalars.
00:40:11.200 | And then after you have those scalars,
00:40:12.800 | you compute something called the beta, which is basically
00:40:16.440 | a softmax of all the alpha.
00:40:20.280 | So to compute that, you take the--
00:40:22.480 | bi is an exponential of ai divided
00:40:26.360 | by the sum of exponentials.
00:40:30.800 | And then you take those bi and then multiply by h of i.
00:40:36.680 | And then you take the weighted average.
00:40:39.720 | And then you take the sum.
00:40:40.960 | And then you send it to add additional signal
00:40:44.760 | to predict the word m.
00:40:46.480 | And then you keep going with that.
00:40:48.720 | So in the next step, you also predict another c.
00:40:51.160 | And then you take that c to compute the dot product.
00:40:53.760 | You compute the a, and then you can compute the b.
00:40:57.000 | You can take the b.
00:40:57.760 | You do the weighted average.
00:40:59.440 | And then you send it to the next step
00:41:01.560 | to extend it to the prediction.
00:41:03.080 | And then you use stochastic gradient descent
00:41:05.160 | to train everything.
00:41:10.080 | And this algorithm is implemented in TensorFlow.
00:41:14.640 | OK, so how intuitively-- what is going on here?
00:41:18.600 | So let's suppose that you want to use this for translation.
00:41:22.480 | So in translation, you want to--
00:41:26.200 | for example, the input would be, hi, how are you?
00:41:28.680 | And the output is, hola, como estas, or something like that.
00:41:34.320 | And then when you predict the first word,
00:41:38.000 | you want hola to correspond to the word hi.
00:41:42.920 | Because there's a one-to-one mapping
00:41:44.920 | between the word hi and hola.
00:41:47.040 | So if you use the attention model,
00:41:49.560 | the betas that you learn will put a strong weight
00:41:53.720 | for the word hola--
00:41:55.200 | for the word hi.
00:41:56.800 | And then it has a smaller weight for all the stuff.
00:41:59.320 | And then if you keep going, then when you say, como,
00:42:02.440 | then it will focus on how, and et cetera.
00:42:06.040 | So it moves that coefficient.
00:42:08.600 | It puts a strong emphasis on the relevant word.
00:42:11.640 | And especially for translation, it's
00:42:13.200 | extremely useful because you know the one-to-one mapping
00:42:17.160 | between the input and output.
00:42:20.160 | Any questions so far?
00:42:21.040 | This is definitely very complicated.
00:42:25.240 | Yeah, I have a question.
00:42:26.640 | So how do you deal with languages
00:42:29.480 | where word orders are different?
00:42:31.760 | Oh, right now, the A and B are learned.
00:42:36.280 | So I don't-- and so the question is,
00:42:39.120 | how do I deal with languages where the order don't reverse,
00:42:42.800 | for example, English to Japanese?
00:42:45.240 | So some of the verbs get moved and things like that.
00:42:48.080 | Well, I did not hard code A or B. They are learned.
00:42:54.160 | So by virtue of learning, they will figure out
00:42:57.120 | what beta to put to weight the input.
00:43:03.000 | And those are basically computed by gradient set.
00:43:06.360 | So they just keep on learning.
00:43:07.640 | OK, I have a question here.
00:43:12.840 | [INAUDIBLE]
00:43:18.000 | Yeah, so the question is, are there
00:43:20.160 | any work on putting attention in the output?
00:43:22.840 | Yeah, I think you can do that.
00:43:26.080 | I'm not too familiar with any work in here,
00:43:27.960 | but I think it's possible to do it.
00:43:30.640 | I think some people have explored something like that.
00:43:32.960 | Yeah.
00:43:35.240 | Any question?
00:43:35.760 | Oh, I have a question.
00:43:39.520 | Another question.
00:43:40.200 | So with the capitalization of your first words in the--
00:43:45.200 | Yeah.
00:43:45.700 | --line, does that imply that you have to have your own [INAUDIBLE]
00:43:49.480 | Yeah.
00:43:49.980 | [INAUDIBLE]
00:43:52.040 | Yeah, yeah, yeah.
00:43:52.800 | So the question is, let's suppose-- because right now,
00:43:55.520 | the word "high" is capitalized at the first character.
00:44:00.320 | Does it mean I'm using 2n or n vocabulary size?
00:44:04.400 | So in practice, we should do some normalization.
00:44:07.240 | If you have a small data set, what you should do
00:44:09.720 | is you normalize the text.
00:44:11.080 | So "high" will be lowercase and et cetera.
00:44:14.960 | Now, if you have a huge data set, it doesn't matter.
00:44:17.440 | We just learn.
00:44:20.480 | Yeah.
00:44:21.120 | I have a question there.
00:44:22.200 | So in essence, actually, this can
00:44:24.000 | change the positional aspect of the words, right?
00:44:27.720 | Right, yeah.
00:44:28.400 | So the question is, in a sense, it
00:44:30.840 | captures the positional information in the input.
00:44:34.320 | Yeah, I agree.
00:44:35.320 | I have a question there.
00:44:39.080 | What about punctuation?
00:44:41.800 | Pardon?
00:44:42.300 | What about punctuation?
00:44:44.040 | Punctuation.
00:44:45.400 | So the question is, what do I do with punctuation?
00:44:48.160 | Well, right now, I just present the algorithm
00:44:52.320 | as if it's a very simple implementation,
00:44:56.720 | like the very basic.
00:44:57.840 | But one thing that you can do is before you
00:45:03.120 | train the algorithm, you put a space between the word
00:45:07.440 | and the punctuation so that you do some--
00:45:10.560 | that step is called tokenization or normalization
00:45:13.800 | in language processing.
00:45:15.320 | So you can use any Stanford NLP package or something
00:45:19.480 | like that to normalize your text so that it's easy to train.
00:45:23.480 | Now, if you have infinite data, then it will just learn itself.
00:45:30.840 | So I should get going because there's a lot of-- all
00:45:32.720 | the interesting stuff.
00:45:34.140 | So it turns out that that's the basic implementation.
00:45:38.080 | But if you want to get good results
00:45:40.280 | and if you have big data sets, so one thing that you can do
00:45:43.360 | is to make the network deep.
00:45:44.920 | And one way to make deep is in the following way.
00:45:48.400 | So you stack your recurrent network on top of each other.
00:45:53.120 | So you-- in the first sequence-to-sequence paper,
00:45:56.080 | we use a network of four, but people are gradually
00:45:59.040 | increasing to six, eight, and so on right now.
00:46:01.960 | And they're getting better and better results.
00:46:04.120 | Like in ImageNet, if you make a network deeper,
00:46:07.440 | you also get better results.
00:46:11.260 | So if you want to train sequence-to-sequence
00:46:14.440 | with attention, then a couple of years ago,
00:46:19.720 | when we-- like many labs working on this problem
00:46:23.400 | were behind the state of the art.
00:46:25.440 | But right now, in translation, many translation tasks
00:46:31.440 | basically--
00:46:33.760 | this model already achieved state of the art
00:46:36.160 | results in a lot of these WMT data sets.
00:46:40.160 | So to train this model, so number one is that, as I said,
00:46:44.920 | you might end up with a lot of vocabulary--
00:46:48.800 | out of vocabulary issues.
00:46:53.160 | So Barack Obama will be just an unknown, right?
00:46:56.360 | Hillary Clinton is an unknown.
00:46:58.280 | Now, you might use something like word segments, right?
00:47:02.440 | So you segment the words out.
00:47:03.960 | For example, Barack Obama would be "ba" and "rak" and et cetera.
00:47:10.120 | Or you can use all the smart algorithms.
00:47:12.720 | For example, word character split.
00:47:15.000 | You can split words that are unknown to be into characters,
00:47:18.960 | and then you treat them as a character.
00:47:20.560 | There's some work at Stanford, and they
00:47:21.760 | prove that it works very well.
00:47:23.240 | So that's one way to do it.
00:47:25.080 | Now, tip number two is that when you train this algorithm--
00:47:29.880 | because when you do back propagation or forward
00:47:32.640 | propagation, you essentially multiply a matrix
00:47:36.120 | many, many times.
00:47:37.040 | So you have explosion of function value
00:47:42.200 | or the gradient or implosion as well.
00:47:47.360 | Now, one thing that you can do is you clip
00:47:49.120 | the gradient at a certain value.
00:47:51.440 | So you say that if the magnitude of the gradient
00:47:55.600 | is larger than 10, set it to 10.
00:48:00.000 | Then tip number three is to use GRU, or in our work,
00:48:04.280 | we use a long short-term memory.
00:48:07.160 | So I want to revisit this long short-term memory
00:48:09.400 | business a little bit.
00:48:11.040 | So what's long short-term memory?
00:48:12.640 | So if you use an INN cell, basically, you
00:48:16.600 | concatenate your input and the hidden state,
00:48:22.080 | and then you multiply by some theta,
00:48:24.120 | and then you apply with some activation function.
00:48:27.280 | Let's say that's a hyperbolic tangent.
00:48:31.080 | Now, that's the simple function for INN.
00:48:35.400 | Now, in LSTM, you basically multiply the input and H
00:48:40.960 | by a big matrix.
00:48:44.040 | Let's call that theta.
00:48:45.560 | That theta is four times bigger than the theta
00:48:49.480 | I said in the INN cell.
00:48:51.280 | And then you're going to take that Z that's coming out.
00:48:55.880 | You split it into four blocks.
00:48:58.520 | Each block, you can compute the gates.
00:49:01.240 | And then you use the value of something called the cell,
00:49:06.320 | and then you keep adding the newly computed values
00:49:10.120 | to the cell.
00:49:11.880 | So there's a part here that I say the integral of C.
00:49:15.560 | It's that what it does, it basically
00:49:17.480 | keeps a hidden state where it keeps adding information to it.
00:49:21.520 | So it doesn't multiply information,
00:49:23.240 | but it keeps adding information.
00:49:24.880 | You don't need to know a lot of this
00:49:26.420 | if you want to just apply LSTM, because it's already
00:49:29.800 | implemented in TensorFlow.
00:49:31.200 | Any questions so far?
00:49:35.360 | So in terms of applications, you can use this thing
00:49:42.760 | to do summarization.
00:49:44.280 | So I've started seeing work in summarization, pretty exciting.
00:49:48.680 | You can do image captioning.
00:49:51.640 | And the input in that case would just
00:49:53.280 | be the representation of an image coming out from VGG
00:49:59.640 | or coming out from Google Net and et cetera.
00:50:01.960 | And then you send it to the INN.
00:50:03.600 | The INN will do the decoding for you.
00:50:06.560 | Or you can use it for speech recognition or transcription.
00:50:11.680 | Or you can use it for Q&A.
00:50:14.280 | So the next part of the talk, I will
00:50:18.400 | talk a little bit about speech recognition.
00:50:23.240 | So in speech recognition, the input
00:50:26.720 | could be maybe some waveforms.
00:50:28.480 | And then the output could be some words, hi, how's it.
00:50:34.240 | Well, one thing that you can do is
00:50:36.760 | you crop your input into Windows.
00:50:39.480 | That's the green boxes there.
00:50:41.720 | And then you crop a lot of them.
00:50:43.200 | And then you send a lot of them to an INN.
00:50:46.080 | And then you convert it into MFCC before you send to INN.
00:50:49.920 | MFCC or spectrogram or something like that.
00:50:52.680 | And then you use the algorithm that I said earlier
00:50:58.120 | and then with attention.
00:51:00.840 | And then you do the transcription.
00:51:03.400 | You predict one word at a time in the output.
00:51:06.360 | Now, the problem with this algorithm
00:51:09.680 | is that when it comes to speech, you
00:51:12.760 | end up with a lot of input.
00:51:15.000 | You can end up with thousands and thousands steps.
00:51:17.760 | So backpropagating in time, even with attention,
00:51:20.280 | can be difficult.
00:51:22.240 | Now, one thing that you can do is basically
00:51:24.760 | you do some kind of a pyramid to map the input.
00:51:28.840 | So if you do enough layers, you can divide your input
00:51:32.440 | into a factor of 8 or 16 if you do enough layers.
00:51:39.560 | And then you produce the output.
00:51:43.880 | So we're working on an implementation
00:51:46.560 | where the output is actually characters,
00:51:49.800 | like in the Baidu's work where they have the CTC.
00:51:54.200 | Now, I have to say that the strength of this algorithm
00:51:58.840 | is that it actually has an implicit language
00:52:01.440 | model in the output.
00:52:03.840 | So when I say I have the word "how,"
00:52:06.720 | it's actually conditioned on "hi" and "step 4,"
00:52:10.800 | and including the input.
00:52:11.920 | So there's an implicit language model already.
00:52:14.600 | But the problem with this is that actually you
00:52:19.560 | have to wait until the end of the input to do decoding.
00:52:24.000 | So the decoding has to be done offline.
00:52:27.520 | So if you use this for voice search,
00:52:29.880 | it might not be too nice because people want
00:52:33.880 | to see some output right away.
00:52:37.640 | So in that case, there's an algorithm
00:52:39.480 | that can use this and do it in an online fashion,
00:52:42.120 | block by block.
00:52:45.080 | Now, also, I have to mention that in translation,
00:52:51.120 | the sequence with attention works great.
00:52:54.080 | It's among the state of the art.
00:52:56.040 | But when it comes to speech, it doesn't work as well
00:52:59.080 | as the CTC, at least in published results.
00:53:03.800 | We're not as good as CTC, which is what Adam talked earlier,
00:53:08.040 | or some of the HMM/DNN hybrid, which
00:53:13.120 | is the most widely-spoken speech system currently.
00:53:19.640 | So I want to pause there, and then I can take questions.
00:53:22.800 | Any questions?
00:53:23.400 | I have a question at the back.
00:53:24.840 | Yeah?
00:53:25.720 | So on the machine translation, you were mentioning attention.
00:53:30.120 | So according to, say, English, German,
00:53:34.120 | there's one word that basically is having meaning
00:53:39.560 | of using multiple words in English.
00:53:42.560 | How does attention work there?
00:53:44.040 | Because attention will be focusing on [INAUDIBLE]
00:53:47.520 | sense, and you know that one is the sending sense.
00:53:51.000 | But when you're predicting, it's only
00:53:53.000 | going to predict one or two words.
00:53:58.760 | So how does it work in translation?
00:54:00.760 | Well, in translation, what we do is basically
00:54:04.080 | we have pairs of sentences.
00:54:06.400 | So for example, hi, how are you?
00:54:08.840 | And then, hola, como estas?
00:54:12.680 | And then we have pairs of sentences like this,
00:54:14.920 | and then we just feed it into the sequence
00:54:18.880 | to sequence of attention.
00:54:20.320 | At every step, again, we're going
00:54:22.080 | to predict one word at a time.
00:54:23.840 | But before we make a prediction, the model has the attention.
00:54:27.280 | So it actually sees the input once more
00:54:31.520 | before it makes a prediction.
00:54:32.760 | That's how it works.
00:54:33.960 | Now, what is-- can you repeat?
00:54:36.440 | What is the issue with the model again, please?
00:54:38.720 | When you look at, say, i equals 1 to the [INAUDIBLE]
00:54:44.040 | Yeah.
00:54:45.040 | When you're predicting, the first one
00:54:48.520 | you send in the sequence, the prediction
00:54:52.040 | is actually just one big word.
00:54:54.520 | So there's no [INAUDIBLE] position in that.
00:54:57.520 | So there's no basically one, two, three coming afterward.
00:55:00.520 | I see.
00:55:01.960 | Well, I can't quite follow the question.
00:55:04.480 | But let's take it offline.
00:55:05.840 | Is that OK?
00:55:06.760 | Yeah, yeah.
00:55:07.320 | And then we can do some paper together.
00:55:10.000 | [LAUGHTER]
00:55:14.000 | I have a question.
00:55:14.680 | Yeah.
00:55:15.680 | So I have a question.
00:55:17.160 | For example, what if you actually, in the inbox,
00:55:20.160 | you compose email in different languages,
00:55:22.640 | like Vietnamese and English?
00:55:24.640 | Yeah.
00:55:25.140 | And you actually separate different email
00:55:27.140 | and you send it to a different model?
00:55:28.640 | Yeah.
00:55:29.140 | Or just a single model?
00:55:31.400 | So the model-- the inbox thing that I presented,
00:55:34.160 | it was all in English.
00:55:36.200 | But there's no limitation in the model in terms of language.
00:55:39.280 | So let's suppose that in your inbox,
00:55:43.360 | sometimes you write in English, and sometimes you
00:55:45.600 | write in Vietnamese, or sometimes you
00:55:47.440 | write it in Spanish, whatever.
00:55:49.560 | And you personalize by user embedding.
00:55:52.400 | Then I would say that it will just learn your behavior.
00:55:55.040 | And then it will basically predict the word that you want.
00:55:58.680 | But make sure that your output vocabulary is large enough
00:56:03.120 | so that it covers not only the English words,
00:56:05.720 | but also the Spanish word, et cetera,
00:56:08.920 | like Vietnamese and so on.
00:56:10.120 | So your vocabulary is not going to be 20,000.
00:56:12.920 | It's going to be like 100,000, because you have more choices.
00:56:16.840 | And then you have to train your model on those examples.
00:56:20.400 | Yeah.
00:56:23.160 | It's a matter of training data.
00:56:25.160 | That's all.
00:56:26.400 | I have a question here.
00:56:27.400 | You mentioned that in the voice search,
00:56:29.880 | it cannot do translation online.
00:56:32.840 | Yeah?
00:56:33.340 | Is it possible to change your model a little bit
00:56:36.320 | so that it doesn't have to start predicting at the end
00:56:39.320 | of the voice search?
00:56:40.120 | Yeah, yeah, yeah.
00:56:40.520 | So the question is that in the case of voice search,
00:56:42.520 | right now you have to wait to the end to make a prediction.
00:56:45.040 | Is there any other ways?
00:56:46.000 | Yeah, yeah.
00:56:47.000 | The answer is yes.
00:56:47.840 | You can make a prediction block by block.
00:56:50.080 | So you can actually figure out an algorithm, a simple algorithm
00:56:52.920 | to actually segment the speech, and then make a prediction,
00:56:56.760 | and then take the prediction and feed it
00:56:58.560 | as input at the next block.
00:57:00.000 | So you can keep going like that.
00:57:01.340 | So in theory, you can actually do online decoding.
00:57:06.280 | But I'm saying that you can do online decoding,
00:57:10.880 | but that work is currently work in progress.
00:57:14.400 | How about that?
00:57:16.760 | I have a question there.
00:57:17.800 | So the question is regarding the Google's order of life.
00:57:21.800 | Yeah.
00:57:22.640 | So if I have to put this on my gallery,
00:57:24.320 | I was wondering how do you guys set up
00:57:25.920 | with a training data set?
00:57:27.000 | Because it's difficult to-- how do you
00:57:29.520 | know for this question, this is the answer you're creating?
00:57:32.400 | Oh, yeah.
00:57:32.900 | So we have some input email and then some output email
00:57:37.020 | where expert written emails reply.
00:57:40.340 | And then you can just train it that way.
00:57:42.540 | So you have a trained data set that you
00:57:44.980 | have created for these purposes?
00:57:46.900 | Yeah.
00:57:49.900 | I have a couple of questions.
00:57:52.820 | So it seems like with sequence, there's
00:57:55.220 | not much of a constraint on how the output aligns
00:57:58.460 | with the input, while the CTC there
00:58:00.660 | is sort of constrain that.
00:58:02.260 | Yeah.
00:58:02.760 | [INAUDIBLE]
00:58:03.260 | Is there any way to add this constraint to sequence
00:58:06.260 | to sequence that it works better?
00:58:07.740 | Or is it something like speech recognition?
00:58:09.540 | Yeah.
00:58:10.180 | The question is that in speech recognition,
00:58:12.100 | the CTC seems to be a very nice framework,
00:58:14.260 | because it matches--
00:58:15.860 | it's less like a monotonic increasement
00:58:18.140 | in the output and the input.
00:58:20.500 | But CTC makes this independent assumption.
00:58:23.380 | It doesn't have a language model in it.
00:58:25.380 | Maybe the sequence to sequence--
00:58:29.260 | can address this?
00:58:30.180 | Yeah, I think that's a great idea.
00:58:31.780 | Maybe we should write a paper together.
00:58:33.420 | [LAUGHTER]
00:58:36.780 | I think-- I haven't seen it, but I think
00:58:39.380 | that's a very good idea.
00:58:41.300 | Question?
00:58:42.300 | [INAUDIBLE]
00:58:42.800 | I see.
00:58:54.340 | Great.
00:58:54.980 | So the question is that, because right now we
00:58:58.660 | predict one step at a time, is there
00:59:00.380 | any way to actually look globally at the output
00:59:03.860 | and maybe use some kind of reinforcement learning
00:59:05.900 | to adjust the output?
00:59:07.380 | And the answer is yes.
00:59:08.420 | So there's a recent paper at Facebook called, I think,
00:59:13.180 | sequence level training or something like that,
00:59:15.700 | where they don't optimize for one step at a time,
00:59:19.540 | but they predict-- they look at the globally,
00:59:21.460 | and then they try to improve work at a rate,
00:59:24.820 | or they try to improve blue score or things
00:59:28.620 | like that for translation.
00:59:30.100 | And it seems to be making some improvement in the metrics
00:59:35.600 | that they care about.
00:59:36.940 | Now, if you show it to humans, though,
00:59:39.940 | people still prefer the output from this model.
00:59:44.660 | So some of the metrics that we use in translation and so on
00:59:47.660 | might not be what the metrics that we optimize.
00:59:51.100 | And the next step prediction seem
00:59:53.100 | to be what people like a lot in translation.
00:59:55.780 | [INAUDIBLE]
00:59:56.280 | Yeah, so the question is, can we add the GAN loss?
01:00:04.140 | Yeah, I think that's a great idea.
01:00:06.100 | Yeah.
01:00:07.700 | I have a question here.
01:00:08.620 | Yeah.
01:00:09.120 | [INAUDIBLE]
01:00:09.620 | Yeah.
01:00:14.120 | [INAUDIBLE]
01:00:14.620 | Change?
01:00:19.720 | [INAUDIBLE]
01:00:20.220 | Once you have the model, is there a way
01:00:23.140 | based on the human input to influence the encoder?
01:00:28.460 | Yeah, yeah.
01:00:29.060 | So let's suppose that you type the first word, "Hola,"
01:00:33.180 | then you can actually start the beam from there.
01:00:35.780 | So the question is, is there any way to incorporate user input?
01:00:39.660 | So I say, yeah.
01:00:41.020 | Let's suppose that you say, "Hola," sorry, "Hi, how are you?"
01:00:46.020 | And then as soon as the person type "Hola,"
01:00:49.660 | that actually restrict your beam.
01:00:51.900 | So you can actually condition your beam on the first word,
01:00:55.220 | "Hola," and your beam will be better.
01:00:57.420 | Yeah, I think that's a good idea.
01:00:59.700 | I have a question.
01:01:00.380 | [INAUDIBLE]
01:01:00.880 | Oh, so how much data did we use?
01:01:10.740 | So in translation, for example, we
01:01:12.860 | use several WMT corpora.
01:01:17.340 | And the WMT corpora usually have tens of millions
01:01:22.260 | of pairs of sentences, something like that.
01:01:24.700 | And every sentence have like 20 words on average,
01:01:30.220 | 20, 30 words on average.
01:01:31.980 | I can't remember, but that's something like that,
01:01:34.020 | order of magnitude.
01:01:35.740 | Yeah, I have a question there.
01:01:37.620 | [INAUDIBLE]
01:01:38.120 | I can't really hear.
01:01:42.860 | [INAUDIBLE]
01:01:45.900 | So how is it compared to Google Search auto-completion?
01:01:51.340 | Honestly, I don't know what's used underneath Google Search
01:01:55.020 | auto-completion.
01:01:56.060 | But I think they should use something like this,
01:01:59.380 | because it's--
01:01:59.980 | [LAUGHTER]
01:02:00.480 | OK, I have still lots of interesting stuff coming along.
01:02:07.820 | So OK, so what's the big picture?
01:02:12.100 | So the big picture is so far, I talk about sequence
01:02:15.980 | to sequence learning.
01:02:17.700 | And yesterday, Andrew was talking
01:02:20.540 | about most of the big trends in deep learning.
01:02:25.580 | And he was talking about the second trend was basically
01:02:28.140 | doing end-to-end deep learning.
01:02:29.860 | So you can characterize sequence to sequence learning
01:02:32.980 | as end-to-end deep learning as well.
01:02:36.460 | So the framework is very general.
01:02:39.060 | So it should work for a lot of NLP-related tasks,
01:02:43.220 | because a lot of them, you would have input sequence and output
01:02:47.060 | sequence in NLP.
01:02:48.420 | It could be input would be some text,
01:02:50.780 | and output would be some passing trees.
01:02:53.420 | That's also possible.
01:02:56.380 | But it works great when you have a lot of data.
01:02:59.740 | Now, when you don't have enough data,
01:03:01.260 | then maybe you want to consider dividing your problems
01:03:04.480 | into smaller components, and then train your sequence
01:03:07.180 | to sequence in the subcomponents,
01:03:08.820 | and then merge them.
01:03:11.260 | Now, if you don't have a lot of data,
01:03:13.740 | but you have a lot of related tasks,
01:03:16.100 | then it's also possible to actually merge all these tasks
01:03:19.980 | by combining the data, and then have an indicator bit to say,
01:03:24.420 | this is translation, this is summarization,
01:03:26.900 | this is email reply, and then change it jointly.
01:03:31.780 | And that should improve your output, too.
01:03:36.620 | Now, this basically concludes the parts
01:03:40.580 | about sequence to sequence.
01:03:41.660 | And then the next part, I'm going
01:03:43.260 | to play sequence to sequence in a bigger
01:03:45.980 | picture of the active ongoing work in neural nets for NLP.
01:03:57.180 | So if you have any questions, you can ask now.
01:03:59.780 | I take maybe two questions, because I
01:04:01.500 | think I'm running out of time.
01:04:03.820 | So I have a question.
01:04:04.700 | Yeah?
01:04:05.200 | I have a question.
01:04:06.160 | So does your model handle emoji in NLP?
01:04:11.940 | So the question is, does the model handle emoji?
01:04:16.820 | I don't know, but emoji is like a piece of text, too, right?
01:04:19.820 | So you can just feed it in as another extra token.
01:04:23.260 | If you make your vocabulary 200,000,
01:04:27.220 | then you should be able to cover emoji as well.
01:04:33.060 | Yeah, I have a question.
01:04:34.800 | As new emails and other documents come in,
01:04:36.780 | do you have to retrain the model, or do you do anything?
01:04:41.760 | Oh, so if you have new data coming in,
01:04:43.920 | so should I retrain the model?
01:04:45.720 | I think towards the end, we lower the learning rate.
01:04:51.680 | So if you add new data, it will not make a lot of good updates.
01:04:56.480 | So usually, you can add new data,
01:04:59.360 | increase the learning rate, and then continue to train.
01:05:02.300 | Yeah, that should work.
01:05:04.300 | OK, so I already took two questions.
01:05:05.860 | Let's keep going.
01:05:06.660 | So there's an active area that actually
01:05:11.300 | is very exciting, which is in the area of automatic Q&A.
01:05:16.180 | So you can think that maybe the setup would be,
01:05:20.180 | can you read a Wikipedia page and then answer a question?
01:05:23.980 | Or can you read a book and answer a question?
01:05:26.860 | Now, in theory, you can use sequence-to-sequence
01:05:29.700 | with attention to do this task.
01:05:33.780 | So it's going to look like this.
01:05:35.260 | You're going to read the book, one token at a time,
01:05:38.740 | and with the book.
01:05:40.140 | Then read the question.
01:05:42.060 | And then you're going to use the attention
01:05:44.300 | to look at all the pages.
01:05:47.300 | And then you make a prediction of the tokens.
01:05:49.540 | So that's kind of--
01:05:55.180 | sometimes we do answer questions,
01:05:56.980 | that way, sometimes we don't have knowledge about the fact.
01:06:00.100 | So we actually read the book again to answer the fact.
01:06:03.460 | But a lot of the time, if you ask me,
01:06:06.820 | is Barack Obama the President of the United States?
01:06:09.860 | I would say, yes, because it's already in my memory.
01:06:12.620 | So maybe it's better to actually augment the RNN
01:06:17.780 | with some kind of memory, so that it will not
01:06:21.620 | do this look back again.
01:06:23.780 | It's kind of annoying, look back again.
01:06:25.860 | So there's an active area of this research.
01:06:27.820 | I'm not a definite expert, but I'm very aware,
01:06:32.820 | so I can place you in the right context here.
01:06:35.180 | So a work in this area would be memory networks
01:06:38.260 | by Western and folks at Facebook.
01:06:40.980 | There will be neural-tutoring machines at DeepMind.
01:06:43.780 | Dynamic memory networks would be Richard Solcher presented
01:06:47.420 | yesterday, and then stack-augmented algorithms.
01:06:51.140 | And then stack-augmented RNNs by Facebook again, and et cetera.
01:06:55.460 | So I want to show you a high level, what
01:07:01.460 | does this augmented memory mean.
01:07:04.340 | Let's think about the attention.
01:07:08.580 | So the attention looks like this.
01:07:10.500 | In the encoder, you're going to look at some input.
01:07:12.820 | And then you have a controller, which is your h variable.
01:07:16.500 | And then you keep updating your h variable.
01:07:19.900 | But along the side, you're going to write down
01:07:21.820 | into memory your h1, h2, h3, et cetera.
01:07:25.060 | You store it into a memory.
01:07:27.260 | Clear, right?
01:07:29.060 | And in the decoder, what you're going to do
01:07:31.740 | is you're going to continue producing some output.
01:07:36.220 | You're going to update your controller g,
01:07:39.700 | but you're going to read from memory your h again.
01:07:46.740 | So again, in the input, you write to memory.
01:07:49.380 | And then in the output, you read from memory.
01:07:53.180 | Now let's try to be a little bit more general.
01:07:58.060 | And the general would be at any point in time,
01:08:01.020 | you can read and write.
01:08:02.140 | You have a controller, and you can read and write all the time.
01:08:07.060 | Now to do that, you have the following architecture.
01:08:10.340 | So you have some memory bank, a big memory bank.
01:08:14.260 | And then you can use the write.
01:08:18.820 | You can decide to write some information into it
01:08:23.220 | by a combination of the memory bank in the previous step
01:08:27.620 | and the hidden variable in the previous step.
01:08:30.500 | And then you also read into the hidden state, too.
01:08:34.540 | And then you could make an update,
01:08:35.900 | and then you can keep going forever like that.
01:08:38.500 | So this concept is called RNN with augmented memory.
01:08:41.780 | Okay?
01:08:44.780 | Is that somewhat clear?
01:08:47.420 | Any question?
01:08:51.540 | You have a question.
01:08:53.660 | The question is, when you read,
01:09:00.580 | do you read the entire memory bank?
01:09:02.900 | A lot of these algorithms are actually soft attention.
01:09:06.820 | So yes, it will look the entire memory.
01:09:11.260 | You can actually predict where to look, right?
01:09:13.660 | And then read only that block.
01:09:16.020 | Now the problem with that is you end up with very,
01:09:20.380 | it's not differentiable anymore, right?
01:09:23.020 | Because the thing that you don't read
01:09:25.380 | don't contribute to the gradient.
01:09:27.700 | So it's gonna be hard to train,
01:09:28.980 | but you can use reinforce and so on to train it.
01:09:31.500 | So there's a recent paper,
01:09:34.140 | Reinforcement Learning Neuroturing Machines,
01:09:36.620 | that actually does something like this, right?
01:09:39.820 | Not exactly, but it will deal with discrete actions.
01:09:44.180 | Okay?
01:09:45.020 | Any question?
01:09:47.580 | No question.
01:09:50.740 | Okay, so another extension that a lot of people talk about
01:09:57.020 | is using RNN with augmented operations.
01:10:01.180 | So you wanna augment the neural network
01:10:03.980 | with some kind of operations,
01:10:06.460 | like addition, subtraction, multiplication,
01:10:10.260 | the sine function, et cetera.
01:10:11.540 | A lot of functions.
01:10:13.260 | So to motivate you, you can think about,
01:10:15.940 | Q and A can fall into this.
01:10:17.700 | So for example, here's a context.
01:10:19.420 | The building was constructed in the year 2000.
01:10:23.460 | And then it was then, later on, people say,
01:10:27.220 | oh, it was then destroyed in the year 2010.
01:10:30.900 | And then the question would be,
01:10:32.580 | how long did the building survive?
01:10:35.140 | And the answer would be 10 years.
01:10:37.380 | Now, how would you answer this question,
01:10:40.100 | where you say, 2010, subtract 2010 years?
01:10:44.140 | Now, neural nets, if you can train with a lot of examples,
01:10:47.900 | it can do that too.
01:10:48.860 | It can learn to subtract numbers and things like that,
01:10:51.460 | but it requires a lot of data to do so.
01:10:54.120 | All right, so maybe it's better to augment them
01:10:58.260 | with functions, like addition and subtraction.
01:11:01.160 | So the way you can do it is that
01:11:05.080 | the neural network will read all the tokens so far,
01:11:08.380 | and it will push the numbers into a stack.
01:11:11.380 | And then you'll get, the neural net is augmented
01:11:15.800 | by a subtraction and a addition function.
01:11:20.460 | And then you assign this probability
01:11:23.700 | for these two functions.
01:11:25.500 | So green, the more dark,
01:11:29.420 | this means the higher probability, okay?
01:11:31.340 | So you assign two probability,
01:11:33.520 | and you compute the weighted average
01:11:36.380 | of the values coming out of these two function.
01:11:38.660 | And then you take that, and then you pop it,
01:11:41.260 | and you push it into the stack in the next step.
01:11:43.820 | And then in the next step,
01:11:44.860 | you will call the addition and subtraction again,
01:11:48.140 | and et cetera.
01:11:49.160 | That's the principle of something called neural programmers
01:11:51.540 | or neural programmer interpreters.
01:11:54.860 | So there are two papers last year
01:11:56.340 | from Google Brain and DeepMind was talking about this.
01:11:59.860 | So that's some of the related work
01:12:03.040 | in the area of augmenting recurrent networks
01:12:06.640 | with operations, with memory, et cetera.
01:12:09.960 | Now, what's the big picture?
01:12:11.160 | Okay, so the big picture, I wanna revisit,
01:12:13.200 | and I say, so what I've talked about today
01:12:18.200 | is sequence-to-sequence learning.
01:12:20.960 | And it's an end-to-end deep learning task.
01:12:23.920 | So it's one of the big trends happening in natural language.
01:12:29.620 | It's very general, so you can use,
01:12:32.540 | if you have a lot and a lot of supervised data,
01:12:34.860 | it's a supervised learning algorithm.
01:12:37.420 | So if you have a lot of data, it should work great.
01:12:40.720 | But if you don't have enough supervised data,
01:12:43.240 | then you consider dividing your problem
01:12:45.280 | and then training in different components.
01:12:47.940 | Or you can train jointly in a multitask settings.
01:12:51.940 | And people also train it jointly with autoencoder,
01:12:54.900 | namely you read the input sentence
01:12:57.420 | and then predict the output sentence again.
01:12:59.220 | And that's also, and then you train jointly
01:13:02.400 | with all the tasks, and that works as well.
01:13:04.740 | If you go home and then you wanna make impact
01:13:10.620 | at your work tomorrow, then so far, that's so far so good.
01:13:13.960 | That can make some impact.
01:13:15.340 | Now, if you wanna do some research,
01:13:16.940 | and I think things with memory, operation,
01:13:20.740 | augmentation are some of the exciting areas.
01:13:25.420 | But it seems like it's still work in progress.
01:13:28.740 | But I expect a lot of advances in this area
01:13:31.620 | in the near future.
01:13:32.540 | So if you want to know more,
01:13:37.740 | you can take a look at Chris Ola Block.
01:13:40.980 | He talk about attention and augmented recorder networks.
01:13:44.520 | I also wrote some tutorials, pretty simple.
01:13:46.820 | The sequence-to-sequence with attention for translation
01:13:52.340 | is implemented in TensorFlow.
01:13:54.060 | So you can actually download TensorFlow
01:13:58.580 | and train it, what I said today.
01:14:02.020 | Now, there's a lot of work going on in this area.
01:14:06.660 | Many of these are not mine.
01:14:09.560 | So as you can see, you can't even read the words.
01:14:12.860 | That means how many papers come along in this area.
01:14:16.940 | So I can pause there, and I have five minutes
01:14:22.140 | to answer questions.
01:14:24.520 | (audience member speaking)
01:14:26.720 | I have a question there, yeah.
01:14:28.440 | (audience member speaking)
01:14:31.940 | Yeah.
01:14:33.640 | (audience member speaking)
01:14:37.140 | Yeah.
01:14:44.540 | (audience member speaking)
01:14:49.080 | Yeah.
01:14:49.920 | (audience member speaking)
01:14:53.420 | I see.
01:14:56.600 | Can you speak to the microphone?
01:14:59.560 | Because I can't hear very well.
01:15:00.800 | The microphone, and then I think people
01:15:02.560 | can hear that as well.
01:15:03.760 | - When you're training a Q&A network,
01:15:10.560 | so you're taking the example of training
01:15:12.940 | from a book to answer questions.
01:15:14.640 | - Yeah.
01:15:15.840 | - So if, let's say, Harry Potter,
01:15:18.360 | who was Harry Potter's father.
01:15:19.880 | - Yeah.
01:15:20.720 | - There could be many books that have a character Harry.
01:15:23.160 | - Yeah.
01:15:24.000 | - So there's a context resolution issue,
01:15:25.320 | which is which Harry should I answer the question for.
01:15:27.400 | - Yeah.
01:15:28.240 | - How do you solve the context problem
01:15:30.800 | when you're training this kind of Q&A type network?
01:15:34.000 | - I think that's a great question.
01:15:35.360 | So I think one thing is that you can always personalize.
01:15:40.360 | For example, you know that the guy,
01:15:43.740 | when he talk about, you can have a representation
01:15:47.200 | for the user, and then you know that when he say Harry,
01:15:50.960 | because he actually been reading a lot of books
01:15:52.960 | about Harry Potter, so it's more likely to be Harry Potter.
01:15:57.720 | But I think with the algorithm I said,
01:15:59.280 | I just want to make sure that it's as simple as possible.
01:16:02.040 | So the user has to ask the question,
01:16:06.920 | Harry Potter, rather than Harry.
01:16:10.160 | But I'm saying if you represent user vectors,
01:16:13.760 | and then you inject more additional knowledge
01:16:16.700 | about the users, about the context,
01:16:19.560 | into as additional token in the input of the net,
01:16:24.560 | the net can figure it out by itself.
01:16:27.500 | Yeah, so that's one way to do it, yeah.
01:16:30.780 | Okay, I have a question there, yeah.
01:16:33.340 | - You did some work on doc2vec.
01:16:35.860 | - Yeah.
01:16:37.620 | - Do you have an idea what the state of the art
01:16:39.020 | in generalizing word2vec is to more than one word?
01:16:43.500 | - Oh, I see.
01:16:44.400 | I think skip thoughts are interesting directions here.
01:16:51.580 | So doc2vec is one way, but skip thoughts,
01:16:56.620 | so the idea of skip thoughts was actually
01:16:59.060 | Ruslan Salakhovsdinov, was author on this.
01:17:02.580 | And his idea is basically using sequence to sequence
01:17:05.620 | to predict the next sentence.
01:17:08.020 | So the input would be the current sentence,
01:17:11.740 | the output would be the previous sentence
01:17:15.060 | or the next sentence.
01:17:16.500 | And then you could train a model like that.
01:17:18.180 | The model is called skip thought.
01:17:20.340 | And I've heard a lot of good things about skip thoughts,
01:17:23.260 | where you can take the embedding at the end,
01:17:25.500 | and then you can do document classification
01:17:29.180 | and things like that, and it works very well.
01:17:30.900 | So that's probably one place that you can go.
01:17:33.860 | My colleague at Google is also working
01:17:35.740 | on something called autoencoder.
01:17:37.700 | So instead of predicting the next sentence,
01:17:40.300 | he predicts the current sentence.
01:17:42.260 | So trying to repeat the current sentence,
01:17:45.180 | and that's kind of worked well too.
01:17:47.460 | Yeah, yeah.
01:17:48.740 | - See, what was your thoughts on how to solve
01:17:53.100 | the common sense reasoning problem?
01:17:55.380 | - Oh, common sense, I'm deeply interested in common sense,
01:17:58.740 | but I gotta say, I have no idea.
01:18:01.380 | I think maybe you can do something like,
01:18:05.420 | I think common sense is about a lot of,
01:18:08.020 | first of all, there's a lot of knowledge about the world
01:18:10.980 | that is not captured in text.
01:18:14.580 | Like for example, gravity and things like that.
01:18:17.180 | So maybe you really need to actually combine
01:18:20.660 | a lot of modality.
01:18:21.820 | That's one way to think about it.
01:18:23.740 | Or the other thing is, do you make sure
01:18:26.180 | that unsupervised learning work?
01:18:28.260 | That's another approach.
01:18:31.180 | But I think this research area,
01:18:34.260 | I think, I'm just making guesses right now.
01:18:39.260 | - Is there a good way to represent all these rules
01:18:42.020 | and using some soft--
01:18:45.420 | - Yes, yes.
01:18:46.260 | So the question is, how do you represent rules?
01:18:50.020 | So if you think about this network,
01:18:52.660 | the neural programmer network,
01:18:54.380 | that it actually augmented by addition
01:18:57.260 | and subtraction, then these are rules.
01:19:02.260 | - Right.
01:19:03.580 | - You can augment it with a table of rules
01:19:06.140 | and then ask the network to actually attend
01:19:09.660 | into that rule table.
01:19:11.540 | People have looked into this direction.
01:19:13.420 | So that's one way to do it.
01:19:15.260 | - Okay, you're saying basically,
01:19:16.340 | I'll go ahead and do some logical reasoning?
01:19:19.100 | - Yeah, yeah.
01:19:20.380 | - Hey, great talk.
01:19:23.940 | - Yeah, thank you.
01:19:25.460 | - Is there a practical rule of thumb
01:19:27.740 | for how many sequence pairs you need
01:19:30.300 | to train such a model successfully?
01:19:32.620 | - I see.
01:19:33.460 | - Are there any tips to reduce how many pairs you need
01:19:37.940 | if you don't have--
01:19:38.780 | - I see, okay.
01:19:39.700 | So usually, the bigger data set, the better,
01:19:44.460 | but the corpus that people train this on translation,
01:19:48.340 | for example, English to German,
01:19:50.540 | it has only about three, five million pairs of sentences
01:19:53.220 | or something like that.
01:19:54.060 | So that's kind of small, three million, right?
01:19:57.100 | And still, people are able to make it
01:19:59.060 | to the state of the art.
01:20:00.700 | So that's pretty encouraging.
01:20:02.580 | Now, if you don't have a lot of data,
01:20:04.580 | then I would say things like pre-train your word vectors
01:20:08.380 | with language models or Word2Vec, right?
01:20:13.380 | That's one area that you have a lot of parameters.
01:20:16.420 | You can pre-train your model
01:20:20.420 | with some kind of language model,
01:20:22.580 | and then you reuse the softmax.
01:20:24.380 | That's another area that you have a lot of parameters.
01:20:26.900 | Or use dropout in the input embedding
01:20:29.660 | or dropout some random word in the input sentence.
01:20:32.620 | So those things can improve the regularization
01:20:35.420 | when you don't have a lot of data.
01:20:37.100 | Okay.
01:20:39.660 | Yeah, thank you.
01:20:40.500 | Okay.
01:20:44.460 | Yeah, thank you.
01:20:45.300 | (audience applauding)
01:20:48.460 | - Thank you, Kuo.
01:20:53.100 | So we'll reconvene at six o'clock
01:20:55.940 | for Yoshua Bengio's closing keynote.