Sequence to Sequence Deep Learning (Quoc Le, Google)

00:00:00.000 | I will divide it in two parts.

00:00:03.160 | So number one, I will work with you and

00:00:05.720 | develop the sequence to sequence learning.

00:00:08.680 | And then the second part, I will play sequence to sequence in a broader context

00:00:14.920 | on a lot of exciting work in this area.

00:00:18.680 | Now, so let's motivate this by an example.

00:00:27.200 | So a week ago, I came back from vacation, and

00:00:30.560 | in my inbox, I have 508 emails, un-replied emails.

00:00:36.800 | And a lot of emails basically just require just yes and no answer.

00:00:42.280 | So let's try to see whether we can build a system that can automatically

00:00:49.120 | reply these emails to say yes and no.

00:00:53.480 | And for example, some of the email would be from my friend, Ann.

00:00:59.960 | She said, hi, in the subject, and she said,

00:01:02.800 | are you visiting Vietnam for the New Year walk?

00:01:05.800 | That would be her content.

00:01:07.720 | And then my probable reply would be yes.

00:01:10.080 | So you can gather a data set like this, and then you have some input content.

00:01:16.520 | So for now, let's ignore the author of the email and the subject.

00:01:23.520 | But let's focus on the content.

00:01:25.400 | So let's suppose that you gather some email, and some input would be something

00:01:28.560 | like, are you visiting Vietnam for the New Year walk?

00:01:32.080 | And the answer would be yes.

00:01:33.000 | And then another email would be, are you hanging out with us tonight?

00:01:40.200 | The answer is no, because I'm quite busy.

00:01:42.800 | >> [LAUGH]

00:01:46.040 | >> So the third email would be,

00:01:47.880 | did you read the coolness paper on ResNet?

00:01:50.760 | The answer is yes, because I like it.

00:01:52.880 | Now let's do a little bit of processing, where basically in

00:01:59.640 | the previous slide, we have year and comma, and

00:02:06.160 | then quark, and then question mark, and so on.

00:02:09.080 | So let's do a little bit of processing, and then put the comma,

00:02:14.600 | a space between year and comma, and then quark and question mark, and so on.

00:02:20.600 | So this step, a lot of people call tokenization and normalization.

00:02:24.360 | So let's do that with our emails.

00:02:26.360 | Now, so and then the second step I would do would be to do feature representation.

00:02:34.520 | So in this step, what I'm gonna do is the following.

00:02:36.440 | I'm gonna construct a 2,000 dimensional vector.

00:02:40.200 | 2,000 represent the size of English vocabulary.

00:02:44.120 | And then I'm gonna go through email.

00:02:45.600 | I'm gonna count how many times a particular word occur in my email.

00:02:50.160 | For example, the word r occur one in my email, so I increase the counter.

00:02:59.080 | And then u occur once, I increase another counter, and s, etc.

00:03:04.760 | And then I will reserve at the end a token to

00:03:08.200 | just count all the words that are just out of vocabulary.

00:03:11.560 | Okay?

00:03:12.800 | And then, now, if you do this process,

00:03:19.240 | you're gonna convert all of your email from input to output pairs,

00:03:23.440 | where the input would be fixed length representation of 20,000 dimensional vector.

00:03:28.880 | And output would be either 0 or 1.

00:03:31.680 | Okay?

00:03:33.280 | Any questions so far?

00:03:34.120 | Okay, good.

00:03:36.460 | Okay, so I will get, so as you said, somebody in the audience said,

00:03:43.200 | the order of the words don't matter, and the answer is yes.

00:03:47.640 | So I'm gonna get back to that issue later.

00:03:50.880 | Now, so that x and y, and now, your job,

00:03:57.600 | my job now is to try to find some w such that w times x can approximate y.

00:04:04.760 | Y is the output, right?

00:04:06.800 | And y here is yes and no.

00:04:09.440 | So because of this problem has two categories,

00:04:13.800 | you can think of it as a logistic regression problem.

00:04:16.960 | Now, if anybody follow the great CS229 class by Andrew,

00:04:22.000 | probably can formulate this very quickly.

00:04:24.520 | But in very short, the algorithm comes as follows.

00:04:30.040 | You kind of try to come up with a vector for every email.

00:04:34.000 | Your w is a two column matrix, okay?

00:04:40.440 | The first column will find the probability for

00:04:43.160 | the whether the email have to be answered as yes.

00:04:47.720 | Second column will be answered as no.

00:04:50.400 | And then you basically take the dot product between the first column.

00:04:55.680 | [INAUDIBLE]

00:04:57.480 | The stochastic gradient descent.

00:04:58.720 | So you run for iteration one to like a million, you run for a long, long time.

00:05:05.160 | You sample a random email x, and there's some reply.

00:05:08.120 | And then if the reply is yes, then you wanna update your w1 and

00:05:14.640 | w2 such that you increase the probability that the answer is yes.

00:05:20.960 | So you increase the first probability.

00:05:22.520 | Now if the correct reply is no, then you're gonna update w1 and

00:05:29.600 | w2 so that you can increase the probability of

00:05:34.880 | the email to be answered as no, so the second probability, okay?

00:05:40.960 | So let's call those p1 and p2.

00:05:44.520 | Now, so because I said to update the increase of probability,

00:05:51.560 | what does that mean?

00:05:52.480 | What that means is that you find the gradient of the partial gradient

00:05:57.400 | of the objective function with respect to some parameter.

00:06:01.600 | So now you have to pick some alpha, which is the learning rate.

00:06:06.120 | And then you say w1 is equal to w1 plus some alpha,

00:06:11.040 | the partial derivative of log p1 with respect to d of w1, okay?

00:06:19.040 | Now, I cheated a little bit here because I used the log function.

00:06:22.840 | It turns out because the log function is a monotonic increasing function.

00:06:27.360 | So increasing p1 is equivalent to increasing the log of p1, okay?

00:06:32.200 | And it usually, with this formulation, stochastic gradient descent works better.

00:06:36.160 | Any questions so far?

00:06:38.360 | And then you can also update w2 if the email to be replied is yes.

00:06:48.960 | And you can have different way to update if the reply is no.

00:06:55.960 | So what's, and then if you have a new email coming in,

00:07:00.960 | then you take x and then you convert into the vector.

00:07:04.760 | Then you compute the first probability, okay, w1 times x divided by exponential,

00:07:11.600 | w1 times x plus exponential w2 times x.

00:07:15.000 | And if that probability is larger than 0.5, then you say yes.

00:07:18.880 | And if that probability is less than 0.5, then you say no.

00:07:24.160 | Okay, so that's how you do prediction with this.

00:07:27.120 | Now, there's a problem with this representation,

00:07:33.120 | is that there's some information loss.

00:07:35.040 | So somebody in the audience just said that the order of the words don't matter.

00:07:39.560 | And that's true.

00:07:40.840 | Now, let's fix this problem by using something called the Recurrent Network.

00:07:47.560 | And I think Richard Solcher already talked about Recurrent Networks and

00:07:53.840 | some part of it yesterday, and Andre as well.

00:07:57.840 | Now, the idea of a Recurrent Network is basically you also have

00:08:02.200 | fixed length representation for your input, but

00:08:05.840 | it actually preserves some sort of ordering information.

00:08:08.880 | And the way that you compute the hidden units are the following.

00:08:15.680 | So the function h of 0 is basically a hyperbolic

00:08:21.840 | tangent of some matrix u times

00:08:27.680 | the word vector for the word r.

00:08:31.080 | Okay, so Richard also talked about word vectors yesterday.

00:08:35.800 | So you can take word vectors coming out of word2vec, or

00:08:39.600 | you can just actually randomly initialize them if you want to.

00:08:43.360 | Okay, so let's suppose that that's h of 0.

00:08:46.680 | Now, h of 1 would be a function of h 0 and

00:08:53.200 | the vector for u, which is a times h of 0 plus u times v of vector u.

00:09:01.520 | And then you can keep going with that.

00:09:04.240 | This is one of my three most complicated slides, so

00:09:11.880 | you should ask questions.

00:09:13.280 | No questions?

00:09:16.240 | So everybody familiar with Recurrent Nets?

00:09:19.840 | Wow.

00:09:20.860 | >> [LAUGH]

00:09:22.900 | >> Okay, so to make prediction with this,

00:09:25.860 | you tack on the label at the last step.

00:09:29.580 | And then you say, try to predict y for me.

00:09:32.380 | Now, how do you do that?

00:09:33.740 | Now, here, basically,

00:09:37.140 | you went the way you did before.

00:09:43.660 | And basically, you make update on the w matrix,

00:09:47.300 | which is the classifier at the top, like what I said earlier.

00:09:50.540 | Now, but you also have to update all the relevant matrices,

00:09:56.940 | which is the matrix u, the matrix a, and some word vectors, right?

00:10:03.780 | So this is basically, you have to compute the partial derivative

00:10:07.660 | of the loss function with respect to those parameters.

00:10:12.780 | Now, that's gonna be very complicated.

00:10:14.820 | And usually, when I do that myself, I get that wrong.

00:10:19.460 | But there's a lot of toolkits out there that you can use.

00:10:22.820 | What is, you can use auto differentiation in

00:10:27.220 | TensorFlow, or you can call Torch, or

00:10:31.700 | you can call Tiano to actually compute the derivatives.

00:10:35.940 | And once you have the derivatives, you can just make the updates.

00:10:38.100 | All right?

00:10:41.100 | Yeah?

00:10:41.900 | >> Do you use the same rule?

00:10:43.740 | >> Yes.

00:10:44.460 | >> What size?

00:10:46.020 | >> So u, the matrix u, I share, so I'm gonna go back to one side.

00:10:52.020 | So the matrix u, I share for all vertical matrices, right?

00:11:01.540 | And the size, you have to determine ahead of time.

00:11:04.420 | For example, the number of column would be the size of the word vectors.

00:11:11.580 | But the number of rows must be like 1,000 if you want, or maybe 255 if you want.

00:11:18.580 | So this is model selection, and it depends on whether you're underfitting or

00:11:22.540 | overfitting to choose a bigger model or a smaller model.

00:11:26.980 | And you'll compute power so that you can train a larger model or smaller model.

00:11:30.980 | >> So what would you consider the number of words in the dictionary?

00:11:37.140 | >> The matrix u?

00:11:37.940 | >> Yeah.

00:11:39.860 | Yeah, so the word vectors, the number of word vectors that you use,

00:11:46.700 | are the size of vocabulary, right?

00:11:52.380 | So you're gonna tend to end up with 20,000 word vectors, right?

00:11:57.500 | But the size of, so that means you have 20,000 rows in matrix u,

00:12:03.940 | but the number of column you can, sorry, the number of column is 20,000, but

00:12:08.060 | the number of row would be, you have to determine that yourself.

00:12:11.500 | Okay?

00:12:13.100 | Any other questions?

00:12:15.580 | Now, okay, so what's the big picture?

00:12:23.500 | So the big picture is I started with bag of words representations.

00:12:27.460 | And then I talk about INN as a new way to represent

00:12:33.860 | variable size input that can capture some sort of ordering information.

00:12:38.620 | Then I talk about auto differentiation so

00:12:41.580 | that you can compute the partial derivatives.

00:12:43.900 | And these, you can find auto diff in TensorFlow or Tiano or Torch.

00:12:50.220 | Now then I talk about stochastic gradient descent as a way to train

00:12:54.860 | the neural networks.

00:12:57.460 | Any questions so far?

00:13:00.700 | Okay, you have a question.

00:13:03.260 | >> How long is the data size if you're not using the data system?

00:13:07.500 | >> So that also depends on how big your training set and

00:13:11.220 | how big is your computer and so on, right?

00:13:15.660 | But usually if you use RNN and if you use a hidden state of 100,

00:13:20.700 | it should take a couple of hours.

00:13:22.620 | Yeah, but it largely depends on size of training data,

00:13:29.260 | because you want to iterate for a lot of, you sample a lot of emails, right?

00:13:33.500 | And you want your algorithm to see as many emails as possible.

00:13:37.740 | Right?

00:13:38.240 | So, okay, so if you use such algorithm to just say yes, no, and yes, no,

00:13:45.820 | then you might end up losing a lot of friends.

00:13:49.220 | Because- >> [LAUGH]

00:13:52.180 | >> Because we don't just say yes, no.

00:13:55.820 | Because when you say, for example, my friend asked me,

00:14:01.460 | are you visiting Vietnam for the New Year walk?

00:14:03.340 | Then maybe the better answer would be, yes, see you soon, right?

00:14:07.180 | That's a better, nicer way to approach this.

00:14:10.660 | And then if my friends ask me, are you hanging out with us tonight?

00:14:16.220 | So instance, just say, no.

00:14:17.980 | I would say, no, I'm too busy.

00:14:19.900 | All right, or did you read the cool [INAUDIBLE] right?

00:14:24.420 | So let's see how we're going to fix this.

00:14:26.980 | So before I'm going to tell you the solution,

00:14:31.220 | I would say this problem basically requires you

00:14:36.220 | to map between variable size input to some variable size output.

00:14:43.220 | And if you can do something like this, then there's a lot of applications.

00:14:49.180 | Because you can do auto reply, which is what we've been working on so far.

00:14:53.260 | But we can also work on user to do translation,

00:14:55.980 | translate between English, French.

00:14:58.860 | You can do image captioning.

00:15:00.220 | So input would be a fixed length vector, a representation coming from ConfNet.

00:15:06.180 | And then output would be the cat sat on the mat.

00:15:10.500 | You can do summarization.

00:15:12.420 | The input will be a document, and output will be some summary of it.

00:15:16.900 | Or you can do speech transcription, where

00:15:19.660 | you can have input would be speech frames, and output would be words.

00:15:25.580 | Or you can do conversation.

00:15:27.060 | So basically, the input would be the conversation so far,

00:15:29.940 | and the output would be my reply.

00:15:32.660 | Or you can do Q&A, et cetera, et cetera.

00:15:35.060 | So we can keep going on.

00:15:36.100 | So how do we solve this problem?

00:15:41.660 | So this is hard.

00:15:43.380 | So let's check out what Andrej Karpathy has

00:15:45.820 | to say about recurrent networks.

00:15:49.460 | So Andrej say that there's more than one way

00:15:51.820 | that you can configure neural networks to do things.

00:15:54.420 | So you can use neural network to map--

00:15:58.100 | recurrent networks to map one to one.

00:16:01.620 | So at the bottom, that's the input.

00:16:05.460 | The green would be the hidden state, and the output

00:16:08.900 | would be what you want to predict.

00:16:13.140 | Now, one to one is not what we want, because we have many to many.

00:16:17.500 | So it's probably more like the last two to the right.

00:16:23.340 | But we arrived at the solution that I said in the red box.

00:16:28.820 | And the reason why that's a better solution

00:16:32.060 | is because the size of the input and the size of output

00:16:36.100 | can vary a lot.

00:16:37.940 | Sometimes you have smaller input, but larger output.

00:16:41.900 | But sometimes you have larger input and smaller output.

00:16:46.220 | So if you do the one in the red circle,

00:16:49.860 | you can be very flexible.

00:16:52.460 | If you do the one to the extreme right,

00:16:54.940 | then maybe the output has to be smaller, or at least

00:16:58.980 | the same with the input.

00:17:03.700 | That's what we don't want.

00:17:05.420 | So let's construct a solution that look like that.

00:17:08.340 | So here's the solution.

00:17:10.340 | So the input would be something like, hi, how are you?

00:17:14.260 | And then let's put a special token.

00:17:17.420 | Let's say the token is end.

00:17:19.660 | And then you're going to predict the first token, which

00:17:22.220 | is am.

00:17:24.460 | And then you predict the second token, fine.

00:17:26.940 | And then you predict the third token, thanks.

00:17:29.860 | And then you keep going on until you predict the word end.

00:17:33.660 | And then you stop.

00:17:36.700 | Now, I want to mention that in the previous set of slides,

00:17:43.220 | I was just talking about yes and no.

00:17:45.340 | And in yes and no, you have only two choices.

00:17:48.860 | Now you have more than two choices.

00:17:50.980 | You have actually 20,000 choices.

00:17:54.300 | And you can actually use the algorithm

00:17:57.060 | that are the logistic regression.

00:17:59.140 | And you can extend it to cover more than two choices.

00:18:03.820 | You can have a lot of choices.

00:18:06.860 | And then the algorithm will just follow the same way.

00:18:09.460 | So this is my first solution when I worked with 626.

00:18:15.980 | But it turns out it didn't work very well.

00:18:17.860 | And the reason why it didn't work very well

00:18:19.660 | is because the model never know what it actually

00:18:21.980 | predicted in the last step.

00:18:25.340 | So it keep going.

00:18:26.740 | And it will keep synthesizing output.

00:18:28.500 | But it didn't know what it said.

00:18:30.540 | It didn't know what decision it committed

00:18:32.740 | in the previous step.

00:18:33.980 | So a better solution would look like this.

00:18:36.860 | A better solution is basically you

00:18:39.220 | feed what the model predicts in the previous step

00:18:42.940 | as input to the next step.

00:18:46.620 | So for example, in this case, I'm going to take am.

00:18:48.940 | I feed it in to the next step so that I'm

00:18:52.020 | going to predict the second word, which is fine,

00:18:57.020 | and et cetera.

00:18:58.420 | So a lot of people call this concept autoregressive.

00:19:02.220 | So you eat your own output and make it as your input.

00:19:08.900 | Any questions so far?

00:19:09.820 | How do you know when to stop?

00:19:15.580 | Oh, whenever it produces end, then you stop.

00:19:19.300 | There's a special token end.

00:19:20.740 | So relevant architecture here would be--

00:19:28.820 | people also call the encoder as the recurrent network

00:19:34.140 | in the input.

00:19:34.980 | And the decoder would be the recurrent network

00:19:37.100 | in the output.

00:19:40.020 | OK, so how do you train this?

00:19:41.460 | So again, so you basically run for a million steps.

00:19:45.220 | You see all of your emails.

00:19:46.780 | And then you say you sample.

00:19:48.780 | And for each iteration, you sample an email x and a reply

00:19:53.140 | y.

00:19:54.620 | Y would be, I'm fine, thanks.

00:19:57.460 | And then you sample a random word yt in y.

00:20:01.380 | And then you update the INN encoder and decoder parameters

00:20:06.100 | so that you can increase the probability that y of t

00:20:10.900 | is correct, given all what you've seen before,

00:20:15.300 | which is your yt minus 1, yt minus 2, et cetera,

00:20:18.900 | and also all the x's.

00:20:22.500 | And then you have to compute the partial derivatives

00:20:24.700 | to make it work.

00:20:25.660 | So the computing part of partial derivative is very difficult.

00:20:29.060 | So again, I recommend you to use something

00:20:31.240 | like autodifferentiation in TensorFlow or Torch or Tiana.

00:20:35.420 | OK, you have a question.

00:20:39.660 | [INAUDIBLE]

00:20:42.500 | Yeah, but the number of parameters

00:20:45.860 | didn't change because you have u, v, and a are fixed.

00:20:50.660 | OK, so the question in the audience

00:20:56.860 | is that if the INN are different for different example,

00:21:03.780 | and the answer is yes.

00:21:05.860 | So the number of steps are different.

00:21:08.260 | [INAUDIBLE]

00:21:10.740 | I have a question there.

00:21:12.220 | [INAUDIBLE]

00:21:15.180 | OK, yeah, I'm going to get to that in the next slide.

00:21:22.180 | Yeah, OK.

00:21:24.180 | [INAUDIBLE]

00:21:28.300 | So the question is, in practice, how long

00:21:31.020 | would I go for the INN?

00:21:32.660 | I would say you usually stop at 400 steps or something

00:21:38.500 | like that, because outside of that,

00:21:40.180 | it's going to be too long to make the update.

00:21:43.860 | And compute, it's very expensive to compute.

00:21:49.140 | But you can go more if you want to.

00:21:51.900 | Yeah.

00:21:53.900 | I have a question.

00:21:55.740 | Yeah.

00:21:56.240 | [INAUDIBLE]

00:21:59.220 | [INAUDIBLE]

00:22:03.460 | Yeah, yeah, so that's a problem.

00:22:05.180 | So I'm going to talk about the prediction next.

00:22:09.100 | So let me go to the prediction, and then you can ask questions.

00:22:12.300 | So how do you do prediction?

00:22:13.580 | So the first algorithm that you can do is called greedy

00:22:17.580 | decoding.

00:22:19.180 | In greedy decoding, for any incoming email x,

00:22:24.220 | I'm going to predict the first word.

00:22:27.900 | And then you find the most likely word,

00:22:30.220 | and then you feed back in.

00:22:32.460 | And then you find the next most likely word,

00:22:35.140 | and then you feed back in, and et cetera.

00:22:38.340 | So you keep going.

00:22:39.380 | You keep going until you see the word end, and then stop.

00:22:42.500 | Or if it exceeds a certain length, you stop.

00:22:46.780 | Now, that's just too greedy.

00:22:49.500 | So let's do a little bit less greedy.

00:22:51.580 | So it turns out that, so given x,

00:22:53.980 | you can predict more than one candidate.

00:22:55.820 | So let's say you can predict k candidates.

00:22:58.220 | Let's say three.

00:23:00.260 | So you take three candidates.

00:23:02.260 | And then for each candidate, you're

00:23:03.760 | going to feed in the next step, and then you arrive at three.

00:23:06.700 | So the next step, you're going to have nine candidates.

00:23:10.540 | And then you're going to end up going that way.

00:23:13.060 | So here's a picture.

00:23:14.740 | So given input x, I'm going to predict the first token.

00:23:19.180 | That would be hi, yes, and please.

00:23:21.140 | And given every first token like this,

00:23:22.940 | I'm going to feed back into the network,

00:23:24.580 | and the network will produce another three, and et cetera.

00:23:27.620 | So you're going to end up with a lot of candidates.

00:23:30.580 | So how do you select the best candidate?

00:23:32.460 | Well, you can traverse each beam,

00:23:35.060 | and then you compute the joint probability at each step.

00:23:38.700 | And then you find the sequence that

00:23:41.060 | have the highest probability to be the sequence of choice.

00:23:47.460 | What is your reply?

00:23:48.260 | Any question?

00:23:52.700 | This is the most complicated slide in my talk.

00:23:55.020 | Oh, yeah.

00:24:01.220 | So there will be no out-of-vocabulary words

00:24:04.700 | in your algorithm, right?

00:24:06.700 | Yes.

00:24:07.220 | So the question is, what do you do

00:24:10.380 | with the out-of-vocabulary words?

00:24:12.180 | Now, it turns out in this algorithm, what you do

00:24:14.180 | is that for any word that is out-of-vocabulary,

00:24:17.340 | you create a token called unknown.

00:24:20.220 | And you map everything to unknown,

00:24:22.460 | or anything that out-of-vocabulary

00:24:24.860 | to be unknown.

00:24:26.500 | So it doesn't seem a very nice thing,

00:24:28.820 | but usually it works well.

00:24:31.300 | There's a bunch of algorithms to address these issues.

00:24:33.820 | For example, they break it into characters and things

00:24:37.260 | like that, and then you could fix this problem.

00:24:40.620 | Yeah.

00:24:42.340 | Yeah.

00:24:42.840 | What is the cost function in training?

00:24:46.580 | The cost function is that--

00:24:48.500 | so I go back one slide.

00:24:51.060 | So the cost function--

00:24:53.260 | one more slide.

00:24:54.580 | So the cost function is that you sample a random word, yt.

00:24:58.780 | Let's suppose that here, this is my input so far.

00:25:06.300 | And I'm sampling yt.

00:25:07.660 | Let's say t is equal to 2, which means the word "fine."

00:25:13.540 | I'm at the word "fine."

00:25:14.980 | I want to increase the probability of the model

00:25:18.020 | to predict word "fine."

00:25:20.420 | So every time, the model will make a lot of predictions.

00:25:24.500 | A lot of them will be incorrect.

00:25:26.900 | So you have a lot of probabilities.

00:25:28.620 | You have probability for the word "er,"

00:25:31.060 | and the probability "er, er," and et cetera,

00:25:33.060 | and then probability for "z, z, z, z, z."

00:25:35.820 | And you have a lot of probabilities.

00:25:38.020 | You want the probability for the word "fine"

00:25:43.820 | to be as high as possible.

00:25:45.500 | You increase that probability.

00:25:48.500 | Does that make sense?

00:25:49.980 | Yeah, but we don't care if there's no i, m, et cetera.

00:25:53.980 | Oh, you condition on i, m.

00:25:56.980 | So when I'm at "fine," my input would be "hi," "how," "are you,"

00:26:04.020 | "and," and "am."

00:26:07.540 | OK?

00:26:08.100 | That's all I see.

00:26:09.420 | And then I need to make a prediction.

00:26:10.960 | I have to make that prediction right.

00:26:13.700 | And if I'm at the word "thanks," my input would be "hi," "how,"

00:26:17.700 | "are you," and "am fine."

00:26:20.620 | And I've got to get my "thanks" probability right.

00:26:25.580 | OK?

00:26:27.700 | Yeah, I have a question here.

00:26:28.980 | So how do you personalize the extreme model

00:26:31.420 | for each time?

00:26:33.180 | Oh, I haven't thought about it yet.

00:26:37.140 | So the question is, how do you personalize?

00:26:39.460 | So well, one way to do it is basically

00:26:41.820 | embed a user as a vector.

00:26:43.740 | So let's suppose that you have a lot of users,

00:26:45.660 | and you embed a user as a vector.

00:26:47.500 | That's one way to do it.

00:26:49.140 | Yeah.

00:26:51.540 | I have a question here.

00:26:53.300 | So basically, if all the sequences were the same length,

00:26:56.260 | the number of paths down the tree is k to the n, right?

00:27:03.340 | Yeah.

00:27:03.840 | So that's a lot.

00:27:05.500 | Yeah.

00:27:06.540 | So the question is, let's suppose

00:27:08.260 | that my beam search is 10.

00:27:11.180 | Then you go from 10, like 100, and then 1,000,

00:27:15.660 | and suddenly it grows very quickly, right?

00:27:17.500 | It goes to-- if your sequence is long,

00:27:21.340 | then you end up with k to the n or something like that.

00:27:23.780 | Well, one way to do it is basically

00:27:25.860 | you do truncated beam search, where

00:27:29.300 | any sequence with very low probability,

00:27:31.140 | you just kick it out.

00:27:32.140 | You don't use it anymore.

00:27:33.220 | So you go-- so you can do this.

00:27:35.100 | You can do 3, 9, and then you're 27,

00:27:40.580 | and then you go back up to 9, right?

00:27:43.500 | And then you keep going.

00:27:45.580 | So that way, you don't end up with a huge beam.

00:27:48.540 | And usually, in practice, using a beam size of 3 or 10

00:27:53.380 | would work just fine.

00:27:54.500 | It works great.

00:27:55.100 | Yeah.

00:27:56.460 | Yeah, I have a question here.

00:27:57.660 | [INAUDIBLE]

00:28:02.700 | OK, so because it's an i and n, we

00:28:05.580 | don't have to pad the input.

00:28:07.300 | Now, to be fast, sometimes we have to pad the input,

00:28:10.860 | because we want to make sure that batch processing works

00:28:15.380 | very well, so we pad.

00:28:17.620 | But we pad with only like 0 tokens.

00:28:22.900 | OK.

00:28:23.400 | Did you change the graph from batch to batch?

00:28:28.020 | Yeah, so let's suppose that you have a sequence of 10,

00:28:30.740 | then you have a graph for 10.

00:28:32.420 | You have a sequence-- a batch of all 20,

00:28:34.700 | you have a graph of 20, et cetera.

00:28:38.700 | Yeah, that will make the GPU very happy.

00:28:44.500 | I have a question there.

00:28:46.100 | If you would use the user embedding

00:28:48.580 | to customize the RAM to be applied,

00:28:51.060 | where you would connect that embedding to this memory?

00:28:55.020 | As an initial state, or--

00:28:58.980 | Oh, so--

00:28:59.500 | [INAUDIBLE]

00:29:01.740 | So are you asking--

00:29:03.540 | so my interpretation of your question

00:29:08.140 | is, how do you insert the world embedding into the model?

00:29:12.340 | Is that correct?

00:29:13.020 | No, the user embedding.

00:29:14.020 | User embedding.

00:29:14.740 | Oh, if you want to personalize the thing,

00:29:17.500 | then at the beginning, you have a vector.

00:29:20.100 | And that's a vector for quark with an ID 1, 2, 3, 4, 5.

00:29:25.300 | And then if it's Peter, then the vector would be 5, 6, 7, 8.

00:29:30.060 | So it would be the initial vector for the encoder,

00:29:33.180 | like the initial state for that.

00:29:34.620 | Yeah, yeah.

00:29:36.060 | That's one way to do it.

00:29:37.860 | Well, there's more than one way.

00:29:39.140 | You can do it at the end, or you can do it at the beginning,

00:29:41.980 | or you can insert it at every prediction steps.

00:29:46.020 | But my proposal is just put it at the beginning.

00:29:48.700 | That's simpler.

00:29:49.740 | OK.

00:29:50.980 | I have a question there.

00:29:52.620 | Yeah, you.

00:29:54.020 | OK, well, I'm thinking that because your prediction is

00:29:56.980 | using the prediction as a nice [INAUDIBLE]

00:30:01.220 | Yeah.

00:30:01.720 | [INAUDIBLE]

00:30:02.220 | That's a very good question.

00:30:09.420 | The question is, what if the model derails?

00:30:12.820 | If you make a prediction, and then that's a bad prediction,

00:30:16.180 | and then your model never sees, and then it keeps derailing,

00:30:18.740 | and it will produce garbage.

00:30:20.540 | Yeah, that's a good question.

00:30:22.020 | So I'm going to get to that.

00:30:26.420 | So well, so this is a slide.

00:30:30.220 | So there's an algorithm called scheduled sampling.

00:30:33.020 | So in scheduled sampling, what you do is you--

00:30:38.020 | instead of feeding the truth during training,

00:30:40.700 | you can feed what sample from the softmax.

00:30:44.220 | So what's generated by the model,

00:30:46.060 | and then feed in as input so that the model understands

00:30:50.100 | that if it produced something bad,

00:30:51.940 | it would suck, actually can recover from it.

00:30:55.540 | So that's one way to address this issue.

00:31:00.620 | Did that make sense?

00:31:04.820 | Yeah.

00:31:05.320 | Any question?

00:31:09.500 | There's a question here.

00:31:10.540 | OK.

00:31:11.040 | [INAUDIBLE]

00:31:11.540 | Yeah.

00:31:17.200 | [INAUDIBLE]

00:31:17.700 | Yeah, yeah.

00:31:19.420 | So in this algorithm, the question

00:31:21.580 | is, how large is the size of the decoder?

00:31:26.580 | Well, my answer is that try to be as large as possible.

00:31:30.420 | But it's going to be very slow.

00:31:31.760 | And in this algorithm, what happens

00:31:33.340 | is that you use the same--

00:31:38.220 | you use fixed-length embedding to represent very, very

00:31:43.780 | much the long-term dependency, like a huge input, right?

00:31:48.000 | And that's going to be a problem.

00:31:49.380 | So I'm going to come back to that issue with the attention

00:31:53.100 | model in a second.

00:31:54.940 | OK?

00:31:56.460 | Any question?

00:31:57.140 | OK, here's a question.

00:31:58.140 | So if you're scoring based on a single word,

00:32:01.620 | doesn't that make you learn away from synonyms?

00:32:03.580 | Because you're unlikely to use them

00:32:05.060 | in the same sentence as the answer.

00:32:09.100 | So does the model learn synonyms?

00:32:11.900 | Is that the question?

00:32:12.700 | Or what's the question?

00:32:13.700 | Like, aren't you biasing it against synonyms?

00:32:15.660 | So it's like your answer is fine,

00:32:17.140 | and you're scoring based on that answer.

00:32:20.100 | Does it become an unlikely answer?

00:32:22.820 | Oh, I see.

00:32:24.820 | Well, yeah.

00:32:27.020 | It turns out that if you learn, it turns out that it map good.

00:32:30.060 | And if you visualize embedding, the good and fine and so on

00:32:34.740 | are mapped very closely to the embedding space.

00:32:39.100 | But in the output, we don't know what else to do.

00:32:44.460 | The other approach is basically to train the word

00:32:47.500 | embeddings using Word2Vec and then

00:32:49.500 | try to ask the model to regress to the word embeddings.

00:32:53.940 | So that's one way to address the issue.

00:32:55.580 | We tried something like that.

00:32:56.780 | It did not work very well.

00:32:58.260 | So whatever we had in here was pretty good.

00:33:01.980 | OK.

00:33:02.900 | I have to keep going.

00:33:03.740 | But anyway, the algorithm that you've seen so far

00:33:08.140 | turns out actually answers some emails.

00:33:10.860 | So if you use the smart reply feature in Inbox,

00:33:15.700 | it's already used this system in production.

00:33:20.020 | Now, for example, in this email, my colleague Greg Corrado

00:33:23.380 | got an email from his friend saying that, hey,

00:33:27.500 | we wanted to invite you to join us for an early Thanksgiving

00:33:31.540 | on November 22nd, beginning around 2 PM.

00:33:34.780 | Please bring your favorite dish and reserve by next week.

00:33:38.420 | And then it would propose three answers.

00:33:40.700 | For example, the first answer would be, tell us in.

00:33:44.460 | Second answer would be, we'll be there.

00:33:46.820 | And the third answer is, sorry, we won't be able to make it.

00:33:50.380 | Now, where do these three answers come from?

00:33:52.860 | Those are the beams.

00:33:54.540 | Now, there's an algorithm to actually figure out

00:33:57.340 | the diversity as well of the beams

00:33:58.940 | so that you don't end up with very similar answers.

00:34:02.100 | So there's an algorithm, like a heuristic,

00:34:05.140 | that make these beams a little bit more diverse.

00:34:08.660 | And then they pick the best three to present to you.

00:34:11.580 | OK, any question?

00:34:16.780 | Yeah, I have a question here.

00:34:18.580 | How do you make sure that the beam terminates?

00:34:22.940 | Yeah, there's no guarantees.

00:34:24.300 | So the question is, how do I guarantee that the beam will

00:34:28.340 | terminate and end?

00:34:29.620 | Now, there's no guarantee.

00:34:32.460 | It can go on forever.

00:34:33.900 | Indeed, there are certain cases like that if you don't

00:34:36.180 | train the model very well.

00:34:37.380 | But if you train the model well with very good accuracy,

00:34:41.180 | then the model usually terminates.

00:34:43.540 | I've hardly seen any cases that it doesn't terminate.

00:34:50.020 | But there are certain corner cases

00:34:52.180 | that it will do funny things.

00:34:55.420 | But you can stop the model after 1,000 or 100

00:34:58.700 | or something like that so that you make sure

00:35:00.780 | that the model doesn't go on crazy.

00:35:06.620 | I have a question here.

00:35:07.820 | So there's nothing in the email which

00:35:10.300 | says they're inviting multiple people,

00:35:12.740 | but the reply seems all good with us meeting.

00:35:17.820 | That's very interesting.

00:35:18.820 | Yeah, it just comes out because there's a lot of emails.

00:35:21.500 | And if you invite someone, there's more than one person.

00:35:24.100 | And it learns about Thanksgiving.

00:35:25.980 | It just means inviting the whole family, things like that.

00:35:29.020 | Yeah, it just learns from statistics.

00:35:30.700 | [INAUDIBLE]

00:35:33.700 | Yeah, or maybe there's something like that.

00:35:35.500 | Yeah.

00:35:37.500 | OK.

00:35:38.500 | [INAUDIBLE]

00:35:45.500 | Oh, in this algorithm-- so the question

00:35:48.060 | is, do I do any post-processing to correct

00:35:51.180 | the grammar of the beams?

00:35:52.820 | In this algorithm, we did not have to do it.

00:35:54.700 | [INAUDIBLE]

00:35:55.900 | Yeah.

00:35:56.400 | Yes?

00:35:58.880 | OK.

00:36:00.000 | I have another question.

00:36:01.800 | So how contextual these multiple are?

00:36:05.280 | Are they very basic to a specific email,

00:36:09.240 | or do they tend to be [INAUDIBLE]??

00:36:12.760 | So OK, so the question is, how contextual?

00:36:15.520 | So I would say we don't have any user embedding in this.

00:36:18.320 | So it's pretty general.

00:36:21.040 | The input would be the previous emails,

00:36:23.600 | and the output would be the prediction, the reply.

00:36:29.480 | That's all we have.

00:36:30.320 | So it sees the context, which is the threat so far.

00:36:35.920 | OK.

00:36:37.000 | Did I answer your question?

00:36:38.120 | OK, yeah, you can catch me up after the talk.

00:36:45.240 | Yeah?

00:36:46.240 | Is it running on the phone or on the server?

00:36:50.360 | It runs on the server.

00:36:51.840 | Yeah, slow.

00:36:54.960 | Question?

00:36:55.960 | I guess there are definitely some emails

00:36:57.960 | which are not suitable for auto-smart reply.

00:37:00.440 | How do you detect which email to reply?

00:37:02.840 | Oh, I see.

00:37:04.120 | So the question is, there are some emails that

00:37:06.240 | are not relevant for a smart reply.

00:37:08.200 | Maybe they're too long, or you should not

00:37:10.720 | reply or something like that.

00:37:12.200 | So in fact, we have two algorithms.

00:37:14.480 | So one algorithm is to say yes or no to reply.

00:37:19.680 | And then after it passes, sees the threshold,

00:37:22.360 | there's an algorithm to run to produce the threshold.

00:37:25.080 | So it's a combined of two algorithms

00:37:26.760 | that I presented earlier.

00:37:31.040 | Yeah.

00:37:32.240 | I have to get going, but you can get back to the question.

00:37:35.440 | So there's a lot of more interesting stuff coming along.

00:37:37.960 | OK, so what's the big picture so far?

00:37:40.440 | So the big picture is that we have an INN encoder that

00:37:45.000 | eats all the input.

00:37:46.080 | And then we have an INN decoder that's

00:37:48.760 | trying to predict one token at a time in the output.

00:37:52.160 | Now, everything else follows the same way.

00:37:55.560 | So you can use stochastic gradient descent

00:37:58.320 | to train the algorithm.

00:38:01.160 | And then you do beam search decoding.

00:38:04.320 | Usually, you do a beam search of three.

00:38:07.160 | And then you should be able to find a good beam

00:38:10.440 | with the highest probability.

00:38:12.360 | Now, someone in the audience brought up the issue

00:38:17.880 | that we use fixed length representation.

00:38:20.600 | So just before you make a prediction, the h of n,

00:38:24.600 | the white thing right before you go to the decoder,

00:38:29.160 | that is a fixed length representation.

00:38:31.160 | And you can think of it as a vector that capture

00:38:35.160 | all everything in the input.

00:38:39.200 | It could be 1,000 words or it could be five words.

00:38:43.040 | And you use a fixed length representation

00:38:44.920 | for a variable length input, which is kind of not so nice.

00:38:49.960 | So we want to fix that issue.

00:38:53.040 | So there's an algorithm coming along.

00:38:55.400 | And it's actually invented at University of Montreal.

00:39:01.280 | You're sure he's here?

00:39:03.640 | So the idea is to use an attention.

00:39:06.840 | So how does an attention work?

00:39:08.240 | So in principle, what you want is something like this.

00:39:11.480 | Every time before you make a prediction--

00:39:13.600 | let's say you predict the word m--

00:39:15.840 | you kind of want to look again at all the hidden states

00:39:19.400 | so far.

00:39:20.640 | You want to look at all what you see in the input so far.

00:39:24.360 | Now, same, when you do find, you also

00:39:28.960 | want to see all the hidden state of the input so far and on.

00:39:33.560 | Now, how do you do that as a program?

00:39:36.680 | So well, you can do this.

00:39:37.960 | So at h of m, you predict a vector c.

00:39:41.680 | Let's say that vector is the same dimension with all the h.

00:39:47.520 | So if your h of 1 is a dimension of 100,

00:39:51.080 | then c also has a dimension of 100.

00:39:54.560 | And then you take c, and then you do dot product

00:39:57.280 | with all the h.

00:39:59.360 | And then you have coefficients a, a0, a1, blah, blah, blah,

00:40:05.080 | to a to the n.

00:40:07.880 | And those are scalars.

00:40:11.200 | And then after you have those scalars,

00:40:12.800 | you compute something called the beta, which is basically

00:40:16.440 | a softmax of all the alpha.

00:40:20.280 | So to compute that, you take the--

00:40:22.480 | bi is an exponential of ai divided

00:40:26.360 | by the sum of exponentials.

00:40:30.800 | And then you take those bi and then multiply by h of i.

00:40:36.680 | And then you take the weighted average.

00:40:39.720 | And then you take the sum.

00:40:40.960 | And then you send it to add additional signal

00:40:44.760 | to predict the word m.

00:40:46.480 | And then you keep going with that.

00:40:48.720 | So in the next step, you also predict another c.

00:40:51.160 | And then you take that c to compute the dot product.

00:40:53.760 | You compute the a, and then you can compute the b.

00:40:57.000 | You can take the b.

00:40:57.760 | You do the weighted average.

00:40:59.440 | And then you send it to the next step

00:41:01.560 | to extend it to the prediction.

00:41:03.080 | And then you use stochastic gradient descent

00:41:05.160 | to train everything.

00:41:06.680 | OK?

00:41:08.680 | OK?

00:41:10.080 | And this algorithm is implemented in TensorFlow.

00:41:14.640 | OK, so how intuitively-- what is going on here?

00:41:18.600 | So let's suppose that you want to use this for translation.

00:41:22.480 | So in translation, you want to--

00:41:26.200 | for example, the input would be, hi, how are you?

00:41:28.680 | And the output is, hola, como estas, or something like that.

00:41:33.720 | OK?

00:41:34.320 | And then when you predict the first word,

00:41:38.000 | you want hola to correspond to the word hi.

00:41:42.440 | OK?

00:41:42.920 | Because there's a one-to-one mapping

00:41:44.920 | between the word hi and hola.

00:41:47.040 | So if you use the attention model,

00:41:49.560 | the betas that you learn will put a strong weight

00:41:53.720 | for the word hola--

00:41:55.200 | for the word hi.

00:41:56.800 | And then it has a smaller weight for all the stuff.

00:41:59.320 | And then if you keep going, then when you say, como,

00:42:02.440 | then it will focus on how, and et cetera.

00:42:05.440 | OK?

00:42:06.040 | So it moves that coefficient.

00:42:08.600 | It puts a strong emphasis on the relevant word.

00:42:11.640 | And especially for translation, it's

00:42:13.200 | extremely useful because you know the one-to-one mapping

00:42:17.160 | between the input and output.

00:42:20.160 | Any questions so far?

00:42:21.040 | This is definitely very complicated.

00:42:25.240 | Yeah, I have a question.

00:42:26.640 | So how do you deal with languages

00:42:29.480 | where word orders are different?

00:42:31.760 | Oh, right now, the A and B are learned.

00:42:36.280 | So I don't-- and so the question is,

00:42:39.120 | how do I deal with languages where the order don't reverse,

00:42:42.800 | for example, English to Japanese?

00:42:45.240 | So some of the verbs get moved and things like that.

00:42:48.080 | Well, I did not hard code A or B. They are learned.

00:42:54.160 | So by virtue of learning, they will figure out

00:42:57.120 | what beta to put to weight the input.

00:43:03.000 | And those are basically computed by gradient set.

00:43:06.360 | So they just keep on learning.

00:43:07.640 | OK, I have a question here.

00:43:12.840 | [INAUDIBLE]

00:43:13.320 | OK.

00:43:18.000 | Yeah, so the question is, are there

00:43:20.160 | any work on putting attention in the output?

00:43:22.840 | Yeah, I think you can do that.

00:43:26.080 | I'm not too familiar with any work in here,

00:43:27.960 | but I think it's possible to do it.

00:43:30.640 | I think some people have explored something like that.

00:43:32.960 | Yeah.

00:43:35.240 | Any question?

00:43:35.760 | Oh, I have a question.

00:43:39.520 | Another question.

00:43:40.200 | So with the capitalization of your first words in the--

00:43:45.200 | Yeah.

00:43:45.700 | --line, does that imply that you have to have your own [INAUDIBLE]

00:43:49.480 | Yeah.

00:43:49.980 | [INAUDIBLE]

00:43:52.040 | Yeah, yeah, yeah.

00:43:52.800 | So the question is, let's suppose-- because right now,

00:43:55.520 | the word "high" is capitalized at the first character.

00:44:00.320 | Does it mean I'm using 2n or n vocabulary size?

00:44:04.400 | So in practice, we should do some normalization.

00:44:07.240 | If you have a small data set, what you should do

00:44:09.720 | is you normalize the text.

00:44:11.080 | So "high" will be lowercase and et cetera.

00:44:14.960 | Now, if you have a huge data set, it doesn't matter.

00:44:17.440 | We just learn.

00:44:19.360 | OK?

00:44:20.480 | Yeah.

00:44:21.120 | I have a question there.

00:44:22.200 | So in essence, actually, this can

00:44:24.000 | change the positional aspect of the words, right?

00:44:27.720 | Right, yeah.

00:44:28.400 | So the question is, in a sense, it

00:44:30.840 | captures the positional information in the input.

00:44:34.320 | Yeah, I agree.

00:44:35.320 | I have a question there.

00:44:39.080 | What about punctuation?

00:44:41.800 | Pardon?

00:44:42.300 | What about punctuation?

00:44:44.040 | Punctuation.

00:44:45.400 | So the question is, what do I do with punctuation?

00:44:48.160 | Well, right now, I just present the algorithm

00:44:52.320 | as if it's a very simple implementation,

00:44:56.720 | like the very basic.

00:44:57.840 | But one thing that you can do is before you

00:45:03.120 | train the algorithm, you put a space between the word

00:45:07.440 | and the punctuation so that you do some--

00:45:10.560 | that step is called tokenization or normalization

00:45:13.800 | in language processing.

00:45:15.320 | So you can use any Stanford NLP package or something

00:45:19.480 | like that to normalize your text so that it's easy to train.

00:45:23.480 | Now, if you have infinite data, then it will just learn itself.

00:45:26.240 | OK.

00:45:30.840 | So I should get going because there's a lot of-- all

00:45:32.720 | the interesting stuff.

00:45:33.640 | OK.

00:45:34.140 | So it turns out that that's the basic implementation.

00:45:38.080 | But if you want to get good results

00:45:40.280 | and if you have big data sets, so one thing that you can do

00:45:43.360 | is to make the network deep.

00:45:44.920 | And one way to make deep is in the following way.

00:45:48.400 | So you stack your recurrent network on top of each other.

00:45:53.120 | So you-- in the first sequence-to-sequence paper,

00:45:56.080 | we use a network of four, but people are gradually

00:45:59.040 | increasing to six, eight, and so on right now.

00:46:01.960 | And they're getting better and better results.

00:46:04.120 | Like in ImageNet, if you make a network deeper,

00:46:07.440 | you also get better results.

00:46:10.760 | OK.

00:46:11.260 | So if you want to train sequence-to-sequence

00:46:14.440 | with attention, then a couple of years ago,

00:46:19.720 | when we-- like many labs working on this problem

00:46:23.400 | were behind the state of the art.

00:46:25.440 | But right now, in translation, many translation tasks

00:46:31.440 | basically--

00:46:33.760 | this model already achieved state of the art

00:46:36.160 | results in a lot of these WMT data sets.

00:46:40.160 | So to train this model, so number one is that, as I said,

00:46:44.920 | you might end up with a lot of vocabulary--

00:46:48.800 | out of vocabulary issues.

00:46:53.160 | So Barack Obama will be just an unknown, right?

00:46:56.360 | Hillary Clinton is an unknown.

00:46:58.280 | Now, you might use something like word segments, right?

00:47:02.440 | So you segment the words out.

00:47:03.960 | For example, Barack Obama would be "ba" and "rak" and et cetera.

00:47:10.120 | Or you can use all the smart algorithms.

00:47:12.720 | For example, word character split.

00:47:15.000 | You can split words that are unknown to be into characters,

00:47:18.960 | and then you treat them as a character.

00:47:20.560 | There's some work at Stanford, and they

00:47:21.760 | prove that it works very well.

00:47:23.240 | So that's one way to do it.

00:47:25.080 | Now, tip number two is that when you train this algorithm--

00:47:29.880 | because when you do back propagation or forward

00:47:32.640 | propagation, you essentially multiply a matrix

00:47:36.120 | many, many times.

00:47:37.040 | So you have explosion of function value

00:47:42.200 | or the gradient or implosion as well.

00:47:47.360 | Now, one thing that you can do is you clip

00:47:49.120 | the gradient at a certain value.

00:47:51.440 | So you say that if the magnitude of the gradient

00:47:55.600 | is larger than 10, set it to 10.

00:48:00.000 | Then tip number three is to use GRU, or in our work,

00:48:04.280 | we use a long short-term memory.

00:48:07.160 | So I want to revisit this long short-term memory

00:48:09.400 | business a little bit.

00:48:11.040 | So what's long short-term memory?

00:48:12.640 | So if you use an INN cell, basically, you

00:48:16.600 | concatenate your input and the hidden state,

00:48:22.080 | and then you multiply by some theta,

00:48:24.120 | and then you apply with some activation function.

00:48:27.280 | Let's say that's a hyperbolic tangent.

00:48:31.080 | Now, that's the simple function for INN.

00:48:35.400 | Now, in LSTM, you basically multiply the input and H

00:48:40.960 | by a big matrix.

00:48:44.040 | Let's call that theta.

00:48:45.560 | That theta is four times bigger than the theta

00:48:49.480 | I said in the INN cell.

00:48:51.280 | And then you're going to take that Z that's coming out.

00:48:55.880 | You split it into four blocks.

00:48:58.520 | Each block, you can compute the gates.

00:49:01.240 | And then you use the value of something called the cell,

00:49:06.320 | and then you keep adding the newly computed values

00:49:10.120 | to the cell.

00:49:11.880 | So there's a part here that I say the integral of C.

00:49:15.560 | It's that what it does, it basically

00:49:17.480 | keeps a hidden state where it keeps adding information to it.

00:49:21.520 | So it doesn't multiply information,

00:49:23.240 | but it keeps adding information.

00:49:24.880 | You don't need to know a lot of this

00:49:26.420 | if you want to just apply LSTM, because it's already

00:49:29.800 | implemented in TensorFlow.

00:49:31.200 | Any questions so far?

00:49:35.360 | So in terms of applications, you can use this thing

00:49:42.760 | to do summarization.

00:49:44.280 | So I've started seeing work in summarization, pretty exciting.

00:49:48.680 | You can do image captioning.

00:49:51.640 | And the input in that case would just

00:49:53.280 | be the representation of an image coming out from VGG

00:49:59.640 | or coming out from Google Net and et cetera.

00:50:01.960 | And then you send it to the INN.

00:50:03.600 | The INN will do the decoding for you.

00:50:06.560 | Or you can use it for speech recognition or transcription.

00:50:11.680 | Or you can use it for Q&A.

00:50:14.280 | So the next part of the talk, I will

00:50:18.400 | talk a little bit about speech recognition.

00:50:23.240 | So in speech recognition, the input

00:50:26.720 | could be maybe some waveforms.

00:50:28.480 | And then the output could be some words, hi, how's it.

00:50:34.240 | Well, one thing that you can do is

00:50:36.760 | you crop your input into Windows.

00:50:39.480 | That's the green boxes there.

00:50:41.720 | And then you crop a lot of them.

00:50:43.200 | And then you send a lot of them to an INN.

00:50:46.080 | And then you convert it into MFCC before you send to INN.

00:50:49.920 | MFCC or spectrogram or something like that.

00:50:52.680 | And then you use the algorithm that I said earlier

00:50:58.120 | and then with attention.

00:51:00.840 | And then you do the transcription.

00:51:03.400 | You predict one word at a time in the output.

00:51:06.360 | Now, the problem with this algorithm

00:51:09.680 | is that when it comes to speech, you

00:51:12.760 | end up with a lot of input.

00:51:15.000 | You can end up with thousands and thousands steps.

00:51:17.760 | So backpropagating in time, even with attention,

00:51:20.280 | can be difficult.

00:51:22.240 | Now, one thing that you can do is basically

00:51:24.760 | you do some kind of a pyramid to map the input.

00:51:28.840 | So if you do enough layers, you can divide your input

00:51:32.440 | into a factor of 8 or 16 if you do enough layers.

00:51:39.560 | And then you produce the output.

00:51:43.880 | So we're working on an implementation

00:51:46.560 | where the output is actually characters,

00:51:49.800 | like in the Baidu's work where they have the CTC.

00:51:54.200 | Now, I have to say that the strength of this algorithm

00:51:58.840 | is that it actually has an implicit language

00:52:01.440 | model in the output.

00:52:03.840 | So when I say I have the word "how,"

00:52:06.720 | it's actually conditioned on "hi" and "step 4,"

00:52:10.800 | and including the input.

00:52:11.920 | So there's an implicit language model already.

00:52:14.600 | But the problem with this is that actually you

00:52:19.560 | have to wait until the end of the input to do decoding.

00:52:24.000 | So the decoding has to be done offline.

00:52:27.520 | So if you use this for voice search,

00:52:29.880 | it might not be too nice because people want

00:52:33.880 | to see some output right away.

00:52:37.640 | So in that case, there's an algorithm

00:52:39.480 | that can use this and do it in an online fashion,

00:52:42.120 | block by block.

00:52:45.080 | Now, also, I have to mention that in translation,

00:52:51.120 | the sequence with attention works great.

00:52:54.080 | It's among the state of the art.

00:52:56.040 | But when it comes to speech, it doesn't work as well

00:52:59.080 | as the CTC, at least in published results.

00:53:03.800 | We're not as good as CTC, which is what Adam talked earlier,

00:53:08.040 | or some of the HMM/DNN hybrid, which

00:53:13.120 | is the most widely-spoken speech system currently.

00:53:19.640 | So I want to pause there, and then I can take questions.

00:53:22.800 | Any questions?

00:53:23.400 | I have a question at the back.

00:53:24.840 | Yeah?

00:53:25.720 | So on the machine translation, you were mentioning attention.

00:53:30.120 | So according to, say, English, German,

00:53:34.120 | there's one word that basically is having meaning

00:53:39.560 | of using multiple words in English.

00:53:42.560 | How does attention work there?

00:53:44.040 | Because attention will be focusing on [INAUDIBLE]

00:53:47.520 | sense, and you know that one is the sending sense.

00:53:51.000 | But when you're predicting, it's only

00:53:53.000 | going to predict one or two words.

00:53:57.480 | Oh.

00:53:58.760 | So how does it work in translation?

00:54:00.760 | Well, in translation, what we do is basically

00:54:04.080 | we have pairs of sentences.

00:54:06.400 | So for example, hi, how are you?

00:54:08.840 | And then, hola, como estas?

00:54:12.680 | And then we have pairs of sentences like this,

00:54:14.920 | and then we just feed it into the sequence

00:54:18.880 | to sequence of attention.

00:54:20.320 | At every step, again, we're going

00:54:22.080 | to predict one word at a time.

00:54:23.840 | But before we make a prediction, the model has the attention.

00:54:27.280 | So it actually sees the input once more

00:54:31.520 | before it makes a prediction.

00:54:32.760 | That's how it works.

00:54:33.960 | Now, what is-- can you repeat?

00:54:36.440 | What is the issue with the model again, please?

00:54:38.720 | When you look at, say, i equals 1 to the [INAUDIBLE]

00:54:44.040 | Yeah.

00:54:45.040 | When you're predicting, the first one

00:54:48.520 | you send in the sequence, the prediction

00:54:52.040 | is actually just one big word.

00:54:54.520 | So there's no [INAUDIBLE] position in that.

00:54:57.520 | So there's no basically one, two, three coming afterward.

00:55:00.520 | I see.

00:55:01.960 | Well, I can't quite follow the question.

00:55:04.480 | But let's take it offline.

00:55:05.840 | Is that OK?

00:55:06.760 | Yeah, yeah.

00:55:07.320 | And then we can do some paper together.

00:55:10.000 | [LAUGHTER]

00:55:14.000 | I have a question.

00:55:14.680 | Yeah.

00:55:15.680 | So I have a question.

00:55:17.160 | For example, what if you actually, in the inbox,

00:55:20.160 | you compose email in different languages,

00:55:22.640 | like Vietnamese and English?

00:55:24.640 | Yeah.

00:55:25.140 | And you actually separate different email

00:55:27.140 | and you send it to a different model?

00:55:28.640 | Yeah.

00:55:29.140 | Or just a single model?

00:55:30.640 | OK.

00:55:31.400 | So the model-- the inbox thing that I presented,

00:55:34.160 | it was all in English.

00:55:36.200 | But there's no limitation in the model in terms of language.

00:55:39.280 | So let's suppose that in your inbox,

00:55:43.360 | sometimes you write in English, and sometimes you

00:55:45.600 | write in Vietnamese, or sometimes you

00:55:47.440 | write it in Spanish, whatever.

00:55:49.560 | And you personalize by user embedding.

00:55:52.400 | Then I would say that it will just learn your behavior.

00:55:55.040 | And then it will basically predict the word that you want.

00:55:58.680 | But make sure that your output vocabulary is large enough

00:56:03.120 | so that it covers not only the English words,

00:56:05.720 | but also the Spanish word, et cetera,

00:56:08.920 | like Vietnamese and so on.

00:56:10.120 | So your vocabulary is not going to be 20,000.

00:56:12.920 | It's going to be like 100,000, because you have more choices.

00:56:16.840 | And then you have to train your model on those examples.

00:56:20.400 | Yeah.

00:56:23.160 | It's a matter of training data.

00:56:25.160 | That's all.

00:56:25.920 | OK.

00:56:26.400 | I have a question here.

00:56:27.400 | You mentioned that in the voice search,

00:56:29.880 | it cannot do translation online.

00:56:32.840 | Yeah?

00:56:33.340 | Is it possible to change your model a little bit

00:56:36.320 | so that it doesn't have to start predicting at the end

00:56:39.320 | of the voice search?

00:56:40.120 | Yeah, yeah, yeah.

00:56:40.520 | So the question is that in the case of voice search,

00:56:42.520 | right now you have to wait to the end to make a prediction.

00:56:45.040 | Is there any other ways?

00:56:46.000 | Yeah, yeah.

00:56:47.000 | The answer is yes.

00:56:47.840 | You can make a prediction block by block.

00:56:50.080 | So you can actually figure out an algorithm, a simple algorithm

00:56:52.920 | to actually segment the speech, and then make a prediction,

00:56:56.760 | and then take the prediction and feed it

00:56:58.560 | as input at the next block.

00:57:00.000 | So you can keep going like that.

00:57:01.340 | So in theory, you can actually do online decoding.

00:57:06.280 | But I'm saying that you can do online decoding,

00:57:10.880 | but that work is currently work in progress.

00:57:14.400 | How about that?

00:57:15.440 | OK.

00:57:16.760 | I have a question there.

00:57:17.800 | So the question is regarding the Google's order of life.

00:57:21.800 | Yeah.

00:57:22.640 | So if I have to put this on my gallery,

00:57:24.320 | I was wondering how do you guys set up

00:57:25.920 | with a training data set?

00:57:27.000 | Because it's difficult to-- how do you

00:57:29.520 | know for this question, this is the answer you're creating?

00:57:32.400 | Oh, yeah.

00:57:32.900 | So we have some input email and then some output email

00:57:37.020 | where expert written emails reply.

00:57:40.340 | And then you can just train it that way.

00:57:42.540 | So you have a trained data set that you

00:57:44.980 | have created for these purposes?

00:57:46.900 | Yeah.

00:57:49.140 | OK.

00:57:49.900 | I have a couple of questions.

00:57:52.820 | So it seems like with sequence, there's

00:57:55.220 | not much of a constraint on how the output aligns

00:57:58.460 | with the input, while the CTC there

00:58:00.660 | is sort of constrain that.

00:58:02.260 | Yeah.

00:58:02.760 | [INAUDIBLE]

00:58:03.260 | Is there any way to add this constraint to sequence

00:58:06.260 | to sequence that it works better?

00:58:07.740 | Or is it something like speech recognition?

00:58:09.540 | Yeah.

00:58:10.180 | The question is that in speech recognition,

00:58:12.100 | the CTC seems to be a very nice framework,

00:58:14.260 | because it matches--

00:58:15.860 | it's less like a monotonic increasement

00:58:18.140 | in the output and the input.

00:58:20.500 | But CTC makes this independent assumption.

00:58:23.380 | It doesn't have a language model in it.

00:58:25.380 | Maybe the sequence to sequence--

00:58:29.260 | can address this?

00:58:30.180 | Yeah, I think that's a great idea.

00:58:31.780 | Maybe we should write a paper together.

00:58:33.420 | [LAUGHTER]

00:58:35.340 | OK.

00:58:36.780 | I think-- I haven't seen it, but I think

00:58:39.380 | that's a very good idea.

00:58:41.300 | Question?

00:58:42.300 | [INAUDIBLE]

00:58:42.800 | I see.

00:58:52.820 | OK.

00:58:54.340 | Great.

00:58:54.980 | So the question is that, because right now we

00:58:58.660 | predict one step at a time, is there

00:59:00.380 | any way to actually look globally at the output

00:59:03.860 | and maybe use some kind of reinforcement learning

00:59:05.900 | to adjust the output?

00:59:07.380 | And the answer is yes.

00:59:08.420 | So there's a recent paper at Facebook called, I think,

00:59:13.180 | sequence level training or something like that,

00:59:15.700 | where they don't optimize for one step at a time,

00:59:19.540 | but they predict-- they look at the globally,

00:59:21.460 | and then they try to improve work at a rate,

00:59:24.820 | or they try to improve blue score or things

00:59:28.620 | like that for translation.

00:59:30.100 | And it seems to be making some improvement in the metrics

00:59:35.600 | that they care about.

00:59:36.940 | Now, if you show it to humans, though,

00:59:39.940 | people still prefer the output from this model.

00:59:44.660 | So some of the metrics that we use in translation and so on

00:59:47.660 | might not be what the metrics that we optimize.

00:59:51.100 | And the next step prediction seem

00:59:53.100 | to be what people like a lot in translation.

00:59:55.780 | [INAUDIBLE]

00:59:56.280 | Yeah, so the question is, can we add the GAN loss?

01:00:04.140 | Yeah, I think that's a great idea.

01:00:06.100 | Yeah.

01:00:07.700 | I have a question here.

01:00:08.620 | Yeah.

01:00:09.120 | [INAUDIBLE]

01:00:09.620 | Yeah.

01:00:14.120 | [INAUDIBLE]

01:00:14.620 | Change?

01:00:19.720 | [INAUDIBLE]

01:00:20.220 | Once you have the model, is there a way

01:00:23.140 | based on the human input to influence the encoder?

01:00:28.460 | Yeah, yeah.

01:00:29.060 | So let's suppose that you type the first word, "Hola,"

01:00:33.180 | then you can actually start the beam from there.

01:00:35.780 | So the question is, is there any way to incorporate user input?

01:00:39.660 | So I say, yeah.

01:00:41.020 | Let's suppose that you say, "Hola," sorry, "Hi, how are you?"

01:00:46.020 | And then as soon as the person type "Hola,"

01:00:49.660 | that actually restrict your beam.

01:00:51.900 | So you can actually condition your beam on the first word,

01:00:55.220 | "Hola," and your beam will be better.

01:00:57.420 | Yeah, I think that's a good idea.

01:00:59.700 | I have a question.

01:01:00.380 | [INAUDIBLE]

01:01:00.880 | Oh, so how much data did we use?

01:01:10.740 | So in translation, for example, we

01:01:12.860 | use several WMT corpora.

01:01:17.340 | And the WMT corpora usually have tens of millions

01:01:22.260 | of pairs of sentences, something like that.

01:01:24.700 | And every sentence have like 20 words on average,

01:01:30.220 | 20, 30 words on average.

01:01:31.980 | I can't remember, but that's something like that,

01:01:34.020 | order of magnitude.

01:01:35.740 | Yeah, I have a question there.

01:01:37.620 | [INAUDIBLE]

01:01:38.120 | I can't really hear.

01:01:42.860 | [INAUDIBLE]

01:01:45.900 | So how is it compared to Google Search auto-completion?

01:01:51.340 | Honestly, I don't know what's used underneath Google Search

01:01:55.020 | auto-completion.

01:01:56.060 | But I think they should use something like this,

01:01:59.380 | because it's--

01:01:59.980 | [LAUGHTER]

01:02:00.480 | OK, I have still lots of interesting stuff coming along.

01:02:07.820 | So OK, so what's the big picture?

01:02:12.100 | So the big picture is so far, I talk about sequence

01:02:15.980 | to sequence learning.

01:02:17.700 | And yesterday, Andrew was talking

01:02:20.540 | about most of the big trends in deep learning.

01:02:25.580 | And he was talking about the second trend was basically

01:02:28.140 | doing end-to-end deep learning.

01:02:29.860 | So you can characterize sequence to sequence learning

01:02:32.980 | as end-to-end deep learning as well.

01:02:36.460 | So the framework is very general.

01:02:39.060 | So it should work for a lot of NLP-related tasks,

01:02:43.220 | because a lot of them, you would have input sequence and output

01:02:47.060 | sequence in NLP.

01:02:48.420 | It could be input would be some text,

01:02:50.780 | and output would be some passing trees.

01:02:53.420 | That's also possible.

01:02:56.380 | But it works great when you have a lot of data.

01:02:59.740 | Now, when you don't have enough data,

01:03:01.260 | then maybe you want to consider dividing your problems

01:03:04.480 | into smaller components, and then train your sequence

01:03:07.180 | to sequence in the subcomponents,

01:03:08.820 | and then merge them.

01:03:11.260 | Now, if you don't have a lot of data,

01:03:13.740 | but you have a lot of related tasks,

01:03:16.100 | then it's also possible to actually merge all these tasks

01:03:19.980 | by combining the data, and then have an indicator bit to say,

01:03:24.420 | this is translation, this is summarization,

01:03:26.900 | this is email reply, and then change it jointly.

01:03:31.780 | And that should improve your output, too.

01:03:36.620 | Now, this basically concludes the parts

01:03:40.580 | about sequence to sequence.

01:03:41.660 | And then the next part, I'm going

01:03:43.260 | to play sequence to sequence in a bigger

01:03:45.980 | picture of the active ongoing work in neural nets for NLP.

01:03:57.180 | So if you have any questions, you can ask now.

01:03:59.780 | I take maybe two questions, because I

01:04:01.500 | think I'm running out of time.

01:04:03.820 | So I have a question.

01:04:04.700 | Yeah?

01:04:05.200 | I have a question.

01:04:06.160 | So does your model handle emoji in NLP?

01:04:11.940 | So the question is, does the model handle emoji?

01:04:16.820 | I don't know, but emoji is like a piece of text, too, right?

01:04:19.820 | So you can just feed it in as another extra token.

01:04:23.260 | If you make your vocabulary 200,000,

01:04:27.220 | then you should be able to cover emoji as well.

01:04:33.060 | Yeah, I have a question.

01:04:34.800 | As new emails and other documents come in,

01:04:36.780 | do you have to retrain the model, or do you do anything?

01:04:41.760 | Oh, so if you have new data coming in,

01:04:43.920 | so should I retrain the model?

01:04:45.720 | I think towards the end, we lower the learning rate.

01:04:51.680 | So if you add new data, it will not make a lot of good updates.

01:04:56.480 | So usually, you can add new data,

01:04:59.360 | increase the learning rate, and then continue to train.

01:05:02.300 | Yeah, that should work.

01:05:04.300 | OK, so I already took two questions.

01:05:05.860 | Let's keep going.

01:05:06.660 | So there's an active area that actually

01:05:11.300 | is very exciting, which is in the area of automatic Q&A.

01:05:16.180 | So you can think that maybe the setup would be,

01:05:20.180 | can you read a Wikipedia page and then answer a question?

01:05:23.980 | Or can you read a book and answer a question?

01:05:26.860 | Now, in theory, you can use sequence-to-sequence

01:05:29.700 | with attention to do this task.

01:05:33.780 | So it's going to look like this.

01:05:35.260 | You're going to read the book, one token at a time,

01:05:38.740 | and with the book.

01:05:40.140 | Then read the question.

01:05:42.060 | And then you're going to use the attention

01:05:44.300 | to look at all the pages.

01:05:47.300 | And then you make a prediction of the tokens.

01:05:49.540 | So that's kind of--

01:05:55.180 | sometimes we do answer questions,

01:05:56.980 | that way, sometimes we don't have knowledge about the fact.

01:06:00.100 | So we actually read the book again to answer the fact.

01:06:03.460 | But a lot of the time, if you ask me,

01:06:06.820 | is Barack Obama the President of the United States?

01:06:09.860 | I would say, yes, because it's already in my memory.

01:06:12.620 | So maybe it's better to actually augment the RNN

01:06:17.780 | with some kind of memory, so that it will not

01:06:21.620 | do this look back again.

01:06:23.780 | It's kind of annoying, look back again.

01:06:25.860 | So there's an active area of this research.

01:06:27.820 | I'm not a definite expert, but I'm very aware,

01:06:32.820 | so I can place you in the right context here.

01:06:35.180 | So a work in this area would be memory networks

01:06:38.260 | by Western and folks at Facebook.

01:06:40.980 | There will be neural-tutoring machines at DeepMind.

01:06:43.780 | Dynamic memory networks would be Richard Solcher presented

01:06:47.420 | yesterday, and then stack-augmented algorithms.

01:06:51.140 | And then stack-augmented RNNs by Facebook again, and et cetera.

01:06:55.460 | So I want to show you a high level, what

01:07:01.460 | does this augmented memory mean.

01:07:04.340 | Let's think about the attention.

01:07:08.580 | So the attention looks like this.

01:07:10.500 | In the encoder, you're going to look at some input.

01:07:12.820 | And then you have a controller, which is your h variable.

01:07:16.500 | And then you keep updating your h variable.

01:07:19.900 | But along the side, you're going to write down

01:07:21.820 | into memory your h1, h2, h3, et cetera.

01:07:25.060 | You store it into a memory.

01:07:27.260 | Clear, right?

01:07:29.060 | And in the decoder, what you're going to do

01:07:31.740 | is you're going to continue producing some output.

01:07:36.220 | You're going to update your controller g,

01:07:39.700 | but you're going to read from memory your h again.

01:07:46.740 | So again, in the input, you write to memory.

01:07:49.380 | And then in the output, you read from memory.

01:07:53.180 | Now let's try to be a little bit more general.

01:07:58.060 | And the general would be at any point in time,

01:08:01.020 | you can read and write.

01:08:02.140 | You have a controller, and you can read and write all the time.

01:08:07.060 | Now to do that, you have the following architecture.

01:08:10.340 | So you have some memory bank, a big memory bank.

01:08:14.260 | And then you can use the write.

01:08:18.820 | You can decide to write some information into it

01:08:23.220 | by a combination of the memory bank in the previous step

01:08:27.620 | and the hidden variable in the previous step.

01:08:30.500 | And then you also read into the hidden state, too.

01:08:34.540 | And then you could make an update,

01:08:35.900 | and then you can keep going forever like that.

01:08:38.500 | So this concept is called RNN with augmented memory.

01:08:41.780 | Okay?

01:08:44.780 | Is that somewhat clear?

01:08:47.420 | Any question?

01:08:51.540 | You have a question.

01:08:53.660 | The question is, when you read,

01:09:00.580 | do you read the entire memory bank?

01:09:02.900 | A lot of these algorithms are actually soft attention.

01:09:06.820 | So yes, it will look the entire memory.

01:09:11.260 | You can actually predict where to look, right?

01:09:13.660 | And then read only that block.

01:09:16.020 | Now the problem with that is you end up with very,

01:09:20.380 | it's not differentiable anymore, right?

01:09:23.020 | Because the thing that you don't read

01:09:25.380 | don't contribute to the gradient.

01:09:27.700 | So it's gonna be hard to train,

01:09:28.980 | but you can use reinforce and so on to train it.

01:09:31.500 | So there's a recent paper,

01:09:34.140 | Reinforcement Learning Neuroturing Machines,

01:09:36.620 | that actually does something like this, right?

01:09:39.820 | Not exactly, but it will deal with discrete actions.

01:09:44.180 | Okay?

01:09:45.020 | Any question?

01:09:47.580 | No question.

01:09:50.740 | Okay, so another extension that a lot of people talk about

01:09:57.020 | is using RNN with augmented operations.

01:10:01.180 | So you wanna augment the neural network

01:10:03.980 | with some kind of operations,

01:10:06.460 | like addition, subtraction, multiplication,

01:10:10.260 | the sine function, et cetera.

01:10:11.540 | A lot of functions.

01:10:13.260 | So to motivate you, you can think about,

01:10:15.940 | Q and A can fall into this.

01:10:17.700 | So for example, here's a context.

01:10:19.420 | The building was constructed in the year 2000.

01:10:23.460 | And then it was then, later on, people say,

01:10:27.220 | oh, it was then destroyed in the year 2010.

01:10:30.900 | And then the question would be,

01:10:32.580 | how long did the building survive?

01:10:35.140 | And the answer would be 10 years.

01:10:37.380 | Now, how would you answer this question,

01:10:40.100 | where you say, 2010, subtract 2010 years?

01:10:44.140 | Now, neural nets, if you can train with a lot of examples,

01:10:47.900 | it can do that too.

01:10:48.860 | It can learn to subtract numbers and things like that,

01:10:51.460 | but it requires a lot of data to do so.

01:10:54.120 | All right, so maybe it's better to augment them

01:10:58.260 | with functions, like addition and subtraction.

01:11:01.160 | So the way you can do it is that

01:11:05.080 | the neural network will read all the tokens so far,

01:11:08.380 | and it will push the numbers into a stack.

01:11:11.380 | And then you'll get, the neural net is augmented

01:11:15.800 | by a subtraction and a addition function.

01:11:20.460 | And then you assign this probability

01:11:23.700 | for these two functions.

01:11:25.500 | So green, the more dark,

01:11:29.420 | this means the higher probability, okay?

01:11:31.340 | So you assign two probability,

01:11:33.520 | and you compute the weighted average

01:11:36.380 | of the values coming out of these two function.

01:11:38.660 | And then you take that, and then you pop it,

01:11:41.260 | and you push it into the stack in the next step.

01:11:43.820 | And then in the next step,

01:11:44.860 | you will call the addition and subtraction again,

01:11:48.140 | and et cetera.

01:11:49.160 | That's the principle of something called neural programmers

01:11:51.540 | or neural programmer interpreters.

01:11:54.860 | So there are two papers last year

01:11:56.340 | from Google Brain and DeepMind was talking about this.

01:11:59.860 | So that's some of the related work

01:12:03.040 | in the area of augmenting recurrent networks

01:12:06.640 | with operations, with memory, et cetera.

01:12:09.960 | Now, what's the big picture?

01:12:11.160 | Okay, so the big picture, I wanna revisit,

01:12:13.200 | and I say, so what I've talked about today

01:12:18.200 | is sequence-to-sequence learning.

01:12:20.960 | And it's an end-to-end deep learning task.

01:12:23.920 | So it's one of the big trends happening in natural language.

01:12:29.620 | It's very general, so you can use,

01:12:32.540 | if you have a lot and a lot of supervised data,

01:12:34.860 | it's a supervised learning algorithm.

01:12:37.420 | So if you have a lot of data, it should work great.

01:12:40.720 | But if you don't have enough supervised data,

01:12:43.240 | then you consider dividing your problem

01:12:45.280 | and then training in different components.

01:12:47.940 | Or you can train jointly in a multitask settings.

01:12:51.940 | And people also train it jointly with autoencoder,

01:12:54.900 | namely you read the input sentence

01:12:57.420 | and then predict the output sentence again.

01:12:59.220 | And that's also, and then you train jointly

01:13:02.400 | with all the tasks, and that works as well.

01:13:04.740 | If you go home and then you wanna make impact

01:13:10.620 | at your work tomorrow, then so far, that's so far so good.

01:13:13.960 | That can make some impact.

01:13:15.340 | Now, if you wanna do some research,

01:13:16.940 | and I think things with memory, operation,

01:13:20.740 | augmentation are some of the exciting areas.

01:13:25.420 | But it seems like it's still work in progress.

01:13:28.740 | But I expect a lot of advances in this area

01:13:31.620 | in the near future.

01:13:32.540 | So if you want to know more,

01:13:37.740 | you can take a look at Chris Ola Block.

01:13:40.980 | He talk about attention and augmented recorder networks.

01:13:44.520 | I also wrote some tutorials, pretty simple.

01:13:46.820 | The sequence-to-sequence with attention for translation

01:13:52.340 | is implemented in TensorFlow.

01:13:54.060 | So you can actually download TensorFlow

01:13:58.580 | and train it, what I said today.

01:14:02.020 | Now, there's a lot of work going on in this area.

01:14:06.660 | Many of these are not mine.

01:14:09.560 | So as you can see, you can't even read the words.

01:14:12.860 | That means how many papers come along in this area.

01:14:16.940 | So I can pause there, and I have five minutes

01:14:22.140 | to answer questions.

01:14:24.520 | (audience member speaking)

01:14:26.720 | I have a question there, yeah.

01:14:28.440 | (audience member speaking)

01:14:31.940 | Yeah.

01:14:33.640 | (audience member speaking)

01:14:37.140 | Yeah.

01:14:44.540 | (audience member speaking)

01:14:49.080 | Yeah.

01:14:49.920 | (audience member speaking)

01:14:53.420 | I see.

01:14:56.600 | Can you speak to the microphone?

01:14:59.560 | Because I can't hear very well.

01:15:00.800 | The microphone, and then I think people

01:15:02.560 | can hear that as well.

01:15:03.760 | - When you're training a Q&A network,

01:15:10.560 | so you're taking the example of training

01:15:12.940 | from a book to answer questions.

01:15:14.640 | - Yeah.

01:15:15.840 | - So if, let's say, Harry Potter,

01:15:18.360 | who was Harry Potter's father.

01:15:19.880 | - Yeah.

01:15:20.720 | - There could be many books that have a character Harry.

01:15:23.160 | - Yeah.

01:15:24.000 | - So there's a context resolution issue,

01:15:25.320 | which is which Harry should I answer the question for.

01:15:27.400 | - Yeah.

01:15:28.240 | - How do you solve the context problem

01:15:30.800 | when you're training this kind of Q&A type network?

01:15:34.000 | - I think that's a great question.

01:15:35.360 | So I think one thing is that you can always personalize.

01:15:40.360 | For example, you know that the guy,

01:15:43.740 | when he talk about, you can have a representation

01:15:47.200 | for the user, and then you know that when he say Harry,

01:15:50.960 | because he actually been reading a lot of books

01:15:52.960 | about Harry Potter, so it's more likely to be Harry Potter.

01:15:57.720 | But I think with the algorithm I said,

01:15:59.280 | I just want to make sure that it's as simple as possible.

01:16:02.040 | So the user has to ask the question,

01:16:06.920 | Harry Potter, rather than Harry.

01:16:10.160 | But I'm saying if you represent user vectors,

01:16:13.760 | and then you inject more additional knowledge

01:16:16.700 | about the users, about the context,

01:16:19.560 | into as additional token in the input of the net,

01:16:24.560 | the net can figure it out by itself.

01:16:27.500 | Yeah, so that's one way to do it, yeah.

01:16:30.780 | Okay, I have a question there, yeah.

01:16:33.340 | - You did some work on doc2vec.

01:16:35.860 | - Yeah.

01:16:37.620 | - Do you have an idea what the state of the art

01:16:39.020 | in generalizing word2vec is to more than one word?

01:16:43.500 | - Oh, I see.

01:16:44.400 | I think skip thoughts are interesting directions here.

01:16:51.580 | So doc2vec is one way, but skip thoughts,

01:16:56.620 | so the idea of skip thoughts was actually

01:16:59.060 | Ruslan Salakhovsdinov, was author on this.

01:17:02.580 | And his idea is basically using sequence to sequence

01:17:05.620 | to predict the next sentence.

01:17:08.020 | So the input would be the current sentence,

01:17:11.740 | the output would be the previous sentence

01:17:15.060 | or the next sentence.

01:17:16.500 | And then you could train a model like that.

01:17:18.180 | The model is called skip thought.

01:17:20.340 | And I've heard a lot of good things about skip thoughts,

01:17:23.260 | where you can take the embedding at the end,

01:17:25.500 | and then you can do document classification

01:17:29.180 | and things like that, and it works very well.

01:17:30.900 | So that's probably one place that you can go.

01:17:33.860 | My colleague at Google is also working

01:17:35.740 | on something called autoencoder.

01:17:37.700 | So instead of predicting the next sentence,

01:17:40.300 | he predicts the current sentence.

01:17:42.260 | So trying to repeat the current sentence,

01:17:45.180 | and that's kind of worked well too.

01:17:47.460 | Yeah, yeah.

01:17:48.740 | - See, what was your thoughts on how to solve

01:17:53.100 | the common sense reasoning problem?

01:17:55.380 | - Oh, common sense, I'm deeply interested in common sense,

01:17:58.740 | but I gotta say, I have no idea.

01:18:01.380 | I think maybe you can do something like,

01:18:05.420 | I think common sense is about a lot of,

01:18:08.020 | first of all, there's a lot of knowledge about the world

01:18:10.980 | that is not captured in text.

01:18:14.580 | Like for example, gravity and things like that.

01:18:17.180 | So maybe you really need to actually combine

01:18:20.660 | a lot of modality.

01:18:21.820 | That's one way to think about it.

01:18:23.740 | Or the other thing is, do you make sure

01:18:26.180 | that unsupervised learning work?

01:18:28.260 | That's another approach.

01:18:31.180 | But I think this research area,

01:18:34.260 | I think, I'm just making guesses right now.

01:18:39.260 | - Is there a good way to represent all these rules

01:18:42.020 | and using some soft--

01:18:45.420 | - Yes, yes.

01:18:46.260 | So the question is, how do you represent rules?

01:18:50.020 | So if you think about this network,

01:18:52.660 | the neural programmer network,

01:18:54.380 | that it actually augmented by addition

01:18:57.260 | and subtraction, then these are rules.

01:19:02.260 | - Right.

01:19:03.580 | - You can augment it with a table of rules

01:19:06.140 | and then ask the network to actually attend

01:19:09.660 | into that rule table.

01:19:11.540 | People have looked into this direction.

01:19:13.420 | So that's one way to do it.

01:19:15.260 | - Okay, you're saying basically,

01:19:16.340 | I'll go ahead and do some logical reasoning?

01:19:19.100 | - Yeah, yeah.

01:19:20.380 | - Hey, great talk.

01:19:23.940 | - Yeah, thank you.

01:19:25.460 | - Is there a practical rule of thumb

01:19:27.740 | for how many sequence pairs you need

01:19:30.300 | to train such a model successfully?

01:19:32.620 | - I see.

01:19:33.460 | - Are there any tips to reduce how many pairs you need

01:19:37.940 | if you don't have--

01:19:38.780 | - I see, okay.

01:19:39.700 | So usually, the bigger data set, the better,

01:19:44.460 | but the corpus that people train this on translation,

01:19:48.340 | for example, English to German,

01:19:50.540 | it has only about three, five million pairs of sentences

01:19:53.220 | or something like that.

01:19:54.060 | So that's kind of small, three million, right?

01:19:57.100 | And still, people are able to make it

01:19:59.060 | to the state of the art.

01:20:00.700 | So that's pretty encouraging.

01:20:02.580 | Now, if you don't have a lot of data,

01:20:04.580 | then I would say things like pre-train your word vectors

01:20:08.380 | with language models or Word2Vec, right?

01:20:13.380 | That's one area that you have a lot of parameters.

01:20:16.420 | You can pre-train your model

01:20:20.420 | with some kind of language model,

01:20:22.580 | and then you reuse the softmax.

01:20:24.380 | That's another area that you have a lot of parameters.

01:20:26.900 | Or use dropout in the input embedding

01:20:29.660 | or dropout some random word in the input sentence.

01:20:32.620 | So those things can improve the regularization

01:20:35.420 | when you don't have a lot of data.

01:20:37.100 | Okay.

01:20:39.660 | Yeah, thank you.

01:20:40.500 | Okay.

01:20:44.460 | Yeah, thank you.

01:20:45.300 | (audience applauding)

01:20:48.460 | - Thank you, Kuo.

01:20:53.100 | So we'll reconvene at six o'clock

01:20:55.940 | for Yoshua Bengio's closing keynote.

Sequence to Sequence Deep Learning (Quoc Le, Google)

Chapters