Back to Index

Sequence to Sequence Deep Learning (Quoc Le, Google)


Chapters

0:0
2:10 Preprocessing
2:34 Feature Representation
4:56 Training with stochastic gradient descent
7:28 Information Loss
7:42 Recurrent Neural Network
9:34 Training RNN with stochastic gradient descent
17:10 Better Formulation
19:40 Sequence to Sequence Training with SGD
22:19 Sequence to Sequence Prediction
30:31 Scheduled Sampling
37:41 The big picture so far
41:14 Model Understandability with Attention Mechanism
48:8 LSTMCell vs. RNNCell
49:39 Applications
52:18 Sequence to Sequence With Attention for Speech

Transcript

I will divide it in two parts. So number one, I will work with you and develop the sequence to sequence learning. And then the second part, I will play sequence to sequence in a broader context on a lot of exciting work in this area. Now, so let's motivate this by an example.

So a week ago, I came back from vacation, and in my inbox, I have 508 emails, un-replied emails. And a lot of emails basically just require just yes and no answer. So let's try to see whether we can build a system that can automatically reply these emails to say yes and no.

And for example, some of the email would be from my friend, Ann. She said, hi, in the subject, and she said, are you visiting Vietnam for the New Year walk? That would be her content. And then my probable reply would be yes. So you can gather a data set like this, and then you have some input content.

So for now, let's ignore the author of the email and the subject. But let's focus on the content. So let's suppose that you gather some email, and some input would be something like, are you visiting Vietnam for the New Year walk? And the answer would be yes. And then another email would be, are you hanging out with us tonight?

The answer is no, because I'm quite busy. >> >> So the third email would be, did you read the coolness paper on ResNet? The answer is yes, because I like it. Now let's do a little bit of processing, where basically in the previous slide, we have year and comma, and then quark, and then question mark, and so on.

So let's do a little bit of processing, and then put the comma, a space between year and comma, and then quark and question mark, and so on. So this step, a lot of people call tokenization and normalization. So let's do that with our emails. Now, so and then the second step I would do would be to do feature representation.

So in this step, what I'm gonna do is the following. I'm gonna construct a 2,000 dimensional vector. 2,000 represent the size of English vocabulary. And then I'm gonna go through email. I'm gonna count how many times a particular word occur in my email. For example, the word r occur one in my email, so I increase the counter.

And then u occur once, I increase another counter, and s, etc. And then I will reserve at the end a token to just count all the words that are just out of vocabulary. Okay? And then, now, if you do this process, you're gonna convert all of your email from input to output pairs, where the input would be fixed length representation of 20,000 dimensional vector.

And output would be either 0 or 1. Okay? Any questions so far? Okay, good. Okay, so I will get, so as you said, somebody in the audience said, the order of the words don't matter, and the answer is yes. So I'm gonna get back to that issue later. Now, so that x and y, and now, your job, my job now is to try to find some w such that w times x can approximate y.

Y is the output, right? And y here is yes and no. So because of this problem has two categories, you can think of it as a logistic regression problem. Now, if anybody follow the great CS229 class by Andrew, probably can formulate this very quickly. But in very short, the algorithm comes as follows.

You kind of try to come up with a vector for every email. Your w is a two column matrix, okay? The first column will find the probability for the whether the email have to be answered as yes. Second column will be answered as no. And then you basically take the dot product between the first column.

The stochastic gradient descent. So you run for iteration one to like a million, you run for a long, long time. You sample a random email x, and there's some reply. And then if the reply is yes, then you wanna update your w1 and w2 such that you increase the probability that the answer is yes.

So you increase the first probability. Now if the correct reply is no, then you're gonna update w1 and w2 so that you can increase the probability of the email to be answered as no, so the second probability, okay? So let's call those p1 and p2. Now, so because I said to update the increase of probability, what does that mean?

What that means is that you find the gradient of the partial gradient of the objective function with respect to some parameter. So now you have to pick some alpha, which is the learning rate. And then you say w1 is equal to w1 plus some alpha, the partial derivative of log p1 with respect to d of w1, okay?

Now, I cheated a little bit here because I used the log function. It turns out because the log function is a monotonic increasing function. So increasing p1 is equivalent to increasing the log of p1, okay? And it usually, with this formulation, stochastic gradient descent works better. Any questions so far?

And then you can also update w2 if the email to be replied is yes. And you can have different way to update if the reply is no. So what's, and then if you have a new email coming in, then you take x and then you convert into the vector.

Then you compute the first probability, okay, w1 times x divided by exponential, w1 times x plus exponential w2 times x. And if that probability is larger than 0.5, then you say yes. And if that probability is less than 0.5, then you say no. Okay, so that's how you do prediction with this.

Now, there's a problem with this representation, is that there's some information loss. So somebody in the audience just said that the order of the words don't matter. And that's true. Now, let's fix this problem by using something called the Recurrent Network. And I think Richard Solcher already talked about Recurrent Networks and some part of it yesterday, and Andre as well.

Now, the idea of a Recurrent Network is basically you also have fixed length representation for your input, but it actually preserves some sort of ordering information. And the way that you compute the hidden units are the following. So the function h of 0 is basically a hyperbolic tangent of some matrix u times the word vector for the word r.

Okay, so Richard also talked about word vectors yesterday. So you can take word vectors coming out of word2vec, or you can just actually randomly initialize them if you want to. Okay, so let's suppose that that's h of 0. Now, h of 1 would be a function of h 0 and the vector for u, which is a times h of 0 plus u times v of vector u.

And then you can keep going with that. This is one of my three most complicated slides, so you should ask questions. No questions? So everybody familiar with Recurrent Nets? Wow. >> >> Okay, so to make prediction with this, you tack on the label at the last step. And then you say, try to predict y for me.

Now, how do you do that? Now, here, basically, you went the way you did before. And basically, you make update on the w matrix, which is the classifier at the top, like what I said earlier. Now, but you also have to update all the relevant matrices, which is the matrix u, the matrix a, and some word vectors, right?

So this is basically, you have to compute the partial derivative of the loss function with respect to those parameters. Now, that's gonna be very complicated. And usually, when I do that myself, I get that wrong. But there's a lot of toolkits out there that you can use. What is, you can use auto differentiation in TensorFlow, or you can call Torch, or you can call Tiano to actually compute the derivatives.

And once you have the derivatives, you can just make the updates. All right? Yeah? >> Do you use the same rule? >> Yes. >> What size? >> So u, the matrix u, I share, so I'm gonna go back to one side. So the matrix u, I share for all vertical matrices, right?

And the size, you have to determine ahead of time. For example, the number of column would be the size of the word vectors. But the number of rows must be like 1,000 if you want, or maybe 255 if you want. So this is model selection, and it depends on whether you're underfitting or overfitting to choose a bigger model or a smaller model.

And you'll compute power so that you can train a larger model or smaller model. >> So what would you consider the number of words in the dictionary? >> The matrix u? >> Yeah. Yeah, so the word vectors, the number of word vectors that you use, are the size of vocabulary, right?

So you're gonna tend to end up with 20,000 word vectors, right? But the size of, so that means you have 20,000 rows in matrix u, but the number of column you can, sorry, the number of column is 20,000, but the number of row would be, you have to determine that yourself.

Okay? Any other questions? Now, okay, so what's the big picture? So the big picture is I started with bag of words representations. And then I talk about INN as a new way to represent variable size input that can capture some sort of ordering information. Then I talk about auto differentiation so that you can compute the partial derivatives.

And these, you can find auto diff in TensorFlow or Tiano or Torch. Now then I talk about stochastic gradient descent as a way to train the neural networks. Any questions so far? Okay, you have a question. >> How long is the data size if you're not using the data system?

>> So that also depends on how big your training set and how big is your computer and so on, right? But usually if you use RNN and if you use a hidden state of 100, it should take a couple of hours. Yeah, but it largely depends on size of training data, because you want to iterate for a lot of, you sample a lot of emails, right?

And you want your algorithm to see as many emails as possible. Right? So, okay, so if you use such algorithm to just say yes, no, and yes, no, then you might end up losing a lot of friends. Because- >> >> Because we don't just say yes, no. Because when you say, for example, my friend asked me, are you visiting Vietnam for the New Year walk?

Then maybe the better answer would be, yes, see you soon, right? That's a better, nicer way to approach this. And then if my friends ask me, are you hanging out with us tonight? So instance, just say, no. I would say, no, I'm too busy. All right, or did you read the cool right?

So let's see how we're going to fix this. So before I'm going to tell you the solution, I would say this problem basically requires you to map between variable size input to some variable size output. And if you can do something like this, then there's a lot of applications.

Because you can do auto reply, which is what we've been working on so far. But we can also work on user to do translation, translate between English, French. You can do image captioning. So input would be a fixed length vector, a representation coming from ConfNet. And then output would be the cat sat on the mat.

You can do summarization. The input will be a document, and output will be some summary of it. Or you can do speech transcription, where you can have input would be speech frames, and output would be words. Or you can do conversation. So basically, the input would be the conversation so far, and the output would be my reply.

Or you can do Q&A, et cetera, et cetera. So we can keep going on. So how do we solve this problem? So this is hard. So let's check out what Andrej Karpathy has to say about recurrent networks. So Andrej say that there's more than one way that you can configure neural networks to do things.

So you can use neural network to map-- recurrent networks to map one to one. So at the bottom, that's the input. The green would be the hidden state, and the output would be what you want to predict. Now, one to one is not what we want, because we have many to many.

So it's probably more like the last two to the right. But we arrived at the solution that I said in the red box. And the reason why that's a better solution is because the size of the input and the size of output can vary a lot. Sometimes you have smaller input, but larger output.

But sometimes you have larger input and smaller output. So if you do the one in the red circle, you can be very flexible. If you do the one to the extreme right, then maybe the output has to be smaller, or at least the same with the input. That's what we don't want.

So let's construct a solution that look like that. So here's the solution. So the input would be something like, hi, how are you? And then let's put a special token. Let's say the token is end. And then you're going to predict the first token, which is am. And then you predict the second token, fine.

And then you predict the third token, thanks. And then you keep going on until you predict the word end. And then you stop. Now, I want to mention that in the previous set of slides, I was just talking about yes and no. And in yes and no, you have only two choices.

Now you have more than two choices. You have actually 20,000 choices. And you can actually use the algorithm that are the logistic regression. And you can extend it to cover more than two choices. You can have a lot of choices. And then the algorithm will just follow the same way.

So this is my first solution when I worked with 626. But it turns out it didn't work very well. And the reason why it didn't work very well is because the model never know what it actually predicted in the last step. So it keep going. And it will keep synthesizing output.

But it didn't know what it said. It didn't know what decision it committed in the previous step. So a better solution would look like this. A better solution is basically you feed what the model predicts in the previous step as input to the next step. So for example, in this case, I'm going to take am.

I feed it in to the next step so that I'm going to predict the second word, which is fine, and et cetera. So a lot of people call this concept autoregressive. So you eat your own output and make it as your input. Any questions so far? How do you know when to stop?

Oh, whenever it produces end, then you stop. There's a special token end. So relevant architecture here would be-- people also call the encoder as the recurrent network in the input. And the decoder would be the recurrent network in the output. OK, so how do you train this? So again, so you basically run for a million steps.

You see all of your emails. And then you say you sample. And for each iteration, you sample an email x and a reply y. Y would be, I'm fine, thanks. And then you sample a random word yt in y. And then you update the INN encoder and decoder parameters so that you can increase the probability that y of t is correct, given all what you've seen before, which is your yt minus 1, yt minus 2, et cetera, and also all the x's.

And then you have to compute the partial derivatives to make it work. So the computing part of partial derivative is very difficult. So again, I recommend you to use something like autodifferentiation in TensorFlow or Torch or Tiana. OK, you have a question. Yeah, but the number of parameters didn't change because you have u, v, and a are fixed.

OK, so the question in the audience is that if the INN are different for different example, and the answer is yes. So the number of steps are different. I have a question there. OK, yeah, I'm going to get to that in the next slide. Yeah, OK. So the question is, in practice, how long would I go for the INN?

I would say you usually stop at 400 steps or something like that, because outside of that, it's going to be too long to make the update. And compute, it's very expensive to compute. But you can go more if you want to. Yeah. I have a question. Yeah. Yeah, yeah, so that's a problem.

So I'm going to talk about the prediction next. So let me go to the prediction, and then you can ask questions. So how do you do prediction? So the first algorithm that you can do is called greedy decoding. In greedy decoding, for any incoming email x, I'm going to predict the first word.

And then you find the most likely word, and then you feed back in. And then you find the next most likely word, and then you feed back in, and et cetera. So you keep going. You keep going until you see the word end, and then stop. Or if it exceeds a certain length, you stop.

Now, that's just too greedy. So let's do a little bit less greedy. So it turns out that, so given x, you can predict more than one candidate. So let's say you can predict k candidates. Let's say three. So you take three candidates. And then for each candidate, you're going to feed in the next step, and then you arrive at three.

So the next step, you're going to have nine candidates. And then you're going to end up going that way. So here's a picture. So given input x, I'm going to predict the first token. That would be hi, yes, and please. And given every first token like this, I'm going to feed back into the network, and the network will produce another three, and et cetera.

So you're going to end up with a lot of candidates. So how do you select the best candidate? Well, you can traverse each beam, and then you compute the joint probability at each step. And then you find the sequence that have the highest probability to be the sequence of choice.

What is your reply? Any question? This is the most complicated slide in my talk. Oh, yeah. So there will be no out-of-vocabulary words in your algorithm, right? Yes. So the question is, what do you do with the out-of-vocabulary words? Now, it turns out in this algorithm, what you do is that for any word that is out-of-vocabulary, you create a token called unknown.

And you map everything to unknown, or anything that out-of-vocabulary to be unknown. So it doesn't seem a very nice thing, but usually it works well. There's a bunch of algorithms to address these issues. For example, they break it into characters and things like that, and then you could fix this problem.

Yeah. Yeah. What is the cost function in training? The cost function is that-- so I go back one slide. So the cost function-- one more slide. So the cost function is that you sample a random word, yt. Let's suppose that here, this is my input so far. And I'm sampling yt.

Let's say t is equal to 2, which means the word "fine." I'm at the word "fine." I want to increase the probability of the model to predict word "fine." So every time, the model will make a lot of predictions. A lot of them will be incorrect. So you have a lot of probabilities.

You have probability for the word "er," and the probability "er, er," and et cetera, and then probability for "z, z, z, z, z." And you have a lot of probabilities. You want the probability for the word "fine" to be as high as possible. You increase that probability. Does that make sense?

Yeah, but we don't care if there's no i, m, et cetera. Oh, you condition on i, m. So when I'm at "fine," my input would be "hi," "how," "are you," "and," and "am." OK? That's all I see. And then I need to make a prediction. I have to make that prediction right.

And if I'm at the word "thanks," my input would be "hi," "how," "are you," and "am fine." And I've got to get my "thanks" probability right. OK? Yeah, I have a question here. So how do you personalize the extreme model for each time? Oh, I haven't thought about it yet.

So the question is, how do you personalize? So well, one way to do it is basically embed a user as a vector. So let's suppose that you have a lot of users, and you embed a user as a vector. That's one way to do it. Yeah. I have a question here.

So basically, if all the sequences were the same length, the number of paths down the tree is k to the n, right? Yeah. So that's a lot. Yeah. So the question is, let's suppose that my beam search is 10. Then you go from 10, like 100, and then 1,000, and suddenly it grows very quickly, right?

It goes to-- if your sequence is long, then you end up with k to the n or something like that. Well, one way to do it is basically you do truncated beam search, where any sequence with very low probability, you just kick it out. You don't use it anymore.

So you go-- so you can do this. You can do 3, 9, and then you're 27, and then you go back up to 9, right? And then you keep going. So that way, you don't end up with a huge beam. And usually, in practice, using a beam size of 3 or 10 would work just fine.

It works great. Yeah. Yeah, I have a question here. OK, so because it's an i and n, we don't have to pad the input. Now, to be fast, sometimes we have to pad the input, because we want to make sure that batch processing works very well, so we pad.

But we pad with only like 0 tokens. OK. Did you change the graph from batch to batch? Yeah, so let's suppose that you have a sequence of 10, then you have a graph for 10. You have a sequence-- a batch of all 20, you have a graph of 20, et cetera.

Yeah, that will make the GPU very happy. I have a question there. If you would use the user embedding to customize the RAM to be applied, where you would connect that embedding to this memory? As an initial state, or-- Oh, so-- So are you asking-- so my interpretation of your question is, how do you insert the world embedding into the model?

Is that correct? No, the user embedding. User embedding. Oh, if you want to personalize the thing, then at the beginning, you have a vector. And that's a vector for quark with an ID 1, 2, 3, 4, 5. And then if it's Peter, then the vector would be 5, 6, 7, 8.

So it would be the initial vector for the encoder, like the initial state for that. Yeah, yeah. That's one way to do it. Well, there's more than one way. You can do it at the end, or you can do it at the beginning, or you can insert it at every prediction steps.

But my proposal is just put it at the beginning. That's simpler. OK. I have a question there. Yeah, you. OK, well, I'm thinking that because your prediction is using the prediction as a nice Yeah. That's a very good question. The question is, what if the model derails? If you make a prediction, and then that's a bad prediction, and then your model never sees, and then it keeps derailing, and it will produce garbage.

Yeah, that's a good question. So I'm going to get to that. So well, so this is a slide. So there's an algorithm called scheduled sampling. So in scheduled sampling, what you do is you-- instead of feeding the truth during training, you can feed what sample from the softmax. So what's generated by the model, and then feed in as input so that the model understands that if it produced something bad, it would suck, actually can recover from it.

So that's one way to address this issue. Did that make sense? Yeah. Any question? There's a question here. OK. Yeah. Yeah, yeah. So in this algorithm, the question is, how large is the size of the decoder? Well, my answer is that try to be as large as possible. But it's going to be very slow.

And in this algorithm, what happens is that you use the same-- you use fixed-length embedding to represent very, very much the long-term dependency, like a huge input, right? And that's going to be a problem. So I'm going to come back to that issue with the attention model in a second.

OK? Any question? OK, here's a question. So if you're scoring based on a single word, doesn't that make you learn away from synonyms? Because you're unlikely to use them in the same sentence as the answer. So does the model learn synonyms? Is that the question? Or what's the question?

Like, aren't you biasing it against synonyms? So it's like your answer is fine, and you're scoring based on that answer. Does it become an unlikely answer? Oh, I see. Well, yeah. It turns out that if you learn, it turns out that it map good. And if you visualize embedding, the good and fine and so on are mapped very closely to the embedding space.

But in the output, we don't know what else to do. The other approach is basically to train the word embeddings using Word2Vec and then try to ask the model to regress to the word embeddings. So that's one way to address the issue. We tried something like that. It did not work very well.

So whatever we had in here was pretty good. OK. I have to keep going. But anyway, the algorithm that you've seen so far turns out actually answers some emails. So if you use the smart reply feature in Inbox, it's already used this system in production. Now, for example, in this email, my colleague Greg Corrado got an email from his friend saying that, hey, we wanted to invite you to join us for an early Thanksgiving on November 22nd, beginning around 2 PM.

Please bring your favorite dish and reserve by next week. And then it would propose three answers. For example, the first answer would be, tell us in. Second answer would be, we'll be there. And the third answer is, sorry, we won't be able to make it. Now, where do these three answers come from?

Those are the beams. Now, there's an algorithm to actually figure out the diversity as well of the beams so that you don't end up with very similar answers. So there's an algorithm, like a heuristic, that make these beams a little bit more diverse. And then they pick the best three to present to you.

OK, any question? Yeah, I have a question here. How do you make sure that the beam terminates? Yeah, there's no guarantees. So the question is, how do I guarantee that the beam will terminate and end? Now, there's no guarantee. It can go on forever. Indeed, there are certain cases like that if you don't train the model very well.

But if you train the model well with very good accuracy, then the model usually terminates. I've hardly seen any cases that it doesn't terminate. But there are certain corner cases that it will do funny things. But you can stop the model after 1,000 or 100 or something like that so that you make sure that the model doesn't go on crazy.

I have a question here. So there's nothing in the email which says they're inviting multiple people, but the reply seems all good with us meeting. That's very interesting. Yeah, it just comes out because there's a lot of emails. And if you invite someone, there's more than one person. And it learns about Thanksgiving.

It just means inviting the whole family, things like that. Yeah, it just learns from statistics. Yeah, or maybe there's something like that. Yeah. OK. Oh, in this algorithm-- so the question is, do I do any post-processing to correct the grammar of the beams? In this algorithm, we did not have to do it.

Yeah. Yes? OK. I have another question. So how contextual these multiple are? Are they very basic to a specific email, or do they tend to be ?? So OK, so the question is, how contextual? So I would say we don't have any user embedding in this. So it's pretty general.

The input would be the previous emails, and the output would be the prediction, the reply. That's all we have. So it sees the context, which is the threat so far. OK. Did I answer your question? OK, yeah, you can catch me up after the talk. Yeah? Is it running on the phone or on the server?

It runs on the server. Yeah, slow. Question? I guess there are definitely some emails which are not suitable for auto-smart reply. How do you detect which email to reply? Oh, I see. So the question is, there are some emails that are not relevant for a smart reply. Maybe they're too long, or you should not reply or something like that.

So in fact, we have two algorithms. So one algorithm is to say yes or no to reply. And then after it passes, sees the threshold, there's an algorithm to run to produce the threshold. So it's a combined of two algorithms that I presented earlier. Yeah. I have to get going, but you can get back to the question.

So there's a lot of more interesting stuff coming along. OK, so what's the big picture so far? So the big picture is that we have an INN encoder that eats all the input. And then we have an INN decoder that's trying to predict one token at a time in the output.

Now, everything else follows the same way. So you can use stochastic gradient descent to train the algorithm. And then you do beam search decoding. Usually, you do a beam search of three. And then you should be able to find a good beam with the highest probability. Now, someone in the audience brought up the issue that we use fixed length representation.

So just before you make a prediction, the h of n, the white thing right before you go to the decoder, that is a fixed length representation. And you can think of it as a vector that capture all everything in the input. It could be 1,000 words or it could be five words.

And you use a fixed length representation for a variable length input, which is kind of not so nice. So we want to fix that issue. So there's an algorithm coming along. And it's actually invented at University of Montreal. You're sure he's here? So the idea is to use an attention.

So how does an attention work? So in principle, what you want is something like this. Every time before you make a prediction-- let's say you predict the word m-- you kind of want to look again at all the hidden states so far. You want to look at all what you see in the input so far.

Now, same, when you do find, you also want to see all the hidden state of the input so far and on. Now, how do you do that as a program? So well, you can do this. So at h of m, you predict a vector c. Let's say that vector is the same dimension with all the h.

So if your h of 1 is a dimension of 100, then c also has a dimension of 100. And then you take c, and then you do dot product with all the h. And then you have coefficients a, a0, a1, blah, blah, blah, to a to the n. And those are scalars.

And then after you have those scalars, you compute something called the beta, which is basically a softmax of all the alpha. So to compute that, you take the-- bi is an exponential of ai divided by the sum of exponentials. And then you take those bi and then multiply by h of i.

And then you take the weighted average. And then you take the sum. And then you send it to add additional signal to predict the word m. And then you keep going with that. So in the next step, you also predict another c. And then you take that c to compute the dot product.

You compute the a, and then you can compute the b. You can take the b. You do the weighted average. And then you send it to the next step to extend it to the prediction. And then you use stochastic gradient descent to train everything. OK? OK? And this algorithm is implemented in TensorFlow.

OK, so how intuitively-- what is going on here? So let's suppose that you want to use this for translation. So in translation, you want to-- for example, the input would be, hi, how are you? And the output is, hola, como estas, or something like that. OK? And then when you predict the first word, you want hola to correspond to the word hi.

OK? Because there's a one-to-one mapping between the word hi and hola. So if you use the attention model, the betas that you learn will put a strong weight for the word hola-- for the word hi. And then it has a smaller weight for all the stuff. And then if you keep going, then when you say, como, then it will focus on how, and et cetera.

OK? So it moves that coefficient. It puts a strong emphasis on the relevant word. And especially for translation, it's extremely useful because you know the one-to-one mapping between the input and output. Any questions so far? This is definitely very complicated. Yeah, I have a question. So how do you deal with languages where word orders are different?

Oh, right now, the A and B are learned. So I don't-- and so the question is, how do I deal with languages where the order don't reverse, for example, English to Japanese? So some of the verbs get moved and things like that. Well, I did not hard code A or B.

They are learned. So by virtue of learning, they will figure out what beta to put to weight the input. And those are basically computed by gradient set. So they just keep on learning. OK, I have a question here. OK. Yeah, so the question is, are there any work on putting attention in the output?

Yeah, I think you can do that. I'm not too familiar with any work in here, but I think it's possible to do it. I think some people have explored something like that. Yeah. Any question? Oh, I have a question. Another question. So with the capitalization of your first words in the-- Yeah.

--line, does that imply that you have to have your own Yeah. Yeah, yeah, yeah. So the question is, let's suppose-- because right now, the word "high" is capitalized at the first character. Does it mean I'm using 2n or n vocabulary size? So in practice, we should do some normalization.

If you have a small data set, what you should do is you normalize the text. So "high" will be lowercase and et cetera. Now, if you have a huge data set, it doesn't matter. We just learn. OK? Yeah. I have a question there. So in essence, actually, this can change the positional aspect of the words, right?

Right, yeah. So the question is, in a sense, it captures the positional information in the input. Yeah, I agree. I have a question there. What about punctuation? Pardon? What about punctuation? Punctuation. So the question is, what do I do with punctuation? Well, right now, I just present the algorithm as if it's a very simple implementation, like the very basic.

But one thing that you can do is before you train the algorithm, you put a space between the word and the punctuation so that you do some-- that step is called tokenization or normalization in language processing. So you can use any Stanford NLP package or something like that to normalize your text so that it's easy to train.

Now, if you have infinite data, then it will just learn itself. OK. So I should get going because there's a lot of-- all the interesting stuff. OK. So it turns out that that's the basic implementation. But if you want to get good results and if you have big data sets, so one thing that you can do is to make the network deep.

And one way to make deep is in the following way. So you stack your recurrent network on top of each other. So you-- in the first sequence-to-sequence paper, we use a network of four, but people are gradually increasing to six, eight, and so on right now. And they're getting better and better results.

Like in ImageNet, if you make a network deeper, you also get better results. OK. So if you want to train sequence-to-sequence with attention, then a couple of years ago, when we-- like many labs working on this problem were behind the state of the art. But right now, in translation, many translation tasks basically-- this model already achieved state of the art results in a lot of these WMT data sets.

So to train this model, so number one is that, as I said, you might end up with a lot of vocabulary-- out of vocabulary issues. So Barack Obama will be just an unknown, right? Hillary Clinton is an unknown. Now, you might use something like word segments, right? So you segment the words out.

For example, Barack Obama would be "ba" and "rak" and et cetera. Or you can use all the smart algorithms. For example, word character split. You can split words that are unknown to be into characters, and then you treat them as a character. There's some work at Stanford, and they prove that it works very well.

So that's one way to do it. Now, tip number two is that when you train this algorithm-- because when you do back propagation or forward propagation, you essentially multiply a matrix many, many times. So you have explosion of function value or the gradient or implosion as well. Now, one thing that you can do is you clip the gradient at a certain value.

So you say that if the magnitude of the gradient is larger than 10, set it to 10. Then tip number three is to use GRU, or in our work, we use a long short-term memory. So I want to revisit this long short-term memory business a little bit. So what's long short-term memory?

So if you use an INN cell, basically, you concatenate your input and the hidden state, and then you multiply by some theta, and then you apply with some activation function. Let's say that's a hyperbolic tangent. Now, that's the simple function for INN. Now, in LSTM, you basically multiply the input and H by a big matrix.

Let's call that theta. That theta is four times bigger than the theta I said in the INN cell. And then you're going to take that Z that's coming out. You split it into four blocks. Each block, you can compute the gates. And then you use the value of something called the cell, and then you keep adding the newly computed values to the cell.

So there's a part here that I say the integral of C. It's that what it does, it basically keeps a hidden state where it keeps adding information to it. So it doesn't multiply information, but it keeps adding information. You don't need to know a lot of this if you want to just apply LSTM, because it's already implemented in TensorFlow.

Any questions so far? So in terms of applications, you can use this thing to do summarization. So I've started seeing work in summarization, pretty exciting. You can do image captioning. And the input in that case would just be the representation of an image coming out from VGG or coming out from Google Net and et cetera.

And then you send it to the INN. The INN will do the decoding for you. Or you can use it for speech recognition or transcription. Or you can use it for Q&A. So the next part of the talk, I will talk a little bit about speech recognition. So in speech recognition, the input could be maybe some waveforms.

And then the output could be some words, hi, how's it. Well, one thing that you can do is you crop your input into Windows. That's the green boxes there. And then you crop a lot of them. And then you send a lot of them to an INN. And then you convert it into MFCC before you send to INN.

MFCC or spectrogram or something like that. And then you use the algorithm that I said earlier and then with attention. And then you do the transcription. You predict one word at a time in the output. Now, the problem with this algorithm is that when it comes to speech, you end up with a lot of input.

You can end up with thousands and thousands steps. So backpropagating in time, even with attention, can be difficult. Now, one thing that you can do is basically you do some kind of a pyramid to map the input. So if you do enough layers, you can divide your input into a factor of 8 or 16 if you do enough layers.

And then you produce the output. So we're working on an implementation where the output is actually characters, like in the Baidu's work where they have the CTC. Now, I have to say that the strength of this algorithm is that it actually has an implicit language model in the output.

So when I say I have the word "how," it's actually conditioned on "hi" and "step 4," and including the input. So there's an implicit language model already. But the problem with this is that actually you have to wait until the end of the input to do decoding. So the decoding has to be done offline.

So if you use this for voice search, it might not be too nice because people want to see some output right away. So in that case, there's an algorithm that can use this and do it in an online fashion, block by block. Now, also, I have to mention that in translation, the sequence with attention works great.

It's among the state of the art. But when it comes to speech, it doesn't work as well as the CTC, at least in published results. We're not as good as CTC, which is what Adam talked earlier, or some of the HMM/DNN hybrid, which is the most widely-spoken speech system currently.

So I want to pause there, and then I can take questions. Any questions? I have a question at the back. Yeah? So on the machine translation, you were mentioning attention. So according to, say, English, German, there's one word that basically is having meaning of using multiple words in English.

How does attention work there? Because attention will be focusing on sense, and you know that one is the sending sense. But when you're predicting, it's only going to predict one or two words. Oh. So how does it work in translation? Well, in translation, what we do is basically we have pairs of sentences.

So for example, hi, how are you? And then, hola, como estas? And then we have pairs of sentences like this, and then we just feed it into the sequence to sequence of attention. At every step, again, we're going to predict one word at a time. But before we make a prediction, the model has the attention.

So it actually sees the input once more before it makes a prediction. That's how it works. Now, what is-- can you repeat? What is the issue with the model again, please? When you look at, say, i equals 1 to the Yeah. When you're predicting, the first one you send in the sequence, the prediction is actually just one big word.

So there's no position in that. So there's no basically one, two, three coming afterward. I see. Well, I can't quite follow the question. But let's take it offline. Is that OK? Yeah, yeah. And then we can do some paper together. I have a question. Yeah. So I have a question.

For example, what if you actually, in the inbox, you compose email in different languages, like Vietnamese and English? Yeah. And you actually separate different email and you send it to a different model? Yeah. Or just a single model? OK. So the model-- the inbox thing that I presented, it was all in English.

But there's no limitation in the model in terms of language. So let's suppose that in your inbox, sometimes you write in English, and sometimes you write in Vietnamese, or sometimes you write it in Spanish, whatever. And you personalize by user embedding. Then I would say that it will just learn your behavior.

And then it will basically predict the word that you want. But make sure that your output vocabulary is large enough so that it covers not only the English words, but also the Spanish word, et cetera, like Vietnamese and so on. So your vocabulary is not going to be 20,000.

It's going to be like 100,000, because you have more choices. And then you have to train your model on those examples. Yeah. It's a matter of training data. That's all. OK. I have a question here. You mentioned that in the voice search, it cannot do translation online. Yeah? Is it possible to change your model a little bit so that it doesn't have to start predicting at the end of the voice search?

Yeah, yeah, yeah. So the question is that in the case of voice search, right now you have to wait to the end to make a prediction. Is there any other ways? Yeah, yeah. The answer is yes. You can make a prediction block by block. So you can actually figure out an algorithm, a simple algorithm to actually segment the speech, and then make a prediction, and then take the prediction and feed it as input at the next block.

So you can keep going like that. So in theory, you can actually do online decoding. But I'm saying that you can do online decoding, but that work is currently work in progress. How about that? OK. I have a question there. So the question is regarding the Google's order of life.

Yeah. So if I have to put this on my gallery, I was wondering how do you guys set up with a training data set? Because it's difficult to-- how do you know for this question, this is the answer you're creating? Oh, yeah. So we have some input email and then some output email where expert written emails reply.

And then you can just train it that way. So you have a trained data set that you have created for these purposes? Yeah. OK. I have a couple of questions. So it seems like with sequence, there's not much of a constraint on how the output aligns with the input, while the CTC there is sort of constrain that.

Yeah. Is there any way to add this constraint to sequence to sequence that it works better? Or is it something like speech recognition? Yeah. The question is that in speech recognition, the CTC seems to be a very nice framework, because it matches-- it's less like a monotonic increasement in the output and the input.

But CTC makes this independent assumption. It doesn't have a language model in it. Maybe the sequence to sequence-- can address this? Yeah, I think that's a great idea. Maybe we should write a paper together. OK. I think-- I haven't seen it, but I think that's a very good idea.

Question? I see. OK. Great. So the question is that, because right now we predict one step at a time, is there any way to actually look globally at the output and maybe use some kind of reinforcement learning to adjust the output? And the answer is yes. So there's a recent paper at Facebook called, I think, sequence level training or something like that, where they don't optimize for one step at a time, but they predict-- they look at the globally, and then they try to improve work at a rate, or they try to improve blue score or things like that for translation.

And it seems to be making some improvement in the metrics that they care about. Now, if you show it to humans, though, people still prefer the output from this model. So some of the metrics that we use in translation and so on might not be what the metrics that we optimize.

And the next step prediction seem to be what people like a lot in translation. Yeah, so the question is, can we add the GAN loss? Yeah, I think that's a great idea. Yeah. I have a question here. Yeah. Yeah. Change? Once you have the model, is there a way based on the human input to influence the encoder?

Yeah, yeah. So let's suppose that you type the first word, "Hola," then you can actually start the beam from there. So the question is, is there any way to incorporate user input? So I say, yeah. Let's suppose that you say, "Hola," sorry, "Hi, how are you?" And then as soon as the person type "Hola," that actually restrict your beam.

So you can actually condition your beam on the first word, "Hola," and your beam will be better. Yeah, I think that's a good idea. I have a question. Oh, so how much data did we use? So in translation, for example, we use several WMT corpora. And the WMT corpora usually have tens of millions of pairs of sentences, something like that.

And every sentence have like 20 words on average, 20, 30 words on average. I can't remember, but that's something like that, order of magnitude. Yeah, I have a question there. I can't really hear. So how is it compared to Google Search auto-completion? Honestly, I don't know what's used underneath Google Search auto-completion.

But I think they should use something like this, because it's-- OK, I have still lots of interesting stuff coming along. So OK, so what's the big picture? So the big picture is so far, I talk about sequence to sequence learning. And yesterday, Andrew was talking about most of the big trends in deep learning.

And he was talking about the second trend was basically doing end-to-end deep learning. So you can characterize sequence to sequence learning as end-to-end deep learning as well. So the framework is very general. So it should work for a lot of NLP-related tasks, because a lot of them, you would have input sequence and output sequence in NLP.

It could be input would be some text, and output would be some passing trees. That's also possible. But it works great when you have a lot of data. Now, when you don't have enough data, then maybe you want to consider dividing your problems into smaller components, and then train your sequence to sequence in the subcomponents, and then merge them.

Now, if you don't have a lot of data, but you have a lot of related tasks, then it's also possible to actually merge all these tasks by combining the data, and then have an indicator bit to say, this is translation, this is summarization, this is email reply, and then change it jointly.

And that should improve your output, too. Now, this basically concludes the parts about sequence to sequence. And then the next part, I'm going to play sequence to sequence in a bigger picture of the active ongoing work in neural nets for NLP. So if you have any questions, you can ask now.

I take maybe two questions, because I think I'm running out of time. So I have a question. Yeah? I have a question. So does your model handle emoji in NLP? So the question is, does the model handle emoji? I don't know, but emoji is like a piece of text, too, right?

So you can just feed it in as another extra token. If you make your vocabulary 200,000, then you should be able to cover emoji as well. Yeah, I have a question. As new emails and other documents come in, do you have to retrain the model, or do you do anything?

Oh, so if you have new data coming in, so should I retrain the model? I think towards the end, we lower the learning rate. So if you add new data, it will not make a lot of good updates. So usually, you can add new data, increase the learning rate, and then continue to train.

Yeah, that should work. OK, so I already took two questions. Let's keep going. So there's an active area that actually is very exciting, which is in the area of automatic Q&A. So you can think that maybe the setup would be, can you read a Wikipedia page and then answer a question?

Or can you read a book and answer a question? Now, in theory, you can use sequence-to-sequence with attention to do this task. So it's going to look like this. You're going to read the book, one token at a time, and with the book. Then read the question. And then you're going to use the attention to look at all the pages.

And then you make a prediction of the tokens. So that's kind of-- sometimes we do answer questions, that way, sometimes we don't have knowledge about the fact. So we actually read the book again to answer the fact. But a lot of the time, if you ask me, is Barack Obama the President of the United States?

I would say, yes, because it's already in my memory. So maybe it's better to actually augment the RNN with some kind of memory, so that it will not do this look back again. It's kind of annoying, look back again. So there's an active area of this research. I'm not a definite expert, but I'm very aware, so I can place you in the right context here.

So a work in this area would be memory networks by Western and folks at Facebook. There will be neural-tutoring machines at DeepMind. Dynamic memory networks would be Richard Solcher presented yesterday, and then stack-augmented algorithms. And then stack-augmented RNNs by Facebook again, and et cetera. So I want to show you a high level, what does this augmented memory mean.

Let's think about the attention. So the attention looks like this. In the encoder, you're going to look at some input. And then you have a controller, which is your h variable. And then you keep updating your h variable. But along the side, you're going to write down into memory your h1, h2, h3, et cetera.

You store it into a memory. Clear, right? And in the decoder, what you're going to do is you're going to continue producing some output. You're going to update your controller g, but you're going to read from memory your h again. So again, in the input, you write to memory.

And then in the output, you read from memory. Now let's try to be a little bit more general. And the general would be at any point in time, you can read and write. You have a controller, and you can read and write all the time. Now to do that, you have the following architecture.

So you have some memory bank, a big memory bank. And then you can use the write. You can decide to write some information into it by a combination of the memory bank in the previous step and the hidden variable in the previous step. And then you also read into the hidden state, too.

And then you could make an update, and then you can keep going forever like that. So this concept is called RNN with augmented memory. Okay? Is that somewhat clear? Any question? You have a question. The question is, when you read, do you read the entire memory bank? A lot of these algorithms are actually soft attention.

So yes, it will look the entire memory. You can actually predict where to look, right? And then read only that block. Now the problem with that is you end up with very, it's not differentiable anymore, right? Because the thing that you don't read don't contribute to the gradient. So it's gonna be hard to train, but you can use reinforce and so on to train it.

So there's a recent paper, Reinforcement Learning Neuroturing Machines, that actually does something like this, right? Not exactly, but it will deal with discrete actions. Okay? Any question? No question. Okay, so another extension that a lot of people talk about is using RNN with augmented operations. So you wanna augment the neural network with some kind of operations, like addition, subtraction, multiplication, the sine function, et cetera.

A lot of functions. So to motivate you, you can think about, Q and A can fall into this. So for example, here's a context. The building was constructed in the year 2000. And then it was then, later on, people say, oh, it was then destroyed in the year 2010.

And then the question would be, how long did the building survive? And the answer would be 10 years. Now, how would you answer this question, where you say, 2010, subtract 2010 years? Now, neural nets, if you can train with a lot of examples, it can do that too. It can learn to subtract numbers and things like that, but it requires a lot of data to do so.

All right, so maybe it's better to augment them with functions, like addition and subtraction. So the way you can do it is that the neural network will read all the tokens so far, and it will push the numbers into a stack. And then you'll get, the neural net is augmented by a subtraction and a addition function.

And then you assign this probability for these two functions. So green, the more dark, this means the higher probability, okay? So you assign two probability, and you compute the weighted average of the values coming out of these two function. And then you take that, and then you pop it, and you push it into the stack in the next step.

And then in the next step, you will call the addition and subtraction again, and et cetera. That's the principle of something called neural programmers or neural programmer interpreters. So there are two papers last year from Google Brain and DeepMind was talking about this. So that's some of the related work in the area of augmenting recurrent networks with operations, with memory, et cetera.

Now, what's the big picture? Okay, so the big picture, I wanna revisit, and I say, so what I've talked about today is sequence-to-sequence learning. And it's an end-to-end deep learning task. So it's one of the big trends happening in natural language. It's very general, so you can use, if you have a lot and a lot of supervised data, it's a supervised learning algorithm.

So if you have a lot of data, it should work great. But if you don't have enough supervised data, then you consider dividing your problem and then training in different components. Or you can train jointly in a multitask settings. And people also train it jointly with autoencoder, namely you read the input sentence and then predict the output sentence again.

And that's also, and then you train jointly with all the tasks, and that works as well. If you go home and then you wanna make impact at your work tomorrow, then so far, that's so far so good. That can make some impact. Now, if you wanna do some research, and I think things with memory, operation, augmentation are some of the exciting areas.

But it seems like it's still work in progress. But I expect a lot of advances in this area in the near future. So if you want to know more, you can take a look at Chris Ola Block. He talk about attention and augmented recorder networks. I also wrote some tutorials, pretty simple.

The sequence-to-sequence with attention for translation is implemented in TensorFlow. So you can actually download TensorFlow and train it, what I said today. Now, there's a lot of work going on in this area. Many of these are not mine. So as you can see, you can't even read the words.

That means how many papers come along in this area. So I can pause there, and I have five minutes to answer questions. (audience member speaking) I have a question there, yeah. (audience member speaking) Yeah. (audience member speaking) Yeah. (audience member speaking) Yeah. (audience member speaking) I see. Can you speak to the microphone?

Because I can't hear very well. The microphone, and then I think people can hear that as well. - When you're training a Q&A network, so you're taking the example of training from a book to answer questions. - Yeah. - So if, let's say, Harry Potter, who was Harry Potter's father.

- Yeah. - There could be many books that have a character Harry. - Yeah. - So there's a context resolution issue, which is which Harry should I answer the question for. - Yeah. - How do you solve the context problem when you're training this kind of Q&A type network?

- I think that's a great question. So I think one thing is that you can always personalize. For example, you know that the guy, when he talk about, you can have a representation for the user, and then you know that when he say Harry, because he actually been reading a lot of books about Harry Potter, so it's more likely to be Harry Potter.

But I think with the algorithm I said, I just want to make sure that it's as simple as possible. So the user has to ask the question, Harry Potter, rather than Harry. But I'm saying if you represent user vectors, and then you inject more additional knowledge about the users, about the context, into as additional token in the input of the net, the net can figure it out by itself.

Yeah, so that's one way to do it, yeah. Okay, I have a question there, yeah. - You did some work on doc2vec. - Yeah. - Do you have an idea what the state of the art in generalizing word2vec is to more than one word? - Oh, I see. I think skip thoughts are interesting directions here.

So doc2vec is one way, but skip thoughts, so the idea of skip thoughts was actually Ruslan Salakhovsdinov, was author on this. And his idea is basically using sequence to sequence to predict the next sentence. So the input would be the current sentence, the output would be the previous sentence or the next sentence.

And then you could train a model like that. The model is called skip thought. And I've heard a lot of good things about skip thoughts, where you can take the embedding at the end, and then you can do document classification and things like that, and it works very well.

So that's probably one place that you can go. My colleague at Google is also working on something called autoencoder. So instead of predicting the next sentence, he predicts the current sentence. So trying to repeat the current sentence, and that's kind of worked well too. Yeah, yeah. - See, what was your thoughts on how to solve the common sense reasoning problem?

- Oh, common sense, I'm deeply interested in common sense, but I gotta say, I have no idea. I think maybe you can do something like, I think common sense is about a lot of, first of all, there's a lot of knowledge about the world that is not captured in text.

Like for example, gravity and things like that. So maybe you really need to actually combine a lot of modality. That's one way to think about it. Or the other thing is, do you make sure that unsupervised learning work? That's another approach. But I think this research area, I think, I'm just making guesses right now.

- Is there a good way to represent all these rules and using some soft-- - Yes, yes. So the question is, how do you represent rules? So if you think about this network, the neural programmer network, that it actually augmented by addition and subtraction, then these are rules. - Right.

- You can augment it with a table of rules and then ask the network to actually attend into that rule table. People have looked into this direction. So that's one way to do it. - Okay, you're saying basically, I'll go ahead and do some logical reasoning? - Yeah, yeah.

- Hey, great talk. - Yeah, thank you. - Is there a practical rule of thumb for how many sequence pairs you need to train such a model successfully? - I see. - Are there any tips to reduce how many pairs you need if you don't have-- - I see, okay.

So usually, the bigger data set, the better, but the corpus that people train this on translation, for example, English to German, it has only about three, five million pairs of sentences or something like that. So that's kind of small, three million, right? And still, people are able to make it to the state of the art.

So that's pretty encouraging. Now, if you don't have a lot of data, then I would say things like pre-train your word vectors with language models or Word2Vec, right? That's one area that you have a lot of parameters. You can pre-train your model with some kind of language model, and then you reuse the softmax.

That's another area that you have a lot of parameters. Or use dropout in the input embedding or dropout some random word in the input sentence. So those things can improve the regularization when you don't have a lot of data. Okay. Yeah, thank you. Okay. Yeah, thank you. (audience applauding) - Thank you, Kuo.

So we'll reconvene at six o'clock for Yoshua Bengio's closing keynote.