Back to Index

MIT 6.S094: Recurrent Neural Networks for Steering Through Time


Chapters

0:0 Intro
0:44 Administrative
1:50 Flavors of Neural Networks
6:1 Back to Basics: Backpropagation
8:34 Backpropagation: Forward Pass
10:14 Backpropagation: By Example
11:58 Backpropagation: Backward Pass
13:41 Modular Magic: Chain Rule
14:55 Interpreting Gradients
18:27 Modularity Expanded: Sigmoid Activation Function
20:33 Learning with Backpropagation
25:27 Optimization is Hard: Dying ReLUS
26:13 Optimization is Hard: Saddle Point
27:39 Learning is an Optimization Problem
30:59 Optimization is Hard: Vanishing Gradients
32:52 Reflections on Backpropagation
36:7 Unrolling a Recurrent Neural Network
37:39 RNN Observations
38:39 Backpropagation Through Time (BPTT)
40:34 Gradients Can Explode or Vanish Geometric Interpretation
41:8 RNN Variants: Bidirectional RNNS
42:8 Long-Term Dependency
43:49 Long Short Term Memory (LSTM) Networks
45:43 LSTM: Gates Regulate
45:48 LSTM: Pick What to Forget and What To Remember
48:8 LSTM Conveyer Belt
50:12 Application: Machine Translation
52:19 Application: Handwriting Generation from Text
52:59 Application: Character-Level Text Generation
54:6 Application: Image Question Answering
55:21 Application: Image Caption Generation
55:56 Application: Video Description Generation
56:50 Application: Modeling Attention Steering
57:33 Application: Drawing with Selective Attention Writing
58:3 Application: Adding Audio to Silent Film
59:0 Application: Medical Diagnosis

Transcript

All right, so we've talked about regular neural networks, fully connected neural networks. We've talked about convolutional neural networks that work with images. We've talked about reinforcement, deep reinforcement learning, where we plug in a neural network into a reinforcement learning algorithm when an agent, when a system has to not only perceive the world but also act in it and collect reward.

And today we'll talk about perhaps the least understood but the most exciting neural network out there, flavor of neural network is recurring neural networks. So first, administrative stuff. There's a website, I don't know if you heard, cars.mit.edu, where you should create an account if you're a registered student. That's one of the requirements.

You need to have an account if you want to get credit for this. You need to submit code for DeepTraffic.js and DeepTesla.js. And for DeepTraffic, you have to have a neural network that drives faster than 65 miles an hour. If you need help to achieve that speed, please email us.

We can drop, we can give you some hints. For those of you who are old school SNL fans, there's a deep thoughts section now in the profile page where we encourage you to talk about the kinds of things you tried in DeepTraffic or any of the other DeepTesla or any of the work you've done as part of this class for deep learning.

We've talked about the vanilla neural networks on the left. The vanilla neural network is the one where it's computing, it's approximating a function that maps from one input to one output. An example is mapping images to the number that's shown in the image. For ImageNet, it's mapping an image to what's the object in the image.

It could be anything. In fact, convolutional neural networks can operate on audio. You could give it a chunk of audio, five second audio clip. That still counts as one input because it's fixed size. As long as the size of the input is fixed, that's one chunk of input. And as long as you have ground truth that maps that chunk of input to some output ground truth, that's a vanilla neural network.

Whether that's a fully connected neural network or a convolutional neural network. Today, we'll talk about the amazing, the mysterious recurring neural networks. They compute functions from one to many, from many to one, from many to many. Also, bidirectional. What does that mean? They take as input sequences, time series, audio, video.

Whenever there's a sequence of data and that temporal dynamics that connects the data is more important than the spatial content of each individual frame. When there's a lot of information being conveyed in the sequence, in the temporal change of whatever that type of data is, that's when you want to use recurring neural networks.

Like speech, natural language, audio. The power of this is that for many of them, of a recurring neural network where they really shine is when the size of the input is variable. You don't have a fixed chunk of data that you're putting in, it's variable input. The same goes for the output.

You can give it a sequence of speech, several seconds of speech, and then the output is a single label of whether the speaker is male or female. That's many to one. You can also do many to many translation. You can have natural language put into the network in Spanish and the output is in English.

Machine translation, that's many to many. And that many to many doesn't have to be mapped directly into same size sequences. So for video, the sequence size might be the same. You're labeling every single frame. You put in a five-second clip of somebody playing basketball. And you can label every single frame counting the number of people in every single frame.

That's many to many when the size of the input and size of the output is the same. Yes, question. The question was, are there any models where there's feedback from output and input? And that's exactly what recurrent neural networks are. It produces output and it copies that output and loops it back in.

That's exactly almost the definition of a recurrent neural network. There's a loop in there that produces the output and also takes that output as input once again. And so there's also many to many where the sequences don't align. Like machine translation, the size of the output sequence might be totally different than the input sequence.

We'll look at a lot of cool applications like you can start a song. So learn on the audio of a particular song and have the recurrent neural network continue that song after a certain period of time. So you can learn to generate sequences of audio, of natural language, of video.

I know I promised not many equations but this is so beautifully simple that we have to cover backpropagation. It's also the thing that if you're a little bit lazy and you go on the internet and start using the basic tutorials for TensorFlow, you ignore how backpropagation work at your peril.

You kind of assume it just works. I give it some inputs, some outputs and it's like Lego pieces. I can assemble them like you might have done with deep traffic. A bunch of layers, put them together and then just press train. And backpropagation is the mechanism that neural networks currently, the best mechanism we know of, that is used for training.

So we need to understand the simple power of backpropagation but also the dangers. Summary. Say up at the top of the slide there's an input, so the network is an image. There's a bunch of neurons all with differentiable smooth activation functions on each neuron. And then as you pass through those activation functions, taking the input, pass it through this net of differentiable compute nodes, you produce an output.

And that output, you also have a ground truth that correct the truth that you hope, you expect the network to produce. And then you can look at the difference between what the network actually produced and what you hope it will produce and that's an error. And then you back or propagate that error punishing or rewarding the weights, the parameters of the network that resulted in that output.

Let's start with a really simple example. There's a function that takes as input up on top three variables X, Y and Z. The function does two things. It adds X and Y and then it multiplies that sum by Z. And then we can formulate that as a circuit, circuit of gates where there's a plus gate and a multiplication gate.

And let's take some inputs shown in blue. Let's say X is -2, Y is 5, Z is -4. And let's do a forward pass through this circuit to produce the output. So -2 + 5 = 3. Q is that intermediate value that's the value that we're looking for. And then we can do a forward pass through the circuit Q is that intermediate value, 3.

This is so simple and so important to understand that I just want to take my time through this because everything else about neural networks just builds on these concepts. Okay, so the add gate produces Q. In this case it's 3. And then 3 * -4 is 12. That's the output.

The output of the circuit of this network if you think of it as such is -12. And so the forward pass is shown in blue. The backward pass will be shown in red in a second here. So what we want to do is what would make us happy, what would make F happy is for the output to be as high as possible.

-12 is so-so, we could do better. So how do we teach it? How do we adjust X, Y and Z such that it produces a higher F? Makes us happier. Okay, let's start backward, the backward pass. We make the gradient on the output 1. Meaning we want this to increase.

We want F to increase. That's how we encode our happiness. We want it to go up by 1. And in order to then propagate that fact that we want the F to go up by 1, we have to look at the gradient on each one of the gates. Now what's a gradient?

It's a partial derivative with respect to its inputs. The partial derivative of the output of a gate with respect to its inputs. If you don't know what that means, it's just the how much does the output change when I change the inputs a little bit. What is the slope of that change?

If I increase X for the first function of addition, F(X)Y = X + Y. If I increase X by a little bit, what happens to F? If I increase Y by a little bit, what happens to F? So taking a partial derivative of those with respect to X and Y, you just get a slope of 1.

So when you increase X, F increases linearly. Same with Y. Multiplication is a little trickier. When you increase X, F increases by Y. So the partial derivative of F with respect to X is Y. The partial derivative of F with respect to Y is X. So if you think about it, what happens is the gradients, when you change X, the gradient of change doesn't care about X.

It cares about Y. So it's flipped. So we can back propagate that 1, the indication of what makes us happy, backwards. And that's done by computing the local gradient. For Q, so the partial derivative of F with respect to Q, that intermediate value, that gradient will be -4. It will take the value of Z, as I said, it's a multiplication gate.

It will take the value of Z and assign it to the gradient. And the same for the partial derivative of F with respect to Z, it will assign that to Q, the value of the forward pass from the Q. So there's a 3 and a -4 in the forward pass, in blue, and then that's flipped, -4 and 3, on the backward pass.

That's the gradient. And then we continue in the same exact process, but wait. So what makes all of this work is the chain rule. It's magical. So what it allows us to do is to compute the gradient, the gradient on F with respect to the inputs, X, Y, Z.

We don't need to construct the giant function that is the partial derivative of F with respect to X and Y and Z analytically. We can do it step by step, backpropagating the gradients. We can multiply the gradients together as opposed to doing partial derivative of F with respect to X.

We have just the intermediate, the local gradient of F with respect to Q and of Q with respect to X and multiply them together. So instead of computing the gradient of that giant function, X + Y = Z, in this case it's not that giant, but it gets pretty giant in neural networks, we just go step by step.

We look at the first function, simple addition, Q = X + Y and the second function, multiplication, F = Q * Z. So, the gradient on X and Y, the partial derivative of F with respect to X and Y is computed by multiplying the gradient on the output, -4, times the gradient on the inputs, which as we talked about when the operation is addition, that's just 1.

So it's -4 * 1. That means, what does that mean? Let's interpret those numbers. You now have gradients on X, Y and Z, the gradient of, the partial derivative of F with respect to X, Y, Z. That means, so for X and Y it's -4, for Z it's 3.

That means in order to make F happy, we have to decrease the inputs that have a negative gradient and increase the inputs that have a positive gradient. The negative ones are X and Y, the positive is Z. Hopefully I don't say the word beautiful too many times in this presentation, but this is very simple, beautifully simple.

Because this gradient is a local worker. It propagates for you, it has no knowledge of the broader happiness of F. It just propagates, it computes the gradient between the output and the input. And you can propagate this gradient based on, in this case, F, a gradient of 1, but also just the error.

Instead of 1, we could have on the output the error is the measure of happiness. And then we could propagate that error backwards. These gates are important because you can break down almost every operation we could think of that we work with in neural networks into one of several gates like this.

And the most popular are 3, which is addition, multiplication and the max operation. So for addition, what you do is you ignore the... Okay, the process is you take a forward pass to the network. So we have a value on every single gate. And then you take a backward pass.

And through the backward pass you compute those gradients. For an add gate, you equally distribute the gradients on the output to the input. So when the gradient on the output is -4, you equally distribute it to -4. And you ignore the forward pass value. So that 3 is ignored when you back propagate it.

On the multiplication gate, on the multiply gate, it's trickier. You switch the forward pass values. So if you look at F, that's a multiply gate. The forward pass values are switched and multiplied by the value of the gradient in the output. If it's confusing, go through the slides slowly.

It'll make a lot more sense, hopefully. One more gate. There's the max gate, which takes the inputs and produces as output the value that is larger. And when computing the gradient of the max gate, it distributes the gradient similar to the add gate, but to only one. To only one of the inputs.

The largest one. So it, unlike the add gate, pays attention to the input values on the forward pass. Alright. Lots of numbers, but the whole point here is it's really simple. A neural network is just a simple collection of these gates. And you take a forward pass, you calculate some kind of function on the end, a gradient at the very end, and you propagate that back.

So usually for neural networks, that's an error function. A loss function, objective function, cost function, all the same word. So that's the sigmoid function there. When you have three weights, w0, w1, w2, and x, two inputs, x0, x1, that's going to be the sigmoid function. That's how you compute the output of the neuron.

But then you can decompose that neuron, you can separate it all into just a set of gates like this. Addition, multiplication, there's exponential in there, and division. They're all very similar. And you repeat the exact same process. If the, there's five inputs, there's three weights, and two inputs, x0, x1.

You take a forward pass through this circuit. In this case, again, you want it to increase so that the gradient on the output is one. You back propagate that gradient of one to the inputs. Now with neural networks, there's a bunch of parameters that you're trying to, through this process, modify.

And you don't get to modify the inputs. You get to modify the weights along the way and the biases. The inputs are fixed, the outputs are fixed, the outputs that you hope the network will produce. What you're modifying is the weights. So I get to try to adjust those weights in such that, in the direction of the gradient.

That's the task of back propagation. The main way that neural networks learn is we update the weights and the biases to decrease the loss function. The lower the loss function, the better. So in this case, you have three inputs on top left, the simple network, three inputs, three weights on each of the inputs.

There's a bias on the node, b, and it produces an output, a. And that little symbol is indicating a sigmoid function. And loss is computed as y minus a squared divided by two. Where y is the ground truth, the output that you want the network to produce. And that loss function is back propagated in exactly the same way that we described before.

So the subtasks that are involved in this update of weights and biases is that the forward pass computes the network output at every neuron, and then finally the output layer computes the error, the difference between a and b. And then backward propagates the gradients. Instead of one on the output, it'll be the error on the output and you back propagate it.

And then once you know the gradient, you adjust the weights and the biases in the direction of the gradient. Or actually the opposite of the direction of the gradient because you want the loss to decrease. And the amount by which you make that adjustment is called the learning rate.

The learning rate could be the same across the entire network, or it could be individual to every weight. (deep breath) And the process of adjusting the weights and biases is just optimization. Learning is an optimization problem. You have an objective function and you're trying to minimize it. And your variables are the parameters, the weights and biases.

And neural networks just happen to have tens, hundreds of thousands, millions of those parameters. So the space, the function that you're trying to minimize is highly non-linear. But it boils down to something like this. You have two weights, here are two plots. Or actually one weight, sorry, one weight.

And then as you adjust it, the cost... You adjust in such a way that minimizes the output cost. And there's a bunch of optimization methods for doing this. You can... This is a convex function, so you can find the minimum, the local minimum, if you know about these kind of terminologies.

The local minimum is the same as the global minimum. So there's not... It's not a weirdly hilly terrain where you can get stuck in... So your goal is to get to the bottom of this thing. And if it's really complex terrain, it'll be hard to get to the bottom of it.

So there is a lot of different... The general approach is gradient descent. And there's a lot of different ways to do gradient descent. Some adding... In various ways of adding randomness into the process, so you don't get stuck into the weird crevices of the terrain. All right, but it's messy.

You have to be really careful. This is the part you have to be aware of. When you're designing a network for deep traffic and nothing is happening, this might be what's happening. Vanishing gradients or exploding gradients. When the partial derivative is small, so if you take the sigmoid function, the most popular for a while activation function, the derivative is zero at the tails.

So when the input to the sigmoid function is really high or really low, that derivative is going to be zero. So the gradient that you compute... Gradient tells you how much I want to adjust the weights. The gradient might be zero. And so you back propagate that zero, a very low number, and it gets less and less as you back propagate.

And so the result is that you don't... You think that you don't need to adjust the weights at all. And when a large fraction of the network thinks that weights don't need to be adjusted, then they don't adjust the weights and you're not doing any learning. So the learning is slow.

There's some fixes to this. There's different types of functions. There's a piecewise, the Rayleigh function, which is the most popular activation function. But again, it suffers... If the neurons are initialized poorly, it might not... This function might not fire... It might be zero gradient for the entire dataset. Nothing that you produce as input...

You run all your thousands of images of cats and none of them fire at all. So that's the danger here. So you have to pick these... Both the optimization engine, the solver that you use, and the activation functions carefully. You can't just plug and play like they're Legos. You have to be aware of the function.

SGD, stochastic gradient descent, that's the vanilla optimization algorithm for gradient descent, for optimizing the loss function over the gradients. And so what's visualized here is, again, if you've done any numerical optimization, nonlinear optimization, there's the famous saddle point that's tricky for these algorithms to deal with. What happens is it's easy for them to oscillate, get stuck in that saddle and oscillate back and forth.

As opposed to what they want to do, which is go down into... You get so happy that you found this low point, that you forget that there's a much lower point. And so you get stuck with the gradient, the momentum of the gradient keeps rocking you back and forth while you go in to a much greater global minimum.

And there's a bunch of clever ways of solving that. The atom optimizer is one of those. But in this case, as long as the gradients don't vanish, SGD, the stochastic gradient descent, one of these algorithms will get you there. That might take a little while, but they'll get you there.

And that's the main question. The question was, you're dealing with a function that's not non-convex, and how do we ensure anything about it converging to anything that's reasonably good, the local optimum it converges to. And the answer is, you can't. This isn't only a nonlinear function, it's a highly nonlinear function.

The power and the beauty of neural networks is that it can represent these arbitrarily complex functions. It's incredible, right? And you can learn those functions from data. But the reason people refer to neural networks training as art is you're trying to play with parameters that don't get stuck in these local optima for stupid reasons and for clever reasons.

Yes, question. So the question, yeah. So continue on the same thread. So the thing is, we're dealing with functions where we don't know what the global optimal is. That's sort of the crux of it. Everything we talk about, interpreting text, interpreting video, even driving, what is the optimal for driving?

Never crashing? It sounds easy to say that, but you actually have to formulate the world under which you define all of those things and that becomes really nonlinear objective function for which you don't know what the optimal is. It's just... That's why you just keep trying and get impressed every time it gets better.

It's essentially the process. And you can also compare, you can compare the human level performance. So for ImageNet, we can tell the difference in cats and dogs in top five categories in 90, shoot, 96% of the time, whatever, accuracy. And then you get impressed when a machine can do better than that.

But you don't know what the best is. These videos can be watched for hours. I won't play it until I explain the slide. So let's pause to reflect on backpropagation before I go on to recurrent neural networks. Yes, question. In a practical manner, how can you tell when you're actually training a net whether you're facing the vanishing gradient problem or you need to change your optimizer or you need to, I mean, like you've reached some local minimum?

The question was, how do you practically know when you've hit the vanishing gradient problem? So the vanishing gradient could be, the derivative being zero on the gradient happens when the activation is exploding, so like really high values and really low values. The really high values is easy because they're like, your network is just going crazy, producing very large values.

And you can fix a lot of those things by just capping the activations. The values being really low resulting in a vanishing gradient are really hard to detect. This is, I mean, there's a lot of research in trying to figure out how to detect these things, but if you're not careful, it's oftentimes you can find that, and this isn't hard to do, where like 40, 50% of the network, of the neurons are dead.

We're going to call it like for ReLU, they're dead ReLU nodes. They're not firing at all. How do you detect that? That's part of learning. So if they never fire, you can detect that by running it through the entire training set. I mean, there's a lot of tricks, but that's the problem is you try to learn, and then you look at the loss function, and it's not converging to anything reasonable.

It's either going all over the place or just converging very slowly, and that's an indication that something is wrong. That something could be the loss function is bad, that something could be that you've already found the optimal, or that something could be the vanishing gradient. And again, that's why it's an art.

But certainly, at least some fraction of the neurons need to be firing. Otherwise, the initialization is really poorly done. Okay, so to reflect on the simplicity of backpropagation, and the power of it. So this is, this kind of step of backpropagating the loss function through the gradients locally is the way neural networks learn.

We don't have, it's really the only way that we've effectively been able to train a neural network to learn a function. So adjusting the weights and biases, the huge number of weights and biases, the parameters, is just through this optimization. It's backpropagating the error where you have the supervised ground truth.

So the question is whether this process is just like fitting, adjusting the parameters of a highly nonlinear function to minimize a single objective, is the way you achieve intelligence, human level intelligence. And that's something to think about. You have to think about the, for driving purposes, what is the limitation of this approach?

So what's not happening? The neural network design, the architecture, is not being adjusted. You're not evolving any of the edges, the layers, nothing is being evolved. And so there are other optimization approaches that I think are more interesting and inspiring than effective. So for example, this is using soft cubes to, so this is falling out of the field of evolutionary robotics, where you evolve the dynamics of a robot using genetic algorithms.

And that's, so you can think of, so these robots are being taught to, in simulation obviously, to walk and to swim. So that one is swimming. But you could, the nice thing here is the dynamics, that highly nonlinear space as well, that controls the dynamics of this weird shaped robot, with a lot of degrees of freedom, is the same kind of thing as the neural network.

And in fact, people have applied genetic algorithms and colony optimization, all kinds of sort of nature-inspired algorithms for optimizing the weights and the biases. But they don't seem to currently work that well. But it's kind of, it's a cool idea to be using nature-type evolutionary algorithms to evolve something that's already nature-inspired, which is neural networks.

But something to think about is, you know, that backpropagation, while really simple, is kind of dumb. And the question is whether general intelligence reasoning could be achieved with this process. All right, recurring neural networks. So on the left there, there's an input x with weights on the input u.

There's a hidden state, a hidden layer s, with weights on the edge connecting the hidden states to each other. And then more weights v on the output o. It's a really simple network. There's inputs, there is hidden states, the memory of this network, and there's outputs. But the fact that there is this loop where the hidden states are connected to each other means that as opposed to producing a single input, the network takes arbitrary number of inputs.

It just keeps taking x one at a time and produces a sequence of x's through time. And so depending on the duration of the sequences you're interested in, you can think of this network in its unrolled state. So you can unroll this neural network where the inputs are on the bottom, x t minus one, x t, x t plus one.

And same with the outputs, zero, sorry, o t minus one, o t, o t plus one. And it becomes like a regular neural network, unrolled some arbitrary number of times. The parameters, again, there's weights, there's biases. It's similar to CNNs, convolutional neural networks, in that it's just like convolutional neural networks make certain spatial consistency assumptions.

The recurring neural networks assume temporal consistency amongst the parameters. So it shares the parameters. That w, that u, that v is the same for every single time step. So you're learning the same parameter no matter the duration of the sequence. And that allows you to look at arbitrarily long sequences without having an explosion of parameters.

And this process is the same exact process that's repeated based on the different variants that we talked about before in terms of inputs and outputs. One to many, many to one, many to many. And the backpropagation process is exactly the same as for regular neural networks. It has a fancy name of backpropagation through time, BPTT.

But it's just backpropagation through an unrolled, unrolled recurring neural network where the errors are computed on the outputs, the gradients are computed, backpropagated and computed on the inputs. Again, suffering from the same exact problem of vanishing gradients. Now the problem is that the depth of these networks can be arbitrarily long, right?

So if at any point the gradient hits a low number, 0, that neuron becomes saturated. That gradient, let's call it saturated, that gradient drives all the earlier layers to 0. So it's easy to run into a problem where you're really ignoring majority of the sequence. This is just another Python way, pseudo-code way to look at it.

You have the same W. Remember, you're sharing the weights and all the parameters from time to time. So if the weights are such, WHH, if the weights are such that they produce, they have either, they have a negative value that results in a gradient that goes to 0, that propagates through the rest.

So that's the pseudo-code for backpropagation, the backward pass through the RNN. That WHH propagates back. And so you get these things with exploding and vanishing gradients where this, for example, an error surface for a single hidden unit RNN. So these visualize in the gradient, the value of the weight, the value of the bias, and the error.

So the error could be really flat or could explode. And both are going to lead to you not making the, either making steps that are too gradual or too big. That's the geometric interpretation. Okay, what other variants that we'll look at a little bit are there for RNNs? It doesn't have to be only one way.

It can be bidirectional. So there could be edges going forward and edges going back. What that's needed for is things like filling in missing, whatever the data is, filling in missing elements of that data, whether that's images or words or audio. And generally, as always is the case in neural networks, the deeper you go, the better.

So this is that deep referring to the number of layers in a single temporal instance. So on the right of the slide is, we're stacking in the, not in the temporal domain. Each of those layers has its own set of weights and its own sets of biases. These things are awesome, but they need a lot of data when you add extra layers in this way.

Okay, so the problem is, while recurrent neural networks, in theory, are supposed to be able to learn any kind of sequence, the reality is they're not really good at remembering what happened a while ago, the long-term dependency. So here's a silly example. Let's think of a story about Bob.

Bob is eating an apple. So the apple part is generated by the recurrent neural network. The recurrent neural networks can learn to generate apple because they've seen a lot of sentences with Bob and eating, and they can generate the word apple. For a longer sentence, like Bob likes apples, he's hungry and decided to have a snack, so now he's eating an apple.

You have to maintain the state that we're talking about Bob, and we're talking about apples, through several discrete semantic sentences. And that kind of long-term memory is not because of different effects, but vanishing gradients. It's difficult to propagate the important stuff that happened a while ago, in order to maintain that context in generating apple or classifying some concept that happened way down the line.

So when people talk about recurrent neural networks, these days, they're talking about LSTMs, long short-term memory networks. So all the impressive results on time series, on audio, on video, all of that, that requires LSTMs. And so again, vanilla RNNs up on top of the slide, each cell is simple.

There's some hidden units, there's an input, and there's an output. Here we'll use 10H as the activation function. It's just another popular sigmoid type activation function. LSTMs are more complicated, or they look more complicated, but in some ways they're more intuitive for us to understand. There's a bunch of gates in each cell.

We'll go through those. In yellow are different neural network layers. With sigma and 10H are different types of activation functions. 10H is an activation function that squishes the input to the range of -1 to 1. A sigmoid function squishes it between 0 and 1, and that serves different purposes.

There is some pointwise operations, addition, multiplication, and there is connections, so data being passed from layer to layer, shown by the arrows. There's concatenation and there's a copy operation on the output. So we copy, the output of each cell is copied to the next cell and to the output.

Let me try to make it clarify a little bit. There's this conveyor belt going through inside each individual cell. There's really three steps in the conveyor belt. The first is there is a sigmoid function that's responsible for deciding what to forget and what to ignore. It's responsible for taking in the input, the new input, XT, taking in the state of the previous, the output of the previous cell, previous time step, and deciding do I want to keep that in my memory or not, and do I want to integrate the new input into my memory or not.

So this allows you to be selective about the information which you learn. So for example, the sentence "Bob and Alice are having lunch." Bob likes apples, Alice likes oranges, she's eating an orange. So Bob and Alice are having lunch. Bob likes apples. Right now, if you say you had a hidden state, keeping track of the gender of the person we're talking about.

You might say that there's both genders in the first sentence, there's male in the second sentence, female in the third sentence. That way, when you have to generate a sentence about who's eating what, you'll keep the gender information in order to make an accurate generation of text corresponding to the proper person.

So you have to forget certain things, like forget that Bob existed at that moment, and you have to forget Bob likes apples, but you have to remember that Alice likes oranges. So you have to selectively remember and forget certain things. That's LSTM in a nutshell. So you decide what to forget, decide what to remember, and decide what to output at that cell.

All right, so zoom in a little bit, because this is pretty cool. There is a state running through the cell. This can vary about previous state, like what the gender that we're currently talking about, that's the state that you're keeping track of, and that's running through the cell. And then there is three sigmoid layers outputting a number between 0 and 1, but it's 1 when you want that information to go through, and 0 when you don't want it to go through, the conveyor belt that maintains the state.

And so first sigmoid function is we decide what to forget and what to ignore. That's the first one. You take the inputs from the previous time step, the input to the network from the current time step, and decide do I want to forget, do I want to ignore those.

Then we decide which part of the state to update. What part of our memory do we update with this information, and what values to insert in that update. Third step is we perform the actual update and perform the actual forgetting. So that's where you have the sigmoid function is you just multiply it.

When it's 0, it's forgetting. When it's 1, that information passes through. And finally, we produce an output from the cell. So if it's translation, it's producing an output in the English language where the input was in the Spanish language. And then that same output is copied to the next cell.

Okay, so what can we get done with this kind of approach? We can look at machine translation. I guess what I'm trying to... Question. What is the representation of the state? Is it like a floating point or is it like a vector? What is it exactly? The state is the activation multiplied by the weight.

So it's the outputs of the sigmoid or the 10H activations. So there's a bunch of neurons and they're firing a number between -1 and 1 or between 0 and 1. And that holds the state. It's just calling it state is sort of simplifying. But the point is that there's a bunch of numbers being constantly modified by the weights and the biases.

And those numbers hold the state. And the modification of those numbers is controlled by the weights. And then once all that is done, the resulting output of the recurrent neural network is compared to the desired output and the errors are back-propagated through the weights. Hopefully that made sense. So machine translation is one popular application.

And all of it is the same. All of these networks that I'll talk about, they're really similar constructs. You have some inputs, whatever language that is again. German maybe. I think everything is German. And the output, so the inputs are in one language, a set of characters that compose a word in one language.

There's a state being propagated. And once that sentence is over, you start, as opposed to collecting inputs, you start producing outputs. And you can output in the English language. There's a ton of great work on machine translation. It's what Google is mostly using for their translator. Same thing, I showed this previously, but now you know how it works.

Same exact thing, LSTMs, generating handwritten characters, so handwriting in arbitrary styles. So controlling the drawing, where the input is text and the output is handwriting. And it's again the same kind of network with some depth here. The inputs is the text, the output is the control of the writing.

Character level text generation. This is the thing that told us about life. The meaning of life, literary recognition and the tradition of ancient human reproduction. That's again the same process. Input one character at a time, where you see there's an encoding of the characters on the input layer. There's a hidden state, hidden layer, that's keeping track of those activations, the outputs of the activation functions.

And every single time, it's outputting its best prediction of the next character that follows. Now in a lot of these applications, you want to ignore the output until the input sentence is over. And then you start listening to the output. But the point is it just keeps generating text, whether it's given input or not.

So you producing input is just adding, steering the recurrent neural network. You can answer questions about an image. So the input hid there, so you could almost arbitrarily stack things together. So you take an image as an input, bottom left there, put it into convolutional neural network and take the question.

There's something called word embeddings. It's to broaden the representative meaning of the words. So how many books is the question? So you want to take the word embeddings and the image and produce your best estimate of the answer. So for a question of what color is the cat, it could be gray or black.

There's the different LSTM flavors producing that answer. Same with counting chairs. You can give an image of a chair and ask the question how many chairs are there and it can produce an answer of three. So I should say that this is really hard, right? And it's an arbitrary question, ask of an arbitrary image.

So you're both interpreting, you do natural language processing and you're doing computer vision, all in one network. Same thing with image caption generation. You can detect the different objects in the scene, generate those words, stitch them together in syntactically correct sentences and re-rank the sentences. All of those are LSTMs, the second and the third step.

The first is computer vision detecting the objects, segmenting the image and detecting the objects. And that way you can generate a caption that says a man is sitting in a chair with a dog in his lap. Again, LSTMs for video. Caption generation for video. The input at every frame is an image that goes into the LSTM.

The input is an image and the output is a set of characters. First you load in the video, in this case the output is on top. You encode the video into a representation inside the network and then you start generating words about that video. First comes the input, the encoding stage, then the decoding stage.

Take in the video, say a man is taking, talking, whatever. And because the input and the output is arbitrary, there also has to be indicators of the beginnings and the ends of a sentence. So in this case, end of sentence. So you want to know when you stop. In order to generate syntactically correct sentences, you want to be able to generate a period that indicates the end of a sentence.

So you can also, again, recurrent neural networks, LSTMs here, controlling the steering of a sliding window on an image that's used to classify what's contained in that image. So here, a CNN being steered by a recurrent neural network in order to convert this image into the number that's associated with the house number.

It's called visual attention. And that visual attention can be used to steer for the perception side and it can be used to steer a network for the generation. On the right, we can generate an image. So the output of the LSTM where the output at every time step is visual.

In this way, you can draw numbers. Here, I mentioned this before, is taking in as input silent video, sequence of images, and producing audio. So this is an LSTM that has convolutional layers for every single frame. It takes images as input and produces a spectrogram, audio as output. The training set is a person hitting an object with a drumstick and your task is to generate, given a silent video, generate the sound that a drumstick would make when in contact with that object.

Okay, medical diagnosis. That's actually, so I've listed some places where it's been really successful and pretty cool. But it's also beginning to be applied in places where it can actually really help civilization, right, in medical applications. So for medical diagnosis, there is a highly sparse and a variable length sequence of information in the form of, for example, patient electronic health records.

So every time you visit a doctor, there's some test being done and that information is there and you can look at it as a sequence over a period of time. And then given that data, that's the input, the output is a diagnosis, a medical diagnosis. So in this case, we can look at predicting diabetes, scoliosis, asthma, so on, with pretty good accuracy.

There's something that all of us wish we could do is stock market prediction. So you can input, for example, well, first of all, you can input the raw stock data, right, the order books and so on, financial data. But you can also look at news articles from all over the web and take those as input, as shown here, on the x-axis is time, so articles from different days, LSTM, once again, and produce an output of your prediction, binary prediction, whether the stock will go up or down.

And nobody's been able to really successfully do this, but there is a bunch of results and trying to perform above random, which is how you make money, right, significantly above random on the prediction of is it going up or down, so you can buy or sell. And especially when there's, in the cases when there were crashes, it's easier to predict.

So you can predict an encroaching crash. These are shown in the table, the error rates for different stocks, automotive stocks. You can also generate audio. This exact same process you generate language, you generate audio. Here's trained on a single speaker, a few hour epics of them speaking, and you just learn that's raw audio of the speaker.

And it's learning slowly to generate. (audience mumbling) Obviously, they were reading numbers, but that might... This is incredible. This is trained on a compressed spectrogram of the audio, raw audio. And it's producing something that, over just a few epics, it's producing something that sounds like words. I could do this lecture for me, I wish.

(audience mumbling) Since... I don't know. This is amazing. This is raw input, raw output, all again LSTMs. And there's a lot of work in voice recognition and audio recognition. You're mapping... Let me turn it up. You're mapping any kind of audio to a classification. (audience mumbling) So, you can take the audio of the road, (audience mumbling) and that's a spectrogram on the bottom there being shown, and you could detect whether the road is wet or the road is dry.

(audience mumbling) And you could do the same thing for recognizing the gender of the speaker or recognizing many-to-many map of the actual words being spoken, speech recognition. But this is about driving, so let's see where recurring neural networks apply in driving. We talked about the NVIDIA approach, the thing that actually powers DeepTesla JS, is a simple convolutional neural network.

There's five convolutional layers in their approach, three fully connected layers. You can add as many layers as you want in DeepTesla. So, that's a quarter million parameters to optimize. And all you're taking is a single image, no temporal information, single image, and producing a steering angle. That's the approach, that's the DeepTesla way.

So, taking a single image and learning a regression of a steering angle. Now, so one of the prizes for the competition is the Udacity self-driving car engineer nanodegree for free. This thing is awesome, I encourage everyone to check it out. But they did a competition that's very similar to ours.

But they have a very large group of obsessed people, so they were very clever. They went beyond just convolutional neural networks of predicting steering. So, taking a sequence of images and predicting steering. What they did is, the winners, at least the first, and I'll talk about the second place winner tomorrow, on 3D convolutional neural networks.

But the first and the third place winners used RNNs, used LSTMs, recurrent neural networks. And mapped a sequence of images to a sequence of steering angles. For anyone, statistically speaking, anybody here who's not a computer vision person, most likely what you want to use for whatever application you're interested in is RNNs.

It's just the world is full of time series data. Very few of us are working on data that's not time series data. In fact, whenever it's just snapshots, you're really just reducing the problem to the size that you can handle. But most data, the world is time series data.

So this is the approach you will end up using if you want to apply it in your own research. So this is, RNNs is the way to go. So, again, what are they doing? How do you put images into a recurrent neural network? It's the same thing. You take, you have to convert an image into numbers in some kind of way.

A powerful way of doing that is convolutional neural networks. So you can take either 3D convolutional neural networks or 2D convolutional neural networks, ones that take time into consideration and one not. So process that image to extract the representation of that image. And that becomes the input to the LSTM.

And the output at every single cell, at every single time step, is the predicted steering angle, the speed of the vehicle and the torque. That's what the first place winner did. They didn't just do steering angle, they also did the speed and the torque. And the sequence length that they were using for training and for testing for the input and the output is a sequence length of 10.

(audience member asking question) The question was, do they use supervised learning? Yep. So they were given the same thing as in DeepTesla. A sequence of frames, whether you have a sequence of steering angle, speed and torque, I think there's other information too, available. So yeah, there's no reinforcement learning here.

Question? (audience member asking question) So the question was, how many LSTM gates are there in this problem? (audience member asking question) So this network, (audience member asking question) it's true that these diagrams kind of hide the number of parameters here. But it's arbitrary, just like convolutional neural networks are arbitrary.

So the size of the input is arbitrary, the size of the sigmoid function at 10H is arbitrary. So you can make it as large as you want, as deep as you want. And the deeper and larger, the better. What these folks actually use, so the way these competitions work, I encourage you if you're interested in machine learning to participate in Kaggle, I don't know how to pronounce it, competitions, where basically everyone's doing the same thing.

You're using LSTMs, or if it's one-to-one mapping, using convolutional neural networks, fully connected networks, with some clever pre-processing. And the whole job is, that takes months. And you probably, if you're a researcher, that's what you'll be doing, your own research, is playing with parameters. Playing with pre-processing of the data, playing with the different parameters that control the size of the network, the learning rate.

I mentioned this type of optimizer, all these kinds of things, that's what you're playing with. Using your own human intuition, and you're using your... whatever probing you can do in monitoring the performance of the network through time. Yes? Right. The question was, you said that there is a memory of 10 in this LSTM, and I thought RNNs are supposed to be arbitrary, or whatever.

So, it has to do with the training, how the network is trained. So it's trained with sequences of 10. The structure is still the same, you only have one cell that's looping onto each other. But the question is, in what chunks, what is the size of the sequence in which you're doing the training and then the testing?

So, you don't have to, it can be arbitrary length, it's just usually better to be consistent and have a fixed length. But you're not stacking 10 cells together, it's just a single cell still. So, the third place winner, Team Chauffeur, used something called transfer learning, and it's something I don't think I mentioned, but it's kind of implied.

The amazing power of neural networks, so, first, you need a lot of data to do anything. So that's a cost, that's a limitation of neural networks. But what you could do is, so there's neural networks that have been trained on very large datasets, on ImageNet. There's VGGNet, AlexNet, ResNet, all these networks that train on huge amounts of data.

But those networks are trained to tell the difference between a cat and a dog, or whatever, or the specific object recognition in single images. How do I then take that network and apply it to my problem, say, of driving, of lane detection, or classifying medical diagnosis of cancer or not.

The beauty of neural networks is you don't, I mean, it depends, but the promise of transfer learning is that you can just take that network, chop off the final layer, the fully connected layer that maps from all those cool high-dimensional features that you've learned about the visual space, and as opposed to predicting cat versus dog, you teach it to predict cancer or no cancer.

You teach it to predict lane or no lane, truck or no truck. And so, as long as the visual space under which the networks operate is similar, or the data space, like if it's audio or whatever, if it's similar, if the features are useful that you learned, in studying the problem of cat versus dog deeply, you have learned actually how to see the world.

And so you can apply that visual knowledge, you can transfer that learning to another domain. And that's the beautiful power of neural networks, is they're transferable. And so what they did here is they took, I didn't spend enough time looking through the code, so I'm not sure which of the giant networks they took, but they took a giant convolutional neural network, they pruned it down to, they chopped off the end layer, which produced 3,000 features, and they took those 3,000 features that every single image frame, and that's the XT, and they gave that as the input to LSTM, and the sequence length in that case was 50.

So this process is pretty similar across domains, that's the beauty of it. And the art of neural networks is in the, that's a good sign, I guess I should wrap it up. Anyway, I don't think I need much time. But the art of neural networks is in the hyperparameter tuning, and that's the tricky part, and that's the part you can't be taught, that's experience, sadly enough.

That's why they called, I talked about stochastic gradient descent, SGD, that's why whoever was Jeffrey Hinton, refers to it as stochastic graduate student descent. Meaning you just keep higher graduate students to play with the hyperparameters until the problem is solved. So I have about 100 plus slides on driver state, which is the thing that I'm most passionate about, and I think we'll save the best for last, right?

And I'll talk about that tomorrow. We have a guest speaker from the White House, who'll talk about the future of artificial intelligence from the perspective of policy. And what I'd like you to do, first of all, if you're a registered student, submit the two tutorials assignments, and pick up, can we just set up boxes right here or something?

Yeah, just stop by, pick up a shirt, and give us a card on the way. All right, thanks guys. (applause)