MIT 6.S094: Recurrent Neural Networks for Steering Through Time

00:00:00.000 | All right, so we've talked about regular neural networks,

00:00:06.000 | fully connected neural networks.

00:00:07.800 | We've talked about convolutional neural networks that work with images.

00:00:12.200 | We've talked about reinforcement, deep reinforcement learning,

00:00:16.000 | where we plug in a neural network into a reinforcement learning algorithm

00:00:21.300 | when an agent, when a system has to not only perceive the world

00:00:26.200 | but also act in it and collect reward.

00:00:29.400 | And today we'll talk about perhaps the least understood

00:00:34.400 | but the most exciting neural network out there,

00:00:37.400 | flavor of neural network is recurring neural networks.

00:00:41.400 | So first, administrative stuff.

00:00:46.000 | There's a website, I don't know if you heard,

00:00:50.200 | cars.mit.edu, where you should create an account

00:00:53.700 | if you're a registered student.

00:00:55.000 | That's one of the requirements.

00:00:57.000 | You need to have an account if you want to get credit for this.

00:00:59.800 | You need to submit code for DeepTraffic.js and DeepTesla.js.

00:01:05.700 | And for DeepTraffic, you have to have a neural network

00:01:09.600 | that drives faster than 65 miles an hour.

00:01:11.800 | If you need help to achieve that speed, please email us.

00:01:15.800 | We can drop, we can give you some hints.

00:01:19.000 | For those of you who are old school SNL fans,

00:01:23.000 | there's a deep thoughts section now in the profile page

00:01:28.600 | where we encourage you to talk about the kinds of things

00:01:32.300 | you tried in DeepTraffic or any of the other DeepTesla

00:01:36.400 | or any of the work you've done as part of this class for deep learning.

00:01:41.300 | We've talked about the vanilla neural networks on the left.

00:01:49.500 | The vanilla neural network is the one where it's computing,

00:01:53.800 | it's approximating a function that maps from one input to one output.

00:01:59.100 | An example is mapping images to the number that's shown in the image.

00:02:03.400 | For ImageNet, it's mapping an image to what's the object in the image.

00:02:07.900 | It could be anything.

00:02:09.500 | In fact, convolutional neural networks can operate on audio.

00:02:13.600 | You could give it a chunk of audio, five second audio clip.

00:02:17.500 | That still counts as one input because it's fixed size.

00:02:20.900 | As long as the size of the input is fixed, that's one chunk of input.

00:02:26.900 | And as long as you have ground truth that maps that chunk of input

00:02:31.000 | to some output ground truth, that's a vanilla neural network.

00:02:35.800 | Whether that's a fully connected neural network

00:02:38.200 | or a convolutional neural network.

00:02:40.200 | Today, we'll talk about the amazing,

00:02:43.800 | the mysterious recurring neural networks.

00:02:48.200 | They compute functions from one to many,

00:02:51.600 | from many to one, from many to many.

00:02:54.800 | Also, bidirectional.

00:03:00.500 | What does that mean?

00:03:02.300 | They take as input sequences, time series, audio, video.

00:03:10.000 | Whenever there's a sequence of data

00:03:12.400 | and that temporal dynamics that connects the data

00:03:15.900 | is more important than the spatial content of each individual frame.

00:03:22.000 | When there's a lot of information being conveyed in the sequence,

00:03:26.300 | in the temporal change of whatever that type of data is,

00:03:30.500 | that's when you want to use recurring neural networks.

00:03:32.600 | Like speech, natural language, audio.

00:03:38.100 | The power of this is that for many of them,

00:03:41.500 | of a recurring neural network where they really shine

00:03:44.900 | is when the size of the input is variable.

00:03:48.800 | You don't have a fixed chunk of data that you're putting in,

00:03:51.800 | it's variable input.

00:03:53.300 | The same goes for the output.

00:03:55.800 | You can give it a sequence of speech,

00:04:00.300 | several seconds of speech,

00:04:02.700 | and then the output is a single label

00:04:07.600 | of whether the speaker is male or female.

00:04:10.100 | That's many to one.

00:04:14.300 | You can also do many to many translation.

00:04:20.100 | You can have natural language put into the network

00:04:23.400 | in Spanish and the output is in English.

00:04:29.100 | Machine translation, that's many to many.

00:04:32.500 | And that many to many doesn't have to be

00:04:35.600 | mapped directly into same size sequences.

00:04:39.300 | So for video, the sequence size might be the same.

00:04:42.500 | You're labeling every single frame.

00:04:43.900 | You put in a five-second clip

00:04:47.100 | of somebody playing basketball.

00:04:51.100 | And you can label every single frame

00:04:52.800 | counting the number of people in every single frame.

00:04:55.000 | That's many to many when the size of the input

00:04:57.200 | and size of the output is the same.

00:04:58.600 | Yes, question.

00:05:00.100 | The question was, are there any models

00:05:02.800 | where there's feedback from output and input?

00:05:05.000 | And that's exactly what recurrent neural networks are.

00:05:08.900 | It produces output and it copies that output

00:05:13.600 | and loops it back in.

00:05:15.100 | That's exactly almost the definition

00:05:20.300 | of a recurrent neural network.

00:05:21.500 | There's a loop in there that produces the output

00:05:24.200 | and also takes that output as input once again.

00:05:27.000 | And so there's also many to many

00:05:31.200 | where the sequences don't align.

00:05:32.700 | Like machine translation,

00:05:34.600 | the size of the output sequence

00:05:37.000 | might be totally different than the input sequence.

00:05:38.900 | We'll look at a lot of cool applications

00:05:40.900 | like you can start a song.

00:05:45.400 | So learn on the audio of a particular song

00:05:48.700 | and have the recurrent neural network

00:05:51.200 | continue that song after a certain period of time.

00:05:55.000 | So you can learn to generate sequences

00:05:57.700 | of audio, of natural language, of video.

00:05:59.800 | I know I promised not many equations

00:06:06.800 | but this is so beautifully simple

00:06:10.800 | that we have to cover backpropagation.

00:06:13.000 | It's also the thing that if you're a little bit lazy

00:06:17.900 | and you go on the internet

00:06:19.300 | and start using the basic tutorials for TensorFlow,

00:06:22.500 | you ignore how backpropagation work at your peril.

00:06:26.400 | You kind of assume it just works.

00:06:29.100 | I give it some inputs, some outputs

00:06:30.800 | and it's like Lego pieces.

00:06:32.300 | I can assemble them like you might have done with deep traffic.

00:06:34.600 | A bunch of layers, put them together

00:06:36.500 | and then just press train.

00:06:38.900 | And backpropagation is the mechanism

00:06:41.200 | that neural networks currently,

00:06:42.700 | the best mechanism we know of,

00:06:44.500 | that is used for training.

00:06:45.800 | So we need to understand

00:06:47.300 | the simple power of backpropagation

00:06:52.100 | but also the dangers.

00:06:53.900 | Summary.

00:06:58.200 | Say up at the top of the slide

00:07:01.300 | there's an input, so the network is an image.

00:07:04.300 | There's a bunch of neurons

00:07:06.900 | all with differentiable smooth activation functions

00:07:10.900 | on each neuron.

00:07:11.700 | And then as you pass through those activation functions,

00:07:17.500 | taking the input, pass it through

00:07:20.100 | this net of differentiable compute nodes,

00:07:25.000 | you produce an output.

00:07:26.500 | And that output,

00:07:27.900 | you also have a ground truth

00:07:30.400 | that correct the truth

00:07:32.800 | that you hope, you expect the network to produce.

00:07:35.800 | And then you can look at the difference between

00:07:38.000 | what the network actually produced

00:07:39.500 | and what you hope it will produce

00:07:42.000 | and that's an error.

00:07:43.000 | And then you back or propagate that error

00:07:45.800 | punishing or rewarding the weights,

00:07:50.800 | the parameters of the network

00:07:52.100 | that resulted in that output.

00:07:53.800 | Let's start with a really simple example.

00:08:00.500 | There's a function

00:08:01.500 | that takes as input up on top

00:08:05.800 | three variables X, Y and Z.

00:08:08.800 | The function does two things.

00:08:11.200 | It adds X and Y

00:08:12.800 | and then it multiplies that sum by Z.

00:08:17.000 | And then we can formulate that as a circuit,

00:08:20.100 | circuit of gates

00:08:21.800 | where there's a plus gate

00:08:24.300 | and a multiplication gate.

00:08:27.300 | And let's take some inputs shown in blue.

00:08:31.100 | Let's say X is -2,

00:08:33.600 | Y is 5, Z is -4.

00:08:36.300 | And let's do a forward pass through this circuit

00:08:40.600 | to produce the output.

00:08:42.300 | So -2 + 5 = 3.

00:08:47.100 | Q is that intermediate value

00:08:49.900 | that's the value that we're looking for.

00:08:51.900 | And then we can do a forward pass

00:08:53.900 | through the circuit

00:08:55.900 | Q is that intermediate value, 3.

00:08:59.200 | This is so simple

00:09:02.900 | and so important to understand

00:09:05.100 | that I just want to take my time through this

00:09:07.100 | because everything else about neural networks

00:09:09.100 | just builds on these concepts.

00:09:10.500 | Okay, so the add gate produces Q.

00:09:15.700 | In this case it's 3.

00:09:17.100 | And then 3 * -4 is 12.

00:09:19.500 | That's the output.

00:09:20.500 | The output of the circuit of this network

00:09:24.900 | if you think of it as such

00:09:26.100 | is -12.

00:09:28.100 | And so the forward pass is shown in blue.

00:09:31.500 | The backward pass will be shown in red

00:09:33.500 | in a second here.

00:09:34.300 | So what we want to do is

00:09:35.900 | what would make us happy,

00:09:37.500 | what would make F happy

00:09:38.900 | is for the output to be as high as possible.

00:09:41.300 | -12 is so-so, we could do better.

00:09:43.900 | So how do we teach it?

00:09:45.500 | How do we adjust X, Y and Z

00:09:48.500 | such that it produces a higher

00:09:51.500 | F?

00:09:54.900 | Makes us happier.

00:09:56.300 | Okay, let's start backward,

00:09:58.900 | the backward pass.

00:10:01.700 | We make the gradient on the output 1.

00:10:05.300 | Meaning we want this to increase.

00:10:07.500 | We want F to increase.

00:10:08.900 | That's how we encode our happiness.

00:10:10.500 | We want it to go up by 1.

00:10:13.900 | And in order to then propagate

00:10:19.900 | that fact that we want the F to go up by 1,

00:10:24.700 | we have to look at the gradient on each one of the gates.

00:10:30.100 | Now what's a gradient?

00:10:31.700 | It's a partial derivative

00:10:38.300 | with respect to its inputs.

00:10:42.100 | The partial derivative of the output of a gate

00:10:45.100 | with respect to its inputs.

00:10:46.500 | If you don't know what that means,

00:10:49.100 | it's just

00:10:49.900 | the

00:10:51.900 | how much does the output change

00:10:55.900 | when I change the inputs a little bit.

00:10:58.300 | What is the slope of that change?

00:11:00.300 | If I increase X for the first function of addition,

00:11:03.500 | F(X)Y = X + Y.

00:11:06.500 | If I increase X by a little bit,

00:11:08.500 | what happens to F?

00:11:09.700 | If I increase Y by a little bit,

00:11:11.300 | what happens to F?

00:11:12.300 | So taking a partial derivative of those

00:11:15.100 | with respect to X and Y,

00:11:16.500 | you just get a slope of 1.

00:11:19.100 | So when you increase X,

00:11:20.300 | F increases linearly.

00:11:22.500 | Same with Y.

00:11:23.900 | Multiplication is a little trickier.

00:11:26.300 | When you increase X,

00:11:30.100 | F increases by Y.

00:11:33.300 | So the partial derivative of F

00:11:35.900 | with respect to X is Y.

00:11:38.300 | The partial derivative of F with respect to Y is X.

00:11:41.100 | So if you think about it,

00:11:45.700 | what happens is

00:11:47.700 | the gradients, when you change X,

00:11:50.100 | the gradient of change

00:11:52.500 | doesn't care about X.

00:11:54.300 | It cares about Y.

00:11:57.100 | So it's flipped.

00:11:58.900 | So we can back propagate that 1,

00:12:01.500 | the indication of what makes us happy,

00:12:03.700 | backwards.

00:12:05.300 | And that's done by

00:12:09.100 | computing the local gradient.

00:12:11.100 | For Q,

00:12:17.100 | so the partial derivative of F

00:12:20.100 | with respect to Q, that intermediate value,

00:12:22.100 | that gradient will be -4.

00:12:26.100 | It will take the value of Z,

00:12:28.100 | as I said, it's a multiplication gate.

00:12:30.100 | It will take the value of Z

00:12:32.100 | and assign it to the gradient.

00:12:36.500 | And the same for the partial derivative of F

00:12:40.100 | with respect to Z,

00:12:41.100 | it will assign that to Q,

00:12:43.100 | the value of the forward pass from the Q.

00:12:45.100 | So there's a 3 and a -4 in the forward pass,

00:12:49.100 | in blue,

00:12:50.100 | and then that's flipped, -4 and 3,

00:12:53.100 | on the backward pass.

00:12:54.100 | That's the gradient.

00:12:55.100 | And then we continue in the same exact process,

00:12:59.100 | but wait.

00:13:01.100 | So what makes all of this work

00:13:06.100 | is the chain rule.

00:13:09.100 | It's magical.

00:13:10.100 | So what it allows us to do

00:13:14.100 | is to compute the gradient,

00:13:16.100 | the gradient on F with respect to the inputs, X, Y, Z.

00:13:23.100 | We don't need to construct

00:13:25.100 | the giant function that is

00:13:29.100 | the partial derivative of F

00:13:33.100 | with respect to X and Y and Z

00:13:35.100 | analytically.

00:13:37.100 | We can do it step by step,

00:13:38.100 | backpropagating the gradients.

00:13:40.100 | We can multiply the gradients together

00:13:42.100 | as opposed to doing partial derivative of F

00:13:44.100 | with respect to X.

00:13:46.100 | We have just the intermediate,

00:13:48.100 | the local gradient of F with respect to Q

00:13:51.100 | and of Q with respect to X

00:13:53.100 | and multiply them together.

00:13:55.100 | So instead of computing

00:14:00.100 | the gradient of that giant function,

00:14:04.100 | X + Y = Z,

00:14:06.100 | in this case it's not that giant,

00:14:08.100 | but it gets pretty giant in neural networks,

00:14:10.100 | we just go step by step.

00:14:12.100 | We look at the first function,

00:14:14.100 | simple addition,

00:14:15.100 | Q = X + Y

00:14:17.100 | and the second function, multiplication,

00:14:20.100 | F = Q * Z.

00:14:22.100 | So,

00:14:24.100 | the gradient on X and Y,

00:14:29.100 | the partial derivative of F

00:14:33.100 | with respect to X and Y

00:14:36.100 | is computed by multiplying

00:14:38.100 | the gradient on the output, -4,

00:14:41.100 | times the gradient on the inputs,

00:14:44.100 | which as we talked about

00:14:46.100 | when the operation is addition,

00:14:48.100 | that's just 1.

00:14:49.100 | So it's -4 * 1.

00:14:51.100 | That means,

00:14:54.100 | what does that mean?

00:14:57.100 | Let's interpret those numbers.

00:15:00.100 | You now have gradients on X, Y and Z,

00:15:04.100 | the gradient of,

00:15:05.100 | the partial derivative of F

00:15:07.100 | with respect to X, Y, Z.

00:15:08.100 | That means,

00:15:10.100 | so for X and Y it's -4,

00:15:12.100 | for Z it's 3.

00:15:14.100 | That means in order to make F happy,

00:15:17.100 | we have to decrease

00:15:19.100 | the inputs that have a negative gradient

00:15:25.100 | and increase the inputs

00:15:27.100 | that have a positive gradient.

00:15:28.100 | The negative ones are X and Y,

00:15:30.100 | the positive is Z.

00:15:36.100 | Hopefully I don't say the word beautiful

00:15:37.100 | too many times in this presentation,

00:15:39.100 | but this is very simple,

00:15:41.100 | beautifully simple.

00:15:43.100 | Because this gradient is a local worker.

00:15:48.100 | It propagates for you,

00:15:50.100 | it has no knowledge of the broader

00:15:53.100 | happiness of F.

00:15:56.100 | It just propagates,

00:15:58.100 | it computes the gradient

00:15:59.100 | between the output and the input.

00:16:01.100 | And you can propagate this gradient

00:16:04.100 | based on, in this case,

00:16:06.100 | F, a gradient of 1,

00:16:08.100 | but also just the error.

00:16:10.100 | Instead of 1, we could have on the output

00:16:12.100 | the error is the measure of happiness.

00:16:14.100 | And then we could propagate that error backwards.

00:16:17.100 | These gates are important

00:16:18.100 | because you can break down

00:16:19.100 | almost every operation we could think of

00:16:21.100 | that we work with in neural networks

00:16:23.100 | into one of several gates like this.

00:16:27.100 | And the most popular are 3,

00:16:30.100 | which is addition, multiplication

00:16:31.100 | and the max operation.

00:16:33.100 | So for addition,

00:16:34.100 | what you do is you ignore the...

00:16:38.100 | Okay, the process is

00:16:39.100 | you take a forward pass to the network.

00:16:42.100 | So we have a value on every single gate.

00:16:46.100 | And then you take a backward pass.

00:16:49.100 | And through the backward pass

00:16:50.100 | you compute those gradients.

00:16:53.100 | For an add gate,

00:16:55.100 | you equally distribute the gradients

00:16:57.100 | on the output to the input.

00:16:58.100 | So when the gradient on the output is -4,

00:17:00.100 | you equally distribute it to -4.

00:17:06.100 | And you ignore the forward pass value.

00:17:09.100 | So that 3 is ignored when you back propagate it.

00:17:14.100 | On the multiplication gate,

00:17:16.100 | on the multiply gate,

00:17:18.100 | it's trickier.

00:17:19.100 | You switch the forward pass values.

00:17:23.100 | So if you look at F,

00:17:24.100 | that's a multiply gate.

00:17:28.100 | The forward pass values are switched

00:17:32.100 | and multiplied by the value of the gradient in the output.

00:17:37.100 | If it's confusing, go through the slides slowly.

00:17:41.100 | It'll make a lot more sense, hopefully.

00:17:45.100 | One more gate.

00:17:46.100 | There's the max gate,

00:17:47.100 | which takes the inputs and produces as output

00:17:53.100 | the value that is larger.

00:17:56.100 | And when computing the gradient of the max gate,

00:18:00.100 | it distributes the gradient

00:18:04.100 | similar to the add gate,

00:18:06.100 | but to only one.

00:18:10.100 | To only one of the inputs.

00:18:13.100 | The largest one.

00:18:15.100 | So it, unlike the add gate,

00:18:17.100 | pays attention to the input values on the forward pass.

00:18:22.100 | Alright.

00:18:25.100 | Lots of numbers, but

00:18:27.100 | the whole point here is it's really simple.

00:18:32.100 | A neural network is just a simple collection of these gates.

00:18:37.100 | And you take a forward pass,

00:18:40.100 | you calculate some kind of function on the end,

00:18:42.100 | a gradient at the very end,

00:18:44.100 | and you propagate that back.

00:18:46.100 | So usually for neural networks, that's an error function.

00:18:49.100 | A loss function, objective function,

00:18:52.100 | cost function, all the same word.

00:18:56.100 | So that's the sigmoid function there.

00:18:59.100 | When you have three weights,

00:19:01.100 | w0, w1, w2,

00:19:05.100 | and x, two inputs, x0, x1,

00:19:09.100 | that's going to be the sigmoid function.

00:19:11.100 | That's how you compute the output

00:19:14.100 | of the neuron.

00:19:20.100 | But then you can decompose that neuron,

00:19:22.100 | you can separate it all into

00:19:24.100 | just a set of gates like this.

00:19:26.100 | Addition, multiplication,

00:19:28.100 | there's exponential in there, and division.

00:19:31.100 | They're all very similar.

00:19:33.100 | And you repeat the exact same process.

00:19:35.100 | If the,

00:19:38.100 | there's five inputs,

00:19:40.100 | there's three weights,

00:19:41.100 | and two inputs, x0, x1.

00:19:46.100 | You take a forward pass

00:19:49.100 | through this circuit.

00:19:52.100 | In this case, again,

00:19:54.100 | you want it to increase so that

00:19:56.100 | the gradient on the output is one.

00:19:59.100 | You back propagate that gradient

00:20:01.100 | of one to the inputs.

00:20:04.100 | Now with neural networks,

00:20:06.100 | there's a bunch of parameters that you're trying to,

00:20:08.100 | through this process, modify.

00:20:10.100 | And you don't get to modify the inputs.

00:20:12.100 | You get to modify the weights along the way

00:20:15.100 | and the biases.

00:20:16.100 | The inputs are fixed,

00:20:17.100 | the outputs are fixed,

00:20:19.100 | the outputs that you hope

00:20:22.100 | the network will produce.

00:20:23.100 | What you're modifying is the weights.

00:20:25.100 | So I get to try to adjust those weights

00:20:28.100 | in such that,

00:20:30.100 | in the direction of the gradient.

00:20:35.100 | That's the task of back propagation.

00:20:38.100 | The main way that neural networks learn

00:20:41.100 | is we update the weights and the biases

00:20:44.100 | to decrease the loss function.

00:20:47.100 | The lower the loss function, the better.

00:20:50.100 | So in this case, you have

00:20:52.100 | three inputs on top left,

00:20:55.100 | the simple network, three inputs,

00:20:59.100 | three weights on each of the inputs.

00:21:01.100 | There's a bias on the node, b,

00:21:04.100 | and it produces an output, a.

00:21:08.100 | And that little symbol

00:21:10.100 | is indicating a sigmoid function.

00:21:15.100 | And loss is computed as y minus a squared

00:21:22.100 | divided by two.

00:21:25.100 | Where y is the ground truth,

00:21:28.100 | the output that you want the network to produce.

00:21:32.100 | And that loss function is back propagated

00:21:34.100 | in exactly the same way that we described before.

00:21:37.100 | So the subtasks that are involved in this update

00:21:40.100 | of weights and biases

00:21:42.100 | is that the forward pass computes

00:21:44.100 | the network output at every neuron,

00:21:47.100 | and then finally the output layer

00:21:50.100 | computes the error, the difference between a and b.

00:21:54.100 | And then backward propagates the gradients.

00:21:58.100 | Instead of one on the output,

00:22:00.100 | it'll be the error on the output

00:22:01.100 | and you back propagate it.

00:22:03.100 | And then once you know the gradient,

00:22:05.100 | you adjust the weights and the biases

00:22:07.100 | in the direction of the gradient.

00:22:09.100 | Or actually the opposite of the direction of the gradient

00:22:12.100 | because you want the loss to decrease.

00:22:14.100 | And the amount by which you make that adjustment

00:22:18.100 | is called the learning rate.

00:22:20.100 | The learning rate could be the same

00:22:21.100 | across the entire network,

00:22:23.100 | or it could be individual to every weight.

00:22:27.100 | (deep breath)

00:22:30.100 | And the process of adjusting the weights and biases

00:22:36.100 | is just optimization.

00:22:38.100 | Learning is an optimization problem.

00:22:41.100 | You have an objective function

00:22:42.100 | and you're trying to minimize it.

00:22:44.100 | And your variables are the parameters,

00:22:46.100 | the weights and biases.

00:22:48.100 | And neural networks just happen to have

00:22:51.100 | tens, hundreds of thousands, millions of those parameters.

00:22:55.100 | So the space, the function that you're trying to minimize

00:22:58.100 | is highly non-linear.

00:23:00.100 | But it boils down to something like this.

00:23:02.100 | You have two weights, here are two plots.

00:23:06.100 | Or actually one weight, sorry, one weight.

00:23:09.100 | And then as you adjust it, the cost...

00:23:12.100 | You adjust in such a way that minimizes the output cost.

00:23:18.100 | And there's a bunch of optimization methods for doing this.

00:23:24.100 | You can...

00:23:26.100 | This is a convex function,

00:23:28.100 | so you can find the minimum,

00:23:31.100 | the local minimum,

00:23:32.100 | if you know about these kind of terminologies.

00:23:34.100 | The local minimum is the same as the global minimum.

00:23:36.100 | So there's not...

00:23:37.100 | It's not a weirdly hilly terrain

00:23:39.100 | where you can get stuck in...

00:23:42.100 | So your goal is to get to the bottom of this thing.

00:23:44.100 | And if it's really complex terrain,

00:23:46.100 | it'll be hard to get to the bottom of it.

00:23:48.100 | So there is a lot of different...

00:23:53.100 | The general approach is gradient descent.

00:23:56.100 | And there's a lot of different ways to do gradient descent.

00:23:58.100 | Some adding...

00:24:00.100 | In various ways of adding randomness into the process,

00:24:03.100 | so you don't get stuck into the weird crevices of the terrain.

00:24:09.100 | All right, but it's messy.

00:24:12.100 | You have to be really careful.

00:24:13.100 | This is the part you have to be aware of.

00:24:15.100 | When you're designing a network for deep traffic

00:24:18.100 | and nothing is happening,

00:24:20.100 | this might be what's happening.

00:24:23.100 | Vanishing gradients or exploding gradients.

00:24:28.100 | When the partial derivative is small,

00:24:32.100 | so if you take the sigmoid function,

00:24:35.100 | the most popular for a while activation function,

00:24:39.100 | the derivative is zero at the tails.

00:24:43.100 | So when the input to the sigmoid function

00:24:46.100 | is really high or really low,

00:24:48.100 | that derivative is going to be zero.

00:24:51.100 | So the gradient that you compute...

00:24:54.100 | Gradient tells you how much I want to adjust the weights.

00:24:57.100 | The gradient might be zero.

00:25:00.100 | And so you back propagate that zero,

00:25:02.100 | a very low number,

00:25:04.100 | and it gets less and less as you back propagate.

00:25:07.100 | And so the result is that you don't...

00:25:11.100 | You think that you don't need to adjust the weights at all.

00:25:15.100 | And when a large fraction of the network

00:25:17.100 | thinks that weights don't need to be adjusted,

00:25:20.100 | then they don't adjust the weights

00:25:21.100 | and you're not doing any learning.

00:25:23.100 | So the learning is slow.

00:25:26.100 | There's some fixes to this.

00:25:31.100 | There's different types of functions.

00:25:33.100 | There's a piecewise, the Rayleigh function,

00:25:37.100 | which is the most popular activation function.

00:25:40.100 | But again, it suffers...

00:25:43.100 | If the neurons are initialized poorly,

00:25:48.100 | it might not...

00:25:49.100 | This function might not fire...

00:25:51.100 | It might be zero gradient for the entire dataset.

00:25:56.100 | Nothing that you produce as input...

00:26:01.100 | You run all your thousands of images of cats

00:26:04.100 | and none of them fire at all.

00:26:07.100 | So that's the danger here.

00:26:10.100 | So you have to pick these...

00:26:13.100 | Both the optimization engine,

00:26:17.100 | the solver that you use,

00:26:19.100 | and the activation functions carefully.

00:26:21.100 | You can't just plug and play like they're Legos.

00:26:25.100 | You have to be aware of the function.

00:26:28.100 | SGD, stochastic gradient descent,

00:26:35.100 | that's the vanilla optimization algorithm

00:26:41.100 | for gradient descent,

00:26:44.100 | for optimizing the loss function over the gradients.

00:26:48.100 | And so what's visualized here is,

00:26:50.100 | again, if you've done any numerical optimization,

00:26:54.100 | nonlinear optimization,

00:26:56.100 | there's the famous saddle point

00:26:58.100 | that's tricky for these algorithms to deal with.

00:27:02.100 | What happens is it's easy for them to oscillate,

00:27:06.100 | get stuck in that saddle and oscillate back and forth.

00:27:09.100 | As opposed to what they want to do,

00:27:11.100 | which is go down into...

00:27:14.100 | You get so happy that you found this low point,

00:27:20.100 | that you forget that there's a much lower point.

00:27:23.100 | And so you get stuck with the gradient,

00:27:25.100 | the momentum of the gradient keeps rocking you back and forth

00:27:28.100 | while you go in to a much greater global minimum.

00:27:32.100 | And there's a bunch of clever ways of solving that.

00:27:35.100 | The atom optimizer is one of those.

00:27:39.100 | But in this case, as long as the gradients don't vanish,

00:27:46.100 | SGD, the stochastic gradient descent,

00:27:49.100 | one of these algorithms will get you there.

00:27:51.100 | That might take a little while, but they'll get you there.

00:27:53.100 | And that's the main question.

00:27:57.100 | The question was,

00:28:00.100 | you're dealing with a function that's not non-convex,

00:28:03.100 | and how do we ensure anything about it converging

00:28:07.100 | to anything that's reasonably good,

00:28:09.100 | the local optimum it converges to.

00:28:12.100 | And the answer is, you can't.

00:28:16.100 | This isn't only a nonlinear function,

00:28:19.100 | it's a highly nonlinear function.

00:28:22.100 | The power and the beauty of neural networks

00:28:24.100 | is that it can represent these arbitrarily complex functions.

00:28:32.100 | It's incredible, right?

00:28:34.100 | And you can learn those functions from data.

00:28:36.100 | But the reason people refer to neural networks training as art

00:28:42.100 | is you're trying to play with parameters

00:28:46.100 | that don't get stuck in these local optima

00:28:48.100 | for stupid reasons and for clever reasons.

00:28:50.100 | Yes, question.

00:28:53.100 | So the question, yeah.

00:28:55.100 | So continue on the same thread.

00:28:58.100 | So the thing is, we're dealing with functions

00:29:03.100 | where we don't know what the global optimal is.

00:29:06.100 | That's sort of the crux of it.

00:29:08.100 | Everything we talk about,

00:29:12.100 | interpreting text,

00:29:14.100 | interpreting video,

00:29:16.100 | even driving,

00:29:18.100 | what is the optimal for driving?

00:29:21.100 | Never crashing?

00:29:23.100 | It sounds easy to say that,

00:29:27.100 | but you actually have to formulate the world

00:29:29.100 | under which you define all of those things

00:29:31.100 | and that becomes really nonlinear objective function

00:29:34.100 | for which you don't know what the optimal is.

00:29:37.100 | It's just...

00:29:39.100 | That's why you just keep trying

00:29:42.100 | and get impressed every time it gets better.

00:29:44.100 | It's essentially the process.

00:29:47.100 | And you can also compare,

00:29:50.100 | you can compare the human level performance.

00:29:52.100 | So for ImageNet,

00:29:53.100 | we can tell the difference in cats and dogs

00:29:55.100 | in top five categories

00:29:57.100 | in 90,

00:29:59.100 | shoot,

00:30:00.100 | 96% of the time, whatever, accuracy.

00:30:03.100 | And then you get impressed

00:30:04.100 | when a machine can do better than that.

00:30:06.100 | But you don't know what the best is.

00:30:08.100 | These videos can be watched for hours.

00:30:17.100 | I won't play it until I explain the slide.

00:30:20.100 | So let's pause to reflect on backpropagation

00:30:23.100 | before I go on to recurrent neural networks.

00:30:25.100 | Yes, question.

00:30:26.100 | In a practical manner,

00:30:27.100 | how can you tell when you're actually training a net

00:30:30.100 | whether you're facing the vanishing gradient problem

00:30:33.100 | or you need to change your optimizer

00:30:37.100 | or you need to, I mean,

00:30:39.100 | like you've reached some local minimum?

00:30:42.100 | The question was,

00:30:45.100 | how do you practically know

00:30:47.100 | when you've hit the vanishing gradient problem?

00:30:51.100 | So the vanishing gradient could be,

00:30:53.100 | the derivative being zero on the gradient

00:31:03.100 | happens when the activation is exploding,

00:31:07.100 | so like really high values

00:31:09.100 | and really low values.

00:31:10.100 | The really high values is easy

00:31:12.100 | because they're like, your network is just going crazy,

00:31:14.100 | producing very large values.

00:31:17.100 | And you can fix a lot of those things

00:31:19.100 | by just capping the activations.

00:31:23.100 | The values being really low

00:31:28.100 | resulting in a vanishing gradient

00:31:30.100 | are really hard to detect.

00:31:32.100 | This is, I mean,

00:31:34.100 | there's a lot of research in trying to figure out

00:31:36.100 | how to detect these things,

00:31:39.100 | but if you're not careful,

00:31:41.100 | it's oftentimes you can find that,

00:31:46.100 | and this isn't hard to do,

00:31:50.100 | where like 40, 50% of the network,

00:31:53.100 | of the neurons are dead.

00:31:57.100 | We're going to call it like for ReLU,

00:31:59.100 | they're dead ReLU nodes.

00:32:00.100 | They're not firing at all.

00:32:02.100 | How do you detect that?

00:32:05.100 | That's part of learning.

00:32:07.100 | So if they never fire, you can detect that

00:32:08.100 | by running it through the entire training set.

00:32:10.100 | I mean, there's a lot of tricks,

00:32:12.100 | but that's the problem is

00:32:14.100 | you try to learn,

00:32:16.100 | and then you look at the loss function,

00:32:19.100 | and it's not converging to anything reasonable.

00:32:23.100 | It's either going all over the place

00:32:24.100 | or just converging very slowly,

00:32:26.100 | and that's an indication that something is wrong.

00:32:28.100 | That something could be the loss function is bad,

00:32:31.100 | that something could be that you've already found the optimal,

00:32:33.100 | or that something could be the vanishing gradient.

00:32:36.100 | And again, that's why it's an art.

00:32:41.100 | But certainly,

00:32:45.100 | at least some fraction of the neurons need to be firing.

00:32:49.100 | Otherwise, the initialization is really poorly done.

00:32:52.100 | Okay, so to reflect on the simplicity of backpropagation,

00:32:58.100 | and the power of it.

00:33:00.100 | So this is,

00:33:02.100 | this kind of step of backpropagating the loss function

00:33:04.100 | through the gradients locally

00:33:08.100 | is the way neural networks learn.

00:33:10.100 | We don't have,

00:33:12.100 | it's really the only way that we've effectively been able

00:33:16.100 | to train a neural network to learn a function.

00:33:21.100 | So adjusting the weights and biases,

00:33:23.100 | the huge number of weights and biases, the parameters,

00:33:26.100 | is just through this optimization.

00:33:28.100 | It's backpropagating the error

00:33:31.100 | where you have the supervised ground truth.

00:33:34.100 | So the question is whether this process is just like fitting,

00:33:41.100 | adjusting the parameters of a highly nonlinear function

00:33:47.100 | to minimize a single objective,

00:33:50.100 | is the way you achieve intelligence,

00:33:55.100 | human level intelligence.

00:33:56.100 | And that's something to think about.

00:33:58.100 | You have to think about the, for driving purposes,

00:34:00.100 | what is the limitation of this approach?

00:34:04.100 | So what's not happening?

00:34:06.100 | The neural network design, the architecture,

00:34:09.100 | is not being adjusted.

00:34:10.100 | You're not evolving any of the edges, the layers,

00:34:14.100 | nothing is being evolved.

00:34:18.100 | And so there are other optimization approaches

00:34:22.100 | that I think are more

00:34:27.100 | interesting and inspiring than effective.

00:34:30.100 | So for example, this is using soft cubes to,

00:34:37.100 | so this is falling out of the field of evolutionary robotics,

00:34:43.100 | where you evolve the dynamics of a robot

00:34:47.100 | using genetic algorithms.

00:34:49.100 | And that's,

00:34:54.100 | so you can think of,

00:34:59.100 | so these robots are being taught to,

00:35:03.100 | in simulation obviously, to walk and to swim.

00:35:08.100 | So that one is swimming.

00:35:12.100 | But you could, the nice thing here is the dynamics,

00:35:17.100 | that highly nonlinear space as well,

00:35:19.100 | that controls the dynamics of this weird shaped robot,

00:35:24.100 | with a lot of degrees of freedom,

00:35:26.100 | is the same kind of thing as the neural network.

00:35:28.100 | And in fact, people have applied genetic algorithms

00:35:31.100 | and colony optimization,

00:35:33.100 | all kinds of sort of nature-inspired algorithms

00:35:36.100 | for optimizing the weights and the biases.

00:35:38.100 | But they don't seem to currently work that well.

00:35:40.100 | But it's kind of, it's a cool idea to be using

00:35:43.100 | nature-type evolutionary algorithms

00:35:45.100 | to evolve something that's already nature-inspired,

00:35:48.100 | which is neural networks.

00:35:50.100 | But something to think about is, you know,

00:35:55.100 | that backpropagation, while really simple,

00:35:57.100 | is kind of dumb.

00:35:59.100 | And the question is whether general intelligence reasoning

00:36:02.100 | could be achieved with this process.

00:36:04.100 | All right, recurring neural networks.

00:36:07.100 | So on the left there, there's an input x

00:36:11.100 | with weights on the input u.

00:36:14.100 | There's a hidden state, a hidden layer s,

00:36:18.100 | with weights on the edge

00:36:26.100 | connecting the hidden states to each other.

00:36:29.100 | And then more weights v on the output o.

00:36:33.100 | It's a really simple network.

00:36:35.100 | There's inputs, there is hidden states,

00:36:39.100 | the memory of this network,

00:36:42.100 | and there's outputs.

00:36:44.100 | But the fact that there is this loop

00:36:50.100 | where the hidden states are connected to each other

00:36:53.100 | means that as opposed to producing a single input,

00:36:57.100 | the network takes arbitrary number of inputs.

00:37:00.100 | It just keeps taking x one at a time

00:37:03.100 | and produces a sequence of x's through time.

00:37:08.100 | And so depending on the duration of the sequences you're interested in,

00:37:14.100 | you can think of this network in its unrolled state.

00:37:18.100 | So you can unroll this neural network

00:37:20.100 | where the inputs are on the bottom,

00:37:22.100 | x t minus one, x t, x t plus one.

00:37:25.100 | And same with the outputs, zero, sorry,

00:37:28.100 | o t minus one, o t, o t plus one.

00:37:32.100 | And it becomes like a regular neural network,

00:37:35.100 | unrolled some arbitrary number of times.

00:37:40.100 | The parameters, again, there's weights, there's biases.

00:37:44.100 | It's similar to CNNs, convolutional neural networks,

00:37:48.100 | in that it's just like convolutional neural networks

00:37:51.100 | make certain spatial consistency assumptions.

00:37:55.100 | The recurring neural networks assume temporal consistency

00:37:59.100 | amongst the parameters.

00:38:00.100 | So it shares the parameters.

00:38:02.100 | That w, that u, that v is the same for every single time step.

00:38:08.100 | So you're learning the same parameter

00:38:11.100 | no matter the duration of the sequence.

00:38:14.100 | And that allows you to look at arbitrarily long sequences

00:38:19.100 | without having an explosion of parameters.

00:38:29.100 | And this process is the same exact process that's repeated

00:38:32.100 | based on the different variants that we talked about before

00:38:35.100 | in terms of inputs and outputs.

00:38:36.100 | One to many, many to one, many to many.

00:38:40.100 | And the backpropagation process is exactly the same

00:38:43.100 | as for regular neural networks.

00:38:45.100 | It has a fancy name of backpropagation through time, BPTT.

00:38:50.100 | But it's just backpropagation through an unrolled,

00:38:57.100 | unrolled recurring neural network

00:39:00.100 | where the errors are computed on the outputs,

00:39:04.100 | the gradients are computed,

00:39:07.100 | backpropagated and computed on the inputs.

00:39:12.100 | Again, suffering from the same exact problem

00:39:15.100 | of vanishing gradients.

00:39:18.100 | Now the problem is that the depth of these networks

00:39:21.100 | can be arbitrarily long, right?

00:39:22.100 | So if at any point the gradient hits a low number, 0,

00:39:29.100 | that neuron becomes saturated.

00:39:32.100 | That gradient, let's call it saturated,

00:39:34.100 | that gradient drives all the earlier layers to 0.

00:39:41.100 | So it's easy to run into a problem where

00:39:43.100 | you're really ignoring majority of the sequence.

00:39:47.100 | This is just another Python way,

00:39:51.100 | pseudo-code way to look at it.

00:39:54.100 | You have the same W.

00:39:55.100 | Remember, you're sharing the weights

00:39:58.100 | and all the parameters from time to time.

00:40:01.100 | So if the weights are such,

00:40:05.100 | WHH, if the weights are such that they produce,

00:40:10.100 | they have either,

00:40:14.100 | they have a negative value that results

00:40:18.100 | in a gradient that goes to 0,

00:40:22.100 | that propagates through the rest.

00:40:24.100 | So that's the pseudo-code for backpropagation,

00:40:26.100 | the backward pass through the RNN.

00:40:29.100 | That WHH propagates back.

00:40:35.100 | And so you get these things with exploding and vanishing gradients

00:40:39.100 | where this, for example,

00:40:41.100 | an error surface for a single hidden unit RNN.

00:40:45.100 | So these visualize in the gradient,

00:40:47.100 | the value of the weight, the value of the bias,

00:40:51.100 | and the error.

00:40:52.100 | So the error could be really flat or could explode.

00:40:55.100 | And both are going to lead to you

00:41:00.100 | not making the,

00:41:02.100 | either making steps that are too gradual or too big.

00:41:06.100 | That's the geometric interpretation.

00:41:08.100 | Okay, what other variants that we'll look at a little bit

00:41:12.100 | are there for RNNs?

00:41:13.100 | It doesn't have to be only one way.

00:41:15.100 | It can be bidirectional.

00:41:16.100 | So there could be edges going forward and edges going back.

00:41:20.100 | What that's needed for is things like

00:41:25.100 | filling in missing, whatever the data is,

00:41:27.100 | filling in missing elements of that data,

00:41:29.100 | whether that's images or words or audio.

00:41:34.100 | And generally, as always is the case in neural networks,

00:41:37.100 | the deeper you go, the better.

00:41:38.100 | So this is that deep referring to the number of layers

00:41:45.100 | in a single temporal instance.

00:41:48.100 | So on the right of the slide is,

00:41:51.100 | we're stacking in the,

00:41:54.100 | not in the temporal domain.

00:41:58.100 | Each of those layers has its own set of weights

00:42:02.100 | and its own sets of biases.

00:42:05.100 | These things are awesome,

00:42:06.100 | but they need a lot of data

00:42:08.100 | when you add extra layers in this way.

00:42:16.100 | Okay, so the problem is,

00:42:18.100 | while recurrent neural networks,

00:42:20.100 | in theory, are supposed to be able to learn

00:42:22.100 | any kind of sequence,

00:42:25.100 | the reality is they're not really good at remembering

00:42:28.100 | what happened a while ago,

00:42:29.100 | the long-term dependency.

00:42:31.100 | So here's a silly example.

00:42:35.100 | Let's think of a story about Bob.

00:42:40.100 | Bob is eating an apple.

00:42:42.100 | So the apple part is generated

00:42:45.100 | by the recurrent neural network.

00:42:50.100 | The recurrent neural networks can learn to generate apple

00:42:53.100 | because they've seen a lot of sentences with Bob and eating,

00:42:56.100 | and they can generate the word apple.

00:42:59.100 | For a longer sentence, like Bob likes apples,

00:43:03.100 | he's hungry and decided to have a snack,

00:43:05.100 | so now he's eating an apple.

00:43:07.100 | You have to maintain the state

00:43:09.100 | that we're talking about Bob,

00:43:11.100 | and we're talking about apples,

00:43:13.100 | through several discrete semantic sentences.

00:43:20.100 | And that kind of long-term memory

00:43:23.100 | is not because of different effects,

00:43:28.100 | but vanishing gradients.

00:43:30.100 | It's difficult to propagate the important stuff

00:43:34.100 | that happened a while ago,

00:43:35.100 | in order to maintain that context in generating apple

00:43:39.100 | or classifying some concept that happened way down the line.

00:43:44.100 | So when people talk about recurrent neural networks,

00:43:51.100 | these days, they're talking about LSTMs,

00:43:55.100 | long short-term memory networks.

00:44:00.100 | So all the impressive results on time series,

00:44:03.100 | on audio, on video, all of that,

00:44:05.100 | that requires LSTMs.

00:44:07.100 | And so again, vanilla RNNs up on top of the slide,

00:44:12.100 | each cell is simple.

00:44:16.100 | There's some hidden units,

00:44:18.100 | there's an input, and there's an output.

00:44:21.100 | Here we'll use 10H as the activation function.

00:44:27.100 | It's just another popular sigmoid type activation function.

00:44:35.100 | LSTMs are more complicated,

00:44:38.100 | or they look more complicated,

00:44:40.100 | but in some ways they're more intuitive for us to understand.

00:44:46.100 | There's a bunch of gates in each cell.

00:44:49.100 | We'll go through those.

00:44:51.100 | In yellow are different neural network layers.

00:44:55.100 | With sigma and 10H are different types of activation functions.

00:45:00.100 | 10H is an activation function that squishes the input

00:45:05.100 | to the range of -1 to 1.

00:45:08.100 | A sigmoid function squishes it between 0 and 1,

00:45:13.100 | and that serves different purposes.

00:45:16.100 | There is some pointwise operations, addition, multiplication,

00:45:21.100 | and there is connections,

00:45:25.100 | so data being passed from layer to layer,

00:45:28.100 | shown by the arrows.

00:45:31.100 | There's concatenation and there's a copy operation on the output.

00:45:35.100 | So we copy, the output of each cell is copied to the next cell

00:45:40.100 | and to the output.

00:45:43.100 | Let me try to make it clarify a little bit.

00:45:54.100 | There's this conveyor belt going through inside each individual cell.

00:46:00.100 | There's really three steps in the conveyor belt.

00:46:05.100 | The first is there is a sigmoid function

00:46:10.100 | that's responsible for deciding

00:46:15.100 | what to forget and what to ignore.

00:46:18.100 | It's responsible for taking in the input,

00:46:24.100 | the new input, XT,

00:46:26.100 | taking in the state of the previous,

00:46:31.100 | the output of the previous cell, previous time step,

00:46:35.100 | and deciding do I want to keep that in my memory or not,

00:46:39.100 | and do I want to integrate the new input into my memory or not.

00:46:44.100 | So this allows you to be selective about the information which you learn.

00:46:49.100 | So for example, the sentence "Bob and Alice are having lunch."

00:46:53.100 | Bob likes apples, Alice likes oranges, she's eating an orange.

00:46:59.100 | So Bob and Alice are having lunch.

00:47:05.100 | Bob likes apples.

00:47:06.100 | Right now, if you say you had a hidden state,

00:47:10.100 | keeping track of the gender of the person we're talking about.

00:47:16.100 | You might say that there's both genders in the first sentence,

00:47:19.100 | there's male in the second sentence, female in the third sentence.

00:47:23.100 | That way, when you have to generate a sentence about who's eating what,

00:47:27.100 | you'll keep the gender information

00:47:32.100 | in order to make an accurate generation of text

00:47:36.100 | corresponding to the proper person.

00:47:40.100 | So you have to forget certain things,

00:47:42.100 | like forget that Bob existed at that moment,

00:47:45.100 | and you have to forget Bob likes apples,

00:47:49.100 | but you have to remember that Alice likes oranges.

00:47:54.100 | So you have to selectively remember and forget certain things.

00:47:57.100 | That's LSTM in a nutshell.

00:48:00.100 | So you decide what to forget, decide what to remember,

00:48:03.100 | and decide what to output at that cell.

00:48:09.100 | All right, so zoom in a little bit, because this is pretty cool.

00:48:15.100 | There is a state running through the cell.

00:48:20.100 | This can vary about previous state,

00:48:23.100 | like what the gender that we're currently talking about,

00:48:28.100 | that's the state that you're keeping track of,

00:48:31.100 | and that's running through the cell.

00:48:34.100 | And then there is three sigmoid layers

00:48:37.100 | outputting a number between 0 and 1,

00:48:42.100 | but it's 1 when you want that information to go through,

00:48:46.100 | and 0 when you don't want it to go through,

00:48:51.100 | the conveyor belt that maintains the state.

00:48:55.100 | And so first sigmoid function is

00:48:58.100 | we decide what to forget and what to ignore.

00:49:01.100 | That's the first one.

00:49:03.100 | You take the inputs from the previous time step,

00:49:06.100 | the input to the network from the current time step,

00:49:10.100 | and decide do I want to forget, do I want to ignore those.

00:49:15.100 | Then we decide which part of the state to update.

00:49:21.100 | What part of our memory do we update with this information,

00:49:24.100 | and what values to insert in that update.

00:49:30.100 | Third step is we perform the actual update

00:49:34.100 | and perform the actual forgetting.

00:49:37.100 | So that's where you have the sigmoid function is

00:49:40.100 | you just multiply it.

00:49:42.100 | When it's 0, it's forgetting.

00:49:44.100 | When it's 1, that information passes through.

00:49:49.100 | And finally, we produce an output from the cell.

00:49:55.100 | So if it's translation,

00:49:58.100 | it's producing an output in the English language

00:50:01.100 | where the input was in the Spanish language.

00:50:03.100 | And then that same output is copied to the next cell.

00:50:13.100 | Okay, so what can we get done with this kind of approach?

00:50:18.100 | We can look at machine translation.

00:50:20.100 | I guess what I'm trying to...

00:50:23.100 | Question.

00:50:25.100 | What is the representation of the state?

00:50:27.100 | Is it like a floating point or is it like a vector?

00:50:30.100 | What is it exactly?

00:50:33.100 | The state is the activation multiplied by the weight.

00:50:41.100 | So it's the outputs of the sigmoid or the 10H activations.

00:50:46.100 | So there's a bunch of neurons and they're firing a number

00:50:50.100 | between -1 and 1 or between 0 and 1.

00:50:53.100 | And that holds the state.

00:50:55.100 | It's just calling it state is sort of simplifying.

00:50:58.100 | But the point is that there's a bunch of numbers

00:51:00.100 | being constantly modified by the weights and the biases.

00:51:05.100 | And those numbers hold the state.

00:51:09.100 | And the modification of those numbers is controlled by the weights.

00:51:14.100 | And then once all that is done,

00:51:16.100 | the resulting output of the recurrent neural network

00:51:20.100 | is compared to the desired output

00:51:22.100 | and the errors are back-propagated through the weights.

00:51:27.100 | Hopefully that made sense.

00:51:30.100 | So machine translation is one popular application.

00:51:35.100 | And all of it is the same.

00:51:40.100 | All of these networks that I'll talk about,

00:51:42.100 | they're really similar constructs.

00:51:46.100 | You have some inputs, whatever language that is again.

00:51:52.100 | German maybe. I think everything is German.

00:51:58.100 | And the output, so the inputs are in one language,

00:52:03.100 | a set of characters that compose a word in one language.

00:52:08.100 | There's a state being propagated.

00:52:10.100 | And once that sentence is over,

00:52:12.100 | you start, as opposed to collecting inputs,

00:52:14.100 | you start producing outputs.

00:52:16.100 | And you can output in the English language.

00:52:19.100 | There's a ton of great work on machine translation.

00:52:23.100 | It's what Google is mostly using for their translator.

00:52:26.100 | Same thing, I showed this previously,

00:52:30.100 | but now you know how it works.

00:52:32.100 | Same exact thing, LSTMs, generating handwritten characters,

00:52:37.100 | so handwriting in arbitrary styles.

00:52:39.100 | So controlling the drawing,

00:52:43.100 | where the input is text and the output is handwriting.

00:52:46.100 | And it's again the same kind of network

00:52:51.100 | with some depth here.

00:52:53.100 | The inputs is the text,

00:52:55.100 | the output is the control of the writing.

00:52:59.100 | Character level text generation.

00:53:02.100 | This is the thing that told us about life.

00:53:06.100 | The meaning of life, literary recognition

00:53:09.100 | and the tradition of ancient human reproduction.

00:53:12.100 | That's again the same process.

00:53:16.100 | Input one character at a time,

00:53:18.100 | where you see there's an encoding of the characters

00:53:21.100 | on the input layer.

00:53:23.100 | There's a hidden state, hidden layer,

00:53:26.100 | that's keeping track of those activations,

00:53:28.100 | the outputs of the activation functions.

00:53:32.100 | And every single time,

00:53:38.100 | it's outputting its best prediction

00:53:42.100 | of the next character that follows.

00:53:44.100 | Now in a lot of these applications,

00:53:46.100 | you want to ignore the output

00:53:49.100 | until the input sentence is over.

00:53:52.100 | And then you start listening to the output.

00:53:55.100 | But the point is it just keeps generating text,

00:53:58.100 | whether it's given input or not.

00:54:00.100 | So you producing input is just adding,

00:54:03.100 | steering the recurrent neural network.

00:54:07.100 | You can answer questions

00:54:11.100 | about an image.

00:54:13.100 | So the input hid there,

00:54:15.100 | so you could almost arbitrarily stack things together.

00:54:18.100 | So you take an image as an input, bottom left there,

00:54:21.100 | put it into convolutional neural network

00:54:26.100 | and take the question.

00:54:30.100 | There's something called word embeddings.

00:54:33.100 | It's to broaden the representative meaning of the words.

00:54:37.100 | So how many books is the question?

00:54:40.100 | So you want to take the word embeddings and the image

00:54:43.100 | and produce your best estimate of the answer.

00:54:46.100 | So for a question of what color is the cat,

00:54:49.100 | it could be gray or black.

00:54:51.100 | There's the different LSTM flavors

00:54:54.100 | producing that answer.

00:54:56.100 | Same with counting chairs.

00:54:58.100 | You can give an image of a chair

00:55:00.100 | and ask the question how many chairs are there

00:55:03.100 | and it can produce an answer of three.

00:55:07.100 | So I should say that this is really hard, right?

00:55:11.100 | And it's an arbitrary question,

00:55:12.100 | ask of an arbitrary image.

00:55:14.100 | So you're both interpreting,

00:55:15.100 | you do natural language processing

00:55:17.100 | and you're doing computer vision,

00:55:19.100 | all in one network.

00:55:22.100 | Same thing with image caption generation.

00:55:26.100 | You can detect the different objects in the scene,

00:55:31.100 | generate those words,

00:55:33.100 | stitch them together in syntactically correct sentences

00:55:37.100 | and re-rank the sentences.

00:55:39.100 | All of those are LSTMs,

00:55:41.100 | the second and the third step.

00:55:43.100 | The first is computer vision detecting the objects,

00:55:46.100 | segmenting the image and detecting the objects.

00:55:48.100 | And that way you can generate a caption

00:55:50.100 | that says a man is sitting in a chair

00:55:52.100 | with a dog in his lap.

00:55:56.100 | Again, LSTMs for video.

00:56:00.100 | Caption generation for video.

00:56:03.100 | The input at every frame is an image

00:56:06.100 | that goes into the LSTM.

00:56:08.100 | The input is an image

00:56:11.100 | and the output is a set of characters.

00:56:13.100 | First you load in the video,

00:56:15.100 | in this case the output is on top.

00:56:17.100 | You encode the video

00:56:21.100 | into a representation inside the network

00:56:24.100 | and then you start generating words about that video.

00:56:27.100 | First comes the input, the encoding stage,

00:56:29.100 | then the decoding stage.

00:56:32.100 | Take in the video, say a man is taking,

00:56:35.100 | talking, whatever.

00:56:37.100 | And because the input and the output is arbitrary,

00:56:41.100 | there also has to be indicators of the beginnings

00:56:43.100 | and the ends of a sentence.

00:56:46.100 | So in this case, end of sentence.

00:56:48.100 | So you want to know when you stop.

00:56:51.100 | In order to generate syntactically correct sentences,

00:56:54.100 | you want to be able to generate a period

00:56:57.100 | that indicates the end of a sentence.

00:57:01.100 | So you can also, again, recurrent neural networks,

00:57:04.100 | LSTMs here, controlling the steering

00:57:11.100 | of a sliding window on an image

00:57:15.100 | that's used to classify what's contained in that image.

00:57:19.100 | So here, a CNN being steered by a recurrent neural network

00:57:24.100 | in order to convert this image

00:57:28.100 | into the number that's associated with the house number.

00:57:33.100 | It's called visual attention.

00:57:35.100 | And that visual attention can be used to steer

00:57:37.100 | for the perception side

00:57:39.100 | and it can be used to steer a network for the generation.

00:57:43.100 | On the right, we can generate an image.

00:57:49.100 | So the output of the LSTM

00:57:54.100 | where the output at every time step is visual.

00:57:59.100 | In this way, you can draw numbers.

00:58:06.100 | Here, I mentioned this before,

00:58:12.100 | is taking in as input silent video,

00:58:15.100 | sequence of images,

00:58:19.100 | and producing audio.

00:58:22.100 | So this is an LSTM

00:58:26.100 | that has convolutional layers for every single frame.

00:58:32.100 | It takes images as input

00:58:35.100 | and produces a spectrogram, audio as output.

00:58:41.100 | The training set is a person hitting an object with a drumstick

00:58:49.100 | and your task is to generate, given a silent video,

00:58:53.100 | generate the sound that a drumstick would make

00:58:57.100 | when in contact with that object.

00:59:01.100 | Okay, medical diagnosis.

00:59:06.100 | That's actually, so I've listed some places

00:59:08.100 | where it's been really successful and pretty cool.

00:59:11.100 | But it's also beginning to be applied in places

00:59:14.100 | where it can actually really help civilization,

00:59:23.100 | right, in medical applications.

00:59:25.100 | So for medical diagnosis,

00:59:28.100 | there is a highly sparse

00:59:32.100 | and a variable length sequence of information

00:59:39.100 | in the form of, for example,

00:59:41.100 | patient electronic health records.

00:59:43.100 | So every time you visit a doctor,

00:59:45.100 | there's some test being done

00:59:46.100 | and that information is there

00:59:48.100 | and you can look at it as a sequence over a period of time.

00:59:51.100 | And then given that data, that's the input,

00:59:54.100 | the output is a diagnosis, a medical diagnosis.

01:00:00.100 | So in this case, we can look at predicting diabetes,

01:00:04.100 | scoliosis, asthma, so on,

01:00:09.100 | with pretty good accuracy.

01:00:13.100 | There's something that all of us wish we could do

01:00:20.100 | is stock market prediction.

01:00:25.100 | So you can input, for example,

01:00:27.100 | well, first of all, you can input the raw stock data,

01:00:30.100 | right, the order books and so on, financial data.

01:00:33.100 | But you can also look at news articles

01:00:35.100 | from all over the web

01:00:37.100 | and take those as input, as shown here,

01:00:40.100 | on the x-axis is time,

01:00:42.100 | so articles from different days,

01:00:45.100 | LSTM, once again,

01:00:48.100 | and produce an output of your prediction,

01:00:51.100 | binary prediction, whether the stock will go up or down.

01:00:56.100 | And nobody's been able to really successfully do this,

01:00:59.100 | but there is a bunch of results

01:01:02.100 | and trying to perform above random,

01:01:06.100 | which is how you make money, right,

01:01:10.100 | significantly above random

01:01:12.100 | on the prediction of is it going up or down,

01:01:14.100 | so you can buy or sell.

01:01:16.100 | And especially when there's,

01:01:19.100 | in the cases when there were crashes,

01:01:21.100 | it's easier to predict.

01:01:23.100 | So you can predict an encroaching crash.

01:01:25.100 | These are shown in the table,

01:01:27.100 | the error rates for different stocks,

01:01:31.100 | automotive stocks.

01:01:35.100 | You can also generate audio.

01:01:38.100 | This exact same process you generate language,

01:01:40.100 | you generate audio.

01:01:42.100 | Here's trained on a single speaker,

01:01:47.100 | a few hour epics of them speaking,

01:01:52.100 | and you just learn that's raw audio of the speaker.

01:01:58.100 | And it's learning slowly to generate.

01:02:02.100 | (audience mumbling)

01:02:19.100 | Obviously, they were reading numbers,

01:02:21.100 | but that might...

01:02:26.100 | This is incredible.

01:02:27.100 | This is trained on a compressed spectrogram

01:02:31.100 | of the audio, raw audio.

01:02:35.100 | And it's producing something that,

01:02:38.100 | over just a few epics,

01:02:40.100 | it's producing something that sounds like words.

01:02:43.100 | I could do this lecture for me, I wish.

01:02:47.100 | (audience mumbling)

01:02:56.100 | Since...

01:02:59.100 | I don't know. This is amazing.

01:03:02.100 | This is raw input, raw output,

01:03:06.100 | all again LSTMs.

01:03:10.100 | And there's a lot of work in voice recognition

01:03:13.100 | and audio recognition.

01:03:14.100 | You're mapping...

01:03:17.100 | Let me turn it up.

01:03:22.100 | You're mapping any kind of audio to a classification.

01:03:25.100 | (audience mumbling)

01:03:29.100 | So, you can take the audio of the road,

01:03:32.100 | (audience mumbling)

01:03:35.100 | and that's a spectrogram on the bottom there being shown,

01:03:39.100 | and you could detect whether the road is wet

01:03:42.100 | or the road is dry.

01:03:44.100 | (audience mumbling)

01:03:47.100 | And you could do the same thing for

01:03:51.100 | recognizing the gender of the speaker

01:03:54.100 | or recognizing many-to-many map

01:03:57.100 | of the actual words being spoken,

01:04:00.100 | speech recognition.

01:04:02.100 | But this is about driving,

01:04:04.100 | so let's see where recurring neural networks apply in driving.

01:04:08.100 | We talked about the NVIDIA approach,

01:04:12.100 | the thing that actually powers DeepTesla JS,

01:04:16.100 | is a simple convolutional neural network.

01:04:18.100 | There's five convolutional layers in their approach,

01:04:22.100 | three fully connected layers.

01:04:24.100 | You can add as many layers as you want in DeepTesla.

01:04:29.100 | So, that's a quarter million parameters to optimize.

01:04:34.100 | And all you're taking is a single image,

01:04:37.100 | no temporal information, single image,

01:04:39.100 | and producing a steering angle.

01:04:40.100 | That's the approach, that's the DeepTesla way.

01:04:44.100 | So, taking a single image

01:04:49.100 | and learning a regression of a steering angle.

01:04:53.100 | Now, so one of the prizes for the competition

01:04:59.100 | is the Udacity self-driving car engineer nanodegree for free.

01:05:06.100 | This thing is awesome,

01:05:07.100 | I encourage everyone to check it out.

01:05:09.100 | But they did a competition

01:05:12.100 | that's very similar to ours.

01:05:17.100 | But they have a very large group of obsessed people,

01:05:22.100 | so they were very clever.

01:05:25.100 | They went beyond just convolutional neural networks

01:05:27.100 | of predicting steering.

01:05:28.100 | So, taking a sequence of images and predicting steering.

01:05:31.100 | What they did is, the winners,

01:05:34.100 | at least the first, and I'll talk about the second place winner tomorrow,

01:05:38.100 | on 3D convolutional neural networks.

01:05:43.100 | But the first and the third place winners used RNNs,

01:05:46.100 | used LSTMs, recurrent neural networks.

01:05:49.100 | And mapped a sequence of images

01:05:53.100 | to a sequence of steering angles.

01:05:55.100 | For anyone, statistically speaking,

01:06:00.100 | anybody here who's not a computer vision person,

01:06:03.100 | most likely what you want to use

01:06:05.100 | for whatever application you're interested in

01:06:07.100 | is RNNs.

01:06:09.100 | It's just the world is full of time series data.

01:06:12.100 | Very few of us are working on data

01:06:16.100 | that's not time series data.

01:06:18.100 | In fact, whenever it's just snapshots,

01:06:21.100 | you're really just reducing the problem

01:06:24.100 | to the size that you can handle.

01:06:26.100 | But most data, the world is time series data.

01:06:29.100 | So this is the approach you will end up using

01:06:32.100 | if you want to apply it in your own research.

01:06:36.100 | So this is,

01:06:40.100 | RNNs is the way to go.

01:06:43.100 | So, again, what are they doing?

01:06:49.100 | How do you put images

01:06:53.100 | into a recurrent neural network?

01:06:55.100 | It's the same thing.

01:06:58.100 | You take,

01:07:00.100 | you have to convert an image into numbers

01:07:02.100 | in some kind of way.

01:07:04.100 | A powerful way of doing that is convolutional neural networks.

01:07:07.100 | So you can take

01:07:09.100 | either 3D convolutional neural networks

01:07:12.100 | or 2D convolutional neural networks,

01:07:15.100 | ones that take time into consideration and one not.

01:07:18.100 | So process that image

01:07:20.100 | to extract the representation of that image.

01:07:23.100 | And that becomes the input to the LSTM.

01:07:26.100 | And the output at every single cell,

01:07:29.100 | at every single time step,

01:07:31.100 | is the predicted steering angle,

01:07:33.100 | the speed of the vehicle and the torque.

01:07:35.100 | That's what the first place winner did.

01:07:37.100 | They didn't just do steering angle,

01:07:39.100 | they also did the speed and the torque.

01:07:42.100 | And the sequence length that they were using

01:07:45.100 | for training and for testing

01:07:48.100 | for the input and the output is a sequence length of 10.

01:07:52.100 | (audience member asking question)

01:07:57.100 | The question was, do they use supervised learning?

01:08:00.100 | Yep. So they were given the same thing as in DeepTesla.

01:08:03.100 | A sequence of frames, whether you have a sequence of

01:08:06.100 | steering angle, speed and torque,

01:08:08.100 | I think there's other information too, available.

01:08:11.100 | So yeah, there's no reinforcement learning here.

01:08:14.100 | Question?

01:08:15.100 | (audience member asking question)

01:08:26.100 | So the question was,

01:08:28.100 | how many LSTM gates are there in this problem?

01:08:31.100 | (audience member asking question)

01:08:34.100 | So this network,

01:08:36.100 | (audience member asking question)

01:08:41.100 | it's true that these diagrams kind of hide

01:08:45.100 | the number of parameters here.

01:08:47.100 | But it's arbitrary, just like convolutional neural networks are arbitrary.

01:08:50.100 | So the size of the input is arbitrary,

01:08:54.100 | the size of the sigmoid function at 10H is arbitrary.

01:08:58.100 | So you can make it as large as you want, as deep as you want.

01:09:02.100 | And the deeper and larger, the better.

01:09:05.100 | What these folks actually use,

01:09:07.100 | so the way these competitions work,

01:09:11.100 | I encourage you if you're interested in machine learning

01:09:14.100 | to participate in Kaggle,

01:09:16.100 | I don't know how to pronounce it, competitions,

01:09:19.100 | where basically everyone's doing the same thing.

01:09:21.100 | You're using LSTMs, or if it's one-to-one mapping,

01:09:25.100 | using convolutional neural networks, fully connected networks,

01:09:28.100 | with some clever pre-processing.

01:09:30.100 | And the whole job is, that takes months.

01:09:32.100 | And you probably, if you're a researcher,

01:09:34.100 | that's what you'll be doing, your own research,

01:09:36.100 | is playing with parameters.

01:09:37.100 | Playing with pre-processing of the data,

01:09:39.100 | playing with the different parameters that control

01:09:41.100 | the size of the network, the learning rate.

01:09:44.100 | I mentioned this type of optimizer,

01:09:46.100 | all these kinds of things, that's what you're playing with.

01:09:49.100 | Using your own human intuition,

01:09:51.100 | and you're using your...

01:09:54.100 | whatever probing you can do in monitoring

01:09:59.100 | the performance of the network through time.

01:10:02.100 | Yes?

01:10:05.100 | Right.

01:10:16.100 | The question was,

01:10:18.100 | you said that there is a memory of 10 in this LSTM,

01:10:26.100 | and I thought RNNs are supposed to be arbitrary, or whatever.

01:10:30.100 | So, it has to do with the training,

01:10:36.100 | how the network is trained.

01:10:39.100 | So it's trained with sequences of 10.

01:10:41.100 | The structure is still the same,

01:10:42.100 | you only have one cell that's looping onto each other.

01:10:46.100 | But the question is, in what chunks,

01:10:51.100 | what is the size of the sequence

01:10:53.100 | in which you're doing the training and then the testing?

01:10:56.100 | So, you don't have to, it can be arbitrary length,

01:10:59.100 | it's just usually better to be consistent

01:11:02.100 | and have a fixed length.

01:11:04.100 | But you're not stacking 10 cells together,

01:11:10.100 | it's just a single cell still.

01:11:14.100 | So, the third place winner,

01:11:18.100 | Team Chauffeur,

01:11:20.100 | used something called transfer learning,

01:11:23.100 | and it's something I don't think I mentioned,

01:11:26.100 | but it's kind of implied.

01:11:29.100 | The amazing power of neural networks,

01:11:34.100 | so, first, you need a lot of data to do anything.

01:11:37.100 | So that's a cost, that's a limitation of neural networks.

01:11:40.100 | But what you could do is,

01:11:43.100 | so there's neural networks that have been trained

01:11:48.100 | on very large datasets, on ImageNet.

01:11:51.100 | There's VGGNet, AlexNet, ResNet,

01:11:56.100 | all these networks that train on huge amounts of data.

01:11:59.100 | But those networks are trained to tell the difference

01:12:03.100 | between a cat and a dog, or whatever,

01:12:05.100 | or the specific object recognition in single images.

01:12:08.100 | How do I then take that network and apply it to my problem,

01:12:12.100 | say, of driving, of lane detection,

01:12:14.100 | or classifying medical diagnosis of cancer or not.

01:12:18.100 | The beauty of neural networks is you don't,

01:12:22.100 | I mean, it depends,

01:12:25.100 | but the promise of transfer learning

01:12:28.100 | is that you can just take that network,

01:12:30.100 | chop off the final layer,

01:12:32.100 | the fully connected layer that maps from all those

01:12:36.100 | cool high-dimensional features that you've learned about the visual space,

01:12:41.100 | and as opposed to predicting cat versus dog,

01:12:44.100 | you teach it to predict cancer or no cancer.

01:12:47.100 | You teach it to predict lane or no lane,

01:12:50.100 | truck or no truck.

01:12:52.100 | And so, as long as the visual space

01:12:54.100 | under which the networks operate is similar,

01:12:57.100 | or the data space, like if it's audio or whatever,

01:13:00.100 | if it's similar, if the features are useful that you learned,

01:13:04.100 | in studying the problem of cat versus dog deeply,

01:13:07.100 | you have learned actually how to see the world.

01:13:10.100 | And so you can apply that visual knowledge,

01:13:13.100 | you can transfer that learning to another domain.

01:13:17.100 | And that's the beautiful power of neural networks,

01:13:20.100 | is they're transferable.

01:13:22.100 | And so what they did here is they took,

01:13:27.100 | I didn't spend enough time looking through the code,

01:13:31.100 | so I'm not sure which of the giant networks they took,

01:13:34.100 | but they took a giant convolutional neural network,

01:13:38.100 | they pruned it down to, they chopped off the end layer,

01:13:43.100 | which produced 3,000 features,

01:13:45.100 | and they took those 3,000 features

01:13:47.100 | that every single image frame,

01:13:49.100 | and that's the XT,

01:13:51.100 | and they gave that as the input to LSTM,

01:13:54.100 | and the sequence length in that case was 50.

01:13:57.100 | So this process is pretty similar across domains,

01:14:05.100 | that's the beauty of it.

01:14:07.100 | And the art of neural networks is in the,

01:14:13.100 | that's a good sign,

01:14:15.100 | I guess I should wrap it up.

01:14:20.100 | Anyway, I don't think I need much time.

01:14:24.100 | But the art of neural networks is in the hyperparameter tuning,

01:14:30.100 | and that's the tricky part,

01:14:32.100 | and that's the part you can't be taught,

01:14:34.100 | that's experience, sadly enough.

01:14:38.100 | That's why they called, I talked about

01:14:41.100 | stochastic gradient descent, SGD,

01:14:44.100 | that's why whoever was Jeffrey Hinton,

01:14:47.100 | refers to it as stochastic graduate student descent.

01:14:52.100 | Meaning you just keep higher graduate students

01:14:55.100 | to play with the hyperparameters

01:14:57.100 | until the problem is solved.

01:15:02.100 | So I have about 100 plus slides on driver state,

01:15:11.100 | which is the thing that I'm most passionate about,

01:15:15.100 | and I think we'll save the best for last, right?

01:15:19.100 | And I'll talk about that tomorrow.

01:15:21.100 | We have a guest speaker from the White House,

01:15:25.100 | who'll talk about the future of artificial intelligence

01:15:27.100 | from the perspective of policy.

01:15:30.100 | And what I'd like you to do, first of all,

01:15:34.100 | if you're a registered student,

01:15:35.100 | submit the two tutorials assignments,

01:15:37.100 | and pick up,

01:15:40.100 | can we just set up boxes right here or something?

01:15:42.100 | Yeah, just stop by, pick up a shirt,

01:15:46.100 | and give us a card on the way.

01:15:48.100 | All right, thanks guys.

01:15:52.100 | (applause)

01:15:57.100 | [BLANK_AUDIO]

MIT 6.S094: Recurrent Neural Networks for Steering Through Time

Chapters