back to index

MIT 6.S094: Recurrent Neural Networks for Steering Through Time


Chapters

0:0 Intro
0:44 Administrative
1:50 Flavors of Neural Networks
6:1 Back to Basics: Backpropagation
8:34 Backpropagation: Forward Pass
10:14 Backpropagation: By Example
11:58 Backpropagation: Backward Pass
13:41 Modular Magic: Chain Rule
14:55 Interpreting Gradients
18:27 Modularity Expanded: Sigmoid Activation Function
20:33 Learning with Backpropagation
25:27 Optimization is Hard: Dying ReLUS
26:13 Optimization is Hard: Saddle Point
27:39 Learning is an Optimization Problem
30:59 Optimization is Hard: Vanishing Gradients
32:52 Reflections on Backpropagation
36:7 Unrolling a Recurrent Neural Network
37:39 RNN Observations
38:39 Backpropagation Through Time (BPTT)
40:34 Gradients Can Explode or Vanish Geometric Interpretation
41:8 RNN Variants: Bidirectional RNNS
42:8 Long-Term Dependency
43:49 Long Short Term Memory (LSTM) Networks
45:43 LSTM: Gates Regulate
45:48 LSTM: Pick What to Forget and What To Remember
48:8 LSTM Conveyer Belt
50:12 Application: Machine Translation
52:19 Application: Handwriting Generation from Text
52:59 Application: Character-Level Text Generation
54:6 Application: Image Question Answering
55:21 Application: Image Caption Generation
55:56 Application: Video Description Generation
56:50 Application: Modeling Attention Steering
57:33 Application: Drawing with Selective Attention Writing
58:3 Application: Adding Audio to Silent Film
59:0 Application: Medical Diagnosis

Whisper Transcript | Transcript Only Page

00:00:00.000 | All right, so we've talked about regular neural networks,
00:00:06.000 | fully connected neural networks.
00:00:07.800 | We've talked about convolutional neural networks that work with images.
00:00:12.200 | We've talked about reinforcement, deep reinforcement learning,
00:00:16.000 | where we plug in a neural network into a reinforcement learning algorithm
00:00:21.300 | when an agent, when a system has to not only perceive the world
00:00:26.200 | but also act in it and collect reward.
00:00:29.400 | And today we'll talk about perhaps the least understood
00:00:34.400 | but the most exciting neural network out there,
00:00:37.400 | flavor of neural network is recurring neural networks.
00:00:41.400 | So first, administrative stuff.
00:00:46.000 | There's a website, I don't know if you heard,
00:00:50.200 | cars.mit.edu, where you should create an account
00:00:53.700 | if you're a registered student.
00:00:55.000 | That's one of the requirements.
00:00:57.000 | You need to have an account if you want to get credit for this.
00:00:59.800 | You need to submit code for DeepTraffic.js and DeepTesla.js.
00:01:05.700 | And for DeepTraffic, you have to have a neural network
00:01:09.600 | that drives faster than 65 miles an hour.
00:01:11.800 | If you need help to achieve that speed, please email us.
00:01:15.800 | We can drop, we can give you some hints.
00:01:19.000 | For those of you who are old school SNL fans,
00:01:23.000 | there's a deep thoughts section now in the profile page
00:01:28.600 | where we encourage you to talk about the kinds of things
00:01:32.300 | you tried in DeepTraffic or any of the other DeepTesla
00:01:36.400 | or any of the work you've done as part of this class for deep learning.
00:01:41.300 | We've talked about the vanilla neural networks on the left.
00:01:49.500 | The vanilla neural network is the one where it's computing,
00:01:53.800 | it's approximating a function that maps from one input to one output.
00:01:59.100 | An example is mapping images to the number that's shown in the image.
00:02:03.400 | For ImageNet, it's mapping an image to what's the object in the image.
00:02:07.900 | It could be anything.
00:02:09.500 | In fact, convolutional neural networks can operate on audio.
00:02:13.600 | You could give it a chunk of audio, five second audio clip.
00:02:17.500 | That still counts as one input because it's fixed size.
00:02:20.900 | As long as the size of the input is fixed, that's one chunk of input.
00:02:26.900 | And as long as you have ground truth that maps that chunk of input
00:02:31.000 | to some output ground truth, that's a vanilla neural network.
00:02:35.800 | Whether that's a fully connected neural network
00:02:38.200 | or a convolutional neural network.
00:02:40.200 | Today, we'll talk about the amazing,
00:02:43.800 | the mysterious recurring neural networks.
00:02:48.200 | They compute functions from one to many,
00:02:51.600 | from many to one, from many to many.
00:02:54.800 | Also, bidirectional.
00:03:00.500 | What does that mean?
00:03:02.300 | They take as input sequences, time series, audio, video.
00:03:10.000 | Whenever there's a sequence of data
00:03:12.400 | and that temporal dynamics that connects the data
00:03:15.900 | is more important than the spatial content of each individual frame.
00:03:22.000 | When there's a lot of information being conveyed in the sequence,
00:03:26.300 | in the temporal change of whatever that type of data is,
00:03:30.500 | that's when you want to use recurring neural networks.
00:03:32.600 | Like speech, natural language, audio.
00:03:38.100 | The power of this is that for many of them,
00:03:41.500 | of a recurring neural network where they really shine
00:03:44.900 | is when the size of the input is variable.
00:03:48.800 | You don't have a fixed chunk of data that you're putting in,
00:03:51.800 | it's variable input.
00:03:53.300 | The same goes for the output.
00:03:55.800 | You can give it a sequence of speech,
00:04:00.300 | several seconds of speech,
00:04:02.700 | and then the output is a single label
00:04:07.600 | of whether the speaker is male or female.
00:04:10.100 | That's many to one.
00:04:14.300 | You can also do many to many translation.
00:04:20.100 | You can have natural language put into the network
00:04:23.400 | in Spanish and the output is in English.
00:04:29.100 | Machine translation, that's many to many.
00:04:32.500 | And that many to many doesn't have to be
00:04:35.600 | mapped directly into same size sequences.
00:04:39.300 | So for video, the sequence size might be the same.
00:04:42.500 | You're labeling every single frame.
00:04:43.900 | You put in a five-second clip
00:04:47.100 | of somebody playing basketball.
00:04:51.100 | And you can label every single frame
00:04:52.800 | counting the number of people in every single frame.
00:04:55.000 | That's many to many when the size of the input
00:04:57.200 | and size of the output is the same.
00:04:58.600 | Yes, question.
00:05:00.100 | The question was, are there any models
00:05:02.800 | where there's feedback from output and input?
00:05:05.000 | And that's exactly what recurrent neural networks are.
00:05:08.900 | It produces output and it copies that output
00:05:13.600 | and loops it back in.
00:05:15.100 | That's exactly almost the definition
00:05:20.300 | of a recurrent neural network.
00:05:21.500 | There's a loop in there that produces the output
00:05:24.200 | and also takes that output as input once again.
00:05:27.000 | And so there's also many to many
00:05:31.200 | where the sequences don't align.
00:05:32.700 | Like machine translation,
00:05:34.600 | the size of the output sequence
00:05:37.000 | might be totally different than the input sequence.
00:05:38.900 | We'll look at a lot of cool applications
00:05:40.900 | like you can start a song.
00:05:45.400 | So learn on the audio of a particular song
00:05:48.700 | and have the recurrent neural network
00:05:51.200 | continue that song after a certain period of time.
00:05:55.000 | So you can learn to generate sequences
00:05:57.700 | of audio, of natural language, of video.
00:05:59.800 | I know I promised not many equations
00:06:06.800 | but this is so beautifully simple
00:06:10.800 | that we have to cover backpropagation.
00:06:13.000 | It's also the thing that if you're a little bit lazy
00:06:17.900 | and you go on the internet
00:06:19.300 | and start using the basic tutorials for TensorFlow,
00:06:22.500 | you ignore how backpropagation work at your peril.
00:06:26.400 | You kind of assume it just works.
00:06:29.100 | I give it some inputs, some outputs
00:06:30.800 | and it's like Lego pieces.
00:06:32.300 | I can assemble them like you might have done with deep traffic.
00:06:34.600 | A bunch of layers, put them together
00:06:36.500 | and then just press train.
00:06:38.900 | And backpropagation is the mechanism
00:06:41.200 | that neural networks currently,
00:06:42.700 | the best mechanism we know of,
00:06:44.500 | that is used for training.
00:06:45.800 | So we need to understand
00:06:47.300 | the simple power of backpropagation
00:06:52.100 | but also the dangers.
00:06:53.900 | Summary.
00:06:58.200 | Say up at the top of the slide
00:07:01.300 | there's an input, so the network is an image.
00:07:04.300 | There's a bunch of neurons
00:07:06.900 | all with differentiable smooth activation functions
00:07:10.900 | on each neuron.
00:07:11.700 | And then as you pass through those activation functions,
00:07:17.500 | taking the input, pass it through
00:07:20.100 | this net of differentiable compute nodes,
00:07:25.000 | you produce an output.
00:07:26.500 | And that output,
00:07:27.900 | you also have a ground truth
00:07:30.400 | that correct the truth
00:07:32.800 | that you hope, you expect the network to produce.
00:07:35.800 | And then you can look at the difference between
00:07:38.000 | what the network actually produced
00:07:39.500 | and what you hope it will produce
00:07:42.000 | and that's an error.
00:07:43.000 | And then you back or propagate that error
00:07:45.800 | punishing or rewarding the weights,
00:07:50.800 | the parameters of the network
00:07:52.100 | that resulted in that output.
00:07:53.800 | Let's start with a really simple example.
00:08:00.500 | There's a function
00:08:01.500 | that takes as input up on top
00:08:05.800 | three variables X, Y and Z.
00:08:08.800 | The function does two things.
00:08:11.200 | It adds X and Y
00:08:12.800 | and then it multiplies that sum by Z.
00:08:17.000 | And then we can formulate that as a circuit,
00:08:20.100 | circuit of gates
00:08:21.800 | where there's a plus gate
00:08:24.300 | and a multiplication gate.
00:08:27.300 | And let's take some inputs shown in blue.
00:08:31.100 | Let's say X is -2,
00:08:33.600 | Y is 5, Z is -4.
00:08:36.300 | And let's do a forward pass through this circuit
00:08:40.600 | to produce the output.
00:08:42.300 | So -2 + 5 = 3.
00:08:47.100 | Q is that intermediate value
00:08:49.900 | that's the value that we're looking for.
00:08:51.900 | And then we can do a forward pass
00:08:53.900 | through the circuit
00:08:55.900 | Q is that intermediate value, 3.
00:08:59.200 | This is so simple
00:09:02.900 | and so important to understand
00:09:05.100 | that I just want to take my time through this
00:09:07.100 | because everything else about neural networks
00:09:09.100 | just builds on these concepts.
00:09:10.500 | Okay, so the add gate produces Q.
00:09:15.700 | In this case it's 3.
00:09:17.100 | And then 3 * -4 is 12.
00:09:19.500 | That's the output.
00:09:20.500 | The output of the circuit of this network
00:09:24.900 | if you think of it as such
00:09:26.100 | is -12.
00:09:28.100 | And so the forward pass is shown in blue.
00:09:31.500 | The backward pass will be shown in red
00:09:33.500 | in a second here.
00:09:34.300 | So what we want to do is
00:09:35.900 | what would make us happy,
00:09:37.500 | what would make F happy
00:09:38.900 | is for the output to be as high as possible.
00:09:41.300 | -12 is so-so, we could do better.
00:09:43.900 | So how do we teach it?
00:09:45.500 | How do we adjust X, Y and Z
00:09:48.500 | such that it produces a higher
00:09:54.900 | Makes us happier.
00:09:56.300 | Okay, let's start backward,
00:09:58.900 | the backward pass.
00:10:01.700 | We make the gradient on the output 1.
00:10:05.300 | Meaning we want this to increase.
00:10:07.500 | We want F to increase.
00:10:08.900 | That's how we encode our happiness.
00:10:10.500 | We want it to go up by 1.
00:10:13.900 | And in order to then propagate
00:10:19.900 | that fact that we want the F to go up by 1,
00:10:24.700 | we have to look at the gradient on each one of the gates.
00:10:30.100 | Now what's a gradient?
00:10:31.700 | It's a partial derivative
00:10:38.300 | with respect to its inputs.
00:10:42.100 | The partial derivative of the output of a gate
00:10:45.100 | with respect to its inputs.
00:10:46.500 | If you don't know what that means,
00:10:49.100 | it's just
00:10:51.900 | how much does the output change
00:10:55.900 | when I change the inputs a little bit.
00:10:58.300 | What is the slope of that change?
00:11:00.300 | If I increase X for the first function of addition,
00:11:03.500 | F(X)Y = X + Y.
00:11:06.500 | If I increase X by a little bit,
00:11:08.500 | what happens to F?
00:11:09.700 | If I increase Y by a little bit,
00:11:11.300 | what happens to F?
00:11:12.300 | So taking a partial derivative of those
00:11:15.100 | with respect to X and Y,
00:11:16.500 | you just get a slope of 1.
00:11:19.100 | So when you increase X,
00:11:20.300 | F increases linearly.
00:11:22.500 | Same with Y.
00:11:23.900 | Multiplication is a little trickier.
00:11:26.300 | When you increase X,
00:11:30.100 | F increases by Y.
00:11:33.300 | So the partial derivative of F
00:11:35.900 | with respect to X is Y.
00:11:38.300 | The partial derivative of F with respect to Y is X.
00:11:41.100 | So if you think about it,
00:11:45.700 | what happens is
00:11:47.700 | the gradients, when you change X,
00:11:50.100 | the gradient of change
00:11:52.500 | doesn't care about X.
00:11:54.300 | It cares about Y.
00:11:57.100 | So it's flipped.
00:11:58.900 | So we can back propagate that 1,
00:12:01.500 | the indication of what makes us happy,
00:12:03.700 | backwards.
00:12:05.300 | And that's done by
00:12:09.100 | computing the local gradient.
00:12:11.100 | For Q,
00:12:17.100 | so the partial derivative of F
00:12:20.100 | with respect to Q, that intermediate value,
00:12:22.100 | that gradient will be -4.
00:12:26.100 | It will take the value of Z,
00:12:28.100 | as I said, it's a multiplication gate.
00:12:30.100 | It will take the value of Z
00:12:32.100 | and assign it to the gradient.
00:12:36.500 | And the same for the partial derivative of F
00:12:40.100 | with respect to Z,
00:12:41.100 | it will assign that to Q,
00:12:43.100 | the value of the forward pass from the Q.
00:12:45.100 | So there's a 3 and a -4 in the forward pass,
00:12:49.100 | in blue,
00:12:50.100 | and then that's flipped, -4 and 3,
00:12:53.100 | on the backward pass.
00:12:54.100 | That's the gradient.
00:12:55.100 | And then we continue in the same exact process,
00:12:59.100 | but wait.
00:13:01.100 | So what makes all of this work
00:13:06.100 | is the chain rule.
00:13:09.100 | It's magical.
00:13:10.100 | So what it allows us to do
00:13:14.100 | is to compute the gradient,
00:13:16.100 | the gradient on F with respect to the inputs, X, Y, Z.
00:13:23.100 | We don't need to construct
00:13:25.100 | the giant function that is
00:13:29.100 | the partial derivative of F
00:13:33.100 | with respect to X and Y and Z
00:13:35.100 | analytically.
00:13:37.100 | We can do it step by step,
00:13:38.100 | backpropagating the gradients.
00:13:40.100 | We can multiply the gradients together
00:13:42.100 | as opposed to doing partial derivative of F
00:13:44.100 | with respect to X.
00:13:46.100 | We have just the intermediate,
00:13:48.100 | the local gradient of F with respect to Q
00:13:51.100 | and of Q with respect to X
00:13:53.100 | and multiply them together.
00:13:55.100 | So instead of computing
00:14:00.100 | the gradient of that giant function,
00:14:04.100 | X + Y = Z,
00:14:06.100 | in this case it's not that giant,
00:14:08.100 | but it gets pretty giant in neural networks,
00:14:10.100 | we just go step by step.
00:14:12.100 | We look at the first function,
00:14:14.100 | simple addition,
00:14:15.100 | Q = X + Y
00:14:17.100 | and the second function, multiplication,
00:14:20.100 | F = Q * Z.
00:14:24.100 | the gradient on X and Y,
00:14:29.100 | the partial derivative of F
00:14:33.100 | with respect to X and Y
00:14:36.100 | is computed by multiplying
00:14:38.100 | the gradient on the output, -4,
00:14:41.100 | times the gradient on the inputs,
00:14:44.100 | which as we talked about
00:14:46.100 | when the operation is addition,
00:14:48.100 | that's just 1.
00:14:49.100 | So it's -4 * 1.
00:14:51.100 | That means,
00:14:54.100 | what does that mean?
00:14:57.100 | Let's interpret those numbers.
00:15:00.100 | You now have gradients on X, Y and Z,
00:15:04.100 | the gradient of,
00:15:05.100 | the partial derivative of F
00:15:07.100 | with respect to X, Y, Z.
00:15:08.100 | That means,
00:15:10.100 | so for X and Y it's -4,
00:15:12.100 | for Z it's 3.
00:15:14.100 | That means in order to make F happy,
00:15:17.100 | we have to decrease
00:15:19.100 | the inputs that have a negative gradient
00:15:25.100 | and increase the inputs
00:15:27.100 | that have a positive gradient.
00:15:28.100 | The negative ones are X and Y,
00:15:30.100 | the positive is Z.
00:15:36.100 | Hopefully I don't say the word beautiful
00:15:37.100 | too many times in this presentation,
00:15:39.100 | but this is very simple,
00:15:41.100 | beautifully simple.
00:15:43.100 | Because this gradient is a local worker.
00:15:48.100 | It propagates for you,
00:15:50.100 | it has no knowledge of the broader
00:15:53.100 | happiness of F.
00:15:56.100 | It just propagates,
00:15:58.100 | it computes the gradient
00:15:59.100 | between the output and the input.
00:16:01.100 | And you can propagate this gradient
00:16:04.100 | based on, in this case,
00:16:06.100 | F, a gradient of 1,
00:16:08.100 | but also just the error.
00:16:10.100 | Instead of 1, we could have on the output
00:16:12.100 | the error is the measure of happiness.
00:16:14.100 | And then we could propagate that error backwards.
00:16:17.100 | These gates are important
00:16:18.100 | because you can break down
00:16:19.100 | almost every operation we could think of
00:16:21.100 | that we work with in neural networks
00:16:23.100 | into one of several gates like this.
00:16:27.100 | And the most popular are 3,
00:16:30.100 | which is addition, multiplication
00:16:31.100 | and the max operation.
00:16:33.100 | So for addition,
00:16:34.100 | what you do is you ignore the...
00:16:38.100 | Okay, the process is
00:16:39.100 | you take a forward pass to the network.
00:16:42.100 | So we have a value on every single gate.
00:16:46.100 | And then you take a backward pass.
00:16:49.100 | And through the backward pass
00:16:50.100 | you compute those gradients.
00:16:53.100 | For an add gate,
00:16:55.100 | you equally distribute the gradients
00:16:57.100 | on the output to the input.
00:16:58.100 | So when the gradient on the output is -4,
00:17:00.100 | you equally distribute it to -4.
00:17:06.100 | And you ignore the forward pass value.
00:17:09.100 | So that 3 is ignored when you back propagate it.
00:17:14.100 | On the multiplication gate,
00:17:16.100 | on the multiply gate,
00:17:18.100 | it's trickier.
00:17:19.100 | You switch the forward pass values.
00:17:23.100 | So if you look at F,
00:17:24.100 | that's a multiply gate.
00:17:28.100 | The forward pass values are switched
00:17:32.100 | and multiplied by the value of the gradient in the output.
00:17:37.100 | If it's confusing, go through the slides slowly.
00:17:41.100 | It'll make a lot more sense, hopefully.
00:17:45.100 | One more gate.
00:17:46.100 | There's the max gate,
00:17:47.100 | which takes the inputs and produces as output
00:17:53.100 | the value that is larger.
00:17:56.100 | And when computing the gradient of the max gate,
00:18:00.100 | it distributes the gradient
00:18:04.100 | similar to the add gate,
00:18:06.100 | but to only one.
00:18:10.100 | To only one of the inputs.
00:18:13.100 | The largest one.
00:18:15.100 | So it, unlike the add gate,
00:18:17.100 | pays attention to the input values on the forward pass.
00:18:22.100 | Alright.
00:18:25.100 | Lots of numbers, but
00:18:27.100 | the whole point here is it's really simple.
00:18:32.100 | A neural network is just a simple collection of these gates.
00:18:37.100 | And you take a forward pass,
00:18:40.100 | you calculate some kind of function on the end,
00:18:42.100 | a gradient at the very end,
00:18:44.100 | and you propagate that back.
00:18:46.100 | So usually for neural networks, that's an error function.
00:18:49.100 | A loss function, objective function,
00:18:52.100 | cost function, all the same word.
00:18:56.100 | So that's the sigmoid function there.
00:18:59.100 | When you have three weights,
00:19:01.100 | w0, w1, w2,
00:19:05.100 | and x, two inputs, x0, x1,
00:19:09.100 | that's going to be the sigmoid function.
00:19:11.100 | That's how you compute the output
00:19:14.100 | of the neuron.
00:19:20.100 | But then you can decompose that neuron,
00:19:22.100 | you can separate it all into
00:19:24.100 | just a set of gates like this.
00:19:26.100 | Addition, multiplication,
00:19:28.100 | there's exponential in there, and division.
00:19:31.100 | They're all very similar.
00:19:33.100 | And you repeat the exact same process.
00:19:35.100 | If the,
00:19:38.100 | there's five inputs,
00:19:40.100 | there's three weights,
00:19:41.100 | and two inputs, x0, x1.
00:19:46.100 | You take a forward pass
00:19:49.100 | through this circuit.
00:19:52.100 | In this case, again,
00:19:54.100 | you want it to increase so that
00:19:56.100 | the gradient on the output is one.
00:19:59.100 | You back propagate that gradient
00:20:01.100 | of one to the inputs.
00:20:04.100 | Now with neural networks,
00:20:06.100 | there's a bunch of parameters that you're trying to,
00:20:08.100 | through this process, modify.
00:20:10.100 | And you don't get to modify the inputs.
00:20:12.100 | You get to modify the weights along the way
00:20:15.100 | and the biases.
00:20:16.100 | The inputs are fixed,
00:20:17.100 | the outputs are fixed,
00:20:19.100 | the outputs that you hope
00:20:22.100 | the network will produce.
00:20:23.100 | What you're modifying is the weights.
00:20:25.100 | So I get to try to adjust those weights
00:20:28.100 | in such that,
00:20:30.100 | in the direction of the gradient.
00:20:35.100 | That's the task of back propagation.
00:20:38.100 | The main way that neural networks learn
00:20:41.100 | is we update the weights and the biases
00:20:44.100 | to decrease the loss function.
00:20:47.100 | The lower the loss function, the better.
00:20:50.100 | So in this case, you have
00:20:52.100 | three inputs on top left,
00:20:55.100 | the simple network, three inputs,
00:20:59.100 | three weights on each of the inputs.
00:21:01.100 | There's a bias on the node, b,
00:21:04.100 | and it produces an output, a.
00:21:08.100 | And that little symbol
00:21:10.100 | is indicating a sigmoid function.
00:21:15.100 | And loss is computed as y minus a squared
00:21:22.100 | divided by two.
00:21:25.100 | Where y is the ground truth,
00:21:28.100 | the output that you want the network to produce.
00:21:32.100 | And that loss function is back propagated
00:21:34.100 | in exactly the same way that we described before.
00:21:37.100 | So the subtasks that are involved in this update
00:21:40.100 | of weights and biases
00:21:42.100 | is that the forward pass computes
00:21:44.100 | the network output at every neuron,
00:21:47.100 | and then finally the output layer
00:21:50.100 | computes the error, the difference between a and b.
00:21:54.100 | And then backward propagates the gradients.
00:21:58.100 | Instead of one on the output,
00:22:00.100 | it'll be the error on the output
00:22:01.100 | and you back propagate it.
00:22:03.100 | And then once you know the gradient,
00:22:05.100 | you adjust the weights and the biases
00:22:07.100 | in the direction of the gradient.
00:22:09.100 | Or actually the opposite of the direction of the gradient
00:22:12.100 | because you want the loss to decrease.
00:22:14.100 | And the amount by which you make that adjustment
00:22:18.100 | is called the learning rate.
00:22:20.100 | The learning rate could be the same
00:22:21.100 | across the entire network,
00:22:23.100 | or it could be individual to every weight.
00:22:27.100 | (deep breath)
00:22:30.100 | And the process of adjusting the weights and biases
00:22:36.100 | is just optimization.
00:22:38.100 | Learning is an optimization problem.
00:22:41.100 | You have an objective function
00:22:42.100 | and you're trying to minimize it.
00:22:44.100 | And your variables are the parameters,
00:22:46.100 | the weights and biases.
00:22:48.100 | And neural networks just happen to have
00:22:51.100 | tens, hundreds of thousands, millions of those parameters.
00:22:55.100 | So the space, the function that you're trying to minimize
00:22:58.100 | is highly non-linear.
00:23:00.100 | But it boils down to something like this.
00:23:02.100 | You have two weights, here are two plots.
00:23:06.100 | Or actually one weight, sorry, one weight.
00:23:09.100 | And then as you adjust it, the cost...
00:23:12.100 | You adjust in such a way that minimizes the output cost.
00:23:18.100 | And there's a bunch of optimization methods for doing this.
00:23:24.100 | You can...
00:23:26.100 | This is a convex function,
00:23:28.100 | so you can find the minimum,
00:23:31.100 | the local minimum,
00:23:32.100 | if you know about these kind of terminologies.
00:23:34.100 | The local minimum is the same as the global minimum.
00:23:36.100 | So there's not...
00:23:37.100 | It's not a weirdly hilly terrain
00:23:39.100 | where you can get stuck in...
00:23:42.100 | So your goal is to get to the bottom of this thing.
00:23:44.100 | And if it's really complex terrain,
00:23:46.100 | it'll be hard to get to the bottom of it.
00:23:48.100 | So there is a lot of different...
00:23:53.100 | The general approach is gradient descent.
00:23:56.100 | And there's a lot of different ways to do gradient descent.
00:23:58.100 | Some adding...
00:24:00.100 | In various ways of adding randomness into the process,
00:24:03.100 | so you don't get stuck into the weird crevices of the terrain.
00:24:09.100 | All right, but it's messy.
00:24:12.100 | You have to be really careful.
00:24:13.100 | This is the part you have to be aware of.
00:24:15.100 | When you're designing a network for deep traffic
00:24:18.100 | and nothing is happening,
00:24:20.100 | this might be what's happening.
00:24:23.100 | Vanishing gradients or exploding gradients.
00:24:28.100 | When the partial derivative is small,
00:24:32.100 | so if you take the sigmoid function,
00:24:35.100 | the most popular for a while activation function,
00:24:39.100 | the derivative is zero at the tails.
00:24:43.100 | So when the input to the sigmoid function
00:24:46.100 | is really high or really low,
00:24:48.100 | that derivative is going to be zero.
00:24:51.100 | So the gradient that you compute...
00:24:54.100 | Gradient tells you how much I want to adjust the weights.
00:24:57.100 | The gradient might be zero.
00:25:00.100 | And so you back propagate that zero,
00:25:02.100 | a very low number,
00:25:04.100 | and it gets less and less as you back propagate.
00:25:07.100 | And so the result is that you don't...
00:25:11.100 | You think that you don't need to adjust the weights at all.
00:25:15.100 | And when a large fraction of the network
00:25:17.100 | thinks that weights don't need to be adjusted,
00:25:20.100 | then they don't adjust the weights
00:25:21.100 | and you're not doing any learning.
00:25:23.100 | So the learning is slow.
00:25:26.100 | There's some fixes to this.
00:25:31.100 | There's different types of functions.
00:25:33.100 | There's a piecewise, the Rayleigh function,
00:25:37.100 | which is the most popular activation function.
00:25:40.100 | But again, it suffers...
00:25:43.100 | If the neurons are initialized poorly,
00:25:48.100 | it might not...
00:25:49.100 | This function might not fire...
00:25:51.100 | It might be zero gradient for the entire dataset.
00:25:56.100 | Nothing that you produce as input...
00:26:01.100 | You run all your thousands of images of cats
00:26:04.100 | and none of them fire at all.
00:26:07.100 | So that's the danger here.
00:26:10.100 | So you have to pick these...
00:26:13.100 | Both the optimization engine,
00:26:17.100 | the solver that you use,
00:26:19.100 | and the activation functions carefully.
00:26:21.100 | You can't just plug and play like they're Legos.
00:26:25.100 | You have to be aware of the function.
00:26:28.100 | SGD, stochastic gradient descent,
00:26:35.100 | that's the vanilla optimization algorithm
00:26:41.100 | for gradient descent,
00:26:44.100 | for optimizing the loss function over the gradients.
00:26:48.100 | And so what's visualized here is,
00:26:50.100 | again, if you've done any numerical optimization,
00:26:54.100 | nonlinear optimization,
00:26:56.100 | there's the famous saddle point
00:26:58.100 | that's tricky for these algorithms to deal with.
00:27:02.100 | What happens is it's easy for them to oscillate,
00:27:06.100 | get stuck in that saddle and oscillate back and forth.
00:27:09.100 | As opposed to what they want to do,
00:27:11.100 | which is go down into...
00:27:14.100 | You get so happy that you found this low point,
00:27:20.100 | that you forget that there's a much lower point.
00:27:23.100 | And so you get stuck with the gradient,
00:27:25.100 | the momentum of the gradient keeps rocking you back and forth
00:27:28.100 | while you go in to a much greater global minimum.
00:27:32.100 | And there's a bunch of clever ways of solving that.
00:27:35.100 | The atom optimizer is one of those.
00:27:39.100 | But in this case, as long as the gradients don't vanish,
00:27:46.100 | SGD, the stochastic gradient descent,
00:27:49.100 | one of these algorithms will get you there.
00:27:51.100 | That might take a little while, but they'll get you there.
00:27:53.100 | And that's the main question.
00:27:57.100 | The question was,
00:28:00.100 | you're dealing with a function that's not non-convex,
00:28:03.100 | and how do we ensure anything about it converging
00:28:07.100 | to anything that's reasonably good,
00:28:09.100 | the local optimum it converges to.
00:28:12.100 | And the answer is, you can't.
00:28:16.100 | This isn't only a nonlinear function,
00:28:19.100 | it's a highly nonlinear function.
00:28:22.100 | The power and the beauty of neural networks
00:28:24.100 | is that it can represent these arbitrarily complex functions.
00:28:32.100 | It's incredible, right?
00:28:34.100 | And you can learn those functions from data.
00:28:36.100 | But the reason people refer to neural networks training as art
00:28:42.100 | is you're trying to play with parameters
00:28:46.100 | that don't get stuck in these local optima
00:28:48.100 | for stupid reasons and for clever reasons.
00:28:50.100 | Yes, question.
00:28:53.100 | So the question, yeah.
00:28:55.100 | So continue on the same thread.
00:28:58.100 | So the thing is, we're dealing with functions
00:29:03.100 | where we don't know what the global optimal is.
00:29:06.100 | That's sort of the crux of it.
00:29:08.100 | Everything we talk about,
00:29:12.100 | interpreting text,
00:29:14.100 | interpreting video,
00:29:16.100 | even driving,
00:29:18.100 | what is the optimal for driving?
00:29:21.100 | Never crashing?
00:29:23.100 | It sounds easy to say that,
00:29:27.100 | but you actually have to formulate the world
00:29:29.100 | under which you define all of those things
00:29:31.100 | and that becomes really nonlinear objective function
00:29:34.100 | for which you don't know what the optimal is.
00:29:37.100 | It's just...
00:29:39.100 | That's why you just keep trying
00:29:42.100 | and get impressed every time it gets better.
00:29:44.100 | It's essentially the process.
00:29:47.100 | And you can also compare,
00:29:50.100 | you can compare the human level performance.
00:29:52.100 | So for ImageNet,
00:29:53.100 | we can tell the difference in cats and dogs
00:29:55.100 | in top five categories
00:29:57.100 | in 90,
00:29:59.100 | shoot,
00:30:00.100 | 96% of the time, whatever, accuracy.
00:30:03.100 | And then you get impressed
00:30:04.100 | when a machine can do better than that.
00:30:06.100 | But you don't know what the best is.
00:30:08.100 | These videos can be watched for hours.
00:30:17.100 | I won't play it until I explain the slide.
00:30:20.100 | So let's pause to reflect on backpropagation
00:30:23.100 | before I go on to recurrent neural networks.
00:30:25.100 | Yes, question.
00:30:26.100 | In a practical manner,
00:30:27.100 | how can you tell when you're actually training a net
00:30:30.100 | whether you're facing the vanishing gradient problem
00:30:33.100 | or you need to change your optimizer
00:30:37.100 | or you need to, I mean,
00:30:39.100 | like you've reached some local minimum?
00:30:42.100 | The question was,
00:30:45.100 | how do you practically know
00:30:47.100 | when you've hit the vanishing gradient problem?
00:30:51.100 | So the vanishing gradient could be,
00:30:53.100 | the derivative being zero on the gradient
00:31:03.100 | happens when the activation is exploding,
00:31:07.100 | so like really high values
00:31:09.100 | and really low values.
00:31:10.100 | The really high values is easy
00:31:12.100 | because they're like, your network is just going crazy,
00:31:14.100 | producing very large values.
00:31:17.100 | And you can fix a lot of those things
00:31:19.100 | by just capping the activations.
00:31:23.100 | The values being really low
00:31:28.100 | resulting in a vanishing gradient
00:31:30.100 | are really hard to detect.
00:31:32.100 | This is, I mean,
00:31:34.100 | there's a lot of research in trying to figure out
00:31:36.100 | how to detect these things,
00:31:39.100 | but if you're not careful,
00:31:41.100 | it's oftentimes you can find that,
00:31:46.100 | and this isn't hard to do,
00:31:50.100 | where like 40, 50% of the network,
00:31:53.100 | of the neurons are dead.
00:31:57.100 | We're going to call it like for ReLU,
00:31:59.100 | they're dead ReLU nodes.
00:32:00.100 | They're not firing at all.
00:32:02.100 | How do you detect that?
00:32:05.100 | That's part of learning.
00:32:07.100 | So if they never fire, you can detect that
00:32:08.100 | by running it through the entire training set.
00:32:10.100 | I mean, there's a lot of tricks,
00:32:12.100 | but that's the problem is
00:32:14.100 | you try to learn,
00:32:16.100 | and then you look at the loss function,
00:32:19.100 | and it's not converging to anything reasonable.
00:32:23.100 | It's either going all over the place
00:32:24.100 | or just converging very slowly,
00:32:26.100 | and that's an indication that something is wrong.
00:32:28.100 | That something could be the loss function is bad,
00:32:31.100 | that something could be that you've already found the optimal,
00:32:33.100 | or that something could be the vanishing gradient.
00:32:36.100 | And again, that's why it's an art.
00:32:41.100 | But certainly,
00:32:45.100 | at least some fraction of the neurons need to be firing.
00:32:49.100 | Otherwise, the initialization is really poorly done.
00:32:52.100 | Okay, so to reflect on the simplicity of backpropagation,
00:32:58.100 | and the power of it.
00:33:00.100 | So this is,
00:33:02.100 | this kind of step of backpropagating the loss function
00:33:04.100 | through the gradients locally
00:33:08.100 | is the way neural networks learn.
00:33:10.100 | We don't have,
00:33:12.100 | it's really the only way that we've effectively been able
00:33:16.100 | to train a neural network to learn a function.
00:33:21.100 | So adjusting the weights and biases,
00:33:23.100 | the huge number of weights and biases, the parameters,
00:33:26.100 | is just through this optimization.
00:33:28.100 | It's backpropagating the error
00:33:31.100 | where you have the supervised ground truth.
00:33:34.100 | So the question is whether this process is just like fitting,
00:33:41.100 | adjusting the parameters of a highly nonlinear function
00:33:47.100 | to minimize a single objective,
00:33:50.100 | is the way you achieve intelligence,
00:33:55.100 | human level intelligence.
00:33:56.100 | And that's something to think about.
00:33:58.100 | You have to think about the, for driving purposes,
00:34:00.100 | what is the limitation of this approach?
00:34:04.100 | So what's not happening?
00:34:06.100 | The neural network design, the architecture,
00:34:09.100 | is not being adjusted.
00:34:10.100 | You're not evolving any of the edges, the layers,
00:34:14.100 | nothing is being evolved.
00:34:18.100 | And so there are other optimization approaches
00:34:22.100 | that I think are more
00:34:27.100 | interesting and inspiring than effective.
00:34:30.100 | So for example, this is using soft cubes to,
00:34:37.100 | so this is falling out of the field of evolutionary robotics,
00:34:43.100 | where you evolve the dynamics of a robot
00:34:47.100 | using genetic algorithms.
00:34:49.100 | And that's,
00:34:54.100 | so you can think of,
00:34:59.100 | so these robots are being taught to,
00:35:03.100 | in simulation obviously, to walk and to swim.
00:35:08.100 | So that one is swimming.
00:35:12.100 | But you could, the nice thing here is the dynamics,
00:35:17.100 | that highly nonlinear space as well,
00:35:19.100 | that controls the dynamics of this weird shaped robot,
00:35:24.100 | with a lot of degrees of freedom,
00:35:26.100 | is the same kind of thing as the neural network.
00:35:28.100 | And in fact, people have applied genetic algorithms
00:35:31.100 | and colony optimization,
00:35:33.100 | all kinds of sort of nature-inspired algorithms
00:35:36.100 | for optimizing the weights and the biases.
00:35:38.100 | But they don't seem to currently work that well.
00:35:40.100 | But it's kind of, it's a cool idea to be using
00:35:43.100 | nature-type evolutionary algorithms
00:35:45.100 | to evolve something that's already nature-inspired,
00:35:48.100 | which is neural networks.
00:35:50.100 | But something to think about is, you know,
00:35:55.100 | that backpropagation, while really simple,
00:35:57.100 | is kind of dumb.
00:35:59.100 | And the question is whether general intelligence reasoning
00:36:02.100 | could be achieved with this process.
00:36:04.100 | All right, recurring neural networks.
00:36:07.100 | So on the left there, there's an input x
00:36:11.100 | with weights on the input u.
00:36:14.100 | There's a hidden state, a hidden layer s,
00:36:18.100 | with weights on the edge
00:36:26.100 | connecting the hidden states to each other.
00:36:29.100 | And then more weights v on the output o.
00:36:33.100 | It's a really simple network.
00:36:35.100 | There's inputs, there is hidden states,
00:36:39.100 | the memory of this network,
00:36:42.100 | and there's outputs.
00:36:44.100 | But the fact that there is this loop
00:36:50.100 | where the hidden states are connected to each other
00:36:53.100 | means that as opposed to producing a single input,
00:36:57.100 | the network takes arbitrary number of inputs.
00:37:00.100 | It just keeps taking x one at a time
00:37:03.100 | and produces a sequence of x's through time.
00:37:08.100 | And so depending on the duration of the sequences you're interested in,
00:37:14.100 | you can think of this network in its unrolled state.
00:37:18.100 | So you can unroll this neural network
00:37:20.100 | where the inputs are on the bottom,
00:37:22.100 | x t minus one, x t, x t plus one.
00:37:25.100 | And same with the outputs, zero, sorry,
00:37:28.100 | o t minus one, o t, o t plus one.
00:37:32.100 | And it becomes like a regular neural network,
00:37:35.100 | unrolled some arbitrary number of times.
00:37:40.100 | The parameters, again, there's weights, there's biases.
00:37:44.100 | It's similar to CNNs, convolutional neural networks,
00:37:48.100 | in that it's just like convolutional neural networks
00:37:51.100 | make certain spatial consistency assumptions.
00:37:55.100 | The recurring neural networks assume temporal consistency
00:37:59.100 | amongst the parameters.
00:38:00.100 | So it shares the parameters.
00:38:02.100 | That w, that u, that v is the same for every single time step.
00:38:08.100 | So you're learning the same parameter
00:38:11.100 | no matter the duration of the sequence.
00:38:14.100 | And that allows you to look at arbitrarily long sequences
00:38:19.100 | without having an explosion of parameters.
00:38:29.100 | And this process is the same exact process that's repeated
00:38:32.100 | based on the different variants that we talked about before
00:38:35.100 | in terms of inputs and outputs.
00:38:36.100 | One to many, many to one, many to many.
00:38:40.100 | And the backpropagation process is exactly the same
00:38:43.100 | as for regular neural networks.
00:38:45.100 | It has a fancy name of backpropagation through time, BPTT.
00:38:50.100 | But it's just backpropagation through an unrolled,
00:38:57.100 | unrolled recurring neural network
00:39:00.100 | where the errors are computed on the outputs,
00:39:04.100 | the gradients are computed,
00:39:07.100 | backpropagated and computed on the inputs.
00:39:12.100 | Again, suffering from the same exact problem
00:39:15.100 | of vanishing gradients.
00:39:18.100 | Now the problem is that the depth of these networks
00:39:21.100 | can be arbitrarily long, right?
00:39:22.100 | So if at any point the gradient hits a low number, 0,
00:39:29.100 | that neuron becomes saturated.
00:39:32.100 | That gradient, let's call it saturated,
00:39:34.100 | that gradient drives all the earlier layers to 0.
00:39:41.100 | So it's easy to run into a problem where
00:39:43.100 | you're really ignoring majority of the sequence.
00:39:47.100 | This is just another Python way,
00:39:51.100 | pseudo-code way to look at it.
00:39:54.100 | You have the same W.
00:39:55.100 | Remember, you're sharing the weights
00:39:58.100 | and all the parameters from time to time.
00:40:01.100 | So if the weights are such,
00:40:05.100 | WHH, if the weights are such that they produce,
00:40:10.100 | they have either,
00:40:14.100 | they have a negative value that results
00:40:18.100 | in a gradient that goes to 0,
00:40:22.100 | that propagates through the rest.
00:40:24.100 | So that's the pseudo-code for backpropagation,
00:40:26.100 | the backward pass through the RNN.
00:40:29.100 | That WHH propagates back.
00:40:35.100 | And so you get these things with exploding and vanishing gradients
00:40:39.100 | where this, for example,
00:40:41.100 | an error surface for a single hidden unit RNN.
00:40:45.100 | So these visualize in the gradient,
00:40:47.100 | the value of the weight, the value of the bias,
00:40:51.100 | and the error.
00:40:52.100 | So the error could be really flat or could explode.
00:40:55.100 | And both are going to lead to you
00:41:00.100 | not making the,
00:41:02.100 | either making steps that are too gradual or too big.
00:41:06.100 | That's the geometric interpretation.
00:41:08.100 | Okay, what other variants that we'll look at a little bit
00:41:12.100 | are there for RNNs?
00:41:13.100 | It doesn't have to be only one way.
00:41:15.100 | It can be bidirectional.
00:41:16.100 | So there could be edges going forward and edges going back.
00:41:20.100 | What that's needed for is things like
00:41:25.100 | filling in missing, whatever the data is,
00:41:27.100 | filling in missing elements of that data,
00:41:29.100 | whether that's images or words or audio.
00:41:34.100 | And generally, as always is the case in neural networks,
00:41:37.100 | the deeper you go, the better.
00:41:38.100 | So this is that deep referring to the number of layers
00:41:45.100 | in a single temporal instance.
00:41:48.100 | So on the right of the slide is,
00:41:51.100 | we're stacking in the,
00:41:54.100 | not in the temporal domain.
00:41:58.100 | Each of those layers has its own set of weights
00:42:02.100 | and its own sets of biases.
00:42:05.100 | These things are awesome,
00:42:06.100 | but they need a lot of data
00:42:08.100 | when you add extra layers in this way.
00:42:16.100 | Okay, so the problem is,
00:42:18.100 | while recurrent neural networks,
00:42:20.100 | in theory, are supposed to be able to learn
00:42:22.100 | any kind of sequence,
00:42:25.100 | the reality is they're not really good at remembering
00:42:28.100 | what happened a while ago,
00:42:29.100 | the long-term dependency.
00:42:31.100 | So here's a silly example.
00:42:35.100 | Let's think of a story about Bob.
00:42:40.100 | Bob is eating an apple.
00:42:42.100 | So the apple part is generated
00:42:45.100 | by the recurrent neural network.
00:42:50.100 | The recurrent neural networks can learn to generate apple
00:42:53.100 | because they've seen a lot of sentences with Bob and eating,
00:42:56.100 | and they can generate the word apple.
00:42:59.100 | For a longer sentence, like Bob likes apples,
00:43:03.100 | he's hungry and decided to have a snack,
00:43:05.100 | so now he's eating an apple.
00:43:07.100 | You have to maintain the state
00:43:09.100 | that we're talking about Bob,
00:43:11.100 | and we're talking about apples,
00:43:13.100 | through several discrete semantic sentences.
00:43:20.100 | And that kind of long-term memory
00:43:23.100 | is not because of different effects,
00:43:28.100 | but vanishing gradients.
00:43:30.100 | It's difficult to propagate the important stuff
00:43:34.100 | that happened a while ago,
00:43:35.100 | in order to maintain that context in generating apple
00:43:39.100 | or classifying some concept that happened way down the line.
00:43:44.100 | So when people talk about recurrent neural networks,
00:43:51.100 | these days, they're talking about LSTMs,
00:43:55.100 | long short-term memory networks.
00:44:00.100 | So all the impressive results on time series,
00:44:03.100 | on audio, on video, all of that,
00:44:05.100 | that requires LSTMs.
00:44:07.100 | And so again, vanilla RNNs up on top of the slide,
00:44:12.100 | each cell is simple.
00:44:16.100 | There's some hidden units,
00:44:18.100 | there's an input, and there's an output.
00:44:21.100 | Here we'll use 10H as the activation function.
00:44:27.100 | It's just another popular sigmoid type activation function.
00:44:35.100 | LSTMs are more complicated,
00:44:38.100 | or they look more complicated,
00:44:40.100 | but in some ways they're more intuitive for us to understand.
00:44:46.100 | There's a bunch of gates in each cell.
00:44:49.100 | We'll go through those.
00:44:51.100 | In yellow are different neural network layers.
00:44:55.100 | With sigma and 10H are different types of activation functions.
00:45:00.100 | 10H is an activation function that squishes the input
00:45:05.100 | to the range of -1 to 1.
00:45:08.100 | A sigmoid function squishes it between 0 and 1,
00:45:13.100 | and that serves different purposes.
00:45:16.100 | There is some pointwise operations, addition, multiplication,
00:45:21.100 | and there is connections,
00:45:25.100 | so data being passed from layer to layer,
00:45:28.100 | shown by the arrows.
00:45:31.100 | There's concatenation and there's a copy operation on the output.
00:45:35.100 | So we copy, the output of each cell is copied to the next cell
00:45:40.100 | and to the output.
00:45:43.100 | Let me try to make it clarify a little bit.
00:45:54.100 | There's this conveyor belt going through inside each individual cell.
00:46:00.100 | There's really three steps in the conveyor belt.
00:46:05.100 | The first is there is a sigmoid function
00:46:10.100 | that's responsible for deciding
00:46:15.100 | what to forget and what to ignore.
00:46:18.100 | It's responsible for taking in the input,
00:46:24.100 | the new input, XT,
00:46:26.100 | taking in the state of the previous,
00:46:31.100 | the output of the previous cell, previous time step,
00:46:35.100 | and deciding do I want to keep that in my memory or not,
00:46:39.100 | and do I want to integrate the new input into my memory or not.
00:46:44.100 | So this allows you to be selective about the information which you learn.
00:46:49.100 | So for example, the sentence "Bob and Alice are having lunch."
00:46:53.100 | Bob likes apples, Alice likes oranges, she's eating an orange.
00:46:59.100 | So Bob and Alice are having lunch.
00:47:05.100 | Bob likes apples.
00:47:06.100 | Right now, if you say you had a hidden state,
00:47:10.100 | keeping track of the gender of the person we're talking about.
00:47:16.100 | You might say that there's both genders in the first sentence,
00:47:19.100 | there's male in the second sentence, female in the third sentence.
00:47:23.100 | That way, when you have to generate a sentence about who's eating what,
00:47:27.100 | you'll keep the gender information
00:47:32.100 | in order to make an accurate generation of text
00:47:36.100 | corresponding to the proper person.
00:47:40.100 | So you have to forget certain things,
00:47:42.100 | like forget that Bob existed at that moment,
00:47:45.100 | and you have to forget Bob likes apples,
00:47:49.100 | but you have to remember that Alice likes oranges.
00:47:54.100 | So you have to selectively remember and forget certain things.
00:47:57.100 | That's LSTM in a nutshell.
00:48:00.100 | So you decide what to forget, decide what to remember,
00:48:03.100 | and decide what to output at that cell.
00:48:09.100 | All right, so zoom in a little bit, because this is pretty cool.
00:48:15.100 | There is a state running through the cell.
00:48:20.100 | This can vary about previous state,
00:48:23.100 | like what the gender that we're currently talking about,
00:48:28.100 | that's the state that you're keeping track of,
00:48:31.100 | and that's running through the cell.
00:48:34.100 | And then there is three sigmoid layers
00:48:37.100 | outputting a number between 0 and 1,
00:48:42.100 | but it's 1 when you want that information to go through,
00:48:46.100 | and 0 when you don't want it to go through,
00:48:51.100 | the conveyor belt that maintains the state.
00:48:55.100 | And so first sigmoid function is
00:48:58.100 | we decide what to forget and what to ignore.
00:49:01.100 | That's the first one.
00:49:03.100 | You take the inputs from the previous time step,
00:49:06.100 | the input to the network from the current time step,
00:49:10.100 | and decide do I want to forget, do I want to ignore those.
00:49:15.100 | Then we decide which part of the state to update.
00:49:21.100 | What part of our memory do we update with this information,
00:49:24.100 | and what values to insert in that update.
00:49:30.100 | Third step is we perform the actual update
00:49:34.100 | and perform the actual forgetting.
00:49:37.100 | So that's where you have the sigmoid function is
00:49:40.100 | you just multiply it.
00:49:42.100 | When it's 0, it's forgetting.
00:49:44.100 | When it's 1, that information passes through.
00:49:49.100 | And finally, we produce an output from the cell.
00:49:55.100 | So if it's translation,
00:49:58.100 | it's producing an output in the English language
00:50:01.100 | where the input was in the Spanish language.
00:50:03.100 | And then that same output is copied to the next cell.
00:50:13.100 | Okay, so what can we get done with this kind of approach?
00:50:18.100 | We can look at machine translation.
00:50:20.100 | I guess what I'm trying to...
00:50:23.100 | Question.
00:50:25.100 | What is the representation of the state?
00:50:27.100 | Is it like a floating point or is it like a vector?
00:50:30.100 | What is it exactly?
00:50:33.100 | The state is the activation multiplied by the weight.
00:50:41.100 | So it's the outputs of the sigmoid or the 10H activations.
00:50:46.100 | So there's a bunch of neurons and they're firing a number
00:50:50.100 | between -1 and 1 or between 0 and 1.
00:50:53.100 | And that holds the state.
00:50:55.100 | It's just calling it state is sort of simplifying.
00:50:58.100 | But the point is that there's a bunch of numbers
00:51:00.100 | being constantly modified by the weights and the biases.
00:51:05.100 | And those numbers hold the state.
00:51:09.100 | And the modification of those numbers is controlled by the weights.
00:51:14.100 | And then once all that is done,
00:51:16.100 | the resulting output of the recurrent neural network
00:51:20.100 | is compared to the desired output
00:51:22.100 | and the errors are back-propagated through the weights.
00:51:27.100 | Hopefully that made sense.
00:51:30.100 | So machine translation is one popular application.
00:51:35.100 | And all of it is the same.
00:51:40.100 | All of these networks that I'll talk about,
00:51:42.100 | they're really similar constructs.
00:51:46.100 | You have some inputs, whatever language that is again.
00:51:52.100 | German maybe. I think everything is German.
00:51:58.100 | And the output, so the inputs are in one language,
00:52:03.100 | a set of characters that compose a word in one language.
00:52:08.100 | There's a state being propagated.
00:52:10.100 | And once that sentence is over,
00:52:12.100 | you start, as opposed to collecting inputs,
00:52:14.100 | you start producing outputs.
00:52:16.100 | And you can output in the English language.
00:52:19.100 | There's a ton of great work on machine translation.
00:52:23.100 | It's what Google is mostly using for their translator.
00:52:26.100 | Same thing, I showed this previously,
00:52:30.100 | but now you know how it works.
00:52:32.100 | Same exact thing, LSTMs, generating handwritten characters,
00:52:37.100 | so handwriting in arbitrary styles.
00:52:39.100 | So controlling the drawing,
00:52:43.100 | where the input is text and the output is handwriting.
00:52:46.100 | And it's again the same kind of network
00:52:51.100 | with some depth here.
00:52:53.100 | The inputs is the text,
00:52:55.100 | the output is the control of the writing.
00:52:59.100 | Character level text generation.
00:53:02.100 | This is the thing that told us about life.
00:53:06.100 | The meaning of life, literary recognition
00:53:09.100 | and the tradition of ancient human reproduction.
00:53:12.100 | That's again the same process.
00:53:16.100 | Input one character at a time,
00:53:18.100 | where you see there's an encoding of the characters
00:53:21.100 | on the input layer.
00:53:23.100 | There's a hidden state, hidden layer,
00:53:26.100 | that's keeping track of those activations,
00:53:28.100 | the outputs of the activation functions.
00:53:32.100 | And every single time,
00:53:38.100 | it's outputting its best prediction
00:53:42.100 | of the next character that follows.
00:53:44.100 | Now in a lot of these applications,
00:53:46.100 | you want to ignore the output
00:53:49.100 | until the input sentence is over.
00:53:52.100 | And then you start listening to the output.
00:53:55.100 | But the point is it just keeps generating text,
00:53:58.100 | whether it's given input or not.
00:54:00.100 | So you producing input is just adding,
00:54:03.100 | steering the recurrent neural network.
00:54:07.100 | You can answer questions
00:54:11.100 | about an image.
00:54:13.100 | So the input hid there,
00:54:15.100 | so you could almost arbitrarily stack things together.
00:54:18.100 | So you take an image as an input, bottom left there,
00:54:21.100 | put it into convolutional neural network
00:54:26.100 | and take the question.
00:54:30.100 | There's something called word embeddings.
00:54:33.100 | It's to broaden the representative meaning of the words.
00:54:37.100 | So how many books is the question?
00:54:40.100 | So you want to take the word embeddings and the image
00:54:43.100 | and produce your best estimate of the answer.
00:54:46.100 | So for a question of what color is the cat,
00:54:49.100 | it could be gray or black.
00:54:51.100 | There's the different LSTM flavors
00:54:54.100 | producing that answer.
00:54:56.100 | Same with counting chairs.
00:54:58.100 | You can give an image of a chair
00:55:00.100 | and ask the question how many chairs are there
00:55:03.100 | and it can produce an answer of three.
00:55:07.100 | So I should say that this is really hard, right?
00:55:11.100 | And it's an arbitrary question,
00:55:12.100 | ask of an arbitrary image.
00:55:14.100 | So you're both interpreting,
00:55:15.100 | you do natural language processing
00:55:17.100 | and you're doing computer vision,
00:55:19.100 | all in one network.
00:55:22.100 | Same thing with image caption generation.
00:55:26.100 | You can detect the different objects in the scene,
00:55:31.100 | generate those words,
00:55:33.100 | stitch them together in syntactically correct sentences
00:55:37.100 | and re-rank the sentences.
00:55:39.100 | All of those are LSTMs,
00:55:41.100 | the second and the third step.
00:55:43.100 | The first is computer vision detecting the objects,
00:55:46.100 | segmenting the image and detecting the objects.
00:55:48.100 | And that way you can generate a caption
00:55:50.100 | that says a man is sitting in a chair
00:55:52.100 | with a dog in his lap.
00:55:56.100 | Again, LSTMs for video.
00:56:00.100 | Caption generation for video.
00:56:03.100 | The input at every frame is an image
00:56:06.100 | that goes into the LSTM.
00:56:08.100 | The input is an image
00:56:11.100 | and the output is a set of characters.
00:56:13.100 | First you load in the video,
00:56:15.100 | in this case the output is on top.
00:56:17.100 | You encode the video
00:56:21.100 | into a representation inside the network
00:56:24.100 | and then you start generating words about that video.
00:56:27.100 | First comes the input, the encoding stage,
00:56:29.100 | then the decoding stage.
00:56:32.100 | Take in the video, say a man is taking,
00:56:35.100 | talking, whatever.
00:56:37.100 | And because the input and the output is arbitrary,
00:56:41.100 | there also has to be indicators of the beginnings
00:56:43.100 | and the ends of a sentence.
00:56:46.100 | So in this case, end of sentence.
00:56:48.100 | So you want to know when you stop.
00:56:51.100 | In order to generate syntactically correct sentences,
00:56:54.100 | you want to be able to generate a period
00:56:57.100 | that indicates the end of a sentence.
00:57:01.100 | So you can also, again, recurrent neural networks,
00:57:04.100 | LSTMs here, controlling the steering
00:57:11.100 | of a sliding window on an image
00:57:15.100 | that's used to classify what's contained in that image.
00:57:19.100 | So here, a CNN being steered by a recurrent neural network
00:57:24.100 | in order to convert this image
00:57:28.100 | into the number that's associated with the house number.
00:57:33.100 | It's called visual attention.
00:57:35.100 | And that visual attention can be used to steer
00:57:37.100 | for the perception side
00:57:39.100 | and it can be used to steer a network for the generation.
00:57:43.100 | On the right, we can generate an image.
00:57:49.100 | So the output of the LSTM
00:57:54.100 | where the output at every time step is visual.
00:57:59.100 | In this way, you can draw numbers.
00:58:06.100 | Here, I mentioned this before,
00:58:12.100 | is taking in as input silent video,
00:58:15.100 | sequence of images,
00:58:19.100 | and producing audio.
00:58:22.100 | So this is an LSTM
00:58:26.100 | that has convolutional layers for every single frame.
00:58:32.100 | It takes images as input
00:58:35.100 | and produces a spectrogram, audio as output.
00:58:41.100 | The training set is a person hitting an object with a drumstick
00:58:49.100 | and your task is to generate, given a silent video,
00:58:53.100 | generate the sound that a drumstick would make
00:58:57.100 | when in contact with that object.
00:59:01.100 | Okay, medical diagnosis.
00:59:06.100 | That's actually, so I've listed some places
00:59:08.100 | where it's been really successful and pretty cool.
00:59:11.100 | But it's also beginning to be applied in places
00:59:14.100 | where it can actually really help civilization,
00:59:23.100 | right, in medical applications.
00:59:25.100 | So for medical diagnosis,
00:59:28.100 | there is a highly sparse
00:59:32.100 | and a variable length sequence of information
00:59:39.100 | in the form of, for example,
00:59:41.100 | patient electronic health records.
00:59:43.100 | So every time you visit a doctor,
00:59:45.100 | there's some test being done
00:59:46.100 | and that information is there
00:59:48.100 | and you can look at it as a sequence over a period of time.
00:59:51.100 | And then given that data, that's the input,
00:59:54.100 | the output is a diagnosis, a medical diagnosis.
01:00:00.100 | So in this case, we can look at predicting diabetes,
01:00:04.100 | scoliosis, asthma, so on,
01:00:09.100 | with pretty good accuracy.
01:00:13.100 | There's something that all of us wish we could do
01:00:20.100 | is stock market prediction.
01:00:25.100 | So you can input, for example,
01:00:27.100 | well, first of all, you can input the raw stock data,
01:00:30.100 | right, the order books and so on, financial data.
01:00:33.100 | But you can also look at news articles
01:00:35.100 | from all over the web
01:00:37.100 | and take those as input, as shown here,
01:00:40.100 | on the x-axis is time,
01:00:42.100 | so articles from different days,
01:00:45.100 | LSTM, once again,
01:00:48.100 | and produce an output of your prediction,
01:00:51.100 | binary prediction, whether the stock will go up or down.
01:00:56.100 | And nobody's been able to really successfully do this,
01:00:59.100 | but there is a bunch of results
01:01:02.100 | and trying to perform above random,
01:01:06.100 | which is how you make money, right,
01:01:10.100 | significantly above random
01:01:12.100 | on the prediction of is it going up or down,
01:01:14.100 | so you can buy or sell.
01:01:16.100 | And especially when there's,
01:01:19.100 | in the cases when there were crashes,
01:01:21.100 | it's easier to predict.
01:01:23.100 | So you can predict an encroaching crash.
01:01:25.100 | These are shown in the table,
01:01:27.100 | the error rates for different stocks,
01:01:31.100 | automotive stocks.
01:01:35.100 | You can also generate audio.
01:01:38.100 | This exact same process you generate language,
01:01:40.100 | you generate audio.
01:01:42.100 | Here's trained on a single speaker,
01:01:47.100 | a few hour epics of them speaking,
01:01:52.100 | and you just learn that's raw audio of the speaker.
01:01:58.100 | And it's learning slowly to generate.
01:02:02.100 | (audience mumbling)
01:02:19.100 | Obviously, they were reading numbers,
01:02:21.100 | but that might...
01:02:26.100 | This is incredible.
01:02:27.100 | This is trained on a compressed spectrogram
01:02:31.100 | of the audio, raw audio.
01:02:35.100 | And it's producing something that,
01:02:38.100 | over just a few epics,
01:02:40.100 | it's producing something that sounds like words.
01:02:43.100 | I could do this lecture for me, I wish.
01:02:47.100 | (audience mumbling)
01:02:56.100 | Since...
01:02:59.100 | I don't know. This is amazing.
01:03:02.100 | This is raw input, raw output,
01:03:06.100 | all again LSTMs.
01:03:10.100 | And there's a lot of work in voice recognition
01:03:13.100 | and audio recognition.
01:03:14.100 | You're mapping...
01:03:17.100 | Let me turn it up.
01:03:22.100 | You're mapping any kind of audio to a classification.
01:03:25.100 | (audience mumbling)
01:03:29.100 | So, you can take the audio of the road,
01:03:32.100 | (audience mumbling)
01:03:35.100 | and that's a spectrogram on the bottom there being shown,
01:03:39.100 | and you could detect whether the road is wet
01:03:42.100 | or the road is dry.
01:03:44.100 | (audience mumbling)
01:03:47.100 | And you could do the same thing for
01:03:51.100 | recognizing the gender of the speaker
01:03:54.100 | or recognizing many-to-many map
01:03:57.100 | of the actual words being spoken,
01:04:00.100 | speech recognition.
01:04:02.100 | But this is about driving,
01:04:04.100 | so let's see where recurring neural networks apply in driving.
01:04:08.100 | We talked about the NVIDIA approach,
01:04:12.100 | the thing that actually powers DeepTesla JS,
01:04:16.100 | is a simple convolutional neural network.
01:04:18.100 | There's five convolutional layers in their approach,
01:04:22.100 | three fully connected layers.
01:04:24.100 | You can add as many layers as you want in DeepTesla.
01:04:29.100 | So, that's a quarter million parameters to optimize.
01:04:34.100 | And all you're taking is a single image,
01:04:37.100 | no temporal information, single image,
01:04:39.100 | and producing a steering angle.
01:04:40.100 | That's the approach, that's the DeepTesla way.
01:04:44.100 | So, taking a single image
01:04:49.100 | and learning a regression of a steering angle.
01:04:53.100 | Now, so one of the prizes for the competition
01:04:59.100 | is the Udacity self-driving car engineer nanodegree for free.
01:05:06.100 | This thing is awesome,
01:05:07.100 | I encourage everyone to check it out.
01:05:09.100 | But they did a competition
01:05:12.100 | that's very similar to ours.
01:05:17.100 | But they have a very large group of obsessed people,
01:05:22.100 | so they were very clever.
01:05:25.100 | They went beyond just convolutional neural networks
01:05:27.100 | of predicting steering.
01:05:28.100 | So, taking a sequence of images and predicting steering.
01:05:31.100 | What they did is, the winners,
01:05:34.100 | at least the first, and I'll talk about the second place winner tomorrow,
01:05:38.100 | on 3D convolutional neural networks.
01:05:43.100 | But the first and the third place winners used RNNs,
01:05:46.100 | used LSTMs, recurrent neural networks.
01:05:49.100 | And mapped a sequence of images
01:05:53.100 | to a sequence of steering angles.
01:05:55.100 | For anyone, statistically speaking,
01:06:00.100 | anybody here who's not a computer vision person,
01:06:03.100 | most likely what you want to use
01:06:05.100 | for whatever application you're interested in
01:06:07.100 | is RNNs.
01:06:09.100 | It's just the world is full of time series data.
01:06:12.100 | Very few of us are working on data
01:06:16.100 | that's not time series data.
01:06:18.100 | In fact, whenever it's just snapshots,
01:06:21.100 | you're really just reducing the problem
01:06:24.100 | to the size that you can handle.
01:06:26.100 | But most data, the world is time series data.
01:06:29.100 | So this is the approach you will end up using
01:06:32.100 | if you want to apply it in your own research.
01:06:36.100 | So this is,
01:06:40.100 | RNNs is the way to go.
01:06:43.100 | So, again, what are they doing?
01:06:49.100 | How do you put images
01:06:53.100 | into a recurrent neural network?
01:06:55.100 | It's the same thing.
01:06:58.100 | You take,
01:07:00.100 | you have to convert an image into numbers
01:07:02.100 | in some kind of way.
01:07:04.100 | A powerful way of doing that is convolutional neural networks.
01:07:07.100 | So you can take
01:07:09.100 | either 3D convolutional neural networks
01:07:12.100 | or 2D convolutional neural networks,
01:07:15.100 | ones that take time into consideration and one not.
01:07:18.100 | So process that image
01:07:20.100 | to extract the representation of that image.
01:07:23.100 | And that becomes the input to the LSTM.
01:07:26.100 | And the output at every single cell,
01:07:29.100 | at every single time step,
01:07:31.100 | is the predicted steering angle,
01:07:33.100 | the speed of the vehicle and the torque.
01:07:35.100 | That's what the first place winner did.
01:07:37.100 | They didn't just do steering angle,
01:07:39.100 | they also did the speed and the torque.
01:07:42.100 | And the sequence length that they were using
01:07:45.100 | for training and for testing
01:07:48.100 | for the input and the output is a sequence length of 10.
01:07:52.100 | (audience member asking question)
01:07:57.100 | The question was, do they use supervised learning?
01:08:00.100 | Yep. So they were given the same thing as in DeepTesla.
01:08:03.100 | A sequence of frames, whether you have a sequence of
01:08:06.100 | steering angle, speed and torque,
01:08:08.100 | I think there's other information too, available.
01:08:11.100 | So yeah, there's no reinforcement learning here.
01:08:14.100 | Question?
01:08:15.100 | (audience member asking question)
01:08:26.100 | So the question was,
01:08:28.100 | how many LSTM gates are there in this problem?
01:08:31.100 | (audience member asking question)
01:08:34.100 | So this network,
01:08:36.100 | (audience member asking question)
01:08:41.100 | it's true that these diagrams kind of hide
01:08:45.100 | the number of parameters here.
01:08:47.100 | But it's arbitrary, just like convolutional neural networks are arbitrary.
01:08:50.100 | So the size of the input is arbitrary,
01:08:54.100 | the size of the sigmoid function at 10H is arbitrary.
01:08:58.100 | So you can make it as large as you want, as deep as you want.
01:09:02.100 | And the deeper and larger, the better.
01:09:05.100 | What these folks actually use,
01:09:07.100 | so the way these competitions work,
01:09:11.100 | I encourage you if you're interested in machine learning
01:09:14.100 | to participate in Kaggle,
01:09:16.100 | I don't know how to pronounce it, competitions,
01:09:19.100 | where basically everyone's doing the same thing.
01:09:21.100 | You're using LSTMs, or if it's one-to-one mapping,
01:09:25.100 | using convolutional neural networks, fully connected networks,
01:09:28.100 | with some clever pre-processing.
01:09:30.100 | And the whole job is, that takes months.
01:09:32.100 | And you probably, if you're a researcher,
01:09:34.100 | that's what you'll be doing, your own research,
01:09:36.100 | is playing with parameters.
01:09:37.100 | Playing with pre-processing of the data,
01:09:39.100 | playing with the different parameters that control
01:09:41.100 | the size of the network, the learning rate.
01:09:44.100 | I mentioned this type of optimizer,
01:09:46.100 | all these kinds of things, that's what you're playing with.
01:09:49.100 | Using your own human intuition,
01:09:51.100 | and you're using your...
01:09:54.100 | whatever probing you can do in monitoring
01:09:59.100 | the performance of the network through time.
01:10:05.100 | Right.
01:10:16.100 | The question was,
01:10:18.100 | you said that there is a memory of 10 in this LSTM,
01:10:26.100 | and I thought RNNs are supposed to be arbitrary, or whatever.
01:10:30.100 | So, it has to do with the training,
01:10:36.100 | how the network is trained.
01:10:39.100 | So it's trained with sequences of 10.
01:10:41.100 | The structure is still the same,
01:10:42.100 | you only have one cell that's looping onto each other.
01:10:46.100 | But the question is, in what chunks,
01:10:51.100 | what is the size of the sequence
01:10:53.100 | in which you're doing the training and then the testing?
01:10:56.100 | So, you don't have to, it can be arbitrary length,
01:10:59.100 | it's just usually better to be consistent
01:11:02.100 | and have a fixed length.
01:11:04.100 | But you're not stacking 10 cells together,
01:11:10.100 | it's just a single cell still.
01:11:14.100 | So, the third place winner,
01:11:18.100 | Team Chauffeur,
01:11:20.100 | used something called transfer learning,
01:11:23.100 | and it's something I don't think I mentioned,
01:11:26.100 | but it's kind of implied.
01:11:29.100 | The amazing power of neural networks,
01:11:34.100 | so, first, you need a lot of data to do anything.
01:11:37.100 | So that's a cost, that's a limitation of neural networks.
01:11:40.100 | But what you could do is,
01:11:43.100 | so there's neural networks that have been trained
01:11:48.100 | on very large datasets, on ImageNet.
01:11:51.100 | There's VGGNet, AlexNet, ResNet,
01:11:56.100 | all these networks that train on huge amounts of data.
01:11:59.100 | But those networks are trained to tell the difference
01:12:03.100 | between a cat and a dog, or whatever,
01:12:05.100 | or the specific object recognition in single images.
01:12:08.100 | How do I then take that network and apply it to my problem,
01:12:12.100 | say, of driving, of lane detection,
01:12:14.100 | or classifying medical diagnosis of cancer or not.
01:12:18.100 | The beauty of neural networks is you don't,
01:12:22.100 | I mean, it depends,
01:12:25.100 | but the promise of transfer learning
01:12:28.100 | is that you can just take that network,
01:12:30.100 | chop off the final layer,
01:12:32.100 | the fully connected layer that maps from all those
01:12:36.100 | cool high-dimensional features that you've learned about the visual space,
01:12:41.100 | and as opposed to predicting cat versus dog,
01:12:44.100 | you teach it to predict cancer or no cancer.
01:12:47.100 | You teach it to predict lane or no lane,
01:12:50.100 | truck or no truck.
01:12:52.100 | And so, as long as the visual space
01:12:54.100 | under which the networks operate is similar,
01:12:57.100 | or the data space, like if it's audio or whatever,
01:13:00.100 | if it's similar, if the features are useful that you learned,
01:13:04.100 | in studying the problem of cat versus dog deeply,
01:13:07.100 | you have learned actually how to see the world.
01:13:10.100 | And so you can apply that visual knowledge,
01:13:13.100 | you can transfer that learning to another domain.
01:13:17.100 | And that's the beautiful power of neural networks,
01:13:20.100 | is they're transferable.
01:13:22.100 | And so what they did here is they took,
01:13:27.100 | I didn't spend enough time looking through the code,
01:13:31.100 | so I'm not sure which of the giant networks they took,
01:13:34.100 | but they took a giant convolutional neural network,
01:13:38.100 | they pruned it down to, they chopped off the end layer,
01:13:43.100 | which produced 3,000 features,
01:13:45.100 | and they took those 3,000 features
01:13:47.100 | that every single image frame,
01:13:49.100 | and that's the XT,
01:13:51.100 | and they gave that as the input to LSTM,
01:13:54.100 | and the sequence length in that case was 50.
01:13:57.100 | So this process is pretty similar across domains,
01:14:05.100 | that's the beauty of it.
01:14:07.100 | And the art of neural networks is in the,
01:14:13.100 | that's a good sign,
01:14:15.100 | I guess I should wrap it up.
01:14:20.100 | Anyway, I don't think I need much time.
01:14:24.100 | But the art of neural networks is in the hyperparameter tuning,
01:14:30.100 | and that's the tricky part,
01:14:32.100 | and that's the part you can't be taught,
01:14:34.100 | that's experience, sadly enough.
01:14:38.100 | That's why they called, I talked about
01:14:41.100 | stochastic gradient descent, SGD,
01:14:44.100 | that's why whoever was Jeffrey Hinton,
01:14:47.100 | refers to it as stochastic graduate student descent.
01:14:52.100 | Meaning you just keep higher graduate students
01:14:55.100 | to play with the hyperparameters
01:14:57.100 | until the problem is solved.
01:15:02.100 | So I have about 100 plus slides on driver state,
01:15:11.100 | which is the thing that I'm most passionate about,
01:15:15.100 | and I think we'll save the best for last, right?
01:15:19.100 | And I'll talk about that tomorrow.
01:15:21.100 | We have a guest speaker from the White House,
01:15:25.100 | who'll talk about the future of artificial intelligence
01:15:27.100 | from the perspective of policy.
01:15:30.100 | And what I'd like you to do, first of all,
01:15:34.100 | if you're a registered student,
01:15:35.100 | submit the two tutorials assignments,
01:15:37.100 | and pick up,
01:15:40.100 | can we just set up boxes right here or something?
01:15:42.100 | Yeah, just stop by, pick up a shirt,
01:15:46.100 | and give us a card on the way.
01:15:48.100 | All right, thanks guys.
01:15:52.100 | (applause)
01:15:57.100 | [BLANK_AUDIO]