back to indexMIT 6.S094: Recurrent Neural Networks for Steering Through Time
Chapters
0:0 Intro
0:44 Administrative
1:50 Flavors of Neural Networks
6:1 Back to Basics: Backpropagation
8:34 Backpropagation: Forward Pass
10:14 Backpropagation: By Example
11:58 Backpropagation: Backward Pass
13:41 Modular Magic: Chain Rule
14:55 Interpreting Gradients
18:27 Modularity Expanded: Sigmoid Activation Function
20:33 Learning with Backpropagation
25:27 Optimization is Hard: Dying ReLUS
26:13 Optimization is Hard: Saddle Point
27:39 Learning is an Optimization Problem
30:59 Optimization is Hard: Vanishing Gradients
32:52 Reflections on Backpropagation
36:7 Unrolling a Recurrent Neural Network
37:39 RNN Observations
38:39 Backpropagation Through Time (BPTT)
40:34 Gradients Can Explode or Vanish Geometric Interpretation
41:8 RNN Variants: Bidirectional RNNS
42:8 Long-Term Dependency
43:49 Long Short Term Memory (LSTM) Networks
45:43 LSTM: Gates Regulate
45:48 LSTM: Pick What to Forget and What To Remember
48:8 LSTM Conveyer Belt
50:12 Application: Machine Translation
52:19 Application: Handwriting Generation from Text
52:59 Application: Character-Level Text Generation
54:6 Application: Image Question Answering
55:21 Application: Image Caption Generation
55:56 Application: Video Description Generation
56:50 Application: Modeling Attention Steering
57:33 Application: Drawing with Selective Attention Writing
58:3 Application: Adding Audio to Silent Film
59:0 Application: Medical Diagnosis
00:00:00.000 |
All right, so we've talked about regular neural networks, 00:00:07.800 |
We've talked about convolutional neural networks that work with images. 00:00:12.200 |
We've talked about reinforcement, deep reinforcement learning, 00:00:16.000 |
where we plug in a neural network into a reinforcement learning algorithm 00:00:21.300 |
when an agent, when a system has to not only perceive the world 00:00:29.400 |
And today we'll talk about perhaps the least understood 00:00:34.400 |
but the most exciting neural network out there, 00:00:37.400 |
flavor of neural network is recurring neural networks. 00:00:46.000 |
There's a website, I don't know if you heard, 00:00:50.200 |
cars.mit.edu, where you should create an account 00:00:57.000 |
You need to have an account if you want to get credit for this. 00:00:59.800 |
You need to submit code for DeepTraffic.js and DeepTesla.js. 00:01:05.700 |
And for DeepTraffic, you have to have a neural network 00:01:11.800 |
If you need help to achieve that speed, please email us. 00:01:19.000 |
For those of you who are old school SNL fans, 00:01:23.000 |
there's a deep thoughts section now in the profile page 00:01:28.600 |
where we encourage you to talk about the kinds of things 00:01:32.300 |
you tried in DeepTraffic or any of the other DeepTesla 00:01:36.400 |
or any of the work you've done as part of this class for deep learning. 00:01:41.300 |
We've talked about the vanilla neural networks on the left. 00:01:49.500 |
The vanilla neural network is the one where it's computing, 00:01:53.800 |
it's approximating a function that maps from one input to one output. 00:01:59.100 |
An example is mapping images to the number that's shown in the image. 00:02:03.400 |
For ImageNet, it's mapping an image to what's the object in the image. 00:02:09.500 |
In fact, convolutional neural networks can operate on audio. 00:02:13.600 |
You could give it a chunk of audio, five second audio clip. 00:02:17.500 |
That still counts as one input because it's fixed size. 00:02:20.900 |
As long as the size of the input is fixed, that's one chunk of input. 00:02:26.900 |
And as long as you have ground truth that maps that chunk of input 00:02:31.000 |
to some output ground truth, that's a vanilla neural network. 00:02:35.800 |
Whether that's a fully connected neural network 00:03:02.300 |
They take as input sequences, time series, audio, video. 00:03:12.400 |
and that temporal dynamics that connects the data 00:03:15.900 |
is more important than the spatial content of each individual frame. 00:03:22.000 |
When there's a lot of information being conveyed in the sequence, 00:03:26.300 |
in the temporal change of whatever that type of data is, 00:03:30.500 |
that's when you want to use recurring neural networks. 00:03:41.500 |
of a recurring neural network where they really shine 00:03:48.800 |
You don't have a fixed chunk of data that you're putting in, 00:04:20.100 |
You can have natural language put into the network 00:04:39.300 |
So for video, the sequence size might be the same. 00:04:52.800 |
counting the number of people in every single frame. 00:04:55.000 |
That's many to many when the size of the input 00:05:02.800 |
where there's feedback from output and input? 00:05:05.000 |
And that's exactly what recurrent neural networks are. 00:05:21.500 |
There's a loop in there that produces the output 00:05:24.200 |
and also takes that output as input once again. 00:05:37.000 |
might be totally different than the input sequence. 00:05:51.200 |
continue that song after a certain period of time. 00:06:13.000 |
It's also the thing that if you're a little bit lazy 00:06:19.300 |
and start using the basic tutorials for TensorFlow, 00:06:22.500 |
you ignore how backpropagation work at your peril. 00:06:32.300 |
I can assemble them like you might have done with deep traffic. 00:07:01.300 |
there's an input, so the network is an image. 00:07:06.900 |
all with differentiable smooth activation functions 00:07:11.700 |
And then as you pass through those activation functions, 00:07:32.800 |
that you hope, you expect the network to produce. 00:07:35.800 |
And then you can look at the difference between 00:08:36.300 |
And let's do a forward pass through this circuit 00:09:05.100 |
that I just want to take my time through this 00:09:07.100 |
because everything else about neural networks 00:10:24.700 |
we have to look at the gradient on each one of the gates. 00:10:42.100 |
The partial derivative of the output of a gate 00:11:00.300 |
If I increase X for the first function of addition, 00:11:38.300 |
The partial derivative of F with respect to Y is X. 00:12:55.100 |
And then we continue in the same exact process, 00:13:16.100 |
the gradient on F with respect to the inputs, X, Y, Z. 00:16:14.100 |
And then we could propagate that error backwards. 00:17:09.100 |
So that 3 is ignored when you back propagate it. 00:17:32.100 |
and multiplied by the value of the gradient in the output. 00:17:37.100 |
If it's confusing, go through the slides slowly. 00:17:47.100 |
which takes the inputs and produces as output 00:17:56.100 |
And when computing the gradient of the max gate, 00:18:17.100 |
pays attention to the input values on the forward pass. 00:18:32.100 |
A neural network is just a simple collection of these gates. 00:18:40.100 |
you calculate some kind of function on the end, 00:18:46.100 |
So usually for neural networks, that's an error function. 00:20:06.100 |
there's a bunch of parameters that you're trying to, 00:21:28.100 |
the output that you want the network to produce. 00:21:34.100 |
in exactly the same way that we described before. 00:21:37.100 |
So the subtasks that are involved in this update 00:21:50.100 |
computes the error, the difference between a and b. 00:22:09.100 |
Or actually the opposite of the direction of the gradient 00:22:14.100 |
And the amount by which you make that adjustment 00:22:30.100 |
And the process of adjusting the weights and biases 00:22:51.100 |
tens, hundreds of thousands, millions of those parameters. 00:22:55.100 |
So the space, the function that you're trying to minimize 00:23:12.100 |
You adjust in such a way that minimizes the output cost. 00:23:18.100 |
And there's a bunch of optimization methods for doing this. 00:23:32.100 |
if you know about these kind of terminologies. 00:23:34.100 |
The local minimum is the same as the global minimum. 00:23:42.100 |
So your goal is to get to the bottom of this thing. 00:23:56.100 |
And there's a lot of different ways to do gradient descent. 00:24:00.100 |
In various ways of adding randomness into the process, 00:24:03.100 |
so you don't get stuck into the weird crevices of the terrain. 00:24:15.100 |
When you're designing a network for deep traffic 00:24:35.100 |
the most popular for a while activation function, 00:24:54.100 |
Gradient tells you how much I want to adjust the weights. 00:25:04.100 |
and it gets less and less as you back propagate. 00:25:11.100 |
You think that you don't need to adjust the weights at all. 00:25:17.100 |
thinks that weights don't need to be adjusted, 00:25:37.100 |
which is the most popular activation function. 00:25:51.100 |
It might be zero gradient for the entire dataset. 00:26:21.100 |
You can't just plug and play like they're Legos. 00:26:44.100 |
for optimizing the loss function over the gradients. 00:26:50.100 |
again, if you've done any numerical optimization, 00:26:58.100 |
that's tricky for these algorithms to deal with. 00:27:02.100 |
What happens is it's easy for them to oscillate, 00:27:06.100 |
get stuck in that saddle and oscillate back and forth. 00:27:14.100 |
You get so happy that you found this low point, 00:27:20.100 |
that you forget that there's a much lower point. 00:27:25.100 |
the momentum of the gradient keeps rocking you back and forth 00:27:28.100 |
while you go in to a much greater global minimum. 00:27:32.100 |
And there's a bunch of clever ways of solving that. 00:27:39.100 |
But in this case, as long as the gradients don't vanish, 00:27:51.100 |
That might take a little while, but they'll get you there. 00:28:00.100 |
you're dealing with a function that's not non-convex, 00:28:03.100 |
and how do we ensure anything about it converging 00:28:24.100 |
is that it can represent these arbitrarily complex functions. 00:28:36.100 |
But the reason people refer to neural networks training as art 00:28:58.100 |
So the thing is, we're dealing with functions 00:29:03.100 |
where we don't know what the global optimal is. 00:29:31.100 |
and that becomes really nonlinear objective function 00:29:34.100 |
for which you don't know what the optimal is. 00:30:27.100 |
how can you tell when you're actually training a net 00:30:30.100 |
whether you're facing the vanishing gradient problem 00:30:47.100 |
when you've hit the vanishing gradient problem? 00:31:12.100 |
because they're like, your network is just going crazy, 00:31:34.100 |
there's a lot of research in trying to figure out 00:32:08.100 |
by running it through the entire training set. 00:32:19.100 |
and it's not converging to anything reasonable. 00:32:26.100 |
and that's an indication that something is wrong. 00:32:28.100 |
That something could be the loss function is bad, 00:32:31.100 |
that something could be that you've already found the optimal, 00:32:33.100 |
or that something could be the vanishing gradient. 00:32:45.100 |
at least some fraction of the neurons need to be firing. 00:32:49.100 |
Otherwise, the initialization is really poorly done. 00:32:52.100 |
Okay, so to reflect on the simplicity of backpropagation, 00:33:02.100 |
this kind of step of backpropagating the loss function 00:33:12.100 |
it's really the only way that we've effectively been able 00:33:16.100 |
to train a neural network to learn a function. 00:33:23.100 |
the huge number of weights and biases, the parameters, 00:33:34.100 |
So the question is whether this process is just like fitting, 00:33:41.100 |
adjusting the parameters of a highly nonlinear function 00:33:58.100 |
You have to think about the, for driving purposes, 00:34:10.100 |
You're not evolving any of the edges, the layers, 00:34:18.100 |
And so there are other optimization approaches 00:34:37.100 |
so this is falling out of the field of evolutionary robotics, 00:35:03.100 |
in simulation obviously, to walk and to swim. 00:35:12.100 |
But you could, the nice thing here is the dynamics, 00:35:19.100 |
that controls the dynamics of this weird shaped robot, 00:35:26.100 |
is the same kind of thing as the neural network. 00:35:28.100 |
And in fact, people have applied genetic algorithms 00:35:33.100 |
all kinds of sort of nature-inspired algorithms 00:35:38.100 |
But they don't seem to currently work that well. 00:35:40.100 |
But it's kind of, it's a cool idea to be using 00:35:45.100 |
to evolve something that's already nature-inspired, 00:35:59.100 |
And the question is whether general intelligence reasoning 00:36:50.100 |
where the hidden states are connected to each other 00:36:53.100 |
means that as opposed to producing a single input, 00:36:57.100 |
the network takes arbitrary number of inputs. 00:37:08.100 |
And so depending on the duration of the sequences you're interested in, 00:37:14.100 |
you can think of this network in its unrolled state. 00:37:32.100 |
And it becomes like a regular neural network, 00:37:40.100 |
The parameters, again, there's weights, there's biases. 00:37:44.100 |
It's similar to CNNs, convolutional neural networks, 00:37:48.100 |
in that it's just like convolutional neural networks 00:37:51.100 |
make certain spatial consistency assumptions. 00:37:55.100 |
The recurring neural networks assume temporal consistency 00:38:02.100 |
That w, that u, that v is the same for every single time step. 00:38:14.100 |
And that allows you to look at arbitrarily long sequences 00:38:29.100 |
And this process is the same exact process that's repeated 00:38:32.100 |
based on the different variants that we talked about before 00:38:40.100 |
And the backpropagation process is exactly the same 00:38:45.100 |
It has a fancy name of backpropagation through time, BPTT. 00:38:50.100 |
But it's just backpropagation through an unrolled, 00:39:00.100 |
where the errors are computed on the outputs, 00:39:18.100 |
Now the problem is that the depth of these networks 00:39:22.100 |
So if at any point the gradient hits a low number, 0, 00:39:34.100 |
that gradient drives all the earlier layers to 0. 00:39:43.100 |
you're really ignoring majority of the sequence. 00:40:05.100 |
WHH, if the weights are such that they produce, 00:40:24.100 |
So that's the pseudo-code for backpropagation, 00:40:35.100 |
And so you get these things with exploding and vanishing gradients 00:40:41.100 |
an error surface for a single hidden unit RNN. 00:40:47.100 |
the value of the weight, the value of the bias, 00:40:52.100 |
So the error could be really flat or could explode. 00:41:02.100 |
either making steps that are too gradual or too big. 00:41:08.100 |
Okay, what other variants that we'll look at a little bit 00:41:16.100 |
So there could be edges going forward and edges going back. 00:41:34.100 |
And generally, as always is the case in neural networks, 00:41:38.100 |
So this is that deep referring to the number of layers 00:41:58.100 |
Each of those layers has its own set of weights 00:42:25.100 |
the reality is they're not really good at remembering 00:42:50.100 |
The recurrent neural networks can learn to generate apple 00:42:53.100 |
because they've seen a lot of sentences with Bob and eating, 00:42:59.100 |
For a longer sentence, like Bob likes apples, 00:43:30.100 |
It's difficult to propagate the important stuff 00:43:35.100 |
in order to maintain that context in generating apple 00:43:39.100 |
or classifying some concept that happened way down the line. 00:43:44.100 |
So when people talk about recurrent neural networks, 00:44:00.100 |
So all the impressive results on time series, 00:44:07.100 |
And so again, vanilla RNNs up on top of the slide, 00:44:21.100 |
Here we'll use 10H as the activation function. 00:44:27.100 |
It's just another popular sigmoid type activation function. 00:44:40.100 |
but in some ways they're more intuitive for us to understand. 00:44:51.100 |
In yellow are different neural network layers. 00:44:55.100 |
With sigma and 10H are different types of activation functions. 00:45:00.100 |
10H is an activation function that squishes the input 00:45:08.100 |
A sigmoid function squishes it between 0 and 1, 00:45:16.100 |
There is some pointwise operations, addition, multiplication, 00:45:31.100 |
There's concatenation and there's a copy operation on the output. 00:45:35.100 |
So we copy, the output of each cell is copied to the next cell 00:45:54.100 |
There's this conveyor belt going through inside each individual cell. 00:46:00.100 |
There's really three steps in the conveyor belt. 00:46:31.100 |
the output of the previous cell, previous time step, 00:46:35.100 |
and deciding do I want to keep that in my memory or not, 00:46:39.100 |
and do I want to integrate the new input into my memory or not. 00:46:44.100 |
So this allows you to be selective about the information which you learn. 00:46:49.100 |
So for example, the sentence "Bob and Alice are having lunch." 00:46:53.100 |
Bob likes apples, Alice likes oranges, she's eating an orange. 00:47:06.100 |
Right now, if you say you had a hidden state, 00:47:10.100 |
keeping track of the gender of the person we're talking about. 00:47:16.100 |
You might say that there's both genders in the first sentence, 00:47:19.100 |
there's male in the second sentence, female in the third sentence. 00:47:23.100 |
That way, when you have to generate a sentence about who's eating what, 00:47:32.100 |
in order to make an accurate generation of text 00:47:49.100 |
but you have to remember that Alice likes oranges. 00:47:54.100 |
So you have to selectively remember and forget certain things. 00:48:00.100 |
So you decide what to forget, decide what to remember, 00:48:09.100 |
All right, so zoom in a little bit, because this is pretty cool. 00:48:23.100 |
like what the gender that we're currently talking about, 00:48:28.100 |
that's the state that you're keeping track of, 00:48:42.100 |
but it's 1 when you want that information to go through, 00:49:03.100 |
You take the inputs from the previous time step, 00:49:06.100 |
the input to the network from the current time step, 00:49:10.100 |
and decide do I want to forget, do I want to ignore those. 00:49:15.100 |
Then we decide which part of the state to update. 00:49:21.100 |
What part of our memory do we update with this information, 00:49:37.100 |
So that's where you have the sigmoid function is 00:49:44.100 |
When it's 1, that information passes through. 00:49:49.100 |
And finally, we produce an output from the cell. 00:49:58.100 |
it's producing an output in the English language 00:50:03.100 |
And then that same output is copied to the next cell. 00:50:13.100 |
Okay, so what can we get done with this kind of approach? 00:50:27.100 |
Is it like a floating point or is it like a vector? 00:50:33.100 |
The state is the activation multiplied by the weight. 00:50:41.100 |
So it's the outputs of the sigmoid or the 10H activations. 00:50:46.100 |
So there's a bunch of neurons and they're firing a number 00:50:55.100 |
It's just calling it state is sort of simplifying. 00:50:58.100 |
But the point is that there's a bunch of numbers 00:51:00.100 |
being constantly modified by the weights and the biases. 00:51:09.100 |
And the modification of those numbers is controlled by the weights. 00:51:16.100 |
the resulting output of the recurrent neural network 00:51:22.100 |
and the errors are back-propagated through the weights. 00:51:30.100 |
So machine translation is one popular application. 00:51:46.100 |
You have some inputs, whatever language that is again. 00:51:58.100 |
And the output, so the inputs are in one language, 00:52:03.100 |
a set of characters that compose a word in one language. 00:52:19.100 |
There's a ton of great work on machine translation. 00:52:23.100 |
It's what Google is mostly using for their translator. 00:52:32.100 |
Same exact thing, LSTMs, generating handwritten characters, 00:52:43.100 |
where the input is text and the output is handwriting. 00:53:09.100 |
and the tradition of ancient human reproduction. 00:53:18.100 |
where you see there's an encoding of the characters 00:53:55.100 |
But the point is it just keeps generating text, 00:54:15.100 |
so you could almost arbitrarily stack things together. 00:54:18.100 |
So you take an image as an input, bottom left there, 00:54:33.100 |
It's to broaden the representative meaning of the words. 00:54:40.100 |
So you want to take the word embeddings and the image 00:54:43.100 |
and produce your best estimate of the answer. 00:55:00.100 |
and ask the question how many chairs are there 00:55:07.100 |
So I should say that this is really hard, right? 00:55:26.100 |
You can detect the different objects in the scene, 00:55:33.100 |
stitch them together in syntactically correct sentences 00:55:43.100 |
The first is computer vision detecting the objects, 00:55:46.100 |
segmenting the image and detecting the objects. 00:56:24.100 |
and then you start generating words about that video. 00:56:37.100 |
And because the input and the output is arbitrary, 00:56:41.100 |
there also has to be indicators of the beginnings 00:56:51.100 |
In order to generate syntactically correct sentences, 00:57:01.100 |
So you can also, again, recurrent neural networks, 00:57:15.100 |
that's used to classify what's contained in that image. 00:57:19.100 |
So here, a CNN being steered by a recurrent neural network 00:57:28.100 |
into the number that's associated with the house number. 00:57:35.100 |
And that visual attention can be used to steer 00:57:39.100 |
and it can be used to steer a network for the generation. 00:57:54.100 |
where the output at every time step is visual. 00:58:26.100 |
that has convolutional layers for every single frame. 00:58:41.100 |
The training set is a person hitting an object with a drumstick 00:58:49.100 |
and your task is to generate, given a silent video, 00:58:53.100 |
generate the sound that a drumstick would make 00:59:08.100 |
where it's been really successful and pretty cool. 00:59:11.100 |
But it's also beginning to be applied in places 00:59:14.100 |
where it can actually really help civilization, 00:59:32.100 |
and a variable length sequence of information 00:59:48.100 |
and you can look at it as a sequence over a period of time. 00:59:54.100 |
the output is a diagnosis, a medical diagnosis. 01:00:00.100 |
So in this case, we can look at predicting diabetes, 01:00:13.100 |
There's something that all of us wish we could do 01:00:27.100 |
well, first of all, you can input the raw stock data, 01:00:30.100 |
right, the order books and so on, financial data. 01:00:51.100 |
binary prediction, whether the stock will go up or down. 01:00:56.100 |
And nobody's been able to really successfully do this, 01:01:38.100 |
This exact same process you generate language, 01:01:52.100 |
and you just learn that's raw audio of the speaker. 01:02:40.100 |
it's producing something that sounds like words. 01:03:10.100 |
And there's a lot of work in voice recognition 01:03:22.100 |
You're mapping any kind of audio to a classification. 01:03:35.100 |
and that's a spectrogram on the bottom there being shown, 01:04:04.100 |
so let's see where recurring neural networks apply in driving. 01:04:18.100 |
There's five convolutional layers in their approach, 01:04:24.100 |
You can add as many layers as you want in DeepTesla. 01:04:29.100 |
So, that's a quarter million parameters to optimize. 01:04:40.100 |
That's the approach, that's the DeepTesla way. 01:04:49.100 |
and learning a regression of a steering angle. 01:04:53.100 |
Now, so one of the prizes for the competition 01:04:59.100 |
is the Udacity self-driving car engineer nanodegree for free. 01:05:17.100 |
But they have a very large group of obsessed people, 01:05:25.100 |
They went beyond just convolutional neural networks 01:05:28.100 |
So, taking a sequence of images and predicting steering. 01:05:34.100 |
at least the first, and I'll talk about the second place winner tomorrow, 01:05:43.100 |
But the first and the third place winners used RNNs, 01:06:00.100 |
anybody here who's not a computer vision person, 01:06:05.100 |
for whatever application you're interested in 01:06:09.100 |
It's just the world is full of time series data. 01:06:26.100 |
But most data, the world is time series data. 01:06:29.100 |
So this is the approach you will end up using 01:06:32.100 |
if you want to apply it in your own research. 01:07:04.100 |
A powerful way of doing that is convolutional neural networks. 01:07:15.100 |
ones that take time into consideration and one not. 01:07:48.100 |
for the input and the output is a sequence length of 10. 01:07:57.100 |
The question was, do they use supervised learning? 01:08:00.100 |
Yep. So they were given the same thing as in DeepTesla. 01:08:03.100 |
A sequence of frames, whether you have a sequence of 01:08:08.100 |
I think there's other information too, available. 01:08:11.100 |
So yeah, there's no reinforcement learning here. 01:08:28.100 |
how many LSTM gates are there in this problem? 01:08:47.100 |
But it's arbitrary, just like convolutional neural networks are arbitrary. 01:08:54.100 |
the size of the sigmoid function at 10H is arbitrary. 01:08:58.100 |
So you can make it as large as you want, as deep as you want. 01:09:11.100 |
I encourage you if you're interested in machine learning 01:09:16.100 |
I don't know how to pronounce it, competitions, 01:09:19.100 |
where basically everyone's doing the same thing. 01:09:21.100 |
You're using LSTMs, or if it's one-to-one mapping, 01:09:25.100 |
using convolutional neural networks, fully connected networks, 01:09:34.100 |
that's what you'll be doing, your own research, 01:09:39.100 |
playing with the different parameters that control 01:09:46.100 |
all these kinds of things, that's what you're playing with. 01:10:18.100 |
you said that there is a memory of 10 in this LSTM, 01:10:26.100 |
and I thought RNNs are supposed to be arbitrary, or whatever. 01:10:42.100 |
you only have one cell that's looping onto each other. 01:10:53.100 |
in which you're doing the training and then the testing? 01:10:56.100 |
So, you don't have to, it can be arbitrary length, 01:11:23.100 |
and it's something I don't think I mentioned, 01:11:34.100 |
so, first, you need a lot of data to do anything. 01:11:37.100 |
So that's a cost, that's a limitation of neural networks. 01:11:43.100 |
so there's neural networks that have been trained 01:11:56.100 |
all these networks that train on huge amounts of data. 01:11:59.100 |
But those networks are trained to tell the difference 01:12:05.100 |
or the specific object recognition in single images. 01:12:08.100 |
How do I then take that network and apply it to my problem, 01:12:14.100 |
or classifying medical diagnosis of cancer or not. 01:12:32.100 |
the fully connected layer that maps from all those 01:12:36.100 |
cool high-dimensional features that you've learned about the visual space, 01:12:57.100 |
or the data space, like if it's audio or whatever, 01:13:00.100 |
if it's similar, if the features are useful that you learned, 01:13:04.100 |
in studying the problem of cat versus dog deeply, 01:13:07.100 |
you have learned actually how to see the world. 01:13:13.100 |
you can transfer that learning to another domain. 01:13:17.100 |
And that's the beautiful power of neural networks, 01:13:27.100 |
I didn't spend enough time looking through the code, 01:13:31.100 |
so I'm not sure which of the giant networks they took, 01:13:34.100 |
but they took a giant convolutional neural network, 01:13:38.100 |
they pruned it down to, they chopped off the end layer, 01:13:57.100 |
So this process is pretty similar across domains, 01:14:24.100 |
But the art of neural networks is in the hyperparameter tuning, 01:14:47.100 |
refers to it as stochastic graduate student descent. 01:14:52.100 |
Meaning you just keep higher graduate students 01:15:02.100 |
So I have about 100 plus slides on driver state, 01:15:11.100 |
which is the thing that I'm most passionate about, 01:15:15.100 |
and I think we'll save the best for last, right? 01:15:21.100 |
We have a guest speaker from the White House, 01:15:25.100 |
who'll talk about the future of artificial intelligence 01:15:40.100 |
can we just set up boxes right here or something?