back to indexDeep Learning State of the Art (2019) - MIT
Chapters
0:0 Introduction
2:0 BERT and Natural Language Processing
14:0 Tesla Autopilot Hardware v2+: Neural Networks at Scale
16:25 AdaNet: AutoML with Ensembles
18:32 AutoAugment: Deep RL Data Augmentation
22:53 Training Deep Networks with Synthetic Data
24:37 Segmentation Annotation with Polygon-RNN
26:39 DAWNBench: Training Fast and Cheap
29:6 BigGAN: State of the Art in Image Synthesis
30:14 Video-to-Video Synthesis
32:12 Semantic Segmentation
36:3 AlphaZero & OpenAI Five
43:34 Deep Learning Frameworks
44:40 2019 and beyond
00:00:00.000 |
The thing I would very much like to talk about today is 00:00:08.400 |
really at the height of some of the great accomplishments that have happened 00:00:17.500 |
this incredible data-driven technology takes us. 00:00:23.200 |
the breakthroughs that happened in 2017 and 2018 that take us to this point. 00:00:34.600 |
the state of the art results on main machine learning benchmarks. 00:00:38.600 |
So the various image classification, object detection, 00:00:43.100 |
or the NLP benchmarks, or the GAN benchmarks. 00:00:51.100 |
that's available on GitHub that performs best on a particular benchmark. 00:00:58.800 |
Ideas and developments that are at the cutting edge 00:01:02.800 |
of what defines this exciting field of deep learning. 00:01:06.100 |
And so I'd like to go through a bunch of different areas 00:01:11.200 |
Now of course this is also not a lecture that's complete. 00:01:15.200 |
There's other things that I may be totally missing 00:01:20.200 |
particularly exciting to people here, people beyond. 00:01:24.200 |
For example, medical applications of deep learning 00:01:30.400 |
And protein folding and all kinds of applications 00:01:34.200 |
that there has been some exciting developments 00:01:39.400 |
So forgive me if your favorite developments are missing, 00:01:43.200 |
but hopefully this encompasses some of the really 00:01:48.600 |
both on the theory side, on the application side, 00:01:51.600 |
and on the community side of all of us being able to work together 00:02:03.600 |
Many have described this year as the ImageNet moment 00:02:12.400 |
was the first neural network that really gave that big jump 00:02:20.400 |
with deep learning, with purely learning-based methods. 00:02:23.200 |
In the same way, there's been a series of developments 00:02:31.400 |
with the development of BERT that has made on benchmarks 00:02:39.200 |
and in our ability to apply NLP to solve various NLP tasks, 00:02:44.200 |
natural language processing tasks, a total leap. 00:02:48.000 |
So let's tell the story of what takes us there. 00:02:54.600 |
about the encoder-decoder recurrent neural networks. 00:03:04.000 |
encode sequences of data and output something. 00:03:09.000 |
Output either a single prediction or another sequence. 00:03:13.000 |
When the input sequence and the output sequence 00:03:21.200 |
We have to translate from one language to another. 00:03:42.400 |
and encodes that sentence into a single vector. 00:03:50.600 |
of what it represents, a representation of that sentence. 00:04:16.000 |
and mapping it to a fixed size vector representation 00:04:20.400 |
and then you decode by taking that fixed size vector representation 00:04:26.400 |
that can be of different length than the input sentence. 00:04:32.800 |
It's been very effective for machine translation 00:04:36.400 |
and dealing with arbitrary length input sequences, 00:04:49.600 |
It's an improvement on the encoder-decoder architecture. 00:05:00.200 |
that allows to look back at the input sequence. 00:05:04.600 |
that you have a sequence that's the input sentence 00:05:11.600 |
you're allowed to look back at the particular samples 00:06:07.200 |
a hidden state that captures the representation 00:06:22.600 |
that are then used by the decoder to translate, 00:09:01.000 |
that's meaningful for that kind of understanding. 00:09:07.600 |
and so the traditional Word2Vec process of embedding 00:10:05.200 |
and just take in the hidden representation form in the middle 00:10:08.000 |
that's how you form this compressed embedding 00:10:23.000 |
have nothing to do with each other they're far away. 00:10:34.400 |
so looking not just the sequence that led up to the word 00:10:38.200 |
the sequence that followed the sequence that before. 00:10:48.600 |
In learning the rich full context of the word 00:11:18.000 |
sentence classification, sentence comparison, so on 00:11:20.400 |
translation, that representation is much more effective 00:11:39.400 |
decoder with attention looking back at the input sequence 00:12:08.200 |
With the transformer formulation there's always 00:12:23.600 |
it takes in the full sequence of the sentence 00:12:36.400 |
15% of the samples, the tokens from the sequence 00:12:52.400 |
That construct and then you stack a ton of them together 00:13:02.000 |
and that allows you to learn the rich context 00:13:21.400 |
in the space that's very efficient to reason with. 00:13:37.200 |
Okay, I lingered on that one a little bit too long 00:13:41.200 |
but it is also the one I'm really excited about 00:13:45.800 |
and really if there's a breakthrough this year 00:14:03.200 |
those kind of academic developments in deep learning 00:14:19.200 |
is a implementation of the NVIDIA Drive PX2 system 00:15:25.600 |
and the control decision based on those perceptions 00:15:46.600 |
over one billion miles have been driven in autopilot. 00:16:01.600 |
as far as we know that was not using a neural network 00:16:05.600 |
that wasn't learning at least online learning in the Tesla's. 00:16:14.600 |
The hardware version two has a neural network 00:22:39.600 |
meaningful, complex, rich representations from 00:23:50.600 |
do things that can't possibly happen in reality 00:27:27.600 |
and the training in the least amount of dollars 00:28:15.600 |
based on the error the neural network observes 00:28:30.600 |
which is a parameter of the optimization process 00:30:37.600 |
Video to Video Synthesis that a few people have been 00:30:49.600 |
the temporal consistency, the temporal dynamics 00:32:23.600 |
of basic image classification where the input is an image 00:32:25.600 |
and the output is a classification of what's going on 00:32:47.600 |
representation that can then be used for all kinds 00:35:27.600 |
and that's what's producing the state of the art performances 00:35:55.600 |
one of the most commonly used for the task of 00:36:35.600 |
learning methods that are taking in just the raw 00:36:55.600 |
the sort of physics of the game sufficiently to be able 00:37:15.600 |
was able to beat the top of the world champion 00:37:43.600 |
more and more and more which is why AlphaZero 00:38:33.600 |
was able to beat it with just 4 hours of training 00:38:45.600 |
undergraduate student sitting in their dorm room 00:38:53.600 |
very quickly learn to beat the state of the art 00:39:11.600 |
possibly make and so the farther along you look 00:39:25.600 |
to determine which is the action is the most optimal 00:39:33.600 |
doesn't feel like they're like looking down a tree 00:39:37.600 |
creative intuition there's something like you could 00:39:51.600 |
the stockfish the state of the art chess engine 00:39:57.600 |
closer and closer and closer towards the human 00:40:05.600 |
estimating the quality of the move and the quality 00:40:19.600 |
information just like when a grandmaster looks 00:40:33.600 |
this very structured formal constrained world 00:42:23.600 |
so you better believe that they're coming back 00:43:01.600 |
what's completely current well not completely 00:43:19.600 |
incredibly difficult one and some people think 00:44:27.600 |
about today and Monday and we'll keep talking 00:45:09.600 |
who's deeply suspicious of everything I've said