back to index

Deep Learning State of the Art (2019) - MIT


Chapters

0:0 Introduction
2:0 BERT and Natural Language Processing
14:0 Tesla Autopilot Hardware v2+: Neural Networks at Scale
16:25 AdaNet: AutoML with Ensembles
18:32 AutoAugment: Deep RL Data Augmentation
22:53 Training Deep Networks with Synthetic Data
24:37 Segmentation Annotation with Polygon-RNN
26:39 DAWNBench: Training Fast and Cheap
29:6 BigGAN: State of the Art in Image Synthesis
30:14 Video-to-Video Synthesis
32:12 Semantic Segmentation
36:3 AlphaZero & OpenAI Five
43:34 Deep Learning Frameworks
44:40 2019 and beyond

Whisper Transcript | Transcript Only Page

00:00:00.000 | The thing I would very much like to talk about today is
00:00:02.860 | the state of the art in deep learning.
00:00:05.600 | Here we stand in 2019
00:00:08.400 | really at the height of some of the great accomplishments that have happened
00:00:12.900 | but also stand at the beginning.
00:00:14.900 | And it's up to us to define where
00:00:17.500 | this incredible data-driven technology takes us.
00:00:20.500 | And so I'd like to talk a little bit about
00:00:23.200 | the breakthroughs that happened in 2017 and 2018 that take us to this point.
00:00:29.600 | So this lecture is not on
00:00:34.600 | the state of the art results on main machine learning benchmarks.
00:00:38.600 | So the various image classification, object detection,
00:00:43.100 | or the NLP benchmarks, or the GAN benchmarks.
00:00:47.400 | This isn't about the cutting edge algorithm
00:00:51.100 | that's available on GitHub that performs best on a particular benchmark.
00:00:56.800 | This is about ideas.
00:00:58.800 | Ideas and developments that are at the cutting edge
00:01:02.800 | of what defines this exciting field of deep learning.
00:01:06.100 | And so I'd like to go through a bunch of different areas
00:01:09.600 | that I think are really exciting.
00:01:11.200 | Now of course this is also not a lecture that's complete.
00:01:15.200 | There's other things that I may be totally missing
00:01:17.700 | that happened in 2017 and 2018 that are
00:01:20.200 | particularly exciting to people here, people beyond.
00:01:24.200 | For example, medical applications of deep learning
00:01:27.400 | is something I totally don't touch on.
00:01:30.400 | And protein folding and all kinds of applications
00:01:34.200 | that there has been some exciting developments
00:01:36.400 | from DeepMind and so on that don't touch on.
00:01:39.400 | So forgive me if your favorite developments are missing,
00:01:43.200 | but hopefully this encompasses some of the really
00:01:46.600 | fundamental things that have happened,
00:01:48.600 | both on the theory side, on the application side,
00:01:51.600 | and on the community side of all of us being able to work together
00:01:54.600 | on these kinds of technologies.
00:01:57.600 | I think 2018, in terms of deep learning,
00:02:01.200 | is the year of natural language processing.
00:02:03.600 | Many have described this year as the ImageNet moment
00:02:08.400 | in 2012 for computer vision when AlexNet
00:02:12.400 | was the first neural network that really gave that big jump
00:02:16.200 | in performance on computer vision.
00:02:17.800 | It started to inspire people what's possible
00:02:20.400 | with deep learning, with purely learning-based methods.
00:02:23.200 | In the same way, there's been a series of developments
00:02:26.600 | from 2016, '17, and led up to '18,
00:02:31.400 | with the development of BERT that has made on benchmarks
00:02:39.200 | and in our ability to apply NLP to solve various NLP tasks,
00:02:44.200 | natural language processing tasks, a total leap.
00:02:48.000 | So let's tell the story of what takes us there.
00:02:50.400 | There's a few developments.
00:02:52.200 | I've mentioned a little bit on Monday
00:02:54.600 | about the encoder-decoder recurrent neural networks.
00:02:59.000 | So this idea of recurrent neural networks
00:03:04.000 | encode sequences of data and output something.
00:03:09.000 | Output either a single prediction or another sequence.
00:03:13.000 | When the input sequence and the output sequence
00:03:16.000 | are not necessarily the same size,
00:03:19.200 | they're like in machine translation.
00:03:21.200 | We have to translate from one language to another.
00:03:24.600 | The encoder-decoder architecture
00:03:29.000 | takes the following process.
00:03:31.000 | It takes in the sequence of words
00:03:33.400 | or the sequence of samples as the input
00:03:36.600 | and uses the recurrent units,
00:03:38.800 | whether it's LSTM or GRUs or beyond,
00:03:42.400 | and encodes that sentence into a single vector.
00:03:47.200 | So forms an embedding of that sentence
00:03:50.600 | of what it represents, a representation of that sentence.
00:03:54.600 | And then feeds that representation
00:03:58.600 | in the decoder recurrent neural network
00:04:01.600 | that then generates the sequence of words
00:04:07.600 | that form the sentence in the language
00:04:10.800 | that's being translated to.
00:04:13.000 | So first you encode by taking a sequence
00:04:16.000 | and mapping it to a fixed size vector representation
00:04:20.400 | and then you decode by taking that fixed size vector representation
00:04:23.600 | and unrolling it into the sentence
00:04:26.400 | that can be of different length than the input sentence.
00:04:29.000 | Okay, that's the encoder-decoder structure
00:04:31.400 | for recurrent neural networks.
00:04:32.800 | It's been very effective for machine translation
00:04:36.400 | and dealing with arbitrary length input sequences,
00:04:40.000 | arbitrary length output sequences.
00:04:43.000 | Next step, attention.
00:04:46.200 | What is attention?
00:04:47.200 | Well, it's the next step beyond.
00:04:49.600 | It's an improvement on the encoder-decoder architecture.
00:04:56.400 | It provides a mechanism
00:05:00.200 | that allows to look back at the input sequence.
00:05:02.400 | So as opposed to saying
00:05:04.600 | that you have a sequence that's the input sentence
00:05:08.000 | and that all gets collapsed
00:05:09.600 | into a single vector representation,
00:05:11.600 | you're allowed to look back at the particular samples
00:05:14.400 | from the input sequence
00:05:17.000 | as part of the decoding process.
00:05:19.000 | That's attention.
00:05:20.400 | And you can also learn
00:05:23.000 | which aspects are important
00:05:26.000 | for which aspects of the decoding process,
00:05:28.600 | which aspects of the input sequence
00:05:30.600 | are important to the output sequence.
00:05:35.000 | Visualize another way.
00:05:37.600 | And there's a few visualizations here
00:05:40.600 | that are quite incredible
00:05:42.000 | that are done by Jay Alomar.
00:05:45.600 | I highly recommend you follow the links
00:05:49.800 | and look at the further details
00:05:52.200 | of these visualizations of attention.
00:05:54.600 | So if we look at neural machine translation,
00:05:57.200 | the encoder RNN takes a sequence of words
00:06:00.200 | and throughout, after every sequence,
00:06:03.200 | forms a set of hidden representations,
00:06:07.200 | a hidden state that captures the representation
00:06:10.200 | of the words that followed.
00:06:13.200 | And those sets of hidden representations
00:06:16.200 | as opposed to being collapsed
00:06:17.200 | to a single fixed size vector
00:06:18.800 | are then all pushed forward to the decoder
00:06:22.600 | that are then used by the decoder to translate,
00:06:25.800 | but in a selective way
00:06:27.800 | where the decoder,
00:06:29.400 | here visualized on the y-axis,
00:06:32.200 | the input language and on the x,
00:06:34.000 | the output language.
00:06:37.600 | The decoder weighs the different parts
00:06:40.800 | of the input sequence,
00:06:43.000 | differently in order to determine
00:06:45.800 | how to best translate,
00:06:47.400 | generate the word that forms a translation
00:06:50.000 | in the full output sentence.
00:06:52.000 | Okay, that's attention.
00:06:54.200 | Allowing, expanding the
00:06:56.000 | encoder-decoder architecture
00:06:58.600 | to allow for selective attention
00:07:04.000 | to the input sequence
00:07:05.200 | as opposed to collapsing everything down
00:07:06.800 | into a fixed representation.
00:07:08.800 | Okay, next step, self-attention.
00:07:12.600 | In the encoding process,
00:07:15.200 | allowing the encoder
00:07:18.400 | to also selectively look
00:07:22.000 | in forming the hidden representations
00:07:24.200 | at other parts of the input sequence
00:07:27.600 | in order to form those representations.
00:07:30.000 | It allows you to determine
00:07:33.600 | for certain words
00:07:36.000 | what are the important relevant aspects
00:07:38.200 | of the input sequence that can help you
00:07:40.000 | encode that word the best.
00:07:43.200 | So it improves the encoder process
00:07:45.200 | by allowing it to look
00:07:46.600 | at the entirety of the context.
00:07:48.200 | That's self-attention.
00:07:52.600 | Building on that, transformer.
00:07:55.600 | It's using the self-attention mechanism
00:07:59.000 | in the encoder
00:08:01.000 | to form these sets of representations
00:08:03.200 | on the input sequence.
00:08:04.800 | And then as part of the decoding process,
00:08:07.200 | follow the same but in reverse
00:08:09.400 | with a bunch of self-attention
00:08:11.400 | that's able to look back again.
00:08:13.400 | So it's self-attention on the encoder,
00:08:15.400 | attention on the decoder,
00:08:17.200 | and that's where the magic,
00:08:18.800 | that's where the entirety of the magic is
00:08:21.200 | that's able to capture the rich context
00:08:24.400 | of the input sequence
00:08:26.000 | in order to generate
00:08:27.800 | in a contextual way the output sequence.
00:08:30.600 | So let's take a step back then
00:08:32.800 | and look at what is critical
00:08:35.200 | to natural language
00:08:38.000 | in order to be able to reason about words,
00:08:40.600 | construct a language model
00:08:42.600 | and be able to reason about the words
00:08:45.200 | in order to classify a sentence,
00:08:46.800 | to translate a sentence
00:08:48.600 | or compare two sentences and so on.
00:08:52.000 | The sentences are
00:08:54.600 | collections of words or characters
00:08:57.000 | and those characters and words have to
00:08:59.200 | have an efficient representation
00:09:01.000 | that's meaningful for that kind of understanding.
00:09:03.200 | And that's what the process of embedding is.
00:09:05.600 | We talked a little bit about it on Monday
00:09:07.600 | and so the traditional Word2Vec process of embedding
00:09:11.400 | is you use some kind of trick
00:09:13.600 | in an unsupervised way to map words
00:09:16.600 | into a compressed representation.
00:09:22.200 | So language modeling is the process
00:09:26.200 | of determining which words
00:09:28.000 | follow each other usually.
00:09:29.400 | So one way you can use it
00:09:31.600 | in a skip-gram model
00:09:33.600 | taking huge data sets of words,
00:09:36.800 | you know there's writing all over the place,
00:09:38.600 | taking those data sets
00:09:40.200 | and feeding a neural network
00:09:42.200 | that in a supervised way
00:09:45.000 | looks at which words
00:09:47.200 | are usually follow the input.
00:09:50.000 | So the input is a word,
00:09:51.600 | the output is which word
00:09:53.200 | are statistically likely to follow that word
00:09:55.400 | and the same with the preceding word.
00:09:57.200 | And doing this kind of unsupervised learning
00:10:00.600 | which is what Word2Vec does,
00:10:02.800 | if you throw away the output and the input
00:10:05.200 | and just take in the hidden representation form in the middle
00:10:08.000 | that's how you form this compressed embedding
00:10:11.000 | a meaningful representation
00:10:13.600 | that when two words are related
00:10:15.400 | in a language modeling sense
00:10:17.400 | two words are related they're going to be
00:10:19.800 | in that representation close to each other
00:10:21.600 | and when they're totally unrelated
00:10:23.000 | have nothing to do with each other they're far away.
00:10:25.600 | ELMo is the approach of using
00:10:28.200 | bi-directional LSTMs
00:10:30.400 | to learn that representation.
00:10:32.200 | And what bi-directional, bi-directionally
00:10:34.400 | so looking not just the sequence that led up to the word
00:10:36.800 | but in both directions
00:10:38.200 | the sequence that followed the sequence that before.
00:10:41.200 | And that allows you to learn
00:10:44.600 | the rich full context of the word.
00:10:48.600 | In learning the rich full context of the word
00:10:51.800 | you're forming representations
00:10:53.200 | that are much better able to represent
00:10:55.200 | the statistical language model
00:10:58.800 | behind the kind of corpus of language
00:11:02.200 | that you're looking at.
00:11:04.200 | And this has taken a big leap
00:11:08.400 | in ability to then further algorithms
00:11:13.600 | that then with the language model
00:11:15.600 | a reasoning about doing things like
00:11:18.000 | sentence classification, sentence comparison, so on
00:11:20.400 | translation, that representation is much more effective
00:11:24.400 | for working with language.
00:11:26.000 | The idea of the OpenAI transformer
00:11:30.000 | is the next step forward
00:11:33.200 | is taking the same transformer
00:11:36.000 | that I mentioned previously
00:11:37.600 | the encoder with self-attention
00:11:39.400 | decoder with attention looking back at the input sequence
00:11:43.400 | and using it
00:11:46.600 | taking the language learned by the decoder
00:11:50.600 | and using that as a language model
00:11:54.600 | and then chopping off layers
00:11:56.000 | and training on a specific language task
00:11:59.800 | like sentence classification.
00:12:02.200 | Now BERT is the thing that
00:12:05.200 | that did the big leap in performance.
00:12:08.200 | With the transformer formulation there's always
00:12:11.400 | there's no bi-directional element
00:12:13.400 | there is, it's always moving forward
00:12:15.800 | so the encoding step and the decoding step
00:12:18.600 | with BERT is, it's richly bi-directional
00:12:23.600 | it takes in the full sequence of the sentence
00:12:29.400 | and masks out
00:12:33.400 | some percentage of the words
00:12:35.000 | 15% of the words
00:12:36.400 | 15% of the samples, the tokens from the sequence
00:12:40.400 | and tasks the entire encoding
00:12:46.400 | self-attention mechanism to predict
00:12:50.200 | the words that are missing.
00:12:52.400 | That construct and then you stack a ton of them together
00:12:56.600 | a ton of those encoders
00:12:58.600 | self-attention feed-forward network
00:13:00.000 | self-attention feed-forward network together
00:13:02.000 | and that allows you to learn the rich context
00:13:05.200 | of the language to then at the end
00:13:07.600 | perform all kinds of tasks.
00:13:09.800 | You can create first of all like ELMo
00:13:13.400 | and like Word2Vec
00:13:14.800 | create rich contextual embeddings
00:13:17.600 | take a set of words and represent them
00:13:21.400 | in the space that's very efficient to reason with.
00:13:24.200 | You can do language classification
00:13:26.200 | you can do sentence pair classification
00:13:29.400 | you could do the similarity of two sentences
00:13:31.600 | multiple choice question answering
00:13:33.200 | general question answering
00:13:34.800 | tagging of sentences.
00:13:37.200 | Okay, I lingered on that one a little bit too long
00:13:41.200 | but it is also the one I'm really excited about
00:13:45.800 | and really if there's a breakthrough this year
00:13:48.400 | it's thanks to BERT.
00:13:51.200 | The other thing I'm very excited about
00:13:54.200 | is totally jumping away from the
00:13:58.200 | NeurIPS, the theory
00:14:03.200 | those kind of academic developments in deep learning
00:14:06.400 | and into the world of applied deep learning.
00:14:10.200 | So Tesla has a system called Autopilot
00:14:15.200 | where the hardware version 2 of that system
00:14:19.200 | is a implementation of the NVIDIA Drive PX2 system
00:14:29.200 | which runs a ton of neural networks.
00:14:32.200 | There's eight cameras on the car
00:14:35.600 | and a variant of the Inception network
00:14:40.600 | is now taking in all eight cameras
00:14:45.600 | at different resolutions as input
00:14:47.600 | and performing various tasks
00:14:50.600 | like drivable area segmentation
00:14:53.600 | like object detection
00:14:55.600 | and some basic localization tasks.
00:14:58.600 | So you have now a huge fleet of vehicles
00:15:03.600 | where it's not engineers
00:15:05.600 | some I'm sure are engineers
00:15:07.600 | but it's really regular consumers
00:15:09.600 | people that have purchased the car
00:15:12.600 | have no understanding in many cases
00:15:14.600 | of what a neural networks limitations
00:15:16.000 | and capabilities are so on.
00:15:17.600 | Now it has a neural network
00:15:19.600 | is controlling the well-being
00:15:21.600 | its decisions, its perceptions
00:15:25.600 | and the control decision based on those perceptions
00:15:27.600 | are controlling the life of a human being.
00:15:30.600 | And that to me is one of the great
00:15:32.600 | sort of breakthroughs of 17 and 18
00:15:35.600 | in terms of the development
00:15:39.600 | of what AI can do in a practical sense
00:15:42.600 | in impacting the world.
00:15:43.600 | And so one billion miles
00:15:46.600 | over one billion miles have been driven in autopilot.
00:15:49.600 | Now there's two types of systems
00:15:51.600 | currently operating in Tesla's
00:15:53.600 | there's hardware version one
00:15:55.600 | hardware version two.
00:15:56.600 | Hardware version one was Intel Mobileye
00:15:59.600 | monocular camera perception system
00:16:01.600 | as far as we know that was not using a neural network
00:16:04.600 | and it was a fixed system
00:16:05.600 | that wasn't learning at least online learning in the Tesla's.
00:16:08.600 | The other is hardware version two
00:16:10.600 | and it's about half and half now
00:16:12.600 | in terms of the miles driven.
00:16:14.600 | The hardware version two has a neural network
00:16:16.600 | that's always learning.
00:16:17.600 | There's weekly updates.
00:16:18.600 | It's always improving the model
00:16:20.600 | shipping new weights and so on.
00:16:22.600 | That's the exciting set of breakthroughs.
00:16:25.600 | In terms of AutoML
00:16:27.600 | the dream of automating
00:16:30.600 | some aspects or all aspects
00:16:32.600 | or as many aspects as possible
00:16:33.600 | of the machine learning process
00:16:35.600 | where you can just
00:16:37.600 | drop in a data set that you're working on
00:16:41.600 | and the system will automatically determine
00:16:45.600 | all the parameters
00:16:47.600 | from the details of the architectures
00:16:49.600 | the size of the architecture
00:16:51.600 | the different modules in that architecture
00:16:53.600 | the hyper parameters
00:16:55.600 | used for training the architecture
00:16:58.600 | running that, doing inference, everything.
00:17:00.600 | All is done for you
00:17:01.600 | all you just feed it is data.
00:17:03.600 | So that's been the success
00:17:06.600 | of the neural architecture search
00:17:08.600 | in '16 and '17
00:17:10.600 | and there's been a few ideas
00:17:12.600 | with Google AutoML
00:17:13.600 | that's really trying to almost create an API
00:17:15.600 | where you just drop in your data set
00:17:17.600 | and it's using reinforcement learning
00:17:19.600 | and recurring neural networks
00:17:21.600 | to given a few modules
00:17:24.600 | stitch them together in such a way
00:17:26.600 | where the objective function is optimizing
00:17:28.600 | the performance of the overall system
00:17:30.600 | and they showed a lot of exciting results
00:17:32.600 | Google showed and others
00:17:34.600 | that outperform state of the art systems
00:17:36.600 | both in terms of efficiency
00:17:37.600 | and in terms of accuracy.
00:17:39.600 | Now in '18
00:17:41.600 | there have been a few improvements
00:17:43.600 | on this direction
00:17:44.600 | and one of them is Attenet
00:17:46.600 | where it's now using
00:17:49.600 | the same reinforcement learning
00:17:51.600 | AutoML formulation
00:17:52.600 | to build ensembles on neural networks
00:17:54.600 | so in many cases
00:17:56.600 | state of the art performance
00:17:57.600 | can be achieved
00:17:58.600 | by as opposed to taking
00:18:00.600 | a single architecture
00:18:02.600 | is building up a multitude
00:18:04.600 | an ensemble, a collection of architectures
00:18:06.600 | and that's what is doing here
00:18:08.600 | is given a candidate architectures
00:18:11.600 | stitching them together
00:18:12.600 | to form an ensemble
00:18:13.600 | to get state of the art performance
00:18:15.600 | now that state of the art performance
00:18:17.600 | is not a leap
00:18:20.600 | a breakthrough leap forward
00:18:22.600 | but it's nevertheless a step forward
00:18:24.600 | and it's a very exciting field
00:18:27.600 | that's going to be receiving
00:18:28.600 | more and more attention
00:18:30.600 | there's an area of machine learning
00:18:33.600 | that's heavily understudied
00:18:35.600 | and I think it's extremely exciting area
00:18:37.600 | and if you look at 2012
00:18:43.600 | with AlexNet
00:18:45.600 | achieving the breakthrough performance
00:18:48.600 | of showing that
00:18:49.600 | what deep learning networks are capable of
00:18:52.600 | from that point on
00:18:54.600 | from 2012 to today
00:18:56.600 | there's been non-stop
00:18:57.600 | extremely active developments
00:18:59.600 | of different architectures
00:19:00.600 | that even on just ImageNet alone
00:19:02.600 | on doing the image classification task
00:19:04.600 | have improved performance
00:19:08.600 | over and over and over
00:19:09.600 | with totally new ideas
00:19:11.600 | now on the other side
00:19:12.600 | on the data side
00:19:14.600 | there's been very few ideas
00:19:17.600 | about how to do data augmentation
00:19:19.600 | so data augmentation
00:19:22.600 | is the process of
00:19:25.600 | you know it's what
00:19:27.600 | kids always do
00:19:28.600 | when you learn about an object right
00:19:30.600 | is you look at an object
00:19:31.600 | and you kind of like twist it around
00:19:34.600 | is taking the raw data
00:19:39.600 | and messing it with such a way
00:19:41.600 | that it can give you
00:19:42.600 | much richer representation
00:19:44.600 | of what this data can look like
00:19:47.600 | in other forms
00:19:48.600 | in other contexts in the real world
00:19:51.600 | there's been very few developments
00:19:54.600 | I think still
00:19:55.600 | and there's this auto-augment
00:19:57.600 | is just a step
00:19:58.600 | a tiny step into that direction
00:20:00.600 | that I hope that we as a community
00:20:03.600 | invest a lot of effort in
00:20:05.600 | so what auto-augment does
00:20:07.600 | is it says
00:20:08.600 | okay so there's these
00:20:11.600 | data augmentation methods
00:20:13.600 | like translating the image
00:20:14.600 | sharing the image
00:20:16.600 | doing color manipulation
00:20:17.600 | like color inversion
00:20:19.600 | let's take those as basic actions
00:20:21.600 | you can take
00:20:22.600 | and then use reinforcement learning
00:20:23.600 | and an RNN again construct
00:20:26.600 | to stitch those actions together
00:20:28.600 | in such a way
00:20:29.600 | that can augment data
00:20:32.600 | like on ImageNet
00:20:34.600 | to when you train on that data
00:20:37.600 | it gets state-of-the-art performance
00:20:40.600 | so mess with the data
00:20:41.600 | in a way that optimizes
00:20:44.600 | the way you mess with the data
00:20:46.600 | so and then they've also showed
00:20:49.600 | that given that
00:20:51.600 | the set of data augmentation policies
00:20:55.600 | that are learned to optimize
00:20:57.600 | for example for ImageNet
00:20:59.600 | given some kind of architecture
00:21:01.600 | you can take that
00:21:03.600 | learned set of policies
00:21:05.600 | for data augmentation
00:21:06.600 | and apply it to
00:21:08.600 | a totally different data set
00:21:10.600 | so there's the process of transfer learning
00:21:14.600 | so what is transfer learning?
00:21:16.600 | So we talked about
00:21:17.600 | transfer learning
00:21:18.600 | you have a neural network
00:21:19.600 | that learns to do
00:21:20.600 | cat versus dog
00:21:21.600 | or no
00:21:22.600 | learns to do a thousand class
00:21:24.600 | classification problem on ImageNet
00:21:26.600 | and then you transfer
00:21:28.600 | you chop off few layers
00:21:29.600 | and you transfer on the task
00:21:30.600 | of your own data set
00:21:31.600 | of cat versus dog
00:21:33.600 | what you're transferring
00:21:34.600 | is the weights
00:21:36.600 | that are learned
00:21:38.600 | on the ImageNet
00:21:39.600 | classification task
00:21:40.600 | and now you're then
00:21:42.600 | fine-tuning those weights
00:21:44.600 | on the specific
00:21:48.600 | personal cat versus dog data set
00:21:50.600 | you have
00:21:52.600 | now you could do the same thing here
00:21:55.600 | you can transfer
00:21:57.600 | as part of the transfer learning process
00:21:59.600 | take the data augmentation policies
00:22:02.600 | learned on ImageNet
00:22:04.600 | and transfer those
00:22:05.600 | you can transfer both the weights
00:22:06.600 | and the policies
00:22:08.600 | that's a really super exciting idea
00:22:12.600 | I think
00:22:13.600 | it wasn't quite demonstrated
00:22:14.600 | extremely well here
00:22:16.600 | in terms of performance
00:22:18.600 | so it got an improvement
00:22:20.600 | in performance and so on
00:22:21.600 | but it kind of inspired an idea
00:22:24.600 | that's something
00:22:25.600 | that we need to really think about
00:22:26.600 | how to augment data
00:22:28.600 | in an interesting way
00:22:29.600 | such that
00:22:31.600 | given just a few samples
00:22:33.600 | of data
00:22:35.600 | we can generate huge data sets
00:22:37.600 | in a way that you can then form
00:22:39.600 | meaningful, complex, rich representations from
00:22:43.600 | I think that's really exciting
00:22:44.600 | and one of the ways that
00:22:46.600 | you break open the problem
00:22:47.600 | of how do we learn a lot from a little
00:22:50.600 | training deep neural networks
00:22:53.600 | with synthetic data
00:22:56.600 | also really an exciting topic
00:22:58.600 | that a few groups
00:23:01.600 | but especially NVIDIA
00:23:02.600 | has invested a lot in
00:23:03.600 | and here's a
00:23:04.600 | from a CVPR 2018
00:23:06.600 | probably my favorite work on this topic
00:23:09.600 | they really went crazy
00:23:12.600 | and said okay let's mess
00:23:13.600 | with synthetic data
00:23:15.600 | in every way we could possibly can
00:23:18.600 | so on the left there
00:23:19.600 | shown a set of backgrounds
00:23:21.600 | then there's also set of artificial objects
00:23:23.600 | and you have a car
00:23:24.600 | or some kind of object
00:23:25.600 | that you're trying to classify
00:23:27.600 | so let's take that car
00:23:28.600 | and mess with it
00:23:29.600 | with every way possible
00:23:31.600 | apply lighting variation to it
00:23:33.600 | with every way possible
00:23:35.600 | rotate everything that is crazy
00:23:38.600 | what NVIDIA
00:23:40.600 | is really good at
00:23:41.600 | is creating realistic scenes
00:23:43.600 | and they said okay
00:23:44.600 | let's create realistic scenes
00:23:46.600 | but let's also go way above board
00:23:48.600 | and not do realistic at all
00:23:50.600 | do things that can't possibly happen in reality
00:23:53.600 | and so generally these huge data sets
00:23:55.600 | I want you to train
00:23:56.600 | and again achieve quite interesting
00:23:58.600 | quite good performance
00:24:02.600 | on image classification
00:24:04.600 | of course
00:24:05.600 | they're trying to apply to ImageNet
00:24:06.600 | and so on these kinds of tasks
00:24:08.600 | you're not going to outperform
00:24:10.600 | networks that were trained on ImageNet
00:24:12.600 | but they show that
00:24:13.600 | with just a small sample from
00:24:15.600 | from those real images
00:24:18.600 | they can fine tune this network train
00:24:20.600 | on synthetic images
00:24:21.600 | totally fake images
00:24:22.600 | to achieve state of the art performance
00:24:25.600 | again another way to generate
00:24:27.600 | to get to learn a lot from very little
00:24:31.600 | by generating fake worlds
00:24:33.600 | synthetically
00:24:37.600 | the process of annotation
00:24:39.600 | which for supervised learning
00:24:41.600 | is what you need to do
00:24:43.600 | in order to
00:24:45.600 | train the network
00:24:46.600 | you need to be able to provide ground truth
00:24:47.600 | you need to be able to label
00:24:49.600 | whatever the entity that is
00:24:51.600 | being learned
00:24:52.600 | and so for image classification
00:24:54.600 | that's saying what is going on in the image
00:24:56.600 | and part of that was done
00:24:58.600 | on ImageNet
00:24:59.600 | by doing a Google search
00:25:00.600 | for creating candidates
00:25:03.600 | saying what's going on in the image
00:25:05.600 | is a pretty easy task
00:25:06.600 | then there is the
00:25:08.600 | object detection task
00:25:09.600 | of detecting the bounty box
00:25:11.600 | and so saying
00:25:12.600 | drawing the actual bounty box
00:25:14.600 | is a little bit more difficult
00:25:16.600 | but it's a couple of clicks and so on
00:25:18.600 | then if we take the final
00:25:22.600 | probably one of the higher
00:25:24.600 | complexity tasks
00:25:26.600 | of perception
00:25:28.600 | of image understanding
00:25:29.600 | is segmentation
00:25:30.600 | is actually drawing
00:25:32.600 | either pixel level or polygons
00:25:34.600 | the outline of a particular object
00:25:36.600 | now if you have to annotate that
00:25:38.600 | that's extremely costly
00:25:39.600 | so the work with
00:25:41.600 | polygon RNN
00:25:42.600 | is to use recurring neural networks
00:25:44.600 | to make suggestions for polygons
00:25:46.600 | it's really interesting
00:25:48.600 | there's a few tricks
00:25:50.600 | to form these high resolution polygons
00:25:52.600 | so the idea is
00:25:53.600 | it drops in a single point
00:25:55.600 | you draw a
00:25:57.600 | you draw a bounty box around an object
00:25:59.600 | you use
00:26:01.600 | convolution neural networks to drop
00:26:03.600 | the first point
00:26:04.600 | and then you use recurring neural networks
00:26:05.600 | to draw around it
00:26:07.600 | and the performance
00:26:09.600 | is really good
00:26:10.600 | there's a few tricks
00:26:11.600 | and this tool is available online
00:26:12.600 | it's a really interesting idea
00:26:14.600 | again
00:26:15.600 | the dream
00:26:16.600 | with AutoML
00:26:17.600 | is to remove the human from the picture
00:26:19.600 | as much as possible
00:26:20.600 | with data augmentation
00:26:22.600 | remove the human from the picture
00:26:23.600 | as much as possible
00:26:24.600 | for menial data
00:26:25.600 | automate the boring stuff
00:26:27.600 | and in this case
00:26:28.600 | the act of drawing a polygon
00:26:30.600 | try to automate it
00:26:31.600 | as much as possible
00:26:35.600 | interesting
00:26:37.600 | other dimension
00:26:39.600 | along which
00:26:41.600 | deep learning is
00:26:43.600 | recently been trying to be optimized
00:26:47.600 | do we make
00:26:48.600 | deep learning accessible
00:26:51.600 | cheap
00:26:52.600 | accessible
00:26:53.600 | so the Dawn Bench
00:26:54.600 | from Stanford
00:26:55.600 | the benchmark
00:26:56.600 | the Dawn Bench
00:26:57.600 | benchmark
00:26:58.600 | from Stanford
00:26:59.600 | asked
00:27:00.600 | formulated an interesting competition
00:27:02.600 | which got a lot of attention
00:27:04.600 | and a lot of progress
00:27:06.600 | it's saying
00:27:07.600 | if we want to achieve
00:27:09.600 | 93% accuracy on ImageNet
00:27:11.600 | and 94% on CIFAR-10
00:27:13.600 | let's now
00:27:15.600 | compete
00:27:16.600 | that's like the requirement
00:27:17.600 | let's now compete
00:27:19.600 | how you can do it
00:27:20.600 | in the least amount of time
00:27:21.600 | and for the least amount of dollars
00:27:24.600 | do the training in the least amount of time
00:27:27.600 | and the training in the least amount of dollars
00:27:29.600 | like literally dollars
00:27:30.600 | you're allowed to spend
00:27:31.600 | to do this
00:27:33.600 | and Fast.ai
00:27:35.600 | you know
00:27:36.600 | it's a renegade
00:27:37.600 | awesome
00:27:38.600 | renegade group of deep learning researchers
00:27:40.600 | have been able to train
00:27:42.600 | on ImageNet in 3 hours
00:27:44.600 | so this is for training process
00:27:46.600 | for 25 bucks
00:27:48.600 | so training a network
00:27:50.600 | that achieves 93% accuracy
00:27:53.600 | for 25 bucks
00:27:55.600 | and 94% accuracy
00:27:57.600 | for 26 cents
00:27:59.600 | on CIFAR-10
00:28:01.600 | so the key idea
00:28:02.600 | that they were playing with
00:28:04.600 | is quite simple
00:28:05.600 | but really it boils down to
00:28:07.600 | messing with the learning rate
00:28:09.600 | throughout the process of training
00:28:11.600 | so the learning rate is
00:28:13.600 | how much you based on the loss
00:28:15.600 | based on the error the neural network observes
00:28:17.600 | how much do you adjust the weights
00:28:19.600 | so they found
00:28:23.600 | that if they
00:28:24.600 | crank up
00:28:25.600 | the learning rate
00:28:27.600 | while decreasing
00:28:28.600 | the momentum
00:28:30.600 | which is a parameter of the optimization process
00:28:32.600 | where they do it that jointly
00:28:34.600 | they're able
00:28:35.600 | to make the network learn
00:28:37.600 | really fast
00:28:39.600 | that's really exciting
00:28:40.600 | and the benchmark itself
00:28:41.600 | is also really exciting
00:28:43.600 | because that's exactly
00:28:44.600 | for people sitting in this room
00:28:46.600 | that opens up the door
00:28:48.600 | to doing all kinds of
00:28:50.600 | fundamental deep learning problems
00:28:52.600 | without the resources
00:28:54.600 | of Google DeepMind
00:28:56.600 | or OpenAI
00:28:57.600 | or Facebook
00:28:58.600 | or so on
00:28:59.600 | without computational resources
00:29:01.600 | that's important for academia
00:29:02.600 | that's important for independent researchers
00:29:04.600 | and so on
00:29:05.600 | so GANs
00:29:07.600 | there's been a lot of work
00:29:09.600 | on generative adversarial neural networks
00:29:13.600 | in some ways there's not been
00:29:15.600 | breakthrough
00:29:17.600 | ideas
00:29:19.600 | in GANs for quite a bit
00:29:21.600 | and I think
00:29:23.600 | BigGAN from
00:29:25.600 | from Google DeepMind
00:29:27.600 | the ability to generate
00:29:29.600 | incredibly high resolution
00:29:31.600 | images
00:29:34.600 | it's the same GAN technique
00:29:36.600 | so in terms of breakthroughs innovations
00:29:38.600 | but scaled
00:29:40.600 | so increase the model capacity
00:29:42.600 | and increase the batch size
00:29:44.600 | the number of images that are
00:29:48.600 | to the network
00:29:50.600 | it produces incredible images
00:29:52.600 | I encourage you to go online
00:29:54.600 | and look at them
00:29:55.600 | it's hard to believe that they're
00:29:57.600 | generated
00:29:59.600 | so that was
00:30:01.600 | 2018 for GANs
00:30:03.600 | was a year
00:30:05.600 | of scaling
00:30:07.600 | and parameter tuning
00:30:09.600 | as opposed to breakthrough new ideas
00:30:11.600 | Video to Video Synthesis
00:30:15.600 | this work is from
00:30:17.600 | NVIDIA
00:30:19.600 | is looking at the problem
00:30:21.600 | so there's been a lot of work on
00:30:23.600 | going from image to image
00:30:25.600 | so from a particular
00:30:27.600 | image generating another image
00:30:29.600 | so whether it's colorizing an image
00:30:31.600 | or just traditionally
00:30:33.600 | defined GANs
00:30:35.600 | the idea with
00:30:37.600 | Video to Video Synthesis that a few people have been
00:30:39.600 | working on but NVIDIA took a good
00:30:41.600 | step forward
00:30:43.600 | is to
00:30:47.600 | the video to make
00:30:49.600 | the temporal consistency, the temporal dynamics
00:30:51.600 | part of the optimization process
00:30:53.600 | so make it look not jumpy
00:30:55.600 | so if you look here at the
00:30:57.600 | comparison for this particular
00:30:59.600 | so the input is the labels in the top
00:31:03.600 | left and the output of the
00:31:05.600 | NVIDIA approach is on the bottom right
00:31:11.600 | see it's very
00:31:13.600 | temporally consistent. If you look at
00:31:15.600 | the image to image
00:31:17.600 | mapping that's
00:31:19.600 | state of the art, pix2pix HD
00:31:21.600 | it's very jumpy
00:31:23.600 | it's not temporally consistent at all
00:31:25.600 | and there's some naive
00:31:27.600 | approaches for trying to maintain temporal
00:31:29.600 | consistency
00:31:31.600 | that's in the bottom left
00:31:33.600 | so you can apply this to all
00:31:35.600 | kinds of tasks, all kinds of video to video
00:31:37.600 | mapping. Here is mapping it to
00:31:39.600 | face edges
00:31:41.600 | edge detection on faces, mapping it to
00:31:43.600 | faces
00:31:45.600 | generating faces from just edges
00:31:47.600 | you can look at
00:31:51.600 | body pose
00:31:53.600 | to actual images
00:31:55.600 | as input to the network you can take
00:31:57.600 | the pose of the person
00:31:59.600 | and generate the
00:32:01.600 | video of the person
00:32:03.600 | okay, semantic segmentation
00:32:13.600 | the problem of perception
00:32:17.600 | began with AlexNet and ImageNet
00:32:19.600 | and then further and further developments
00:32:21.600 | where the input, the problem is
00:32:23.600 | of basic image classification where the input is an image
00:32:25.600 | and the output is a classification of what's going on
00:32:27.600 | in that image and the fundamental
00:32:29.600 | architecture can be reused for
00:32:31.600 | more complex tasks like detection
00:32:33.600 | like segmentation and so on
00:32:35.600 | interpreting what's going on in the image
00:32:37.600 | so these large networks
00:32:39.600 | from VGGNet, GoogleNet,
00:32:41.600 | ResNet,
00:32:43.600 | SCNet, DenseNet
00:32:45.600 | all these networks are forming rich
00:32:47.600 | representation that can then be used for all kinds
00:32:49.600 | of tasks, whether that task is object
00:32:51.600 | detection, this here
00:32:53.600 | shown is the region based methods
00:32:55.600 | where the neural network
00:32:57.600 | is tasked, the convolutional layers
00:33:01.600 | region proposals
00:33:03.600 | so a bunch of candidates to be considered
00:33:05.600 | and then there's a step that's
00:33:07.600 | determining what's in those
00:33:09.600 | different regions and forming bounding boxes
00:33:11.600 | around them in a for loop way
00:33:13.600 | and then there is the one shot method
00:33:15.600 | single shot method where
00:33:17.600 | in a single pass all of the
00:33:19.600 | bounding boxes in their classes
00:33:21.600 | are generated and there has been
00:33:23.600 | a tremendous amount of work
00:33:25.600 | in the space of object detection
00:33:27.600 | some are single
00:33:29.600 | shot methods
00:33:31.600 | some are region
00:33:33.600 | based methods and there's been a lot
00:33:35.600 | of exciting work
00:33:37.600 | but not
00:33:39.600 | I would say breakthrough
00:33:41.600 | ideas and then
00:33:43.600 | we take it to the highest level of
00:33:45.600 | perception which is semantic segmentation
00:33:47.600 | there's also been
00:33:49.600 | a lot of work there, the state of
00:33:51.600 | the art performance is
00:33:53.600 | at least for the open source systems
00:33:55.600 | is DeepLab
00:33:59.600 | on the PASCAL
00:34:01.600 | VLC challenge
00:34:03.600 | so semantic segmentation to catch it
00:34:05.600 | all up started in 2014 with fully
00:34:07.600 | convolutional neural networks
00:34:09.600 | chopping off the fully connected
00:34:11.600 | layers and then
00:34:13.600 | outputting
00:34:15.600 | the heat map, very grainy
00:34:19.600 | low resolution
00:34:21.600 | then improving that with segnet
00:34:23.600 | performing max pooling
00:34:27.600 | a breakthrough idea that's reused in a lot
00:34:29.600 | of cases is dilated convolution
00:34:31.600 | atrius convolutions
00:34:33.600 | having some spacing
00:34:35.600 | which increases the
00:34:37.600 | field of view of the convolutional
00:34:39.600 | filter, the key idea
00:34:41.600 | behind DeepLab V3+
00:34:43.600 | that I
00:34:45.600 | is the state of the art is
00:34:47.600 | the multi-scale processing
00:34:49.600 | without
00:34:51.600 | increasing the parameters, the
00:34:53.600 | multi-scales achieved by
00:34:55.600 | quote-unquote the atrius
00:34:57.600 | rate, so taking those atrius convolutions
00:34:59.600 | and increasing the spacing
00:35:01.600 | and you can think of increasing that
00:35:03.600 | spacing
00:35:05.600 | by enlarging the
00:35:07.600 | model's field of view, so you
00:35:09.600 | can consider all these different
00:35:11.600 | scales of processing
00:35:13.600 | looking at the
00:35:15.600 | layers of features
00:35:19.600 | allowing you to be able
00:35:21.600 | to grasp the greater
00:35:23.600 | context as part of the
00:35:25.600 | upsampling deconvolutional step
00:35:27.600 | and that's what's producing the state of the art performances
00:35:29.600 | and that's where we have the
00:35:31.600 | notebook
00:35:33.600 | tutorial
00:35:35.600 | on github
00:35:37.600 | showing this
00:35:39.600 | DeepLab
00:35:41.600 | architecture trained on
00:35:43.600 | Cityscapes, so Cityscapes
00:35:45.600 | is a driving segmentation
00:35:47.600 | data set
00:35:49.600 | that is
00:35:51.600 | that is
00:35:55.600 | one of the most commonly used for the task of
00:35:57.600 | driving scene segmentation
00:35:59.600 | Okay, on the deep reinforcement
00:36:03.600 | learning front
00:36:05.600 | So this is touching a bit
00:36:07.600 | a bit on the 2017
00:36:09.600 | but I think
00:36:11.600 | the excitement really settled in
00:36:13.600 | in 2018
00:36:15.600 | is the work from Google
00:36:17.600 | and from OpenAI
00:36:19.600 | DeepMind, so it started
00:36:21.600 | in DQN paper from Google
00:36:23.600 | DeepMind where they beat a bunch of
00:36:25.600 | a bunch of Atari
00:36:29.600 | games achieving superhuman
00:36:31.600 | performance with
00:36:33.600 | deep reinforcement
00:36:35.600 | learning methods that are taking in just the raw
00:36:37.600 | pixels of the game, so this same
00:36:39.600 | kind of architecture is able to learn
00:36:41.600 | how to beat these games
00:36:43.600 | super exciting idea
00:36:45.600 | that kind of has echoes
00:36:47.600 | of what general intelligence is
00:36:49.600 | taking in the raw
00:36:51.600 | information and being able to
00:36:53.600 | understand the game
00:36:55.600 | the sort of physics of the game sufficiently to be able
00:36:57.600 | to beat it. Then in 2016
00:36:59.600 | AlphaGo
00:37:01.600 | with some supervision
00:37:03.600 | and some playing against itself
00:37:05.600 | self play
00:37:07.600 | some supervised learning
00:37:09.600 | on expert world champ players
00:37:11.600 | and some self play
00:37:13.600 | where it plays against itself
00:37:15.600 | was able to beat the top of the world champion
00:37:17.600 | at Go
00:37:19.600 | and then 2017 AlphaGo Zero
00:37:21.600 | a specialized
00:37:23.600 | version of AlphaZero
00:37:25.600 | was able to beat
00:37:27.600 | the AlphaGo
00:37:29.600 | with just a few days
00:37:31.600 | of training and zero
00:37:33.600 | supervision from expert games
00:37:35.600 | so through the process of self play
00:37:37.600 | again this is kind of
00:37:39.600 | getting
00:37:41.600 | the human out of the picture
00:37:43.600 | more and more and more which is why AlphaZero
00:37:45.600 | is probably or this
00:37:47.600 | AlphaGo Zero was a demonstration
00:37:49.600 | of the cleanest
00:37:53.600 | demonstration of all the nice
00:37:55.600 | progress in deep reinforcement learning
00:37:57.600 | and I think if you look at the history of AI
00:37:59.600 | when you're sitting on a
00:38:01.600 | porch 100 years from now
00:38:03.600 | sort of reminiscing back
00:38:05.600 | AlphaZero
00:38:07.600 | will be a thing that people
00:38:09.600 | will remember as an
00:38:11.600 | interesting moment in time
00:38:13.600 | as a key moment in time
00:38:15.600 | and AlphaZero
00:38:17.600 | was applied in 2017
00:38:19.600 | to beat
00:38:21.600 | AlphaZero paper
00:38:23.600 | was in 2017 and it was
00:38:25.600 | this year played Stockfish
00:38:27.600 | in chess which is
00:38:29.600 | the best engine
00:38:31.600 | chess playing engines
00:38:33.600 | was able to beat it with just 4 hours of training
00:38:35.600 | of course the 4 hours
00:38:37.600 | is caveats because 4 hours
00:38:39.600 | for Google DeepMind is highly distributed
00:38:41.600 | training so it's not 4 hours
00:38:43.600 | for an
00:38:45.600 | undergraduate student sitting in their dorm room
00:38:47.600 | but meaning
00:38:49.600 | it was
00:38:51.600 | able through self play to
00:38:53.600 | very quickly learn to beat the state of the art
00:38:55.600 | chess engine and
00:38:57.600 | learn to beat the state of the art shogi
00:38:59.600 | engine ELMO
00:39:01.600 | and the interesting
00:39:03.600 | thing here is
00:39:05.600 | with perfect information games
00:39:07.600 | like chess you have a tree
00:39:09.600 | and you have all the decisions you could
00:39:11.600 | possibly make and so the farther along you look
00:39:13.600 | at along that tree presumably
00:39:15.600 | the better you do
00:39:17.600 | that's how Deep Blue beat
00:39:19.600 | Kasparov in the
00:39:21.600 | 90's is you just look as far
00:39:23.600 | as possible down the tree
00:39:25.600 | to determine which is the action is the most optimal
00:39:27.600 | if you look at the way
00:39:29.600 | human
00:39:31.600 | grandmasters think it certainly
00:39:33.600 | doesn't feel like they're like looking down a tree
00:39:35.600 | there's something like
00:39:37.600 | creative intuition there's something like you could
00:39:39.600 | see the patterns in the board you can do
00:39:41.600 | a few calculations but really
00:39:43.600 | it's in the order of hundreds it's not
00:39:45.600 | on the order of millions
00:39:47.600 | or billions which is kind
00:39:49.600 | of the
00:39:51.600 | the stockfish the state of the art chess engine
00:39:55.600 | approach and AlphaZero is moving
00:39:57.600 | closer and closer and closer towards the human
00:39:59.600 | grandmaster concerning very few
00:40:01.600 | future moves it's able through
00:40:03.600 | the neural network estimator that's
00:40:05.600 | estimating the quality of the move and the quality
00:40:07.600 | of the different the quality of the
00:40:09.600 | current quality of the board and the quality
00:40:11.600 | of the moves that follow
00:40:13.600 | it's able to do much much less
00:40:15.600 | look ahead so the
00:40:17.600 | neural network learns the fundamental
00:40:19.600 | information just like when a grandmaster looks
00:40:21.600 | at a board they can tell how
00:40:23.600 | good that is
00:40:25.600 | that's again interesting it's a step
00:40:27.600 | towards
00:40:29.600 | at least echoes
00:40:31.600 | of what human intelligence is in
00:40:33.600 | this very structured formal constrained world
00:40:35.600 | of chess and Go and Shogi
00:40:37.600 | and then there's
00:40:39.600 | the other side of the world that's messy
00:40:41.600 | it's still games
00:40:43.600 | it's still constrained in that way
00:40:45.600 | but OpenAI
00:40:47.600 | has taken on the challenge of
00:40:49.600 | playing games that are much messier
00:40:51.600 | that have this
00:40:53.600 | semblance
00:40:55.600 | of the real world and the fact that
00:40:57.600 | you have to do teamwork you have to
00:40:59.600 | look at long time horizons
00:41:01.600 | with huge amounts of imperfect
00:41:03.600 | information hidden information
00:41:05.600 | uncertainty
00:41:07.600 | so within that world they've
00:41:09.600 | taken on the challenge of a popular game
00:41:11.600 | Dota 2
00:41:13.600 | on the human side of that
00:41:15.600 | there's the competition
00:41:17.600 | the international hosted every year
00:41:19.600 | where in 2018 the winning
00:41:21.600 | team gets 11 million dollars
00:41:23.600 | it's a very popular very active competition
00:41:25.600 | that's been going on for
00:41:27.600 | a few years
00:41:29.600 | they've been
00:41:31.600 | improving and achieved a lot of interesting
00:41:33.600 | milestones in 2017
00:41:35.600 | they were 1v1 bot
00:41:37.600 | beat the top professional Dota 2 player
00:41:39.600 | the way you achieve great
00:41:41.600 | things is
00:41:43.600 | you try
00:41:45.600 | in 2018 they tried to go
00:41:49.600 | the OpenAI 5 team lost 2 games
00:41:51.600 | against the top 2
00:41:53.600 | the top Dota 2 players
00:41:55.600 | at the 2018 international
00:41:57.600 | and of course their ranking
00:41:59.600 | here the MMR ranking in Dota 2
00:42:01.600 | has been increasing over and over
00:42:03.600 | but there's a lot of challenges
00:42:05.600 | here that make it extremely difficult
00:42:07.600 | to beat the human players
00:42:11.600 | this is you know in every story
00:42:13.600 | Rocky or whatever
00:42:15.600 | you think about losing is
00:42:17.600 | an essential element of a story
00:42:19.600 | that leads to then a movie
00:42:21.600 | and a book and greatness
00:42:23.600 | so you better believe that they're coming back
00:42:25.600 | next year and there's going to be
00:42:27.600 | a lot of exciting developments there
00:42:29.600 | it also Dota 2
00:42:31.600 | this particular video game makes it
00:42:33.600 | currently
00:42:35.600 | there's really 2 games
00:42:37.600 | that have the public eye
00:42:39.600 | in terms of AI taking on as benchmarks
00:42:41.600 | so we solved GO
00:42:43.600 | incredible accomplishment but what's next
00:42:45.600 | so last year
00:42:49.600 | associate with the best paper
00:42:51.600 | in Europe's
00:42:53.600 | was the heads up
00:42:55.600 | Texas no limit hold'em
00:42:57.600 | AI was able to
00:42:59.600 | beat the top level players
00:43:01.600 | what's completely current well not completely
00:43:03.600 | but currently out of reach is the
00:43:05.600 | general not heads up 1 vs 1
00:43:07.600 | but the general team
00:43:09.600 | Texas no limit hold'em
00:43:11.600 | and on the gaming
00:43:13.600 | side this dream
00:43:15.600 | of Dota 2 now that's the benchmark
00:43:17.600 | that everybody's targeting and it's actually
00:43:19.600 | incredibly difficult one and some people think
00:43:21.600 | it'll be a long time before we can
00:43:23.600 | win and on
00:43:25.600 | the more
00:43:27.600 | practical side of things
00:43:33.600 | starting 2017 has been a year
00:43:37.600 | the frameworks
00:43:39.600 | growing up
00:43:41.600 | of maturing and creating ecosystems
00:43:43.600 | around them with
00:43:45.600 | TensorFlow
00:43:47.600 | with the history there
00:43:49.600 | dating back a few years has really
00:43:51.600 | with TensorFlow 1.0
00:43:55.600 | come to be sort of a mature
00:43:57.600 | framework
00:43:59.600 | PyTorch 1.0 came out in 2018
00:44:01.600 | is matured as well and now
00:44:03.600 | the really exciting developments in
00:44:05.600 | TensorFlow with eager execution
00:44:07.600 | and beyond that's
00:44:09.600 | coming out in TensorFlow 2.0
00:44:11.600 | in 2019 so
00:44:13.600 | really those two
00:44:15.600 | those two players have
00:44:17.600 | made incredible
00:44:19.600 | leaps in
00:44:21.600 | standardizing deep
00:44:23.600 | learning and
00:44:25.600 | the fact that a lot of the ideas I talked
00:44:27.600 | about today and Monday and we'll keep talking
00:44:29.600 | about are all
00:44:31.600 | have a GitHub repository
00:44:33.600 | with implementations in TensorFlow
00:44:35.600 | and PyTorch making it extremely
00:44:37.600 | accessible and that's really exciting
00:44:39.600 | it's probably best to
00:44:41.600 | quote Jeff Hinton
00:44:43.600 | the quote unquote godfather of deep learning
00:44:45.600 | one of the key people
00:44:47.600 | behind backpropagation
00:44:49.600 | said recently of backpropagation
00:44:51.600 | is my view is throw it all away
00:44:53.600 | and start again he
00:44:55.600 | believes backpropagation is
00:44:57.600 | totally broken and an idea that
00:44:59.600 | is ancient and
00:45:01.600 | needs to be completely revolutionized
00:45:03.600 | and the practical protocol
00:45:05.600 | for doing that is he said
00:45:07.600 | the future depends on some graduate student
00:45:09.600 | who's deeply suspicious of everything I've said
00:45:11.600 | that's probably a good
00:45:13.600 | way to
00:45:15.600 | to end the discussion
00:45:17.600 | about what the state
00:45:19.600 | of the art in deep learning
00:45:21.600 | holds because everything we're doing
00:45:23.600 | is fundamentally based on
00:45:25.600 | ideas
00:45:27.600 | from the 60s and the 80s
00:45:29.600 | and really in terms of new ideas
00:45:31.600 | there's not been many new ideas
00:45:33.600 | especially the
00:45:35.600 | state of the art results that I've mentioned
00:45:37.600 | are all based on
00:45:39.600 | fundamentally
00:45:41.600 | on stochastic gradient
00:45:43.600 | descent and backpropagation
00:45:45.600 | it's ripe for
00:45:47.600 | totally new ideas so it's
00:45:49.600 | up to us to define the
00:45:51.600 | real breakthroughs and the real
00:45:53.600 | state of the art 2019
00:45:55.600 | and beyond. So with that
00:45:57.600 | I'd like to thank
00:45:59.600 | you and the stuff is on the
00:46:01.600 | website deeplearning.mit.edu
00:46:03.600 | [applause]
00:46:05.600 | [silence]
00:46:07.600 | [silence]
00:46:09.600 | [silence]
00:46:11.600 | [silence]
00:46:13.600 | [silence]
00:46:15.600 | [silence]
00:46:17.600 | [silence]
00:46:19.600 | [silence]
00:46:21.600 | [silence]
00:46:23.600 | [silence]