back to index

Deep Learning Basics: Introduction and Overview


Chapters

0:0 Introduction
0:53 Deep learning in one slide
4:55 History of ideas and tools
9:43 Simple example in TensorFlow
11:36 TensorFlow in one slide
13:32 Deep learning is representation learning
16:2 Why deep learning (and why not)
22:0 Challenges for supervised learning
38:27 Key low-level concepts
46:15 Higher-level methods
66:0 Toward artificial general intelligence

Whisper Transcript | Transcript Only Page

00:00:00.000 | - Welcome everyone to 2019.
00:00:03.160 | It's really good to see everybody here,
00:00:05.280 | making it in the cold.
00:00:06.760 | This is 6S094, Deep Learning for Self-Driving Cars.
00:00:10.920 | It is part of a series of courses on deep learning
00:00:17.040 | that we're running throughout this month.
00:00:19.880 | The website that you can get all the content,
00:00:22.240 | the videos, the lectures, and the code
00:00:23.940 | is deeplearning.mit.edu.
00:00:26.400 | The videos and slides will be made available there,
00:00:29.320 | along with a GitHub repository
00:00:31.560 | that's accompanying the course.
00:00:33.920 | Assignments for registered students
00:00:35.360 | will be emailed later on in the week.
00:00:39.400 | And you can always contact us with questions,
00:00:41.520 | concerns, comments at hcaihumancenteredai@mit.edu.
00:00:46.520 | So let's start through the basics, the fundamentals.
00:00:52.480 | To summarize in one slide, what is deep learning?
00:00:58.320 | It is a way to extract useful patterns from data
00:01:02.280 | in an automated way,
00:01:03.960 | with as little human effort involved as possible,
00:01:08.960 | hence the automated.
00:01:13.000 | The fundamental aspect that we'll talk about a lot
00:01:16.280 | is the optimization of neural networks.
00:01:18.940 | The practical nature that we'll provide through the code
00:01:22.360 | and so on is that there's libraries
00:01:27.660 | that make it accessible and easy
00:01:30.480 | to do some of the most powerful things in deep learning
00:01:34.620 | using Python, TensorFlow, and Friends.
00:01:36.980 | The hard part always with machine learning
00:01:42.840 | and artificial intelligence in general
00:01:45.720 | is asking good questions and getting good data.
00:01:49.620 | A lot of times, the exciting aspects
00:01:51.840 | of what the news covers
00:01:54.320 | and a lot of the exciting aspects of what is published
00:01:58.120 | in the prestigious conferences, in an archive,
00:02:01.300 | in a blog post is the methodology.
00:02:04.300 | The hard part is applying that methodology
00:02:07.600 | to solve real-world problems,
00:02:09.000 | to solve fascinating, interesting problems,
00:02:10.880 | and that requires data.
00:02:12.580 | That requires asking the right questions of that data,
00:02:15.940 | organizing that data,
00:02:18.100 | and labeling, selecting aspects of that data
00:02:21.520 | that can reveal the answers to the questions you ask.
00:02:24.660 | So why has this breakthrough over the past decade
00:02:30.780 | of the application of neural networks,
00:02:33.300 | the ideas of neural networks,
00:02:34.620 | what has happened, what has changed
00:02:36.020 | that have been around since the 1940s
00:02:40.120 | and ideas have been percolating even before?
00:02:42.540 | The digitization of information, data,
00:02:48.620 | the ability to access data easily
00:02:51.320 | in a distributed fashion across the world,
00:02:53.440 | all kinds of problems have now a digital form
00:02:56.640 | that can be accessed by learning algorithms.
00:02:59.800 | Hardware, compute, both the Moore's Law of CPU
00:03:04.800 | and GPU and ASICs, Google's TPU systems,
00:03:10.000 | hardware that enables the efficient,
00:03:13.120 | effective, large-scale execution of these algorithms.
00:03:18.660 | Community, people here, people all over the world
00:03:22.340 | are being able to work together, to talk to each other,
00:03:25.160 | to feed the fire of excitement behind machine learning.
00:03:28.500 | GitHub and beyond.
00:03:31.240 | The tooling, as we'll talk about TensorFlow,
00:03:35.620 | PyTorch, and everything in between,
00:03:38.500 | that enables a person with an idea
00:03:45.620 | to reach a solution in less and less and less time.
00:03:50.620 | Higher and higher levels of abstraction
00:03:53.000 | empower people to solve problems in less and less time
00:03:57.120 | with less and less knowledge,
00:03:59.020 | where the idea and the data become the central point,
00:04:02.440 | not the effort that takes you from the idea to the solution.
00:04:06.420 | And there's been a lot of exciting progress,
00:04:09.720 | some of which we'll talk about,
00:04:11.000 | from face recognition to the general problem
00:04:13.840 | of scene understanding, image classification to speech,
00:04:17.320 | text, natural language processing, transcription,
00:04:20.520 | translation in medical applications and medical diagnosis.
00:04:25.040 | And cars, being able to solve many aspects of perception
00:04:29.720 | in autonomous vehicles with drivable area lane detection,
00:04:33.020 | object detection, digital assistants,
00:04:36.040 | the ones on your phone and beyond, the ones in your home.
00:04:40.800 | Ads, recommender systems, from Netflix to Search
00:04:44.600 | to Social, Facebook, and of course,
00:04:48.520 | the deep reinforcement learning successes
00:04:50.520 | in the playing of games,
00:04:52.120 | from board games to StarCraft and Dota.
00:04:54.680 | Let's take a step back.
00:05:00.040 | Deep learning is more than a set of tools
00:05:04.520 | to solve practical problems.
00:05:08.480 | Pamela McCordick said in '79,
00:05:10.960 | "AI began with the ancient wish to forge the gods."
00:05:15.000 | Throughout our history, throughout our civilization,
00:05:18.560 | human civilization, we've dreamed about creating echoes
00:05:22.160 | of whatever is in this mind of ours in the machine
00:05:27.160 | and creating living organisms.
00:05:29.440 | From the popular culture in the 1800s
00:05:33.200 | with Frankenstein to Ex Machina,
00:05:35.400 | this vision, this dream of understanding intelligence
00:05:38.920 | and creating intelligence has captivated all of us.
00:05:41.480 | And deep learning is at the core of that
00:05:45.400 | because there's aspects of it, the learning aspects
00:05:48.920 | that captivate our imagination about what is possible,
00:05:52.080 | given data and methodology, what learning,
00:05:56.660 | learning to learn and beyond, how far that can take us.
00:06:03.000 | And here visualized is just 3% of the neurons
00:06:06.280 | and 1/1,000,000 of the synapses in our own brain.
00:06:10.720 | This incredible structure that's in our mind
00:06:13.280 | and there's only echoes of it,
00:06:15.000 | small shadows of it in our artificial neural networks
00:06:18.840 | that we're able to create, but nevertheless,
00:06:21.040 | those echoes are inspiring to us.
00:06:23.740 | The history of neural networks
00:06:28.160 | on this pale blue dot of ours
00:06:31.640 | started quite a while ago
00:06:34.360 | with summers and winters, with excitements
00:06:39.440 | and periods of pessimism,
00:06:42.160 | starting in the '40s with neural networks
00:06:44.040 | and the implementation of those neural networks
00:06:45.960 | as a perceptron in the '50s,
00:06:47.880 | with ideas of back propagation,
00:06:50.640 | restricted Boltzmann machines,
00:06:53.160 | recurring neural networks in the '70s and '80s
00:06:55.800 | with convolutional neural networks
00:06:57.520 | and the MNIST dataset,
00:07:00.020 | with datasets beginning to percolate
00:07:01.920 | in LSTMs, bidirectional RNNs in the '90s,
00:07:05.360 | and the rebranding and the rebirth of neural networks
00:07:09.280 | under the flag of deep learning
00:07:11.600 | and deep belief nets in 2006,
00:07:14.360 | the birth of ImageNet, the dataset that,
00:07:17.040 | on which the possibilities of what deep learning
00:07:21.600 | can bring to the world has been first illustrated
00:07:24.320 | in the recent years in 2009,
00:07:27.920 | and AlexNet, the network that, on ImageNet,
00:07:30.940 | performed exactly that,
00:07:32.520 | with a few ideas like dropout
00:07:34.780 | that improved neural networks over time
00:07:36.340 | every year by year,
00:07:37.700 | improving the performance of neural networks.
00:07:39.740 | In 2014, the idea of GANs
00:07:43.620 | that Jan LeCun called the most exciting idea
00:07:47.020 | of the last 20 years,
00:07:48.540 | the generative adversarial networks,
00:07:50.140 | the ability to, with very little supervision,
00:07:52.840 | generate data, to generate ideas.
00:07:55.700 | After forming representation of those,
00:07:57.760 | from the understanding,
00:08:00.920 | from the high-level abstractions
00:08:02.440 | of what is extracted in the data,
00:08:04.400 | be able to generate new samples, create.
00:08:08.200 | The idea of being able to create
00:08:10.200 | as opposed to memorize is really exciting.
00:08:13.240 | And on the applied side,
00:08:14.980 | in 2014 with DeepFace,
00:08:18.060 | the ability to do face recognition.
00:08:20.120 | There's been a lot of breakthroughs
00:08:21.960 | on the computer vision front,
00:08:23.400 | that being one of them.
00:08:24.960 | The world was inspired,
00:08:29.140 | captivated in 2016 with AlphaGo
00:08:31.780 | and '17 with AlphaZero,
00:08:33.460 | beating with less and less and less effort
00:08:38.000 | the best players in the world at Go.
00:08:41.060 | The problem that, for most of the history
00:08:44.380 | of artificial intelligence,
00:08:45.380 | thought to be unsolvable.
00:08:47.700 | And new ideas with capsule networks.
00:08:49.620 | And this year is the year,
00:08:51.680 | 2018 was the year of natural language processing.
00:08:55.820 | A lot of interesting breakthroughs.
00:08:57.920 | Google's BERT and others that we'll talk about,
00:09:02.620 | breakthroughs on ability to understand language,
00:09:06.060 | understand speech, and everything,
00:09:09.340 | including generation, that's built all around that.
00:09:11.940 | And there's a parallel history of tooling,
00:09:16.520 | starting in the '60s with the Perceptron
00:09:18.860 | and the wiring diagrams.
00:09:20.700 | There, ending with this year,
00:09:23.540 | with PyTorch 1.0 and TensorFlow 2.0.
00:09:26.860 | These really solidified, exciting, powerful ecosystems
00:09:31.580 | of tools that enable you to do very,
00:09:34.900 | to do a lot with very little effort.
00:09:38.340 | The sky is the limit, thanks to the tooling.
00:09:41.040 | So let's then, from the big picture,
00:09:46.940 | take into the smallest.
00:09:49.540 | Everything should be made as simple as possible.
00:09:51.940 | So let's start simple, with a little piece of code,
00:09:57.580 | before we jump into the details
00:10:04.060 | and a big run-through of everything
00:10:06.260 | that is possible in deep learning.
00:10:08.820 | At the very basic level, with just a few lines of code,
00:10:12.420 | really six here, six little pieces of code,
00:10:16.640 | you can train a neural network
00:10:17.900 | to understand what's going on in an image.
00:10:20.580 | The classic, that I will always love, MNIST dataset,
00:10:24.380 | the handwriting digits, where the input
00:10:26.860 | to a neural network, a machine learning system,
00:10:28.800 | is the picture of a handwritten digit,
00:10:31.020 | and the output is the number that's in that digit.
00:10:33.980 | It's as simple as, in the first step,
00:10:38.180 | import the library, TensorFlow.
00:10:41.460 | Second step, import the dataset, MNIST.
00:10:45.940 | Third step, like Lego bricks, stack on top of each other,
00:10:50.580 | the neural network, layer by layer,
00:10:53.500 | with a hidden layer, an input layer, an output layer.
00:10:56.780 | Step four, train the model,
00:11:00.340 | as simple as a single line, model fit.
00:11:02.680 | Evaluate the model in step five, on the testing dataset,
00:11:07.820 | and that's it.
00:11:08.660 | In step six, you're ready to deploy.
00:11:10.420 | You're ready to predict what's in the image.
00:11:13.940 | It's as simple as that.
00:11:15.300 | And much of this code, obviously, much more complicated,
00:11:19.620 | or much more elaborate, and rich, and interesting,
00:11:23.700 | and complex, we'll be making available on GitHub,
00:11:27.540 | on our repository that accompanies these courses.
00:11:30.300 | Today, we've released the first tutorial
00:11:32.180 | on driver-scene segmentation.
00:11:33.540 | I encourage everybody to go through it.
00:11:36.120 | And then, on the tooling side, in one slide,
00:11:41.500 | before we dive into the neural networks,
00:11:43.780 | and deep learning, the tooling side,
00:11:47.300 | amongst many other things,
00:11:48.540 | TensorFlow is a deep learning library,
00:11:50.740 | an open-source library from Google.
00:11:52.900 | The most popular one to date,
00:11:55.940 | the most active with a large ecosystem.
00:11:58.820 | It's not just something you import in Python,
00:12:02.520 | and to solve some basic problems.
00:12:04.520 | There's an entire ecosystem of tooling.
00:12:06.480 | There's different levels of APIs.
00:12:10.420 | Much of what we'll do in this course
00:12:12.460 | will be the highest-level API with Keras.
00:12:15.620 | But there's also the ability to run in the browser
00:12:17.980 | with TensorFlow.js, on the phone with TensorFlow Lite,
00:12:21.740 | in the cloud, without any need to have a computer,
00:12:26.580 | hardware, anything, any of the libraries
00:12:28.260 | set up on your own machine.
00:12:29.140 | You can run all the code that we're providing
00:12:31.720 | in the cloud with Google Collaboratory.
00:12:35.260 | And the optimized ASICs hardware
00:12:38.100 | that Google has optimized for TensorFlow
00:12:41.740 | with their TPU, Tensor Processing Unit,
00:12:44.140 | ability to visualize TensorBoard,
00:12:46.140 | models are provided in TensorFlow Hub.
00:12:48.900 | And there's just an entire ecosystem,
00:12:51.300 | including, most importantly, I think,
00:12:53.500 | documentation and blogs that make it
00:12:57.140 | extremely accessible to understand
00:13:01.280 | the fundamentals of the tooling
00:13:04.220 | that allow you to solve the problems
00:13:05.700 | from natural language processing,
00:13:06.820 | to computer vision, to GANs,
00:13:09.060 | generative adversarial neural networks,
00:13:10.620 | and everything in between,
00:13:13.420 | deep reinforcement learning, and so on.
00:13:15.400 | So that's why we're excited to work
00:13:19.940 | both in the theory in this course,
00:13:21.980 | in this series of lectures,
00:13:25.060 | and in the tooling and the applied side of TensorFlow.
00:13:28.300 | It really makes it, exceptionally,
00:13:30.460 | these ideas exceptionally accessible.
00:13:32.420 | So deep learning, at the core,
00:13:34.580 | is the ability to form higher and higher level
00:13:36.780 | of abstractions, of representations in data,
00:13:40.420 | in raw patterns, higher and higher levels
00:13:42.940 | of understanding of patterns.
00:13:44.600 | And those representations
00:13:48.740 | are extremely important
00:13:53.140 | and effective for being able to interpret data.
00:14:00.680 | Under certain representations,
00:14:03.980 | data is trivial to understand.
00:14:06.980 | Cat versus dog, blue dot versus green triangle.
00:14:11.300 | Under others, it's much more difficult.
00:14:14.620 | In this task, drawing a line under polar coordinates
00:14:19.060 | is trivial.
00:14:20.020 | Under Cartesian coordinates,
00:14:21.900 | it's very difficult, well, impossible to do accurately.
00:14:25.140 | And that's a trivial example of a representation.
00:14:28.060 | So our task with deep learning,
00:14:29.860 | with machine learning in general,
00:14:31.940 | is forming representations that map the topology,
00:14:35.380 | this, whatever the topology,
00:14:37.700 | the rich space of the problem that you're trying to deal
00:14:40.380 | with of the raw inputs,
00:14:42.580 | map it in such a way
00:14:44.260 | that the final representation is trivial to work with,
00:14:49.920 | trivial to classify,
00:14:51.540 | trivial to perform regression,
00:14:55.380 | trivial to generate new samples of that data.
00:14:58.220 | And that representation of higher and higher levels
00:15:00.460 | of representation is really the dream
00:15:03.820 | of artificial intelligence.
00:15:06.020 | That is what understanding is,
00:15:07.940 | making the complex simple,
00:15:10.700 | like Einstein back in a few slides ago said.
00:15:14.380 | And that, with Juergen Schmidhuber,
00:15:19.140 | and whoever else said it, I don't know,
00:15:21.300 | that's been the dream of all of science in general,
00:15:26.840 | of the history of science
00:15:28.580 | is the history of compression progress,
00:15:30.900 | of forming simpler
00:15:32.940 | and simpler representations of ideas.
00:15:38.740 | The models of the universe of our solar system
00:15:44.980 | with the Earth at the center of it
00:15:47.340 | is much more complex to perform,
00:15:49.860 | to do physics on than a model
00:15:53.500 | where the sun is at the center.
00:15:56.260 | Those higher and higher levels of simple representations
00:16:00.060 | enable us to do extremely powerful things.
00:16:02.140 | That has been the dream of science
00:16:03.740 | and the dream of artificial intelligence.
00:16:05.780 | And why deep learning?
00:16:09.500 | What is so special about deep learning
00:16:12.380 | in the grander world of machine learning
00:16:14.340 | and artificial intelligence?
00:16:15.740 | It's the ability to more and more remove
00:16:21.020 | the input of human experts,
00:16:23.120 | remove the human from the picture,
00:16:25.020 | the human costly inefficient effort
00:16:27.300 | of human beings in the picture.
00:16:29.860 | Deep learning automates much of the extraction
00:16:33.220 | gets us closer and closer to the raw data
00:16:37.180 | without the need of human involvement,
00:16:39.460 | human expert involvement,
00:16:40.820 | ability to form representations from the raw data
00:16:43.460 | as opposed to having a human being
00:16:45.540 | needing to extract features
00:16:47.720 | as was done in the 80s and 90s
00:16:50.420 | and the early aughts to extract features
00:16:53.580 | with which then the machine learning algorithms
00:16:55.520 | can work with.
00:16:56.360 | The automated extraction of features
00:16:58.420 | enables us to work with large and larger data sets
00:17:00.940 | removing the human completely
00:17:02.980 | except from the supervision labeling step at the very end.
00:17:07.020 | It doesn't require the human expert.
00:17:09.060 | But at the same time,
00:17:12.340 | there is limits to our technologies.
00:17:18.340 | There's always a balance between excitement
00:17:22.100 | and disillusionment.
00:17:23.700 | The Gartner hype cycle
00:17:26.460 | as much as we don't like to think about it
00:17:31.460 | applies to almost every single technology.
00:17:33.940 | Of course, the magnitude of the peaks
00:17:35.380 | and the draws is different.
00:17:36.740 | But I would say we're at the peak
00:17:40.220 | of an inflated expectation with deep learning.
00:17:43.700 | And that's something we have to think about
00:17:45.180 | as we talk about some of the ideas
00:17:46.500 | and exciting possibilities of the future.
00:17:48.540 | And we're still driving cars
00:17:51.540 | that we'll talk about in future lectures in this course.
00:17:54.040 | We're at the same.
00:17:55.300 | In fact, we're a little bit beyond the peak.
00:17:57.780 | And so it's up to us.
00:17:59.740 | This is MIT and the engineers
00:18:02.300 | and the people working on this in the world
00:18:04.380 | to carry us through the draw,
00:18:07.380 | to carry us through the future
00:18:09.640 | as the ups and downs of the excitement progresses forward
00:18:14.640 | into the plateau of productivity.
00:18:18.040 | Why else not deep learning?
00:18:22.900 | If we look at real world applications,
00:18:25.260 | especially with humanoid robotics,
00:18:29.600 | robotic manipulation,
00:18:31.100 | and even, yes, autonomous vehicles,
00:18:34.740 | majority of the aspects of the autonomous vehicles
00:18:37.440 | do not involve to an extensive amount
00:18:40.600 | machine learning today.
00:18:41.940 | The problems are not formulated as data-driven learning.
00:18:46.260 | Instead, they're model-based optimization methods
00:18:49.500 | that don't learn from data over time.
00:18:51.980 | And then from the speakers these couple of weeks,
00:18:54.980 | we'll get to see how much machine learning
00:18:57.500 | is starting to creep in.
00:18:59.180 | But the example shown here with the Boston,
00:19:01.540 | with amazing humanoid robotics in Boston Dynamics,
00:19:04.560 | to date, almost no machine learning has been used
00:19:09.300 | except for trivial perception.
00:19:11.700 | The same with autonomous vehicles.
00:19:13.580 | Almost no machine learning, deep learning
00:19:15.180 | has been used except with perception.
00:19:18.800 | Some aspect of enhanced perception
00:19:20.940 | from the visual texture information.
00:19:22.740 | Plus, what's becoming, what's starting to be used
00:19:27.260 | a little bit more is use of recurring neural networks
00:19:32.260 | to predict the future,
00:19:36.020 | to predict the intent of the different players in the scene
00:19:41.020 | in order to anticipate what the future is.
00:19:43.220 | But these are very early steps.
00:19:44.960 | Most of the success that you see today,
00:19:46.860 | the 10 million miles that Waymo has achieved,
00:19:50.340 | has been attributed mostly to non-machine learning methods.
00:19:54.580 | Why else not deep learning?
00:19:58.580 | Here's a really clean example of unintended consequences.
00:20:03.700 | Of ethical issues we have to really think about.
00:20:11.640 | When an algorithm learns from data
00:20:14.540 | based on an objective function, a loss function,
00:20:17.820 | the power, the consequences of an algorithm
00:20:22.820 | that optimizes that function is not always obvious.
00:20:25.820 | Here's an example of a human player
00:20:28.380 | playing the game of Coast Runners with a,
00:20:31.740 | it's a boat racing game where the task is to go
00:20:34.580 | around the racetrack and try to win the race.
00:20:38.280 | And the objective is to get as many points as possible.
00:20:42.620 | There are three ways to get points.
00:20:44.640 | The finishing time, how long it took you to finish.
00:20:47.340 | The finishing position, where you were in the ranking.
00:20:50.980 | And picking up quote unquote turbos,
00:20:54.220 | those little green things along the way
00:20:56.220 | that give you points.
00:20:57.820 | Okay, simple enough.
00:20:59.180 | So we design an agent, in this case an RL agent,
00:21:02.700 | that optimizes for the awards.
00:21:06.460 | And what we find on the right here,
00:21:10.220 | the optimal, the agent discovers that the optimal
00:21:13.140 | actually has nothing to do with finishing the race
00:21:15.500 | or the ranking.
00:21:16.840 | That you can get much more points
00:21:19.220 | by just focusing on the turbos
00:21:20.920 | and collecting those little green dots
00:21:23.960 | because they regenerate.
00:21:25.300 | So you go in circles over and over and over,
00:21:27.400 | slamming into the wall, collecting the green turbos.
00:21:32.060 | Now that's a very clear example of a well-reasoned,
00:21:37.060 | a formulated objective function
00:21:41.280 | that has totally unexpected consequences.
00:21:43.980 | At least without sort of considering
00:21:47.620 | those consequences ahead of time.
00:21:49.260 | And so that shows the need for AI safety
00:21:52.060 | for a human in the loop of machine learning.
00:21:55.740 | That's why not deep learning exclusively.
00:21:57.860 | The challenge of deep learning algorithms,
00:22:05.780 | of deep learning applied,
00:22:07.280 | is to ask the right question
00:22:10.320 | and understand what the answers mean.
00:22:13.100 | You have to take a step back
00:22:15.060 | and look at the difference,
00:22:19.480 | the distinction, the levels,
00:22:23.500 | degrees of what the algorithm is accomplishing.
00:22:25.440 | For example, image classification
00:22:27.580 | is not necessarily scene understanding.
00:22:30.220 | In fact, it's very far from scene understanding.
00:22:33.540 | Classification may be very far from understanding.
00:22:36.760 | And the data sets vary drastically
00:22:41.660 | across the different benchmarks and the data sets used.
00:22:45.120 | The professionally done photographs
00:22:47.140 | versus synthetically generated images
00:22:49.860 | versus real world data.
00:22:52.440 | And the real world data is where the big impact is.
00:22:56.040 | So oftentimes the one doesn't transfer to the other.
00:22:59.660 | That's the challenge of deep learning.
00:23:01.560 | Solving all of these problems
00:23:04.500 | of different lighting variations,
00:23:05.820 | of opposed variation, interclass variation,
00:23:07.940 | all the things that we take for granted as human beings
00:23:10.340 | with our incredible perception system,
00:23:12.220 | all have to be solved in order to gain
00:23:14.320 | greater and greater understanding of a scene.
00:23:16.580 | And all the other things we have to close the gap on
00:23:20.300 | that we're not even close to yet.
00:23:22.420 | Here's an image from the Andrej Karpathy blog
00:23:25.140 | from a few years ago
00:23:26.620 | of former President Obama stepping on a scale.
00:23:30.620 | We can classify, we can do semantic segmentation
00:23:33.580 | of the scene, we can do object detection,
00:23:35.060 | we can do a little bit of 3D reconstruction
00:23:37.500 | from a video version of the scene.
00:23:39.140 | But what we can't do well
00:23:42.180 | is all the things we take for granted.
00:23:44.100 | We can't tell the images in the mirrors
00:23:46.180 | versus in reality as different.
00:23:50.000 | We can't deal with the sparsity of information.
00:23:52.880 | Just a few pixels on President Obama's face,
00:23:55.620 | we can still identify him as the president.
00:23:57.780 | The 3D structure of the scene,
00:24:02.100 | that there's a foot on top of a scale,
00:24:04.100 | that there's human beings behind from a single image,
00:24:08.660 | things we can trivially do using all the common sense
00:24:11.620 | semantic knowledge that we have cannot do.
00:24:14.460 | The physics of the scene, that there's gravity.
00:24:16.900 | And the biggest thing, the hardest thing,
00:24:20.560 | is what's on people's minds.
00:24:22.600 | And what's on people's minds
00:24:23.820 | about what's on other people's minds, and so on.
00:24:27.380 | Mental models of the world,
00:24:29.260 | being able to infer what people are thinking about.
00:24:32.140 | Being able to infer,
00:24:33.900 | there's been a lot of exciting work here at MIT
00:24:35.700 | about what people are looking at.
00:24:38.260 | But we're not even close to solving that problem either.
00:24:40.500 | But what they're thinking about,
00:24:42.100 | we haven't even begun to really think about that problem.
00:24:46.500 | And we do it trivially as human beings.
00:24:48.960 | And I think at the core of that,
00:24:52.600 | I think I'm harboring on the visual perception problem,
00:24:55.860 | because it's one we take really for granted as human beings,
00:24:59.340 | especially when trying to solve real world problems,
00:25:01.220 | especially when trying to solve autonomous driving.
00:25:04.980 | We have 540 million years of data for visual perception,
00:25:08.860 | so we take it for granted.
00:25:10.700 | We don't realize how difficult it is.
00:25:12.740 | And we kind of focus all our attention
00:25:14.340 | on this recent development of 100,000 years
00:25:16.940 | of abstract thought, being able to play chess,
00:25:19.060 | being able to reason.
00:25:21.100 | But the visual perception is nevertheless
00:25:23.460 | extremely difficult.
00:25:25.660 | At every single layer of what's required
00:25:28.940 | to perceive, interpret, and understand
00:25:31.820 | the fundamentals of a scene.
00:25:34.260 | And a trivial way to show that
00:25:36.220 | is just all the ways you can mess
00:25:38.180 | with these image classification systems
00:25:40.200 | by adding a little bit of noise.
00:25:42.020 | The last few years, there's been a lot of papers,
00:25:44.760 | a lot of work to show that you can mess with these systems
00:25:49.060 | by adding noise here with 99% accuracy,
00:25:52.480 | predict a dog, add a little bit of distortion.
00:25:55.500 | Immediately the system predicts with 99% accuracy
00:25:59.100 | that it's an ostrich.
00:26:00.100 | And you can do that kind of manipulation
00:26:02.060 | with just a single pixel.
00:26:03.580 | So that's just a clean way to show
00:26:07.520 | the gap between image classification
00:26:10.020 | on an artificial data set like ImageNet
00:26:12.380 | and real world perception that has to be solved,
00:26:15.300 | especially for life critical situations
00:26:17.260 | like autonomous driving.
00:26:18.460 | I really like this Max Tegmark's visualization
00:26:26.800 | of this rising sea of the landscape of human competence
00:26:32.980 | from Hans Marvack.
00:26:34.580 | And this is the difference as we progress forward
00:26:40.860 | and we discuss some of these machine learning methods
00:26:44.260 | is there is the human intelligence,
00:26:48.020 | the general human intelligence,
00:26:50.140 | let's call Einstein here,
00:26:52.940 | that's able to generalize over all kinds of problems,
00:26:56.460 | over all kinds of from the common sense
00:26:58.860 | to the incredibly complex.
00:27:01.780 | And then there is the way we've been doing,
00:27:04.620 | especially data driven machine learning,
00:27:07.120 | which is Savant's, which is specialized intelligence,
00:27:11.740 | extremely smart at a particular task,
00:27:14.660 | but not being able to transfer
00:27:16.100 | except in the very narrow neighborhood
00:27:17.840 | on this little landscape of different,
00:27:20.420 | of art, cinematography, book writing at the peaks
00:27:23.460 | and chess, arithmetic and theorem proving
00:27:26.180 | and vision at the bottom in the lake.
00:27:29.740 | And there's this rising sea
00:27:31.020 | as we solve problem after problem,
00:27:33.100 | the question can the methodology
00:27:36.300 | and the approach of deep learning
00:27:38.540 | of everything we're doing now keep the sea rising
00:27:42.300 | or do fundamental breakthroughs have to happen
00:27:44.380 | in order to generalize and solve these problems.
00:27:47.780 | And so from the specialized where the successes are,
00:27:51.340 | the systems are essentially boiled down to
00:27:56.340 | given the data set and given the ground truth
00:27:59.360 | for that data set, here's the apartment cost
00:28:02.140 | in the Boston area, be able to input several parameters
00:28:06.460 | and based on those parameters, predict the apartment cost.
00:28:09.820 | That's the basic premise approach
00:28:14.340 | behind the successful supervised
00:28:18.100 | deep learning systems today.
00:28:19.620 | If you have good enough data,
00:28:21.820 | there's good enough ground truth
00:28:22.980 | and can be formalized, we can solve it.
00:28:26.240 | Some of the recent promise that we will do
00:28:30.980 | an entire series of lectures in the third week
00:28:33.220 | on deeper enforcement learning showed
00:28:35.300 | that from raw sensory information
00:28:38.900 | with very little annotation through self-play
00:28:41.740 | where their systems learn without human supervision
00:28:46.740 | are able to perform extremely well
00:28:49.060 | in these constrained contexts.
00:28:50.860 | The question of a video game,
00:28:53.820 | here pong to pixels, being able to perceive
00:28:56.680 | the raw pixels of this pong game
00:28:59.800 | as raw input and learn the fundamental
00:29:04.060 | quote unquote physics of this game,
00:29:06.400 | understand how it is this game behaves
00:29:10.480 | and how to be able to win this game.
00:29:12.300 | That's kind of a step toward general purpose
00:29:14.940 | artificial intelligence, but it is a very small step
00:29:18.280 | because it's in a simulated, very trivial situation.
00:29:23.580 | That's the challenge that's before us.
00:29:26.180 | Would less and less human supervision
00:29:27.800 | be able to solve huge real world problems
00:29:31.860 | from the top supervised learning
00:29:35.620 | where majority of the teaching is done by human beings
00:29:39.340 | throughout the annotation process
00:29:40.740 | through labeling all the data
00:29:42.220 | by showing different examples
00:29:43.860 | and further and further down to semi-supervised learning,
00:29:49.100 | reinforcement learning and supervised learning
00:29:51.140 | removing the teacher from the picture.
00:29:53.260 | And making that teacher extremely efficient
00:29:56.420 | when it is needed.
00:29:57.340 | Of course, data augmentation is one way
00:30:02.460 | as we'll talk about.
00:30:03.980 | So taking a small number of examples
00:30:07.620 | and messing with that set of examples,
00:30:10.980 | augmenting that set of examples
00:30:12.780 | through trivial and through complex methods
00:30:15.580 | of cropping, stretching, shifting and so on
00:30:18.140 | including through generative networks,
00:30:20.140 | modifying those images to grow a small data set
00:30:22.600 | into a large one to minimize,
00:30:25.660 | to decrease further and further the input
00:30:28.380 | that's the input of the human teacher.
00:30:32.740 | But still, that's quite far away
00:30:34.980 | from the incredibly efficient both teaching
00:30:38.900 | and learning that humans do.
00:30:41.340 | This is a video and there's many of them online
00:30:46.340 | for the first time, a human baby walking.
00:30:51.960 | (video playing)
00:30:54.740 | We learned to do this, it's one shot learning.
00:30:59.060 | One day you're on all fours
00:31:04.100 | and the next day you put your two hands up
00:31:07.060 | and then you figure out the rest, one shot.
00:31:10.860 | Well, you can kind of, ish,
00:31:14.580 | you can kind of play around with it.
00:31:16.360 | But the point is you're extremely efficient.
00:31:19.220 | With only a few examples are able to learn
00:31:21.940 | the fundamental aspect of how to solve a particular problem.
00:31:24.940 | Machines in most cases need thousands, millions
00:31:31.060 | and sometimes more examples depending
00:31:32.980 | on the life critical nature of the application.
00:31:35.340 | The data flow of supervised learning systems
00:31:49.200 | is there's input data, there's a learning system
00:31:51.880 | and there is output.
00:31:53.320 | Now in the training stage for the output
00:31:56.280 | we have the ground truth.
00:31:57.880 | And so we use that ground truth to teach the system.
00:32:02.880 | In the testing stage, when it goes out into the wild
00:32:05.320 | there's new input data over which we have to generalize
00:32:07.520 | with the learning system and have to make our best guess.
00:32:10.680 | In the training stage, the processes with neural networks
00:32:15.680 | is given the input data for which we have the ground truth,
00:32:18.360 | pass it through the model, get the prediction
00:32:21.320 | and given that we have the ground truth
00:32:23.000 | we can compare the prediction to the ground truth,
00:32:25.280 | look at the error and based on the error adjust the weights.
00:32:28.480 | The types of predictions we can make
00:32:30.720 | is regression and classification.
00:32:32.680 | Regression is a continuous
00:32:34.240 | and classification is categorical.
00:32:37.000 | Here, if we look at weather, the regression problem says
00:32:42.920 | what is the temperature going to be tomorrow
00:32:46.000 | and the classification formulation of that problem
00:32:48.200 | says is it going to be hot or cold
00:32:50.320 | or some threshold definition of what hot or cold is.
00:32:53.520 | That's regression classification.
00:32:55.280 | On the classification front, it can be multi-class
00:32:58.560 | which is the standard formulation
00:33:01.000 | where you're tasked with saying
00:33:02.600 | a particular entity can only be one thing
00:33:08.320 | and then there's multi-label
00:33:09.840 | where a particular entity can be multiple things.
00:33:12.280 | And overall, the input to the system
00:33:16.480 | can be not just a single sample of the particular dataset
00:33:21.480 | and the output doesn't have to be a particular sample
00:33:24.880 | of the ground truth dataset.
00:33:26.720 | It can be a sequence, sequence to sequence,
00:33:29.680 | a single sample to a sequence,
00:33:31.480 | a sequence to sample and so on.
00:33:33.760 | From video captioning where it's video captioning
00:33:37.120 | to translation to natural language generation
00:33:41.960 | to of course the one-to-one general computer vision.
00:33:45.760 | Okay, that's the bigger picture.
00:33:47.200 | Let's step back from the big to the small
00:33:49.760 | to a single neuron inspired by our own brain,
00:33:54.760 | the biological neural networks in our brain
00:33:58.400 | and the computational block
00:34:00.120 | that is behind a lot of the intelligence in our mind.
00:34:03.200 | The artificial neuron has inputs with weights on them
00:34:08.920 | plus a bias and an activation function and an output.
00:34:14.280 | It's inspired by this thing.
00:34:16.320 | As I showed it before,
00:34:17.480 | here visualizes the thalamic cortical system
00:34:20.280 | with three million neurons and 476 million synapses.
00:34:24.000 | The full brain has a hundred billion, billion neurons
00:34:29.000 | and a thousand trillion synapses.
00:34:33.400 | ResNet and some of the other state-of-the-art networks
00:34:36.760 | have in tens, hundreds of millions of edges of synapses.
00:34:42.760 | The human brain has 10 million times more synapses
00:34:47.760 | than artificial neural networks
00:34:50.720 | and there's other differences.
00:34:52.360 | The topology is asynchronous
00:34:57.360 | and not constructed in layers.
00:35:00.840 | The learning algorithm for artificial neural networks
00:35:03.960 | is back propagation for our biological neurons
00:35:09.520 | and our biological networks we don't know.
00:35:12.960 | That's one of the mysteries of the human brain.
00:35:15.160 | There's ideas but we really don't know.
00:35:17.440 | The power consumption,
00:35:18.760 | human brains are much more efficient than neural networks.
00:35:21.200 | That's one of the problems that we're trying to solve
00:35:23.360 | and ASICs are starting to begin
00:35:25.920 | to solve some of these problems.
00:35:28.080 | And the stages of learning.
00:35:30.680 | In the biological neural networks,
00:35:32.040 | you really never stop learning.
00:35:33.840 | You're always learning, always changing
00:35:35.520 | both on the hardware and the software.
00:35:38.640 | In artificial neural networks,
00:35:40.840 | oftentimes there's a training stage,
00:35:42.640 | there's a distinct training stage
00:35:44.160 | and there's a distinct testing stage
00:35:45.840 | when you release the thing in the wild.
00:35:47.560 | Online learning is an exceptionally difficult thing
00:35:50.040 | that we're still in the very early stages of.
00:35:52.920 | This neuron takes a few inputs,
00:35:59.600 | the fundamental computational block behind neural networks,
00:36:02.840 | takes a few inputs, applies weights,
00:36:05.360 | which are the parameters that are learned,
00:36:07.240 | sums them up, puts it into a nonlinear activation function
00:36:10.960 | after adding the bias,
00:36:12.800 | also a learned parameter, and gives an output.
00:36:17.600 | And the task of this neuron is to get excited
00:36:20.360 | based on certain aspects of the layers,
00:36:22.720 | features, inputs that followed before.
00:36:25.960 | And in that ability to discriminate,
00:36:29.280 | get excited by certain things
00:36:30.960 | and get not excited by other things,
00:36:33.080 | hold a little piece of information
00:36:35.360 | of whatever level of abstraction it is.
00:36:37.400 | So when you combine many of them together,
00:36:39.640 | you have knowledge.
00:36:43.800 | Different levels of abstractions form a knowledge base
00:36:46.720 | that's able to represent, understand,
00:36:49.720 | or even act on a particular set of raw inputs.
00:36:53.640 | And you stack these neurons together in layers,
00:36:58.240 | both in width and depth, increasing further on,
00:37:02.000 | and there's a lot of different architectural variants,
00:37:05.240 | but they begin at this basic fact
00:37:08.240 | that with just a single hidden layer of a neural network,
00:37:11.680 | the possibilities are endless.
00:37:13.320 | It can approximate any arbitrary function.
00:37:15.720 | Adding a neural network with a single hidden layer
00:37:20.160 | can approximate any function.
00:37:22.040 | That means any other neural network
00:37:23.640 | with multiple layers and so on
00:37:25.600 | is just interesting optimizations
00:37:29.920 | of how we can discover those functions.
00:37:33.840 | The possibilities are endless.
00:37:35.400 | And the other aspect here is the mathematical underpinnings
00:37:42.080 | of neural networks with the weights
00:37:45.480 | and the differentiable activation functions
00:37:47.840 | are such that in a few steps,
00:37:49.680 | from the inputs to the outputs,
00:37:51.560 | are deeply parallelizable.
00:37:57.120 | And that's why the other aspect on the compute,
00:38:00.960 | the parallelizability of neural networks
00:38:03.080 | is what enables some of the exciting advancements
00:38:07.000 | on the graphical processing unit,
00:38:10.040 | the GPUs and with ASICs, TPUs.
00:38:14.080 | The ability to run across machines,
00:38:17.840 | across GPU units,
00:38:19.400 | in a very large distributed scale
00:38:24.440 | to be able to train and perform inference on neural networks.
00:38:27.440 | Activation functions.
00:38:32.040 | These activation functions put together
00:38:34.160 | are tasked with optimizing a loss function.
00:38:38.480 | For regression, that loss function is
00:38:42.000 | mean squared error, usually.
00:38:45.120 | There's a lot of variance.
00:38:46.320 | And for classification, it's cross-entropy loss.
00:38:48.760 | In the cross-entropy loss, the ground truth is zero, one.
00:38:51.600 | In the mean squared error, it's real numbered.
00:39:00.840 | And so with the loss function,
00:39:02.160 | and the weights and the bias and the activation functions
00:39:04.920 | propagating forward through the network
00:39:06.800 | from the input to the output,
00:39:09.120 | using the loss function,
00:39:10.560 | we use the algorithm of Bragg propagation,
00:39:12.880 | I wish I did an entire lecture last time,
00:39:16.360 | to adjust the weights,
00:39:21.440 | to have the air flow backwards through the network
00:39:24.000 | and adjust the weights such that, once again,
00:39:27.720 | the weights that were responsible for
00:39:30.680 | for producing the correct output are increased,
00:39:36.760 | and the weights that were responsible
00:39:39.040 | for producing the incorrect output were decreased.
00:39:42.680 | The forward pass gives you the error.
00:39:47.840 | The backward pass computes the gradients.
00:39:50.000 | And based on the gradients, the optimization algorithm,
00:39:52.960 | combined with a learning rate, adjust the weights.
00:39:56.800 | The learning rate is how fast the network learns.
00:40:00.040 | And all of this is possible
00:40:01.720 | on the numerical computation side
00:40:04.480 | with automatic differentiation.
00:40:06.200 | The optimization problem,
00:40:09.560 | given those gradients that are computed
00:40:11.200 | in the backward flow through the network of the gradients,
00:40:16.200 | is stochastic gradient descent.
00:40:18.760 | There's a lot of variance of this optimization algorithms
00:40:21.040 | that solve various problems,
00:40:23.040 | from dying Rayleigh's to vanishing gradients.
00:40:26.360 | There's a lot of different parameters
00:40:29.080 | on momentum and so on that really just boil down
00:40:33.520 | to all the different problems that are solved
00:40:35.200 | with nonlinear optimization.
00:40:37.080 | Mini-batch size,
00:40:38.560 | what is the right size of a batch,
00:40:43.680 | or really it's called mini-batch
00:40:44.920 | when it's not the entire dataset,
00:40:47.360 | based on which to compute the gradients
00:40:50.800 | to adjust the learning?
00:40:52.760 | Do you do it over a very large amount,
00:40:55.920 | or do you do it with stochastic gradient descent
00:40:58.800 | for every single sample of the data?
00:41:00.640 | If you listen to Yann LeCun
00:41:03.360 | and a lot of recent literature,
00:41:04.680 | is small mini-batch sizes are good.
00:41:08.240 | He says, "Training with large mini-batches
00:41:10.240 | "is bad for your health.
00:41:11.680 | "More importantly, it's bad for your test error.
00:41:14.000 | "Friends don't let friends use mini-batches larger than 32."
00:41:18.480 | Larger batch size means more computational speed,
00:41:23.320 | 'cause you don't have to update the weights as often.
00:41:25.440 | But smaller batch size empirically
00:41:29.400 | produces better generalization.
00:41:31.080 | The problem we're often on the broader scale of learning
00:41:38.920 | trying to solve is overfitting.
00:41:42.000 | And the way we solve it is through regularization.
00:41:45.480 | We want to train on a dataset
00:41:49.800 | without memorizing to an extent
00:41:52.520 | that you only do well in that trained dataset.
00:41:56.240 | So you want it to be generalizable into future,
00:41:58.880 | into the future things that you haven't seen yet.
00:42:02.800 | So obviously, this is a problem for small datasets
00:42:07.800 | and also for sets of parameters that you choose.
00:42:10.000 | Here shown an example of a sine curve
00:42:15.000 | trying to fit a particular data
00:42:17.280 | versus a ninth degree polynomial
00:42:19.200 | trying to fit a particular set of data with the blue dots.
00:42:22.800 | The ninth degree polynomial is overfitting.
00:42:25.560 | It does very well for that particular set of samples
00:42:28.240 | but does not generalize well in the general case.
00:42:31.560 | And the trade-off here is as you train further and further,
00:42:36.040 | at a certain point, there's a deviation
00:42:40.760 | between the error being decreased to zero
00:42:45.760 | on the training set and going to one on the test set.
00:42:51.040 | And that's the balance we have to strike.
00:42:53.400 | That's done with the validation set.
00:42:55.520 | So you take a piece of the training set
00:43:00.360 | for which you have the ground truth
00:43:02.120 | and you call it the validation set and you set it aside
00:43:04.680 | and you evaluate the performance of your system
00:43:06.920 | on that validation set.
00:43:09.080 | And after you notice that your trained network
00:43:14.080 | is performing poorly on the validation set
00:43:17.120 | for a prolonged period of time, that's when you stop.
00:43:19.600 | That's early stoppage.
00:43:20.960 | Basically it's getting better and better and better
00:43:22.560 | and then there is some period of time,
00:43:24.560 | there's always noise of course,
00:43:26.040 | and after some period of time, it's definitely getting worse.
00:43:29.560 | And that's, we need to stop there.
00:43:31.720 | So that provides an automated way
00:43:33.560 | to discovering when you need to stop.
00:43:35.600 | And there's a lot of other regularization methodologies.
00:43:38.560 | Of course, as I mentioned,
00:43:40.000 | dropout is a very interesting approach for,
00:43:43.960 | and its variance of simply
00:43:47.960 | with a certain kind of probability,
00:43:50.560 | randomly remove nodes in the network,
00:43:52.680 | both the incoming and outgoing edges,
00:43:56.320 | randomly throughout the training process.
00:43:58.520 | And there's normalization.
00:44:01.200 | Normalization is obviously always applied at the input.
00:44:09.600 | So whenever you have a dataset
00:44:14.240 | as different lighting conditions, different variations,
00:44:17.440 | different sources and so on,
00:44:19.080 | you have to all kind of put it on the same level ground
00:44:21.960 | so that we're learning the fundamental aspects
00:44:23.960 | of the input data as opposed to
00:44:26.240 | some less relevant semantic information
00:44:30.080 | like lighting variations and so on.
00:44:31.560 | So we should usually always normalize, for example,
00:44:35.920 | if it's computer vision with pixels from zero to 255,
00:44:38.960 | you always normalize to zero to one or negative one to one
00:44:42.080 | or normalize based on the mean and the standard deviation.
00:44:46.280 | That's something you should almost always do.
00:44:48.960 | The thing that enabled
00:44:54.160 | a lot of breakthrough performances in the past few years
00:44:57.760 | is batch normalization.
00:44:59.080 | It's performing this kind of same normalization
00:45:01.040 | later on in the network,
00:45:02.800 | looking at the inputs to the hidden layers
00:45:07.800 | and normalizing based on the batch of data
00:45:10.600 | which you're training,
00:45:12.000 | normalized based on the mean and the standard deviation.
00:45:15.000 | As batch normalization with batch renormalization
00:45:18.920 | fixes a few of the challenges which is
00:45:23.880 | given that you're normalizing during the training
00:45:27.640 | on the mini-batches in the training dataset,
00:45:31.880 | that doesn't directly map to the inference stage
00:45:34.160 | in the testing.
00:45:35.280 | And so it allows by keeping a running average,
00:45:39.320 | it across both training and testing,
00:45:43.600 | you're able to asymptotically approach
00:45:45.760 | a global normalization.
00:45:47.360 | So there's this idea across all the weights,
00:45:49.900 | not just the inputs,
00:45:50.740 | across all the weights,
00:45:51.560 | you normalize the world
00:45:56.240 | in all the levels of abstractions that you're forming.
00:45:58.720 | And batch renorm solves a lot of these problems
00:46:01.120 | during inference.
00:46:01.960 | And there's a lot of other ideas
00:46:03.320 | from layer to weight to instance normalization
00:46:05.800 | to group normalization.
00:46:07.480 | And you can play with a lot of these ideas
00:46:09.120 | in the TensorFlow playground,
00:46:11.480 | on playground.tensorflow.org.
00:46:13.320 | And I highly recommend.
00:46:15.120 | So now let's run through a bunch of different ideas,
00:46:18.880 | some of which we'll cover in future lectures.
00:46:22.920 | Of what is all of this in this world of deep learning,
00:46:25.500 | from computer vision to deep reinforcement learning,
00:46:28.060 | to the different small level techniques
00:46:30.280 | to the large natural language processing.
00:46:33.200 | So convolution neural networks,
00:46:34.700 | the thing that enables image classification.
00:46:37.760 | So these convolutional filters slide over the image
00:46:40.160 | and are able to take advantage
00:46:41.520 | of the spatial invariance of visual information
00:46:44.800 | that a cat in the top left corner
00:46:46.720 | is the same as features associated with cats
00:46:49.480 | in the top right corner and so on.
00:46:51.400 | Images are just a set of numbers
00:46:53.940 | and our task is to take that image
00:46:56.160 | and produce a classification
00:46:58.040 | and use the spatial invariance of visual information
00:47:02.760 | to make that,
00:47:03.800 | to slide a convolution filter across the image
00:47:08.520 | and learn that filter
00:47:09.880 | as opposed to assigning equal value to features
00:47:14.800 | that are present in various regions of the image.
00:47:18.560 | And stacked on top of each other,
00:47:19.840 | these convolution filters can form
00:47:22.440 | high level abstractions of visual information and images.
00:47:28.260 | With AlexNet, as I've mentioned,
00:47:30.320 | and the ImageNet data set and challenge,
00:47:33.080 | captivating the world of what is possible
00:47:35.680 | with neural networks,
00:47:36.800 | have been further and further improved,
00:47:38.920 | superseding human performance
00:47:42.640 | with a special note,
00:47:45.000 | GoogleNet with the inception module,
00:47:46.960 | there's different ideas that came along,
00:47:48.480 | ResNet with the residual blocks,
00:47:50.880 | and SCNet most recently.
00:47:55.660 | So the object detection problem
00:47:59.280 | is a step, the next step in the visual recognition.
00:48:02.780 | So the image classification is just taking the entire image
00:48:05.280 | and saying what's in the image.
00:48:07.440 | Object detection localization is saying,
00:48:10.560 | find all the objects of interest in the scene
00:48:13.080 | and classify them.
00:48:14.080 | The region-based methods, like shown here,
00:48:17.480 | FastRCNN, takes the image,
00:48:20.000 | uses convolution neural network
00:48:21.400 | to extract features in that image
00:48:23.900 | and generate region proposals.
00:48:25.760 | Here's a bunch of candidates that you should look at.
00:48:27.880 | And within those candidates,
00:48:29.400 | it classifies what they are
00:48:31.000 | and generates four parameters,
00:48:33.360 | the bounding box,
00:48:34.980 | that thing that captures that thing.
00:48:38.660 | So object detection localization
00:48:40.560 | ultimately boils down to a bounding box,
00:48:43.220 | a rectangle with a class
00:48:46.140 | that's the most likely class
00:48:47.720 | that's in that bounding box.
00:48:49.940 | And you can really summarize region-based methods
00:48:53.940 | as you generate the region proposal,
00:48:56.340 | here a little pseudocode,
00:48:57.700 | and do a for loop over the region proposals
00:49:02.140 | and perform detection on that for loop.
00:49:05.660 | The single-shot methods remove the for loop.
00:49:10.600 | There's a single pass through,
00:49:13.020 | you add a bunch of,
00:49:14.140 | take a, for example,
00:49:15.700 | here shown SSD,
00:49:17.140 | take a pre-trained neural network
00:49:20.460 | that's been trained to do image classification,
00:49:22.860 | stack a bunch of convolutional layers on top,
00:49:25.100 | from each layer extract features
00:49:27.260 | that are then able to generate
00:49:28.780 | in a single pass classes,
00:49:31.820 | bounding boxes,
00:49:32.980 | bounding box predictions,
00:49:34.100 | and the classes associated with those bounding box.
00:49:36.780 | The trade-off here,
00:49:37.860 | and this is where the popular YOLO V123 come from.
00:49:42.340 | The trade-off here oftentimes
00:49:47.100 | is in performance and accuracy.
00:49:48.760 | So single-shot methods
00:49:52.140 | are often less performant,
00:49:54.700 | especially in terms of accuracy
00:49:56.980 | on objects that are really far away,
00:49:58.420 | or rather objects that are small in the image
00:50:00.260 | or really large.
00:50:02.360 | Then the next step up
00:50:05.520 | in visual perception,
00:50:06.680 | visual understanding,
00:50:07.860 | is semantic segmentation.
00:50:10.700 | That's where the tutorial that we presented here
00:50:12.660 | on GitHub is covering.
00:50:15.200 | Semantic segmentation is the task of now,
00:50:17.800 | as opposed to a bounding box,
00:50:19.200 | or classifying the entire image,
00:50:20.560 | or detecting the object as a bounding box,
00:50:22.880 | is assigning at a pixel level
00:50:26.000 | the boundaries of what the object is.
00:50:28.920 | Every single, in full scene segmentation,
00:50:32.360 | classifying what every single pixel,
00:50:35.000 | which class that pixel belongs to.
00:50:37.800 | And the fundamental aspect there,
00:50:39.440 | so we'll cover a little bit,
00:50:41.040 | or a lot more,
00:50:42.560 | on Wednesday,
00:50:43.960 | is taking a image classification network,
00:50:48.960 | chopping it off at some point,
00:50:52.160 | and then having,
00:50:53.400 | which is performing the encoding step
00:50:55.800 | of compressing a representation of the scene,
00:50:58.560 | and taking that representation
00:51:00.400 | with a decoder,
00:51:01.960 | up sampling in a dense way,
00:51:04.480 | so taking that representation
00:51:08.160 | and up sampling the pixel level classification.
00:51:12.200 | So that up sampling,
00:51:13.200 | there's a lot of tricks that we'll talk through
00:51:15.040 | that are interesting,
00:51:15.860 | but ultimately it boils down to the encoding step
00:51:18.440 | of forming a representation
00:51:19.720 | of what's going on in the scene,
00:51:20.960 | and then the decoding step
00:51:22.840 | that up samples the pixel level annotation
00:51:25.480 | classification of all the individual pixels.
00:51:28.280 | And as I mentioned here,
00:51:29.520 | the underlying idea applied most extensively,
00:51:32.280 | most successfully in computer vision,
00:51:34.300 | is transfer learning.
00:51:36.540 | Most commonly applied way of transfer learning
00:51:44.440 | is taking a pre-trained neural network,
00:51:46.400 | like ResNet,
00:51:48.520 | and chopping it off at some point,
00:51:51.000 | it's chopping off the fully connected layer,
00:51:53.560 | layers, some parts of the layers,
00:51:57.380 | and then taking a dataset,
00:51:59.700 | a new dataset,
00:52:02.880 | and retraining that network.
00:52:04.800 | So what is this useful for?
00:52:06.440 | For every single application
00:52:07.720 | in computer vision in industry,
00:52:09.600 | when you have a specific application,
00:52:11.520 | like you want to build a pedestrian detector.
00:52:16.600 | If you wanna build a pedestrian detector,
00:52:18.560 | and you have a pedestrian dataset,
00:52:20.320 | it's useful to take ResNet trained on ImageNet,
00:52:23.680 | or CoCo trained in the general case of vision perception,
00:52:27.080 | and taking that network,
00:52:28.160 | chopping off some of the layers,
00:52:29.560 | and then retraining on your specialized pedestrian dataset.
00:52:33.560 | And depending on how large that dataset is,
00:52:36.120 | some of the previous layers
00:52:39.960 | that from the pre-trained network
00:52:42.360 | should be fixed, frozen,
00:52:44.660 | and sometimes not,
00:52:46.060 | depending on how large the data is.
00:52:48.680 | And this is extremely effective in a computer vision,
00:52:52.080 | but also in audio speech and NLP.
00:52:55.480 | And so as I mentioned with the pre-trained networks,
00:53:00.480 | they are ultimately forming representations of the data
00:53:07.240 | based on which classifications
00:53:08.520 | the regression is made,
00:53:09.880 | prediction is made.
00:53:10.820 | But a cleanest example of this
00:53:14.320 | is the autoencoder,
00:53:15.600 | or forming representations in an unsupervised way.
00:53:19.160 | The input is an image,
00:53:21.600 | and the output is that exact same image.
00:53:23.880 | So why do we do that?
00:53:25.320 | Well, if you add a bottleneck in the network,
00:53:29.680 | where there is,
00:53:30.960 | where the network is narrower in the middle
00:53:35.960 | than it is on the inputs and the outputs,
00:53:39.640 | it's forced to compress the data down
00:53:41.600 | into meaningful representation.
00:53:42.960 | That's what the autoencoder does.
00:53:45.040 | You're training it to reproduce the output,
00:53:48.760 | and reproduce it with a latent representation
00:53:51.600 | that is smaller than the original raw data.
00:53:54.240 | And that's a really powerful way to compress the data.
00:53:56.360 | It's used for removing noise and so on,
00:53:58.720 | but it's also just an effective way
00:54:00.340 | to demonstrate a concept.
00:54:03.180 | It can also be used for embeddings.
00:54:05.280 | We have a huge amount of data,
00:54:06.840 | and you want to form a compressed,
00:54:11.840 | efficient representation of that data.
00:54:15.960 | Now, in practice,
00:54:18.240 | this is completely unsupervised.
00:54:19.720 | In practice, if you want to form an efficient,
00:54:24.240 | useful representation of the data,
00:54:26.520 | you want to train it in a supervised way.
00:54:31.960 | You want to train it on a discriminative task,
00:54:34.580 | where you have labeled data,
00:54:36.320 | and the network is trained to identify cat versus dog.
00:54:39.960 | That network that's trained in a discriminative way,
00:54:42.640 | on an annotated, supervised learning way,
00:54:47.680 | is able to form better representation.
00:54:49.960 | But nevertheless, the concept stands.
00:54:51.560 | And one way to visualize these concepts,
00:54:53.720 | is the tool that I really love,
00:54:56.360 | projector.tensorflow.org,
00:54:58.240 | is a way to visualize these different representations,
00:55:00.360 | these different embeddings.
00:55:01.680 | You should definitely play with it,
00:55:03.880 | and you can insert your own data.
00:55:05.760 | Okay, going further and further
00:55:07.520 | in this direction of unsupervised,
00:55:09.200 | and forming representations,
00:55:10.960 | is generative adversarial networks.
00:55:13.360 | From these representations,
00:55:14.600 | being able to generate new data.
00:55:16.440 | And the fundamental methodology of GANs,
00:55:21.440 | is to have two networks.
00:55:25.240 | One is the generator, one is the discriminator,
00:55:27.240 | and they compete against each other,
00:55:29.200 | in order for the generator,
00:55:31.600 | to get better and better and better,
00:55:34.900 | generating realistic images.
00:55:37.520 | The generators task, from noise,
00:55:40.360 | to generate images based on a certain representation,
00:55:43.300 | that are realistic.
00:55:44.640 | And the discriminator, is the critic,
00:55:49.320 | that has to discriminate between real images,
00:55:52.080 | and those generated by the generator.
00:55:54.280 | And both get better together.
00:55:56.800 | The generator gets better and better,
00:55:58.560 | at generating real images,
00:55:59.900 | to trick the discriminator.
00:56:02.120 | And the discriminator gets better and better,
00:56:04.400 | at telling the difference between real and fake,
00:56:08.840 | until the generator is able to generate,
00:56:12.480 | some incredible things.
00:56:13.800 | So shown here, by the work with NVIDIA,
00:56:17.040 | I mean the ability to generate realistic faces,
00:56:20.200 | has skyrocketed in the past three years.
00:56:25.000 | So these are samples of celebrities photos,
00:56:28.160 | that have been able to generate,
00:56:29.200 | those are all generated by GAN.
00:56:32.000 | There's ability to generate,
00:56:34.280 | temporally consistent video over time with GANs.
00:56:38.200 | And then there's the ability,
00:56:40.000 | shown at the bottom right, in NVIDIA.
00:56:41.800 | I'm sure they'll, I'm sure I'll also talk about,
00:56:44.800 | the on a pixel level from semantic segmentation,
00:56:47.400 | being so from the semantic pixel segmentation on the right,
00:56:52.240 | being able to generate completely,
00:56:54.800 | the scene on the left.
00:56:57.160 | All the raw rich high definition pixels on the left.
00:57:01.600 | The natural language processing world,
00:57:07.080 | same forming representations,
00:57:09.440 | forming embeddings with,
00:57:11.600 | a word to VEC,
00:57:15.080 | ability to from words to form representation,
00:57:18.720 | that are efficiently able to,
00:57:20.520 | then be used to reason about the words.
00:57:24.400 | The whole idea of forming representation about the data,
00:57:27.520 | is taking a huge, you know,
00:57:29.440 | vocabulary of a million words.
00:57:31.560 | You wanna be able to map it into a space,
00:57:34.320 | where words that are far apart from each other,
00:57:38.120 | are in a Euclidean sense,
00:57:41.480 | in Euclidean distance between words,
00:57:43.880 | are semantically far apart from each other as well.
00:57:47.440 | So things that are similar are together in that space.
00:57:50.680 | And one way of doing that with skip grams,
00:57:53.680 | for example, is looking at a source text,
00:57:56.960 | and turning into a large body of text,
00:58:00.320 | into a supervised learning problem,
00:58:02.240 | by learning to map,
00:58:04.240 | predict from the words,
00:58:06.000 | from a particular word to all its neighbors.
00:58:08.680 | So train a network,
00:58:10.400 | on the connections that are commonly seen,
00:58:13.880 | in natural language.
00:58:15.120 | And based on those connections,
00:58:16.720 | you're able to know which words are related to each other.
00:58:19.880 | Now the main thing here,
00:58:23.800 | and I won't get into too many details,
00:58:25.440 | but the main thing here with the input vector,
00:58:28.120 | representing the words,
00:58:29.560 | and the output of vector representing the probability,
00:58:32.920 | that those words are connected to each other.
00:58:35.160 | The main thing,
00:58:36.320 | both are thrown away in the end,
00:58:37.920 | the main thing is the middle,
00:58:39.360 | the hidden layer.
00:58:40.520 | The low, that representation gives you the embedding,
00:58:43.800 | that represent these words in such a way,
00:58:46.160 | where in the Euclidean space,
00:58:47.560 | the ones that are close together semantically,
00:58:50.480 | are semantically together,
00:58:51.640 | and the ones that are not,
00:58:53.200 | are semantically far apart.
00:58:59.040 | natural language,
00:59:01.160 | and other sequence data,
00:59:03.360 | text, speech, audio, video,
00:59:05.160 | relies on recurring neural networks.
00:59:09.320 | The recurring neural networks are able to learn,
00:59:11.800 | temporal data,
00:59:13.200 | temporal dynamics in the data,
00:59:16.080 | sequence data,
00:59:18.680 | and are able to generate sequence data.
00:59:21.520 | The challenge is,
00:59:22.640 | that they're not able to learn,
00:59:25.200 | long-term context.
00:59:27.400 | Because when unrolling a neural network,
00:59:30.520 | it's trained by unrolling,
00:59:32.960 | and doing back propagation,
00:59:34.680 | without any tricks,
00:59:36.280 | the back propagation of the gradient,
00:59:38.080 | fades away very quickly.
00:59:39.680 | So you're not able to,
00:59:41.160 | memorize the context,
00:59:42.440 | in a longer form of the sentences,
00:59:44.560 | unless there's extensions here,
00:59:47.120 | with LSTMs and GRUs,
00:59:50.040 | long-term dependency is captured by,
00:59:52.880 | allowing the network to,
00:59:55.680 | forget information,
00:59:58.120 | allow it to freely pass through information in time.
01:00:02.840 | So what to forget,
01:00:04.040 | what to remember,
01:00:05.360 | and every time decide what to output.
01:00:08.200 | And all of those aspects have gates,
01:00:10.600 | that are all trainable,
01:00:12.920 | with sigmoid and 10H functions.
01:00:15.800 | Bidirectional,
01:00:17.840 | recurrent neural networks,
01:00:20.480 | from the 90s is an extension often used,
01:00:22.880 | for providing context in both direction.
01:00:26.440 | recurring neural networks,
01:00:27.720 | simply define,
01:00:29.000 | vanilla way is,
01:00:30.680 | learning representations for what happened in the past.
01:00:33.160 | Now in many cases,
01:00:34.480 | you're able,
01:00:35.680 | it's not real-time operation in that,
01:00:37.560 | you're able to also look into the future,
01:00:39.640 | you look into the data that falls out to the sequence.
01:00:42.200 | So benefits you do a forward pass through the network,
01:00:44.920 | beyond the current,
01:00:46.400 | and then back.
01:00:47.360 | The encoder decoder architecture,
01:00:54.560 | in recurrent neural networks,
01:00:56.680 | used very much when the sequence on the input,
01:00:59.040 | and the sequence on the output,
01:01:00.400 | are not relied to be of the same length.
01:01:03.360 | The task is to first,
01:01:07.080 | with the encoder network,
01:01:08.360 | encode everything that's came,
01:01:12.000 | everything on the input sequence.
01:01:13.560 | So this is useful for machine translation for example.
01:01:15.840 | So encoding all the information,
01:01:17.360 | the input sequence in English,
01:01:18.840 | and then,
01:01:19.920 | in the language you're translating to,
01:01:22.920 | given that representation,
01:01:24.680 | keep feeding it into the decoder,
01:01:26.880 | recurrent neural network,
01:01:28.080 | to generate the translation.
01:01:29.840 | The input might be much smaller,
01:01:31.520 | or much larger than the output.
01:01:33.080 | That's the encoder decoder architecture.
01:01:37.600 | And then there's improvements.
01:01:39.240 | Attention,
01:01:43.760 | is the improvement on this encoder decoder architecture,
01:01:46.840 | that allows you to,
01:01:48.360 | as opposed to taking the input sequence,
01:01:51.240 | forming a representation of it,
01:01:52.560 | and that's it.
01:01:53.520 | It allows you to actually look back,
01:01:54.920 | at different parts of the input.
01:01:57.160 | So not just relying on the,
01:02:00.640 | single vector representation of all,
01:02:03.400 | the entire input.
01:02:10.720 | a lot of excitement,
01:02:12.040 | has been around the idea,
01:02:16.760 | as I mentioned,
01:02:17.600 | some of the dream,
01:02:19.160 | of artificial intelligence,
01:02:20.240 | and machine learning in general,
01:02:21.680 | has been to remove the human more,
01:02:23.120 | and more, and more from the picture.
01:02:25.160 | Being able to automate,
01:02:26.400 | some of the difficult tasks.
01:02:28.480 | So AutoML from Google,
01:02:30.320 | and just the general concept,
01:02:31.440 | of neural architecture search, NASNet.
01:02:36.120 | ability to automate,
01:02:37.880 | the discovery of,
01:02:40.640 | parameters of a neural network,
01:02:44.040 | and the ability to discover,
01:02:47.320 | the actual architecture,
01:02:49.080 | that produces the best result.
01:02:51.160 | So with neural architecture search,
01:02:53.840 | you have basic,
01:02:54.800 | basic modules,
01:02:57.200 | similar to the ResNet modules.
01:02:58.840 | And with a recurring neural network,
01:03:02.240 | you keep assembling a network together,
01:03:05.560 | and eval,
01:03:06.400 | and assembling in such a way,
01:03:07.920 | that it minimizes the loss,
01:03:10.080 | of the overall classification performance.
01:03:12.320 | And it's shown that you can then construct,
01:03:15.720 | a neural network that's much more efficient,
01:03:18.480 | and much more accurate,
01:03:19.640 | than state of the art,
01:03:21.480 | on classification tasks like ImageNet,
01:03:23.440 | here shown with a plot.
01:03:25.480 | Or at the very least competitive,
01:03:27.880 | with the state of the art,
01:03:29.240 | and the SCNet.
01:03:30.600 | It's super exciting,
01:03:31.880 | that as opposed to,
01:03:33.040 | like I said,
01:03:33.880 | stacking Lego pieces yourself,
01:03:35.640 | the final result,
01:03:36.960 | is essentially,
01:03:38.080 | you step back,
01:03:39.400 | and you say,
01:03:40.240 | here's,
01:03:41.080 | I have a data set,
01:03:42.400 | with the,
01:03:44.080 | with the labels,
01:03:45.040 | with the ground truth,
01:03:46.160 | which is what Google,
01:03:48.320 | the dream of Google AutoML is,
01:03:49.920 | I have the data set,
01:03:51.120 | you tell me,
01:03:52.040 | what kind of neural network,
01:03:53.520 | will do best on this data set?
01:03:55.400 | And that's it.
01:03:56.240 | So all you bring is the data,
01:03:57.320 | it constructs the network,
01:03:59.600 | through this neural architecture search,
01:04:01.800 | and returns to you the model,
01:04:03.560 | and that's it.
01:04:04.400 | It solves,
01:04:05.400 | it makes it possible to solve,
01:04:07.520 | exception,
01:04:09.600 | solve many,
01:04:12.640 | of the real world problems,
01:04:14.280 | that essentially boil down to,
01:04:15.640 | I have a few classes,
01:04:16.720 | I need to be very accurate on,
01:04:18.480 | here's my data set.
01:04:20.200 | And then it converts the problem,
01:04:22.160 | of a deep learning researcher,
01:04:24.000 | to the problem of maybe what's traditionally,
01:04:26.480 | what's more commonly called,
01:04:27.960 | sort of a data science engineer,
01:04:30.880 | where the task is,
01:04:32.480 | as I said,
01:04:33.480 | focuses on what is the right question,
01:04:35.680 | and what is the right data to solve that question?
01:04:38.200 | And deep reinforcement learning,
01:04:42.240 | taking further steps along the path,
01:04:44.200 | of decreasing human input,
01:04:47.040 | deep reinforcement learning is,
01:04:49.000 | the task of an agent,
01:04:50.840 | to act in the world based on,
01:04:53.080 | the observations of the state,
01:04:54.920 | and the rewards received in that state.
01:04:57.080 | Knowing very little about the world,
01:04:59.320 | and learning from the very sparse nature of the reward,
01:05:02.640 | sometimes only when you,
01:05:04.720 | in the gaming context,
01:05:06.160 | when you win or lose,
01:05:07.680 | or in the robotics context,
01:05:09.760 | when you successfully accomplish a task or not,
01:05:11.880 | with that very sparse reward,
01:05:13.720 | are able to learn how to behave in that world.
01:05:16.360 | Here with,
01:05:18.440 | with cats learning how the bell maps to the food,
01:05:21.600 | and a lot of the amazing work at OpenAI and DeepMind,
01:05:24.920 | about the robotics manipulation and navigation,
01:05:29.920 | through self play in simulated environments,
01:05:32.320 | and of course the best,
01:05:33.760 | our own deep reinforcement learning competition,
01:05:36.800 | with deep traffic,
01:05:38.000 | that all of you can participate,
01:05:40.280 | and I encourage you to try to win,
01:05:42.440 | that with no supervised knowledge,
01:05:48.120 | no human supervision,
01:05:50.040 | through sparse rewards from the simulation,
01:05:53.840 | or through self play constructs,
01:05:56.680 | able to learn how to operate successfully in this world.
01:05:59.840 | And those are the steps we're taking,
01:06:03.280 | towards general,
01:06:04.920 | towards artificial general intelligence.
01:06:07.000 | This is the exciting from,
01:06:08.960 | from the breakthrough ideas,
01:06:12.280 | that we'll talk about on Wednesday,
01:06:13.720 | natural language processing,
01:06:15.480 | to generate adversarial networks,
01:06:17.600 | able to generate arbitrary data,
01:06:19.800 | high resolution data,
01:06:21.160 | create data really from this understanding of the world,
01:06:24.640 | to deep reinforcement learning,
01:06:26.120 | being able to learn how to act in the world,
01:06:28.720 | very little input from human supervision,
01:06:31.760 | is taking further and further steps,
01:06:33.720 | and there's been a lot of exciting ideas,
01:06:35.520 | going by different names,
01:06:36.880 | sometimes misused,
01:06:38.400 | sometimes overused,
01:06:40.000 | sometimes,
01:06:43.640 | misinterpreted of transfer learning,
01:06:47.200 | meta learning,
01:06:48.680 | and the hyper parameter architecture search,
01:06:51.000 | basically removing a human as much as possible,
01:06:54.080 | from the menial task,
01:06:55.840 | and involving the human only on the fundamental side,
01:06:58.560 | as I mentioned with the racing boat,
01:07:00.440 | on the ethical side,
01:07:01.920 | on the things that us humans,
01:07:04.080 | at least pretend to be quite good at,
01:07:07.360 | which is understanding the fundamental big questions,
01:07:10.040 | understanding the data,
01:07:11.720 | that empowers us to solve real world problems,
01:07:14.520 | and understand the ethical balance,
01:07:16.400 | that needs to be struck in order to solve those problems well,
01:07:19.680 | and as on the bottom right,
01:07:23.080 | I show that's our job here in this room,
01:07:26.160 | our job for all the engineers in the world,
01:07:28.400 | to solve these problems,
01:07:29.960 | and progress forward,
01:07:31.440 | through the current summer,
01:07:32.960 | and through the winter if it ever comes,
01:07:35.360 | so with that I'd like to thank you,
01:07:37.480 | and you can get the videos,
01:07:39.080 | code and so on,
01:07:40.480 | online deeplearning.mit.edu,
01:07:42.320 | thank you very much guys.
01:07:43.560 | (audience applauding)
01:07:46.720 | (upbeat music)
01:07:49.320 | (upbeat music)
01:07:51.920 | (upbeat music)
01:07:54.520 | (upbeat music)
01:07:57.120 | (upbeat music)
01:07:59.720 | (upbeat music)
01:08:02.320 | [BLANK_AUDIO]