Deep Learning Basics: Introduction and Overview

00:00:00.000 | - Welcome everyone to 2019.

00:00:03.160 | It's really good to see everybody here,

00:00:05.280 | making it in the cold.

00:00:06.760 | This is 6S094, Deep Learning for Self-Driving Cars.

00:00:10.920 | It is part of a series of courses on deep learning

00:00:17.040 | that we're running throughout this month.

00:00:19.880 | The website that you can get all the content,

00:00:22.240 | the videos, the lectures, and the code

00:00:23.940 | is deeplearning.mit.edu.

00:00:26.400 | The videos and slides will be made available there,

00:00:29.320 | along with a GitHub repository

00:00:31.560 | that's accompanying the course.

00:00:33.920 | Assignments for registered students

00:00:35.360 | will be emailed later on in the week.

00:00:39.400 | And you can always contact us with questions,

00:00:41.520 | concerns, comments at hcaihumancenteredai@mit.edu.

00:00:46.520 | So let's start through the basics, the fundamentals.

00:00:52.480 | To summarize in one slide, what is deep learning?

00:00:58.320 | It is a way to extract useful patterns from data

00:01:02.280 | in an automated way,

00:01:03.960 | with as little human effort involved as possible,

00:01:08.960 | hence the automated.

00:01:12.120 | How?

00:01:13.000 | The fundamental aspect that we'll talk about a lot

00:01:16.280 | is the optimization of neural networks.

00:01:18.940 | The practical nature that we'll provide through the code

00:01:22.360 | and so on is that there's libraries

00:01:27.660 | that make it accessible and easy

00:01:30.480 | to do some of the most powerful things in deep learning

00:01:34.620 | using Python, TensorFlow, and Friends.

00:01:36.980 | The hard part always with machine learning

00:01:42.840 | and artificial intelligence in general

00:01:45.720 | is asking good questions and getting good data.

00:01:49.620 | A lot of times, the exciting aspects

00:01:51.840 | of what the news covers

00:01:54.320 | and a lot of the exciting aspects of what is published

00:01:58.120 | in the prestigious conferences, in an archive,

00:02:01.300 | in a blog post is the methodology.

00:02:04.300 | The hard part is applying that methodology

00:02:07.600 | to solve real-world problems,

00:02:09.000 | to solve fascinating, interesting problems,

00:02:10.880 | and that requires data.

00:02:12.580 | That requires asking the right questions of that data,

00:02:15.940 | organizing that data,

00:02:18.100 | and labeling, selecting aspects of that data

00:02:21.520 | that can reveal the answers to the questions you ask.

00:02:24.660 | So why has this breakthrough over the past decade

00:02:30.780 | of the application of neural networks,

00:02:33.300 | the ideas of neural networks,

00:02:34.620 | what has happened, what has changed

00:02:36.020 | that have been around since the 1940s

00:02:40.120 | and ideas have been percolating even before?

00:02:42.540 | The digitization of information, data,

00:02:48.620 | the ability to access data easily

00:02:51.320 | in a distributed fashion across the world,

00:02:53.440 | all kinds of problems have now a digital form

00:02:56.640 | that can be accessed by learning algorithms.

00:02:59.800 | Hardware, compute, both the Moore's Law of CPU

00:03:04.800 | and GPU and ASICs, Google's TPU systems,

00:03:10.000 | hardware that enables the efficient,

00:03:13.120 | effective, large-scale execution of these algorithms.

00:03:18.660 | Community, people here, people all over the world

00:03:22.340 | are being able to work together, to talk to each other,

00:03:25.160 | to feed the fire of excitement behind machine learning.

00:03:28.500 | GitHub and beyond.

00:03:31.240 | The tooling, as we'll talk about TensorFlow,

00:03:35.620 | PyTorch, and everything in between,

00:03:38.500 | that enables a person with an idea

00:03:45.620 | to reach a solution in less and less and less time.

00:03:50.620 | Higher and higher levels of abstraction

00:03:53.000 | empower people to solve problems in less and less time

00:03:57.120 | with less and less knowledge,

00:03:59.020 | where the idea and the data become the central point,

00:04:02.440 | not the effort that takes you from the idea to the solution.

00:04:06.420 | And there's been a lot of exciting progress,

00:04:09.720 | some of which we'll talk about,

00:04:11.000 | from face recognition to the general problem

00:04:13.840 | of scene understanding, image classification to speech,

00:04:17.320 | text, natural language processing, transcription,

00:04:20.520 | translation in medical applications and medical diagnosis.

00:04:25.040 | And cars, being able to solve many aspects of perception

00:04:29.720 | in autonomous vehicles with drivable area lane detection,

00:04:33.020 | object detection, digital assistants,

00:04:36.040 | the ones on your phone and beyond, the ones in your home.

00:04:40.800 | Ads, recommender systems, from Netflix to Search

00:04:44.600 | to Social, Facebook, and of course,

00:04:48.520 | the deep reinforcement learning successes

00:04:50.520 | in the playing of games,

00:04:52.120 | from board games to StarCraft and Dota.

00:04:54.680 | Let's take a step back.

00:05:00.040 | Deep learning is more than a set of tools

00:05:04.520 | to solve practical problems.

00:05:08.480 | Pamela McCordick said in '79,

00:05:10.960 | "AI began with the ancient wish to forge the gods."

00:05:15.000 | Throughout our history, throughout our civilization,

00:05:18.560 | human civilization, we've dreamed about creating echoes

00:05:22.160 | of whatever is in this mind of ours in the machine

00:05:27.160 | and creating living organisms.

00:05:29.440 | From the popular culture in the 1800s

00:05:33.200 | with Frankenstein to Ex Machina,

00:05:35.400 | this vision, this dream of understanding intelligence

00:05:38.920 | and creating intelligence has captivated all of us.

00:05:41.480 | And deep learning is at the core of that

00:05:45.400 | because there's aspects of it, the learning aspects

00:05:48.920 | that captivate our imagination about what is possible,

00:05:52.080 | given data and methodology, what learning,

00:05:56.660 | learning to learn and beyond, how far that can take us.

00:06:03.000 | And here visualized is just 3% of the neurons

00:06:06.280 | and 1/1,000,000 of the synapses in our own brain.

00:06:10.720 | This incredible structure that's in our mind

00:06:13.280 | and there's only echoes of it,

00:06:15.000 | small shadows of it in our artificial neural networks

00:06:18.840 | that we're able to create, but nevertheless,

00:06:21.040 | those echoes are inspiring to us.

00:06:23.740 | The history of neural networks

00:06:28.160 | on this pale blue dot of ours

00:06:31.640 | started quite a while ago

00:06:34.360 | with summers and winters, with excitements

00:06:39.440 | and periods of pessimism,

00:06:42.160 | starting in the '40s with neural networks

00:06:44.040 | and the implementation of those neural networks

00:06:45.960 | as a perceptron in the '50s,

00:06:47.880 | with ideas of back propagation,

00:06:50.640 | restricted Boltzmann machines,

00:06:53.160 | recurring neural networks in the '70s and '80s

00:06:55.800 | with convolutional neural networks

00:06:57.520 | and the MNIST dataset,

00:07:00.020 | with datasets beginning to percolate

00:07:01.920 | in LSTMs, bidirectional RNNs in the '90s,

00:07:05.360 | and the rebranding and the rebirth of neural networks

00:07:09.280 | under the flag of deep learning

00:07:11.600 | and deep belief nets in 2006,

00:07:14.360 | the birth of ImageNet, the dataset that,

00:07:17.040 | on which the possibilities of what deep learning

00:07:21.600 | can bring to the world has been first illustrated

00:07:24.320 | in the recent years in 2009,

00:07:27.920 | and AlexNet, the network that, on ImageNet,

00:07:30.940 | performed exactly that,

00:07:32.520 | with a few ideas like dropout

00:07:34.780 | that improved neural networks over time

00:07:36.340 | every year by year,

00:07:37.700 | improving the performance of neural networks.

00:07:39.740 | In 2014, the idea of GANs

00:07:43.620 | that Jan LeCun called the most exciting idea

00:07:47.020 | of the last 20 years,

00:07:48.540 | the generative adversarial networks,

00:07:50.140 | the ability to, with very little supervision,

00:07:52.840 | generate data, to generate ideas.

00:07:55.700 | After forming representation of those,

00:07:57.760 | from the understanding,

00:08:00.920 | from the high-level abstractions

00:08:02.440 | of what is extracted in the data,

00:08:04.400 | be able to generate new samples, create.

00:08:08.200 | The idea of being able to create

00:08:10.200 | as opposed to memorize is really exciting.

00:08:13.240 | And on the applied side,

00:08:14.980 | in 2014 with DeepFace,

00:08:18.060 | the ability to do face recognition.

00:08:20.120 | There's been a lot of breakthroughs

00:08:21.960 | on the computer vision front,

00:08:23.400 | that being one of them.

00:08:24.960 | The world was inspired,

00:08:29.140 | captivated in 2016 with AlphaGo

00:08:31.780 | and '17 with AlphaZero,

00:08:33.460 | beating with less and less and less effort

00:08:38.000 | the best players in the world at Go.

00:08:41.060 | The problem that, for most of the history

00:08:44.380 | of artificial intelligence,

00:08:45.380 | thought to be unsolvable.

00:08:47.700 | And new ideas with capsule networks.

00:08:49.620 | And this year is the year,

00:08:51.680 | 2018 was the year of natural language processing.

00:08:55.820 | A lot of interesting breakthroughs.

00:08:57.920 | Google's BERT and others that we'll talk about,

00:09:02.620 | breakthroughs on ability to understand language,

00:09:06.060 | understand speech, and everything,

00:09:09.340 | including generation, that's built all around that.

00:09:11.940 | And there's a parallel history of tooling,

00:09:16.520 | starting in the '60s with the Perceptron

00:09:18.860 | and the wiring diagrams.

00:09:20.700 | There, ending with this year,

00:09:23.540 | with PyTorch 1.0 and TensorFlow 2.0.

00:09:26.860 | These really solidified, exciting, powerful ecosystems

00:09:31.580 | of tools that enable you to do very,

00:09:34.900 | to do a lot with very little effort.

00:09:38.340 | The sky is the limit, thanks to the tooling.

00:09:41.040 | So let's then, from the big picture,

00:09:46.940 | take into the smallest.

00:09:49.540 | Everything should be made as simple as possible.

00:09:51.940 | So let's start simple, with a little piece of code,

00:09:57.580 | before we jump into the details

00:10:04.060 | and a big run-through of everything

00:10:06.260 | that is possible in deep learning.

00:10:08.820 | At the very basic level, with just a few lines of code,

00:10:12.420 | really six here, six little pieces of code,

00:10:16.640 | you can train a neural network

00:10:17.900 | to understand what's going on in an image.

00:10:20.580 | The classic, that I will always love, MNIST dataset,

00:10:24.380 | the handwriting digits, where the input

00:10:26.860 | to a neural network, a machine learning system,

00:10:28.800 | is the picture of a handwritten digit,

00:10:31.020 | and the output is the number that's in that digit.

00:10:33.980 | It's as simple as, in the first step,

00:10:38.180 | import the library, TensorFlow.

00:10:41.460 | Second step, import the dataset, MNIST.

00:10:45.940 | Third step, like Lego bricks, stack on top of each other,

00:10:50.580 | the neural network, layer by layer,

00:10:53.500 | with a hidden layer, an input layer, an output layer.

00:10:56.780 | Step four, train the model,

00:11:00.340 | as simple as a single line, model fit.

00:11:02.680 | Evaluate the model in step five, on the testing dataset,

00:11:07.820 | and that's it.

00:11:08.660 | In step six, you're ready to deploy.

00:11:10.420 | You're ready to predict what's in the image.

00:11:13.940 | It's as simple as that.

00:11:15.300 | And much of this code, obviously, much more complicated,

00:11:19.620 | or much more elaborate, and rich, and interesting,

00:11:23.700 | and complex, we'll be making available on GitHub,

00:11:27.540 | on our repository that accompanies these courses.

00:11:30.300 | Today, we've released the first tutorial

00:11:32.180 | on driver-scene segmentation.

00:11:33.540 | I encourage everybody to go through it.

00:11:36.120 | And then, on the tooling side, in one slide,

00:11:41.500 | before we dive into the neural networks,

00:11:43.780 | and deep learning, the tooling side,

00:11:47.300 | amongst many other things,

00:11:48.540 | TensorFlow is a deep learning library,

00:11:50.740 | an open-source library from Google.

00:11:52.900 | The most popular one to date,

00:11:55.940 | the most active with a large ecosystem.

00:11:58.820 | It's not just something you import in Python,

00:12:02.520 | and to solve some basic problems.

00:12:04.520 | There's an entire ecosystem of tooling.

00:12:06.480 | There's different levels of APIs.

00:12:10.420 | Much of what we'll do in this course

00:12:12.460 | will be the highest-level API with Keras.

00:12:15.620 | But there's also the ability to run in the browser

00:12:17.980 | with TensorFlow.js, on the phone with TensorFlow Lite,

00:12:21.740 | in the cloud, without any need to have a computer,

00:12:26.580 | hardware, anything, any of the libraries

00:12:28.260 | set up on your own machine.

00:12:29.140 | You can run all the code that we're providing

00:12:31.720 | in the cloud with Google Collaboratory.

00:12:35.260 | And the optimized ASICs hardware

00:12:38.100 | that Google has optimized for TensorFlow

00:12:41.740 | with their TPU, Tensor Processing Unit,

00:12:44.140 | ability to visualize TensorBoard,

00:12:46.140 | models are provided in TensorFlow Hub.

00:12:48.900 | And there's just an entire ecosystem,

00:12:51.300 | including, most importantly, I think,

00:12:53.500 | documentation and blogs that make it

00:12:57.140 | extremely accessible to understand

00:13:01.280 | the fundamentals of the tooling

00:13:04.220 | that allow you to solve the problems

00:13:05.700 | from natural language processing,

00:13:06.820 | to computer vision, to GANs,

00:13:09.060 | generative adversarial neural networks,

00:13:10.620 | and everything in between,

00:13:13.420 | deep reinforcement learning, and so on.

00:13:15.400 | So that's why we're excited to work

00:13:19.940 | both in the theory in this course,

00:13:21.980 | in this series of lectures,

00:13:25.060 | and in the tooling and the applied side of TensorFlow.

00:13:28.300 | It really makes it, exceptionally,

00:13:30.460 | these ideas exceptionally accessible.

00:13:32.420 | So deep learning, at the core,

00:13:34.580 | is the ability to form higher and higher level

00:13:36.780 | of abstractions, of representations in data,

00:13:40.420 | in raw patterns, higher and higher levels

00:13:42.940 | of understanding of patterns.

00:13:44.600 | And those representations

00:13:48.740 | are extremely important

00:13:53.140 | and effective for being able to interpret data.

00:14:00.680 | Under certain representations,

00:14:03.980 | data is trivial to understand.

00:14:06.980 | Cat versus dog, blue dot versus green triangle.

00:14:11.300 | Under others, it's much more difficult.

00:14:14.620 | In this task, drawing a line under polar coordinates

00:14:19.060 | is trivial.

00:14:20.020 | Under Cartesian coordinates,

00:14:21.900 | it's very difficult, well, impossible to do accurately.

00:14:25.140 | And that's a trivial example of a representation.

00:14:28.060 | So our task with deep learning,

00:14:29.860 | with machine learning in general,

00:14:31.940 | is forming representations that map the topology,

00:14:35.380 | this, whatever the topology,

00:14:37.700 | the rich space of the problem that you're trying to deal

00:14:40.380 | with of the raw inputs,

00:14:42.580 | map it in such a way

00:14:44.260 | that the final representation is trivial to work with,

00:14:49.920 | trivial to classify,

00:14:51.540 | trivial to perform regression,

00:14:55.380 | trivial to generate new samples of that data.

00:14:58.220 | And that representation of higher and higher levels

00:15:00.460 | of representation is really the dream

00:15:03.820 | of artificial intelligence.

00:15:06.020 | That is what understanding is,

00:15:07.940 | making the complex simple,

00:15:10.700 | like Einstein back in a few slides ago said.

00:15:14.380 | And that, with Juergen Schmidhuber,

00:15:19.140 | and whoever else said it, I don't know,

00:15:21.300 | that's been the dream of all of science in general,

00:15:26.840 | of the history of science

00:15:28.580 | is the history of compression progress,

00:15:30.900 | of forming simpler

00:15:32.940 | and simpler representations of ideas.

00:15:38.740 | The models of the universe of our solar system

00:15:44.980 | with the Earth at the center of it

00:15:47.340 | is much more complex to perform,

00:15:49.860 | to do physics on than a model

00:15:53.500 | where the sun is at the center.

00:15:56.260 | Those higher and higher levels of simple representations

00:16:00.060 | enable us to do extremely powerful things.

00:16:02.140 | That has been the dream of science

00:16:03.740 | and the dream of artificial intelligence.

00:16:05.780 | And why deep learning?

00:16:09.500 | What is so special about deep learning

00:16:12.380 | in the grander world of machine learning

00:16:14.340 | and artificial intelligence?

00:16:15.740 | It's the ability to more and more remove

00:16:21.020 | the input of human experts,

00:16:23.120 | remove the human from the picture,

00:16:25.020 | the human costly inefficient effort

00:16:27.300 | of human beings in the picture.

00:16:29.860 | Deep learning automates much of the extraction

00:16:33.220 | gets us closer and closer to the raw data

00:16:37.180 | without the need of human involvement,

00:16:39.460 | human expert involvement,

00:16:40.820 | ability to form representations from the raw data

00:16:43.460 | as opposed to having a human being

00:16:45.540 | needing to extract features

00:16:47.720 | as was done in the 80s and 90s

00:16:50.420 | and the early aughts to extract features

00:16:53.580 | with which then the machine learning algorithms

00:16:55.520 | can work with.

00:16:56.360 | The automated extraction of features

00:16:58.420 | enables us to work with large and larger data sets

00:17:00.940 | removing the human completely

00:17:02.980 | except from the supervision labeling step at the very end.

00:17:07.020 | It doesn't require the human expert.

00:17:09.060 | But at the same time,

00:17:12.340 | there is limits to our technologies.

00:17:18.340 | There's always a balance between excitement

00:17:22.100 | and disillusionment.

00:17:23.700 | The Gartner hype cycle

00:17:26.460 | as much as we don't like to think about it

00:17:31.460 | applies to almost every single technology.

00:17:33.940 | Of course, the magnitude of the peaks

00:17:35.380 | and the draws is different.

00:17:36.740 | But I would say we're at the peak

00:17:40.220 | of an inflated expectation with deep learning.

00:17:43.700 | And that's something we have to think about

00:17:45.180 | as we talk about some of the ideas

00:17:46.500 | and exciting possibilities of the future.

00:17:48.540 | And we're still driving cars

00:17:51.540 | that we'll talk about in future lectures in this course.

00:17:54.040 | We're at the same.

00:17:55.300 | In fact, we're a little bit beyond the peak.

00:17:57.780 | And so it's up to us.

00:17:59.740 | This is MIT and the engineers

00:18:02.300 | and the people working on this in the world

00:18:04.380 | to carry us through the draw,

00:18:07.380 | to carry us through the future

00:18:09.640 | as the ups and downs of the excitement progresses forward

00:18:14.640 | into the plateau of productivity.

00:18:18.040 | Why else not deep learning?

00:18:22.900 | If we look at real world applications,

00:18:25.260 | especially with humanoid robotics,

00:18:29.600 | robotic manipulation,

00:18:31.100 | and even, yes, autonomous vehicles,

00:18:34.740 | majority of the aspects of the autonomous vehicles

00:18:37.440 | do not involve to an extensive amount

00:18:40.600 | machine learning today.

00:18:41.940 | The problems are not formulated as data-driven learning.

00:18:46.260 | Instead, they're model-based optimization methods

00:18:49.500 | that don't learn from data over time.

00:18:51.980 | And then from the speakers these couple of weeks,

00:18:54.980 | we'll get to see how much machine learning

00:18:57.500 | is starting to creep in.

00:18:59.180 | But the example shown here with the Boston,

00:19:01.540 | with amazing humanoid robotics in Boston Dynamics,

00:19:04.560 | to date, almost no machine learning has been used

00:19:09.300 | except for trivial perception.

00:19:11.700 | The same with autonomous vehicles.

00:19:13.580 | Almost no machine learning, deep learning

00:19:15.180 | has been used except with perception.

00:19:18.800 | Some aspect of enhanced perception

00:19:20.940 | from the visual texture information.

00:19:22.740 | Plus, what's becoming, what's starting to be used

00:19:27.260 | a little bit more is use of recurring neural networks

00:19:32.260 | to predict the future,

00:19:36.020 | to predict the intent of the different players in the scene

00:19:41.020 | in order to anticipate what the future is.

00:19:43.220 | But these are very early steps.

00:19:44.960 | Most of the success that you see today,

00:19:46.860 | the 10 million miles that Waymo has achieved,

00:19:50.340 | has been attributed mostly to non-machine learning methods.

00:19:54.580 | Why else not deep learning?

00:19:58.580 | Here's a really clean example of unintended consequences.

00:20:03.700 | Of ethical issues we have to really think about.

00:20:11.640 | When an algorithm learns from data

00:20:14.540 | based on an objective function, a loss function,

00:20:17.820 | the power, the consequences of an algorithm

00:20:22.820 | that optimizes that function is not always obvious.

00:20:25.820 | Here's an example of a human player

00:20:28.380 | playing the game of Coast Runners with a,

00:20:31.740 | it's a boat racing game where the task is to go

00:20:34.580 | around the racetrack and try to win the race.

00:20:38.280 | And the objective is to get as many points as possible.

00:20:42.620 | There are three ways to get points.

00:20:44.640 | The finishing time, how long it took you to finish.

00:20:47.340 | The finishing position, where you were in the ranking.

00:20:50.980 | And picking up quote unquote turbos,

00:20:54.220 | those little green things along the way

00:20:56.220 | that give you points.

00:20:57.820 | Okay, simple enough.

00:20:59.180 | So we design an agent, in this case an RL agent,

00:21:02.700 | that optimizes for the awards.

00:21:06.460 | And what we find on the right here,

00:21:10.220 | the optimal, the agent discovers that the optimal

00:21:13.140 | actually has nothing to do with finishing the race

00:21:15.500 | or the ranking.

00:21:16.840 | That you can get much more points

00:21:19.220 | by just focusing on the turbos

00:21:20.920 | and collecting those little green dots

00:21:23.960 | because they regenerate.

00:21:25.300 | So you go in circles over and over and over,

00:21:27.400 | slamming into the wall, collecting the green turbos.

00:21:32.060 | Now that's a very clear example of a well-reasoned,

00:21:37.060 | a formulated objective function

00:21:41.280 | that has totally unexpected consequences.

00:21:43.980 | At least without sort of considering

00:21:47.620 | those consequences ahead of time.

00:21:49.260 | And so that shows the need for AI safety

00:21:52.060 | for a human in the loop of machine learning.

00:21:55.740 | That's why not deep learning exclusively.

00:21:57.860 | The challenge of deep learning algorithms,

00:22:05.780 | of deep learning applied,

00:22:07.280 | is to ask the right question

00:22:10.320 | and understand what the answers mean.

00:22:13.100 | You have to take a step back

00:22:15.060 | and look at the difference,

00:22:19.480 | the distinction, the levels,

00:22:23.500 | degrees of what the algorithm is accomplishing.

00:22:25.440 | For example, image classification

00:22:27.580 | is not necessarily scene understanding.

00:22:30.220 | In fact, it's very far from scene understanding.

00:22:33.540 | Classification may be very far from understanding.

00:22:36.760 | And the data sets vary drastically

00:22:41.660 | across the different benchmarks and the data sets used.

00:22:45.120 | The professionally done photographs

00:22:47.140 | versus synthetically generated images

00:22:49.860 | versus real world data.

00:22:52.440 | And the real world data is where the big impact is.

00:22:56.040 | So oftentimes the one doesn't transfer to the other.

00:22:59.660 | That's the challenge of deep learning.

00:23:01.560 | Solving all of these problems

00:23:04.500 | of different lighting variations,

00:23:05.820 | of opposed variation, interclass variation,

00:23:07.940 | all the things that we take for granted as human beings

00:23:10.340 | with our incredible perception system,

00:23:12.220 | all have to be solved in order to gain

00:23:14.320 | greater and greater understanding of a scene.

00:23:16.580 | And all the other things we have to close the gap on

00:23:20.300 | that we're not even close to yet.

00:23:22.420 | Here's an image from the Andrej Karpathy blog

00:23:25.140 | from a few years ago

00:23:26.620 | of former President Obama stepping on a scale.

00:23:30.620 | We can classify, we can do semantic segmentation

00:23:33.580 | of the scene, we can do object detection,

00:23:35.060 | we can do a little bit of 3D reconstruction

00:23:37.500 | from a video version of the scene.

00:23:39.140 | But what we can't do well

00:23:42.180 | is all the things we take for granted.

00:23:44.100 | We can't tell the images in the mirrors

00:23:46.180 | versus in reality as different.

00:23:50.000 | We can't deal with the sparsity of information.

00:23:52.880 | Just a few pixels on President Obama's face,

00:23:55.620 | we can still identify him as the president.

00:23:57.780 | The 3D structure of the scene,

00:24:02.100 | that there's a foot on top of a scale,

00:24:04.100 | that there's human beings behind from a single image,

00:24:08.660 | things we can trivially do using all the common sense

00:24:11.620 | semantic knowledge that we have cannot do.

00:24:14.460 | The physics of the scene, that there's gravity.

00:24:16.900 | And the biggest thing, the hardest thing,

00:24:20.560 | is what's on people's minds.

00:24:22.600 | And what's on people's minds

00:24:23.820 | about what's on other people's minds, and so on.

00:24:27.380 | Mental models of the world,

00:24:29.260 | being able to infer what people are thinking about.

00:24:32.140 | Being able to infer,

00:24:33.900 | there's been a lot of exciting work here at MIT

00:24:35.700 | about what people are looking at.

00:24:38.260 | But we're not even close to solving that problem either.

00:24:40.500 | But what they're thinking about,

00:24:42.100 | we haven't even begun to really think about that problem.

00:24:46.500 | And we do it trivially as human beings.

00:24:48.960 | And I think at the core of that,

00:24:52.600 | I think I'm harboring on the visual perception problem,

00:24:55.860 | because it's one we take really for granted as human beings,

00:24:59.340 | especially when trying to solve real world problems,

00:25:01.220 | especially when trying to solve autonomous driving.

00:25:04.980 | We have 540 million years of data for visual perception,

00:25:08.860 | so we take it for granted.

00:25:10.700 | We don't realize how difficult it is.

00:25:12.740 | And we kind of focus all our attention

00:25:14.340 | on this recent development of 100,000 years

00:25:16.940 | of abstract thought, being able to play chess,

00:25:19.060 | being able to reason.

00:25:21.100 | But the visual perception is nevertheless

00:25:23.460 | extremely difficult.

00:25:25.660 | At every single layer of what's required

00:25:28.940 | to perceive, interpret, and understand

00:25:31.820 | the fundamentals of a scene.

00:25:34.260 | And a trivial way to show that

00:25:36.220 | is just all the ways you can mess

00:25:38.180 | with these image classification systems

00:25:40.200 | by adding a little bit of noise.

00:25:42.020 | The last few years, there's been a lot of papers,

00:25:44.760 | a lot of work to show that you can mess with these systems

00:25:49.060 | by adding noise here with 99% accuracy,

00:25:52.480 | predict a dog, add a little bit of distortion.

00:25:55.500 | Immediately the system predicts with 99% accuracy

00:25:59.100 | that it's an ostrich.

00:26:00.100 | And you can do that kind of manipulation

00:26:02.060 | with just a single pixel.

00:26:03.580 | So that's just a clean way to show

00:26:07.520 | the gap between image classification

00:26:10.020 | on an artificial data set like ImageNet

00:26:12.380 | and real world perception that has to be solved,

00:26:15.300 | especially for life critical situations

00:26:17.260 | like autonomous driving.

00:26:18.460 | I really like this Max Tegmark's visualization

00:26:26.800 | of this rising sea of the landscape of human competence

00:26:32.980 | from Hans Marvack.

00:26:34.580 | And this is the difference as we progress forward

00:26:40.860 | and we discuss some of these machine learning methods

00:26:44.260 | is there is the human intelligence,

00:26:48.020 | the general human intelligence,

00:26:50.140 | let's call Einstein here,

00:26:52.940 | that's able to generalize over all kinds of problems,

00:26:56.460 | over all kinds of from the common sense

00:26:58.860 | to the incredibly complex.

00:27:01.780 | And then there is the way we've been doing,

00:27:04.620 | especially data driven machine learning,

00:27:07.120 | which is Savant's, which is specialized intelligence,

00:27:11.740 | extremely smart at a particular task,

00:27:14.660 | but not being able to transfer

00:27:16.100 | except in the very narrow neighborhood

00:27:17.840 | on this little landscape of different,

00:27:20.420 | of art, cinematography, book writing at the peaks

00:27:23.460 | and chess, arithmetic and theorem proving

00:27:26.180 | and vision at the bottom in the lake.

00:27:29.740 | And there's this rising sea

00:27:31.020 | as we solve problem after problem,

00:27:33.100 | the question can the methodology

00:27:36.300 | and the approach of deep learning

00:27:38.540 | of everything we're doing now keep the sea rising

00:27:42.300 | or do fundamental breakthroughs have to happen

00:27:44.380 | in order to generalize and solve these problems.

00:27:47.780 | And so from the specialized where the successes are,

00:27:51.340 | the systems are essentially boiled down to

00:27:56.340 | given the data set and given the ground truth

00:27:59.360 | for that data set, here's the apartment cost

00:28:02.140 | in the Boston area, be able to input several parameters

00:28:06.460 | and based on those parameters, predict the apartment cost.

00:28:09.820 | That's the basic premise approach

00:28:14.340 | behind the successful supervised

00:28:18.100 | deep learning systems today.

00:28:19.620 | If you have good enough data,

00:28:21.820 | there's good enough ground truth

00:28:22.980 | and can be formalized, we can solve it.

00:28:26.240 | Some of the recent promise that we will do

00:28:30.980 | an entire series of lectures in the third week

00:28:33.220 | on deeper enforcement learning showed

00:28:35.300 | that from raw sensory information

00:28:38.900 | with very little annotation through self-play

00:28:41.740 | where their systems learn without human supervision

00:28:46.740 | are able to perform extremely well

00:28:49.060 | in these constrained contexts.

00:28:50.860 | The question of a video game,

00:28:53.820 | here pong to pixels, being able to perceive

00:28:56.680 | the raw pixels of this pong game

00:28:59.800 | as raw input and learn the fundamental

00:29:04.060 | quote unquote physics of this game,

00:29:06.400 | understand how it is this game behaves

00:29:10.480 | and how to be able to win this game.

00:29:12.300 | That's kind of a step toward general purpose

00:29:14.940 | artificial intelligence, but it is a very small step

00:29:18.280 | because it's in a simulated, very trivial situation.

00:29:23.580 | That's the challenge that's before us.

00:29:26.180 | Would less and less human supervision

00:29:27.800 | be able to solve huge real world problems

00:29:31.860 | from the top supervised learning

00:29:35.620 | where majority of the teaching is done by human beings

00:29:39.340 | throughout the annotation process

00:29:40.740 | through labeling all the data

00:29:42.220 | by showing different examples

00:29:43.860 | and further and further down to semi-supervised learning,

00:29:49.100 | reinforcement learning and supervised learning

00:29:51.140 | removing the teacher from the picture.

00:29:53.260 | And making that teacher extremely efficient

00:29:56.420 | when it is needed.

00:29:57.340 | Of course, data augmentation is one way

00:30:02.460 | as we'll talk about.

00:30:03.980 | So taking a small number of examples

00:30:07.620 | and messing with that set of examples,

00:30:10.980 | augmenting that set of examples

00:30:12.780 | through trivial and through complex methods

00:30:15.580 | of cropping, stretching, shifting and so on

00:30:18.140 | including through generative networks,

00:30:20.140 | modifying those images to grow a small data set

00:30:22.600 | into a large one to minimize,

00:30:25.660 | to decrease further and further the input

00:30:28.380 | that's the input of the human teacher.

00:30:32.740 | But still, that's quite far away

00:30:34.980 | from the incredibly efficient both teaching

00:30:38.900 | and learning that humans do.

00:30:41.340 | This is a video and there's many of them online

00:30:46.340 | for the first time, a human baby walking.

00:30:51.960 | (video playing)

00:30:54.740 | We learned to do this, it's one shot learning.

00:30:59.060 | One day you're on all fours

00:31:04.100 | and the next day you put your two hands up

00:31:07.060 | and then you figure out the rest, one shot.

00:31:10.860 | Well, you can kind of, ish,

00:31:14.580 | you can kind of play around with it.

00:31:16.360 | But the point is you're extremely efficient.

00:31:19.220 | With only a few examples are able to learn

00:31:21.940 | the fundamental aspect of how to solve a particular problem.

00:31:24.940 | Machines in most cases need thousands, millions

00:31:31.060 | and sometimes more examples depending

00:31:32.980 | on the life critical nature of the application.

00:31:35.340 | The data flow of supervised learning systems

00:31:49.200 | is there's input data, there's a learning system

00:31:51.880 | and there is output.

00:31:53.320 | Now in the training stage for the output

00:31:56.280 | we have the ground truth.

00:31:57.880 | And so we use that ground truth to teach the system.

00:32:02.880 | In the testing stage, when it goes out into the wild

00:32:05.320 | there's new input data over which we have to generalize

00:32:07.520 | with the learning system and have to make our best guess.

00:32:10.680 | In the training stage, the processes with neural networks

00:32:15.680 | is given the input data for which we have the ground truth,

00:32:18.360 | pass it through the model, get the prediction

00:32:21.320 | and given that we have the ground truth

00:32:23.000 | we can compare the prediction to the ground truth,

00:32:25.280 | look at the error and based on the error adjust the weights.

00:32:28.480 | The types of predictions we can make

00:32:30.720 | is regression and classification.

00:32:32.680 | Regression is a continuous

00:32:34.240 | and classification is categorical.

00:32:37.000 | Here, if we look at weather, the regression problem says

00:32:42.920 | what is the temperature going to be tomorrow

00:32:46.000 | and the classification formulation of that problem

00:32:48.200 | says is it going to be hot or cold

00:32:50.320 | or some threshold definition of what hot or cold is.

00:32:53.520 | That's regression classification.

00:32:55.280 | On the classification front, it can be multi-class

00:32:58.560 | which is the standard formulation

00:33:01.000 | where you're tasked with saying

00:33:02.600 | a particular entity can only be one thing

00:33:08.320 | and then there's multi-label

00:33:09.840 | where a particular entity can be multiple things.

00:33:12.280 | And overall, the input to the system

00:33:16.480 | can be not just a single sample of the particular dataset

00:33:21.480 | and the output doesn't have to be a particular sample

00:33:24.880 | of the ground truth dataset.

00:33:26.720 | It can be a sequence, sequence to sequence,

00:33:29.680 | a single sample to a sequence,

00:33:31.480 | a sequence to sample and so on.

00:33:33.760 | From video captioning where it's video captioning

00:33:37.120 | to translation to natural language generation

00:33:41.960 | to of course the one-to-one general computer vision.

00:33:45.760 | Okay, that's the bigger picture.

00:33:47.200 | Let's step back from the big to the small

00:33:49.760 | to a single neuron inspired by our own brain,

00:33:54.760 | the biological neural networks in our brain

00:33:58.400 | and the computational block

00:34:00.120 | that is behind a lot of the intelligence in our mind.

00:34:03.200 | The artificial neuron has inputs with weights on them

00:34:08.920 | plus a bias and an activation function and an output.

00:34:14.280 | It's inspired by this thing.

00:34:16.320 | As I showed it before,

00:34:17.480 | here visualizes the thalamic cortical system

00:34:20.280 | with three million neurons and 476 million synapses.

00:34:24.000 | The full brain has a hundred billion, billion neurons

00:34:29.000 | and a thousand trillion synapses.

00:34:33.400 | ResNet and some of the other state-of-the-art networks

00:34:36.760 | have in tens, hundreds of millions of edges of synapses.

00:34:42.760 | The human brain has 10 million times more synapses

00:34:47.760 | than artificial neural networks

00:34:50.720 | and there's other differences.

00:34:52.360 | The topology is asynchronous

00:34:57.360 | and not constructed in layers.

00:35:00.840 | The learning algorithm for artificial neural networks

00:35:03.960 | is back propagation for our biological neurons

00:35:09.520 | and our biological networks we don't know.

00:35:12.960 | That's one of the mysteries of the human brain.

00:35:15.160 | There's ideas but we really don't know.

00:35:17.440 | The power consumption,

00:35:18.760 | human brains are much more efficient than neural networks.

00:35:21.200 | That's one of the problems that we're trying to solve

00:35:23.360 | and ASICs are starting to begin

00:35:25.920 | to solve some of these problems.

00:35:28.080 | And the stages of learning.

00:35:30.680 | In the biological neural networks,

00:35:32.040 | you really never stop learning.

00:35:33.840 | You're always learning, always changing

00:35:35.520 | both on the hardware and the software.

00:35:38.640 | In artificial neural networks,

00:35:40.840 | oftentimes there's a training stage,

00:35:42.640 | there's a distinct training stage

00:35:44.160 | and there's a distinct testing stage

00:35:45.840 | when you release the thing in the wild.

00:35:47.560 | Online learning is an exceptionally difficult thing

00:35:50.040 | that we're still in the very early stages of.

00:35:52.920 | This neuron takes a few inputs,

00:35:59.600 | the fundamental computational block behind neural networks,

00:36:02.840 | takes a few inputs, applies weights,

00:36:05.360 | which are the parameters that are learned,

00:36:07.240 | sums them up, puts it into a nonlinear activation function

00:36:10.960 | after adding the bias,

00:36:12.800 | also a learned parameter, and gives an output.

00:36:17.600 | And the task of this neuron is to get excited

00:36:20.360 | based on certain aspects of the layers,

00:36:22.720 | features, inputs that followed before.

00:36:25.960 | And in that ability to discriminate,

00:36:29.280 | get excited by certain things

00:36:30.960 | and get not excited by other things,

00:36:33.080 | hold a little piece of information

00:36:35.360 | of whatever level of abstraction it is.

00:36:37.400 | So when you combine many of them together,

00:36:39.640 | you have knowledge.

00:36:43.800 | Different levels of abstractions form a knowledge base

00:36:46.720 | that's able to represent, understand,

00:36:49.720 | or even act on a particular set of raw inputs.

00:36:53.640 | And you stack these neurons together in layers,

00:36:58.240 | both in width and depth, increasing further on,

00:37:02.000 | and there's a lot of different architectural variants,

00:37:05.240 | but they begin at this basic fact

00:37:08.240 | that with just a single hidden layer of a neural network,

00:37:11.680 | the possibilities are endless.

00:37:13.320 | It can approximate any arbitrary function.

00:37:15.720 | Adding a neural network with a single hidden layer

00:37:20.160 | can approximate any function.

00:37:22.040 | That means any other neural network

00:37:23.640 | with multiple layers and so on

00:37:25.600 | is just interesting optimizations

00:37:29.920 | of how we can discover those functions.

00:37:33.840 | The possibilities are endless.

00:37:35.400 | And the other aspect here is the mathematical underpinnings

00:37:42.080 | of neural networks with the weights

00:37:45.480 | and the differentiable activation functions

00:37:47.840 | are such that in a few steps,

00:37:49.680 | from the inputs to the outputs,

00:37:51.560 | are deeply parallelizable.

00:37:57.120 | And that's why the other aspect on the compute,

00:38:00.960 | the parallelizability of neural networks

00:38:03.080 | is what enables some of the exciting advancements

00:38:07.000 | on the graphical processing unit,

00:38:10.040 | the GPUs and with ASICs, TPUs.

00:38:14.080 | The ability to run across machines,

00:38:17.840 | across GPU units,

00:38:19.400 | in a very large distributed scale

00:38:24.440 | to be able to train and perform inference on neural networks.

00:38:27.440 | Activation functions.

00:38:32.040 | These activation functions put together

00:38:34.160 | are tasked with optimizing a loss function.

00:38:38.480 | For regression, that loss function is

00:38:42.000 | mean squared error, usually.

00:38:45.120 | There's a lot of variance.

00:38:46.320 | And for classification, it's cross-entropy loss.

00:38:48.760 | In the cross-entropy loss, the ground truth is zero, one.

00:38:51.600 | In the mean squared error, it's real numbered.

00:39:00.840 | And so with the loss function,

00:39:02.160 | and the weights and the bias and the activation functions

00:39:04.920 | propagating forward through the network

00:39:06.800 | from the input to the output,

00:39:09.120 | using the loss function,

00:39:10.560 | we use the algorithm of Bragg propagation,

00:39:12.880 | I wish I did an entire lecture last time,

00:39:16.360 | to adjust the weights,

00:39:21.440 | to have the air flow backwards through the network

00:39:24.000 | and adjust the weights such that, once again,

00:39:27.720 | the weights that were responsible for

00:39:30.680 | for producing the correct output are increased,

00:39:36.760 | and the weights that were responsible

00:39:39.040 | for producing the incorrect output were decreased.

00:39:42.680 | The forward pass gives you the error.

00:39:47.840 | The backward pass computes the gradients.

00:39:50.000 | And based on the gradients, the optimization algorithm,

00:39:52.960 | combined with a learning rate, adjust the weights.

00:39:56.800 | The learning rate is how fast the network learns.

00:40:00.040 | And all of this is possible

00:40:01.720 | on the numerical computation side

00:40:04.480 | with automatic differentiation.

00:40:06.200 | The optimization problem,

00:40:09.560 | given those gradients that are computed

00:40:11.200 | in the backward flow through the network of the gradients,

00:40:16.200 | is stochastic gradient descent.

00:40:18.760 | There's a lot of variance of this optimization algorithms

00:40:21.040 | that solve various problems,

00:40:23.040 | from dying Rayleigh's to vanishing gradients.

00:40:26.360 | There's a lot of different parameters

00:40:29.080 | on momentum and so on that really just boil down

00:40:33.520 | to all the different problems that are solved

00:40:35.200 | with nonlinear optimization.

00:40:37.080 | Mini-batch size,

00:40:38.560 | what is the right size of a batch,

00:40:43.680 | or really it's called mini-batch

00:40:44.920 | when it's not the entire dataset,

00:40:47.360 | based on which to compute the gradients

00:40:50.800 | to adjust the learning?

00:40:52.760 | Do you do it over a very large amount,

00:40:55.920 | or do you do it with stochastic gradient descent

00:40:58.800 | for every single sample of the data?

00:41:00.640 | If you listen to Yann LeCun

00:41:03.360 | and a lot of recent literature,

00:41:04.680 | is small mini-batch sizes are good.

00:41:08.240 | He says, "Training with large mini-batches

00:41:10.240 | "is bad for your health.

00:41:11.680 | "More importantly, it's bad for your test error.

00:41:14.000 | "Friends don't let friends use mini-batches larger than 32."

00:41:18.480 | Larger batch size means more computational speed,

00:41:23.320 | 'cause you don't have to update the weights as often.

00:41:25.440 | But smaller batch size empirically

00:41:29.400 | produces better generalization.

00:41:31.080 | The problem we're often on the broader scale of learning

00:41:38.920 | trying to solve is overfitting.

00:41:42.000 | And the way we solve it is through regularization.

00:41:45.480 | We want to train on a dataset

00:41:49.800 | without memorizing to an extent

00:41:52.520 | that you only do well in that trained dataset.

00:41:56.240 | So you want it to be generalizable into future,

00:41:58.880 | into the future things that you haven't seen yet.

00:42:02.800 | So obviously, this is a problem for small datasets

00:42:07.800 | and also for sets of parameters that you choose.

00:42:10.000 | Here shown an example of a sine curve

00:42:15.000 | trying to fit a particular data

00:42:17.280 | versus a ninth degree polynomial

00:42:19.200 | trying to fit a particular set of data with the blue dots.

00:42:22.800 | The ninth degree polynomial is overfitting.

00:42:25.560 | It does very well for that particular set of samples

00:42:28.240 | but does not generalize well in the general case.

00:42:31.560 | And the trade-off here is as you train further and further,

00:42:36.040 | at a certain point, there's a deviation

00:42:40.760 | between the error being decreased to zero

00:42:45.760 | on the training set and going to one on the test set.

00:42:51.040 | And that's the balance we have to strike.

00:42:53.400 | That's done with the validation set.

00:42:55.520 | So you take a piece of the training set

00:43:00.360 | for which you have the ground truth

00:43:02.120 | and you call it the validation set and you set it aside

00:43:04.680 | and you evaluate the performance of your system

00:43:06.920 | on that validation set.

00:43:09.080 | And after you notice that your trained network

00:43:14.080 | is performing poorly on the validation set

00:43:17.120 | for a prolonged period of time, that's when you stop.

00:43:19.600 | That's early stoppage.

00:43:20.960 | Basically it's getting better and better and better

00:43:22.560 | and then there is some period of time,

00:43:24.560 | there's always noise of course,

00:43:26.040 | and after some period of time, it's definitely getting worse.

00:43:29.560 | And that's, we need to stop there.

00:43:31.720 | So that provides an automated way

00:43:33.560 | to discovering when you need to stop.

00:43:35.600 | And there's a lot of other regularization methodologies.

00:43:38.560 | Of course, as I mentioned,

00:43:40.000 | dropout is a very interesting approach for,

00:43:43.960 | and its variance of simply

00:43:47.960 | with a certain kind of probability,

00:43:50.560 | randomly remove nodes in the network,

00:43:52.680 | both the incoming and outgoing edges,

00:43:56.320 | randomly throughout the training process.

00:43:58.520 | And there's normalization.

00:44:01.200 | Normalization is obviously always applied at the input.

00:44:09.600 | So whenever you have a dataset

00:44:14.240 | as different lighting conditions, different variations,

00:44:17.440 | different sources and so on,

00:44:19.080 | you have to all kind of put it on the same level ground

00:44:21.960 | so that we're learning the fundamental aspects

00:44:23.960 | of the input data as opposed to

00:44:26.240 | some less relevant semantic information

00:44:30.080 | like lighting variations and so on.

00:44:31.560 | So we should usually always normalize, for example,

00:44:35.920 | if it's computer vision with pixels from zero to 255,

00:44:38.960 | you always normalize to zero to one or negative one to one

00:44:42.080 | or normalize based on the mean and the standard deviation.

00:44:46.280 | That's something you should almost always do.

00:44:48.960 | The thing that enabled

00:44:54.160 | a lot of breakthrough performances in the past few years

00:44:57.760 | is batch normalization.

00:44:59.080 | It's performing this kind of same normalization

00:45:01.040 | later on in the network,

00:45:02.800 | looking at the inputs to the hidden layers

00:45:07.800 | and normalizing based on the batch of data

00:45:10.600 | which you're training,

00:45:12.000 | normalized based on the mean and the standard deviation.

00:45:15.000 | As batch normalization with batch renormalization

00:45:18.920 | fixes a few of the challenges which is

00:45:23.880 | given that you're normalizing during the training

00:45:27.640 | on the mini-batches in the training dataset,

00:45:31.880 | that doesn't directly map to the inference stage

00:45:34.160 | in the testing.

00:45:35.280 | And so it allows by keeping a running average,

00:45:39.320 | it across both training and testing,

00:45:43.600 | you're able to asymptotically approach

00:45:45.760 | a global normalization.

00:45:47.360 | So there's this idea across all the weights,

00:45:49.900 | not just the inputs,

00:45:50.740 | across all the weights,

00:45:51.560 | you normalize the world

00:45:56.240 | in all the levels of abstractions that you're forming.

00:45:58.720 | And batch renorm solves a lot of these problems

00:46:01.120 | during inference.

00:46:01.960 | And there's a lot of other ideas

00:46:03.320 | from layer to weight to instance normalization

00:46:05.800 | to group normalization.

00:46:07.480 | And you can play with a lot of these ideas

00:46:09.120 | in the TensorFlow playground,

00:46:11.480 | on playground.tensorflow.org.

00:46:13.320 | And I highly recommend.

00:46:15.120 | So now let's run through a bunch of different ideas,

00:46:18.880 | some of which we'll cover in future lectures.

00:46:22.920 | Of what is all of this in this world of deep learning,

00:46:25.500 | from computer vision to deep reinforcement learning,

00:46:28.060 | to the different small level techniques

00:46:30.280 | to the large natural language processing.

00:46:33.200 | So convolution neural networks,

00:46:34.700 | the thing that enables image classification.

00:46:37.760 | So these convolutional filters slide over the image

00:46:40.160 | and are able to take advantage

00:46:41.520 | of the spatial invariance of visual information

00:46:44.800 | that a cat in the top left corner

00:46:46.720 | is the same as features associated with cats

00:46:49.480 | in the top right corner and so on.

00:46:51.400 | Images are just a set of numbers

00:46:53.940 | and our task is to take that image

00:46:56.160 | and produce a classification

00:46:58.040 | and use the spatial invariance of visual information

00:47:02.760 | to make that,

00:47:03.800 | to slide a convolution filter across the image

00:47:08.520 | and learn that filter

00:47:09.880 | as opposed to assigning equal value to features

00:47:14.800 | that are present in various regions of the image.

00:47:18.560 | And stacked on top of each other,

00:47:19.840 | these convolution filters can form

00:47:22.440 | high level abstractions of visual information and images.

00:47:28.260 | With AlexNet, as I've mentioned,

00:47:30.320 | and the ImageNet data set and challenge,

00:47:33.080 | captivating the world of what is possible

00:47:35.680 | with neural networks,

00:47:36.800 | have been further and further improved,

00:47:38.920 | superseding human performance

00:47:42.640 | with a special note,

00:47:45.000 | GoogleNet with the inception module,

00:47:46.960 | there's different ideas that came along,

00:47:48.480 | ResNet with the residual blocks,

00:47:50.880 | and SCNet most recently.

00:47:55.660 | So the object detection problem

00:47:59.280 | is a step, the next step in the visual recognition.

00:48:02.780 | So the image classification is just taking the entire image

00:48:05.280 | and saying what's in the image.

00:48:07.440 | Object detection localization is saying,

00:48:10.560 | find all the objects of interest in the scene

00:48:13.080 | and classify them.

00:48:14.080 | The region-based methods, like shown here,

00:48:17.480 | FastRCNN, takes the image,

00:48:20.000 | uses convolution neural network

00:48:21.400 | to extract features in that image

00:48:23.900 | and generate region proposals.

00:48:25.760 | Here's a bunch of candidates that you should look at.

00:48:27.880 | And within those candidates,

00:48:29.400 | it classifies what they are

00:48:31.000 | and generates four parameters,

00:48:33.360 | the bounding box,

00:48:34.980 | that thing that captures that thing.

00:48:38.660 | So object detection localization

00:48:40.560 | ultimately boils down to a bounding box,

00:48:43.220 | a rectangle with a class

00:48:46.140 | that's the most likely class

00:48:47.720 | that's in that bounding box.

00:48:49.940 | And you can really summarize region-based methods

00:48:53.940 | as you generate the region proposal,

00:48:56.340 | here a little pseudocode,

00:48:57.700 | and do a for loop over the region proposals

00:49:02.140 | and perform detection on that for loop.

00:49:05.660 | The single-shot methods remove the for loop.

00:49:10.600 | There's a single pass through,

00:49:13.020 | you add a bunch of,

00:49:14.140 | take a, for example,

00:49:15.700 | here shown SSD,

00:49:17.140 | take a pre-trained neural network

00:49:20.460 | that's been trained to do image classification,

00:49:22.860 | stack a bunch of convolutional layers on top,

00:49:25.100 | from each layer extract features

00:49:27.260 | that are then able to generate

00:49:28.780 | in a single pass classes,

00:49:31.820 | bounding boxes,

00:49:32.980 | bounding box predictions,

00:49:34.100 | and the classes associated with those bounding box.

00:49:36.780 | The trade-off here,

00:49:37.860 | and this is where the popular YOLO V123 come from.

00:49:42.340 | The trade-off here oftentimes

00:49:47.100 | is in performance and accuracy.

00:49:48.760 | So single-shot methods

00:49:52.140 | are often less performant,

00:49:54.700 | especially in terms of accuracy

00:49:56.980 | on objects that are really far away,

00:49:58.420 | or rather objects that are small in the image

00:50:00.260 | or really large.

00:50:02.360 | Then the next step up

00:50:05.520 | in visual perception,

00:50:06.680 | visual understanding,

00:50:07.860 | is semantic segmentation.

00:50:10.700 | That's where the tutorial that we presented here

00:50:12.660 | on GitHub is covering.

00:50:15.200 | Semantic segmentation is the task of now,

00:50:17.800 | as opposed to a bounding box,

00:50:19.200 | or classifying the entire image,

00:50:20.560 | or detecting the object as a bounding box,

00:50:22.880 | is assigning at a pixel level

00:50:26.000 | the boundaries of what the object is.

00:50:28.920 | Every single, in full scene segmentation,

00:50:32.360 | classifying what every single pixel,

00:50:35.000 | which class that pixel belongs to.

00:50:37.800 | And the fundamental aspect there,

00:50:39.440 | so we'll cover a little bit,

00:50:41.040 | or a lot more,

00:50:42.560 | on Wednesday,

00:50:43.960 | is taking a image classification network,

00:50:48.960 | chopping it off at some point,

00:50:52.160 | and then having,

00:50:53.400 | which is performing the encoding step

00:50:55.800 | of compressing a representation of the scene,

00:50:58.560 | and taking that representation

00:51:00.400 | with a decoder,

00:51:01.960 | up sampling in a dense way,

00:51:04.480 | so taking that representation

00:51:08.160 | and up sampling the pixel level classification.

00:51:12.200 | So that up sampling,

00:51:13.200 | there's a lot of tricks that we'll talk through

00:51:15.040 | that are interesting,

00:51:15.860 | but ultimately it boils down to the encoding step

00:51:18.440 | of forming a representation

00:51:19.720 | of what's going on in the scene,

00:51:20.960 | and then the decoding step

00:51:22.840 | that up samples the pixel level annotation

00:51:25.480 | classification of all the individual pixels.

00:51:28.280 | And as I mentioned here,

00:51:29.520 | the underlying idea applied most extensively,

00:51:32.280 | most successfully in computer vision,

00:51:34.300 | is transfer learning.

00:51:36.540 | Most commonly applied way of transfer learning

00:51:44.440 | is taking a pre-trained neural network,

00:51:46.400 | like ResNet,

00:51:48.520 | and chopping it off at some point,

00:51:51.000 | it's chopping off the fully connected layer,

00:51:53.560 | layers, some parts of the layers,

00:51:57.380 | and then taking a dataset,

00:51:59.700 | a new dataset,

00:52:02.880 | and retraining that network.

00:52:04.800 | So what is this useful for?

00:52:06.440 | For every single application

00:52:07.720 | in computer vision in industry,

00:52:09.600 | when you have a specific application,

00:52:11.520 | like you want to build a pedestrian detector.

00:52:16.600 | If you wanna build a pedestrian detector,

00:52:18.560 | and you have a pedestrian dataset,

00:52:20.320 | it's useful to take ResNet trained on ImageNet,

00:52:23.680 | or CoCo trained in the general case of vision perception,

00:52:27.080 | and taking that network,

00:52:28.160 | chopping off some of the layers,

00:52:29.560 | and then retraining on your specialized pedestrian dataset.

00:52:33.560 | And depending on how large that dataset is,

00:52:36.120 | some of the previous layers

00:52:39.960 | that from the pre-trained network

00:52:42.360 | should be fixed, frozen,

00:52:44.660 | and sometimes not,

00:52:46.060 | depending on how large the data is.

00:52:48.680 | And this is extremely effective in a computer vision,

00:52:52.080 | but also in audio speech and NLP.

00:52:55.480 | And so as I mentioned with the pre-trained networks,

00:53:00.480 | they are ultimately forming representations of the data

00:53:07.240 | based on which classifications

00:53:08.520 | the regression is made,

00:53:09.880 | prediction is made.

00:53:10.820 | But a cleanest example of this

00:53:14.320 | is the autoencoder,

00:53:15.600 | or forming representations in an unsupervised way.

00:53:19.160 | The input is an image,

00:53:21.600 | and the output is that exact same image.

00:53:23.880 | So why do we do that?

00:53:25.320 | Well, if you add a bottleneck in the network,

00:53:29.680 | where there is,

00:53:30.960 | where the network is narrower in the middle

00:53:35.960 | than it is on the inputs and the outputs,

00:53:39.640 | it's forced to compress the data down

00:53:41.600 | into meaningful representation.

00:53:42.960 | That's what the autoencoder does.

00:53:45.040 | You're training it to reproduce the output,

00:53:48.760 | and reproduce it with a latent representation

00:53:51.600 | that is smaller than the original raw data.

00:53:54.240 | And that's a really powerful way to compress the data.

00:53:56.360 | It's used for removing noise and so on,

00:53:58.720 | but it's also just an effective way

00:54:00.340 | to demonstrate a concept.

00:54:03.180 | It can also be used for embeddings.

00:54:05.280 | We have a huge amount of data,

00:54:06.840 | and you want to form a compressed,

00:54:11.840 | efficient representation of that data.

00:54:15.960 | Now, in practice,

00:54:18.240 | this is completely unsupervised.

00:54:19.720 | In practice, if you want to form an efficient,

00:54:24.240 | useful representation of the data,

00:54:26.520 | you want to train it in a supervised way.

00:54:31.960 | You want to train it on a discriminative task,

00:54:34.580 | where you have labeled data,

00:54:36.320 | and the network is trained to identify cat versus dog.

00:54:39.960 | That network that's trained in a discriminative way,

00:54:42.640 | on an annotated, supervised learning way,

00:54:47.680 | is able to form better representation.

00:54:49.960 | But nevertheless, the concept stands.

00:54:51.560 | And one way to visualize these concepts,

00:54:53.720 | is the tool that I really love,

00:54:56.360 | projector.tensorflow.org,

00:54:58.240 | is a way to visualize these different representations,

00:55:00.360 | these different embeddings.

00:55:01.680 | You should definitely play with it,

00:55:03.880 | and you can insert your own data.

00:55:05.760 | Okay, going further and further

00:55:07.520 | in this direction of unsupervised,

00:55:09.200 | and forming representations,

00:55:10.960 | is generative adversarial networks.

00:55:13.360 | From these representations,

00:55:14.600 | being able to generate new data.

00:55:16.440 | And the fundamental methodology of GANs,

00:55:21.440 | is to have two networks.

00:55:25.240 | One is the generator, one is the discriminator,

00:55:27.240 | and they compete against each other,

00:55:29.200 | in order for the generator,

00:55:31.600 | to get better and better and better,

00:55:34.900 | generating realistic images.

00:55:37.520 | The generators task, from noise,

00:55:40.360 | to generate images based on a certain representation,

00:55:43.300 | that are realistic.

00:55:44.640 | And the discriminator, is the critic,

00:55:49.320 | that has to discriminate between real images,

00:55:52.080 | and those generated by the generator.

00:55:54.280 | And both get better together.

00:55:56.800 | The generator gets better and better,

00:55:58.560 | at generating real images,

00:55:59.900 | to trick the discriminator.

00:56:02.120 | And the discriminator gets better and better,

00:56:04.400 | at telling the difference between real and fake,

00:56:08.840 | until the generator is able to generate,

00:56:12.480 | some incredible things.

00:56:13.800 | So shown here, by the work with NVIDIA,

00:56:17.040 | I mean the ability to generate realistic faces,

00:56:20.200 | has skyrocketed in the past three years.

00:56:25.000 | So these are samples of celebrities photos,

00:56:28.160 | that have been able to generate,

00:56:29.200 | those are all generated by GAN.

00:56:32.000 | There's ability to generate,

00:56:34.280 | temporally consistent video over time with GANs.

00:56:38.200 | And then there's the ability,

00:56:40.000 | shown at the bottom right, in NVIDIA.

00:56:41.800 | I'm sure they'll, I'm sure I'll also talk about,

00:56:44.800 | the on a pixel level from semantic segmentation,

00:56:47.400 | being so from the semantic pixel segmentation on the right,

00:56:52.240 | being able to generate completely,

00:56:54.800 | the scene on the left.

00:56:57.160 | All the raw rich high definition pixels on the left.

00:57:01.600 | The natural language processing world,

00:57:07.080 | same forming representations,

00:57:09.440 | forming embeddings with,

00:57:11.600 | a word to VEC,

00:57:15.080 | ability to from words to form representation,

00:57:18.720 | that are efficiently able to,

00:57:20.520 | then be used to reason about the words.

00:57:24.400 | The whole idea of forming representation about the data,

00:57:27.520 | is taking a huge, you know,

00:57:29.440 | vocabulary of a million words.

00:57:31.560 | You wanna be able to map it into a space,

00:57:34.320 | where words that are far apart from each other,

00:57:38.120 | are in a Euclidean sense,

00:57:41.480 | in Euclidean distance between words,

00:57:43.880 | are semantically far apart from each other as well.

00:57:47.440 | So things that are similar are together in that space.

00:57:50.680 | And one way of doing that with skip grams,

00:57:53.680 | for example, is looking at a source text,

00:57:56.960 | and turning into a large body of text,

00:58:00.320 | into a supervised learning problem,

00:58:02.240 | by learning to map,

00:58:04.240 | predict from the words,

00:58:06.000 | from a particular word to all its neighbors.

00:58:08.680 | So train a network,

00:58:10.400 | on the connections that are commonly seen,

00:58:13.880 | in natural language.

00:58:15.120 | And based on those connections,

00:58:16.720 | you're able to know which words are related to each other.

00:58:19.880 | Now the main thing here,

00:58:22.080 | is,

00:58:23.800 | and I won't get into too many details,

00:58:25.440 | but the main thing here with the input vector,

00:58:28.120 | representing the words,

00:58:29.560 | and the output of vector representing the probability,

00:58:32.920 | that those words are connected to each other.

00:58:35.160 | The main thing,

00:58:36.320 | both are thrown away in the end,

00:58:37.920 | the main thing is the middle,

00:58:39.360 | the hidden layer.

00:58:40.520 | The low, that representation gives you the embedding,

00:58:43.800 | that represent these words in such a way,

00:58:46.160 | where in the Euclidean space,

00:58:47.560 | the ones that are close together semantically,

00:58:50.480 | are semantically together,

00:58:51.640 | and the ones that are not,

00:58:53.200 | are semantically far apart.

00:58:55.040 | And,

00:58:59.040 | natural language,

00:59:01.160 | and other sequence data,

00:59:03.360 | text, speech, audio, video,

00:59:05.160 | relies on recurring neural networks.

00:59:09.320 | The recurring neural networks are able to learn,

00:59:11.800 | temporal data,

00:59:13.200 | temporal dynamics in the data,

00:59:16.080 | sequence data,

00:59:18.680 | and are able to generate sequence data.

00:59:21.520 | The challenge is,

00:59:22.640 | that they're not able to learn,

00:59:25.200 | long-term context.

00:59:27.400 | Because when unrolling a neural network,

00:59:30.520 | it's trained by unrolling,

00:59:32.960 | and doing back propagation,

00:59:34.680 | without any tricks,

00:59:36.280 | the back propagation of the gradient,

00:59:38.080 | fades away very quickly.

00:59:39.680 | So you're not able to,

00:59:41.160 | memorize the context,

00:59:42.440 | in a longer form of the sentences,

00:59:44.560 | unless there's extensions here,

00:59:47.120 | with LSTMs and GRUs,

00:59:50.040 | long-term dependency is captured by,

00:59:52.880 | allowing the network to,

00:59:55.680 | forget information,

00:59:58.120 | allow it to freely pass through information in time.

01:00:02.840 | So what to forget,

01:00:04.040 | what to remember,

01:00:05.360 | and every time decide what to output.

01:00:08.200 | And all of those aspects have gates,

01:00:10.600 | that are all trainable,

01:00:12.920 | with sigmoid and 10H functions.

01:00:15.800 | Bidirectional,

01:00:17.840 | recurrent neural networks,

01:00:20.480 | from the 90s is an extension often used,

01:00:22.880 | for providing context in both direction.

01:00:25.520 | So,

01:00:26.440 | recurring neural networks,

01:00:27.720 | simply define,

01:00:29.000 | vanilla way is,

01:00:30.680 | learning representations for what happened in the past.

01:00:33.160 | Now in many cases,

01:00:34.480 | you're able,

01:00:35.680 | it's not real-time operation in that,

01:00:37.560 | you're able to also look into the future,

01:00:39.640 | you look into the data that falls out to the sequence.

01:00:42.200 | So benefits you do a forward pass through the network,

01:00:44.920 | beyond the current,

01:00:46.400 | and then back.

01:00:47.360 | The encoder decoder architecture,

01:00:54.560 | in recurrent neural networks,

01:00:56.680 | used very much when the sequence on the input,

01:00:59.040 | and the sequence on the output,

01:01:00.400 | are not relied to be of the same length.

01:01:03.360 | The task is to first,

01:01:07.080 | with the encoder network,

01:01:08.360 | encode everything that's came,

01:01:12.000 | everything on the input sequence.

01:01:13.560 | So this is useful for machine translation for example.

01:01:15.840 | So encoding all the information,

01:01:17.360 | the input sequence in English,

01:01:18.840 | and then,

01:01:19.920 | in the language you're translating to,

01:01:22.920 | given that representation,

01:01:24.680 | keep feeding it into the decoder,

01:01:26.880 | recurrent neural network,

01:01:28.080 | to generate the translation.

01:01:29.840 | The input might be much smaller,

01:01:31.520 | or much larger than the output.

01:01:33.080 | That's the encoder decoder architecture.

01:01:37.600 | And then there's improvements.

01:01:39.240 | Attention,

01:01:43.760 | is the improvement on this encoder decoder architecture,

01:01:46.840 | that allows you to,

01:01:48.360 | as opposed to taking the input sequence,

01:01:51.240 | forming a representation of it,

01:01:52.560 | and that's it.

01:01:53.520 | It allows you to actually look back,

01:01:54.920 | at different parts of the input.

01:01:57.160 | So not just relying on the,

01:02:00.640 | single vector representation of all,

01:02:03.400 | the entire input.

01:02:04.880 | And,

01:02:10.720 | a lot of excitement,

01:02:12.040 | has been around the idea,

01:02:16.760 | as I mentioned,

01:02:17.600 | some of the dream,

01:02:19.160 | of artificial intelligence,

01:02:20.240 | and machine learning in general,

01:02:21.680 | has been to remove the human more,

01:02:23.120 | and more, and more from the picture.

01:02:25.160 | Being able to automate,

01:02:26.400 | some of the difficult tasks.

01:02:28.480 | So AutoML from Google,

01:02:30.320 | and just the general concept,

01:02:31.440 | of neural architecture search, NASNet.

01:02:34.280 | The,

01:02:36.120 | ability to automate,

01:02:37.880 | the discovery of,

01:02:40.640 | parameters of a neural network,

01:02:44.040 | and the ability to discover,

01:02:47.320 | the actual architecture,

01:02:49.080 | that produces the best result.

01:02:51.160 | So with neural architecture search,

01:02:53.840 | you have basic,

01:02:54.800 | basic modules,

01:02:57.200 | similar to the ResNet modules.

01:02:58.840 | And with a recurring neural network,

01:03:02.240 | you keep assembling a network together,

01:03:05.560 | and eval,

01:03:06.400 | and assembling in such a way,

01:03:07.920 | that it minimizes the loss,

01:03:10.080 | of the overall classification performance.

01:03:12.320 | And it's shown that you can then construct,

01:03:15.720 | a neural network that's much more efficient,

01:03:18.480 | and much more accurate,

01:03:19.640 | than state of the art,

01:03:21.480 | on classification tasks like ImageNet,

01:03:23.440 | here shown with a plot.

01:03:25.480 | Or at the very least competitive,

01:03:27.880 | with the state of the art,

01:03:29.240 | and the SCNet.

01:03:30.600 | It's super exciting,

01:03:31.880 | that as opposed to,

01:03:33.040 | like I said,

01:03:33.880 | stacking Lego pieces yourself,

01:03:35.640 | the final result,

01:03:36.960 | is essentially,

01:03:38.080 | you step back,

01:03:39.400 | and you say,

01:03:40.240 | here's,

01:03:41.080 | I have a data set,

01:03:42.400 | with the,

01:03:44.080 | with the labels,

01:03:45.040 | with the ground truth,

01:03:46.160 | which is what Google,

01:03:48.320 | the dream of Google AutoML is,

01:03:49.920 | I have the data set,

01:03:51.120 | you tell me,

01:03:52.040 | what kind of neural network,

01:03:53.520 | will do best on this data set?

01:03:55.400 | And that's it.

01:03:56.240 | So all you bring is the data,

01:03:57.320 | it constructs the network,

01:03:59.600 | through this neural architecture search,

01:04:01.800 | and returns to you the model,

01:04:03.560 | and that's it.

01:04:04.400 | It solves,

01:04:05.400 | it makes it possible to solve,

01:04:07.520 | exception,

01:04:09.600 | solve many,

01:04:12.640 | of the real world problems,

01:04:14.280 | that essentially boil down to,

01:04:15.640 | I have a few classes,

01:04:16.720 | I need to be very accurate on,

01:04:18.480 | here's my data set.

01:04:20.200 | And then it converts the problem,

01:04:22.160 | of a deep learning researcher,

01:04:24.000 | to the problem of maybe what's traditionally,

01:04:26.480 | what's more commonly called,

01:04:27.960 | sort of a data science engineer,

01:04:30.880 | where the task is,

01:04:32.480 | as I said,

01:04:33.480 | focuses on what is the right question,

01:04:35.680 | and what is the right data to solve that question?

01:04:38.200 | And deep reinforcement learning,

01:04:42.240 | taking further steps along the path,

01:04:44.200 | of decreasing human input,

01:04:47.040 | deep reinforcement learning is,

01:04:49.000 | the task of an agent,

01:04:50.840 | to act in the world based on,

01:04:53.080 | the observations of the state,

01:04:54.920 | and the rewards received in that state.

01:04:57.080 | Knowing very little about the world,

01:04:59.320 | and learning from the very sparse nature of the reward,

01:05:02.640 | sometimes only when you,

01:05:04.720 | in the gaming context,

01:05:06.160 | when you win or lose,

01:05:07.680 | or in the robotics context,

01:05:09.760 | when you successfully accomplish a task or not,

01:05:11.880 | with that very sparse reward,

01:05:13.720 | are able to learn how to behave in that world.

01:05:16.360 | Here with,

01:05:18.440 | with cats learning how the bell maps to the food,

01:05:21.600 | and a lot of the amazing work at OpenAI and DeepMind,

01:05:24.920 | about the robotics manipulation and navigation,

01:05:29.920 | through self play in simulated environments,

01:05:32.320 | and of course the best,

01:05:33.760 | our own deep reinforcement learning competition,

01:05:36.800 | with deep traffic,

01:05:38.000 | that all of you can participate,

01:05:40.280 | and I encourage you to try to win,

01:05:42.440 | that with no supervised knowledge,

01:05:48.120 | no human supervision,

01:05:50.040 | through sparse rewards from the simulation,

01:05:53.840 | or through self play constructs,

01:05:56.680 | able to learn how to operate successfully in this world.

01:05:59.840 | And those are the steps we're taking,

01:06:03.280 | towards general,

01:06:04.920 | towards artificial general intelligence.

01:06:07.000 | This is the exciting from,

01:06:08.960 | from the breakthrough ideas,

01:06:12.280 | that we'll talk about on Wednesday,

01:06:13.720 | natural language processing,

01:06:15.480 | to generate adversarial networks,

01:06:17.600 | able to generate arbitrary data,

01:06:19.800 | high resolution data,

01:06:21.160 | create data really from this understanding of the world,

01:06:24.640 | to deep reinforcement learning,

01:06:26.120 | being able to learn how to act in the world,

01:06:28.720 | very little input from human supervision,

01:06:31.760 | is taking further and further steps,

01:06:33.720 | and there's been a lot of exciting ideas,

01:06:35.520 | going by different names,

01:06:36.880 | sometimes misused,

01:06:38.400 | sometimes overused,

01:06:40.000 | sometimes,

01:06:43.640 | misinterpreted of transfer learning,

01:06:47.200 | meta learning,

01:06:48.680 | and the hyper parameter architecture search,

01:06:51.000 | basically removing a human as much as possible,

01:06:54.080 | from the menial task,

01:06:55.840 | and involving the human only on the fundamental side,

01:06:58.560 | as I mentioned with the racing boat,

01:07:00.440 | on the ethical side,

01:07:01.920 | on the things that us humans,

01:07:04.080 | at least pretend to be quite good at,

01:07:07.360 | which is understanding the fundamental big questions,

01:07:10.040 | understanding the data,

01:07:11.720 | that empowers us to solve real world problems,

01:07:14.520 | and understand the ethical balance,

01:07:16.400 | that needs to be struck in order to solve those problems well,

01:07:19.680 | and as on the bottom right,

01:07:23.080 | I show that's our job here in this room,

01:07:26.160 | our job for all the engineers in the world,

01:07:28.400 | to solve these problems,

01:07:29.960 | and progress forward,

01:07:31.440 | through the current summer,

01:07:32.960 | and through the winter if it ever comes,

01:07:35.360 | so with that I'd like to thank you,

01:07:37.480 | and you can get the videos,

01:07:39.080 | code and so on,

01:07:40.480 | online deeplearning.mit.edu,

01:07:42.320 | thank you very much guys.

01:07:43.560 | (audience applauding)

01:07:46.720 | (upbeat music)

01:07:49.320 | (upbeat music)

01:07:51.920 | (upbeat music)

01:07:54.520 | (upbeat music)

01:07:57.120 | (upbeat music)

01:07:59.720 | (upbeat music)

01:08:02.320 | [BLANK_AUDIO]

Deep Learning Basics: Introduction and Overview

Chapters