back to index

MIT 6.S094: Deep Learning


Chapters

0:0 Introduction
8:14 Self-Driving Cars
14:20 Deep Learning

Whisper Transcript | Transcript Only Page

00:00:00.000 | Thank you everyone for braving the cold and the snow to be here.
00:00:06.200 | This is 6S094, Deep Learning for Self-Driving Cars.
00:00:12.600 | And it's a course where we cover the topic of deep learning,
00:00:18.800 | which is a set of techniques that have taken a leap in the last decade
00:00:24.400 | for our understanding of what artificial intelligence systems are capable of doing.
00:00:30.800 | And self-driving cars, which is systems that can take these techniques
00:00:36.600 | and integrate them in a meaningful profound way
00:00:41.200 | into our daily lives in a way that transforms society.
00:00:45.800 | So that's why both of these topics are extremely important and extremely exciting.
00:00:52.400 | My name is Lex Friedman and I'm joined by an amazing team of engineers
00:00:57.600 | in Jack Terwilliger, Julia Kindlesberger, Dan Brown, Michael Glazer,
00:01:04.000 | Lee Ding, Spencer Dodd, and Benedict Jenick, among many others.
00:01:10.800 | We build autonomous vehicles here at MIT.
00:01:14.400 | Not just ones that perceive and move about the environment,
00:01:19.600 | but ones that interact, communicate, and earn the trust and understanding
00:01:26.000 | of human beings inside the car, the drivers and the passengers,
00:01:29.600 | and the human beings outside the car,
00:01:32.600 | the pedestrians and other drivers and cyclists.
00:01:39.200 | The website for this course, selfdrivingcars.mit.edu
00:01:43.600 | If you have questions, email deepcars@mit.edu
00:01:48.000 | Slack deep-mit.
00:01:51.800 | For registered MIT students, you have to register on the website.
00:01:57.200 | And by midnight, Friday, January 19th, build a neural network
00:02:03.800 | and submit it to the competition that achieves the speed of 65 miles per hour
00:02:08.800 | on the new Deep Traffic 2.0.
00:02:11.600 | It's much harder and much more interesting than last year's
00:02:15.200 | for those of you who participated.
00:02:18.000 | There's three competitions in this class.
00:02:20.200 | Deep traffic, SegFuse, Deep Crash.
00:02:24.600 | There's guest speakers that come from Waymo, Google, Tesla.
00:02:31.800 | And those are starting new autonomous vehicle startups
00:02:36.000 | in Voyage, Neutronomy, and Aurora.
00:02:43.600 | And the news a lot today from CES.
00:02:48.200 | And we have shirts.
00:02:50.800 | For those of you who brave the snow and continue to do so,
00:02:54.600 | towards the end of the class, there will be free shirts.
00:02:57.800 | Yes, I said free and shirts in the same sentence.
00:03:00.400 | You should be here.
00:03:03.000 | Okay, first, the Deep Traffic competition.
00:03:07.800 | There's a lot of updates and we'll cover those on Wednesday.
00:03:11.200 | It's a deep reinforcement learning competition.
00:03:13.800 | Last year, we received over 18,000 submissions.
00:03:18.200 | This year, we're going to go bigger.
00:03:23.000 | Not only can you control one car within your network,
00:03:26.000 | you can control up to 10.
00:03:27.800 | This is multi-agent deep reinforcement learning.
00:03:30.600 | This is super cool.
00:03:34.200 | Second, SegFuse, Dynamic Driving Scene Segmentation competition.
00:03:39.400 | Where you're given the raw video,
00:03:44.800 | the kinematics of the vehicle, so the movement of the vehicle,
00:03:49.800 | the state-of-the-art segmentation.
00:03:52.200 | For the training set, you're given ground truth labels,
00:03:55.800 | pixel level labels, scene segmentation,
00:03:58.600 | and optical flow.
00:04:00.400 | And with those pieces of data,
00:04:02.600 | you're tasked to try to perform better
00:04:05.200 | than the state-of-the-art in image-based segmentation.
00:04:10.400 | Why is this critical and fascinating
00:04:14.000 | in an open research problem?
00:04:17.000 | Because robots that act in this world,
00:04:21.800 | in the physical space, not only must interpret,
00:04:24.800 | use these deep learning methods to interpret
00:04:26.600 | the spatial visual characteristics of a scene.
00:04:29.200 | They must also interpret, understand, and track
00:04:32.200 | the temporal dynamics of the scene.
00:04:34.000 | This competition is about temporal propagation of information,
00:04:37.600 | not just scene segmentation.
00:04:40.400 | You must understand the space and time.
00:04:44.800 | And finally, Deep Crash.
00:04:48.800 | Where we use deep reinforcement learning
00:04:50.800 | to slam cars thousands of times,
00:04:53.600 | here at MIT at the gym.
00:04:57.400 | You're given data on a thousand runs,
00:05:01.400 | where a car knowing nothing is using a monocular camera
00:05:05.000 | as a single input, driving over 30 miles an hour,
00:05:08.400 | through a scene it has very little control through,
00:05:11.000 | very little capability to localize itself,
00:05:13.400 | it must act very quickly.
00:05:15.200 | In that scene, you're given a thousand runs
00:05:17.600 | to learn anything.
00:05:21.400 | We'll discuss this in the coming weeks.
00:05:24.600 | This competition will result in four submissions,
00:05:29.600 | that we evaluate everyone's in simulation,
00:05:32.600 | but the top four submissions,
00:05:34.400 | we put head to head at the gym.
00:05:36.600 | And until there is a winner declared,
00:05:39.000 | we keep slamming cars at 30 miles an hour.
00:05:44.000 | Deep Crash.
00:05:45.200 | And also on the website is from last year,
00:05:48.000 | and on GitHub, there's Deep Tesla,
00:05:50.800 | which is using the large-scale naturalistic driving data set,
00:05:54.000 | we have to train a neural network to do end-to-end steering.
00:05:57.800 | That takes in monocular video from the forward roadway,
00:06:01.000 | and produces steering commands,
00:06:03.000 | that steering commands for the car.
00:06:07.000 | Lectures.
00:06:08.000 | Today we'll talk about deep learning.
00:06:10.000 | Tomorrow we'll talk about autonomous vehicles.
00:06:12.800 | Deep RLs on Wednesday.
00:06:15.800 | Driving scene understanding, so segmentation.
00:06:20.000 | That's Thursday.
00:06:22.000 | On Friday, we have Sasha Arnou,
00:06:25.000 | the director of engineering at Waymo.
00:06:27.600 | Waymo is one of the companies,
00:06:29.400 | that's truly taking huge strides in fully autonomous vehicles.
00:06:33.600 | They're taking the fully L4, L5 autonomous vehicle approach,
00:06:37.400 | and it's fascinating to learn.
00:06:39.400 | He's also the head of perception for them,
00:06:42.000 | to learn from him,
00:06:43.600 | what kind of problems they're facing,
00:06:45.800 | and what kind of approach they're taking on.
00:06:47.800 | We have Emilia Frizzoli,
00:06:49.600 | who one of last year's speakers,
00:06:51.600 | Sertac Karaman,
00:06:53.000 | said Emilia is the smartest person he knows.
00:06:55.800 | So Emilia Frizzoli is the CTO of Neutonomy,
00:06:58.200 | an autonomous vehicle company,
00:07:00.800 | that was just acquired by Delphi,
00:07:03.600 | for a large sum of money.
00:07:05.000 | And they're doing a lot of incredible work,
00:07:06.800 | in Singapore and here in Boston.
00:07:10.200 | Next Wednesday,
00:07:12.000 | we are going to talk about the topic of our research,
00:07:16.000 | and my personal fascination is deep learning,
00:07:19.000 | for driver state sensing,
00:07:20.600 | understanding the human,
00:07:21.800 | perceiving everything about the human being,
00:07:23.600 | inside the car and outside the car.
00:07:25.600 | One talk I'm really excited about,
00:07:29.600 | is Oliver Cameron on Thursday.
00:07:32.600 | He is now the CEO of autonomous vehicle startup Voyage,
00:07:37.200 | who was previously the director,
00:07:39.000 | of the self-driving car program for Udacity.
00:07:41.800 | He will talk about,
00:07:43.200 | how to start a self-driving car company.
00:07:46.400 | For those, he said that MIT folks,
00:07:49.400 | and entrepreneurs,
00:07:50.600 | if you want to start one yourself,
00:07:52.200 | he'll tell you exactly how.
00:07:54.000 | It's super cool.
00:07:55.600 | And then Sterling Anderson,
00:07:57.800 | who was the director previously,
00:08:01.200 | of Tesla autopilot team,
00:08:03.400 | and now is a co-founder of Aurora,
00:08:06.600 | the self-driving car startup that I mentioned,
00:08:11.200 | that has now partnered with NVIDIA and many others.
00:08:13.800 | So, why self-driving cars?
00:08:16.600 | This class is about applying,
00:08:18.600 | data-driven learning methods,
00:08:20.800 | to the problem of autonomous vehicles.
00:08:23.200 | Why self-driving cars are fascinating,
00:08:26.400 | and an interesting problem space.
00:08:28.400 | Quite possibly, in my opinion,
00:08:33.600 | this is the first wide-reaching,
00:08:35.800 | and profound integration of personal robots,
00:08:38.400 | in society.
00:08:40.200 | Wide-reaching, because there's one billion cars on the road,
00:08:43.400 | even a fraction of that,
00:08:45.000 | will change the face of transportation,
00:08:48.600 | and how we move about this world.
00:08:51.600 | Profound, and this is an important point,
00:08:54.800 | that's not always understood,
00:08:57.600 | is there's an intimate connection,
00:09:01.200 | between a human and a vehicle,
00:09:04.400 | when there's a direct transfer of control.
00:09:07.600 | It's a direct transfer of control,
00:09:10.000 | that takes that his or her life,
00:09:13.000 | into the hands of an artificial intelligence system.
00:09:16.000 | I showed a few quick clips here,
00:09:20.200 | you can Google, first time with Tesla autopilot,
00:09:23.200 | on YouTube, and watch people,
00:09:25.400 | perform that transfer of control.
00:09:27.600 | There's something magical,
00:09:29.400 | about a human and a robot working together,
00:09:33.000 | that will transform,
00:09:35.200 | what artificial intelligence is,
00:09:37.000 | in the 21st century.
00:09:38.800 | And this particular autonomous system,
00:09:41.600 | AI system, self-driving cars,
00:09:44.400 | is on the scale,
00:09:46.600 | and the profound, the life-critical nature of it,
00:09:49.000 | is profound, in a way that,
00:09:51.000 | it will truly test the capabilities of AI.
00:09:55.800 | There is a personal connection,
00:09:58.000 | that will argue throughout these lectures,
00:10:00.400 | that we cannot escape, considering the human being.
00:10:03.600 | That autonomous vehicle, must not only perceive,
00:10:06.400 | and control its movement through the environment,
00:10:08.600 | it must also perceive everything,
00:10:10.000 | about the human driver and the passenger,
00:10:12.200 | and interact, communicate, and build trust,
00:10:14.400 | with that driver.
00:10:16.400 | Because, in my view,
00:10:24.400 | as I will argue throughout this course,
00:10:27.000 | an autonomous vehicle is more of a personal robot,
00:10:31.000 | than it is a perfect perception control system.
00:10:34.400 | Because perfect perception and control,
00:10:38.000 | through this world, full of humans,
00:10:41.000 | is extremely difficult,
00:10:43.200 | and could be two, three, four decades away.
00:10:46.800 | Full autonomy,
00:10:49.000 | autonomous vehicles are going to be flawed,
00:10:52.800 | they're going to have flaws,
00:10:54.800 | and we have to design systems,
00:10:56.800 | that are effectively caught,
00:10:58.800 | that effectively transfer control to human beings,
00:11:02.200 | when they can't handle the situation.
00:11:04.000 | And that transfer of control,
00:11:06.600 | is a fascinating opportunity for AI.
00:11:10.000 | Because, the obstacle avoidance,
00:11:16.600 | perception of obstacles,
00:11:19.600 | and obstacle avoidance, is the easy problem.
00:11:23.600 | It's the safe problem,
00:11:25.000 | going 30 miles an hour,
00:11:26.200 | navigating through streets of Boston, is easy.
00:11:30.600 | It's when you have to get to work,
00:11:33.400 | and you're late, or you're sick of the person in front of you,
00:11:37.200 | that you want to go in the opposing lane, and speed up.
00:11:41.400 | That's human nature, and we can't escape it.
00:11:44.600 | Our artificial intelligence systems,
00:11:47.400 | can't escape human nature, they must work with it.
00:11:50.800 | What's shown here, is one of the algorithms,
00:11:53.200 | we'll talk about next week, for cognitive load.
00:11:56.400 | Where we take the raw, 3D convolutional neural networks,
00:12:00.600 | take in the eye region, the blinking, and the pupil movement,
00:12:04.600 | to determine the cognitive load of the driver.
00:12:06.800 | We'll see how we can detect everything about the driver,
00:12:09.800 | where they're looking, emotion, cognitive load,
00:12:13.800 | body pose estimation, drowsiness.
00:12:18.600 | The movement towards full autonomy,
00:12:22.400 | is so difficult, I would argue,
00:12:25.600 | that it almost requires human level intelligence.
00:12:30.000 | That the, as I said, two, three, four decade out,
00:12:34.000 | journey for artificial intelligence researchers,
00:12:37.400 | to achieve full autonomy, will require achieving,
00:12:40.400 | solving some of the problems, fundamental problems,
00:12:43.400 | of creating intelligence.
00:12:46.600 | And, that's something we'll discuss in much more depth,
00:12:51.000 | in a broader view, in two weeks,
00:12:53.200 | for the artificial general intelligence course.
00:12:56.200 | Where we have Andrej Karpathy from Tesla,
00:12:58.200 | Ray Kurzweil, Mark Rybert, from Boston Dynamics,
00:13:03.200 | who asked for the dimensions of this room,
00:13:05.200 | because he's bringing robots.
00:13:08.400 | Nothing else was told to me.
00:13:11.400 | It'll be a surprise.
00:13:16.000 | So, that is why I argue,
00:13:17.000 | the human-centered artificial intelligence approach,
00:13:20.200 | in every algorithmic design, considers the human.
00:13:26.200 | For autonomous vehicle on the left,
00:13:28.200 | the perception, seen understanding,
00:13:31.200 | and the control problem,
00:13:33.000 | as we'll explore through the competitions,
00:13:34.600 | and the assignments of this course,
00:13:37.000 | can handle 90, and increasing percent of the cases.
00:13:42.400 | But it's the 10, 1, 0.1% of the cases,
00:13:46.600 | as we get better and better,
00:13:48.400 | that we have to, we're not able to handle,
00:13:51.400 | through these methods.
00:13:52.600 | And that's where the human,
00:13:53.800 | perceiving the human is really important.
00:13:55.800 | This is the video from last year,
00:13:58.200 | of Arc de Triomphe.
00:13:59.600 | Thank you, I didn't know it last year, I know now.
00:14:03.200 | That's, is one of millions of cases,
00:14:06.600 | where human to human interaction,
00:14:09.600 | is the dominant driver,
00:14:12.000 | not the basic perception control problem.
00:14:20.000 | So, why deep learning in this space?
00:14:23.600 | Because deep learning, is a set of methods,
00:14:28.200 | that do well from a lot of data.
00:14:31.400 | And to solve these problems,
00:14:33.400 | where human life is at stake,
00:14:35.600 | we have to be able to have techniques,
00:14:38.400 | that learn from data, learn from real world data.
00:14:41.800 | This is the fundamental reality,
00:14:44.000 | of artificial intelligence systems,
00:14:45.400 | that operate in the real world.
00:14:47.000 | They must learn from real world data.
00:14:49.800 | Whether that's on the left,
00:14:51.000 | for the perception, the control side.
00:14:54.600 | Or on the right, for the human,
00:14:57.000 | the perception and the communication,
00:14:59.000 | interaction and collaboration,
00:15:02.000 | with the human, and the human robot interaction.
00:15:06.600 | Okay, so what is deep learning?
00:15:13.000 | It's a set of techniques,
00:15:15.000 | if you allow me the definition,
00:15:16.800 | of intelligence being,
00:15:18.200 | the ability to accomplish complex goals.
00:15:21.400 | Then I would argue,
00:15:23.000 | definition of understanding,
00:15:25.000 | maybe reasoning,
00:15:27.200 | is the ability to turn complex information,
00:15:30.400 | into simple, useful, actionable information.
00:15:34.800 | And that is what deep learning does.
00:15:37.600 | Deep learning is representation learning,
00:15:40.600 | or feature learning if you will.
00:15:43.200 | It's able to take raw information,
00:15:46.200 | raw complicated information,
00:15:48.000 | that's hard to do anything with,
00:15:49.800 | and construct hierarchical representations,
00:15:52.400 | of that information,
00:15:53.800 | to be able to do something interesting with it.
00:15:56.800 | It is the branch of artificial intelligence,
00:15:59.400 | which is most capable and focused on this task.
00:16:04.000 | Forming representations from data,
00:16:06.200 | whether it's supervised or unsupervised,
00:16:08.400 | whether it's with the help of humans or not.
00:16:10.800 | It's able to construct structure,
00:16:14.600 | find structure in the data,
00:16:16.000 | such that you can extract,
00:16:18.800 | simple, useful, actionable information.
00:16:21.600 | On the left,
00:16:23.400 | from Ian Goodfellow's book,
00:16:26.400 | is the basic example of image classification.
00:16:30.800 | The input of the image,
00:16:34.000 | on the bottom with the raw pixels,
00:16:36.600 | and as we go up the stack,
00:16:38.400 | as we go up the layers,
00:16:39.800 | higher and higher order representations are formed.
00:16:42.800 | From edges to contours,
00:16:44.400 | to corners, to object parts,
00:16:46.400 | and then finally the full object,
00:16:48.200 | semantic classification of what's in the image.
00:16:51.600 | This is representation learning.
00:16:54.200 | A favorite example for me,
00:16:57.600 | is one from four centuries ago.
00:17:02.200 | Our place in the universe,
00:17:05.000 | and representing that place in the universe,
00:17:07.400 | whether it's relative to earth,
00:17:09.800 | or relative to the sun.
00:17:13.200 | On the left is our current belief,
00:17:16.800 | on the right is the one that is held widely,
00:17:20.600 | four centuries ago.
00:17:23.200 | Representation matters,
00:17:24.600 | because what's on the right,
00:17:26.600 | is much more complicated,
00:17:28.200 | than what's on the left.
00:17:31.200 | You can think of in a simple case here,
00:17:36.600 | when the task is to draw a line,
00:17:38.600 | that separates green triangles and blue circles,
00:17:41.000 | in the Cartesian coordinate space,
00:17:43.000 | on the left, the task is much more difficult,
00:17:45.600 | impossible to do well.
00:17:47.600 | On the right, it's trivial,
00:17:49.400 | in polar coordinates.
00:17:51.400 | This transformation is exactly,
00:17:53.800 | which we need to learn.
00:17:55.200 | This is representation learning.
00:17:57.800 | So you can take the same task,
00:17:59.000 | of having to draw a line,
00:18:00.200 | that separates the blue curve,
00:18:01.600 | and the red curve on the left.
00:18:04.000 | If we draw a straight line,
00:18:05.800 | it's going to be a high,
00:18:07.800 | there's no way to do it,
00:18:09.800 | with zero error,
00:18:11.000 | with 100% accuracy.
00:18:13.600 | Shown on the right,
00:18:14.800 | is our best attempt.
00:18:16.800 | But what we can do with deep learning,
00:18:20.200 | with a single hidden layer network,
00:18:22.000 | done here,
00:18:23.400 | is form the topology,
00:18:26.400 | the mapping of the space,
00:18:27.800 | in such a way in the middle,
00:18:30.000 | that allows for a straight line to be drawn,
00:18:31.800 | that separates the blue curve,
00:18:33.200 | and the red curve.
00:18:35.000 | The learning of the function in the middle,
00:18:38.200 | is what we're able to achieve with deep learning.
00:18:42.200 | It's taking raw, complicated information,
00:18:45.800 | and making it simple,
00:18:48.400 | actionable, useful.
00:18:51.600 | And the point is,
00:18:53.000 | that this kind of ability,
00:18:55.600 | to learn from raw sensory information,
00:18:58.200 | means that we can do a lot more,
00:19:00.400 | with a lot more data.
00:19:03.000 | So deep learning,
00:19:04.600 | gets better with more data.
00:19:06.600 | And that's important,
00:19:10.600 | for real-world applications,
00:19:12.600 | where edge cases are everything.
00:19:17.000 | This is us driving,
00:19:19.200 | with two perception control systems.
00:19:21.200 | One is a Tesla vehicle,
00:19:23.400 | with the autopilot,
00:19:25.200 | version one system,
00:19:26.400 | that's using a monocular camera,
00:19:27.800 | to perceive the external environment,
00:19:29.600 | and produce control decisions.
00:19:31.400 | And our own neural network,
00:19:33.600 | running on a Jessen TX2,
00:19:35.000 | that's taking in the same,
00:19:36.800 | with a monocular camera,
00:19:38.000 | and producing control decisions.
00:19:40.600 | And the two systems argue,
00:19:43.000 | and when they disagree,
00:19:44.200 | they raise up a flag,
00:19:45.800 | to say that this is an edge case,
00:19:47.200 | that needs human intervention.
00:19:50.200 | There is,
00:19:51.600 | covering such edge cases,
00:19:53.200 | using machine learning,
00:19:55.000 | is the main problem,
00:19:57.000 | of artificial intelligence,
00:19:59.400 | when applied to the real world.
00:20:01.000 | It is the main problem to solve.
00:20:03.000 | Okay, so what are neural networks?
00:20:07.800 | Inspired, very loosely,
00:20:11.000 | and I'll discuss,
00:20:12.000 | about the key difference,
00:20:13.200 | between our own brains,
00:20:14.400 | and artificial brains.
00:20:16.400 | Because there's a lot of insights,
00:20:18.600 | in that difference.
00:20:20.200 | But inspired, loosely,
00:20:22.000 | by biological neural networks,
00:20:23.800 | here is a simulation,
00:20:25.600 | of a thalamocortical brain network,
00:20:28.400 | which is only three million neurons.
00:20:31.200 | 476 million synapses.
00:20:33.400 | The full human brain,
00:20:34.800 | is a lot more than that.
00:20:36.000 | A hundred billion neurons,
00:20:37.800 | 1,000 trillion synapses.
00:20:41.800 | There's inspirational music,
00:20:47.600 | with this one,
00:20:48.400 | that I didn't realize was here.
00:20:50.400 | It should make you think.
00:20:52.400 | Artificial neural networks,
00:20:55.400 | okay, let's just let it play.
00:21:00.000 | The human neural network,
00:21:02.000 | is a hundred billion neurons, right?
00:21:04.000 | 1,000 trillion synapses.
00:21:06.000 | One of the state of the art,
00:21:09.000 | neural networks is ResNet-152,
00:21:12.000 | which has 60 million synapses.
00:21:15.000 | That's a difference,
00:21:19.000 | of about,
00:21:20.000 | a seven order of magnitude difference.
00:21:22.000 | The human brains have,
00:21:24.000 | 10 million times more synapses,
00:21:26.400 | than artificial neural networks.
00:21:29.000 | Plus or minus one order of magnitude,
00:21:31.000 | depending on the network.
00:21:33.000 | So what's the difference,
00:21:35.200 | between a biological neuron,
00:21:37.600 | and an artificial neuron?
00:21:39.000 | The topology of the human brain,
00:21:41.600 | have no layers.
00:21:42.800 | Neural networks,
00:21:44.000 | are stacked in layers.
00:21:46.000 | They're fixed,
00:21:46.800 | for the most part.
00:21:48.000 | There is chaos,
00:21:51.000 | very little structure,
00:21:52.400 | in our human brain,
00:21:53.400 | in terms of how neurons are connected.
00:21:56.000 | They're connected,
00:21:57.400 | often to 10,000 plus other neurons.
00:21:59.800 | The number of synapses,
00:22:01.400 | from individual neurons,
00:22:02.800 | that are input into the neuron,
00:22:05.400 | is huge.
00:22:06.400 | They're asynchronous.
00:22:08.400 | The human brain,
00:22:09.400 | works asynchronously.
00:22:11.200 | Artificial neural networks,
00:22:12.400 | work synchronously.
00:22:13.400 | The learning algorithm,
00:22:16.400 | for artificial neural networks,
00:22:19.400 | the only one,
00:22:20.400 | the best one,
00:22:22.400 | is back propagation.
00:22:24.400 | And we don't know,
00:22:27.800 | how human brains learn.
00:22:29.800 | Processing speed,
00:22:34.800 | this is one of the,
00:22:36.800 | the only benefits,
00:22:38.800 | we have with artificial neural networks,
00:22:40.800 | is artificial neurons,
00:22:42.800 | are faster.
00:22:44.800 | But they're also,
00:22:46.800 | extremely power inefficient.
00:22:51.800 | there is a division into stages,
00:22:53.800 | of training and testing,
00:22:54.800 | with neural networks.
00:22:56.800 | With biological neural networks,
00:22:58.800 | as you're sitting here today,
00:23:00.800 | they're always learning.
00:23:01.800 | The only profound similarity,
00:23:04.800 | the inspiring one,
00:23:06.800 | the captivating one,
00:23:08.800 | is that,
00:23:09.800 | both are distributed computation at scale.
00:23:12.800 | There is an emergent,
00:23:15.800 | aspect to neural networks,
00:23:18.800 | where the basic element of computation,
00:23:21.800 | a neuron,
00:23:23.800 | is simple,
00:23:24.800 | is extremely simple.
00:23:26.800 | But when connected together,
00:23:28.800 | beautiful,
00:23:29.800 | amazing,
00:23:31.800 | powerful approximators can be formed.
00:23:33.800 | A neural network is built up,
00:23:36.800 | with these computational units,
00:23:37.800 | where the inputs,
00:23:38.800 | there's a set of edges,
00:23:40.800 | with weights on them.
00:23:41.800 | The edges are,
00:23:43.800 | the weights are multiplied,
00:23:44.800 | by this input signal.
00:23:46.800 | A bias is added,
00:23:48.800 | with a nonlinear function,
00:23:51.800 | that determines whether,
00:23:52.800 | the network gets activated or not.
00:23:54.800 | The neuron gets activated or not,
00:23:56.800 | visualized here.
00:23:58.800 | And these neurons can be combined,
00:24:00.800 | in a number of ways.
00:24:02.800 | They can form a feed-forward neural network,
00:24:04.800 | or they can,
00:24:05.800 | feed back into itself,
00:24:08.800 | to form,
00:24:09.800 | to have state memory,
00:24:11.800 | in recurrent neural networks.
00:24:14.800 | The ones on the left,
00:24:16.800 | are the ones that are most successful,
00:24:18.800 | for most applications,
00:24:20.800 | in computer vision.
00:24:22.800 | The ones on the right,
00:24:24.800 | are very popular,
00:24:26.800 | and specific,
00:24:27.800 | when temporal dynamics,
00:24:28.800 | or dynamics time series,
00:24:29.800 | of any kind are used.
00:24:31.800 | In fact, the ones on the right,
00:24:33.800 | are much closer,
00:24:34.800 | to the way our human brains are,
00:24:36.800 | than the ones on the left.
00:24:38.800 | But that's why they're really hard to train.
00:24:41.800 | One beautiful aspect,
00:24:46.800 | of this emergent power,
00:24:48.800 | from multiple neurons being connected together,
00:24:50.800 | is the universal property,
00:24:52.800 | that with a single hidden layer,
00:24:54.800 | these networks,
00:24:56.800 | can learn any function,
00:24:57.800 | learn to approximate any function.
00:24:59.800 | Which is an important property,
00:25:01.800 | to be aware of.
00:25:03.800 | Because,
00:25:04.800 | the limits here,
00:25:07.800 | are not in the power of the networks.
00:25:10.800 | The limits,
00:25:11.800 | is in the methods,
00:25:13.800 | by which we construct them,
00:25:15.800 | and train them.
00:25:16.800 | What kinds of machine learning,
00:25:18.800 | deep learning are there?
00:25:20.800 | We can separate it into two categories.
00:25:22.800 | Memorizers,
00:25:24.800 | the approaches,
00:25:26.800 | that essentially memorize patterns in the data.
00:25:28.800 | And approaches that,
00:25:30.800 | we can loosely say,
00:25:32.800 | are beginning to reason,
00:25:34.800 | to generalize over the data,
00:25:36.800 | with minimal effort.
00:25:38.800 | And the reason,
00:25:40.800 | is that the data,
00:25:42.800 | is not a single,
00:25:44.800 | to generalize over the data,
00:25:46.800 | with minimal human input.
00:25:47.800 | On top,
00:25:48.800 | on the left,
00:25:49.800 | are the "teachers",
00:25:50.800 | is how much human input,
00:25:52.800 | in blue is needed,
00:25:53.800 | to make the method successful.
00:25:55.800 | For supervised learning,
00:25:56.800 | which is what most of deep learning,
00:25:58.800 | successes come from,
00:25:59.800 | where most of the data,
00:26:00.800 | is annotated by human beings.
00:26:02.800 | The human,
00:26:04.800 | is at the core,
00:26:05.800 | of the success.
00:26:06.800 | Most of the data,
00:26:07.800 | that's part of the training,
00:26:08.800 | needs to be annotated by human beings.
00:26:10.800 | With some additional successes,
00:26:13.800 | coming from augmentation methods,
00:26:15.800 | that extend that,
00:26:17.800 | extend the data,
00:26:19.800 | based on which,
00:26:20.800 | these networks are trained.
00:26:22.800 | And the semi-supervised,
00:26:27.800 | reinforcement learning,
00:26:28.800 | and unsupervised methods,
00:26:29.800 | that we'll talk about later in the course,
00:26:31.800 | that's where the near-term,
00:26:34.800 | successes we hope are.
00:26:36.800 | And with the unsupervised learning approaches,
00:26:38.800 | that's where,
00:26:39.800 | the true excitement about,
00:26:41.800 | the possibilities of artificial intelligence lie.
00:26:43.800 | Being able to make sense,
00:26:45.800 | of our world,
00:26:46.800 | with minimal input,
00:26:48.800 | from humans.
00:26:50.800 | So we can think of two kinds of,
00:26:55.800 | deep learning,
00:26:57.800 | impact spaces.
00:26:59.800 | One is a special purpose intelligence,
00:27:02.800 | is taking a problem,
00:27:04.800 | formalizing it,
00:27:05.800 | collecting enough data on it,
00:27:07.800 | and being able to,
00:27:09.800 | solve a particular,
00:27:11.800 | case that's,
00:27:13.800 | that provides value.
00:27:15.800 | Of particular interest here,
00:27:17.800 | is a network that estimates apartment costs,
00:27:19.800 | in the Boston area.
00:27:20.800 | So you could take,
00:27:21.800 | the number of bedrooms,
00:27:22.800 | the square feet,
00:27:23.800 | in the neighborhood,
00:27:24.800 | and provide as output,
00:27:26.800 | the estimated cost.
00:27:28.800 | On the right,
00:27:29.800 | is the actual data,
00:27:31.800 | of apartment costs.
00:27:33.800 | We're actually standing,
00:27:34.800 | in a,
00:27:35.800 | in a area,
00:27:37.800 | that has over $3,000,
00:27:39.800 | for a studio apartment.
00:27:41.800 | Some of you may be feeling that pain.
00:27:46.800 | And then there's general purpose intelligence,
00:27:50.800 | something,
00:27:51.800 | that feels like,
00:27:53.800 | approaching general purpose intelligence,
00:27:55.800 | which is reinforcement,
00:27:56.800 | and unsupervised learning.
00:27:58.800 | Here with Andrei,
00:28:00.800 | from Andrei Karpathy's Pong the Pixels,
00:28:02.800 | a system that takes in,
00:28:03.800 | 80 by 80 pixel image,
00:28:05.800 | and with no other information,
00:28:07.800 | is able to beat,
00:28:08.800 | is able to win at this game.
00:28:10.800 | No information except,
00:28:11.800 | a sequence of images,
00:28:12.800 | raw sensory information,
00:28:14.800 | the same way,
00:28:15.800 | the same kind of information,
00:28:16.800 | that human beings take in,
00:28:18.800 | from the visual,
00:28:19.800 | audio,
00:28:20.800 | touch,
00:28:21.800 | sensory data,
00:28:22.800 | the very low level data,
00:28:24.800 | and be able to learn to win.
00:28:26.800 | In this very simplistic,
00:28:27.800 | in this very artificially,
00:28:29.800 | constructed world,
00:28:30.800 | but nevertheless,
00:28:31.800 | a world where no feature learning,
00:28:33.800 | is performed.
00:28:34.800 | Only raw sensory information,
00:28:36.800 | is used to win,
00:28:37.800 | with very sparse,
00:28:39.800 | minimal human input.
00:28:41.800 | We'll talk about that,
00:28:43.800 | on Wednesday,
00:28:45.800 | with deep reinforcement learning.
00:28:48.800 | So, but for now,
00:28:50.800 | we'll focus on supervised learning,
00:28:52.800 | where there is,
00:28:54.800 | input data,
00:28:55.800 | there is a network,
00:28:56.800 | we're trying to train,
00:28:58.800 | a learning system,
00:28:59.800 | and there's a correct output,
00:29:00.800 | that's labeled by human beings.
00:29:03.800 | That's the general training process,
00:29:05.800 | for a neural network.
00:29:06.800 | Input data,
00:29:07.800 | labels,
00:29:08.800 | and the training of that,
00:29:10.800 | network,
00:29:11.800 | that model.
00:29:12.800 | So that in a testing stage,
00:29:14.800 | a new input data,
00:29:15.800 | that has never seen before,
00:29:17.800 | it's tasked with,
00:29:18.800 | producing guesses,
00:29:19.800 | and is evaluated based on that.
00:29:22.800 | For autonomous vehicles,
00:29:23.800 | that means being released,
00:29:25.800 | either in simulation,
00:29:26.800 | or in the real world,
00:29:27.800 | to operate.
00:29:31.800 | And how they learn,
00:29:33.800 | how neural networks learn,
00:29:34.800 | is given,
00:29:35.800 | the forward pass,
00:29:37.800 | of taking the input data,
00:29:38.800 | whether it's from the training stage,
00:29:41.800 | in the training stage,
00:29:43.800 | the taking the input data,
00:29:44.800 | producing a prediction,
00:29:45.800 | and then given that,
00:29:46.800 | there's ground truth,
00:29:47.800 | in the training stage,
00:29:48.800 | we can have a measure of error,
00:29:51.800 | based on a loss function,
00:29:52.800 | that then punishes,
00:29:54.800 | the synapses,
00:29:56.800 | the connections,
00:29:57.800 | the parameters,
00:29:58.800 | that were,
00:30:00.800 | involved with making that,
00:30:02.800 | that wrong prediction.
00:30:05.800 | And it back propagates the error,
00:30:09.800 | through those weights.
00:30:10.800 | We'll discuss that in a little bit more detail,
00:30:12.800 | in a bit here.
00:30:14.800 | So what can we do with deep learning?
00:30:16.800 | We can do one-to-one mapping.
00:30:18.800 | Really, you can think of input,
00:30:19.800 | as being anything.
00:30:20.800 | It can be a number,
00:30:21.800 | a vector of numbers,
00:30:22.800 | a sequence of numbers,
00:30:23.800 | a sequence,
00:30:24.800 | a vector of numbers.
00:30:25.800 | Anything you can think of,
00:30:26.800 | from images,
00:30:27.800 | to video,
00:30:28.800 | to audio,
00:30:29.800 | represented in this way.
00:30:30.800 | And the output,
00:30:31.800 | can the same,
00:30:32.800 | be a single number,
00:30:34.800 | or it can be images,
00:30:35.800 | video, text, audio.
00:30:38.800 | One-to-one mapping,
00:30:39.800 | on the bottom,
00:30:40.800 | one to many,
00:30:41.800 | many to one,
00:30:42.800 | many to many,
00:30:43.800 | and many to many,
00:30:45.800 | with different starting points,
00:30:47.800 | for the data.
00:30:48.800 | Asynchronous.
00:30:51.800 | Some quick terms,
00:30:53.800 | that will come up.
00:30:54.800 | Deep learning,
00:30:55.800 | is the same as neural networks.
00:30:58.800 | It's really deep neural networks,
00:31:01.800 | large neural networks.
00:31:03.800 | It's a subset of machine learning,
00:31:05.800 | that has been extremely successful,
00:31:07.800 | in the past decade.
00:31:09.800 | Multi-layer perceptron,
00:31:11.800 | deep neural network,
00:31:12.800 | recurring neural network,
00:31:14.800 | long short-term memory network,
00:31:15.800 | LSTM,
00:31:16.800 | convolutional neural network,
00:31:18.800 | and deep belief networks.
00:31:19.800 | All of these will come up,
00:31:20.800 | through the slides.
00:31:23.800 | And there is,
00:31:24.800 | specific operations,
00:31:26.800 | layers within these networks,
00:31:27.800 | of convolution,
00:31:28.800 | pooling,
00:31:29.800 | activation,
00:31:30.800 | and back propagation.
00:31:31.800 | This concept,
00:31:32.800 | that we'll discuss,
00:31:34.800 | in this class.
00:31:36.800 | Activation functions,
00:31:37.800 | there's a lot of variants.
00:31:40.800 | On the left,
00:31:41.800 | is the activation function,
00:31:42.800 | the left column.
00:31:43.800 | On the x-axis,
00:31:44.800 | is the input.
00:31:45.800 | On the y-axis,
00:31:46.800 | is the output.
00:31:48.800 | The sigmoid function,
00:31:49.800 | the output,
00:31:51.800 | if the font is too small,
00:31:52.800 | the output is,
00:31:54.800 | not centered at zero.
00:31:58.800 | For the tanh function,
00:32:00.800 | it's centered at zero,
00:32:01.800 | but it still suffers,
00:32:02.800 | from vanishing gradients.
00:32:04.800 | Vanishing gradients,
00:32:05.800 | is when the value,
00:32:06.800 | the input is low or high.
00:32:11.800 | The output of the network,
00:32:13.800 | as you see in the right column there,
00:32:15.800 | the derivative of the function,
00:32:17.800 | is very low.
00:32:18.800 | So the learning rate is very low.
00:32:21.800 | For ReLU,
00:32:24.800 | it's also not zero centered,
00:32:27.800 | but it does not suffer,
00:32:29.800 | from vanishing gradients.
00:32:31.800 | Back propagation,
00:32:32.800 | is the process of learning.
00:32:34.800 | It's the way we take,
00:32:35.800 | go from error,
00:32:36.800 | compute as the loss function,
00:32:38.800 | on the bottom right of the slide,
00:32:39.800 | taking the actual output,
00:32:42.800 | of the network with the forward pass,
00:32:43.800 | subtracting it,
00:32:45.800 | from the ground truth,
00:32:47.800 | squaring, dividing by two,
00:32:49.800 | and using that loss function,
00:32:51.800 | then back propagate through,
00:32:53.800 | to construct a gradient,
00:32:54.800 | to back propagate the error,
00:32:56.800 | to the weights that were responsible,
00:32:58.800 | for making either a correct,
00:32:59.800 | or an incorrect decision.
00:33:01.800 | So the sub tasks of that,
00:33:03.800 | there's a forward pass,
00:33:04.800 | there's a backward pass,
00:33:06.800 | and a fraction of the weights,
00:33:08.800 | gradient subtracted from the weight.
00:33:09.800 | That's it.
00:33:11.800 | That process is modular,
00:33:14.800 | so it's local to each individual neuron,
00:33:16.800 | which is why it's extremely,
00:33:18.800 | we're able to distribute it across,
00:33:21.800 | multiple,
00:33:23.800 | across the GPU,
00:33:26.800 | parallelize across the GPU.
00:33:28.800 | So, learning for neural network,
00:33:34.800 | these computational units,
00:33:35.800 | are extremely simple.
00:33:36.800 | They're extremely simple,
00:33:37.800 | to then correct,
00:33:39.800 | when they make an error,
00:33:40.800 | when they're part of a larger network,
00:33:41.800 | that makes an error.
00:33:42.800 | And all that boils down to,
00:33:44.800 | is essentially an optimization problem,
00:33:46.800 | where the objective,
00:33:47.800 | utility function is,
00:33:49.800 | the loss function,
00:33:50.800 | and the goal is to minimize it.
00:33:52.800 | And we have to update the parameters,
00:33:54.800 | the weights and the synapses,
00:33:55.800 | and the biases,
00:33:57.800 | to decrease that loss function.
00:33:59.800 | And that loss function is highly nonlinear.
00:34:03.800 | Depending on the activation functions,
00:34:07.800 | different properties,
00:34:08.800 | different issues arise.
00:34:09.800 | There's vanishing gradients,
00:34:11.800 | for sigmoid,
00:34:14.800 | where the learning can be slow.
00:34:16.800 | There's dying Rayleigh's,
00:34:19.800 | where the derivative is exactly zero,
00:34:23.800 | for inputs less than zero.
00:34:26.800 | There are solutions to this,
00:34:28.800 | like leaky Rayleigh's,
00:34:30.800 | and a bunch of details,
00:34:31.800 | that you may discover,
00:34:32.800 | when you try to win,
00:34:33.800 | the deep traffic competition.
00:34:35.800 | But, for the most part,
00:34:37.800 | these are the main activation functions.
00:34:39.800 | And it's the choice of the,
00:34:42.800 | neural network designer,
00:34:44.800 | which one works best.
00:34:46.800 | There are saddle points,
00:34:48.800 | all the problems,
00:34:49.800 | from numerical nonlinear optimization,
00:34:51.800 | that arise,
00:34:52.800 | come up here.
00:34:54.800 | It's hard to break symmetry,
00:34:57.800 | and stochastic gradient descent,
00:35:00.800 | without any kind of tricks to it,
00:35:03.800 | can take a very long time,
00:35:05.800 | to arrive at the minima.
00:35:07.800 | One of the biggest problems,
00:35:10.800 | in all of machine learning,
00:35:11.800 | and certainly deep learning,
00:35:13.800 | is overfitting.
00:35:14.800 | You can think of the blue dots,
00:35:16.800 | in a plot here,
00:35:17.800 | as the data,
00:35:18.800 | to which we want to fit a curve.
00:35:20.800 | We want to design a learning system,
00:35:23.800 | that approximates,
00:35:25.800 | the regression of this data.
00:35:28.800 | So, in green,
00:35:30.800 | is a sine curve,
00:35:32.800 | simple, fits well.
00:35:34.800 | And then there's a ninth degree polynomial,
00:35:37.800 | which fits even better,
00:35:38.800 | in terms of the error.
00:35:40.800 | But it clearly overfits this data.
00:35:42.800 | If there's other data,
00:35:45.800 | that has not seen yet,
00:35:47.800 | that has to fit,
00:35:48.800 | it's likely to produce a high error.
00:35:50.800 | So it's overfitting the training set.
00:35:52.800 | This is a big problem,
00:35:54.800 | for small data sets.
00:35:56.800 | And so we have to fix that,
00:35:58.800 | with regularization.
00:35:59.800 | Regularization is a set of methodologies,
00:36:02.800 | that prevent overfitting.
00:36:04.800 | Learning the training too well,
00:36:07.800 | in order, and then to not be able to generalize,
00:36:09.800 | to the testing stage.
00:36:11.800 | And overfitting, the main symptom,
00:36:15.800 | is the error decreases in training set,
00:36:18.800 | but increases in test set.
00:36:20.800 | So there's a lot of techniques,
00:36:23.800 | in traditional machine learning,
00:36:24.800 | that deal with this,
00:36:25.800 | and cross validation and so on.
00:36:26.800 | But because of the cost of training,
00:36:28.800 | for neural networks,
00:36:30.800 | it's traditional to use,
00:36:32.800 | what's called a validation set.
00:36:34.800 | So you create a subset of the training,
00:36:37.800 | that you keep away,
00:36:39.800 | for which you have the ground truth.
00:36:41.800 | And use that,
00:36:42.800 | as a representative of the testing set.
00:36:45.800 | So you perform early stoppage,
00:36:48.800 | or more realistically,
00:36:49.800 | just save a checkpoint often,
00:36:52.800 | to see how as the training evolves,
00:36:56.800 | the performance changes,
00:36:58.800 | on the validation set.
00:37:00.800 | And so you can stop,
00:37:02.800 | when the performance in the validation set,
00:37:03.800 | is getting a lot worse.
00:37:05.800 | It means you're over training,
00:37:06.800 | on the training set.
00:37:08.800 | In practice, of course,
00:37:12.800 | we run training much longer,
00:37:14.800 | and see when,
00:37:15.800 | what is the best performing,
00:37:18.800 | what is the best performing,
00:37:20.800 | snapshot checkpoint of the network.
00:37:23.800 | Dropout,
00:37:24.800 | is another very powerful,
00:37:26.800 | regularization technique.
00:37:27.800 | Where we randomly remove,
00:37:29.800 | part of the network,
00:37:30.800 | randomly remove some of the nodes,
00:37:32.800 | in the network,
00:37:33.800 | along with its incoming,
00:37:36.800 | and outgoing edges.
00:37:37.800 | So what that really looks like,
00:37:39.800 | is a probability of keeping a node.
00:37:41.800 | And in many deep learning frameworks today,
00:37:44.800 | it comes with a dropout layer.
00:37:46.800 | So it's essentially a probability,
00:37:48.800 | that's usually greater than 0.5,
00:37:50.800 | that a node will be kept.
00:37:53.800 | For the input layer,
00:37:55.800 | the probability should be much higher,
00:37:57.800 | or more effectively,
00:37:59.800 | what works well is just adding noise.
00:38:01.800 | What's the point here?
00:38:02.800 | You want to create,
00:38:04.800 | enough diversity,
00:38:05.800 | in the training data,
00:38:07.800 | such that it is generalizable,
00:38:09.800 | to the testing.
00:38:11.800 | And as you'll see,
00:38:13.800 | with deep traffic competition,
00:38:14.800 | there's L2 and L1 penalty,
00:38:17.800 | weight decay, weight penalty.
00:38:19.800 | Where,
00:38:20.800 | there's a penalization on the weights,
00:38:22.800 | they get too large.
00:38:24.800 | The L2 penalty keeps the weight small,
00:38:26.800 | unless the error derivative is huge,
00:38:29.800 | and produces a smoother model,
00:38:31.800 | and prefers to distribute,
00:38:34.800 | when there is,
00:38:35.800 | two similar inputs,
00:38:36.800 | it prefers to put half the weights on each,
00:38:39.800 | distribute the weights,
00:38:40.800 | as opposed to putting the weight on one of the edges.
00:38:43.800 | Makes the network more robust.
00:38:46.800 | L1 penalty has the one benefit,
00:38:49.800 | that for really large weights,
00:38:51.800 | they're allowed to be, to stay.
00:38:53.800 | So it allows for a few weights,
00:38:55.800 | to remain very large.
00:38:56.800 | These are the regularization techniques.
00:38:58.800 | And I wanted to mention them,
00:39:00.800 | because they're useful to some of the competitions,
00:39:02.800 | here in the course.
00:39:03.800 | And I recommend,
00:39:04.800 | to go to playground,
00:39:05.800 | to TensorFlow playground,
00:39:07.800 | to play around with some of these parameters.
00:39:10.800 | Where you get to,
00:39:11.800 | online in the browser,
00:39:13.800 | play around with different inputs,
00:39:14.800 | different features,
00:39:16.800 | different number of layers,
00:39:17.800 | and regularization techniques.
00:39:19.800 | And to build your intuition,
00:39:21.800 | about classification regression problems,
00:39:23.800 | given different input data sets.
00:39:26.800 | So what changed?
00:39:29.800 | Why, over the past many decades,
00:39:32.800 | neural networks,
00:39:34.800 | that have gone through two winters,
00:39:36.800 | are now again,
00:39:38.800 | dominating the artificial intelligence community.
00:39:40.800 | CPUs, GPUs,
00:39:43.800 | ASICs,
00:39:45.800 | the computational power has skyrocketed.
00:39:47.800 | From Moore's law to GPUs.
00:39:50.800 | There is huge data set,
00:39:53.800 | including ImageNet and others.
00:39:57.800 | There is research,
00:40:00.800 | back propagation,
00:40:02.800 | in the 80s.
00:40:04.800 | The convolutional neural networks,
00:40:07.800 | LSTMs.
00:40:08.800 | There's been a lot of,
00:40:10.800 | interesting breakthroughs,
00:40:11.800 | about how to design these architectures.
00:40:13.800 | How to build them,
00:40:14.800 | such that they're trainable efficiently,
00:40:16.800 | using GPUs.
00:40:18.800 | There is the software infrastructure,
00:40:21.800 | from being able to share the data,
00:40:23.800 | or get,
00:40:24.800 | to being able to train networks,
00:40:26.800 | and share code,
00:40:27.800 | and effectively,
00:40:28.800 | view neural networks as a stack of layers,
00:40:32.800 | as opposed to having to implement stuff from scratch,
00:40:34.800 | with TensorFlow, PyTorch,
00:40:36.800 | and other deep learning frameworks.
00:40:38.800 | And there's huge financial backing,
00:40:40.800 | from Google, Facebook, and so on.
00:40:42.800 | Deep learning,
00:40:50.800 | in order to understand,
00:40:52.800 | why it works so well,
00:40:55.800 | and where its limitations are,
00:40:57.800 | we need to understand,
00:40:58.800 | where our own intuition comes from,
00:40:59.800 | about what is hard,
00:41:00.800 | and what is easy.
00:41:02.800 | The important thing about computer vision,
00:41:04.800 | which is a lot of what this course is about,
00:41:06.800 | even as in deep reinforcement learning formulation,
00:41:09.800 | is that visual perception,
00:41:11.800 | for us human beings,
00:41:12.800 | was formed,
00:41:14.800 | 540 million years ago.
00:41:16.800 | That's 540 million years worth of data.
00:41:21.800 | An abstract thought,
00:41:24.800 | is only formed about 100,000 years ago.
00:41:27.800 | That's several orders of magnitude less data.
00:41:31.800 | So we can make,
00:41:32.800 | with neural networks,
00:41:34.800 | predictions,
00:41:36.800 | that seem trivial,
00:41:38.800 | the trivial to us human beings,
00:41:43.800 | but completely challenging,
00:41:45.800 | and wrong to neural networks.
00:41:48.800 | Here, on the left,
00:41:49.800 | showing a prediction of a dog,
00:41:51.800 | with a little bit of distortion,
00:41:52.800 | and noise added to the image,
00:41:54.800 | producing the image on the right.
00:41:55.800 | And neural network is confidently,
00:41:58.800 | 99% plus accuracy,
00:42:00.800 | predicting that it's an ostrich.
00:42:03.800 | And there's all these problems,
00:42:06.800 | has to deal with,
00:42:07.800 | whether it's in computer vision data,
00:42:09.800 | whether it's in text data,
00:42:10.800 | audio,
00:42:11.800 | all of this variation arises.
00:42:14.800 | In vision,
00:42:15.800 | it's illumination variability,
00:42:17.800 | the set of pixels,
00:42:18.800 | and the numbers look completely different,
00:42:20.800 | depending on the lighting conditions.
00:42:22.800 | It's the biggest problem in driving,
00:42:24.800 | is lighting conditions,
00:42:25.800 | lighting variability.
00:42:27.800 | Pose variation,
00:42:29.800 | objects need to be learned,
00:42:30.800 | from every different perspective.
00:42:32.800 | I'll discuss that,
00:42:33.800 | for when sensing the driver.
00:42:35.800 | Most of the deep learning work,
00:42:38.800 | that's done on the face,
00:42:39.800 | on the human,
00:42:40.800 | is done on the frontal face,
00:42:42.800 | or semi frontal face.
00:42:44.800 | There's very little work done,
00:42:45.800 | on the full 360,
00:42:48.800 | pose variability,
00:42:50.800 | that a human being can take on.
00:42:52.800 | Inter-class variability,
00:42:56.800 | for the classification problem,
00:42:57.800 | for the detection problem,
00:42:59.800 | there is a lot of different kinds of objects,
00:43:01.800 | for cats, dogs, cars, bicyclists, pedestrians.
00:43:05.800 | So that brings us to object classification.
00:43:09.800 | And I'd like to take you through,
00:43:11.800 | where deep learning,
00:43:13.800 | has taken big strides,
00:43:15.800 | for the past several years,
00:43:16.800 | leading up to this year,
00:43:17.800 | to 2018.
00:43:19.800 | So let's start,
00:43:21.800 | at object classification.
00:43:23.800 | It's when you take,
00:43:24.800 | a single image,
00:43:26.800 | and you have to say,
00:43:27.800 | one class,
00:43:29.800 | that's most likely to belong in that image.
00:43:31.800 | The most famous,
00:43:33.800 | variant of that,
00:43:34.800 | is the ImageNet competition,
00:43:35.800 | ImageNet challenge.
00:43:36.800 | ImageNet data set,
00:43:37.800 | is a data set of 14 million images,
00:43:39.800 | with 21,000 categories.
00:43:41.800 | And for say,
00:43:43.800 | the category of fruit,
00:43:45.800 | there's a total of,
00:43:47.800 | 188,000 images of fruit.
00:43:50.800 | And there is,
00:43:51.800 | 1,200 images of Granny Smith apples.
00:43:53.800 | It gives you a sense,
00:43:54.800 | of what we're talking about here.
00:43:56.800 | So this has been,
00:43:59.800 | the source,
00:44:00.800 | of a lot of interesting breakthroughs,
00:44:02.800 | in deep learning,
00:44:03.800 | and a lot of the excitement,
00:44:05.800 | in deep learning.
00:44:06.800 | Is first,
00:44:07.800 | the big successful network,
00:44:09.800 | at least,
00:44:10.800 | one that became famous,
00:44:12.800 | in deep learning,
00:44:14.800 | is AlexNet in 2012,
00:44:16.800 | that took a leap,
00:44:18.800 | a significant leap in performance,
00:44:20.800 | on the ImageNet challenge.
00:44:22.800 | So it was one of the first,
00:44:24.800 | neural networks,
00:44:25.800 | that was successfully trained on the GPU,
00:44:27.800 | and achieved,
00:44:28.800 | an incredible performance boost,
00:44:29.800 | over the previous year,
00:44:31.800 | on the ImageNet challenge.
00:44:32.800 | The challenge is,
00:44:34.800 | and I'll talk about some of these networks,
00:44:36.800 | is to given a single image,
00:44:38.800 | give five guesses,
00:44:40.800 | and you have five guesses to guess,
00:44:42.800 | for one of them to be correct.
00:44:44.800 | The human annotation,
00:44:46.800 | is a question often comes up.
00:44:48.800 | So how do you know the ground truth?
00:44:50.800 | Human level of performance,
00:44:51.800 | is 5.1% accuracy,
00:44:53.800 | on this task.
00:44:55.800 | But the way,
00:44:57.800 | the annotation for ImageNet,
00:44:59.800 | is performed,
00:45:01.800 | there's a Google search,
00:45:02.800 | where you pull,
00:45:03.800 | the images,
00:45:04.800 | already labeled for you,
00:45:05.800 | and then the annotation,
00:45:07.800 | that on Mechanical Turk,
00:45:08.800 | other humans perform,
00:45:09.800 | is just binary.
00:45:10.800 | Is this a cat or not a cat?
00:45:12.800 | So they're not tasked,
00:45:13.800 | with performing the,
00:45:14.800 | very high resolution,
00:45:16.800 | semantic,
00:45:17.800 | labeling of the image.
00:45:19.800 | Okay,
00:45:21.800 | so through,
00:45:22.800 | from 2012,
00:45:23.800 | with AlexNet,
00:45:24.800 | to today.
00:45:25.800 | And the big,
00:45:27.800 | transition in 2018,
00:45:28.800 | of the ImageNet challenge,
00:45:30.800 | leaving Stanford,
00:45:31.800 | and going to Kaggle.
00:45:33.800 | It's sort of a monumental step,
00:45:36.800 | because in 2015,
00:45:37.800 | with the ResNet network,
00:45:39.800 | was the first time,
00:45:40.800 | that the human level of performance,
00:45:42.800 | was exceeded.
00:45:43.800 | And I think this is,
00:45:45.800 | a very important,
00:45:51.800 | of where deep learning is.
00:45:53.800 | For a particular,
00:45:54.800 | what I would argue,
00:45:55.800 | is a toy example,
00:45:56.800 | despite the fact,
00:45:57.800 | that it's 14 million images.
00:45:58.800 | So we're developing,
00:46:00.800 | state-of-the-art techniques here,
00:46:02.800 | and the next stage,
00:46:03.800 | as we are now,
00:46:04.800 | exceeding human level performance,
00:46:05.800 | on this task,
00:46:06.800 | is how to take,
00:46:07.800 | these methods,
00:46:08.800 | into the real world,
00:46:09.800 | to perform,
00:46:10.800 | scene perception,
00:46:11.800 | to perform,
00:46:12.800 | driver state perception.
00:46:14.800 | In 2016,
00:46:19.800 | and 2017,
00:46:20.800 | CU Image,
00:46:22.800 | and SCNet,
00:46:23.800 | has a very unique,
00:46:24.800 | new addition,
00:46:25.800 | to the previous formulations,
00:46:26.800 | that has achieved,
00:46:27.800 | an accuracy of 2.2% error.
00:46:30.800 | 2.25% error,
00:46:33.800 | on the ImageNet,
00:46:34.800 | classification challenge.
00:46:35.800 | It's an incredible result.
00:46:37.800 | Okay,
00:46:38.800 | so you have this image,
00:46:39.800 | classification architecture,
00:46:41.800 | that takes in a single image,
00:46:43.800 | and produces convolution,
00:46:45.800 | and takes it through,
00:46:46.800 | pooling convolution,
00:46:47.800 | and at the end,
00:46:48.800 | fully connected layers,
00:46:49.800 | and performs a classification task,
00:46:51.800 | or regression task.
00:46:52.800 | And you can swap out,
00:46:53.800 | that layer,
00:46:54.800 | to perform any kind of,
00:46:56.800 | other task,
00:46:58.800 | including with,
00:46:59.800 | recurrent neural networks,
00:47:00.800 | of image captioning,
00:47:01.800 | and so on,
00:47:02.800 | or localization,
00:47:03.800 | of bounding boxes,
00:47:05.800 | or you can do,
00:47:06.800 | fully convolutional networks,
00:47:08.800 | which we'll talk about,
00:47:10.800 | on Thursday.
00:47:12.800 | Which is when you take a,
00:47:14.800 | image as an input,
00:47:15.800 | and produce an image as an output.
00:47:17.800 | But where the output image,
00:47:18.800 | in this case,
00:47:19.800 | is a segmentation.
00:47:22.800 | where a color indicates,
00:47:23.800 | what the object is,
00:47:25.800 | of the category,
00:47:26.800 | of the object.
00:47:27.800 | So it's pixel level segmentation,
00:47:29.800 | every single pixel in the image,
00:47:30.800 | is assigned a class,
00:47:32.800 | a category,
00:47:33.800 | of where that pixel belongs to.
00:47:36.800 | This is,
00:47:37.800 | the kind of,
00:47:39.800 | task,
00:47:40.800 | that's overlaid on top,
00:47:42.800 | other sensory information,
00:47:44.800 | coming for the car,
00:47:45.800 | in order to,
00:47:46.800 | perceive the external environment.
00:47:49.800 | You can continue to extract,
00:47:51.800 | information from images in this way,
00:47:53.800 | to produce image to image mapping,
00:47:55.800 | for example,
00:47:56.800 | to colorize images.
00:47:57.800 | And take from grayscale images,
00:47:59.800 | to color images.
00:48:02.800 | Or you can use that kind of,
00:48:04.800 | heat map information,
00:48:05.800 | to localize objects in the image.
00:48:07.800 | So as opposed to just classifying,
00:48:09.800 | that this is the image of a cow,
00:48:11.800 | RCNN,
00:48:13.800 | FAST,
00:48:14.800 | and FASTA-RCNN,
00:48:15.800 | and a lot of other localization networks,
00:48:17.800 | allow you to,
00:48:19.800 | propose different candidates,
00:48:21.800 | for where exactly the cow,
00:48:22.800 | is located in this image.
00:48:24.800 | And thereby,
00:48:25.800 | being able to perform object detection,
00:48:26.800 | not just object classification.
00:48:30.800 | In 2017,
00:48:32.800 | has been a lot of cool applications,
00:48:34.800 | of these architectures.
00:48:36.800 | One of which is background removal.
00:48:38.800 | Again, mapping from image to image,
00:48:41.800 | ability to remove,
00:48:42.800 | background from selfies,
00:48:45.800 | of humans,
00:48:46.800 | or human-like,
00:48:48.800 | pictures,
00:48:50.800 | or faces.
00:48:52.800 | The references,
00:48:54.800 | with some incredible animations,
00:48:57.800 | are in the bottom of the slide,
00:48:58.800 | and the slides are now available online.
00:49:01.800 | Pix2Pix HD,
00:49:06.800 | there's been a lot of work in GANs,
00:49:09.800 | and generative adversarial networks.
00:49:12.800 | In particular in driving,
00:49:15.800 | GANs have been used to generate,
00:49:18.800 | examples that,
00:49:19.800 | generate examples,
00:49:21.800 | from source data.
00:49:23.800 | Whether that's from raw data,
00:49:25.800 | or in this case with Pix2Pix HD,
00:49:27.800 | is taking,
00:49:29.800 | coarse semantic labeling,
00:49:31.800 | of the images,
00:49:32.800 | pixel level,
00:49:33.800 | and producing,
00:49:34.800 | photorealistic,
00:49:36.800 | high-definition,
00:49:38.800 | images of the forward roadway.
00:49:40.800 | This is an exciting,
00:49:42.800 | possibility,
00:49:43.800 | for being able to generate,
00:49:45.800 | a variety of cases,
00:49:46.800 | for self-driving cars,
00:49:47.800 | for autonomous vehicles,
00:49:48.800 | to be able to learn,
00:49:49.800 | to generate,
00:49:50.800 | to augment the data,
00:49:51.800 | and be able to change,
00:49:53.800 | the way different roads look,
00:49:54.800 | road conditions,
00:49:55.800 | to change the way vehicles look,
00:49:56.800 | cyclists, pedestrians.
00:49:58.800 | Then we can move on,
00:50:00.800 | to recurrent neural networks.
00:50:01.800 | Everything I've talked about,
00:50:03.800 | was one-to-one mapping,
00:50:05.800 | from image to image,
00:50:06.800 | or image to number.
00:50:07.800 | Recurrent neural networks,
00:50:08.800 | work with sequences.
00:50:10.800 | We can use sequences,
00:50:12.800 | to generate handwriting,
00:50:15.800 | to generate text,
00:50:20.800 | captions from an image,
00:50:22.800 | based on the localizations,
00:50:23.800 | the various detections,
00:50:24.800 | in that image.
00:50:26.800 | We can provide,
00:50:29.800 | video description generation.
00:50:31.800 | So taking a video,
00:50:33.800 | and combining convolutional neural networks,
00:50:35.800 | with recurrent neural networks,
00:50:37.800 | using convolutional neural networks,
00:50:38.800 | to extract features,
00:50:39.800 | frame to frame,
00:50:40.800 | and using those extracted features,
00:50:42.800 | to input into the RNNs,
00:50:45.800 | to then generate,
00:50:46.800 | a labeling,
00:50:49.800 | a description,
00:50:50.800 | what's going on in the video.
00:50:53.800 | A lot of exciting approaches,
00:50:55.800 | for autonomous systems,
00:50:57.800 | especially in drones,
00:50:59.800 | where the time to make a decision,
00:51:01.800 | is short.
00:51:03.800 | Same with the RC car,
00:51:05.800 | traveling 30 miles an hour.
00:51:07.800 | Attentional mechanisms,
00:51:08.800 | for steering the attention,
00:51:09.800 | of the network,
00:51:10.800 | have been very popular,
00:51:12.800 | for the localization task,
00:51:14.800 | and for just saving,
00:51:15.800 | how much interpretation,
00:51:16.800 | of the image,
00:51:17.800 | how many pixels need to be considered,
00:51:19.800 | in the classification task.
00:51:22.800 | So we can steer,
00:51:23.800 | we can model the way,
00:51:25.800 | a human being,
00:51:26.800 | looks around an image to interpret it,
00:51:28.800 | and use the network to do the same.
00:51:30.800 | And we can use that kind of steering,
00:51:32.800 | to draw images as well.
00:51:35.800 | Finally the big breakthroughs in 2017,
00:51:43.800 | came from this,
00:51:46.800 | the pong to pixels,
00:51:47.800 | the reinforcement learning,
00:51:49.800 | using sensory data,
00:51:50.800 | raw sensory data,
00:51:51.800 | and use reinforcement learning methods,
00:51:53.800 | deep RL methods,
00:51:54.800 | of which we'll talk about on Wednesday.
00:51:56.800 | I'm really excited about,
00:51:58.800 | the underlying methodology,
00:52:00.800 | of deep traffic and deep crash,
00:52:02.800 | is using neural networks,
00:52:06.800 | as the approximators,
00:52:08.800 | inside reinforcement learning approaches.
00:52:11.800 | So AlphaGo in 2016,
00:52:13.800 | has achieved a monumental task,
00:52:16.800 | that when I first started,
00:52:17.800 | in artificial intelligence,
00:52:18.800 | was told to me is impossible,
00:52:20.800 | for AI system to accomplish,
00:52:21.800 | which is to win at the game of Go,
00:52:24.800 | against the top human player in the world.
00:52:28.800 | However that method was trained,
00:52:30.800 | on human expert positions.
00:52:33.800 | The AlphaGo system,
00:52:34.800 | was trained on previous games,
00:52:36.800 | played by human experts.
00:52:38.800 | And in an incredible accomplishment,
00:52:42.800 | AlphaGo Zero in 2017,
00:52:45.800 | was able to beat AlphaGo,
00:52:48.800 | and many of its variants,
00:52:51.800 | by playing itself,
00:52:54.800 | from zero information.
00:52:57.800 | So no knowledge of human experts,
00:53:01.800 | no games,
00:53:02.800 | no training data,
00:53:04.800 | very little human input.
00:53:07.800 | And what more,
00:53:08.800 | it was able to generate moves,
00:53:11.800 | that were surprising to human experts.
00:53:14.800 | I think it's Einstein,
00:53:16.800 | that said that intelligence,
00:53:19.800 | that the key mark of intelligence,
00:53:20.800 | is imagination.
00:53:23.800 | I think it's beautiful,
00:53:24.800 | to see an artificial intelligence system,
00:53:26.800 | come up with something,
00:53:27.800 | that surprises human experts.
00:53:30.800 | Truly surprises.
00:53:35.800 | For the gambling junkies,
00:53:36.800 | DeepStack and a few other variants,
00:53:40.800 | have been used in 2017,
00:53:42.800 | to win a heads-up poker.
00:53:44.800 | Again, another incredible result.
00:53:46.800 | I was always told,
00:53:47.800 | an artificial intelligence,
00:53:48.800 | would be impossible,
00:53:49.800 | for deep,
00:53:50.800 | for any machine learning method,
00:53:52.800 | to achieve.
00:53:53.800 | And was able to beat,
00:53:54.800 | a professional player,
00:53:55.800 | and several competitors,
00:53:57.800 | have come along since.
00:53:59.800 | We're yet to be able to beat,
00:54:01.800 | to win in a tournament setting,
00:54:03.800 | so multiple players,
00:54:04.800 | for those of you familiar,
00:54:05.800 | heads-up poker is one-on-one.
00:54:06.800 | It's a much, much smaller,
00:54:08.800 | easier space to solve.
00:54:11.800 | There's a lot more,
00:54:12.800 | human-to-human dynamics going on,
00:54:14.800 | for when there's multiple players.
00:54:16.800 | But that's the task for 2018.
00:54:21.800 | And the drawbacks,
00:54:22.800 | it's one of my favorite videos,
00:54:24.800 | I show it often,
00:54:26.800 | of coast runners.
00:54:28.800 | For these deep reinforcement,
00:54:29.800 | learning approaches,
00:54:31.800 | the learning of the reward function,
00:54:35.800 | the definition of the reward function,
00:54:37.800 | controls how the actual,
00:54:40.800 | system behaves.
00:54:42.800 | And this will come,
00:54:44.800 | this would be extremely important for us,
00:54:45.800 | with autonomous vehicles.
00:54:47.800 | Here, the boat is tasked with,
00:54:50.800 | gaining the highest number of points,
00:54:53.800 | and it figures out,
00:54:54.800 | that it does not need to race,
00:54:55.800 | which is the whole point of the game,
00:54:57.800 | in order to gain points,
00:54:58.800 | but instead pick up green,
00:55:01.800 | circles that regenerate themselves,
00:55:03.800 | over and over.
00:55:05.800 | This is the,
00:55:06.800 | the counterintuitive,
00:55:09.800 | behavior of a system,
00:55:11.800 | that would not be expected,
00:55:14.800 | when you first design the reward function.
00:55:16.800 | And this is a very formal,
00:55:17.800 | simple system,
00:55:18.800 | nevertheless,
00:55:20.800 | is extremely difficult,
00:55:21.800 | to come up with a reward function,
00:55:24.800 | that makes it operate,
00:55:25.800 | in the way you expect it to operate.
00:55:27.800 | Very applicable for,
00:55:29.800 | autonomous vehicles.
00:55:30.800 | And of course,
00:55:31.800 | in the perception side,
00:55:32.800 | as I mentioned with the ostrich and the dog,
00:55:35.800 | a little bit of noise,
00:55:37.800 | with 99.6% confidence,
00:55:39.800 | we can predict,
00:55:40.800 | that the noise up top,
00:55:41.800 | is a robin, a cheetah,
00:55:43.800 | armadillo, lesser panda.
00:55:45.800 | These are outputs from actual,
00:55:47.800 | state-of-the-art neural networks.
00:55:50.800 | Taking in the noise,
00:55:52.800 | and producing a confident prediction.
00:55:55.800 | It should build our intuition,
00:55:56.800 | to understand that we don't,
00:55:58.800 | that the visual characteristics,
00:56:00.800 | the spatial characteristics of an image,
00:56:03.800 | do not necessarily convey,
00:56:05.800 | the level of hierarchy necessary,
00:56:07.800 | to function in this world.
00:56:11.800 | In a similar way,
00:56:12.800 | with a dog and an ostrich,
00:56:14.800 | and everything in an ostrich,
00:56:16.800 | a network confidently,
00:56:18.800 | with a little bit of noise,
00:56:19.800 | can make the wrong prediction.
00:56:21.800 | Thinking a school bus,
00:56:23.800 | is an ostrich,
00:56:24.800 | and a speaker is an ostrich.
00:56:28.800 | They're easily fooled,
00:56:31.800 | but not really,
00:56:32.800 | because they perform the tasks,
00:56:34.800 | that they were trained to do well.
00:56:37.800 | So we have to,
00:56:39.800 | make sure we keep our intuition,
00:56:43.800 | optimized to the way machines learn,
00:56:46.800 | not the way humans have learned,
00:56:48.800 | over the 540 million years of data,
00:56:51.800 | that we've gained,
00:56:52.800 | through developing the eye,
00:56:53.800 | through evolution.
00:56:55.800 | The current challenges we're taking on,
00:56:57.800 | first transfer learning.
00:57:00.800 | There's a lot of success,
00:57:01.800 | in transfer learning,
00:57:02.800 | between domains,
00:57:03.800 | that are very close to each other.
00:57:05.800 | So image classification,
00:57:06.800 | from one domain to the next.
00:57:09.800 | There's a lot of value,
00:57:10.800 | in forming representations,
00:57:11.800 | of the way scenes look,
00:57:13.800 | in order to,
00:57:14.800 | natural scenes look,
00:57:15.800 | in order to do,
00:57:16.800 | scene segmentation,
00:57:17.800 | the driving case for example.
00:57:19.800 | But we're not able to do any,
00:57:21.800 | any bigger leaps,
00:57:24.800 | in the way we perform transfer learning.
00:57:27.800 | The biggest challenge for deep learning,
00:57:28.800 | is to generalize,
00:57:30.800 | generalize across domains.
00:57:32.800 | It lacks the ability to reason,
00:57:35.800 | in the way that we've defined,
00:57:36.800 | understanding previously,
00:57:37.800 | which is,
00:57:38.800 | the ability to turn complex information,
00:57:40.800 | into simple useful information.
00:57:44.800 | Convert domain specific,
00:57:47.800 | complicated sensory information,
00:57:50.800 | that doesn't relate,
00:57:51.800 | to the initial training set.
00:57:53.800 | That's the open challenge,
00:57:54.800 | for deep learning.
00:57:55.800 | Train on very little data,
00:57:57.800 | and then go and reason,
00:57:58.800 | and operate in the real world.
00:58:00.800 | Right now,
00:58:01.800 | neural networks are very inefficient.
00:58:03.800 | They require big data.
00:58:05.800 | They require supervised data,
00:58:07.800 | which means they need human,
00:58:08.800 | costly human input.
00:58:11.800 | They're not fully automated,
00:58:13.800 | despite the fact,
00:58:14.800 | that the feature learning,
00:58:15.800 | incredibly the big breakthrough,
00:58:17.800 | feature learning is performed,
00:58:18.800 | automatically.
00:58:19.800 | You still have to do,
00:58:20.800 | a lot of design,
00:58:21.800 | of the actual architecture,
00:58:22.800 | of the network,
00:58:23.800 | and all the different,
00:58:24.800 | hyper parameter tuning,
00:58:25.800 | needs to be performed.
00:58:27.800 | Human input,
00:58:28.800 | perhaps a little bit more,
00:58:30.800 | educated human input,
00:58:31.800 | in form of PhD students,
00:58:33.800 | postdocs, faculty,
00:58:34.800 | is required,
00:58:36.800 | to tune these hyper parameters.
00:58:38.800 | But nevertheless,
00:58:39.800 | human input is still necessary.
00:58:41.800 | They cannot be left alone,
00:58:43.800 | for the most part.
00:58:46.800 | The reward,
00:58:47.800 | defining the reward,
00:58:48.800 | as we saw with Coast Run,
00:58:49.800 | is extremely difficult,
00:58:50.800 | for systems that operate,
00:58:52.800 | in the real world.
00:58:53.800 | Transparency,
00:58:55.800 | quite possibly,
00:58:56.800 | is not an important one,
00:58:58.800 | but neural networks,
00:58:59.800 | currently are black box,
00:59:00.800 | for the most part.
00:59:01.800 | They're not able to accept,
00:59:03.800 | through a few successful,
00:59:04.800 | visualization methods,
00:59:05.800 | that visualize different aspects,
00:59:07.800 | of the activations.
00:59:08.800 | They're not able,
00:59:10.800 | to reveal to us humans,
00:59:12.800 | why they work,
00:59:13.800 | or where they fail.
00:59:16.800 | And this is a philosophical question,
00:59:18.800 | for autonomous vehicles,
00:59:20.800 | that we may not care,
00:59:21.800 | as human beings,
00:59:22.800 | if a system works well enough.
00:59:24.800 | But I would argue,
00:59:25.800 | that it'll be a long time,
00:59:28.800 | before systems work well enough,
00:59:30.800 | where we don't care.
00:59:32.800 | We'll care,
00:59:33.800 | and we'll have to work together,
00:59:34.800 | with these systems.
00:59:35.800 | And that's where transparency,
00:59:36.800 | communication,
00:59:37.800 | collaboration is critical.
00:59:39.800 | And edge cases,
00:59:40.800 | it's all about edge cases.
00:59:42.800 | In robotics,
00:59:44.800 | in autonomous vehicles,
00:59:46.800 | the 99.9% of driving,
00:59:48.800 | is really boring.
00:59:49.800 | It's the same.
00:59:51.800 | Especially highway driving,
00:59:52.800 | traffic driving,
00:59:54.800 | it's the same.
00:59:55.800 | The obstacle avoidance,
00:59:56.800 | the car following,
00:59:57.800 | the lane centering,
00:59:58.800 | all these problems are trivial.
01:00:00.800 | It's the edge cases.
01:00:02.800 | The trillions of edge cases,
01:00:05.800 | that need to be generalized over,
01:00:07.800 | on a very small amount of training data.
01:00:10.800 | So again, I return to,
01:00:16.800 | why deep learning?
01:00:18.800 | I mentioned a bunch of challenges,
01:00:23.800 | and this is an opportunity.
01:00:25.800 | It's an opportunity,
01:00:27.800 | to come up with,
01:00:29.800 | techniques,
01:00:31.800 | that operate successfully in this world.
01:00:35.800 | So I hope the competitions,
01:00:36.800 | we present in this class,
01:00:38.800 | in the autonomous vehicle domain,
01:00:40.800 | will give you some insight,
01:00:41.800 | and opportunity to apply,
01:00:43.800 | in some of these cases,
01:00:44.800 | are open research problems.
01:00:46.800 | With semantic segmentation,
01:00:48.800 | of external perception,
01:00:49.800 | with control of the vehicle,
01:00:51.800 | in deep traffic,
01:00:53.800 | and with deep crash,
01:00:56.800 | of control of the vehicle,
01:00:58.800 | in under actuated,
01:01:00.800 | high-speed conditions,
01:01:03.800 | and the driver state perception.
01:01:06.800 | So with that,
01:01:11.800 | I wanted to introduce,
01:01:12.800 | deep learning to you today,
01:01:13.800 | before we get to the fun,
01:01:14.800 | tomorrow of autonomous vehicles.
01:01:16.800 | So I would like to thank,
01:01:17.800 | NVIDIA,
01:01:19.800 | Google, Autoliv,
01:01:21.800 | Toyota,
01:01:24.800 | at the risk of setting off people's phones,
01:01:26.800 | Amazon Alexa Auto.
01:01:32.800 | truly,
01:01:34.800 | I would like to say,
01:01:35.800 | that I've been humbled,
01:01:37.800 | over the past year,
01:01:39.800 | by the thousands of messages,
01:01:41.800 | we received,
01:01:42.800 | by the attention,
01:01:43.800 | by the 18,000 competition entries,
01:01:46.800 | by the many people across the world,
01:01:49.800 | not just here at MIT,
01:01:50.800 | that are brilliant,
01:01:52.800 | that I've got a chance to interact with.
01:01:54.800 | And I hope we go bigger,
01:01:56.800 | and do some impressive stuff in 2018.
01:01:59.800 | Thank you very much,
01:02:00.800 | and tomorrow is self-driving.
01:02:02.800 | [APPLAUSE]