MIT 6.S094: Deep Learning

00:00:00.000 | Thank you everyone for braving the cold and the snow to be here.

00:00:06.200 | This is 6S094, Deep Learning for Self-Driving Cars.

00:00:12.600 | And it's a course where we cover the topic of deep learning,

00:00:18.800 | which is a set of techniques that have taken a leap in the last decade

00:00:24.400 | for our understanding of what artificial intelligence systems are capable of doing.

00:00:30.800 | And self-driving cars, which is systems that can take these techniques

00:00:36.600 | and integrate them in a meaningful profound way

00:00:41.200 | into our daily lives in a way that transforms society.

00:00:45.800 | So that's why both of these topics are extremely important and extremely exciting.

00:00:52.400 | My name is Lex Friedman and I'm joined by an amazing team of engineers

00:00:57.600 | in Jack Terwilliger, Julia Kindlesberger, Dan Brown, Michael Glazer,

00:01:04.000 | Lee Ding, Spencer Dodd, and Benedict Jenick, among many others.

00:01:10.800 | We build autonomous vehicles here at MIT.

00:01:14.400 | Not just ones that perceive and move about the environment,

00:01:19.600 | but ones that interact, communicate, and earn the trust and understanding

00:01:26.000 | of human beings inside the car, the drivers and the passengers,

00:01:29.600 | and the human beings outside the car,

00:01:32.600 | the pedestrians and other drivers and cyclists.

00:01:39.200 | The website for this course, selfdrivingcars.mit.edu

00:01:43.600 | If you have questions, email deepcars@mit.edu

00:01:48.000 | Slack deep-mit.

00:01:51.800 | For registered MIT students, you have to register on the website.

00:01:57.200 | And by midnight, Friday, January 19th, build a neural network

00:02:03.800 | and submit it to the competition that achieves the speed of 65 miles per hour

00:02:08.800 | on the new Deep Traffic 2.0.

00:02:11.600 | It's much harder and much more interesting than last year's

00:02:15.200 | for those of you who participated.

00:02:18.000 | There's three competitions in this class.

00:02:20.200 | Deep traffic, SegFuse, Deep Crash.

00:02:24.600 | There's guest speakers that come from Waymo, Google, Tesla.

00:02:31.800 | And those are starting new autonomous vehicle startups

00:02:36.000 | in Voyage, Neutronomy, and Aurora.

00:02:43.600 | And the news a lot today from CES.

00:02:48.200 | And we have shirts.

00:02:50.800 | For those of you who brave the snow and continue to do so,

00:02:54.600 | towards the end of the class, there will be free shirts.

00:02:57.800 | Yes, I said free and shirts in the same sentence.

00:03:00.400 | You should be here.

00:03:03.000 | Okay, first, the Deep Traffic competition.

00:03:07.800 | There's a lot of updates and we'll cover those on Wednesday.

00:03:11.200 | It's a deep reinforcement learning competition.

00:03:13.800 | Last year, we received over 18,000 submissions.

00:03:18.200 | This year, we're going to go bigger.

00:03:23.000 | Not only can you control one car within your network,

00:03:26.000 | you can control up to 10.

00:03:27.800 | This is multi-agent deep reinforcement learning.

00:03:30.600 | This is super cool.

00:03:34.200 | Second, SegFuse, Dynamic Driving Scene Segmentation competition.

00:03:39.400 | Where you're given the raw video,

00:03:44.800 | the kinematics of the vehicle, so the movement of the vehicle,

00:03:49.800 | the state-of-the-art segmentation.

00:03:52.200 | For the training set, you're given ground truth labels,

00:03:55.800 | pixel level labels, scene segmentation,

00:03:58.600 | and optical flow.

00:04:00.400 | And with those pieces of data,

00:04:02.600 | you're tasked to try to perform better

00:04:05.200 | than the state-of-the-art in image-based segmentation.

00:04:10.400 | Why is this critical and fascinating

00:04:14.000 | in an open research problem?

00:04:17.000 | Because robots that act in this world,

00:04:21.800 | in the physical space, not only must interpret,

00:04:24.800 | use these deep learning methods to interpret

00:04:26.600 | the spatial visual characteristics of a scene.

00:04:29.200 | They must also interpret, understand, and track

00:04:32.200 | the temporal dynamics of the scene.

00:04:34.000 | This competition is about temporal propagation of information,

00:04:37.600 | not just scene segmentation.

00:04:40.400 | You must understand the space and time.

00:04:44.800 | And finally, Deep Crash.

00:04:48.800 | Where we use deep reinforcement learning

00:04:50.800 | to slam cars thousands of times,

00:04:53.600 | here at MIT at the gym.

00:04:57.400 | You're given data on a thousand runs,

00:05:01.400 | where a car knowing nothing is using a monocular camera

00:05:05.000 | as a single input, driving over 30 miles an hour,

00:05:08.400 | through a scene it has very little control through,

00:05:11.000 | very little capability to localize itself,

00:05:13.400 | it must act very quickly.

00:05:15.200 | In that scene, you're given a thousand runs

00:05:17.600 | to learn anything.

00:05:21.400 | We'll discuss this in the coming weeks.

00:05:24.600 | This competition will result in four submissions,

00:05:29.600 | that we evaluate everyone's in simulation,

00:05:32.600 | but the top four submissions,

00:05:34.400 | we put head to head at the gym.

00:05:36.600 | And until there is a winner declared,

00:05:39.000 | we keep slamming cars at 30 miles an hour.

00:05:44.000 | Deep Crash.

00:05:45.200 | And also on the website is from last year,

00:05:48.000 | and on GitHub, there's Deep Tesla,

00:05:50.800 | which is using the large-scale naturalistic driving data set,

00:05:54.000 | we have to train a neural network to do end-to-end steering.

00:05:57.800 | That takes in monocular video from the forward roadway,

00:06:01.000 | and produces steering commands,

00:06:03.000 | that steering commands for the car.

00:06:07.000 | Lectures.

00:06:08.000 | Today we'll talk about deep learning.

00:06:10.000 | Tomorrow we'll talk about autonomous vehicles.

00:06:12.800 | Deep RLs on Wednesday.

00:06:15.800 | Driving scene understanding, so segmentation.

00:06:20.000 | That's Thursday.

00:06:22.000 | On Friday, we have Sasha Arnou,

00:06:25.000 | the director of engineering at Waymo.

00:06:27.600 | Waymo is one of the companies,

00:06:29.400 | that's truly taking huge strides in fully autonomous vehicles.

00:06:33.600 | They're taking the fully L4, L5 autonomous vehicle approach,

00:06:37.400 | and it's fascinating to learn.

00:06:39.400 | He's also the head of perception for them,

00:06:42.000 | to learn from him,

00:06:43.600 | what kind of problems they're facing,

00:06:45.800 | and what kind of approach they're taking on.

00:06:47.800 | We have Emilia Frizzoli,

00:06:49.600 | who one of last year's speakers,

00:06:51.600 | Sertac Karaman,

00:06:53.000 | said Emilia is the smartest person he knows.

00:06:55.800 | So Emilia Frizzoli is the CTO of Neutonomy,

00:06:58.200 | an autonomous vehicle company,

00:07:00.800 | that was just acquired by Delphi,

00:07:03.600 | for a large sum of money.

00:07:05.000 | And they're doing a lot of incredible work,

00:07:06.800 | in Singapore and here in Boston.

00:07:10.200 | Next Wednesday,

00:07:12.000 | we are going to talk about the topic of our research,

00:07:16.000 | and my personal fascination is deep learning,

00:07:19.000 | for driver state sensing,

00:07:20.600 | understanding the human,

00:07:21.800 | perceiving everything about the human being,

00:07:23.600 | inside the car and outside the car.

00:07:25.600 | One talk I'm really excited about,

00:07:29.600 | is Oliver Cameron on Thursday.

00:07:32.600 | He is now the CEO of autonomous vehicle startup Voyage,

00:07:37.200 | who was previously the director,

00:07:39.000 | of the self-driving car program for Udacity.

00:07:41.800 | He will talk about,

00:07:43.200 | how to start a self-driving car company.

00:07:46.400 | For those, he said that MIT folks,

00:07:49.400 | and entrepreneurs,

00:07:50.600 | if you want to start one yourself,

00:07:52.200 | he'll tell you exactly how.

00:07:54.000 | It's super cool.

00:07:55.600 | And then Sterling Anderson,

00:07:57.800 | who was the director previously,

00:08:01.200 | of Tesla autopilot team,

00:08:03.400 | and now is a co-founder of Aurora,

00:08:06.600 | the self-driving car startup that I mentioned,

00:08:11.200 | that has now partnered with NVIDIA and many others.

00:08:13.800 | So, why self-driving cars?

00:08:16.600 | This class is about applying,

00:08:18.600 | data-driven learning methods,

00:08:20.800 | to the problem of autonomous vehicles.

00:08:23.200 | Why self-driving cars are fascinating,

00:08:26.400 | and an interesting problem space.

00:08:28.400 | Quite possibly, in my opinion,

00:08:33.600 | this is the first wide-reaching,

00:08:35.800 | and profound integration of personal robots,

00:08:38.400 | in society.

00:08:40.200 | Wide-reaching, because there's one billion cars on the road,

00:08:43.400 | even a fraction of that,

00:08:45.000 | will change the face of transportation,

00:08:48.600 | and how we move about this world.

00:08:51.600 | Profound, and this is an important point,

00:08:54.800 | that's not always understood,

00:08:57.600 | is there's an intimate connection,

00:09:01.200 | between a human and a vehicle,

00:09:04.400 | when there's a direct transfer of control.

00:09:07.600 | It's a direct transfer of control,

00:09:10.000 | that takes that his or her life,

00:09:13.000 | into the hands of an artificial intelligence system.

00:09:16.000 | I showed a few quick clips here,

00:09:20.200 | you can Google, first time with Tesla autopilot,

00:09:23.200 | on YouTube, and watch people,

00:09:25.400 | perform that transfer of control.

00:09:27.600 | There's something magical,

00:09:29.400 | about a human and a robot working together,

00:09:33.000 | that will transform,

00:09:35.200 | what artificial intelligence is,

00:09:37.000 | in the 21st century.

00:09:38.800 | And this particular autonomous system,

00:09:41.600 | AI system, self-driving cars,

00:09:44.400 | is on the scale,

00:09:46.600 | and the profound, the life-critical nature of it,

00:09:49.000 | is profound, in a way that,

00:09:51.000 | it will truly test the capabilities of AI.

00:09:55.800 | There is a personal connection,

00:09:58.000 | that will argue throughout these lectures,

00:10:00.400 | that we cannot escape, considering the human being.

00:10:03.600 | That autonomous vehicle, must not only perceive,

00:10:06.400 | and control its movement through the environment,

00:10:08.600 | it must also perceive everything,

00:10:10.000 | about the human driver and the passenger,

00:10:12.200 | and interact, communicate, and build trust,

00:10:14.400 | with that driver.

00:10:16.400 | Because, in my view,

00:10:24.400 | as I will argue throughout this course,

00:10:27.000 | an autonomous vehicle is more of a personal robot,

00:10:31.000 | than it is a perfect perception control system.

00:10:34.400 | Because perfect perception and control,

00:10:38.000 | through this world, full of humans,

00:10:41.000 | is extremely difficult,

00:10:43.200 | and could be two, three, four decades away.

00:10:46.800 | Full autonomy,

00:10:49.000 | autonomous vehicles are going to be flawed,

00:10:52.800 | they're going to have flaws,

00:10:54.800 | and we have to design systems,

00:10:56.800 | that are effectively caught,

00:10:58.800 | that effectively transfer control to human beings,

00:11:02.200 | when they can't handle the situation.

00:11:04.000 | And that transfer of control,

00:11:06.600 | is a fascinating opportunity for AI.

00:11:10.000 | Because, the obstacle avoidance,

00:11:16.600 | perception of obstacles,

00:11:19.600 | and obstacle avoidance, is the easy problem.

00:11:23.600 | It's the safe problem,

00:11:25.000 | going 30 miles an hour,

00:11:26.200 | navigating through streets of Boston, is easy.

00:11:30.600 | It's when you have to get to work,

00:11:33.400 | and you're late, or you're sick of the person in front of you,

00:11:37.200 | that you want to go in the opposing lane, and speed up.

00:11:41.400 | That's human nature, and we can't escape it.

00:11:44.600 | Our artificial intelligence systems,

00:11:47.400 | can't escape human nature, they must work with it.

00:11:50.800 | What's shown here, is one of the algorithms,

00:11:53.200 | we'll talk about next week, for cognitive load.

00:11:56.400 | Where we take the raw, 3D convolutional neural networks,

00:12:00.600 | take in the eye region, the blinking, and the pupil movement,

00:12:04.600 | to determine the cognitive load of the driver.

00:12:06.800 | We'll see how we can detect everything about the driver,

00:12:09.800 | where they're looking, emotion, cognitive load,

00:12:13.800 | body pose estimation, drowsiness.

00:12:18.600 | The movement towards full autonomy,

00:12:22.400 | is so difficult, I would argue,

00:12:25.600 | that it almost requires human level intelligence.

00:12:30.000 | That the, as I said, two, three, four decade out,

00:12:34.000 | journey for artificial intelligence researchers,

00:12:37.400 | to achieve full autonomy, will require achieving,

00:12:40.400 | solving some of the problems, fundamental problems,

00:12:43.400 | of creating intelligence.

00:12:46.600 | And, that's something we'll discuss in much more depth,

00:12:51.000 | in a broader view, in two weeks,

00:12:53.200 | for the artificial general intelligence course.

00:12:56.200 | Where we have Andrej Karpathy from Tesla,

00:12:58.200 | Ray Kurzweil, Mark Rybert, from Boston Dynamics,

00:13:03.200 | who asked for the dimensions of this room,

00:13:05.200 | because he's bringing robots.

00:13:08.400 | Nothing else was told to me.

00:13:11.400 | It'll be a surprise.

00:13:16.000 | So, that is why I argue,

00:13:17.000 | the human-centered artificial intelligence approach,

00:13:20.200 | in every algorithmic design, considers the human.

00:13:26.200 | For autonomous vehicle on the left,

00:13:28.200 | the perception, seen understanding,

00:13:31.200 | and the control problem,

00:13:33.000 | as we'll explore through the competitions,

00:13:34.600 | and the assignments of this course,

00:13:37.000 | can handle 90, and increasing percent of the cases.

00:13:42.400 | But it's the 10, 1, 0.1% of the cases,

00:13:46.600 | as we get better and better,

00:13:48.400 | that we have to, we're not able to handle,

00:13:51.400 | through these methods.

00:13:52.600 | And that's where the human,

00:13:53.800 | perceiving the human is really important.

00:13:55.800 | This is the video from last year,

00:13:58.200 | of Arc de Triomphe.

00:13:59.600 | Thank you, I didn't know it last year, I know now.

00:14:03.200 | That's, is one of millions of cases,

00:14:06.600 | where human to human interaction,

00:14:09.600 | is the dominant driver,

00:14:12.000 | not the basic perception control problem.

00:14:20.000 | So, why deep learning in this space?

00:14:23.600 | Because deep learning, is a set of methods,

00:14:28.200 | that do well from a lot of data.

00:14:31.400 | And to solve these problems,

00:14:33.400 | where human life is at stake,

00:14:35.600 | we have to be able to have techniques,

00:14:38.400 | that learn from data, learn from real world data.

00:14:41.800 | This is the fundamental reality,

00:14:44.000 | of artificial intelligence systems,

00:14:45.400 | that operate in the real world.

00:14:47.000 | They must learn from real world data.

00:14:49.800 | Whether that's on the left,

00:14:51.000 | for the perception, the control side.

00:14:54.600 | Or on the right, for the human,

00:14:57.000 | the perception and the communication,

00:14:59.000 | interaction and collaboration,

00:15:02.000 | with the human, and the human robot interaction.

00:15:06.600 | Okay, so what is deep learning?

00:15:13.000 | It's a set of techniques,

00:15:15.000 | if you allow me the definition,

00:15:16.800 | of intelligence being,

00:15:18.200 | the ability to accomplish complex goals.

00:15:21.400 | Then I would argue,

00:15:23.000 | definition of understanding,

00:15:25.000 | maybe reasoning,

00:15:27.200 | is the ability to turn complex information,

00:15:30.400 | into simple, useful, actionable information.

00:15:34.800 | And that is what deep learning does.

00:15:37.600 | Deep learning is representation learning,

00:15:40.600 | or feature learning if you will.

00:15:43.200 | It's able to take raw information,

00:15:46.200 | raw complicated information,

00:15:48.000 | that's hard to do anything with,

00:15:49.800 | and construct hierarchical representations,

00:15:52.400 | of that information,

00:15:53.800 | to be able to do something interesting with it.

00:15:56.800 | It is the branch of artificial intelligence,

00:15:59.400 | which is most capable and focused on this task.

00:16:04.000 | Forming representations from data,

00:16:06.200 | whether it's supervised or unsupervised,

00:16:08.400 | whether it's with the help of humans or not.

00:16:10.800 | It's able to construct structure,

00:16:14.600 | find structure in the data,

00:16:16.000 | such that you can extract,

00:16:18.800 | simple, useful, actionable information.

00:16:21.600 | On the left,

00:16:23.400 | from Ian Goodfellow's book,

00:16:26.400 | is the basic example of image classification.

00:16:30.800 | The input of the image,

00:16:34.000 | on the bottom with the raw pixels,

00:16:36.600 | and as we go up the stack,

00:16:38.400 | as we go up the layers,

00:16:39.800 | higher and higher order representations are formed.

00:16:42.800 | From edges to contours,

00:16:44.400 | to corners, to object parts,

00:16:46.400 | and then finally the full object,

00:16:48.200 | semantic classification of what's in the image.

00:16:51.600 | This is representation learning.

00:16:54.200 | A favorite example for me,

00:16:57.600 | is one from four centuries ago.

00:17:02.200 | Our place in the universe,

00:17:05.000 | and representing that place in the universe,

00:17:07.400 | whether it's relative to earth,

00:17:09.800 | or relative to the sun.

00:17:13.200 | On the left is our current belief,

00:17:16.800 | on the right is the one that is held widely,

00:17:20.600 | four centuries ago.

00:17:23.200 | Representation matters,

00:17:24.600 | because what's on the right,

00:17:26.600 | is much more complicated,

00:17:28.200 | than what's on the left.

00:17:31.200 | You can think of in a simple case here,

00:17:36.600 | when the task is to draw a line,

00:17:38.600 | that separates green triangles and blue circles,

00:17:41.000 | in the Cartesian coordinate space,

00:17:43.000 | on the left, the task is much more difficult,

00:17:45.600 | impossible to do well.

00:17:47.600 | On the right, it's trivial,

00:17:49.400 | in polar coordinates.

00:17:51.400 | This transformation is exactly,

00:17:53.800 | which we need to learn.

00:17:55.200 | This is representation learning.

00:17:57.800 | So you can take the same task,

00:17:59.000 | of having to draw a line,

00:18:00.200 | that separates the blue curve,

00:18:01.600 | and the red curve on the left.

00:18:04.000 | If we draw a straight line,

00:18:05.800 | it's going to be a high,

00:18:07.800 | there's no way to do it,

00:18:09.800 | with zero error,

00:18:11.000 | with 100% accuracy.

00:18:13.600 | Shown on the right,

00:18:14.800 | is our best attempt.

00:18:16.800 | But what we can do with deep learning,

00:18:20.200 | with a single hidden layer network,

00:18:22.000 | done here,

00:18:23.400 | is form the topology,

00:18:26.400 | the mapping of the space,

00:18:27.800 | in such a way in the middle,

00:18:30.000 | that allows for a straight line to be drawn,

00:18:31.800 | that separates the blue curve,

00:18:33.200 | and the red curve.

00:18:35.000 | The learning of the function in the middle,

00:18:38.200 | is what we're able to achieve with deep learning.

00:18:42.200 | It's taking raw, complicated information,

00:18:45.800 | and making it simple,

00:18:48.400 | actionable, useful.

00:18:51.600 | And the point is,

00:18:53.000 | that this kind of ability,

00:18:55.600 | to learn from raw sensory information,

00:18:58.200 | means that we can do a lot more,

00:19:00.400 | with a lot more data.

00:19:03.000 | So deep learning,

00:19:04.600 | gets better with more data.

00:19:06.600 | And that's important,

00:19:10.600 | for real-world applications,

00:19:12.600 | where edge cases are everything.

00:19:17.000 | This is us driving,

00:19:19.200 | with two perception control systems.

00:19:21.200 | One is a Tesla vehicle,

00:19:23.400 | with the autopilot,

00:19:25.200 | version one system,

00:19:26.400 | that's using a monocular camera,

00:19:27.800 | to perceive the external environment,

00:19:29.600 | and produce control decisions.

00:19:31.400 | And our own neural network,

00:19:33.600 | running on a Jessen TX2,

00:19:35.000 | that's taking in the same,

00:19:36.800 | with a monocular camera,

00:19:38.000 | and producing control decisions.

00:19:40.600 | And the two systems argue,

00:19:43.000 | and when they disagree,

00:19:44.200 | they raise up a flag,

00:19:45.800 | to say that this is an edge case,

00:19:47.200 | that needs human intervention.

00:19:50.200 | There is,

00:19:51.600 | covering such edge cases,

00:19:53.200 | using machine learning,

00:19:55.000 | is the main problem,

00:19:57.000 | of artificial intelligence,

00:19:59.400 | when applied to the real world.

00:20:01.000 | It is the main problem to solve.

00:20:03.000 | Okay, so what are neural networks?

00:20:07.800 | Inspired, very loosely,

00:20:11.000 | and I'll discuss,

00:20:12.000 | about the key difference,

00:20:13.200 | between our own brains,

00:20:14.400 | and artificial brains.

00:20:16.400 | Because there's a lot of insights,

00:20:18.600 | in that difference.

00:20:20.200 | But inspired, loosely,

00:20:22.000 | by biological neural networks,

00:20:23.800 | here is a simulation,

00:20:25.600 | of a thalamocortical brain network,

00:20:28.400 | which is only three million neurons.

00:20:31.200 | 476 million synapses.

00:20:33.400 | The full human brain,

00:20:34.800 | is a lot more than that.

00:20:36.000 | A hundred billion neurons,

00:20:37.800 | 1,000 trillion synapses.

00:20:41.800 | There's inspirational music,

00:20:47.600 | with this one,

00:20:48.400 | that I didn't realize was here.

00:20:50.400 | It should make you think.

00:20:52.400 | Artificial neural networks,

00:20:55.400 | okay, let's just let it play.

00:21:00.000 | The human neural network,

00:21:02.000 | is a hundred billion neurons, right?

00:21:04.000 | 1,000 trillion synapses.

00:21:06.000 | One of the state of the art,

00:21:09.000 | neural networks is ResNet-152,

00:21:12.000 | which has 60 million synapses.

00:21:15.000 | That's a difference,

00:21:19.000 | of about,

00:21:20.000 | a seven order of magnitude difference.

00:21:22.000 | The human brains have,

00:21:24.000 | 10 million times more synapses,

00:21:26.400 | than artificial neural networks.

00:21:29.000 | Plus or minus one order of magnitude,

00:21:31.000 | depending on the network.

00:21:33.000 | So what's the difference,

00:21:35.200 | between a biological neuron,

00:21:37.600 | and an artificial neuron?

00:21:39.000 | The topology of the human brain,

00:21:41.600 | have no layers.

00:21:42.800 | Neural networks,

00:21:44.000 | are stacked in layers.

00:21:46.000 | They're fixed,

00:21:46.800 | for the most part.

00:21:48.000 | There is chaos,

00:21:51.000 | very little structure,

00:21:52.400 | in our human brain,

00:21:53.400 | in terms of how neurons are connected.

00:21:56.000 | They're connected,

00:21:57.400 | often to 10,000 plus other neurons.

00:21:59.800 | The number of synapses,

00:22:01.400 | from individual neurons,

00:22:02.800 | that are input into the neuron,

00:22:05.400 | is huge.

00:22:06.400 | They're asynchronous.

00:22:08.400 | The human brain,

00:22:09.400 | works asynchronously.

00:22:11.200 | Artificial neural networks,

00:22:12.400 | work synchronously.

00:22:13.400 | The learning algorithm,

00:22:16.400 | for artificial neural networks,

00:22:19.400 | the only one,

00:22:20.400 | the best one,

00:22:22.400 | is back propagation.

00:22:24.400 | And we don't know,

00:22:27.800 | how human brains learn.

00:22:29.800 | Processing speed,

00:22:34.800 | this is one of the,

00:22:36.800 | the only benefits,

00:22:38.800 | we have with artificial neural networks,

00:22:40.800 | is artificial neurons,

00:22:42.800 | are faster.

00:22:44.800 | But they're also,

00:22:46.800 | extremely power inefficient.

00:22:48.800 | And,

00:22:51.800 | there is a division into stages,

00:22:53.800 | of training and testing,

00:22:54.800 | with neural networks.

00:22:56.800 | With biological neural networks,

00:22:58.800 | as you're sitting here today,

00:23:00.800 | they're always learning.

00:23:01.800 | The only profound similarity,

00:23:04.800 | the inspiring one,

00:23:06.800 | the captivating one,

00:23:08.800 | is that,

00:23:09.800 | both are distributed computation at scale.

00:23:12.800 | There is an emergent,

00:23:15.800 | aspect to neural networks,

00:23:18.800 | where the basic element of computation,

00:23:21.800 | a neuron,

00:23:23.800 | is simple,

00:23:24.800 | is extremely simple.

00:23:26.800 | But when connected together,

00:23:28.800 | beautiful,

00:23:29.800 | amazing,

00:23:31.800 | powerful approximators can be formed.

00:23:33.800 | A neural network is built up,

00:23:36.800 | with these computational units,

00:23:37.800 | where the inputs,

00:23:38.800 | there's a set of edges,

00:23:40.800 | with weights on them.

00:23:41.800 | The edges are,

00:23:43.800 | the weights are multiplied,

00:23:44.800 | by this input signal.

00:23:46.800 | A bias is added,

00:23:48.800 | with a nonlinear function,

00:23:51.800 | that determines whether,

00:23:52.800 | the network gets activated or not.

00:23:54.800 | The neuron gets activated or not,

00:23:56.800 | visualized here.

00:23:58.800 | And these neurons can be combined,

00:24:00.800 | in a number of ways.

00:24:02.800 | They can form a feed-forward neural network,

00:24:04.800 | or they can,

00:24:05.800 | feed back into itself,

00:24:08.800 | to form,

00:24:09.800 | to have state memory,

00:24:11.800 | in recurrent neural networks.

00:24:14.800 | The ones on the left,

00:24:16.800 | are the ones that are most successful,

00:24:18.800 | for most applications,

00:24:20.800 | in computer vision.

00:24:22.800 | The ones on the right,

00:24:24.800 | are very popular,

00:24:26.800 | and specific,

00:24:27.800 | when temporal dynamics,

00:24:28.800 | or dynamics time series,

00:24:29.800 | of any kind are used.

00:24:31.800 | In fact, the ones on the right,

00:24:33.800 | are much closer,

00:24:34.800 | to the way our human brains are,

00:24:36.800 | than the ones on the left.

00:24:38.800 | But that's why they're really hard to train.

00:24:41.800 | One beautiful aspect,

00:24:46.800 | of this emergent power,

00:24:48.800 | from multiple neurons being connected together,

00:24:50.800 | is the universal property,

00:24:52.800 | that with a single hidden layer,

00:24:54.800 | these networks,

00:24:56.800 | can learn any function,

00:24:57.800 | learn to approximate any function.

00:24:59.800 | Which is an important property,

00:25:01.800 | to be aware of.

00:25:03.800 | Because,

00:25:04.800 | the limits here,

00:25:07.800 | are not in the power of the networks.

00:25:10.800 | The limits,

00:25:11.800 | is in the methods,

00:25:13.800 | by which we construct them,

00:25:15.800 | and train them.

00:25:16.800 | What kinds of machine learning,

00:25:18.800 | deep learning are there?

00:25:20.800 | We can separate it into two categories.

00:25:22.800 | Memorizers,

00:25:24.800 | the approaches,

00:25:26.800 | that essentially memorize patterns in the data.

00:25:28.800 | And approaches that,

00:25:30.800 | we can loosely say,

00:25:32.800 | are beginning to reason,

00:25:34.800 | to generalize over the data,

00:25:36.800 | with minimal effort.

00:25:38.800 | And the reason,

00:25:40.800 | is that the data,

00:25:42.800 | is not a single,

00:25:44.800 | to generalize over the data,

00:25:46.800 | with minimal human input.

00:25:47.800 | On top,

00:25:48.800 | on the left,

00:25:49.800 | are the "teachers",

00:25:50.800 | is how much human input,

00:25:52.800 | in blue is needed,

00:25:53.800 | to make the method successful.

00:25:55.800 | For supervised learning,

00:25:56.800 | which is what most of deep learning,

00:25:58.800 | successes come from,

00:25:59.800 | where most of the data,

00:26:00.800 | is annotated by human beings.

00:26:02.800 | The human,

00:26:04.800 | is at the core,

00:26:05.800 | of the success.

00:26:06.800 | Most of the data,

00:26:07.800 | that's part of the training,

00:26:08.800 | needs to be annotated by human beings.

00:26:10.800 | With some additional successes,

00:26:13.800 | coming from augmentation methods,

00:26:15.800 | that extend that,

00:26:17.800 | extend the data,

00:26:19.800 | based on which,

00:26:20.800 | these networks are trained.

00:26:22.800 | And the semi-supervised,

00:26:27.800 | reinforcement learning,

00:26:28.800 | and unsupervised methods,

00:26:29.800 | that we'll talk about later in the course,

00:26:31.800 | that's where the near-term,

00:26:34.800 | successes we hope are.

00:26:36.800 | And with the unsupervised learning approaches,

00:26:38.800 | that's where,

00:26:39.800 | the true excitement about,

00:26:41.800 | the possibilities of artificial intelligence lie.

00:26:43.800 | Being able to make sense,

00:26:45.800 | of our world,

00:26:46.800 | with minimal input,

00:26:48.800 | from humans.

00:26:50.800 | So we can think of two kinds of,

00:26:55.800 | deep learning,

00:26:57.800 | impact spaces.

00:26:59.800 | One is a special purpose intelligence,

00:27:02.800 | is taking a problem,

00:27:04.800 | formalizing it,

00:27:05.800 | collecting enough data on it,

00:27:07.800 | and being able to,

00:27:09.800 | solve a particular,

00:27:11.800 | case that's,

00:27:13.800 | that provides value.

00:27:15.800 | Of particular interest here,

00:27:17.800 | is a network that estimates apartment costs,

00:27:19.800 | in the Boston area.

00:27:20.800 | So you could take,

00:27:21.800 | the number of bedrooms,

00:27:22.800 | the square feet,

00:27:23.800 | in the neighborhood,

00:27:24.800 | and provide as output,

00:27:26.800 | the estimated cost.

00:27:28.800 | On the right,

00:27:29.800 | is the actual data,

00:27:31.800 | of apartment costs.

00:27:33.800 | We're actually standing,

00:27:34.800 | in a,

00:27:35.800 | in a area,

00:27:37.800 | that has over $3,000,

00:27:39.800 | for a studio apartment.

00:27:41.800 | Some of you may be feeling that pain.

00:27:46.800 | And then there's general purpose intelligence,

00:27:49.800 | or,

00:27:50.800 | something,

00:27:51.800 | that feels like,

00:27:53.800 | approaching general purpose intelligence,

00:27:55.800 | which is reinforcement,

00:27:56.800 | and unsupervised learning.

00:27:58.800 | Here with Andrei,

00:28:00.800 | from Andrei Karpathy's Pong the Pixels,

00:28:02.800 | a system that takes in,

00:28:03.800 | 80 by 80 pixel image,

00:28:05.800 | and with no other information,

00:28:07.800 | is able to beat,

00:28:08.800 | is able to win at this game.

00:28:10.800 | No information except,

00:28:11.800 | a sequence of images,

00:28:12.800 | raw sensory information,

00:28:14.800 | the same way,

00:28:15.800 | the same kind of information,

00:28:16.800 | that human beings take in,

00:28:18.800 | from the visual,

00:28:19.800 | audio,

00:28:20.800 | touch,

00:28:21.800 | sensory data,

00:28:22.800 | the very low level data,

00:28:24.800 | and be able to learn to win.

00:28:26.800 | In this very simplistic,

00:28:27.800 | in this very artificially,

00:28:29.800 | constructed world,

00:28:30.800 | but nevertheless,

00:28:31.800 | a world where no feature learning,

00:28:33.800 | is performed.

00:28:34.800 | Only raw sensory information,

00:28:36.800 | is used to win,

00:28:37.800 | with very sparse,

00:28:39.800 | minimal human input.

00:28:41.800 | We'll talk about that,

00:28:43.800 | on Wednesday,

00:28:45.800 | with deep reinforcement learning.

00:28:48.800 | So, but for now,

00:28:50.800 | we'll focus on supervised learning,

00:28:52.800 | where there is,

00:28:54.800 | input data,

00:28:55.800 | there is a network,

00:28:56.800 | we're trying to train,

00:28:58.800 | a learning system,

00:28:59.800 | and there's a correct output,

00:29:00.800 | that's labeled by human beings.

00:29:03.800 | That's the general training process,

00:29:05.800 | for a neural network.

00:29:06.800 | Input data,

00:29:07.800 | labels,

00:29:08.800 | and the training of that,

00:29:10.800 | network,

00:29:11.800 | that model.

00:29:12.800 | So that in a testing stage,

00:29:14.800 | a new input data,

00:29:15.800 | that has never seen before,

00:29:17.800 | it's tasked with,

00:29:18.800 | producing guesses,

00:29:19.800 | and is evaluated based on that.

00:29:22.800 | For autonomous vehicles,

00:29:23.800 | that means being released,

00:29:25.800 | either in simulation,

00:29:26.800 | or in the real world,

00:29:27.800 | to operate.

00:29:31.800 | And how they learn,

00:29:33.800 | how neural networks learn,

00:29:34.800 | is given,

00:29:35.800 | the forward pass,

00:29:37.800 | of taking the input data,

00:29:38.800 | whether it's from the training stage,

00:29:41.800 | in the training stage,

00:29:43.800 | the taking the input data,

00:29:44.800 | producing a prediction,

00:29:45.800 | and then given that,

00:29:46.800 | there's ground truth,

00:29:47.800 | in the training stage,

00:29:48.800 | we can have a measure of error,

00:29:51.800 | based on a loss function,

00:29:52.800 | that then punishes,

00:29:54.800 | the synapses,

00:29:56.800 | the connections,

00:29:57.800 | the parameters,

00:29:58.800 | that were,

00:30:00.800 | involved with making that,

00:30:02.800 | that wrong prediction.

00:30:05.800 | And it back propagates the error,

00:30:09.800 | through those weights.

00:30:10.800 | We'll discuss that in a little bit more detail,

00:30:12.800 | in a bit here.

00:30:14.800 | So what can we do with deep learning?

00:30:16.800 | We can do one-to-one mapping.

00:30:18.800 | Really, you can think of input,

00:30:19.800 | as being anything.

00:30:20.800 | It can be a number,

00:30:21.800 | a vector of numbers,

00:30:22.800 | a sequence of numbers,

00:30:23.800 | a sequence,

00:30:24.800 | a vector of numbers.

00:30:25.800 | Anything you can think of,

00:30:26.800 | from images,

00:30:27.800 | to video,

00:30:28.800 | to audio,

00:30:29.800 | represented in this way.

00:30:30.800 | And the output,

00:30:31.800 | can the same,

00:30:32.800 | be a single number,

00:30:34.800 | or it can be images,

00:30:35.800 | video, text, audio.

00:30:38.800 | One-to-one mapping,

00:30:39.800 | on the bottom,

00:30:40.800 | one to many,

00:30:41.800 | many to one,

00:30:42.800 | many to many,

00:30:43.800 | and many to many,

00:30:45.800 | with different starting points,

00:30:47.800 | for the data.

00:30:48.800 | Asynchronous.

00:30:51.800 | Some quick terms,

00:30:53.800 | that will come up.

00:30:54.800 | Deep learning,

00:30:55.800 | is the same as neural networks.

00:30:58.800 | It's really deep neural networks,

00:31:01.800 | large neural networks.

00:31:03.800 | It's a subset of machine learning,

00:31:05.800 | that has been extremely successful,

00:31:07.800 | in the past decade.

00:31:09.800 | Multi-layer perceptron,

00:31:11.800 | deep neural network,

00:31:12.800 | recurring neural network,

00:31:14.800 | long short-term memory network,

00:31:15.800 | LSTM,

00:31:16.800 | convolutional neural network,

00:31:18.800 | and deep belief networks.

00:31:19.800 | All of these will come up,

00:31:20.800 | through the slides.

00:31:23.800 | And there is,

00:31:24.800 | specific operations,

00:31:26.800 | layers within these networks,

00:31:27.800 | of convolution,

00:31:28.800 | pooling,

00:31:29.800 | activation,

00:31:30.800 | and back propagation.

00:31:31.800 | This concept,

00:31:32.800 | that we'll discuss,

00:31:34.800 | in this class.

00:31:36.800 | Activation functions,

00:31:37.800 | there's a lot of variants.

00:31:40.800 | On the left,

00:31:41.800 | is the activation function,

00:31:42.800 | the left column.

00:31:43.800 | On the x-axis,

00:31:44.800 | is the input.

00:31:45.800 | On the y-axis,

00:31:46.800 | is the output.

00:31:48.800 | The sigmoid function,

00:31:49.800 | the output,

00:31:51.800 | if the font is too small,

00:31:52.800 | the output is,

00:31:54.800 | not centered at zero.

00:31:58.800 | For the tanh function,

00:32:00.800 | it's centered at zero,

00:32:01.800 | but it still suffers,

00:32:02.800 | from vanishing gradients.

00:32:04.800 | Vanishing gradients,

00:32:05.800 | is when the value,

00:32:06.800 | the input is low or high.

00:32:11.800 | The output of the network,

00:32:13.800 | as you see in the right column there,

00:32:15.800 | the derivative of the function,

00:32:17.800 | is very low.

00:32:18.800 | So the learning rate is very low.

00:32:21.800 | For ReLU,

00:32:24.800 | it's also not zero centered,

00:32:27.800 | but it does not suffer,

00:32:29.800 | from vanishing gradients.

00:32:31.800 | Back propagation,

00:32:32.800 | is the process of learning.

00:32:34.800 | It's the way we take,

00:32:35.800 | go from error,

00:32:36.800 | compute as the loss function,

00:32:38.800 | on the bottom right of the slide,

00:32:39.800 | taking the actual output,

00:32:42.800 | of the network with the forward pass,

00:32:43.800 | subtracting it,

00:32:45.800 | from the ground truth,

00:32:47.800 | squaring, dividing by two,

00:32:49.800 | and using that loss function,

00:32:51.800 | then back propagate through,

00:32:53.800 | to construct a gradient,

00:32:54.800 | to back propagate the error,

00:32:56.800 | to the weights that were responsible,

00:32:58.800 | for making either a correct,

00:32:59.800 | or an incorrect decision.

00:33:01.800 | So the sub tasks of that,

00:33:03.800 | there's a forward pass,

00:33:04.800 | there's a backward pass,

00:33:06.800 | and a fraction of the weights,

00:33:08.800 | gradient subtracted from the weight.

00:33:09.800 | That's it.

00:33:11.800 | That process is modular,

00:33:14.800 | so it's local to each individual neuron,

00:33:16.800 | which is why it's extremely,

00:33:18.800 | we're able to distribute it across,

00:33:21.800 | multiple,

00:33:23.800 | across the GPU,

00:33:26.800 | parallelize across the GPU.

00:33:28.800 | So, learning for neural network,

00:33:34.800 | these computational units,

00:33:35.800 | are extremely simple.

00:33:36.800 | They're extremely simple,

00:33:37.800 | to then correct,

00:33:39.800 | when they make an error,

00:33:40.800 | when they're part of a larger network,

00:33:41.800 | that makes an error.

00:33:42.800 | And all that boils down to,

00:33:44.800 | is essentially an optimization problem,

00:33:46.800 | where the objective,

00:33:47.800 | utility function is,

00:33:49.800 | the loss function,

00:33:50.800 | and the goal is to minimize it.

00:33:52.800 | And we have to update the parameters,

00:33:54.800 | the weights and the synapses,

00:33:55.800 | and the biases,

00:33:57.800 | to decrease that loss function.

00:33:59.800 | And that loss function is highly nonlinear.

00:34:03.800 | Depending on the activation functions,

00:34:07.800 | different properties,

00:34:08.800 | different issues arise.

00:34:09.800 | There's vanishing gradients,

00:34:11.800 | for sigmoid,

00:34:14.800 | where the learning can be slow.

00:34:16.800 | There's dying Rayleigh's,

00:34:19.800 | where the derivative is exactly zero,

00:34:23.800 | for inputs less than zero.

00:34:26.800 | There are solutions to this,

00:34:28.800 | like leaky Rayleigh's,

00:34:30.800 | and a bunch of details,

00:34:31.800 | that you may discover,

00:34:32.800 | when you try to win,

00:34:33.800 | the deep traffic competition.

00:34:35.800 | But, for the most part,

00:34:37.800 | these are the main activation functions.

00:34:39.800 | And it's the choice of the,

00:34:42.800 | neural network designer,

00:34:44.800 | which one works best.

00:34:46.800 | There are saddle points,

00:34:48.800 | all the problems,

00:34:49.800 | from numerical nonlinear optimization,

00:34:51.800 | that arise,

00:34:52.800 | come up here.

00:34:54.800 | It's hard to break symmetry,

00:34:57.800 | and stochastic gradient descent,

00:35:00.800 | without any kind of tricks to it,

00:35:03.800 | can take a very long time,

00:35:05.800 | to arrive at the minima.

00:35:07.800 | One of the biggest problems,

00:35:10.800 | in all of machine learning,

00:35:11.800 | and certainly deep learning,

00:35:13.800 | is overfitting.

00:35:14.800 | You can think of the blue dots,

00:35:16.800 | in a plot here,

00:35:17.800 | as the data,

00:35:18.800 | to which we want to fit a curve.

00:35:20.800 | We want to design a learning system,

00:35:23.800 | that approximates,

00:35:25.800 | the regression of this data.

00:35:28.800 | So, in green,

00:35:30.800 | is a sine curve,

00:35:32.800 | simple, fits well.

00:35:34.800 | And then there's a ninth degree polynomial,

00:35:37.800 | which fits even better,

00:35:38.800 | in terms of the error.

00:35:40.800 | But it clearly overfits this data.

00:35:42.800 | If there's other data,

00:35:45.800 | that has not seen yet,

00:35:47.800 | that has to fit,

00:35:48.800 | it's likely to produce a high error.

00:35:50.800 | So it's overfitting the training set.

00:35:52.800 | This is a big problem,

00:35:54.800 | for small data sets.

00:35:56.800 | And so we have to fix that,

00:35:58.800 | with regularization.

00:35:59.800 | Regularization is a set of methodologies,

00:36:02.800 | that prevent overfitting.

00:36:04.800 | Learning the training too well,

00:36:07.800 | in order, and then to not be able to generalize,

00:36:09.800 | to the testing stage.

00:36:11.800 | And overfitting, the main symptom,

00:36:15.800 | is the error decreases in training set,

00:36:18.800 | but increases in test set.

00:36:20.800 | So there's a lot of techniques,

00:36:23.800 | in traditional machine learning,

00:36:24.800 | that deal with this,

00:36:25.800 | and cross validation and so on.

00:36:26.800 | But because of the cost of training,

00:36:28.800 | for neural networks,

00:36:30.800 | it's traditional to use,

00:36:32.800 | what's called a validation set.

00:36:34.800 | So you create a subset of the training,

00:36:37.800 | that you keep away,

00:36:39.800 | for which you have the ground truth.

00:36:41.800 | And use that,

00:36:42.800 | as a representative of the testing set.

00:36:45.800 | So you perform early stoppage,

00:36:48.800 | or more realistically,

00:36:49.800 | just save a checkpoint often,

00:36:52.800 | to see how as the training evolves,

00:36:56.800 | the performance changes,

00:36:58.800 | on the validation set.

00:37:00.800 | And so you can stop,

00:37:02.800 | when the performance in the validation set,

00:37:03.800 | is getting a lot worse.

00:37:05.800 | It means you're over training,

00:37:06.800 | on the training set.

00:37:08.800 | In practice, of course,

00:37:12.800 | we run training much longer,

00:37:14.800 | and see when,

00:37:15.800 | what is the best performing,

00:37:18.800 | what is the best performing,

00:37:20.800 | snapshot checkpoint of the network.

00:37:23.800 | Dropout,

00:37:24.800 | is another very powerful,

00:37:26.800 | regularization technique.

00:37:27.800 | Where we randomly remove,

00:37:29.800 | part of the network,

00:37:30.800 | randomly remove some of the nodes,

00:37:32.800 | in the network,

00:37:33.800 | along with its incoming,

00:37:36.800 | and outgoing edges.

00:37:37.800 | So what that really looks like,

00:37:39.800 | is a probability of keeping a node.

00:37:41.800 | And in many deep learning frameworks today,

00:37:44.800 | it comes with a dropout layer.

00:37:46.800 | So it's essentially a probability,

00:37:48.800 | that's usually greater than 0.5,

00:37:50.800 | that a node will be kept.

00:37:53.800 | For the input layer,

00:37:55.800 | the probability should be much higher,

00:37:57.800 | or more effectively,

00:37:59.800 | what works well is just adding noise.

00:38:01.800 | What's the point here?

00:38:02.800 | You want to create,

00:38:04.800 | enough diversity,

00:38:05.800 | in the training data,

00:38:07.800 | such that it is generalizable,

00:38:09.800 | to the testing.

00:38:11.800 | And as you'll see,

00:38:13.800 | with deep traffic competition,

00:38:14.800 | there's L2 and L1 penalty,

00:38:17.800 | weight decay, weight penalty.

00:38:19.800 | Where,

00:38:20.800 | there's a penalization on the weights,

00:38:22.800 | they get too large.

00:38:24.800 | The L2 penalty keeps the weight small,

00:38:26.800 | unless the error derivative is huge,

00:38:29.800 | and produces a smoother model,

00:38:31.800 | and prefers to distribute,

00:38:34.800 | when there is,

00:38:35.800 | two similar inputs,

00:38:36.800 | it prefers to put half the weights on each,

00:38:39.800 | distribute the weights,

00:38:40.800 | as opposed to putting the weight on one of the edges.

00:38:43.800 | Makes the network more robust.

00:38:46.800 | L1 penalty has the one benefit,

00:38:49.800 | that for really large weights,

00:38:51.800 | they're allowed to be, to stay.

00:38:53.800 | So it allows for a few weights,

00:38:55.800 | to remain very large.

00:38:56.800 | These are the regularization techniques.

00:38:58.800 | And I wanted to mention them,

00:39:00.800 | because they're useful to some of the competitions,

00:39:02.800 | here in the course.

00:39:03.800 | And I recommend,

00:39:04.800 | to go to playground,

00:39:05.800 | to TensorFlow playground,

00:39:07.800 | to play around with some of these parameters.

00:39:10.800 | Where you get to,

00:39:11.800 | online in the browser,

00:39:13.800 | play around with different inputs,

00:39:14.800 | different features,

00:39:16.800 | different number of layers,

00:39:17.800 | and regularization techniques.

00:39:19.800 | And to build your intuition,

00:39:21.800 | about classification regression problems,

00:39:23.800 | given different input data sets.

00:39:26.800 | So what changed?

00:39:29.800 | Why, over the past many decades,

00:39:32.800 | neural networks,

00:39:34.800 | that have gone through two winters,

00:39:36.800 | are now again,

00:39:38.800 | dominating the artificial intelligence community.

00:39:40.800 | CPUs, GPUs,

00:39:43.800 | ASICs,

00:39:45.800 | the computational power has skyrocketed.

00:39:47.800 | From Moore's law to GPUs.

00:39:50.800 | There is huge data set,

00:39:53.800 | including ImageNet and others.

00:39:57.800 | There is research,

00:40:00.800 | back propagation,

00:40:02.800 | in the 80s.

00:40:04.800 | The convolutional neural networks,

00:40:07.800 | LSTMs.

00:40:08.800 | There's been a lot of,

00:40:10.800 | interesting breakthroughs,

00:40:11.800 | about how to design these architectures.

00:40:13.800 | How to build them,

00:40:14.800 | such that they're trainable efficiently,

00:40:16.800 | using GPUs.

00:40:18.800 | There is the software infrastructure,

00:40:21.800 | from being able to share the data,

00:40:23.800 | or get,

00:40:24.800 | to being able to train networks,

00:40:26.800 | and share code,

00:40:27.800 | and effectively,

00:40:28.800 | view neural networks as a stack of layers,

00:40:32.800 | as opposed to having to implement stuff from scratch,

00:40:34.800 | with TensorFlow, PyTorch,

00:40:36.800 | and other deep learning frameworks.

00:40:38.800 | And there's huge financial backing,

00:40:40.800 | from Google, Facebook, and so on.

00:40:42.800 | Deep learning,

00:40:47.800 | is,

00:40:50.800 | in order to understand,

00:40:52.800 | why it works so well,

00:40:55.800 | and where its limitations are,

00:40:57.800 | we need to understand,

00:40:58.800 | where our own intuition comes from,

00:40:59.800 | about what is hard,

00:41:00.800 | and what is easy.

00:41:02.800 | The important thing about computer vision,

00:41:04.800 | which is a lot of what this course is about,

00:41:06.800 | even as in deep reinforcement learning formulation,

00:41:09.800 | is that visual perception,

00:41:11.800 | for us human beings,

00:41:12.800 | was formed,

00:41:14.800 | 540 million years ago.

00:41:16.800 | That's 540 million years worth of data.

00:41:21.800 | An abstract thought,

00:41:24.800 | is only formed about 100,000 years ago.

00:41:27.800 | That's several orders of magnitude less data.

00:41:31.800 | So we can make,

00:41:32.800 | with neural networks,

00:41:34.800 | predictions,

00:41:36.800 | that seem trivial,

00:41:38.800 | the trivial to us human beings,

00:41:43.800 | but completely challenging,

00:41:45.800 | and wrong to neural networks.

00:41:48.800 | Here, on the left,

00:41:49.800 | showing a prediction of a dog,

00:41:51.800 | with a little bit of distortion,

00:41:52.800 | and noise added to the image,

00:41:54.800 | producing the image on the right.

00:41:55.800 | And neural network is confidently,

00:41:58.800 | 99% plus accuracy,

00:42:00.800 | predicting that it's an ostrich.

00:42:03.800 | And there's all these problems,

00:42:06.800 | has to deal with,

00:42:07.800 | whether it's in computer vision data,

00:42:09.800 | whether it's in text data,

00:42:10.800 | audio,

00:42:11.800 | all of this variation arises.

00:42:14.800 | In vision,

00:42:15.800 | it's illumination variability,

00:42:17.800 | the set of pixels,

00:42:18.800 | and the numbers look completely different,

00:42:20.800 | depending on the lighting conditions.

00:42:22.800 | It's the biggest problem in driving,

00:42:24.800 | is lighting conditions,

00:42:25.800 | lighting variability.

00:42:27.800 | Pose variation,

00:42:29.800 | objects need to be learned,

00:42:30.800 | from every different perspective.

00:42:32.800 | I'll discuss that,

00:42:33.800 | for when sensing the driver.

00:42:35.800 | Most of the deep learning work,

00:42:38.800 | that's done on the face,

00:42:39.800 | on the human,

00:42:40.800 | is done on the frontal face,

00:42:42.800 | or semi frontal face.

00:42:44.800 | There's very little work done,

00:42:45.800 | on the full 360,

00:42:48.800 | pose variability,

00:42:50.800 | that a human being can take on.

00:42:52.800 | Inter-class variability,

00:42:56.800 | for the classification problem,

00:42:57.800 | for the detection problem,

00:42:59.800 | there is a lot of different kinds of objects,

00:43:01.800 | for cats, dogs, cars, bicyclists, pedestrians.

00:43:05.800 | So that brings us to object classification.

00:43:09.800 | And I'd like to take you through,

00:43:11.800 | where deep learning,

00:43:13.800 | has taken big strides,

00:43:15.800 | for the past several years,

00:43:16.800 | leading up to this year,

00:43:17.800 | to 2018.

00:43:19.800 | So let's start,

00:43:21.800 | at object classification.

00:43:23.800 | It's when you take,

00:43:24.800 | a single image,

00:43:26.800 | and you have to say,

00:43:27.800 | one class,

00:43:29.800 | that's most likely to belong in that image.

00:43:31.800 | The most famous,

00:43:33.800 | variant of that,

00:43:34.800 | is the ImageNet competition,

00:43:35.800 | ImageNet challenge.

00:43:36.800 | ImageNet data set,

00:43:37.800 | is a data set of 14 million images,

00:43:39.800 | with 21,000 categories.

00:43:41.800 | And for say,

00:43:43.800 | the category of fruit,

00:43:45.800 | there's a total of,

00:43:47.800 | 188,000 images of fruit.

00:43:50.800 | And there is,

00:43:51.800 | 1,200 images of Granny Smith apples.

00:43:53.800 | It gives you a sense,

00:43:54.800 | of what we're talking about here.

00:43:56.800 | So this has been,

00:43:59.800 | the source,

00:44:00.800 | of a lot of interesting breakthroughs,

00:44:02.800 | in deep learning,

00:44:03.800 | and a lot of the excitement,

00:44:05.800 | in deep learning.

00:44:06.800 | Is first,

00:44:07.800 | the big successful network,

00:44:09.800 | at least,

00:44:10.800 | one that became famous,

00:44:12.800 | in deep learning,

00:44:14.800 | is AlexNet in 2012,

00:44:16.800 | that took a leap,

00:44:17.800 | of,

00:44:18.800 | a significant leap in performance,

00:44:20.800 | on the ImageNet challenge.

00:44:22.800 | So it was one of the first,

00:44:24.800 | neural networks,

00:44:25.800 | that was successfully trained on the GPU,

00:44:27.800 | and achieved,

00:44:28.800 | an incredible performance boost,

00:44:29.800 | over the previous year,

00:44:31.800 | on the ImageNet challenge.

00:44:32.800 | The challenge is,

00:44:34.800 | and I'll talk about some of these networks,

00:44:36.800 | is to given a single image,

00:44:38.800 | give five guesses,

00:44:40.800 | and you have five guesses to guess,

00:44:42.800 | for one of them to be correct.

00:44:44.800 | The human annotation,

00:44:46.800 | is a question often comes up.

00:44:48.800 | So how do you know the ground truth?

00:44:50.800 | Human level of performance,

00:44:51.800 | is 5.1% accuracy,

00:44:53.800 | on this task.

00:44:55.800 | But the way,

00:44:57.800 | the annotation for ImageNet,

00:44:59.800 | is performed,

00:45:00.800 | is,

00:45:01.800 | there's a Google search,

00:45:02.800 | where you pull,

00:45:03.800 | the images,

00:45:04.800 | already labeled for you,

00:45:05.800 | and then the annotation,

00:45:07.800 | that on Mechanical Turk,

00:45:08.800 | other humans perform,

00:45:09.800 | is just binary.

00:45:10.800 | Is this a cat or not a cat?

00:45:12.800 | So they're not tasked,

00:45:13.800 | with performing the,

00:45:14.800 | very high resolution,

00:45:16.800 | semantic,

00:45:17.800 | labeling of the image.

00:45:19.800 | Okay,

00:45:21.800 | so through,

00:45:22.800 | from 2012,

00:45:23.800 | with AlexNet,

00:45:24.800 | to today.

00:45:25.800 | And the big,

00:45:27.800 | transition in 2018,

00:45:28.800 | of the ImageNet challenge,

00:45:30.800 | leaving Stanford,

00:45:31.800 | and going to Kaggle.

00:45:33.800 | It's sort of a monumental step,

00:45:36.800 | because in 2015,

00:45:37.800 | with the ResNet network,

00:45:39.800 | was the first time,

00:45:40.800 | that the human level of performance,

00:45:42.800 | was exceeded.

00:45:43.800 | And I think this is,

00:45:45.800 | a very important,

00:45:48.800 | map,

00:45:51.800 | of where deep learning is.

00:45:53.800 | For a particular,

00:45:54.800 | what I would argue,

00:45:55.800 | is a toy example,

00:45:56.800 | despite the fact,

00:45:57.800 | that it's 14 million images.

00:45:58.800 | So we're developing,

00:46:00.800 | state-of-the-art techniques here,

00:46:02.800 | and the next stage,

00:46:03.800 | as we are now,

00:46:04.800 | exceeding human level performance,

00:46:05.800 | on this task,

00:46:06.800 | is how to take,

00:46:07.800 | these methods,

00:46:08.800 | into the real world,

00:46:09.800 | to perform,

00:46:10.800 | scene perception,

00:46:11.800 | to perform,

00:46:12.800 | driver state perception.

00:46:14.800 | In 2016,

00:46:19.800 | and 2017,

00:46:20.800 | CU Image,

00:46:22.800 | and SCNet,

00:46:23.800 | has a very unique,

00:46:24.800 | new addition,

00:46:25.800 | to the previous formulations,

00:46:26.800 | that has achieved,

00:46:27.800 | an accuracy of 2.2% error.

00:46:30.800 | 2.25% error,

00:46:33.800 | on the ImageNet,

00:46:34.800 | classification challenge.

00:46:35.800 | It's an incredible result.

00:46:37.800 | Okay,

00:46:38.800 | so you have this image,

00:46:39.800 | classification architecture,

00:46:41.800 | that takes in a single image,

00:46:43.800 | and produces convolution,

00:46:45.800 | and takes it through,

00:46:46.800 | pooling convolution,

00:46:47.800 | and at the end,

00:46:48.800 | fully connected layers,

00:46:49.800 | and performs a classification task,

00:46:51.800 | or regression task.

00:46:52.800 | And you can swap out,

00:46:53.800 | that layer,

00:46:54.800 | to perform any kind of,

00:46:56.800 | other task,

00:46:58.800 | including with,

00:46:59.800 | recurrent neural networks,

00:47:00.800 | of image captioning,

00:47:01.800 | and so on,

00:47:02.800 | or localization,

00:47:03.800 | of bounding boxes,

00:47:05.800 | or you can do,

00:47:06.800 | fully convolutional networks,

00:47:08.800 | which we'll talk about,

00:47:10.800 | on Thursday.

00:47:12.800 | Which is when you take a,

00:47:14.800 | image as an input,

00:47:15.800 | and produce an image as an output.

00:47:17.800 | But where the output image,

00:47:18.800 | in this case,

00:47:19.800 | is a segmentation.

00:47:21.800 | Is,

00:47:22.800 | where a color indicates,

00:47:23.800 | what the object is,

00:47:25.800 | of the category,

00:47:26.800 | of the object.

00:47:27.800 | So it's pixel level segmentation,

00:47:29.800 | every single pixel in the image,

00:47:30.800 | is assigned a class,

00:47:32.800 | a category,

00:47:33.800 | of where that pixel belongs to.

00:47:36.800 | This is,

00:47:37.800 | the kind of,

00:47:39.800 | task,

00:47:40.800 | that's overlaid on top,

00:47:41.800 | of,

00:47:42.800 | other sensory information,

00:47:44.800 | coming for the car,

00:47:45.800 | in order to,

00:47:46.800 | perceive the external environment.

00:47:49.800 | You can continue to extract,

00:47:51.800 | information from images in this way,

00:47:53.800 | to produce image to image mapping,

00:47:55.800 | for example,

00:47:56.800 | to colorize images.

00:47:57.800 | And take from grayscale images,

00:47:59.800 | to color images.

00:48:02.800 | Or you can use that kind of,

00:48:04.800 | heat map information,

00:48:05.800 | to localize objects in the image.

00:48:07.800 | So as opposed to just classifying,

00:48:09.800 | that this is the image of a cow,

00:48:11.800 | RCNN,

00:48:13.800 | FAST,

00:48:14.800 | and FASTA-RCNN,

00:48:15.800 | and a lot of other localization networks,

00:48:17.800 | allow you to,

00:48:19.800 | propose different candidates,

00:48:21.800 | for where exactly the cow,

00:48:22.800 | is located in this image.

00:48:24.800 | And thereby,

00:48:25.800 | being able to perform object detection,

00:48:26.800 | not just object classification.

00:48:30.800 | In 2017,

00:48:32.800 | has been a lot of cool applications,

00:48:34.800 | of these architectures.

00:48:36.800 | One of which is background removal.

00:48:38.800 | Again, mapping from image to image,

00:48:41.800 | ability to remove,

00:48:42.800 | background from selfies,

00:48:45.800 | of humans,

00:48:46.800 | or human-like,

00:48:48.800 | pictures,

00:48:50.800 | or faces.

00:48:52.800 | The references,

00:48:54.800 | with some incredible animations,

00:48:57.800 | are in the bottom of the slide,

00:48:58.800 | and the slides are now available online.

00:49:01.800 | Pix2Pix HD,

00:49:06.800 | there's been a lot of work in GANs,

00:49:09.800 | and generative adversarial networks.

00:49:12.800 | In particular in driving,

00:49:15.800 | GANs have been used to generate,

00:49:18.800 | examples that,

00:49:19.800 | generate examples,

00:49:21.800 | from source data.

00:49:23.800 | Whether that's from raw data,

00:49:25.800 | or in this case with Pix2Pix HD,

00:49:27.800 | is taking,

00:49:29.800 | coarse semantic labeling,

00:49:31.800 | of the images,

00:49:32.800 | pixel level,

00:49:33.800 | and producing,

00:49:34.800 | photorealistic,

00:49:36.800 | high-definition,

00:49:38.800 | images of the forward roadway.

00:49:40.800 | This is an exciting,

00:49:42.800 | possibility,

00:49:43.800 | for being able to generate,

00:49:45.800 | a variety of cases,

00:49:46.800 | for self-driving cars,

00:49:47.800 | for autonomous vehicles,

00:49:48.800 | to be able to learn,

00:49:49.800 | to generate,

00:49:50.800 | to augment the data,

00:49:51.800 | and be able to change,

00:49:53.800 | the way different roads look,

00:49:54.800 | road conditions,

00:49:55.800 | to change the way vehicles look,

00:49:56.800 | cyclists, pedestrians.

00:49:58.800 | Then we can move on,

00:50:00.800 | to recurrent neural networks.

00:50:01.800 | Everything I've talked about,

00:50:03.800 | was one-to-one mapping,

00:50:05.800 | from image to image,

00:50:06.800 | or image to number.

00:50:07.800 | Recurrent neural networks,

00:50:08.800 | work with sequences.

00:50:10.800 | We can use sequences,

00:50:12.800 | to generate handwriting,

00:50:15.800 | to generate text,

00:50:20.800 | captions from an image,

00:50:22.800 | based on the localizations,

00:50:23.800 | the various detections,

00:50:24.800 | in that image.

00:50:26.800 | We can provide,

00:50:29.800 | video description generation.

00:50:31.800 | So taking a video,

00:50:33.800 | and combining convolutional neural networks,

00:50:35.800 | with recurrent neural networks,

00:50:37.800 | using convolutional neural networks,

00:50:38.800 | to extract features,

00:50:39.800 | frame to frame,

00:50:40.800 | and using those extracted features,

00:50:42.800 | to input into the RNNs,

00:50:45.800 | to then generate,

00:50:46.800 | a labeling,

00:50:49.800 | a description,

00:50:50.800 | what's going on in the video.

00:50:53.800 | A lot of exciting approaches,

00:50:55.800 | for autonomous systems,

00:50:57.800 | especially in drones,

00:50:59.800 | where the time to make a decision,

00:51:01.800 | is short.

00:51:03.800 | Same with the RC car,

00:51:05.800 | traveling 30 miles an hour.

00:51:07.800 | Attentional mechanisms,

00:51:08.800 | for steering the attention,

00:51:09.800 | of the network,

00:51:10.800 | have been very popular,

00:51:12.800 | for the localization task,

00:51:14.800 | and for just saving,

00:51:15.800 | how much interpretation,

00:51:16.800 | of the image,

00:51:17.800 | how many pixels need to be considered,

00:51:19.800 | in the classification task.

00:51:22.800 | So we can steer,

00:51:23.800 | we can model the way,

00:51:25.800 | a human being,

00:51:26.800 | looks around an image to interpret it,

00:51:28.800 | and use the network to do the same.

00:51:30.800 | And we can use that kind of steering,

00:51:32.800 | to draw images as well.

00:51:35.800 | Finally the big breakthroughs in 2017,

00:51:43.800 | came from this,

00:51:46.800 | the pong to pixels,

00:51:47.800 | the reinforcement learning,

00:51:49.800 | using sensory data,

00:51:50.800 | raw sensory data,

00:51:51.800 | and use reinforcement learning methods,

00:51:53.800 | deep RL methods,

00:51:54.800 | of which we'll talk about on Wednesday.

00:51:56.800 | I'm really excited about,

00:51:58.800 | the underlying methodology,

00:52:00.800 | of deep traffic and deep crash,

00:52:02.800 | is using neural networks,

00:52:06.800 | as the approximators,

00:52:08.800 | inside reinforcement learning approaches.

00:52:11.800 | So AlphaGo in 2016,

00:52:13.800 | has achieved a monumental task,

00:52:16.800 | that when I first started,

00:52:17.800 | in artificial intelligence,

00:52:18.800 | was told to me is impossible,

00:52:20.800 | for AI system to accomplish,

00:52:21.800 | which is to win at the game of Go,

00:52:24.800 | against the top human player in the world.

00:52:28.800 | However that method was trained,

00:52:30.800 | on human expert positions.

00:52:33.800 | The AlphaGo system,

00:52:34.800 | was trained on previous games,

00:52:36.800 | played by human experts.

00:52:38.800 | And in an incredible accomplishment,

00:52:42.800 | AlphaGo Zero in 2017,

00:52:45.800 | was able to beat AlphaGo,

00:52:48.800 | and many of its variants,

00:52:51.800 | by playing itself,

00:52:54.800 | from zero information.

00:52:57.800 | So no knowledge of human experts,

00:53:01.800 | no games,

00:53:02.800 | no training data,

00:53:04.800 | very little human input.

00:53:07.800 | And what more,

00:53:08.800 | it was able to generate moves,

00:53:11.800 | that were surprising to human experts.

00:53:14.800 | I think it's Einstein,

00:53:16.800 | that said that intelligence,

00:53:19.800 | that the key mark of intelligence,

00:53:20.800 | is imagination.

00:53:23.800 | I think it's beautiful,

00:53:24.800 | to see an artificial intelligence system,

00:53:26.800 | come up with something,

00:53:27.800 | that surprises human experts.

00:53:30.800 | Truly surprises.

00:53:35.800 | For the gambling junkies,

00:53:36.800 | DeepStack and a few other variants,

00:53:40.800 | have been used in 2017,

00:53:42.800 | to win a heads-up poker.

00:53:44.800 | Again, another incredible result.

00:53:46.800 | I was always told,

00:53:47.800 | an artificial intelligence,

00:53:48.800 | would be impossible,

00:53:49.800 | for deep,

00:53:50.800 | for any machine learning method,

00:53:52.800 | to achieve.

00:53:53.800 | And was able to beat,

00:53:54.800 | a professional player,

00:53:55.800 | and several competitors,

00:53:57.800 | have come along since.

00:53:59.800 | We're yet to be able to beat,

00:54:01.800 | to win in a tournament setting,

00:54:03.800 | so multiple players,

00:54:04.800 | for those of you familiar,

00:54:05.800 | heads-up poker is one-on-one.

00:54:06.800 | It's a much, much smaller,

00:54:08.800 | easier space to solve.

00:54:11.800 | There's a lot more,

00:54:12.800 | human-to-human dynamics going on,

00:54:14.800 | for when there's multiple players.

00:54:16.800 | But that's the task for 2018.

00:54:21.800 | And the drawbacks,

00:54:22.800 | it's one of my favorite videos,

00:54:24.800 | I show it often,

00:54:26.800 | of coast runners.

00:54:28.800 | For these deep reinforcement,

00:54:29.800 | learning approaches,

00:54:31.800 | the learning of the reward function,

00:54:35.800 | the definition of the reward function,

00:54:37.800 | controls how the actual,

00:54:40.800 | system behaves.

00:54:42.800 | And this will come,

00:54:44.800 | this would be extremely important for us,

00:54:45.800 | with autonomous vehicles.

00:54:47.800 | Here, the boat is tasked with,

00:54:50.800 | gaining the highest number of points,

00:54:53.800 | and it figures out,

00:54:54.800 | that it does not need to race,

00:54:55.800 | which is the whole point of the game,

00:54:57.800 | in order to gain points,

00:54:58.800 | but instead pick up green,

00:55:01.800 | circles that regenerate themselves,

00:55:03.800 | over and over.

00:55:05.800 | This is the,

00:55:06.800 | the counterintuitive,

00:55:09.800 | behavior of a system,

00:55:11.800 | that would not be expected,

00:55:14.800 | when you first design the reward function.

00:55:16.800 | And this is a very formal,

00:55:17.800 | simple system,

00:55:18.800 | nevertheless,

00:55:20.800 | is extremely difficult,

00:55:21.800 | to come up with a reward function,

00:55:24.800 | that makes it operate,

00:55:25.800 | in the way you expect it to operate.

00:55:27.800 | Very applicable for,

00:55:29.800 | autonomous vehicles.

00:55:30.800 | And of course,

00:55:31.800 | in the perception side,

00:55:32.800 | as I mentioned with the ostrich and the dog,

00:55:35.800 | a little bit of noise,

00:55:37.800 | with 99.6% confidence,

00:55:39.800 | we can predict,

00:55:40.800 | that the noise up top,

00:55:41.800 | is a robin, a cheetah,

00:55:43.800 | armadillo, lesser panda.

00:55:45.800 | These are outputs from actual,

00:55:47.800 | state-of-the-art neural networks.

00:55:50.800 | Taking in the noise,

00:55:52.800 | and producing a confident prediction.

00:55:55.800 | It should build our intuition,

00:55:56.800 | to understand that we don't,

00:55:58.800 | that the visual characteristics,

00:56:00.800 | the spatial characteristics of an image,

00:56:03.800 | do not necessarily convey,

00:56:05.800 | the level of hierarchy necessary,

00:56:07.800 | to function in this world.

00:56:11.800 | In a similar way,

00:56:12.800 | with a dog and an ostrich,

00:56:14.800 | and everything in an ostrich,

00:56:16.800 | a network confidently,

00:56:18.800 | with a little bit of noise,

00:56:19.800 | can make the wrong prediction.

00:56:21.800 | Thinking a school bus,

00:56:23.800 | is an ostrich,

00:56:24.800 | and a speaker is an ostrich.

00:56:28.800 | They're easily fooled,

00:56:31.800 | but not really,

00:56:32.800 | because they perform the tasks,

00:56:34.800 | that they were trained to do well.

00:56:37.800 | So we have to,

00:56:39.800 | make sure we keep our intuition,

00:56:43.800 | optimized to the way machines learn,

00:56:46.800 | not the way humans have learned,

00:56:48.800 | over the 540 million years of data,

00:56:51.800 | that we've gained,

00:56:52.800 | through developing the eye,

00:56:53.800 | through evolution.

00:56:55.800 | The current challenges we're taking on,

00:56:57.800 | first transfer learning.

00:57:00.800 | There's a lot of success,

00:57:01.800 | in transfer learning,

00:57:02.800 | between domains,

00:57:03.800 | that are very close to each other.

00:57:05.800 | So image classification,

00:57:06.800 | from one domain to the next.

00:57:09.800 | There's a lot of value,

00:57:10.800 | in forming representations,

00:57:11.800 | of the way scenes look,

00:57:13.800 | in order to,

00:57:14.800 | natural scenes look,

00:57:15.800 | in order to do,

00:57:16.800 | scene segmentation,

00:57:17.800 | the driving case for example.

00:57:19.800 | But we're not able to do any,

00:57:21.800 | any bigger leaps,

00:57:24.800 | in the way we perform transfer learning.

00:57:27.800 | The biggest challenge for deep learning,

00:57:28.800 | is to generalize,

00:57:30.800 | generalize across domains.

00:57:32.800 | It lacks the ability to reason,

00:57:35.800 | in the way that we've defined,

00:57:36.800 | understanding previously,

00:57:37.800 | which is,

00:57:38.800 | the ability to turn complex information,

00:57:40.800 | into simple useful information.

00:57:44.800 | Convert domain specific,

00:57:47.800 | complicated sensory information,

00:57:50.800 | that doesn't relate,

00:57:51.800 | to the initial training set.

00:57:53.800 | That's the open challenge,

00:57:54.800 | for deep learning.

00:57:55.800 | Train on very little data,

00:57:57.800 | and then go and reason,

00:57:58.800 | and operate in the real world.

00:58:00.800 | Right now,

00:58:01.800 | neural networks are very inefficient.

00:58:03.800 | They require big data.

00:58:05.800 | They require supervised data,

00:58:07.800 | which means they need human,

00:58:08.800 | costly human input.

00:58:11.800 | They're not fully automated,

00:58:13.800 | despite the fact,

00:58:14.800 | that the feature learning,

00:58:15.800 | incredibly the big breakthrough,

00:58:17.800 | feature learning is performed,

00:58:18.800 | automatically.

00:58:19.800 | You still have to do,

00:58:20.800 | a lot of design,

00:58:21.800 | of the actual architecture,

00:58:22.800 | of the network,

00:58:23.800 | and all the different,

00:58:24.800 | hyper parameter tuning,

00:58:25.800 | needs to be performed.

00:58:27.800 | Human input,

00:58:28.800 | perhaps a little bit more,

00:58:30.800 | educated human input,

00:58:31.800 | in form of PhD students,

00:58:33.800 | postdocs, faculty,

00:58:34.800 | is required,

00:58:36.800 | to tune these hyper parameters.

00:58:38.800 | But nevertheless,

00:58:39.800 | human input is still necessary.

00:58:41.800 | They cannot be left alone,

00:58:43.800 | for the most part.

00:58:46.800 | The reward,

00:58:47.800 | defining the reward,

00:58:48.800 | as we saw with Coast Run,

00:58:49.800 | is extremely difficult,

00:58:50.800 | for systems that operate,

00:58:52.800 | in the real world.

00:58:53.800 | Transparency,

00:58:55.800 | quite possibly,

00:58:56.800 | is not an important one,

00:58:58.800 | but neural networks,

00:58:59.800 | currently are black box,

00:59:00.800 | for the most part.

00:59:01.800 | They're not able to accept,

00:59:03.800 | through a few successful,

00:59:04.800 | visualization methods,

00:59:05.800 | that visualize different aspects,

00:59:07.800 | of the activations.

00:59:08.800 | They're not able,

00:59:10.800 | to reveal to us humans,

00:59:12.800 | why they work,

00:59:13.800 | or where they fail.

00:59:16.800 | And this is a philosophical question,

00:59:18.800 | for autonomous vehicles,

00:59:20.800 | that we may not care,

00:59:21.800 | as human beings,

00:59:22.800 | if a system works well enough.

00:59:24.800 | But I would argue,

00:59:25.800 | that it'll be a long time,

00:59:28.800 | before systems work well enough,

00:59:30.800 | where we don't care.

00:59:32.800 | We'll care,

00:59:33.800 | and we'll have to work together,

00:59:34.800 | with these systems.

00:59:35.800 | And that's where transparency,

00:59:36.800 | communication,

00:59:37.800 | collaboration is critical.

00:59:39.800 | And edge cases,

00:59:40.800 | it's all about edge cases.

00:59:42.800 | In robotics,

00:59:44.800 | in autonomous vehicles,

00:59:46.800 | the 99.9% of driving,

00:59:48.800 | is really boring.

00:59:49.800 | It's the same.

00:59:51.800 | Especially highway driving,

00:59:52.800 | traffic driving,

00:59:54.800 | it's the same.

00:59:55.800 | The obstacle avoidance,

00:59:56.800 | the car following,

00:59:57.800 | the lane centering,

00:59:58.800 | all these problems are trivial.

01:00:00.800 | It's the edge cases.

01:00:02.800 | The trillions of edge cases,

01:00:05.800 | that need to be generalized over,

01:00:07.800 | on a very small amount of training data.

01:00:10.800 | So again, I return to,

01:00:16.800 | why deep learning?

01:00:18.800 | I mentioned a bunch of challenges,

01:00:23.800 | and this is an opportunity.

01:00:25.800 | It's an opportunity,

01:00:27.800 | to come up with,

01:00:29.800 | techniques,

01:00:31.800 | that operate successfully in this world.

01:00:35.800 | So I hope the competitions,

01:00:36.800 | we present in this class,

01:00:38.800 | in the autonomous vehicle domain,

01:00:40.800 | will give you some insight,

01:00:41.800 | and opportunity to apply,

01:00:43.800 | in some of these cases,

01:00:44.800 | are open research problems.

01:00:46.800 | With semantic segmentation,

01:00:48.800 | of external perception,

01:00:49.800 | with control of the vehicle,

01:00:51.800 | in deep traffic,

01:00:53.800 | and with deep crash,

01:00:56.800 | of control of the vehicle,

01:00:58.800 | in under actuated,

01:01:00.800 | high-speed conditions,

01:01:03.800 | and the driver state perception.

01:01:06.800 | So with that,

01:01:11.800 | I wanted to introduce,

01:01:12.800 | deep learning to you today,

01:01:13.800 | before we get to the fun,

01:01:14.800 | tomorrow of autonomous vehicles.

01:01:16.800 | So I would like to thank,

01:01:17.800 | NVIDIA,

01:01:19.800 | Google, Autoliv,

01:01:21.800 | Toyota,

01:01:22.800 | and,

01:01:24.800 | at the risk of setting off people's phones,

01:01:26.800 | Amazon Alexa Auto.

01:01:28.800 | But,

01:01:32.800 | truly,

01:01:34.800 | I would like to say,

01:01:35.800 | that I've been humbled,

01:01:37.800 | over the past year,

01:01:39.800 | by the thousands of messages,

01:01:41.800 | we received,

01:01:42.800 | by the attention,

01:01:43.800 | by the 18,000 competition entries,

01:01:46.800 | by the many people across the world,

01:01:49.800 | not just here at MIT,

01:01:50.800 | that are brilliant,

01:01:52.800 | that I've got a chance to interact with.

01:01:54.800 | And I hope we go bigger,

01:01:56.800 | and do some impressive stuff in 2018.

01:01:59.800 | Thank you very much,

01:02:00.800 | and tomorrow is self-driving.

01:02:02.800 | [APPLAUSE]

MIT 6.S094: Deep Learning

Chapters