MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars

00:00:00.000 | All right, hello everybody. Hopefully you can hear me well. Yes, yes, great.

00:00:07.400 | So welcome to course 6S094, Deep Learning for Self-Driving Cars.

00:00:15.600 | We will introduce to you the methods of deep learning of deep neural networks

00:00:22.500 | using the guiding case study of building self-driving cars.

00:00:28.800 | My name is Lex Friedman.

00:00:31.200 | You get to listen to me for majority of these lectures

00:00:35.600 | and I am part of an amazing team with some brilliant TAs, would you say?

00:00:42.700 | Brilliant?

00:00:43.200 | Dan, Dan Brown, you guys want to stand up?

00:00:49.300 | You're okay?

00:00:49.800 | They're in front row.

00:00:50.800 | Spencer, William Angel, Spencer Dodd, and all the way in the back,

00:00:57.600 | the smartest and the tallest person I know, Benedict Jenek.

00:01:01.500 | So what you see there on the left of the slide is a visualization of one of the two projects that,

00:01:09.600 | one of the two simulations games that we'll get to go through.

00:01:16.800 | We use it as a way to teach you about deep reinforcement learning

00:01:21.300 | but also as a way to excite you by challenging you to compete against others,

00:01:27.900 | if you wish, to win a special prize yet to be announced, super secret prize.

00:01:35.000 | So you can reach me and the TAs at deepcars@mit.edu

00:01:41.000 | if you have any questions about the tutorials, about the lecture, about anything at all.

00:01:45.400 | The website cars.mit.edu has the lecture content.

00:01:51.200 | Code tutorials again, like today, the lecture slides for today are already up in PDF form.

00:01:57.800 | The slides themselves, if you want to see them, just email me,

00:02:02.100 | but they're over a gigabyte in size because they're very heavy in videos.

00:02:06.300 | So I'm just posting the PDFs.

00:02:07.800 | And there will be lecture videos available a few days after the lecture is given.

00:02:15.800 | So speaking of which, there is a camera in the back.

00:02:19.100 | This is being videotaped and recorded, but for the most part,

00:02:23.700 | the camera is just on the speaker.

00:02:26.700 | So you shouldn't have to worry.

00:02:28.600 | If that kind of thing worries you, then you could sit on the periphery of the classroom

00:02:34.300 | or maybe I suggest sunglasses and a fake mustache, that would be a good idea.

00:02:40.100 | There is a competition for the game that you see on the left.

00:02:43.400 | I'll describe exactly what's involved.

00:02:47.100 | In order to get credit for the course, you have to design a neural network

00:02:52.200 | that drives the car just above the speed limit of 65 miles an hour.

00:02:56.200 | But if you want to win, you need to go a little faster than that.

00:02:59.700 | So who this class is for?

00:03:05.500 | You may be new to programming, new to machine learning, new to robotics,

00:03:11.900 | or you're an expert in those fields but want to go back to the basics.

00:03:17.600 | So what you will learn is an overview of deep reinforcement learning,

00:03:21.300 | convolutional neural networks, recurrent neural networks,

00:03:25.900 | and how these methods can help improve each of the components of autonomous driving.

00:03:31.800 | Perception, visual perception, localization, mapping, control planning,

00:03:37.300 | and the detection of driver state.

00:03:42.500 | Okay, two projects. Code name Deep Traffic is the first one.

00:03:46.700 | There is, in this particular formulation of it, there is seven lanes.

00:03:52.100 | It's a top view. It looks like a game but I assure you it's very serious.

00:03:59.500 | It is the agent in red.

00:04:03.300 | The car in red is being controlled by a neural network

00:04:06.800 | and we'll explain how you can control and design the various aspects,

00:04:12.100 | the various parameters of this neural network.

00:04:15.100 | And it learns in the browser.

00:04:19.100 | So this, we're using ConvNetJS,

00:04:21.900 | which is a library that is programmed by Andrej Karpathy in JavaScript.

00:04:27.000 | So amazingly, we live in a world where you can train in a matter of minutes

00:04:33.100 | a neural network in your browser.

00:04:35.200 | And we'll talk about how to do that.

00:04:37.800 | The reason we did this is so that there is very few requirements

00:04:42.800 | for get you up and started with neural networks.

00:04:46.300 | So in order to complete this project for the course,

00:04:50.800 | you don't need any requirements except to have a Chrome browser.

00:04:54.500 | And to win the competition, you don't need anything except a Chrome browser.

00:05:00.500 | The second project, code name Deep Tesla, or Tesla,

00:05:08.600 | is using data from a Tesla vehicle of the forward roadway

00:05:15.600 | and using end-to-end learning, taking the image

00:05:18.800 | and putting it into convolutional neural networks

00:05:22.400 | that directly maps a regressor that maps to a steering angle.

00:05:27.800 | So all it takes is a single image

00:05:30.600 | and it predicts a steering angle for the car.

00:05:33.600 | And we have data for the car itself

00:05:36.900 | and you get to build a neural network that tries to do better,

00:05:41.800 | tries to steer better or at least as good as the car.

00:05:45.000 | Okay, let's get started with the question

00:05:51.600 | with the thing that we understand so poorly at this time

00:05:57.200 | because it's so shrouded in mystery but it fascinates many of us.

00:06:01.800 | And that's the question of what is intelligence?

00:06:06.500 | This is from a 1996, March 1996, Time Magazine.

00:06:14.100 | And the question, can machines think, is answered below

00:06:19.000 | with "they already do, so what if anything is special about the human mind?"

00:06:24.800 | It's a good question for 1996, a good question for 2016,

00:06:29.900 | 17 now and the future.

00:06:33.000 | And there's two ways to ask that question.

00:06:35.700 | One is the special purpose version.

00:06:37.800 | Can an artificial intelligence system achieve a well-defined,

00:06:44.300 | specifically, formally defined finite set of goals?

00:06:49.200 | And this little diagram from a book that got me into artificial intelligence

00:06:55.700 | as a bright-eyed high school student,

00:06:57.300 | the Artificial Intelligence, a Modern Approach.

00:07:02.200 | This is a beautifully simple diagram of a system.

00:07:08.400 | It exists in an environment.

00:07:10.300 | It has a set of sensors that do the perception.

00:07:15.300 | It takes those sensors in, does something magical,

00:07:19.300 | there's a question mark there,

00:07:20.500 | and with a set of effectors, acts in the world,

00:07:23.400 | manipulates objects in that world.

00:07:26.900 | And so special purpose, we can, under this formulation,

00:07:33.300 | as long as the environment is formally defined, well-defined,

00:07:36.500 | as long as a set of goals are well-defined,

00:07:39.100 | as long as a set of actions, sensors,

00:07:42.200 | and the ways that the perception carries itself out is well-defined,

00:07:47.800 | we have good algorithms of which we'll talk about

00:07:51.500 | that can optimize for those goals.

00:07:55.500 | The question is, if we inch along this path,

00:07:58.600 | will we get closer to the general formulation,

00:08:04.200 | to the general purpose version of what artificial intelligence is?

00:08:08.500 | Can it achieve poorly defined, unconstrained set of goals

00:08:14.000 | with an unconstrained, poorly defined set of actions,

00:08:16.900 | and unconstrained, poorly defined utility functions, rewards?

00:08:24.400 | This is what human life is about.

00:08:26.200 | This is what we do pretty well most days,

00:08:28.800 | exist in a undefined, full of uncertainty world.

00:08:37.000 | So, okay, we can separate tasks into three different categories.

00:08:43.000 | Formal tasks, this is the easiest.

00:08:46.000 | It doesn't seem so, it didn't seem so at the birth of artificial intelligence,

00:08:50.400 | but that's in fact true if you think about it.

00:08:52.800 | The easiest is the formal tasks, playing board games, theorem proving,

00:08:56.800 | all the kind of mathematical logic problems that can be formally defined.

00:09:02.400 | Then there's the expert tasks.

00:09:06.200 | So this is where a lot of the exciting breakthroughs have been happening,

00:09:12.100 | where machine learning methods, data-driven methods,

00:09:16.100 | can help aid or improve on the performance of our human experts.

00:09:22.800 | This means medical diagnosis, hardware design, scheduling.

00:09:26.900 | And then there is the thing that we take for granted,

00:09:30.500 | the trivial thing, the thing that we do so easily every day,

00:09:35.500 | when we wake up in the morning,

00:09:37.000 | the mundane tasks of everyday speech, of written language,

00:09:41.600 | of visual perception, of walking,

00:09:46.200 | which we'll talk about in today's lecture,

00:09:49.300 | is a fascinatingly difficult task.

00:09:52.800 | And object manipulation.

00:09:54.300 | So the question is that we're asking here,

00:09:58.700 | before we talk about deep learning,

00:10:00.700 | before we talk about the specific methods,

00:10:02.900 | we really want to dig in and try to see what is it about driving.

00:10:09.300 | How difficult is driving?

00:10:12.900 | Is it more like chess, which you see on the left there,

00:10:17.500 | where we can formally define a set of lanes, a set of actions,

00:10:21.200 | and formulate it as this, you know, there's five set of actions,

00:10:24.800 | you can change a lane, you can avoid obstacles,

00:10:27.500 | you can formally define an obstacle,

00:10:30.000 | you can formally define the rules of the road.

00:10:32.400 | Or is there something about natural language,

00:10:37.200 | something similar to everyday conversation about driving,

00:10:40.300 | that requires a much higher degree of reasoning,

00:10:43.400 | of communication, of learning,

00:10:49.200 | of existing in this underactuated space?

00:10:52.300 | Is it a lot more than just left lane, right lane, speed up, slow down?

00:10:58.400 | So let's look at it as a chess game.

00:11:03.000 | Here's the chess pieces.

00:11:04.400 | What are the sensors we get to work with on a self,

00:11:08.100 | on an autonomous vehicle?

00:11:10.100 | And we'll get a lot more in depth on this,

00:11:13.000 | especially with the guest speakers who built many of these.

00:11:17.300 | There's radar, there's the range sensors, radar, lidar,

00:11:20.400 | that give you information about the obstacles in the environment,

00:11:24.400 | that help localize the obstacles in the environment.

00:11:28.200 | There's the visible light camera, the stereo vision,

00:11:31.700 | that gives you texture information,

00:11:34.700 | that helps you figure out not just where the obstacles are,

00:11:38.200 | but what they are, helps to classify those,

00:11:41.300 | helps to understand their subtle movements.

00:11:47.200 | Then there is the information about the vehicle itself,

00:11:49.700 | about the trajectory and the movement of the vehicle,

00:11:52.800 | that comes from the GPS and IMU sensors.

00:11:55.600 | And there is the state of, the rich state of the vehicle itself.

00:12:00.300 | What is it doing?

00:12:01.400 | What are all the individual systems doing?

00:12:04.100 | That comes from the CAN network.

00:12:06.600 | And there is one of the less studied,

00:12:11.000 | but fascinating to us on the research side,

00:12:13.400 | is audio, the sounds of the road.

00:12:17.900 | That provide the rich context of a wet road,

00:12:22.400 | the sound of a road that when it stopped raining,

00:12:25.300 | but it's still wet, the sound that it makes.

00:12:28.100 | The screeching tire and honking,

00:12:32.800 | these are all fascinating signals as well.

00:12:35.000 | And the focus of the research in our group,

00:12:38.200 | the thing that's really much under investigated,

00:12:44.500 | is the internal facing sensors.

00:12:47.400 | The driver, sensing the state of the driver.

00:12:52.200 | Where are they looking? Are they sleepy?

00:12:55.100 | The emotional state, are they in the seat at all?

00:12:58.500 | And the same with audio.

00:13:01.800 | That comes from the visual information and the audio information.

00:13:06.300 | More than that, here's the tasks.

00:13:11.300 | If you were to break into modules,

00:13:13.100 | the task of what it means to build a self-driving vehicle.

00:13:17.300 | First, you want to know where you are, where am I?

00:13:20.300 | Localization and mapping.

00:13:22.300 | You want to map the external environment,

00:13:24.900 | figure out where all the different obstacles are,

00:13:29.600 | all the entities are, and use that estimate of the environment

00:13:34.100 | to then figure out where I am, where the robot is.

00:13:38.100 | Then there's scene understanding.

00:13:40.500 | It's understanding not just the positional aspects

00:13:44.700 | of the external environment and the dynamics of it,

00:13:48.200 | but also what those entities are.

00:13:51.100 | Is it a car? Is it a pedestrian? Is it a bird?

00:13:54.000 | There's movement planning.

00:13:57.800 | Once you have kind of figured out to the best of your abilities,

00:14:01.700 | your position and the position of other entities in this world,

00:14:06.200 | there's figuring out a trajectory through that world.

00:14:09.000 | And finally, once you've figured out how to move about,

00:14:14.500 | safely and effectively through that world,

00:14:17.200 | it's figuring out what the human that's on board is doing.

00:14:20.600 | Because as I will talk about, the path to a self-driving vehicle,

00:14:25.700 | and that is hence our focus on Tesla,

00:14:28.800 | may go through semi-autonomous vehicles,

00:14:34.600 | where the vehicle must not only drive itself,

00:14:40.900 | but effectively hand over control from the car

00:14:45.200 | to the human and back.

00:14:46.900 | Okay, quick history.

00:14:50.300 | Well, there's a lot of fun stuff from the 80s and 90s, but

00:14:54.000 | the big breakthroughs came in the second DARPA Grand Challenge

00:15:02.700 | with Stanford Stanley when they won the competition,

00:15:06.400 | one of five cars that finished.

00:15:08.500 | This was an incredible accomplishment.

00:15:12.400 | In a desert race, a fully autonomous vehicle was able to complete the race

00:15:18.900 | in record time.

00:15:21.800 | The DARPA Urban Challenge in 2007,

00:15:32.000 | where the task was no longer a race to the desert,

00:15:37.600 | but through a urban environment.

00:15:41.400 | And CMU's boss, with GM, won that race.

00:15:46.700 | And a lot of that work led directly into the acceptance

00:15:55.500 | and large major industry players

00:16:00.600 | taking on the challenge of building these vehicles.

00:16:02.600 | Google, now Waymo, self-driving car.

00:16:09.100 | Tesla, with its Autopilot system and now Autopilot 2 system.

00:16:13.700 | Uber, with its testing in Pittsburgh.

00:16:17.500 | And there's many other companies,

00:16:20.400 | including one of the speakers for this course of Neutronomy,

00:16:24.000 | that are driving the wonderful streets of Boston.

00:16:29.800 | Okay, so let's take a step back.

00:16:35.400 | We have, if we think about the accomplishments in the DARPA challenge

00:16:39.700 | and if we look at the accomplishments of the Google self-driving car,

00:16:45.400 | which essentially boils the world down into a chess game.

00:16:50.000 | It uses incredibly accurate sensors

00:16:56.800 | to build a three-dimensional map of the world,

00:16:59.400 | localize itself effective in that world and move about that world

00:17:04.000 | in a very well-defined way.

00:17:08.500 | Now, what if driving, the open question is,

00:17:16.000 | if driving is more like a conversation,

00:17:18.600 | like a natural language conversation,

00:17:21.200 | how hard is it to pass the Turing test?

00:17:24.200 | The Turing test, as the popular current formulation is,

00:17:28.900 | can a computer be mistaken for a human being

00:17:33.100 | in more than 30% of the time.

00:17:34.600 | When a human is talking behind a veil,

00:17:37.700 | having a conversation with either a computer or a human,

00:17:40.500 | they mistake the other side of that conversation

00:17:43.900 | for being a human when it's in fact a computer.

00:17:47.900 | And the way you would, in a natural language,

00:17:55.600 | build a system that has successfully passed the Turing test

00:17:58.900 | is the natural language processing part

00:18:02.900 | to enable it to communicate successfully.

00:18:05.200 | So generate language and interpret language,

00:18:09.100 | then you represent knowledge, the state of the conversation,

00:18:13.300 | transferred over time.

00:18:14.700 | And the last piece, and this is the hard piece,

00:18:18.600 | is the automated reasoning.

00:18:20.100 | Is reasoning, can we teach machine learning methods to reason?

00:18:30.200 | That is something that will propagate through our discussion

00:18:33.800 | because, as I will talk about, the various methods,

00:18:40.400 | the various deep learning methods, neural networks,

00:18:44.200 | are good at learning from data.

00:18:48.100 | But they're not yet, there's no good mechanism for reasoning.

00:18:52.800 | Now, reasoning could be just something

00:18:56.800 | that we tell ourselves we do to feel special,

00:19:00.200 | better, to feel like we're better than machines.

00:19:03.700 | Reasoning may be simply something as simple as learning from data.

00:19:09.600 | We just need a larger network.

00:19:13.000 | Or there could be a totally different mechanism required

00:19:18.000 | and we'll talk about the possibilities there.

00:19:21.400 | Yes.

00:19:24.000 | Can you go back to the US for example?

00:19:25.800 | Okay, so we talked about the video,

00:19:27.400 | so which state is that?

00:19:29.400 | The top states of the US or other states?

00:19:33.200 | No, it's very difficult to find these kind of situations

00:19:36.300 | in the United States.

00:19:37.300 | So the question was, for this video,

00:19:39.600 | is it in the United States or not?

00:19:42.200 | I believe it's in Tokyo.

00:19:46.600 | So India, a few European countries,

00:19:53.900 | are much more towards the direction of

00:20:00.300 | natural language versus chess.

00:20:04.600 | In the United States, generally speaking,

00:20:08.900 | we follow rules more concretely.

00:20:11.100 | The quality of roads is better,

00:20:13.000 | the marking on the roads is better,

00:20:14.800 | so there's less requirements there.

00:20:18.000 | I'm not sure it's going to be Tokyo,

00:20:19.700 | because they drive on the left side,

00:20:21.300 | but India is going to the right side.

00:20:23.100 | So Japan is less likely to use the game.

00:20:27.100 | These cars are driving on the left side?

00:20:30.800 | No, but they drive on the right side of the road,

00:20:33.700 | just like in the US.

00:20:35.100 | I see.

00:20:36.900 | I just, okay.

00:20:38.600 | Yeah, you're right, it is, because, yep.

00:20:40.900 | Yeah, so, but it's certainly not the United States.

00:20:43.900 | I'm pretty, I'm,

00:20:46.700 | I spent quite a bit of Googling

00:20:48.200 | trying to find the United States,

00:20:50.000 | and it's difficult.

00:20:51.300 | So let's talk about

00:20:57.000 | the recent breakthroughs in machine learning,

00:21:02.000 | and what is at the core of those breakthroughs.

00:21:05.200 | It's neural networks

00:21:07.700 | that have been around for a long time,

00:21:11.900 | and I will talk about what has changed,

00:21:14.000 | what are the cool new things.

00:21:17.200 | And what hasn't changed,

00:21:18.600 | and what are its possibilities.

00:21:20.300 | But first, a neuron,

00:21:22.400 | crudely,

00:21:25.100 | is a computational building block of the brain.

00:21:30.000 | I know there's a few folks here,

00:21:32.700 | neuroscience folks.

00:21:34.000 | This is hardly a model.

00:21:38.500 | It is mostly an inspiration.

00:21:42.300 | And so,

00:21:45.200 | the human neuron

00:21:46.800 | has inspired the artificial neuron,

00:21:50.700 | the computational building block of a neural network,

00:21:54.100 | of an artificial neural network.

00:21:56.200 | Now to give you some context,

00:21:58.800 | these neurons, for both artificial and human brains,

00:22:05.300 | are interconnected.

00:22:06.900 | In the human brain, there's about,

00:22:12.700 | I believe, 10,000 outgoing connections from every neuron,

00:22:16.100 | on average.

00:22:18.400 | And they're interconnected to each other.

00:22:21.500 | The largest current,

00:22:25.600 | as far as I'm aware,

00:22:27.300 | artificial neural network

00:22:29.300 | has 10 billion of those connections, synapses.

00:22:34.300 | Our human brain, to the best estimate,

00:22:37.600 | that I'm aware of,

00:22:39.600 | has

00:22:41.300 | 10,000 times that.

00:22:44.900 | So,

00:22:46.600 | 100 to 1,000 trillion synapses.

00:22:50.700 | Now what is an artificial neuron?

00:22:59.200 | This building block of a neural network.

00:23:02.900 | It takes a set of inputs,

00:23:06.000 | it puts a weight on each of those inputs,

00:23:10.400 | sums them together,

00:23:11.500 | applies a bias value on each,

00:23:16.600 | that sits on each neuron,

00:23:18.900 | and using an activation function that takes as input

00:23:23.600 | that sum plus the bias

00:23:27.600 | and squishes it together

00:23:29.700 | to produce a zero to one signal.

00:23:38.800 | And this allows us, a single neuron,

00:23:41.200 | take a few inputs and produces an output,

00:23:46.400 | a classification, for example, a zero one.

00:23:50.200 | And as we'll talk about,

00:23:53.100 | simply,

00:23:54.300 | it can serve as a linear classifier.

00:23:59.500 | So it can draw a line,

00:24:01.800 | it can learn to draw a line between,

00:24:05.400 | like what's seen here, between the blue dots,

00:24:08.700 | and the yellow dots.

00:24:10.500 | And that's exactly what we'll do

00:24:12.500 | in the IPython notebook that I'll talk about.

00:24:15.300 | But the basic algorithm is,

00:24:19.600 | you initialize the weights on the inputs,

00:24:23.600 | and you compute the output.

00:24:28.800 | You perform this previous operation I talked about, sum up,

00:24:33.000 | compute the output.

00:24:34.800 | And if the output,

00:24:37.900 | does not match the ground truth,

00:24:40.400 | the expected output, the output that it should produce,

00:24:44.600 | the weights are punished accordingly.

00:24:47.400 | And we'll talk through a little bit of the math of that.

00:24:51.800 | And this process is repeated until the perceptron

00:24:57.200 | does not make any more mistakes.

00:25:00.500 | Now here's,

00:25:06.300 | the amazing thing about neural networks.

00:25:09.800 | There's several, I'll talk about them.

00:25:12.000 | One on the mathematical side,

00:25:18.000 | is the universality of neural networks.

00:25:21.500 | With just a single layer, if we stack them together,

00:25:24.300 | a single hidden layer,

00:25:25.600 | the inputs on the left, the outputs on the right,

00:25:29.600 | and in the middle there's a single hidden layer.

00:25:32.800 | It can,

00:25:34.700 | closely approximate any function.

00:25:37.500 | Any function.

00:25:39.100 | So this is an incredible property.

00:25:42.900 | That with a single layer,

00:25:47.100 | any function,

00:25:48.800 | you can think of,

00:25:51.000 | that,

00:25:52.900 | you know, you can think of driving as a function.

00:25:56.500 | It takes an input,

00:25:57.800 | the world outside as output,

00:26:01.400 | the control of the vehicle.

00:26:04.800 | There exists a neural network out there that can drive,

00:26:07.500 | perfectly.

00:26:08.700 | It's a fascinating mathematical fact.

00:26:11.900 | So we can think of this then, these functions as a special purpose,

00:26:20.900 | function, special purpose intelligence.

00:26:23.100 | You can take,

00:26:24.700 | say as input,

00:26:26.000 | the number of bedrooms,

00:26:28.900 | the square feet,

00:26:30.200 | the,

00:26:31.900 | type of neighborhood,

00:26:34.100 | those are the three inputs.

00:26:35.200 | It,

00:26:36.900 | passes that value through to the hidden layer,

00:26:40.900 | and then one more step,

00:26:42.800 | it produces the final price estimate,

00:26:45.200 | for the house,

00:26:46.300 | or for the residence.

00:26:47.900 | And we can teach a network to do this pretty well,

00:26:51.900 | in a supervised way.

00:26:53.500 | This is supervised learning.

00:26:54.900 | You provide,

00:26:56.500 | a lot of examples,

00:26:58.100 | where you know the number of bedrooms, the square feet,

00:27:00.900 | the type of neighborhood,

00:27:02.500 | and then you also know the final price,

00:27:04.600 | of the house,

00:27:05.900 | or the residence.

00:27:07.200 | And then you can,

00:27:08.800 | as I'll talk about through,

00:27:11.000 | a process of back propagation,

00:27:13.000 | teach these networks,

00:27:14.300 | to,

00:27:16.000 | make this prediction,

00:27:17.500 | pretty well.

00:27:18.600 | Now,

00:27:21.100 | some of the exciting breakthroughs recently,

00:27:24.200 | have been in the general purpose intelligence.

00:27:28.800 | This is from our,

00:27:31.200 | this is from Andrej Karpathy,

00:27:33.400 | who is now at OpenAI.

00:27:35.800 | I,

00:27:38.400 | would like,

00:27:39.800 | to,

00:27:41.300 | take a moment here,

00:27:43.100 | to try to explain how amazing this is.

00:27:45.600 | This is a game of Pong.

00:27:47.500 | If you're not familiar with Pong,

00:27:51.400 | there's two paddles,

00:27:54.000 | and you're trying to,

00:27:55.900 | bounce the ball back,

00:27:59.400 | and in such a way that,

00:28:01.100 | prevents the other guy from bouncing the ball back at you.

00:28:05.600 | On the,

00:28:07.000 | the,

00:28:08.800 | the artificial intelligence agents on the right in green,

00:28:13.600 | and up top is the score,

00:28:15.500 | eight to one.

00:28:16.700 | Now this takes,

00:28:18.400 | about three days to train,

00:28:20.400 | on a regular computer,

00:28:22.000 | this network.

00:28:22.900 | What is,

00:28:24.200 | this network doing?

00:28:26.400 | It's called the policy network.

00:28:29.000 | The input,

00:28:29.800 | is the raw,

00:28:31.000 | pixels.

00:28:32.000 | The,

00:28:33.000 | they're, they're slightly,

00:28:34.600 | processed and also you take the difference between,

00:28:38.600 | two frames,

00:28:40.900 | but it's basically the raw pixel information.

00:28:43.600 | That's the input.

00:28:45.300 | There's,

00:28:46.800 | a few hidden layers,

00:28:48.500 | and the output is a single probability of moving up.

00:28:52.200 | That, that's it.

00:28:55.700 | That's,

00:28:56.300 | that's the whole,

00:28:57.600 | that's, that's the whole system.

00:28:59.700 | And what it's doing is,

00:29:02.700 | it learns,

00:29:05.000 | not,

00:29:06.700 | you don't know,

00:29:08.500 | at any one moment,

00:29:10.000 | you don't know what the right thing to do is.

00:29:13.300 | Is it to move up? Is it to move down?

00:29:15.500 | You only know,

00:29:17.100 | what the right thing to do is,

00:29:19.900 | by the fact that eventually you win or lose the game.

00:29:23.000 | So this is the amazing thing here,

00:29:26.800 | is there's no supervised learning about,

00:29:30.200 | there's no like,

00:29:31.600 | universal fact about,

00:29:33.600 | any one state being good or bad,

00:29:35.800 | and any one action being good or bad in any state.

00:29:38.600 | But if you punish,

00:29:40.800 | or reward,

00:29:41.800 | every single action you took,

00:29:44.000 | every single action you took,

00:29:45.800 | for entire game,

00:29:47.800 | based on the result.

00:29:49.900 | So no matter what you did, if you won the game,

00:29:53.200 | the end justifies the means.

00:29:56.600 | If you won the game,

00:29:57.700 | every action you took, and every action,

00:30:00.300 | state pair, gets rewarded.

00:30:03.000 | If you lost the game,

00:30:04.200 | it gets punished.

00:30:05.400 | And this process,

00:30:07.800 | with only 200,000 games,

00:30:10.400 | where the,

00:30:12.200 | system just simulates the games,

00:30:14.200 | it can learn to beat the computer.

00:30:17.200 | This system knows nothing about Pong,

00:30:21.200 | nothing about games.

00:30:23.000 | This is general intelligence.

00:30:27.000 | Except for the fact,

00:30:28.800 | that it's just a game of Pong.

00:30:31.400 | And I will,

00:30:33.400 | talk about,

00:30:36.200 | how this can,

00:30:38.200 | be extended further, why this is so promising,

00:30:41.200 | and why this is also,

00:30:43.400 | we should proceed with caution.

00:30:47.200 | So again,

00:30:49.200 | there's a set of actions you take,

00:30:52.500 | up, down, up, down, based on the output of the network.

00:30:54.800 | There's a threshold,

00:30:56.300 | given the probability of moving up,

00:30:57.800 | you move up or down based on the output of the network.

00:31:00.200 | And you have a set of states.

00:31:04.600 | And every single state action pair is rewarded if there's a win,

00:31:09.200 | and it's punished,

00:31:10.600 | if there's a loss.

00:31:11.800 | When you go home,

00:31:18.000 | think about how amazing that is.

00:31:22.000 | And if you don't understand why that's amazing,

00:31:25.200 | spend some time on it.

00:31:26.600 | It's incredible.

00:31:28.300 | Sure, sure thing.

00:31:36.600 | The question was,

00:31:38.600 | what is supervised learning, what is unsupervised learning,

00:31:41.600 | what's the difference?

00:31:42.400 | So supervised learning,

00:31:44.500 | is, when people talk about machine learning,

00:31:47.100 | they mean supervised learning most of the time.

00:31:49.000 | Supervised learning is,

00:31:55.100 | learning from data.

00:31:56.500 | It's learning from example.

00:31:58.700 | When you have a set of inputs and a set of outputs,

00:32:01.200 | that you know are correct,

00:32:03.300 | what are called ground truth.

00:32:04.700 | So you need those examples,

00:32:08.100 | a large amount of them,

00:32:09.800 | to train any of the machine learning algorithms,

00:32:12.600 | to learn to then generalize that to future examples.

00:32:17.300 | This is,

00:32:23.800 | actually there's a third one called reinforcement learning,

00:32:26.400 | where the ground truth is sparse.

00:32:32.000 | The information about,

00:32:34.400 | when something is good or not,

00:32:37.900 | the ground truth only happens every once in a while,

00:32:40.500 | at the end of the game,

00:32:41.300 | not every single frame.

00:32:42.900 | And unsupervised learning is when you have no information,

00:32:46.900 | about the outputs,

00:32:48.600 | that are correct or incorrect.

00:32:52.000 | And it is the excitement,

00:32:55.000 | of the deep learning community,

00:32:57.400 | is unsupervised learning.

00:32:58.600 | But it has achieved no major breakthroughs at this point.

00:33:03.000 | This is the,

00:33:05.200 | I'll talk about what the future of deep learning is,

00:33:07.400 | and a lot of the people that are working in the field,

00:33:10.100 | are excited by it.

00:33:11.100 | But right now,

00:33:12.500 | any interesting accomplishment,

00:33:14.800 | has to do with supervised learning.

00:33:19.000 | [INAUDIBLE]

00:33:24.300 | And the brown one is just a heuristic solution,

00:33:27.800 | like look at the velocity.

00:33:29.900 | So basically the reinforcement learning here,

00:33:34.100 | is learning from somebody who has certain rules.

00:33:38.400 | And how can that be guaranteed,

00:33:42.600 | that it would generalize to somebody else?

00:33:47.100 | So the question was,

00:33:49.900 | the green paddle learns to play this game successfully,

00:33:56.500 | against this specific one brown paddle,

00:33:58.900 | operating under specific kinds of rules.

00:34:01.100 | How do we know it can generalize to other games,

00:34:04.900 | other things?

00:34:05.700 | And it can't.

00:34:07.000 | But the mechanism by which it learns generalizes.

00:34:11.200 | So the question is,

00:34:13.400 | how do we know that it can generalize to other games,

00:34:16.400 | so as long as you let it play,

00:34:19.200 | as long as you let it play in whatever world you want it to succeed in,

00:34:27.200 | long enough,

00:34:28.500 | it will use the same approach to learn to succeed in that world.

00:34:33.300 | The problem is,

00:34:34.900 | this works for worlds you can simulate well.

00:34:38.700 | Unfortunately, one of the big challenges of neural networks,

00:34:45.900 | is that they're not currently efficient learners.

00:34:48.400 | We need a lot of data to learn anything.

00:34:50.900 | Human beings need one example,

00:34:53.800 | oftentimes, and they learn very efficiently from that one example.

00:34:58.000 | And again, I'll talk about that as well.

00:35:03.300 | It's a good question.

00:35:04.600 | So the drawbacks of neural networks.

00:35:07.900 | So if you think about the way a human being would approach this game,

00:35:12.200 | this game of Pong,

00:35:14.400 | they would only need a simple set of instructions.

00:35:16.700 | You're in control of a paddle,

00:35:19.300 | and you can move it up and down.

00:35:21.800 | And your task is to bounce the ball past the other player,

00:35:25.700 | controlled by AI.

00:35:27.900 | Now, human being would immediately,

00:35:33.700 | they may not win the game,

00:35:34.900 | but they would immediately understand the game.

00:35:36.700 | And will be able to successfully play it well enough

00:35:39.700 | to pretty quickly learn to beat the game.

00:35:44.300 | But they need to have a concept of control.

00:35:46.600 | What it means to control a paddle.

00:35:48.200 | They need to have a concept of a paddle.

00:35:49.900 | They need to have a concept of moving up and down,

00:35:52.700 | and a ball, and bouncing.

00:35:55.700 | They have to know,

00:35:56.400 | they have to have at least a loose concept of real-world physics,

00:36:00.300 | that they can then project that real-world physics

00:36:03.000 | onto the two-dimensional world.

00:36:04.600 | All of these concepts are

00:36:06.700 | are concepts that you come to the table with.

00:36:10.000 | That's knowledge.

00:36:13.400 | And the kind of way you transfer that knowledge from

00:36:16.900 | from your previous experience,

00:36:19.800 | from childhood to now,

00:36:22.200 | when you come to this game,

00:36:23.400 | that is something is called reasoning.

00:36:27.200 | Whatever reasoning means.

00:36:29.700 | And the question is whether through this same kind of process,

00:36:34.200 | you can see the entire world

00:36:37.800 | as a game of pong.

00:36:43.200 | And reasoning is simply ability to simulate

00:36:47.100 | that game in your mind

00:36:50.100 | and learn very efficiently,

00:36:53.100 | much more efficiently than 200,000 iterations.

00:36:55.700 | The other challenge of deep neural networks

00:37:00.900 | and machine learning broadly is you need big data

00:37:03.500 | and efficient learners, as I said.

00:37:05.400 | That data also needs to be supervised data.

00:37:08.900 | You need to have ground truth,

00:37:11.000 | which is very costly for,

00:37:13.100 | so annotation,

00:37:15.100 | a human being looking at a particular image, for example,

00:37:19.700 | and labeling that as something,

00:37:21.300 | as a cat or a dog,

00:37:23.300 | whatever objects is in the image,

00:37:25.000 | that's very costly.

00:37:26.200 | And for particularly for neural networks,

00:37:31.200 | there's a lot of parameters to tune.

00:37:36.200 | There's a lot of hyperparameters.

00:37:38.800 | You need to figure out the network structure first.

00:37:42.000 | How does this network look?

00:37:43.300 | How many layers?

00:37:44.200 | How many hidden nodes?

00:37:45.300 | What type of activation function in each node?

00:37:52.300 | There's a lot of hyperparameters there.

00:37:54.200 | And then once you built your network,

00:37:56.100 | there's parameters for how you teach that network.

00:38:00.400 | There's learning rate, loss function,

00:38:03.300 | mini-batch size, number of training iterations,

00:38:07.200 | gradient update smoothing,

00:38:09.000 | and selecting even the optimizer with which you,

00:38:14.100 | with which you solve the various differential equations involved.

00:38:20.000 | It's a topic of many research papers, certainly.

00:38:28.200 | It's rich enough for research papers,

00:38:30.100 | but it's also really challenging.

00:38:31.900 | It means that you can't just plop a network down

00:38:35.000 | and it will solve the problem generally.

00:38:37.700 | And defining a good loss function,

00:38:42.600 | or in the case of Pong or games,

00:38:45.300 | a good reward function is difficult.

00:38:49.400 | So here's a game.

00:38:51.100 | This is a recent result from OpenAI.

00:38:54.600 | I'm teaching a network to play the game of Coast Runners.

00:39:03.400 | And the goal of Coast Runners is to go,

00:39:09.200 | you're in a boat, the task is to go around a track

00:39:13.300 | and successfully complete a race

00:39:16.700 | against other people you're racing against.

00:39:19.100 | Now, this network is an optimal one.

00:39:23.100 | And what it's figured out that actually in the game,

00:39:26.900 | it gets a lot of points for collecting certain objects along the path.

00:39:33.000 | So what you see is it's figured out to go in a circle

00:39:36.600 | and collect those green turbo things.

00:39:40.700 | And what it's figured out is you don't need to complete the game

00:39:45.700 | to earn the reward.

00:39:47.300 | Now, that more sort of,

00:39:54.200 | and despite being on fire and hitting the wall

00:40:00.000 | and going through this whole process,

00:40:01.900 | it's actually achieved at least a local optima

00:40:05.700 | given the reward function of maximizing the number of points.

00:40:10.200 | And so it's figured out a way to earn a higher reward

00:40:16.800 | while ignoring the implied bigger picture goal of finishing the race,

00:40:20.300 | which us as humans understand much better.

00:40:24.400 | This raises for self-driving cars ethical questions.

00:40:31.400 | Besides other questions, you can watch this for hours

00:40:35.600 | and it will do that for hours.

00:40:37.800 | And that's the point is,

00:40:39.400 | it's hard to teach, it's hard to encode

00:40:47.300 | the formally defined utility function

00:40:53.800 | under which an intelligence system needs to operate.

00:40:56.300 | And that's made obvious even in a simple game.

00:40:59.600 | And so what is, yep, question.

00:41:01.600 | So the question was, what's an example of a local optimum

00:41:12.400 | that an autonomous car, so similar to the coast race,

00:41:15.800 | so what would be the example in the real world for an autonomous vehicle?

00:41:18.700 | And it's a touchy subject,

00:41:22.800 | but it would certainly have to be involved

00:41:27.800 | with the choices we make under near crashes and crashes.

00:41:34.600 | The choices a car makes when to avoid,

00:41:37.900 | for example, if there's a crash imminent

00:41:41.200 | and there's no way you can stop to prevent the crash,

00:41:45.200 | do you keep the driver safe or do you keep the other people safe?

00:41:51.800 | And there has to be some,

00:41:56.200 | even if you don't choose to acknowledge it,

00:42:03.200 | even if it's only in the data and the learning that you do,

00:42:06.600 | there's an implied reward function there.

00:42:08.800 | And we need to be aware of that reward function is

00:42:12.600 | because it may find something.

00:42:14.600 | Until you actually see it, we won't know it.

00:42:17.600 | Once we see it, we'll realize that,

00:42:20.500 | "Oh, that was a bad design."

00:42:23.600 | And that's the scary thing.

00:42:25.000 | It's hard to know ahead of time what that is.

00:42:27.800 | So the recent breakthroughs from deep learning came

00:42:34.900 | of several factors.

00:42:38.600 | First is the compute.

00:42:39.700 | Moore's law.

00:42:41.800 | CPUs are getting faster, 100 times faster every decade.

00:42:45.300 | Then there's GPUs.

00:42:49.100 | Also, the ability to train neural networks and GPUs

00:42:53.100 | and now ASICs has created a lot of capabilities

00:43:00.300 | in terms of energy efficiency

00:43:02.200 | and being able to train larger networks more efficiently.

00:43:08.000 | There is larger, well, first of all,

00:43:13.000 | in the 21st century, there's digitized data.

00:43:15.700 | There's larger data sets of digital data.

00:43:19.300 | And now there is that data is becoming more organized,

00:43:23.200 | not just vaguely available data out there on the internet.

00:43:27.900 | It's actual organized data sets like ImageNet.

00:43:31.000 | Certainly for natural language, there's large data sets.

00:43:35.000 | There is the algorithm innovations.

00:43:38.400 | Backprop, backpropagation, convolutional neural networks, LSTMs,

00:43:43.500 | all these different architectures for dealing with specific

00:43:47.200 | specific types of domains and tasks.

00:43:49.600 | There's the huge one, is infrastructure,

00:43:53.800 | is on the software and the hardware side.

00:43:57.000 | There's Git, ability to share an open source way software.

00:44:01.100 | There is pieces of software that make robotics

00:44:08.600 | and make machine learning easier.

00:44:10.200 | ROS, TensorFlow.

00:44:12.100 | There's Amazon Mechanical Turk,

00:44:16.600 | which allows for efficient, cheap annotation of large scale data sets.

00:44:21.400 | There's AWS in the cloud hosting machine learning,

00:44:26.500 | hosting the data and the compute.

00:44:28.800 | And then there's a financial backing of large companies,

00:44:32.800 | Google, Facebook, Amazon.

00:44:35.000 | But really, nothing has changed.

00:44:38.800 | There really has not been any significant breakthroughs.

00:44:42.300 | We're using these convolutional neural networks

00:44:44.700 | have been around since the 90s.

00:44:46.600 | Neural networks have been around since the 60s.

00:44:48.800 | There's been a few improvements.

00:44:52.000 | But the hope is, that's in terms of methodology.

00:44:57.100 | The compute has really been the workhorse.

00:45:00.200 | The ability to do the hundredfold improvement every decade

00:45:05.500 | holds promise.

00:45:08.800 | And the question is whether that reasoning thing I talked about

00:45:12.100 | is all you need is a larger network.

00:45:15.800 | That is the open question.

00:45:16.900 | So some terms for deep learning.

00:45:22.500 | First of all, deep learning is a PR term for neural networks.

00:45:30.000 | It is a term for deep neural networks,

00:45:38.900 | for neural networks that have many layers.

00:45:40.800 | It is symbolic term for the newly gained capabilities

00:45:45.300 | that compute has brought us,

00:45:46.700 | that training on GPUs has brought us.

00:45:50.200 | So deep learning is a subset of machine learning.

00:45:54.300 | There's many other methods that are still effective.

00:45:56.900 | The terms that will come up in this class is,

00:46:02.000 | first of all, multi-layer perceptron,

00:46:04.600 | deep neural networks, recurring neural networks,

00:46:07.600 | LSTM, long short-term memory networks,

00:46:10.200 | CNN or ConvNets, convolutional neural networks,

00:46:14.400 | deep belief networks.

00:46:15.600 | And the operation that will come up is convolution, pooling,

00:46:18.900 | activation functions and back propagation.

00:46:21.900 | Yep, cool question.

00:46:26.100 | [INAUDIBLE]

00:46:48.100 | So the question was,

00:46:50.300 | what is the purpose of the different layers in a neural network?

00:46:54.100 | What does it mean to have one configuration versus another?

00:46:56.800 | So a neural network having several layers,

00:47:01.500 | it's the only thing you have an understanding of

00:47:05.900 | is the inputs and the outputs.

00:47:08.500 | You don't have a good understanding about what each layer does.

00:47:12.800 | They're mysterious things, neural networks.

00:47:16.900 | So I'll talk about how with every layer it forms a higher level,

00:47:23.300 | a higher order representation of the input.

00:47:26.500 | So it's not like the first layer does localization,

00:47:29.900 | the second layer does path planning,

00:47:31.600 | the third layer does navigation,

00:47:35.500 | how you get from here to Florida.

00:47:36.900 | Or maybe it does, but we don't know.

00:47:40.800 | So we know, we're beginning to visualize neural networks

00:47:45.600 | for simple tasks, like for ImageNet, classifying cats versus dogs.

00:47:51.600 | We can tell what is the thing that the first layer does,

00:47:54.800 | the second layer, the third layer, and we'll look at that.

00:47:57.200 | But for driving, as the input provides just the images and the output, the steering,

00:48:02.600 | it's still unclear what you learn.

00:48:05.200 | Partially because we don't have neural networks that drive successfully yet.

00:48:15.200 | Do neural networks fill layers or does it eventually generate them on its own over time?

00:48:19.200 | So the question was,

00:48:23.400 | does a neural network generate layers over time?

00:48:30.000 | Like does it grow?

00:48:31.200 | That's one of the challenges is that a neural network is predefined.

00:48:38.200 | The architectures, the number of nodes, number of layers, that's all fixed.

00:48:42.500 | Unlike the human brain where neurons die and are born all the time.

00:48:46.000 | Neural network is pre-specified, that's it, that's all you get.

00:48:50.400 | And if you want to change that, you have to change that and then retrain everything.

00:48:54.200 | So it's fixed.

00:48:55.800 | So what I encourage you is to proceed with caution

00:49:00.800 | because there's this feeling when you first teach a network with very little effort,

00:49:06.900 | how to do some amazing tasks, like classify a face,

00:49:12.000 | versus non-face or your face versus other faces or cats versus dogs.

00:49:16.700 | It's an incredible feeling.

00:49:18.100 | And then there's definitely this feeling that I'm an expert.

00:49:23.600 | But what you realize is,

00:49:26.700 | you don't actually understand how it works.

00:49:31.600 | And getting it to perform well for more generalized tasks,

00:49:35.800 | for larger scale datasets, for more useful applications,

00:49:38.700 | requires a lot of hyperparameter tuning.

00:49:41.600 | Figuring out how to tweak little things here and there.

00:49:43.900 | And still in the end, you don't understand why it works so damn well.

00:49:48.000 | So deep learning, these deep neural network architectures is representation learning.

00:49:59.100 | This is the difference between traditional machine learning methods.

00:50:05.900 | Where, for example, for the task of having an image here as the input,

00:50:14.400 | the input to the network here is on the bottom, the output is up at top.

00:50:18.100 | So, and the input is a single image of a person in this case.

00:50:24.700 | And so, the input specifically is all of the pixels in that image, RGB.

00:50:35.200 | The different colors of the pixels in the image.

00:50:37.300 | And over time, what a network does is build

00:50:44.100 | a multi-resolutional representation of this data.

00:50:47.900 | The first layer learns the concept of edges, for example.

00:50:55.300 | The second layer starts to learn composition of those edges, corners, contours.

00:51:02.100 | Then it starts to learn about object parts.

00:51:05.800 | And finally, actually provide a label for the entities that are in the input.

00:51:12.500 | And this is the difference between traditional machine learning methods.

00:51:16.700 | Where the concepts like edges and corners and contours are manually pre-specified by human beings,

00:51:28.400 | human experts for the particular domain.

00:51:31.000 | And representation matters because figuring out a line

00:51:42.900 | for the Cartesian coordinates of this particular dataset,

00:51:46.500 | where you want to design a machine learning system

00:51:49.200 | that tells the difference between green triangles and blue circles is difficult.

00:51:54.600 | There's no line that separates them cleanly.

00:52:00.000 | And if you were to ask a human being, a human expert in the field,

00:52:04.100 | to try to draw that line, they would probably do a PhD on it and still not succeed.

00:52:12.300 | But a neural network can automatically figure out to remap that input into polar coordinates.

00:52:23.000 | Where the representation is such that it's an easily linearly separable dataset.

00:52:28.600 | And so deep learning is a subset of representation learning,

00:52:36.800 | is a subset of machine learning and a key subset of artificial intelligence.

00:52:41.600 | Now, because of this, because of its ability to compute an arbitrary number of features

00:52:53.700 | at the core of the representation.

00:52:56.400 | So you're not, if you were trying to detect a cat in an image,

00:52:59.700 | you're not specifying 215 specific features of cat ears and whiskers and so on

00:53:08.100 | that a human expert would specify.

00:53:10.100 | You allow a neural network to discover tens of thousands of such features.

00:53:14.400 | Which maybe for cats you are an expert, but for a lot of objects

00:53:19.400 | you may never be able to sufficiently provide the features

00:53:24.700 | which successfully would be used for identifying the object.

00:53:27.600 | And so this kind of representation learning,

00:53:30.600 | one is easy in the sense that all you have to provide is inputs and outputs.

00:53:35.600 | All you need to provide is a dataset that you care about without hand engineering features.

00:53:40.800 | And two, because of its ability to construct arbitrarily sized representations,

00:53:49.200 | deep neural networks are hungry for data.

00:53:52.700 | The more data we give them, the more they're able to learn about this particular dataset.

00:53:59.000 | So let's look at some applications.

00:54:06.300 | First, some cool things that deep neural networks have been able to accomplish up to this point.

00:54:13.900 | Let me go through them.

00:54:15.100 | First, the basic one.

00:54:17.400 | AlexNet.

00:54:21.500 | ImageNet is a famous dataset.

00:54:27.600 | It's a competition of classification localization

00:54:31.500 | where the task is given an image,

00:54:34.400 | identify what are the five most likely things in that image

00:54:38.300 | and what is the most likely and you have to do so correctly.

00:54:41.400 | So on the right, there's an image of a leopard

00:54:43.900 | and you have to correctly classify that that is in fact a leopard.

00:54:47.100 | So they're able to do this pretty well.

00:54:50.800 | Given a specific image, determine that it's a leopard.

00:54:55.200 | And we started what's shown here on the x-axis is years,

00:55:01.800 | on the y-axis is error in classification.

00:55:04.900 | So starting from 2012 on the left with AlexNet

00:55:10.000 | and today the errors decreased from 16%

00:55:18.000 | and 40% before then with traditional methods have decreased to below 4%.

00:55:24.200 | So human level performance, if I were to give you this picture of a leopard,

00:55:29.900 | there's a 4% of those pictures of leopards,

00:55:34.100 | you would not say it's a leopard.

00:55:35.700 | That's human level performance.

00:55:37.800 | So for the first time in 2015,

00:55:40.100 | convolutional neural networks outperform human beings.

00:55:43.200 | That in itself is incredible.

00:55:45.500 | That's something that seemed impossible

00:55:47.900 | and now is because it's done, it's not as impressive.

00:55:53.100 | But I just want to get to why that's so impressive

00:55:58.800 | because computer vision is hard.

00:56:02.500 | We as human beings have evolved visual perception over millions of years,

00:56:07.200 | hundreds of millions of years.

00:56:08.300 | So we take it for granted but computer vision is really hard.

00:56:13.700 | Visual perception is really hard.

00:56:15.100 | There is illumination variability.

00:56:17.000 | So it's the same object.

00:56:18.400 | The only way we tell anything is from the shade,

00:56:21.700 | the reflection of light from that surface.

00:56:23.600 | It could be the same object with drastically, in terms of pixels,

00:56:28.000 | drastically different looking shapes

00:56:31.200 | and we still know it's the same object.

00:56:35.500 | There is pose variability and occlusions.

00:56:37.800 | Probably my favorite caption for an image,

00:56:41.200 | for a figure in an academic paper is

00:56:44.900 | "deformable and truncated cat".

00:56:46.800 | These are pictures, you know, cats are famously deformable.

00:56:54.600 | They can take a lot of different shapes.

00:56:56.300 | It's arbitrary poses are possible.

00:57:04.600 | So you have to have computer vision,

00:57:06.200 | you should know it's still the same object,

00:57:07.900 | still the same class of objects

00:57:09.700 | given all the variability in the pose.

00:57:12.300 | And occlusions is a huge problem.

00:57:15.700 | We still know it's an object.

00:57:17.500 | We still know it's a cat even when parts of it are not visible

00:57:21.100 | and sometimes large parts of it are not visible.

00:57:23.400 | And then there's all the interclass variability.

00:57:27.100 | In interclass, all of these on the top two rows are cats.

00:57:32.900 | Many of them look drastically different

00:57:34.600 | and the top bottom two rows are dogs,

00:57:37.800 | also look drastically different.

00:57:39.900 | And yet some of the dogs look like cats,

00:57:43.200 | some of the cats look like dogs

00:57:45.200 | and as human beings are pretty good at telling the difference

00:57:48.700 | and we want computer vision to do better than that.

00:57:52.000 | It's hard.

00:57:53.800 | So how is this done?

00:57:56.400 | This is done with convolutional neural networks.

00:57:58.800 | The input to which is a raw image.

00:58:01.500 | Here's an input on the left of a number three

00:58:05.500 | and I'll talk about through convolutional layers.

00:58:10.500 | That image is processed, passed through.

00:58:14.300 | Convolutional layers maintain spatial information.

00:58:19.200 | On the output, in this case predicts which of the images,

00:58:28.900 | what number is shown in the image, 0, 1, 2 through 9.

00:58:33.000 | And so these networks, this is exactly,

00:58:38.100 | everybody is using the same kind of network

00:58:40.100 | to determine exactly that.

00:58:41.800 | Input is an image, output is a number.

00:58:43.800 | And in the case of probability that it's a leopard,

00:58:48.200 | what is that number?

00:58:49.400 | Then there's segmentation built on top of these

00:58:52.900 | convolutional neural networks where you chop off the end

00:58:58.300 | and convolutionize the network.

00:59:00.600 | You chop off the end where the output is a heat map.

00:59:03.000 | So you can have instead of a detector for a cat,

00:59:08.000 | you can do a cat heat map where it's the part of the image,

00:59:14.000 | the output heat map gets excited,

00:59:16.200 | the neurons on that output get excited

00:59:19.000 | and the spatially excited in the parts of the image

00:59:23.300 | that contain a tabby cat.

00:59:25.800 | And this kind of process can be used to segment the image

00:59:28.400 | into different objects.

00:59:29.900 | A horse, so the original input on the left

00:59:32.500 | is a woman on a horse and the output is a fully segmented image

00:59:36.900 | of knowing where's the woman, where's the horse.

00:59:39.500 | And this kind of process can be used for object detection

00:59:44.700 | which is the task of detecting an object in an image.

00:59:47.600 | Now the traditional method with convolutional neural networks

00:59:53.200 | and in general in computer vision is the sliding window approach.

00:59:56.800 | We have a detector like the leopard detector

00:59:59.500 | that you slide through the image to find where in that image is a leopard.

01:00:03.500 | The segmenting approach, the RCNN approach

01:00:10.500 | is efficiently segment the image in such a way

01:00:14.200 | that it can propose different parts of the image

01:00:16.400 | that are likely to have a leopard or in this case a cowboy.

01:00:22.200 | And that drastically reduces the computational requirements

01:00:25.100 | of the object detection task.

01:00:27.400 | And so these networks, this is the currently

01:00:37.100 | one of the best networks for the ImageNet task of localization

01:00:41.000 | is the deep residual networks.

01:00:44.300 | They're deep.

01:00:49.000 | So VGG19 is one of the famous ones, VGGNet.

01:00:53.000 | You're starting to get above 20 layers in many cases.

01:00:59.100 | 34 layers is the ResNet one.

01:01:02.000 | So the lesson there is the deeper you go,

01:01:06.600 | the more representation power you have, the higher accuracy.

01:01:10.200 | But you need more data.

01:01:17.000 | Other applications, colorization of images.

01:01:19.500 | So this again, input is a single image

01:01:27.000 | and output is a single image.

01:01:29.000 | So you can take a black and white video from a film,

01:01:33.500 | from an old film and recolor it.

01:01:36.700 | And all you need to do to train that network in a supervised way

01:01:41.100 | is provide modern films and convert them to grayscale.

01:01:46.100 | So now you have arbitrarily sized datasets

01:01:48.400 | that are able to, datasets of grayscale to color.

01:01:53.800 | And you're able to, with very little effort on top of it,

01:02:00.900 | to successfully, well, somewhat successfully recolor images.

01:02:05.300 | Again, Google Translate does image translation in this way.

01:02:10.200 | Image to image.

01:02:12.800 | It first perceives, here in German I believe,

01:02:16.400 | famous German, correct me if I'm wrong,

01:02:19.400 | dark chocolate written in German on a box.

01:02:21.900 | So this can take this image, detect the different letters,

01:02:26.300 | convert them to text, translate the text,

01:02:29.100 | and then using the image to image mapping,

01:02:32.100 | map the letters, the translated letters back onto the box.

01:02:37.600 | And you can do this in real time on video.

01:02:42.500 | So what we've talked about up to this point,

01:02:45.000 | on the left are vanilla neural networks,

01:02:47.400 | convolutional neural networks

01:02:49.200 | that map a single input to a single output,

01:02:51.900 | a single image to a number, single image to another image.

01:02:55.800 | Then there is recurrent neural networks that map,

01:02:58.700 | this is the more general formulation,

01:03:00.900 | that map a sequence of images or a sequence of words

01:03:04.400 | or a sequence of any kind to another sequence.

01:03:09.200 | And these networks are able to do incredible things

01:03:12.600 | with natural language, with video,

01:03:15.800 | and any time series data.

01:03:18.800 | For example, we can convert text to handwritten digits,

01:03:24.100 | with handwritten text.

01:03:27.000 | Here we type in, and you could do this online,

01:03:31.600 | type in "deep learning for self-driving cars"

01:03:33.800 | and it will use an arbitrary number of digits.

01:03:37.600 | And it will use an arbitrary handwriting style

01:03:41.700 | to generate the words "deep learning for self-driving cars".

01:03:44.600 | This is done using recurrent neural networks.

01:03:48.000 | We can also take car RNNs, they're called,

01:03:54.200 | this character level recurrent neural networks

01:03:57.300 | that train on a dataset, an arbitrary text dataset,

01:04:04.300 | and learn to generate text one character at a time.

01:04:09.000 | So there is no preconceived syntactical semantic structure

01:04:14.500 | that's provided to the network.

01:04:16.100 | It learns that structure.

01:04:17.700 | So, for example, you can train it on Wikipedia articles,

01:04:24.100 | like in this case, and it's able to generate successfully

01:04:29.800 | not only text that makes some kind of grammatical sense at least,

01:04:35.400 | but also keep perfect syntactic structure for Wikipedia,

01:04:41.400 | for Markdown editing, for LaTeX editing, and so on.

01:04:45.900 | This text says, "Naturalism and decision for the majority of Arab countries,

01:04:52.000 | capitalized, whatever that means, was grounded by the Irish language

01:04:55.900 | by John Clare, and so on.

01:04:58.400 | These are sentences, if you didn't know better, that might sound correct.

01:05:03.100 | And it does so, let me pause, one character at a time.

01:05:08.300 | So, these aren't words being generated.

01:05:12.700 | This is one character.

01:05:14.400 | You start with the beginning three letters, "Nat",

01:05:17.300 | you generate "You" completely without knowing of the word "Naturalism".

01:05:23.900 | This is incredible.

01:05:26.900 | You can do this to start a sentence

01:05:33.600 | and let the neural network complete that sentence.

01:05:35.700 | So, for example, if you start the sentence with "Life is"

01:05:39.000 | or "Life is about", actually, it will complete it with a lot of fun things.

01:05:46.600 | The weather, "Life is about kids",

01:05:49.800 | "Life is about the true love of Mr. Mom",

01:05:55.300 | "Life is about the truth now",

01:05:56.700 | and this is from Geoffrey Hinton, the last two.

01:06:01.200 | If you start with the meaning of life,

01:06:03.000 | it can complete that with the meaning of life is literary recognition,

01:06:07.600 | maybe true for some of us here.

01:06:09.600 | Publish or perish.

01:06:13.500 | And the meaning of life is the tradition of ancient human reproduction.

01:06:18.600 | Also true for some of us here, I'm sure.

01:06:23.500 | Okay, so what else can you do?

01:06:27.100 | This has been very exciting recently is image caption recognition.

01:06:31.600 | No, generation, I'm sorry.

01:06:33.100 | Image caption generation is important for large data sets of images

01:06:41.200 | where we want to be able to determine what's going on inside those images,

01:06:45.000 | especially for search.

01:06:46.900 | If you want to find a man sitting on a couch with a dog,

01:06:50.800 | you type it into Google and it's able to find that.

01:06:53.400 | So here shown in black text,

01:06:59.000 | a man sitting on a couch with a dog is generated by the system.

01:07:02.300 | A man sitting on a chair with a dog in his lap is generated by a human observer.

01:07:07.300 | And again, these annotations are done by detecting the different obstacles,

01:07:12.700 | the different objects in the scene.

01:07:15.100 | So segmenting the scene, detecting on the right there's a woman, a crowd, a cat,

01:07:20.400 | a camera holding purple, all of these words are being detected.

01:07:25.400 | Then a syntactically correct sentence is generated, a lot of them,

01:07:30.700 | and then you order which sentence is the most likely.

01:07:32.900 | And in this way you can generate very accurate labeling of the images,

01:07:38.400 | captions for the images.

01:07:41.800 | And you can do the same kind of process for image question answering.

01:07:49.400 | You can ask how many, so quantity, how many chairs are there?

01:07:53.000 | You can ask about location, where are the right bananas?

01:07:59.500 | You can ask about the type of object, what is the object on the chair?

01:08:04.400 | It's a pillow.

01:08:05.500 | And these are again using the recurrent neural networks.

01:08:15.000 | You can do the same thing with video caption generation,

01:08:20.100 | video caption description generation.

01:08:23.000 | So looking at a sequence of images as opposed to just a single image.

01:08:26.400 | What is the action going on in this situation?

01:08:30.100 | This is the difficult task.

01:08:32.000 | There's a lot of work in this area.

01:08:34.600 | Now on the left is correct descriptions, a man is doing stunts on his bike.

01:08:38.900 | I heard of zebras are walking in the field.

01:08:41.900 | And on the right, there's a small bus running into a building.

01:08:45.200 | You know, it's talking about relevant entities

01:08:51.700 | but just doing an incorrect description.

01:08:53.500 | A man is cutting a piece of, a piece of, a pair of a paper.

01:08:59.400 | He's cutting a piece of a pair of a paper.

01:09:01.900 | So the words are correct, perhaps.

01:09:06.000 | But, so you're close, but no cigar.

01:09:11.600 | So one of the interesting things

01:09:13.200 | you can do with the recurrent neural networks

01:09:18.900 | is if you think about the way we look at images,

01:09:21.700 | human beings look at images,

01:09:22.900 | is we only have a small fovea with which we focus on, in the scene.

01:09:30.300 | So right now your periphery is very distorted.

01:09:33.500 | The only thing, if you're looking at the slides,

01:09:35.900 | or you're looking at me, that's the only thing that's in focus.

01:09:40.300 | Majority of everything else is out of focus.

01:09:42.700 | So we can use the same kind of concept

01:09:44.900 | to try to teach a neural network to steer around the image,

01:09:47.600 | both for perception and generation of those images.

01:09:51.200 | This is important first on the general artificial intelligence point

01:09:55.900 | of it being just fascinating

01:09:58.500 | that we can selectively steer our attention.

01:10:02.900 | But also it's important for things like drones

01:10:05.300 | that have to fly at high speeds in an environment

01:10:08.300 | where at 300 plus frames a second you have to make decisions.

01:10:12.000 | So you can't possibly localize yourself

01:10:14.700 | or perceive the world around yourself successfully

01:10:17.500 | if you have to interpret the entire scene.

01:10:20.400 | So what you can do is you can steer,

01:10:22.900 | for example, here shown is reading house numbers

01:10:28.800 | by steering around an image.

01:10:32.200 | You could do the same task for reading and for writing.

01:10:38.400 | So reading numbers here on the MNIST dataset on the left

01:10:41.600 | is reading numbers.

01:10:42.900 | We can also selectively steer a network around an image

01:10:49.900 | to generate that image.

01:10:51.200 | Starting with a blurred image first

01:10:53.700 | and then getting more and more higher resolution

01:10:57.900 | as the steering goes on.

01:11:02.300 | Work here at MIT is able to map video to audio.

01:11:10.100 | So head stuff with a drumstick,

01:11:13.200 | silent video and able to generate the sound

01:11:18.200 | that would drumstick hitting that particular object makes.

01:11:22.200 | So you can get texture information from that impact.

01:11:29.100 | So here is a video of a human soccer player playing soccer

01:11:38.700 | and a state-of-the-art machine playing soccer.

01:11:44.900 | And well, let me give him some time to build up.

01:11:52.500 | (Laughter)

01:12:03.300 | Okay, so soccer, this is, we take this for granted

01:12:08.300 | but walking is hard.

01:12:10.300 | Object manipulation is hard.

01:12:12.800 | Soccer is harder than chess for us to do, much harder.

01:12:18.100 | On your phone now, you can have a chess engine

01:12:23.800 | that beats the best players in the world.

01:12:26.300 | And you have to internalize that because the question is,

01:12:32.300 | this is a painful video,

01:12:33.800 | the question is, where does driving fall?

01:12:37.400 | Is it closer to chess or is it closer to soccer?

01:12:40.900 | For those incredible brilliant engineers

01:12:44.500 | that worked on the most recent DARPA challenge,

01:12:47.200 | this would be a very painful video to watch, I apologize.

01:12:51.100 | This is a video from the DARPA challenge

01:12:55.700 | of robots struggling with the basic object manipulation

01:13:05.700 | and walking tasks.

01:13:06.900 | So it's mostly a fully autonomous navigation task.

01:13:14.400 | (Laughter)

01:13:24.000 | Maybe I'll just let this play for a few moments

01:13:27.100 | to let it internalize how difficult this task is.

01:13:32.400 | Of balancing, of planning in an under-actuated way

01:13:38.000 | where you don't have full control of everything.

01:13:40.300 | When there is a delta between your perception

01:13:44.300 | of what you think the world is and what the reality is.

01:13:47.600 | So there, a robot was trying to turn an object that wasn't there.

01:13:54.700 | And this is an MIT entry that actually successfully,

01:14:02.300 | I believe, gotten points for this because it got into that area.

01:14:07.800 | (Laughter)

01:14:12.000 | But as a lot of the teams talked about, the hardest part,

01:14:17.200 | so one of the things the robot had to do

01:14:19.200 | is get into a car and drive it and get out of the car.

01:14:23.100 | And there's a few other manipulation tasks

01:14:25.600 | like to walk on unsteady ground,

01:14:28.200 | it had to drill a hole through a wall, all of these tasks.

01:14:32.000 | And what a lot of teams said is the hardest part,

01:14:35.100 | the hardest task of all of them is getting out of the car.

01:14:38.200 | So it's not getting into the car,

01:14:40.200 | it's this very task that you saw now is a robot getting out of the car.

01:14:44.500 | These are things we take for granted.

01:14:46.200 | So in our evaluation of what is difficult about driving,

01:14:50.900 | we have to remember that some of those things

01:14:54.700 | we may take for granted in the same kind of way

01:14:57.100 | that we take walking for granted.

01:14:58.400 | This is Marv X paradox.

01:15:05.600 | With Hans Moravac from CMU.

01:15:08.100 | Let me just quickly read that quote.

01:15:11.300 | "Encoded in the large highly evolved sensory and motor portions of the human brain

01:15:15.100 | is billions of years of experience

01:15:18.300 | about the nature of the world and how to survive in it."

01:15:20.600 | So this is data, this is big data, billions of years.

01:15:25.100 | An abstract thought which is reasoning,

01:15:28.300 | the stuff we think is intelligence

01:15:31.900 | is perhaps less than 100,000 years of data old.

01:15:36.700 | We haven't yet mastered it and so,

01:15:39.700 | sorry I'm inserting my own statements in the middle of a quote but,

01:15:44.100 | it's been very recent that we've learned how to think

01:15:51.600 | and so we respect it perhaps more

01:15:55.200 | than the things we take for granted like walking and visual perception and so on.

01:16:00.500 | But those may be strictly a matter of data,

01:16:03.700 | data and training time and network size.

01:16:08.200 | So walking is hard.

01:16:15.900 | The question is how hard is driving?

01:16:19.800 | And that's an important question because the margin of error is small.

01:16:27.800 | There's one fatality per 100 million miles.

01:16:34.400 | That's the number of people that die in car crashes every year.

01:16:38.200 | One fatality per 100 million miles.

01:16:41.200 | That's a 0.00001% margin of error.

01:16:47.200 | That's through all the time you spend on the road,

01:16:50.200 | that is the error you get.

01:16:52.200 | We're impressed with ImageNet being able to classify a leopard, a cat or a dog

01:16:57.700 | at close to, at above human level performance.

01:17:01.900 | But this is the margin of error we get with driving.

01:17:04.600 | And we have to be able to deal with snow, with heavy rain,

01:17:09.500 | with big open parking lots, with parking garages,

01:17:13.500 | any pedestrians that behaves irresponsibly, as rarely as that happens

01:17:18.300 | or just unpredictably again, especially in Boston.

01:17:23.800 | Reflections, the ones especially, this is one of some of the things you don't think about,

01:17:29.900 | the lighting variations that blind the cameras.

01:17:33.100 | The question was if that number changes, if you look at just crashes.

01:17:46.400 | So fatalities per crash.

01:17:49.700 | Crashes per, yeah, so one of the big things is cars have gotten really good at crashing

01:17:55.400 | and not hurting the, anybody.

01:17:57.400 | So the number of crashes is much, much larger than number of fatalities,

01:18:01.400 | which is a great thing.

01:18:03.100 | We've built safer cars.

01:18:05.000 | But still, you know, even one fatality is too many.

01:18:09.300 | So this is one, Google self-driving car team,

01:18:20.200 | is quite open about their performance since hitting public roads.

01:18:28.700 | This is from a report that shows the number of times the driver disengages,

01:18:35.700 | the car gives up control, that it asks the driver to take control back

01:18:43.200 | or the driver takes control back by force.

01:18:45.600 | Meaning that they're unhappy with the decision that the car was making

01:18:49.900 | or it was putting the car or other pedestrians or other cars in unsafe situations.

01:18:54.500 | And so if you see over time that there's been a total from 2014 to 2015,

01:19:01.800 | there's been a total of 341 times on beautiful San Francisco roads.

01:19:08.600 | And I say that seriously because the weather conditions are great there.

01:19:14.400 | 341 times that the driver had to elect to take control back.

01:19:17.700 | So it's a work in progress.

01:19:20.400 | Let me give you something to think about here.

01:19:24.900 | This with neural networks is a big open question.

01:19:31.500 | The question of robustness.

01:19:33.200 | So this is an amazing paper, I encourage people to read it.

01:19:38.400 | There's a couple of papers around this topic.

01:19:40.800 | Deep neural networks are easily fooled.

01:19:43.300 | So here are eight images where if given to a neural network as input,

01:19:52.800 | a convolutional neural network as input,

01:19:54.800 | the network with higher than 99.6% confidence says that the image,

01:20:01.900 | for example, in the top left is a robin, next to it is a cheetah,

01:20:06.000 | then an armadillo, a panda, an electric guitar, a baseball, a starfish, a king penguin.

01:20:13.100 | All of these things are obviously not in the images.

01:20:16.400 | So networks can be fooled with noise.

01:20:19.100 | More importantly, more practically for the real world,

01:20:25.800 | adding just a little bit of distortion, a little bit of noise distortion to the image

01:20:31.800 | can force the network to produce a totally wrong prediction.

01:20:37.400 | So here's an example.

01:20:39.800 | There's three columns, correct image classification,

01:20:44.300 | the slight addition of distortion,

01:20:47.800 | and the resulting prediction of an ostrich for all three images on the left,

01:20:53.300 | and a prediction of an ostrich for all three images on the right.

01:20:59.000 | This ability to fool networks easily brings up an important point.

01:21:06.500 | And that point is that there has been a lot of excitement

01:21:14.400 | about neural networks throughout their history.

01:21:17.200 | There's been a lot of excitement about artificial intelligence throughout its history.

01:21:21.000 | And not coupling that excitement, not grounding that excitement in the reality,

01:21:28.000 | the real challenges around that has resulted in crashes,

01:21:36.200 | in AI winters, when funding dried out and people became hopeless

01:21:42.300 | in terms of the possibilities of artificial intelligence.

01:21:44.800 | So here's a 1958 New York Times article

01:21:47.800 | that said the Navy revealed the embryo of an electronic computer today.

01:21:52.000 | This is when the first perceptron that I talked about

01:21:55.600 | was implemented in hardware by Frank Rosenblatt.

01:21:59.100 | It took 400 pixel image input and it provided a single output

01:22:06.000 | weights were encoded in hardware potentiometers

01:22:09.900 | and weights were updated with electric motors.

01:22:12.100 | Now New York Times wrote,

01:22:13.900 | "The Navy revealed the embryo of an electronic computer today

01:22:17.100 | that expects we'll be able to walk, talk, see, write, reproduce itself

01:22:24.300 | and be conscious of its existence."

01:22:27.000 | Dr. Frank Rosenblatt, a research psychologist

01:22:31.200 | at the Cornell Aeronautical Laboratory, Buffalo,

01:22:34.700 | said perceptrons might be fired to the planets as mechanical space explorers.

01:22:39.300 | This might seem ridiculous, but this was the general opinion of the time.

01:22:45.500 | And as we know now, perceptrons cannot even separate a nonlinear function.

01:22:53.400 | They're just linear classifiers.

01:22:57.300 | And so this led to two major AI winters in the 70s and the late 80s and early 90s.

01:23:05.600 | The Lighthill Report in 1973 by the UK government said that

01:23:14.000 | "No part of the field of discoveries made so far produced the major impact that was promised."

01:23:19.000 | So if the hype builds beyond the capabilities of our research,

01:23:27.200 | reports like this will come

01:23:31.300 | and they have the possibility of creating another AI winter.

01:23:35.200 | So I want to pair the optimism, some of the cool things we'll talk about in this class

01:23:40.000 | with the reality of the challenges ahead of us.

01:23:43.900 | The focus of the research community.

01:23:51.400 | This is some of the key players in deep learning.

01:23:55.100 | What are the things that are next for deep learning?

01:23:59.900 | The five-year vision.

01:24:01.700 | We want to run on smaller, cheaper mobile devices.

01:24:05.700 | We want to explore more in the space of unsupervised learning,

01:24:10.000 | as I mentioned, and reinforcement learning.

01:24:12.500 | We want to do things that explore the space of videos more.

01:24:20.300 | The recurring neural networks, like being able to summarize videos

01:24:23.700 | or generate short videos.

01:24:27.000 | One of the big efforts, especially in the companies dealing with large data,

01:24:31.800 | is multimodal learning.

01:24:33.400 | Learning from multiple data sets with multiple sources of data.

01:24:37.600 | And lastly, making money from these technologies.

01:24:43.600 | There's a lot of this, despite the excitement,

01:24:48.800 | there has been an inability for the most part to make serious money

01:24:54.800 | from some of the more interesting parts of deep learning.

01:24:59.600 | And while I got made fun of by the TAs for including this slide

01:25:10.600 | because it's shown in so many sort of business type lectures,

01:25:13.600 | but it is true that we're at the peak of a hype cycle

01:25:17.800 | and we have to make sure,

01:25:20.400 | given the large amount of hype and excitement there is, we proceed with caution.

01:25:25.000 | One example of that, let me mention,

01:25:37.000 | is we already talked about spoofing the cameras,

01:25:42.800 | spoofing the cameras with a little bit of noise.

01:25:44.400 | So if you think about it,

01:25:47.600 | So if you think about it,

01:25:48.600 | self-driving vehicles operate with a set of sensors

01:25:53.200 | and they rely on those sensors to convey, to accurately capture that information.

01:25:58.200 | Now what happens, not only when the world itself produces noisy visual information,

01:26:06.600 | but what if somebody actively tries to spoof that data?

01:26:10.000 | One of the fascinating things that have been recently done is spoofing of LiDAR.

01:26:16.000 | So these LiDARs is a range sensor that gives a 3D point cloud of the objects in the external environment

01:26:22.800 | and you're able to successfully do a replay attack

01:26:28.000 | where you have the car see people and other cars around it

01:26:32.800 | when there's actually nothing around it.

01:26:34.600 | In the same way that you can spoof a camera to see things that are not there,

01:26:40.400 | a neural network.

01:26:44.200 | So let me run through some of the libraries that we'll work with

01:26:48.400 | and they're out there that you might work with if you proceed with deep learning.

01:26:53.400 | TensorFlow, that is the most popular one these days.

01:26:58.600 | It's heavily backed and developed by Google.

01:27:01.600 | It has primarily a Python interface

01:27:06.000 | and is very good at operating on multiple GPUs.

01:27:14.200 | There's Keras and also TF Learn and TF Slim

01:27:18.200 | which are libraries that operate on top of TensorFlow

01:27:21.800 | that make it slightly easier, slightly more user-friendly interfaces

01:27:26.800 | to get up and running.

01:27:29.000 | Torch, if you're interested to get in at the lower level

01:27:40.200 | tweaking of the different parameters of neural networks,

01:27:42.800 | creating your own architectures, Torch is excellent for that

01:27:46.200 | with its own Lua interface.

01:27:49.200 | Lua is a programming language and heavily backed by Facebook.

01:27:54.000 | There's the old school Theano, just what I started on,

01:27:58.000 | a lot of people early on in deep learning started on.

01:28:00.600 | It's one of the first libraries that supported, that came with GPU support.

01:28:06.000 | It definitely encourages lower level tinkering

01:28:10.000 | and has a Python interface.

01:28:11.400 | And many of these, if not all, rely on NVIDIA's library

01:28:18.800 | for doing some of the low-level computations

01:28:23.600 | involved with training these neural networks on NVIDIA GPUs.

01:28:29.000 | MXNet, heavily supported by Amazon

01:28:33.000 | and they've officially recently announced that they're going to be,

01:28:39.800 | that their AWS is going to be all in on MXNet.

01:28:43.800 | Neon, recently bought by Intel, started out as a manufacturer

01:28:52.800 | of neural network chips, which is really exciting

01:28:56.600 | and it performs exceptionally well.

01:28:59.800 | Now here are good things.

01:29:01.800 | Caffe, started in Berkeley, also was very popular in Google

01:29:07.400 | before TensorFlow came out.

01:29:10.000 | It's primarily designed for computer vision with ConvNets

01:29:13.800 | but it's now expanded to all other domains.

01:29:18.400 | There is CNTK, as it used to be known and now called

01:29:24.000 | the Microsoft Cognitive Toolkit.

01:29:25.400 | Nobody calls it that still, I'm aware of.

01:29:28.200 | It says multi-GPU support, has its own brain script,

01:29:34.400 | custom language, as well as other interfaces.

01:29:39.200 | And what we'll get to play around in this class

01:29:41.600 | is amazingly deep learning in the browser, right?

01:29:45.800 | Our favorite is ComNetJS, what you'll use, built by Andrej Karpathy

01:29:52.000 | from Stanford, now OpenAI.

01:29:54.000 | It's good for explaining the basic concept of neural networks.

01:29:58.600 | It's fun to play around with.

01:30:00.200 | All you need is a browser, so very few requirements.

01:30:03.400 | It can't leverage GPUs, unfortunately.

01:30:08.000 | But for a lot of things that we're doing, you don't need GPUs.

01:30:10.600 | You'll be able to train a network with very little

01:30:12.600 | and relatively efficiently without the need of GPUs.

01:30:16.600 | It has full support for CNNs, RNNs, and even deep reinforcement learning.

01:30:22.200 | KerasJS, which seems incredible, we tried to use for this class,

01:30:28.200 | didn't happen to. It has GPU support, so it runs in the browser

01:30:34.200 | with GPU support, with OpenGL or however it works magically.

01:30:39.800 | But we're able to accomplish a lot of things we need without the use of GPUs.

01:30:44.000 | So, it's incredible to live in a day and age when it literally,

01:30:52.200 | as I'll show in the tutorials, it takes just a few minutes

01:30:56.000 | to get started with building your own neural network that classifies images.

01:31:00.000 | And a lot of these libraries are friendly in that way.

01:31:05.000 | So, all the references mentioned in this presentation are available at this link

01:31:10.400 | and the slides are available there as well.

01:31:12.400 | So, I think in the interest of time, let me wrap up.

01:31:16.400 | Thank you so much for coming in today.

01:31:19.000 | And tomorrow I'll explain the deep reinforcement learning game

01:31:23.200 | and the actual competition and how you can win it.

01:31:25.600 | Thanks very much guys.

01:31:27.200 | [APPLAUSE]

MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars

Chapters