back to index

MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars


Chapters

0:0 Intro
0:54 Administrative
3:43 Project: Deep Traffic
5:2 Project: DeepTesla
6:50 Defining Artificial Intelligence
10:14 How Hard is Driving?
10:59 Chess Pieces: Self-Driving Car Sensors
13:8 Chess Pieces: Self-Driving Car Tasks
14:48 DARPA Grand Challenge II (2006)
15:23 DARPA Urban Challenge (2007)
15:48 Industry Takes on the Challenge
16:30 How Hard is it to Pass the Turing Test?
20:54 Neuron: Biological Inspiration for Computation
22:54 Perceptron: Forward Pass
23:37 Perceptron Algorithm
25:3 Neural Networks are Amazing
26:15 Special Purpose Intelligence
27:21 General Purpose Intelligence
40:43 Deep Learning Breakthroughs: What Changed?
45:19 Useful Deep Learning Terms
48:58 Neural Networks: Proceed with Caution
49:50 Deep Learning is Representation Learning
51:33 Representation Matters
52:43 Deep Learning: Scalable Machine Learning
54:1 Applications: Object Classification in Images
55:55 illumination Variability
56:32 Pose Variability and Occlusions
57:54 Pause: Object Recognition / Classification
59:41 Pouse Object Detection

Whisper Transcript | Transcript Only Page

00:00:00.000 | All right, hello everybody. Hopefully you can hear me well. Yes, yes, great.
00:00:07.400 | So welcome to course 6S094, Deep Learning for Self-Driving Cars.
00:00:15.600 | We will introduce to you the methods of deep learning of deep neural networks
00:00:22.500 | using the guiding case study of building self-driving cars.
00:00:28.800 | My name is Lex Friedman.
00:00:31.200 | You get to listen to me for majority of these lectures
00:00:35.600 | and I am part of an amazing team with some brilliant TAs, would you say?
00:00:42.700 | Brilliant?
00:00:43.200 | Dan, Dan Brown, you guys want to stand up?
00:00:49.300 | You're okay?
00:00:49.800 | They're in front row.
00:00:50.800 | Spencer, William Angel, Spencer Dodd, and all the way in the back,
00:00:57.600 | the smartest and the tallest person I know, Benedict Jenek.
00:01:01.500 | So what you see there on the left of the slide is a visualization of one of the two projects that,
00:01:09.600 | one of the two simulations games that we'll get to go through.
00:01:16.800 | We use it as a way to teach you about deep reinforcement learning
00:01:21.300 | but also as a way to excite you by challenging you to compete against others,
00:01:27.900 | if you wish, to win a special prize yet to be announced, super secret prize.
00:01:35.000 | So you can reach me and the TAs at deepcars@mit.edu
00:01:41.000 | if you have any questions about the tutorials, about the lecture, about anything at all.
00:01:45.400 | The website cars.mit.edu has the lecture content.
00:01:51.200 | Code tutorials again, like today, the lecture slides for today are already up in PDF form.
00:01:57.800 | The slides themselves, if you want to see them, just email me,
00:02:02.100 | but they're over a gigabyte in size because they're very heavy in videos.
00:02:06.300 | So I'm just posting the PDFs.
00:02:07.800 | And there will be lecture videos available a few days after the lecture is given.
00:02:15.800 | So speaking of which, there is a camera in the back.
00:02:19.100 | This is being videotaped and recorded, but for the most part,
00:02:23.700 | the camera is just on the speaker.
00:02:26.700 | So you shouldn't have to worry.
00:02:28.600 | If that kind of thing worries you, then you could sit on the periphery of the classroom
00:02:34.300 | or maybe I suggest sunglasses and a fake mustache, that would be a good idea.
00:02:40.100 | There is a competition for the game that you see on the left.
00:02:43.400 | I'll describe exactly what's involved.
00:02:47.100 | In order to get credit for the course, you have to design a neural network
00:02:52.200 | that drives the car just above the speed limit of 65 miles an hour.
00:02:56.200 | But if you want to win, you need to go a little faster than that.
00:02:59.700 | So who this class is for?
00:03:05.500 | You may be new to programming, new to machine learning, new to robotics,
00:03:11.900 | or you're an expert in those fields but want to go back to the basics.
00:03:17.600 | So what you will learn is an overview of deep reinforcement learning,
00:03:21.300 | convolutional neural networks, recurrent neural networks,
00:03:25.900 | and how these methods can help improve each of the components of autonomous driving.
00:03:31.800 | Perception, visual perception, localization, mapping, control planning,
00:03:37.300 | and the detection of driver state.
00:03:42.500 | Okay, two projects. Code name Deep Traffic is the first one.
00:03:46.700 | There is, in this particular formulation of it, there is seven lanes.
00:03:52.100 | It's a top view. It looks like a game but I assure you it's very serious.
00:03:59.500 | It is the agent in red.
00:04:03.300 | The car in red is being controlled by a neural network
00:04:06.800 | and we'll explain how you can control and design the various aspects,
00:04:12.100 | the various parameters of this neural network.
00:04:15.100 | And it learns in the browser.
00:04:19.100 | So this, we're using ConvNetJS,
00:04:21.900 | which is a library that is programmed by Andrej Karpathy in JavaScript.
00:04:27.000 | So amazingly, we live in a world where you can train in a matter of minutes
00:04:33.100 | a neural network in your browser.
00:04:35.200 | And we'll talk about how to do that.
00:04:37.800 | The reason we did this is so that there is very few requirements
00:04:42.800 | for get you up and started with neural networks.
00:04:46.300 | So in order to complete this project for the course,
00:04:50.800 | you don't need any requirements except to have a Chrome browser.
00:04:54.500 | And to win the competition, you don't need anything except a Chrome browser.
00:05:00.500 | The second project, code name Deep Tesla, or Tesla,
00:05:08.600 | is using data from a Tesla vehicle of the forward roadway
00:05:15.600 | and using end-to-end learning, taking the image
00:05:18.800 | and putting it into convolutional neural networks
00:05:22.400 | that directly maps a regressor that maps to a steering angle.
00:05:27.800 | So all it takes is a single image
00:05:30.600 | and it predicts a steering angle for the car.
00:05:33.600 | And we have data for the car itself
00:05:36.900 | and you get to build a neural network that tries to do better,
00:05:41.800 | tries to steer better or at least as good as the car.
00:05:45.000 | Okay, let's get started with the question
00:05:51.600 | with the thing that we understand so poorly at this time
00:05:57.200 | because it's so shrouded in mystery but it fascinates many of us.
00:06:01.800 | And that's the question of what is intelligence?
00:06:06.500 | This is from a 1996, March 1996, Time Magazine.
00:06:14.100 | And the question, can machines think, is answered below
00:06:19.000 | with "they already do, so what if anything is special about the human mind?"
00:06:24.800 | It's a good question for 1996, a good question for 2016,
00:06:29.900 | 17 now and the future.
00:06:33.000 | And there's two ways to ask that question.
00:06:35.700 | One is the special purpose version.
00:06:37.800 | Can an artificial intelligence system achieve a well-defined,
00:06:44.300 | specifically, formally defined finite set of goals?
00:06:49.200 | And this little diagram from a book that got me into artificial intelligence
00:06:55.700 | as a bright-eyed high school student,
00:06:57.300 | the Artificial Intelligence, a Modern Approach.
00:07:02.200 | This is a beautifully simple diagram of a system.
00:07:08.400 | It exists in an environment.
00:07:10.300 | It has a set of sensors that do the perception.
00:07:15.300 | It takes those sensors in, does something magical,
00:07:19.300 | there's a question mark there,
00:07:20.500 | and with a set of effectors, acts in the world,
00:07:23.400 | manipulates objects in that world.
00:07:26.900 | And so special purpose, we can, under this formulation,
00:07:33.300 | as long as the environment is formally defined, well-defined,
00:07:36.500 | as long as a set of goals are well-defined,
00:07:39.100 | as long as a set of actions, sensors,
00:07:42.200 | and the ways that the perception carries itself out is well-defined,
00:07:47.800 | we have good algorithms of which we'll talk about
00:07:51.500 | that can optimize for those goals.
00:07:55.500 | The question is, if we inch along this path,
00:07:58.600 | will we get closer to the general formulation,
00:08:04.200 | to the general purpose version of what artificial intelligence is?
00:08:08.500 | Can it achieve poorly defined, unconstrained set of goals
00:08:14.000 | with an unconstrained, poorly defined set of actions,
00:08:16.900 | and unconstrained, poorly defined utility functions, rewards?
00:08:24.400 | This is what human life is about.
00:08:26.200 | This is what we do pretty well most days,
00:08:28.800 | exist in a undefined, full of uncertainty world.
00:08:37.000 | So, okay, we can separate tasks into three different categories.
00:08:43.000 | Formal tasks, this is the easiest.
00:08:46.000 | It doesn't seem so, it didn't seem so at the birth of artificial intelligence,
00:08:50.400 | but that's in fact true if you think about it.
00:08:52.800 | The easiest is the formal tasks, playing board games, theorem proving,
00:08:56.800 | all the kind of mathematical logic problems that can be formally defined.
00:09:02.400 | Then there's the expert tasks.
00:09:06.200 | So this is where a lot of the exciting breakthroughs have been happening,
00:09:12.100 | where machine learning methods, data-driven methods,
00:09:16.100 | can help aid or improve on the performance of our human experts.
00:09:22.800 | This means medical diagnosis, hardware design, scheduling.
00:09:26.900 | And then there is the thing that we take for granted,
00:09:30.500 | the trivial thing, the thing that we do so easily every day,
00:09:35.500 | when we wake up in the morning,
00:09:37.000 | the mundane tasks of everyday speech, of written language,
00:09:41.600 | of visual perception, of walking,
00:09:46.200 | which we'll talk about in today's lecture,
00:09:49.300 | is a fascinatingly difficult task.
00:09:52.800 | And object manipulation.
00:09:54.300 | So the question is that we're asking here,
00:09:58.700 | before we talk about deep learning,
00:10:00.700 | before we talk about the specific methods,
00:10:02.900 | we really want to dig in and try to see what is it about driving.
00:10:09.300 | How difficult is driving?
00:10:12.900 | Is it more like chess, which you see on the left there,
00:10:17.500 | where we can formally define a set of lanes, a set of actions,
00:10:21.200 | and formulate it as this, you know, there's five set of actions,
00:10:24.800 | you can change a lane, you can avoid obstacles,
00:10:27.500 | you can formally define an obstacle,
00:10:30.000 | you can formally define the rules of the road.
00:10:32.400 | Or is there something about natural language,
00:10:37.200 | something similar to everyday conversation about driving,
00:10:40.300 | that requires a much higher degree of reasoning,
00:10:43.400 | of communication, of learning,
00:10:49.200 | of existing in this underactuated space?
00:10:52.300 | Is it a lot more than just left lane, right lane, speed up, slow down?
00:10:58.400 | So let's look at it as a chess game.
00:11:03.000 | Here's the chess pieces.
00:11:04.400 | What are the sensors we get to work with on a self,
00:11:08.100 | on an autonomous vehicle?
00:11:10.100 | And we'll get a lot more in depth on this,
00:11:13.000 | especially with the guest speakers who built many of these.
00:11:17.300 | There's radar, there's the range sensors, radar, lidar,
00:11:20.400 | that give you information about the obstacles in the environment,
00:11:24.400 | that help localize the obstacles in the environment.
00:11:28.200 | There's the visible light camera, the stereo vision,
00:11:31.700 | that gives you texture information,
00:11:34.700 | that helps you figure out not just where the obstacles are,
00:11:38.200 | but what they are, helps to classify those,
00:11:41.300 | helps to understand their subtle movements.
00:11:47.200 | Then there is the information about the vehicle itself,
00:11:49.700 | about the trajectory and the movement of the vehicle,
00:11:52.800 | that comes from the GPS and IMU sensors.
00:11:55.600 | And there is the state of, the rich state of the vehicle itself.
00:12:00.300 | What is it doing?
00:12:01.400 | What are all the individual systems doing?
00:12:04.100 | That comes from the CAN network.
00:12:06.600 | And there is one of the less studied,
00:12:11.000 | but fascinating to us on the research side,
00:12:13.400 | is audio, the sounds of the road.
00:12:17.900 | That provide the rich context of a wet road,
00:12:22.400 | the sound of a road that when it stopped raining,
00:12:25.300 | but it's still wet, the sound that it makes.
00:12:28.100 | The screeching tire and honking,
00:12:32.800 | these are all fascinating signals as well.
00:12:35.000 | And the focus of the research in our group,
00:12:38.200 | the thing that's really much under investigated,
00:12:44.500 | is the internal facing sensors.
00:12:47.400 | The driver, sensing the state of the driver.
00:12:52.200 | Where are they looking? Are they sleepy?
00:12:55.100 | The emotional state, are they in the seat at all?
00:12:58.500 | And the same with audio.
00:13:01.800 | That comes from the visual information and the audio information.
00:13:06.300 | More than that, here's the tasks.
00:13:11.300 | If you were to break into modules,
00:13:13.100 | the task of what it means to build a self-driving vehicle.
00:13:17.300 | First, you want to know where you are, where am I?
00:13:20.300 | Localization and mapping.
00:13:22.300 | You want to map the external environment,
00:13:24.900 | figure out where all the different obstacles are,
00:13:29.600 | all the entities are, and use that estimate of the environment
00:13:34.100 | to then figure out where I am, where the robot is.
00:13:38.100 | Then there's scene understanding.
00:13:40.500 | It's understanding not just the positional aspects
00:13:44.700 | of the external environment and the dynamics of it,
00:13:48.200 | but also what those entities are.
00:13:51.100 | Is it a car? Is it a pedestrian? Is it a bird?
00:13:54.000 | There's movement planning.
00:13:57.800 | Once you have kind of figured out to the best of your abilities,
00:14:01.700 | your position and the position of other entities in this world,
00:14:06.200 | there's figuring out a trajectory through that world.
00:14:09.000 | And finally, once you've figured out how to move about,
00:14:14.500 | safely and effectively through that world,
00:14:17.200 | it's figuring out what the human that's on board is doing.
00:14:20.600 | Because as I will talk about, the path to a self-driving vehicle,
00:14:25.700 | and that is hence our focus on Tesla,
00:14:28.800 | may go through semi-autonomous vehicles,
00:14:34.600 | where the vehicle must not only drive itself,
00:14:40.900 | but effectively hand over control from the car
00:14:45.200 | to the human and back.
00:14:46.900 | Okay, quick history.
00:14:50.300 | Well, there's a lot of fun stuff from the 80s and 90s, but
00:14:54.000 | the big breakthroughs came in the second DARPA Grand Challenge
00:15:02.700 | with Stanford Stanley when they won the competition,
00:15:06.400 | one of five cars that finished.
00:15:08.500 | This was an incredible accomplishment.
00:15:12.400 | In a desert race, a fully autonomous vehicle was able to complete the race
00:15:18.900 | in record time.
00:15:21.800 | The DARPA Urban Challenge in 2007,
00:15:32.000 | where the task was no longer a race to the desert,
00:15:37.600 | but through a urban environment.
00:15:41.400 | And CMU's boss, with GM, won that race.
00:15:46.700 | And a lot of that work led directly into the acceptance
00:15:55.500 | and large major industry players
00:16:00.600 | taking on the challenge of building these vehicles.
00:16:02.600 | Google, now Waymo, self-driving car.
00:16:09.100 | Tesla, with its Autopilot system and now Autopilot 2 system.
00:16:13.700 | Uber, with its testing in Pittsburgh.
00:16:17.500 | And there's many other companies,
00:16:20.400 | including one of the speakers for this course of Neutronomy,
00:16:24.000 | that are driving the wonderful streets of Boston.
00:16:29.800 | Okay, so let's take a step back.
00:16:35.400 | We have, if we think about the accomplishments in the DARPA challenge
00:16:39.700 | and if we look at the accomplishments of the Google self-driving car,
00:16:45.400 | which essentially boils the world down into a chess game.
00:16:50.000 | It uses incredibly accurate sensors
00:16:56.800 | to build a three-dimensional map of the world,
00:16:59.400 | localize itself effective in that world and move about that world
00:17:04.000 | in a very well-defined way.
00:17:08.500 | Now, what if driving, the open question is,
00:17:16.000 | if driving is more like a conversation,
00:17:18.600 | like a natural language conversation,
00:17:21.200 | how hard is it to pass the Turing test?
00:17:24.200 | The Turing test, as the popular current formulation is,
00:17:28.900 | can a computer be mistaken for a human being
00:17:33.100 | in more than 30% of the time.
00:17:34.600 | When a human is talking behind a veil,
00:17:37.700 | having a conversation with either a computer or a human,
00:17:40.500 | they mistake the other side of that conversation
00:17:43.900 | for being a human when it's in fact a computer.
00:17:47.900 | And the way you would, in a natural language,
00:17:55.600 | build a system that has successfully passed the Turing test
00:17:58.900 | is the natural language processing part
00:18:02.900 | to enable it to communicate successfully.
00:18:05.200 | So generate language and interpret language,
00:18:09.100 | then you represent knowledge, the state of the conversation,
00:18:13.300 | transferred over time.
00:18:14.700 | And the last piece, and this is the hard piece,
00:18:18.600 | is the automated reasoning.
00:18:20.100 | Is reasoning, can we teach machine learning methods to reason?
00:18:30.200 | That is something that will propagate through our discussion
00:18:33.800 | because, as I will talk about, the various methods,
00:18:40.400 | the various deep learning methods, neural networks,
00:18:44.200 | are good at learning from data.
00:18:48.100 | But they're not yet, there's no good mechanism for reasoning.
00:18:52.800 | Now, reasoning could be just something
00:18:56.800 | that we tell ourselves we do to feel special,
00:19:00.200 | better, to feel like we're better than machines.
00:19:03.700 | Reasoning may be simply something as simple as learning from data.
00:19:09.600 | We just need a larger network.
00:19:13.000 | Or there could be a totally different mechanism required
00:19:18.000 | and we'll talk about the possibilities there.
00:19:24.000 | Can you go back to the US for example?
00:19:25.800 | Okay, so we talked about the video,
00:19:27.400 | so which state is that?
00:19:29.400 | The top states of the US or other states?
00:19:33.200 | No, it's very difficult to find these kind of situations
00:19:36.300 | in the United States.
00:19:37.300 | So the question was, for this video,
00:19:39.600 | is it in the United States or not?
00:19:42.200 | I believe it's in Tokyo.
00:19:46.600 | So India, a few European countries,
00:19:53.900 | are much more towards the direction of
00:20:00.300 | natural language versus chess.
00:20:04.600 | In the United States, generally speaking,
00:20:08.900 | we follow rules more concretely.
00:20:11.100 | The quality of roads is better,
00:20:13.000 | the marking on the roads is better,
00:20:14.800 | so there's less requirements there.
00:20:18.000 | I'm not sure it's going to be Tokyo,
00:20:19.700 | because they drive on the left side,
00:20:21.300 | but India is going to the right side.
00:20:23.100 | So Japan is less likely to use the game.
00:20:27.100 | These cars are driving on the left side?
00:20:30.800 | No, but they drive on the right side of the road,
00:20:33.700 | just like in the US.
00:20:35.100 | I see.
00:20:36.900 | I just, okay.
00:20:38.600 | Yeah, you're right, it is, because, yep.
00:20:40.900 | Yeah, so, but it's certainly not the United States.
00:20:43.900 | I'm pretty, I'm,
00:20:46.700 | I spent quite a bit of Googling
00:20:48.200 | trying to find the United States,
00:20:50.000 | and it's difficult.
00:20:51.300 | So let's talk about
00:20:57.000 | the recent breakthroughs in machine learning,
00:21:02.000 | and what is at the core of those breakthroughs.
00:21:05.200 | It's neural networks
00:21:07.700 | that have been around for a long time,
00:21:11.900 | and I will talk about what has changed,
00:21:14.000 | what are the cool new things.
00:21:17.200 | And what hasn't changed,
00:21:18.600 | and what are its possibilities.
00:21:20.300 | But first, a neuron,
00:21:22.400 | crudely,
00:21:25.100 | is a computational building block of the brain.
00:21:30.000 | I know there's a few folks here,
00:21:32.700 | neuroscience folks.
00:21:34.000 | This is hardly a model.
00:21:38.500 | It is mostly an inspiration.
00:21:42.300 | And so,
00:21:45.200 | the human neuron
00:21:46.800 | has inspired the artificial neuron,
00:21:50.700 | the computational building block of a neural network,
00:21:54.100 | of an artificial neural network.
00:21:56.200 | Now to give you some context,
00:21:58.800 | these neurons, for both artificial and human brains,
00:22:05.300 | are interconnected.
00:22:06.900 | In the human brain, there's about,
00:22:12.700 | I believe, 10,000 outgoing connections from every neuron,
00:22:16.100 | on average.
00:22:18.400 | And they're interconnected to each other.
00:22:21.500 | The largest current,
00:22:25.600 | as far as I'm aware,
00:22:27.300 | artificial neural network
00:22:29.300 | has 10 billion of those connections, synapses.
00:22:34.300 | Our human brain, to the best estimate,
00:22:37.600 | that I'm aware of,
00:22:41.300 | 10,000 times that.
00:22:46.600 | 100 to 1,000 trillion synapses.
00:22:50.700 | Now what is an artificial neuron?
00:22:59.200 | This building block of a neural network.
00:23:02.900 | It takes a set of inputs,
00:23:06.000 | it puts a weight on each of those inputs,
00:23:10.400 | sums them together,
00:23:11.500 | applies a bias value on each,
00:23:16.600 | that sits on each neuron,
00:23:18.900 | and using an activation function that takes as input
00:23:23.600 | that sum plus the bias
00:23:27.600 | and squishes it together
00:23:29.700 | to produce a zero to one signal.
00:23:38.800 | And this allows us, a single neuron,
00:23:41.200 | take a few inputs and produces an output,
00:23:46.400 | a classification, for example, a zero one.
00:23:50.200 | And as we'll talk about,
00:23:53.100 | simply,
00:23:54.300 | it can serve as a linear classifier.
00:23:59.500 | So it can draw a line,
00:24:01.800 | it can learn to draw a line between,
00:24:05.400 | like what's seen here, between the blue dots,
00:24:08.700 | and the yellow dots.
00:24:10.500 | And that's exactly what we'll do
00:24:12.500 | in the IPython notebook that I'll talk about.
00:24:15.300 | But the basic algorithm is,
00:24:19.600 | you initialize the weights on the inputs,
00:24:23.600 | and you compute the output.
00:24:28.800 | You perform this previous operation I talked about, sum up,
00:24:33.000 | compute the output.
00:24:34.800 | And if the output,
00:24:37.900 | does not match the ground truth,
00:24:40.400 | the expected output, the output that it should produce,
00:24:44.600 | the weights are punished accordingly.
00:24:47.400 | And we'll talk through a little bit of the math of that.
00:24:51.800 | And this process is repeated until the perceptron
00:24:57.200 | does not make any more mistakes.
00:25:00.500 | Now here's,
00:25:06.300 | the amazing thing about neural networks.
00:25:09.800 | There's several, I'll talk about them.
00:25:12.000 | One on the mathematical side,
00:25:18.000 | is the universality of neural networks.
00:25:21.500 | With just a single layer, if we stack them together,
00:25:24.300 | a single hidden layer,
00:25:25.600 | the inputs on the left, the outputs on the right,
00:25:29.600 | and in the middle there's a single hidden layer.
00:25:32.800 | It can,
00:25:34.700 | closely approximate any function.
00:25:37.500 | Any function.
00:25:39.100 | So this is an incredible property.
00:25:42.900 | That with a single layer,
00:25:47.100 | any function,
00:25:48.800 | you can think of,
00:25:51.000 | that,
00:25:52.900 | you know, you can think of driving as a function.
00:25:56.500 | It takes an input,
00:25:57.800 | the world outside as output,
00:26:01.400 | the control of the vehicle.
00:26:04.800 | There exists a neural network out there that can drive,
00:26:07.500 | perfectly.
00:26:08.700 | It's a fascinating mathematical fact.
00:26:11.900 | So we can think of this then, these functions as a special purpose,
00:26:20.900 | function, special purpose intelligence.
00:26:23.100 | You can take,
00:26:24.700 | say as input,
00:26:26.000 | the number of bedrooms,
00:26:28.900 | the square feet,
00:26:31.900 | type of neighborhood,
00:26:34.100 | those are the three inputs.
00:26:36.900 | passes that value through to the hidden layer,
00:26:40.900 | and then one more step,
00:26:42.800 | it produces the final price estimate,
00:26:45.200 | for the house,
00:26:46.300 | or for the residence.
00:26:47.900 | And we can teach a network to do this pretty well,
00:26:51.900 | in a supervised way.
00:26:53.500 | This is supervised learning.
00:26:54.900 | You provide,
00:26:56.500 | a lot of examples,
00:26:58.100 | where you know the number of bedrooms, the square feet,
00:27:00.900 | the type of neighborhood,
00:27:02.500 | and then you also know the final price,
00:27:04.600 | of the house,
00:27:05.900 | or the residence.
00:27:07.200 | And then you can,
00:27:08.800 | as I'll talk about through,
00:27:11.000 | a process of back propagation,
00:27:13.000 | teach these networks,
00:27:16.000 | make this prediction,
00:27:17.500 | pretty well.
00:27:21.100 | some of the exciting breakthroughs recently,
00:27:24.200 | have been in the general purpose intelligence.
00:27:28.800 | This is from our,
00:27:31.200 | this is from Andrej Karpathy,
00:27:33.400 | who is now at OpenAI.
00:27:38.400 | would like,
00:27:41.300 | take a moment here,
00:27:43.100 | to try to explain how amazing this is.
00:27:45.600 | This is a game of Pong.
00:27:47.500 | If you're not familiar with Pong,
00:27:51.400 | there's two paddles,
00:27:54.000 | and you're trying to,
00:27:55.900 | bounce the ball back,
00:27:59.400 | and in such a way that,
00:28:01.100 | prevents the other guy from bouncing the ball back at you.
00:28:05.600 | On the,
00:28:08.800 | the artificial intelligence agents on the right in green,
00:28:13.600 | and up top is the score,
00:28:15.500 | eight to one.
00:28:16.700 | Now this takes,
00:28:18.400 | about three days to train,
00:28:20.400 | on a regular computer,
00:28:22.000 | this network.
00:28:22.900 | What is,
00:28:24.200 | this network doing?
00:28:26.400 | It's called the policy network.
00:28:29.000 | The input,
00:28:29.800 | is the raw,
00:28:31.000 | pixels.
00:28:33.000 | they're, they're slightly,
00:28:34.600 | processed and also you take the difference between,
00:28:38.600 | two frames,
00:28:40.900 | but it's basically the raw pixel information.
00:28:43.600 | That's the input.
00:28:45.300 | There's,
00:28:46.800 | a few hidden layers,
00:28:48.500 | and the output is a single probability of moving up.
00:28:52.200 | That, that's it.
00:28:55.700 | That's,
00:28:56.300 | that's the whole,
00:28:57.600 | that's, that's the whole system.
00:28:59.700 | And what it's doing is,
00:29:02.700 | it learns,
00:29:06.700 | you don't know,
00:29:08.500 | at any one moment,
00:29:10.000 | you don't know what the right thing to do is.
00:29:13.300 | Is it to move up? Is it to move down?
00:29:15.500 | You only know,
00:29:17.100 | what the right thing to do is,
00:29:19.900 | by the fact that eventually you win or lose the game.
00:29:23.000 | So this is the amazing thing here,
00:29:26.800 | is there's no supervised learning about,
00:29:30.200 | there's no like,
00:29:31.600 | universal fact about,
00:29:33.600 | any one state being good or bad,
00:29:35.800 | and any one action being good or bad in any state.
00:29:38.600 | But if you punish,
00:29:40.800 | or reward,
00:29:41.800 | every single action you took,
00:29:44.000 | every single action you took,
00:29:45.800 | for entire game,
00:29:47.800 | based on the result.
00:29:49.900 | So no matter what you did, if you won the game,
00:29:53.200 | the end justifies the means.
00:29:56.600 | If you won the game,
00:29:57.700 | every action you took, and every action,
00:30:00.300 | state pair, gets rewarded.
00:30:03.000 | If you lost the game,
00:30:04.200 | it gets punished.
00:30:05.400 | And this process,
00:30:07.800 | with only 200,000 games,
00:30:10.400 | where the,
00:30:12.200 | system just simulates the games,
00:30:14.200 | it can learn to beat the computer.
00:30:17.200 | This system knows nothing about Pong,
00:30:21.200 | nothing about games.
00:30:23.000 | This is general intelligence.
00:30:27.000 | Except for the fact,
00:30:28.800 | that it's just a game of Pong.
00:30:31.400 | And I will,
00:30:33.400 | talk about,
00:30:36.200 | how this can,
00:30:38.200 | be extended further, why this is so promising,
00:30:41.200 | and why this is also,
00:30:43.400 | we should proceed with caution.
00:30:47.200 | So again,
00:30:49.200 | there's a set of actions you take,
00:30:52.500 | up, down, up, down, based on the output of the network.
00:30:54.800 | There's a threshold,
00:30:56.300 | given the probability of moving up,
00:30:57.800 | you move up or down based on the output of the network.
00:31:00.200 | And you have a set of states.
00:31:04.600 | And every single state action pair is rewarded if there's a win,
00:31:09.200 | and it's punished,
00:31:10.600 | if there's a loss.
00:31:11.800 | When you go home,
00:31:18.000 | think about how amazing that is.
00:31:22.000 | And if you don't understand why that's amazing,
00:31:25.200 | spend some time on it.
00:31:26.600 | It's incredible.
00:31:28.300 | Sure, sure thing.
00:31:36.600 | The question was,
00:31:38.600 | what is supervised learning, what is unsupervised learning,
00:31:41.600 | what's the difference?
00:31:42.400 | So supervised learning,
00:31:44.500 | is, when people talk about machine learning,
00:31:47.100 | they mean supervised learning most of the time.
00:31:49.000 | Supervised learning is,
00:31:55.100 | learning from data.
00:31:56.500 | It's learning from example.
00:31:58.700 | When you have a set of inputs and a set of outputs,
00:32:01.200 | that you know are correct,
00:32:03.300 | what are called ground truth.
00:32:04.700 | So you need those examples,
00:32:08.100 | a large amount of them,
00:32:09.800 | to train any of the machine learning algorithms,
00:32:12.600 | to learn to then generalize that to future examples.
00:32:17.300 | This is,
00:32:23.800 | actually there's a third one called reinforcement learning,
00:32:26.400 | where the ground truth is sparse.
00:32:32.000 | The information about,
00:32:34.400 | when something is good or not,
00:32:37.900 | the ground truth only happens every once in a while,
00:32:40.500 | at the end of the game,
00:32:41.300 | not every single frame.
00:32:42.900 | And unsupervised learning is when you have no information,
00:32:46.900 | about the outputs,
00:32:48.600 | that are correct or incorrect.
00:32:52.000 | And it is the excitement,
00:32:55.000 | of the deep learning community,
00:32:57.400 | is unsupervised learning.
00:32:58.600 | But it has achieved no major breakthroughs at this point.
00:33:03.000 | This is the,
00:33:05.200 | I'll talk about what the future of deep learning is,
00:33:07.400 | and a lot of the people that are working in the field,
00:33:10.100 | are excited by it.
00:33:11.100 | But right now,
00:33:12.500 | any interesting accomplishment,
00:33:14.800 | has to do with supervised learning.
00:33:19.000 | [INAUDIBLE]
00:33:24.300 | And the brown one is just a heuristic solution,
00:33:27.800 | like look at the velocity.
00:33:29.900 | So basically the reinforcement learning here,
00:33:34.100 | is learning from somebody who has certain rules.
00:33:38.400 | And how can that be guaranteed,
00:33:42.600 | that it would generalize to somebody else?
00:33:47.100 | So the question was,
00:33:49.900 | the green paddle learns to play this game successfully,
00:33:56.500 | against this specific one brown paddle,
00:33:58.900 | operating under specific kinds of rules.
00:34:01.100 | How do we know it can generalize to other games,
00:34:04.900 | other things?
00:34:05.700 | And it can't.
00:34:07.000 | But the mechanism by which it learns generalizes.
00:34:11.200 | So the question is,
00:34:13.400 | how do we know that it can generalize to other games,
00:34:16.400 | so as long as you let it play,
00:34:19.200 | as long as you let it play in whatever world you want it to succeed in,
00:34:27.200 | long enough,
00:34:28.500 | it will use the same approach to learn to succeed in that world.
00:34:33.300 | The problem is,
00:34:34.900 | this works for worlds you can simulate well.
00:34:38.700 | Unfortunately, one of the big challenges of neural networks,
00:34:45.900 | is that they're not currently efficient learners.
00:34:48.400 | We need a lot of data to learn anything.
00:34:50.900 | Human beings need one example,
00:34:53.800 | oftentimes, and they learn very efficiently from that one example.
00:34:58.000 | And again, I'll talk about that as well.
00:35:03.300 | It's a good question.
00:35:04.600 | So the drawbacks of neural networks.
00:35:07.900 | So if you think about the way a human being would approach this game,
00:35:12.200 | this game of Pong,
00:35:14.400 | they would only need a simple set of instructions.
00:35:16.700 | You're in control of a paddle,
00:35:19.300 | and you can move it up and down.
00:35:21.800 | And your task is to bounce the ball past the other player,
00:35:25.700 | controlled by AI.
00:35:27.900 | Now, human being would immediately,
00:35:33.700 | they may not win the game,
00:35:34.900 | but they would immediately understand the game.
00:35:36.700 | And will be able to successfully play it well enough
00:35:39.700 | to pretty quickly learn to beat the game.
00:35:44.300 | But they need to have a concept of control.
00:35:46.600 | What it means to control a paddle.
00:35:48.200 | They need to have a concept of a paddle.
00:35:49.900 | They need to have a concept of moving up and down,
00:35:52.700 | and a ball, and bouncing.
00:35:55.700 | They have to know,
00:35:56.400 | they have to have at least a loose concept of real-world physics,
00:36:00.300 | that they can then project that real-world physics
00:36:03.000 | onto the two-dimensional world.
00:36:04.600 | All of these concepts are
00:36:06.700 | are concepts that you come to the table with.
00:36:10.000 | That's knowledge.
00:36:13.400 | And the kind of way you transfer that knowledge from
00:36:16.900 | from your previous experience,
00:36:19.800 | from childhood to now,
00:36:22.200 | when you come to this game,
00:36:23.400 | that is something is called reasoning.
00:36:27.200 | Whatever reasoning means.
00:36:29.700 | And the question is whether through this same kind of process,
00:36:34.200 | you can see the entire world
00:36:37.800 | as a game of pong.
00:36:43.200 | And reasoning is simply ability to simulate
00:36:47.100 | that game in your mind
00:36:50.100 | and learn very efficiently,
00:36:53.100 | much more efficiently than 200,000 iterations.
00:36:55.700 | The other challenge of deep neural networks
00:37:00.900 | and machine learning broadly is you need big data
00:37:03.500 | and efficient learners, as I said.
00:37:05.400 | That data also needs to be supervised data.
00:37:08.900 | You need to have ground truth,
00:37:11.000 | which is very costly for,
00:37:13.100 | so annotation,
00:37:15.100 | a human being looking at a particular image, for example,
00:37:19.700 | and labeling that as something,
00:37:21.300 | as a cat or a dog,
00:37:23.300 | whatever objects is in the image,
00:37:25.000 | that's very costly.
00:37:26.200 | And for particularly for neural networks,
00:37:31.200 | there's a lot of parameters to tune.
00:37:36.200 | There's a lot of hyperparameters.
00:37:38.800 | You need to figure out the network structure first.
00:37:42.000 | How does this network look?
00:37:43.300 | How many layers?
00:37:44.200 | How many hidden nodes?
00:37:45.300 | What type of activation function in each node?
00:37:52.300 | There's a lot of hyperparameters there.
00:37:54.200 | And then once you built your network,
00:37:56.100 | there's parameters for how you teach that network.
00:38:00.400 | There's learning rate, loss function,
00:38:03.300 | mini-batch size, number of training iterations,
00:38:07.200 | gradient update smoothing,
00:38:09.000 | and selecting even the optimizer with which you,
00:38:14.100 | with which you solve the various differential equations involved.
00:38:20.000 | It's a topic of many research papers, certainly.
00:38:28.200 | It's rich enough for research papers,
00:38:30.100 | but it's also really challenging.
00:38:31.900 | It means that you can't just plop a network down
00:38:35.000 | and it will solve the problem generally.
00:38:37.700 | And defining a good loss function,
00:38:42.600 | or in the case of Pong or games,
00:38:45.300 | a good reward function is difficult.
00:38:49.400 | So here's a game.
00:38:51.100 | This is a recent result from OpenAI.
00:38:54.600 | I'm teaching a network to play the game of Coast Runners.
00:39:03.400 | And the goal of Coast Runners is to go,
00:39:09.200 | you're in a boat, the task is to go around a track
00:39:13.300 | and successfully complete a race
00:39:16.700 | against other people you're racing against.
00:39:19.100 | Now, this network is an optimal one.
00:39:23.100 | And what it's figured out that actually in the game,
00:39:26.900 | it gets a lot of points for collecting certain objects along the path.
00:39:33.000 | So what you see is it's figured out to go in a circle
00:39:36.600 | and collect those green turbo things.
00:39:40.700 | And what it's figured out is you don't need to complete the game
00:39:45.700 | to earn the reward.
00:39:47.300 | Now, that more sort of,
00:39:54.200 | and despite being on fire and hitting the wall
00:40:00.000 | and going through this whole process,
00:40:01.900 | it's actually achieved at least a local optima
00:40:05.700 | given the reward function of maximizing the number of points.
00:40:10.200 | And so it's figured out a way to earn a higher reward
00:40:16.800 | while ignoring the implied bigger picture goal of finishing the race,
00:40:20.300 | which us as humans understand much better.
00:40:24.400 | This raises for self-driving cars ethical questions.
00:40:31.400 | Besides other questions, you can watch this for hours
00:40:35.600 | and it will do that for hours.
00:40:37.800 | And that's the point is,
00:40:39.400 | it's hard to teach, it's hard to encode
00:40:47.300 | the formally defined utility function
00:40:53.800 | under which an intelligence system needs to operate.
00:40:56.300 | And that's made obvious even in a simple game.
00:40:59.600 | And so what is, yep, question.
00:41:01.600 | So the question was, what's an example of a local optimum
00:41:12.400 | that an autonomous car, so similar to the coast race,
00:41:15.800 | so what would be the example in the real world for an autonomous vehicle?
00:41:18.700 | And it's a touchy subject,
00:41:22.800 | but it would certainly have to be involved
00:41:27.800 | with the choices we make under near crashes and crashes.
00:41:34.600 | The choices a car makes when to avoid,
00:41:37.900 | for example, if there's a crash imminent
00:41:41.200 | and there's no way you can stop to prevent the crash,
00:41:45.200 | do you keep the driver safe or do you keep the other people safe?
00:41:51.800 | And there has to be some,
00:41:56.200 | even if you don't choose to acknowledge it,
00:42:03.200 | even if it's only in the data and the learning that you do,
00:42:06.600 | there's an implied reward function there.
00:42:08.800 | And we need to be aware of that reward function is
00:42:12.600 | because it may find something.
00:42:14.600 | Until you actually see it, we won't know it.
00:42:17.600 | Once we see it, we'll realize that,
00:42:20.500 | "Oh, that was a bad design."
00:42:23.600 | And that's the scary thing.
00:42:25.000 | It's hard to know ahead of time what that is.
00:42:27.800 | So the recent breakthroughs from deep learning came
00:42:34.900 | of several factors.
00:42:38.600 | First is the compute.
00:42:39.700 | Moore's law.
00:42:41.800 | CPUs are getting faster, 100 times faster every decade.
00:42:45.300 | Then there's GPUs.
00:42:49.100 | Also, the ability to train neural networks and GPUs
00:42:53.100 | and now ASICs has created a lot of capabilities
00:43:00.300 | in terms of energy efficiency
00:43:02.200 | and being able to train larger networks more efficiently.
00:43:08.000 | There is larger, well, first of all,
00:43:13.000 | in the 21st century, there's digitized data.
00:43:15.700 | There's larger data sets of digital data.
00:43:19.300 | And now there is that data is becoming more organized,
00:43:23.200 | not just vaguely available data out there on the internet.
00:43:27.900 | It's actual organized data sets like ImageNet.
00:43:31.000 | Certainly for natural language, there's large data sets.
00:43:35.000 | There is the algorithm innovations.
00:43:38.400 | Backprop, backpropagation, convolutional neural networks, LSTMs,
00:43:43.500 | all these different architectures for dealing with specific
00:43:47.200 | specific types of domains and tasks.
00:43:49.600 | There's the huge one, is infrastructure,
00:43:53.800 | is on the software and the hardware side.
00:43:57.000 | There's Git, ability to share an open source way software.
00:44:01.100 | There is pieces of software that make robotics
00:44:08.600 | and make machine learning easier.
00:44:10.200 | ROS, TensorFlow.
00:44:12.100 | There's Amazon Mechanical Turk,
00:44:16.600 | which allows for efficient, cheap annotation of large scale data sets.
00:44:21.400 | There's AWS in the cloud hosting machine learning,
00:44:26.500 | hosting the data and the compute.
00:44:28.800 | And then there's a financial backing of large companies,
00:44:32.800 | Google, Facebook, Amazon.
00:44:35.000 | But really, nothing has changed.
00:44:38.800 | There really has not been any significant breakthroughs.
00:44:42.300 | We're using these convolutional neural networks
00:44:44.700 | have been around since the 90s.
00:44:46.600 | Neural networks have been around since the 60s.
00:44:48.800 | There's been a few improvements.
00:44:52.000 | But the hope is, that's in terms of methodology.
00:44:57.100 | The compute has really been the workhorse.
00:45:00.200 | The ability to do the hundredfold improvement every decade
00:45:05.500 | holds promise.
00:45:08.800 | And the question is whether that reasoning thing I talked about
00:45:12.100 | is all you need is a larger network.
00:45:15.800 | That is the open question.
00:45:16.900 | So some terms for deep learning.
00:45:22.500 | First of all, deep learning is a PR term for neural networks.
00:45:30.000 | It is a term for deep neural networks,
00:45:38.900 | for neural networks that have many layers.
00:45:40.800 | It is symbolic term for the newly gained capabilities
00:45:45.300 | that compute has brought us,
00:45:46.700 | that training on GPUs has brought us.
00:45:50.200 | So deep learning is a subset of machine learning.
00:45:54.300 | There's many other methods that are still effective.
00:45:56.900 | The terms that will come up in this class is,
00:46:02.000 | first of all, multi-layer perceptron,
00:46:04.600 | deep neural networks, recurring neural networks,
00:46:07.600 | LSTM, long short-term memory networks,
00:46:10.200 | CNN or ConvNets, convolutional neural networks,
00:46:14.400 | deep belief networks.
00:46:15.600 | And the operation that will come up is convolution, pooling,
00:46:18.900 | activation functions and back propagation.
00:46:21.900 | Yep, cool question.
00:46:26.100 | [INAUDIBLE]
00:46:48.100 | So the question was,
00:46:50.300 | what is the purpose of the different layers in a neural network?
00:46:54.100 | What does it mean to have one configuration versus another?
00:46:56.800 | So a neural network having several layers,
00:47:01.500 | it's the only thing you have an understanding of
00:47:05.900 | is the inputs and the outputs.
00:47:08.500 | You don't have a good understanding about what each layer does.
00:47:12.800 | They're mysterious things, neural networks.
00:47:16.900 | So I'll talk about how with every layer it forms a higher level,
00:47:23.300 | a higher order representation of the input.
00:47:26.500 | So it's not like the first layer does localization,
00:47:29.900 | the second layer does path planning,
00:47:31.600 | the third layer does navigation,
00:47:35.500 | how you get from here to Florida.
00:47:36.900 | Or maybe it does, but we don't know.
00:47:40.800 | So we know, we're beginning to visualize neural networks
00:47:45.600 | for simple tasks, like for ImageNet, classifying cats versus dogs.
00:47:51.600 | We can tell what is the thing that the first layer does,
00:47:54.800 | the second layer, the third layer, and we'll look at that.
00:47:57.200 | But for driving, as the input provides just the images and the output, the steering,
00:48:02.600 | it's still unclear what you learn.
00:48:05.200 | Partially because we don't have neural networks that drive successfully yet.
00:48:15.200 | Do neural networks fill layers or does it eventually generate them on its own over time?
00:48:19.200 | So the question was,
00:48:23.400 | does a neural network generate layers over time?
00:48:30.000 | Like does it grow?
00:48:31.200 | That's one of the challenges is that a neural network is predefined.
00:48:38.200 | The architectures, the number of nodes, number of layers, that's all fixed.
00:48:42.500 | Unlike the human brain where neurons die and are born all the time.
00:48:46.000 | Neural network is pre-specified, that's it, that's all you get.
00:48:50.400 | And if you want to change that, you have to change that and then retrain everything.
00:48:54.200 | So it's fixed.
00:48:55.800 | So what I encourage you is to proceed with caution
00:49:00.800 | because there's this feeling when you first teach a network with very little effort,
00:49:06.900 | how to do some amazing tasks, like classify a face,
00:49:12.000 | versus non-face or your face versus other faces or cats versus dogs.
00:49:16.700 | It's an incredible feeling.
00:49:18.100 | And then there's definitely this feeling that I'm an expert.
00:49:23.600 | But what you realize is,
00:49:26.700 | you don't actually understand how it works.
00:49:31.600 | And getting it to perform well for more generalized tasks,
00:49:35.800 | for larger scale datasets, for more useful applications,
00:49:38.700 | requires a lot of hyperparameter tuning.
00:49:41.600 | Figuring out how to tweak little things here and there.
00:49:43.900 | And still in the end, you don't understand why it works so damn well.
00:49:48.000 | So deep learning, these deep neural network architectures is representation learning.
00:49:59.100 | This is the difference between traditional machine learning methods.
00:50:05.900 | Where, for example, for the task of having an image here as the input,
00:50:14.400 | the input to the network here is on the bottom, the output is up at top.
00:50:18.100 | So, and the input is a single image of a person in this case.
00:50:24.700 | And so, the input specifically is all of the pixels in that image, RGB.
00:50:35.200 | The different colors of the pixels in the image.
00:50:37.300 | And over time, what a network does is build
00:50:44.100 | a multi-resolutional representation of this data.
00:50:47.900 | The first layer learns the concept of edges, for example.
00:50:55.300 | The second layer starts to learn composition of those edges, corners, contours.
00:51:02.100 | Then it starts to learn about object parts.
00:51:05.800 | And finally, actually provide a label for the entities that are in the input.
00:51:12.500 | And this is the difference between traditional machine learning methods.
00:51:16.700 | Where the concepts like edges and corners and contours are manually pre-specified by human beings,
00:51:28.400 | human experts for the particular domain.
00:51:31.000 | And representation matters because figuring out a line
00:51:42.900 | for the Cartesian coordinates of this particular dataset,
00:51:46.500 | where you want to design a machine learning system
00:51:49.200 | that tells the difference between green triangles and blue circles is difficult.
00:51:54.600 | There's no line that separates them cleanly.
00:52:00.000 | And if you were to ask a human being, a human expert in the field,
00:52:04.100 | to try to draw that line, they would probably do a PhD on it and still not succeed.
00:52:12.300 | But a neural network can automatically figure out to remap that input into polar coordinates.
00:52:23.000 | Where the representation is such that it's an easily linearly separable dataset.
00:52:28.600 | And so deep learning is a subset of representation learning,
00:52:36.800 | is a subset of machine learning and a key subset of artificial intelligence.
00:52:41.600 | Now, because of this, because of its ability to compute an arbitrary number of features
00:52:53.700 | at the core of the representation.
00:52:56.400 | So you're not, if you were trying to detect a cat in an image,
00:52:59.700 | you're not specifying 215 specific features of cat ears and whiskers and so on
00:53:08.100 | that a human expert would specify.
00:53:10.100 | You allow a neural network to discover tens of thousands of such features.
00:53:14.400 | Which maybe for cats you are an expert, but for a lot of objects
00:53:19.400 | you may never be able to sufficiently provide the features
00:53:24.700 | which successfully would be used for identifying the object.
00:53:27.600 | And so this kind of representation learning,
00:53:30.600 | one is easy in the sense that all you have to provide is inputs and outputs.
00:53:35.600 | All you need to provide is a dataset that you care about without hand engineering features.
00:53:40.800 | And two, because of its ability to construct arbitrarily sized representations,
00:53:49.200 | deep neural networks are hungry for data.
00:53:52.700 | The more data we give them, the more they're able to learn about this particular dataset.
00:53:59.000 | So let's look at some applications.
00:54:06.300 | First, some cool things that deep neural networks have been able to accomplish up to this point.
00:54:13.900 | Let me go through them.
00:54:15.100 | First, the basic one.
00:54:17.400 | AlexNet.
00:54:21.500 | ImageNet is a famous dataset.
00:54:27.600 | It's a competition of classification localization
00:54:31.500 | where the task is given an image,
00:54:34.400 | identify what are the five most likely things in that image
00:54:38.300 | and what is the most likely and you have to do so correctly.
00:54:41.400 | So on the right, there's an image of a leopard
00:54:43.900 | and you have to correctly classify that that is in fact a leopard.
00:54:47.100 | So they're able to do this pretty well.
00:54:50.800 | Given a specific image, determine that it's a leopard.
00:54:55.200 | And we started what's shown here on the x-axis is years,
00:55:01.800 | on the y-axis is error in classification.
00:55:04.900 | So starting from 2012 on the left with AlexNet
00:55:10.000 | and today the errors decreased from 16%
00:55:18.000 | and 40% before then with traditional methods have decreased to below 4%.
00:55:24.200 | So human level performance, if I were to give you this picture of a leopard,
00:55:29.900 | there's a 4% of those pictures of leopards,
00:55:34.100 | you would not say it's a leopard.
00:55:35.700 | That's human level performance.
00:55:37.800 | So for the first time in 2015,
00:55:40.100 | convolutional neural networks outperform human beings.
00:55:43.200 | That in itself is incredible.
00:55:45.500 | That's something that seemed impossible
00:55:47.900 | and now is because it's done, it's not as impressive.
00:55:53.100 | But I just want to get to why that's so impressive
00:55:58.800 | because computer vision is hard.
00:56:02.500 | We as human beings have evolved visual perception over millions of years,
00:56:07.200 | hundreds of millions of years.
00:56:08.300 | So we take it for granted but computer vision is really hard.
00:56:13.700 | Visual perception is really hard.
00:56:15.100 | There is illumination variability.
00:56:17.000 | So it's the same object.
00:56:18.400 | The only way we tell anything is from the shade,
00:56:21.700 | the reflection of light from that surface.
00:56:23.600 | It could be the same object with drastically, in terms of pixels,
00:56:28.000 | drastically different looking shapes
00:56:31.200 | and we still know it's the same object.
00:56:35.500 | There is pose variability and occlusions.
00:56:37.800 | Probably my favorite caption for an image,
00:56:41.200 | for a figure in an academic paper is
00:56:44.900 | "deformable and truncated cat".
00:56:46.800 | These are pictures, you know, cats are famously deformable.
00:56:54.600 | They can take a lot of different shapes.
00:56:56.300 | It's arbitrary poses are possible.
00:57:04.600 | So you have to have computer vision,
00:57:06.200 | you should know it's still the same object,
00:57:07.900 | still the same class of objects
00:57:09.700 | given all the variability in the pose.
00:57:12.300 | And occlusions is a huge problem.
00:57:15.700 | We still know it's an object.
00:57:17.500 | We still know it's a cat even when parts of it are not visible
00:57:21.100 | and sometimes large parts of it are not visible.
00:57:23.400 | And then there's all the interclass variability.
00:57:27.100 | In interclass, all of these on the top two rows are cats.
00:57:32.900 | Many of them look drastically different
00:57:34.600 | and the top bottom two rows are dogs,
00:57:37.800 | also look drastically different.
00:57:39.900 | And yet some of the dogs look like cats,
00:57:43.200 | some of the cats look like dogs
00:57:45.200 | and as human beings are pretty good at telling the difference
00:57:48.700 | and we want computer vision to do better than that.
00:57:52.000 | It's hard.
00:57:53.800 | So how is this done?
00:57:56.400 | This is done with convolutional neural networks.
00:57:58.800 | The input to which is a raw image.
00:58:01.500 | Here's an input on the left of a number three
00:58:05.500 | and I'll talk about through convolutional layers.
00:58:10.500 | That image is processed, passed through.
00:58:14.300 | Convolutional layers maintain spatial information.
00:58:19.200 | On the output, in this case predicts which of the images,
00:58:28.900 | what number is shown in the image, 0, 1, 2 through 9.
00:58:33.000 | And so these networks, this is exactly,
00:58:38.100 | everybody is using the same kind of network
00:58:40.100 | to determine exactly that.
00:58:41.800 | Input is an image, output is a number.
00:58:43.800 | And in the case of probability that it's a leopard,
00:58:48.200 | what is that number?
00:58:49.400 | Then there's segmentation built on top of these
00:58:52.900 | convolutional neural networks where you chop off the end
00:58:58.300 | and convolutionize the network.
00:59:00.600 | You chop off the end where the output is a heat map.
00:59:03.000 | So you can have instead of a detector for a cat,
00:59:08.000 | you can do a cat heat map where it's the part of the image,
00:59:14.000 | the output heat map gets excited,
00:59:16.200 | the neurons on that output get excited
00:59:19.000 | and the spatially excited in the parts of the image
00:59:23.300 | that contain a tabby cat.
00:59:25.800 | And this kind of process can be used to segment the image
00:59:28.400 | into different objects.
00:59:29.900 | A horse, so the original input on the left
00:59:32.500 | is a woman on a horse and the output is a fully segmented image
00:59:36.900 | of knowing where's the woman, where's the horse.
00:59:39.500 | And this kind of process can be used for object detection
00:59:44.700 | which is the task of detecting an object in an image.
00:59:47.600 | Now the traditional method with convolutional neural networks
00:59:53.200 | and in general in computer vision is the sliding window approach.
00:59:56.800 | We have a detector like the leopard detector
00:59:59.500 | that you slide through the image to find where in that image is a leopard.
01:00:03.500 | The segmenting approach, the RCNN approach
01:00:10.500 | is efficiently segment the image in such a way
01:00:14.200 | that it can propose different parts of the image
01:00:16.400 | that are likely to have a leopard or in this case a cowboy.
01:00:22.200 | And that drastically reduces the computational requirements
01:00:25.100 | of the object detection task.
01:00:27.400 | And so these networks, this is the currently
01:00:37.100 | one of the best networks for the ImageNet task of localization
01:00:41.000 | is the deep residual networks.
01:00:44.300 | They're deep.
01:00:49.000 | So VGG19 is one of the famous ones, VGGNet.
01:00:53.000 | You're starting to get above 20 layers in many cases.
01:00:59.100 | 34 layers is the ResNet one.
01:01:02.000 | So the lesson there is the deeper you go,
01:01:06.600 | the more representation power you have, the higher accuracy.
01:01:10.200 | But you need more data.
01:01:17.000 | Other applications, colorization of images.
01:01:19.500 | So this again, input is a single image
01:01:27.000 | and output is a single image.
01:01:29.000 | So you can take a black and white video from a film,
01:01:33.500 | from an old film and recolor it.
01:01:36.700 | And all you need to do to train that network in a supervised way
01:01:41.100 | is provide modern films and convert them to grayscale.
01:01:46.100 | So now you have arbitrarily sized datasets
01:01:48.400 | that are able to, datasets of grayscale to color.
01:01:53.800 | And you're able to, with very little effort on top of it,
01:02:00.900 | to successfully, well, somewhat successfully recolor images.
01:02:05.300 | Again, Google Translate does image translation in this way.
01:02:10.200 | Image to image.
01:02:12.800 | It first perceives, here in German I believe,
01:02:16.400 | famous German, correct me if I'm wrong,
01:02:19.400 | dark chocolate written in German on a box.
01:02:21.900 | So this can take this image, detect the different letters,
01:02:26.300 | convert them to text, translate the text,
01:02:29.100 | and then using the image to image mapping,
01:02:32.100 | map the letters, the translated letters back onto the box.
01:02:37.600 | And you can do this in real time on video.
01:02:42.500 | So what we've talked about up to this point,
01:02:45.000 | on the left are vanilla neural networks,
01:02:47.400 | convolutional neural networks
01:02:49.200 | that map a single input to a single output,
01:02:51.900 | a single image to a number, single image to another image.
01:02:55.800 | Then there is recurrent neural networks that map,
01:02:58.700 | this is the more general formulation,
01:03:00.900 | that map a sequence of images or a sequence of words
01:03:04.400 | or a sequence of any kind to another sequence.
01:03:09.200 | And these networks are able to do incredible things
01:03:12.600 | with natural language, with video,
01:03:15.800 | and any time series data.
01:03:18.800 | For example, we can convert text to handwritten digits,
01:03:24.100 | with handwritten text.
01:03:27.000 | Here we type in, and you could do this online,
01:03:31.600 | type in "deep learning for self-driving cars"
01:03:33.800 | and it will use an arbitrary number of digits.
01:03:37.600 | And it will use an arbitrary handwriting style
01:03:41.700 | to generate the words "deep learning for self-driving cars".
01:03:44.600 | This is done using recurrent neural networks.
01:03:48.000 | We can also take car RNNs, they're called,
01:03:54.200 | this character level recurrent neural networks
01:03:57.300 | that train on a dataset, an arbitrary text dataset,
01:04:04.300 | and learn to generate text one character at a time.
01:04:09.000 | So there is no preconceived syntactical semantic structure
01:04:14.500 | that's provided to the network.
01:04:16.100 | It learns that structure.
01:04:17.700 | So, for example, you can train it on Wikipedia articles,
01:04:24.100 | like in this case, and it's able to generate successfully
01:04:29.800 | not only text that makes some kind of grammatical sense at least,
01:04:35.400 | but also keep perfect syntactic structure for Wikipedia,
01:04:41.400 | for Markdown editing, for LaTeX editing, and so on.
01:04:45.900 | This text says, "Naturalism and decision for the majority of Arab countries,
01:04:52.000 | capitalized, whatever that means, was grounded by the Irish language
01:04:55.900 | by John Clare, and so on.
01:04:58.400 | These are sentences, if you didn't know better, that might sound correct.
01:05:03.100 | And it does so, let me pause, one character at a time.
01:05:08.300 | So, these aren't words being generated.
01:05:12.700 | This is one character.
01:05:14.400 | You start with the beginning three letters, "Nat",
01:05:17.300 | you generate "You" completely without knowing of the word "Naturalism".
01:05:23.900 | This is incredible.
01:05:26.900 | You can do this to start a sentence
01:05:33.600 | and let the neural network complete that sentence.
01:05:35.700 | So, for example, if you start the sentence with "Life is"
01:05:39.000 | or "Life is about", actually, it will complete it with a lot of fun things.
01:05:46.600 | The weather, "Life is about kids",
01:05:49.800 | "Life is about the true love of Mr. Mom",
01:05:55.300 | "Life is about the truth now",
01:05:56.700 | and this is from Geoffrey Hinton, the last two.
01:06:01.200 | If you start with the meaning of life,
01:06:03.000 | it can complete that with the meaning of life is literary recognition,
01:06:07.600 | maybe true for some of us here.
01:06:09.600 | Publish or perish.
01:06:13.500 | And the meaning of life is the tradition of ancient human reproduction.
01:06:18.600 | Also true for some of us here, I'm sure.
01:06:23.500 | Okay, so what else can you do?
01:06:27.100 | This has been very exciting recently is image caption recognition.
01:06:31.600 | No, generation, I'm sorry.
01:06:33.100 | Image caption generation is important for large data sets of images
01:06:41.200 | where we want to be able to determine what's going on inside those images,
01:06:45.000 | especially for search.
01:06:46.900 | If you want to find a man sitting on a couch with a dog,
01:06:50.800 | you type it into Google and it's able to find that.
01:06:53.400 | So here shown in black text,
01:06:59.000 | a man sitting on a couch with a dog is generated by the system.
01:07:02.300 | A man sitting on a chair with a dog in his lap is generated by a human observer.
01:07:07.300 | And again, these annotations are done by detecting the different obstacles,
01:07:12.700 | the different objects in the scene.
01:07:15.100 | So segmenting the scene, detecting on the right there's a woman, a crowd, a cat,
01:07:20.400 | a camera holding purple, all of these words are being detected.
01:07:25.400 | Then a syntactically correct sentence is generated, a lot of them,
01:07:30.700 | and then you order which sentence is the most likely.
01:07:32.900 | And in this way you can generate very accurate labeling of the images,
01:07:38.400 | captions for the images.
01:07:41.800 | And you can do the same kind of process for image question answering.
01:07:49.400 | You can ask how many, so quantity, how many chairs are there?
01:07:53.000 | You can ask about location, where are the right bananas?
01:07:59.500 | You can ask about the type of object, what is the object on the chair?
01:08:04.400 | It's a pillow.
01:08:05.500 | And these are again using the recurrent neural networks.
01:08:15.000 | You can do the same thing with video caption generation,
01:08:20.100 | video caption description generation.
01:08:23.000 | So looking at a sequence of images as opposed to just a single image.
01:08:26.400 | What is the action going on in this situation?
01:08:30.100 | This is the difficult task.
01:08:32.000 | There's a lot of work in this area.
01:08:34.600 | Now on the left is correct descriptions, a man is doing stunts on his bike.
01:08:38.900 | I heard of zebras are walking in the field.
01:08:41.900 | And on the right, there's a small bus running into a building.
01:08:45.200 | You know, it's talking about relevant entities
01:08:51.700 | but just doing an incorrect description.
01:08:53.500 | A man is cutting a piece of, a piece of, a pair of a paper.
01:08:59.400 | He's cutting a piece of a pair of a paper.
01:09:01.900 | So the words are correct, perhaps.
01:09:06.000 | But, so you're close, but no cigar.
01:09:11.600 | So one of the interesting things
01:09:13.200 | you can do with the recurrent neural networks
01:09:18.900 | is if you think about the way we look at images,
01:09:21.700 | human beings look at images,
01:09:22.900 | is we only have a small fovea with which we focus on, in the scene.
01:09:30.300 | So right now your periphery is very distorted.
01:09:33.500 | The only thing, if you're looking at the slides,
01:09:35.900 | or you're looking at me, that's the only thing that's in focus.
01:09:40.300 | Majority of everything else is out of focus.
01:09:42.700 | So we can use the same kind of concept
01:09:44.900 | to try to teach a neural network to steer around the image,
01:09:47.600 | both for perception and generation of those images.
01:09:51.200 | This is important first on the general artificial intelligence point
01:09:55.900 | of it being just fascinating
01:09:58.500 | that we can selectively steer our attention.
01:10:02.900 | But also it's important for things like drones
01:10:05.300 | that have to fly at high speeds in an environment
01:10:08.300 | where at 300 plus frames a second you have to make decisions.
01:10:12.000 | So you can't possibly localize yourself
01:10:14.700 | or perceive the world around yourself successfully
01:10:17.500 | if you have to interpret the entire scene.
01:10:20.400 | So what you can do is you can steer,
01:10:22.900 | for example, here shown is reading house numbers
01:10:28.800 | by steering around an image.
01:10:32.200 | You could do the same task for reading and for writing.
01:10:38.400 | So reading numbers here on the MNIST dataset on the left
01:10:41.600 | is reading numbers.
01:10:42.900 | We can also selectively steer a network around an image
01:10:49.900 | to generate that image.
01:10:51.200 | Starting with a blurred image first
01:10:53.700 | and then getting more and more higher resolution
01:10:57.900 | as the steering goes on.
01:11:02.300 | Work here at MIT is able to map video to audio.
01:11:10.100 | So head stuff with a drumstick,
01:11:13.200 | silent video and able to generate the sound
01:11:18.200 | that would drumstick hitting that particular object makes.
01:11:22.200 | So you can get texture information from that impact.
01:11:29.100 | So here is a video of a human soccer player playing soccer
01:11:38.700 | and a state-of-the-art machine playing soccer.
01:11:44.900 | And well, let me give him some time to build up.
01:11:52.500 | (Laughter)
01:12:03.300 | Okay, so soccer, this is, we take this for granted
01:12:08.300 | but walking is hard.
01:12:10.300 | Object manipulation is hard.
01:12:12.800 | Soccer is harder than chess for us to do, much harder.
01:12:18.100 | On your phone now, you can have a chess engine
01:12:23.800 | that beats the best players in the world.
01:12:26.300 | And you have to internalize that because the question is,
01:12:32.300 | this is a painful video,
01:12:33.800 | the question is, where does driving fall?
01:12:37.400 | Is it closer to chess or is it closer to soccer?
01:12:40.900 | For those incredible brilliant engineers
01:12:44.500 | that worked on the most recent DARPA challenge,
01:12:47.200 | this would be a very painful video to watch, I apologize.
01:12:51.100 | This is a video from the DARPA challenge
01:12:55.700 | of robots struggling with the basic object manipulation
01:13:05.700 | and walking tasks.
01:13:06.900 | So it's mostly a fully autonomous navigation task.
01:13:14.400 | (Laughter)
01:13:24.000 | Maybe I'll just let this play for a few moments
01:13:27.100 | to let it internalize how difficult this task is.
01:13:32.400 | Of balancing, of planning in an under-actuated way
01:13:38.000 | where you don't have full control of everything.
01:13:40.300 | When there is a delta between your perception
01:13:44.300 | of what you think the world is and what the reality is.
01:13:47.600 | So there, a robot was trying to turn an object that wasn't there.
01:13:54.700 | And this is an MIT entry that actually successfully,
01:14:02.300 | I believe, gotten points for this because it got into that area.
01:14:07.800 | (Laughter)
01:14:12.000 | But as a lot of the teams talked about, the hardest part,
01:14:17.200 | so one of the things the robot had to do
01:14:19.200 | is get into a car and drive it and get out of the car.
01:14:23.100 | And there's a few other manipulation tasks
01:14:25.600 | like to walk on unsteady ground,
01:14:28.200 | it had to drill a hole through a wall, all of these tasks.
01:14:32.000 | And what a lot of teams said is the hardest part,
01:14:35.100 | the hardest task of all of them is getting out of the car.
01:14:38.200 | So it's not getting into the car,
01:14:40.200 | it's this very task that you saw now is a robot getting out of the car.
01:14:44.500 | These are things we take for granted.
01:14:46.200 | So in our evaluation of what is difficult about driving,
01:14:50.900 | we have to remember that some of those things
01:14:54.700 | we may take for granted in the same kind of way
01:14:57.100 | that we take walking for granted.
01:14:58.400 | This is Marv X paradox.
01:15:05.600 | With Hans Moravac from CMU.
01:15:08.100 | Let me just quickly read that quote.
01:15:11.300 | "Encoded in the large highly evolved sensory and motor portions of the human brain
01:15:15.100 | is billions of years of experience
01:15:18.300 | about the nature of the world and how to survive in it."
01:15:20.600 | So this is data, this is big data, billions of years.
01:15:25.100 | An abstract thought which is reasoning,
01:15:28.300 | the stuff we think is intelligence
01:15:31.900 | is perhaps less than 100,000 years of data old.
01:15:36.700 | We haven't yet mastered it and so,
01:15:39.700 | sorry I'm inserting my own statements in the middle of a quote but,
01:15:44.100 | it's been very recent that we've learned how to think
01:15:51.600 | and so we respect it perhaps more
01:15:55.200 | than the things we take for granted like walking and visual perception and so on.
01:16:00.500 | But those may be strictly a matter of data,
01:16:03.700 | data and training time and network size.
01:16:08.200 | So walking is hard.
01:16:15.900 | The question is how hard is driving?
01:16:19.800 | And that's an important question because the margin of error is small.
01:16:27.800 | There's one fatality per 100 million miles.
01:16:34.400 | That's the number of people that die in car crashes every year.
01:16:38.200 | One fatality per 100 million miles.
01:16:41.200 | That's a 0.00001% margin of error.
01:16:47.200 | That's through all the time you spend on the road,
01:16:50.200 | that is the error you get.
01:16:52.200 | We're impressed with ImageNet being able to classify a leopard, a cat or a dog
01:16:57.700 | at close to, at above human level performance.
01:17:01.900 | But this is the margin of error we get with driving.
01:17:04.600 | And we have to be able to deal with snow, with heavy rain,
01:17:09.500 | with big open parking lots, with parking garages,
01:17:13.500 | any pedestrians that behaves irresponsibly, as rarely as that happens
01:17:18.300 | or just unpredictably again, especially in Boston.
01:17:23.800 | Reflections, the ones especially, this is one of some of the things you don't think about,
01:17:29.900 | the lighting variations that blind the cameras.
01:17:33.100 | The question was if that number changes, if you look at just crashes.
01:17:46.400 | So fatalities per crash.
01:17:49.700 | Crashes per, yeah, so one of the big things is cars have gotten really good at crashing
01:17:55.400 | and not hurting the, anybody.
01:17:57.400 | So the number of crashes is much, much larger than number of fatalities,
01:18:01.400 | which is a great thing.
01:18:03.100 | We've built safer cars.
01:18:05.000 | But still, you know, even one fatality is too many.
01:18:09.300 | So this is one, Google self-driving car team,
01:18:20.200 | is quite open about their performance since hitting public roads.
01:18:28.700 | This is from a report that shows the number of times the driver disengages,
01:18:35.700 | the car gives up control, that it asks the driver to take control back
01:18:43.200 | or the driver takes control back by force.
01:18:45.600 | Meaning that they're unhappy with the decision that the car was making
01:18:49.900 | or it was putting the car or other pedestrians or other cars in unsafe situations.
01:18:54.500 | And so if you see over time that there's been a total from 2014 to 2015,
01:19:01.800 | there's been a total of 341 times on beautiful San Francisco roads.
01:19:08.600 | And I say that seriously because the weather conditions are great there.
01:19:14.400 | 341 times that the driver had to elect to take control back.
01:19:17.700 | So it's a work in progress.
01:19:20.400 | Let me give you something to think about here.
01:19:24.900 | This with neural networks is a big open question.
01:19:31.500 | The question of robustness.
01:19:33.200 | So this is an amazing paper, I encourage people to read it.
01:19:38.400 | There's a couple of papers around this topic.
01:19:40.800 | Deep neural networks are easily fooled.
01:19:43.300 | So here are eight images where if given to a neural network as input,
01:19:52.800 | a convolutional neural network as input,
01:19:54.800 | the network with higher than 99.6% confidence says that the image,
01:20:01.900 | for example, in the top left is a robin, next to it is a cheetah,
01:20:06.000 | then an armadillo, a panda, an electric guitar, a baseball, a starfish, a king penguin.
01:20:13.100 | All of these things are obviously not in the images.
01:20:16.400 | So networks can be fooled with noise.
01:20:19.100 | More importantly, more practically for the real world,
01:20:25.800 | adding just a little bit of distortion, a little bit of noise distortion to the image
01:20:31.800 | can force the network to produce a totally wrong prediction.
01:20:37.400 | So here's an example.
01:20:39.800 | There's three columns, correct image classification,
01:20:44.300 | the slight addition of distortion,
01:20:47.800 | and the resulting prediction of an ostrich for all three images on the left,
01:20:53.300 | and a prediction of an ostrich for all three images on the right.
01:20:59.000 | This ability to fool networks easily brings up an important point.
01:21:06.500 | And that point is that there has been a lot of excitement
01:21:14.400 | about neural networks throughout their history.
01:21:17.200 | There's been a lot of excitement about artificial intelligence throughout its history.
01:21:21.000 | And not coupling that excitement, not grounding that excitement in the reality,
01:21:28.000 | the real challenges around that has resulted in crashes,
01:21:36.200 | in AI winters, when funding dried out and people became hopeless
01:21:42.300 | in terms of the possibilities of artificial intelligence.
01:21:44.800 | So here's a 1958 New York Times article
01:21:47.800 | that said the Navy revealed the embryo of an electronic computer today.
01:21:52.000 | This is when the first perceptron that I talked about
01:21:55.600 | was implemented in hardware by Frank Rosenblatt.
01:21:59.100 | It took 400 pixel image input and it provided a single output
01:22:06.000 | weights were encoded in hardware potentiometers
01:22:09.900 | and weights were updated with electric motors.
01:22:12.100 | Now New York Times wrote,
01:22:13.900 | "The Navy revealed the embryo of an electronic computer today
01:22:17.100 | that expects we'll be able to walk, talk, see, write, reproduce itself
01:22:24.300 | and be conscious of its existence."
01:22:27.000 | Dr. Frank Rosenblatt, a research psychologist
01:22:31.200 | at the Cornell Aeronautical Laboratory, Buffalo,
01:22:34.700 | said perceptrons might be fired to the planets as mechanical space explorers.
01:22:39.300 | This might seem ridiculous, but this was the general opinion of the time.
01:22:45.500 | And as we know now, perceptrons cannot even separate a nonlinear function.
01:22:53.400 | They're just linear classifiers.
01:22:57.300 | And so this led to two major AI winters in the 70s and the late 80s and early 90s.
01:23:05.600 | The Lighthill Report in 1973 by the UK government said that
01:23:14.000 | "No part of the field of discoveries made so far produced the major impact that was promised."
01:23:19.000 | So if the hype builds beyond the capabilities of our research,
01:23:27.200 | reports like this will come
01:23:31.300 | and they have the possibility of creating another AI winter.
01:23:35.200 | So I want to pair the optimism, some of the cool things we'll talk about in this class
01:23:40.000 | with the reality of the challenges ahead of us.
01:23:43.900 | The focus of the research community.
01:23:51.400 | This is some of the key players in deep learning.
01:23:55.100 | What are the things that are next for deep learning?
01:23:59.900 | The five-year vision.
01:24:01.700 | We want to run on smaller, cheaper mobile devices.
01:24:05.700 | We want to explore more in the space of unsupervised learning,
01:24:10.000 | as I mentioned, and reinforcement learning.
01:24:12.500 | We want to do things that explore the space of videos more.
01:24:20.300 | The recurring neural networks, like being able to summarize videos
01:24:23.700 | or generate short videos.
01:24:27.000 | One of the big efforts, especially in the companies dealing with large data,
01:24:31.800 | is multimodal learning.
01:24:33.400 | Learning from multiple data sets with multiple sources of data.
01:24:37.600 | And lastly, making money from these technologies.
01:24:43.600 | There's a lot of this, despite the excitement,
01:24:48.800 | there has been an inability for the most part to make serious money
01:24:54.800 | from some of the more interesting parts of deep learning.
01:24:59.600 | And while I got made fun of by the TAs for including this slide
01:25:10.600 | because it's shown in so many sort of business type lectures,
01:25:13.600 | but it is true that we're at the peak of a hype cycle
01:25:17.800 | and we have to make sure,
01:25:20.400 | given the large amount of hype and excitement there is, we proceed with caution.
01:25:25.000 | One example of that, let me mention,
01:25:37.000 | is we already talked about spoofing the cameras,
01:25:42.800 | spoofing the cameras with a little bit of noise.
01:25:44.400 | So if you think about it,
01:25:47.600 | So if you think about it,
01:25:48.600 | self-driving vehicles operate with a set of sensors
01:25:53.200 | and they rely on those sensors to convey, to accurately capture that information.
01:25:58.200 | Now what happens, not only when the world itself produces noisy visual information,
01:26:06.600 | but what if somebody actively tries to spoof that data?
01:26:10.000 | One of the fascinating things that have been recently done is spoofing of LiDAR.
01:26:16.000 | So these LiDARs is a range sensor that gives a 3D point cloud of the objects in the external environment
01:26:22.800 | and you're able to successfully do a replay attack
01:26:28.000 | where you have the car see people and other cars around it
01:26:32.800 | when there's actually nothing around it.
01:26:34.600 | In the same way that you can spoof a camera to see things that are not there,
01:26:40.400 | a neural network.
01:26:44.200 | So let me run through some of the libraries that we'll work with
01:26:48.400 | and they're out there that you might work with if you proceed with deep learning.
01:26:53.400 | TensorFlow, that is the most popular one these days.
01:26:58.600 | It's heavily backed and developed by Google.
01:27:01.600 | It has primarily a Python interface
01:27:06.000 | and is very good at operating on multiple GPUs.
01:27:14.200 | There's Keras and also TF Learn and TF Slim
01:27:18.200 | which are libraries that operate on top of TensorFlow
01:27:21.800 | that make it slightly easier, slightly more user-friendly interfaces
01:27:26.800 | to get up and running.
01:27:29.000 | Torch, if you're interested to get in at the lower level
01:27:40.200 | tweaking of the different parameters of neural networks,
01:27:42.800 | creating your own architectures, Torch is excellent for that
01:27:46.200 | with its own Lua interface.
01:27:49.200 | Lua is a programming language and heavily backed by Facebook.
01:27:54.000 | There's the old school Theano, just what I started on,
01:27:58.000 | a lot of people early on in deep learning started on.
01:28:00.600 | It's one of the first libraries that supported, that came with GPU support.
01:28:06.000 | It definitely encourages lower level tinkering
01:28:10.000 | and has a Python interface.
01:28:11.400 | And many of these, if not all, rely on NVIDIA's library
01:28:18.800 | for doing some of the low-level computations
01:28:23.600 | involved with training these neural networks on NVIDIA GPUs.
01:28:29.000 | MXNet, heavily supported by Amazon
01:28:33.000 | and they've officially recently announced that they're going to be,
01:28:39.800 | that their AWS is going to be all in on MXNet.
01:28:43.800 | Neon, recently bought by Intel, started out as a manufacturer
01:28:52.800 | of neural network chips, which is really exciting
01:28:56.600 | and it performs exceptionally well.
01:28:59.800 | Now here are good things.
01:29:01.800 | Caffe, started in Berkeley, also was very popular in Google
01:29:07.400 | before TensorFlow came out.
01:29:10.000 | It's primarily designed for computer vision with ConvNets
01:29:13.800 | but it's now expanded to all other domains.
01:29:18.400 | There is CNTK, as it used to be known and now called
01:29:24.000 | the Microsoft Cognitive Toolkit.
01:29:25.400 | Nobody calls it that still, I'm aware of.
01:29:28.200 | It says multi-GPU support, has its own brain script,
01:29:34.400 | custom language, as well as other interfaces.
01:29:39.200 | And what we'll get to play around in this class
01:29:41.600 | is amazingly deep learning in the browser, right?
01:29:45.800 | Our favorite is ComNetJS, what you'll use, built by Andrej Karpathy
01:29:52.000 | from Stanford, now OpenAI.
01:29:54.000 | It's good for explaining the basic concept of neural networks.
01:29:58.600 | It's fun to play around with.
01:30:00.200 | All you need is a browser, so very few requirements.
01:30:03.400 | It can't leverage GPUs, unfortunately.
01:30:08.000 | But for a lot of things that we're doing, you don't need GPUs.
01:30:10.600 | You'll be able to train a network with very little
01:30:12.600 | and relatively efficiently without the need of GPUs.
01:30:16.600 | It has full support for CNNs, RNNs, and even deep reinforcement learning.
01:30:22.200 | KerasJS, which seems incredible, we tried to use for this class,
01:30:28.200 | didn't happen to. It has GPU support, so it runs in the browser
01:30:34.200 | with GPU support, with OpenGL or however it works magically.
01:30:39.800 | But we're able to accomplish a lot of things we need without the use of GPUs.
01:30:44.000 | So, it's incredible to live in a day and age when it literally,
01:30:52.200 | as I'll show in the tutorials, it takes just a few minutes
01:30:56.000 | to get started with building your own neural network that classifies images.
01:31:00.000 | And a lot of these libraries are friendly in that way.
01:31:05.000 | So, all the references mentioned in this presentation are available at this link
01:31:10.400 | and the slides are available there as well.
01:31:12.400 | So, I think in the interest of time, let me wrap up.
01:31:16.400 | Thank you so much for coming in today.
01:31:19.000 | And tomorrow I'll explain the deep reinforcement learning game
01:31:23.200 | and the actual competition and how you can win it.
01:31:25.600 | Thanks very much guys.
01:31:27.200 | [APPLAUSE]