MIT 6.S094: Deep Learning

Thank you everyone for braving the cold and the snow to be here. This is 6S094, Deep Learning for Self-Driving Cars. And it's a course where we cover the topic of deep learning, which is a set of techniques that have taken a leap in the last decade for our understanding of what artificial intelligence systems are capable of doing. And self-driving cars, which is systems that can take these techniques and integrate them in a meaningful profound way into our daily lives in a way that transforms society. So that's why both of these topics are extremely important and extremely exciting. My name is Lex Friedman and I'm joined by an amazing team of engineers in Jack Terwilliger, Julia Kindlesberger, Dan Brown, Michael Glazer, Lee Ding, Spencer Dodd, and Benedict Jenick, among many others. We build autonomous vehicles here at MIT. Not just ones that perceive and move about the environment, but ones that interact, communicate, and earn the trust and understanding of human beings inside the car, the drivers and the passengers, and the human beings outside the car, the pedestrians and other drivers and cyclists. The website for this course, selfdrivingcars.mit.edu If you have questions, email deepcars@mit.edu Slack deep-mit. For registered MIT students, you have to register on the website. And by midnight, Friday, January 19th, build a neural network and submit it to the competition that achieves the speed of 65 miles per hour on the new Deep Traffic 2.0. It's much harder and much more interesting than last year's for those of you who participated. There's three competitions in this class. Deep traffic, SegFuse, Deep Crash. There's guest speakers that come from Waymo, Google, Tesla. And those are starting new autonomous vehicle startups in Voyage, Neutronomy, and Aurora. And the news a lot today from CES. And we have shirts. For those of you who brave the snow and continue to do so, towards the end of the class, there will be free shirts. Yes, I said free and shirts in the same sentence. You should be here. Okay, first, the Deep Traffic competition. There's a lot of updates and we'll cover those on Wednesday. It's a deep reinforcement learning competition. Last year, we received over 18,000 submissions. This year, we're going to go bigger. Not only can you control one car within your network, you can control up to 10. This is multi-agent deep reinforcement learning. This is super cool. Second, SegFuse, Dynamic Driving Scene Segmentation competition. Where you're given the raw video, the kinematics of the vehicle, so the movement of the vehicle, the state-of-the-art segmentation. For the training set, you're given ground truth labels, pixel level labels, scene segmentation, and optical flow. And with those pieces of data, you're tasked to try to perform better than the state-of-the-art in image-based segmentation. Why is this critical and fascinating in an open research problem? Because robots that act in this world, in the physical space, not only must interpret, use these deep learning methods to interpret the spatial visual characteristics of a scene. They must also interpret, understand, and track the temporal dynamics of the scene. This competition is about temporal propagation of information, not just scene segmentation. You must understand the space and time. And finally, Deep Crash. Where we use deep reinforcement learning to slam cars thousands of times, here at MIT at the gym. You're given data on a thousand runs, where a car knowing nothing is using a monocular camera as a single input, driving over 30 miles an hour, through a scene it has very little control through, very little capability to localize itself, it must act very quickly. In that scene, you're given a thousand runs to learn anything. We'll discuss this in the coming weeks. This competition will result in four submissions, that we evaluate everyone's in simulation, but the top four submissions, we put head to head at the gym. And until there is a winner declared, we keep slamming cars at 30 miles an hour. Deep Crash. And also on the website is from last year, and on GitHub, there's Deep Tesla, which is using the large-scale naturalistic driving data set, we have to train a neural network to do end-to-end steering. That takes in monocular video from the forward roadway, and produces steering commands, that steering commands for the car. Lectures. Today we'll talk about deep learning. Tomorrow we'll talk about autonomous vehicles. Deep RLs on Wednesday. Driving scene understanding, so segmentation. That's Thursday. On Friday, we have Sasha Arnou, the director of engineering at Waymo. Waymo is one of the companies, that's truly taking huge strides in fully autonomous vehicles. They're taking the fully L4, L5 autonomous vehicle approach, and it's fascinating to learn. He's also the head of perception for them, to learn from him, what kind of problems they're facing, and what kind of approach they're taking on. We have Emilia Frizzoli, who one of last year's speakers, Sertac Karaman, said Emilia is the smartest person he knows. So Emilia Frizzoli is the CTO of Neutonomy, an autonomous vehicle company, that was just acquired by Delphi, for a large sum of money. And they're doing a lot of incredible work, in Singapore and here in Boston. Next Wednesday, we are going to talk about the topic of our research, and my personal fascination is deep learning, for driver state sensing, understanding the human, perceiving everything about the human being, inside the car and outside the car. One talk I'm really excited about, is Oliver Cameron on Thursday. He is now the CEO of autonomous vehicle startup Voyage, who was previously the director, of the self-driving car program for Udacity. He will talk about, how to start a self-driving car company. For those, he said that MIT folks, and entrepreneurs, if you want to start one yourself, he'll tell you exactly how. It's super cool. And then Sterling Anderson, who was the director previously, of Tesla autopilot team, and now is a co-founder of Aurora, the self-driving car startup that I mentioned, that has now partnered with NVIDIA and many others. So, why self-driving cars? This class is about applying, data-driven learning methods, to the problem of autonomous vehicles. Why self-driving cars are fascinating, and an interesting problem space. Quite possibly, in my opinion, this is the first wide-reaching, and profound integration of personal robots, in society. Wide-reaching, because there's one billion cars on the road, even a fraction of that, will change the face of transportation, and how we move about this world. Profound, and this is an important point, that's not always understood, is there's an intimate connection, between a human and a vehicle, when there's a direct transfer of control. It's a direct transfer of control, that takes that his or her life, into the hands of an artificial intelligence system. I showed a few quick clips here, you can Google, first time with Tesla autopilot, on YouTube, and watch people, perform that transfer of control. There's something magical, about a human and a robot working together, that will transform, what artificial intelligence is, in the 21st century. And this particular autonomous system, AI system, self-driving cars, is on the scale, and the profound, the life-critical nature of it, is profound, in a way that, it will truly test the capabilities of AI. There is a personal connection, that will argue throughout these lectures, that we cannot escape, considering the human being. That autonomous vehicle, must not only perceive, and control its movement through the environment, it must also perceive everything, about the human driver and the passenger, and interact, communicate, and build trust, with that driver. Because, in my view, as I will argue throughout this course, an autonomous vehicle is more of a personal robot, than it is a perfect perception control system. Because perfect perception and control, through this world, full of humans, is extremely difficult, and could be two, three, four decades away. Full autonomy, autonomous vehicles are going to be flawed, they're going to have flaws, and we have to design systems, that are effectively caught, that effectively transfer control to human beings, when they can't handle the situation. And that transfer of control, is a fascinating opportunity for AI. Because, the obstacle avoidance, perception of obstacles, and obstacle avoidance, is the easy problem. It's the safe problem, going 30 miles an hour, navigating through streets of Boston, is easy. It's when you have to get to work, and you're late, or you're sick of the person in front of you, that you want to go in the opposing lane, and speed up. That's human nature, and we can't escape it. Our artificial intelligence systems, can't escape human nature, they must work with it. What's shown here, is one of the algorithms, we'll talk about next week, for cognitive load. Where we take the raw, 3D convolutional neural networks, take in the eye region, the blinking, and the pupil movement, to determine the cognitive load of the driver. We'll see how we can detect everything about the driver, where they're looking, emotion, cognitive load, body pose estimation, drowsiness. The movement towards full autonomy, is so difficult, I would argue, that it almost requires human level intelligence. That the, as I said, two, three, four decade out, journey for artificial intelligence researchers, to achieve full autonomy, will require achieving, solving some of the problems, fundamental problems, of creating intelligence. And, that's something we'll discuss in much more depth, in a broader view, in two weeks, for the artificial general intelligence course. Where we have Andrej Karpathy from Tesla, Ray Kurzweil, Mark Rybert, from Boston Dynamics, who asked for the dimensions of this room, because he's bringing robots. Nothing else was told to me. It'll be a surprise. So, that is why I argue, the human-centered artificial intelligence approach, in every algorithmic design, considers the human. For autonomous vehicle on the left, the perception, seen understanding, and the control problem, as we'll explore through the competitions, and the assignments of this course, can handle 90, and increasing percent of the cases. But it's the 10, 1, 0.1% of the cases, as we get better and better, that we have to, we're not able to handle, through these methods. And that's where the human, perceiving the human is really important. This is the video from last year, of Arc de Triomphe. Thank you, I didn't know it last year, I know now. That's, is one of millions of cases, where human to human interaction, is the dominant driver, not the basic perception control problem. So, why deep learning in this space? Because deep learning, is a set of methods, that do well from a lot of data. And to solve these problems, where human life is at stake, we have to be able to have techniques, that learn from data, learn from real world data. This is the fundamental reality, of artificial intelligence systems, that operate in the real world. They must learn from real world data. Whether that's on the left, for the perception, the control side. Or on the right, for the human, the perception and the communication, interaction and collaboration, with the human, and the human robot interaction. Okay, so what is deep learning? It's a set of techniques, if you allow me the definition, of intelligence being, the ability to accomplish complex goals. Then I would argue, definition of understanding, maybe reasoning, is the ability to turn complex information, into simple, useful, actionable information. And that is what deep learning does. Deep learning is representation learning, or feature learning if you will. It's able to take raw information, raw complicated information, that's hard to do anything with, and construct hierarchical representations, of that information, to be able to do something interesting with it. It is the branch of artificial intelligence, which is most capable and focused on this task. Forming representations from data, whether it's supervised or unsupervised, whether it's with the help of humans or not. It's able to construct structure, find structure in the data, such that you can extract, simple, useful, actionable information. On the left, from Ian Goodfellow's book, is the basic example of image classification. The input of the image, on the bottom with the raw pixels, and as we go up the stack, as we go up the layers, higher and higher order representations are formed. From edges to contours, to corners, to object parts, and then finally the full object, semantic classification of what's in the image. This is representation learning. A favorite example for me, is one from four centuries ago. Our place in the universe, and representing that place in the universe, whether it's relative to earth, or relative to the sun. On the left is our current belief, on the right is the one that is held widely, four centuries ago. Representation matters, because what's on the right, is much more complicated, than what's on the left. You can think of in a simple case here, when the task is to draw a line, that separates green triangles and blue circles, in the Cartesian coordinate space, on the left, the task is much more difficult, impossible to do well. On the right, it's trivial, in polar coordinates. This transformation is exactly, which we need to learn. This is representation learning. So you can take the same task, of having to draw a line, that separates the blue curve, and the red curve on the left. If we draw a straight line, it's going to be a high, there's no way to do it, with zero error, with 100% accuracy. Shown on the right, is our best attempt. But what we can do with deep learning, with a single hidden layer network, done here, is form the topology, the mapping of the space, in such a way in the middle, that allows for a straight line to be drawn, that separates the blue curve, and the red curve. The learning of the function in the middle, is what we're able to achieve with deep learning. It's taking raw, complicated information, and making it simple, actionable, useful. And the point is, that this kind of ability, to learn from raw sensory information, means that we can do a lot more, with a lot more data. So deep learning, gets better with more data. And that's important, for real-world applications, where edge cases are everything. This is us driving, with two perception control systems. One is a Tesla vehicle, with the autopilot, version one system, that's using a monocular camera, to perceive the external environment, and produce control decisions. And our own neural network, running on a Jessen TX2, that's taking in the same, with a monocular camera, and producing control decisions. And the two systems argue, and when they disagree, they raise up a flag, to say that this is an edge case, that needs human intervention. There is, covering such edge cases, using machine learning, is the main problem, of artificial intelligence, when applied to the real world. It is the main problem to solve. Okay, so what are neural networks? Inspired, very loosely, and I'll discuss, about the key difference, between our own brains, and artificial brains. Because there's a lot of insights, in that difference. But inspired, loosely, by biological neural networks, here is a simulation, of a thalamocortical brain network, which is only three million neurons. 476 million synapses. The full human brain, is a lot more than that. A hundred billion neurons, 1,000 trillion synapses. There's inspirational music, with this one, that I didn't realize was here. It should make you think. Artificial neural networks, okay, let's just let it play. The human neural network, is a hundred billion neurons, right? 1,000 trillion synapses. One of the state of the art, neural networks is ResNet-152, which has 60 million synapses. That's a difference, of about, a seven order of magnitude difference. The human brains have, 10 million times more synapses, than artificial neural networks. Plus or minus one order of magnitude, depending on the network. So what's the difference, between a biological neuron, and an artificial neuron? The topology of the human brain, have no layers. Neural networks, are stacked in layers. They're fixed, for the most part. There is chaos, very little structure, in our human brain, in terms of how neurons are connected. They're connected, often to 10,000 plus other neurons. The number of synapses, from individual neurons, that are input into the neuron, is huge. They're asynchronous. The human brain, works asynchronously. Artificial neural networks, work synchronously. The learning algorithm, for artificial neural networks, the only one, the best one, is back propagation. And we don't know, how human brains learn. Processing speed, this is one of the, the only benefits, we have with artificial neural networks, is artificial neurons, are faster. But they're also, extremely power inefficient. And, there is a division into stages, of training and testing, with neural networks. With biological neural networks, as you're sitting here today, they're always learning. The only profound similarity, the inspiring one, the captivating one, is that, both are distributed computation at scale. There is an emergent, aspect to neural networks, where the basic element of computation, a neuron, is simple, is extremely simple. But when connected together, beautiful, amazing, powerful approximators can be formed. A neural network is built up, with these computational units, where the inputs, there's a set of edges, with weights on them. The edges are, the weights are multiplied, by this input signal. A bias is added, with a nonlinear function, that determines whether, the network gets activated or not. The neuron gets activated or not, visualized here. And these neurons can be combined, in a number of ways. They can form a feed-forward neural network, or they can, feed back into itself, to form, to have state memory, in recurrent neural networks. The ones on the left, are the ones that are most successful, for most applications, in computer vision. The ones on the right, are very popular, and specific, when temporal dynamics, or dynamics time series, of any kind are used. In fact, the ones on the right, are much closer, to the way our human brains are, than the ones on the left. But that's why they're really hard to train. One beautiful aspect, of this emergent power, from multiple neurons being connected together, is the universal property, that with a single hidden layer, these networks, can learn any function, learn to approximate any function. Which is an important property, to be aware of. Because, the limits here, are not in the power of the networks. The limits, is in the methods, by which we construct them, and train them. What kinds of machine learning, deep learning are there? We can separate it into two categories. Memorizers, the approaches, that essentially memorize patterns in the data. And approaches that, we can loosely say, are beginning to reason, to generalize over the data, with minimal effort. And the reason, is that the data, is not a single, to generalize over the data, with minimal human input. On top, on the left, are the "teachers", is how much human input, in blue is needed, to make the method successful. For supervised learning, which is what most of deep learning, successes come from, where most of the data, is annotated by human beings. The human, is at the core, of the success. Most of the data, that's part of the training, needs to be annotated by human beings. With some additional successes, coming from augmentation methods, that extend that, extend the data, based on which, these networks are trained. And the semi-supervised, reinforcement learning, and unsupervised methods, that we'll talk about later in the course, that's where the near-term, successes we hope are. And with the unsupervised learning approaches, that's where, the true excitement about, the possibilities of artificial intelligence lie. Being able to make sense, of our world, with minimal input, from humans. So we can think of two kinds of, deep learning, impact spaces. One is a special purpose intelligence, is taking a problem, formalizing it, collecting enough data on it, and being able to, solve a particular, case that's, that provides value. Of particular interest here, is a network that estimates apartment costs, in the Boston area. So you could take, the number of bedrooms, the square feet, in the neighborhood, and provide as output, the estimated cost. On the right, is the actual data, of apartment costs. We're actually standing, in a, in a area, that has over $3,000, for a studio apartment. Some of you may be feeling that pain. And then there's general purpose intelligence, or, something, that feels like, approaching general purpose intelligence, which is reinforcement, and unsupervised learning. Here with Andrei, from Andrei Karpathy's Pong the Pixels, a system that takes in, 80 by 80 pixel image, and with no other information, is able to beat, is able to win at this game. No information except, a sequence of images, raw sensory information, the same way, the same kind of information, that human beings take in, from the visual, audio, touch, sensory data, the very low level data, and be able to learn to win. In this very simplistic, in this very artificially, constructed world, but nevertheless, a world where no feature learning, is performed. Only raw sensory information, is used to win, with very sparse, minimal human input. We'll talk about that, on Wednesday, with deep reinforcement learning. So, but for now, we'll focus on supervised learning, where there is, input data, there is a network, we're trying to train, a learning system, and there's a correct output, that's labeled by human beings. That's the general training process, for a neural network. Input data, labels, and the training of that, network, that model. So that in a testing stage, a new input data, that has never seen before, it's tasked with, producing guesses, and is evaluated based on that. For autonomous vehicles, that means being released, either in simulation, or in the real world, to operate. And how they learn, how neural networks learn, is given, the forward pass, of taking the input data, whether it's from the training stage, in the training stage, the taking the input data, producing a prediction, and then given that, there's ground truth, in the training stage, we can have a measure of error, based on a loss function, that then punishes, the synapses, the connections, the parameters, that were, involved with making that, that wrong prediction. And it back propagates the error, through those weights. We'll discuss that in a little bit more detail, in a bit here. So what can we do with deep learning? We can do one-to-one mapping. Really, you can think of input, as being anything. It can be a number, a vector of numbers, a sequence of numbers, a sequence, a vector of numbers. Anything you can think of, from images, to video, to audio, represented in this way. And the output, can the same, be a single number, or it can be images, video, text, audio. One-to-one mapping, on the bottom, one to many, many to one, many to many, and many to many, with different starting points, for the data. Asynchronous. Some quick terms, that will come up. Deep learning, is the same as neural networks. It's really deep neural networks, large neural networks. It's a subset of machine learning, that has been extremely successful, in the past decade. Multi-layer perceptron, deep neural network, recurring neural network, long short-term memory network, LSTM, convolutional neural network, and deep belief networks. All of these will come up, through the slides. And there is, specific operations, layers within these networks, of convolution, pooling, activation, and back propagation. This concept, that we'll discuss, in this class. Activation functions, there's a lot of variants. On the left, is the activation function, the left column. On the x-axis, is the input. On the y-axis, is the output. The sigmoid function, the output, if the font is too small, the output is, not centered at zero. For the tanh function, it's centered at zero, but it still suffers, from vanishing gradients. Vanishing gradients, is when the value, the input is low or high. The output of the network, as you see in the right column there, the derivative of the function, is very low. So the learning rate is very low. For ReLU, it's also not zero centered, but it does not suffer, from vanishing gradients. Back propagation, is the process of learning. It's the way we take, go from error, compute as the loss function, on the bottom right of the slide, taking the actual output, of the network with the forward pass, subtracting it, from the ground truth, squaring, dividing by two, and using that loss function, then back propagate through, to construct a gradient, to back propagate the error, to the weights that were responsible, for making either a correct, or an incorrect decision. So the sub tasks of that, there's a forward pass, there's a backward pass, and a fraction of the weights, gradient subtracted from the weight. That's it. That process is modular, so it's local to each individual neuron, which is why it's extremely, we're able to distribute it across, multiple, across the GPU, parallelize across the GPU. So, learning for neural network, these computational units, are extremely simple. They're extremely simple, to then correct, when they make an error, when they're part of a larger network, that makes an error. And all that boils down to, is essentially an optimization problem, where the objective, utility function is, the loss function, and the goal is to minimize it. And we have to update the parameters, the weights and the synapses, and the biases, to decrease that loss function. And that loss function is highly nonlinear. Depending on the activation functions, different properties, different issues arise. There's vanishing gradients, for sigmoid, where the learning can be slow. There's dying Rayleigh's, where the derivative is exactly zero, for inputs less than zero. There are solutions to this, like leaky Rayleigh's, and a bunch of details, that you may discover, when you try to win, the deep traffic competition. But, for the most part, these are the main activation functions. And it's the choice of the, neural network designer, which one works best. There are saddle points, all the problems, from numerical nonlinear optimization, that arise, come up here. It's hard to break symmetry, and stochastic gradient descent, without any kind of tricks to it, can take a very long time, to arrive at the minima. One of the biggest problems, in all of machine learning, and certainly deep learning, is overfitting. You can think of the blue dots, in a plot here, as the data, to which we want to fit a curve. We want to design a learning system, that approximates, the regression of this data. So, in green, is a sine curve, simple, fits well. And then there's a ninth degree polynomial, which fits even better, in terms of the error. But it clearly overfits this data. If there's other data, that has not seen yet, that has to fit, it's likely to produce a high error. So it's overfitting the training set. This is a big problem, for small data sets. And so we have to fix that, with regularization. Regularization is a set of methodologies, that prevent overfitting. Learning the training too well, in order, and then to not be able to generalize, to the testing stage. And overfitting, the main symptom, is the error decreases in training set, but increases in test set. So there's a lot of techniques, in traditional machine learning, that deal with this, and cross validation and so on. But because of the cost of training, for neural networks, it's traditional to use, what's called a validation set. So you create a subset of the training, that you keep away, for which you have the ground truth. And use that, as a representative of the testing set. So you perform early stoppage, or more realistically, just save a checkpoint often, to see how as the training evolves, the performance changes, on the validation set. And so you can stop, when the performance in the validation set, is getting a lot worse. It means you're over training, on the training set. In practice, of course, we run training much longer, and see when, what is the best performing, what is the best performing, snapshot checkpoint of the network. Dropout, is another very powerful, regularization technique. Where we randomly remove, part of the network, randomly remove some of the nodes, in the network, along with its incoming, and outgoing edges. So what that really looks like, is a probability of keeping a node. And in many deep learning frameworks today, it comes with a dropout layer. So it's essentially a probability, that's usually greater than 0.5, that a node will be kept. For the input layer, the probability should be much higher, or more effectively, what works well is just adding noise. What's the point here? You want to create, enough diversity, in the training data, such that it is generalizable, to the testing. And as you'll see, with deep traffic competition, there's L2 and L1 penalty, weight decay, weight penalty. Where, there's a penalization on the weights, they get too large. The L2 penalty keeps the weight small, unless the error derivative is huge, and produces a smoother model, and prefers to distribute, when there is, two similar inputs, it prefers to put half the weights on each, distribute the weights, as opposed to putting the weight on one of the edges. Makes the network more robust. L1 penalty has the one benefit, that for really large weights, they're allowed to be, to stay. So it allows for a few weights, to remain very large. These are the regularization techniques. And I wanted to mention them, because they're useful to some of the competitions, here in the course. And I recommend, to go to playground, to TensorFlow playground, to play around with some of these parameters. Where you get to, online in the browser, play around with different inputs, different features, different number of layers, and regularization techniques. And to build your intuition, about classification regression problems, given different input data sets. So what changed? Why, over the past many decades, neural networks, that have gone through two winters, are now again, dominating the artificial intelligence community. CPUs, GPUs, ASICs, the computational power has skyrocketed. From Moore's law to GPUs. There is huge data set, including ImageNet and others. There is research, back propagation, in the 80s. The convolutional neural networks, LSTMs. There's been a lot of, interesting breakthroughs, about how to design these architectures. How to build them, such that they're trainable efficiently, using GPUs. There is the software infrastructure, from being able to share the data, or get, to being able to train networks, and share code, and effectively, view neural networks as a stack of layers, as opposed to having to implement stuff from scratch, with TensorFlow, PyTorch, and other deep learning frameworks. And there's huge financial backing, from Google, Facebook, and so on. Deep learning, is, in order to understand, why it works so well, and where its limitations are, we need to understand, where our own intuition comes from, about what is hard, and what is easy. The important thing about computer vision, which is a lot of what this course is about, even as in deep reinforcement learning formulation, is that visual perception, for us human beings, was formed, 540 million years ago. That's 540 million years worth of data. An abstract thought, is only formed about 100,000 years ago. That's several orders of magnitude less data. So we can make, with neural networks, predictions, that seem trivial, the trivial to us human beings, but completely challenging, and wrong to neural networks. Here, on the left, showing a prediction of a dog, with a little bit of distortion, and noise added to the image, producing the image on the right. And neural network is confidently, 99% plus accuracy, predicting that it's an ostrich. And there's all these problems, has to deal with, whether it's in computer vision data, whether it's in text data, audio, all of this variation arises. In vision, it's illumination variability, the set of pixels, and the numbers look completely different, depending on the lighting conditions. It's the biggest problem in driving, is lighting conditions, lighting variability. Pose variation, objects need to be learned, from every different perspective. I'll discuss that, for when sensing the driver. Most of the deep learning work, that's done on the face, on the human, is done on the frontal face, or semi frontal face. There's very little work done, on the full 360, pose variability, that a human being can take on. Inter-class variability, for the classification problem, for the detection problem, there is a lot of different kinds of objects, for cats, dogs, cars, bicyclists, pedestrians. So that brings us to object classification. And I'd like to take you through, where deep learning, has taken big strides, for the past several years, leading up to this year, to 2018. So let's start, at object classification. It's when you take, a single image, and you have to say, one class, that's most likely to belong in that image. The most famous, variant of that, is the ImageNet competition, ImageNet challenge. ImageNet data set, is a data set of 14 million images, with 21,000 categories. And for say, the category of fruit, there's a total of, 188,000 images of fruit. And there is, 1,200 images of Granny Smith apples. It gives you a sense, of what we're talking about here. So this has been, the source, of a lot of interesting breakthroughs, in deep learning, and a lot of the excitement, in deep learning. Is first, the big successful network, at least, one that became famous, in deep learning, is AlexNet in 2012, that took a leap, of, a significant leap in performance, on the ImageNet challenge. So it was one of the first, neural networks, that was successfully trained on the GPU, and achieved, an incredible performance boost, over the previous year, on the ImageNet challenge. The challenge is, and I'll talk about some of these networks, is to given a single image, give five guesses, and you have five guesses to guess, for one of them to be correct. The human annotation, is a question often comes up. So how do you know the ground truth? Human level of performance, is 5.1% accuracy, on this task. But the way, the annotation for ImageNet, is performed, is, there's a Google search, where you pull, the images, already labeled for you, and then the annotation, that on Mechanical Turk, other humans perform, is just binary. Is this a cat or not a cat? So they're not tasked, with performing the, very high resolution, semantic, labeling of the image. Okay, so through, from 2012, with AlexNet, to today. And the big, transition in 2018, of the ImageNet challenge, leaving Stanford, and going to Kaggle. It's sort of a monumental step, because in 2015, with the ResNet network, was the first time, that the human level of performance, was exceeded. And I think this is, a very important, map, of where deep learning is. For a particular, what I would argue, is a toy example, despite the fact, that it's 14 million images. So we're developing, state-of-the-art techniques here, and the next stage, as we are now, exceeding human level performance, on this task, is how to take, these methods, into the real world, to perform, scene perception, to perform, driver state perception. In 2016, and 2017, CU Image, and SCNet, has a very unique, new addition, to the previous formulations, that has achieved, an accuracy of 2.2% error. 2.25% error, on the ImageNet, classification challenge. It's an incredible result. Okay, so you have this image, classification architecture, that takes in a single image, and produces convolution, and takes it through, pooling convolution, and at the end, fully connected layers, and performs a classification task, or regression task. And you can swap out, that layer, to perform any kind of, other task, including with, recurrent neural networks, of image captioning, and so on, or localization, of bounding boxes, or you can do, fully convolutional networks, which we'll talk about, on Thursday. Which is when you take a, image as an input, and produce an image as an output. But where the output image, in this case, is a segmentation. Is, where a color indicates, what the object is, of the category, of the object. So it's pixel level segmentation, every single pixel in the image, is assigned a class, a category, of where that pixel belongs to. This is, the kind of, task, that's overlaid on top, of, other sensory information, coming for the car, in order to, perceive the external environment. You can continue to extract, information from images in this way, to produce image to image mapping, for example, to colorize images. And take from grayscale images, to color images. Or you can use that kind of, heat map information, to localize objects in the image. So as opposed to just classifying, that this is the image of a cow, RCNN, FAST, and FASTA-RCNN, and a lot of other localization networks, allow you to, propose different candidates, for where exactly the cow, is located in this image. And thereby, being able to perform object detection, not just object classification. In 2017, has been a lot of cool applications, of these architectures. One of which is background removal. Again, mapping from image to image, ability to remove, background from selfies, of humans, or human-like, pictures, or faces. The references, with some incredible animations, are in the bottom of the slide, and the slides are now available online. Pix2Pix HD, there's been a lot of work in GANs, and generative adversarial networks. In particular in driving, GANs have been used to generate, examples that, generate examples, from source data. Whether that's from raw data, or in this case with Pix2Pix HD, is taking, coarse semantic labeling, of the images, pixel level, and producing, photorealistic, high-definition, images of the forward roadway. This is an exciting, possibility, for being able to generate, a variety of cases, for self-driving cars, for autonomous vehicles, to be able to learn, to generate, to augment the data, and be able to change, the way different roads look, road conditions, to change the way vehicles look, cyclists, pedestrians. Then we can move on, to recurrent neural networks. Everything I've talked about, was one-to-one mapping, from image to image, or image to number. Recurrent neural networks, work with sequences. We can use sequences, to generate handwriting, to generate text, captions from an image, based on the localizations, the various detections, in that image. We can provide, video description generation. So taking a video, and combining convolutional neural networks, with recurrent neural networks, using convolutional neural networks, to extract features, frame to frame, and using those extracted features, to input into the RNNs, to then generate, a labeling, a description, what's going on in the video. A lot of exciting approaches, for autonomous systems, especially in drones, where the time to make a decision, is short. Same with the RC car, traveling 30 miles an hour. Attentional mechanisms, for steering the attention, of the network, have been very popular, for the localization task, and for just saving, how much interpretation, of the image, how many pixels need to be considered, in the classification task. So we can steer, we can model the way, a human being, looks around an image to interpret it, and use the network to do the same. And we can use that kind of steering, to draw images as well. Finally the big breakthroughs in 2017, came from this, the pong to pixels, the reinforcement learning, using sensory data, raw sensory data, and use reinforcement learning methods, deep RL methods, of which we'll talk about on Wednesday. I'm really excited about, the underlying methodology, of deep traffic and deep crash, is using neural networks, as the approximators, inside reinforcement learning approaches. So AlphaGo in 2016, has achieved a monumental task, that when I first started, in artificial intelligence, was told to me is impossible, for AI system to accomplish, which is to win at the game of Go, against the top human player in the world. However that method was trained, on human expert positions. The AlphaGo system, was trained on previous games, played by human experts. And in an incredible accomplishment, AlphaGo Zero in 2017, was able to beat AlphaGo, and many of its variants, by playing itself, from zero information. So no knowledge of human experts, no games, no training data, very little human input. And what more, it was able to generate moves, that were surprising to human experts. I think it's Einstein, that said that intelligence, that the key mark of intelligence, is imagination. I think it's beautiful, to see an artificial intelligence system, come up with something, that surprises human experts. Truly surprises. For the gambling junkies, DeepStack and a few other variants, have been used in 2017, to win a heads-up poker. Again, another incredible result. I was always told, an artificial intelligence, would be impossible, for deep, for any machine learning method, to achieve. And was able to beat, a professional player, and several competitors, have come along since. We're yet to be able to beat, to win in a tournament setting, so multiple players, for those of you familiar, heads-up poker is one-on-one. It's a much, much smaller, easier space to solve. There's a lot more, human-to-human dynamics going on, for when there's multiple players. But that's the task for 2018. And the drawbacks, it's one of my favorite videos, I show it often, of coast runners. For these deep reinforcement, learning approaches, the learning of the reward function, the definition of the reward function, controls how the actual, system behaves. And this will come, this would be extremely important for us, with autonomous vehicles. Here, the boat is tasked with, gaining the highest number of points, and it figures out, that it does not need to race, which is the whole point of the game, in order to gain points, but instead pick up green, circles that regenerate themselves, over and over. This is the, the counterintuitive, behavior of a system, that would not be expected, when you first design the reward function. And this is a very formal, simple system, nevertheless, is extremely difficult, to come up with a reward function, that makes it operate, in the way you expect it to operate. Very applicable for, autonomous vehicles. And of course, in the perception side, as I mentioned with the ostrich and the dog, a little bit of noise, with 99.6% confidence, we can predict, that the noise up top, is a robin, a cheetah, armadillo, lesser panda. These are outputs from actual, state-of-the-art neural networks. Taking in the noise, and producing a confident prediction. It should build our intuition, to understand that we don't, that the visual characteristics, the spatial characteristics of an image, do not necessarily convey, the level of hierarchy necessary, to function in this world. In a similar way, with a dog and an ostrich, and everything in an ostrich, a network confidently, with a little bit of noise, can make the wrong prediction. Thinking a school bus, is an ostrich, and a speaker is an ostrich. They're easily fooled, but not really, because they perform the tasks, that they were trained to do well. So we have to, make sure we keep our intuition, optimized to the way machines learn, not the way humans have learned, over the 540 million years of data, that we've gained, through developing the eye, through evolution. The current challenges we're taking on, first transfer learning. There's a lot of success, in transfer learning, between domains, that are very close to each other. So image classification, from one domain to the next. There's a lot of value, in forming representations, of the way scenes look, in order to, natural scenes look, in order to do, scene segmentation, the driving case for example. But we're not able to do any, any bigger leaps, in the way we perform transfer learning. The biggest challenge for deep learning, is to generalize, generalize across domains. It lacks the ability to reason, in the way that we've defined, understanding previously, which is, the ability to turn complex information, into simple useful information. Convert domain specific, complicated sensory information, that doesn't relate, to the initial training set. That's the open challenge, for deep learning. Train on very little data, and then go and reason, and operate in the real world. Right now, neural networks are very inefficient. They require big data. They require supervised data, which means they need human, costly human input. They're not fully automated, despite the fact, that the feature learning, incredibly the big breakthrough, feature learning is performed, automatically. You still have to do, a lot of design, of the actual architecture, of the network, and all the different, hyper parameter tuning, needs to be performed. Human input, perhaps a little bit more, educated human input, in form of PhD students, postdocs, faculty, is required, to tune these hyper parameters. But nevertheless, human input is still necessary. They cannot be left alone, for the most part. The reward, defining the reward, as we saw with Coast Run, is extremely difficult, for systems that operate, in the real world. Transparency, quite possibly, is not an important one, but neural networks, currently are black box, for the most part. They're not able to accept, through a few successful, visualization methods, that visualize different aspects, of the activations. They're not able, to reveal to us humans, why they work, or where they fail. And this is a philosophical question, for autonomous vehicles, that we may not care, as human beings, if a system works well enough. But I would argue, that it'll be a long time, before systems work well enough, where we don't care. We'll care, and we'll have to work together, with these systems. And that's where transparency, communication, collaboration is critical. And edge cases, it's all about edge cases. In robotics, in autonomous vehicles, the 99.9% of driving, is really boring. It's the same. Especially highway driving, traffic driving, it's the same. The obstacle avoidance, the car following, the lane centering, all these problems are trivial. It's the edge cases. The trillions of edge cases, that need to be generalized over, on a very small amount of training data. So again, I return to, why deep learning? I mentioned a bunch of challenges, and this is an opportunity. It's an opportunity, to come up with, techniques, that operate successfully in this world. So I hope the competitions, we present in this class, in the autonomous vehicle domain, will give you some insight, and opportunity to apply, in some of these cases, are open research problems. With semantic segmentation, of external perception, with control of the vehicle, in deep traffic, and with deep crash, of control of the vehicle, in under actuated, high-speed conditions, and the driver state perception. So with that, I wanted to introduce, deep learning to you today, before we get to the fun, tomorrow of autonomous vehicles. So I would like to thank, NVIDIA, Google, Autoliv, Toyota, and, at the risk of setting off people's phones, Amazon Alexa Auto. But, truly, I would like to say, that I've been humbled, over the past year, by the thousands of messages, we received, by the attention, by the 18,000 competition entries, by the many people across the world, not just here at MIT, that are brilliant, that I've got a chance to interact with. And I hope we go bigger, and do some impressive stuff in 2018. Thank you very much, and tomorrow is self-driving.

MIT 6.S094: Deep Learning

Chapters

Transcript