Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze)

- I'm happy to have Vivian See here with us. She's a professor here at MIT, working in the very important and exciting space of developing energy efficient and high performance systems for machine learning, computer vision, and other multimedia applications. This involves joint design of algorithms, architecture, circuits, systems, to enable optimal trade-offs between power, speed, and quality of result.

One of the important differences between the human brain and AI systems is the energy efficiency of the brain. So Vivian is a world-class researcher at the forefront of discovering how we can close that gap. So please give her a warm welcome. - I'm really happy to be here to share some of the research and an overview of this area, efficient computing.

So actually what I'm gonna be talking about today is gonna be a little bit broader than just deep learning. We'll start with deep learning, but we'll also move to how we might apply this to robotics and other AI tasks, and why it's really important to have efficient computing to enable a lot of these exciting applications.

Also, I just wanna mention that a lot of the work I'm gonna present today is not done by myself, but in collaboration with a lot of folks at MIT over here. And of course, if you want access to the slides, they're available on our website. So given that this is a deep learning lecture series, I wanna first start talking up a little bit about deep neural nets.

So we know that deep neural nets has, you know, generate a lot of interest, has many very compelling applications. But one of the things that has, you know, come into light over the past few years is the increasing need of compute. OpenAI actually showed over the past few years that there's been a significant increase in the amount of compute that is required to perform deep learning applications and to do the training for deep learning over the past few years.

So it's actually grown exponentially over the past few years. It's grown in fact by over 300,000 times in terms of the amount of compute we need to drive and increase the accuracy of a lot of the tasks that we're trying to achieve. At the same time, if we start looking at basically the environmental implications of all of this processing, it can be quite severe.

So if we look at, for example, the carbon footprint of, you know, training neural nets, if you think of, you know, the amount of carbon footprint of flying across North America from New York to San Francisco or the carbon footprint of an average human life, you can see that, you know, neural networks are orders of magnitude greater than that.

So the environmental or carbon footprint implications of computing for deep neural nets can be quite severe as well. Now this is a lot having to do with compute in the cloud. Another important area where we wanna do compute is actually moving the compute from the cloud to the edge itself, into the device where a lot of the data is being collected.

So why would we wanna do that? So there's a couple of reasons. First of all, communication. So in a lot of places around the world and just even a lot of just places in general, you might not have a very strong communication infrastructure, right? So you don't wanna necessarily have to rely on a communication network in order to do a lot of these applications.

So again, you know, removing your tethering from the cloud is important. Another reason is a lot of the times that we, you know, apply deep learning on a lot of applications where the data is very sensitive. So you can think about things like healthcare where you're collecting very sensitive data.

And so privacy and security again is really critical. And you would, rather than sending the data to the cloud, you'd like to bring the compute to the data itself. Finally, another compelling reason for, you know, bringing the compute into the device or into the robot is latency. So this is particularly true for interactive applications.

So you can think of things like autonomous navigation, robotics, or self-driving vehicles where you need to interact with the real world. You can imagine if you're driving very quickly down the highway and you detect an obstacle, you might not have enough time to send the data to the cloud, wait for it to be processed, and send the instruction back in.

So again, you wanna move the compute into the robot or into the vehicle itself. Okay, so hopefully this is establishing why we wanna move the compute into the Edge. But one of the big challenges of doing processing in the robot or in the device actually has to do with power consumption itself.

So if we take the self-driving car as an example, it's been reported that it consumes over 2000 watts of power just for the computation itself, just to process all the sensor data that it's collecting. Right, and this actually generates a lot of heat. It takes up a lot of space.

You can see in this prototype that's being placed, and all the compute aspects are being placed in the trunk, generates a lot of heat, it generates, and often needs water cooling. So this can be a big cost and logistical challenges for self-driving vehicles. Now you can imagine that this is gonna be much more challenging if we shrink down the form factor of the device itself to something that is perhaps portable in your hands.

You can think about smaller robots or something like your smartphone or cell phone. In these particular cases, when you think about portable devices, you actually have very limited energy capacity, and this is based on the fact that the battery itself is limited in terms of the size, weight, and its cost.

Right, so you can't have very large amount of energy on these particular devices itself. Secondly, when you take a look at the embedded platforms that are currently used for embedded processing for these particular applications, they tend to consume over 10 watts, which is an order of magnitude higher than the power consumption that you typically would allow for these particular handheld devices.

So in these handheld devices, typically you're limited to under a watt due to the heat dissipation. For example, you don't want your cell phone to get super hot. Okay, so in the past decade or so, or decades, what we would do to address this challenge is that we would wait for transistors to become smaller, faster, and more efficient.

However, this has become a challenge over the past few years, so transistors are not getting more efficient. So for example, Moore's Law, which typically makes transistors smaller and faster, has been slowing down, and Dennard scaling, which has made transistors more efficient, has also slowed down or ended. So you can see here over the past 10 years, this trend has really flattened out.

Okay, so this is a particular challenge because we want more and more compute to drive deep neural network applications, but the transistors are not becoming more efficient. Right? So what we have to turn to in order to address this is we need to turn towards specialized hardware to achieve the significant speed and energy throughputs that we require for our particular application.

When we talk about designing specialized hardware, this is really about thinking about how we can redesign the hardware from the ground up, particularly targeted at these AI, deep learning, and robotic tasks that we're really excited about. Okay, so this notion is not new. In fact, it's become extremely popular to do this.

Over the past few years, there's been a large number of startups and companies that have focused on building specialized hardware for deep learning. So in fact, New York Times reported, I guess it was two years ago that there's a record number of startups looking at building specialized hardware for AI and for deep learning.

Okay, so we'll talk a little bit about what specialized hardware looks like for these particular applications. Now, if you really care about energy and power efficiency, the first question you should ask is where is the power actually going for these applications? And so as it turns out, power is dominated by data movement.

So it's actually not the computations themselves that are expensive, but moving the data to the computation engine that's expensive. So for example, shown here in blue is a range of power consumption, energy consumption for a variety of types of computations, for example, multiplications and additions at various different precision.

So you have, for example, floating point to fixed point and eight bit integer and same with additions. And you can see as it makes sense, as you scale down the precision, the energy consumption of each of these operations reduce. But what's really surprising here is that if you look lower at the energy consumption of data movement, right?

Again, this is delivering the input data to do the multiplication and then, you know, moving the output of the multiplication somewhere into memory, it can be very expensive. So for example, if you look at the energy consumption of a 32 bit read from an SRAM memory, this is an eight kilobyte SRAM.

So it's a very small memory that you would have on the processor or on the chip itself. This is already gonna consume five picojoules of energy. So equivalent or even more than a 32 bit floating point multiply. And this is from a very small memory. If you need to read this data from off chip, so outside the processor, for example, in DRAM, it's gonna be even more expensive.

So in this particular case, we're showing 640 picojoules in terms of energy. And so you can notice here on the horizontal axis that this is basically the, this is an exponential axis. So you're talking about orders of magnitude increase in energy in terms of data movement compared to the compute itself, right?

So this is a key takeaway here. So if we really want to address the energy consumption of these particular types of processing, we really wanna look at reducing data movement. Okay, but what's the challenge here? So if we take a look at a popular AI robotics type of application like autonomous navigation, the real challenge here though, is that these applications use a lot of data, right?

So for example, one of the things you need to do in autonomous navigation is what we call semantic understanding. So you need to be able to identify, you know, which pixel belongs to what. So for example, in this scene, you need to know that this pixel represents the ground, this pixel represents the sky, this pixel represents, you know, a person itself.

Okay, so this is an important type of processing. Often if you're traveling quickly, you wanna be able to do this at a very high frame rate. You might need to have large resolution. So for example, typically if you want HD images, you're talking about 2 million pixels per frame.

And then often, if you also wanna be able to detect objects at different scales or see objects that are far away, you need to do what we call data expansion. For example, build a pyramid for this, and this would increase the amount of pixels or amount of data you need to process by, you know, one to two orders of magnitude.

So that's a huge amount of data that you have to process right off the bat there. Another type of processing or understanding that you wanna do for autonomous navigation is what we call geometric understanding, and that's when you're kind of navigating, you wanna build a 3D map of the world that's around you.

And you can imagine the longer you travel for, the larger the map you're gonna build. And again, that's gonna be more data that you're gonna have to process and compute on. Okay, so this is a significant challenge for autonomous navigation in terms of amount of data. Other aspects of autonomous navigations, also other applications like AR, VR, and so on, is understanding your environment, right?

So a typical thing you might need to do is to do depth estimation. So for example, if I give you an image, can you estimate the distance of how far a given pixel is from you? And also semantic segmentation, we just talked about that before. So these are important types of ways to understand your environment when you're trying to navigate.

And it should be no surprise to you that in order to do these types of processing, the state-of-the-art approaches utilize deep neural nets. Right? But the challenge here is that these deep neural nets often require several hundred millions of operations and weights to do the computation. So when you try and compare it to something like you would all have on your phone, for example, video compression, you're talking about two to three orders of magnitude increase in computational complexity.

So this is a significant challenge 'cause if we'd like to have deep neural networks be as ubiquitous as something like video compression, we really have to figure out how to address this computational complexity. We also know that deep neural networks are not just used for understanding the environment or autonomous navigation, but it's really become the cornerstone of many AI applications from computer vision, speech recognition, gameplay, and even medical applications.

And I'm sure a lot of these have been covered through this course. So briefly, I'm just gonna give a quick overview of some of the key components in deep neural nets, not because, you know, I'm sure all of you understand it, but because since this area is very popular, the terminology can vary from discipline to discipline.

So I'll just do a brief overview to align ourselves on the terminology itself. So what are deep neural nets? Basically, you can view it as a way of, for example, understanding the environment. It's a chain of different layers of processing where you can imagine for an input image, at the low level or the earlier parts of the neural net, you're trying to learn different low-level features such as edges of an image.

And as you get deeper into the network, as you chain more of these kind of computational layers together, you start being able to detect higher and higher level features until you can, you know, recognize a vehicle, for example. And, you know, the difference of this particular approach compared to more traditional ways of doing computer vision is that how we extract these features are learned from the data itself, as opposed to having an expert come in and say, "Hey, look for the edges, look for, you know, the wheels," and so on.

The fact that it recognizes these features is a learned approach. Okay, what is it doing at each of these layers? Well, it's actually doing a very simple computation. This is looking at the inference side of things. Basically, effectively, what it's doing is a weighted sum. Right, so you have the input values, and we'll color code the inputs as blue here and try and stay consistent with that throughout the talk.

We apply certain weights to them, and these weights are learned from the training data, and then they would generate an output, which is typically red here, and it's basically a weighted sum, as we can see. We then pass this weighted sum through some form of non-linearity. So, you know, traditionally, it used to be sigmoids.

More recently, we use things like relues, which basically set, you know, non-zero values or negative values to zero. But the key takeaway here is that if you look at this computational kernel, the key operation to a lot of these neural networks is performing this multiply and accumulate to compute the weighted sum.

And this accounts for over 90% of the computation. So if we really want to focus on, you know, accelerating neural nets or making them more efficient, we really want to focus on minimizing the cost of this multiply and accumulate itself. There are also various popular types of deep neural network layers used for deep neural networks.

They also often vary in terms of, you know, how you connect up the different layers. So for example, you can have feed-forward layers where the inputs are always connected to the outputs. You can have feed-back where the outputs are connected back into the inputs. You can have fully-connected inputs where basically all the outputs are connected to all the inputs, or sparsely connected.

And you might be familiar with some of these layers. So for example, fully-connected layers, just like what we talked about, all inputs and all outputs are connected. They tend to be feed-forward. When you put them together, they're typically referred to as a multilayer perceptron. You have convolutional layers, which are also feed-forward, but then you have sparsely-connected weight-sharing connections.

And when you put them together, they're often referred to as convolutional networks. And they're typically used for image-based processing. You have recurrent layers where we have this feedback connection, so the output is fed back to the input. When we combine recurrent layers, they're referred to as recurrent neural nets.

And these are typically used to process sequential data, so speech or language-based processing. And then most recently, which has become really popular, it's the tension layers or tension-based mechanisms. They often involve matrix multiply, which is again, multiply and accumulate. And when you combine these, they're often referred to as transformers.

Okay, so let's first get an idea as to why convolutional or deep learning is much more, computationally more complex than other types of processing. So we'll focus on convolutional neural nets as an example, although many of these principles apply to other types of neural nets. And the first thing to kind of take a look as to why it's complicated is to look at the computational kernel.

So how does it actually perform convolution itself? So let's say you have this 2D input image. If it's at the input of the neural net, it would be an image. If it's deeper in the neural net, it would be the input feature map. And it's gonna be composed of activations.

Or you can think from an image, it's gonna be composed of pixels. And we convolve it with, let's say, a 2D filter, which is composed of weights. Right, so typical convolution, what you would do is you would do an element-wise multiplication of the filter weights with the input feature map activations.

You would sum them all together to generate one output value. And we would refer to that as the output activation. Right, and then because it's convolution, we would basically slide the filter across this input feature map and generate all the other output feature map activation. And so this kind of 2D convolution is pretty standard in image processing.

We've been doing this for decades, right? What makes convolutional neural nets much more challenging is the increase in dimensionality. So first of all, rather than doing just this 2D convolution, we often stack multiple channels. So there's this third dimension called channels. And then what we're doing here is that we need to do a 2D convolution on each of the channels and then add it all together, right?

And you can think of these channels for an image, these channels would be kind of the red, green, and blue components, for example. And as you get deeper into the feature map, the number of channels could potentially increase. So if you look at AlexNet, which is a popular neural net, the number of channels ranges from three to 192.

Okay, so that already increases the dimensionality, one dimension of the neural net itself in terms of processing. Another dimension that we increase is we actually apply multiple filters to the same input feature map. Okay, so for example, you might apply N filters to the same input feature map, and then you would generate an output feature map of M channels, right?

So in the previous slide, we showed that convolving this 3D filter generates one output channel in the output feature map. If we apply M input, M filters, we're gonna generate M output channels in the output feature map. And again, just to give you an idea in terms of the scale of this, when you talk about things like AlexNet, we're talking about between 96 to 384 filters.

And of course, this is increasing to thousands for other advanced or more modern neural nets itself. And then finally, often you wanna process more than one image at a given time, right? So if you wanna actually do that, we can actually extend it. So N input images become N output images, or N input feature maps becomes N output feature maps.

And we typically refer to this as a batch size, like the number of images you're processing at the same time, and this can range from one to 256. Okay, so these are all the various different dimensions of the neural net. And so really what someone does when they're trying to define what we call the network architecture of the neural net itself is that they're gonna select the different or define the shape of the neural network for each of the different layers.

So it's gonna define all these different dimensions of the neural net itself, and these shapes can vary across the different layers. Just to give you an idea, if you look at MobileNet as an example, this is a very popular neural network cell, you can see that the filter sizes, right, so the height and width of the filters and the number of filters and number of channels will vary across the different blocks or layers itself.

The other thing I just wanna mention is that when we look towards popular DNN models, we can also see important trends. So shown here are the various different models that have been developed over the years that are quite popular. A couple of interesting trends, one is that the networks tend to become deeper, so you can see in the convolutional layers they're getting deeper and deeper.

And then also the number of weights that they're using and the number of MACs are also increasing as well. So this is an important trend, the DNN models are getting larger and deeper, and so again, they're becoming much more computationally demanding. And so we need more sophisticated hardware to be able to process them.

All right, so that's kind of a quick intro or overview into the deep neural network space, I hope we're all aligned. So the first thing I'm gonna talk about is how can we actually build hardware to make the processing of these neural networks more efficient and to run faster.

And often we refer to this as hardware acceleration. All right, so we know these neural networks are very large, there's a lot of compute, but are there types of properties that we can leverage to make computing or processing of these networks more efficient? So the first thing that's really friendly is that they actually exhibit a lot of parallelism.

So all these multiplies and accumulates, you can actually do them all in parallel. Right, so that's great. So what that means is high throughput or high speed is actually possible 'cause I can do a lot of these processing in parallel. What is difficult and what should not be a surprise to you now is that the memory access is the bottleneck.

So delivering the data to the multiply and accumulate engine is what's really challenging. So I'll give you an insight as to why this is the case. So let's take, say we take this multiply and accumulate engine, what we call a MAC. It takes in three inputs for every MAC, so you have the filter weight, you have the input image pixel, or if you're deeper in the network, it would be input feature MAC activation, and it also takes the partial sum, which is like the partially accumulated value from the previous multiply that it did, and then it would generate an updated partial sum.

So for every computation that you do, for every MAC that you're doing, you need to have four memory accesses. So it's a four to one ratio in terms of memory accesses versus compute. The other challenge that you have is, as we mentioned, moving data is gonna be very expensive.

So in the absolute worst case, and you would always try to avoid this, if you read the data from DRAM, it's off-chip memory, every time you access data from DRAM, it's gonna be two orders of magnitude more expensive than the computation of performing a MAC itself. Okay, so that's really, really bad.

So if you can imagine, again, if we look at AlexNet, which has 700 million MACs, we're talking about three billion DRAM accesses to do that computation. Okay, but again, all is not lost. There are some things that we can exploit to help us along with this problem. So one is what we call input data reuse opportunities, which means that a lot of the data that we're reading, we're using to perform these multiplies and accumulates, they're actually used for many multiplies and accumulates.

So if we read the data once, we can reuse it multiple times for many operations, right? So I'll show you some examples of that. First is what we call convolutional reuse. So again, if you remember, we're taking a filter and we're sliding it across this input image. And so as a result, the activations from the feature map and the weights from the filter are gonna be reused in different combinations to compute the different multiply and accumulate values or different MACs itself.

So there's a lot of what we call convolutional reuse opportunities there. Another example is that we're actually, if you recall, gonna apply multiple filters on the same input feature map. So that means that each activation in that input feature map can be reused multiple times across the different filters.

Finally, if we're gonna process many images at the same time or many feature maps, a given weight in the filter itself can be reused multiple times across these input feature maps. So that's what we called filter reuse. Okay, so there's a lot of these great filter reuse opportunities in the neural network itself.

And so what can we do to exploit this reuse opportunities? Well, what we can do is we can build what we call a memory hierarchy that contains very low cost memories that allow us to reduce the overall cost of moving this data. So what do we mean here? We mean that if I have, if I build a multiply and accumulate engine, I'm gonna have a very small memory right beside the multiply and accumulate engine.

And by small, I mean something on the order of under a kilobyte of memory locally beside that multiply and accumulate engine. Why do I want that? Because accessing that very small memory can be very cheap. So for example, if to perform a multiply and accumulate with an ALUX1X, reading from this very small memory beside the multiply and accumulate engine is also gonna be the same amount of energy.

I could also allow these processing elements and a processing element is gonna be this multiply and accumulate plus the small memory. I can also allow the different processing elements to also share data, okay? And so reading from a neighboring processing element is gonna be 2X the energy. And then finally, you can have a shared larger memory called a global buffer.

And that's gonna be able to be shared across all the different processing elements. This tends to be larger between 100 and 500 Kbytes. And that's gonna be more expensive, about 6X the energy itself. And of course, if you go off chip to DRAM, that's gonna be the most expensive at 200X the energy.

Right, and so the big issue here is, the way that you can think about this is what you would ideally like to do is to access all of the data from this very small local memory. But the challenge here is that this very small local memory is only 1Kbyte.

We're talking about neural networks that are millions of weights in terms of size, right? So how do we go about doing that? So there's many challenges of doing that. Just as an analogy for you guys to kind of think through how this is related, you can imagine that accessing something from like, let's say your backpack is gonna be much cheaper than accessing something from your neighbor, or going back to, let's say, your office here, somewhere on campus to get the data versus going back all the way home, right?

So ideally, you'd like to access all of your data from your backpack, but if you have a lot of work to do, you might not be able to fill it in your backpack. So the question is, how can I break up my large piece of work into smaller chunks so that I can access them all from this small memory itself?

And that's the big challenge that you have. And so there's been a lot of research in this area in terms of what's the best way to break up the data and what should I store in this very small local memory? So one approach is what we call a weight stationary.

And the idea here is I'm gonna store the weight information of the neural net into this small local memory, okay? And so as a result, I really minimize the weight energy. But the challenge here is that the other types of data that you have in your system, so for example, your input activations shown in the blue, and then the partial sums that are shown in the red, now those still have to move through the rest of the system itself, so through the network and from the global buffer, okay?

Typical types of work that are popular that use this type of kind of data flow or weight stationary data flow, which is what we call it 'cause the weight remains stationary, are things like the TPU from Google and the NVDA accelerator from NVIDIA. Another approach that people take, or they, well, they say, "Well, so the weight, I only ever have to read it.

"But the partial sums, I have to read it and write it "'cause the partial sum I'm gonna read, "accumulate, like update it, "and then write it back to the memory." So there's two memory accesses for that partial sum data type. So what, maybe I should put that partial sum locally into that small memory itself.

So this is what we call output stationary 'cause the accumulation of the output is gonna be local within that one processing element. That's not gonna move. The trade-off, of course, is the activations of weights now have to move through the network. And then there's various different works called, like for example, some work from KU Leuven and some work from the Chinese Academy of Sciences that have been using this approach.

Another piece of work is saying, "Well, forget about the inputs and the, "or so the outputs and the weights themselves. "Let's keep the input stationary within this small memory." And it's called input stationary. And some of the work, again, from some research work from NVIDIA has examined this. But all of these different types of work really focus on not moving one piece of type of data.

Either focus on minimizing weight energy or out partial sum energy or input energy. I think what's important to think about is that maybe you wanna reduce the data movement of all different data types, all types of energy. So another approach, this is something that we've developed within our own group, is looking at what we call the row stationary data flow.

And within each of the processing elements, you're gonna do one row of convolution. And this row is a mixture of all the different data types. So you have filter information, so the weights of the filter. You have the activations of your input feature map. And then you also have your partial sum information.

So you're really trying to balance the data movement of all the different data types, not just one particular data type. This is just performing a one row, but we just talked about the fact that the neural network is much more than a 1D convolution. So you can imagine expanding this to higher dimensions.

So this is just showing how you might expand this 1D convolution into a 2D convolution. And then there's other higher dimensionality that you can map onto this architecture as well. I won't get into the details of this, but the key takeaway here is that you might not wanna focus on one particular data type.

You wanna actually optimize for all the different types of data that you're moving around in your system. Okay? And this can just show you some results in terms of how these different data types, or these different types of data flows would work. So for example, in the weight stationary case, as expected, the weight energy, the energy required to move the weights, shown in green, is gonna be the lowest.

But then the red portion, which is the energy of the partial sums, and the green, or sorry, the blue part, which is the input feature map or input pixels, that's gonna be very high. Output stationary is another approach, as we talked about, you're trying to reduce the data movement of the partial sums, shown here in red.

So the red part is really minimized, but you can see that the green part, which is the weight stationary data movement, or weight movement, is gonna be increased, and the blue is the input's gonna be increased. There's another approach called no-colloquial reuse, we don't have time to talk about that, but you can see that row stationary, for example, really aims to balance the data movement of all the different data types.

Right, so the big takeaway here is that, you know, when you're trying to optimize, you know, a given piece of hardware, you don't wanna just optimize one, you know, for one particular type of data, you wanna optimize overall for all the movement in the hardware itself. Okay, another thing that you can also exploit to save a bit of power, is the fact that, you know, some of the data could be zero.

So we know that anything multiplied by zero is gonna be zero, right? So if you know that one of the inputs to your multiply and accumulate is gonna be zero, you might as well skip that multiplication. In fact, you might as well skip, you know, accessing data or accessing the other input to that multiply and accumulate engine.

So by doing that, you can actually reduce the power consumption by almost 50%. Another thing that you can do, is that if you have a bunch of zeros, you can also compress the data. For example, you can use things like run length encoding, which where basically a run of zeros is gonna be represented rather than, you know, zero, zero, zero, zero, zero, you can just say I have a run of five zeros.

And this can actually reduce the amount of data movement by up to two X in your system itself. And in fact, in, you know, neural nets, there's a large way, you know, possibilities of actually generating zeros. First of all, if you remember that reloop, it's setting negative values to zero, so naturally generates zeros.

And then there's other techniques, for example, we call pruning, which is setting some of the weights of the neural net to zero as well. And so this can exploit all of that. Okay, so, you know, what is the impact of all these types of things? So we actually looked at building hardware in particular a customized chip that we called Iris to demonstrate these particular approaches, in particular the row stationary data flow and exploiting sparsity in the activation data.

So this Iris chip has 14 by 12, so 168 processing elements. You can see that there's a shared buffer that's 100 kilobytes, and it has some compression, decompression before it goes to off-chip DRAM. And again, that's because accessing DRAM is the most expensive. Shown here on the right-hand side is a dye photo of the fabricated chip itself, right?

And this is four millimeters by four millimeters in terms of size. And so using that, you know, row stationary data flow, it exploits a lot of data reuse. So it actually reduces the number of times we access this global buffer by 100x. And it also reduces the amount of times we access the off-chip memory by over 1000x.

This is all because, you know, each of these processing elements has, you know, a local memory that is trying to read most of its data from, and it's also sharing with other processing elements. So overall, when you compare it to a mobile GPU, you're talking about an order of magnitude reduction in energy consumption.

If you'd like to learn a little bit more about that, I invite you to visit the Iris project website. Okay, so this is great. We can build custom hardware, but what does this actually mean in terms of, you know, building a system that can efficiently compute neural nets? So let's say we take a step back.

Let's say we don't care anything about the hardware, and we're, you know, a systems provider. We want to build, you know, an overall system. And what we really care about is the trade-off between energy and accuracy, right? That's the key thing that we care about. So shown here is a plot, and let's say this is for an object detection task, right?

So accuracy is on the x-axis, and it's listed in terms of average precision, which is a metric that we use for object detection. It's on a linear scale, and higher, the better. Vertically, we have energy consumption. This is the energy that's being consumed per pixel. So you kind of average it.

I can imagine a higher-resolution image can consume more energy. It's going to be an exponential scale. So let's first start on the accuracy axis. And so if you think before neural nets, you know, had its resurgence in around 2011, 2012, actually state-of-the-art approaches used features called histogram of oriented gradients, right?

This is a very popular approach to be very efficient in terms of, or quite accurate in terms of object detection. And we refer to as HOG. The reason why neural nets really took off is 'cause they really improved the accuracy. So you can imagine AlexNet here almost doubled the accuracy, and then VGG further increased the accuracy.

So it's super exciting there. But then we want to look also on the vertical axis, which is the energy consumption. And I should mention, you know, basically you'll see these dots. We have the energy consumption for each of these different approaches. These approaches are actually measured, or these energy numbers are measured on specialized hardware already that's been designed for that particular task.

So we have a chip here that's built in 65-millimeter CMOS process. So they use the same transistors around the same size that does object detection using the HOG features. And then here's the Iris chip that we just talked about. I should also know that both of these chips were built in my group.

The students who built these chips, you know, started designing the chips at the same time and taped out at the same time. So it's somewhat of a controlled experiment in terms of optimization. Okay, so what does this tell us when we look on the energy axis? We can see that histogram of oriented gradients, or HOG features, are actually very efficient from an energy point of view.

In fact, if we compare it to something like video compression, again, something that you all have in your phone, HOG features are actually more efficient than video compression, meaning for the same energy that you would spend compressing a pixel, you could actually understand that pixel. So that's pretty impressive.

But if we start looking at AlexNet or VGG, we can see that the energy increases by two to three orders of magnitude, which is quite significant. I'll give you an example. So if I told you on your cell phone, I'm gonna double the accuracy of its recognition, but your phone would die 300 times faster, who here would be interested in that technology?

Right, so exactly, so nobody, right? So in the sense that battery life is so critical to how we actually use these types of technologies. So we should not just look at the accuracy, which is the x-axis point of view, we should really also consider the energy consumption, and we really don't want the energy to be so high.

And we can see that even with specialized hardware, we're still quite far away from making neural nets as efficient as something like video compression that you all have on your phones. So we really have to think of how we can further push the energy consumption down without sacrificing accuracy, of course.

So actually, there's been a huge amount of research in this space, because we know neural nets are popular, and we know that they have a wide range of applications, but energy's really a big challenge. So people have looked at how can we design new hardware that can be more efficient, or how can we design algorithms that are more efficient to enable energy-efficient processing of DNNs.

And so in fact, within our own research group, we spend quite a bit of time kind of surveying the area and understanding what are the various different types of developments that people have been looking at. So if you're interested in this topic, we actually generated various tutorials on this material, as well as overview papers.

This is an overview paper that's about 30 pages and we're currently expanding it into a book. So if you're interested in this topic, I would encourage you to visit these resources. But the main thing that we learned about as we were doing this kind of survey of the area, is that we actually identified various limitations in terms of how people are approaching or how the research is approaching this problem.

So first let's look on the algorithm side. So again, there's a wide range of approaches that people are using to try and make the DNN algorithms or models more efficient. So for example, we've kind of mentioned the idea of pruning. The idea here is you're gonna set some of the weights to become zero, and again, anything times zero is zero, so you can skip those operations.

So there's a wide range of research there. There's also looking at efficient network architectures, meaning rather than making my neural networks very large, these high three-dimensional convolutions, can I decompose them into smaller filters? So rather than this 3D filter, can I make it a 2D filter and kind of also 2D, but one by one and into the screen itself?

Another very popular thing is reduced precision. So rather than using the default of 32-bit float, can I reduce the number of bits down to eight bits or even binary? We saw before that as we reduce the precision of these operations, you also get energy savings, and you also reduce data movement as well 'cause you have to move less data.

A lot of this work really focuses on reducing the number of MACs and the number of weights, and those are primarily because those are easy to count. But the question that we should be asking if we care about the system is does this actually translate into energy savings and reduced latency?

Because from a system's point of view, those are the things that we care about. We don't really, when you're thinking about something running on your phone, you don't care about the number of MACs and weights, you care about how much energy it's consuming 'cause that's gonna affect the battery life, or how quickly it might react.

That's basically a measure of latency. And again, hopefully you haven't forgotten, but basically data movement is expensive. It really depends on how you move the data through the system. So the key takeaway from this slide is that if you remember where the energy comes from, which is the data movement, it's not because of how many weights or how many MACs you have, but really it depends on where the weight comes from.

If it comes from this small memory register file that's nearby, it's gonna be super cheap as opposed to coming from off-chip DRAM. So all weights are basically not created equal, all MACs are not created equal. It really depends on the memory hierarchy and the data flow of the hardware itself.

So we can't just look at the number of weights and the number of MACs and estimate how much energy is gonna be consumed. So this is quite a difficult challenge. So within our group, we've actually looked at developing different tools that allow us to estimate the energy consumption of the neural network itself.

So for example, in this particular tool, which is available on this website, we basically take in the DNN weights and the input data, including its sparsity. We know the different shapes of the different layers of the neural net, and we run an optimization that figures out the memory access, how much energy consumed by the data movement, and then the energy consumed by the multiply and accumulate computations, and then the output is gonna be a breakdown of the energy for the different layers of the neural network.

And once you have this, you can kind of figure out, well, where is the energy going so I can target my design to minimize that energy consumption? Okay, and so by doing this, when we take a look, it should be no surprise, one of the key observations for this exercise is that the weights alone are not a good metric for energy consumption.

If you take a look at GoogleNet, for example, this is running on kind of the Iris architecture, you can see that the weights only account for 22% of the overall energy. In fact, a lot of the energy goes into moving the input feature maps and the output feature maps as well, right?

And also computation. So in general, this is the same message as before. We shouldn't just look at the data movement in one particular data type. We should look at the energy consumption of all the different data types to give us an overall view of where the energy's actually going.

Okay, and so once we actually know where the energy is going, how can we factor that into the design of the neural networks to make them more efficient? So we talked about the concept of pruning, right? So again, pruning was setting some of the weights of the neural net to zero, or you can think of it as removing some of the weights.

And so what we wanna do here is that now we know that we know where the energy is going, why don't we incorporate the energy into the design of the algorithm, for example, to guide us to figure out where we should actually remove the weights from? You know, so for example, let's say here, this is on AlexNet for the same accuracy across the different approaches.

Traditionally, what happens is that people tend to remove the weights that are small. And we call this magnitude-based pruning, and you can see that you get about a 2x reduction in terms of energy consumption. However, we know that like the size of the weight has nothing to do with, or the value of the weight has nothing to do with the energy consumption.

Ideally, what you'd like to do is remove the weights that consume the most energy, right? In particular, we also know that the more weights that we remove, the accuracy is gonna go down. So to get the biggest bang for your buck, you wanna remove the weights that consume the most energy first.

One way you can do this is you can take your neural network, figure out the energy consumption of each of the layers of the neural network. You can sort, then sort the layers in terms of high energy layer to low energy layers, and then you prune the high energy layers first.

So this is what we call energy-aware pruning. And then by doing this, you actually now get a 3.7x reduction in energy consumption compared to 2x for the same accuracy. And again, this is because we factor in energy consumption into the design of the neural network itself. All right, and the prune models are all available on the IRIS website.

Another important thing that we care about from a performance point of view is latency, right? So for example, latency has to do with how long it takes when I give it an image, how long will I get the result back? People are very sensitive to latency. But the challenge here is that latency, again, is not directly correlated to things like number of multiplies and accumulates.

And so this is some data that was released by Google's Mobile Vision team, and they're showing here on the x-axis the number of multiplies and accumulates. You can do, so going towards the left, you're increasing. And then on the y-axis, this is the latency. So this is actually the measured latency or delay it takes to get a result.

And what they're showing here is that the number of max is not really a good approximation of latency. So in fact, for example, given layers, neural networks that have the same number of max, there can be a 2x range or 2x swing in terms of latency. Or looking at it in a different way, giving neural nets of the same latency, they can have a 3x swing in terms of number of max.

So the key takeaway here is that you can't just count the number of max and say, oh, this is how quickly it's gonna run. It's actually much more challenging than that. And so what we want to ask is, is there a way that we can take latency and use that again to design the neural net directly?

So rather than looking at max, use latency. And so together with Google's Mobile Vision team, we developed this approach called NetAdapt. And this is really a way that you can tailor your particular neural network for a given mobile platform for a latency or an energy budget. So it automatically adapts the neural net for that platform itself.

And really what's driving the design is empirical measurements. So measurements of how that particular network perform on that platform. So measurements for things like latency and energy. And the reason why we want to use empirical measurements is that you can't often generate models for all the different types of hardware out there.

In the case of Google, what they want is that, if they have a new phone, you can automatically tune the network for that particular phone. You don't want to have to model the phone as well. Okay, and so how does this work? I'll walk you through it. So you'll start off with a pre-trained network.

So this is a network that's, let's say, trained in the cloud for very high accuracy. Great, start off with that, but it tends to be very large, let's say. And so what you're gonna do is you're gonna take that into the NetAdapt algorithm. You're gonna take a budget. So a budget will tell you like, oh, I can afford only this type of latency or this amount of latency, this amount of energy.

What NetAdapt will do is gonna generate a bunch of proposals, so different options of how it might modify the network in terms of its dimensions. It's gonna measure these proposals on that target platform that you care about. And then based on these empirical measurements, NetAdapt is gonna then generate a new set of proposals.

And it'll just iterate across this until it gets an adapted network as an output. Okay, and again, all of this is on the NetAdapt website. Just to give you a quick example of how this might work. So let's say you start off with, as your input, a neural network that has the accuracy that you want, but the latency is 100 milliseconds, and you would like for it to be 80 milliseconds.

You want it to be faster. So what it's gonna do is it's gonna generate a bunch of proposals. And what the proposals could involve doing is taking one layer of the neural net and reducing the number of channels until it hits the latency budget of 80 milliseconds. And it can do that for all the different layers.

Then it's gonna tune these different layers and measure the accuracy. Right, so let's say, oh, this one where I just shortened the number of channels in layer one maintains accuracy at 60%. So that means I'm gonna pick that, and that's gonna be the input, or the output of this particular design.

So the output at 80 milliseconds hitting an accuracy of 60%, and it's gonna be the input to the next iteration. And then I'm gonna tighten the budget. Okay, again, if you're interested, I just invite you to go take a look at the NetAdapt paper. But what are the, what is the impact of this particular approach?

Well, it gives you actually a very much improved trade-off between latency and accuracy, right? So if you look at this plot again, on the x-axis is the latency, right? So to the left is better, so it's lower latency. And then on the x-axis, or y-axis, it's gonna be the accuracy, so higher, better.

So here you want higher to the left is good. And so we have first shown in blue and green various kind of handcrafted neural network-based approaches. And you can see NetAdapt, which generates the red dots as it's iterating through its optimization. And you can see that it achieves, for the same accuracy, it can be up to 1.7x faster than a manually designed approach.

This approach is also under the umbrella of basically network architecture search is kind of also in that kind of flavor. But in general, the takeaway here is that if you're gonna design neural networks or efficient neural networks, that you wanna run quickly or you wanna be energy efficient, you should really take, you know, put hardware into the design loop and take in, you know, the accurate energy or latency measurements into the design itself of the neural network.

This particular, you know, example here is shown for an image classification task, meaning I give you an image and you can classify it to the right. You can say what's in the image itself. You can imagine that that type of approach is kind of like reducing information, right? From a 2D image, you reduce it down to a label.

This is very commonly used. But we actually want to see if we can still apply this approach to a more difficult task of something like depth estimation. In this case, you know, I give you a 2D image and the output is also a 2D image where each pixel shows the depth of each, or you know, the output or the picture is basically showing the depth of each pixel at the input.

This is often what we'd refer to as, you know, monocular depth. So I give you just a 2D, you know, depth, image input and you can estimate the depth itself. The reason why you want to do this is, you know, 2D cameras, regular cameras are pretty cheap, right? So it'd be ideal to be able to do this.

You can imagine like the way that we would do this is to use an autoencoder. So the front half of the neural net is still looking like a, what we call an encoder. It's a reduction element. So this is very similar to what you would do for a classification, but then the backend of the autoencoder is a decoder.

So it's going to expand the information back out, right? And so, as I mentioned, again, this is going to be much more difficult than just classification because now my output has to be also very dense as well. And so we want to see if we could make this really fast with approaches that we just talked about, for example, NetAdapt.

So indeed you can make it pretty fast. So if you apply NetAdapt plus the, you know, compact network design and then do some depth-wise decomposition, you can actually increase the frame rate by an order of magnitude. So again, here I'm going to show the plot. On the x-axis, here is the frame rate on a Jetson TX2 GPU.

This is measured with a batch size of one with 32-bit float. And on the vertical axis, it's the accuracy, the depth estimation in terms of the delta one metric, which means the percentage of pixels that are within 25% of the correct depth. So higher, the better. And so you can see, you know, the various different approaches out there.

This star, red star, is the approach using fast, of FastStep using all the different efficient network design techniques that we talked about. And you can see you can get an order of magnitude over a 10x speedup while maintaining accuracy. And the models and all the code to do this is available on the FastStep website.

We presented this at ICRA, which is a robotics conference in the middle of last year. And we wanted to show some live footage there. So at ICRA, we actually captured some footage on an iPhone and showed the real-time depth estimation on an iPhone itself. And you can achieve about 40 frames per second on an iPhone using FastDepth.

So again, if you're interested in this particular type of application or efficient networks for depth estimation, I invite you to visit the website for that. OK, so that's the algorithmic side of things. But let's return to the hardware, building specialized hardware that are efficient for neural network processing. So again, we saw that there's many different ways of making the neural network efficient, from network pruning to efficient network architectures to reduce precision.

The challenge for the hardware designer, though, is that there's no guarantee as to which type of approach someone might apply to the algorithm that they're going to run on the hardware. So if you only own the hardware, you don't know what kind of algorithm someone's going to run on your hardware unless you own the whole stack.

So as a result, you really, really need to have flexible hardware so it can support all of these different approaches and translate these approaches to improvements in energy efficiency and latency. Now, the challenge is a lot of the specialized DNN hardware that exist out there often rely on certain properties of the DNN in order to achieve high efficiency.

So a very typical structure that you might see is that you might have an array of multiply and accumulate units, so a MAC array. And it's going to reduce memory access by amortizing reads across arrays. What do I mean by that? So if I read a weight once from the memory, weight memory once, I'm going to reuse it multiple times across the array.

Send it across the array, so one read, and it can be used multiple times by multiple engines or multiple MACs. Similarly, activation memory, I'm going to read the input activation once and reuse it multiple times. The issue here is that the amount of reuse and the rate utilization depends on the number of channels you have on your neural net, the size of the feature map, and the batch size.

So this is, again, just showing two different variations of-- you're going to reuse based on the number of filters, number of input channels, feature map, batch size. And the problem now is that when we start looking at these efficient neural network models, they're not going to have as much reuse, particularly for the compact cases.

So for example, a very typical approach is to use what we call depth-wise layers. We saw you took that 3D filter and then decomposed it into a 2D filter and a one-by-one. And so as a result, you only have one channel. So you're not going to have much reuse across the input channel.

And so rather than filling this array with a lot of computation that you can process, you're only going to be able to utilize a very small subset, which I've highlighted here in green, of the array itself for computation. So even though you throw down 1,000 multiplies, 10,000 multiplies the Humiliate engine, only a very small subset of them can actually do work.

And that's not great. So this is also an issue because as I scale up the array size, it's going to become less efficient. Ideally, what you would like is that if I put more, you know, cores or processing elements down, the system should run faster, right? I'm paying for more thing- more cores.

But it doesn't because it can't- the data can't reach or be reused by all of these different cores, and it's also going to be difficult to exploit sparsity. So what you need here are two things. One is a very flexible data flow, meaning that there's many different ways for the data to move through this array, right?

And so you can imagine row stationary is a very flexible way that we can basically map the neural network onto the array itself. You can see here in the iris or row stationary case that a lot of the processing elements can be used. Another thing is how do you actually deliver the data for this varying degree of reuse?

So here's like the spectrum of on-chip networks in terms of basically how can I deliver data from that global buffer to all those parallel processing engines, right? One use case is when I use these huge neural nets that have a lot of reuse. What I want to do is multicast, meaning I read once from the global buffer, and then I reuse that data multiple times in all of my processing elements.

You can think of that as like broadcasting information out. And a type of network that you would do for that is shown here on the right-hand side. So this is low bandwidth, so I'm only reading very little data, but high spatial reuse. Many, many engines are using it. On the other extreme, when I design these very efficient neural networks, I'm not going to have very much reuse.

And so what I want is unicast, meaning I want to send out unique information to each of the processing elements so that they can all work. So that's going to be, as shown here on the left-hand side, a case where you have very high bandwidth, a lot of unique information going out, and low spatial reuse.

You're not sharing data. Now, it's very challenging to go across this entire spectrum. One solution would be what we call an all-to-all network that satisfies all of this. So all things are-- all inputs are connected to all inputs. It's going to be very expensive and not scalable. One solution that we have to this is what we call a hierarchical mesh.

So you can break this problem into two steps. At the lowest level, you can use an all-to-all connection. And then at the higher level, you can use a mesh connection. And so the mesh will allow you to scale up. But the all-to-all allows you to achieve a lot of different types of reuse.

And with this type of network on chip, you can basically support a lot of different delivery mechanisms to deliver data from the global buffer to all the processing elements so that all your cores, all your computes can be happening at the same time. And at its core, this is one of the key things that enable the second version of Iris to be both flexible and efficient.

So this is some results from the second version of Iris. It supports a wide range of filter shapes, both the very large shapes as well as very compact, including convolutional fully connected depth-wise layers. So you can see here in this plot, depending on the shape, you can get up to an order of magnitude speed up.

It also supports a wide range of sparsities, both dense and sparse. So this is really important because some networks can be very sparse because you've done a lot of pruning. But some are not. And so you want to efficiently support all of those. You also want to be scalable.

So as you increase the number of processing elements, the throughput also speeds up. And as a result of this particular type of design, you get an order of magnitude improvement in both speed and energy efficiency. All right, so this is great. And this is one way that you can speed up and make neural networks more efficient.

But it's also important to take a step back and look beyond just building specialized hardware. The accelerator itself, both in terms of algorithms and the hardware. So can we look beyond the DNN accelerator for acceleration? And so one good place to show this as an example is the task of super resolution.

So how many of you are familiar with the task of super resolution? All right, so for those of you who aren't, the idea is as follows. So I want to basically generate a high-resolution image from a small-resolution image. And why do you want to do that? Well, there are a couple of reasons.

One is that it can allow you to basically reduce the transmit bandwidth. So for example, if you have limited communication, I'm going to send a low-res version of a video, let's say, or image to your phone. And then your phone can make it high-res. That's one way. Another reason is that screens in general are getting larger and larger.

So every year at CES, they announce a higher-resolution screen. But if you think about the movies that we watch, a lot of them are still 1080p, for example, or fixed resolution. So again, you want to generate a high-resolution representation of that low-resolution input. And the idea here is that your high-resolution is not just interpolation, because it can be very blurry, but there's ways that kind of hallucinate a high-resolution version of the video or image itself.

And that's basically called super-resolution. But one of the challenges for super-resolution is that it's computationally very expensive. So again, you can imagine that the state-of-the-art approaches for super-res use deep neural nets. A lot of the examples we just talked about about neural nets are talking about input images of 200 by 200 pixels.

Now imagine if you extend that to an HD image. It's going to be very, very expensive. So what we want to do is think of different ways that we can speed up the super-resolution process, not just by making DNNs faster, but kind of looking around the other components of the system and seeing if we can make it faster as well.

So one of the approaches we took is this framework called FAST, where we're looking at accelerating any super-resolution algorithm by an order of magnitude. And this is operated on a compressed video. So before I was a faculty here, I worked a lot on video compression. And if you think about the video compression community, they look at video very differently than people who process super-resolution.

So typically, when you're thinking about image processing or super-resolution, when I give you a compressed video, what you basically think of it is as a stack of pixels, a bunch of different images together. But if you asked a video compression person, what does a compressed video look like? Actually, a compressed video is a very structured representation of the redundancy in the video itself.

So why is it that we can compress videos? It's because things like different frames look very-- consecutive frames look very similar. So it's telling you which pixels in frame 1 is related to which pixel or looks like which pixel in frame 2. And so as a result, you don't have to send the pixels in frame 2.

And that's where you get the compression from. So actually, what a compressed video looks like is a description of the structure of the video itself. And so you can use this representation to accelerate super-resolution. So for example, rather than applying super-resolution to every single low-res frame, which is the typical approach-- so you would apply super-resolution to each low-res frame, and you would generate a bunch of high-res frame outputs-- what you can actually do is apply super-resolution to one of the small low-resolution frames.

And then you can use that free information you get in the compressed video that tells you the structure of the video to generate or transfer and generate all those high-resolution videos from that. And so it only needs to run on a subset of frames. And then the complexity to reconstruct all those high-resolution frames once you have that structured image is going to be very low.

So for example, if I'm going to transfer to n frames, I'm going to get an n frame and x speedup. So to evaluate this, we showcase this on a range of videos. So this range of videos is the data set that we use to develop video standards. So it's quite broad.

And you can see, first, on the left-hand side is that if I transfer to four different frames, you can get a 4x acceleration. And then the PSNR, which indicates the quality, doesn't change. So it's the same quality, but 4x faster. If I do transfer to 16 frames or 16 acceleration, there's a slight drop in quality.

But still, you get basically a 16x acceleration. So the key idea here is, again, you'd want to look beyond the processing of the neural network itself to around it to see if you can speed it up. Usually with PSNR, you can't really tell too much about the quality. So another way to look at it is actually look at the video itself or subjective quality.

So on the left-hand side here, this is if I applied super resolution on every single frame. So this is the traditional way of doing it. On the right-hand side here, this is if I just did interpolation on every single frame. And so where you can tell the difference is by looking at things like the text, you can see that the text is much sharper on the left video than the right video.

Now, FAST plus SRC and using FAST is somewhere in between. So FAST actually has the same quality as the video on the left-hand side, but it's just as efficient in terms of processing speed as the approach on the right-hand side. So it kind of has the best of both worlds.

And so the key takeaway for this is that if you want to accelerate DNNs for a given process, it's good to look beyond the hardware for the acceleration. We can look at things like the structure of the data that's entering the neural network accelerator. There might be opportunities there.

For example, here, temporal correlation that allows you to further accelerate the processing. Again, if you're interested in this, all the code is on the website. So to end this lecture, I just want to talk about things that are actually beyond deep neural nets. I also-- I know neural nets are great.

They're useful for many applications. But I think there's a lot of exciting problems outside the space of neural nets as well, which also require efficient computing. So the first thing is what we call visual inertial localization or visual odometry. This is something that's widely used for robots to kind of figure out where they are in the real world.

So you can imagine for autonomous navigation, before you navigate the world, you have to know where you actually are in the world. So that's localization. This is also widely used for things like AR and VR as well, right, because you can know where you're actually looking in AR and VR.

What does this actually mean? It means that you can basically take in a sequence of images. So you can imagine like a camera that's mounted on the robot or the person, as well as an IMU. So it has accelerometer and gyroscope information. And then visual inertial odometry, which is a subset of SLAM, basically fuses this information together.

And the outcome of visual inertial odometry is the localization. So you can see here, basically, you're trying to estimate where you are in the 3D space. And the pose based on, in this case, the camera feed. But you can also measure IMU information there as well. And if you're in an unknown environment, you could also generate a map of that environment.

So one of these is a very key task in navigation. And the key thing is, can you do it in an energy efficient way? So we've looked at building specialized hardware to do localization. This is actually the first chip that performs complete visual inertial odometry on chip. We call it Navion.

This is done in collaboration with Sertesh Karaman. So you can see here, here's the chip itself. It's 4 millimeters by 5 millimeters. You can see that it's smaller than a quarter. And you can imagine mounting it on a small robot. At the front end, it does basically processing of the camera information.

It does things like feature detection, tracking, outlier elimination. It also processes-- it does pre-integration on the IMU. And then on the back end, it fuses this information together using a factor graph. And so when you compare this particular design, this Navion chip design, compared to mobile or desktop CPUs, you're talking about two to three orders of magnitude reduction in energy consumption because you have the specialized chip to do it.

So what is the key component of this chip that enables us to do it? Well, again, sticking with the theme, the key thing is reduction in data movement. In particular, we reduce the amount of data that needs to be moved on and off chip. So all of the processing is located on the chip itself.

And then furthermore, because we want to reduce the size of the chip and the size of the memories, we do things like apply low-cost compression on the frames and then also exploit sparsity, which means number of zeros in the factor graph itself. So all of the compression and exploiting sparsity can actually reduce the storage cost down to under a megabyte of storage on chip to do this processing.

And that allows us to achieve this really low power consumption of below 25 milliwatts. Another thing that really matters for autonomous navigation is once you know where you are, where are you going to go next? So this is kind of a planning and mapping problem. And so in the context of things like robot exploration, where you want to basically explore an unknown area, you can do this by doing what we call computing Shannon's mutual information.

Basically, you want to figure out where should I go next where I will discover the most amount of new information compared to what I already know. So you can imagine what's shown here is like an occupancy map. So this is basically the light colors show the place where it's free space.

It's empty. Nothing's occupied. The dark gray area is unknown. And then the black lines are occupied things, so like walls, for example. And the question is, if I know that this is my current occupancy map, where should I go and scan, let's say, with a depth sensor to figure out more information about the map itself?

So what you can do is you can compute what we call the mutual information of the map itself based on what you already know. And then you go to the location with the most information, and you scan it, and then you get an updated map. So shown here below is a miniature race car that's doing exactly that.

So over here is the mutual information that's being computed. So it's trying to go to those light areas of the yellow areas that has the most information. So you can see that it's going to try and back up and come and scan this region to cover or figure out more information about that.

So that's great. It's a very principled way of doing this. The problem of this kind of computation, the reason why it's been challenging, is, again, the computation, in particular, the data movement. So you can imagine, at any given position, you're going to do a 3D scanning with your LiDAR across a wide range of neighboring regions with your beams.

You can imagine each of these beams with your LiDAR scan can be processed with different cores. So they can all be processed in parallel. So parallelism, again, here, just like the deep learning case, is very easily available. The challenge is data delivery. So what happens is that you're actually storing your occupancy map all in one memory.

But now you have multiple cores that are going to try and process the scans on this occupancy map. And so you only actually, typically, for these types of memories, you're limited to two cores. But if you want to have n cores, 16 cores, 30 cores, it's going to be a challenge in terms of how to read data from this occupancy map and deliver it to the cores themselves.

If we take a closer look at the memory access pattern, you can see here that as you scan it out, the numbers indicate which cycle you would use to read each of the locations on the map itself. And you can see it's kind of a diagonal pattern. So the question is, can I break this map into smaller memories and then access these smaller memories in parallel?

And the question is, if I can break it into smaller memories, how should I decide what part of the map should go into which of these memories? So show here on the right-hand side, in the different colors basically indicate different memories or different banks of the memory. So they store different parts of the map.

And again, if you think of the numbers as the cycle with which each location is accessed, what you'll notice is that for any given color, at most, two numbers are the same, meaning that I'm only going to access two pieces of the location for any given bank or memory.

So there's going to be no conflict. So I can process all of these beams in parallel. And so by doing this, this allows you to compute the mutual information of the entire map. And by the time I can be a very large map, let's say 200 meters by 200 meters at 0.1 meter resolution in under a second.

This is very different from before, where you can only compute the mutual information of a subset of locations and just try and pick the best one. Now you can compute on the entire map. So you can know the absolute best location to go to get the most information. This is 100x speed up compared to a CPU at a tenth of the power on an FPGA.

So that's another important example of how data movement is really critical in order to allow you to process things very, very quickly and how having specialized hardware can enable that. All right. So one last thing is looking at-- so we talked about robotics. We talked about deep learning. But actually, what's really important is there's a lot of important applications where you can apply efficient processing that can help a lot of people around the world.

So in particular, looking at monitoring neurodegenerative disease disorders. So we know things like dementia, so things like Alzheimer's, Parkinson's, affects tens of millions of people around the world and continues to grow. This is a very severe disease. The challenge for this disease is that-- OK, one of the many challenges.

But one of the challenges is that the neurological assessments for these diseases can be very time consuming and require a trained specialist. So normally, if you are suffering from one of these diseases or you might have this disease, what you need to do is you need to go see a specialist.

And they'll ask you a series of questions. They'll do a mini mental exam, like what year is it? Where are you now? Can you count backwards and so on? Or you might be familiar with people are asked to draw the clock, these tests. And so you can imagine going to a specialist to do these type of things can be costly and time consuming.

So you don't go very frequently. So as a result, the data that's collected is very sparse. Also, it's very qualitative. So if you go to different specialists, they might come up with a different assessment. So repeatability is also very much an issue. What's been super exciting is it's been shown in literature that there's actually a quantitative way of measuring or quantitative evaluating these types of diseases, potentially using eye movements.

So eye movements can be used by a quantitative way to evaluate the severity or progression or regression of these particular type of diseases. So you imagine doing things like, if you're taking a certain drug, is your disease getting better or worse? And this eye movement can give a quantitative evaluation for that.

But the challenge is that to do these eye movement evaluations, you still need to go into that. So first, you need a very high speed camera. That can be very expensive. Often, you need to have substantial head support so your head doesn't move so you can really detect the eye movement.

And you might even need IR illumination so you can more clearly see the eye. And so again, this still has the challenge that for clinical measurements of what we call saccade latency or eye movement latency or eye reaction time, they're done in very constrained environments. You still have to go see the special itself.

And they use very specialized and costly equipment. So in the vein of enabling efficient computing and bringing compute to various devices, our question is, can we actually do these eye measurements on a phone itself that we all have? And so indeed, you can. You can develop various algorithms that can detect your eye reaction time on a consumer grade camera like your phone or an iPad.

And we've shown that you can actually replicate the quality of results as you could with a phantom camera. So shown here in the red are basically eye reaction times that are measured on a subject on an iPhone 6, which is obviously under $1,000, way cheaper now, compared to a phantom camera shown here in blue.

You can see that the distributions of the reaction times are about the same. Why is this exciting? Because it enables us to do low cost in-home measurements. So what you can imagine is a patient could do these measurements at home for many days, not just the day they go in.

And then they can bring in this information. And this can give the physician or the specialist additional information to make the assessment as well. So this can be complementary. But it gives a much more rich set of information to do the diagnosis and evaluation. So we're talking about computing.

But there's also other parts of the system that burn power as well, in particular, when we're talking about things like depth estimation using time of flight. Time of flight is very similar to LIDAR. Basically, what you're doing is you're sending a pulse and waiting for it to come back.

And how long it takes to come back indicates the depth of whatever object you're trying to detect. The challenge with depth estimation with time of flight sensors can be very expensive. You're emitting a pulse, waiting for it to come back. So talking about up to tens of watts of power.

The question is, can we also reduce the sensor power if we can do efficient computing? So for example, can I reduce how often I emit the depth sensor and kind of recover the other information just using a monocular-based camera? So for example, typically, you have a pair of a depth sensor and an RGB camera.

If at time 0, I turn both of them on, and time 1 and 2, I turn them off, but I still keep my RGB camera on, can I estimate the depth for at time 2 and time 3? And then the key thing here is to make sure that the algorithms that you're running to estimate the depth without turning on the depth sensor itself is super cheap.

So we actually have algorithms that can run on VGA at 30 frames per second on a Cortex A7, which is a super low-cost embedded processor. And just to give you an idea of how it looks like, so let's see, here's the left is the RGB image. In the middle is the depth map or the ground truth.

So if I always had the depth sensor on, that's what it would look like. And then on the right-hand side is the estimated depth map. In this particular case, we're only turning on the sensor only 11% of the time, so every ninth frame. And your mean at relative error is only about 0.7%, so the accuracy or quality is pretty aligned.

OK, so at a high level, what are the key takeaways I want you guys to get from today's lecture? First is efficient computing is really important. It can extend the reach of AI beyond the cloud itself because it can reduce communication networking costs, enable privacy, and provide low latency.

And so we can use AI for a wide range of applications, ranging from things like robotics to health care. And in order to achieve this energy efficient computing, it really requires cross-layer design. So not just focusing on the hardware, but specialized hardware plays an important role, but also the algorithms itself.

And this is going to be really key to enabling AI for the next decade or so or beyond. OK, and we also covered a lot of points in the lecture, so the slides are all available on our website. Also, just because it's a deep learning seminar series, I just want to point some other resources that you might be interested if you want to learn more about efficient processing of neural nets.

So again, I want to point you first to this survey paper that we've developed. This is with my collaborator Joel Emmer. It really kind of covers what are the different techniques that people are looking at and give some insights of the key design principles. We also have a book coming soon.

It's going to be within the next few weeks. We also have slides from various tutorials that we've given on this particular topic. In fact, we also teach a course on this here at MIT, 6825. If you're interested in updates on all these types of materials, I invite you to join the mailing list or the Twitter feed.

The other thing is if you're not an MIT student, but you want to take a two-day course on this particular topic, I also invite you to take a look at the MIT Professional Education option. So we run short courses on MIT campus over the summer. So you can come for two days, and we can talk about the various different approaches that people use to build efficient deep learning systems.

And then finally, if you're interested in just video and tutorial videos on this talk, I actually, at the end of November during NeurIPS, I gave a 90-minute tutorial that goes really in-depth in terms of how to build efficient deep learning systems. So I invite you to visit that. And we also have some talks at the Mars Conference on Efficient Robotics.

And we have a YouTube channel where this is all located. And then finally, I'd be remiss if I didn't acknowledge a lot of the work here is done by the students, so all the students in our group, as well as my collaborators, Joel Emmer, Sertesh Karaman, and Thomas Helt, and then all of our sponsors that make this research possible.

So that concludes my talk. Thank you very much. Thank you.

Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze) | MIT Deep Learning Series

Chapters

Transcript