back to index

Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze) | MIT Deep Learning Series


Chapters

0:0 Introduction
0:43 Talk overview
1:18 Compute for deep learning
5:48 Power consumption for deep learning, robotics, and AI
9:23 Deep learning in the context of resource use
12:29 Deep learning basics
20:28 Hardware acceleration for deep learning
57:54 Looking beyond the DNN accelerator for acceleration
63:45 Beyond deep neural networks

Whisper Transcript | Transcript Only Page

00:00:00.000 | - I'm happy to have Vivian See here with us.
00:00:03.440 | She's a professor here at MIT,
00:00:05.060 | working in the very important and exciting space
00:00:08.300 | of developing energy efficient and high performance systems
00:00:11.400 | for machine learning, computer vision,
00:00:13.240 | and other multimedia applications.
00:00:16.600 | This involves joint design of algorithms,
00:00:18.720 | architecture, circuits, systems,
00:00:21.040 | to enable optimal trade-offs between power,
00:00:23.280 | speed, and quality of result.
00:00:25.340 | One of the important differences between the human brain
00:00:29.720 | and AI systems is the energy efficiency of the brain.
00:00:34.720 | So Vivian is a world-class researcher at the forefront
00:00:38.160 | of discovering how we can close that gap.
00:00:40.560 | So please give her a warm welcome.
00:00:43.300 | - I'm really happy to be here to share some of the research
00:00:46.520 | and an overview of this area, efficient computing.
00:00:49.560 | So actually what I'm gonna be talking about today
00:00:51.720 | is gonna be a little bit broader than just deep learning.
00:00:53.880 | We'll start with deep learning,
00:00:54.840 | but we'll also move to how we might apply this to robotics
00:00:58.620 | and other AI tasks, and why it's really important
00:01:02.040 | to have efficient computing to enable a lot
00:01:03.720 | of these exciting applications.
00:01:06.100 | Also, I just wanna mention that a lot of the work
00:01:08.340 | I'm gonna present today is not done by myself,
00:01:10.560 | but in collaboration with a lot of folks at MIT over here.
00:01:14.680 | And of course, if you want access to the slides,
00:01:16.520 | they're available on our website.
00:01:18.160 | So given that this is a deep learning lecture series,
00:01:21.920 | I wanna first start talking up a little bit
00:01:23.680 | about deep neural nets.
00:01:24.520 | So we know that deep neural nets has, you know,
00:01:27.600 | generate a lot of interest,
00:01:29.240 | has many very compelling applications.
00:01:33.040 | But one of the things that has, you know,
00:01:34.720 | come into light over the past few years
00:01:37.480 | is the increasing need of compute.
00:01:39.240 | OpenAI actually showed over the past few years
00:01:42.060 | that there's been a significant increase
00:01:44.560 | in the amount of compute that is required
00:01:46.960 | to perform deep learning applications
00:01:49.100 | and to do the training for deep learning
00:01:50.720 | over the past few years.
00:01:51.660 | So it's actually grown exponentially
00:01:53.920 | over the past few years.
00:01:54.760 | It's grown in fact by over 300,000 times
00:01:57.680 | in terms of the amount of compute we need to drive
00:02:01.360 | and increase the accuracy of a lot of the tasks
00:02:04.400 | that we're trying to achieve.
00:02:05.800 | At the same time, if we start looking at basically
00:02:09.880 | the environmental implications of all of this processing,
00:02:13.960 | it can be quite severe.
00:02:15.160 | So if we look at, for example,
00:02:17.040 | the carbon footprint of, you know, training neural nets,
00:02:20.080 | if you think of, you know, the amount of carbon footprint
00:02:23.120 | of flying across North America from New York
00:02:26.440 | to San Francisco or the carbon footprint
00:02:29.480 | of an average human life, you can see that, you know,
00:02:33.360 | neural networks are orders of magnitude greater than that.
00:02:36.720 | So the environmental or carbon footprint implications
00:02:40.080 | of computing for deep neural nets
00:02:41.600 | can be quite severe as well.
00:02:43.520 | Now this is a lot having to do with compute in the cloud.
00:02:46.000 | Another important area where we wanna do compute
00:02:48.520 | is actually moving the compute from the cloud
00:02:51.360 | to the edge itself, into the device
00:02:53.680 | where a lot of the data is being collected.
00:02:55.840 | So why would we wanna do that?
00:02:57.120 | So there's a couple of reasons.
00:02:58.600 | First of all, communication.
00:03:01.280 | So in a lot of places around the world
00:03:03.840 | and just even a lot of just places in general,
00:03:05.480 | you might not have a very strong
00:03:07.040 | communication infrastructure, right?
00:03:09.240 | So you don't wanna necessarily have to rely
00:03:10.840 | on a communication network
00:03:12.240 | in order to do a lot of these applications.
00:03:14.960 | So again, you know, removing your tethering
00:03:17.200 | from the cloud is important.
00:03:19.320 | Another reason is a lot of the times that we,
00:03:22.440 | you know, apply deep learning on a lot of applications
00:03:24.240 | where the data is very sensitive.
00:03:26.200 | So you can think about things like healthcare
00:03:28.560 | where you're collecting very sensitive data.
00:03:30.640 | And so privacy and security again is really critical.
00:03:34.160 | And you would, rather than sending the data to the cloud,
00:03:36.360 | you'd like to bring the compute to the data itself.
00:03:39.280 | Finally, another compelling reason for, you know,
00:03:44.200 | bringing the compute into the device
00:03:45.880 | or into the robot is latency.
00:03:47.520 | So this is particularly true for interactive applications.
00:03:51.040 | So you can think of things like autonomous navigation,
00:03:53.800 | robotics, or self-driving vehicles
00:03:55.920 | where you need to interact with the real world.
00:03:58.080 | You can imagine if you're driving very quickly
00:04:00.240 | down the highway and you detect an obstacle,
00:04:02.520 | you might not have enough time to send the data
00:04:04.480 | to the cloud, wait for it to be processed,
00:04:06.560 | and send the instruction back in.
00:04:08.400 | So again, you wanna move the compute into the robot
00:04:11.160 | or into the vehicle itself.
00:04:13.240 | Okay, so hopefully this is establishing
00:04:15.160 | why we wanna move the compute into the Edge.
00:04:17.800 | But one of the big challenges of doing processing
00:04:20.640 | in the robot or in the device actually has to do
00:04:22.760 | with power consumption itself.
00:04:24.120 | So if we take the self-driving car as an example,
00:04:26.800 | it's been reported that it consumes over 2000 watts
00:04:30.920 | of power just for the computation itself,
00:04:33.400 | just to process all the sensor data that it's collecting.
00:04:36.960 | Right, and this actually generates a lot of heat.
00:04:39.520 | It takes up a lot of space.
00:04:40.440 | You can see in this prototype that's being placed,
00:04:43.240 | and all the compute aspects are being placed in the trunk,
00:04:46.320 | generates a lot of heat, it generates,
00:04:47.680 | and often needs water cooling.
00:04:49.880 | So this can be a big cost and logistical challenges
00:04:53.320 | for self-driving vehicles.
00:04:55.160 | Now you can imagine that this is gonna be
00:04:56.520 | much more challenging if we shrink down the form factor
00:05:00.080 | of the device itself to something that is perhaps portable
00:05:02.840 | in your hands.
00:05:03.680 | You can think about smaller robots
00:05:05.360 | or something like your smartphone or cell phone.
00:05:08.280 | In these particular cases,
00:05:09.360 | when you think about portable devices,
00:05:11.360 | you actually have very limited energy capacity,
00:05:13.880 | and this is based on the fact that the battery itself
00:05:16.920 | is limited in terms of the size, weight, and its cost.
00:05:19.760 | Right, so you can't have very large amount of energy
00:05:22.600 | on these particular devices itself.
00:05:24.680 | Secondly, when you take a look at the embedded platforms
00:05:27.960 | that are currently used for embedded processing
00:05:29.840 | for these particular applications,
00:05:31.760 | they tend to consume over 10 watts,
00:05:34.240 | which is an order of magnitude higher
00:05:35.880 | than the power consumption that you typically
00:05:38.100 | would allow for these particular handheld devices.
00:05:40.640 | So in these handheld devices,
00:05:41.960 | typically you're limited to under a watt
00:05:43.640 | due to the heat dissipation.
00:05:44.720 | For example, you don't want your cell phone
00:05:46.040 | to get super hot.
00:05:47.640 | Okay, so in the past decade or so, or decades,
00:05:51.920 | what we would do to address this challenge
00:05:53.760 | is that we would wait for transistors to become smaller,
00:05:56.880 | faster, and more efficient.
00:05:58.680 | However, this has become a challenge
00:06:01.160 | over the past few years,
00:06:02.400 | so transistors are not getting more efficient.
00:06:05.080 | So for example, Moore's Law,
00:06:07.520 | which typically makes transistors smaller and faster,
00:06:10.180 | has been slowing down,
00:06:11.360 | and Dennard scaling,
00:06:13.040 | which has made transistors more efficient,
00:06:15.400 | has also slowed down or ended.
00:06:17.440 | So you can see here over the past 10 years,
00:06:19.280 | this trend has really flattened out.
00:06:21.080 | Okay, so this is a particular challenge
00:06:23.400 | because we want more and more compute
00:06:25.120 | to drive deep neural network applications,
00:06:27.820 | but the transistors are not becoming more efficient.
00:06:30.720 | Right?
00:06:31.760 | So what we have to turn to in order to address this
00:06:34.880 | is we need to turn towards specialized hardware
00:06:37.600 | to achieve the significant speed and energy throughputs
00:06:40.620 | that we require for our particular application.
00:06:43.220 | When we talk about designing specialized hardware,
00:06:44.940 | this is really about thinking about
00:06:46.180 | how we can redesign the hardware from the ground up,
00:06:49.540 | particularly targeted at these AI, deep learning,
00:06:52.920 | and robotic tasks that we're really excited about.
00:06:55.900 | Okay, so this notion is not new.
00:06:57.500 | In fact, it's become extremely popular to do this.
00:07:00.980 | Over the past few years,
00:07:02.060 | there's been a large number of startups and companies
00:07:04.140 | that have focused on building
00:07:05.260 | specialized hardware for deep learning.
00:07:06.940 | So in fact, New York Times reported,
00:07:09.060 | I guess it was two years ago
00:07:11.000 | that there's a record number of startups
00:07:12.600 | looking at building specialized hardware
00:07:14.800 | for AI and for deep learning.
00:07:16.920 | Okay, so we'll talk a little bit about
00:07:18.320 | what specialized hardware looks like
00:07:20.180 | for these particular applications.
00:07:22.280 | Now, if you really care about energy and power efficiency,
00:07:25.400 | the first question you should ask is
00:07:26.960 | where is the power actually going for these applications?
00:07:31.080 | And so as it turns out,
00:07:33.000 | power is dominated by data movement.
00:07:35.740 | So it's actually not the computations themselves
00:07:38.360 | that are expensive,
00:07:39.440 | but moving the data to the computation engine
00:07:42.320 | that's expensive.
00:07:43.160 | So for example, shown here in blue is
00:07:46.760 | a range of power consumption, energy consumption
00:07:49.240 | for a variety of types of computations,
00:07:51.820 | for example, multiplications and additions
00:07:54.640 | at various different precision.
00:07:56.100 | So you have, for example, floating point to fixed point
00:07:59.440 | and eight bit integer and same with additions.
00:08:01.640 | And you can see as it makes sense,
00:08:02.880 | as you scale down the precision,
00:08:04.760 | the energy consumption of each of these operations reduce.
00:08:07.860 | But what's really surprising here
00:08:09.500 | is that if you look lower
00:08:11.440 | at the energy consumption of data movement, right?
00:08:14.360 | Again, this is delivering the input data
00:08:16.260 | to do the multiplication and then, you know,
00:08:18.280 | moving the output of the multiplication
00:08:19.880 | somewhere into memory, it can be very expensive.
00:08:22.280 | So for example, if you look at the energy consumption
00:08:25.880 | of a 32 bit read from an SRAM memory,
00:08:28.440 | this is an eight kilobyte SRAM.
00:08:29.840 | So it's a very small memory
00:08:31.520 | that you would have on the processor or on the chip itself.
00:08:35.360 | This is already gonna consume five picojoules of energy.
00:08:38.880 | So equivalent or even more
00:08:40.840 | than a 32 bit floating point multiply.
00:08:43.840 | And this is from a very small memory.
00:08:45.880 | If you need to read this data from off chip,
00:08:48.200 | so outside the processor, for example, in DRAM,
00:08:51.920 | it's gonna be even more expensive.
00:08:54.080 | So in this particular case,
00:08:55.180 | we're showing 640 picojoules in terms of energy.
00:08:58.560 | And so you can notice here on the horizontal axis
00:09:01.500 | that this is basically the, this is an exponential axis.
00:09:05.160 | So you're talking about orders of magnitude increase
00:09:07.840 | in energy in terms of data movement
00:09:09.420 | compared to the compute itself, right?
00:09:11.580 | So this is a key takeaway here.
00:09:13.080 | So if we really want to address the energy consumption
00:09:17.080 | of these particular types of processing,
00:09:19.660 | we really wanna look at reducing data movement.
00:09:22.560 | Okay, but what's the challenge here?
00:09:24.200 | So if we take a look at a popular AI robotics
00:09:27.240 | type of application like autonomous navigation,
00:09:29.040 | the real challenge here though,
00:09:30.440 | is that these applications use a lot of data, right?
00:09:33.640 | So for example, one of the things you need to do
00:09:35.240 | in autonomous navigation
00:09:36.280 | is what we call semantic understanding.
00:09:38.380 | So you need to be able to identify, you know,
00:09:40.440 | which pixel belongs to what.
00:09:41.840 | So for example, in this scene,
00:09:42.840 | you need to know that this pixel represents the ground,
00:09:45.400 | this pixel represents the sky,
00:09:46.960 | this pixel represents, you know, a person itself.
00:09:49.880 | Okay, so this is an important type of processing.
00:09:51.640 | Often if you're traveling quickly,
00:09:53.440 | you wanna be able to do this at a very high frame rate.
00:09:56.940 | You might need to have large resolution.
00:09:58.560 | So for example, typically if you want HD images,
00:10:00.920 | you're talking about 2 million pixels per frame.
00:10:03.880 | And then often, if you also wanna be able to detect objects
00:10:06.560 | at different scales or see objects that are far away,
00:10:09.220 | you need to do what we call data expansion.
00:10:11.320 | For example, build a pyramid for this,
00:10:13.220 | and this would increase the amount of pixels
00:10:15.080 | or amount of data you need to process
00:10:17.000 | by, you know, one to two orders of magnitude.
00:10:19.640 | So that's a huge amount of data
00:10:20.800 | that you have to process right off the bat there.
00:10:23.360 | Another type of processing
00:10:25.480 | or understanding that you wanna do for autonomous navigation
00:10:27.520 | is what we call geometric understanding,
00:10:29.520 | and that's when you're kind of navigating,
00:10:30.840 | you wanna build a 3D map of the world that's around you.
00:10:34.080 | And you can imagine the longer you travel for,
00:10:37.320 | the larger the map you're gonna build.
00:10:39.300 | And again, that's gonna be more data
00:10:41.520 | that you're gonna have to process and compute on.
00:10:44.280 | Okay, so this is a significant challenge
00:10:46.080 | for autonomous navigation in terms of amount of data.
00:10:48.720 | Other aspects of autonomous navigations,
00:10:51.700 | also other applications like AR, VR, and so on,
00:10:54.160 | is understanding your environment, right?
00:10:56.720 | So a typical thing you might need to do
00:10:59.160 | is to do depth estimation.
00:11:00.680 | So for example, if I give you an image,
00:11:02.860 | can you estimate the distance
00:11:04.840 | of how far a given pixel is from you?
00:11:07.600 | And also semantic segmentation,
00:11:09.220 | we just talked about that before.
00:11:10.600 | So these are important types of ways
00:11:12.840 | to understand your environment
00:11:14.200 | when you're trying to navigate.
00:11:16.000 | And it should be no surprise to you
00:11:18.280 | that in order to do these types of processing,
00:11:20.840 | the state-of-the-art approaches utilize deep neural nets.
00:11:25.040 | Right?
00:11:26.280 | But the challenge here is that these deep neural nets
00:11:28.200 | often require several hundred millions
00:11:30.640 | of operations and weights to do the computation.
00:11:33.560 | So when you try and compare it to something
00:11:35.720 | like you would all have on your phone,
00:11:37.000 | for example, video compression,
00:11:38.920 | you're talking about two to three orders of magnitude
00:11:41.860 | increase in computational complexity.
00:11:45.040 | So this is a significant challenge
00:11:46.400 | 'cause if we'd like to have deep neural networks
00:11:49.500 | be as ubiquitous as something like video compression,
00:11:52.460 | we really have to figure out
00:11:53.760 | how to address this computational complexity.
00:11:56.640 | We also know that deep neural networks
00:11:58.160 | are not just used for understanding the environment
00:12:00.440 | or autonomous navigation,
00:12:02.000 | but it's really become the cornerstone
00:12:03.360 | of many AI applications from computer vision,
00:12:06.320 | speech recognition, gameplay, and even medical applications.
00:12:09.920 | And I'm sure a lot of these have been covered
00:12:11.800 | through this course.
00:12:13.520 | So briefly, I'm just gonna give a quick overview
00:12:16.640 | of some of the key components in deep neural nets,
00:12:18.680 | not because, you know, I'm sure all of you understand it,
00:12:20.840 | but because since this area is very popular,
00:12:23.520 | the terminology can vary from discipline to discipline.
00:12:26.120 | So I'll just do a brief overview to align ourselves
00:12:28.140 | on the terminology itself.
00:12:30.520 | So what are deep neural nets?
00:12:32.920 | Basically, you can view it as a way of, for example,
00:12:36.360 | understanding the environment.
00:12:37.700 | It's a chain of different layers of processing
00:12:42.060 | where you can imagine for an input image,
00:12:44.200 | at the low level or the earlier parts of the neural net,
00:12:46.760 | you're trying to learn different low-level features
00:12:49.480 | such as edges of an image.
00:12:51.640 | And as you get deeper into the network,
00:12:53.960 | as you chain more of these kind of computational layers
00:12:56.560 | together, you start being able to detect
00:12:58.960 | higher and higher level features
00:13:00.360 | until you can, you know, recognize a vehicle, for example.
00:13:03.880 | And, you know, the difference of this particular approach
00:13:06.240 | compared to more traditional ways of doing computer vision
00:13:09.240 | is that how we extract these features are learned
00:13:12.180 | from the data itself, as opposed to having an expert
00:13:14.160 | come in and say, "Hey, look for the edges,
00:13:16.160 | look for, you know, the wheels," and so on.
00:13:18.160 | The fact that it recognizes these features
00:13:19.800 | is a learned approach.
00:13:22.320 | Okay, what is it doing at each of these layers?
00:13:24.680 | Well, it's actually doing a very simple computation.
00:13:28.200 | This is looking at the inference side of things.
00:13:29.920 | Basically, effectively, what it's doing is a weighted sum.
00:13:32.640 | Right, so you have the input values,
00:13:34.800 | and we'll color code the inputs as blue here
00:13:37.720 | and try and stay consistent with that throughout the talk.
00:13:41.240 | We apply certain weights to them,
00:13:43.440 | and these weights are learned from the training data,
00:13:45.800 | and then they would generate an output,
00:13:47.120 | which is typically red here,
00:13:48.380 | and it's basically a weighted sum, as we can see.
00:13:51.160 | We then pass this weighted sum
00:13:53.160 | through some form of non-linearity.
00:13:55.120 | So, you know, traditionally, it used to be sigmoids.
00:13:57.560 | More recently, we use things like relues,
00:13:59.600 | which basically set, you know, non-zero values
00:14:02.960 | or negative values to zero.
00:14:05.760 | But the key takeaway here is that if you look
00:14:08.480 | at this computational kernel, the key operation
00:14:11.920 | to a lot of these neural networks
00:14:13.300 | is performing this multiply and accumulate
00:14:15.520 | to compute the weighted sum.
00:14:17.120 | And this accounts for over 90% of the computation.
00:14:20.360 | So if we really want to focus on, you know,
00:14:22.880 | accelerating neural nets or making them more efficient,
00:14:25.080 | we really want to focus on minimizing the cost
00:14:26.960 | of this multiply and accumulate itself.
00:14:29.240 | There are also various popular types
00:14:32.960 | of deep neural network layers used for deep neural networks.
00:14:36.740 | They also often vary in terms of, you know,
00:14:38.960 | how you connect up the different layers.
00:14:40.840 | So for example, you can have feed-forward layers
00:14:43.040 | where the inputs are always connected to the outputs.
00:14:45.600 | You can have feed-back where the outputs
00:14:47.320 | are connected back into the inputs.
00:14:49.480 | You can have fully-connected inputs
00:14:51.720 | where basically all the outputs are connected
00:14:53.440 | to all the inputs, or sparsely connected.
00:14:56.800 | And you might be familiar with some of these layers.
00:14:58.360 | So for example, fully-connected layers,
00:15:00.160 | just like what we talked about,
00:15:01.040 | all inputs and all outputs are connected.
00:15:03.840 | They tend to be feed-forward.
00:15:05.680 | When you put them together, they're typically referred
00:15:08.160 | to as a multilayer perceptron.
00:15:10.460 | You have convolutional layers, which are also feed-forward,
00:15:14.520 | but then you have sparsely-connected
00:15:16.880 | weight-sharing connections.
00:15:18.760 | And when you put them together,
00:15:20.360 | they're often referred to as convolutional networks.
00:15:23.320 | And they're typically used for image-based processing.
00:15:25.960 | You have recurrent layers where we have
00:15:29.440 | this feedback connection, so the output
00:15:31.500 | is fed back to the input.
00:15:34.240 | When we combine recurrent layers,
00:15:35.800 | they're referred to as recurrent neural nets.
00:15:37.360 | And these are typically used to process sequential data,
00:15:40.360 | so speech or language-based processing.
00:15:42.960 | And then most recently, which has become really popular,
00:15:46.240 | it's the tension layers or tension-based mechanisms.
00:15:49.800 | They often involve matrix multiply,
00:15:51.440 | which is again, multiply and accumulate.
00:15:53.840 | And when you combine these,
00:15:56.040 | they're often referred to as transformers.
00:15:58.760 | Okay, so let's first get an idea as to why
00:16:02.680 | convolutional or deep learning is much more,
00:16:05.880 | computationally more complex than other types of processing.
00:16:08.720 | So we'll focus on convolutional neural nets as an example,
00:16:12.120 | although many of these principles apply
00:16:13.520 | to other types of neural nets.
00:16:15.360 | And the first thing to kind of take a look
00:16:17.320 | as to why it's complicated is to look
00:16:18.960 | at the computational kernel.
00:16:20.240 | So how does it actually perform convolution itself?
00:16:23.160 | So let's say you have this 2D input image.
00:16:27.320 | If it's at the input of the neural net, it would be an image.
00:16:29.280 | If it's deeper in the neural net,
00:16:30.600 | it would be the input feature map.
00:16:32.640 | And it's gonna be composed of activations.
00:16:35.640 | Or you can think from an image,
00:16:36.640 | it's gonna be composed of pixels.
00:16:38.320 | And we convolve it with, let's say, a 2D filter,
00:16:41.080 | which is composed of weights.
00:16:42.600 | Right, so typical convolution, what you would do
00:16:45.320 | is you would do an element-wise multiplication
00:16:47.840 | of the filter weights with the input feature map activations.
00:16:52.320 | You would sum them all together to generate one output value.
00:16:55.760 | And we would refer to that as the output activation.
00:16:58.720 | Right, and then because it's convolution,
00:17:00.480 | we would basically slide the filter
00:17:03.480 | across this input feature map
00:17:05.520 | and generate all the other output feature map activation.
00:17:08.840 | And so this kind of 2D convolution
00:17:10.960 | is pretty standard in image processing.
00:17:12.920 | We've been doing this for decades, right?
00:17:15.600 | What makes convolutional neural nets much more challenging
00:17:20.000 | is the increase in dimensionality.
00:17:21.480 | So first of all, rather than doing just this 2D convolution,
00:17:25.040 | we often stack multiple channels.
00:17:27.000 | So there's this third dimension called channels.
00:17:29.200 | And then what we're doing here is that we need to do
00:17:30.840 | a 2D convolution on each of the channels
00:17:33.560 | and then add it all together, right?
00:17:35.960 | And you can think of these channels for an image,
00:17:38.320 | these channels would be kind of the red, green,
00:17:40.560 | and blue components, for example.
00:17:42.320 | And as you get deeper into the feature map,
00:17:43.920 | the number of channels could potentially increase.
00:17:45.920 | So if you look at AlexNet, which is a popular neural net,
00:17:48.680 | the number of channels ranges from three to 192.
00:17:52.480 | Okay, so that already increases the dimensionality,
00:17:54.520 | one dimension of the neural net itself
00:17:57.320 | in terms of processing.
00:17:58.560 | Another dimension that we increase
00:18:01.200 | is we actually apply multiple filters
00:18:04.000 | to the same input feature map.
00:18:06.760 | Okay, so for example, you might apply N filters
00:18:10.560 | to the same input feature map,
00:18:12.120 | and then you would generate an output feature map
00:18:14.960 | of M channels, right?
00:18:16.720 | So in the previous slide, we showed that convolving
00:18:20.080 | this 3D filter generates one output channel
00:18:22.840 | in the output feature map.
00:18:24.120 | If we apply M input, M filters,
00:18:28.560 | we're gonna generate M output channels
00:18:31.320 | in the output feature map.
00:18:33.080 | And again, just to give you an idea
00:18:34.400 | in terms of the scale of this,
00:18:35.640 | when you talk about things like AlexNet,
00:18:37.120 | we're talking about between 96 to 384 filters.
00:18:41.120 | And of course, this is increasing to thousands
00:18:43.280 | for other advanced or more modern neural nets itself.
00:18:46.880 | And then finally, often you wanna process
00:18:49.080 | more than one image at a given time, right?
00:18:52.280 | So if you wanna actually do that,
00:18:53.520 | we can actually extend it.
00:18:54.720 | So N input images become N output images,
00:18:58.800 | or N input feature maps becomes N output feature maps.
00:19:02.400 | And we typically refer to this as a batch size,
00:19:05.560 | like the number of images you're processing
00:19:07.200 | at the same time, and this can range from one to 256.
00:19:10.280 | Okay, so these are all the various different dimensions
00:19:13.520 | of the neural net.
00:19:14.640 | And so really what someone does
00:19:16.480 | when they're trying to define what we call
00:19:18.440 | the network architecture of the neural net itself
00:19:20.520 | is that they're gonna select the different
00:19:22.200 | or define the shape of the neural network
00:19:24.040 | for each of the different layers.
00:19:25.040 | So it's gonna define all these different dimensions
00:19:27.800 | of the neural net itself, and these shapes can vary
00:19:29.960 | across the different layers.
00:19:31.840 | Just to give you an idea, if you look at
00:19:35.400 | MobileNet as an example, this is a very popular
00:19:37.920 | neural network cell, you can see that the filter sizes,
00:19:40.840 | right, so the height and width of the filters
00:19:44.040 | and the number of filters and number of channels
00:19:45.600 | will vary across the different blocks or layers itself.
00:19:48.360 | The other thing I just wanna mention
00:19:51.440 | is that when we look towards popular DNN models,
00:19:55.120 | we can also see important trends.
00:19:56.760 | So shown here are the various different models
00:19:59.200 | that have been developed over the years
00:20:00.240 | that are quite popular.
00:20:02.360 | A couple of interesting trends,
00:20:03.800 | one is that the networks tend to become deeper,
00:20:06.480 | so you can see in the convolutional layers
00:20:08.120 | they're getting deeper and deeper.
00:20:09.800 | And then also the number of weights that they're using
00:20:13.760 | and the number of MACs are also increasing as well.
00:20:16.720 | So this is an important trend,
00:20:17.840 | the DNN models are getting larger and deeper,
00:20:20.200 | and so again, they're becoming much more
00:20:21.920 | computationally demanding.
00:20:23.720 | And so we need more sophisticated hardware
00:20:26.600 | to be able to process them.
00:20:28.440 | All right, so that's kind of a quick intro
00:20:31.280 | or overview into the deep neural network space,
00:20:33.160 | I hope we're all aligned.
00:20:34.040 | So the first thing I'm gonna talk about
00:20:35.880 | is how can we actually build hardware
00:20:38.600 | to make the processing of these neural networks
00:20:41.160 | more efficient and to run faster.
00:20:42.840 | And often we refer to this as hardware acceleration.
00:20:46.120 | All right, so we know these neural networks are very large,
00:20:49.040 | there's a lot of compute,
00:20:50.480 | but are there types of properties
00:20:51.960 | that we can leverage to make computing
00:20:53.840 | or processing of these networks more efficient?
00:20:56.960 | So the first thing that's really friendly
00:20:58.960 | is that they actually exhibit a lot of parallelism.
00:21:02.200 | So all these multiplies and accumulates,
00:21:04.400 | you can actually do them all in parallel.
00:21:06.840 | Right, so that's great.
00:21:07.960 | So what that means is high throughput
00:21:09.600 | or high speed is actually possible
00:21:11.040 | 'cause I can do a lot of these processing in parallel.
00:21:13.960 | What is difficult and what should not be a surprise
00:21:16.120 | to you now is that the memory access is the bottleneck.
00:21:18.920 | So delivering the data to the multiply
00:21:21.680 | and accumulate engine is what's really challenging.
00:21:24.240 | So I'll give you an insight as to why this is the case.
00:21:26.600 | So let's take, say we take this multiply
00:21:29.240 | and accumulate engine, what we call a MAC.
00:21:31.840 | It takes in three inputs for every MAC,
00:21:34.320 | so you have the filter weight,
00:21:37.040 | you have the input image pixel,
00:21:39.360 | or if you're deeper in the network,
00:21:40.520 | it would be input feature MAC activation,
00:21:43.360 | and it also takes the partial sum,
00:21:45.160 | which is like the partially accumulated value
00:21:47.160 | from the previous multiply that it did,
00:21:49.320 | and then it would generate an updated partial sum.
00:21:52.800 | So for every computation that you do,
00:21:55.120 | for every MAC that you're doing,
00:21:56.840 | you need to have four memory accesses.
00:21:58.920 | So it's a four to one ratio in terms
00:22:01.560 | of memory accesses versus compute.
00:22:04.120 | The other challenge that you have is, as we mentioned,
00:22:08.840 | moving data is gonna be very expensive.
00:22:12.160 | So in the absolute worst case,
00:22:13.800 | and you would always try to avoid this,
00:22:15.280 | if you read the data from DRAM, it's off-chip memory,
00:22:19.560 | every time you access data from DRAM,
00:22:21.960 | it's gonna be two orders of magnitude more expensive
00:22:26.040 | than the computation of performing a MAC itself.
00:22:29.800 | Okay, so that's really, really bad.
00:22:31.320 | So if you can imagine, again, if we look at AlexNet,
00:22:33.480 | which has 700 million MACs,
00:22:35.400 | we're talking about three billion DRAM accesses
00:22:38.600 | to do that computation.
00:22:40.080 | Okay, but again, all is not lost.
00:22:43.320 | There are some things that we can exploit
00:22:45.280 | to help us along with this problem.
00:22:47.200 | So one is what we call input data reuse opportunities,
00:22:50.520 | which means that a lot of the data that we're reading,
00:22:53.000 | we're using to perform these multiplies and accumulates,
00:22:55.400 | they're actually used for many multiplies and accumulates.
00:22:58.360 | So if we read the data once,
00:23:00.560 | we can reuse it multiple times for many operations, right?
00:23:04.320 | So I'll show you some examples of that.
00:23:07.080 | First is what we call convolutional reuse.
00:23:09.400 | So again, if you remember, we're taking a filter
00:23:11.680 | and we're sliding it across this input image.
00:23:15.400 | And so as a result, the activations from the feature map
00:23:19.800 | and the weights from the filter
00:23:21.200 | are gonna be reused in different combinations
00:23:23.760 | to compute the different multiply and accumulate values
00:23:27.200 | or different MACs itself.
00:23:28.080 | So there's a lot of what we call
00:23:29.160 | convolutional reuse opportunities there.
00:23:32.000 | Another example is that we're actually, if you recall,
00:23:35.680 | gonna apply multiple filters on the same input feature map.
00:23:40.080 | So that means that each activation in that input feature map
00:23:43.960 | can be reused multiple times across the different filters.
00:23:49.040 | Finally, if we're gonna process many images
00:23:52.640 | at the same time or many feature maps,
00:23:55.280 | a given weight in the filter itself
00:23:57.760 | can be reused multiple times across these input feature maps.
00:24:01.800 | So that's what we called filter reuse.
00:24:03.960 | Okay, so there's a lot of these great filter
00:24:05.920 | reuse opportunities in the neural network itself.
00:24:09.320 | And so what can we do to exploit this reuse opportunities?
00:24:13.120 | Well, what we can do is we can build
00:24:14.320 | what we call a memory hierarchy
00:24:16.400 | that contains very low cost memories
00:24:19.320 | that allow us to reduce the overall cost
00:24:21.560 | of moving this data.
00:24:22.400 | So what do we mean here?
00:24:23.440 | We mean that if I have,
00:24:24.880 | if I build a multiply and accumulate engine,
00:24:27.640 | I'm gonna have a very small memory
00:24:31.400 | right beside the multiply and accumulate engine.
00:24:34.360 | And by small, I mean something on the order
00:24:36.240 | of under a kilobyte of memory
00:24:39.000 | locally beside that multiply and accumulate engine.
00:24:41.520 | Why do I want that?
00:24:42.360 | Because accessing that very small memory
00:24:45.000 | can be very cheap.
00:24:46.200 | So for example, if to perform a multiply and accumulate
00:24:50.160 | with an ALUX1X, reading from this very small memory
00:24:55.160 | beside the multiply and accumulate engine
00:24:57.000 | is also gonna be the same amount of energy.
00:24:59.880 | I could also allow these processing elements
00:25:02.800 | and a processing element is gonna be this multiply
00:25:04.920 | and accumulate plus the small memory.
00:25:06.520 | I can also allow the different processing elements
00:25:08.680 | to also share data, okay?
00:25:11.720 | And so reading from a neighboring processing element
00:25:14.040 | is gonna be 2X the energy.
00:25:16.200 | And then finally, you can have a shared larger memory
00:25:20.200 | called a global buffer.
00:25:22.080 | And that's gonna be able to be shared
00:25:24.120 | across all the different processing elements.
00:25:25.400 | This tends to be larger between 100 and 500 Kbytes.
00:25:29.440 | And that's gonna be more expensive,
00:25:30.920 | about 6X the energy itself.
00:25:33.240 | And of course, if you go off chip to DRAM,
00:25:35.600 | that's gonna be the most expensive at 200X the energy.
00:25:39.760 | Right, and so the big issue here is,
00:25:41.560 | the way that you can think about this
00:25:43.280 | is what you would ideally like to do
00:25:46.120 | is to access all of the data
00:25:48.520 | from this very small local memory.
00:25:51.240 | But the challenge here is that this very small local memory
00:25:54.120 | is only 1Kbyte.
00:25:55.520 | We're talking about neural networks
00:25:56.800 | that are millions of weights in terms of size, right?
00:26:00.600 | So how do we go about doing that?
00:26:02.200 | So there's many challenges of doing that.
00:26:04.120 | Just as an analogy for you guys
00:26:05.480 | to kind of think through how this is related,
00:26:07.040 | you can imagine that accessing something
00:26:10.200 | from like, let's say your backpack
00:26:11.680 | is gonna be much cheaper
00:26:13.320 | than accessing something from your neighbor,
00:26:15.720 | or going back to, let's say, your office here,
00:26:18.480 | somewhere on campus to get the data
00:26:20.160 | versus going back all the way home, right?
00:26:21.960 | So ideally, you'd like to access
00:26:23.560 | all of your data from your backpack,
00:26:24.960 | but if you have a lot of work to do,
00:26:26.240 | you might not be able to fill it in your backpack.
00:26:28.120 | So the question is,
00:26:28.960 | how can I break up my large piece of work
00:26:32.120 | into smaller chunks so that I can access them all
00:26:35.000 | from this small memory itself?
00:26:36.440 | And that's the big challenge that you have.
00:26:38.080 | And so there's been a lot of research in this area
00:26:40.800 | in terms of what's the best way to break up the data
00:26:43.040 | and what should I store in this very small local memory?
00:26:46.800 | So one approach is what we call a weight stationary.
00:26:49.560 | And the idea here is I'm gonna store
00:26:51.080 | the weight information of the neural net
00:26:53.280 | into this small local memory, okay?
00:26:56.480 | And so as a result, I really minimize the weight energy.
00:26:59.960 | But the challenge here is that
00:27:01.920 | the other types of data that you have in your system,
00:27:04.280 | so for example, your input activations shown in the blue,
00:27:07.280 | and then the partial sums that are shown in the red,
00:27:09.480 | now those still have to move
00:27:11.040 | through the rest of the system itself,
00:27:12.360 | so through the network and from the global buffer, okay?
00:27:15.760 | Typical types of work that are popular
00:27:17.840 | that use this type of kind of data flow
00:27:19.840 | or weight stationary data flow,
00:27:21.120 | which is what we call it
00:27:21.960 | 'cause the weight remains stationary,
00:27:23.400 | are things like the TPU from Google
00:27:25.720 | and the NVDA accelerator from NVIDIA.
00:27:28.440 | Another approach that people take,
00:27:31.240 | or they, well, they say,
00:27:32.080 | "Well, so the weight, I only ever have to read it.
00:27:35.320 | "But the partial sums, I have to read it and write it
00:27:38.760 | "'cause the partial sum I'm gonna read,
00:27:40.440 | "accumulate, like update it,
00:27:41.840 | "and then write it back to the memory."
00:27:42.960 | So there's two memory accesses
00:27:44.440 | for that partial sum data type.
00:27:46.480 | So what, maybe I should put that partial sum
00:27:50.280 | locally into that small memory itself.
00:27:52.480 | So this is what we call output stationary
00:27:53.960 | 'cause the accumulation of the output
00:27:55.960 | is gonna be local within that one processing element.
00:27:58.400 | That's not gonna move.
00:27:59.680 | The trade-off, of course, is the activations of weights
00:28:02.960 | now have to move through the network.
00:28:05.080 | And then there's various different works called,
00:28:06.840 | like for example, some work from KU Leuven
00:28:09.560 | and some work from the Chinese Academy of Sciences
00:28:13.000 | that have been using this approach.
00:28:15.240 | Another piece of work is saying,
00:28:16.760 | "Well, forget about the inputs and the,
00:28:19.560 | "or so the outputs and the weights themselves.
00:28:22.680 | "Let's keep the input stationary within this small memory."
00:28:26.680 | And it's called input stationary.
00:28:28.400 | And some of the work, again,
00:28:29.960 | from some research work from NVIDIA has examined this.
00:28:33.160 | But all of these different types of work
00:28:34.560 | really focus on not moving one piece of type of data.
00:28:38.680 | Either focus on minimizing weight energy
00:28:41.680 | or out partial sum energy or input energy.
00:28:44.680 | I think what's important to think about
00:28:46.680 | is that maybe you wanna reduce the data movement
00:28:49.160 | of all different data types, all types of energy.
00:28:51.640 | So another approach,
00:28:52.600 | this is something that we've developed within our own group,
00:28:54.680 | is looking at what we call the row stationary data flow.
00:28:57.360 | And within each of the processing elements,
00:28:59.560 | you're gonna do one row of convolution.
00:29:04.040 | And this row is a mixture
00:29:05.520 | of all the different data types.
00:29:07.040 | So you have filter information,
00:29:08.360 | so the weights of the filter.
00:29:09.880 | You have the activations of your input feature map.
00:29:13.320 | And then you also have your partial sum information.
00:29:15.640 | So you're really trying to balance the data movement
00:29:18.200 | of all the different data types,
00:29:19.760 | not just one particular data type.
00:29:22.360 | This is just performing a one row,
00:29:23.840 | but we just talked about the fact that the neural network
00:29:26.520 | is much more than a 1D convolution.
00:29:28.400 | So you can imagine expanding this to higher dimensions.
00:29:32.440 | So this is just showing how you might expand
00:29:34.520 | this 1D convolution into a 2D convolution.
00:29:37.480 | And then there's other higher dimensionality
00:29:39.520 | that you can map onto this architecture as well.
00:29:42.200 | I won't get into the details of this,
00:29:43.520 | but the key takeaway here is that
00:29:45.480 | you might not wanna focus on one particular data type.
00:29:48.400 | You wanna actually optimize for all the different types
00:29:51.440 | of data that you're moving around in your system.
00:29:53.920 | Okay?
00:29:54.760 | And this can just show you some results
00:29:57.720 | in terms of how these different data types,
00:29:59.560 | or these different types of data flows would work.
00:30:02.520 | So for example, in the weight stationary case,
00:30:04.480 | as expected, the weight energy,
00:30:06.200 | the energy required to move the weights,
00:30:08.000 | shown in green, is gonna be the lowest.
00:30:10.360 | But then the red portion,
00:30:11.520 | which is the energy of the partial sums,
00:30:13.920 | and the green, or sorry, the blue part,
00:30:16.760 | which is the input feature map or input pixels,
00:30:19.400 | that's gonna be very high.
00:30:21.400 | Output stationary is another approach,
00:30:23.440 | as we talked about,
00:30:24.280 | you're trying to reduce the data movement
00:30:25.760 | of the partial sums, shown here in red.
00:30:28.000 | So the red part is really minimized,
00:30:29.600 | but you can see that the green part,
00:30:31.240 | which is the weight stationary data movement,
00:30:33.600 | or weight movement, is gonna be increased,
00:30:35.560 | and the blue is the input's gonna be increased.
00:30:39.640 | There's another approach called no-colloquial reuse,
00:30:41.440 | we don't have time to talk about that,
00:30:43.280 | but you can see that row stationary, for example,
00:30:45.160 | really aims to balance the data movement
00:30:47.880 | of all the different data types.
00:30:49.960 | Right, so the big takeaway here is that,
00:30:51.680 | you know, when you're trying to optimize,
00:30:53.720 | you know, a given piece of hardware,
00:30:55.640 | you don't wanna just optimize one,
00:30:57.280 | you know, for one particular type of data,
00:30:59.080 | you wanna optimize overall for all the movement
00:31:01.520 | in the hardware itself.
00:31:03.360 | Okay, another thing that you can also exploit
00:31:06.320 | to save a bit of power,
00:31:08.280 | is the fact that, you know, some of the data could be zero.
00:31:11.120 | So we know that anything multiplied by zero
00:31:14.720 | is gonna be zero, right?
00:31:16.640 | So if you know that one of the inputs
00:31:18.840 | to your multiply and accumulate is gonna be zero,
00:31:21.280 | you might as well skip that multiplication.
00:31:23.440 | In fact, you might as well skip, you know,
00:31:25.240 | accessing data or accessing the other input
00:31:28.000 | to that multiply and accumulate engine.
00:31:29.840 | So by doing that, you can actually
00:31:32.200 | reduce the power consumption by almost 50%.
00:31:36.440 | Another thing that you can do,
00:31:38.000 | is that if you have a bunch of zeros,
00:31:40.560 | you can also compress the data.
00:31:43.040 | For example, you can use things like run length encoding,
00:31:46.080 | which where basically a run of zeros
00:31:48.040 | is gonna be represented rather than, you know,
00:31:49.600 | zero, zero, zero, zero, zero,
00:31:50.880 | you can just say I have a run of five zeros.
00:31:53.000 | And this can actually reduce the amount of data movement
00:31:55.360 | by up to two X in your system itself.
00:31:59.000 | And in fact, in, you know, neural nets,
00:32:00.840 | there's a large way, you know,
00:32:02.120 | possibilities of actually generating zeros.
00:32:03.920 | First of all, if you remember that reloop,
00:32:06.040 | it's setting negative values to zero,
00:32:07.880 | so naturally generates zeros.
00:32:09.440 | And then there's other techniques, for example,
00:32:11.200 | we call pruning, which is setting some of the weights
00:32:13.280 | of the neural net to zero as well.
00:32:14.480 | And so this can exploit all of that.
00:32:16.760 | Okay, so, you know, what is the impact
00:32:19.040 | of all these types of things?
00:32:20.120 | So we actually looked at building hardware
00:32:22.840 | in particular a customized chip that we called Iris
00:32:25.720 | to demonstrate these particular approaches,
00:32:28.080 | in particular the row stationary data flow
00:32:30.840 | and exploiting sparsity in the activation data.
00:32:34.280 | So this Iris chip has 14 by 12,
00:32:37.920 | so 168 processing elements.
00:32:40.360 | You can see that there's a shared buffer
00:32:43.040 | that's 100 kilobytes,
00:32:44.400 | and it has some compression, decompression
00:32:46.160 | before it goes to off-chip DRAM.
00:32:47.960 | And again, that's because accessing DRAM
00:32:49.680 | is the most expensive.
00:32:51.480 | Shown here on the right-hand side
00:32:53.360 | is a dye photo of the fabricated chip itself, right?
00:32:56.800 | And this is four millimeters by four millimeters
00:32:59.400 | in terms of size.
00:33:00.600 | And so using that, you know, row stationary data flow,
00:33:04.040 | it exploits a lot of data reuse.
00:33:05.920 | So it actually reduces the number of times
00:33:08.880 | we access this global buffer by 100x.
00:33:12.600 | And it also reduces the amount of times
00:33:15.000 | we access the off-chip memory by over 1000x.
00:33:18.360 | This is all because, you know,
00:33:19.640 | each of these processing elements has, you know,
00:33:21.880 | a local memory that is trying to read
00:33:24.120 | most of its data from,
00:33:25.200 | and it's also sharing with other processing elements.
00:33:27.760 | So overall, when you compare it to a mobile GPU,
00:33:30.080 | you're talking about an order of magnitude reduction
00:33:32.840 | in energy consumption.
00:33:34.080 | If you'd like to learn a little bit more about that,
00:33:36.760 | I invite you to visit the Iris project website.
00:33:40.640 | Okay, so this is great.
00:33:41.560 | We can build custom hardware,
00:33:42.840 | but what does this actually mean
00:33:44.680 | in terms of, you know, building a system
00:33:46.600 | that can efficiently compute neural nets?
00:33:48.840 | So let's say we take a step back.
00:33:50.360 | Let's say we don't care anything about the hardware,
00:33:52.720 | and we're, you know, a systems provider.
00:33:54.440 | We want to build, you know, an overall system.
00:33:56.160 | And what we really care about
00:33:57.880 | is the trade-off between energy and accuracy, right?
00:34:02.200 | That's the key thing that we care about.
00:34:04.680 | So shown here is a plot,
00:34:06.280 | and let's say this is for an object detection task, right?
00:34:08.560 | So accuracy is on the x-axis,
00:34:13.080 | and it's listed in terms of average precision,
00:34:15.680 | which is a metric that we use for object detection.
00:34:18.360 | It's on a linear scale, and higher, the better.
00:34:20.720 | Vertically, we have energy consumption.
00:34:25.280 | This is the energy that's being consumed per pixel.
00:34:27.760 | So you kind of average it.
00:34:28.640 | I can imagine a higher-resolution image
00:34:30.280 | can consume more energy.
00:34:31.560 | It's going to be an exponential scale.
00:34:33.280 | So let's first start on the accuracy axis.
00:34:37.840 | And so if you think before neural nets, you know,
00:34:40.240 | had its resurgence in around 2011, 2012,
00:34:43.400 | actually state-of-the-art approaches
00:34:44.880 | used features called histogram of oriented gradients, right?
00:34:49.320 | This is a very popular approach to be very efficient
00:34:52.280 | in terms of, or quite accurate in terms of object detection.
00:34:55.560 | And we refer to as HOG.
00:34:57.720 | The reason why neural nets really took off
00:35:00.000 | is 'cause they really improved the accuracy.
00:35:01.480 | So you can imagine AlexNet here almost doubled the accuracy,
00:35:05.160 | and then VGG further increased the accuracy.
00:35:08.480 | So it's super exciting there.
00:35:10.800 | But then we want to look also on the vertical axis,
00:35:14.520 | which is the energy consumption.
00:35:16.280 | And I should mention, you know,
00:35:17.880 | basically you'll see these dots.
00:35:19.280 | We have the energy consumption
00:35:20.400 | for each of these different approaches.
00:35:21.920 | These approaches are actually measured,
00:35:24.360 | or these energy numbers are measured
00:35:25.680 | on specialized hardware already
00:35:27.760 | that's been designed for that particular task.
00:35:30.560 | So we have a chip here that's built
00:35:33.400 | in 65-millimeter CMOS process.
00:35:35.520 | So they use the same transistors around the same size
00:35:37.840 | that does object detection using the HOG features.
00:35:40.800 | And then here's the Iris chip that we just talked about.
00:35:43.360 | I should also know that both of these chips
00:35:45.040 | were built in my group.
00:35:46.160 | The students who built these chips, you know,
00:35:47.880 | started designing the chips at the same time
00:35:50.360 | and taped out at the same time.
00:35:51.440 | So it's somewhat of a controlled experiment
00:35:53.040 | in terms of optimization.
00:35:55.360 | Okay, so what does this tell us
00:35:56.480 | when we look on the energy axis?
00:35:58.360 | We can see that histogram of oriented gradients,
00:36:01.240 | or HOG features, are actually very efficient
00:36:03.640 | from an energy point of view.
00:36:05.040 | In fact, if we compare it to something like
00:36:07.000 | video compression, again, something that you all have
00:36:09.800 | in your phone, HOG features are actually more efficient
00:36:12.960 | than video compression, meaning for the same energy
00:36:16.000 | that you would spend compressing a pixel,
00:36:18.480 | you could actually understand that pixel.
00:36:20.680 | So that's pretty impressive.
00:36:22.920 | But if we start looking at AlexNet or VGG,
00:36:26.440 | we can see that the energy increases
00:36:28.680 | by two to three orders of magnitude,
00:36:30.920 | which is quite significant.
00:36:32.520 | I'll give you an example.
00:36:33.560 | So if I told you on your cell phone,
00:36:35.720 | I'm gonna double the accuracy of its recognition,
00:36:39.080 | but your phone would die 300 times faster,
00:36:42.240 | who here would be interested in that technology?
00:36:44.640 | Right, so exactly, so nobody, right?
00:36:47.760 | So in the sense that battery life is so critical
00:36:50.200 | to how we actually use these types of technologies.
00:36:53.920 | So we should not just look at the accuracy,
00:36:56.800 | which is the x-axis point of view,
00:36:57.960 | we should really also consider the energy consumption,
00:37:01.000 | and we really don't want the energy to be so high.
00:37:03.480 | And we can see that even with specialized hardware,
00:37:06.180 | we're still quite far away from making neural nets
00:37:10.080 | as efficient as something like video compression
00:37:13.280 | that you all have on your phones.
00:37:14.800 | So we really have to think of how we can further
00:37:17.440 | push the energy consumption down
00:37:20.560 | without sacrificing accuracy, of course.
00:37:23.600 | So actually, there's been a huge amount of research
00:37:25.440 | in this space, because we know neural nets are popular,
00:37:28.120 | and we know that they have a wide range of applications,
00:37:29.920 | but energy's really a big challenge.
00:37:31.640 | So people have looked at how can we design new hardware
00:37:35.240 | that can be more efficient, or how can we design algorithms
00:37:38.240 | that are more efficient to enable energy-efficient
00:37:40.440 | processing of DNNs.
00:37:41.720 | And so in fact, within our own research group,
00:37:43.660 | we spend quite a bit of time kind of surveying the area
00:37:46.440 | and understanding what are the various different types
00:37:48.520 | of developments that people have been looking at.
00:37:50.440 | So if you're interested in this topic,
00:37:51.840 | we actually generated various tutorials on this material,
00:37:55.440 | as well as overview papers.
00:37:57.280 | This is an overview paper that's about 30 pages
00:37:59.840 | and we're currently expanding it into a book.
00:38:01.720 | So if you're interested in this topic,
00:38:02.920 | I would encourage you to visit these resources.
00:38:05.520 | But the main thing that we learned about
00:38:07.120 | as we were doing this kind of survey of the area,
00:38:10.000 | is that we actually identified various limitations
00:38:12.540 | in terms of how people are approaching
00:38:14.920 | or how the research is approaching this problem.
00:38:17.560 | So first let's look on the algorithm side.
00:38:20.680 | So again, there's a wide range of approaches
00:38:22.840 | that people are using to try and make the DNN algorithms
00:38:25.680 | or models more efficient.
00:38:27.240 | So for example, we've kind of mentioned
00:38:28.920 | the idea of pruning.
00:38:30.320 | The idea here is you're gonna set some of the weights
00:38:32.560 | to become zero, and again, anything times zero is zero,
00:38:36.120 | so you can skip those operations.
00:38:38.040 | So there's a wide range of research there.
00:38:40.440 | There's also looking at efficient network architectures,
00:38:43.080 | meaning rather than making my neural networks very large,
00:38:45.240 | these high three-dimensional convolutions,
00:38:48.180 | can I decompose them into smaller filters?
00:38:50.920 | So rather than this 3D filter, can I make it a 2D filter
00:38:54.200 | and kind of also 2D, but one by one
00:38:57.080 | and into the screen itself?
00:38:59.000 | Another very popular thing is reduced precision.
00:39:01.240 | So rather than using the default of 32-bit float,
00:39:04.400 | can I reduce the number of bits down to eight bits
00:39:07.320 | or even binary?
00:39:08.160 | We saw before that as we reduce the precision
00:39:10.920 | of these operations, you also get energy savings,
00:39:12.980 | and you also reduce data movement as well
00:39:14.840 | 'cause you have to move less data.
00:39:16.800 | A lot of this work really focuses on reducing
00:39:19.220 | the number of MACs and the number of weights,
00:39:23.080 | and those are primarily because those are easy to count.
00:39:25.980 | But the question that we should be asking
00:39:27.720 | if we care about the system is does this actually translate
00:39:31.280 | into energy savings and reduced latency?
00:39:33.760 | Because from a system's point of view,
00:39:35.640 | those are the things that we care about.
00:39:37.640 | We don't really, when you're thinking about something
00:39:39.400 | running on your phone, you don't care about the number
00:39:40.800 | of MACs and weights, you care about how much energy
00:39:42.520 | it's consuming 'cause that's gonna affect the battery life,
00:39:44.760 | or how quickly it might react.
00:39:47.840 | That's basically a measure of latency.
00:39:49.880 | And again, hopefully you haven't forgotten,
00:39:51.560 | but basically data movement is expensive.
00:39:53.680 | It really depends on how you move the data
00:39:58.080 | through the system.
00:39:58.900 | So the key takeaway from this slide is that if you remember
00:40:01.520 | where the energy comes from, which is the data movement,
00:40:04.360 | it's not because of how many weights or how many MACs you
00:40:06.960 | have, but really it depends on where the weight comes from.
00:40:10.320 | If it comes from this small memory register file
00:40:14.240 | that's nearby, it's gonna be super cheap as opposed
00:40:16.680 | to coming from off-chip DRAM.
00:40:18.720 | So all weights are basically not created equal,
00:40:21.160 | all MACs are not created equal.
00:40:22.400 | It really depends on the memory hierarchy
00:40:24.240 | and the data flow of the hardware itself.
00:40:26.240 | So we can't just look at the number of weights
00:40:29.720 | and the number of MACs and estimate how much energy
00:40:32.280 | is gonna be consumed.
00:40:33.960 | So this is quite a difficult challenge.
00:40:35.440 | So within our group, we've actually looked
00:40:37.360 | at developing different tools that allow us
00:40:39.320 | to estimate the energy consumption
00:40:41.480 | of the neural network itself.
00:40:42.840 | So for example, in this particular tool,
00:40:44.360 | which is available on this website,
00:40:46.760 | we basically take in the DNN weights and the input data,
00:40:50.320 | including its sparsity.
00:40:51.920 | We know the different shapes of the different layers
00:40:55.480 | of the neural net, and we run an optimization
00:40:57.600 | that figures out the memory access,
00:40:59.440 | how much energy consumed by the data movement,
00:41:01.880 | and then the energy consumed by the multiply
00:41:04.040 | and accumulate computations,
00:41:06.280 | and then the output is gonna be a breakdown
00:41:08.240 | of the energy for the different layers
00:41:10.160 | of the neural network.
00:41:11.360 | And once you have this, you can kind of figure out,
00:41:13.400 | well, where is the energy going so I can target my design
00:41:16.800 | to minimize that energy consumption?
00:41:18.800 | Okay, and so by doing this, when we take a look,
00:41:23.280 | it should be no surprise, one of the key observations
00:41:25.480 | for this exercise is that the weights alone
00:41:28.720 | are not a good metric for energy consumption.
00:41:31.120 | If you take a look at GoogleNet, for example,
00:41:34.400 | this is running on kind of the Iris architecture,
00:41:36.160 | you can see that the weights only account
00:41:37.960 | for 22% of the overall energy.
00:41:41.320 | In fact, a lot of the energy goes
00:41:43.120 | into moving the input feature maps
00:41:44.720 | and the output feature maps as well, right?
00:41:47.000 | And also computation.
00:41:48.080 | So in general, this is the same message as before.
00:41:50.800 | We shouldn't just look at the data movement
00:41:53.120 | in one particular data type.
00:41:54.440 | We should look at the energy consumption
00:41:55.800 | of all the different data types
00:41:57.080 | to give us an overall view
00:41:58.760 | of where the energy's actually going.
00:42:01.200 | Okay, and so once we actually know
00:42:03.240 | where the energy is going, how can we factor that
00:42:06.240 | into the design of the neural networks
00:42:08.040 | to make them more efficient?
00:42:09.800 | So we talked about the concept of pruning, right?
00:42:13.160 | So again, pruning was setting some of the weights
00:42:15.400 | of the neural net to zero, or you can think of it
00:42:17.280 | as removing some of the weights.
00:42:18.720 | And so what we wanna do here is that now we know
00:42:21.160 | that we know where the energy is going,
00:42:23.080 | why don't we incorporate the energy
00:42:25.320 | into the design of the algorithm,
00:42:27.120 | for example, to guide us to figure out
00:42:29.080 | where we should actually remove the weights from?
00:42:31.560 | You know, so for example,
00:42:33.160 | let's say here, this is on AlexNet
00:42:36.680 | for the same accuracy across the different approaches.
00:42:39.040 | Traditionally, what happens is that people tend
00:42:41.120 | to remove the weights that are small.
00:42:43.120 | And we call this magnitude-based pruning,
00:42:45.720 | and you can see that you get about a 2x reduction
00:42:48.720 | in terms of energy consumption.
00:42:50.680 | However, we know that like the size of the weight
00:42:53.200 | has nothing to do with, or the value of the weight
00:42:54.920 | has nothing to do with the energy consumption.
00:42:56.400 | Ideally, what you'd like to do is remove the weights
00:42:59.680 | that consume the most energy, right?
00:43:02.160 | In particular, we also know that the more weights
00:43:04.000 | that we remove, the accuracy is gonna go down.
00:43:07.480 | So to get the biggest bang for your buck,
00:43:08.760 | you wanna remove the weights
00:43:09.800 | that consume the most energy first.
00:43:11.760 | One way you can do this is you can take your neural network,
00:43:15.560 | figure out the energy consumption
00:43:17.320 | of each of the layers of the neural network.
00:43:19.560 | You can sort, then sort the layers
00:43:21.440 | in terms of high energy layer to low energy layers,
00:43:25.280 | and then you prune the high energy layers first.
00:43:28.320 | So this is what we call energy-aware pruning.
00:43:30.360 | And then by doing this, you actually now get
00:43:32.720 | a 3.7x reduction in energy consumption
00:43:35.640 | compared to 2x for the same accuracy.
00:43:38.240 | And again, this is because we factor in energy consumption
00:43:41.520 | into the design of the neural network itself.
00:43:43.880 | All right, and the prune models
00:43:46.800 | are all available on the IRIS website.
00:43:49.760 | Another important thing that we care about
00:43:52.120 | from a performance point of view is latency, right?
00:43:55.240 | So for example, latency has to do with how long it takes
00:43:58.000 | when I give it an image, how long will I get the result back?
00:44:01.720 | People are very sensitive to latency.
00:44:04.400 | But the challenge here is that latency, again,
00:44:06.200 | is not directly correlated to things
00:44:08.040 | like number of multiplies and accumulates.
00:44:10.040 | And so this is some data that was released
00:44:11.680 | by Google's Mobile Vision team,
00:44:13.880 | and they're showing here on the x-axis
00:44:17.320 | the number of multiplies and accumulates.
00:44:19.760 | You can do, so going towards the left, you're increasing.
00:44:22.520 | And then on the y-axis, this is the latency.
00:44:25.560 | So this is actually the measured latency
00:44:28.080 | or delay it takes to get a result.
00:44:30.440 | And what they're showing here is that the number of max
00:44:33.520 | is not really a good approximation of latency.
00:44:35.920 | So in fact, for example, given layers,
00:44:39.560 | neural networks that have the same number of max,
00:44:41.880 | there can be a 2x range or 2x swing in terms of latency.
00:44:45.920 | Or looking at it in a different way,
00:44:47.640 | giving neural nets of the same latency,
00:44:50.680 | they can have a 3x swing in terms of number of max.
00:44:55.240 | So the key takeaway here is that you can't just count
00:44:57.120 | the number of max and say,
00:44:58.000 | oh, this is how quickly it's gonna run.
00:44:59.880 | It's actually much more challenging than that.
00:45:04.440 | And so what we want to ask is,
00:45:06.680 | is there a way that we can take latency
00:45:09.360 | and use that again to design the neural net directly?
00:45:12.200 | So rather than looking at max, use latency.
00:45:14.720 | And so together with Google's Mobile Vision team,
00:45:17.800 | we developed this approach called NetAdapt.
00:45:20.120 | And this is really a way that you can tailor
00:45:22.000 | your particular neural network for a given mobile platform
00:45:25.720 | for a latency or an energy budget.
00:45:27.760 | So it automatically adapts the neural net
00:45:29.520 | for that platform itself.
00:45:30.760 | And really what's driving the design
00:45:33.320 | is empirical measurements.
00:45:34.720 | So measurements of how that particular network
00:45:37.800 | perform on that platform.
00:45:39.760 | So measurements for things like latency and energy.
00:45:42.560 | And the reason why we want to use empirical measurements
00:45:44.480 | is that you can't often generate models
00:45:46.880 | for all the different types of hardware out there.
00:45:48.960 | In the case of Google, what they want is that,
00:45:51.400 | if they have a new phone, you can automatically tune
00:45:53.880 | the network for that particular phone.
00:45:55.400 | You don't want to have to model the phone as well.
00:45:57.640 | Okay, and so how does this work?
00:45:59.200 | I'll walk you through it.
00:46:00.040 | So you'll start off with a pre-trained network.
00:46:01.960 | So this is a network that's, let's say,
00:46:03.440 | trained in the cloud for very high accuracy.
00:46:07.000 | Great, start off with that,
00:46:08.400 | but it tends to be very large, let's say.
00:46:10.480 | And so what you're gonna do is you're gonna take that
00:46:12.280 | into the NetAdapt algorithm.
00:46:14.200 | You're gonna take a budget.
00:46:15.320 | So a budget will tell you like,
00:46:16.400 | oh, I can afford only this type of latency
00:46:18.800 | or this amount of latency, this amount of energy.
00:46:21.280 | What NetAdapt will do is gonna generate
00:46:23.520 | a bunch of proposals, so different options
00:46:26.000 | of how it might modify the network
00:46:27.720 | in terms of its dimensions.
00:46:29.400 | It's gonna measure these proposals
00:46:31.080 | on that target platform that you care about.
00:46:34.840 | And then based on these empirical measurements,
00:46:36.960 | NetAdapt is gonna then generate a new set of proposals.
00:46:39.840 | And it'll just iterate across this
00:46:41.880 | until it gets an adapted network as an output.
00:46:45.360 | Okay, and again, all of this is on the NetAdapt website.
00:46:48.400 | Just to give you a quick example of how this might work.
00:46:50.480 | So let's say you start off with, as your input,
00:46:53.320 | a neural network that has the accuracy that you want,
00:46:56.400 | but the latency is 100 milliseconds,
00:46:58.600 | and you would like for it to be 80 milliseconds.
00:47:00.800 | You want it to be faster.
00:47:02.440 | So what it's gonna do is it's gonna generate
00:47:04.160 | a bunch of proposals.
00:47:05.440 | And what the proposals could involve doing
00:47:07.480 | is taking one layer of the neural net
00:47:09.840 | and reducing the number of channels
00:47:11.640 | until it hits the latency budget of 80 milliseconds.
00:47:15.760 | And it can do that for all the different layers.
00:47:18.520 | Then it's gonna tune these different layers
00:47:20.160 | and measure the accuracy.
00:47:22.160 | Right, so let's say, oh, this one where I just
00:47:24.480 | shortened the number of channels in layer one
00:47:26.920 | maintains accuracy at 60%.
00:47:28.680 | So that means I'm gonna pick that,
00:47:30.120 | and that's gonna be the input,
00:47:31.720 | or the output of this particular design.
00:47:34.040 | So the output at 80 milliseconds
00:47:36.520 | hitting an accuracy of 60%,
00:47:37.880 | and it's gonna be the input to the next iteration.
00:47:39.840 | And then I'm gonna tighten the budget.
00:47:41.880 | Okay, again, if you're interested,
00:47:43.520 | I just invite you to go take a look at the NetAdapt paper.
00:47:46.120 | But what are the, what is the impact
00:47:48.280 | of this particular approach?
00:47:49.720 | Well, it gives you actually a very much improved trade-off
00:47:52.920 | between latency and accuracy, right?
00:47:55.760 | So if you look at this plot again,
00:47:57.560 | on the x-axis is the latency, right?
00:48:00.520 | So to the left is better, so it's lower latency.
00:48:04.440 | And then on the x-axis, or y-axis,
00:48:07.400 | it's gonna be the accuracy, so higher, better.
00:48:09.120 | So here you want higher to the left is good.
00:48:12.600 | And so we have first shown in blue and green
00:48:15.120 | various kind of handcrafted neural network-based approaches.
00:48:18.960 | And you can see NetAdapt, which generates the red dots
00:48:22.920 | as it's iterating through its optimization.
00:48:25.240 | And you can see that it achieves,
00:48:27.280 | for the same accuracy, it can be up to 1.7x faster
00:48:31.240 | than a manually designed approach.
00:48:34.160 | This approach is also under the umbrella of
00:48:38.360 | basically network architecture search
00:48:39.800 | is kind of also in that kind of flavor.
00:48:42.160 | But in general, the takeaway here is that
00:48:43.960 | if you're gonna design neural networks
00:48:45.680 | or efficient neural networks,
00:48:47.320 | that you wanna run quickly or you wanna be energy efficient,
00:48:50.360 | you should really take, you know,
00:48:51.640 | put hardware into the design loop
00:48:53.320 | and take in, you know, the accurate energy
00:48:56.160 | or latency measurements into the design itself
00:48:58.160 | of the neural network.
00:48:59.280 | This particular, you know, example here is shown
00:49:02.960 | for an image classification task,
00:49:04.920 | meaning I give you an image
00:49:06.240 | and you can classify it to the right.
00:49:08.720 | You can say what's in the image itself.
00:49:10.560 | You can imagine that that type of approach
00:49:12.120 | is kind of like reducing information, right?
00:49:14.000 | From a 2D image, you reduce it down to a label.
00:49:16.840 | This is very commonly used.
00:49:19.000 | But we actually want to see if we can still apply
00:49:20.720 | this approach to a more difficult task
00:49:23.120 | of something like depth estimation.
00:49:24.840 | In this case, you know, I give you a 2D image
00:49:27.640 | and the output is also a 2D image
00:49:29.720 | where each pixel shows the depth of each,
00:49:33.160 | or you know, the output or the picture
00:49:34.680 | is basically showing the depth of each pixel at the input.
00:49:37.720 | This is often what we'd refer to as, you know,
00:49:40.040 | monocular depth.
00:49:41.000 | So I give you just a 2D, you know, depth,
00:49:44.680 | image input and you can estimate the depth itself.
00:49:46.400 | The reason why you want to do this is, you know,
00:49:47.960 | 2D cameras, regular cameras are pretty cheap, right?
00:49:50.920 | So it'd be ideal to be able to do this.
00:49:52.880 | You can imagine like the way that we would do this
00:49:55.880 | is to use an autoencoder.
00:49:57.200 | So the front half of the neural net
00:49:59.040 | is still looking like a, what we call an encoder.
00:50:01.360 | It's a reduction element.
00:50:02.960 | So this is very similar to what you would do
00:50:04.560 | for a classification, but then the backend
00:50:06.800 | of the autoencoder is a decoder.
00:50:09.080 | So it's going to expand the information back out, right?
00:50:11.760 | And so, as I mentioned, again,
00:50:12.840 | this is going to be much more difficult
00:50:14.280 | than just classification because now my output
00:50:17.160 | has to be also very dense as well.
00:50:19.160 | And so we want to see if we could make this really fast
00:50:22.800 | with approaches that we just talked about,
00:50:24.440 | for example, NetAdapt.
00:50:26.280 | So indeed you can make it pretty fast.
00:50:28.120 | So if you apply NetAdapt plus the, you know,
00:50:30.240 | compact network design and then do some
00:50:32.240 | depth-wise decomposition, you can actually increase
00:50:36.200 | the frame rate by an order of magnitude.
00:50:37.920 | So again, here I'm going to show the plot.
00:50:39.560 | On the x-axis, here is the frame rate
00:50:42.080 | on a Jetson TX2 GPU.
00:50:44.320 | This is measured with a batch size of one
00:50:46.480 | with 32-bit float.
00:50:48.280 | And on the vertical axis, it's the accuracy,
00:50:51.480 | the depth estimation in terms of the delta one metric,
00:50:53.920 | which means the percentage of pixels
00:50:55.800 | that are within 25% of the correct depth.
00:50:58.960 | So higher, the better.
00:51:00.720 | And so you can see, you know, the various different
00:51:02.600 | approaches out there.
00:51:04.240 | This star, red star, is the approach using fast,
00:51:07.520 | of FastStep using all the different efficient
00:51:09.800 | network design techniques that we talked about.
00:51:11.240 | And you can see you can get an order of magnitude
00:51:13.040 | over a 10x speedup while maintaining accuracy.
00:51:17.240 | And the models and all the code to do this
00:51:18.840 | is available on the FastStep website.
00:51:21.240 | We presented this at ICRA, which is a robotics conference
00:51:25.000 | in the middle of last year.
00:51:26.120 | And we wanted to show some live footage there.
00:51:28.000 | So at ICRA, we actually captured some footage on an iPhone
00:51:31.880 | and showed the real-time depth estimation on an iPhone itself.
00:51:35.320 | And you can achieve about 40 frames per second on an iPhone
00:51:38.320 | using FastDepth.
00:51:39.680 | So again, if you're interested in this particular type
00:51:42.520 | of application or efficient networks for depth estimation,
00:51:44.960 | I invite you to visit the website for that.
00:51:47.800 | OK, so that's the algorithmic side of things.
00:51:49.760 | But let's return to the hardware,
00:51:51.200 | building specialized hardware that
00:51:52.680 | are efficient for neural network processing.
00:51:56.160 | So again, we saw that there's many different ways
00:51:59.040 | of making the neural network efficient,
00:52:01.480 | from network pruning to efficient network
00:52:03.760 | architectures to reduce precision.
00:52:05.880 | The challenge for the hardware designer,
00:52:08.040 | though, is that there's no guarantee
00:52:09.880 | as to which type of approach someone
00:52:12.840 | might apply to the algorithm that they're going
00:52:14.880 | to run on the hardware.
00:52:15.920 | So if you only own the hardware, you
00:52:17.440 | don't know what kind of algorithm
00:52:19.200 | someone's going to run on your hardware
00:52:20.200 | unless you own the whole stack.
00:52:21.760 | So as a result, you really, really
00:52:23.520 | need to have flexible hardware so it
00:52:25.560 | can support all of these different approaches
00:52:27.680 | and translate these approaches to improvements in energy
00:52:31.400 | efficiency and latency.
00:52:33.600 | Now, the challenge is a lot of the specialized DNN hardware
00:52:37.920 | that exist out there often rely on certain properties of the DNN
00:52:42.600 | in order to achieve high efficiency.
00:52:44.520 | So a very typical structure that you might see
00:52:47.240 | is that you might have an array of multiply and accumulate
00:52:50.120 | units, so a MAC array.
00:52:51.560 | And it's going to reduce memory access
00:52:54.520 | by amortizing reads across arrays.
00:52:56.400 | What do I mean by that?
00:52:57.600 | So if I read a weight once from the memory,
00:53:00.680 | weight memory once, I'm going to reuse it multiple times
00:53:02.880 | across the array.
00:53:03.760 | Send it across the array, so one read,
00:53:06.120 | and it can be used multiple times by multiple engines
00:53:08.720 | or multiple MACs.
00:53:10.080 | Similarly, activation memory, I'm
00:53:12.120 | going to read the input activation once
00:53:13.800 | and reuse it multiple times.
00:53:16.880 | The issue here is that the amount of reuse
00:53:20.320 | and the rate utilization depends on the number of channels
00:53:23.800 | you have on your neural net, the size of the feature map,
00:53:26.080 | and the batch size.
00:53:27.400 | So this is, again, just showing two different variations of--
00:53:30.160 | you're going to reuse based on the number of filters, number
00:53:32.880 | of input channels, feature map, batch size.
00:53:35.960 | And the problem now is that when we
00:53:37.360 | start looking at these efficient neural network models,
00:53:40.520 | they're not going to have as much reuse,
00:53:42.560 | particularly for the compact cases.
00:53:44.560 | So for example, a very typical approach
00:53:46.360 | is to use what we call depth-wise layers.
00:53:48.080 | We saw you took that 3D filter and then decomposed it
00:53:51.320 | into a 2D filter and a one-by-one.
00:53:54.480 | And so as a result, you only have one channel.
00:53:56.400 | So you're not going to have much reuse across the input channel.
00:53:59.520 | And so rather than filling this array with a lot of computation
00:54:03.720 | that you can process, you're only
00:54:05.160 | going to be able to utilize a very small subset, which
00:54:07.640 | I've highlighted here in green, of the array itself
00:54:09.760 | for computation.
00:54:10.840 | So even though you throw down 1,000 multiplies,
00:54:13.680 | 10,000 multiplies the Humiliate engine,
00:54:15.800 | only a very small subset of them can actually do work.
00:54:19.880 | And that's not great.
00:54:20.760 | So this is also an issue because as I scale up the array size,
00:54:24.640 | it's going to become less efficient.
00:54:26.100 | Ideally, what you would like is that if I put more, you know,
00:54:28.920 | cores or processing elements down,
00:54:31.080 | the system should run faster, right?
00:54:32.600 | I'm paying for more thing- more cores.
00:54:34.600 | But it doesn't because it can't- the data can't reach or be
00:54:38.480 | reused by all of these different cores,
00:54:40.480 | and it's also going to be difficult to exploit sparsity.
00:54:42.560 | So what you need here are two things.
00:54:44.880 | One is a very flexible data flow,
00:54:47.760 | meaning that there's many different ways for the data
00:54:49.960 | to move through this array, right?
00:54:53.120 | And so you can imagine row stationary is a very flexible
00:54:56.120 | way that we can basically map the neural network
00:54:58.120 | onto the array itself.
00:54:59.040 | You can see here in the iris or row stationary case
00:55:01.800 | that a lot of the processing elements can be used.
00:55:04.640 | Another thing is how do you actually
00:55:06.120 | deliver the data for this varying degree of reuse?
00:55:10.040 | So here's like the spectrum of on-chip networks
00:55:13.720 | in terms of basically how can I deliver data
00:55:15.800 | from that global buffer to all those parallel processing
00:55:19.400 | engines, right?
00:55:21.360 | One use case is when I use these huge neural nets that
00:55:24.120 | have a lot of reuse.
00:55:25.320 | What I want to do is multicast, meaning
00:55:27.080 | I read once from the global buffer,
00:55:29.360 | and then I reuse that data multiple times
00:55:31.320 | in all of my processing elements.
00:55:32.680 | You can think of that as like broadcasting information out.
00:55:35.360 | And a type of network that you would do for that
00:55:37.560 | is shown here on the right-hand side.
00:55:39.480 | So this is low bandwidth, so I'm only reading very little data,
00:55:42.800 | but high spatial reuse.
00:55:44.160 | Many, many engines are using it.
00:55:46.680 | On the other extreme, when I design
00:55:49.600 | these very efficient neural networks,
00:55:51.180 | I'm not going to have very much reuse.
00:55:53.160 | And so what I want is unicast, meaning
00:55:54.920 | I want to send out unique information
00:55:58.080 | to each of the processing elements
00:56:00.280 | so that they can all work.
00:56:02.480 | So that's going to be, as shown here on the left-hand side,
00:56:05.520 | a case where you have very high bandwidth,
00:56:07.320 | a lot of unique information going out,
00:56:10.760 | and low spatial reuse.
00:56:11.760 | You're not sharing data.
00:56:13.280 | Now, it's very challenging to go across this entire spectrum.
00:56:16.680 | One solution would be what we call an all-to-all network
00:56:20.360 | that satisfies all of this.
00:56:21.680 | So all things are-- all inputs are connected to all inputs.
00:56:24.080 | It's going to be very expensive and not scalable.
00:56:27.760 | One solution that we have to this
00:56:29.360 | is what we call a hierarchical mesh.
00:56:30.860 | So you can break this problem into two steps.
00:56:33.040 | At the lowest level, you can use an all-to-all connection.
00:56:37.960 | And then at the higher level, you can use a mesh connection.
00:56:41.360 | And so the mesh will allow you to scale up.
00:56:44.000 | But the all-to-all allows you to achieve
00:56:45.840 | a lot of different types of reuse.
00:56:47.260 | And with this type of network on chip,
00:56:49.320 | you can basically support a lot of different delivery
00:56:51.560 | mechanisms to deliver data from the global buffer
00:56:54.480 | to all the processing elements so that all your cores,
00:56:57.520 | all your computes can be happening at the same time.
00:56:59.840 | And at its core, this is one of the key things
00:57:02.640 | that enable the second version of Iris
00:57:04.760 | to be both flexible and efficient.
00:57:07.720 | So this is some results from the second version of Iris.
00:57:11.480 | It supports a wide range of filter shapes,
00:57:13.520 | both the very large shapes as well as very compact,
00:57:18.400 | including convolutional fully connected depth-wise layers.
00:57:21.040 | So you can see here in this plot, depending on the shape,
00:57:25.200 | you can get up to an order of magnitude speed up.
00:57:28.400 | It also supports a wide range of sparsities, both dense
00:57:30.840 | and sparse.
00:57:32.100 | So this is really important because some networks
00:57:34.100 | can be very sparse because you've
00:57:35.140 | done a lot of pruning.
00:57:36.280 | But some are not.
00:57:37.100 | And so you want to efficiently support all of those.
00:57:39.720 | You also want to be scalable.
00:57:40.960 | So as you increase the number of processing elements,
00:57:44.840 | the throughput also speeds up.
00:57:47.360 | And as a result of this particular type of design,
00:57:50.080 | you get an order of magnitude improvement
00:57:52.000 | in both speed and energy efficiency.
00:57:55.760 | All right, so this is great.
00:57:56.920 | And this is one way that you can speed up and make
00:57:59.800 | neural networks more efficient.
00:58:01.920 | But it's also important to take a step back and look
00:58:04.160 | beyond just building specialized hardware.
00:58:06.800 | The accelerator itself, both in terms
00:58:08.900 | of algorithms and the hardware.
00:58:11.020 | So can we look beyond the DNN accelerator for acceleration?
00:58:15.300 | And so one good place to show this as an example
00:58:17.700 | is the task of super resolution.
00:58:19.740 | So how many of you are familiar with the task of super
00:58:21.980 | resolution?
00:58:23.140 | All right, so for those of you who aren't, the idea is
00:58:25.340 | as follows.
00:58:26.020 | So I want to basically generate a high-resolution image
00:58:30.060 | from a small-resolution image.
00:58:32.180 | And why do you want to do that?
00:58:33.460 | Well, there are a couple of reasons.
00:58:34.980 | One is that it can allow you to basically reduce
00:58:38.060 | the transmit bandwidth.
00:58:39.260 | So for example, if you have limited communication,
00:58:41.340 | I'm going to send a low-res version of a video,
00:58:43.780 | let's say, or image to your phone.
00:58:45.420 | And then your phone can make it high-res.
00:58:47.380 | That's one way.
00:58:48.740 | Another reason is that screens in general
00:58:51.420 | are getting larger and larger.
00:58:52.700 | So every year at CES, they announce a higher-resolution
00:58:55.300 | screen.
00:58:56.060 | But if you think about the movies that we watch,
00:58:58.700 | a lot of them are still 1080p, for example,
00:59:01.380 | or fixed resolution.
00:59:02.580 | So again, you want to generate a high-resolution
00:59:04.700 | representation of that low-resolution input.
00:59:09.260 | And the idea here is that your high-resolution is not
00:59:11.460 | just interpolation, because it can be very blurry,
00:59:13.460 | but there's ways that kind of hallucinate
00:59:15.420 | a high-resolution version of the video or image itself.
00:59:20.060 | And that's basically called super-resolution.
00:59:23.100 | But one of the challenges for super-resolution
00:59:25.580 | is that it's computationally very expensive.
00:59:27.580 | So again, you can imagine that the state-of-the-art approaches
00:59:30.160 | for super-res use deep neural nets.
00:59:32.340 | A lot of the examples we just talked about
00:59:34.140 | about neural nets are talking about input images
00:59:36.180 | of 200 by 200 pixels.
00:59:38.140 | Now imagine if you extend that to an HD image.
00:59:40.820 | It's going to be very, very expensive.
00:59:42.860 | So what we want to do is think of different ways
00:59:45.300 | that we can speed up the super-resolution process,
00:59:48.420 | not just by making DNNs faster, but kind
00:59:51.060 | of looking around the other components of the system
00:59:54.220 | and seeing if we can make it faster as well.
00:59:56.060 | So one of the approaches we took is this framework called FAST,
00:59:59.420 | where we're looking at accelerating
01:00:00.860 | any super-resolution algorithm by an order of magnitude.
01:00:03.740 | And this is operated on a compressed video.
01:00:06.100 | So before I was a faculty here, I
01:00:09.100 | worked a lot on video compression.
01:00:10.900 | And if you think about the video compression community,
01:00:14.300 | they look at video very differently than people
01:00:17.460 | who process super-resolution.
01:00:18.620 | So typically, when you're thinking
01:00:20.040 | about image processing or super-resolution,
01:00:22.020 | when I give you a compressed video, what you basically
01:00:24.700 | think of it is as a stack of pixels,
01:00:27.500 | a bunch of different images together.
01:00:29.300 | But if you asked a video compression person,
01:00:31.900 | what does a compressed video look like?
01:00:33.580 | Actually, a compressed video is a very structured
01:00:37.460 | representation of the redundancy in the video itself.
01:00:41.260 | So why is it that we can compress videos?
01:00:43.100 | It's because things like different frames
01:00:44.900 | look very-- consecutive frames look very similar.
01:00:47.500 | So it's telling you which pixels in frame 1
01:00:50.540 | is related to which pixel or looks
01:00:52.220 | like which pixel in frame 2.
01:00:53.820 | And so as a result, you don't have
01:00:55.220 | to send the pixels in frame 2.
01:00:56.740 | And that's where you get the compression from.
01:00:58.660 | So actually, what a compressed video looks like
01:01:00.620 | is a description of the structure of the video itself.
01:01:05.780 | And so you can use this representation
01:01:07.580 | to accelerate super-resolution.
01:01:09.700 | So for example, rather than applying super-resolution
01:01:14.100 | to every single low-res frame, which is the typical approach--
01:01:16.980 | so you would apply super-resolution
01:01:18.440 | to each low-res frame, and you would generate a bunch
01:01:20.780 | of high-res frame outputs--
01:01:22.980 | what you can actually do is apply super-resolution
01:01:26.140 | to one of the small low-resolution frames.
01:01:29.420 | And then you can use that free information
01:01:31.700 | you get in the compressed video that tells you
01:01:33.540 | the structure of the video to generate or transfer
01:01:36.780 | and generate all those high-resolution videos
01:01:39.740 | from that.
01:01:40.700 | And so it only needs to run on a subset of frames.
01:01:43.140 | And then the complexity to reconstruct
01:01:45.180 | all those high-resolution frames once you
01:01:47.140 | have that structured image is going to be very low.
01:01:49.940 | So for example, if I'm going to transfer to n frames,
01:01:53.780 | I'm going to get an n frame and x speedup.
01:01:57.100 | So to evaluate this, we showcase this on a range of videos.
01:02:00.260 | So this range of videos is the data
01:02:01.740 | set that we use to develop video standards.
01:02:03.620 | So it's quite broad.
01:02:05.220 | And you can see, first, on the left-hand side
01:02:07.540 | is that if I transfer to four different frames,
01:02:11.460 | you can get a 4x acceleration.
01:02:13.060 | And then the PSNR, which indicates the quality,
01:02:15.940 | doesn't change.
01:02:16.700 | So it's the same quality, but 4x faster.
01:02:18.980 | If I do transfer to 16 frames or 16 acceleration,
01:02:22.260 | there's a slight drop in quality.
01:02:24.380 | But still, you get basically a 16x acceleration.
01:02:28.820 | So the key idea here is, again, you'd
01:02:31.180 | want to look beyond the processing
01:02:33.300 | of the neural network itself to around it
01:02:35.060 | to see if you can speed it up.
01:02:36.620 | Usually with PSNR, you can't really
01:02:37.940 | tell too much about the quality.
01:02:39.060 | So another way to look at it is actually
01:02:40.700 | look at the video itself or subjective quality.
01:02:42.940 | So on the left-hand side here, this
01:02:45.380 | is if I applied super resolution on every single frame.
01:02:48.780 | So this is the traditional way of doing it.
01:02:51.580 | On the right-hand side here, this
01:02:53.820 | is if I just did interpolation on every single frame.
01:02:56.980 | And so where you can tell the difference is by looking
01:02:59.260 | at things like the text, you can see
01:03:00.780 | that the text is much sharper on the left video
01:03:03.500 | than the right video.
01:03:05.260 | Now, FAST plus SRC and using FAST is somewhere in between.
01:03:08.420 | So FAST actually has the same quality
01:03:11.620 | as the video on the left-hand side,
01:03:13.900 | but it's just as efficient in terms of processing speed
01:03:17.380 | as the approach on the right-hand side.
01:03:19.980 | So it kind of has the best of both worlds.
01:03:22.460 | And so the key takeaway for this is
01:03:24.140 | that if you want to accelerate DNNs for a given process,
01:03:27.660 | it's good to look beyond the hardware for the acceleration.
01:03:31.020 | We can look at things like the structure of the data that's
01:03:33.780 | entering the neural network accelerator.
01:03:36.060 | There might be opportunities there.
01:03:37.540 | For example, here, temporal correlation
01:03:39.740 | that allows you to further accelerate the processing.
01:03:42.220 | Again, if you're interested in this,
01:03:43.740 | all the code is on the website.
01:03:45.220 | So to end this lecture, I just want
01:03:46.660 | to talk about things that are actually
01:03:48.500 | beyond deep neural nets.
01:03:49.620 | I also-- I know neural nets are great.
01:03:51.260 | They're useful for many applications.
01:03:52.900 | But I think there's a lot of exciting problems
01:03:54.860 | outside the space of neural nets as well, which also
01:03:57.460 | require efficient computing.
01:04:00.140 | So the first thing is what we call
01:04:01.940 | visual inertial localization or visual odometry.
01:04:05.580 | This is something that's widely used for robots
01:04:07.900 | to kind of figure out where they are in the real world.
01:04:10.220 | So you can imagine for autonomous navigation,
01:04:12.140 | before you navigate the world, you
01:04:13.660 | have to know where you actually are in the world.
01:04:15.700 | So that's localization.
01:04:16.780 | This is also widely used for things like AR and VR
01:04:19.140 | as well, right, because you can know where you're actually
01:04:21.140 | looking in AR and VR.
01:04:22.620 | What does this actually mean?
01:04:24.540 | It means that you can basically take in a sequence of images.
01:04:27.740 | So you can imagine like a camera that's mounted on the robot
01:04:30.340 | or the person, as well as an IMU.
01:04:33.140 | So it has accelerometer and gyroscope information.
01:04:36.260 | And then visual inertial odometry,
01:04:38.180 | which is a subset of SLAM, basically fuses
01:04:40.180 | this information together.
01:04:41.860 | And the outcome of visual inertial odometry
01:04:44.660 | is the localization.
01:04:45.700 | So you can see here, basically, you're
01:04:47.420 | trying to estimate where you are in the 3D space.
01:04:50.220 | And the pose based on, in this case, the camera feed.
01:04:52.860 | But you can also measure IMU information there as well.
01:04:55.660 | And if you're in an unknown environment,
01:04:57.380 | you could also generate a map of that environment.
01:04:59.540 | So one of these is a very key task in navigation.
01:05:03.340 | And the key thing is, can you do it in an energy efficient way?
01:05:06.380 | So we've looked at building specialized hardware
01:05:09.340 | to do localization.
01:05:11.660 | This is actually the first chip that
01:05:13.160 | performs complete visual inertial odometry on chip.
01:05:15.740 | We call it Navion.
01:05:17.420 | This is done in collaboration with Sertesh Karaman.
01:05:19.660 | So you can see here, here's the chip itself.
01:05:21.460 | It's 4 millimeters by 5 millimeters.
01:05:23.700 | You can see that it's smaller than a quarter.
01:05:26.180 | And you can imagine mounting it on a small robot.
01:05:29.180 | At the front end, it does basically
01:05:30.900 | processing of the camera information.
01:05:32.660 | It does things like feature detection,
01:05:34.260 | tracking, outlier elimination.
01:05:36.980 | It also processes-- it does pre-integration on the IMU.
01:05:40.700 | And then on the back end, it fuses this information
01:05:42.980 | together using a factor graph.
01:05:46.460 | And so when you compare this particular design,
01:05:48.940 | this Navion chip design, compared
01:05:50.580 | to mobile or desktop CPUs, you're
01:05:52.740 | talking about two to three orders of magnitude
01:05:55.660 | reduction in energy consumption because you have
01:05:58.060 | the specialized chip to do it.
01:05:59.700 | So what is the key component of this chip that
01:06:02.260 | enables us to do it?
01:06:03.340 | Well, again, sticking with the theme,
01:06:04.860 | the key thing is reduction in data movement.
01:06:07.620 | In particular, we reduce the amount
01:06:09.060 | of data that needs to be moved on and off chip.
01:06:11.380 | So all of the processing is located on the chip itself.
01:06:15.440 | And then furthermore, because we want
01:06:17.020 | to reduce the size of the chip and the size of the memories,
01:06:19.560 | we do things like apply low-cost compression on the frames
01:06:23.980 | and then also exploit sparsity, which
01:06:26.420 | means number of zeros in the factor graph itself.
01:06:28.820 | So all of the compression and exploiting sparsity
01:06:30.980 | can actually reduce the storage cost
01:06:32.620 | down to under a megabyte of storage
01:06:34.980 | on chip to do this processing.
01:06:36.260 | And that allows us to achieve this really low power
01:06:38.940 | consumption of below 25 milliwatts.
01:06:43.700 | Another thing that really matters for autonomous
01:06:45.700 | navigation is once you know where you are,
01:06:47.860 | where are you going to go next?
01:06:49.540 | So this is kind of a planning and mapping problem.
01:06:52.100 | And so in the context of things like robot exploration,
01:06:54.580 | where you want to basically explore an unknown area,
01:06:57.580 | you can do this by doing what we call computing
01:07:00.340 | Shannon's mutual information.
01:07:01.640 | Basically, you want to figure out
01:07:03.220 | where should I go next where I will discover
01:07:05.200 | the most amount of new information
01:07:07.540 | compared to what I already know.
01:07:09.660 | So you can imagine what's shown here is like an occupancy map.
01:07:12.860 | So this is basically the light colors
01:07:14.460 | show the place where it's free space.
01:07:16.380 | It's empty.
01:07:16.880 | Nothing's occupied.
01:07:18.260 | The dark gray area is unknown.
01:07:21.460 | And then the black lines are occupied things,
01:07:23.980 | so like walls, for example.
01:07:25.380 | And the question is, if I know that this is my current
01:07:27.620 | occupancy map, where should I go and scan, let's say,
01:07:30.460 | with a depth sensor to figure out more information
01:07:35.140 | about the map itself?
01:07:36.140 | So what you can do is you can compute
01:07:37.780 | what we call the mutual information of the map itself
01:07:40.820 | based on what you already know.
01:07:42.260 | And then you go to the location with the most information,
01:07:44.660 | and you scan it, and then you get an updated map.
01:07:47.420 | So shown here below is a miniature race car
01:07:49.660 | that's doing exactly that.
01:07:51.260 | So over here is the mutual information
01:07:55.580 | that's being computed.
01:07:56.540 | So it's trying to go to those light areas
01:08:01.020 | of the yellow areas that has the most information.
01:08:03.300 | So you can see that it's going to try and back up and come
01:08:06.020 | and scan this region to cover or figure out
01:08:09.020 | more information about that.
01:08:10.940 | So that's great.
01:08:11.780 | It's a very principled way of doing this.
01:08:13.660 | The problem of this kind of computation,
01:08:18.660 | the reason why it's been challenging,
01:08:20.540 | is, again, the computation, in particular, the data movement.
01:08:23.500 | So you can imagine, at any given position,
01:08:25.820 | you're going to do a 3D scanning with your LiDAR
01:08:29.100 | across a wide range of neighboring regions
01:08:32.100 | with your beams.
01:08:32.900 | You can imagine each of these beams with your LiDAR scan
01:08:35.220 | can be processed with different cores.
01:08:36.820 | So they can all be processed in parallel.
01:08:38.980 | So parallelism, again, here, just like the deep learning
01:08:41.460 | case, is very easily available.
01:08:45.500 | The challenge is data delivery.
01:08:47.540 | So what happens is that you're actually storing
01:08:49.700 | your occupancy map all in one memory.
01:08:52.580 | But now you have multiple cores that
01:08:54.220 | are going to try and process the scans on this occupancy map.
01:08:58.620 | And so you only actually, typically,
01:08:59.940 | for these types of memories, you're limited to two cores.
01:09:02.360 | But if you want to have n cores, 16 cores, 30 cores,
01:09:05.460 | it's going to be a challenge in terms of how
01:09:07.260 | to read data from this occupancy map
01:09:09.500 | and deliver it to the cores themselves.
01:09:12.300 | If we take a closer look at the memory access pattern,
01:09:15.740 | you can see here that as you scan it out,
01:09:18.180 | the numbers indicate which cycle you
01:09:20.500 | would use to read each of the locations on the map itself.
01:09:25.500 | And you can see it's kind of a diagonal pattern.
01:09:27.500 | So the question is, can I break this map into smaller memories
01:09:33.380 | and then access these smaller memories in parallel?
01:09:35.460 | And the question is, if I can break it into smaller memories,
01:09:38.000 | how should I decide what part of the map
01:09:39.860 | should go into which of these memories?
01:09:41.860 | So show here on the right-hand side,
01:09:44.620 | in the different colors basically
01:09:46.340 | indicate different memories or different banks of the memory.
01:09:49.020 | So they store different parts of the map.
01:09:50.700 | And again, if you think of the numbers
01:09:52.380 | as the cycle with which each location is accessed,
01:09:55.740 | what you'll notice is that for any given color, at most,
01:09:59.100 | two numbers are the same, meaning
01:10:01.680 | that I'm only going to access two pieces of the location
01:10:04.660 | for any given bank or memory.
01:10:06.100 | So there's going to be no conflict.
01:10:07.560 | So I can process all of these beams in parallel.
01:10:11.220 | And so by doing this, this allows
01:10:13.020 | you to compute the mutual information of the entire map.
01:10:17.300 | And by the time I can be a very large map,
01:10:19.060 | let's say 200 meters by 200 meters at 0.1 meter resolution
01:10:22.900 | in under a second.
01:10:24.140 | This is very different from before,
01:10:25.620 | where you can only compute the mutual information
01:10:27.860 | of a subset of locations and just try and pick the best one.
01:10:31.060 | Now you can compute on the entire map.
01:10:32.780 | So you can know the absolute best location to go to get
01:10:35.220 | the most information.
01:10:36.860 | This is 100x speed up compared to a CPU
01:10:39.940 | at a tenth of the power on an FPGA.
01:10:43.500 | So that's another important example
01:10:44.960 | of how data movement is really critical in order
01:10:47.740 | to allow you to process things very, very quickly
01:10:50.060 | and how having specialized hardware can enable that.
01:10:53.780 | All right.
01:10:54.260 | So one last thing is looking at--
01:10:55.940 | so we talked about robotics.
01:10:57.260 | We talked about deep learning.
01:10:58.060 | But actually, what's really important
01:10:59.020 | is there's a lot of important applications
01:11:00.820 | where you can apply efficient processing that can help
01:11:03.980 | a lot of people around the world.
01:11:05.340 | So in particular, looking at monitoring neurodegenerative
01:11:08.740 | disease disorders.
01:11:09.980 | So we know things like dementia, so things like Alzheimer's,
01:11:12.700 | Parkinson's, affects tens of millions of people
01:11:15.420 | around the world and continues to grow.
01:11:17.500 | This is a very severe disease.
01:11:19.780 | The challenge for this disease is that--
01:11:21.620 | OK, one of the many challenges.
01:11:22.900 | But one of the challenges is that the neurological
01:11:25.020 | assessments for these diseases can be very time consuming
01:11:27.620 | and require a trained specialist.
01:11:29.460 | So normally, if you are suffering
01:11:31.220 | from one of these diseases or you might have this disease,
01:11:34.180 | what you need to do is you need to go see a specialist.
01:11:36.740 | And they'll ask you a series of questions.
01:11:39.220 | They'll do a mini mental exam, like what year is it?
01:11:41.780 | Where are you now?
01:11:42.580 | Can you count backwards and so on?
01:11:44.300 | Or you might be familiar with people
01:11:45.900 | are asked to draw the clock, these tests.
01:11:49.140 | And so you can imagine going to a specialist
01:11:51.020 | to do these type of things can be costly and time consuming.
01:11:53.620 | So you don't go very frequently.
01:11:55.540 | So as a result, the data that's collected is very sparse.
01:11:58.220 | Also, it's very qualitative.
01:12:00.100 | So if you go to different specialists,
01:12:01.700 | they might come up with a different assessment.
01:12:04.140 | So repeatability is also very much an issue.
01:12:08.060 | What's been super exciting is it's
01:12:09.900 | been shown in literature that there's actually
01:12:12.100 | a quantitative way of measuring or quantitative evaluating
01:12:16.860 | these types of diseases, potentially using eye movements.
01:12:20.660 | So eye movements can be used by a quantitative way
01:12:22.860 | to evaluate the severity or progression
01:12:25.260 | or regression of these particular type of diseases.
01:12:27.580 | So you imagine doing things like,
01:12:29.020 | if you're taking a certain drug, is your disease
01:12:31.140 | getting better or worse?
01:12:32.300 | And this eye movement can give a quantitative evaluation
01:12:34.780 | for that.
01:12:35.300 | But the challenge is that to do these eye movement evaluations,
01:12:39.540 | you still need to go into that.
01:12:40.900 | So first, you need a very high speed camera.
01:12:43.020 | That can be very expensive.
01:12:44.500 | Often, you need to have substantial head support
01:12:46.660 | so your head doesn't move so you can really
01:12:47.780 | detect the eye movement.
01:12:48.940 | And you might even need IR illumination
01:12:50.700 | so you can more clearly see the eye.
01:12:53.300 | And so again, this still has the challenge
01:12:55.260 | that for clinical measurements of what
01:12:57.100 | we call saccade latency or eye movement latency or eye
01:12:59.460 | reaction time, they're done in very constrained environments.
01:13:02.420 | You still have to go see the special itself.
01:13:05.340 | And they use very specialized and costly equipment.
01:13:08.420 | So in the vein of enabling efficient computing
01:13:10.940 | and bringing compute to various devices, our question is,
01:13:13.980 | can we actually do these eye measurements on a phone
01:13:17.820 | itself that we all have?
01:13:20.540 | And so indeed, you can.
01:13:21.860 | You can develop various algorithms
01:13:23.340 | that can detect your eye reaction time
01:13:25.500 | on a consumer grade camera like your phone or an iPad.
01:13:29.460 | And we've shown that you can actually
01:13:31.300 | replicate the quality of results as you
01:13:33.820 | could with a phantom camera.
01:13:35.020 | So shown here in the red are basically eye reaction times
01:13:38.980 | that are measured on a subject on an iPhone 6, which
01:13:41.660 | is obviously under $1,000, way cheaper now,
01:13:44.380 | compared to a phantom camera shown here in blue.
01:13:46.380 | You can see that the distributions of the reaction
01:13:48.460 | times are about the same.
01:13:50.700 | Why is this exciting?
01:13:51.780 | Because it enables us to do low cost in-home measurements.
01:13:55.300 | So what you can imagine is a patient
01:13:56.780 | could do these measurements at home for many days,
01:13:59.460 | not just the day they go in.
01:14:00.780 | And then they can bring in this information.
01:14:02.620 | And this can give the physician or the specialist
01:14:04.860 | additional information to make the assessment as well.
01:14:07.180 | So this can be complementary.
01:14:08.380 | But it gives a much more rich set of information
01:14:10.500 | to do the diagnosis and evaluation.
01:14:12.940 | So we're talking about computing.
01:14:14.660 | But there's also other parts of the system
01:14:16.420 | that burn power as well, in particular,
01:14:18.420 | when we're talking about things like depth estimation using
01:14:20.920 | time of flight.
01:14:21.700 | Time of flight is very similar to LIDAR.
01:14:23.540 | Basically, what you're doing is you're sending a pulse
01:14:25.940 | and waiting for it to come back.
01:14:27.260 | And how long it takes to come back
01:14:28.620 | indicates the depth of whatever object you're trying to detect.
01:14:31.700 | The challenge with depth estimation
01:14:33.580 | with time of flight sensors can be very expensive.
01:14:35.820 | You're emitting a pulse, waiting for it to come back.
01:14:38.020 | So talking about up to tens of watts of power.
01:14:42.860 | The question is, can we also reduce the sensor power
01:14:45.300 | if we can do efficient computing?
01:14:46.860 | So for example, can I reduce how often I emit the depth sensor
01:14:51.420 | and kind of recover the other information just using
01:14:54.460 | a monocular-based camera?
01:14:56.020 | So for example, typically, you have a pair of a depth sensor
01:14:59.620 | and an RGB camera.
01:15:00.940 | If at time 0, I turn both of them on, and time 1 and 2,
01:15:05.400 | I turn them off, but I still keep my RGB camera on,
01:15:08.700 | can I estimate the depth for at time 2 and time 3?
01:15:13.180 | And then the key thing here is to make sure
01:15:15.020 | that the algorithms that you're running to estimate
01:15:17.180 | the depth without turning on the depth sensor itself
01:15:19.460 | is super cheap.
01:15:20.340 | So we actually have algorithms that
01:15:22.260 | can run on VGA at 30 frames per second
01:15:24.700 | on a Cortex A7, which is a super low-cost embedded processor.
01:15:29.780 | And just to give you an idea of how it looks like,
01:15:31.860 | so let's see, here's the left is the RGB image.
01:15:34.620 | In the middle is the depth map or the ground truth.
01:15:37.100 | So if I always had the depth sensor on,
01:15:38.660 | that's what it would look like.
01:15:39.820 | And then on the right-hand side is the estimated depth map.
01:15:42.660 | In this particular case, we're only turning on the sensor
01:15:46.100 | only 11% of the time, so every ninth frame.
01:15:49.460 | And your mean at relative error is only about 0.7%,
01:15:52.540 | so the accuracy or quality is pretty aligned.
01:15:55.740 | OK, so at a high level, what are the key takeaways
01:15:59.780 | I want you guys to get from today's lecture?
01:16:02.460 | First is efficient computing is really important.
01:16:05.340 | It can extend the reach of AI beyond the cloud itself
01:16:09.060 | because it can reduce communication networking
01:16:11.060 | costs, enable privacy, and provide low latency.
01:16:15.140 | And so we can use AI for a wide range of applications,
01:16:17.580 | ranging from things like robotics to health care.
01:16:20.420 | And in order to achieve this energy efficient computing,
01:16:22.820 | it really requires cross-layer design.
01:16:24.980 | So not just focusing on the hardware,
01:16:26.980 | but specialized hardware plays an important role, but also
01:16:29.420 | the algorithms itself.
01:16:31.020 | And this is going to be really key to enabling AI
01:16:33.300 | for the next decade or so or beyond.
01:16:36.340 | OK, and we also covered a lot of points in the lecture,
01:16:39.700 | so the slides are all available on our website.
01:16:43.540 | Also, just because it's a deep learning seminar series,
01:16:46.020 | I just want to point some other resources
01:16:48.060 | that you might be interested if you
01:16:49.560 | want to learn more about efficient processing
01:16:51.500 | of neural nets.
01:16:52.100 | So again, I want to point you first to this survey paper
01:16:54.940 | that we've developed. This is with my collaborator Joel
01:16:57.300 | Emmer.
01:16:57.800 | It really kind of covers what are the different techniques
01:17:00.260 | that people are looking at and give some insights
01:17:01.860 | of the key design principles.
01:17:03.300 | We also have a book coming soon.
01:17:04.680 | It's going to be within the next few weeks.
01:17:07.820 | We also have slides from various tutorials
01:17:09.820 | that we've given on this particular topic.
01:17:11.780 | In fact, we also teach a course on this here at MIT, 6825.
01:17:16.660 | If you're interested in updates on all these types of materials,
01:17:19.460 | I invite you to join the mailing list or the Twitter feed.
01:17:23.820 | The other thing is if you're not an MIT student,
01:17:25.880 | but you want to take a two-day course on this particular topic,
01:17:29.940 | I also invite you to take a look at the MIT Professional
01:17:33.380 | Education option.
01:17:34.860 | So we run short courses on MIT campus over the summer.
01:17:38.180 | So you can come for two days, and we
01:17:39.800 | can talk about the various different approaches
01:17:41.260 | that people use to build efficient deep learning
01:17:43.340 | systems.
01:17:44.900 | And then finally, if you're interested in just video
01:17:47.780 | and tutorial videos on this talk,
01:17:49.420 | I actually, at the end of November during NeurIPS,
01:17:52.180 | I gave a 90-minute tutorial that goes really in-depth in terms
01:17:55.580 | of how to build efficient deep learning systems.
01:17:58.540 | So I invite you to visit that.
01:17:59.860 | And we also have some talks at the Mars Conference
01:18:02.140 | on Efficient Robotics.
01:18:03.540 | And we have a YouTube channel where this is all located.
01:18:07.140 | And then finally, I'd be remiss if I didn't acknowledge
01:18:09.900 | a lot of the work here is done by the students, so
01:18:12.700 | all the students in our group, as well as my collaborators,
01:18:14.940 | Joel Emmer, Sertesh Karaman, and Thomas Helt,
01:18:16.900 | and then all of our sponsors that
01:18:18.820 | make this research possible.
01:18:20.740 | So that concludes my talk.
01:18:22.060 | Thank you very much.
01:18:22.860 | [APPLAUSE]
01:18:26.420 | Thank you.
01:18:27.980 | [APPLAUSE]
01:18:31.340 | [AUDIO OUT]
01:18:34.380 | [AUDIO OUT]
01:18:37.740 | [AUDIO OUT]
01:18:41.100 | [AUDIO OUT]
01:18:44.460 | [BLANK_AUDIO]