Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze)

00:00:00.000 | - I'm happy to have Vivian See here with us.

00:00:03.440 | She's a professor here at MIT,

00:00:05.060 | working in the very important and exciting space

00:00:08.300 | of developing energy efficient and high performance systems

00:00:11.400 | for machine learning, computer vision,

00:00:13.240 | and other multimedia applications.

00:00:16.600 | This involves joint design of algorithms,

00:00:18.720 | architecture, circuits, systems,

00:00:21.040 | to enable optimal trade-offs between power,

00:00:23.280 | speed, and quality of result.

00:00:25.340 | One of the important differences between the human brain

00:00:29.720 | and AI systems is the energy efficiency of the brain.

00:00:34.720 | So Vivian is a world-class researcher at the forefront

00:00:38.160 | of discovering how we can close that gap.

00:00:40.560 | So please give her a warm welcome.

00:00:43.300 | - I'm really happy to be here to share some of the research

00:00:46.520 | and an overview of this area, efficient computing.

00:00:49.560 | So actually what I'm gonna be talking about today

00:00:51.720 | is gonna be a little bit broader than just deep learning.

00:00:53.880 | We'll start with deep learning,

00:00:54.840 | but we'll also move to how we might apply this to robotics

00:00:58.620 | and other AI tasks, and why it's really important

00:01:02.040 | to have efficient computing to enable a lot

00:01:03.720 | of these exciting applications.

00:01:06.100 | Also, I just wanna mention that a lot of the work

00:01:08.340 | I'm gonna present today is not done by myself,

00:01:10.560 | but in collaboration with a lot of folks at MIT over here.

00:01:14.680 | And of course, if you want access to the slides,

00:01:16.520 | they're available on our website.

00:01:18.160 | So given that this is a deep learning lecture series,

00:01:21.920 | I wanna first start talking up a little bit

00:01:23.680 | about deep neural nets.

00:01:24.520 | So we know that deep neural nets has, you know,

00:01:27.600 | generate a lot of interest,

00:01:29.240 | has many very compelling applications.

00:01:33.040 | But one of the things that has, you know,

00:01:34.720 | come into light over the past few years

00:01:37.480 | is the increasing need of compute.

00:01:39.240 | OpenAI actually showed over the past few years

00:01:42.060 | that there's been a significant increase

00:01:44.560 | in the amount of compute that is required

00:01:46.960 | to perform deep learning applications

00:01:49.100 | and to do the training for deep learning

00:01:50.720 | over the past few years.

00:01:51.660 | So it's actually grown exponentially

00:01:53.920 | over the past few years.

00:01:54.760 | It's grown in fact by over 300,000 times

00:01:57.680 | in terms of the amount of compute we need to drive

00:02:01.360 | and increase the accuracy of a lot of the tasks

00:02:04.400 | that we're trying to achieve.

00:02:05.800 | At the same time, if we start looking at basically

00:02:09.880 | the environmental implications of all of this processing,

00:02:13.960 | it can be quite severe.

00:02:15.160 | So if we look at, for example,

00:02:17.040 | the carbon footprint of, you know, training neural nets,

00:02:20.080 | if you think of, you know, the amount of carbon footprint

00:02:23.120 | of flying across North America from New York

00:02:26.440 | to San Francisco or the carbon footprint

00:02:29.480 | of an average human life, you can see that, you know,

00:02:33.360 | neural networks are orders of magnitude greater than that.

00:02:36.720 | So the environmental or carbon footprint implications

00:02:40.080 | of computing for deep neural nets

00:02:41.600 | can be quite severe as well.

00:02:43.520 | Now this is a lot having to do with compute in the cloud.

00:02:46.000 | Another important area where we wanna do compute

00:02:48.520 | is actually moving the compute from the cloud

00:02:51.360 | to the edge itself, into the device

00:02:53.680 | where a lot of the data is being collected.

00:02:55.840 | So why would we wanna do that?

00:02:57.120 | So there's a couple of reasons.

00:02:58.600 | First of all, communication.

00:03:01.280 | So in a lot of places around the world

00:03:03.840 | and just even a lot of just places in general,

00:03:05.480 | you might not have a very strong

00:03:07.040 | communication infrastructure, right?

00:03:09.240 | So you don't wanna necessarily have to rely

00:03:10.840 | on a communication network

00:03:12.240 | in order to do a lot of these applications.

00:03:14.960 | So again, you know, removing your tethering

00:03:17.200 | from the cloud is important.

00:03:19.320 | Another reason is a lot of the times that we,

00:03:22.440 | you know, apply deep learning on a lot of applications

00:03:24.240 | where the data is very sensitive.

00:03:26.200 | So you can think about things like healthcare

00:03:28.560 | where you're collecting very sensitive data.

00:03:30.640 | And so privacy and security again is really critical.

00:03:34.160 | And you would, rather than sending the data to the cloud,

00:03:36.360 | you'd like to bring the compute to the data itself.

00:03:39.280 | Finally, another compelling reason for, you know,

00:03:44.200 | bringing the compute into the device

00:03:45.880 | or into the robot is latency.

00:03:47.520 | So this is particularly true for interactive applications.

00:03:51.040 | So you can think of things like autonomous navigation,

00:03:53.800 | robotics, or self-driving vehicles

00:03:55.920 | where you need to interact with the real world.

00:03:58.080 | You can imagine if you're driving very quickly

00:04:00.240 | down the highway and you detect an obstacle,

00:04:02.520 | you might not have enough time to send the data

00:04:04.480 | to the cloud, wait for it to be processed,

00:04:06.560 | and send the instruction back in.

00:04:08.400 | So again, you wanna move the compute into the robot

00:04:11.160 | or into the vehicle itself.

00:04:13.240 | Okay, so hopefully this is establishing

00:04:15.160 | why we wanna move the compute into the Edge.

00:04:17.800 | But one of the big challenges of doing processing

00:04:20.640 | in the robot or in the device actually has to do

00:04:22.760 | with power consumption itself.

00:04:24.120 | So if we take the self-driving car as an example,

00:04:26.800 | it's been reported that it consumes over 2000 watts

00:04:30.920 | of power just for the computation itself,

00:04:33.400 | just to process all the sensor data that it's collecting.

00:04:36.960 | Right, and this actually generates a lot of heat.

00:04:39.520 | It takes up a lot of space.

00:04:40.440 | You can see in this prototype that's being placed,

00:04:43.240 | and all the compute aspects are being placed in the trunk,

00:04:46.320 | generates a lot of heat, it generates,

00:04:47.680 | and often needs water cooling.

00:04:49.880 | So this can be a big cost and logistical challenges

00:04:53.320 | for self-driving vehicles.

00:04:55.160 | Now you can imagine that this is gonna be

00:04:56.520 | much more challenging if we shrink down the form factor

00:05:00.080 | of the device itself to something that is perhaps portable

00:05:02.840 | in your hands.

00:05:03.680 | You can think about smaller robots

00:05:05.360 | or something like your smartphone or cell phone.

00:05:08.280 | In these particular cases,

00:05:09.360 | when you think about portable devices,

00:05:11.360 | you actually have very limited energy capacity,

00:05:13.880 | and this is based on the fact that the battery itself

00:05:16.920 | is limited in terms of the size, weight, and its cost.

00:05:19.760 | Right, so you can't have very large amount of energy

00:05:22.600 | on these particular devices itself.

00:05:24.680 | Secondly, when you take a look at the embedded platforms

00:05:27.960 | that are currently used for embedded processing

00:05:29.840 | for these particular applications,

00:05:31.760 | they tend to consume over 10 watts,

00:05:34.240 | which is an order of magnitude higher

00:05:35.880 | than the power consumption that you typically

00:05:38.100 | would allow for these particular handheld devices.

00:05:40.640 | So in these handheld devices,

00:05:41.960 | typically you're limited to under a watt

00:05:43.640 | due to the heat dissipation.

00:05:44.720 | For example, you don't want your cell phone

00:05:46.040 | to get super hot.

00:05:47.640 | Okay, so in the past decade or so, or decades,

00:05:51.920 | what we would do to address this challenge

00:05:53.760 | is that we would wait for transistors to become smaller,

00:05:56.880 | faster, and more efficient.

00:05:58.680 | However, this has become a challenge

00:06:01.160 | over the past few years,

00:06:02.400 | so transistors are not getting more efficient.

00:06:05.080 | So for example, Moore's Law,

00:06:07.520 | which typically makes transistors smaller and faster,

00:06:10.180 | has been slowing down,

00:06:11.360 | and Dennard scaling,

00:06:13.040 | which has made transistors more efficient,

00:06:15.400 | has also slowed down or ended.

00:06:17.440 | So you can see here over the past 10 years,

00:06:19.280 | this trend has really flattened out.

00:06:21.080 | Okay, so this is a particular challenge

00:06:23.400 | because we want more and more compute

00:06:25.120 | to drive deep neural network applications,

00:06:27.820 | but the transistors are not becoming more efficient.

00:06:30.720 | Right?

00:06:31.760 | So what we have to turn to in order to address this

00:06:34.880 | is we need to turn towards specialized hardware

00:06:37.600 | to achieve the significant speed and energy throughputs

00:06:40.620 | that we require for our particular application.

00:06:43.220 | When we talk about designing specialized hardware,

00:06:44.940 | this is really about thinking about

00:06:46.180 | how we can redesign the hardware from the ground up,

00:06:49.540 | particularly targeted at these AI, deep learning,

00:06:52.920 | and robotic tasks that we're really excited about.

00:06:55.900 | Okay, so this notion is not new.

00:06:57.500 | In fact, it's become extremely popular to do this.

00:07:00.980 | Over the past few years,

00:07:02.060 | there's been a large number of startups and companies

00:07:04.140 | that have focused on building

00:07:05.260 | specialized hardware for deep learning.

00:07:06.940 | So in fact, New York Times reported,

00:07:09.060 | I guess it was two years ago

00:07:11.000 | that there's a record number of startups

00:07:12.600 | looking at building specialized hardware

00:07:14.800 | for AI and for deep learning.

00:07:16.920 | Okay, so we'll talk a little bit about

00:07:18.320 | what specialized hardware looks like

00:07:20.180 | for these particular applications.

00:07:22.280 | Now, if you really care about energy and power efficiency,

00:07:25.400 | the first question you should ask is

00:07:26.960 | where is the power actually going for these applications?

00:07:31.080 | And so as it turns out,

00:07:33.000 | power is dominated by data movement.

00:07:35.740 | So it's actually not the computations themselves

00:07:38.360 | that are expensive,

00:07:39.440 | but moving the data to the computation engine

00:07:42.320 | that's expensive.

00:07:43.160 | So for example, shown here in blue is

00:07:46.760 | a range of power consumption, energy consumption

00:07:49.240 | for a variety of types of computations,

00:07:51.820 | for example, multiplications and additions

00:07:54.640 | at various different precision.

00:07:56.100 | So you have, for example, floating point to fixed point

00:07:59.440 | and eight bit integer and same with additions.

00:08:01.640 | And you can see as it makes sense,

00:08:02.880 | as you scale down the precision,

00:08:04.760 | the energy consumption of each of these operations reduce.

00:08:07.860 | But what's really surprising here

00:08:09.500 | is that if you look lower

00:08:11.440 | at the energy consumption of data movement, right?

00:08:14.360 | Again, this is delivering the input data

00:08:16.260 | to do the multiplication and then, you know,

00:08:18.280 | moving the output of the multiplication

00:08:19.880 | somewhere into memory, it can be very expensive.

00:08:22.280 | So for example, if you look at the energy consumption

00:08:25.880 | of a 32 bit read from an SRAM memory,

00:08:28.440 | this is an eight kilobyte SRAM.

00:08:29.840 | So it's a very small memory

00:08:31.520 | that you would have on the processor or on the chip itself.

00:08:35.360 | This is already gonna consume five picojoules of energy.

00:08:38.880 | So equivalent or even more

00:08:40.840 | than a 32 bit floating point multiply.

00:08:43.840 | And this is from a very small memory.

00:08:45.880 | If you need to read this data from off chip,

00:08:48.200 | so outside the processor, for example, in DRAM,

00:08:51.920 | it's gonna be even more expensive.

00:08:54.080 | So in this particular case,

00:08:55.180 | we're showing 640 picojoules in terms of energy.

00:08:58.560 | And so you can notice here on the horizontal axis

00:09:01.500 | that this is basically the, this is an exponential axis.

00:09:05.160 | So you're talking about orders of magnitude increase

00:09:07.840 | in energy in terms of data movement

00:09:09.420 | compared to the compute itself, right?

00:09:11.580 | So this is a key takeaway here.

00:09:13.080 | So if we really want to address the energy consumption

00:09:17.080 | of these particular types of processing,

00:09:19.660 | we really wanna look at reducing data movement.

00:09:22.560 | Okay, but what's the challenge here?

00:09:24.200 | So if we take a look at a popular AI robotics

00:09:27.240 | type of application like autonomous navigation,

00:09:29.040 | the real challenge here though,

00:09:30.440 | is that these applications use a lot of data, right?

00:09:33.640 | So for example, one of the things you need to do

00:09:35.240 | in autonomous navigation

00:09:36.280 | is what we call semantic understanding.

00:09:38.380 | So you need to be able to identify, you know,

00:09:40.440 | which pixel belongs to what.

00:09:41.840 | So for example, in this scene,

00:09:42.840 | you need to know that this pixel represents the ground,

00:09:45.400 | this pixel represents the sky,

00:09:46.960 | this pixel represents, you know, a person itself.

00:09:49.880 | Okay, so this is an important type of processing.

00:09:51.640 | Often if you're traveling quickly,

00:09:53.440 | you wanna be able to do this at a very high frame rate.

00:09:56.940 | You might need to have large resolution.

00:09:58.560 | So for example, typically if you want HD images,

00:10:00.920 | you're talking about 2 million pixels per frame.

00:10:03.880 | And then often, if you also wanna be able to detect objects

00:10:06.560 | at different scales or see objects that are far away,

00:10:09.220 | you need to do what we call data expansion.

00:10:11.320 | For example, build a pyramid for this,

00:10:13.220 | and this would increase the amount of pixels

00:10:15.080 | or amount of data you need to process

00:10:17.000 | by, you know, one to two orders of magnitude.

00:10:19.640 | So that's a huge amount of data

00:10:20.800 | that you have to process right off the bat there.

00:10:23.360 | Another type of processing

00:10:25.480 | or understanding that you wanna do for autonomous navigation

00:10:27.520 | is what we call geometric understanding,

00:10:29.520 | and that's when you're kind of navigating,

00:10:30.840 | you wanna build a 3D map of the world that's around you.

00:10:34.080 | And you can imagine the longer you travel for,

00:10:37.320 | the larger the map you're gonna build.

00:10:39.300 | And again, that's gonna be more data

00:10:41.520 | that you're gonna have to process and compute on.

00:10:44.280 | Okay, so this is a significant challenge

00:10:46.080 | for autonomous navigation in terms of amount of data.

00:10:48.720 | Other aspects of autonomous navigations,

00:10:51.700 | also other applications like AR, VR, and so on,

00:10:54.160 | is understanding your environment, right?

00:10:56.720 | So a typical thing you might need to do

00:10:59.160 | is to do depth estimation.

00:11:00.680 | So for example, if I give you an image,

00:11:02.860 | can you estimate the distance

00:11:04.840 | of how far a given pixel is from you?

00:11:07.600 | And also semantic segmentation,

00:11:09.220 | we just talked about that before.

00:11:10.600 | So these are important types of ways

00:11:12.840 | to understand your environment

00:11:14.200 | when you're trying to navigate.

00:11:16.000 | And it should be no surprise to you

00:11:18.280 | that in order to do these types of processing,

00:11:20.840 | the state-of-the-art approaches utilize deep neural nets.

00:11:25.040 | Right?

00:11:26.280 | But the challenge here is that these deep neural nets

00:11:28.200 | often require several hundred millions

00:11:30.640 | of operations and weights to do the computation.

00:11:33.560 | So when you try and compare it to something

00:11:35.720 | like you would all have on your phone,

00:11:37.000 | for example, video compression,

00:11:38.920 | you're talking about two to three orders of magnitude

00:11:41.860 | increase in computational complexity.

00:11:45.040 | So this is a significant challenge

00:11:46.400 | 'cause if we'd like to have deep neural networks

00:11:49.500 | be as ubiquitous as something like video compression,

00:11:52.460 | we really have to figure out

00:11:53.760 | how to address this computational complexity.

00:11:56.640 | We also know that deep neural networks

00:11:58.160 | are not just used for understanding the environment

00:12:00.440 | or autonomous navigation,

00:12:02.000 | but it's really become the cornerstone

00:12:03.360 | of many AI applications from computer vision,

00:12:06.320 | speech recognition, gameplay, and even medical applications.

00:12:09.920 | And I'm sure a lot of these have been covered

00:12:11.800 | through this course.

00:12:13.520 | So briefly, I'm just gonna give a quick overview

00:12:16.640 | of some of the key components in deep neural nets,

00:12:18.680 | not because, you know, I'm sure all of you understand it,

00:12:20.840 | but because since this area is very popular,

00:12:23.520 | the terminology can vary from discipline to discipline.

00:12:26.120 | So I'll just do a brief overview to align ourselves

00:12:28.140 | on the terminology itself.

00:12:30.520 | So what are deep neural nets?

00:12:32.920 | Basically, you can view it as a way of, for example,

00:12:36.360 | understanding the environment.

00:12:37.700 | It's a chain of different layers of processing

00:12:42.060 | where you can imagine for an input image,

00:12:44.200 | at the low level or the earlier parts of the neural net,

00:12:46.760 | you're trying to learn different low-level features

00:12:49.480 | such as edges of an image.

00:12:51.640 | And as you get deeper into the network,

00:12:53.960 | as you chain more of these kind of computational layers

00:12:56.560 | together, you start being able to detect

00:12:58.960 | higher and higher level features

00:13:00.360 | until you can, you know, recognize a vehicle, for example.

00:13:03.880 | And, you know, the difference of this particular approach

00:13:06.240 | compared to more traditional ways of doing computer vision

00:13:09.240 | is that how we extract these features are learned

00:13:12.180 | from the data itself, as opposed to having an expert

00:13:14.160 | come in and say, "Hey, look for the edges,

00:13:16.160 | look for, you know, the wheels," and so on.

00:13:18.160 | The fact that it recognizes these features

00:13:19.800 | is a learned approach.

00:13:22.320 | Okay, what is it doing at each of these layers?

00:13:24.680 | Well, it's actually doing a very simple computation.

00:13:28.200 | This is looking at the inference side of things.

00:13:29.920 | Basically, effectively, what it's doing is a weighted sum.

00:13:32.640 | Right, so you have the input values,

00:13:34.800 | and we'll color code the inputs as blue here

00:13:37.720 | and try and stay consistent with that throughout the talk.

00:13:41.240 | We apply certain weights to them,

00:13:43.440 | and these weights are learned from the training data,

00:13:45.800 | and then they would generate an output,

00:13:47.120 | which is typically red here,

00:13:48.380 | and it's basically a weighted sum, as we can see.

00:13:51.160 | We then pass this weighted sum

00:13:53.160 | through some form of non-linearity.

00:13:55.120 | So, you know, traditionally, it used to be sigmoids.

00:13:57.560 | More recently, we use things like relues,

00:13:59.600 | which basically set, you know, non-zero values

00:14:02.960 | or negative values to zero.

00:14:05.760 | But the key takeaway here is that if you look

00:14:08.480 | at this computational kernel, the key operation

00:14:11.920 | to a lot of these neural networks

00:14:13.300 | is performing this multiply and accumulate

00:14:15.520 | to compute the weighted sum.

00:14:17.120 | And this accounts for over 90% of the computation.

00:14:20.360 | So if we really want to focus on, you know,

00:14:22.880 | accelerating neural nets or making them more efficient,

00:14:25.080 | we really want to focus on minimizing the cost

00:14:26.960 | of this multiply and accumulate itself.

00:14:29.240 | There are also various popular types

00:14:32.960 | of deep neural network layers used for deep neural networks.

00:14:36.740 | They also often vary in terms of, you know,

00:14:38.960 | how you connect up the different layers.

00:14:40.840 | So for example, you can have feed-forward layers

00:14:43.040 | where the inputs are always connected to the outputs.

00:14:45.600 | You can have feed-back where the outputs

00:14:47.320 | are connected back into the inputs.

00:14:49.480 | You can have fully-connected inputs

00:14:51.720 | where basically all the outputs are connected

00:14:53.440 | to all the inputs, or sparsely connected.

00:14:56.800 | And you might be familiar with some of these layers.

00:14:58.360 | So for example, fully-connected layers,

00:15:00.160 | just like what we talked about,

00:15:01.040 | all inputs and all outputs are connected.

00:15:03.840 | They tend to be feed-forward.

00:15:05.680 | When you put them together, they're typically referred

00:15:08.160 | to as a multilayer perceptron.

00:15:10.460 | You have convolutional layers, which are also feed-forward,

00:15:14.520 | but then you have sparsely-connected

00:15:16.880 | weight-sharing connections.

00:15:18.760 | And when you put them together,

00:15:20.360 | they're often referred to as convolutional networks.

00:15:23.320 | And they're typically used for image-based processing.

00:15:25.960 | You have recurrent layers where we have

00:15:29.440 | this feedback connection, so the output

00:15:31.500 | is fed back to the input.

00:15:34.240 | When we combine recurrent layers,

00:15:35.800 | they're referred to as recurrent neural nets.

00:15:37.360 | And these are typically used to process sequential data,

00:15:40.360 | so speech or language-based processing.

00:15:42.960 | And then most recently, which has become really popular,

00:15:46.240 | it's the tension layers or tension-based mechanisms.

00:15:49.800 | They often involve matrix multiply,

00:15:51.440 | which is again, multiply and accumulate.

00:15:53.840 | And when you combine these,

00:15:56.040 | they're often referred to as transformers.

00:15:58.760 | Okay, so let's first get an idea as to why

00:16:02.680 | convolutional or deep learning is much more,

00:16:05.880 | computationally more complex than other types of processing.

00:16:08.720 | So we'll focus on convolutional neural nets as an example,

00:16:12.120 | although many of these principles apply

00:16:13.520 | to other types of neural nets.

00:16:15.360 | And the first thing to kind of take a look

00:16:17.320 | as to why it's complicated is to look

00:16:18.960 | at the computational kernel.

00:16:20.240 | So how does it actually perform convolution itself?

00:16:23.160 | So let's say you have this 2D input image.

00:16:27.320 | If it's at the input of the neural net, it would be an image.

00:16:29.280 | If it's deeper in the neural net,

00:16:30.600 | it would be the input feature map.

00:16:32.640 | And it's gonna be composed of activations.

00:16:35.640 | Or you can think from an image,

00:16:36.640 | it's gonna be composed of pixels.

00:16:38.320 | And we convolve it with, let's say, a 2D filter,

00:16:41.080 | which is composed of weights.

00:16:42.600 | Right, so typical convolution, what you would do

00:16:45.320 | is you would do an element-wise multiplication

00:16:47.840 | of the filter weights with the input feature map activations.

00:16:52.320 | You would sum them all together to generate one output value.

00:16:55.760 | And we would refer to that as the output activation.

00:16:58.720 | Right, and then because it's convolution,

00:17:00.480 | we would basically slide the filter

00:17:03.480 | across this input feature map

00:17:05.520 | and generate all the other output feature map activation.

00:17:08.840 | And so this kind of 2D convolution

00:17:10.960 | is pretty standard in image processing.

00:17:12.920 | We've been doing this for decades, right?

00:17:15.600 | What makes convolutional neural nets much more challenging

00:17:20.000 | is the increase in dimensionality.

00:17:21.480 | So first of all, rather than doing just this 2D convolution,

00:17:25.040 | we often stack multiple channels.

00:17:27.000 | So there's this third dimension called channels.

00:17:29.200 | And then what we're doing here is that we need to do

00:17:30.840 | a 2D convolution on each of the channels

00:17:33.560 | and then add it all together, right?

00:17:35.960 | And you can think of these channels for an image,

00:17:38.320 | these channels would be kind of the red, green,

00:17:40.560 | and blue components, for example.

00:17:42.320 | And as you get deeper into the feature map,

00:17:43.920 | the number of channels could potentially increase.

00:17:45.920 | So if you look at AlexNet, which is a popular neural net,

00:17:48.680 | the number of channels ranges from three to 192.

00:17:52.480 | Okay, so that already increases the dimensionality,

00:17:54.520 | one dimension of the neural net itself

00:17:57.320 | in terms of processing.

00:17:58.560 | Another dimension that we increase

00:18:01.200 | is we actually apply multiple filters

00:18:04.000 | to the same input feature map.

00:18:06.760 | Okay, so for example, you might apply N filters

00:18:10.560 | to the same input feature map,

00:18:12.120 | and then you would generate an output feature map

00:18:14.960 | of M channels, right?

00:18:16.720 | So in the previous slide, we showed that convolving

00:18:20.080 | this 3D filter generates one output channel

00:18:22.840 | in the output feature map.

00:18:24.120 | If we apply M input, M filters,

00:18:28.560 | we're gonna generate M output channels

00:18:31.320 | in the output feature map.

00:18:33.080 | And again, just to give you an idea

00:18:34.400 | in terms of the scale of this,

00:18:35.640 | when you talk about things like AlexNet,

00:18:37.120 | we're talking about between 96 to 384 filters.

00:18:41.120 | And of course, this is increasing to thousands

00:18:43.280 | for other advanced or more modern neural nets itself.

00:18:46.880 | And then finally, often you wanna process

00:18:49.080 | more than one image at a given time, right?

00:18:52.280 | So if you wanna actually do that,

00:18:53.520 | we can actually extend it.

00:18:54.720 | So N input images become N output images,

00:18:58.800 | or N input feature maps becomes N output feature maps.

00:19:02.400 | And we typically refer to this as a batch size,

00:19:05.560 | like the number of images you're processing

00:19:07.200 | at the same time, and this can range from one to 256.

00:19:10.280 | Okay, so these are all the various different dimensions

00:19:13.520 | of the neural net.

00:19:14.640 | And so really what someone does

00:19:16.480 | when they're trying to define what we call

00:19:18.440 | the network architecture of the neural net itself

00:19:20.520 | is that they're gonna select the different

00:19:22.200 | or define the shape of the neural network

00:19:24.040 | for each of the different layers.

00:19:25.040 | So it's gonna define all these different dimensions

00:19:27.800 | of the neural net itself, and these shapes can vary

00:19:29.960 | across the different layers.

00:19:31.840 | Just to give you an idea, if you look at

00:19:35.400 | MobileNet as an example, this is a very popular

00:19:37.920 | neural network cell, you can see that the filter sizes,

00:19:40.840 | right, so the height and width of the filters

00:19:44.040 | and the number of filters and number of channels

00:19:45.600 | will vary across the different blocks or layers itself.

00:19:48.360 | The other thing I just wanna mention

00:19:51.440 | is that when we look towards popular DNN models,

00:19:55.120 | we can also see important trends.

00:19:56.760 | So shown here are the various different models

00:19:59.200 | that have been developed over the years

00:20:00.240 | that are quite popular.

00:20:02.360 | A couple of interesting trends,

00:20:03.800 | one is that the networks tend to become deeper,

00:20:06.480 | so you can see in the convolutional layers

00:20:08.120 | they're getting deeper and deeper.

00:20:09.800 | And then also the number of weights that they're using

00:20:13.760 | and the number of MACs are also increasing as well.

00:20:16.720 | So this is an important trend,

00:20:17.840 | the DNN models are getting larger and deeper,

00:20:20.200 | and so again, they're becoming much more

00:20:21.920 | computationally demanding.

00:20:23.720 | And so we need more sophisticated hardware

00:20:26.600 | to be able to process them.

00:20:28.440 | All right, so that's kind of a quick intro

00:20:31.280 | or overview into the deep neural network space,

00:20:33.160 | I hope we're all aligned.

00:20:34.040 | So the first thing I'm gonna talk about

00:20:35.880 | is how can we actually build hardware

00:20:38.600 | to make the processing of these neural networks

00:20:41.160 | more efficient and to run faster.

00:20:42.840 | And often we refer to this as hardware acceleration.

00:20:46.120 | All right, so we know these neural networks are very large,

00:20:49.040 | there's a lot of compute,

00:20:50.480 | but are there types of properties

00:20:51.960 | that we can leverage to make computing

00:20:53.840 | or processing of these networks more efficient?

00:20:56.960 | So the first thing that's really friendly

00:20:58.960 | is that they actually exhibit a lot of parallelism.

00:21:02.200 | So all these multiplies and accumulates,

00:21:04.400 | you can actually do them all in parallel.

00:21:06.840 | Right, so that's great.

00:21:07.960 | So what that means is high throughput

00:21:09.600 | or high speed is actually possible

00:21:11.040 | 'cause I can do a lot of these processing in parallel.

00:21:13.960 | What is difficult and what should not be a surprise

00:21:16.120 | to you now is that the memory access is the bottleneck.

00:21:18.920 | So delivering the data to the multiply

00:21:21.680 | and accumulate engine is what's really challenging.

00:21:24.240 | So I'll give you an insight as to why this is the case.

00:21:26.600 | So let's take, say we take this multiply

00:21:29.240 | and accumulate engine, what we call a MAC.

00:21:31.840 | It takes in three inputs for every MAC,

00:21:34.320 | so you have the filter weight,

00:21:37.040 | you have the input image pixel,

00:21:39.360 | or if you're deeper in the network,

00:21:40.520 | it would be input feature MAC activation,

00:21:43.360 | and it also takes the partial sum,

00:21:45.160 | which is like the partially accumulated value

00:21:47.160 | from the previous multiply that it did,

00:21:49.320 | and then it would generate an updated partial sum.

00:21:52.800 | So for every computation that you do,

00:21:55.120 | for every MAC that you're doing,

00:21:56.840 | you need to have four memory accesses.

00:21:58.920 | So it's a four to one ratio in terms

00:22:01.560 | of memory accesses versus compute.

00:22:04.120 | The other challenge that you have is, as we mentioned,

00:22:08.840 | moving data is gonna be very expensive.

00:22:12.160 | So in the absolute worst case,

00:22:13.800 | and you would always try to avoid this,

00:22:15.280 | if you read the data from DRAM, it's off-chip memory,

00:22:19.560 | every time you access data from DRAM,

00:22:21.960 | it's gonna be two orders of magnitude more expensive

00:22:26.040 | than the computation of performing a MAC itself.

00:22:29.800 | Okay, so that's really, really bad.

00:22:31.320 | So if you can imagine, again, if we look at AlexNet,

00:22:33.480 | which has 700 million MACs,

00:22:35.400 | we're talking about three billion DRAM accesses

00:22:38.600 | to do that computation.

00:22:40.080 | Okay, but again, all is not lost.

00:22:43.320 | There are some things that we can exploit

00:22:45.280 | to help us along with this problem.

00:22:47.200 | So one is what we call input data reuse opportunities,

00:22:50.520 | which means that a lot of the data that we're reading,

00:22:53.000 | we're using to perform these multiplies and accumulates,

00:22:55.400 | they're actually used for many multiplies and accumulates.

00:22:58.360 | So if we read the data once,

00:23:00.560 | we can reuse it multiple times for many operations, right?

00:23:04.320 | So I'll show you some examples of that.

00:23:07.080 | First is what we call convolutional reuse.

00:23:09.400 | So again, if you remember, we're taking a filter

00:23:11.680 | and we're sliding it across this input image.

00:23:15.400 | And so as a result, the activations from the feature map

00:23:19.800 | and the weights from the filter

00:23:21.200 | are gonna be reused in different combinations

00:23:23.760 | to compute the different multiply and accumulate values

00:23:27.200 | or different MACs itself.

00:23:28.080 | So there's a lot of what we call

00:23:29.160 | convolutional reuse opportunities there.

00:23:32.000 | Another example is that we're actually, if you recall,

00:23:35.680 | gonna apply multiple filters on the same input feature map.

00:23:40.080 | So that means that each activation in that input feature map

00:23:43.960 | can be reused multiple times across the different filters.

00:23:49.040 | Finally, if we're gonna process many images

00:23:52.640 | at the same time or many feature maps,

00:23:55.280 | a given weight in the filter itself

00:23:57.760 | can be reused multiple times across these input feature maps.

00:24:01.800 | So that's what we called filter reuse.

00:24:03.960 | Okay, so there's a lot of these great filter

00:24:05.920 | reuse opportunities in the neural network itself.

00:24:09.320 | And so what can we do to exploit this reuse opportunities?

00:24:13.120 | Well, what we can do is we can build

00:24:14.320 | what we call a memory hierarchy

00:24:16.400 | that contains very low cost memories

00:24:19.320 | that allow us to reduce the overall cost

00:24:21.560 | of moving this data.

00:24:22.400 | So what do we mean here?

00:24:23.440 | We mean that if I have,

00:24:24.880 | if I build a multiply and accumulate engine,

00:24:27.640 | I'm gonna have a very small memory

00:24:31.400 | right beside the multiply and accumulate engine.

00:24:34.360 | And by small, I mean something on the order

00:24:36.240 | of under a kilobyte of memory

00:24:39.000 | locally beside that multiply and accumulate engine.

00:24:41.520 | Why do I want that?

00:24:42.360 | Because accessing that very small memory

00:24:45.000 | can be very cheap.

00:24:46.200 | So for example, if to perform a multiply and accumulate

00:24:50.160 | with an ALUX1X, reading from this very small memory

00:24:55.160 | beside the multiply and accumulate engine

00:24:57.000 | is also gonna be the same amount of energy.

00:24:59.880 | I could also allow these processing elements

00:25:02.800 | and a processing element is gonna be this multiply

00:25:04.920 | and accumulate plus the small memory.

00:25:06.520 | I can also allow the different processing elements

00:25:08.680 | to also share data, okay?

00:25:11.720 | And so reading from a neighboring processing element

00:25:14.040 | is gonna be 2X the energy.

00:25:16.200 | And then finally, you can have a shared larger memory

00:25:20.200 | called a global buffer.

00:25:22.080 | And that's gonna be able to be shared

00:25:24.120 | across all the different processing elements.

00:25:25.400 | This tends to be larger between 100 and 500 Kbytes.

00:25:29.440 | And that's gonna be more expensive,

00:25:30.920 | about 6X the energy itself.

00:25:33.240 | And of course, if you go off chip to DRAM,

00:25:35.600 | that's gonna be the most expensive at 200X the energy.

00:25:39.760 | Right, and so the big issue here is,

00:25:41.560 | the way that you can think about this

00:25:43.280 | is what you would ideally like to do

00:25:46.120 | is to access all of the data

00:25:48.520 | from this very small local memory.

00:25:51.240 | But the challenge here is that this very small local memory

00:25:54.120 | is only 1Kbyte.

00:25:55.520 | We're talking about neural networks

00:25:56.800 | that are millions of weights in terms of size, right?

00:26:00.600 | So how do we go about doing that?

00:26:02.200 | So there's many challenges of doing that.

00:26:04.120 | Just as an analogy for you guys

00:26:05.480 | to kind of think through how this is related,

00:26:07.040 | you can imagine that accessing something

00:26:10.200 | from like, let's say your backpack

00:26:11.680 | is gonna be much cheaper

00:26:13.320 | than accessing something from your neighbor,

00:26:15.720 | or going back to, let's say, your office here,

00:26:18.480 | somewhere on campus to get the data

00:26:20.160 | versus going back all the way home, right?

00:26:21.960 | So ideally, you'd like to access

00:26:23.560 | all of your data from your backpack,

00:26:24.960 | but if you have a lot of work to do,

00:26:26.240 | you might not be able to fill it in your backpack.

00:26:28.120 | So the question is,

00:26:28.960 | how can I break up my large piece of work

00:26:32.120 | into smaller chunks so that I can access them all

00:26:35.000 | from this small memory itself?

00:26:36.440 | And that's the big challenge that you have.

00:26:38.080 | And so there's been a lot of research in this area

00:26:40.800 | in terms of what's the best way to break up the data

00:26:43.040 | and what should I store in this very small local memory?

00:26:46.800 | So one approach is what we call a weight stationary.

00:26:49.560 | And the idea here is I'm gonna store

00:26:51.080 | the weight information of the neural net

00:26:53.280 | into this small local memory, okay?

00:26:56.480 | And so as a result, I really minimize the weight energy.

00:26:59.960 | But the challenge here is that

00:27:01.920 | the other types of data that you have in your system,

00:27:04.280 | so for example, your input activations shown in the blue,

00:27:07.280 | and then the partial sums that are shown in the red,

00:27:09.480 | now those still have to move

00:27:11.040 | through the rest of the system itself,

00:27:12.360 | so through the network and from the global buffer, okay?

00:27:15.760 | Typical types of work that are popular

00:27:17.840 | that use this type of kind of data flow

00:27:19.840 | or weight stationary data flow,

00:27:21.120 | which is what we call it

00:27:21.960 | 'cause the weight remains stationary,

00:27:23.400 | are things like the TPU from Google

00:27:25.720 | and the NVDA accelerator from NVIDIA.

00:27:28.440 | Another approach that people take,

00:27:31.240 | or they, well, they say,

00:27:32.080 | "Well, so the weight, I only ever have to read it.

00:27:35.320 | "But the partial sums, I have to read it and write it

00:27:38.760 | "'cause the partial sum I'm gonna read,

00:27:40.440 | "accumulate, like update it,

00:27:41.840 | "and then write it back to the memory."

00:27:42.960 | So there's two memory accesses

00:27:44.440 | for that partial sum data type.

00:27:46.480 | So what, maybe I should put that partial sum

00:27:50.280 | locally into that small memory itself.

00:27:52.480 | So this is what we call output stationary

00:27:53.960 | 'cause the accumulation of the output

00:27:55.960 | is gonna be local within that one processing element.

00:27:58.400 | That's not gonna move.

00:27:59.680 | The trade-off, of course, is the activations of weights

00:28:02.960 | now have to move through the network.

00:28:05.080 | And then there's various different works called,

00:28:06.840 | like for example, some work from KU Leuven

00:28:09.560 | and some work from the Chinese Academy of Sciences

00:28:13.000 | that have been using this approach.

00:28:15.240 | Another piece of work is saying,

00:28:16.760 | "Well, forget about the inputs and the,

00:28:19.560 | "or so the outputs and the weights themselves.

00:28:22.680 | "Let's keep the input stationary within this small memory."

00:28:26.680 | And it's called input stationary.

00:28:28.400 | And some of the work, again,

00:28:29.960 | from some research work from NVIDIA has examined this.

00:28:33.160 | But all of these different types of work

00:28:34.560 | really focus on not moving one piece of type of data.

00:28:38.680 | Either focus on minimizing weight energy

00:28:41.680 | or out partial sum energy or input energy.

00:28:44.680 | I think what's important to think about

00:28:46.680 | is that maybe you wanna reduce the data movement

00:28:49.160 | of all different data types, all types of energy.

00:28:51.640 | So another approach,

00:28:52.600 | this is something that we've developed within our own group,

00:28:54.680 | is looking at what we call the row stationary data flow.

00:28:57.360 | And within each of the processing elements,

00:28:59.560 | you're gonna do one row of convolution.

00:29:04.040 | And this row is a mixture

00:29:05.520 | of all the different data types.

00:29:07.040 | So you have filter information,

00:29:08.360 | so the weights of the filter.

00:29:09.880 | You have the activations of your input feature map.

00:29:13.320 | And then you also have your partial sum information.

00:29:15.640 | So you're really trying to balance the data movement

00:29:18.200 | of all the different data types,

00:29:19.760 | not just one particular data type.

00:29:22.360 | This is just performing a one row,

00:29:23.840 | but we just talked about the fact that the neural network

00:29:26.520 | is much more than a 1D convolution.

00:29:28.400 | So you can imagine expanding this to higher dimensions.

00:29:32.440 | So this is just showing how you might expand

00:29:34.520 | this 1D convolution into a 2D convolution.

00:29:37.480 | And then there's other higher dimensionality

00:29:39.520 | that you can map onto this architecture as well.

00:29:42.200 | I won't get into the details of this,

00:29:43.520 | but the key takeaway here is that

00:29:45.480 | you might not wanna focus on one particular data type.

00:29:48.400 | You wanna actually optimize for all the different types

00:29:51.440 | of data that you're moving around in your system.

00:29:53.920 | Okay?

00:29:54.760 | And this can just show you some results

00:29:57.720 | in terms of how these different data types,

00:29:59.560 | or these different types of data flows would work.

00:30:02.520 | So for example, in the weight stationary case,

00:30:04.480 | as expected, the weight energy,

00:30:06.200 | the energy required to move the weights,

00:30:08.000 | shown in green, is gonna be the lowest.

00:30:10.360 | But then the red portion,

00:30:11.520 | which is the energy of the partial sums,

00:30:13.920 | and the green, or sorry, the blue part,

00:30:16.760 | which is the input feature map or input pixels,

00:30:19.400 | that's gonna be very high.

00:30:21.400 | Output stationary is another approach,

00:30:23.440 | as we talked about,

00:30:24.280 | you're trying to reduce the data movement

00:30:25.760 | of the partial sums, shown here in red.

00:30:28.000 | So the red part is really minimized,

00:30:29.600 | but you can see that the green part,

00:30:31.240 | which is the weight stationary data movement,

00:30:33.600 | or weight movement, is gonna be increased,

00:30:35.560 | and the blue is the input's gonna be increased.

00:30:39.640 | There's another approach called no-colloquial reuse,

00:30:41.440 | we don't have time to talk about that,

00:30:43.280 | but you can see that row stationary, for example,

00:30:45.160 | really aims to balance the data movement

00:30:47.880 | of all the different data types.

00:30:49.960 | Right, so the big takeaway here is that,

00:30:51.680 | you know, when you're trying to optimize,

00:30:53.720 | you know, a given piece of hardware,

00:30:55.640 | you don't wanna just optimize one,

00:30:57.280 | you know, for one particular type of data,

00:30:59.080 | you wanna optimize overall for all the movement

00:31:01.520 | in the hardware itself.

00:31:03.360 | Okay, another thing that you can also exploit

00:31:06.320 | to save a bit of power,

00:31:08.280 | is the fact that, you know, some of the data could be zero.

00:31:11.120 | So we know that anything multiplied by zero

00:31:14.720 | is gonna be zero, right?

00:31:16.640 | So if you know that one of the inputs

00:31:18.840 | to your multiply and accumulate is gonna be zero,

00:31:21.280 | you might as well skip that multiplication.

00:31:23.440 | In fact, you might as well skip, you know,

00:31:25.240 | accessing data or accessing the other input

00:31:28.000 | to that multiply and accumulate engine.

00:31:29.840 | So by doing that, you can actually

00:31:32.200 | reduce the power consumption by almost 50%.

00:31:36.440 | Another thing that you can do,

00:31:38.000 | is that if you have a bunch of zeros,

00:31:40.560 | you can also compress the data.

00:31:43.040 | For example, you can use things like run length encoding,

00:31:46.080 | which where basically a run of zeros

00:31:48.040 | is gonna be represented rather than, you know,

00:31:49.600 | zero, zero, zero, zero, zero,

00:31:50.880 | you can just say I have a run of five zeros.

00:31:53.000 | And this can actually reduce the amount of data movement

00:31:55.360 | by up to two X in your system itself.

00:31:59.000 | And in fact, in, you know, neural nets,

00:32:00.840 | there's a large way, you know,

00:32:02.120 | possibilities of actually generating zeros.

00:32:03.920 | First of all, if you remember that reloop,

00:32:06.040 | it's setting negative values to zero,

00:32:07.880 | so naturally generates zeros.

00:32:09.440 | And then there's other techniques, for example,

00:32:11.200 | we call pruning, which is setting some of the weights

00:32:13.280 | of the neural net to zero as well.

00:32:14.480 | And so this can exploit all of that.

00:32:16.760 | Okay, so, you know, what is the impact

00:32:19.040 | of all these types of things?

00:32:20.120 | So we actually looked at building hardware

00:32:22.840 | in particular a customized chip that we called Iris

00:32:25.720 | to demonstrate these particular approaches,

00:32:28.080 | in particular the row stationary data flow

00:32:30.840 | and exploiting sparsity in the activation data.

00:32:34.280 | So this Iris chip has 14 by 12,

00:32:37.920 | so 168 processing elements.

00:32:40.360 | You can see that there's a shared buffer

00:32:43.040 | that's 100 kilobytes,

00:32:44.400 | and it has some compression, decompression

00:32:46.160 | before it goes to off-chip DRAM.

00:32:47.960 | And again, that's because accessing DRAM

00:32:49.680 | is the most expensive.

00:32:51.480 | Shown here on the right-hand side

00:32:53.360 | is a dye photo of the fabricated chip itself, right?

00:32:56.800 | And this is four millimeters by four millimeters

00:32:59.400 | in terms of size.

00:33:00.600 | And so using that, you know, row stationary data flow,

00:33:04.040 | it exploits a lot of data reuse.

00:33:05.920 | So it actually reduces the number of times

00:33:08.880 | we access this global buffer by 100x.

00:33:12.600 | And it also reduces the amount of times

00:33:15.000 | we access the off-chip memory by over 1000x.

00:33:18.360 | This is all because, you know,

00:33:19.640 | each of these processing elements has, you know,

00:33:21.880 | a local memory that is trying to read

00:33:24.120 | most of its data from,

00:33:25.200 | and it's also sharing with other processing elements.

00:33:27.760 | So overall, when you compare it to a mobile GPU,

00:33:30.080 | you're talking about an order of magnitude reduction

00:33:32.840 | in energy consumption.

00:33:34.080 | If you'd like to learn a little bit more about that,

00:33:36.760 | I invite you to visit the Iris project website.

00:33:40.640 | Okay, so this is great.

00:33:41.560 | We can build custom hardware,

00:33:42.840 | but what does this actually mean

00:33:44.680 | in terms of, you know, building a system

00:33:46.600 | that can efficiently compute neural nets?

00:33:48.840 | So let's say we take a step back.

00:33:50.360 | Let's say we don't care anything about the hardware,

00:33:52.720 | and we're, you know, a systems provider.

00:33:54.440 | We want to build, you know, an overall system.

00:33:56.160 | And what we really care about

00:33:57.880 | is the trade-off between energy and accuracy, right?

00:34:02.200 | That's the key thing that we care about.

00:34:04.680 | So shown here is a plot,

00:34:06.280 | and let's say this is for an object detection task, right?

00:34:08.560 | So accuracy is on the x-axis,

00:34:13.080 | and it's listed in terms of average precision,

00:34:15.680 | which is a metric that we use for object detection.

00:34:18.360 | It's on a linear scale, and higher, the better.

00:34:20.720 | Vertically, we have energy consumption.

00:34:25.280 | This is the energy that's being consumed per pixel.

00:34:27.760 | So you kind of average it.

00:34:28.640 | I can imagine a higher-resolution image

00:34:30.280 | can consume more energy.

00:34:31.560 | It's going to be an exponential scale.

00:34:33.280 | So let's first start on the accuracy axis.

00:34:37.840 | And so if you think before neural nets, you know,

00:34:40.240 | had its resurgence in around 2011, 2012,

00:34:43.400 | actually state-of-the-art approaches

00:34:44.880 | used features called histogram of oriented gradients, right?

00:34:49.320 | This is a very popular approach to be very efficient

00:34:52.280 | in terms of, or quite accurate in terms of object detection.

00:34:55.560 | And we refer to as HOG.

00:34:57.720 | The reason why neural nets really took off

00:35:00.000 | is 'cause they really improved the accuracy.

00:35:01.480 | So you can imagine AlexNet here almost doubled the accuracy,

00:35:05.160 | and then VGG further increased the accuracy.

00:35:08.480 | So it's super exciting there.

00:35:10.800 | But then we want to look also on the vertical axis,

00:35:14.520 | which is the energy consumption.

00:35:16.280 | And I should mention, you know,

00:35:17.880 | basically you'll see these dots.

00:35:19.280 | We have the energy consumption

00:35:20.400 | for each of these different approaches.

00:35:21.920 | These approaches are actually measured,

00:35:24.360 | or these energy numbers are measured

00:35:25.680 | on specialized hardware already

00:35:27.760 | that's been designed for that particular task.

00:35:30.560 | So we have a chip here that's built

00:35:33.400 | in 65-millimeter CMOS process.

00:35:35.520 | So they use the same transistors around the same size

00:35:37.840 | that does object detection using the HOG features.

00:35:40.800 | And then here's the Iris chip that we just talked about.

00:35:43.360 | I should also know that both of these chips

00:35:45.040 | were built in my group.

00:35:46.160 | The students who built these chips, you know,

00:35:47.880 | started designing the chips at the same time

00:35:50.360 | and taped out at the same time.

00:35:51.440 | So it's somewhat of a controlled experiment

00:35:53.040 | in terms of optimization.

00:35:55.360 | Okay, so what does this tell us

00:35:56.480 | when we look on the energy axis?

00:35:58.360 | We can see that histogram of oriented gradients,

00:36:01.240 | or HOG features, are actually very efficient

00:36:03.640 | from an energy point of view.

00:36:05.040 | In fact, if we compare it to something like

00:36:07.000 | video compression, again, something that you all have

00:36:09.800 | in your phone, HOG features are actually more efficient

00:36:12.960 | than video compression, meaning for the same energy

00:36:16.000 | that you would spend compressing a pixel,

00:36:18.480 | you could actually understand that pixel.

00:36:20.680 | So that's pretty impressive.

00:36:22.920 | But if we start looking at AlexNet or VGG,

00:36:26.440 | we can see that the energy increases

00:36:28.680 | by two to three orders of magnitude,

00:36:30.920 | which is quite significant.

00:36:32.520 | I'll give you an example.

00:36:33.560 | So if I told you on your cell phone,

00:36:35.720 | I'm gonna double the accuracy of its recognition,

00:36:39.080 | but your phone would die 300 times faster,

00:36:42.240 | who here would be interested in that technology?

00:36:44.640 | Right, so exactly, so nobody, right?

00:36:47.760 | So in the sense that battery life is so critical

00:36:50.200 | to how we actually use these types of technologies.

00:36:53.920 | So we should not just look at the accuracy,

00:36:56.800 | which is the x-axis point of view,

00:36:57.960 | we should really also consider the energy consumption,

00:37:01.000 | and we really don't want the energy to be so high.

00:37:03.480 | And we can see that even with specialized hardware,

00:37:06.180 | we're still quite far away from making neural nets

00:37:10.080 | as efficient as something like video compression

00:37:13.280 | that you all have on your phones.

00:37:14.800 | So we really have to think of how we can further

00:37:17.440 | push the energy consumption down

00:37:20.560 | without sacrificing accuracy, of course.

00:37:23.600 | So actually, there's been a huge amount of research

00:37:25.440 | in this space, because we know neural nets are popular,

00:37:28.120 | and we know that they have a wide range of applications,

00:37:29.920 | but energy's really a big challenge.

00:37:31.640 | So people have looked at how can we design new hardware

00:37:35.240 | that can be more efficient, or how can we design algorithms

00:37:38.240 | that are more efficient to enable energy-efficient

00:37:40.440 | processing of DNNs.

00:37:41.720 | And so in fact, within our own research group,

00:37:43.660 | we spend quite a bit of time kind of surveying the area

00:37:46.440 | and understanding what are the various different types

00:37:48.520 | of developments that people have been looking at.

00:37:50.440 | So if you're interested in this topic,

00:37:51.840 | we actually generated various tutorials on this material,

00:37:55.440 | as well as overview papers.

00:37:57.280 | This is an overview paper that's about 30 pages

00:37:59.840 | and we're currently expanding it into a book.

00:38:01.720 | So if you're interested in this topic,

00:38:02.920 | I would encourage you to visit these resources.

00:38:05.520 | But the main thing that we learned about

00:38:07.120 | as we were doing this kind of survey of the area,

00:38:10.000 | is that we actually identified various limitations

00:38:12.540 | in terms of how people are approaching

00:38:14.920 | or how the research is approaching this problem.

00:38:17.560 | So first let's look on the algorithm side.

00:38:20.680 | So again, there's a wide range of approaches

00:38:22.840 | that people are using to try and make the DNN algorithms

00:38:25.680 | or models more efficient.

00:38:27.240 | So for example, we've kind of mentioned

00:38:28.920 | the idea of pruning.

00:38:30.320 | The idea here is you're gonna set some of the weights

00:38:32.560 | to become zero, and again, anything times zero is zero,

00:38:36.120 | so you can skip those operations.

00:38:38.040 | So there's a wide range of research there.

00:38:40.440 | There's also looking at efficient network architectures,

00:38:43.080 | meaning rather than making my neural networks very large,

00:38:45.240 | these high three-dimensional convolutions,

00:38:48.180 | can I decompose them into smaller filters?

00:38:50.920 | So rather than this 3D filter, can I make it a 2D filter

00:38:54.200 | and kind of also 2D, but one by one

00:38:57.080 | and into the screen itself?

00:38:59.000 | Another very popular thing is reduced precision.

00:39:01.240 | So rather than using the default of 32-bit float,

00:39:04.400 | can I reduce the number of bits down to eight bits

00:39:07.320 | or even binary?

00:39:08.160 | We saw before that as we reduce the precision

00:39:10.920 | of these operations, you also get energy savings,

00:39:12.980 | and you also reduce data movement as well

00:39:14.840 | 'cause you have to move less data.

00:39:16.800 | A lot of this work really focuses on reducing

00:39:19.220 | the number of MACs and the number of weights,

00:39:23.080 | and those are primarily because those are easy to count.

00:39:25.980 | But the question that we should be asking

00:39:27.720 | if we care about the system is does this actually translate

00:39:31.280 | into energy savings and reduced latency?

00:39:33.760 | Because from a system's point of view,

00:39:35.640 | those are the things that we care about.

00:39:37.640 | We don't really, when you're thinking about something

00:39:39.400 | running on your phone, you don't care about the number

00:39:40.800 | of MACs and weights, you care about how much energy

00:39:42.520 | it's consuming 'cause that's gonna affect the battery life,

00:39:44.760 | or how quickly it might react.

00:39:47.840 | That's basically a measure of latency.

00:39:49.880 | And again, hopefully you haven't forgotten,

00:39:51.560 | but basically data movement is expensive.

00:39:53.680 | It really depends on how you move the data

00:39:58.080 | through the system.

00:39:58.900 | So the key takeaway from this slide is that if you remember

00:40:01.520 | where the energy comes from, which is the data movement,

00:40:04.360 | it's not because of how many weights or how many MACs you

00:40:06.960 | have, but really it depends on where the weight comes from.

00:40:10.320 | If it comes from this small memory register file

00:40:14.240 | that's nearby, it's gonna be super cheap as opposed

00:40:16.680 | to coming from off-chip DRAM.

00:40:18.720 | So all weights are basically not created equal,

00:40:21.160 | all MACs are not created equal.

00:40:22.400 | It really depends on the memory hierarchy

00:40:24.240 | and the data flow of the hardware itself.

00:40:26.240 | So we can't just look at the number of weights

00:40:29.720 | and the number of MACs and estimate how much energy

00:40:32.280 | is gonna be consumed.

00:40:33.960 | So this is quite a difficult challenge.

00:40:35.440 | So within our group, we've actually looked

00:40:37.360 | at developing different tools that allow us

00:40:39.320 | to estimate the energy consumption

00:40:41.480 | of the neural network itself.

00:40:42.840 | So for example, in this particular tool,

00:40:44.360 | which is available on this website,

00:40:46.760 | we basically take in the DNN weights and the input data,

00:40:50.320 | including its sparsity.

00:40:51.920 | We know the different shapes of the different layers

00:40:55.480 | of the neural net, and we run an optimization

00:40:57.600 | that figures out the memory access,

00:40:59.440 | how much energy consumed by the data movement,

00:41:01.880 | and then the energy consumed by the multiply

00:41:04.040 | and accumulate computations,

00:41:06.280 | and then the output is gonna be a breakdown

00:41:08.240 | of the energy for the different layers

00:41:10.160 | of the neural network.

00:41:11.360 | And once you have this, you can kind of figure out,

00:41:13.400 | well, where is the energy going so I can target my design

00:41:16.800 | to minimize that energy consumption?

00:41:18.800 | Okay, and so by doing this, when we take a look,

00:41:23.280 | it should be no surprise, one of the key observations

00:41:25.480 | for this exercise is that the weights alone

00:41:28.720 | are not a good metric for energy consumption.

00:41:31.120 | If you take a look at GoogleNet, for example,

00:41:34.400 | this is running on kind of the Iris architecture,

00:41:36.160 | you can see that the weights only account

00:41:37.960 | for 22% of the overall energy.

00:41:41.320 | In fact, a lot of the energy goes

00:41:43.120 | into moving the input feature maps

00:41:44.720 | and the output feature maps as well, right?

00:41:47.000 | And also computation.

00:41:48.080 | So in general, this is the same message as before.

00:41:50.800 | We shouldn't just look at the data movement

00:41:53.120 | in one particular data type.

00:41:54.440 | We should look at the energy consumption

00:41:55.800 | of all the different data types

00:41:57.080 | to give us an overall view

00:41:58.760 | of where the energy's actually going.

00:42:01.200 | Okay, and so once we actually know

00:42:03.240 | where the energy is going, how can we factor that

00:42:06.240 | into the design of the neural networks

00:42:08.040 | to make them more efficient?

00:42:09.800 | So we talked about the concept of pruning, right?

00:42:13.160 | So again, pruning was setting some of the weights

00:42:15.400 | of the neural net to zero, or you can think of it

00:42:17.280 | as removing some of the weights.

00:42:18.720 | And so what we wanna do here is that now we know

00:42:21.160 | that we know where the energy is going,

00:42:23.080 | why don't we incorporate the energy

00:42:25.320 | into the design of the algorithm,

00:42:27.120 | for example, to guide us to figure out

00:42:29.080 | where we should actually remove the weights from?

00:42:31.560 | You know, so for example,

00:42:33.160 | let's say here, this is on AlexNet

00:42:36.680 | for the same accuracy across the different approaches.

00:42:39.040 | Traditionally, what happens is that people tend

00:42:41.120 | to remove the weights that are small.

00:42:43.120 | And we call this magnitude-based pruning,

00:42:45.720 | and you can see that you get about a 2x reduction

00:42:48.720 | in terms of energy consumption.

00:42:50.680 | However, we know that like the size of the weight

00:42:53.200 | has nothing to do with, or the value of the weight

00:42:54.920 | has nothing to do with the energy consumption.

00:42:56.400 | Ideally, what you'd like to do is remove the weights

00:42:59.680 | that consume the most energy, right?

00:43:02.160 | In particular, we also know that the more weights

00:43:04.000 | that we remove, the accuracy is gonna go down.

00:43:07.480 | So to get the biggest bang for your buck,

00:43:08.760 | you wanna remove the weights

00:43:09.800 | that consume the most energy first.

00:43:11.760 | One way you can do this is you can take your neural network,

00:43:15.560 | figure out the energy consumption

00:43:17.320 | of each of the layers of the neural network.

00:43:19.560 | You can sort, then sort the layers

00:43:21.440 | in terms of high energy layer to low energy layers,

00:43:25.280 | and then you prune the high energy layers first.

00:43:28.320 | So this is what we call energy-aware pruning.

00:43:30.360 | And then by doing this, you actually now get

00:43:32.720 | a 3.7x reduction in energy consumption

00:43:35.640 | compared to 2x for the same accuracy.

00:43:38.240 | And again, this is because we factor in energy consumption

00:43:41.520 | into the design of the neural network itself.

00:43:43.880 | All right, and the prune models

00:43:46.800 | are all available on the IRIS website.

00:43:49.760 | Another important thing that we care about

00:43:52.120 | from a performance point of view is latency, right?

00:43:55.240 | So for example, latency has to do with how long it takes

00:43:58.000 | when I give it an image, how long will I get the result back?

00:44:01.720 | People are very sensitive to latency.

00:44:04.400 | But the challenge here is that latency, again,

00:44:06.200 | is not directly correlated to things

00:44:08.040 | like number of multiplies and accumulates.

00:44:10.040 | And so this is some data that was released

00:44:11.680 | by Google's Mobile Vision team,

00:44:13.880 | and they're showing here on the x-axis

00:44:17.320 | the number of multiplies and accumulates.

00:44:19.760 | You can do, so going towards the left, you're increasing.

00:44:22.520 | And then on the y-axis, this is the latency.

00:44:25.560 | So this is actually the measured latency

00:44:28.080 | or delay it takes to get a result.

00:44:30.440 | And what they're showing here is that the number of max

00:44:33.520 | is not really a good approximation of latency.

00:44:35.920 | So in fact, for example, given layers,

00:44:39.560 | neural networks that have the same number of max,

00:44:41.880 | there can be a 2x range or 2x swing in terms of latency.

00:44:45.920 | Or looking at it in a different way,

00:44:47.640 | giving neural nets of the same latency,

00:44:50.680 | they can have a 3x swing in terms of number of max.

00:44:55.240 | So the key takeaway here is that you can't just count

00:44:57.120 | the number of max and say,

00:44:58.000 | oh, this is how quickly it's gonna run.

00:44:59.880 | It's actually much more challenging than that.

00:45:04.440 | And so what we want to ask is,

00:45:06.680 | is there a way that we can take latency

00:45:09.360 | and use that again to design the neural net directly?

00:45:12.200 | So rather than looking at max, use latency.

00:45:14.720 | And so together with Google's Mobile Vision team,

00:45:17.800 | we developed this approach called NetAdapt.

00:45:20.120 | And this is really a way that you can tailor

00:45:22.000 | your particular neural network for a given mobile platform

00:45:25.720 | for a latency or an energy budget.

00:45:27.760 | So it automatically adapts the neural net

00:45:29.520 | for that platform itself.

00:45:30.760 | And really what's driving the design

00:45:33.320 | is empirical measurements.

00:45:34.720 | So measurements of how that particular network

00:45:37.800 | perform on that platform.

00:45:39.760 | So measurements for things like latency and energy.

00:45:42.560 | And the reason why we want to use empirical measurements

00:45:44.480 | is that you can't often generate models

00:45:46.880 | for all the different types of hardware out there.

00:45:48.960 | In the case of Google, what they want is that,

00:45:51.400 | if they have a new phone, you can automatically tune

00:45:53.880 | the network for that particular phone.

00:45:55.400 | You don't want to have to model the phone as well.

00:45:57.640 | Okay, and so how does this work?

00:45:59.200 | I'll walk you through it.

00:46:00.040 | So you'll start off with a pre-trained network.

00:46:01.960 | So this is a network that's, let's say,

00:46:03.440 | trained in the cloud for very high accuracy.

00:46:07.000 | Great, start off with that,

00:46:08.400 | but it tends to be very large, let's say.

00:46:10.480 | And so what you're gonna do is you're gonna take that

00:46:12.280 | into the NetAdapt algorithm.

00:46:14.200 | You're gonna take a budget.

00:46:15.320 | So a budget will tell you like,

00:46:16.400 | oh, I can afford only this type of latency

00:46:18.800 | or this amount of latency, this amount of energy.

00:46:21.280 | What NetAdapt will do is gonna generate

00:46:23.520 | a bunch of proposals, so different options

00:46:26.000 | of how it might modify the network

00:46:27.720 | in terms of its dimensions.

00:46:29.400 | It's gonna measure these proposals

00:46:31.080 | on that target platform that you care about.

00:46:34.840 | And then based on these empirical measurements,

00:46:36.960 | NetAdapt is gonna then generate a new set of proposals.

00:46:39.840 | And it'll just iterate across this

00:46:41.880 | until it gets an adapted network as an output.

00:46:45.360 | Okay, and again, all of this is on the NetAdapt website.

00:46:48.400 | Just to give you a quick example of how this might work.

00:46:50.480 | So let's say you start off with, as your input,

00:46:53.320 | a neural network that has the accuracy that you want,

00:46:56.400 | but the latency is 100 milliseconds,

00:46:58.600 | and you would like for it to be 80 milliseconds.

00:47:00.800 | You want it to be faster.

00:47:02.440 | So what it's gonna do is it's gonna generate

00:47:04.160 | a bunch of proposals.

00:47:05.440 | And what the proposals could involve doing

00:47:07.480 | is taking one layer of the neural net

00:47:09.840 | and reducing the number of channels

00:47:11.640 | until it hits the latency budget of 80 milliseconds.

00:47:15.760 | And it can do that for all the different layers.

00:47:18.520 | Then it's gonna tune these different layers

00:47:20.160 | and measure the accuracy.

00:47:22.160 | Right, so let's say, oh, this one where I just

00:47:24.480 | shortened the number of channels in layer one

00:47:26.920 | maintains accuracy at 60%.

00:47:28.680 | So that means I'm gonna pick that,

00:47:30.120 | and that's gonna be the input,

00:47:31.720 | or the output of this particular design.

00:47:34.040 | So the output at 80 milliseconds

00:47:36.520 | hitting an accuracy of 60%,

00:47:37.880 | and it's gonna be the input to the next iteration.

00:47:39.840 | And then I'm gonna tighten the budget.

00:47:41.880 | Okay, again, if you're interested,

00:47:43.520 | I just invite you to go take a look at the NetAdapt paper.

00:47:46.120 | But what are the, what is the impact

00:47:48.280 | of this particular approach?

00:47:49.720 | Well, it gives you actually a very much improved trade-off

00:47:52.920 | between latency and accuracy, right?

00:47:55.760 | So if you look at this plot again,

00:47:57.560 | on the x-axis is the latency, right?

00:48:00.520 | So to the left is better, so it's lower latency.

00:48:04.440 | And then on the x-axis, or y-axis,

00:48:07.400 | it's gonna be the accuracy, so higher, better.

00:48:09.120 | So here you want higher to the left is good.

00:48:12.600 | And so we have first shown in blue and green

00:48:15.120 | various kind of handcrafted neural network-based approaches.

00:48:18.960 | And you can see NetAdapt, which generates the red dots

00:48:22.920 | as it's iterating through its optimization.

00:48:25.240 | And you can see that it achieves,

00:48:27.280 | for the same accuracy, it can be up to 1.7x faster

00:48:31.240 | than a manually designed approach.

00:48:34.160 | This approach is also under the umbrella of

00:48:38.360 | basically network architecture search

00:48:39.800 | is kind of also in that kind of flavor.

00:48:42.160 | But in general, the takeaway here is that

00:48:43.960 | if you're gonna design neural networks

00:48:45.680 | or efficient neural networks,

00:48:47.320 | that you wanna run quickly or you wanna be energy efficient,

00:48:50.360 | you should really take, you know,

00:48:51.640 | put hardware into the design loop

00:48:53.320 | and take in, you know, the accurate energy

00:48:56.160 | or latency measurements into the design itself

00:48:58.160 | of the neural network.

00:48:59.280 | This particular, you know, example here is shown

00:49:02.960 | for an image classification task,

00:49:04.920 | meaning I give you an image

00:49:06.240 | and you can classify it to the right.

00:49:08.720 | You can say what's in the image itself.

00:49:10.560 | You can imagine that that type of approach

00:49:12.120 | is kind of like reducing information, right?

00:49:14.000 | From a 2D image, you reduce it down to a label.

00:49:16.840 | This is very commonly used.

00:49:19.000 | But we actually want to see if we can still apply

00:49:20.720 | this approach to a more difficult task

00:49:23.120 | of something like depth estimation.

00:49:24.840 | In this case, you know, I give you a 2D image

00:49:27.640 | and the output is also a 2D image

00:49:29.720 | where each pixel shows the depth of each,

00:49:33.160 | or you know, the output or the picture

00:49:34.680 | is basically showing the depth of each pixel at the input.

00:49:37.720 | This is often what we'd refer to as, you know,

00:49:40.040 | monocular depth.

00:49:41.000 | So I give you just a 2D, you know, depth,

00:49:44.680 | image input and you can estimate the depth itself.

00:49:46.400 | The reason why you want to do this is, you know,

00:49:47.960 | 2D cameras, regular cameras are pretty cheap, right?

00:49:50.920 | So it'd be ideal to be able to do this.

00:49:52.880 | You can imagine like the way that we would do this

00:49:55.880 | is to use an autoencoder.

00:49:57.200 | So the front half of the neural net

00:49:59.040 | is still looking like a, what we call an encoder.

00:50:01.360 | It's a reduction element.

00:50:02.960 | So this is very similar to what you would do

00:50:04.560 | for a classification, but then the backend

00:50:06.800 | of the autoencoder is a decoder.

00:50:09.080 | So it's going to expand the information back out, right?

00:50:11.760 | And so, as I mentioned, again,

00:50:12.840 | this is going to be much more difficult

00:50:14.280 | than just classification because now my output

00:50:17.160 | has to be also very dense as well.

00:50:19.160 | And so we want to see if we could make this really fast

00:50:22.800 | with approaches that we just talked about,

00:50:24.440 | for example, NetAdapt.

00:50:26.280 | So indeed you can make it pretty fast.

00:50:28.120 | So if you apply NetAdapt plus the, you know,

00:50:30.240 | compact network design and then do some

00:50:32.240 | depth-wise decomposition, you can actually increase

00:50:36.200 | the frame rate by an order of magnitude.

00:50:37.920 | So again, here I'm going to show the plot.

00:50:39.560 | On the x-axis, here is the frame rate

00:50:42.080 | on a Jetson TX2 GPU.

00:50:44.320 | This is measured with a batch size of one

00:50:46.480 | with 32-bit float.

00:50:48.280 | And on the vertical axis, it's the accuracy,

00:50:51.480 | the depth estimation in terms of the delta one metric,

00:50:53.920 | which means the percentage of pixels

00:50:55.800 | that are within 25% of the correct depth.

00:50:58.960 | So higher, the better.

00:51:00.720 | And so you can see, you know, the various different

00:51:02.600 | approaches out there.

00:51:04.240 | This star, red star, is the approach using fast,

00:51:07.520 | of FastStep using all the different efficient

00:51:09.800 | network design techniques that we talked about.

00:51:11.240 | And you can see you can get an order of magnitude

00:51:13.040 | over a 10x speedup while maintaining accuracy.

00:51:17.240 | And the models and all the code to do this

00:51:18.840 | is available on the FastStep website.

00:51:21.240 | We presented this at ICRA, which is a robotics conference

00:51:25.000 | in the middle of last year.

00:51:26.120 | And we wanted to show some live footage there.

00:51:28.000 | So at ICRA, we actually captured some footage on an iPhone

00:51:31.880 | and showed the real-time depth estimation on an iPhone itself.

00:51:35.320 | And you can achieve about 40 frames per second on an iPhone

00:51:38.320 | using FastDepth.

00:51:39.680 | So again, if you're interested in this particular type

00:51:42.520 | of application or efficient networks for depth estimation,

00:51:44.960 | I invite you to visit the website for that.

00:51:47.800 | OK, so that's the algorithmic side of things.

00:51:49.760 | But let's return to the hardware,

00:51:51.200 | building specialized hardware that

00:51:52.680 | are efficient for neural network processing.

00:51:56.160 | So again, we saw that there's many different ways

00:51:59.040 | of making the neural network efficient,

00:52:01.480 | from network pruning to efficient network

00:52:03.760 | architectures to reduce precision.

00:52:05.880 | The challenge for the hardware designer,

00:52:08.040 | though, is that there's no guarantee

00:52:09.880 | as to which type of approach someone

00:52:12.840 | might apply to the algorithm that they're going

00:52:14.880 | to run on the hardware.

00:52:15.920 | So if you only own the hardware, you

00:52:17.440 | don't know what kind of algorithm

00:52:19.200 | someone's going to run on your hardware

00:52:20.200 | unless you own the whole stack.

00:52:21.760 | So as a result, you really, really

00:52:23.520 | need to have flexible hardware so it

00:52:25.560 | can support all of these different approaches

00:52:27.680 | and translate these approaches to improvements in energy

00:52:31.400 | efficiency and latency.

00:52:33.600 | Now, the challenge is a lot of the specialized DNN hardware

00:52:37.920 | that exist out there often rely on certain properties of the DNN

00:52:42.600 | in order to achieve high efficiency.

00:52:44.520 | So a very typical structure that you might see

00:52:47.240 | is that you might have an array of multiply and accumulate

00:52:50.120 | units, so a MAC array.

00:52:51.560 | And it's going to reduce memory access

00:52:54.520 | by amortizing reads across arrays.

00:52:56.400 | What do I mean by that?

00:52:57.600 | So if I read a weight once from the memory,

00:53:00.680 | weight memory once, I'm going to reuse it multiple times

00:53:02.880 | across the array.

00:53:03.760 | Send it across the array, so one read,

00:53:06.120 | and it can be used multiple times by multiple engines

00:53:08.720 | or multiple MACs.

00:53:10.080 | Similarly, activation memory, I'm

00:53:12.120 | going to read the input activation once

00:53:13.800 | and reuse it multiple times.

00:53:16.880 | The issue here is that the amount of reuse

00:53:20.320 | and the rate utilization depends on the number of channels

00:53:23.800 | you have on your neural net, the size of the feature map,

00:53:26.080 | and the batch size.

00:53:27.400 | So this is, again, just showing two different variations of--

00:53:30.160 | you're going to reuse based on the number of filters, number

00:53:32.880 | of input channels, feature map, batch size.

00:53:35.960 | And the problem now is that when we

00:53:37.360 | start looking at these efficient neural network models,

00:53:40.520 | they're not going to have as much reuse,

00:53:42.560 | particularly for the compact cases.

00:53:44.560 | So for example, a very typical approach

00:53:46.360 | is to use what we call depth-wise layers.

00:53:48.080 | We saw you took that 3D filter and then decomposed it

00:53:51.320 | into a 2D filter and a one-by-one.

00:53:54.480 | And so as a result, you only have one channel.

00:53:56.400 | So you're not going to have much reuse across the input channel.

00:53:59.520 | And so rather than filling this array with a lot of computation

00:54:03.720 | that you can process, you're only

00:54:05.160 | going to be able to utilize a very small subset, which

00:54:07.640 | I've highlighted here in green, of the array itself

00:54:09.760 | for computation.

00:54:10.840 | So even though you throw down 1,000 multiplies,

00:54:13.680 | 10,000 multiplies the Humiliate engine,

00:54:15.800 | only a very small subset of them can actually do work.

00:54:19.880 | And that's not great.

00:54:20.760 | So this is also an issue because as I scale up the array size,

00:54:24.640 | it's going to become less efficient.

00:54:26.100 | Ideally, what you would like is that if I put more, you know,

00:54:28.920 | cores or processing elements down,

00:54:31.080 | the system should run faster, right?

00:54:32.600 | I'm paying for more thing- more cores.

00:54:34.600 | But it doesn't because it can't- the data can't reach or be

00:54:38.480 | reused by all of these different cores,

00:54:40.480 | and it's also going to be difficult to exploit sparsity.

00:54:42.560 | So what you need here are two things.

00:54:44.880 | One is a very flexible data flow,

00:54:47.760 | meaning that there's many different ways for the data

00:54:49.960 | to move through this array, right?

00:54:53.120 | And so you can imagine row stationary is a very flexible

00:54:56.120 | way that we can basically map the neural network

00:54:58.120 | onto the array itself.

00:54:59.040 | You can see here in the iris or row stationary case

00:55:01.800 | that a lot of the processing elements can be used.

00:55:04.640 | Another thing is how do you actually

00:55:06.120 | deliver the data for this varying degree of reuse?

00:55:10.040 | So here's like the spectrum of on-chip networks

00:55:13.720 | in terms of basically how can I deliver data

00:55:15.800 | from that global buffer to all those parallel processing

00:55:19.400 | engines, right?

00:55:21.360 | One use case is when I use these huge neural nets that

00:55:24.120 | have a lot of reuse.

00:55:25.320 | What I want to do is multicast, meaning

00:55:27.080 | I read once from the global buffer,

00:55:29.360 | and then I reuse that data multiple times

00:55:31.320 | in all of my processing elements.

00:55:32.680 | You can think of that as like broadcasting information out.

00:55:35.360 | And a type of network that you would do for that

00:55:37.560 | is shown here on the right-hand side.

00:55:39.480 | So this is low bandwidth, so I'm only reading very little data,

00:55:42.800 | but high spatial reuse.

00:55:44.160 | Many, many engines are using it.

00:55:46.680 | On the other extreme, when I design

00:55:49.600 | these very efficient neural networks,

00:55:51.180 | I'm not going to have very much reuse.

00:55:53.160 | And so what I want is unicast, meaning

00:55:54.920 | I want to send out unique information

00:55:58.080 | to each of the processing elements

00:56:00.280 | so that they can all work.

00:56:02.480 | So that's going to be, as shown here on the left-hand side,

00:56:05.520 | a case where you have very high bandwidth,

00:56:07.320 | a lot of unique information going out,

00:56:10.760 | and low spatial reuse.

00:56:11.760 | You're not sharing data.

00:56:13.280 | Now, it's very challenging to go across this entire spectrum.

00:56:16.680 | One solution would be what we call an all-to-all network

00:56:20.360 | that satisfies all of this.

00:56:21.680 | So all things are-- all inputs are connected to all inputs.

00:56:24.080 | It's going to be very expensive and not scalable.

00:56:27.760 | One solution that we have to this

00:56:29.360 | is what we call a hierarchical mesh.

00:56:30.860 | So you can break this problem into two steps.

00:56:33.040 | At the lowest level, you can use an all-to-all connection.

00:56:37.960 | And then at the higher level, you can use a mesh connection.

00:56:41.360 | And so the mesh will allow you to scale up.

00:56:44.000 | But the all-to-all allows you to achieve

00:56:45.840 | a lot of different types of reuse.

00:56:47.260 | And with this type of network on chip,

00:56:49.320 | you can basically support a lot of different delivery

00:56:51.560 | mechanisms to deliver data from the global buffer

00:56:54.480 | to all the processing elements so that all your cores,

00:56:57.520 | all your computes can be happening at the same time.

00:56:59.840 | And at its core, this is one of the key things

00:57:02.640 | that enable the second version of Iris

00:57:04.760 | to be both flexible and efficient.

00:57:07.720 | So this is some results from the second version of Iris.

00:57:11.480 | It supports a wide range of filter shapes,

00:57:13.520 | both the very large shapes as well as very compact,

00:57:18.400 | including convolutional fully connected depth-wise layers.

00:57:21.040 | So you can see here in this plot, depending on the shape,

00:57:25.200 | you can get up to an order of magnitude speed up.

00:57:28.400 | It also supports a wide range of sparsities, both dense

00:57:30.840 | and sparse.

00:57:32.100 | So this is really important because some networks

00:57:34.100 | can be very sparse because you've

00:57:35.140 | done a lot of pruning.

00:57:36.280 | But some are not.

00:57:37.100 | And so you want to efficiently support all of those.

00:57:39.720 | You also want to be scalable.

00:57:40.960 | So as you increase the number of processing elements,

00:57:44.840 | the throughput also speeds up.

00:57:47.360 | And as a result of this particular type of design,

00:57:50.080 | you get an order of magnitude improvement

00:57:52.000 | in both speed and energy efficiency.

00:57:55.760 | All right, so this is great.

00:57:56.920 | And this is one way that you can speed up and make

00:57:59.800 | neural networks more efficient.

00:58:01.920 | But it's also important to take a step back and look

00:58:04.160 | beyond just building specialized hardware.

00:58:06.800 | The accelerator itself, both in terms

00:58:08.900 | of algorithms and the hardware.

00:58:11.020 | So can we look beyond the DNN accelerator for acceleration?

00:58:15.300 | And so one good place to show this as an example

00:58:17.700 | is the task of super resolution.

00:58:19.740 | So how many of you are familiar with the task of super

00:58:21.980 | resolution?

00:58:23.140 | All right, so for those of you who aren't, the idea is

00:58:25.340 | as follows.

00:58:26.020 | So I want to basically generate a high-resolution image

00:58:30.060 | from a small-resolution image.

00:58:32.180 | And why do you want to do that?

00:58:33.460 | Well, there are a couple of reasons.

00:58:34.980 | One is that it can allow you to basically reduce

00:58:38.060 | the transmit bandwidth.

00:58:39.260 | So for example, if you have limited communication,

00:58:41.340 | I'm going to send a low-res version of a video,

00:58:43.780 | let's say, or image to your phone.

00:58:45.420 | And then your phone can make it high-res.

00:58:47.380 | That's one way.

00:58:48.740 | Another reason is that screens in general

00:58:51.420 | are getting larger and larger.

00:58:52.700 | So every year at CES, they announce a higher-resolution

00:58:55.300 | screen.

00:58:56.060 | But if you think about the movies that we watch,

00:58:58.700 | a lot of them are still 1080p, for example,

00:59:01.380 | or fixed resolution.

00:59:02.580 | So again, you want to generate a high-resolution

00:59:04.700 | representation of that low-resolution input.

00:59:09.260 | And the idea here is that your high-resolution is not

00:59:11.460 | just interpolation, because it can be very blurry,

00:59:13.460 | but there's ways that kind of hallucinate

00:59:15.420 | a high-resolution version of the video or image itself.

00:59:20.060 | And that's basically called super-resolution.

00:59:23.100 | But one of the challenges for super-resolution

00:59:25.580 | is that it's computationally very expensive.

00:59:27.580 | So again, you can imagine that the state-of-the-art approaches

00:59:30.160 | for super-res use deep neural nets.

00:59:32.340 | A lot of the examples we just talked about

00:59:34.140 | about neural nets are talking about input images

00:59:36.180 | of 200 by 200 pixels.

00:59:38.140 | Now imagine if you extend that to an HD image.

00:59:40.820 | It's going to be very, very expensive.

00:59:42.860 | So what we want to do is think of different ways

00:59:45.300 | that we can speed up the super-resolution process,

00:59:48.420 | not just by making DNNs faster, but kind

00:59:51.060 | of looking around the other components of the system

00:59:54.220 | and seeing if we can make it faster as well.

00:59:56.060 | So one of the approaches we took is this framework called FAST,

00:59:59.420 | where we're looking at accelerating

01:00:00.860 | any super-resolution algorithm by an order of magnitude.

01:00:03.740 | And this is operated on a compressed video.

01:00:06.100 | So before I was a faculty here, I

01:00:09.100 | worked a lot on video compression.

01:00:10.900 | And if you think about the video compression community,

01:00:14.300 | they look at video very differently than people

01:00:17.460 | who process super-resolution.

01:00:18.620 | So typically, when you're thinking

01:00:20.040 | about image processing or super-resolution,

01:00:22.020 | when I give you a compressed video, what you basically

01:00:24.700 | think of it is as a stack of pixels,

01:00:27.500 | a bunch of different images together.

01:00:29.300 | But if you asked a video compression person,

01:00:31.900 | what does a compressed video look like?

01:00:33.580 | Actually, a compressed video is a very structured

01:00:37.460 | representation of the redundancy in the video itself.

01:00:41.260 | So why is it that we can compress videos?

01:00:43.100 | It's because things like different frames

01:00:44.900 | look very-- consecutive frames look very similar.

01:00:47.500 | So it's telling you which pixels in frame 1

01:00:50.540 | is related to which pixel or looks

01:00:52.220 | like which pixel in frame 2.

01:00:53.820 | And so as a result, you don't have

01:00:55.220 | to send the pixels in frame 2.

01:00:56.740 | And that's where you get the compression from.

01:00:58.660 | So actually, what a compressed video looks like

01:01:00.620 | is a description of the structure of the video itself.

01:01:05.780 | And so you can use this representation

01:01:07.580 | to accelerate super-resolution.

01:01:09.700 | So for example, rather than applying super-resolution

01:01:14.100 | to every single low-res frame, which is the typical approach--

01:01:16.980 | so you would apply super-resolution

01:01:18.440 | to each low-res frame, and you would generate a bunch

01:01:20.780 | of high-res frame outputs--

01:01:22.980 | what you can actually do is apply super-resolution

01:01:26.140 | to one of the small low-resolution frames.

01:01:29.420 | And then you can use that free information

01:01:31.700 | you get in the compressed video that tells you

01:01:33.540 | the structure of the video to generate or transfer

01:01:36.780 | and generate all those high-resolution videos

01:01:39.740 | from that.

01:01:40.700 | And so it only needs to run on a subset of frames.

01:01:43.140 | And then the complexity to reconstruct

01:01:45.180 | all those high-resolution frames once you

01:01:47.140 | have that structured image is going to be very low.

01:01:49.940 | So for example, if I'm going to transfer to n frames,

01:01:53.780 | I'm going to get an n frame and x speedup.

01:01:57.100 | So to evaluate this, we showcase this on a range of videos.

01:02:00.260 | So this range of videos is the data

01:02:01.740 | set that we use to develop video standards.

01:02:03.620 | So it's quite broad.

01:02:05.220 | And you can see, first, on the left-hand side

01:02:07.540 | is that if I transfer to four different frames,

01:02:11.460 | you can get a 4x acceleration.

01:02:13.060 | And then the PSNR, which indicates the quality,

01:02:15.940 | doesn't change.

01:02:16.700 | So it's the same quality, but 4x faster.

01:02:18.980 | If I do transfer to 16 frames or 16 acceleration,

01:02:22.260 | there's a slight drop in quality.

01:02:24.380 | But still, you get basically a 16x acceleration.

01:02:28.820 | So the key idea here is, again, you'd

01:02:31.180 | want to look beyond the processing

01:02:33.300 | of the neural network itself to around it

01:02:35.060 | to see if you can speed it up.

01:02:36.620 | Usually with PSNR, you can't really

01:02:37.940 | tell too much about the quality.

01:02:39.060 | So another way to look at it is actually

01:02:40.700 | look at the video itself or subjective quality.

01:02:42.940 | So on the left-hand side here, this

01:02:45.380 | is if I applied super resolution on every single frame.

01:02:48.780 | So this is the traditional way of doing it.

01:02:51.580 | On the right-hand side here, this

01:02:53.820 | is if I just did interpolation on every single frame.

01:02:56.980 | And so where you can tell the difference is by looking

01:02:59.260 | at things like the text, you can see

01:03:00.780 | that the text is much sharper on the left video

01:03:03.500 | than the right video.

01:03:05.260 | Now, FAST plus SRC and using FAST is somewhere in between.

01:03:08.420 | So FAST actually has the same quality

01:03:11.620 | as the video on the left-hand side,

01:03:13.900 | but it's just as efficient in terms of processing speed

01:03:17.380 | as the approach on the right-hand side.

01:03:19.980 | So it kind of has the best of both worlds.

01:03:22.460 | And so the key takeaway for this is

01:03:24.140 | that if you want to accelerate DNNs for a given process,

01:03:27.660 | it's good to look beyond the hardware for the acceleration.

01:03:31.020 | We can look at things like the structure of the data that's

01:03:33.780 | entering the neural network accelerator.

01:03:36.060 | There might be opportunities there.

01:03:37.540 | For example, here, temporal correlation

01:03:39.740 | that allows you to further accelerate the processing.

01:03:42.220 | Again, if you're interested in this,

01:03:43.740 | all the code is on the website.

01:03:45.220 | So to end this lecture, I just want

01:03:46.660 | to talk about things that are actually

01:03:48.500 | beyond deep neural nets.

01:03:49.620 | I also-- I know neural nets are great.

01:03:51.260 | They're useful for many applications.

01:03:52.900 | But I think there's a lot of exciting problems

01:03:54.860 | outside the space of neural nets as well, which also

01:03:57.460 | require efficient computing.

01:04:00.140 | So the first thing is what we call

01:04:01.940 | visual inertial localization or visual odometry.

01:04:05.580 | This is something that's widely used for robots

01:04:07.900 | to kind of figure out where they are in the real world.

01:04:10.220 | So you can imagine for autonomous navigation,

01:04:12.140 | before you navigate the world, you

01:04:13.660 | have to know where you actually are in the world.

01:04:15.700 | So that's localization.

01:04:16.780 | This is also widely used for things like AR and VR

01:04:19.140 | as well, right, because you can know where you're actually

01:04:21.140 | looking in AR and VR.

01:04:22.620 | What does this actually mean?

01:04:24.540 | It means that you can basically take in a sequence of images.

01:04:27.740 | So you can imagine like a camera that's mounted on the robot

01:04:30.340 | or the person, as well as an IMU.

01:04:33.140 | So it has accelerometer and gyroscope information.

01:04:36.260 | And then visual inertial odometry,

01:04:38.180 | which is a subset of SLAM, basically fuses

01:04:40.180 | this information together.

01:04:41.860 | And the outcome of visual inertial odometry

01:04:44.660 | is the localization.

01:04:45.700 | So you can see here, basically, you're

01:04:47.420 | trying to estimate where you are in the 3D space.

01:04:50.220 | And the pose based on, in this case, the camera feed.

01:04:52.860 | But you can also measure IMU information there as well.

01:04:55.660 | And if you're in an unknown environment,

01:04:57.380 | you could also generate a map of that environment.

01:04:59.540 | So one of these is a very key task in navigation.

01:05:03.340 | And the key thing is, can you do it in an energy efficient way?

01:05:06.380 | So we've looked at building specialized hardware

01:05:09.340 | to do localization.

01:05:11.660 | This is actually the first chip that

01:05:13.160 | performs complete visual inertial odometry on chip.

01:05:15.740 | We call it Navion.

01:05:17.420 | This is done in collaboration with Sertesh Karaman.

01:05:19.660 | So you can see here, here's the chip itself.

01:05:21.460 | It's 4 millimeters by 5 millimeters.

01:05:23.700 | You can see that it's smaller than a quarter.

01:05:26.180 | And you can imagine mounting it on a small robot.

01:05:29.180 | At the front end, it does basically

01:05:30.900 | processing of the camera information.

01:05:32.660 | It does things like feature detection,

01:05:34.260 | tracking, outlier elimination.

01:05:36.980 | It also processes-- it does pre-integration on the IMU.

01:05:40.700 | And then on the back end, it fuses this information

01:05:42.980 | together using a factor graph.

01:05:46.460 | And so when you compare this particular design,

01:05:48.940 | this Navion chip design, compared

01:05:50.580 | to mobile or desktop CPUs, you're

01:05:52.740 | talking about two to three orders of magnitude

01:05:55.660 | reduction in energy consumption because you have

01:05:58.060 | the specialized chip to do it.

01:05:59.700 | So what is the key component of this chip that

01:06:02.260 | enables us to do it?

01:06:03.340 | Well, again, sticking with the theme,

01:06:04.860 | the key thing is reduction in data movement.

01:06:07.620 | In particular, we reduce the amount

01:06:09.060 | of data that needs to be moved on and off chip.

01:06:11.380 | So all of the processing is located on the chip itself.

01:06:15.440 | And then furthermore, because we want

01:06:17.020 | to reduce the size of the chip and the size of the memories,

01:06:19.560 | we do things like apply low-cost compression on the frames

01:06:23.980 | and then also exploit sparsity, which

01:06:26.420 | means number of zeros in the factor graph itself.

01:06:28.820 | So all of the compression and exploiting sparsity

01:06:30.980 | can actually reduce the storage cost

01:06:32.620 | down to under a megabyte of storage

01:06:34.980 | on chip to do this processing.

01:06:36.260 | And that allows us to achieve this really low power

01:06:38.940 | consumption of below 25 milliwatts.

01:06:43.700 | Another thing that really matters for autonomous

01:06:45.700 | navigation is once you know where you are,

01:06:47.860 | where are you going to go next?

01:06:49.540 | So this is kind of a planning and mapping problem.

01:06:52.100 | And so in the context of things like robot exploration,

01:06:54.580 | where you want to basically explore an unknown area,

01:06:57.580 | you can do this by doing what we call computing

01:07:00.340 | Shannon's mutual information.

01:07:01.640 | Basically, you want to figure out

01:07:03.220 | where should I go next where I will discover

01:07:05.200 | the most amount of new information

01:07:07.540 | compared to what I already know.

01:07:09.660 | So you can imagine what's shown here is like an occupancy map.

01:07:12.860 | So this is basically the light colors

01:07:14.460 | show the place where it's free space.

01:07:16.380 | It's empty.

01:07:16.880 | Nothing's occupied.

01:07:18.260 | The dark gray area is unknown.

01:07:21.460 | And then the black lines are occupied things,

01:07:23.980 | so like walls, for example.

01:07:25.380 | And the question is, if I know that this is my current

01:07:27.620 | occupancy map, where should I go and scan, let's say,

01:07:30.460 | with a depth sensor to figure out more information

01:07:35.140 | about the map itself?

01:07:36.140 | So what you can do is you can compute

01:07:37.780 | what we call the mutual information of the map itself

01:07:40.820 | based on what you already know.

01:07:42.260 | And then you go to the location with the most information,

01:07:44.660 | and you scan it, and then you get an updated map.

01:07:47.420 | So shown here below is a miniature race car

01:07:49.660 | that's doing exactly that.

01:07:51.260 | So over here is the mutual information

01:07:55.580 | that's being computed.

01:07:56.540 | So it's trying to go to those light areas

01:08:01.020 | of the yellow areas that has the most information.

01:08:03.300 | So you can see that it's going to try and back up and come

01:08:06.020 | and scan this region to cover or figure out

01:08:09.020 | more information about that.

01:08:10.940 | So that's great.

01:08:11.780 | It's a very principled way of doing this.

01:08:13.660 | The problem of this kind of computation,

01:08:18.660 | the reason why it's been challenging,

01:08:20.540 | is, again, the computation, in particular, the data movement.

01:08:23.500 | So you can imagine, at any given position,

01:08:25.820 | you're going to do a 3D scanning with your LiDAR

01:08:29.100 | across a wide range of neighboring regions

01:08:32.100 | with your beams.

01:08:32.900 | You can imagine each of these beams with your LiDAR scan

01:08:35.220 | can be processed with different cores.

01:08:36.820 | So they can all be processed in parallel.

01:08:38.980 | So parallelism, again, here, just like the deep learning

01:08:41.460 | case, is very easily available.

01:08:45.500 | The challenge is data delivery.

01:08:47.540 | So what happens is that you're actually storing

01:08:49.700 | your occupancy map all in one memory.

01:08:52.580 | But now you have multiple cores that

01:08:54.220 | are going to try and process the scans on this occupancy map.

01:08:58.620 | And so you only actually, typically,

01:08:59.940 | for these types of memories, you're limited to two cores.

01:09:02.360 | But if you want to have n cores, 16 cores, 30 cores,

01:09:05.460 | it's going to be a challenge in terms of how

01:09:07.260 | to read data from this occupancy map

01:09:09.500 | and deliver it to the cores themselves.

01:09:12.300 | If we take a closer look at the memory access pattern,

01:09:15.740 | you can see here that as you scan it out,

01:09:18.180 | the numbers indicate which cycle you

01:09:20.500 | would use to read each of the locations on the map itself.

01:09:25.500 | And you can see it's kind of a diagonal pattern.

01:09:27.500 | So the question is, can I break this map into smaller memories

01:09:33.380 | and then access these smaller memories in parallel?

01:09:35.460 | And the question is, if I can break it into smaller memories,

01:09:38.000 | how should I decide what part of the map

01:09:39.860 | should go into which of these memories?

01:09:41.860 | So show here on the right-hand side,

01:09:44.620 | in the different colors basically

01:09:46.340 | indicate different memories or different banks of the memory.

01:09:49.020 | So they store different parts of the map.

01:09:50.700 | And again, if you think of the numbers

01:09:52.380 | as the cycle with which each location is accessed,

01:09:55.740 | what you'll notice is that for any given color, at most,

01:09:59.100 | two numbers are the same, meaning

01:10:01.680 | that I'm only going to access two pieces of the location

01:10:04.660 | for any given bank or memory.

01:10:06.100 | So there's going to be no conflict.

01:10:07.560 | So I can process all of these beams in parallel.

01:10:11.220 | And so by doing this, this allows

01:10:13.020 | you to compute the mutual information of the entire map.

01:10:17.300 | And by the time I can be a very large map,

01:10:19.060 | let's say 200 meters by 200 meters at 0.1 meter resolution

01:10:22.900 | in under a second.

01:10:24.140 | This is very different from before,

01:10:25.620 | where you can only compute the mutual information

01:10:27.860 | of a subset of locations and just try and pick the best one.

01:10:31.060 | Now you can compute on the entire map.

01:10:32.780 | So you can know the absolute best location to go to get

01:10:35.220 | the most information.

01:10:36.860 | This is 100x speed up compared to a CPU

01:10:39.940 | at a tenth of the power on an FPGA.

01:10:43.500 | So that's another important example

01:10:44.960 | of how data movement is really critical in order

01:10:47.740 | to allow you to process things very, very quickly

01:10:50.060 | and how having specialized hardware can enable that.

01:10:53.780 | All right.

01:10:54.260 | So one last thing is looking at--

01:10:55.940 | so we talked about robotics.

01:10:57.260 | We talked about deep learning.

01:10:58.060 | But actually, what's really important

01:10:59.020 | is there's a lot of important applications

01:11:00.820 | where you can apply efficient processing that can help

01:11:03.980 | a lot of people around the world.

01:11:05.340 | So in particular, looking at monitoring neurodegenerative

01:11:08.740 | disease disorders.

01:11:09.980 | So we know things like dementia, so things like Alzheimer's,

01:11:12.700 | Parkinson's, affects tens of millions of people

01:11:15.420 | around the world and continues to grow.

01:11:17.500 | This is a very severe disease.

01:11:19.780 | The challenge for this disease is that--

01:11:21.620 | OK, one of the many challenges.

01:11:22.900 | But one of the challenges is that the neurological

01:11:25.020 | assessments for these diseases can be very time consuming

01:11:27.620 | and require a trained specialist.

01:11:29.460 | So normally, if you are suffering

01:11:31.220 | from one of these diseases or you might have this disease,

01:11:34.180 | what you need to do is you need to go see a specialist.

01:11:36.740 | And they'll ask you a series of questions.

01:11:39.220 | They'll do a mini mental exam, like what year is it?

01:11:41.780 | Where are you now?

01:11:42.580 | Can you count backwards and so on?

01:11:44.300 | Or you might be familiar with people

01:11:45.900 | are asked to draw the clock, these tests.

01:11:49.140 | And so you can imagine going to a specialist

01:11:51.020 | to do these type of things can be costly and time consuming.

01:11:53.620 | So you don't go very frequently.

01:11:55.540 | So as a result, the data that's collected is very sparse.

01:11:58.220 | Also, it's very qualitative.

01:12:00.100 | So if you go to different specialists,

01:12:01.700 | they might come up with a different assessment.

01:12:04.140 | So repeatability is also very much an issue.

01:12:08.060 | What's been super exciting is it's

01:12:09.900 | been shown in literature that there's actually

01:12:12.100 | a quantitative way of measuring or quantitative evaluating

01:12:16.860 | these types of diseases, potentially using eye movements.

01:12:20.660 | So eye movements can be used by a quantitative way

01:12:22.860 | to evaluate the severity or progression

01:12:25.260 | or regression of these particular type of diseases.

01:12:27.580 | So you imagine doing things like,

01:12:29.020 | if you're taking a certain drug, is your disease

01:12:31.140 | getting better or worse?

01:12:32.300 | And this eye movement can give a quantitative evaluation

01:12:34.780 | for that.

01:12:35.300 | But the challenge is that to do these eye movement evaluations,

01:12:39.540 | you still need to go into that.

01:12:40.900 | So first, you need a very high speed camera.

01:12:43.020 | That can be very expensive.

01:12:44.500 | Often, you need to have substantial head support

01:12:46.660 | so your head doesn't move so you can really

01:12:47.780 | detect the eye movement.

01:12:48.940 | And you might even need IR illumination

01:12:50.700 | so you can more clearly see the eye.

01:12:53.300 | And so again, this still has the challenge

01:12:55.260 | that for clinical measurements of what

01:12:57.100 | we call saccade latency or eye movement latency or eye

01:12:59.460 | reaction time, they're done in very constrained environments.

01:13:02.420 | You still have to go see the special itself.

01:13:05.340 | And they use very specialized and costly equipment.

01:13:08.420 | So in the vein of enabling efficient computing

01:13:10.940 | and bringing compute to various devices, our question is,

01:13:13.980 | can we actually do these eye measurements on a phone

01:13:17.820 | itself that we all have?

01:13:20.540 | And so indeed, you can.

01:13:21.860 | You can develop various algorithms

01:13:23.340 | that can detect your eye reaction time

01:13:25.500 | on a consumer grade camera like your phone or an iPad.

01:13:29.460 | And we've shown that you can actually

01:13:31.300 | replicate the quality of results as you

01:13:33.820 | could with a phantom camera.

01:13:35.020 | So shown here in the red are basically eye reaction times

01:13:38.980 | that are measured on a subject on an iPhone 6, which

01:13:41.660 | is obviously under $1,000, way cheaper now,

01:13:44.380 | compared to a phantom camera shown here in blue.

01:13:46.380 | You can see that the distributions of the reaction

01:13:48.460 | times are about the same.

01:13:50.700 | Why is this exciting?

01:13:51.780 | Because it enables us to do low cost in-home measurements.

01:13:55.300 | So what you can imagine is a patient

01:13:56.780 | could do these measurements at home for many days,

01:13:59.460 | not just the day they go in.

01:14:00.780 | And then they can bring in this information.

01:14:02.620 | And this can give the physician or the specialist

01:14:04.860 | additional information to make the assessment as well.

01:14:07.180 | So this can be complementary.

01:14:08.380 | But it gives a much more rich set of information

01:14:10.500 | to do the diagnosis and evaluation.

01:14:12.940 | So we're talking about computing.

01:14:14.660 | But there's also other parts of the system

01:14:16.420 | that burn power as well, in particular,

01:14:18.420 | when we're talking about things like depth estimation using

01:14:20.920 | time of flight.

01:14:21.700 | Time of flight is very similar to LIDAR.

01:14:23.540 | Basically, what you're doing is you're sending a pulse

01:14:25.940 | and waiting for it to come back.

01:14:27.260 | And how long it takes to come back

01:14:28.620 | indicates the depth of whatever object you're trying to detect.

01:14:31.700 | The challenge with depth estimation

01:14:33.580 | with time of flight sensors can be very expensive.

01:14:35.820 | You're emitting a pulse, waiting for it to come back.

01:14:38.020 | So talking about up to tens of watts of power.

01:14:42.860 | The question is, can we also reduce the sensor power

01:14:45.300 | if we can do efficient computing?

01:14:46.860 | So for example, can I reduce how often I emit the depth sensor

01:14:51.420 | and kind of recover the other information just using

01:14:54.460 | a monocular-based camera?

01:14:56.020 | So for example, typically, you have a pair of a depth sensor

01:14:59.620 | and an RGB camera.

01:15:00.940 | If at time 0, I turn both of them on, and time 1 and 2,

01:15:05.400 | I turn them off, but I still keep my RGB camera on,

01:15:08.700 | can I estimate the depth for at time 2 and time 3?

01:15:13.180 | And then the key thing here is to make sure

01:15:15.020 | that the algorithms that you're running to estimate

01:15:17.180 | the depth without turning on the depth sensor itself

01:15:19.460 | is super cheap.

01:15:20.340 | So we actually have algorithms that

01:15:22.260 | can run on VGA at 30 frames per second

01:15:24.700 | on a Cortex A7, which is a super low-cost embedded processor.

01:15:29.780 | And just to give you an idea of how it looks like,

01:15:31.860 | so let's see, here's the left is the RGB image.

01:15:34.620 | In the middle is the depth map or the ground truth.

01:15:37.100 | So if I always had the depth sensor on,

01:15:38.660 | that's what it would look like.

01:15:39.820 | And then on the right-hand side is the estimated depth map.

01:15:42.660 | In this particular case, we're only turning on the sensor

01:15:46.100 | only 11% of the time, so every ninth frame.

01:15:49.460 | And your mean at relative error is only about 0.7%,

01:15:52.540 | so the accuracy or quality is pretty aligned.

01:15:55.740 | OK, so at a high level, what are the key takeaways

01:15:59.780 | I want you guys to get from today's lecture?

01:16:02.460 | First is efficient computing is really important.

01:16:05.340 | It can extend the reach of AI beyond the cloud itself

01:16:09.060 | because it can reduce communication networking

01:16:11.060 | costs, enable privacy, and provide low latency.

01:16:15.140 | And so we can use AI for a wide range of applications,

01:16:17.580 | ranging from things like robotics to health care.

01:16:20.420 | And in order to achieve this energy efficient computing,

01:16:22.820 | it really requires cross-layer design.

01:16:24.980 | So not just focusing on the hardware,

01:16:26.980 | but specialized hardware plays an important role, but also

01:16:29.420 | the algorithms itself.

01:16:31.020 | And this is going to be really key to enabling AI

01:16:33.300 | for the next decade or so or beyond.

01:16:36.340 | OK, and we also covered a lot of points in the lecture,

01:16:39.700 | so the slides are all available on our website.

01:16:43.540 | Also, just because it's a deep learning seminar series,

01:16:46.020 | I just want to point some other resources

01:16:48.060 | that you might be interested if you

01:16:49.560 | want to learn more about efficient processing

01:16:51.500 | of neural nets.

01:16:52.100 | So again, I want to point you first to this survey paper

01:16:54.940 | that we've developed. This is with my collaborator Joel

01:16:57.300 | Emmer.

01:16:57.800 | It really kind of covers what are the different techniques

01:17:00.260 | that people are looking at and give some insights

01:17:01.860 | of the key design principles.

01:17:03.300 | We also have a book coming soon.

01:17:04.680 | It's going to be within the next few weeks.

01:17:07.820 | We also have slides from various tutorials

01:17:09.820 | that we've given on this particular topic.

01:17:11.780 | In fact, we also teach a course on this here at MIT, 6825.

01:17:16.660 | If you're interested in updates on all these types of materials,

01:17:19.460 | I invite you to join the mailing list or the Twitter feed.

01:17:23.820 | The other thing is if you're not an MIT student,

01:17:25.880 | but you want to take a two-day course on this particular topic,

01:17:29.940 | I also invite you to take a look at the MIT Professional

01:17:33.380 | Education option.

01:17:34.860 | So we run short courses on MIT campus over the summer.

01:17:38.180 | So you can come for two days, and we

01:17:39.800 | can talk about the various different approaches

01:17:41.260 | that people use to build efficient deep learning

01:17:43.340 | systems.

01:17:44.900 | And then finally, if you're interested in just video

01:17:47.780 | and tutorial videos on this talk,

01:17:49.420 | I actually, at the end of November during NeurIPS,

01:17:52.180 | I gave a 90-minute tutorial that goes really in-depth in terms

01:17:55.580 | of how to build efficient deep learning systems.

01:17:58.540 | So I invite you to visit that.

01:17:59.860 | And we also have some talks at the Mars Conference

01:18:02.140 | on Efficient Robotics.

01:18:03.540 | And we have a YouTube channel where this is all located.

01:18:07.140 | And then finally, I'd be remiss if I didn't acknowledge

01:18:09.900 | a lot of the work here is done by the students, so

01:18:12.700 | all the students in our group, as well as my collaborators,

01:18:14.940 | Joel Emmer, Sertesh Karaman, and Thomas Helt,

01:18:16.900 | and then all of our sponsors that

01:18:18.820 | make this research possible.

01:18:20.740 | So that concludes my talk.

01:18:22.060 | Thank you very much.

01:18:22.860 | [APPLAUSE]

01:18:26.420 | Thank you.

01:18:27.980 | [APPLAUSE]

01:18:31.340 | [AUDIO OUT]

01:18:34.380 | [AUDIO OUT]

01:18:37.740 | [AUDIO OUT]

01:18:41.100 | [AUDIO OUT]

01:18:44.460 | [BLANK_AUDIO]

Efficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze) | MIT Deep Learning Series

Chapters