back to indexEfficient Computing for Deep Learning, Robotics, and AI (Vivienne Sze) | MIT Deep Learning Series
Chapters
0:0 Introduction
0:43 Talk overview
1:18 Compute for deep learning
5:48 Power consumption for deep learning, robotics, and AI
9:23 Deep learning in the context of resource use
12:29 Deep learning basics
20:28 Hardware acceleration for deep learning
57:54 Looking beyond the DNN accelerator for acceleration
63:45 Beyond deep neural networks
00:00:05.060 |
working in the very important and exciting space 00:00:08.300 |
of developing energy efficient and high performance systems 00:00:25.340 |
One of the important differences between the human brain 00:00:29.720 |
and AI systems is the energy efficiency of the brain. 00:00:34.720 |
So Vivian is a world-class researcher at the forefront 00:00:43.300 |
- I'm really happy to be here to share some of the research 00:00:46.520 |
and an overview of this area, efficient computing. 00:00:49.560 |
So actually what I'm gonna be talking about today 00:00:51.720 |
is gonna be a little bit broader than just deep learning. 00:00:54.840 |
but we'll also move to how we might apply this to robotics 00:00:58.620 |
and other AI tasks, and why it's really important 00:01:06.100 |
Also, I just wanna mention that a lot of the work 00:01:08.340 |
I'm gonna present today is not done by myself, 00:01:10.560 |
but in collaboration with a lot of folks at MIT over here. 00:01:14.680 |
And of course, if you want access to the slides, 00:01:18.160 |
So given that this is a deep learning lecture series, 00:01:24.520 |
So we know that deep neural nets has, you know, 00:01:39.240 |
OpenAI actually showed over the past few years 00:01:57.680 |
in terms of the amount of compute we need to drive 00:02:01.360 |
and increase the accuracy of a lot of the tasks 00:02:05.800 |
At the same time, if we start looking at basically 00:02:09.880 |
the environmental implications of all of this processing, 00:02:17.040 |
the carbon footprint of, you know, training neural nets, 00:02:20.080 |
if you think of, you know, the amount of carbon footprint 00:02:29.480 |
of an average human life, you can see that, you know, 00:02:33.360 |
neural networks are orders of magnitude greater than that. 00:02:36.720 |
So the environmental or carbon footprint implications 00:02:43.520 |
Now this is a lot having to do with compute in the cloud. 00:02:46.000 |
Another important area where we wanna do compute 00:02:48.520 |
is actually moving the compute from the cloud 00:03:03.840 |
and just even a lot of just places in general, 00:03:19.320 |
Another reason is a lot of the times that we, 00:03:22.440 |
you know, apply deep learning on a lot of applications 00:03:26.200 |
So you can think about things like healthcare 00:03:30.640 |
And so privacy and security again is really critical. 00:03:34.160 |
And you would, rather than sending the data to the cloud, 00:03:36.360 |
you'd like to bring the compute to the data itself. 00:03:39.280 |
Finally, another compelling reason for, you know, 00:03:47.520 |
So this is particularly true for interactive applications. 00:03:51.040 |
So you can think of things like autonomous navigation, 00:03:55.920 |
where you need to interact with the real world. 00:03:58.080 |
You can imagine if you're driving very quickly 00:04:02.520 |
you might not have enough time to send the data 00:04:08.400 |
So again, you wanna move the compute into the robot 00:04:17.800 |
But one of the big challenges of doing processing 00:04:20.640 |
in the robot or in the device actually has to do 00:04:24.120 |
So if we take the self-driving car as an example, 00:04:26.800 |
it's been reported that it consumes over 2000 watts 00:04:33.400 |
just to process all the sensor data that it's collecting. 00:04:36.960 |
Right, and this actually generates a lot of heat. 00:04:40.440 |
You can see in this prototype that's being placed, 00:04:43.240 |
and all the compute aspects are being placed in the trunk, 00:04:49.880 |
So this can be a big cost and logistical challenges 00:04:56.520 |
much more challenging if we shrink down the form factor 00:05:00.080 |
of the device itself to something that is perhaps portable 00:05:05.360 |
or something like your smartphone or cell phone. 00:05:11.360 |
you actually have very limited energy capacity, 00:05:13.880 |
and this is based on the fact that the battery itself 00:05:16.920 |
is limited in terms of the size, weight, and its cost. 00:05:19.760 |
Right, so you can't have very large amount of energy 00:05:24.680 |
Secondly, when you take a look at the embedded platforms 00:05:27.960 |
that are currently used for embedded processing 00:05:35.880 |
than the power consumption that you typically 00:05:38.100 |
would allow for these particular handheld devices. 00:05:47.640 |
Okay, so in the past decade or so, or decades, 00:05:53.760 |
is that we would wait for transistors to become smaller, 00:06:02.400 |
so transistors are not getting more efficient. 00:06:07.520 |
which typically makes transistors smaller and faster, 00:06:27.820 |
but the transistors are not becoming more efficient. 00:06:31.760 |
So what we have to turn to in order to address this 00:06:34.880 |
is we need to turn towards specialized hardware 00:06:37.600 |
to achieve the significant speed and energy throughputs 00:06:40.620 |
that we require for our particular application. 00:06:43.220 |
When we talk about designing specialized hardware, 00:06:46.180 |
how we can redesign the hardware from the ground up, 00:06:49.540 |
particularly targeted at these AI, deep learning, 00:06:52.920 |
and robotic tasks that we're really excited about. 00:06:57.500 |
In fact, it's become extremely popular to do this. 00:07:02.060 |
there's been a large number of startups and companies 00:07:22.280 |
Now, if you really care about energy and power efficiency, 00:07:26.960 |
where is the power actually going for these applications? 00:07:35.740 |
So it's actually not the computations themselves 00:07:39.440 |
but moving the data to the computation engine 00:07:46.760 |
a range of power consumption, energy consumption 00:07:56.100 |
So you have, for example, floating point to fixed point 00:07:59.440 |
and eight bit integer and same with additions. 00:08:04.760 |
the energy consumption of each of these operations reduce. 00:08:11.440 |
at the energy consumption of data movement, right? 00:08:19.880 |
somewhere into memory, it can be very expensive. 00:08:22.280 |
So for example, if you look at the energy consumption 00:08:31.520 |
that you would have on the processor or on the chip itself. 00:08:35.360 |
This is already gonna consume five picojoules of energy. 00:08:48.200 |
so outside the processor, for example, in DRAM, 00:08:55.180 |
we're showing 640 picojoules in terms of energy. 00:08:58.560 |
And so you can notice here on the horizontal axis 00:09:01.500 |
that this is basically the, this is an exponential axis. 00:09:05.160 |
So you're talking about orders of magnitude increase 00:09:13.080 |
So if we really want to address the energy consumption 00:09:19.660 |
we really wanna look at reducing data movement. 00:09:24.200 |
So if we take a look at a popular AI robotics 00:09:27.240 |
type of application like autonomous navigation, 00:09:30.440 |
is that these applications use a lot of data, right? 00:09:33.640 |
So for example, one of the things you need to do 00:09:38.380 |
So you need to be able to identify, you know, 00:09:42.840 |
you need to know that this pixel represents the ground, 00:09:46.960 |
this pixel represents, you know, a person itself. 00:09:49.880 |
Okay, so this is an important type of processing. 00:09:53.440 |
you wanna be able to do this at a very high frame rate. 00:09:58.560 |
So for example, typically if you want HD images, 00:10:00.920 |
you're talking about 2 million pixels per frame. 00:10:03.880 |
And then often, if you also wanna be able to detect objects 00:10:06.560 |
at different scales or see objects that are far away, 00:10:17.000 |
by, you know, one to two orders of magnitude. 00:10:20.800 |
that you have to process right off the bat there. 00:10:25.480 |
or understanding that you wanna do for autonomous navigation 00:10:30.840 |
you wanna build a 3D map of the world that's around you. 00:10:34.080 |
And you can imagine the longer you travel for, 00:10:41.520 |
that you're gonna have to process and compute on. 00:10:46.080 |
for autonomous navigation in terms of amount of data. 00:10:51.700 |
also other applications like AR, VR, and so on, 00:11:18.280 |
that in order to do these types of processing, 00:11:20.840 |
the state-of-the-art approaches utilize deep neural nets. 00:11:26.280 |
But the challenge here is that these deep neural nets 00:11:30.640 |
of operations and weights to do the computation. 00:11:38.920 |
you're talking about two to three orders of magnitude 00:11:46.400 |
'cause if we'd like to have deep neural networks 00:11:49.500 |
be as ubiquitous as something like video compression, 00:11:53.760 |
how to address this computational complexity. 00:11:58.160 |
are not just used for understanding the environment 00:12:03.360 |
of many AI applications from computer vision, 00:12:06.320 |
speech recognition, gameplay, and even medical applications. 00:12:09.920 |
And I'm sure a lot of these have been covered 00:12:13.520 |
So briefly, I'm just gonna give a quick overview 00:12:16.640 |
of some of the key components in deep neural nets, 00:12:18.680 |
not because, you know, I'm sure all of you understand it, 00:12:23.520 |
the terminology can vary from discipline to discipline. 00:12:26.120 |
So I'll just do a brief overview to align ourselves 00:12:32.920 |
Basically, you can view it as a way of, for example, 00:12:37.700 |
It's a chain of different layers of processing 00:12:44.200 |
at the low level or the earlier parts of the neural net, 00:12:46.760 |
you're trying to learn different low-level features 00:12:53.960 |
as you chain more of these kind of computational layers 00:13:00.360 |
until you can, you know, recognize a vehicle, for example. 00:13:03.880 |
And, you know, the difference of this particular approach 00:13:06.240 |
compared to more traditional ways of doing computer vision 00:13:09.240 |
is that how we extract these features are learned 00:13:12.180 |
from the data itself, as opposed to having an expert 00:13:22.320 |
Okay, what is it doing at each of these layers? 00:13:24.680 |
Well, it's actually doing a very simple computation. 00:13:28.200 |
This is looking at the inference side of things. 00:13:29.920 |
Basically, effectively, what it's doing is a weighted sum. 00:13:37.720 |
and try and stay consistent with that throughout the talk. 00:13:43.440 |
and these weights are learned from the training data, 00:13:48.380 |
and it's basically a weighted sum, as we can see. 00:13:55.120 |
So, you know, traditionally, it used to be sigmoids. 00:13:59.600 |
which basically set, you know, non-zero values 00:14:05.760 |
But the key takeaway here is that if you look 00:14:08.480 |
at this computational kernel, the key operation 00:14:17.120 |
And this accounts for over 90% of the computation. 00:14:22.880 |
accelerating neural nets or making them more efficient, 00:14:25.080 |
we really want to focus on minimizing the cost 00:14:32.960 |
of deep neural network layers used for deep neural networks. 00:14:40.840 |
So for example, you can have feed-forward layers 00:14:43.040 |
where the inputs are always connected to the outputs. 00:14:51.720 |
where basically all the outputs are connected 00:14:56.800 |
And you might be familiar with some of these layers. 00:15:05.680 |
When you put them together, they're typically referred 00:15:10.460 |
You have convolutional layers, which are also feed-forward, 00:15:20.360 |
they're often referred to as convolutional networks. 00:15:23.320 |
And they're typically used for image-based processing. 00:15:35.800 |
they're referred to as recurrent neural nets. 00:15:37.360 |
And these are typically used to process sequential data, 00:15:42.960 |
And then most recently, which has become really popular, 00:15:46.240 |
it's the tension layers or tension-based mechanisms. 00:16:05.880 |
computationally more complex than other types of processing. 00:16:08.720 |
So we'll focus on convolutional neural nets as an example, 00:16:20.240 |
So how does it actually perform convolution itself? 00:16:27.320 |
If it's at the input of the neural net, it would be an image. 00:16:38.320 |
And we convolve it with, let's say, a 2D filter, 00:16:42.600 |
Right, so typical convolution, what you would do 00:16:45.320 |
is you would do an element-wise multiplication 00:16:47.840 |
of the filter weights with the input feature map activations. 00:16:52.320 |
You would sum them all together to generate one output value. 00:16:55.760 |
And we would refer to that as the output activation. 00:17:05.520 |
and generate all the other output feature map activation. 00:17:15.600 |
What makes convolutional neural nets much more challenging 00:17:21.480 |
So first of all, rather than doing just this 2D convolution, 00:17:27.000 |
So there's this third dimension called channels. 00:17:29.200 |
And then what we're doing here is that we need to do 00:17:35.960 |
And you can think of these channels for an image, 00:17:38.320 |
these channels would be kind of the red, green, 00:17:43.920 |
the number of channels could potentially increase. 00:17:45.920 |
So if you look at AlexNet, which is a popular neural net, 00:17:48.680 |
the number of channels ranges from three to 192. 00:17:52.480 |
Okay, so that already increases the dimensionality, 00:18:06.760 |
Okay, so for example, you might apply N filters 00:18:12.120 |
and then you would generate an output feature map 00:18:16.720 |
So in the previous slide, we showed that convolving 00:18:37.120 |
we're talking about between 96 to 384 filters. 00:18:41.120 |
And of course, this is increasing to thousands 00:18:43.280 |
for other advanced or more modern neural nets itself. 00:18:58.800 |
or N input feature maps becomes N output feature maps. 00:19:02.400 |
And we typically refer to this as a batch size, 00:19:07.200 |
at the same time, and this can range from one to 256. 00:19:10.280 |
Okay, so these are all the various different dimensions 00:19:18.440 |
the network architecture of the neural net itself 00:19:25.040 |
So it's gonna define all these different dimensions 00:19:27.800 |
of the neural net itself, and these shapes can vary 00:19:35.400 |
MobileNet as an example, this is a very popular 00:19:37.920 |
neural network cell, you can see that the filter sizes, 00:19:40.840 |
right, so the height and width of the filters 00:19:44.040 |
and the number of filters and number of channels 00:19:45.600 |
will vary across the different blocks or layers itself. 00:19:51.440 |
is that when we look towards popular DNN models, 00:19:56.760 |
So shown here are the various different models 00:20:03.800 |
one is that the networks tend to become deeper, 00:20:09.800 |
And then also the number of weights that they're using 00:20:13.760 |
and the number of MACs are also increasing as well. 00:20:17.840 |
the DNN models are getting larger and deeper, 00:20:31.280 |
or overview into the deep neural network space, 00:20:38.600 |
to make the processing of these neural networks 00:20:42.840 |
And often we refer to this as hardware acceleration. 00:20:46.120 |
All right, so we know these neural networks are very large, 00:20:53.840 |
or processing of these networks more efficient? 00:20:58.960 |
is that they actually exhibit a lot of parallelism. 00:21:11.040 |
'cause I can do a lot of these processing in parallel. 00:21:13.960 |
What is difficult and what should not be a surprise 00:21:16.120 |
to you now is that the memory access is the bottleneck. 00:21:21.680 |
and accumulate engine is what's really challenging. 00:21:24.240 |
So I'll give you an insight as to why this is the case. 00:21:45.160 |
which is like the partially accumulated value 00:21:49.320 |
and then it would generate an updated partial sum. 00:22:04.120 |
The other challenge that you have is, as we mentioned, 00:22:15.280 |
if you read the data from DRAM, it's off-chip memory, 00:22:21.960 |
it's gonna be two orders of magnitude more expensive 00:22:26.040 |
than the computation of performing a MAC itself. 00:22:31.320 |
So if you can imagine, again, if we look at AlexNet, 00:22:35.400 |
we're talking about three billion DRAM accesses 00:22:47.200 |
So one is what we call input data reuse opportunities, 00:22:50.520 |
which means that a lot of the data that we're reading, 00:22:53.000 |
we're using to perform these multiplies and accumulates, 00:22:55.400 |
they're actually used for many multiplies and accumulates. 00:23:00.560 |
we can reuse it multiple times for many operations, right? 00:23:09.400 |
So again, if you remember, we're taking a filter 00:23:11.680 |
and we're sliding it across this input image. 00:23:15.400 |
And so as a result, the activations from the feature map 00:23:21.200 |
are gonna be reused in different combinations 00:23:23.760 |
to compute the different multiply and accumulate values 00:23:32.000 |
Another example is that we're actually, if you recall, 00:23:35.680 |
gonna apply multiple filters on the same input feature map. 00:23:40.080 |
So that means that each activation in that input feature map 00:23:43.960 |
can be reused multiple times across the different filters. 00:23:57.760 |
can be reused multiple times across these input feature maps. 00:24:05.920 |
reuse opportunities in the neural network itself. 00:24:09.320 |
And so what can we do to exploit this reuse opportunities? 00:24:31.400 |
right beside the multiply and accumulate engine. 00:24:39.000 |
locally beside that multiply and accumulate engine. 00:24:46.200 |
So for example, if to perform a multiply and accumulate 00:24:50.160 |
with an ALUX1X, reading from this very small memory 00:25:02.800 |
and a processing element is gonna be this multiply 00:25:06.520 |
I can also allow the different processing elements 00:25:11.720 |
And so reading from a neighboring processing element 00:25:16.200 |
And then finally, you can have a shared larger memory 00:25:24.120 |
across all the different processing elements. 00:25:25.400 |
This tends to be larger between 100 and 500 Kbytes. 00:25:35.600 |
that's gonna be the most expensive at 200X the energy. 00:25:51.240 |
But the challenge here is that this very small local memory 00:25:56.800 |
that are millions of weights in terms of size, right? 00:26:05.480 |
to kind of think through how this is related, 00:26:15.720 |
or going back to, let's say, your office here, 00:26:26.240 |
you might not be able to fill it in your backpack. 00:26:32.120 |
into smaller chunks so that I can access them all 00:26:38.080 |
And so there's been a lot of research in this area 00:26:40.800 |
in terms of what's the best way to break up the data 00:26:43.040 |
and what should I store in this very small local memory? 00:26:46.800 |
So one approach is what we call a weight stationary. 00:26:56.480 |
And so as a result, I really minimize the weight energy. 00:27:01.920 |
the other types of data that you have in your system, 00:27:04.280 |
so for example, your input activations shown in the blue, 00:27:07.280 |
and then the partial sums that are shown in the red, 00:27:12.360 |
so through the network and from the global buffer, okay? 00:27:32.080 |
"Well, so the weight, I only ever have to read it. 00:27:35.320 |
"But the partial sums, I have to read it and write it 00:27:55.960 |
is gonna be local within that one processing element. 00:27:59.680 |
The trade-off, of course, is the activations of weights 00:28:05.080 |
And then there's various different works called, 00:28:09.560 |
and some work from the Chinese Academy of Sciences 00:28:19.560 |
"or so the outputs and the weights themselves. 00:28:22.680 |
"Let's keep the input stationary within this small memory." 00:28:29.960 |
from some research work from NVIDIA has examined this. 00:28:34.560 |
really focus on not moving one piece of type of data. 00:28:46.680 |
is that maybe you wanna reduce the data movement 00:28:49.160 |
of all different data types, all types of energy. 00:28:52.600 |
this is something that we've developed within our own group, 00:28:54.680 |
is looking at what we call the row stationary data flow. 00:29:09.880 |
You have the activations of your input feature map. 00:29:13.320 |
And then you also have your partial sum information. 00:29:15.640 |
So you're really trying to balance the data movement 00:29:23.840 |
but we just talked about the fact that the neural network 00:29:28.400 |
So you can imagine expanding this to higher dimensions. 00:29:39.520 |
that you can map onto this architecture as well. 00:29:45.480 |
you might not wanna focus on one particular data type. 00:29:48.400 |
You wanna actually optimize for all the different types 00:29:51.440 |
of data that you're moving around in your system. 00:29:59.560 |
or these different types of data flows would work. 00:30:02.520 |
So for example, in the weight stationary case, 00:30:16.760 |
which is the input feature map or input pixels, 00:30:31.240 |
which is the weight stationary data movement, 00:30:35.560 |
and the blue is the input's gonna be increased. 00:30:39.640 |
There's another approach called no-colloquial reuse, 00:30:43.280 |
but you can see that row stationary, for example, 00:30:59.080 |
you wanna optimize overall for all the movement 00:31:03.360 |
Okay, another thing that you can also exploit 00:31:08.280 |
is the fact that, you know, some of the data could be zero. 00:31:18.840 |
to your multiply and accumulate is gonna be zero, 00:31:43.040 |
For example, you can use things like run length encoding, 00:31:48.040 |
is gonna be represented rather than, you know, 00:31:53.000 |
And this can actually reduce the amount of data movement 00:32:09.440 |
And then there's other techniques, for example, 00:32:11.200 |
we call pruning, which is setting some of the weights 00:32:22.840 |
in particular a customized chip that we called Iris 00:32:30.840 |
and exploiting sparsity in the activation data. 00:32:53.360 |
is a dye photo of the fabricated chip itself, right? 00:32:56.800 |
And this is four millimeters by four millimeters 00:33:00.600 |
And so using that, you know, row stationary data flow, 00:33:19.640 |
each of these processing elements has, you know, 00:33:25.200 |
and it's also sharing with other processing elements. 00:33:27.760 |
So overall, when you compare it to a mobile GPU, 00:33:30.080 |
you're talking about an order of magnitude reduction 00:33:34.080 |
If you'd like to learn a little bit more about that, 00:33:36.760 |
I invite you to visit the Iris project website. 00:33:50.360 |
Let's say we don't care anything about the hardware, 00:33:54.440 |
We want to build, you know, an overall system. 00:33:57.880 |
is the trade-off between energy and accuracy, right? 00:34:06.280 |
and let's say this is for an object detection task, right? 00:34:13.080 |
and it's listed in terms of average precision, 00:34:15.680 |
which is a metric that we use for object detection. 00:34:18.360 |
It's on a linear scale, and higher, the better. 00:34:25.280 |
This is the energy that's being consumed per pixel. 00:34:37.840 |
And so if you think before neural nets, you know, 00:34:44.880 |
used features called histogram of oriented gradients, right? 00:34:49.320 |
This is a very popular approach to be very efficient 00:34:52.280 |
in terms of, or quite accurate in terms of object detection. 00:35:01.480 |
So you can imagine AlexNet here almost doubled the accuracy, 00:35:10.800 |
But then we want to look also on the vertical axis, 00:35:27.760 |
that's been designed for that particular task. 00:35:35.520 |
So they use the same transistors around the same size 00:35:37.840 |
that does object detection using the HOG features. 00:35:40.800 |
And then here's the Iris chip that we just talked about. 00:35:46.160 |
The students who built these chips, you know, 00:35:58.360 |
We can see that histogram of oriented gradients, 00:36:07.000 |
video compression, again, something that you all have 00:36:09.800 |
in your phone, HOG features are actually more efficient 00:36:12.960 |
than video compression, meaning for the same energy 00:36:35.720 |
I'm gonna double the accuracy of its recognition, 00:36:42.240 |
who here would be interested in that technology? 00:36:47.760 |
So in the sense that battery life is so critical 00:36:50.200 |
to how we actually use these types of technologies. 00:36:57.960 |
we should really also consider the energy consumption, 00:37:01.000 |
and we really don't want the energy to be so high. 00:37:03.480 |
And we can see that even with specialized hardware, 00:37:06.180 |
we're still quite far away from making neural nets 00:37:10.080 |
as efficient as something like video compression 00:37:14.800 |
So we really have to think of how we can further 00:37:23.600 |
So actually, there's been a huge amount of research 00:37:25.440 |
in this space, because we know neural nets are popular, 00:37:28.120 |
and we know that they have a wide range of applications, 00:37:31.640 |
So people have looked at how can we design new hardware 00:37:35.240 |
that can be more efficient, or how can we design algorithms 00:37:38.240 |
that are more efficient to enable energy-efficient 00:37:41.720 |
And so in fact, within our own research group, 00:37:43.660 |
we spend quite a bit of time kind of surveying the area 00:37:46.440 |
and understanding what are the various different types 00:37:48.520 |
of developments that people have been looking at. 00:37:51.840 |
we actually generated various tutorials on this material, 00:37:57.280 |
This is an overview paper that's about 30 pages 00:37:59.840 |
and we're currently expanding it into a book. 00:38:02.920 |
I would encourage you to visit these resources. 00:38:07.120 |
as we were doing this kind of survey of the area, 00:38:10.000 |
is that we actually identified various limitations 00:38:14.920 |
or how the research is approaching this problem. 00:38:22.840 |
that people are using to try and make the DNN algorithms 00:38:30.320 |
The idea here is you're gonna set some of the weights 00:38:32.560 |
to become zero, and again, anything times zero is zero, 00:38:40.440 |
There's also looking at efficient network architectures, 00:38:43.080 |
meaning rather than making my neural networks very large, 00:38:50.920 |
So rather than this 3D filter, can I make it a 2D filter 00:38:59.000 |
Another very popular thing is reduced precision. 00:39:01.240 |
So rather than using the default of 32-bit float, 00:39:04.400 |
can I reduce the number of bits down to eight bits 00:39:08.160 |
We saw before that as we reduce the precision 00:39:10.920 |
of these operations, you also get energy savings, 00:39:16.800 |
A lot of this work really focuses on reducing 00:39:19.220 |
the number of MACs and the number of weights, 00:39:23.080 |
and those are primarily because those are easy to count. 00:39:27.720 |
if we care about the system is does this actually translate 00:39:37.640 |
We don't really, when you're thinking about something 00:39:39.400 |
running on your phone, you don't care about the number 00:39:40.800 |
of MACs and weights, you care about how much energy 00:39:42.520 |
it's consuming 'cause that's gonna affect the battery life, 00:39:58.900 |
So the key takeaway from this slide is that if you remember 00:40:01.520 |
where the energy comes from, which is the data movement, 00:40:04.360 |
it's not because of how many weights or how many MACs you 00:40:06.960 |
have, but really it depends on where the weight comes from. 00:40:10.320 |
If it comes from this small memory register file 00:40:14.240 |
that's nearby, it's gonna be super cheap as opposed 00:40:18.720 |
So all weights are basically not created equal, 00:40:26.240 |
So we can't just look at the number of weights 00:40:29.720 |
and the number of MACs and estimate how much energy 00:40:46.760 |
we basically take in the DNN weights and the input data, 00:40:51.920 |
We know the different shapes of the different layers 00:40:55.480 |
of the neural net, and we run an optimization 00:40:59.440 |
how much energy consumed by the data movement, 00:41:11.360 |
And once you have this, you can kind of figure out, 00:41:13.400 |
well, where is the energy going so I can target my design 00:41:18.800 |
Okay, and so by doing this, when we take a look, 00:41:23.280 |
it should be no surprise, one of the key observations 00:41:28.720 |
are not a good metric for energy consumption. 00:41:31.120 |
If you take a look at GoogleNet, for example, 00:41:34.400 |
this is running on kind of the Iris architecture, 00:41:48.080 |
So in general, this is the same message as before. 00:42:03.240 |
where the energy is going, how can we factor that 00:42:09.800 |
So we talked about the concept of pruning, right? 00:42:13.160 |
So again, pruning was setting some of the weights 00:42:15.400 |
of the neural net to zero, or you can think of it 00:42:18.720 |
And so what we wanna do here is that now we know 00:42:29.080 |
where we should actually remove the weights from? 00:42:36.680 |
for the same accuracy across the different approaches. 00:42:39.040 |
Traditionally, what happens is that people tend 00:42:45.720 |
and you can see that you get about a 2x reduction 00:42:50.680 |
However, we know that like the size of the weight 00:42:53.200 |
has nothing to do with, or the value of the weight 00:42:54.920 |
has nothing to do with the energy consumption. 00:42:56.400 |
Ideally, what you'd like to do is remove the weights 00:43:02.160 |
In particular, we also know that the more weights 00:43:04.000 |
that we remove, the accuracy is gonna go down. 00:43:11.760 |
One way you can do this is you can take your neural network, 00:43:21.440 |
in terms of high energy layer to low energy layers, 00:43:25.280 |
and then you prune the high energy layers first. 00:43:28.320 |
So this is what we call energy-aware pruning. 00:43:38.240 |
And again, this is because we factor in energy consumption 00:43:41.520 |
into the design of the neural network itself. 00:43:52.120 |
from a performance point of view is latency, right? 00:43:55.240 |
So for example, latency has to do with how long it takes 00:43:58.000 |
when I give it an image, how long will I get the result back? 00:44:04.400 |
But the challenge here is that latency, again, 00:44:19.760 |
You can do, so going towards the left, you're increasing. 00:44:30.440 |
And what they're showing here is that the number of max 00:44:33.520 |
is not really a good approximation of latency. 00:44:39.560 |
neural networks that have the same number of max, 00:44:41.880 |
there can be a 2x range or 2x swing in terms of latency. 00:44:50.680 |
they can have a 3x swing in terms of number of max. 00:44:55.240 |
So the key takeaway here is that you can't just count 00:44:59.880 |
It's actually much more challenging than that. 00:45:09.360 |
and use that again to design the neural net directly? 00:45:14.720 |
And so together with Google's Mobile Vision team, 00:45:22.000 |
your particular neural network for a given mobile platform 00:45:34.720 |
So measurements of how that particular network 00:45:39.760 |
So measurements for things like latency and energy. 00:45:42.560 |
And the reason why we want to use empirical measurements 00:45:46.880 |
for all the different types of hardware out there. 00:45:48.960 |
In the case of Google, what they want is that, 00:45:51.400 |
if they have a new phone, you can automatically tune 00:45:55.400 |
You don't want to have to model the phone as well. 00:46:00.040 |
So you'll start off with a pre-trained network. 00:46:10.480 |
And so what you're gonna do is you're gonna take that 00:46:18.800 |
or this amount of latency, this amount of energy. 00:46:34.840 |
And then based on these empirical measurements, 00:46:36.960 |
NetAdapt is gonna then generate a new set of proposals. 00:46:41.880 |
until it gets an adapted network as an output. 00:46:45.360 |
Okay, and again, all of this is on the NetAdapt website. 00:46:48.400 |
Just to give you a quick example of how this might work. 00:46:50.480 |
So let's say you start off with, as your input, 00:46:53.320 |
a neural network that has the accuracy that you want, 00:46:58.600 |
and you would like for it to be 80 milliseconds. 00:47:11.640 |
until it hits the latency budget of 80 milliseconds. 00:47:15.760 |
And it can do that for all the different layers. 00:47:22.160 |
Right, so let's say, oh, this one where I just 00:47:24.480 |
shortened the number of channels in layer one 00:47:37.880 |
and it's gonna be the input to the next iteration. 00:47:43.520 |
I just invite you to go take a look at the NetAdapt paper. 00:47:49.720 |
Well, it gives you actually a very much improved trade-off 00:48:00.520 |
So to the left is better, so it's lower latency. 00:48:07.400 |
it's gonna be the accuracy, so higher, better. 00:48:15.120 |
various kind of handcrafted neural network-based approaches. 00:48:18.960 |
And you can see NetAdapt, which generates the red dots 00:48:27.280 |
for the same accuracy, it can be up to 1.7x faster 00:48:47.320 |
that you wanna run quickly or you wanna be energy efficient, 00:48:56.160 |
or latency measurements into the design itself 00:48:59.280 |
This particular, you know, example here is shown 00:49:14.000 |
From a 2D image, you reduce it down to a label. 00:49:19.000 |
But we actually want to see if we can still apply 00:49:24.840 |
In this case, you know, I give you a 2D image 00:49:34.680 |
is basically showing the depth of each pixel at the input. 00:49:37.720 |
This is often what we'd refer to as, you know, 00:49:44.680 |
image input and you can estimate the depth itself. 00:49:46.400 |
The reason why you want to do this is, you know, 00:49:47.960 |
2D cameras, regular cameras are pretty cheap, right? 00:49:52.880 |
You can imagine like the way that we would do this 00:49:59.040 |
is still looking like a, what we call an encoder. 00:50:09.080 |
So it's going to expand the information back out, right? 00:50:14.280 |
than just classification because now my output 00:50:19.160 |
And so we want to see if we could make this really fast 00:50:32.240 |
depth-wise decomposition, you can actually increase 00:50:51.480 |
the depth estimation in terms of the delta one metric, 00:51:00.720 |
And so you can see, you know, the various different 00:51:04.240 |
This star, red star, is the approach using fast, 00:51:07.520 |
of FastStep using all the different efficient 00:51:09.800 |
network design techniques that we talked about. 00:51:11.240 |
And you can see you can get an order of magnitude 00:51:13.040 |
over a 10x speedup while maintaining accuracy. 00:51:21.240 |
We presented this at ICRA, which is a robotics conference 00:51:26.120 |
And we wanted to show some live footage there. 00:51:28.000 |
So at ICRA, we actually captured some footage on an iPhone 00:51:31.880 |
and showed the real-time depth estimation on an iPhone itself. 00:51:35.320 |
And you can achieve about 40 frames per second on an iPhone 00:51:39.680 |
So again, if you're interested in this particular type 00:51:42.520 |
of application or efficient networks for depth estimation, 00:51:47.800 |
OK, so that's the algorithmic side of things. 00:51:56.160 |
So again, we saw that there's many different ways 00:52:12.840 |
might apply to the algorithm that they're going 00:52:25.560 |
can support all of these different approaches 00:52:27.680 |
and translate these approaches to improvements in energy 00:52:33.600 |
Now, the challenge is a lot of the specialized DNN hardware 00:52:37.920 |
that exist out there often rely on certain properties of the DNN 00:52:44.520 |
So a very typical structure that you might see 00:52:47.240 |
is that you might have an array of multiply and accumulate 00:53:00.680 |
weight memory once, I'm going to reuse it multiple times 00:53:06.120 |
and it can be used multiple times by multiple engines 00:53:20.320 |
and the rate utilization depends on the number of channels 00:53:23.800 |
you have on your neural net, the size of the feature map, 00:53:27.400 |
So this is, again, just showing two different variations of-- 00:53:30.160 |
you're going to reuse based on the number of filters, number 00:53:37.360 |
start looking at these efficient neural network models, 00:53:48.080 |
We saw you took that 3D filter and then decomposed it 00:53:54.480 |
And so as a result, you only have one channel. 00:53:56.400 |
So you're not going to have much reuse across the input channel. 00:53:59.520 |
And so rather than filling this array with a lot of computation 00:54:05.160 |
going to be able to utilize a very small subset, which 00:54:07.640 |
I've highlighted here in green, of the array itself 00:54:10.840 |
So even though you throw down 1,000 multiplies, 00:54:15.800 |
only a very small subset of them can actually do work. 00:54:20.760 |
So this is also an issue because as I scale up the array size, 00:54:26.100 |
Ideally, what you would like is that if I put more, you know, 00:54:34.600 |
But it doesn't because it can't- the data can't reach or be 00:54:40.480 |
and it's also going to be difficult to exploit sparsity. 00:54:47.760 |
meaning that there's many different ways for the data 00:54:53.120 |
And so you can imagine row stationary is a very flexible 00:54:56.120 |
way that we can basically map the neural network 00:54:59.040 |
You can see here in the iris or row stationary case 00:55:01.800 |
that a lot of the processing elements can be used. 00:55:06.120 |
deliver the data for this varying degree of reuse? 00:55:10.040 |
So here's like the spectrum of on-chip networks 00:55:15.800 |
from that global buffer to all those parallel processing 00:55:21.360 |
One use case is when I use these huge neural nets that 00:55:32.680 |
You can think of that as like broadcasting information out. 00:55:35.360 |
And a type of network that you would do for that 00:55:39.480 |
So this is low bandwidth, so I'm only reading very little data, 00:56:02.480 |
So that's going to be, as shown here on the left-hand side, 00:56:13.280 |
Now, it's very challenging to go across this entire spectrum. 00:56:16.680 |
One solution would be what we call an all-to-all network 00:56:21.680 |
So all things are-- all inputs are connected to all inputs. 00:56:24.080 |
It's going to be very expensive and not scalable. 00:56:30.860 |
So you can break this problem into two steps. 00:56:33.040 |
At the lowest level, you can use an all-to-all connection. 00:56:37.960 |
And then at the higher level, you can use a mesh connection. 00:56:49.320 |
you can basically support a lot of different delivery 00:56:51.560 |
mechanisms to deliver data from the global buffer 00:56:54.480 |
to all the processing elements so that all your cores, 00:56:57.520 |
all your computes can be happening at the same time. 00:56:59.840 |
And at its core, this is one of the key things 00:57:07.720 |
So this is some results from the second version of Iris. 00:57:13.520 |
both the very large shapes as well as very compact, 00:57:18.400 |
including convolutional fully connected depth-wise layers. 00:57:21.040 |
So you can see here in this plot, depending on the shape, 00:57:25.200 |
you can get up to an order of magnitude speed up. 00:57:28.400 |
It also supports a wide range of sparsities, both dense 00:57:32.100 |
So this is really important because some networks 00:57:37.100 |
And so you want to efficiently support all of those. 00:57:40.960 |
So as you increase the number of processing elements, 00:57:47.360 |
And as a result of this particular type of design, 00:57:56.920 |
And this is one way that you can speed up and make 00:58:01.920 |
But it's also important to take a step back and look 00:58:11.020 |
So can we look beyond the DNN accelerator for acceleration? 00:58:15.300 |
And so one good place to show this as an example 00:58:19.740 |
So how many of you are familiar with the task of super 00:58:23.140 |
All right, so for those of you who aren't, the idea is 00:58:26.020 |
So I want to basically generate a high-resolution image 00:58:34.980 |
One is that it can allow you to basically reduce 00:58:39.260 |
So for example, if you have limited communication, 00:58:41.340 |
I'm going to send a low-res version of a video, 00:58:52.700 |
So every year at CES, they announce a higher-resolution 00:58:56.060 |
But if you think about the movies that we watch, 00:59:02.580 |
So again, you want to generate a high-resolution 00:59:09.260 |
And the idea here is that your high-resolution is not 00:59:11.460 |
just interpolation, because it can be very blurry, 00:59:15.420 |
a high-resolution version of the video or image itself. 00:59:20.060 |
And that's basically called super-resolution. 00:59:23.100 |
But one of the challenges for super-resolution 00:59:27.580 |
So again, you can imagine that the state-of-the-art approaches 00:59:34.140 |
about neural nets are talking about input images 00:59:38.140 |
Now imagine if you extend that to an HD image. 00:59:42.860 |
So what we want to do is think of different ways 00:59:45.300 |
that we can speed up the super-resolution process, 00:59:51.060 |
of looking around the other components of the system 00:59:56.060 |
So one of the approaches we took is this framework called FAST, 01:00:00.860 |
any super-resolution algorithm by an order of magnitude. 01:00:10.900 |
And if you think about the video compression community, 01:00:14.300 |
they look at video very differently than people 01:00:22.020 |
when I give you a compressed video, what you basically 01:00:33.580 |
Actually, a compressed video is a very structured 01:00:37.460 |
representation of the redundancy in the video itself. 01:00:44.900 |
look very-- consecutive frames look very similar. 01:00:56.740 |
And that's where you get the compression from. 01:00:58.660 |
So actually, what a compressed video looks like 01:01:00.620 |
is a description of the structure of the video itself. 01:01:09.700 |
So for example, rather than applying super-resolution 01:01:14.100 |
to every single low-res frame, which is the typical approach-- 01:01:18.440 |
to each low-res frame, and you would generate a bunch 01:01:22.980 |
what you can actually do is apply super-resolution 01:01:31.700 |
you get in the compressed video that tells you 01:01:33.540 |
the structure of the video to generate or transfer 01:01:36.780 |
and generate all those high-resolution videos 01:01:40.700 |
And so it only needs to run on a subset of frames. 01:01:47.140 |
have that structured image is going to be very low. 01:01:49.940 |
So for example, if I'm going to transfer to n frames, 01:01:57.100 |
So to evaluate this, we showcase this on a range of videos. 01:02:05.220 |
And you can see, first, on the left-hand side 01:02:07.540 |
is that if I transfer to four different frames, 01:02:13.060 |
And then the PSNR, which indicates the quality, 01:02:18.980 |
If I do transfer to 16 frames or 16 acceleration, 01:02:24.380 |
But still, you get basically a 16x acceleration. 01:02:40.700 |
look at the video itself or subjective quality. 01:02:45.380 |
is if I applied super resolution on every single frame. 01:02:53.820 |
is if I just did interpolation on every single frame. 01:02:56.980 |
And so where you can tell the difference is by looking 01:03:00.780 |
that the text is much sharper on the left video 01:03:05.260 |
Now, FAST plus SRC and using FAST is somewhere in between. 01:03:13.900 |
but it's just as efficient in terms of processing speed 01:03:24.140 |
that if you want to accelerate DNNs for a given process, 01:03:27.660 |
it's good to look beyond the hardware for the acceleration. 01:03:31.020 |
We can look at things like the structure of the data that's 01:03:39.740 |
that allows you to further accelerate the processing. 01:03:52.900 |
But I think there's a lot of exciting problems 01:03:54.860 |
outside the space of neural nets as well, which also 01:04:01.940 |
visual inertial localization or visual odometry. 01:04:05.580 |
This is something that's widely used for robots 01:04:07.900 |
to kind of figure out where they are in the real world. 01:04:10.220 |
So you can imagine for autonomous navigation, 01:04:13.660 |
have to know where you actually are in the world. 01:04:16.780 |
This is also widely used for things like AR and VR 01:04:19.140 |
as well, right, because you can know where you're actually 01:04:24.540 |
It means that you can basically take in a sequence of images. 01:04:27.740 |
So you can imagine like a camera that's mounted on the robot 01:04:33.140 |
So it has accelerometer and gyroscope information. 01:04:47.420 |
trying to estimate where you are in the 3D space. 01:04:50.220 |
And the pose based on, in this case, the camera feed. 01:04:52.860 |
But you can also measure IMU information there as well. 01:04:57.380 |
you could also generate a map of that environment. 01:04:59.540 |
So one of these is a very key task in navigation. 01:05:03.340 |
And the key thing is, can you do it in an energy efficient way? 01:05:06.380 |
So we've looked at building specialized hardware 01:05:13.160 |
performs complete visual inertial odometry on chip. 01:05:17.420 |
This is done in collaboration with Sertesh Karaman. 01:05:23.700 |
You can see that it's smaller than a quarter. 01:05:26.180 |
And you can imagine mounting it on a small robot. 01:05:36.980 |
It also processes-- it does pre-integration on the IMU. 01:05:40.700 |
And then on the back end, it fuses this information 01:05:46.460 |
And so when you compare this particular design, 01:05:52.740 |
talking about two to three orders of magnitude 01:05:55.660 |
reduction in energy consumption because you have 01:05:59.700 |
So what is the key component of this chip that 01:06:09.060 |
of data that needs to be moved on and off chip. 01:06:11.380 |
So all of the processing is located on the chip itself. 01:06:17.020 |
to reduce the size of the chip and the size of the memories, 01:06:19.560 |
we do things like apply low-cost compression on the frames 01:06:26.420 |
means number of zeros in the factor graph itself. 01:06:28.820 |
So all of the compression and exploiting sparsity 01:06:36.260 |
And that allows us to achieve this really low power 01:06:43.700 |
Another thing that really matters for autonomous 01:06:49.540 |
So this is kind of a planning and mapping problem. 01:06:52.100 |
And so in the context of things like robot exploration, 01:06:54.580 |
where you want to basically explore an unknown area, 01:06:57.580 |
you can do this by doing what we call computing 01:07:09.660 |
So you can imagine what's shown here is like an occupancy map. 01:07:21.460 |
And then the black lines are occupied things, 01:07:25.380 |
And the question is, if I know that this is my current 01:07:27.620 |
occupancy map, where should I go and scan, let's say, 01:07:30.460 |
with a depth sensor to figure out more information 01:07:37.780 |
what we call the mutual information of the map itself 01:07:42.260 |
And then you go to the location with the most information, 01:07:44.660 |
and you scan it, and then you get an updated map. 01:08:01.020 |
of the yellow areas that has the most information. 01:08:03.300 |
So you can see that it's going to try and back up and come 01:08:20.540 |
is, again, the computation, in particular, the data movement. 01:08:25.820 |
you're going to do a 3D scanning with your LiDAR 01:08:32.900 |
You can imagine each of these beams with your LiDAR scan 01:08:38.980 |
So parallelism, again, here, just like the deep learning 01:08:47.540 |
So what happens is that you're actually storing 01:08:54.220 |
are going to try and process the scans on this occupancy map. 01:08:59.940 |
for these types of memories, you're limited to two cores. 01:09:02.360 |
But if you want to have n cores, 16 cores, 30 cores, 01:09:12.300 |
If we take a closer look at the memory access pattern, 01:09:20.500 |
would use to read each of the locations on the map itself. 01:09:25.500 |
And you can see it's kind of a diagonal pattern. 01:09:27.500 |
So the question is, can I break this map into smaller memories 01:09:33.380 |
and then access these smaller memories in parallel? 01:09:35.460 |
And the question is, if I can break it into smaller memories, 01:09:46.340 |
indicate different memories or different banks of the memory. 01:09:52.380 |
as the cycle with which each location is accessed, 01:09:55.740 |
what you'll notice is that for any given color, at most, 01:10:01.680 |
that I'm only going to access two pieces of the location 01:10:07.560 |
So I can process all of these beams in parallel. 01:10:13.020 |
you to compute the mutual information of the entire map. 01:10:19.060 |
let's say 200 meters by 200 meters at 0.1 meter resolution 01:10:25.620 |
where you can only compute the mutual information 01:10:27.860 |
of a subset of locations and just try and pick the best one. 01:10:32.780 |
So you can know the absolute best location to go to get 01:10:44.960 |
of how data movement is really critical in order 01:10:47.740 |
to allow you to process things very, very quickly 01:10:50.060 |
and how having specialized hardware can enable that. 01:11:00.820 |
where you can apply efficient processing that can help 01:11:05.340 |
So in particular, looking at monitoring neurodegenerative 01:11:09.980 |
So we know things like dementia, so things like Alzheimer's, 01:11:12.700 |
Parkinson's, affects tens of millions of people 01:11:22.900 |
But one of the challenges is that the neurological 01:11:25.020 |
assessments for these diseases can be very time consuming 01:11:31.220 |
from one of these diseases or you might have this disease, 01:11:34.180 |
what you need to do is you need to go see a specialist. 01:11:39.220 |
They'll do a mini mental exam, like what year is it? 01:11:51.020 |
to do these type of things can be costly and time consuming. 01:11:55.540 |
So as a result, the data that's collected is very sparse. 01:12:01.700 |
they might come up with a different assessment. 01:12:09.900 |
been shown in literature that there's actually 01:12:12.100 |
a quantitative way of measuring or quantitative evaluating 01:12:16.860 |
these types of diseases, potentially using eye movements. 01:12:20.660 |
So eye movements can be used by a quantitative way 01:12:25.260 |
or regression of these particular type of diseases. 01:12:29.020 |
if you're taking a certain drug, is your disease 01:12:32.300 |
And this eye movement can give a quantitative evaluation 01:12:35.300 |
But the challenge is that to do these eye movement evaluations, 01:12:44.500 |
Often, you need to have substantial head support 01:12:57.100 |
we call saccade latency or eye movement latency or eye 01:12:59.460 |
reaction time, they're done in very constrained environments. 01:13:05.340 |
And they use very specialized and costly equipment. 01:13:08.420 |
So in the vein of enabling efficient computing 01:13:10.940 |
and bringing compute to various devices, our question is, 01:13:13.980 |
can we actually do these eye measurements on a phone 01:13:25.500 |
on a consumer grade camera like your phone or an iPad. 01:13:35.020 |
So shown here in the red are basically eye reaction times 01:13:38.980 |
that are measured on a subject on an iPhone 6, which 01:13:44.380 |
compared to a phantom camera shown here in blue. 01:13:46.380 |
You can see that the distributions of the reaction 01:13:51.780 |
Because it enables us to do low cost in-home measurements. 01:13:56.780 |
could do these measurements at home for many days, 01:14:02.620 |
And this can give the physician or the specialist 01:14:04.860 |
additional information to make the assessment as well. 01:14:08.380 |
But it gives a much more rich set of information 01:14:18.420 |
when we're talking about things like depth estimation using 01:14:23.540 |
Basically, what you're doing is you're sending a pulse 01:14:28.620 |
indicates the depth of whatever object you're trying to detect. 01:14:33.580 |
with time of flight sensors can be very expensive. 01:14:35.820 |
You're emitting a pulse, waiting for it to come back. 01:14:38.020 |
So talking about up to tens of watts of power. 01:14:42.860 |
The question is, can we also reduce the sensor power 01:14:46.860 |
So for example, can I reduce how often I emit the depth sensor 01:14:51.420 |
and kind of recover the other information just using 01:14:56.020 |
So for example, typically, you have a pair of a depth sensor 01:15:00.940 |
If at time 0, I turn both of them on, and time 1 and 2, 01:15:05.400 |
I turn them off, but I still keep my RGB camera on, 01:15:08.700 |
can I estimate the depth for at time 2 and time 3? 01:15:15.020 |
that the algorithms that you're running to estimate 01:15:17.180 |
the depth without turning on the depth sensor itself 01:15:24.700 |
on a Cortex A7, which is a super low-cost embedded processor. 01:15:29.780 |
And just to give you an idea of how it looks like, 01:15:31.860 |
so let's see, here's the left is the RGB image. 01:15:34.620 |
In the middle is the depth map or the ground truth. 01:15:39.820 |
And then on the right-hand side is the estimated depth map. 01:15:42.660 |
In this particular case, we're only turning on the sensor 01:15:49.460 |
And your mean at relative error is only about 0.7%, 01:15:52.540 |
so the accuracy or quality is pretty aligned. 01:15:55.740 |
OK, so at a high level, what are the key takeaways 01:16:02.460 |
First is efficient computing is really important. 01:16:05.340 |
It can extend the reach of AI beyond the cloud itself 01:16:09.060 |
because it can reduce communication networking 01:16:11.060 |
costs, enable privacy, and provide low latency. 01:16:15.140 |
And so we can use AI for a wide range of applications, 01:16:17.580 |
ranging from things like robotics to health care. 01:16:20.420 |
And in order to achieve this energy efficient computing, 01:16:26.980 |
but specialized hardware plays an important role, but also 01:16:31.020 |
And this is going to be really key to enabling AI 01:16:36.340 |
OK, and we also covered a lot of points in the lecture, 01:16:39.700 |
so the slides are all available on our website. 01:16:43.540 |
Also, just because it's a deep learning seminar series, 01:16:49.560 |
want to learn more about efficient processing 01:16:52.100 |
So again, I want to point you first to this survey paper 01:16:54.940 |
that we've developed. This is with my collaborator Joel 01:16:57.800 |
It really kind of covers what are the different techniques 01:17:00.260 |
that people are looking at and give some insights 01:17:11.780 |
In fact, we also teach a course on this here at MIT, 6825. 01:17:16.660 |
If you're interested in updates on all these types of materials, 01:17:19.460 |
I invite you to join the mailing list or the Twitter feed. 01:17:23.820 |
The other thing is if you're not an MIT student, 01:17:25.880 |
but you want to take a two-day course on this particular topic, 01:17:29.940 |
I also invite you to take a look at the MIT Professional 01:17:34.860 |
So we run short courses on MIT campus over the summer. 01:17:39.800 |
can talk about the various different approaches 01:17:41.260 |
that people use to build efficient deep learning 01:17:44.900 |
And then finally, if you're interested in just video 01:17:49.420 |
I actually, at the end of November during NeurIPS, 01:17:52.180 |
I gave a 90-minute tutorial that goes really in-depth in terms 01:17:55.580 |
of how to build efficient deep learning systems. 01:17:59.860 |
And we also have some talks at the Mars Conference 01:18:03.540 |
And we have a YouTube channel where this is all located. 01:18:07.140 |
And then finally, I'd be remiss if I didn't acknowledge 01:18:09.900 |
a lot of the work here is done by the students, so 01:18:12.700 |
all the students in our group, as well as my collaborators, 01:18:14.940 |
Joel Emmer, Sertesh Karaman, and Thomas Helt,