back to indexQuantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training
Chapters
0:0 Introduction
1:10 What is quantization?
3:42 Integer representation
7:25 Floating-point representation
9:16 Quantization (details)
13:50 Asymmetric vs Symmetric Quantization
15:38 Asymmetric Quantization
18:34 Symmetric Quantization
20:57 Asymmetric vs Symmetric Quantization (Python Code)
24:16 Dynamic Quantization & Calibration
27:57 Multiply-Accumulate Block
30:5 Range selection strategies
34:40 Quantization granularity
35:49 Post-Training Quantization
43:5 Training-Aware Quantization
00:00:00.000 |
Hello guys, welcome back to my channel. Today we are gonna talk about quantization. 00:00:04.000 |
Let's review the topics of today. I will start by showing what is quantization 00:00:09.000 |
and why we need quantization and later we will briefly introduce what are the 00:00:13.400 |
numerical representation for integers and floating-point numbers in our 00:00:16.560 |
hardware, so in CPUs and GPUs. I will show you later what is quantization at the 00:00:21.920 |
neural network level by giving you some examples and later we will go into the 00:00:27.000 |
detail of the types of quantization, so the asymmetric and the symmetric 00:00:30.120 |
quantization, what we mean by the range and the granularity and later we will 00:00:34.400 |
see also post-training quantization and quantization-aware training. For all of 00:00:38.540 |
these topics I will also show you the Python, the PyTorch and the Python code 00:00:42.560 |
on how to do it from scratch. So actually we will build the asymmetric and the 00:00:46.840 |
quantization and the symmetric quantization from scratch using PyTorch 00:00:50.000 |
and then later we will also apply it to a sample neural network using 00:00:53.600 |
post-training quantization and quantization-aware training. What do I 00:00:57.380 |
expect you guys to already know before watching this video is basically you 00:01:01.320 |
have some basic understanding of neural networks and then you have some 00:01:05.740 |
background in mathematics, just high school mathematics is enough. So let's 00:01:11.060 |
start our journey. Let's see what is quantization first of all. So quantization 00:01:15.680 |
aims to solve a problem. The problem is that most modern deep neural networks 00:01:19.720 |
are made up of millions if not billions of parameters. For example the smallest 00:01:24.480 |
Lama 2 has a 7 billion parameters. Now if every parameter is a 32-bit then we 00:01:30.600 |
need 28 gigabyte just to store the parameters on the disk. Also when we 00:01:35.900 |
inference the model we need to load all the parameters of the model in the 00:01:39.320 |
memory. If we are using the CPU for example for inference then we need to 00:01:42.660 |
load it in the RAM but if you are using the GPU we need to load it in the memory 00:01:46.240 |
of the GPU. Of course big models cannot easily be loaded inside the CPU, the RAM 00:01:52.640 |
or the GPU in case we are using a standard PC or small device like a 00:01:56.480 |
smartphone. And also just like humans computers are slow at computing 00:02:03.080 |
floating-point operations compared to integer operations. For example if you 00:02:07.320 |
try to do mentally 3 multiplied by 6 and also mentally 1.21 multiplied by 2.897 00:02:14.560 |
of course you are able to do much faster the 3 by 6 multiplication and the same 00:02:19.360 |
goes on with computers. So the solution is quantization. Quantization basically 00:02:25.000 |
aims to reduce the number of the amount of bits required to represent each 00:02:30.280 |
parameter using by usually by converting the floating-point numbers into 00:02:34.800 |
integers. This way for example a model that normally occupies many gigabyte 00:02:39.320 |
can be compressed to much less smaller size. Also please note that 00:02:44.840 |
quantization doesn't mean that we just round up or round down all the 00:02:48.600 |
floating-point numbers to the nearest integer, this is not what quantization 00:02:52.300 |
does. We will see later how it works so please don't be confused. And the 00:02:57.120 |
quantization can also speed up computation because as working with 00:03:00.520 |
smaller data types is faster for example the computer is much faster as 00:03:04.720 |
multiplying matrices made up of integers than two matrices made up of 00:03:09.360 |
floating-point numbers. And later we will see actually how this matrix 00:03:13.720 |
multiplication works at the GPU level also. So what is the advantage of 00:03:18.720 |
quantization? First of all we have less memory consumption when loading models 00:03:22.560 |
so the model can be compressed into a much smaller size and we have less 00:03:27.720 |
inference time because of using simpler data types so for example integers 00:03:32.440 |
instead of floating-point numbers. And these two combinations lead to less 00:03:36.720 |
energy consumption which is very important for like for example 00:03:40.000 |
smartphones. Okay now let's go review how numbers are represented in the 00:03:46.000 |
hardware so in the CPU level or in the GPU level. So computers use a 00:03:52.120 |
fixed number of bits to represent any piece of data. For example to represent a 00:03:56.880 |
number or a character or a pixel color we always use the fixed number of bit. A 00:04:00.800 |
bit string that is made up of n bits can represent up to 2 to the power of n 00:04:05.360 |
distinct numbers. For example with 3 bit we can represent 8 possible numbers 00:04:10.720 |
from 0 to 7 and for each number you can see its binary representation. We can 00:04:16.560 |
always convert the binary representation in the decimal representation by 00:04:20.960 |
multiplying each digit with the power of 2 to the power of its position to the 00:04:27.960 |
to the position of the digit inside the bit string. And in most CPUs actually the 00:04:34.240 |
numbers the integer numbers are represented using the twos complement 00:04:38.200 |
which means that the first bit of the number indicates the sign so 0 means 00:04:43.480 |
positive and 1 means negative while the rest of the bits indicate the absolute 00:04:49.000 |
value of the number in case it's positive or its complement in case it's 00:04:52.360 |
negative. The reason we use the twos complement is because we want one unique 00:04:57.120 |
representation for the zero so the plus zero and the minus zero have the same 00:05:00.920 |
binary representation. But of course you may argue okay computers use a fixed 00:05:06.200 |
number of bits to represent numbers but how can Python handle such big numbers 00:05:11.800 |
without any problems like when you run 2 to the power of 9 9 9 9 on Python you 00:05:17.200 |
will get a result which is much bigger than any 64-bit number and how can 00:05:23.080 |
Python handle these huge numbers without any problem? Well Python uses the so 00:05:27.760 |
called the big num arithmetic so as we saw before in this table the number 6 00:05:34.280 |
when it's represented in base 10 only needs one digit but when it's 00:05:38.520 |
represented in a base 2 it reads three digits so this is actually a rule so the 00:05:43.880 |
smaller the base the bigger the number of digits we need to represent the 00:05:47.880 |
number and Python does the inverse so it saves all these numbers as an 00:05:53.000 |
array of digits in which each digit is the digit of the number in base 2 to the 00:05:58.040 |
power of 30 so overall we need less digits to store very big numbers for 00:06:02.960 |
example if this number which is the result of 2 to the power of 9 9 9 9 is 00:06:07.600 |
represented as a decimal number we would need an array of 3,000 digits to store 00:06:12.720 |
it in memory while Python stores this number as an array of digits in base 2 00:06:19.240 |
to the power of 30 so it only needs 334 elements in which all the elements are 00:06:26.880 |
zero except the most significant one which is equal to 512 and as a matter of 00:06:31.880 |
fact you can check by yourself that by doing 512 multiplied by the base so 2 to 00:06:38.360 |
the power of 30 then to the power of the position of this digit in the array we 00:06:44.480 |
will obtain the number 2 to the power of 9 9 9 9 I also want you to notice that 00:06:49.760 |
this is something that is implemented by CPython which is the Python 00:06:53.760 |
interpreter not by the CPU so it's not the CPU that is doing this big num 00:06:58.400 |
arithmetic for us it's the Python interpreter for example when you compile 00:07:02.440 |
C++ code the code will run directly on the hardware on the CPU which means also 00:07:08.080 |
that the C++ code is compiled for the specific hardware it will run 00:07:13.320 |
on while Python code we never compile it because the CPython will take care of 00:07:17.760 |
translating our Python instructions into machine code and in a process called 00:07:22.920 |
just-in-time compilation okay let's review how floating-point numbers are 00:07:28.560 |
represented now decimal numbers are just numbers that also include the negative 00:07:34.240 |
powers of the base for example the number 85.612 can be written as each 00:07:39.640 |
number multiplied so each digit multiplied by a power of the base which 00:07:44.400 |
is 10 but the decimal part have negative powers of 10 as you can see 10 to the 00:07:49.880 |
power of minus 1 minus 2 and minus 3 and this same reasoning is used to in the 00:07:54.800 |
standard IEEE 754 which defines the representation of floating-point 00:08:00.320 |
numbers in 32-bit basically we divided the 32-bit string into three parts the 00:08:06.200 |
first bit indicates the sign so 0 means positive the next 8 bit indicated the 00:08:11.920 |
exponent which also indicates the magnitude of the number so how big is 00:08:15.560 |
the number and the last 23 bits indicate the fractional part of the number so all 00:08:21.520 |
the digits corresponding to the negative powers of 2 to convert this bit string 00:08:27.880 |
into a value decimal value we just need to do this so the sign multiplied by 2 00:08:32.660 |
to the power of the exponent minus 127 multiplied by the fraction 1 plus all 00:08:39.240 |
the negative powers of 2 and should correspond to the number 0.15625 00:08:47.680 |
most modern GPUs also support a 16-bit floating-point number but of course this 00:08:54.360 |
results in less precision because we have less bits dedicated to the 00:08:58.160 |
fractional part the last bit dedicated to the exponent and of course they are 00:09:03.140 |
smaller so they have less it means that they can represent the floating-point 00:09:07.800 |
numbers with less precision so we don't we cannot have too many digits after the 00:09:12.160 |
comma for example okay let's go inside the details of quantization now first of 00:09:19.920 |
all we review how neural networks work so we start with an input which could be 00:09:24.180 |
a tensor and we give it to a layer which could be a linear layer for example 00:09:28.200 |
which then maps to another linear layer and finally we have an output we have 00:09:33.780 |
usually a target we compare the output and the target through a loss function 00:09:38.440 |
and we calculate the gradient of the loss function with respect to each 00:09:42.360 |
parameter and we run back propagation to update these parameters the neural 00:09:48.960 |
network can be made up of many different layers for example a linear layer is 00:09:53.340 |
made up of two matrices one is called the weight and one is called the bias 00:09:57.320 |
which are commonly represented using floating-point numbers quantization 00:10:02.120 |
aims to use integer numbers to represent these two matrices while maintaining the 00:10:06.960 |
accuracy of the model let's see how so this linear layer for example the first 00:10:12.180 |
linear layer of this neural network represents an operation which is the 00:10:16.080 |
input multiplied by a weight matrix which are the parameters of this linear 00:10:22.200 |
layer plus a bias which are also the parameters of this linear layer and we 00:10:28.520 |
the goal of the quantization is to quantize the input the weight matrix and 00:10:34.280 |
the bias matrix into integers such that we perform all these operations here as 00:10:39.720 |
integer operations because they are much faster compared to floating-point 00:10:44.080 |
operations we take then the output we dequantize it and we feed it to the next 00:10:49.640 |
layer and we dequantize in such a way that the next layer should not even 00:10:54.840 |
realize that there have been a quantization in the previous layer so we 00:10:58.360 |
want to do quantization in such a way that the model's output should not 00:11:02.640 |
change because of quantization so we want to keep the model's performance the 00:11:07.600 |
accuracy of the model but we want to perform all these operations using 00:11:11.840 |
integers so we need to find a mapping between floating-point numbers and 00:11:16.440 |
integers and a reversible mapping of course so we can go from floating-point 00:11:21.000 |
to integers and from integers to floating-point but in such a way that we 00:11:25.640 |
don't lose the precision of the model but at the same time we want to optimize 00:11:31.200 |
the space occupation of the model inside the RAM and on the disk and we want to 00:11:36.640 |
make it faster to compute these operations because as we saw before 00:11:39.880 |
computing integer operations is much faster than computing floating-point 00:11:44.600 |
operations the main benefit is that the integer operations is much faster in 00:11:50.160 |
most hardware than floating-point operations plus in most embedding 00:11:55.040 |
hardware especially very very small embedded device we don't even have a 00:11:59.000 |
floating-point numbers so we are forced to use integer operations in those 00:12:03.880 |
devices okay let's see how it works so this hidden layer here for example may 00:12:09.240 |
have a weight matrix which could be a 5 by 5 matrix that we can see here the 00:12:14.080 |
goal of quantization is to reduce the precision of each number that we see in 00:12:20.760 |
this matrix by mapping it into a range that occupies less bits so this is a 00:12:25.960 |
floating-point number and occupies 4 bytes so 32 bits we want to 00:12:32.720 |
quantize using only 8 bits so each number should be represented only using 00:12:37.400 |
8 bit now with 8 bit we can represent the range from -128 to +127 but 00:12:45.240 |
usually we sacrifice the -128 to obtain a symmetric range so we map each 00:12:51.200 |
number into its 8 bit representation in such a way that we can then map back to 00:12:56.600 |
the original array in an operation that is first called quantization and the 00:13:00.640 |
second is called dequantization now during the quantization we should 00:13:04.840 |
obtain the original array the original tensor or matrix but we usually lose 00:13:11.000 |
some precision so for example if you look at the first value it's exactly the 00:13:15.040 |
same as the original matrix but the second value here is similar but not 00:13:19.840 |
exactly the same and this is to say that with quantization we introduce some 00:13:25.140 |
error so the model will not be as accurate as the not quantized model but 00:13:31.720 |
we want to make it the quantization process in such a way that we lose the 00:13:36.000 |
least accuracy possible so we don't want to lose precision so we want to minimize 00:13:40.720 |
this error that we introduce okay let's go into the details of quantization now 00:13:47.240 |
so by reviewing the types of quantization we have available first of 00:13:52.120 |
all I will show you the difference between asymmetric and symmetric 00:13:54.760 |
quantization so imagine we have a tensor which is made up of 10 values that you 00:13:59.760 |
can see here the goal of asymmetric quantization is to map the original 00:14:04.640 |
tensor which is distributed between this range so minus 44.23 which is the 00:14:10.880 |
smallest number in this tensor and 43.31 which is the 00:14:15.360 |
biggest number in this tensor we want to map it into another range that is made 00:14:20.320 |
up of integers that are between 0 and 255 which are the integers that we can 00:14:26.200 |
represent using 8-bit for example and if we do this operation we will obtain a 00:14:31.640 |
new tensor that will map for example this first number into 255 this number 00:14:36.560 |
here into 0 this number here into 130 etc the other type of quantization is 00:14:43.720 |
the symmetric quantization which aims to map a symmetric range so we take this 00:14:51.040 |
tensor and we we treat it as a symmetric range even if it's not symmetric 00:14:57.400 |
because as you can see the biggest value here is 43.31 and the smallest value is 00:15:03.000 |
minus 44.93 so they are not symmetric with respect to the zero but if they are 00:15:08.960 |
then we can use a symmetric range which aims to basically map the original 00:15:13.400 |
symmetric range into another symmetric range also using 8-bit in our case such 00:15:18.960 |
that however this gives you the advantage that the zero is always mapped 00:15:22.800 |
into the zero in the quantized numbers I will show you later how actually we do 00:15:28.840 |
this computation so how do we compute the quantized version using the original 00:15:34.680 |
tensor and also how to de-quantize back so let's go in the case of a symmetric 00:15:40.560 |
quantization imagine we have an original tensor that is like this so these 10 00:15:46.600 |
items we can see here we quantize using the following formula so the quantized 00:15:52.660 |
version of each of these numbers is equal to the floating point number so 00:15:56.680 |
the original floating point number divided by a parameter called S which 00:16:00.640 |
stands for scale we round down or round up to the nearest integer plus a 00:16:08.080 |
number Z and if the result of this operation is smaller than zero then we 00:16:15.560 |
clamp it to zero and if it's bigger than 2 to the power of n minus 1 then we 00:16:20.040 |
clamp it to 2 to the power of n minus 1 what is n? n is the number of bits that 00:16:25.600 |
we want to use for quantization so we want to quantize for example all these 00:16:29.840 |
floating point numbers into 8 bits so we will choose n equal to 8 how to 00:16:35.080 |
calculate this S parameter the S parameter is given by alpha minus beta 00:16:39.920 |
divided by the range of the the output range basically so how many numbers the 00:16:44.680 |
output range can represent what is beta and alpha they are the biggest number in 00:16:50.640 |
the original tensor and the smallest number in the original tensor so we take 00:16:54.600 |
basically the range of the original tensor and we squeeze it into the output 00:17:00.040 |
range by means of this scale parameter and then we center it using the Z 00:17:05.200 |
parameter this Z parameter is computed as minus 1 multiplied by beta divided by 00:17:12.160 |
S and then rounded to the nearest integer so the Z parameter is an integer 00:17:17.120 |
while the scale parameter is not an integer it is a floating point 00:17:21.440 |
number if we do this operation so we take each floating point and we run it 00:17:26.320 |
through this formula we will obtain this quantized vector what we can see first 00:17:33.800 |
of all the biggest number using a symmetric quantization is always mapped 00:17:37.040 |
to the biggest number in the output range and the smallest number is always 00:17:40.520 |
mapped to the zero in the output range the zero number in the original vector 00:17:45.760 |
is mapped into the Z parameter so this 130 is actually the Z parameter if you 00:17:50.320 |
compute it and all the other numbers are mapped into something that is in between 00:17:55.160 |
0 and 255 we can then dequantize using the following formula so to 00:18:02.040 |
dequantize to obtain the floating point number back we just need to take 00:18:05.360 |
multiply the scale multiplied by the quantized number minus Z and we should 00:18:11.720 |
obtain the original tensor but you should see that the numbers are similar 00:18:17.320 |
but not exactly the same because the quantization introduces some error 00:18:21.600 |
because we are trying to squeeze a range that could be very big because with 32 00:18:26.400 |
bit we can represent a very big range into a range that is much smaller with 00:18:30.880 |
8 bit so of course we will introduce some error let's see the symmetric 00:18:36.400 |
quantization symmetric quantization as we saw before we aim to transform a 00:18:41.280 |
symmetric input range into a symmetric output range so imagine we still have 00:18:45.800 |
this tensor what we do we compute the quantized values as follows so each 00:18:51.440 |
number the floating point number divided by a parameter S so the scale and 00:18:56.360 |
clamped between these two limits this one and this one where n is the number 00:19:01.000 |
of bits that we want to use for quantizing and the S parameter is 00:19:05.400 |
calculated as the absolute value of alpha where alpha is the biggest number 00:19:10.160 |
here in absolute terms in this case it's the number minus 44.93 because in 00:19:16.060 |
absolute terms is the biggest value and we can then quantize this tensor and we 00:19:23.480 |
should obtain something like this we should notice that the the zero in this 00:19:27.840 |
case is mapped into the zero which is very useful we can then dequantize using 00:19:34.040 |
the formula we can see here so to obtain the floating point number we take the 00:19:38.000 |
quantized number multiplied by the scale parameter so the S parameter and we 00:19:42.400 |
should obtain the original vector but of course we will lose some precision so we 00:19:47.880 |
lose some as you can see the original number was for 43.31 the 00:19:52.280 |
dequantized number is a 43.16 so we lost some precision but our 00:19:57.920 |
goal of course is to have have it as similar as possible to the original 00:20:02.440 |
array and there are of course the best ways to just increase the number of 00:20:07.640 |
bits of the quantization but of course we cannot just choose any number of 00:20:13.680 |
bits because as we saw before we want to run this the matrix multiplication in 00:20:18.120 |
the linear layer to be accelerated by the CPU and the CPU always works with 00:20:22.960 |
the fixed number of bits and the operations in the side of the CPU are 00:20:26.360 |
optimized for a fixed number of bits so for example we have optimization for 8 00:20:30.680 |
bits 16 bit 32 bit and 64 bit but of course if we choose 11 bits as the for 00:20:36.440 |
quantization the CPU may not support the acceleration of operations using 11 bits 00:20:42.440 |
so we have to be careful to choose a good compromise between the number of 00:20:46.600 |
bits and also the availability of the hardware later we will also see how the 00:20:52.240 |
GPU computes the matrix multiplication in the accelerated form okay I have 00:20:59.280 |
shown you the symmetric and the asymmetric quantization now it's time to 00:21:02.720 |
actually look at the code on how it is implemented in reality let's have a look 00:21:07.160 |
okay I created a very simple notebook in which basically I generated 20 random 00:21:14.560 |
numbers between -50 and 150 I modified these numbers in such a way that the 00:21:20.440 |
first number is the biggest one and the second number is the smallest one and 00:21:24.480 |
then the third is a zero so we can check the effect of the quantization on the 00:21:28.080 |
biggest number on the smallest number and on the zero suppose this is the 00:21:32.440 |
original numbers so this array of 20 numbers we define the functions that 00:21:37.520 |
will quantize this vector so asymmetric quantization basically it will 00:21:43.200 |
compute the alpha as the maximum value the beta as the minimum value it will 00:21:47.160 |
calculate the scale and the zero using the formula that we saw on the slide 00:21:50.800 |
before and then it will quantize using the same formula that we saw before and 00:21:55.080 |
the same goes for the symmetric quantization we calculate the alpha the 00:21:59.120 |
scale parameter the upper bound and the lower bound for clamping and we can 00:22:04.000 |
also dequantize using the same formula that we saw on the slide so in the case 00:22:07.840 |
of asymmetric is this one with the zero and in the case of symmetric we don't 00:22:11.520 |
have the zero because the zero is always mapped into the zero we can also 00:22:16.080 |
calculate the quantization error by comparing the original values and the 00:22:19.840 |
dequantized values by using the mean squared error so let's try to see what 00:22:24.960 |
is the effect on quantization so this is our original array of floating point 00:22:29.040 |
numbers if we quantize it using asymmetric quantization we will obtain 00:22:32.880 |
this array here in which we can see that the biggest value is mapped into 255 00:22:39.360 |
which is the biggest value of the output range the smallest value is mapped into 00:22:43.960 |
the zero and the zero is mapped into the Z parameter which is a 61 and as you 00:22:49.000 |
can see the zero is mapped into the 61 while with the symmetric quantization we 00:22:55.320 |
have that the zero is mapped into the zero so the third element of the 00:22:59.080 |
original vector is mapped into the third element of the symmetric range and it's 00:23:02.600 |
the zero if we dequantize back the quantized parameters we will see that 00:23:08.760 |
they are similar to the original vector but not exactly the same as you can see 00:23:14.640 |
we lose a little bit of the precision and we can measure this precision using 00:23:18.640 |
the mean squared error for example and we can see that the error is much bigger 00:23:22.800 |
for the symmetric quantization why because the original vector is not 00:23:28.680 |
symmetric the original vector is between -50 and 150 so what we are doing with 00:23:35.120 |
symmetric quantization is that we are calculating let's see here with 00:23:39.440 |
symmetric quantization basically we are checking the biggest value in absolute 00:23:44.960 |
terms so the biggest value in absolute terms is 127 which will means that the 00:23:50.320 |
symmetric quantization will map a range that is between -127 and +127 but all the 00:23:57.620 |
numbers between -127 and -40 don't do not appear in this array so we are 00:24:03.840 |
sacrificing a lot of the range will be unused and that's why all the other 00:24:08.160 |
numbers will suffer from this bad distribution let's say and this is why 00:24:14.040 |
the symmetric quantization has a bigger error okay let's review again how the 00:24:19.160 |
quantization will work in our case of the linear layer so if we never quantize 00:24:24.280 |
this network we will have a weight matrix a bias matrix the output of this 00:24:29.560 |
layer will be a weight multiplied by the input of this layer plus the bias and 00:24:34.760 |
the output will be another floating-point number so all of these 00:24:37.640 |
matrices are floating-point numbers but when we quantize we quantize the weight 00:24:43.560 |
matrix which is a fixed matrix because we pretend the network has already been 00:24:47.300 |
trained so the weight matrix is fixed and we can quantize it by calculating 00:24:52.320 |
the alpha and the beta that we saw before using the symmetric quantization 00:24:55.900 |
or the asymmetric quantization the beta parameters can also be quantized because 00:25:00.960 |
it's a fixed vector and we can calculate the alpha and the beta of this vector 00:25:06.080 |
and we can quantize using 8 bits we want our goal is to perform all these 00:25:12.080 |
operations using integers so how can we quantize the X matrix because this is 00:25:16.880 |
the X matrix is an input which depends on the input the network receives one 00:25:22.480 |
way is called the dynamic quantization dynamic quantization means that for 00:25:26.720 |
every input we receive on the fly we calculate the alpha and the beta because 00:25:32.000 |
we have a vector so we can calculate the alpha and the beta and then we can 00:25:35.440 |
quantize it on the fly okay now we have quantized also the input matrix by using 00:25:41.020 |
for example dynamic quantization we can perform this matrix multiplication 00:25:45.240 |
will be which will become a integer matrix multiplication the output will be 00:25:50.620 |
Y which is an integer matrix but this matrix here is not the original 00:25:57.280 |
floating-point number of the not quantized network it's a quantized value 00:26:02.640 |
how can we map it back to the original floating-point number well we need to do 00:26:09.340 |
a process called calibration calibration means that we take the point the 00:26:14.600 |
network we run some input through the network and we check what are the 00:26:19.220 |
typical values of Y by using these typical values of Y we can check what 00:26:24.880 |
could be a reasonable alpha and the reasonable beta for these values that we 00:26:29.680 |
observe of Y and then we can use the output of this integer matrix 00:26:35.040 |
multiplication and use the scale and the zero parameter that we have computed by 00:26:41.660 |
visual by collecting statistics about this Y to dequantize this output matrix 00:26:48.520 |
here such that it's mapped back into a floating-point number such that the 00:26:53.880 |
network output doesn't differ too much from what is the not quantized network 00:26:59.780 |
so the goal of quantization is to reduce the number of bits required to represent 00:27:04.620 |
each parameter and also to speed up the computation but our goal is to obtain 00:27:09.720 |
the same output for the same input or at least to obtain a very similar output 00:27:14.600 |
for the same input so we don't want to lose the precision so we need to find a 00:27:19.120 |
way to of course map back into the floating-point numbers each output of 00:27:24.520 |
each linear layer and this is how we do it so the input matrix we can observe it 00:27:30.000 |
every time by using dynamic quantization so on the fly we can quantize it the 00:27:34.440 |
output we can observe it for a few samples so we know what are the typical 00:27:39.000 |
maximum and the minimum values such that we can then use them as alpha and 00:27:44.160 |
beta and then we can dequantize the output Y using these values that we have 00:27:49.520 |
observed we will see later this practically with the post-training 00:27:52.880 |
quantization we will actually watch the code of how it works I also want to 00:27:58.680 |
give you a glimpse into how GPU perform matrix multiplication so when we 00:28:04.440 |
calculate the product X multiplied by W plus B which is a matrix multiplication 00:28:09.760 |
followed by a matrix addition the result is a list of dot products between each 00:28:15.240 |
row of the X matrix and each column of the Y matrix summing the corresponding 00:28:21.720 |
element of the bias vector B this operation so the matrix multiplication 00:28:27.140 |
plus bias can be accelerated by the GPU using a block called the multiply 00:28:32.600 |
accumulate in which for example imagine each matrix is made up of vectors of 00:28:37.480 |
four elements so we load the vector of the X the first row of the VEX matrix 00:28:43.680 |
and then the first column of the W matrix and we compute the corresponding 00:28:50.080 |
product so X 1 1 with W 1 1 then X 1 2 with W 1 2 X 1 3 with W 3 1 etc etc and 00:28:57.320 |
then we sum all this value into a register called the accumulator now here 00:29:03.080 |
this is a 8-bit integer this is an 8-bit integer because we quantize them so the 00:29:10.720 |
result of a multiplication of two 8-bit integers may not be an 8-bit integer is 00:29:15.920 |
of course can be 16-bit or more and for this reason we use the accumulator here 00:29:22.720 |
is used as is a usually 32-bit and this is also the reason we quantize this 00:29:29.440 |
vector here as a 32-bit because the accumulator here is initialized already 00:29:34.680 |
with the bias element so this GPU will perform this operation in 00:29:39.840 |
parallel for every row and column of the initial matrices using many blocks 00:29:45.400 |
like this and this is how the GPU acceleration works for matrix 00:29:48.800 |
multiplication if you are interested in how this happens on low level on 00:29:53.520 |
algorithmic level I recommend watching this article from Google in their 00:29:58.680 |
general matrix multiplication library which is a low precision matrix 00:30:02.640 |
multiplication library okay now that we have seen what is the difference between 00:30:09.560 |
symmetric and asymmetric quantization we may also want to understand how do we 00:30:14.840 |
choose the beta and the alpha parameter we saw before one way of course is to 00:30:19.000 |
choose for in the case of asymmetric quantization to choose beta and alpha to 00:30:22.460 |
be the smallest and the biggest value and for the symmetric for example 00:30:26.560 |
quantization to choose alpha as the biggest value in absolute terms but this 00:30:30.860 |
is not the only strategy and they have pros and cons so let's review all the 00:30:34.360 |
strategies we have the strategies that we use before is called the minimax 00:30:38.680 |
strategy which means that we choose alpha as the biggest value in the 00:30:41.720 |
original tensor and beta as the minimum value in the original tensor this 00:30:46.520 |
however is a sensitive to outliers because imagine we have a vector that is 00:30:51.320 |
more or less distributed around the minus 50 and plus 50 but then we have an 00:30:56.680 |
outlier that is a very big number here the problem with this strategy is that 00:31:01.280 |
the outlier will make the quantization so the quantization error of all the 00:31:07.840 |
numbers very big so all the numbers as you can see when we quantize and then 00:31:12.760 |
dequantize using asymmetric quantization with minimax strategy we see that all 00:31:18.320 |
the numbers are not very similar to the original they are actually quite 00:31:21.840 |
different so this is 43.31 this is 45.08 so actually it's a quite a big 00:31:26.920 |
error for the quantization a better strategy to avoid the outliers ruining 00:31:32.720 |
the original the input range is to use the percentile strategy so we set the 00:31:37.960 |
range alpha and beta basically to be a percentile of the original distribution 00:31:42.760 |
so not the maximum or the minimum but using the percent for example the 99% 00:31:47.080 |
percentile and if we use the percentile we will see that the quantization error 00:31:53.440 |
is reduced for all the terms and the only term that will suffer a lot from 00:31:57.480 |
the quantization error is the outlier itself okay let's have a look at the 00:32:02.560 |
code to see how this minimax strategy and percentile strategy differ so we 00:32:08.480 |
open this one in which we again have a lot of numbers so 10,000 numbers 00:32:15.820 |
distributed between -50 and 150 and then we introduce an outlier let's say the 00:32:20.640 |
last number is an outlier so it's equal to 1000 all the other numbers are 00:32:25.040 |
distributed between -50 and 150 we compare these two strategies so the 00:32:31.300 |
asymmetric quantization using the minimax strategy and the asymmetric 00:32:34.960 |
quantization using the percentile strategy as you can see the only 00:32:37.960 |
modification between these two methods is how we compute alpha and beta here 00:32:42.280 |
alpha is computed as the maximum value here alpha is computed as a percentile 00:32:45.920 |
percentile of 99.99 and we can compare what are the quantized value we can see 00:32:55.320 |
here and then we can dequantize and when we dequantize we will see that the 00:33:01.280 |
all the values using the minimax strategy suffer from a big quantization 00:33:06.640 |
error while when we use the percentile we will see that the only value that 00:33:10.040 |
suffers from a big quantization error is the outlier itself and as we can see if 00:33:15.880 |
we exclude the outlier and we compute the quantization error on the other 00:33:19.440 |
terms we will see that with the percentile we have a much smaller error 00:33:24.120 |
while with the minimax strategy we have a very big error for all the numbers 00:33:28.440 |
except the outlier. Other two strategies that are commonly used for choosing 00:33:33.480 |
alpha and beta are the mean squared error and the cross entropy. Mean squared 00:33:37.560 |
error means that we choose alpha and beta such that the mean squared error 00:33:41.120 |
between the original values and the quantized values is minimized so we 00:33:46.160 |
usually use a grid search for this and the cross entropy is used as a strategy 00:33:51.760 |
whenever we are dealing for example with a language model as you know in the 00:33:55.640 |
language model we have the last layer which is a linear layer plus softmax 00:33:59.240 |
which allow us to choose a token from the vocabulary. The goal of this 00:34:05.040 |
softmax layer is to create a distribution probability distribution 00:34:08.680 |
in which usually we use the grid strategy or the top of the strategy so 00:34:12.560 |
what we are concerned about are not the values inside this distribution but 00:34:16.920 |
they actually the distribution itself so the biggest number should remain the 00:34:21.400 |
biggest number also in the quantized values and the intermediate numbers 00:34:24.760 |
should not change the relative distribution and for this case we use 00:34:28.720 |
the cross entropy strategy which means that we choose alpha and beta such that 00:34:32.680 |
the cross entropy between the quantized value and the dequantized, not 00:34:37.240 |
quantized value so the original values and the dequantized value is minimized 00:34:41.760 |
and another topic when we are doing quantization which comes to play every 00:34:48.480 |
time we have a convolutional layer is the granularity. As you know convolutional 00:34:52.440 |
layers are made up of many filters or kernels and each kernel is run through 00:34:57.560 |
the for example the image to calculate specific features. Now for example these 00:35:03.280 |
kernels are made of parameters which may be distributed differently for example 00:35:08.160 |
we may have a kernel that is distributed for example between minus 5 and plus 5 00:35:12.640 |
another one that is distributed between minus 10 and plus 10 and another one 00:35:17.680 |
that is distributed for example between minus 6 and plus 6. If we use the same 00:35:23.040 |
alpha and beta for all of them we will have that some kernels are wasting their 00:35:28.840 |
quantization range here and here for example so in this case it's better to 00:35:34.480 |
perform a channel wise quantization which means that for each kernel we will 00:35:39.040 |
calculate an alpha and beta and they will be different for each basic kernel 00:35:43.800 |
which results in a higher quality quantization so we lose less precision 00:35:48.800 |
this way. And now let's look at what is post training quantization. So post 00:35:54.320 |
training quantization means that we have a pre-trained model that we want to 00:35:58.040 |
quantize. How do we do that? Well we need the pre-trained model and we need some 00:36:03.040 |
data which is unlabeled data so we don't we do not need the original training 00:36:07.560 |
data we just need some data that we can run inference on. For example imagine 00:36:11.800 |
that the pre-trained model is a model that can classify dogs and cats what we 00:36:16.560 |
need as data we just need some pictures of dogs and cats which may also not come 00:36:20.640 |
from the training set and what we do is basically we take this pre-trained model 00:36:27.000 |
and we attach some observers that will collect some statistics while we are 00:36:33.040 |
running inference on the model and this statistics will be used to 00:36:38.440 |
calculate the Z and the S parameter for each layer of the model and then we can 00:36:44.160 |
use it to quantize the model. Let's see how this works in code. In this case I 00:36:50.680 |
will be creating a very simple model so first we import some libraries but 00:36:55.280 |
basically just a torch and then we import the data set we will be using 00:37:00.800 |
MNIST in our case. I define a very simple model for classifying MNIST 00:37:06.160 |
digits which is made up of three linear layers with ReLU activations. I 00:37:11.340 |
create this network I run a training on this network so this is just a basic 00:37:16.780 |
training training loop you can see here and we save this network as in this file 00:37:25.480 |
so we train it for I don't remember how many epochs for five epochs and then we 00:37:29.640 |
save it in a file. We define the testing loop which is just for validating the 00:37:36.040 |
what is the accuracy of this model. So first let's look at the model the not 00:37:41.560 |
quantized model so the pre-trained model for example. In this case let's look at 00:37:46.120 |
the weights of the first linear layer. In this case we can see that the linear 00:37:49.840 |
layer is made up of a weight matrix which is made up of many numbers which 00:37:53.800 |
are floating point of 32 bits. Floating point numbers of 32 bits. The size of the 00:37:59.800 |
model before quantization is 360 kilobyte. If we run the 00:38:07.120 |
testing loop on this model we will see that the accuracy is 96% which is not 00:38:12.360 |
bad. Of course our goal is to quantize which means that we want to speed up the 00:38:17.000 |
computation we want to reduce the size of the model but while maintaining the 00:38:20.460 |
accuracy. Let's see how it works. The first thing we do is we create a copy of 00:38:25.520 |
the model by introducing some observers. So as you can see this is a quantization 00:38:31.940 |
stub and this is a de-quantization stub that is used by PyTorch to do 00:38:36.960 |
quantization on the fly. And then we introduce also some observers in all the 00:38:43.080 |
intermediate layers. So we take this new model that is with observers we 00:38:48.480 |
basically take the weights from the pre-trained model and copy it into this 00:38:53.600 |
new model that we have created. So we are not training a new model we are just 00:38:57.000 |
copying the weights of the pre-trained model into this new model that we have 00:39:01.280 |
defined which is exactly the same as the original one just with some observers. 00:39:05.840 |
And we also insert some observers in all the intermediate layers. Let's see 00:39:12.120 |
these observers basically they are the some special class of objects made 00:39:18.560 |
available by PyTorch that for each linear layer they will observe some 00:39:22.440 |
statistics when we run some inference on this model. And as you can see what 00:39:27.760 |
the statistic they collect is just the minimum value they see and the maximum 00:39:31.540 |
value they see for each layer also for the input and this is why we have this 00:39:36.520 |
quant stub as input. And we calibrate the model using the test. So if we run 00:39:42.680 |
inference on the model using the test set for example which is we just need 00:39:46.720 |
some data to run inference on the model so that these observers will collect 00:39:51.160 |
statistics. We do it so this will calculate the this will run inference of 00:39:57.520 |
all the test set on the model so we are not training anything we're just running 00:40:01.960 |
inference. The observers after running inference we will have collected some 00:40:07.760 |
statistics so for example the input observer here has collected some 00:40:11.720 |
statistics. The observer for the first linear layer also have collected some 00:40:16.160 |
statistics the second and the third etc etc. We can use the statistics that we 00:40:22.160 |
have collected to create the quantized model so the actual quantization happens 00:40:27.160 |
after we have collected these statistics and then we run this method which is 00:40:31.760 |
quantization.convert which will create the quantized model. And we can now 00:40:37.440 |
see that after we quantize it each layer will become a quantized layer so 00:40:42.880 |
before quantization it's just a linear layer but after they become a quantized 00:40:46.720 |
linear layer. Each of them has some special parameter that is the S and the 00:40:52.500 |
Z parameter that we saw in the slide so the scale and the zero point. And we can 00:40:58.320 |
also print the weight matrix after quantization and we can see that the 00:41:01.600 |
weight matrix has become an integer of 8 bits so as you can see here. We can 00:41:07.040 |
compare the dequantized weights and the original weights so the original weights 00:41:11.200 |
were floating-point numbers of 32 bits while the dequantized weights so after 00:41:18.560 |
we dequantize of course we obtain back the floating-point numbers so these are 00:41:22.640 |
the how they are stored on the disk but of course when we want to dequantize we 00:41:27.720 |
obtain something that is very similar to the original weight matrix but not 00:41:32.360 |
exactly the same because we introduce some error because of the quantization. 00:41:36.560 |
So the dequantized weights are very similar to the original number but not 00:41:41.360 |
exactly the same. For example the first number is quite different the second one 00:41:46.160 |
is quite similar the third one is quite similar etc etc. We can check the size of 00:41:51.920 |
the model after it's been quantized and we can see that the new size of the 00:41:55.560 |
model is 94 kilobyte. Originally it was 360 if I remember correctly so it has 00:42:02.040 |
been reduced by four times. Why? Because each number instead of being a 00:42:06.480 |
floating-point number of 32 bits is now an integer plus some overhead because we 00:42:11.960 |
need to save some other data because for example we need to save all this scale 00:42:16.720 |
the scale value the zero point value and also PyTorch saves some other values. 00:42:22.040 |
We can also check the accuracy of the quantized model and we see that the 00:42:26.360 |
model didn't suffer much from actually didn't suffer at all from from the 00:42:31.320 |
quantization so the accuracy remained practically the same. In reality okay 00:42:35.480 |
this is a very simple example and the model is quite big so I think the model 00:42:39.040 |
has plenty of parameters to to predict well. But in reality usually when 00:42:46.720 |
we quantize a model we will lose some accuracy and we will see later a 00:42:51.120 |
training approach that makes the model more robust to quantization 00:42:56.000 |
which is called the quantization aware training. So this is the post-training 00:43:02.240 |
quantization and that's all for this one. Let's look at the next quantization 00:43:06.920 |
strategy which is the quantization aware training. What we do basically is that we 00:43:11.960 |
insert some fake modules in the computational graph of the model to 00:43:16.760 |
simulate the effect of quantization during training. So before we were 00:43:20.400 |
talking about how to quantize a model after we have already trained it. In this 00:43:26.000 |
case we want to train a model such that the model is more robust to the 00:43:30.640 |
quantization effect. So this is done during training not after the 00:43:36.560 |
training. And basically what we do is we have our model which has input then we 00:43:42.640 |
have some linear layers, we have output, we have a target, we compute the loss. 00:43:46.480 |
What we do basically is we insert between each layer some special 00:43:53.620 |
operations of quantize and dequantize operations, some fake operations. So 00:43:58.120 |
actually we are not quantizing the model or the weights because the model is 00:44:02.080 |
getting trained. But we do some quantization on the fly. So 00:44:06.800 |
every time we see an input here we quantize it and dequantize it 00:44:10.280 |
immediately and run it to the next layer. Then this will produce some output. We 00:44:14.920 |
quantize it and dequantize it immediately and we give it to the next 00:44:18.520 |
because this will introduce some quantization error and we hope that the 00:44:23.360 |
loss function will learn to be more robust to handle this 00:44:28.500 |
quantization error that is introduced by this fake quantization that we are 00:44:32.360 |
introducing. So the goal of introducing these operations is just to introduce 00:44:36.640 |
some quantization error so that the loss function can get ready to counter 00:44:42.920 |
affect the effects of quantization. Let's look at the code of how it is done. 00:44:49.960 |
So we go to quantization aware training. Okay we import the necessary libraries 00:44:56.440 |
just like before. We import the data set, in our case it's MNIST. We define a 00:45:01.760 |
model which is exactly the same as before but we notice that here we 00:45:05.760 |
already start with a quantization model that is ready for quantization because 00:45:10.300 |
here we want to train the model in a way that it's already aware of the 00:45:14.720 |
quantization. That's why it's called quantization aware training and the rest 00:45:19.520 |
of the structure of the model is the same as before. We insert the minimax 00:45:23.720 |
observers in the model for every layer so as you can see this model is not 00:45:28.240 |
trained and we are insert already some observers. These observers are not 00:45:33.880 |
calibrated because we never run any inference or we never run any training 00:45:37.400 |
on this model so all these values are plus and minus infinity. Then we 00:45:43.400 |
train the model using the MNIST and we train it for one epoch and we check the 00:45:50.760 |
statistics collected by these observers during training and we can see that 00:45:56.200 |
during training they have collected some statistics so the minimum and the 00:46:00.000 |
maximum value and you can see that when we do quantization aware training we 00:46:05.360 |
have this weight fake quant so this is actually all the fake quantization 00:46:10.360 |
observers that we have introduced during the training and they have collected 00:46:15.280 |
some some values or some statistics. We can then quantize the model by using the 00:46:22.160 |
statistics that have been collected during training and we can print the 00:46:27.160 |
values scale and zero point of the quantized model and we can see them here. 00:46:32.520 |
We can also print the weights of the quantized model and you can see that the 00:46:37.040 |
weight matrix of the first linear layer is actually an integer matrix and we can 00:46:41.720 |
also run the accuracy and we can see that the accuracy of this model is 0.952 00:46:47.200 |
okay in this case it's a little worse than the other case but this is not the 00:46:52.320 |
rule usually quantization aware training makes the model more robust to the 00:46:56.200 |
effects of quantization so usually when we do post training quantization the 00:47:01.200 |
model loses more accuracy compared to quantization when we train a model with 00:47:06.000 |
quantization aware training. Let's go back to the slides. Now there is one 00:47:12.420 |
thing that we should notice that with quantization aware training we are 00:47:15.800 |
introducing some observers between each layer some special quantized and 00:47:22.520 |
dequantized operation between each layer and then we do it while training. This 00:47:27.400 |
means that the backpropagation algorithm should also be able to calculate the 00:47:32.220 |
gradient of the loss function with respect to this operation that we are 00:47:37.840 |
doing but we do this the operation of quantization is not differentiable so 00:47:42.520 |
how can the backpropagation algorithm algorithm calculate the gradient of the 00:47:47.120 |
quantization operation that we are doing during the forward loop? Well we usually 00:47:53.920 |
approximate the gradient using the straight through estimator which means 00:47:58.360 |
that for all the values that fall in between the beta and the alpha parameter 00:48:04.720 |
we give a gradient of 1 and for all the other values that are outside of this 00:48:09.520 |
range we approximate the gradient with 0 and this is because the quantization 00:48:16.040 |
operation is not differentiable this is why we need to approximate the gradient 00:48:19.800 |
using this approximator. The next thing that we should notice is why does 00:48:26.840 |
quantization aware training works I mean what is the effect of quantization aware 00:48:31.200 |
training on the loss function because as I told you before our goal is to 00:48:35.520 |
introduce the quantization error during training such that the loss function can 00:48:40.200 |
react to it but how? Now imagine we do post training quantization when we train 00:48:46.040 |
a model that that has no notion of quantization imagine we only have a one 00:48:51.920 |
weight and the loss function is computed for this particular weight the goal of 00:48:58.000 |
the backpropagation algorithm or the gradient descent is actually to of the 00:49:03.580 |
gradient descent algorithm is to calculate the weights of the model such 00:49:08.520 |
that we minimize the loss and usually suppose we end up meaning this is the 00:49:14.240 |
loss function and we end up in this local minima here. The goal of 00:49:19.240 |
quantization aware training is to make the model reach a local minima that is 00:49:25.400 |
more wide. Why? Because the weight value here after we quantize 00:49:31.800 |
it will change and for example if we do it without quantization aware training 00:49:37.280 |
if the loss was here and the weight value was here after quantization this 00:49:42.520 |
weight value will be changed of course so it may go here but the loss will 00:49:46.360 |
increase a lot for example but with quantization aware training we choose a 00:49:51.480 |
local minima or a minima that is more wide so that if the weight after the 00:49:56.820 |
quantization moves a little bit the loss will not increase by much and this is 00:50:02.320 |
why quantization aware training works. Thank you guys for watching my video I 00:50:07.600 |
hope you enjoyed learning about quantization I didn't talk about 00:50:11.400 |
advanced topic like GPTQ or AWQ which I hope to do in my next videos. If you 00:50:18.080 |
liked the video please subscribe and like the video and share it with your 00:50:21.560 |
friends or colleagues and the students. I have other videos about deep learning 00:50:27.840 |
and machine learning so please let me know if there is something you don't 00:50:30.760 |
understand and be free to connect with me on LinkedIn or on social media