Back to Index

Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training


Chapters

0:0 Introduction
1:10 What is quantization?
3:42 Integer representation
7:25 Floating-point representation
9:16 Quantization (details)
13:50 Asymmetric vs Symmetric Quantization
15:38 Asymmetric Quantization
18:34 Symmetric Quantization
20:57 Asymmetric vs Symmetric Quantization (Python Code)
24:16 Dynamic Quantization & Calibration
27:57 Multiply-Accumulate Block
30:5 Range selection strategies
34:40 Quantization granularity
35:49 Post-Training Quantization
43:5 Training-Aware Quantization

Transcript

Hello guys, welcome back to my channel. Today we are gonna talk about quantization. Let's review the topics of today. I will start by showing what is quantization and why we need quantization and later we will briefly introduce what are the numerical representation for integers and floating-point numbers in our hardware, so in CPUs and GPUs.

I will show you later what is quantization at the neural network level by giving you some examples and later we will go into the detail of the types of quantization, so the asymmetric and the symmetric quantization, what we mean by the range and the granularity and later we will see also post-training quantization and quantization-aware training.

For all of these topics I will also show you the Python, the PyTorch and the Python code on how to do it from scratch. So actually we will build the asymmetric and the quantization and the symmetric quantization from scratch using PyTorch and then later we will also apply it to a sample neural network using post-training quantization and quantization-aware training.

What do I expect you guys to already know before watching this video is basically you have some basic understanding of neural networks and then you have some background in mathematics, just high school mathematics is enough. So let's start our journey. Let's see what is quantization first of all. So quantization aims to solve a problem.

The problem is that most modern deep neural networks are made up of millions if not billions of parameters. For example the smallest Lama 2 has a 7 billion parameters. Now if every parameter is a 32-bit then we need 28 gigabyte just to store the parameters on the disk. Also when we inference the model we need to load all the parameters of the model in the memory.

If we are using the CPU for example for inference then we need to load it in the RAM but if you are using the GPU we need to load it in the memory of the GPU. Of course big models cannot easily be loaded inside the CPU, the RAM or the GPU in case we are using a standard PC or small device like a smartphone.

And also just like humans computers are slow at computing floating-point operations compared to integer operations. For example if you try to do mentally 3 multiplied by 6 and also mentally 1.21 multiplied by 2.897 of course you are able to do much faster the 3 by 6 multiplication and the same goes on with computers.

So the solution is quantization. Quantization basically aims to reduce the number of the amount of bits required to represent each parameter using by usually by converting the floating-point numbers into integers. This way for example a model that normally occupies many gigabyte can be compressed to much less smaller size.

Also please note that quantization doesn't mean that we just round up or round down all the floating-point numbers to the nearest integer, this is not what quantization does. We will see later how it works so please don't be confused. And the quantization can also speed up computation because as working with smaller data types is faster for example the computer is much faster as multiplying matrices made up of integers than two matrices made up of floating-point numbers.

And later we will see actually how this matrix multiplication works at the GPU level also. So what is the advantage of quantization? First of all we have less memory consumption when loading models so the model can be compressed into a much smaller size and we have less inference time because of using simpler data types so for example integers instead of floating-point numbers.

And these two combinations lead to less energy consumption which is very important for like for example smartphones. Okay now let's go review how numbers are represented in the hardware so in the CPU level or in the GPU level. So computers use a fixed number of bits to represent any piece of data.

For example to represent a number or a character or a pixel color we always use the fixed number of bit. A bit string that is made up of n bits can represent up to 2 to the power of n distinct numbers. For example with 3 bit we can represent 8 possible numbers from 0 to 7 and for each number you can see its binary representation.

We can always convert the binary representation in the decimal representation by multiplying each digit with the power of 2 to the power of its position to the to the position of the digit inside the bit string. And in most CPUs actually the numbers the integer numbers are represented using the twos complement which means that the first bit of the number indicates the sign so 0 means positive and 1 means negative while the rest of the bits indicate the absolute value of the number in case it's positive or its complement in case it's negative.

The reason we use the twos complement is because we want one unique representation for the zero so the plus zero and the minus zero have the same binary representation. But of course you may argue okay computers use a fixed number of bits to represent numbers but how can Python handle such big numbers without any problems like when you run 2 to the power of 9 9 9 9 on Python you will get a result which is much bigger than any 64-bit number and how can Python handle these huge numbers without any problem?

Well Python uses the so called the big num arithmetic so as we saw before in this table the number 6 when it's represented in base 10 only needs one digit but when it's represented in a base 2 it reads three digits so this is actually a rule so the smaller the base the bigger the number of digits we need to represent the number and Python does the inverse so it saves all these numbers as an array of digits in which each digit is the digit of the number in base 2 to the power of 30 so overall we need less digits to store very big numbers for example if this number which is the result of 2 to the power of 9 9 9 9 is represented as a decimal number we would need an array of 3,000 digits to store it in memory while Python stores this number as an array of digits in base 2 to the power of 30 so it only needs 334 elements in which all the elements are zero except the most significant one which is equal to 512 and as a matter of fact you can check by yourself that by doing 512 multiplied by the base so 2 to the power of 30 then to the power of the position of this digit in the array we will obtain the number 2 to the power of 9 9 9 9 I also want you to notice that this is something that is implemented by CPython which is the Python interpreter not by the CPU so it's not the CPU that is doing this big num arithmetic for us it's the Python interpreter for example when you compile C++ code the code will run directly on the hardware on the CPU which means also that the C++ code is compiled for the specific hardware it will run on while Python code we never compile it because the CPython will take care of translating our Python instructions into machine code and in a process called just-in-time compilation okay let's review how floating-point numbers are represented now decimal numbers are just numbers that also include the negative powers of the base for example the number 85.612 can be written as each number multiplied so each digit multiplied by a power of the base which is 10 but the decimal part have negative powers of 10 as you can see 10 to the power of minus 1 minus 2 and minus 3 and this same reasoning is used to in the standard IEEE 754 which defines the representation of floating-point numbers in 32-bit basically we divided the 32-bit string into three parts the first bit indicates the sign so 0 means positive the next 8 bit indicated the exponent which also indicates the magnitude of the number so how big is the number and the last 23 bits indicate the fractional part of the number so all the digits corresponding to the negative powers of 2 to convert this bit string into a value decimal value we just need to do this so the sign multiplied by 2 to the power of the exponent minus 127 multiplied by the fraction 1 plus all the negative powers of 2 and should correspond to the number 0.15625 most modern GPUs also support a 16-bit floating-point number but of course this results in less precision because we have less bits dedicated to the fractional part the last bit dedicated to the exponent and of course they are smaller so they have less it means that they can represent the floating-point numbers with less precision so we don't we cannot have too many digits after the comma for example okay let's go inside the details of quantization now first of all we review how neural networks work so we start with an input which could be a tensor and we give it to a layer which could be a linear layer for example which then maps to another linear layer and finally we have an output we have usually a target we compare the output and the target through a loss function and we calculate the gradient of the loss function with respect to each parameter and we run back propagation to update these parameters the neural network can be made up of many different layers for example a linear layer is made up of two matrices one is called the weight and one is called the bias which are commonly represented using floating-point numbers quantization aims to use integer numbers to represent these two matrices while maintaining the accuracy of the model let's see how so this linear layer for example the first linear layer of this neural network represents an operation which is the input multiplied by a weight matrix which are the parameters of this linear layer plus a bias which are also the parameters of this linear layer and we the goal of the quantization is to quantize the input the weight matrix and the bias matrix into integers such that we perform all these operations here as integer operations because they are much faster compared to floating-point operations we take then the output we dequantize it and we feed it to the next layer and we dequantize in such a way that the next layer should not even realize that there have been a quantization in the previous layer so we want to do quantization in such a way that the model's output should not change because of quantization so we want to keep the model's performance the accuracy of the model but we want to perform all these operations using integers so we need to find a mapping between floating-point numbers and integers and a reversible mapping of course so we can go from floating-point to integers and from integers to floating-point but in such a way that we don't lose the precision of the model but at the same time we want to optimize the space occupation of the model inside the RAM and on the disk and we want to make it faster to compute these operations because as we saw before computing integer operations is much faster than computing floating-point operations the main benefit is that the integer operations is much faster in most hardware than floating-point operations plus in most embedding hardware especially very very small embedded device we don't even have a floating-point numbers so we are forced to use integer operations in those devices okay let's see how it works so this hidden layer here for example may have a weight matrix which could be a 5 by 5 matrix that we can see here the goal of quantization is to reduce the precision of each number that we see in this matrix by mapping it into a range that occupies less bits so this is a floating-point number and occupies 4 bytes so 32 bits we want to quantize using only 8 bits so each number should be represented only using 8 bit now with 8 bit we can represent the range from -128 to +127 but usually we sacrifice the -128 to obtain a symmetric range so we map each number into its 8 bit representation in such a way that we can then map back to the original array in an operation that is first called quantization and the second is called dequantization now during the quantization we should obtain the original array the original tensor or matrix but we usually lose some precision so for example if you look at the first value it's exactly the same as the original matrix but the second value here is similar but not exactly the same and this is to say that with quantization we introduce some error so the model will not be as accurate as the not quantized model but we want to make it the quantization process in such a way that we lose the least accuracy possible so we don't want to lose precision so we want to minimize this error that we introduce okay let's go into the details of quantization now so by reviewing the types of quantization we have available first of all I will show you the difference between asymmetric and symmetric quantization so imagine we have a tensor which is made up of 10 values that you can see here the goal of asymmetric quantization is to map the original tensor which is distributed between this range so minus 44.23 which is the smallest number in this tensor and 43.31 which is the biggest number in this tensor we want to map it into another range that is made up of integers that are between 0 and 255 which are the integers that we can represent using 8-bit for example and if we do this operation we will obtain a new tensor that will map for example this first number into 255 this number here into 0 this number here into 130 etc the other type of quantization is the symmetric quantization which aims to map a symmetric range so we take this tensor and we we treat it as a symmetric range even if it's not symmetric because as you can see the biggest value here is 43.31 and the smallest value is minus 44.93 so they are not symmetric with respect to the zero but if they are then we can use a symmetric range which aims to basically map the original symmetric range into another symmetric range also using 8-bit in our case such that however this gives you the advantage that the zero is always mapped into the zero in the quantized numbers I will show you later how actually we do this computation so how do we compute the quantized version using the original tensor and also how to de-quantize back so let's go in the case of a symmetric quantization imagine we have an original tensor that is like this so these 10 items we can see here we quantize using the following formula so the quantized version of each of these numbers is equal to the floating point number so the original floating point number divided by a parameter called S which stands for scale we round down or round up to the nearest integer plus a number Z and if the result of this operation is smaller than zero then we clamp it to zero and if it's bigger than 2 to the power of n minus 1 then we clamp it to 2 to the power of n minus 1 what is n?

n is the number of bits that we want to use for quantization so we want to quantize for example all these floating point numbers into 8 bits so we will choose n equal to 8 how to calculate this S parameter the S parameter is given by alpha minus beta divided by the range of the the output range basically so how many numbers the output range can represent what is beta and alpha they are the biggest number in the original tensor and the smallest number in the original tensor so we take basically the range of the original tensor and we squeeze it into the output range by means of this scale parameter and then we center it using the Z parameter this Z parameter is computed as minus 1 multiplied by beta divided by S and then rounded to the nearest integer so the Z parameter is an integer while the scale parameter is not an integer it is a floating point number if we do this operation so we take each floating point and we run it through this formula we will obtain this quantized vector what we can see first of all the biggest number using a symmetric quantization is always mapped to the biggest number in the output range and the smallest number is always mapped to the zero in the output range the zero number in the original vector is mapped into the Z parameter so this 130 is actually the Z parameter if you compute it and all the other numbers are mapped into something that is in between 0 and 255 we can then dequantize using the following formula so to dequantize to obtain the floating point number back we just need to take multiply the scale multiplied by the quantized number minus Z and we should obtain the original tensor but you should see that the numbers are similar but not exactly the same because the quantization introduces some error because we are trying to squeeze a range that could be very big because with 32 bit we can represent a very big range into a range that is much smaller with 8 bit so of course we will introduce some error let's see the symmetric quantization symmetric quantization as we saw before we aim to transform a symmetric input range into a symmetric output range so imagine we still have this tensor what we do we compute the quantized values as follows so each number the floating point number divided by a parameter S so the scale and clamped between these two limits this one and this one where n is the number of bits that we want to use for quantizing and the S parameter is calculated as the absolute value of alpha where alpha is the biggest number here in absolute terms in this case it's the number minus 44.93 because in absolute terms is the biggest value and we can then quantize this tensor and we should obtain something like this we should notice that the the zero in this case is mapped into the zero which is very useful we can then dequantize using the formula we can see here so to obtain the floating point number we take the quantized number multiplied by the scale parameter so the S parameter and we should obtain the original vector but of course we will lose some precision so we lose some as you can see the original number was for 43.31 the dequantized number is a 43.16 so we lost some precision but our goal of course is to have have it as similar as possible to the original array and there are of course the best ways to just increase the number of bits of the quantization but of course we cannot just choose any number of bits because as we saw before we want to run this the matrix multiplication in the linear layer to be accelerated by the CPU and the CPU always works with the fixed number of bits and the operations in the side of the CPU are optimized for a fixed number of bits so for example we have optimization for 8 bits 16 bit 32 bit and 64 bit but of course if we choose 11 bits as the for quantization the CPU may not support the acceleration of operations using 11 bits so we have to be careful to choose a good compromise between the number of bits and also the availability of the hardware later we will also see how the GPU computes the matrix multiplication in the accelerated form okay I have shown you the symmetric and the asymmetric quantization now it's time to actually look at the code on how it is implemented in reality let's have a look okay I created a very simple notebook in which basically I generated 20 random numbers between -50 and 150 I modified these numbers in such a way that the first number is the biggest one and the second number is the smallest one and then the third is a zero so we can check the effect of the quantization on the biggest number on the smallest number and on the zero suppose this is the original numbers so this array of 20 numbers we define the functions that will quantize this vector so asymmetric quantization basically it will compute the alpha as the maximum value the beta as the minimum value it will calculate the scale and the zero using the formula that we saw on the slide before and then it will quantize using the same formula that we saw before and the same goes for the symmetric quantization we calculate the alpha the scale parameter the upper bound and the lower bound for clamping and we can also dequantize using the same formula that we saw on the slide so in the case of asymmetric is this one with the zero and in the case of symmetric we don't have the zero because the zero is always mapped into the zero we can also calculate the quantization error by comparing the original values and the dequantized values by using the mean squared error so let's try to see what is the effect on quantization so this is our original array of floating point numbers if we quantize it using asymmetric quantization we will obtain this array here in which we can see that the biggest value is mapped into 255 which is the biggest value of the output range the smallest value is mapped into the zero and the zero is mapped into the Z parameter which is a 61 and as you can see the zero is mapped into the 61 while with the symmetric quantization we have that the zero is mapped into the zero so the third element of the original vector is mapped into the third element of the symmetric range and it's the zero if we dequantize back the quantized parameters we will see that they are similar to the original vector but not exactly the same as you can see we lose a little bit of the precision and we can measure this precision using the mean squared error for example and we can see that the error is much bigger for the symmetric quantization why because the original vector is not symmetric the original vector is between -50 and 150 so what we are doing with symmetric quantization is that we are calculating let's see here with symmetric quantization basically we are checking the biggest value in absolute terms so the biggest value in absolute terms is 127 which will means that the symmetric quantization will map a range that is between -127 and +127 but all the numbers between -127 and -40 don't do not appear in this array so we are sacrificing a lot of the range will be unused and that's why all the other numbers will suffer from this bad distribution let's say and this is why the symmetric quantization has a bigger error okay let's review again how the quantization will work in our case of the linear layer so if we never quantize this network we will have a weight matrix a bias matrix the output of this layer will be a weight multiplied by the input of this layer plus the bias and the output will be another floating-point number so all of these matrices are floating-point numbers but when we quantize we quantize the weight matrix which is a fixed matrix because we pretend the network has already been trained so the weight matrix is fixed and we can quantize it by calculating the alpha and the beta that we saw before using the symmetric quantization or the asymmetric quantization the beta parameters can also be quantized because it's a fixed vector and we can calculate the alpha and the beta of this vector and we can quantize using 8 bits we want our goal is to perform all these operations using integers so how can we quantize the X matrix because this is the X matrix is an input which depends on the input the network receives one way is called the dynamic quantization dynamic quantization means that for every input we receive on the fly we calculate the alpha and the beta because we have a vector so we can calculate the alpha and the beta and then we can quantize it on the fly okay now we have quantized also the input matrix by using for example dynamic quantization we can perform this matrix multiplication will be which will become a integer matrix multiplication the output will be Y which is an integer matrix but this matrix here is not the original floating-point number of the not quantized network it's a quantized value how can we map it back to the original floating-point number well we need to do a process called calibration calibration means that we take the point the network we run some input through the network and we check what are the typical values of Y by using these typical values of Y we can check what could be a reasonable alpha and the reasonable beta for these values that we observe of Y and then we can use the output of this integer matrix multiplication and use the scale and the zero parameter that we have computed by visual by collecting statistics about this Y to dequantize this output matrix here such that it's mapped back into a floating-point number such that the network output doesn't differ too much from what is the not quantized network so the goal of quantization is to reduce the number of bits required to represent each parameter and also to speed up the computation but our goal is to obtain the same output for the same input or at least to obtain a very similar output for the same input so we don't want to lose the precision so we need to find a way to of course map back into the floating-point numbers each output of each linear layer and this is how we do it so the input matrix we can observe it every time by using dynamic quantization so on the fly we can quantize it the output we can observe it for a few samples so we know what are the typical maximum and the minimum values such that we can then use them as alpha and beta and then we can dequantize the output Y using these values that we have observed we will see later this practically with the post-training quantization we will actually watch the code of how it works I also want to give you a glimpse into how GPU perform matrix multiplication so when we calculate the product X multiplied by W plus B which is a matrix multiplication followed by a matrix addition the result is a list of dot products between each row of the X matrix and each column of the Y matrix summing the corresponding element of the bias vector B this operation so the matrix multiplication plus bias can be accelerated by the GPU using a block called the multiply accumulate in which for example imagine each matrix is made up of vectors of four elements so we load the vector of the X the first row of the VEX matrix and then the first column of the W matrix and we compute the corresponding product so X 1 1 with W 1 1 then X 1 2 with W 1 2 X 1 3 with W 3 1 etc etc and then we sum all this value into a register called the accumulator now here this is a 8-bit integer this is an 8-bit integer because we quantize them so the result of a multiplication of two 8-bit integers may not be an 8-bit integer is of course can be 16-bit or more and for this reason we use the accumulator here is used as is a usually 32-bit and this is also the reason we quantize this vector here as a 32-bit because the accumulator here is initialized already with the bias element so this GPU will perform this operation in parallel for every row and column of the initial matrices using many blocks like this and this is how the GPU acceleration works for matrix multiplication if you are interested in how this happens on low level on algorithmic level I recommend watching this article from Google in their general matrix multiplication library which is a low precision matrix multiplication library okay now that we have seen what is the difference between symmetric and asymmetric quantization we may also want to understand how do we choose the beta and the alpha parameter we saw before one way of course is to choose for in the case of asymmetric quantization to choose beta and alpha to be the smallest and the biggest value and for the symmetric for example quantization to choose alpha as the biggest value in absolute terms but this is not the only strategy and they have pros and cons so let's review all the strategies we have the strategies that we use before is called the minimax strategy which means that we choose alpha as the biggest value in the original tensor and beta as the minimum value in the original tensor this however is a sensitive to outliers because imagine we have a vector that is more or less distributed around the minus 50 and plus 50 but then we have an outlier that is a very big number here the problem with this strategy is that the outlier will make the quantization so the quantization error of all the numbers very big so all the numbers as you can see when we quantize and then dequantize using asymmetric quantization with minimax strategy we see that all the numbers are not very similar to the original they are actually quite different so this is 43.31 this is 45.08 so actually it's a quite a big error for the quantization a better strategy to avoid the outliers ruining the original the input range is to use the percentile strategy so we set the range alpha and beta basically to be a percentile of the original distribution so not the maximum or the minimum but using the percent for example the 99% percentile and if we use the percentile we will see that the quantization error is reduced for all the terms and the only term that will suffer a lot from the quantization error is the outlier itself okay let's have a look at the code to see how this minimax strategy and percentile strategy differ so we open this one in which we again have a lot of numbers so 10,000 numbers distributed between -50 and 150 and then we introduce an outlier let's say the last number is an outlier so it's equal to 1000 all the other numbers are distributed between -50 and 150 we compare these two strategies so the asymmetric quantization using the minimax strategy and the asymmetric quantization using the percentile strategy as you can see the only modification between these two methods is how we compute alpha and beta here alpha is computed as the maximum value here alpha is computed as a percentile percentile of 99.99 and we can compare what are the quantized value we can see here and then we can dequantize and when we dequantize we will see that the all the values using the minimax strategy suffer from a big quantization error while when we use the percentile we will see that the only value that suffers from a big quantization error is the outlier itself and as we can see if we exclude the outlier and we compute the quantization error on the other terms we will see that with the percentile we have a much smaller error while with the minimax strategy we have a very big error for all the numbers except the outlier.

Other two strategies that are commonly used for choosing alpha and beta are the mean squared error and the cross entropy. Mean squared error means that we choose alpha and beta such that the mean squared error between the original values and the quantized values is minimized so we usually use a grid search for this and the cross entropy is used as a strategy whenever we are dealing for example with a language model as you know in the language model we have the last layer which is a linear layer plus softmax which allow us to choose a token from the vocabulary.

The goal of this softmax layer is to create a distribution probability distribution in which usually we use the grid strategy or the top of the strategy so what we are concerned about are not the values inside this distribution but they actually the distribution itself so the biggest number should remain the biggest number also in the quantized values and the intermediate numbers should not change the relative distribution and for this case we use the cross entropy strategy which means that we choose alpha and beta such that the cross entropy between the quantized value and the dequantized, not quantized value so the original values and the dequantized value is minimized and another topic when we are doing quantization which comes to play every time we have a convolutional layer is the granularity.

As you know convolutional layers are made up of many filters or kernels and each kernel is run through the for example the image to calculate specific features. Now for example these kernels are made of parameters which may be distributed differently for example we may have a kernel that is distributed for example between minus 5 and plus 5 another one that is distributed between minus 10 and plus 10 and another one that is distributed for example between minus 6 and plus 6.

If we use the same alpha and beta for all of them we will have that some kernels are wasting their quantization range here and here for example so in this case it's better to perform a channel wise quantization which means that for each kernel we will calculate an alpha and beta and they will be different for each basic kernel which results in a higher quality quantization so we lose less precision this way.

And now let's look at what is post training quantization. So post training quantization means that we have a pre-trained model that we want to quantize. How do we do that? Well we need the pre-trained model and we need some data which is unlabeled data so we don't we do not need the original training data we just need some data that we can run inference on.

For example imagine that the pre-trained model is a model that can classify dogs and cats what we need as data we just need some pictures of dogs and cats which may also not come from the training set and what we do is basically we take this pre-trained model and we attach some observers that will collect some statistics while we are running inference on the model and this statistics will be used to calculate the Z and the S parameter for each layer of the model and then we can use it to quantize the model.

Let's see how this works in code. In this case I will be creating a very simple model so first we import some libraries but basically just a torch and then we import the data set we will be using MNIST in our case. I define a very simple model for classifying MNIST digits which is made up of three linear layers with ReLU activations.

I create this network I run a training on this network so this is just a basic training training loop you can see here and we save this network as in this file so we train it for I don't remember how many epochs for five epochs and then we save it in a file.

We define the testing loop which is just for validating the what is the accuracy of this model. So first let's look at the model the not quantized model so the pre-trained model for example. In this case let's look at the weights of the first linear layer. In this case we can see that the linear layer is made up of a weight matrix which is made up of many numbers which are floating point of 32 bits.

Floating point numbers of 32 bits. The size of the model before quantization is 360 kilobyte. If we run the testing loop on this model we will see that the accuracy is 96% which is not bad. Of course our goal is to quantize which means that we want to speed up the computation we want to reduce the size of the model but while maintaining the accuracy.

Let's see how it works. The first thing we do is we create a copy of the model by introducing some observers. So as you can see this is a quantization stub and this is a de-quantization stub that is used by PyTorch to do quantization on the fly. And then we introduce also some observers in all the intermediate layers.

So we take this new model that is with observers we basically take the weights from the pre-trained model and copy it into this new model that we have created. So we are not training a new model we are just copying the weights of the pre-trained model into this new model that we have defined which is exactly the same as the original one just with some observers.

And we also insert some observers in all the intermediate layers. Let's see these observers basically they are the some special class of objects made available by PyTorch that for each linear layer they will observe some statistics when we run some inference on this model. And as you can see what the statistic they collect is just the minimum value they see and the maximum value they see for each layer also for the input and this is why we have this quant stub as input.

And we calibrate the model using the test. So if we run inference on the model using the test set for example which is we just need some data to run inference on the model so that these observers will collect statistics. We do it so this will calculate the this will run inference of all the test set on the model so we are not training anything we're just running inference.

The observers after running inference we will have collected some statistics so for example the input observer here has collected some statistics. The observer for the first linear layer also have collected some statistics the second and the third etc etc. We can use the statistics that we have collected to create the quantized model so the actual quantization happens after we have collected these statistics and then we run this method which is quantization.convert which will create the quantized model.

And we can now see that after we quantize it each layer will become a quantized layer so before quantization it's just a linear layer but after they become a quantized linear layer. Each of them has some special parameter that is the S and the Z parameter that we saw in the slide so the scale and the zero point.

And we can also print the weight matrix after quantization and we can see that the weight matrix has become an integer of 8 bits so as you can see here. We can compare the dequantized weights and the original weights so the original weights were floating-point numbers of 32 bits while the dequantized weights so after we dequantize of course we obtain back the floating-point numbers so these are the how they are stored on the disk but of course when we want to dequantize we obtain something that is very similar to the original weight matrix but not exactly the same because we introduce some error because of the quantization.

So the dequantized weights are very similar to the original number but not exactly the same. For example the first number is quite different the second one is quite similar the third one is quite similar etc etc. We can check the size of the model after it's been quantized and we can see that the new size of the model is 94 kilobyte.

Originally it was 360 if I remember correctly so it has been reduced by four times. Why? Because each number instead of being a floating-point number of 32 bits is now an integer plus some overhead because we need to save some other data because for example we need to save all this scale the scale value the zero point value and also PyTorch saves some other values.

We can also check the accuracy of the quantized model and we see that the model didn't suffer much from actually didn't suffer at all from from the quantization so the accuracy remained practically the same. In reality okay this is a very simple example and the model is quite big so I think the model has plenty of parameters to to predict well.

But in reality usually when we quantize a model we will lose some accuracy and we will see later a training approach that makes the model more robust to quantization which is called the quantization aware training. So this is the post-training quantization and that's all for this one. Let's look at the next quantization strategy which is the quantization aware training.

What we do basically is that we insert some fake modules in the computational graph of the model to simulate the effect of quantization during training. So before we were talking about how to quantize a model after we have already trained it. In this case we want to train a model such that the model is more robust to the quantization effect.

So this is done during training not after the training. And basically what we do is we have our model which has input then we have some linear layers, we have output, we have a target, we compute the loss. What we do basically is we insert between each layer some special operations of quantize and dequantize operations, some fake operations.

So actually we are not quantizing the model or the weights because the model is getting trained. But we do some quantization on the fly. So every time we see an input here we quantize it and dequantize it immediately and run it to the next layer. Then this will produce some output.

We quantize it and dequantize it immediately and we give it to the next because this will introduce some quantization error and we hope that the loss function will learn to be more robust to handle this quantization error that is introduced by this fake quantization that we are introducing. So the goal of introducing these operations is just to introduce some quantization error so that the loss function can get ready to counter affect the effects of quantization.

Let's look at the code of how it is done. So we go to quantization aware training. Okay we import the necessary libraries just like before. We import the data set, in our case it's MNIST. We define a model which is exactly the same as before but we notice that here we already start with a quantization model that is ready for quantization because here we want to train the model in a way that it's already aware of the quantization.

That's why it's called quantization aware training and the rest of the structure of the model is the same as before. We insert the minimax observers in the model for every layer so as you can see this model is not trained and we are insert already some observers. These observers are not calibrated because we never run any inference or we never run any training on this model so all these values are plus and minus infinity.

Then we train the model using the MNIST and we train it for one epoch and we check the statistics collected by these observers during training and we can see that during training they have collected some statistics so the minimum and the maximum value and you can see that when we do quantization aware training we have this weight fake quant so this is actually all the fake quantization observers that we have introduced during the training and they have collected some some values or some statistics.

We can then quantize the model by using the statistics that have been collected during training and we can print the values scale and zero point of the quantized model and we can see them here. We can also print the weights of the quantized model and you can see that the weight matrix of the first linear layer is actually an integer matrix and we can also run the accuracy and we can see that the accuracy of this model is 0.952 okay in this case it's a little worse than the other case but this is not the rule usually quantization aware training makes the model more robust to the effects of quantization so usually when we do post training quantization the model loses more accuracy compared to quantization when we train a model with quantization aware training.

Let's go back to the slides. Now there is one thing that we should notice that with quantization aware training we are introducing some observers between each layer some special quantized and dequantized operation between each layer and then we do it while training. This means that the backpropagation algorithm should also be able to calculate the gradient of the loss function with respect to this operation that we are doing but we do this the operation of quantization is not differentiable so how can the backpropagation algorithm algorithm calculate the gradient of the quantization operation that we are doing during the forward loop?

Well we usually approximate the gradient using the straight through estimator which means that for all the values that fall in between the beta and the alpha parameter we give a gradient of 1 and for all the other values that are outside of this range we approximate the gradient with 0 and this is because the quantization operation is not differentiable this is why we need to approximate the gradient using this approximator.

The next thing that we should notice is why does quantization aware training works I mean what is the effect of quantization aware training on the loss function because as I told you before our goal is to introduce the quantization error during training such that the loss function can react to it but how?

Now imagine we do post training quantization when we train a model that that has no notion of quantization imagine we only have a one weight and the loss function is computed for this particular weight the goal of the backpropagation algorithm or the gradient descent is actually to of the gradient descent algorithm is to calculate the weights of the model such that we minimize the loss and usually suppose we end up meaning this is the loss function and we end up in this local minima here.

The goal of quantization aware training is to make the model reach a local minima that is more wide. Why? Because the weight value here after we quantize it will change and for example if we do it without quantization aware training if the loss was here and the weight value was here after quantization this weight value will be changed of course so it may go here but the loss will increase a lot for example but with quantization aware training we choose a local minima or a minima that is more wide so that if the weight after the quantization moves a little bit the loss will not increase by much and this is why quantization aware training works.

Thank you guys for watching my video I hope you enjoyed learning about quantization I didn't talk about advanced topic like GPTQ or AWQ which I hope to do in my next videos. If you liked the video please subscribe and like the video and share it with your friends or colleagues and the students.

I have other videos about deep learning and machine learning so please let me know if there is something you don't understand and be free to connect with me on LinkedIn or on social media you you you you you you you you