Back to Index

LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch


Chapters

0:0 Introduction
0:47 How neural networks work
1:48 How fine tuning works
3:50 LoRA
8:58 Math intuition
10:25 Math explanation
14:5 PyTorch implementation from scratch

Transcript

Hello guys, welcome back to my channel. Today we will be exploring a very influential people called LORA. LORA stands for Low Rank Adaptation of Large Language Models and it's a very influential people. It came out I think two years ago from Microsoft and we will see in this video, we will see what is LORA, how does it work and we will also implement it in PyTorch from zero without using any external libraries except for Torch of course and let's go.

So we are in the domain of language models but actually LORA can be applied to any kind of model and in fact in my demo that I will also show you later we will apply it to a very simple classification task and so before we study LORA we need to understand why we need LORA in the first place.

So let's review some basics about neural networks. So imagine we have some input which could be one number or it could be a vector of numbers and then we have some hidden layer in a neural network which is usually represented by a matrix but here I show you the graphical representation and then we have another hidden layer and finally we have the output right.

Usually when we train a network we also have a target and what we do is we compare the output and the target to produce a loss and finally we back propagate the loss to each of the weights of all the layers. So in this case for example we may have many weights in this layer we will have a weights matrix and a bias matrix and each of these weights will be modified by the loss function and also here we will have a weight and a bias matrix here.

Now what is fine tuning? Fine tuning basically means that we have a pre-trained model and we want to fine tune it on some other data that the original model may have not seen. For example imagine we work for a company that has built its own database so this new database has its own sql language right and we have downloaded a pre-trained model let's say gpt that was trained on a lot of programming languages but we want to fine tune it on our own sql language so that it can answer so that the model can help our users build queries for our database and what we used to do is we train this model with this entire model here on new data and we alter all these weights using the new data however this creates some problem.

The problem with full fine tuning is that we must train the full network which first of all is computationally expensive for the average user because you need to load all the language in the memory then you need to run back propagation on all the weights plus the storage requirements for the checkpoints are expensive because for every checkpoint for every epoch usually we save a checkpoint and we save it on the disk plus if we save also the optimizer state let's say we are using adam optimizer adam optimizer for each of the weights keeps also some statistics to better optimize the models so we are saving a lot of data and if we suppose we want to use the same base model but fine-tuned on two different data sets so we will have basically two different fine-tuned models if we need to switch between them it's very expensive because we need to unload the previous model and then load again all the weights of the other fine-tuned model so we need to replace the all the weights metrics of the model however we have a better solution to these problems with LoRa.

In LoRa there is this difference so we start with an input and we have our pre-trained model so we want to fine-tune it right so we have our pre-trained model with its weights and we freeze them basically we tell PyTorch to never touch these weights just use them as read only never never run back propagation on these weights then we create two other matrices one for each of the metrics that we want to train so basically in LoRa we don't have to create matrices the matrices b and a for each of the layers of the original model we can just do it for some layers and we will see later how but in this case suppose we only have one layer and we introduce the matrix b and a so what's the difference between this matrix b and a and the original matrix w first of all let's look at the dimension the original matrix was d by k suppose d is let's say 1000 and k is equal to 5000 we want to create two new matrices that when multiplied together they produce the same dimension so d by k so in fact we can see it here d by r when it's multiplied by r by k will produce a new matrix that is d by k because the inner dimensions cancel out and we want r to be much smaller than d or k we may as well choose r equal to 1 so if we choose r equal to 1 basically we will have a matrix that is d by 1 so 1000 by 1 and another matrix that is 1 by 5000 and if we compare the numbers of parameters in this matrix in this part in the original matrix w we have the number of parameters let's call it p is equal to d multiplied by k which is equal to 5 million numbers in this matrix in this case however we have two matrices so if r is 1 we will have one matrix that is d by r so 1000 plus 5000 only 6000 numbers in this the combined matrix but with the advantage that when we multiply them together we will still produce a matrix of d by k of course you may think that this matrix will not capture the same information as the original matrix w because it's much smaller right even if they produce the same dimension they actually have the the it's a smaller representation of something so it should you lose some information but this is the whole idea behind LoRa actually we the whole idea behind LoRa is that the matrix w contains a lot of weights a lot of numbers that are actually not meaningful for our purpose they are actually not adding any information to the model they are just a combination of the other weights so they are kind of redundant so we don't need the whole matrix w we can create a lower representation of this w and fine-tune that one so let's continue with our journey of this model let me delete the link okay so we create these two matrix b and a what we do is we combine them because we can sum them right because they have the same dimension when we multiply b by a it will have the dimension uh d by k so we can sum it with the original w we produce the output and then we have our usual target we calculate the loss and we only back propagate the loss to the matrix that we want to train that is the b and a matrix so we never touch the w matrix so our original model which was the pre-trained model is frozen and we never touch its weights we only modify the b and a matrix so what are the benefits first of all as we saw before we have less parameters to train and store because in the case i showed before we have for example five million parameters when the w matrix in the original one and using r equal to five we only have thirty thousand parameters in total so less than one percent of the original less parameters also means that we have less storage requirements and faster back propagation because we don't need to evaluate the gradient for most of the parameters and we can easily switch between two fine-tuned models because for example imagine we have two different models one for sql and one for generating javascript code we only need to reload these two matrices if we want to switch between them we don't need to reload the w matrix because it was never touched so it's still the same as the original pre-trained model why does this work so the idea is that and it's written in the paper is that the pre-trained model have they saw the intuition is that they have an interesting dimension that is smaller than their actual dimension and inspired by this they hypothesize that the updates to the weights also have a low intrinsic rank during adaptation and the rank of a matrix basically means it's we will see it later with a practical example basically it means imagine we have a matrix made of many vectors column vectors and the rank of the matrix is the number of the vectors that are linearly independent from each other so you cannot combine linearly any of them to produce another one this also indicates kind of how many columns are redundant because they can be obtained by linearly combining the other ones and what they what they mean in this paper is that the w matrix actually is is a rank deficient it means that it does not have full rank so it has a dimension maybe 1000 by 1000 but maybe the actual rank is let's say 10 so actually we can use a 10 by 10 matrix to capture most of the information and the idea between this rank reduction is used in a lot of scenarios also for example in compression algorithms so let's review some mathematics of ranking and metric decomposition and then we check the lora implementation in pytorch so let's switch here let's go here first so i will show you a very simple example of matrix decomposition and how a matrix can be rank deficient and how we can produce a smaller matrix that captures most of the information so let's start by importing the very simple libraries torch and numpy then i will create a 10 by 10 matrix here that is artificially rank deficient so i create it in such a way that it is rank deficient with the rank actual rank of 2 so even if this matrix is 10 by 10 we can see that it has 100 numbers we will this the rank of this matrix is actually 2 and we can evaluate that using a numpy so we will see that the rank of this matrix is actually 2 this means that we can decompose it using an algorithm called svd which means singular value decomposition which produces three matrices u s and v that when multiplied together they give us w but the dimension of this u s and v can be much smaller based on the rank so basically it produces three matrices that if we take only the first r columns of these matrices where r indicates the rank of the original matrix they will capture most of the information of the original matrix and we can visualize that in a simple way what we do is we calculate the b and the a matrix just like in the lora case using this decomposition and we can see that we created the lower representation of the w matrix which is originally was 10 by 10 but now we created two matrices one b and one a that is 10 by 2 and 2 by 10 and what we do is we take some input let's call it x and some bias and it's random we compute the output using the w original matrix which was the 10 by 10 matrix so we multiply it by x we add the bias and we also compute the output using the b and a matrix that is the result of the decomposition so we calculate y prime using b multiplied by a just like lora multiplied by x plus bias and we see that the output is the same even if b and a actually have much less elements so in this case i renamed it i forgot to change the names this is b and a okay b and a and what's okay so what i want to show and this is not a proof because i actually created artificially this w matrix and i made it rank deficient artificially i actually took this code from somewhere i don't remember where and so the the idea is that we can have a smaller matrix that can produce the same output for the same given input but by using much less numbers the much less parameters so as you can see the b and a elements combined the number of elements in the b matrix and a matrix combined are 40 while in the original matrix we had 100 elements and they still produce the same output for the same given input which means that b and a captured most of the information the most important information of w now let's go to lora so let's implement lora step by step what we will do is we will do a classification task so imagine we have a very simple neural network for classifying mnist digits and we want to fine tune it on a one specific digit because we see that the performance on one specific digit is not very good so we want to fine tune it on only one and we will use lora and show that we when we fine tune with lora we are actually modifying a very small number of parameters and we only need to save very small number of parameters compared to the pre-trained model let's start so we import the usual libraries so torch and matplotlib actually we will not need it and tqdm for visualizing the progress bar we make it deterministic so it always returns the same results and we load mnist the data set it's already integrated into torch vision so it's not a big deal and we create the loader we create a very unoptimized neural network for classifying these digits so basically this is a very big network for the task we don't need such a big network but i want to make it specific i made it on purpose such big because i want to show the the savings in parameters that we get so i call it rich boy net so because daddy got money so i don't care about efficiency right and it's a very simple network made of three linear layers and with the rule activation and the final layer is just basically the classification of the digit into one of its categories 0 1 or 2 or up to 9 so we create this network and we train it on mnist so we run for only one epoch and we train it just simple training of mnist for classification and then what we do is we keep a copy of the original weights because we will need it later to prove that the laura didn't modify the original weights so the weights of the original pretty pre-trained model will not be altered by laura we can also test the model the pre-trained model we can test it on and check what is the accuracy so if we test it we can see the accuracy is very high but we can see that for the digit number nine the accuracy is not as good as the other digits so maybe we want to fine-tune especially on the digit nine okay laura actually in the paper was fine-tuned on large language models which i cannot do because i don't have the computational resources so that's why i'm using mnist and this very simple example anyway so we have one digit that we want to fine-tune better right let's visualize before we do any fine-tuning how many parameters we have in this network that we created here this network here rich boy net so we have in the layer 1 we have this matrix weights and this bias this weights for the layer 2 and this bias this weights matrix for the layer 3 and this bias in total we have two million eight hundred seven thousand and ten parameters now let's introduce laura so as we saw uh before laura introduces two two matrices called a and b and the um the size of these matrices is if the original weights is d by k the b is d by r and a is r by k so i just call it features in and features out in the paper it's written that they initialize the b matrix with zero and a matrix with random gaussian initialization and this is what i do here as well then they also introduce a scale parameter this is from the section 4.1 of the paper that basically allows to change the rank without changing the the scale of the items and i just use alpha alpha is fixed uh you and because maybe you want to try the same model on different ranks so instead of the scale allow us to keep the scale of the numbers the same if laura is enabled we want the weights matrix so we will basically we will run laura only on the weights matrix not on the bias because also in the paper they don't do it for the bias matrix only on the weights so if laura is enabled the weights matrix will be x so the original weights plus b multiplied by a just like in the paper multiplied by the scale this is also introduced by the paper so basically instead of multiplying the this should be w instead of multiplying x by w just like in the original network we multiply it by w plus b multiplied by a and this is written in the paper we can see it here let's go down it's written here so instead of multiplying x only by w we multiply it by this delta w which is how much the weights have moved moved because of the fine tuning which is b by a and this is what we are doing here and we add this parametrization to our network so to add this parametrization i'm using a special function of pytorch called pytorch parametrization so if you want to have more information how it works this is the link but i will briefly introduce it parametrization basically means allow us to replace the weights matrix of the linear one layer in this case with this function so every time the neural network wants to access the weights layer the weights matrix it will not access directly the weights matrix it will access this function and when this function is what is basically our lora parametrization so when it will ask for the weights matrix it will call this function giving us the original weights and we just alter the original weights by introducing the b and a matrix so when it will multiply the the pytorch will keep doing its work so it will just multiply the w so the weights by x but actually the weights will be the original weights plus the b and a that we combined in this way according to the paper and we can easily enable or disable lora in each of the layers by modifying the enabled property we can see it here so if it's enabled we will use the b and a matrix if it's disabled we will only use the original weights if we enable basically it means that we enable also the fine-tuned weights if we disable it the model should behave just like the pre-trained model and we can also visualize the parameters added by lora so how many parameters were added well in the original layer 1 2 and 3 we only had the weights and the bias now we also have the lora a matrix and the lora b matrix and i chose a rank of 1 and this i defined it here rank of 1 and so the the the matrix b is 1000 by 1 because the weight matrix is 1000 by 784 so 1000 by 1 multiplied by 1 by 784 gives you the same dimension of the weights matrix and we do it for all the layers so in the original model without lora we had 2 million 807 010 parameters by adding the lora matrices we have 2 million 813 804 parameters but the only 6 000 of them so the one introduced by lora will be actually trained all the others will not be trained and to do it we freeze the non-lora parameters so we can see here i created the code to freeze the parameters so we just set requires grad equal false for them and then what we do is we fine-tune the model only on the digit 9 because originally as i show you here we want to improve the accuracy of the digit 9 so we don't fine-tune it on any other thing so we have a pre-trained model that was trained on all the digits but now we will train it fine-tune it only on the digit 9 hoping that it will improve the accuracy of the digit 9 maybe decreasing the accuracy of the other digits so let's go back here i train it i fine-tune this model only on the digits 9 and i do it for only 100 batches because i don't want to alter the model too much so i do it with the training it is very fast and then basically i want to show you that the frozen parameter are still unchanged by the fine-tuning so the frozen parameters are this one and they are still the same as the original weights that we saved after pre-training our model here so here we save the original parameters we actually clone them so they don't get altered and we can see that they are still the same and then what we do is we enable lora and we see that the weights so when we access the weights pytorch will actually replace the weights by the original weights plus b multiplied by a multiplied by the scale according to the formula that we have defined here so every time pytorch tries to access the weight matrix it will actually run this function and this function will return the original weights plus b multiplied by a multiplied by the scale and this is what is happening here if we enable lora if we disable lora we are disabling the parameterization so it will just return the original weights and why does this happen because here we said that when lora is disabled just return the original weights and so what we can do now is that we can enable lora and test the model and we can see that now the digit 9 is performing much better but of course we lost some information about the other digits and if we disable lora the model will behave exactly the same as the pre-trained model so without any fine tuning and we can see these numbers are the same as the pre-trained model here so the number zero had a wrong count for 33 the wrong count for the digit 9 was 107 and it's the same as this one so when we disable lora the model will behave exactly the same as the pre-trained model when we enable lora we introduce the matrix b and a that make the model behave like the fine tuned one and the best the best thing about lora is that we didn't alter the original weights and the only weights that we altered are the b and a matrix and their dimension is much smaller compared to the w matrix so now if we want to save this fine-tuned model we only need to save this 6794 numbers instead of 2 million etc we can fine-tune many versions of this model and by we can easily switch between them just by changing the b and the w matrix in this parameterization we don't need to reload again all the w matrix of the original pre-trained model and this is the power of lora uh i hope my video was clear because i try to make videos that are theoretical but also practical please let me know in the comments if there is something that you want to be explained a little better you can use my repository it's pytorch lora on my account and you can play with it and you can try to use different sizes of their ranking or you can different models it's very easy i suggest you also read the parameterization this parameterization function of pytorch because it's very easy to introduce a different kind of parameterization and also play with the parameterization of a neural network thank you again for listening and i hope you and i hope you enjoyed the video and please come back back to my channel for more videos about machine learning and deep learning