LoRA from a Gradient Perspective: Introduction, Analysis, Conjectures, and Generalization · English (unofficial) translations of posts at kexue.fm

With the popularity of ChatGPT and its alternatives, various parameter-efficient fine-tuning methods have also "risen with the tide." One of the most popular solutions is the subject of this article: LoRA, which originates from the paper "LoRA: Low-Rank Adaptation of Large Language Models". The LoRA method is relatively simple and direct, and there are many existing implementations. Since it is easy to understand and use, there isn’t much left to write about it in detail.

However, directly implementing LoRA requires modifying the network structure, which is slightly cumbersome. At the same time, LoRA feels very similar to the previous optimizer AdaFactor. Therefore, my question is: Can we analyze and implement LoRA from an optimizer perspective? This article revolves around this theme.

Method Introduction

Some previous results (such as "Exploring Universal Intrinsic Task Subspace via Prompt Tuning") show that although pre-trained models have a large number of parameters, the intrinsic dimension corresponding to each downstream task is not large. In other words, theoretically, we can achieve good results on downstream tasks by fine-tuning a very small number of parameters.

Drawing on these results, LoRA proposes that for a pre-trained parameter matrix W_0 \in \mathbb{R}^{n \times m}, we do not directly fine-tune W_0. Instead, we assume a low-rank decomposition for the increment: W = W_0 + A B,\qquad A \in \mathbb{R}^{n \times r}, B \in \mathbb{R}^{r \times m} where one of A or B is initialized to all zeros, W_0 remains fixed, and the optimizer only optimizes A and B. Due to the conclusion that the intrinsic dimension is very small, we can set r to be very small; r=8 is common, and in extreme cases, we can even set r=1. Thus, LoRA is a parameter-efficient fine-tuning method, as the number of optimized parameters is significantly reduced.

The following diagram illustrates the concept:

\times

Gradient Analysis

As mentioned in "Ladder Side-Tuning: A ’Ladder’ for Pre-trained Models", many parameter-efficient fine-tuning methods only reduce memory requirements and do not reduce the amount of computation. Is LoRA an exception? How efficient is it in terms of memory and computation? Let’s analyze it.

First, we know that the memory consumed during model training comes from four parts: model parameters, model gradients, model activations, and optimizer states. LoRA reduces the number of model parameters through low-rank decomposition, so the gradients and optimizer states will also decrease accordingly, leading to obvious memory savings. But can it save computation?

This depends on the implementation of LoRA. Different implementations have different complexities for calculating gradients. Two equivalent implementations of LoRA are as follows: \begin{aligned} Y &= XW = X(W_0 + AB) \label{eq:lora-1} \\[5pt] Y &= XW_0 + XAB = XW_0 + ZB \label{eq:lora-2} \end{aligned} where X \in \mathbb{R}^{b \times n} is the model input and Z = XA \in \mathbb{R}^{b \times r} is an intermediate output. For implementation [eq:lora-1], we have: \frac{\partial \mathcal{L}}{\partial A} = \frac{\partial \mathcal{L}}{\partial W} B^{\top} = \left(X^{\top}\frac{\partial \mathcal{L}}{\partial Y}\right) B^{\top},\quad \frac{\partial \mathcal{L}}{\partial B} = A^{\top}\frac{\partial \mathcal{L}}{\partial W} = A^{\top}\left(X^{\top}\frac{\partial \mathcal{L}}{\partial Y}\right) \label{eq:grad-1} where \mathcal{L} is the loss function. Obviously, this implementation requires calculating the full gradient \frac{\partial \mathcal{L}}{\partial W} \in \mathbb{R}^{n \times m} before calculating the gradients for A and B. This means it is even slower than not using LoRA and consumes more memory. For implementation [eq:lora-2], we have: \frac{\partial \mathcal{L}}{\partial A} = X^{\top}\frac{\partial \mathcal{L}}{\partial Z} = X^{\top}\left(\frac{\partial \mathcal{L}}{\partial Y} B^{\top}\right),\quad \frac{\partial \mathcal{L}}{\partial B} = Z^{\top}\frac{\partial \mathcal{L}}{\partial Y} = (XA)^{\top}\frac{\partial \mathcal{L}}{\partial Y} \label{eq:grad-2} In this case, Z, \frac{\partial \mathcal{L}}{\partial Z} \in \mathbb{R}^{b \times r}, which is significantly smaller than the full gradient, and the computational complexity is also markedly reduced. Therefore, for LoRA to maximize memory and computational savings, it is crucial to implement it according to [eq:lora-2] rather than [eq:lora-1].

(Note: Regarding matrix gradient calculations, we can "assemble" them based on the chain rule and output shapes. For example, for \frac{\partial \mathcal{L}}{\partial A}, we know from the chain rule it must be a product of \frac{\partial \mathcal{L}}{\partial W} and B in some way. We agree that the shape of \frac{\partial \mathcal{L}}{\partial A} matches A, i.e., n \times r. To get an n \times r result from \frac{\partial \mathcal{L}}{\partial W} and B, the only way is \frac{\partial \mathcal{L}}{\partial W} B^{\top}.)

Other Reasons

Besides the benefits of low-rank decomposition, the following points also contribute to LoRA’s memory savings and speedup:

1. Updating only part of the parameters: For example, the original LoRA paper chose to update only the parameters of the Self-Attention mechanism. In practice, we can also choose to update parameters only in certain layers.

2. Reduced communication time: Since the number of updated parameters is smaller, the amount of data to be transmitted (especially in multi-GPU training) is reduced, thereby decreasing transmission time.

3. Adoption of various low-precision acceleration techniques: Such as FP16, FP8, or INT8 quantization.

Of course, these three reasons can indeed speed up training, but they are not unique to LoRA. In fact, almost all parameter-efficient methods share these characteristics. LoRA’s prominent advantages are its intuitive low-rank decomposition, its performance being consistent with full fine-tuning in many scenarios, and the fact that W_0, A, B can be merged into a single matrix during the inference stage, thus adding no inference cost.

Optimization Perspective

Gradient [eq:grad-1] also tells us how to implement LoRA from an optimizer perspective. The optimizer can directly access the full gradient \frac{\partial \mathcal{L}}{\partial W}, and we only need to project the gradient according to formula [eq:grad-1] to obtain the gradients for A and B. Then, we can implement the updates for A and B following a standard optimizer implementation.

If the optimizer is SGD, the updates are: \begin{aligned} A_{t+1} &= A_t - \eta\frac{\partial \mathcal{L}}{\partial W_t} B_t^{\top},\quad B_{t+1} = B_t - \eta A_t^{\top}\frac{\partial \mathcal{L}}{\partial W_t} \\[5pt] W_{t+1} &= W_0 + A_{t+1} B_{t+1} = W_t + (A_{t+1} B_{t+1} - A_t B_t) \end{aligned} If it is an optimizer with moving averages like Adam, we only need to track the moving averages of the projected gradients. This reduces the number of parameters in the optimizer and saves some memory. The larger the model, the larger the proportion of memory saved by this part of the parameters.

LoRA stipulates that either A or B is initialized with zeros to ensure the initial state of the model is consistent with the pre-trained model. However, this also introduces an asymmetry (one is zero, the other is non-zero). In fact, it is also possible to initialize both A and B with non-zero values; one just needs to subtract A_0 B_0 from the pre-trained weights beforehand, or equivalently, parameterize W as: W = W_0 - A_0 B_0 + A B This maintains initial state consistency while allowing both A and B to be initialized with non-zero values, enhancing symmetry.

Random Projection

If we expand the update amount A_{t+1} B_{t+1} - A_t B_t in the SGD scenario, the result is: - \eta\left(\frac{\partial \mathcal{L}}{\partial W_t} B_t^{\top} B_t + A_t A_t^{\top}\frac{\partial \mathcal{L}}{\partial W_t}\right) + \eta^2 \frac{\partial \mathcal{L}}{\partial W_t} B_t^{\top} A_t^{\top}\frac{\partial \mathcal{L}}{\partial W_t} Assuming the \eta^2 term is a negligible higher-order term, we are left with: - \eta\left(\frac{\partial \mathcal{L}}{\partial W_t} B_t^{\top} B_t + A_t A_t^{\top}\frac{\partial \mathcal{L}}{\partial W_t}\right) From this perspective, compared to full fine-tuning with SGD, LoRA replaces the full gradient \frac{\partial \mathcal{L}}{\partial W_t} with the result in the parentheses.

For simplicity, let’s consider the case where r=1. Note that in the above formula, the projection vectors A_t, B_t at time t depend on t. What would happen if we replaced them with random vectors that do not depend on t (re-randomized at each training step)? Consider u, v \sim \mathcal{N}(0,1), where u \in \mathbb{R}^{n \times 1}, v \in \mathbb{R}^{1 \times m}. The update amount becomes: - \eta\left(\frac{\partial \mathcal{L}}{\partial W_t} v^{\top} v + u u^{\top}\frac{\partial \mathcal{L}}{\partial W_t}\right) It can be proven that: \mathbb{E}_{u\sim \mathcal{N}(0,1)}[u u^{\top}] = I_{n\times n},\quad \mathbb{E}_{v\sim \mathcal{N}(0,1)}[v^{\top} v] = I_{m\times m} where I_{n\times n}, I_{m\times m} are identity matrices. Therefore, similar to "zero-order gradients," in an average sense, this LoRA that re-initializes at every step is actually equivalent to full-rank SGD. However, implementing it this way would likely be slower than full-rank SGD. Its purpose is not speed, but rather to mitigate catastrophic forgetting—by using low-rank updates for individual batches, it reduces the impact on the overall model weights. Of course, this is just a conjecture; the actual effect has not been tested by the author.

A Variant

Again, let’s first consider the r=1 case. LoRA assumes \Delta w_{i,j} = u_i v_j. Can we make other low-rank decomposition assumptions? For example, \Delta w_{i,j} = u_i + v_j? In matrix form, this is: W = W_0 + A \mathbbm{1}_{1\times m} + \mathbbm{1}_{n\times 1} B,\qquad A\in\mathbb{R}^{n\times 1},B\in\mathbb{R}^{1\times m} where \mathbbm{1}_{1\times m}, \mathbbm{1}_{n\times 1} are all-ones matrices. The gradients are easily found to be: \frac{\partial \mathcal{L}}{\partial A} = \frac{\partial \mathcal{L}}{\partial W} \mathbbm{1}_{m\times 1},\quad \frac{\partial \mathcal{L}}{\partial B} = \mathbbm{1}_{1\times n}\frac{\partial \mathcal{L}}{\partial W} This is simply the row-wise and column-wise sums of the original gradient. Compared to the original LoRA, this additive decomposition has two advantages: 1. Addition is computationally cheaper than multiplication, and the gradient form is simpler; 2. While the rank of AB is always 1, the rank of A \mathbbm{1}_{1\times m} + \mathbbm{1}_{n\times 1} B can be 2. If rank represents model capacity, then for the same number of parameters, the additive version might have stronger expressive power. As for the actual effect, I will conduct comparative experiments when I use LoRA in the future.

Can the additive decomposition be generalized to r > 1? Yes, but it requires a bit of technique. Assuming m, n are divisible by r, we can change the parameterization to: W = W_0 + A I_{r(1\times m/r)} + I_{r(n/r\times 1)} B,\qquad A\in\mathbb{R}^{n\times r},B\in\mathbb{R}^{r\times m} Here, I_{r(1\times m/r)} and I_{r(n/r\times 1)} refer to block matrices of sizes 1 \times m/r and n/r \times 1 respectively, where each block is an r \times r identity matrix. Essentially, this treats A and B as n/r \times 1 and 1 \times m/r block matrices and applies the r=1 logic.

Article Summary

This article introduces an understanding of LoRA from a gradient perspective. In addition to the basic introduction, it includes some of the author’s conjectures and generalizations for the reader’s reference.

Original address: https://kexue.fm/archives/9590

For more details on reprinting, please refer to: "Scientific Space FAQ"