A Brief History of Linear Attention: From Imitation and Innovation to Reciprocation · English (unofficial) translations of posts at kexue.fm

In the Chinese-speaking community, this site was likely among the earliest to pay attention to Linear Attention. When I wrote the first related blog post in 2020, "Exploration of Linear Attention: Does Attention Necessarily Need a Softmax?", the mainstream discussion was still focused on BERT-related Softmax Attention. In hindsight, considering Linear Attention during the BERT era was not particularly wise; training lengths were relatively short, and models were primarily Encoders, meaning Linear Attention offered essentially no advantage. I previously expressed this view in "Linear Transformer Might Not Be the Model You Are Waiting For".

It wasn’t until the emergence of ChatGPT, which forced everyone toward Decoder-only generative models, that Linear Attention found a perfect match in its RNN-like form. Simultaneously, the pursuit of longer training sequences made the quadratic complexity bottleneck of Softmax Attention increasingly apparent. In this new context, Linear Attention has become more competitive, even showing signs of "reciprocating" (feeding back improvements to) Softmax Attention.

Quadratic Complexity

First, let us introduce some notation: \begin{equation} \begin{gathered} \boldsymbol{q}_i, \boldsymbol{k}_i, \boldsymbol{v}_i, \boldsymbol{o}_i \in \mathbb{R}^{d \times 1} \\[6pt] \boldsymbol{Q}=[\boldsymbol{q}_1, \boldsymbol{q}_2, \cdots, \boldsymbol{q}_n]^{\top} \in \mathbb{R}^{n \times d} \\[6pt] \boldsymbol{K}=[\boldsymbol{k}_1, \boldsymbol{k}_2, \cdots, \boldsymbol{k}_n]^{\top} \in \mathbb{R}^{n \times d} \\[6pt] \boldsymbol{V}=[\boldsymbol{v}_1, \boldsymbol{v}_2, \cdots, \boldsymbol{v}_n]^{\top} \in \mathbb{R}^{n \times d} \\[6pt] \boldsymbol{O}=[\boldsymbol{o}_1, \boldsymbol{o}_2, \cdots, \boldsymbol{o}_n]^{\top} \in \mathbb{R}^{n \times d} \\[6pt] \end{gathered} \end{equation} An Attention model is essentially a mapping \boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V} \to \boldsymbol{O}. This article primarily focuses on the Causal scenario, which means \boldsymbol{o}_t is at most related to \boldsymbol{Q}_{[:t]}, \boldsymbol{K}_{[:t]}, \boldsymbol{V}_{[:t]}. In principle, the dimension d of \boldsymbol{Q}, \boldsymbol{K} and \boldsymbol{V}, \boldsymbol{O} can differ—as seen in GAU and MLA—but simplifying them to the same dimension does not change the essence of the problem.

Standard Softmax Attention usually refers to the mechanism proposed in "Attention is All You Need": \begin{equation} \boldsymbol{O} = \mathop{\text{softmax}}(\boldsymbol{Q}\boldsymbol{K}^{\top} + \log \boldsymbol{M})\boldsymbol{V} \end{equation} The scaling factor 1/\sqrt{d} is omitted here as it can always be absorbed into \boldsymbol{Q} and \boldsymbol{K}. The \mathop{\text{softmax}} operation performs exponential normalization over the second dimension, and \boldsymbol{M} \in \mathbb{R}^{n \times n} is a lower triangular matrix called the mask matrix, defined as: \begin{equation} M_{i,j} = \left\{\begin{aligned} &1, &i \geq j \\ &0, &i < j\end{aligned}\right. \end{equation} \log \boldsymbol{M} refers to the element-wise logarithm of \boldsymbol{M}, where \log 0 = -\infty. Written in component form, Softmax Attention is: \begin{equation} \boldsymbol{o}_t = \frac{\sum_{j=1}^t \exp(\boldsymbol{q}_t^{\top}\boldsymbol{k}_j) \boldsymbol{v}_j}{\sum_{j=1}^t \exp(\boldsymbol{q}_t^{\top}\boldsymbol{k}_j) } \end{equation} The denominator primarily serves numerical stability; furthermore, if we apply RMSNorm to \boldsymbol{O}, the denominator is automatically canceled out. Thus, the core of Softmax Attention is the numerator: \begin{equation} \boldsymbol{O} = \exp(\boldsymbol{Q}\boldsymbol{K}^{\top} + \log \boldsymbol{M})\boldsymbol{V} = (\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V} \end{equation} where \odot is the Hadamard product and \exp is element-wise. It is easy to see that the denominator is equivalent to replacing \boldsymbol{V} with an n \times 1 matrix of ones. Standard implementations of Softmax Attention require computing the n \times n matrix \exp(\boldsymbol{Q}\boldsymbol{K}^{\top}), so both space and time complexity are proportional to n^2. While Flash Attention reduced space requirements, the quadratic time complexity remains unavoidable.

The Original Form

The earliest approach to Linear Attention was primarily to imitate and approximate Softmax Attention. The simplest solution was to directly remove the \exp, yielding: \begin{equation} \boldsymbol{O} = (\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})\boldsymbol{V} \label{eq:linear-attn} \end{equation} For simplicity, we assume matrix multiplication has higher priority than the Hadamard product, allowing us to omit a set of parentheses. Why is this form "Linear" Attention? To understand this quickly, consider the non-causal version without \odot \boldsymbol{M}. In that case, \boldsymbol{O} = (\boldsymbol{Q}\boldsymbol{K}^{\top})\boldsymbol{V} = \boldsymbol{Q}(\boldsymbol{K}^{\top}\boldsymbol{V}). Note that computing \boldsymbol{K}^{\top}\boldsymbol{V} has a complexity of \mathcal{O}(nd^2), resulting in a d \times d matrix. Multiplying this by \boldsymbol{Q} also has a complexity of \mathcal{O}(nd^2). Thus, the complexity depends linearly on n.

As for the Causal version in Eq. [eq:linear-attn], we can understand it through its component form: \begin{equation} \boldsymbol{o}_t = \sum_{j=1}^t \boldsymbol{v}_j (\boldsymbol{k}_j^{\top} \boldsymbol{q}_t) = \sum_{j=1}^t (\boldsymbol{v}_j \boldsymbol{k}_j^{\top}) \boldsymbol{q}_t = \left(\sum_{j=1}^t \boldsymbol{v}_j \boldsymbol{k}_j^{\top}\right) \boldsymbol{q}_t \end{equation} If we denote the term in parentheses as \boldsymbol{S}_t, we have: \begin{equation} \boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t, \qquad \boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top} \label{eq:linear-attn-rnn} \end{equation} Evidently, Causal Attention can be written as a linear RNN with \boldsymbol{S}_t as the state. Consequently, the complexity of each step is constant, and the total complexity is proportional to the sequence length n. Note the appearance of "Linear RNN" here; it is a broader concept. Linear Attention is a type of Linear RNN. Linear RNNs also developed independently for a time (e.g., LRU, SSM), but recently, the most competitive linear architectures have taken the form of Linear Attention.

Early Linear Attention also exhibited obvious traits of imitating Softmax Attention, such as adding a denominator to Eq. [eq:linear-attn] for normalization. To normalize, \boldsymbol{k}_j^{\top} \boldsymbol{q}_t had to be non-negative, leading to the addition of non-negative activation functions to \boldsymbol{Q} and \boldsymbol{K}. Works like Performer and RFA were built specifically to approximate \exp(\boldsymbol{Q}\boldsymbol{K}^{\top}).

However, later research such as "The Devil in Linear Transformer" found that normalizing along the sequence length dimension does not completely avoid numerical instability. It is better to normalize after the fact, such as: \begin{equation} \boldsymbol{O} = \mathop{\text{RMSNorm}}((\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})\boldsymbol{V}) \end{equation} Since normalization is no longer required, non-negative activation functions for \boldsymbol{Q} and \boldsymbol{K} are no longer mandatory. Is there still a point in adding (not necessarily non-negative) activation functions? My view is that adding activation functions is a design choice; it’s possible a specific function might yield better results, but it doesn’t change the form of Linear Attention. Furthermore, existing results suggest that not adding them is often sufficient.

Fancy Forget Gates

From Eq. [eq:linear-attn-rnn], we can see that current Linear Attention is essentially a \mathop{\text{cumsum}}, where all historical information is superimposed with equal weight. It is easy to imagine that when enough tokens are superimposed, the relative contribution of each token becomes extremely small. A fixed-size \boldsymbol{S}_t matrix might fail to accurately reconstruct any single token; intuitively, the memory of each token becomes blurred.

To alleviate this, RetNet introduced a decay effect to Linear Attention: \begin{equation} \boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t, \qquad \boldsymbol{S}_t = \gamma\boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top} \label{eq:linear-attn-retnet} \end{equation} where the decay factor \gamma \in (0,1). In RetNet, this is a constant, though some works set it as a trainable parameter or a diagonal matrix. MiniMax-01 also uses this form. Note that decay factors existed before RetNet, but mostly in the context of Linear RNNs like LRU and SSM. RetNet was likely the first to combine it with Linear Attention. With the decay factor, the model tends to forget more distant history, ensuring higher resolution for recent tokens—a manifestation of the "Recency Bias" inherent in language modeling.

Additionally, a noteworthy detail is that RetNet applies RoPE to \boldsymbol{Q} and \boldsymbol{K}. This is equivalent to generalizing the decay factor to a complex number \gamma e^{\text{i}\theta}, which from the perspective of LRU corresponds to considering complex eigenvalues. While adding positional encoding to an RNN might seem counterintuitive, experiments like the recent TransXSSM show that adding RoPE to Linear Attention can have positive effects, depending on the specific model variant and experimental setup.

A simple generalization of Eq. [eq:linear-attn-retnet] is to replace \gamma with a function of position t, \gamma_t, as seen in SSM. Later, works like DFW, Mamba, and Mamba2 generalized this to be input-dependent, forming a series of works related to "data-dependent decay." This is very similar to the "forget gates" in non-linear RNNs like GRU and LSTM, except that to maintain linearity, the forget gate does not depend on the State (e.g., \boldsymbol{S}_t).

Why do we prefer Linear RNNs? Because they can almost always be trained in parallel, making them competitive with Softmax Attention in both training and inference efficiency. The "universal solution" for parallelization is to convert the problem into a Prefix Sum problem and use an Associative Scan. We briefly introduced this in the "Parallelization" section of "Google’s New Work Attempts to ’Revive’ RNNs: Can RNNs Shine Again?".

However, the "universal solution" is not always GPU-efficient. GPUs are most efficient at matrix multiplication. Therefore, finding parallel algorithms that heavily utilize matrix multiplication is ideal. Even without full parallelization, finding a "Chunk by Chunk" recursive format that utilizes matrix multiplication can significantly improve training efficiency. This, in turn, imposes requirements on the model; for example, only forget gates in outer-product form can achieve this. A typical counter-example is Mamba, which uses a non-outer-product forget gate and cannot fully exploit GPU performance, leading to subsequent changes in Mamba2 and GLA.

Test-Time Training

By now, Linear Attention has evolved from simple imitation of Softmax Attention to incorporating static decay factors and even "data-dependent decay," forming its own identity. However, most of these advances were designed based on human intuition. We must ask: Is there a higher-level principle to guide the design of Linear Attention or general sequence models (Token-Mixers)?

TTT (Test Time Training) provides an answer. It views sequence modeling as an "Online Learning" problem and proposes constructing RNNs (not necessarily linear) using optimizers. Specifically, it treats \boldsymbol{K}, \boldsymbol{V} as a corpus of pairs (\boldsymbol{k}_1, \boldsymbol{v}_1), (\boldsymbol{k}_2, \boldsymbol{v}_2), \cdots, (\boldsymbol{k}_t, \boldsymbol{v}_t). Based on this corpus, a model \boldsymbol{v} = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{k}) is trained, and the output is \boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{q}_t), where \boldsymbol{S}_t represents the model parameters. The model structure \boldsymbol{f} is largely arbitrary.

What does this have to do with RNNs? Simple: optimizers like SGD or Adam are essentially RNNs regarding model parameters! This perspective isn’t new; it was proposed during the Meta-Learning boom around 2017, where researchers tried using RNNs (LSTMs) to simulate better optimizers (see "Optimization as a Model for Few-Shot Learning").

As the saying goes, "what goes around comes around." Years later, TTT proposes using optimizers to construct RNNs. The process is: given current parameters \boldsymbol{S}_{t-1}, the optimizer (SGD) receives new data (\boldsymbol{k}_t, \boldsymbol{v}_t), updates the parameters to \boldsymbol{S}_t, and returns the prediction for \boldsymbol{q}_t using \boldsymbol{f}(\boldsymbol{S}_{t-1}; \boldsymbol{q}_t). Thus, the RNN implemented by TTT can be written as: \begin{equation} \boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{q}_t), \qquad \boldsymbol{S}_t = \boldsymbol{S}_{t-1} - \eta_t\nabla_{\boldsymbol{S}_{t-1}}\mathcal{L}(\boldsymbol{f}(\boldsymbol{S}_{t-1};\boldsymbol{k}_t), \boldsymbol{v}_t) \label{eq:ttt-rnn} \end{equation} where \mathcal{L} is the loss function for the current data under current parameters, and \eta_t is the learning rate (which can also be data-dependent). This form covers many RNN models. For instance, Eq. [eq:linear-attn-rnn] and [eq:linear-attn-retnet] are special cases:

	RNN	\boldsymbol{o}_t	\boldsymbol{f}(\boldsymbol{S};\boldsymbol{k})	\mathcal{L}(\boldsymbol{f}(\boldsymbol{S};\boldsymbol{k}),\boldsymbol{v})	\eta_t
[eq:linear-attn-rnn]	\boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top}	\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t	\boldsymbol{S}\boldsymbol{k}	-\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k})	1
[eq:linear-attn-retnet]	\boldsymbol{S}_t = \gamma\boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top}	\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t	\boldsymbol{S}\boldsymbol{k}	-\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k}) + \frac{1-\gamma}{2}\Vert\boldsymbol{S}\Vert_F^2	1

The original TTT paper explored non-linear RNNs in mini-batches. Later, Titans added momentum to TTT’s SGD, and "Test-Time Training Done Right" explored large-batch TTT and the "TTT + Muon" combination. Note that TTT only uses the optimizer to construct the RNN; parameters outside the RNN (like those for \boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) are still trained using a global optimizer after the whole model is built.

Why can TTT serve as a "guiding principle" for RNNs? The core goal of an RNN is to effectively compress historical data into a fixed-size State. Model parameters are fixed-size, and training a model is essentially compressing data into weights. TTT leverages this alignment. Simply put, if an RNN is a compression task, TTT treats \boldsymbol{f} as the "decompressor," its weights as the "compressed archive," SGD as the compression algorithm, and the loss \mathcal{L} as the compression rate.

This allows us to stop worrying about recursive formats and instead focus on designing \boldsymbol{f} and \mathcal{L}. We can judge an RNN’s strength simply by looking at its corresponding \boldsymbol{f} and \mathcal{L}. Furthermore, because TTT uses Online Learning, the resulting RNN is naturally suited for In-Context Learning (ICL) tasks. Previously, "Why Can GPT Learn In-Context?" even explained Softmax Attention’s ICL ability by treating it as a form of gradient descent; from today’s perspective, it was constructing a corresponding TTT.

Removing the Old and Welcoming the New

For example, the loss function for the earliest Linear Attention was -\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k}). This is clearly unreliable because it is unbounded, which could lead to \boldsymbol{S} tending toward infinity. In contrast, RetNet added an L2 regularization term to the loss, avoiding this risk and mitigating overfitting, resulting in a better RNN.

However, while using the inner product as a loss is concise, it doesn’t directly encourage \boldsymbol{S}\boldsymbol{k}=\boldsymbol{v}. A better objective would be the squared loss, \frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2. Substituting this into the TTT formula Eq. [eq:ttt-rnn] gives: \begin{equation} \boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{q}_t), \qquad \boldsymbol{S}_t = \boldsymbol{S}_{t-1} - \eta_t \underbrace{(\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top}}_{\nabla_{\boldsymbol{S}_{t-1}}\frac{1}{2}\Vert\boldsymbol{S}_{t-1}\boldsymbol{k}_t - \boldsymbol{v}_t\Vert^2} \end{equation} This is DeltaNet, a name from "Parallelizing Linear Transformers with the Delta Rule over Sequence Length", though it was proposed earlier in "Linear Transformers Are Secretly Fast Weight Programmers". Note that \eta_t (\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top} = (\boldsymbol{S}_{t-1} (\sqrt{\eta_t}\boldsymbol{k}_t) - (\sqrt{\eta_t}\boldsymbol{v}_t))(\sqrt{\eta_t}\boldsymbol{k}_t)^{\top}, meaning \eta_t can be absorbed into the definitions of \boldsymbol{k}_t and \boldsymbol{v}_t. Thus, we consider \eta_t=1: \begin{equation} \begin{aligned} \boldsymbol{S}_t =&\, \boldsymbol{S}_{t-1} -(\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top} \\ =&\, \boldsymbol{S}_{t-1} -(\boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top} + \boldsymbol{v}_t\boldsymbol{k}_t^{\top} \\ =&\, \boldsymbol{S}_{t-1} (\boldsymbol{I} - \boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \boldsymbol{v}_t\boldsymbol{k}_t^{\top} \end{aligned} \label{eq:linear-attn-deltanet} \end{equation} Compared to the original Linear Attention in Eq. [eq:linear-attn-rnn], DeltaNet subtracts (\boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top} before adding \boldsymbol{v}_t\boldsymbol{k}_t^{\top}. Here, \boldsymbol{S}_{t-1} \boldsymbol{k}_t is the prediction of the new input \boldsymbol{k}_t by the old model \boldsymbol{S}_{t-1}.

Intuitively, "subtract then add" means removing the model’s old knowledge of \boldsymbol{k}_t and then adding new knowledge based on (\boldsymbol{k}_t, \boldsymbol{v}_t), achieving a "remove the old, welcome the new" effect. This is known as the "Delta Rule," which is the source of the "Delta" in DeltaNet. The Delta Rule is not new; it is also known as Least Mean Square or the Widrow-Hoff Algorithm, dating back to the 1960s. In fact, very little in this field is entirely new; most changes can be traced back to "ancient" works. Current efforts focus on identifying the parts that are Scalable.

It should also be noted that chronologically, DeltaNet came before TTT. Understanding RNNs from an Online Learning perspective appeared sporadically before TTT, but TTT systematically proposed this "guiding principle."

Is DeltaNet still a Linear RNN? Yes. A Linear RNN is defined by a recursive formula where the dependence on the State variable is linear, even if the dependence on the input or \boldsymbol{q}, \boldsymbol{k}, \boldsymbol{v} is non-linear. In Eq. [eq:linear-attn-deltanet], \boldsymbol{S}_{t-1} only appears to the first power, satisfying the definition.

Inversion and Generalization

As mentioned, the most ideal (GPU-efficient) parallel algorithms for Linear RNNs involve matrix multiplication. To achieve this for DeltaNet, we first write: \begin{equation} \boldsymbol{S}_t = \boldsymbol{S}_{t-1} + (\boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top} \end{equation} Let \boldsymbol{u}_t = \boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t, then \boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{u}_t\boldsymbol{k}_t^{\top}. This is just the original Linear Attention with \boldsymbol{V} replaced by \boldsymbol{U}=[\boldsymbol{u}_1, \boldsymbol{u}_2, \cdots, \boldsymbol{u}_n]^{\top}. Iterating t-1 times: \begin{equation} \boldsymbol{S}_{t-1} = \sum_{j=1}^{t-1} \boldsymbol{u}_j\boldsymbol{k}_j^{\top} \quad \Rightarrow \quad \boldsymbol{u}_t = \boldsymbol{v}_t - \left(\sum_{j=1}^{t-1} \boldsymbol{u}_j\boldsymbol{k}_j^{\top}\right)\boldsymbol{k}_t = \boldsymbol{v}_t - \sum_{j=1}^{t-1} \boldsymbol{u}_j(\boldsymbol{k}_j^{\top}\boldsymbol{k}_t) \end{equation} In matrix form: \boldsymbol{U} = \boldsymbol{V} - (\boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)\boldsymbol{U}, where \boldsymbol{M}^- = \boldsymbol{M} - \boldsymbol{I}. This is a system of linear equations with the solution: \begin{equation} \boldsymbol{U} = (\boldsymbol{I} + \underbrace{\boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-}_{\text{denoted as } \boldsymbol{B}})^{-1}\boldsymbol{V} \end{equation} This involves the inverse of an n \times n matrix, which has a standard complexity of \mathcal{O}(n^3)—higher than Softmax Attention! Fortunately, we don’t need the explicit inverse; we only need \boldsymbol{U}, which can be solved as (\boldsymbol{I}+\boldsymbol{B})\boldsymbol{U}=\boldsymbol{V} in \mathcal{O}(n^2). Furthermore, using the lower triangular and low-rank structure of \boldsymbol{I}+\boldsymbol{B}, complexity can be reduced to linear. For these details, please refer to the original paper.

Following DeltaNet, Gated DeltaNet (GDN) introduced forget gates. The original GDN formulation was: \begin{equation} \boldsymbol{S}_t = \alpha_t \boldsymbol{S}_{t-1} (\boldsymbol{I} - \beta_t\boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \beta_t\boldsymbol{v}_t\boldsymbol{k}_t^{\top} \label{eq:gdn-orgi} \end{equation} However, I believe this explicitly breaks the Delta Rule. A better formulation, like in Comba, applies the multiplier only to the first \boldsymbol{S}_{t-1}: \begin{equation} \boldsymbol{S}_t = \gamma_t\boldsymbol{S}_{t-1} + \eta_t(\boldsymbol{v}_t - \boldsymbol{S}_{t-1}\boldsymbol{k}_t)\boldsymbol{k}_t^{\top} \label{eq:gdn-comba} \end{equation} This corresponds to a loss function \frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2 + \frac{1-\gamma}{\eta}\Vert\boldsymbol{S}\Vert_F^2. Mathematically, these two are equivalent: \begin{equation} \alpha_t\boldsymbol{S}_{t-1} (\boldsymbol{I} - \beta_t\boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \beta_t\boldsymbol{v}_t\boldsymbol{k}_t^{\top} = \alpha_t \boldsymbol{S}_{t-1} + \alpha_t \beta_t (\boldsymbol{v}_t/\alpha_t - \boldsymbol{S}_{t-1}\boldsymbol{k}_t)\boldsymbol{k}_t^{\top} \end{equation} By setting \gamma_t = \alpha_t, \eta_t = \alpha_t \beta_t and absorbing 1/\alpha_t into \boldsymbol{v}_t, the former transforms into the latter. Since most \alpha_t are close to 1, their performance is likely similar, though the latter more clearly preserves the Delta Rule.

Theoretically, Gated DeltaNet can also be written in DeltaNet form. By defining \bar{\alpha}_t = \prod_{j=1}^t \alpha_t, dividing Eq. [eq:gdn-orgi] by \bar{\alpha}_t yields: \begin{equation} \bar{\alpha}_t^{-1}\boldsymbol{S}_t = \bar{\alpha}_{t-1}^{-1}\boldsymbol{S}_{t-1} (\boldsymbol{I} - \beta_t\boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \beta_t(\bar{\alpha}_t^{-1}\boldsymbol{v}_t)\boldsymbol{k}_t^{\top} \end{equation} Combined with \boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t = (\bar{\alpha}_t^{-1}\boldsymbol{S}_t) (\bar{\alpha}_t\boldsymbol{q}_t), we see that by redefining \bar{\alpha}_t\boldsymbol{q}_t and \bar{\alpha}_t^{-1}\boldsymbol{v}_t as the new \boldsymbol{q}_t, \boldsymbol{v}_t, it simplifies to DeltaNet. However, this is mostly of theoretical value; in practice, for large t, either \bar{\alpha}_t or \bar{\alpha}_t^{-1} would likely overflow.

Another extension is DeltaProduct, which expands \boldsymbol{k}, \boldsymbol{v} several times before applying DeltaNet or GDN to enhance state-tracking. Personally, I find the approach in "Chapter of Space-Time: Treating Attention as a Quadratic Complexity RNN" more aesthetically pleasing—exploring quadratic complexity RNNs to see if they can surpass Softmax Attention.

Reciprocation in Progress

Regarding surpassing Softmax Attention, I mentioned that Linear Attention is now "reciprocating" it. This makes sense: Softmax Attention has been "regressing" for years, moving from MHA to GQA and MQA to compress KV Cache. Linear Attention, having no KV Cache issue, has continued to move forward.

To see this, let’s write the Attention mechanisms in matrix form:

Formula

Softmax Attention

(\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V}

Original Linear Attention

(\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})\boldsymbol{V}

With Forget Gate

(\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{\Gamma})\boldsymbol{V}

DeltaNet

(\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}\boldsymbol{V}

Gated DeltaNet

((\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{M})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}\odot\boldsymbol{\Gamma})\boldsymbol{V}

=(\boldsymbol{Q}\boldsymbol{K}^{\top}\odot \boldsymbol{\Gamma})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{\Gamma}^-)^{-1}\boldsymbol{V}

where \begin{equation} \Gamma_{i,j} = \left\{\begin{aligned} &\prod_{\tau=j+1}^i \gamma_{\tau}, &i > j \\[6pt] &1, &i = j \\[6pt] &0, &i < j\end{aligned}\right. \end{equation} and \boldsymbol{\Gamma}^- = \boldsymbol{\Gamma} - \boldsymbol{I}. Softmax Attention is still stuck in the form of the earliest Linear Attention. How does "reciprocation" happen? First, we need a way to convert Softmax Attention into Linear Attention. We summarized three schemes for converting Softmax Attention into infinite-dimensional Linear Attention in "Transformer Upgrade: 5. Linear Attention as Infinite Dimensions".

Essentially, there exists a mapping \phi that maps \boldsymbol{Q}, \boldsymbol{K} from n \times d to n \times \infty such that \exp(\boldsymbol{Q}\boldsymbol{K}^{\top}) = \phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top} (the "kernel trick"). We then replace \boldsymbol{Q}, \boldsymbol{K} in the Linear Attention formulas with \phi(\boldsymbol{Q}), \phi(\boldsymbol{K}) and restore the \exp and normalization. For example, applying this to the forget gate formula: \begin{equation} (\phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top}\odot \boldsymbol{\Gamma})\boldsymbol{V} = \exp(\boldsymbol{Q}\boldsymbol{K}^{\top} + \log\boldsymbol{\Gamma})\boldsymbol{V} \end{equation} If \gamma_t is constant, this is ALIBI ("Train Short, Test Long..."); if \gamma_t is input-dependent, it is FoX ("Forgetting Transformer...").

An even more interesting result is DeltaFormer ("Understanding Transformer from the Perspective of Associative Memory"), which is the DeltaNet version of Softmax Attention. Replacing \boldsymbol{Q}, \boldsymbol{K} in DeltaNet with \phi(\boldsymbol{Q}), \phi(\boldsymbol{K}): \begin{equation} \begin{aligned} &\,(\phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top}\odot \boldsymbol{M})(\boldsymbol{I} + \phi(\boldsymbol{K})\phi(\boldsymbol{K})^{\top}\odot \boldsymbol{M}^-)^{-1}\boldsymbol{V} \\[8pt] =&\, \underbrace{\exp(\boldsymbol{Q} \boldsymbol{K}^{\top} + \log\boldsymbol{M})}_{\text{denoted as } \boldsymbol{A}}(\boldsymbol{I} + \underbrace{\exp(\boldsymbol{K} \boldsymbol{K}^{\top} + \log\boldsymbol{M}^-)}_{\text{denoted as } \boldsymbol{B}})^{-1}\boldsymbol{V} \end{aligned} \end{equation} For normalization, we replace \exp with \text{softmax}. Compared to Softmax Attention, DeltaFormer changes \boldsymbol{A}\boldsymbol{V} to \boldsymbol{A}(\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V}. Note that: \begin{equation} \begin{aligned} \boldsymbol{A}(\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V} =&\, \boldsymbol{A}(\boldsymbol{I}-\boldsymbol{B}+\boldsymbol{B}^2- \boldsymbol{B}^3 + \cdots)\boldsymbol{V} \\ =&\, \boldsymbol{A}(\boldsymbol{V}-\boldsymbol{B}\boldsymbol{V}+\boldsymbol{B}^2\boldsymbol{V}- \boldsymbol{B}^3\boldsymbol{V} + \cdots) \end{aligned} \end{equation} DeltaFormer is equivalent to computing Attention multiple times with \boldsymbol{K}, \boldsymbol{K}, \boldsymbol{V}, superimposing the results as a new \boldsymbol{V}, and then computing Attention with \boldsymbol{Q}, \boldsymbol{K}. This is highly effective for Multi-Hop tasks (like Code). Furthermore, since only \boldsymbol{K}, \boldsymbol{V} are involved in (\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V}, it pairs perfectly with MQA, where \boldsymbol{K}, \boldsymbol{V} have only a single head.

However, in my view, this fixed-coefficient superposition might be a "no free lunch" scenario. My experiments show that DeltaFormer’s language model loss doesn’t change much, implying that if performance improves on some tasks, it must decline on others.

Hardcore Encoding

Another notable reciprocation work is PaTH Attention ("PaTH Attention: Position Encoding via Accumulating Householder Transformations"), which feeds DeltaNet back into Softmax Attention from a positional encoding perspective.

In "Transformer Upgrade: 6. Completeness Analysis of RoPE", we noted that for any orthogonal matrix \boldsymbol{\Omega}, \boldsymbol{R}_m = \boldsymbol{\Omega}^m is a generalized RoPE. Besides rotation matrices, what other orthogonal matrices are easy to construct? PaTH uses Householder matrices: if \boldsymbol{w} is any column vector with norm \sqrt{2}, then \boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top} is an orthogonal matrix (representing a reflection).

This is identical to the \boldsymbol{I}-\boldsymbol{k}_t\boldsymbol{k}_t^{\top} term in DeltaNet. PaTH abandons the \boldsymbol{\Omega}^m form and the \Vert\boldsymbol{w}\Vert=\sqrt{2} constraint, using a sequence of \boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top} multiplications to express position: \begin{equation} \boldsymbol{q}_i^{\top}\boldsymbol{k}_j \quad \to \quad \boldsymbol{q}_i^{\top}\underbrace{(\boldsymbol{I}-\boldsymbol{w}_i\boldsymbol{w}_i^{\top})(\boldsymbol{I}-\boldsymbol{w}_{i-1}\boldsymbol{w}_{i-1}^{\top})\cdots(\boldsymbol{I}-\boldsymbol{w}_{j+1}\boldsymbol{w}_{j+1}^{\top})}_{\text{denoted as } \boldsymbol{R}_{i,j}}\boldsymbol{k}_j \end{equation} The recursive form is \boldsymbol{R}_{i,j} = (\boldsymbol{I}-\boldsymbol{w}_i\boldsymbol{w}_i^{\top})\boldsymbol{R}_{i-1,j} with \boldsymbol{R}_{j,j} = \boldsymbol{I}. This is like DeltaNet with \boldsymbol{v}_t=0 but a non-zero initial \boldsymbol{S}_0. Using the same inversion process: \begin{equation} \boldsymbol{R}_{i,j} = \boldsymbol{I} - \boldsymbol{W}_{[j:i]}^{\top}(\boldsymbol{I} + \boldsymbol{W}_{[j:i]}\boldsymbol{W}_{[j:i]}^{\top}\odot\boldsymbol{M}^-)^{-1}\boldsymbol{W}_{[j:i]} \end{equation} where \boldsymbol{W}=[\boldsymbol{w}_1, \boldsymbol{w}_2, \cdots, \boldsymbol{w}_n]^{\top}. Using the properties of triangular matrices, we can write the full Attention matrix (before Softmax) as: \begin{equation} \boldsymbol{A} = \boldsymbol{Q}\boldsymbol{K}^{\top}\odot\boldsymbol{M} - (\boldsymbol{Q}\boldsymbol{W}^{\top}\odot\boldsymbol{M})(\boldsymbol{I} + \boldsymbol{W}\boldsymbol{W}^{\top}\odot\boldsymbol{M}^-)^{-1}(\boldsymbol{W}\boldsymbol{K}^{\top}\odot\boldsymbol{M}^-) \label{eq:path-attn} \end{equation} Direct inversion is \mathcal{O}(n^3), which is unacceptable. One must use the low-rank nature of \boldsymbol{W}\boldsymbol{W}^{\top} to reduce complexity to \mathcal{O}(n^2), derive backpropagation, and write a Flash Attention-like implementation. It is extremely hardcore.

PaTH is a type of CoPE (Contextual Position Encoding), where positions are generated based on context rather than fixed indices. Similarly, FoX can be seen as a contextual version of ALIBI. Context-dependent positional information is a key feature of current Linear Attention and likely a major direction for reciprocating Softmax Attention.

Simplification is Fun

Let’s look at two special cases of PaTH to understand its relationship with DeltaNet.

The first case is \boldsymbol{W}=\boldsymbol{K}. Substituting into Eq. [eq:path-attn]: \begin{equation} \begin{aligned} \boldsymbol{A} =&\, (\boldsymbol{Q}\boldsymbol{K}^{\top}\odot\boldsymbol{M})(\boldsymbol{I} - (\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot\boldsymbol{M}^-)^{-1}(\boldsymbol{K}\boldsymbol{K}^{\top}\odot\boldsymbol{M}^-)) \\[6pt] =&\, (\boldsymbol{Q}\boldsymbol{K}^{\top}\odot\boldsymbol{M})(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot\boldsymbol{M}^-)^{-1} \end{aligned} \end{equation} This is exactly the Attention matrix of DeltaNet! Thus, the difference between PaTH and DeltaFormer is that DeltaFormer applies \exp to \boldsymbol{Q}\boldsymbol{K}^{\top} and \boldsymbol{K}\boldsymbol{K}^{\top} based on the kernel trick, while PaTH applies \exp directly to the DeltaNet Attention matrix.

The second case is re-introducing the \Vert\boldsymbol{w}\Vert=\sqrt{2} constraint. Then \boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top} is orthogonal. We can implement PaTH using absolute positions, multiplying each \boldsymbol{q}_i, \boldsymbol{k}_i by \boldsymbol{R}_i. This turns out to be: \begin{equation} \mathop{\text{SoftmaxAttention}}(\underbrace{\boldsymbol{Q}-\mathop{\text{DeltaNet}}(\boldsymbol{Q},\boldsymbol{W},\boldsymbol{W})}_{\tilde{\boldsymbol{Q}}},\underbrace{\boldsymbol{K}-\mathop{\text{DeltaNet}}(\boldsymbol{K},\boldsymbol{W},\boldsymbol{W})}_{\tilde{\boldsymbol{K}}},\boldsymbol{V}) \end{equation} This is equivalent to using DeltaNet to add positional encoding to \boldsymbol{Q} and \boldsymbol{K}. In this light, PaTH is a intra-layer hybrid of Softmax Attention and DeltaNet.

The "Side-Path" Method

Finally, let’s look at MesaNet (and the similar Atlas). TTT’s Online Learning perspective tells us DeltaNet uses SGD to optimize \frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2. Since \boldsymbol{S}\boldsymbol{k} is linear in \boldsymbol{k}, this is just linear regression, which has an analytical solution! \begin{equation} \boldsymbol{S}_t = \boldsymbol{G}_t \boldsymbol{H}_t^{-1},\quad \boldsymbol{G}_t = \sum_{j=1}^t \boldsymbol{v}_j \boldsymbol{k}_j^{\top},\quad \boldsymbol{H}_t = \sum_{j=1}^t \boldsymbol{k}_j \boldsymbol{k}_j^{\top} \end{equation} MesaNet ("MesaNet: Sequence Modeling by Locally Optimal Test-Time Training") uses this analytical solution, adding forget gates and a diagonal matrix \boldsymbol{\Lambda}_t for invertibility: \begin{equation} \boldsymbol{o}_t = \boldsymbol{G}_t (\boldsymbol{H}_t + \boldsymbol{\Lambda}_t)^{-1} \boldsymbol{q}_t,\quad \boldsymbol{G}_t = \gamma_t \boldsymbol{G}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top},\quad\boldsymbol{H}_t = \gamma_t \boldsymbol{H}_{t-1} + \boldsymbol{k}_t \boldsymbol{k}_t^{\top} \end{equation} MesaNet is still Linear Attention. From a signal processing view, MesaNet vs. DeltaNet is Recursive Least Square vs. Least Mean Square.

Why do I call it "side-path"? Because analytical solutions are rare and inflexible. History shows that branches relying on analytical solutions often decline because they lack generality. Furthermore, \boldsymbol{H}_t + \boldsymbol{\Lambda}_t is not triangular, making parallel computation difficult. MesaNet currently uses the Conjugate Gradient method for approximate solutions. Theoretically, MesaNet’s "sliding average" update might not be as powerful as DeltaNet’s "remove the old, welcome the new" Delta Rule for precise long-term memory.

The Road is Still Long

This article has briefly traced the development of Linear Attention. From imitating Softmax Attention to developing its own features, Linear Attention has become a highly competitive sequence modeling solution, even providing new ideas for Softmax Attention. This process is both fascinating and enlightening.

Reprinting: Please include the original address: https://kexue.fm/archives/11033

For more details on reprinting/citation, please refer to: "Scientific Space FAQ"