Attention Residuals: A Memoir · English (unofficial) translations of posts at kexue.fm

This article introduces our latest work, Attention Residuals (AttnRes), which, as the name suggests, improves residuals using the idea of attention.

Many readers have likely heard of the debate between Pre-Norm and Post-Norm, but ultimately this is just an internal struggle within residuals themselves, as are many subsequent changes to normalization. A more interesting change is HC, which took the path of expanding residual streams, but perhaps due to unstable performance, it did not attract much attention. The subsequent story is probably known to everyone: at the end of last year, DeepSeek’s mHC improved upon HC and validated its effectiveness in larger-scale experiments.

Compared to further expanding residual streams, we chose a more radical route: directly using attention between layers to replace residuals. Of course, making the whole process work involved many details and efforts, and here we simply recall the related journey.

Inter-Layer Attention

As usual, we start with Residuals, which everyone should be familiar with. Its form is \boldsymbol{x}_t = \boldsymbol{x}_{t-1} + \boldsymbol{f}_t(\boldsymbol{x}_{t-1}) Here we use an alternative notation that reveals something deeper. Let \boldsymbol{y}_t=\boldsymbol{f}_t(\boldsymbol{x}_{t-1}), then \boldsymbol{x}_t=\boldsymbol{x}_{t-1}+\boldsymbol{y}_t. Defining \boldsymbol{y}_0=\boldsymbol{x}_0, it is easy to see that \boldsymbol{x}_t=\boldsymbol{y}_0+\boldsymbol{y}_1+\cdots+\boldsymbol{y}_t, so it can be equivalently written as \boldsymbol{y}_{t+1} = \boldsymbol{f}_{t+1}(\boldsymbol{y}_0+\boldsymbol{y}_1+\cdots+\boldsymbol{y}_t)\label{eq:res-sum} That is, from the perspective of \boldsymbol{y}, Residuals sum \boldsymbol{y}_0,\boldsymbol{y}_1,\cdots,\boldsymbol{y}_t with equal weights as the input to \boldsymbol{f}_{t+1} to obtain \boldsymbol{y}_{t+1}. A natural generalization is to use a weighted sum: \boldsymbol{y}_{t+1} = \boldsymbol{f}_{t+1}\left(\sum_{s=0}^t a_{t+1,s}\boldsymbol{y}_s\right)\qquad \text{where}\qquad a_{t,s}\geq 0,\quad\sum_{s=0}^t a_{t+1,s}=1\label{eq:attnres-gen} This is the germ of AttnRes. The above equation also imposes two constraints on a_{t,s}; let’s discuss their necessity.

1. The constraint a_{t,s}\geq 0 ensures that the contribution of the same \boldsymbol{y}_s to different layers is always in the same direction, avoiding the inconsistency where one layer tries to increase \boldsymbol{y}_s while another tries to decrease it. Intuitively, this is more friendly to model learning.

2. Our \boldsymbol{f} uses In Norm, which first applies \mathop{\mathrm{RMSNorm}} to the input. Since \mathop{\mathrm{RMSNorm}}(\boldsymbol{x})=\mathop{\mathrm{RMSNorm}}(c\boldsymbol{x}) holds for any c > 0, weighted averaging and weighted summation are completely equivalent. The constraint \sum_{s=0}^t a_{t,s}=1 does not reduce expressiveness.

Hyper-Connections

Before developing AttnRes, let’s briefly review HC (Hyper-Connections) and demonstrate that it can also be understood as inter-layer attention, thereby showing that inter-layer attention is indeed a more fundamental approach. HC changes Residuals to \boldsymbol{X}_t = \boldsymbol{H}_t^{res}\boldsymbol{X}_{t-1} + \boldsymbol{H}_t^{post} \boldsymbol{f}_t(\boldsymbol{H}_t^{pre}\boldsymbol{X}_{t-1}) where \boldsymbol{X}\in\mathbb{R}^{k\times d},\boldsymbol{H}^{res}\in\mathbb{R}^{k\times k},\boldsymbol{H}^{pre}\in\mathbb{R}^{1\times k},\boldsymbol{H}^{post}\in\mathbb{R}^{k\times 1}. The classic choice is k=4. Simply put, the state variable is expanded k times. Before input to \boldsymbol{f}_t, an \boldsymbol{H}_t^{pre} matrix reduces it back to 1 times; after output, \boldsymbol{H}_t^{post} expands it back to k times, and finally it is added to \boldsymbol{H}_t^{res}-modulated \boldsymbol{x}_{t-1}. Without restricting the forms of \boldsymbol{H}_t^{res}, \boldsymbol{H}_t^{pre}, \boldsymbol{H}_t^{post}, approaches like Post Norm and Highway are special cases of HC.

Similarly, let \boldsymbol{y}_t=\boldsymbol{f}_t(\boldsymbol{H}_t^{pre}\boldsymbol{X}_{t-1}), then \boldsymbol{X}_t = \boldsymbol{H}_t^{res}\boldsymbol{X}_{t-1} + \boldsymbol{H}_t^{post} \boldsymbol{y}_t. Setting \boldsymbol{X}_0 = \boldsymbol{H}_0^{post}\boldsymbol{y}_0, we can expand it as \boldsymbol{X}_t = \boldsymbol{H}_{t\leftarrow 1}^{res}\boldsymbol{H}_0^{post}\boldsymbol{y}_0 + \boldsymbol{H}_{t\leftarrow 2}^{res}\boldsymbol{H}_1^{post}\boldsymbol{y}_1 + \cdots + \boldsymbol{H}_{t\leftarrow t}^{res}\boldsymbol{H}_{t-1}^{post}\boldsymbol{y}_{t-1} + \boldsymbol{H}_t^{post}\boldsymbol{y}_t, where \boldsymbol{H}_{t\leftarrow s}^{res} is defined as \boldsymbol{H}_t^{res}\boldsymbol{H}_{t-1}^{res}\cdots \boldsymbol{H}_{s+1}^{res}\boldsymbol{H}_s^{res}. Further defining \boldsymbol{H}_{t\leftarrow t+1}^{res} = \boldsymbol{I}, we can write \boldsymbol{y}_{t+1} = \boldsymbol{f}_{t+1}(\boldsymbol{H}_{t+1}^{pre}\boldsymbol{x}_t) = \boldsymbol{f}_{t+1}\bigg(\sum_{s=0}^t\underbrace{\boldsymbol{H}_{t+1}^{pre}\boldsymbol{H}_{t\leftarrow s+1}^{res}\boldsymbol{H}_s^{post}}_{a_{t+1,s}}\boldsymbol{y}_s\bigg) Note that each \boldsymbol{H}_{t+1}^{pre}\boldsymbol{H}_{t\leftarrow s+1}^{res}\boldsymbol{H}_s^{post} is a 1\times 1 matrix, equivalent to a scalar, so it is also an inter-layer attention form as in Eq.[eq:attnres-gen]. Readers familiar with linear attention should quickly understand this result: HC is essentially DeltaNet “rotated by 90 degrees”. In practice, the three \boldsymbol{H} matrices are computed from simple linear layers with \tanh activation, which causes the product \boldsymbol{H}_{t\leftarrow s}^{res} to risk explosion or collapse, and cannot guarantee the non-negativity of a_{t+1,s}.

Later, mHC made improvements: it first changed all three \boldsymbol{H} matrices to use sigmoid activation, ensuring non-negativity of a_{t+1,s}, then alternately normalized \boldsymbol{H}_t^{res} to satisfy bi-stochasticity, relying on the closure property of bi-stochastic matrices under multiplication to guarantee the stability of \boldsymbol{H}_{t\leftarrow s}^{res}. Experiments later verified the effectiveness of these modifications. However, some new experiments like “Your deepseek mHC may not need the ‘m’ ” show that directly setting \boldsymbol{H}_t^{res} to the identity matrix is already sufficient.

Many Hands Make Light Work

Let’s return to AttnRes. After realizing the feasibility of AttnRes, the next question is: what form should a_{t+1,s} take? A natural idea is to follow the standard Scaled Dot-Product Attention, but at the time the author wanted to do a quick test, so he chose a simpler form: a_{t+1,s} \propto \exp(\boldsymbol{w}_{t+1}\cdot \boldsymbol{y}_s) where \boldsymbol{w}_t is a trainable vector parameter. That is, we directly use a data-independent static vector as Q, and both K and V are \boldsymbol{y}_s, to perform Softmax Attention. This was the first version of AttnRes. Surprisingly, this simple design already brought a very significant improvement over Residuals!

When the author shared the preliminary experimental results of AttnRes within the group, @Zhang Yu and @Guangyu showed great interest and joined in. They started verifying on larger-scale models and found the results very encouraging. During this period, we also tried some more complex designs, but most were inferior to this simple version. Only adding an \mathop{\mathrm{RMSNorm}} operation to K yielded a relatively stable gain, forming the final AttnRes form: a_{t+1,s} \propto \exp(\boldsymbol{w}_{t+1}\cdot \mathop{\mathrm{RMSNorm}}(\boldsymbol{y}_s))

However, AttnRes is after all a dense inter-layer connection scheme. Is it feasible for training and inference at the scale of K2 or even larger? Encouragingly, @V (Brother V), through clever analysis, first confirmed the feasibility of inference, and the “stroke of genius” was precisely the static Q design we adopted for convenience! This allowed us to precompute the attention a_{t,s} for t > s right after computing \boldsymbol{y}_s, providing enough wiggle room for infra.

Unfortunately, our training colleague @Wang (Brother Wang), after careful analysis, judged that under our current training environment, AttnRes was still not feasible enough (in short, we were too poor), and we needed a solution to further reduce communication and memory. Hence the Block version below; correspondingly, we call the previous version the Full version.

Block Version

Going from Full AttnRes to Block AttnRes is analogous to the process of linearizing quadratic attention. Various existing efficient attention ideas can be tried; for example, our first attempt was SWA (Sliding Window Attention). However, we found the actual results were very poor, even worse than Residuals.

After reflection, the author believes it can be understood as follows: Residuals itself is already a very strong baseline, corresponding to an equal-weight sum of all state vectors. Any new design that wants to surpass it must, at least in form, be able to cover it. Full AttnRes clearly satisfies this condition, but adding SWA does not, because it discards some states and cannot cover the special case of “equal-weight sum of all state vectors”.

This made us realize that for AttnRes, “compression” may be more effective than “sparsification,” and the compression need not be too fine-grained; simple weighted summation may suffice. After some conceptualization and polishing, @Zhang Yu and @Guangyu proposed the Block AttnRes design in the paper, which combines block-wise processing and summation compression, achieving performance close to the Full version.

The idea of Block AttnRes is roughly as follows: First, the embedding layer is treated as a separate block, because by observing the attention matrix of the Full version (another benefit of the attention concept: we can visualize attention patterns at any time), we found that the model tends to give considerable attention to the embedding layer, so it is necessary to isolate it. The remaining layers are grouped every m layers as a block. Within each block, compression is done by summation, and inter-block attention is computed using the summation results as units.

Experiments show that simply dividing into about 8 blocks yields most of the gains of AttnRes. After evaluation, both training and inference colleagues agreed that the additional overhead of Block AttnRes is very small and completely worth the performance improvement (for detailed analysis, see @Wang’s and @V’s posts; to give a number, roughly within 5% overhead for a 25% gain). So everyone pushed hard to integrate it into the main branch—another fulfilling and enjoyable experience that we won’t elaborate on here.

Matrix Perspective

It is worth mentioning that we can unify Residuals, HC/mHC, Full AttnRes, and Block AttnRes through the attention matrix, which provides an interesting perspective. Examples are shown below. Here \phi(\boldsymbol{q},\boldsymbol{k}) = \exp(\boldsymbol{q}\cdot \mathop{\mathrm{RMSNorm}}(\boldsymbol{k})), the Block AttnRes version corresponds to m=3, and \boldsymbol{y}_{s:t}=\sum_{i=s}^t \boldsymbol{y}_i, a notation we also used in the article “Making Alchemy More Scientific (Part 4): New Identity, New Learning Rate”.

Residuals

\boldsymbol{A}=\left(\begin{array}{ccccccc} 1 & & & & & & \\ 1 & 1 & & & & & \\ 1 & 1 & 1 & & & & \\ 1 & 1 & 1 & 1 & & & \\ 1 & 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & 1 & 1 & \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \end{array}\right)

HC/mHC

\boldsymbol{A}=\left(\begin{array}{ccccccc} \boldsymbol{H}_1^{pre} \boldsymbol{H}_0^{post} \\ \boldsymbol{H}_2^{pre}\boldsymbol{H}_{1\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_2^{pre}\boldsymbol{H}_1^{post} \\ \boldsymbol{H}_3^{pre}\boldsymbol{H}_{2\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_3^{pre}\boldsymbol{H}_{2\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_3^{pre}\boldsymbol{H}_2^{post} \\ \boldsymbol{H}_4^{pre}\boldsymbol{H}_{3\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_4^{pre}\boldsymbol{H}_{3\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_4^{pre}\boldsymbol{H}_{3\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_4^{pre}\boldsymbol{H}_3^{post} \\ \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 4}^{res}\boldsymbol{H}_3^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_4^{post} \\ \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 4}^{res}\boldsymbol{H}_3^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 4}^{res}\boldsymbol{H}_4^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_5^{post} \\ \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 4}^{res}\boldsymbol{H}_3^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 5}^{res}\boldsymbol{H}_4^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 6}^{res}\boldsymbol{H}_5^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_6^{post} \\ \end{array}\right)

Full AttnRes

\boldsymbol{A}=\left(\begin{array}{ccccccc} \phi(\boldsymbol{w}_1, \boldsymbol{y}_0) \\ \phi(\boldsymbol{w}_2, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_2, \boldsymbol{y}_1) \\ \phi(\boldsymbol{w}_3, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_2) \\ \phi(\boldsymbol{w}_4, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_3) \\ \phi(\boldsymbol{w}_5, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_3) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_4) \\ \phi(\boldsymbol{w}_6, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_3) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_4) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_5) \\ \phi(\boldsymbol{w}_7, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_3) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_4) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_5) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_6) \\ \end{array}\right)

Block AttnRes

\boldsymbol{A}=\left(\begin{array}{c|ccc|ccc} \phi(\boldsymbol{w}_1, \boldsymbol{y}_0) \\ \hline \phi(\boldsymbol{w}_2, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_2, \boldsymbol{y}_1) \\ \phi(\boldsymbol{w}_3, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_{1:2}) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_{1:2}) \\ \phi(\boldsymbol{w}_4, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_{1:3}) \\ \hline \phi(\boldsymbol{w}_5, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_4)\\ \phi(\boldsymbol{w}_6, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{4:5}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{4:5}) \\ \phi(\boldsymbol{w}_7, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{4:6}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{4:6}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{4:6}) \\ \end{array}\right)

Conclusion

This article introduced our latest result on model architecture, Attention Residuals (AttnRes), which replaces plain residuals with inter-layer attention and, through careful design, meets the efficiency requirements for training and inference, ultimately successfully scaling it to sufficiently large models.

For reprinting, please include the original article link: https://kexue.fm/archives/11664
For more detailed reprint/citation guidelines, please refer: Science Space FAQ