From the article "A Brief History of Linear Attention: From Imitation and Innovation to Feedback", we can see that the parallel form of DeltaNet involves an inverse matrix of the form (\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}. Recently, a reader @Arch123 pointed out that through experiments, it can be observed that the elements of this inverse matrix are always within the range [-1, 1], and asked whether this could be mathematically proven or disproven.
In this article, we will prove that this conclusion strictly holds through two different methods.
Problem Description
First, let us accurately restate the problem. Suppose we have a matrix \boldsymbol{K}=[\boldsymbol{k}_1,\boldsymbol{k}_2,\cdots,\boldsymbol{k}_n]^{\top}\in\mathbb{R}^{n\times d}, where each \boldsymbol{k}_i\in\mathbb{R}^{d\times 1} is a column vector with a norm not exceeding 1. Let \boldsymbol{M}\in\mathbb{R}^{n\times n} be a lower triangular mask matrix defined as: M_{i,j} = \left\{\begin{aligned} &1, &i \geq j \\ &0, &i < j\end{aligned}\right. \boldsymbol{I} is the identity matrix, and \boldsymbol{M}^- = \boldsymbol{M} - \boldsymbol{I}. We want to prove that: (\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}\quad\in\quad [-1, 1]^{n\times n} Let \boldsymbol{X}=\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-, and denote its inverse matrix as \boldsymbol{Y}. Writing \boldsymbol{X} explicitly gives: \boldsymbol{X}=\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^- = \left(\begin{array}{cccccc} 1 & 0 & 0 & \cdots & 0 & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 & 0 & \cdots & 0 & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 & 1 & \cdots & 0 & 0 \\[5pt] \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_2^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_3^{\top}\boldsymbol{k}_{n-1} & \cdots & 1 & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_n & \boldsymbol{k}_2^{\top}\boldsymbol{k}_n & \boldsymbol{k}_3^{\top}\boldsymbol{k}_n & \cdots & \boldsymbol{k}_{n-1}^{\top}\boldsymbol{k}_n & 1 \end{array}\right) Since the norm of each \boldsymbol{k}_i does not exceed 1, it is obvious that every element of \boldsymbol{X} is within [-1, 1]. We aim to prove that every element of \boldsymbol{Y}=\boldsymbol{X}^{-1} is also within [-1, 1].
Mathematical Induction
Since this inverse matrix appears in DeltaNet, it certainly has a clear model background. Combining this background might help us prove it more quickly, which we will leave for the next section. In this section, we will treat it as a pure mathematical problem.
Diagonal Properties
In "Efficient Inversion Methods for ’Diagonal + Low-Rank’ Triangular Matrices", we have seen that lower triangular matrices have several excellent properties. First, they are closed under matrix multiplication. Second, they satisfy the property that "the diagonal of the inverse matrix equals the inverse of the diagonal." This can be generalized to block lower triangular matrices: "the diagonal blocks of the inverse matrix equal the inverses of the diagonal blocks."
From this, we have \boldsymbol{Y}_{[:n-1,:n-1]}=(\boldsymbol{X}_{[:n-1,:n-1]})^{-1} and \boldsymbol{Y}_{[1:,1:]}=(\boldsymbol{X}_{[1:,1:]})^{-1} (where slicing is understood as in NumPy). This property suggests that we can use mathematical induction on n. Assuming the conclusion holds for any (n-1)\times d matrix \boldsymbol{K} satisfying the conditions, for the matrix \boldsymbol{X}, we only need to consider the following two partitioning methods: \boldsymbol{X} = \left(\begin{array}{ccccc|c} 1 & 0 & 0 & \cdots & 0 & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 & 0 & \cdots & 0 & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 & 1 & \cdots & 0 & 0 \\[5pt] \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_2^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_3^{\top}\boldsymbol{k}_{n-1} & \cdots & 1 & 0 \\[5pt] \hline \boldsymbol{k}_1^{\top}\boldsymbol{k}_n & \boldsymbol{k}_2^{\top}\boldsymbol{k}_n & \boldsymbol{k}_3^{\top}\boldsymbol{k}_n & \cdots & \boldsymbol{k}_{n-1}^{\top}\boldsymbol{k}_n & 1 \end{array}\right) = \left(\begin{array}{c|cccc} 1 & 0 & 0 & \cdots & 0 \\[5pt] \hline \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 & 0 & \cdots & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 & 1 & \cdots & 0 \\[5pt] \vdots & \vdots & \vdots & \ddots & \vdots \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_2^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_3^{\top}\boldsymbol{k}_{n-1} & \cdots & 1 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_n & \boldsymbol{k}_2^{\top}\boldsymbol{k}_n & \boldsymbol{k}_3^{\top}\boldsymbol{k}_n & \cdots & \boldsymbol{k}_{n-1}^{\top}\boldsymbol{k}_n \end{array}\right) Then, based on \boldsymbol{Y}_{[:n-1,:n-1]}=(\boldsymbol{X}_{[:n-1,:n-1]})^{-1} and \boldsymbol{Y}_{[1:,1:]}=(\boldsymbol{X}_{[1:,1:]})^{-1}, we can prove that all elements of \boldsymbol{Y} except for the bottom-left corner Y_{n,1} are within [-1, 1]. Thus, we only need to provide an additional proof that Y_{n,1}\in [-1,1].
Adjugate Matrix
To prove Y_{n,1}\in [-1,1], we consider using the adjugate matrix to represent \boldsymbol{X}^{-1} explicitly. Since \boldsymbol{X} is a lower triangular matrix with 1s on the diagonal, its determinant is 1. Therefore, according to the formula for the inverse using the adjugate matrix, we have: Y_{n,1} = (-1)^{n+1}\det\left(\begin{array}{ccccc} \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 & 0 & \cdots & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 & 1 & \cdots & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_4 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_4 & \boldsymbol{k}_3^{\top}\boldsymbol{k}_4 & \cdots & 0 \\[5pt] \vdots & \vdots & \vdots & \ddots & \vdots \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_2^{\top}\boldsymbol{k}_{n-1} & \boldsymbol{k}_3^{\top}\boldsymbol{k}_{n-1} & \cdots & 1 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_n & \boldsymbol{k}_2^{\top}\boldsymbol{k}_n & \boldsymbol{k}_3^{\top}\boldsymbol{k}_n & \cdots & \boldsymbol{k}_{n-1}^{\top}\boldsymbol{k}_n \end{array}\right) This is essentially a lower triangular matrix where the sub-diagonal elements have become 1, making the determinant calculation slightly more complex, but not excessively so. Since the last column has only two non-zero elements, expanding the determinant along the last column transforms it into the difference of two determinants of order n-1. This characteristic again leads us toward mathematical induction.
Product Form
We can try to calculate the first few results by hand: \begin{gathered} Y_{2,1} = -\boldsymbol{k}_1^{\top}\boldsymbol{k}_2 ,\qquad Y_{3,1} = \det\left(\begin{array}{cc} \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 \end{array}\right) = \boldsymbol{k}_1^{\top}\boldsymbol{k}_2\boldsymbol{k}_2^{\top}\boldsymbol{k}_3 - \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 = -\boldsymbol{k}_1^{\top}(\boldsymbol{I}-\boldsymbol{k}_2\boldsymbol{k}_2^{\top})\boldsymbol{k}_3 \\[8pt] Y_{4,1} = -\det\left(\begin{array}{ccc} \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 & 0 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 & 1 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_4 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_4 & \boldsymbol{k}_3^{\top}\boldsymbol{k}_4 \end{array}\right) = \det\left(\begin{array}{cc} \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_4 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_4 \end{array}\right) -\det\left(\begin{array}{cc} \boldsymbol{k}_1^{\top}\boldsymbol{k}_2 & 1 \\[5pt] \boldsymbol{k}_1^{\top}\boldsymbol{k}_3 & \boldsymbol{k}_2^{\top}\boldsymbol{k}_3 \end{array}\right) \boldsymbol{k}_3^{\top}\boldsymbol{k}_4 = -\boldsymbol{k}_1^{\top}(\boldsymbol{I}-\boldsymbol{k}_2\boldsymbol{k}_2^{\top})(\boldsymbol{I}-\boldsymbol{k}_3\boldsymbol{k}_3^{\top})\boldsymbol{k}_4 \end{gathered} From this, we can conjecture: Y_{n,1} = -\boldsymbol{k}_1^{\top}(\boldsymbol{I}-\boldsymbol{k}_2\boldsymbol{k}_2^{\top})(\boldsymbol{I}-\boldsymbol{k}_3\boldsymbol{k}_3^{\top})\cdots (\boldsymbol{I}-\boldsymbol{k}_{n-1}\boldsymbol{k}_{n-1}^{\top})\boldsymbol{k}_n The reader is invited to prove this conjecture using mathematical induction. Assuming the proof is complete, then according to the condition \|\boldsymbol{k}_i\|\leq 1, we have \|\mathbf{(I}-\boldsymbol{k}_i\boldsymbol{k}_i^{\top})\boldsymbol{x}\|^2 = \|\boldsymbol{x}\|^2 + (\boldsymbol{k}_i^{\top} \boldsymbol{x})^2 (\|\boldsymbol{k}_i\|^2 - 2) \leq \|\boldsymbol{x}\|^2. Therefore: |Y_{n,1}| \leq \|\boldsymbol{k}_1\|\times \|(\boldsymbol{I}-\boldsymbol{k}_2\boldsymbol{k}_2^{\top})(\boldsymbol{I}-\boldsymbol{k}_3\boldsymbol{k}_3^{\top})\cdots (\boldsymbol{I}-\boldsymbol{k}_{n-1}\boldsymbol{k}_{n-1}^{\top})\boldsymbol{k}_n\| \leq \|\boldsymbol{k}_1\|\times \|\boldsymbol{k}_n\| \leq 1 This completes the proof that Y_{n,1}\in [-1,1], and by mathematical induction, \boldsymbol{Y}\in[-1, 1]^{n\times n} holds true.
Double Calculation
The direct proof in the previous section was somewhat "brute force." Later, @zhxlin from the FLA group reminded me that we can return to the original background of this inverse matrix—DeltaNet—and calculate DeltaNet in two different ways. By comparing the results, we can derive the explicit expression for the inverse matrix, leading to a relatively more elegant proof.
Inverse Matrix Form
We know that the recursive form of DeltaNet is: \boldsymbol{S}_t = \boldsymbol{S}_{t-1}(\boldsymbol{I} - \boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \boldsymbol{v}_t\boldsymbol{k}_t^{\top} = \boldsymbol{S}_{t-1} + (\boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top} where \boldsymbol{S}_0 = \boldsymbol{0}. Looking at the second equality, let \boldsymbol{u}_t=\boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t, then: \boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{u}_t\boldsymbol{k}_t^{\top} = \sum_{i=1}^t \boldsymbol{u}_i\boldsymbol{k}_i^{\top} So: \boldsymbol{u}_t = \boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t = \boldsymbol{v}_t - \left(\sum_{i=1}^{t-1} \boldsymbol{u}_i\boldsymbol{k}_i^{\top}\right) \boldsymbol{k}_t = \boldsymbol{v}_t - \sum_{i=1}^{t-1} \boldsymbol{u}_i(\boldsymbol{k}_i^{\top} \boldsymbol{k}_t) Defining \boldsymbol{U}=[\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_n]^{\top} and \boldsymbol{V}=[\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_n]^{\top}, the above equation is equivalent to \boldsymbol{U} = \boldsymbol{V} - (\boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)\boldsymbol{U}, or \boldsymbol{U}=(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}\boldsymbol{V}. This is the origin of the inverse matrix (\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}, which is key to the parallel computation of DeltaNet and subsequent works like GDN and KDA.
Recursive Expansion
Looking at the first equality, expanding it step-by-step gives: \boldsymbol{S}_t = \boldsymbol{v}_1\boldsymbol{k}_1^{\top} \boldsymbol{H}_{2\to t} + \cdots + \boldsymbol{v}_{t-1}\boldsymbol{k}_{t-1}^{\top} \boldsymbol{H}_{t\to t} + \boldsymbol{v}_t\boldsymbol{k}_t^{\top} = \sum_{i=1}^t \boldsymbol{v}_i\boldsymbol{k}_i^{\top} \boldsymbol{H}_{i+1\to t} Here \boldsymbol{H}_{r\to t} \triangleq (\boldsymbol{I} - \boldsymbol{k}_r\boldsymbol{k}_r^{\top})(\boldsymbol{I} - \boldsymbol{k}_{r+1}\boldsymbol{k}_{r+1}^{\top})\cdots(\boldsymbol{I} - \boldsymbol{k}_t\boldsymbol{k}_t^{\top}), with the convention \boldsymbol{H}_{t+1\to t}=\boldsymbol{I}.
Substituting this into the definition of \boldsymbol{u}_t: \boldsymbol{u}_t = \boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t = \boldsymbol{v}_t - \sum_{i=1}^{t-1} \boldsymbol{v}_i\boldsymbol{k}_i^{\top} \boldsymbol{H}_{i+1\to t-1}\boldsymbol{k}_t
Comparison of Results
Again, let \boldsymbol{Y}=(\boldsymbol{I} + \boldsymbol{K}\boldsymbol{K}^{\top}\odot \boldsymbol{M}^-)^{-1}. Then \boldsymbol{U}=\boldsymbol{Y}\boldsymbol{V} gives \boldsymbol{u}_t = \sum_{i=1}^n Y_{t, i} \boldsymbol{v}_i. Comparing this with the previous equation, we can read off: Y_{t, i} = \left\{\begin{aligned}& 0, & i > t \\ & 1, & i = t \\ & -\boldsymbol{k}_i^{\top} \boldsymbol{H}_{i+1\to t-1}\boldsymbol{k}_t, & i < t \\ \end{aligned}\right. Therefore, we only need to prove |\boldsymbol{k}_i^{\top} \boldsymbol{H}_{i+1\to t-1}\boldsymbol{k}_t|\leq 1. This follows immediately from \|\boldsymbol{k}_i\|\leq 1 and the property \|(\boldsymbol{I}-\boldsymbol{k}_i\boldsymbol{k}_i^{\top})\boldsymbol{x}\| \leq \|\boldsymbol{x}\| proven in the first section: |\boldsymbol{k}_i^{\top} \boldsymbol{H}_{i+1\to t-1}\boldsymbol{k}_t| \leq \|\boldsymbol{k}_i\|\times \|\boldsymbol{H}_{i+1\to t-1}\boldsymbol{k}_t\| \leq \|\boldsymbol{k}_i\|\times \|\boldsymbol{k}_t\| \leq 1 Thus, the conclusion is proven.
Conclusion
This article provides two proofs for the boundedness of the core inverse matrix in DeltaNet.
When reposting, please include the original address of this article: https://kexue.fm/archives/11563
For more detailed reposting matters, please refer to: "Scientific Space FAQ"