When analyzing the parameters of a model, in some cases we treat all the parameters of the model as a single holistic vector, while in other cases we break down different parameters to examine them individually. For example, for the 7 billion parameters possessed by a 7B LLAMA model, we sometimes treat them as “a 7-billion-dimensional vector,” sometimes we view them as “hundreds of vectors of different dimensions” according to the model’s implementation, and in the most extreme cases, we might even view them as “seven billion 1-dimensional vectors.” Since there are different ways of viewing them, when we want to calculate certain statistical indicators, there will also be different calculation methods—namely, local calculation and global calculation. This leads to the question of the relationship between local calculation indicators and global calculation indicators.
In this article, we are concerned with the cosine similarity of two vectors. If a large vector’s dimensions are split into several groups, and the cosine similarities of the sub-vectors corresponding to the same group are all very large, is the cosine similarity of the two large vectors necessarily large? The answer is no. Specifically, this is also related to the famous “Simpson’s Paradox.”
Problem Background
This problem originated from the author’s analysis of the change in the loss function caused by optimizer parameter increments. Specifically, assume the update rule of the optimizer is: \begin{equation} \boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \boldsymbol{u}_t \end{equation} where \boldsymbol{u}_t is a vector in a specified update direction (the negative direction). At this point, the first-order Taylor expansion gives: \begin{equation} \mathcal{L}(\boldsymbol{\theta}_{t+1}) = \mathcal{L}(\boldsymbol{\theta}_t - \eta_t \boldsymbol{u}_t)\approx \mathcal{L}(\boldsymbol{\theta}_t) - \eta_t \langle\boldsymbol{u}_t,\boldsymbol{g}_t\rangle \end{equation} Here \boldsymbol{g}_t is the gradient \nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}_t). Therefore, the change in the loss function is approximately: \begin{equation} - \eta_t \langle\boldsymbol{u}_t,\boldsymbol{g}_t\rangle = - \eta_t \Vert\boldsymbol{u}_t\Vert \Vert\boldsymbol{g}_t\Vert \cos(\boldsymbol{u}_t,\boldsymbol{g}_t) \end{equation} So the author thought of observing the cosine similarity between \boldsymbol{u}_t and \boldsymbol{g}_t, which is the directional consistency between the update vector and the gradient.
But the problem arises: as mentioned at the beginning of this article, model parameters have different ways of being split. Do we treat all model parameters as one large vector to calculate the cosine of the update vector and the gradient (global), or do we calculate it for each layer or each parameter individually (local)? The author did both and applied truncation to the local cosine (ensuring that the cosine of the update vector and the gradient for each parameter is greater than a certain positive threshold), and then found that the global cosine was actually smaller than that threshold. At first glance, this felt quite surprising, so a simple analysis was conducted.
Simple Analysis
The problem is now abstracted as:
If the local cosine similarities of two vectors are all no less than \lambda > 0, is the global cosine similarity of these two vectors necessarily no less than \lambda?
As everyone already knows, the answer is negative. To negate it, we only need to provide a counterexample. Suppose \boldsymbol{x}=(1,1) and \boldsymbol{y}=(1,2). Obviously \boldsymbol{x}\neq\boldsymbol{y}, so \cos(\boldsymbol{x},\boldsymbol{y})\neq 1. However, their sub-vectors—which are the individual components—are all positive numbers. As 1-dimensional vectors, their cosine similarities are all 1. Thus, we have obtained a counterexample where the local cosine similarities are all 1, but the global similarity is less than 1.
For a more general analysis, we can let \boldsymbol{x}=[\boldsymbol{x}_1,\boldsymbol{x}_2] and \boldsymbol{y}=[\boldsymbol{y}_1,\boldsymbol{y}_2], then: \begin{equation} \begin{aligned} \cos(\boldsymbol{x},\boldsymbol{y}) =&\, \frac{\langle \boldsymbol{x}, \boldsymbol{y}\rangle}{\Vert\boldsymbol{x}\Vert \Vert\boldsymbol{y}\Vert} \\ = & \frac{\langle \boldsymbol{x}_1, \boldsymbol{y}_1\rangle + \langle \boldsymbol{x}_2, \boldsymbol{y}_2\rangle}{\sqrt{\Vert\boldsymbol{x}_1\Vert^2 + \Vert\boldsymbol{x}_2\Vert^2} \sqrt{\Vert\boldsymbol{y}_1\Vert^2 + \Vert\boldsymbol{y}_2\Vert^2}} \\[6pt] =& \,\frac{\cos(\boldsymbol{x}_1, \boldsymbol{y}_1) \Vert\boldsymbol{x}_1\Vert \Vert\boldsymbol{y}_1\Vert+ \cos(\boldsymbol{x}_2, \boldsymbol{y}_2)\Vert\boldsymbol{x}_2\Vert \Vert\boldsymbol{y}_2\Vert}{\sqrt{\Vert\boldsymbol{x}_1\Vert^2 + \Vert\boldsymbol{x}_2\Vert^2} \sqrt{\Vert\boldsymbol{y}_1\Vert^2 + \Vert\boldsymbol{y}_2\Vert^2}} \end{aligned}\label{eq:cos} \end{equation} If we let \Vert\boldsymbol{x}_1\Vert, \Vert\boldsymbol{y}_2\Vert \to 0, while keeping \Vert\boldsymbol{x}_2\Vert, \Vert\boldsymbol{y}_1\Vert greater than zero (without loss of generality, we can set \Vert\boldsymbol{x}_2\Vert=\Vert\boldsymbol{y}_1\Vert=1), then we can obtain \cos(\boldsymbol{x},\boldsymbol{y})\to 0. That is to say, no matter how large \cos(\boldsymbol{x}_1,\boldsymbol{y}_1) and \cos(\boldsymbol{x}_2,\boldsymbol{y}_2) are, there is always a situation that can make \cos(\boldsymbol{x},\boldsymbol{y}) infinitely close to 0. In other words, it is impossible to establish a lower bound for \cos(\boldsymbol{x},\boldsymbol{y}) through \cos(\boldsymbol{x}_1,\boldsymbol{y}_1) and \cos(\boldsymbol{x}_2,\boldsymbol{y}_2).
As for the upper bound, it can be proven that: \begin{equation} \cos(\boldsymbol{x},\boldsymbol{y})\leq \max\big\{\cos(\boldsymbol{x}_1,\boldsymbol{y}_1),\cos(\boldsymbol{x}_2,\boldsymbol{y}_2)\big\}\label{eq:cos-ul} \end{equation} The proof is actually very simple because this bound is quite loose. Without loss of generality, assume \cos(\boldsymbol{x}_1,\boldsymbol{y}_1)\leq\cos(\boldsymbol{x}_2,\boldsymbol{y}_2), then according to equation [eq:cos] we have: \begin{equation} \cos(\boldsymbol{x},\boldsymbol{y}) \leq\left[\frac{\Vert\boldsymbol{x}_1\Vert \Vert\boldsymbol{y}_1\Vert+ \Vert\boldsymbol{x}_2\Vert \Vert\boldsymbol{y}_2\Vert}{\sqrt{\Vert\boldsymbol{x}_1\Vert^2 + \Vert\boldsymbol{x}_2\Vert^2} \sqrt{\Vert\boldsymbol{y}_1\Vert^2 + \Vert\boldsymbol{y}_2\Vert^2}}\right]\cos(\boldsymbol{x}_2, \boldsymbol{y}_2) \end{equation} The part in the square brackets is actually exactly the cosine similarity of the two-dimensional vectors (\Vert\boldsymbol{x}_1\Vert,\Vert\boldsymbol{x}_2\Vert) and (\Vert\boldsymbol{y}_1\Vert,\Vert\boldsymbol{y}_2\Vert), so it must not be greater than 1. Thus, we have \cos(\boldsymbol{x},\boldsymbol{y})\leq\cos(\boldsymbol{x}_2,\boldsymbol{y}_2), which proves inequality [eq:cos-ul].
(Again, it is emphasized that the above proofs are completed under the assumption that \cos(\boldsymbol{x}_1,\boldsymbol{y}_1)\geq 0 and \cos(\boldsymbol{x}_2,\boldsymbol{y}_2) \geq 0. If cases less than 0 exist, the conclusion may need slight modification.)
Related Paradox
Does the above result have any more realistic correspondence? Yes, putting it into correlation analysis leads to the famous “Simpson’s Paradox.”
We know there is a coefficient for measuring linear correlation called the “Pearson Coefficient,” defined as: \begin{equation} r = \frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i - \bar{y})}{\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2}\sqrt{\sum\limits_{i=1}^n(y_i - \bar{y})^2}} \end{equation} Observing more closely, if we denote \boldsymbol{x} = (x_1,x_2,\cdots,x_n) and \boldsymbol{y} = (y_1,y_2,\cdots,y_n), then the above formula is just: \begin{equation} r = \cos(\boldsymbol{x}-\bar{x},\boldsymbol{y}-\bar{y}) \end{equation} Therefore, the Pearson correlation coefficient is actually the cosine similarity after the data points have been centered by subtracting their means. Since we have cosine similarity, the results from the previous section can be applied. The direct conclusion is that even if two sets of data both show obvious linear correlation (\cos > 0), they might be linearly uncorrelated (\cos \to 0) after being combined.
And “Simpson’s Paradox” goes even further, stating that while each batch of data is positively correlated, they might not only be linearly uncorrelated when combined but could even be negatively correlated. This is because the correlation coefficient has more parameters (\bar{x}, \bar{y}) than simple cosine similarity, allowing for greater degrees of freedom. The geometric image is also very intuitive, as shown below:
[Click to view original SVG image: Intuitive illustration of Simpson’s Paradox]
In the figure above, the blue data points are perfectly on the same straight line with a positive slope, so the correlation coefficient is 1. The same applies to the red data points; they are both “perfectly positively linearly correlated” within their own batches. However, after combining the data, if one must fit them with a single straight line, it can only be the dashed line, which has a negative slope—meaning it becomes a negative correlation. This constitutes a classic example of “Simpson’s Paradox.”
Summary
This article briefly discussed the relationship between the local cosine similarity and global cosine similarity of high-dimensional vectors, and further discussed the related “Simpson’s Paradox.”
When reprinting, please include the original address of this article: https://kexue.fm/archives/9931
For more detailed reprinting matters, please refer to: “Scientific Space FAQ”