English (unofficial) translations of posts at kexue.fm
Source

When BERT-whitening Introduces Hyperparameters: There's Always One for You

Translated by DeepSeek V4 Pro. Translations can be inaccurate, please refer to the original post for important stuff.

In the post "You Might Not Need BERT-flow: A Linear Transformation Comparable to BERT-flow", I proposed BERT-whitening, demonstrating that a simple linear transformation could rival the then-SOTA method, BERT-flow. Furthermore, BERT-whitening enables dimensionality reduction for sentence vectors, resulting in lower memory usage and faster retrieval speeds. However, in "Which Unsupervised Semantic Similarity Method is Best? A Comprehensive Evaluation", we also observed that the whitening operation does not always yield improvements. For models that are already well-suited to the task (such as SimBERT, which undergoes supervised training), additional whitening often degrades performance.

To address this deficiency, this article proposes the introduction of two hyperparameters into BERT-whitening. By adjusting these two hyperparameters, we can almost always achieve "dimensionality reduction without performance loss." In other words, even for tasks where whitening originally caused a decline in effectiveness, there is now an opportunity to maintain or even improve performance while reducing dimensions.

Method Overview

The current process for BERT-whitening is as follows: \begin{aligned} \tilde{\boldsymbol{x}}_i &= (\boldsymbol{x}_i - \boldsymbol{\mu})\boldsymbol{U}\boldsymbol{\Lambda}^{-1/2} \\ \boldsymbol{\mu} &= \frac{1}{N}\sum_{i=1}^N \boldsymbol{x}_i \\ \boldsymbol{\Sigma} &= \frac{1}{N}\sum_{i=1}^N (\boldsymbol{x}_i - \boldsymbol{\mu})^{\top}(\boldsymbol{x}_i - \boldsymbol{\mu}) = \boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top} \,\,(\text{SVD decomposition}) \end{aligned} Where \boldsymbol{x}_i is the given sentence vector (vectors are row vectors by default unless otherwise specified), and \tilde{\boldsymbol{x}}_i is the transformed vector. In the SVD results, \boldsymbol{U} is an orthogonal matrix, and \boldsymbol{\Lambda} is a diagonal matrix with non-negative diagonal elements arranged in descending order. As can be seen, the current process is entirely fixed, meaning there are no adjustable hyperparameters.

To increase the flexibility of the transformation, we can introduce two hyperparameters \beta and \gamma (scalars), modifying the process to: \begin{aligned} \tilde{\boldsymbol{x}}_i &= (\boldsymbol{x}_i - {\color{red}\beta}\boldsymbol{\mu})\boldsymbol{U}\boldsymbol{\Lambda}^{-{\color{red}\gamma}/2} \\ \boldsymbol{\mu} &= \frac{1}{N}\sum_{i=1}^N \boldsymbol{x}_i \\ \boldsymbol{\Sigma} &= \frac{1}{N}\sum_{i=1}^N (\boldsymbol{x}_i - {\color{red}\beta}\boldsymbol{\mu})^{\top}(\boldsymbol{x}_i - {\color{red}\beta}\boldsymbol{\mu}) = \boldsymbol{U}\boldsymbol{\Lambda}\boldsymbol{U}^{\top} \,\,(\text{SVD decomposition}) \end{aligned}

Analysis of the Approach

When \beta=\gamma=1, the method is identical to the original BERT-whitening. When \beta=\gamma=0, the net transformation becomes: \tilde{\boldsymbol{x}}_i = \boldsymbol{x}_i \boldsymbol{U} Since \boldsymbol{U} is an orthogonal matrix, it does not change the inner product result, i.e., \tilde{\boldsymbol{x}}_i\tilde{\boldsymbol{x}}_i^{\top} = \boldsymbol{x}_i \boldsymbol{U} (\boldsymbol{x}_i \boldsymbol{U})^{\top} = \boldsymbol{x}_i\boldsymbol{x}_i^{\top}. Therefore, when using cosine similarity as the metric, it does not change the original results. In other words, the introduction of these hyperparameters provides the possibility of achieving results "no worse than before the transformation." By fine-tuning these parameters, it becomes possible to achieve better results than the original vectors. This is the core design philosophy behind these two hyperparameters.

Furthermore, under these modifications, the original dimensionality reduction capability is preserved. We can decompose the transformation into two parts: \tilde{\boldsymbol{x}}_i = \underbrace{(\boldsymbol{x}_i - \beta\boldsymbol{\mu})\boldsymbol{U}}_{\text{Part 1}} \cdot \underbrace{\boldsymbol{\Lambda}^{-\gamma/2}}_{\text{Part 2}} The first part is primarily the orthogonal transformation \boldsymbol{U}. \boldsymbol{U} is the result of the SVD of the \boldsymbol{\Sigma} matrix, which transforms the vector \boldsymbol{x}_i - \beta\boldsymbol{\mu} into a new vector where each component is as independent as possible. The average fluctuation of each component of the new vector relative to 0 is measured by the diagonal elements of \boldsymbol{\Lambda}^{1/2}. If a corresponding fluctuation is very close to 0, we can consider it to be effectively 0; discarding this component will not significantly affect the calculation of the cosine value. This is the principle of dimensionality reduction. Since the SVD results already sort \boldsymbol{\Lambda} in descending order, we can implement dimensionality reduction to k dimensions simply by keeping the first k dimensions: \tilde{\boldsymbol{x}}_i[:k].

As for the second part, \boldsymbol{\Lambda}^{-\gamma/2}, it can be understood as the degree of dependence of the current task on isotropy. If \gamma=1, it is equivalent to giving each component equal weight, which serves as an unsupervised prior. However, this may not be optimal for all tasks, so we can adjust \gamma to better adapt to the specific task at hand.

Experimental Results

The article "Which Unsupervised Semantic Similarity Method is Best? A Comprehensive Evaluation" showed that on the ATEC, BQ, and LCQMC tasks, applying the default whitening operation (\beta=\gamma=1) to SimBERT led to a performance drop. However, if we set \beta=\gamma=0, the results change (two combinations are demonstrated here; others show similar trends):

BERT-P4 Performance Table
ATEC BQ LCQMC PAWSX STS-B
\beta=\gamma=1 24.51 / 27.00 / 27.91 38.81 / 32.29 / 37.67 64.75 / 64.75 / 65.65 15.12 / 17.80 / 15.34 61.66 / 69.45 / 69.37
\beta=\gamma=0 24.51 / 24.51 / 24.59 38.81 / 38.81 / 38.99 64.75 / 64.75 / 63.45 15.12 / 15.12 / 14.59 61.66 / 61.66 / 62.30
SimBERT-P1 Performance Table
ATEC BQ LCQMC PAWSX STS-B
\beta=\gamma=1 38.50 / 23.64 / 30.79 48.54 / 31.78 / 40.01 76.23 / 75.05 / 74.50 15.10 / 18.49 / 15.64 74.14 / 73.37 / 75.29
\beta=\gamma=0 38.50 / 38.50 / 38.81 48.54 / 48.54 / 48.66 76.23 / 76.23 / 76.22 15.10 / 15.10 / 14.88 74.14 / 74.14 / 74.46

As in previous articles, each element in the table follows the format a / b / c, representing:

  • a: Score without whitening.

  • b: Score with whitening.

  • c: Score with whitening and reduction to 256 dimensions.

If b > a, b is shown in green; otherwise, it is red. If c > a, c is shown in green; otherwise, it is red. As mentioned earlier, if dimensionality is not reduced, the net transformation for \beta=\gamma=0 is just \boldsymbol{U}, which does not change the cosine similarity; thus, a and b are equal when \beta=\gamma=0.

In these tables, we primarily focus on the third result c, which is the outcome of reducing the vector from 768 dimensions to 256 dimensions. It can be seen that when \beta=\gamma=0, for both unsupervised BERT and supervised SimBERT, the results are very close to the original vector results (a), and some results even show improvement. This means that the combination of \beta=\gamma=0, k=256 can be considered a "free lunch"—it achieves dimensionality reduction with almost no loss in performance.

I have also tried fine-tuning \beta and \gamma, which indeed yields better results than the two combinations above on some tasks. However, fine-tuning requires labeled data, which might be controversial in an unsupervised context, so I will not demonstrate it here. If the original sentence vector model was already obtained through supervised training and BERT-whitening is used solely for dimensionality reduction, then using a validation set to fine-tune \beta, \gamma, and k is perfectly appropriate and uncontroversial.

Conclusion

This article introduces two hyperparameters to provide BERT-whitening with a tunable space, allowing it to achieve results "no worse than before the transformation" while retaining the ability for dimensionality reduction. In other words, even for previously trained sentence vector models, we can use the new BERT-whitening to reduce their dimensions while keeping performance basically unchanged—and sometimes even better!

Reprinting: Please include the original address of this article: https://kexue.fm/archives/9079

Detailed reprinting matters: Please refer to the "Scientific Space FAQ".