English (unofficial) translations of posts at kexue.fm
Source

CoSENT (III): As a Loss Function for Interaction-based Similarity

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

In "CoSENT (I): A More Effective Sentence Vector Scheme than Sentence-BERT", I proposed a supervised sentence vector scheme named "CoSENT." Since it directly trains on cosine similarity, it is more relevant to the evaluation metrics, typically yielding better results and faster convergence than Sentence-BERT. In "CoSENT (II): How Big is the Gap Between Feature-based and Interaction-based Matching?", we also compared the differences between it and interaction-based similarity models, showing that its performance on certain tasks can even approach that of interaction-based models.

However, at that time, my primary goal was to find a Sentence-BERT alternative that was closer to the evaluation objectives, so the results were oriented towards supervised sentence vectors, i.e., feature-based similarity models. Recently, it occurred to me that CoSENT can actually also serve as a loss function for interaction-based similarity models. So, how does it compare to the standard choice, Cross-Entropy? This article supplements that part of the experiment.

Background Review

When CoSENT was first proposed, it was designed as a loss function for supervised sentence vectors: \begin{equation} \log \left(1 + \sum_{\text{sim}(i,j) > \text{sim}(k,l)} e^{\lambda(\cos(u_k, u_l) - \cos(u_i, u_j))}\right) \end{equation} where i, j, k, l are four training samples (e.g., four sentences), u_i, u_j, u_k, u_l are the sentence vectors to be learned (e.g., their [CLS] vectors after passing through BERT), \cos(\cdot,\cdot) represents the cosine similarity between two vectors, and \text{sim}(\cdot,\cdot) represents their similarity labels. The definition of this loss function is clear: if you believe the similarity of (i,j) should be greater than the similarity of (k,l), then a term e^{\lambda(\cos(u_k, u_l) - \cos(u_i, u_j))} is added to the \log.

From this form, it is evident that CoSENT was originally intended for feature-based models training cosine similarity; even the name "CoSENT" comes from Cosine Sentence. However, setting aside the cosine similarity aspect, CoSENT is essentially a loss function that relies only on the relative order of labels. It has no necessary connection to cosine similarity. We can generalize it as: \begin{equation} \log \left(1 + \sum_{\text{sim}(i,j) > \text{sim}(k,l)} e^{\lambda(f(k,l) - f(i,j))}\right) \end{equation} where f(\cdot,\cdot) is any scalar output function (generally no activation function is needed), representing the similarity model to be learned. This includes "interaction-based similarity" models where two inputs are concatenated into a single text and fed into BERT!

Experimental Comparison

The conventional way to train interaction-based similarity is to construct a two-node output at the end, followed by a softmax, using Cross-Entropy (abbreviated as CE in the table below) as the loss function. This is also equivalent to adding a sigmoid activation to the aforementioned f(\cdot,\cdot) and using binary cross-entropy for a single node. However, this approach is only suitable for labels in a binary classification format. If the labels are continuous scores (e.g., STS-B is 1–5 points), it is not very suitable, and the problem is usually converted into a regression task. CoSENT does not have this limitation because it only requires the rank information of the labels, a characteristic consistent with the commonly used evaluation metric, the Spearman correlation coefficient.

The reference code for the comparative experiment between the two is as follows:

https://github.com/bojone/CoSENT/blob/main/accuracy/interact_cosent.py

The experimental results are:

Evaluation metric: Spearman correlation coefficient
Model ATEC BQ LCQMC PAWSX avg
BERT + CE 48.01 71.96 78.53 68.59 66.77
BERT + CoSENT 48.09 72.25 78.70 69.34 67.10
RoBERTa + CE 49.70 73.20 79.13 70.52 68.14
RoBERTa + CoSENT 49.82 73.09 78.78 70.54 68.06
Evaluation metric: Accuracy
Model ATEC BQ LCQMC PAWSX avg
BERT + CE 85.38 83.57 88.10 81.45 84.63
BERT + CoSENT 85.55 83.73 87.92 81.85 84.76
RoBERTa + CE 85.97 84.67 88.14 82.85 85.41
RoBERTa + CoSENT 86.06 84.23 88.14 83.03 85.37

As can be seen, there are no surprises; the effects of CE and CoSENT are basically consistent. If one must dig for subtle differences, it can be observed that in BERT, CoSENT performs relatively better, while in RoBERTa, there is essentially no difference. Additionally, on the PAWSX task, the improvement from CoSENT is relatively more noticeable, while it remains basically flat on other tasks. Thus, one can "weakly" conclude:

When the model is weak (BERT is weaker than RoBERTa) or the task is difficult (PAWSX is relatively more difficult than the other three tasks), CoSENT might achieve better results than CE.

Note the word "might"; I cannot guarantee it. To be realistic, I do not believe the two constitute any significant difference. However, one can speculate that because the forms of the two loss functions are distinctly different, even if the final metrics are similar, there should be some differences within the model. In such cases, perhaps model ensemble could be considered?

Summary

This article primarily explores and experiments with the feasibility of CoSENT in interaction-based similarity models. The final conclusion is "feasible, but with no significant improvement in performance."

Original Address: https://kexue.fm/archives/9341