From Local to Global: Geodesic Distance for Semantic Similarity · English (unofficial) translations of posts at kexue.fm

Recently, I learned a new concept called “Geodesic Distance” from a recent paper titled “Unsupervised Opinion Summarization Using Approximate Geodesics”. I found it quite interesting and would like to share it with everyone.

To me, what is “new” is not the concept of geodesic distance itself (which I encountered back when studying Riemannian geometry), but rather the fact that geodesic distance can be cleverly constructed in the field of semantic similarity and play a role in certain scenarios. If we want to sound more sophisticated, we could even call this “Semantic Similarity on Manifolds.” Doesn’t that sound much more advanced?

Paper Summary

First, let’s briefly summarize the main content of the original paper. As the name suggests, the theme of the paper is summarization. Typically, unsupervised summarization is done like this: assume an article consists of n sentences t_1, t_2, \dots, t_n. We design a scoring function s(t_i) for each sentence (classically tf-idf and its variants) and then select the top-scoring sentences as the summary. Of course, the paper does not deal with simple summarization but rather “Opinion Summarization.” This “Opinion” can be understood as a given theme or center c. The summary should lean towards extracting sentences related to c, so the scoring function should also be related to c, i.e., s(t_i, c).

Since the era of “Everything is an Embedding,” a mainstream way to design s(t_i, c) is to encode both the sentence t_i and the theme c into corresponding sentence vectors \boldsymbol{v}(t_i) and \boldsymbol{v}(c), and then use the reciprocal of some distance as the scoring function: s(t_i, c) = \frac{1}{d(\boldsymbol{v}(t_i), \boldsymbol{v}(c))} In this design, both the sentence vector encoding model \boldsymbol{v}(\cdot) and the distance function d(\cdot, \cdot) are designable spaces. The original paper contributes to both \boldsymbol{v}(\cdot) and d(\cdot, \cdot). The work on \boldsymbol{v}(\cdot) is not the focus of this article, so we will skip it for now; interested readers can refer to the original paper. As for the contribution to d(\cdot, \cdot), it involves replacing common simple distances with the subject of this article: “Geodesic Distance.”

Principle Analysis

Why use geodesic distance? This starts with how we train sentence vectors.

Sentence vectors can be learned in either a supervised or unsupervised manner. Taking supervised learning as an example, it generally involves contrastive learning with positive and negative sample pairs (refer to “CoSENT (1): A More Effective Sentence Vector Scheme than Sentence-BERT”). Positive pairs are sentences labeled as having essentially the same semantics; we consider them to have high similarity or small distance. The problem lies with negative pairs. As two sentences with different semantics, they might be specifically labeled hard negatives or just two randomly selected unrelated samples. In principle, these two cases should be assigned different distances, but in practice, they are often just assigned the same label, i.e., “negative.”

This leads to a result where the distance values calculated using sentence vectors are theoretically accurate only for sentences that are semantically close. For sentences with large semantic gaps, the distance values can only be used to distinguish between positive and negative samples but cannot be used for comparison within a local range. For example, we can say that a distance of 1 is more similar than a distance of 2, and we can also say that a distance of 1 is more similar than a distance of 10. However, we cannot reliably say that a distance of 10 is more similar than a distance of 11, because once the distance is large, its absolute value becomes inaccurate.

In retrieval scenarios, we usually want to recall samples with very high similarity (i.e., very small distance), so we can directly use a simple distance function d(\cdot, \cdot) for retrieval. However, for the “Opinion Summarization” scenario in the original paper, we need to calculate the distance d(\boldsymbol{v}(t_i), \boldsymbol{v}(c)) between a sentence t_i and a theme c. The similarity between a “sentence” and a “theme” is not necessarily very high (the distance may be large). In other words, it needs to make relative comparisons in a range where distance similarity is relatively large, which is where geodesic distance becomes suitable.

Geodesic Distance

Geodesic distance, simply put, is the shortest distance between two points. Since a manifold may not be flat, this distance is not necessarily the straight-line distance (Euclidean distance) between two points. A classic example is traveling from the Earth’s South Pole to the North Pole. We cannot travel in a straight line through the Earth’s core; we must travel along the Earth’s surface to the equator and then to the North Pole, covering a curved (semi-circular) distance.

In a local area (where the distance is small), the Earth is still effectively flat, so Euclidean distance is still usable. However, for large distances like “South Pole to North Pole” or “South Pole to Equator,” it is not accurate enough. This is very similar to the semantic similarity scenario mentioned earlier—the known distance (such as Euclidean distance) is accurate at short distances but inaccurate at long distances, essentially because the manifold is not flat.

Fortunately, having local distances is enough. We can transform this into a graph problem and use “shortest path” algorithms to estimate the approximate geodesic distance. Specifically, we can use an existing distance function to calculate the distance between each point and all other points, and then keep only the k nearest points (or truncate by a threshold, depending on the situation). We connect them with an edge and label it with the distance. In this way, all points and edges form a weighted graph (which we call a “k-nearest neighbor graph”). We can then use Dijkstra’s algorithm to search for the shortest path between any two points on the graph and calculate its length. This is the approximate result for the geodesic distance.

In summary, under the assumption that “distances between nearby points are accurate, while distances between distant points are inaccurate,” we can use the k-nearest neighbor graph plus the shortest path method to estimate the geodesic distance of distant points as a substitute. Since geodesic distance accounts for the manifold structure of the vector space, it potentially yields better results (refer to Table 8 in the original paper).

Summary

This article shared the concept of “Geodesic Distance,” pointing out that the similarities we usually train might only be suitable for local contexts. Switching to geodesic distance for larger distances might be helpful for certain semantic similarity problems.