The Road to Probability Distributions: A Review of Softmax and Its Alternatives · English (unofficial) translations of posts at kexue.fm

Whether in basic classification tasks or the now-ubiquitous attention mechanisms, the construction of probability distributions is a key step. Specifically, this involves converting an n-dimensional arbitrary vector into an n-element discrete probability distribution. As is well known, the standard answer to this problem is Softmax, which takes the form of exponential normalization. It is relatively simple and intuitive, while also possessing many excellent properties, making it the “standard equipment” in most scenarios.

Nevertheless, Softmax has some drawbacks in certain scenarios, such as not being sparse enough and being unable to exactly equal zero. Consequently, many alternatives have emerged. In this article, we will briefly summarize the properties of Softmax and review and compare some of its alternative solutions.

Softmax Review

First, let’s introduce some general notation: \boldsymbol{x} = (x_1, x_2, \cdots, x_n) \in \mathbb{R}^n is the n-dimensional vector that needs to be converted into a probability distribution. Its components can be positive or negative, and there are no defined upper or lower bounds. \Delta^{n-1} is defined as the set of all n-element discrete probability distributions, i.e., \begin{equation} \Delta^{n-1} = \left\{\boldsymbol{p}=(p_1,p_2,\cdots,p_n)\left|\, p_1,p_2,\cdots,p_n\geq 0,\sum_{i=1}^n p_i = 1\right.\right\} \end{equation} The reason for labeling it n-1 instead of n is that the constraint \sum_{i=1}^n p_i = 1 defines an (n-1)-dimensional hyperplane in n-dimensional space. Combined with the constraint p_i \geq 0, the set of (p_1, p_2, \cdots, p_n) is just a subset of that plane, meaning the actual dimensionality is only n-1.

Based on these notations, the theme of this article can be simply expressed as exploring the mapping \mathbb{R}^n \mapsto \Delta^{n-1}, where \boldsymbol{x} \in \mathbb{R}^n is what we usually call Logits or Scores.

Basic Definition

The definition of Softmax is very simple: \begin{equation} p_i = \text{softmax}(\boldsymbol{x})_i = \frac{e^{x_i}}{\sum\limits_{j=1}^n e^{x_j}} \end{equation} There are too many sources and interpretations for Softmax, such as energy-based models, statistical mechanics, or simply as a smooth approximation of \text{argmax}. Therefore, it is difficult to verify its earliest origin, and we will not attempt to do so here. Often, we also add a temperature parameter, considering \text{softmax}(\boldsymbol{x}/\tau), but \tau itself can be integrated into the definition of \boldsymbol{x}, so we do not specifically separate the \tau parameter here.

The denominator of Softmax is usually denoted as Z(\boldsymbol{x}), and its logarithm is the \text{logsumexp} operation provided by most deep learning frameworks. It is a smooth approximation of \max: \begin{align} \log Z(\boldsymbol{x}) &= \log \sum\limits_{j=1}^n e^{x_j} = \text{logsumexp}(\boldsymbol{x}) \\ \lim_{\tau\to 0^+} \tau\,\text{logsumexp}(\boldsymbol{x}/\tau) &= \max(\boldsymbol{x}) \end{align} When \tau is taken as 1, we can write \text{logsumexp}(\boldsymbol{x}) \approx \max(\boldsymbol{x}). The larger the variance of \boldsymbol{x}, the higher the degree of approximation. For further discussion, please refer to “Seeking a Smooth Maximum Function”.

Two Properties

In addition to converting arbitrary vectors into probability distributions, Softmax also satisfies two properties: \begin{align} &{\color{red}{\text{Monotonicity}}}:\quad p_i > p_j \Leftrightarrow x_i > x_j,\quad p_i = p_j \Leftrightarrow x_i = x_j \\[5pt] &{\color{red}{\text{Invariance}}}:\quad \text{softmax}(\boldsymbol{x}) = \text{softmax}(\boldsymbol{x} + c),\,\,\forall c\in\mathbb{R} \end{align} Monotonicity means that Softmax is order-preserving; the maximum/minimum values of \boldsymbol{x} correspond to the maximum/minimum values of \boldsymbol{p}. Invariance states that if a constant is added to every component of \boldsymbol{x}, the result of Softmax remains unchanged. This is the same property as \text{argmax}, i.e., \text{argmax}(\boldsymbol{x}) = \text{argmax}(\boldsymbol{x} + c).

Therefore, based on these two properties, we can conclude that Softmax is actually a smooth approximation of \text{argmax} (more accurately, a smooth approximation of \text{onehot}(\text{argmax}(\cdot))). Specifically, we have: \begin{equation} \lim_{\tau\to 0^+} \text{softmax}(\boldsymbol{x}/\tau) = \text{onehot}(\text{argmax}(\boldsymbol{x})) \end{equation} This is likely the origin of the name Softmax. Note not to confuse them: Softmax is a smooth approximation of \text{argmax}, not \max; the smooth approximation of \max is \text{logsumexp}.

Gradient Calculation

For deep learning, one of the important ways to understand the properties of a function is to understand its gradient. For Softmax, we previously calculated this in “Looking at the Scale Operation of Attention from Gradient Maximization”: \begin{equation} \frac{\partial p_i}{\partial x_j} = p_i\delta_{i,j} - p_i p_j = \left\{\begin{aligned} p_i - p_i^2,&\quad i=j\\ - p_i p_j,&\quad i\neq j \end{aligned}\right. \end{equation} The matrix formed this way is also called the Jacobian Matrix of Softmax. Its L1 norm has a simple form: \begin{equation} \frac{1}{2}\left\Vert\frac{\partial \boldsymbol{p}}{\partial \boldsymbol{x}}\right\Vert_1=\frac{1}{2}\sum_{i,j}\left|\frac{\partial p_i}{\partial x_j}\right|=\frac{1}{2}\sum_i (p_i - p_i^2) + \frac{1}{2}\sum_{i\neq j} p_i p_j = 1 - \sum_i p_i^2 \end{equation} When \boldsymbol{p} is a one-hot distribution, the above expression equals 0. This means that the closer the Softmax result is to one-hot, the more severe the gradient vanishing phenomenon becomes. Therefore, at least during the initialization phase, we cannot initialize Softmax to be close to one-hot. At the same time, the rightmost side of the above equation is linked to the concept of Rényi entropy, which is similar to the common Shannon entropy.

Reference Implementation

The direct implementation of Softmax is simple: just take the \exp and then normalize. The reference code in Numpy is:

def softmax(x):
    y = np.exp(x)
    return y / y.sum()

However, if there are large components in \boldsymbol{x}, calculating \exp can easily lead to overflow. Therefore, we usually utilize the invariance of Softmax by first subtracting the maximum value of all components from each component before calculating Softmax. This ensures that each component for which \exp is calculated is no greater than 0, preventing overflow:

def softmax(x):
    y = np.exp(x - x.max())
    return y / y.sum()

Loss Function

One of the main uses of constructing probability distributions is for building the output of single-label multi-class classification tasks. That is, assuming there is an n-class task and \boldsymbol{x} is the model output, we hope to predict the probability of each class through \boldsymbol{p}=\text{softmax}(\boldsymbol{x}). To train this model, we need a loss function. Assuming the target class is t, the common choice is the cross-entropy loss: \begin{equation} \mathcal{L}_t = - \log p_t = - \log \text{softmax}(\boldsymbol{x})_t \end{equation} We can find its gradient: \begin{equation} -\frac{\partial \log p_t}{\partial x_j} = p_j - \delta_{t,j} = \left\{\begin{aligned} p_t - 1,&\quad j=t\\ p_j,&\quad j\neq t \end{aligned}\right. \end{equation} Note that t is given, so \delta_{t,j} actually expresses the target distribution \text{onehot}(t), while the set of all p_j is \boldsymbol{p} itself. Thus, the above equation can be written more intuitively as: \begin{equation} -\frac{\partial \log p_t}{\partial \boldsymbol{x}} = \boldsymbol{p} - \text{onehot}(t) \label{eq:softmax-ce-grad} \end{equation} In other words, its gradient is exactly the difference between the target distribution and the predicted distribution. As long as the two are not equal, the gradient will always exist, and optimization can continue. This is the advantage of cross-entropy. Of course, in some cases, this is also a disadvantage because Softmax only yields a one-hot distribution when \tau \to 0^+. In normal circumstances, a one-hot distribution will never appear, meaning optimization will never completely stop, which may lead to over-optimization. This is the motivation for some of the alternatives discussed later.

Besides cross-entropy, other losses are available, such as -p_t, which can be understood as the negative of a smooth approximation of accuracy. However, it may suffer from gradient vanishing, so its optimization efficiency is often not as good as cross-entropy. It is generally suitable for fine-tuning rather than training from scratch. For more discussion, refer to “How to Train Your Accuracy?”.

Softmax Variants

Having introduced Softmax, we will now summarize some related variant works discussed previously on this blog, such as Margin Softmax, Taylor Softmax, and Sparse Softmax. These are derivatives based on Softmax, focusing on improvements in different aspects like loss functions, sparsity, and long-tail properties.

Margin Softmax

First, we introduce a series of Softmax variants originating from face recognition, collectively known as Margin Softmax. They were later applied to Sentence Embedding training in NLP. This site discussed one variant, AM-Softmax, in “Sentence Similarity Model Based on GRU and AM-Softmax”, and later provided a more general discussion in “From Triangle Inequality to Margin Softmax”.

Although Margin Softmax bears the name Softmax, it is actually more of an improvement to the loss function. Taking AM-Softmax as an example, it has two characteristics: first, Logits are constructed in \cos form, i.e., \boldsymbol{x} = [\cos(\boldsymbol{z},\boldsymbol{c}_1), \cos(\boldsymbol{z},\boldsymbol{c}_2), \cdots, \cos(\boldsymbol{z},\boldsymbol{c}_n)]/\tau. Here, the temperature parameter \tau is necessary because the range of \cos is [-1, 1], which is insufficient to pull apart class probabilities. Second, it does not simply use -\log p_t as the loss but strengthens it: \begin{equation} \mathcal{L} = - \log \frac{e^{[\cos(\boldsymbol{z},\boldsymbol{c}_t)-m]/\tau}}{e^{[\cos(\boldsymbol{z},\boldsymbol{c}_t)-m]/\tau} + \sum_{j\neq t} e^{\cos(\boldsymbol{z},\boldsymbol{c}_j)/\tau}} \end{equation} Intuitively, while cross-entropy wants x_t to be the largest among all components of \boldsymbol{x}, AM-Softmax not only wants x_t to be the largest but also wants it to be at least m/\tau larger than the second-largest component. Here, m/\tau is called the Margin.

Why increase the requirement for the target class? This is due to the application scenario. Margin Softmax originated in face recognition and can be used for semantic retrieval in NLP. That is, the application is retrieval, but the training is classification. If only the cross-entropy of the classification task is used, the features encoded by the model may not satisfy retrieval requirements well. Adding a Margin makes the features more compact.

Taylor Softmax

Next is Taylor Softmax, discussed in “The Even-Order Taylor Expansion of exp(x) at x=0 is Always Positive”. It utilizes an interesting property of the Taylor expansion of \exp(x):

For any real number x and even number k, f_k(x) \triangleq \sum_{m=0}^k \frac{x^m}{m!} > 0 always holds. That is, the even-order Taylor expansion of e^x at x=0 is always positive.

Using this property, we can construct a Softmax variant (where k > 0 is any even number): \begin{equation} \text{taylor-softmax}(\boldsymbol{x}, k)_i = \frac{f_k(x_i)}{\sum\limits_{j=1}^n f_k(x_j)} \end{equation} Since it is based on the Taylor expansion of \exp, Taylor Softmax approximates Softmax within a certain range. In some scenarios, Taylor Softmax can replace Softmax. What are its characteristics? The answer is that it is more long-tailed. Because Taylor Softmax is a normalization of a polynomial function, it decays more slowly than an exponential function. Thus, for tail classes, Taylor Softmax tends to assign higher probabilities, which may help alleviate the over-confidence of Softmax.

A recent application of Taylor Softmax is to replace Softmax in Attention, reducing the original quadratic complexity to linear complexity. Theoretical derivations can be found in “Transformer Upgrade: 5. Linear Attention as Infinite Dimensions”. A recent practice of this idea is a model named Based, which uses e^x \approx 1+x+x^2/2 to linearize Attention, claiming to be more efficient than Attention and better than Mamba.

Sparse Softmax

Sparse Softmax is a simple sparse variant of Softmax proposed by the author during the 2020 China Legal Research Cup. It was first published in the blog post “SPACES: Extract-Generate Long Text Summarization”, and later supplemented with experiments in the paper “Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation”.

In text generation, we often use deterministic Beam Search or stochastic TopK/TopP Sampling. These algorithms only keep the tokens with the highest predicted probabilities, which is equivalent to treating the remaining token probabilities as zero. However, if Softmax is used during training, there is no possibility of a probability being exactly zero, creating an inconsistency between training and prediction. Sparse Softmax aims to address this by setting the probabilities of tokens outside the Top-k to zero during training:

	Softmax	Sparse Softmax
Basic Definition	p_i = \frac{e^{x_i}}{\sum\limits_{j=1}^n e^{x_j}}	p_i=\left\{\begin{aligned}&\frac{e^{x_i}}{\sum\limits_{j\in\Omega_k} e^{x_j}},\,i\in\Omega_k\\ &\quad 0,\,i\not\in\Omega_k\end{aligned}\right.
Loss Function	\log\left(\sum\limits_{i=1}^n e^{x_i}\right) - x_t	\log\left(\sum\limits_{i\in\Omega_k} e^{x_i}\right) - x_t

Where \Omega_k is the set of indices of the top k elements after sorting x_1, x_2, \cdots, x_n in descending order. Simply put, it performs a truncation operation during training consistent with the prediction stage. Note that Sparse Softmax forcibly truncates the remaining probabilities, meaning those Logits cannot backpropagate. Therefore, the training efficiency of Sparse Softmax is lower than Softmax, making it generally suitable for fine-tuning rather than training from scratch.

Perturb Max

This section introduces a new way to construct probability distributions called Perturb Max. It is a generalization of Gumbel Max, first introduced on this site in “Constructing Discrete Probability Distributions from a Reparameterization Perspective”.

Reflections on the Problem

Constructing a mapping \mathbb{R}^n \mapsto \Delta^{n-1} is not difficult. As long as f(x) is a mapping from \mathbb{R} \mapsto \mathbb{R}^* (reals to non-negative reals), such as x^2, then: \begin{equation} p_i = \frac{f(x_i)}{\sum\limits_{j=1}^n f(x_j)} \end{equation} satisfies the condition. If we add Monotonicity, we just need f(x) to be a monotonically increasing function, like \text{sigmoid}(x). But what if we also add Invariance? Can we easily write a mapping that satisfies Invariance?

One might ask: why maintain Monotonicity and Invariance? From the perspective of fitting a distribution, they might not be strictly necessary. However, as a “Softmax alternative,” we want the new distribution to serve as a smooth approximation of \text{argmax}, so we want to maintain as many properties of \text{argmax} as possible.

Gumbel Max

Perturb Max uses a generalization of Gumbel Max. Gumbel Max discovered that: \begin{equation} P[\text{argmax}(\boldsymbol{x}+\boldsymbol{\varepsilon}) = i] = \text{softmax}(\boldsymbol{x})_i,\quad \boldsymbol{\varepsilon}\sim \text{Gumbel Noise} \end{equation} where each component of \boldsymbol{\varepsilon} is independently sampled from a Gumbel distribution. While \text{argmax}(\boldsymbol{x}) is deterministic, adding noise \boldsymbol{\varepsilon} makes \text{argmax}(\boldsymbol{x}+\boldsymbol{\varepsilon}) stochastic. Gumbel Max tells us that with Gumbel noise, the probability of i appearing is exactly \text{softmax}(\boldsymbol{x})_i.

General Noise

Perturb Max asks: if Softmax can be derived from the Gumbel distribution, what if we replace it with a general distribution, like the normal distribution? \begin{equation} p_i = P[\text{argmax}(\boldsymbol{x}+\boldsymbol{\varepsilon}) = i],\quad \varepsilon_1,\varepsilon_2,\cdots,\varepsilon_n\sim p(\varepsilon) \end{equation} Following the derivation of Gumbel Max, we get: \begin{equation} p_i = \int_{-\infty}^{\infty} p(\varepsilon_i)\left[\prod_{j\neq i} \Phi(x_i - x_j + \varepsilon_i)\right]d\varepsilon_i = \mathbb{E}_{\varepsilon}\left[\prod_{j\neq i} \Phi(x_i - x_j + \varepsilon)\right] \end{equation} where \Phi(\varepsilon) is the cumulative distribution function (CDF) of p(\varepsilon). For general distributions, this rarely has an analytical solution and must be estimated numerically. Perturb Max satisfies Monotonicity and Invariance.

Sparsemax

Next is Sparsemax, from the 2016 paper “From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification”. Like Sparse Softmax, it targets sparsity, but it provides a more adaptive construction.

Basic Definition

The original paper defines Sparsemax as the solution to the following optimization problem: \begin{equation} \text{sparsemax}(\boldsymbol{x}) = \mathop{\text{argmin}}\limits_{\boldsymbol{p}\in\Delta^{n-1}}\Vert \boldsymbol{p} - \boldsymbol{x}\Vert^2 \end{equation} A more intuitive way to see the connection to Softmax is: \begin{equation} \boldsymbol{p} = \text{softmax}(\boldsymbol{x}) = \exp(\boldsymbol{x} - \lambda(\boldsymbol{x})) \end{equation} where \lambda(\boldsymbol{x}) is a constant ensuring the sum is 1. If we use the approximation \exp(x) \approx \text{relu}(1+x), we get: \begin{equation} \boldsymbol{p} = \text{sparsemax}(\boldsymbol{x}) = \text{relu}(\boldsymbol{x} - \lambda(\boldsymbol{x})) \end{equation} (where the constant 1 is absorbed into \lambda(\boldsymbol{x})).

Solving Algorithm

Assume \boldsymbol{x} is sorted: x_1 \geq x_2 \geq \cdots \geq x_n. If we assume x_k \geq \lambda(\boldsymbol{x}) \geq x_{k+1}, then: \begin{equation} \sum_{i=1}^k [x_i - \lambda(\boldsymbol{x})] = 1 \quad \Rightarrow \quad \lambda(\boldsymbol{x}) = \frac{(\sum_{i=1}^k x_i) - 1}{k} \end{equation} We can find the largest k such that x_k > \lambda_k(\boldsymbol{x}).

def sparsemax(x):
    x_sort = np.sort(x)[::-1]
    x_lamb = (np.cumsum(x_sort) - 1) / np.arange(1, len(x) + 1)
    lamb = x_lamb[(x_sort >= x_lamb).argmin() - 1]
    return np.maximum(x - lamb, 0)

Gradient and Loss

Sparsemax only has gradients for a subset of categories \Omega(\boldsymbol{x}) = \{k | x_k > \lambda(\boldsymbol{x})\}. For the loss function, using -\log p_t is problematic because p_t can be 0. A better choice is a loss whose gradient is \boldsymbol{p} - \text{onehot}(t): \begin{equation} \mathcal{L}_t = \frac{1}{2} - x_t + \sum_{i\in\Omega(\boldsymbol{x})}\frac{1}{2}\left(x_i^2 - \lambda^2(\boldsymbol{x})\right) \end{equation}

Entmax-\alpha

Entmax-\alpha generalizes Sparsemax, providing a smooth transition from Softmax (\alpha=1) to Sparsemax (\alpha=2). It comes from the paper “Sparse Sequence-to-Sequence Models”.

Basic Definition

Using the approximation \exp(x) \approx \text{relu}^{1/\beta}(1 + \beta x) where \beta = \alpha - 1: \begin{equation} \text{Entmax}_{\alpha}(\boldsymbol{x}) = \text{relu}^{1/\beta}(\beta\boldsymbol{x} - \lambda(\boldsymbol{x})) \end{equation}

Solving Algorithm

For general \beta, \lambda(\boldsymbol{x}) is solved via bisection. For \beta=1/2 (\alpha=1.5), an analytical solution exists. Assuming z_k \geq \lambda(\boldsymbol{x}) \geq z_{k+1} for z_i = \beta x_i: \begin{equation} \lambda(\boldsymbol{x}) = \mu_k - \sqrt{\frac{1}{k} - \sigma_k^2} \end{equation} where \mu_k and \sigma_k^2 are the mean and variance of the top k elements of \boldsymbol{z}.

def entmat(x):
    x_sort = np.sort(x / 2)[::-1]
    k = np.arange(1, len(x) + 1)
    x_mu = np.cumsum(x_sort) / k
    x_sigma2 = np.cumsum(x_sort**2) / k  - x_mu**2
    x_lamb = x_mu - np.sqrt(np.maximum(1. / k - x_sigma2, 0))
    x_sort_shift = np.pad(x_sort[1:], (0, 1), constant_values=-np.inf)
    lamb = x_lamb[(x_sort > x_lamb) & (x_lamb > x_sort_shift)]
    return np.maximum(x / 2 - lamb, 0)**2

Loss Function

A simple way to define the loss is using the stop_gradient operator: \begin{equation} \mathcal{L}_t = (\boldsymbol{p} - \text{onehot}(t))\cdot \text{stop\_gradient}(\boldsymbol{x}) \end{equation} This ensures the gradient is \boldsymbol{p} - \text{onehot}(t), though the loss value itself is not directly interpretable.

Summary

This article reviewed Softmax and its alternatives, including Margin Softmax, Taylor Softmax, Sparse Softmax, Perturb Max, Sparsemax, and Entmax-\alpha, covering their definitions, properties, and implementations.