English (unofficial) translations of posts at kexue.fm
Source

Constructing Discrete Probability Distributions from the Perspective of Reparameterization

Translated by DeepSeek V4 Pro. Translations can be inaccurate, please refer to the original post for important stuff.

Generally speaking, the outputs of neural networks are unconstrained, meaning their range is \mathbb{R}. To obtain constrained outputs, activation functions are typically employed. For example, if we want to output a probability distribution representing the probability of each category, Softmax is usually added as the final activation function. A natural follow-up question is: besides Softmax, are there any other operations that can generate a probability distribution?

In "A Discussion on Reparameterization: From Normal Distribution to Gumbel Softmax", we introduced the reparameterization operation of Softmax. In this article, we reverse this process: we first define a reparameterization operation and then derive the corresponding probability distribution, thereby obtaining a new perspective on the construction of probability distributions.

Problem Definition

Suppose the output vector of a model is \boldsymbol{\mu}=[\mu_1,\cdots,\mu_n]\in\mathbb{R}^n. Without loss of generality, we assume that \mu_i are pairwise distinct. We hope to transform \boldsymbol{\mu} into an n-dimensional probability distribution \boldsymbol{p}=[p_1,\cdots,p_n] through some transformation \mathcal{T}, while maintaining certain properties. For instance, the most basic requirements are: {\color{red}1.}\,p_i\geq 0 \qquad {\color{red}2.}\,\sum_i p_i = 1 \qquad {\color{red}3.}\,p_i \geq p_j \Leftrightarrow \mu_i \geq \mu_j Of course, these requirements are quite trivial. As long as f is a monotonic function from \mathbb{R} \mapsto \mathbb{R}^+ (for Softmax, f(x)=e^x), the transformation p_i = \frac{f(\mu_i)}{\sum\limits_j f(\mu_j)} will satisfy the above requirements. Next, we add a less trivial condition: {\color{red}4.}\, \mathcal{T}(\boldsymbol{\mu}) = \mathcal{T}(\boldsymbol{\mu} + c\boldsymbol{1})\quad (\forall c \in \mathbb{R}) where \boldsymbol{1} represents a vector of all ones, and c is an arbitrary constant. That is to say, after adding the same constant to each component of \boldsymbol{\mu}, the result of the transformation remains unchanged. This condition is proposed because adding a constant to each component does not change the result of \mathop{\text{argmax}}, and it is preferable for \mathcal{T} to maintain similar properties. It is easy to verify that Softmax satisfies this condition; however, besides Softmax, it seems difficult to think of other transformations.

Noise Perturbation

Interestingly, we can construct such a transformation using the inverse process of reparameterization! Suppose \boldsymbol{\varepsilon}=[\varepsilon_1,\cdots,\varepsilon_n] is a vector obtained by independently and identically sampling n times from a distribution p(\varepsilon). Since \boldsymbol{\varepsilon} is random, \mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon}) is usually also random. We can then define the transformation \mathcal{T} through: p_i = P[\mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})=i] Since \boldsymbol{\varepsilon} is i.i.d. and the entire definition only depends on \mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon}) (which only involves the relative magnitudes of each component), the defined transformation must satisfy the four aforementioned conditions.

We can also judge the properties it satisfies by directly calculating the form of p_i. Specifically, \mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})=i means \mu_i + \varepsilon_i > \mu_j + \varepsilon_j\quad (\forall j\neq i) which is \mu_i - \mu_j + \varepsilon_i > \varepsilon_j. Clearly, the larger \mu_i is, the more likely this inequality holds, meaning the larger the corresponding p_i is; this is condition 3. Specifically, for a fixed \varepsilon_i, the probability that this condition holds is \int_{-\infty}^{\mu_i - \mu_j + \varepsilon_i} p(\varepsilon_j)d\varepsilon_j = \Phi(\mu_i - \mu_j + \varepsilon_i) where \Phi is the Cumulative Distribution Function (CDF) of p(\varepsilon). Since each \varepsilon_j is i.i.d., we can multiply the probabilities directly: \prod_{j\neq i} \Phi(\mu_i - \mu_j + \varepsilon_i) This is the probability that \mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})=i given a fixed \varepsilon_i. Finally, we only need to average over \varepsilon_i to obtain p_i: p_i = \int_{-\infty}^{\infty} p(\varepsilon_i)\left[\prod_{j\neq i} \Phi(\mu_i - \mu_j + \varepsilon_i)\right]d\varepsilon_i \label{eq:pi} From the expression of p_i, we can see that it only depends on the relative values \mu_i - \mu_j, so it clearly satisfies condition 4 in the definition.

Reviewing the Old to Know the New

Referring back to the introduction of Gumbel Max in "A Discussion on Reparameterization: From Normal Distribution to Gumbel Softmax", we can see that the above derivation is exactly the inverse of reparameterization. It first defines the reparameterization method and then derives the corresponding probability distribution.

Now we can re-examine the previous results: when the noise distribution is the Gumbel distribution, does equation [eq:pi] yield the standard Softmax operation? Gumbel noise is transformed from u\sim U[0,1] via \varepsilon = -\log(-\log u). Since the distribution of u is exactly U[0,1], solving for u=e^{-e^{-\varepsilon}} gives the CDF of the Gumbel distribution, i.e., \Phi(\varepsilon)=e^{-e^{-\varepsilon}}, and p(\varepsilon) is the derivative of \Phi(\varepsilon), which is p(\varepsilon)=\Phi'(\varepsilon)=e^{-\varepsilon-e^{-\varepsilon}}.

Substituting these results into equation [eq:pi], we get: \begin{aligned} p_i =&\, \int_{-\infty}^{\infty} e^{-\varepsilon_i-e^{-\varepsilon_i}} e^{-\sum\limits_{j\neq i}e^{-\varepsilon_i + \mu_j - \mu_i}} d\varepsilon_i \\ =&\, \int_{-\infty}^0 e^{-e^{-\varepsilon_i}\left(1+\sum\limits_{j\neq i}e^{\mu_j - \mu_i}\right)} d(-e^{-\varepsilon_i}) \\ =&\, \int_{-\infty}^0 e^{t\left(1+\sum\limits_{j\neq i}e^{\mu_j - \mu_i}\right)} dt\\ =&\, \frac{1}{1+\sum\limits_{j\neq i}e^{\mu_j - \mu_i}} = \frac{e^{\mu_i}}{\sum\limits_j e^{\mu_j }} \end{aligned} This is exactly Softmax. Thus, we have once again verified the correspondence between Gumbel Max and Softmax.

Numerical Computation

Being able to find an analytical solution like Softmax from the Gumbel distribution is extremely rare; at least for now, the author cannot find a second example. Therefore, in most cases, we can only use numerical methods to approximate equation [eq:pi]. Since p(\varepsilon)=\Phi'(\varepsilon), we can use integration by substitution: p_i = \int_0^1 \left[\prod_{j\neq i} \Phi(\mu_i - \mu_j + \varepsilon_i)\right]d\Phi(\varepsilon_i) Let t=\Phi(\varepsilon_i), then \begin{aligned} p_i =&\, \int_0^1 \left[\prod_{j\neq i} \Phi(\mu_i - \mu_j + \Phi^{-1}(t))\right]dt \\ \approx&\, \frac{1}{K}\sum_{k=1}^K\prod_{j\neq i} \Phi\left(\mu_i - \mu_j + \Phi^{-1}\left(\frac{k}{K+1}\right)\right) \end{aligned} where \Phi^{-1} is the inverse function of \Phi, also known in probability as the Quantile Function or Percent Point Function.

From the above formula, we can see that as long as we know the analytical form of \Phi, we can approximate p_i. Note that we do not need the analytical form of \Phi^{-1}, because the sampling points \Phi^{-1}\left(\frac{k}{K+1}\right) can be precomputed using other numerical methods.

Taking the standard normal distribution as an example, \Phi(x)=\frac{1}{2} \left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right). Since most mainstream deep learning frameworks include the \text{erf} function, the calculation of \Phi(x) is straightforward. As for \Phi^{-1}\left(\frac{k}{K+1}\right), we can precompute it using scipy.stats.norm.ppf. Therefore, when \boldsymbol{\varepsilon} is sampled from a standard normal distribution, the calculation of p_i is feasible in mainstream deep learning frameworks.

Summary

This article generalizes Softmax from the perspective of reparameterization, resulting in a class of probability normalization methods with similar properties.

When reposting, please include the original address of this article: https://kexue.fm/archives/9085

For more detailed information regarding reposting, please refer to: "Scientific Space FAQ"