Several Inequalities for the LogSumExp Operation · English (unofficial) translations of posts at kexue.fm

\mathop{\mathrm{logsumexp}} is an operation frequently encountered in machine learning, especially in the implementation and derivation of cross-entropy. At the same time, it is also a smooth approximation of the \max function (refer to “Seeking a Smooth Maximum Function”). Let x=(x_1, x_2, \dots, x_n); \mathop{\mathrm{logsumexp}} is defined as: \mathop{\mathrm{logsumexp}}(x) = \log \sum_{i=1}^n e^{x_i} This article introduces several inequalities related to \mathop{\mathrm{logsumexp}} that may be useful in theoretical derivations.

Basic Bounds

Let x_{\max} = \max(x_1, x_2, \dots, x_n). Then it is obvious that: e^{x_{\max}} < \sum_{i=1}^n e^{x_i} \leq \sum_{i=1}^n e^{x_{\max}} = ne^{x_{\max}} Taking the logarithm of each side, we obtain: x_{\max} < \mathop{\mathrm{logsumexp}}(x) \leq x_{\max} + \log n This is the most fundamental result regarding the upper and lower bounds of \mathop{\mathrm{logsumexp}}. It shows that the approximation error of \mathop{\mathrm{logsumexp}} relative to \max does not exceed \log n. Note that this error is independent of x itself. Consequently, we have: x_{\max}/\tau < \mathop{\mathrm{logsumexp}}(x/\tau) \leq x_{\max}/\tau + \log n Multiplying each side by \tau gives: x_{\max} < \tau \mathop{\mathrm{logsumexp}}(x/\tau) \leq x_{\max} + \tau \log n As \tau \to 0, the error approaches 0. This tells us that we can improve the degree of approximation to \max by lowering the temperature parameter \tau.

Average Bounds

We know that e^x is a convex function, which satisfies Jensen’s inequality \mathbb{E}[e^{x}] \geq e^{\mathbb{E}[x]}. Therefore: \frac{1}{n} \sum_{i=1}^n e^{x_i} \geq e^{\bar{x}} where \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i. Multiplying both sides by n and taking the logarithm yields: \mathop{\mathrm{logsumexp}}(x) \geq \bar{x} + \log n This is another result regarding the lower bound of \mathop{\mathrm{logsumexp}}. This result can be further generalized to the case of weighted averages: let p_1, p_2, \dots, p_n \geq 0 and \sum_{i=1}^n p_i = 1. From the Cauchy-Schwarz inequality, we have: \left[\sum_{i=1}^n (e^{x_i/2})^2\right] \left[\sum_{i=1}^n p_i^2\right] \geq \left[\sum_{i=1}^n p_i e^{x_i/2}\right]^2 Applying Jensen’s inequality to the expression inside the square brackets on the right side, we get: \left[\sum_{i=1}^n p_i e^{x_i/2}\right]^2 \geq \left[e^{\left(\sum_{i=1}^n p_i x_i/2\right)}\right]^2 = e^{\left(\sum_{i=1}^n p_i x_i\right)} Taking the logarithm of both sides and rearranging gives: \mathop{\mathrm{logsumexp}}(x) \geq \sum_{i=1}^n p_i x_i - \log \sum_{i=1}^n p_i^2 If we use the more general Hölder’s inequality instead of the Cauchy-Schwarz inequality at the beginning, we can also obtain: \mathop{\mathrm{logsumexp}}(x) \geq \sum_{i=1}^n p_i x_i - \frac{1}{t-1} \log \sum_{i=1}^n p_i^t, \quad \forall t > 1 In particular, by taking the limit as t \to 1, we obtain: \mathop{\mathrm{logsumexp}}(x) \geq \sum_{i=1}^n p_i x_i - \sum_{i=1}^n p_i \log p_i This can be equivalently rewritten as \sum_{i=1}^n p_i \log \frac{p_i}{e^{x_i}/Z} \geq 0, where Z = e^{\mathop{\mathrm{logsumexp}}(x)} is the normalization factor. Thus, it is actually the KL divergence between two distributions.

Lipschitz Constraint

Under the infinity norm, \mathop{\mathrm{logsumexp}} also satisfies a Lipschitz constraint, namely: |\mathop{\mathrm{logsumexp}}(x) - \mathop{\mathrm{logsumexp}}(y)| \leq \|x - y\|_{\infty} where \|x - y\|_{\infty} = \max_i |x_i - y_i| (it is actually more intuitive to denote it as \|x - y\|_{\max}). The proof is not difficult. Define: f(t) = \mathop{\mathrm{logsumexp}}(tx + (1-t)y), \quad t \in [0, 1] Treating this as a univariate function of t, by the Mean Value Theorem, there exists \varepsilon \in (0, 1) such that: f'(\varepsilon) = \frac{f(1) - f(0)}{1 - 0} = \mathop{\mathrm{logsumexp}}(x) - \mathop{\mathrm{logsumexp}}(y) It is easy to calculate that: f'(\varepsilon) = \frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}(x_i - y_i)}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}} Therefore: \begin{aligned} |\mathop{\mathrm{logsumexp}}(x) - \mathop{\mathrm{logsumexp}}(y)| &= \left|\frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}(x_i - y_i)}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}}\right| \\ &\leq \frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i} |x_i - y_i|}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}} \\ &\leq \frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i} \|x - y\|_{\infty}}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}} = \|x - y\|_{\infty} \end{aligned}

Convexity

Finally, there is a very strong conclusion: \mathop{\mathrm{logsumexp}} is a convex function! This means that all inequalities related to convex functions apply to \mathop{\mathrm{logsumexp}}, such as the basic Jensen’s inequality: \mathbb{E}[\mathop{\mathrm{logsumexp}}(x)] \geq \mathop{\mathrm{logsumexp}}(\mathbb{E}[x])

To prove that \mathop{\mathrm{logsumexp}} is a convex function is to prove that for \forall t \in [0, 1], the following holds: t \mathop{\mathrm{logsumexp}}(x) + (1-t) \mathop{\mathrm{logsumexp}}(y) \geq \mathop{\mathrm{logsumexp}}(tx + (1-t)y) The proof process is essentially a basic application of Hölder’s inequality. Specifically, we have: t \mathop{\mathrm{logsumexp}}(x) + (1-t) \mathop{\mathrm{logsumexp}}(y) = \log \left(\sum_{i=1}^n e^{x_i}\right)^t \left(\sum_{i=1}^n e^{y_i}\right)^{(1-t)} Now, directly applying Hölder’s inequality gives: \log \left(\sum_{i=1}^n e^{x_i}\right)^t \left(\sum_{i=1}^n e^{y_i}\right)^{(1-t)} \geq \log \sum_{i=1}^n e^{tx_i + (1-t)y_i} = \mathop{\mathrm{logsumexp}}(tx + (1-t)y) This proves that \mathop{\mathrm{logsumexp}} is a convex function.

Conclusion

This article mainly summarizes the inequalities related to the \mathop{\mathrm{logsumexp}} operation for future reference.

Original Address: https://kexue.fm/archives/9070