English (unofficial) translations of posts at kexue.fm
Source

An Identity for ReLU/GeLU/Swish

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

Today I will share some light content based on an identity I realized over the past couple of days. This identity is actually quite simple, but it feels a bit unexpected at first glance, so I am recording it here.

Basic Result

We know that \mathop{\mathrm{relu}}(x) = \max(x, 0). It is easy to prove the following identity: \begin{equation} x = \mathop{\mathrm{relu}}(x) - \mathop{\mathrm{relu}}(-x) \end{equation} If x is a vector, the above equation is even more intuitive: \mathop{\mathrm{relu}}(x) extracts the positive components of x, and -\mathop{\mathrm{relu}}(-x) extracts the negative components of x. Adding the two together yields the original vector.

General Conclusion

The next question is: do activation functions like GeLU and Swish satisfy similar identities? At first glance, they might not seem to, but in fact, they do! We even have a more general conclusion:

Let \phi(x) be any odd function, and let f(x) = \frac{1}{2}(\phi(x) + 1)x. Then the following identity holds: \begin{equation} x = f(x) - f(-x) \end{equation}

Proving this conclusion is also very straightforward, so I will not elaborate on it here. For Swish, we have \phi(x) = \tanh\left(\frac{x}{2}\right), and for GeLU, we have \phi(x) = \mathop{\mathrm{erf}}\left(\frac{x}{\sqrt{2}}\right). Since both are odd functions, they satisfy the same identity.

Reflections on Significance

The identity above can be written in matrix form as: \begin{equation} x = f(x) - f(-x) = f(x[1, -1])\begin{bmatrix}1 \\ -1\end{bmatrix} \end{equation} This indicates that when using ReLU, GeLU, Swish, etc., as activation functions, a two-layer neural network has the capacity to degenerate into a single layer. This means they can adaptively adjust the actual depth of the model, which is similar in principle to how ResNet works. This might be one of the reasons why these activation functions perform better than traditional ones like Tanh or Sigmoid.

When reposting, please include the original address of this article: https://kexue.fm/archives/11233
For more detailed reposting matters, please refer to: "Scientific Space FAQ"