Today I will share some light content based on an identity I realized over the past couple of days. This identity is actually quite simple, but it feels a bit unexpected at first glance, so I am recording it here.
Basic Result
We know that \mathop{\mathrm{relu}}(x) = \max(x, 0). It is easy to prove the following identity: \begin{equation} x = \mathop{\mathrm{relu}}(x) - \mathop{\mathrm{relu}}(-x) \end{equation} If x is a vector, the above equation is even more intuitive: \mathop{\mathrm{relu}}(x) extracts the positive components of x, and -\mathop{\mathrm{relu}}(-x) extracts the negative components of x. Adding the two together yields the original vector.
General Conclusion
The next question is: do activation functions like GeLU and Swish satisfy similar identities? At first glance, they might not seem to, but in fact, they do! We even have a more general conclusion:
Let \phi(x) be any odd function, and let f(x) = \frac{1}{2}(\phi(x) + 1)x. Then the following identity holds: \begin{equation} x = f(x) - f(-x) \end{equation}
Proving this conclusion is also very straightforward, so I will not elaborate on it here. For Swish, we have \phi(x) = \tanh\left(\frac{x}{2}\right), and for GeLU, we have \phi(x) = \mathop{\mathrm{erf}}\left(\frac{x}{\sqrt{2}}\right). Since both are odd functions, they satisfy the same identity.
Reflections on Significance
The identity above can be written in matrix form as: \begin{equation} x = f(x) - f(-x) = f(x[1, -1])\begin{bmatrix}1 \\ -1\end{bmatrix} \end{equation} This indicates that when using ReLU, GeLU, Swish, etc., as activation functions, a two-layer neural network has the capacity to degenerate into a single layer. This means they can adaptively adjust the actual depth of the model, which is similar in principle to how ResNet works. This might be one of the reasons why these activation functions perform better than traditional ones like Tanh or Sigmoid.
When reposting, please include the original address of
this article: https://kexue.fm/archives/11233
For more detailed reposting matters, please refer to: "Scientific Space
FAQ"