In the previous article Generative Diffusion Models Part 19: GANs as Diffusion ODEs, we introduced how to understand GANs as diffusion ODEs in another time dimension. In short, a GAN actually transforms the movement of samples in a diffusion model into the movement of generator parameters! However, the derivation process in that article relied on relatively complex and independent concepts such as Wasserstein gradient flow, which did not connect well with the previous articles in the diffusion series, making it feel somewhat like a "technical gap."
In the author’s view, ReFlow, introduced in Part 17, is the most intuitive scheme for understanding diffusion ODEs. Since GANs can be understood from the perspective of diffusion ODEs, there must exist a way to understand GANs from the perspective of ReFlow. After some attempts, the author successfully derived results similar to WGAN-GP from ReFlow.
Theoretical Review
The reason why "ReFlow is the most intuitive scheme for understanding diffusion ODEs" is that it is inherently very flexible and closely aligns with experimental code. It can establish a mapping from an arbitrary noise distribution to a target data distribution through an ODE, and the training objective is very straightforward, corresponding directly to experimental code without any "convoluted steps."
Specifically, assume \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0) is a random noise sampled from a prior distribution, and \boldsymbol{x}_1 \sim p_1(\boldsymbol{x}_1) is a real sample sampled from the target distribution (Note: In previous articles, \boldsymbol{x}_T was generally noise and \boldsymbol{x}_0 was the target sample; here, they are reversed for convenience). ReFlow allows us to specify any trajectory from \boldsymbol{x}_0 to \boldsymbol{x}_1. For simplicity, ReFlow chooses a straight line, namely: \begin{equation} \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t \boldsymbol{x}_1 \label{eq:line} \end{equation} Now we find the ODE it satisfies: \begin{equation} \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{x}_1 - \boldsymbol{x}_0 \end{equation} This ODE is simple but impractical because we want to generate \boldsymbol{x}_1 from \boldsymbol{x}_0 via the ODE, but the above ODE puts the target we want to generate inside the equation, which is a "causal inversion." To remedy this defect, ReFlow’s idea is simple: learn a function of \boldsymbol{x}_t to approximate \boldsymbol{x}_1 - \boldsymbol{x}_0. Once learned, use it to replace \boldsymbol{x}_1 - \boldsymbol{x}_0, i.e., \begin{equation} \boldsymbol{\varphi}^* = \mathop{\text{argmin}}_{\boldsymbol{\varphi}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\left[\frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right] \label{eq:s-loss} \end{equation} and \begin{equation} \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{x}_1 - \boldsymbol{x}_0 \quad \Rightarrow \quad \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t, t) \label{eq:ode-core} \end{equation} We have previously proved that under the assumption that \boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t) has infinite fitting capacity, the new ODE indeed achieves the sample transformation from distribution p_0(\boldsymbol{x}_0) to distribution p_1(\boldsymbol{x}_1).
Relative Motion
One of the important characteristics of ReFlow is that it does not restrict the form of the prior distribution p_0(\boldsymbol{x}_0). This means we can replace the prior distribution with any distribution we want, for example, a distribution transformed by a generator \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}): \begin{equation} \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0) \quad \Leftrightarrow \quad \boldsymbol{x}_0 = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}), \, \boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \end{equation} After completing the training by substituting this into Eq. [eq:s-loss], we can use Eq. [eq:ode-core] to transform any \boldsymbol{x}_0 = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) into a real sample \boldsymbol{x}_1.
However, we are not satisfied with just this. As mentioned earlier, GANs transform the movement of samples in a diffusion model into the movement of generator parameters. This can also be done in the ReFlow framework: assuming the current parameters of the generator are \boldsymbol{\theta}_{\tau}, we expect the change \boldsymbol{\theta}_{\tau} \to \boldsymbol{\theta}_{\tau+1} to simulate the effect of moving forward a small step in Eq. [eq:ode-core]: \begin{equation} \boldsymbol{\theta}_{\tau+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}\Big[\big\Vert \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) - \boldsymbol{g}_{\boldsymbol{\theta}_{\tau}}(\boldsymbol{z}) - \epsilon\,\boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{g}_{\boldsymbol{\theta}_{\tau}}(\boldsymbol{z}), 0)\big\Vert^2\Big] \label{eq:g-loss} \end{equation} Note that t in Eq. [eq:s-loss] and Eq. [eq:ode-core] does not have the same meaning as \tau in the parameters \boldsymbol{\theta}_{\tau}. The former is the time parameter of the ODE, while the latter is the training progress; hence different notations are used. Furthermore, \boldsymbol{g}_{\boldsymbol{\theta}_{\tau}}(\boldsymbol{z}) appears as \boldsymbol{x}_0 for the ODE, so when pushing forward a small step, we obtain \boldsymbol{x}_{\epsilon}, and the time t to be substituted into \boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t, t) is 0.
Now, we have a new \boldsymbol{g}_{\boldsymbol{\theta}_{\tau+1}}(\boldsymbol{z}). Theoretically, the distribution it produces is closer to the real distribution (because it has moved forward a small step). Then, we treat it as the new \boldsymbol{x}_0, substitute it into Eq. [eq:s-loss] for training, and after training, substitute it back into Eq. [eq:g-loss] to optimize the generator. This iterative process is a GAN-like alternating training procedure.
WGAN-GP
Can this process be quantitatively linked to existing GANs? Yes! Specifically to WGAN-GP with gradient penalty.
First, let’s look at the loss function [eq:s-loss]. Expanding the expectation part, the result is: \begin{equation} \frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t)\Vert^2 - \langle\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t, t),\boldsymbol{x}_1 - \boldsymbol{x}_0\rangle + \frac{1}{2}\Vert\boldsymbol{x}_1 - \boldsymbol{x}_0\Vert^2 \end{equation} The third term is independent of the parameter \boldsymbol{\varphi}, so removing it does not affect the result. Now we assume \boldsymbol{v}_{\boldsymbol{\varphi}} has strong enough fitting capacity such that we do not need to explicitly input t. Then the above expression as a loss function is equivalent to: \begin{equation} \frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \langle\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t),\boldsymbol{x}_1 - \boldsymbol{x}_0\rangle = \frac{1}{2}\Vert\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \left\langle\boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t),\frac{d\boldsymbol{x}_t}{dt}\right\rangle \end{equation} \boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t) is a vector function with the same input and output dimensions. We further assume it is the gradient of some scalar function D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t), i.e., \boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)=\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t). Then the above expression becomes: \begin{equation} \frac{1}{2}\Vert\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \left\langle\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t),\frac{d\boldsymbol{x}_t}{dt}\right\rangle = \frac{1}{2}\Vert\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - \frac{d D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)}{dt} \end{equation} Assuming that the change in D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t) is relatively smooth, then \frac{d D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)}{dt} should be close to its difference at the two points t=0 and t=1, D_{\boldsymbol{\varphi}}(\boldsymbol{x}_1)-D_{\boldsymbol{\varphi}}(\boldsymbol{x}_0). Thus, the above loss function is approximately: \begin{equation} \frac{1}{2}\Vert\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)\Vert^2 - D_{\boldsymbol{\varphi}}(\boldsymbol{x}_1) + D_{\boldsymbol{\varphi}}(\boldsymbol{x}_0) \end{equation} Readers familiar with GANs should find this very recognizable; it is exactly the discriminator loss function of WGAN with gradient penalty! Even the construction method for \boldsymbol{x}_t in the gradient penalty term [eq:line] is identical (linear interpolation between real and fake samples)! The only difference is that the gradient penalty in the original WGAN-GP is centered at 1, whereas here it is centered at zero. However, articles such as WGAN-div: A Little-Known WGAN Fixer and Optimization Algorithms from a Dynamical Perspective (IV): The Third Stage of GANs have already shown that zero-centered gradient penalties usually perform better.
Therefore, under specific parameterization and assumptions, the loss function [eq:s-loss] is actually equivalent to the WGAN-GP discriminator loss. As for the generator loss, in the previous article Generative Diffusion Models Part 19: GANs as Diffusion ODEs, we already proved that when \boldsymbol{v}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t)=\nabla_{\boldsymbol{x}_t} D_{\boldsymbol{\varphi}}(\boldsymbol{x}_t), the gradient of the single-step optimization in Eq. [eq:g-loss] is equivalent to the gradient of: \begin{equation} \boldsymbol{\theta}_{\tau+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}[-D(\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}))] \end{equation} which is exactly the generator loss of WGAN-GP.
Summary
In this article, the author attempted to derive the connection between WGAN-GP and diffusion ODEs starting from ReFlow. This perspective is relatively simpler and more intuitive, and it avoids relatively complex concepts such as Wasserstein gradient flow.
Reprinting: Please include the original address of this article: https://kexue.fm/archives/9668
Further details on reprinting: Please refer to Scientific Space FAQ.