In the article "Generative Diffusion Models (16): Wasserstein Distance \le Score Matching", we derived an inequality between the Wasserstein distance and the score matching loss of diffusion models, indicating that the optimization objectives of diffusion models and WGANs are similar to some extent. In this article, we will explore the research findings in "MonoFlow: Rethinking Divergence GANs via the Perspective of Wasserstein Gradient Flows", which further demonstrates the connection between GANs and diffusion models: GANs can actually be viewed as diffusion ODEs in another time dimension!
These findings suggest that although GANs and diffusion models appear to be two distinct types of generative models on the surface, they actually share many similarities and can learn from and reference each other in many aspects.
Introduction to the Idea
We know that the generator trained by a GAN is a direct deterministic transformation \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) from noise \boldsymbol{z} to a real sample. In contrast, the prominent feature of diffusion models is "progressive generation," where the generation process corresponds to sampling from a series of gradually changing distributions p_0(\boldsymbol{x}_0), p_1(\boldsymbol{x}_1), \dots, p_T(\boldsymbol{x}_T) (Note: In the previous dozen articles, \boldsymbol{x}_T was noise and \boldsymbol{x}_0 was the target sample, with the sampling process being \boldsymbol{x}_T \to \boldsymbol{x}_0. For the convenience of the following description, we reverse this to \boldsymbol{x}_0 \to \boldsymbol{x}_T). At first glance, there seem to be few similarities. How can we connect the two?
Clearly, if we want to understand GANs from the perspective of diffusion models, we must find a way to construct a series of gradually changing distributions. The generator \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) itself is a one-step transformation without gradual change. However, we know that the optimization of the model is gradual. Can we use the historical trajectory of the parameters \boldsymbol{\theta}_t to construct this series of gradual distributions? Specifically, assume the generator is initialized with \boldsymbol{\theta}_0, and after T steps of adversarial training, we obtain the optimal parameters \boldsymbol{\theta}_T. The intermediate parameters during training are \boldsymbol{\theta}_1, \boldsymbol{\theta}_2, \dots, \boldsymbol{\theta}_{T-1}. If we define \boldsymbol{x}_t = \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}), haven’t we defined a series of gradually changing \boldsymbol{x}_0, \boldsymbol{x}_1, \dots, \boldsymbol{x}_T, and thus defined the gradual distributions p_0(\boldsymbol{x}_0), p_1(\boldsymbol{x}_1), \dots, p_T(\boldsymbol{x}_T)?
If this idea is feasible, then GANs can be interpreted as diffusion models in the (virtual) time dimension of gradient descent! Let us explore along this line of thought.
Gradient Flow
First, we need to revisit the results regarding Wasserstein gradient flow from the previous article "Gradient Flow: Exploring the Path to the Minimum". It states that the equation \begin{equation} \frac{\partial q_t(\boldsymbol{x})}{\partial t} = - \nabla_{\boldsymbol{x}}\cdot\big(q_t(\boldsymbol{x})\nabla_{\boldsymbol{x}}\log r_t(\boldsymbol{x})\big) \label{eq:w-flow} \end{equation} is minimizing the KL divergence between p(\boldsymbol{x}) and q_t(\boldsymbol{x}), i.e., \lim_{t\to\infty} q_t(\boldsymbol{x}) = p(\boldsymbol{x}), where r_t(\boldsymbol{x})=\frac{p(\boldsymbol{x})}{q_t(\boldsymbol{x})}. If p(\boldsymbol{x}) represents the distribution of real samples, then if we can achieve sampling from q_t(\boldsymbol{x}), we can achieve sampling from p(\boldsymbol{x}) as t \to \infty. According to "Deriving the Continuity Equation and Fokker-Planck Equation via the Test Function Method", sampling from q_t(\boldsymbol{x}) can be achieved through the following ODE: \begin{equation} \frac{d\boldsymbol{x}}{dt} = \nabla_{\boldsymbol{x}}\log r_t(\boldsymbol{x}) \label{eq:ode-core} \end{equation} However, r_t(\boldsymbol{x}) in the above equation is unknown, so we cannot yet sample through this equation; we first need to find a way to estimate r_t(\boldsymbol{x}).
Discriminative Estimation
This is where the GAN discriminator comes in. Taking the original Vanilla GAN as an example, its training objective is: \begin{equation} \max_D\, \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}[\log \sigma(D(\boldsymbol{x}))] + \mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}[\log (1 - \sigma(D(\boldsymbol{x})))] \label{eq:gan-d} \end{equation} where D is the discriminator, \sigma(t)=1/(1+e^{-t}) is the sigmoid function, p(\boldsymbol{x}) is the distribution of real samples, and q(\boldsymbol{x}) is the distribution of fake samples. It can be proven (readers who are unclear can refer to the "Supplementary Proof" section in "RSGAN: The ’Turing Test’ Idea in Adversarial Models") that the theoretical optimal solution for the discriminator D in the above equation is: \begin{equation} D(\boldsymbol{x}) = \log \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})} \end{equation} The results for the more generalized f-GAN (refer to "Introduction to f-GAN: The Production Workshop of GAN Models" and "Designing GANs: Another GAN Production Workshop") will be slightly different, but it can be proven that their theoretical optimal discriminators are all functions of \frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}. In other words, as long as we can sample from p(\boldsymbol{x}) and q_t(\boldsymbol{x}), we can estimate r_t(\boldsymbol{x})=\frac{p(\boldsymbol{x})}{q_t(\boldsymbol{x})} by training the GAN discriminator using [eq:gan-d].
One Step Forward
At this point, some readers might be confused: hasn’t this entered a "chicken and egg" circular argument? We estimate r_t(\boldsymbol{x}) precisely to use equation [eq:ode-core] to sample from q_t(\boldsymbol{x}). Now you assume we can sample from q_t(\boldsymbol{x}) to estimate r_t(\boldsymbol{x})? Don’t worry, the classic stroke is coming.
Suppose we have a generator \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}), and its sampling generation result is equivalent to sampling from q_t(\boldsymbol{x}), i.e., \begin{equation} \big\{\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z})\big|\,\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\big\}\quad = \quad\big\{\boldsymbol{x}_t\big|\,\boldsymbol{x}_t\sim q_t(\boldsymbol{x})\big\} \end{equation} Now we can use it and equation [eq:gan-d] to estimate r_t(\boldsymbol{x}). Note that this is only r_t(\boldsymbol{x}) at time t. We do not know r_t(\boldsymbol{x}) at other times, so we cannot directly complete the final sampling process through equation [eq:ode-core]. However, we can push forward by one small step: \begin{equation} \boldsymbol{x}_{t+1} = \boldsymbol{x}_t + \epsilon \nabla_{\boldsymbol{x}_t}\log r_t(\boldsymbol{x}_t) = \boldsymbol{x}_t + \epsilon \nabla_{\boldsymbol{x}_t} D(\boldsymbol{x}_t) \label{eq:forward} \end{equation} where \epsilon is a very small positive number representing the step size. Now we have the result of the next sampling step. We hope it continues to be equivalent to the sampling result of the generator at the next step, i.e., \begin{equation} \begin{aligned} \big\{\boldsymbol{g}_{\boldsymbol{\theta}_{t+1}}(\boldsymbol{z})\big|\,\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})\big\}\quad =& \quad\big\{\boldsymbol{x}_{t+1}\big|\,\boldsymbol{x}_{t+1}\sim q_{t+1}(\boldsymbol{x})\big\} \\[5pt] \quad =& \quad\big\{\boldsymbol{x}_{t+1}\big|\,\boldsymbol{x}_t + \epsilon \nabla_{\boldsymbol{x}_t} D(\boldsymbol{x}_t),\boldsymbol{x}_t\sim q_t(\boldsymbol{x})\big\} \end{aligned} \end{equation} In other words, we want to transform the movement of samples in the diffusion model into the movement of generator parameters! To achieve this goal, we solve for \boldsymbol{\theta}_{t+1} using the following loss: \begin{equation} \boldsymbol{\theta}_{t+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}\Big[\big\Vert \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) - \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}) - \epsilon \nabla_{\boldsymbol{g}}D(\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}))\big\Vert^2\Big] \label{eq:gan-g0} \end{equation} That is, we take \boldsymbol{x}_t = \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}), iterate forward one step to get \boldsymbol{x}_{t+1}, and then hope the new \boldsymbol{g}_{\boldsymbol{\theta}_{t+1}}(\boldsymbol{z}) is as close to \boldsymbol{x}_{t+1} as possible. After completing this round, replace the original \boldsymbol{\theta}_t with \boldsymbol{\theta}_{t+1} and start a new round of iteration. Thus, equations [eq:gan-d] and [eq:gan-g0] are executed alternately. Doesn’t this smell like a GAN?
The Finishing Touch
If this is not enough, we can continue to refine it to make it more consistent with GANs. Note that the gradient of the function inside the expectation in [eq:gan-g0] is: \begin{equation} \begin{aligned} & \nabla_{\boldsymbol{\theta}}\Vert \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) - \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}) - \epsilon \nabla_{\boldsymbol{g}}D(\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}))\Vert^2 \\ =& 2\big\langle\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) - \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}) - \epsilon \nabla_{\boldsymbol{g}}D(\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z})), \nabla_{\boldsymbol{\theta}}\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) \big\rangle \\ \end{aligned} \end{equation} Substituting the current value \boldsymbol{\theta}=\boldsymbol{\theta}_t, the result is: \begin{equation} -2\epsilon\big\langle \nabla_{\boldsymbol{g}}D(\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z})), \nabla_{\boldsymbol{\theta}_t}\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}) \big\rangle = -2\epsilon\nabla_{\boldsymbol{\theta}_t}D(\boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z})) \end{equation} In other words, if we only optimize one step using a gradient-based optimizer, then using [eq:gan-g0] as the loss function is equivalent to using the following as the loss function (since the gradients only differ by a constant factor): \begin{equation} \boldsymbol{\theta}_{t+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}[-D(\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}))] \label{eq:gan-g} \end{equation} This is one of the common generator losses. Alternating training of [eq:gan-d] and [eq:gan-g] is a common GAN variant.
In particular, the original paper also proved that the generator’s loss function can be generalized to: \begin{equation} \boldsymbol{\theta}_{t+1} = \mathop{\text{argmin}}_{\boldsymbol{\theta}}\mathbb{E}_{\boldsymbol{z}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I})}[-h(D(\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})))] \end{equation} where h(\cdot) is any monotonically increasing function. This also corresponds to the fact that \log r_t(\boldsymbol{x}) in the Wasserstein gradient flow [eq:w-flow] can be replaced by h(\log r_t(\boldsymbol{x})). This is likely the origin of the term MonoFlow (Monotonically increasing function + Wasserstein flow). We will not expand on this proof; interested readers can refer to the original paper.
Reflections on Significance
In summary, the idea of understanding GANs as diffusion models is: \begin{equation*} \begin{CD} \cdots @>>> \boldsymbol{g}_{\boldsymbol{\theta}_t}(\boldsymbol{z}) @>\text{Eq. \eqref{eq:gan-d}}>> r_t(\boldsymbol{x}) @>\text{Eq. \eqref{eq:forward}}>> \boldsymbol{x}_{t+1} @>\text{Eq. \eqref{eq:gan-g0}}>> \boldsymbol{g}_{\boldsymbol{\theta}_{t+1}}(\boldsymbol{z}) @>>> \cdots \end{CD} \end{equation*} The core equation is [eq:forward], which originates from the Wasserstein gradient flow equations [eq:w-flow] and [eq:ode-core], which we discussed in the previous article "Gradient Flow: Exploring the Path to the Minimum".
Some readers might ask: this perspective doesn’t seem to yield more than GANs already do, so why go to such great lengths to re-understand GANs? First, in the author’s view, understanding GANs from the perspective of diffusion models, or unifying diffusion models and GANs, is inherently interesting and fun. It doesn’t necessarily need to have a practical application; being interesting and fun is its greatest significance.
Secondly, as the authors mention in the original paper, the existing derivation process of GANs is inconsistent with their actual training process, whereas the diffusion perspective discussed here is consistent with the training process. That is to say, if the training process is the standard, the existing derivation of GANs is "wrong," and the diffusion perspective in this article is "correct." How should we understand this? Taking the GAN mentioned earlier as an example, the objectives of the discriminator and generator are: \begin{gather} \max_D\, \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}[\log \sigma(D(\boldsymbol{x}))] + \mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}[\log (1 - \sigma(D(\boldsymbol{x})))] \\ \min_q\mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}[-D(\boldsymbol{x})] \end{gather} The usual proof method is to show that the optimal solution for D is \log\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}, then substitute it into the generator’s loss function, finding that it minimizes the KL divergence between q(\boldsymbol{x}) and p(\boldsymbol{x}), so the optimal solution is q(\boldsymbol{x})=p(\boldsymbol{x}). However, such a proof corresponds to a training method where, for any q(\boldsymbol{x}), the \max_D step is solved completely (the resulting D should be a function of q(x), or rather, a function of the generator parameters \boldsymbol{\theta}), and then the \min_q step is executed, rather than the alternating training actually used. The understanding based on diffusion models is designed to be alternating, so it is more consistent with the training process.
Overall, understanding GANs from the perspective of diffusion models is not just a new perspective but also one that is closer to the training process. For example, we can explain why the GAN generator cannot be trained for too many steps: because only during single-step optimization are equations [eq:gan-g] and [eq:gan-g0] equivalent. If a GAN is to undergo more steps of optimization, then equation [eq:gan-g0] should be used as the loss function. In fact, equation [eq:gan-g0] is equivalent to the KL\left(q(x)\Vert q^{o}(x)\right) term proposed by the author previously in "Unified Understanding of Generative Models (VAE, GAN, AAE, ALI) via Variational Inference", which ensures the "inheritance" of the generator rather than just "innovation."
Summary
This article introduced MonoFlow, which demonstrates that GANs can be understood as diffusion ODEs in another time dimension, thereby establishing a new perspective for understanding GANs based on diffusion models. In particular, this is a perspective that is closer to the training process than the conventional derivation of GANs.
When reposting, please include the original address of this article: https://kexue.fm/archives/9662
For more detailed reposting matters, please refer to: "Scientific Space FAQ"