Generative Diffusion Models (6): The ODE Perspective of the General Framework · English (unofficial) translations of posts at kexue.fm

In the previous article "Generative Diffusion Models (5): The SDE Perspective of the General Framework", we provided a basic introduction and derivation of Dr. Yang Song’s paper "Score-Based Generative Modeling through Stochastic Differential Equations". However, as the name suggests, the previous post primarily covered the SDE-related parts of the original paper, leaving out the section known as the "Probability Flow ODE." This article serves as a supplementary share to cover that content.

In fact, this remaining content occupies only a small section in the main body of the original paper. However, we need a new article to introduce it because, after much consideration, I realized that the derivation of this result cannot bypass the Fokker-Planck equation. Therefore, we need some space to introduce the Fokker-Planck equation before the main protagonist, the ODE, can take the stage.

Reflecting Again

Let us briefly summarize the content of the previous article. First, we defined a forward process ("demolishing a building") via an SDE: d\boldsymbol{x} = \boldsymbol{f}_t(\boldsymbol{x}) dt + g_t d\boldsymbol{w}\label{eq:sde-forward} Then, we derived the corresponding reverse process SDE ("constructing a building"): d\boldsymbol{x} = \left[\boldsymbol{f}_t(\boldsymbol{x}) - g_t^2\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) \right] dt + g_t d\boldsymbol{w}\label{eq:sde-reverse} Finally, we derived the loss function (score matching) for estimating \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) using a neural network \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t): \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t \sim p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)}\left[\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] With this, we completed the general framework for training and prediction of diffusion models. It can be said to be a very general extension of DDPM. However, just as DDIM was introduced in "Generative Diffusion Models (4): DDIM = High-level DDPM" as a high-level reflection of DDPM, does SDE, as an extension of DDPM, have a corresponding "high-level reflection"? Yes, and the result is the subject of this article: the "Probability Flow ODE."

The Dirac Delta Function

What reflection did DDIM perform? Simply put, DDIM discovered that the training objective of DDPM is mainly related to p(\boldsymbol{x}_t|\boldsymbol{x}_0) and unrelated to p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}). Therefore, it took p(\boldsymbol{x}_t|\boldsymbol{x}_0) as a starting point to derive more general forms of p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t,\boldsymbol{x}_0) and p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1},\boldsymbol{x}_0). The reflection performed by the Probability Flow ODE is similar: it seeks to know which different p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) (or different forward SDEs) can be found for a fixed p(\boldsymbol{x}_t) within the SDE framework.

We first write the discrete form of the forward process [eq:sde-forward]: \boldsymbol{x}_{t+\Delta t} = \boldsymbol{x}_t + \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\label{eq:sde-discrete} This equation describes the relationship between the random variables \boldsymbol{x}_{t+\Delta t}, \boldsymbol{x}_t, \boldsymbol{\varepsilon}. We could easily take the expectation of both sides; however, we do not want the expectation, but rather the relationship satisfied by the distribution p_t(\boldsymbol{x}). How do we convert a distribution into an expectation form? The answer is the Dirac Delta function: p(\boldsymbol{x}) = \int \delta(\boldsymbol{x} - \boldsymbol{y}) p(\boldsymbol{y}) d\boldsymbol{y} = \mathbb{E}_{\boldsymbol{y}}[\delta(\boldsymbol{x} - \boldsymbol{y})] Strictly speaking, the Dirac function belongs to the realm of functional analysis, but we usually treat it as an ordinary function, which generally yields correct results. From the above equation, we also know that for any f(\boldsymbol{x}), the following holds: p(\boldsymbol{x})f(\boldsymbol{x}) = \int \delta(\boldsymbol{x} - \boldsymbol{y}) p(\boldsymbol{y})f(\boldsymbol{y}) d\boldsymbol{y} = \mathbb{E}_{\boldsymbol{y}}[\delta(\boldsymbol{x} - \boldsymbol{y}) f(\boldsymbol{y})] Taking the partial derivative of both sides with respect to \boldsymbol{x}, we get: \nabla_{\boldsymbol{x}}[p(\boldsymbol{x}) f(\boldsymbol{x})] = \mathbb{E}_{\boldsymbol{y}}\left[\nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{y}) f(\boldsymbol{y})\right] = \mathbb{E}_{\boldsymbol{y}}\left[f(\boldsymbol{y})\nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{y})\right] This is one of the properties we will use later. It essentially shows that the derivative of the Dirac function can be transferred to the function it multiplies through integration.

The Fokker-Planck Equation

With the above preparation, we now use equation [eq:sde-discrete] to write: \begin{aligned} &\,\delta(\boldsymbol{x} - \boldsymbol{x}_{t+\Delta t}) \\[5pt] =&\, \delta(\boldsymbol{x} - \boldsymbol{x}_t - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t - g_t \sqrt{\Delta t}\boldsymbol{\varepsilon}) \\ \approx&\, \delta(\boldsymbol{x} - \boldsymbol{x}_t) - \left(\boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon}\right)\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t) + \frac{1}{2} \left(g_t\sqrt{\Delta t}\boldsymbol{\varepsilon}\cdot \nabla_{\boldsymbol{x}}\right)^2\delta(\boldsymbol{x} - \boldsymbol{x}_t) \end{aligned} Here, we performed a Taylor expansion on \delta(\cdot) as if it were an ordinary function, retaining only terms up to \mathcal{O}(\Delta t). Now we take the expectation of both sides: \begin{aligned} &\,p_{t+\Delta t}(\boldsymbol{x}) \\[6pt] =&\, \mathbb{E}_{\boldsymbol{x}_{t+\Delta t}}\left[\delta(\boldsymbol{x} - \boldsymbol{x}_{t+\Delta t})\right] \\ \approx&\, \mathbb{E}_{\boldsymbol{x}_t, \boldsymbol{\varepsilon}}\left[\delta(\boldsymbol{x} - \boldsymbol{x}_t) - \left(\boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon}\right)\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t) + \frac{1}{2} \left(g_t\sqrt{\Delta t}\boldsymbol{\varepsilon}\cdot \nabla_{\boldsymbol{x}}\right)^2\delta(\boldsymbol{x} - \boldsymbol{x}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\delta(\boldsymbol{x} - \boldsymbol{x}_t) - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t) + \frac{1}{2} g_t^2\Delta t \nabla_{\boldsymbol{x}}\cdot \nabla_{\boldsymbol{x}}\delta(\boldsymbol{x} - \boldsymbol{x}_t)\right] \\ =&\, p_t(\boldsymbol{x}) - \nabla_{\boldsymbol{x}}\cdot\left[\boldsymbol{f}_t(\boldsymbol{x})\Delta t\, p_t(\boldsymbol{x})\right] + \frac{1}{2}g_t^2\Delta t \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x}) \end{aligned} Dividing both sides by \Delta t and taking the limit \Delta t \to 0, we obtain: \frac{\partial}{\partial t} p_t(\boldsymbol{x}) = - \nabla_{\boldsymbol{x}}\cdot\left[\boldsymbol{f}_t(\boldsymbol{x}) p_t(\boldsymbol{x})\right] + \frac{1}{2}g_t^2 \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x})\label{eq:fp} This is the "F-P equation" (Fokker-Planck equation) corresponding to equation [eq:sde-forward]. It is a partial differential equation describing the marginal distribution.

Equivalent Transformation

There is no need to worry about the partial differential equation, as we do not intend to study how to solve it. We are merely using it to guide an equivalent transformation. For any function \sigma_t satisfying \sigma_t^2 \leq g_t^2, the F-P equation [eq:fp] is completely equivalent to: \begin{aligned} \frac{\partial}{\partial t} p_t(\boldsymbol{x}) =&\, - \nabla_{\boldsymbol{x}}\cdot\left[\boldsymbol{f}_t(\boldsymbol{x})p_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x})\right] + \frac{1}{2}\sigma_t^2 \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x}) \\ =&\, - \nabla_{\boldsymbol{x}}\cdot\left[\left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right)p_t(\boldsymbol{x})\right] + \frac{1}{2}\sigma_t^2 \nabla_{\boldsymbol{x}}\cdot\nabla_{\boldsymbol{x}}p_t(\boldsymbol{x}) \end{aligned}\label{eq:fp-2} Formally, this F-P equation is equivalent to the original F-P equation with \boldsymbol{f}_t(\boldsymbol{x}) replaced by \boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) and g_t replaced by \sigma_t. Just as equation [eq:fp] corresponds to equation [eq:sde-forward], the above equation corresponds to: d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 - \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt + \sigma_t d\boldsymbol{w}\label{eq:sde-forward-2} But do not forget that equation [eq:fp] and equation [eq:fp-2] are completely equivalent. This means that the marginal distributions p_t(\boldsymbol{x}) corresponding to the two stochastic differential equations [eq:sde-forward] and [eq:sde-forward-2] are identical! This result tells us that there exist forward processes with different variances that produce the same marginal distributions. This result is an upgraded version of DDIM; later, we will also prove that when \boldsymbol{f}_t(\boldsymbol{x}) is a linear function of \boldsymbol{x}, it is completely equivalent to DDIM.

In particular, based on the SDE results from the previous post, we can write the reverse SDE corresponding to equation [eq:sde-forward-2]: d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 + \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt + \sigma_t d\boldsymbol{w}\label{eq:sde-reverse-2}

Neural ODE

Equation [eq:sde-forward-2] allows us to change the variance of the sampling process. Here, we specifically consider the extreme case where \sigma_t = 0, at which point the SDE degenerates into an ODE (Ordinary Differential Equation): d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}g_t^2\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt\label{eq:flow-ode} This ODE is called the "Probability Flow ODE." Since in practice \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) needs to be approximated by a neural network \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t), the above equation also corresponds to a "Neural ODE."

Why study the case where the variance is 0? Because at this point, the propagation process carries no noise, and the transformation from \boldsymbol{x}_0 to \boldsymbol{x}_T is deterministic. Thus, we can obtain the inverse transformation from \boldsymbol{x}_T to \boldsymbol{x}_0 by directly solving the ODE in reverse, which is also a deterministic transformation (substituting \sigma_t=0 into equation [eq:sde-reverse-2] also reveals that the forward and reverse equations are the same). This process is consistent with flow models (i.e., transforming noise into samples through an invertible transformation). Therefore, the Probability Flow ODE allows us to relate the results of diffusion models to those of flow models. For example, the original paper mentions that the Probability Flow ODE allows for exact likelihood calculation and obtaining latent representations, which are essentially the benefits of flow models. Due to the invertibility of flow models, it also allows us to perform image editing and other operations in the latent space.

On the other hand, the transformation from \boldsymbol{x}_T to \boldsymbol{x}_0 is described by an ODE, which means we can use various high-order ODE numerical algorithms to accelerate the transformation process from \boldsymbol{x}_T to \boldsymbol{x}_0. Of course, in principle, there are also acceleration methods for solving SDEs, but research on SDE acceleration is far less mature and deep than that for ODEs. Overall, compared to SDEs, ODEs appear much simpler and more direct in both theoretical analysis and practical solving.

Reviewing DDIM

At the end of "Generative Diffusion Models (4): DDIM = High-level DDPM", we derived that the continuous version of DDIM corresponds to the ODE: \frac{d}{ds}\left(\frac{\boldsymbol{x}(s)}{\bar{\alpha}(s)}\right) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\boldsymbol{x}(s), t(s)\right)\frac{d}{ds}\left(\frac{\bar{\beta}(s)}{\bar{\alpha}(s)}\right)\label{eq:ddim-ode} Next, we can see that this result is actually a special case of equation [eq:flow-ode] when \boldsymbol{f}_t(\boldsymbol{x}) is a linear function f_t \boldsymbol{x}. At the end of "Generative Diffusion Models (5): The SDE Perspective of the General Framework", we derived the corresponding relationships: \left\{\begin{aligned} &f_t = \frac{1}{\bar{\alpha}_t}\frac{d\bar{\alpha}_t}{dt} \\ &g^2 (t) = 2\bar{\alpha}_t \bar{\beta}_t \frac{d}{dt}\left(\frac{\bar{\beta}_t}{\bar{\alpha}_t}\right) \\ &\boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)}{\bar{\beta}_t} \end{aligned}\right. Substituting these relationships into equation [eq:flow-ode] [replacing \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) with \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}, t)] and rearranging, we get: \frac{1}{\bar{\alpha}_t}\frac{d\boldsymbol{x}}{dt} - \frac{\boldsymbol{x}}{\bar{\alpha}_t^2}\frac{d\bar{\alpha}_t}{dt} = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\frac{d}{dt}\left(\frac{\bar{\beta}_t}{\bar{\alpha}_t}\right) The left side can be further simplified to \frac{d}{dt}\left(\frac{\boldsymbol{x}}{\bar{\alpha}_t}\right), so the equation is completely equivalent to equation [eq:ddim-ode].

Summary

Building on the SDE article, this post used the F-P equation to derive a more generalized forward equation, which in turn led to the "Probability Flow ODE," and proved that DDIM is a special case of it.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/9228

More details on reprinting: Please refer to: "Scientific Space FAQ"