Generative Diffusion Models (12): ``Directly Tackling'' the Diffusion ODE · English (unofficial) translations of posts at kexue.fm

In Generative Diffusion Models (5): The SDE Perspective, we understood generative diffusion models from the perspective of Stochastic Differential Equations (SDEs). Then, in Generative Diffusion Models (6): The ODE Perspective, we learned that the diffusion model corresponding to an SDE actually implies an underlying ODE model. Coincidentally, in Generative Diffusion Models (4): DDIM = High-Level DDPM, we also saw that the originally stochastic DDPM sampling process contains a deterministic sampling process called DDIM, whose continuous limit is also an ODE.

Reflecting on these processes, one can see that whether it is “DDPM \to DDIM” or “SDE \to ODE”, the transition is always from a stochastic sampling model to a deterministic one. However, if our goal from the very beginning was an ODE, this path seems somewhat “circuitous.” In this article, the author attempts to provide a direct derivation of the ODE diffusion model and reveals its connections to the Jacobian determinant, the heat equation, and other concepts.

Differential Equations

Generative models like GANs essentially aim to find a deterministic transformation that can map random variables sampled from a simple distribution (such as a standard normal distribution) into samples from a specific data distribution. Flow models are also a type of generative model; their approach is the reverse: first find an invertible transformation that maps the data distribution to a simple distribution, and then solve the corresponding inverse transformation to obtain a generative model.

Traditional flow models achieve this invertible transformation by designing sophisticated coupling layers (refer to the “Long Flow of Water” series). However, it was later realized that this transformation can also be achieved through differential equations, which is theoretically quite elegant. Research on building generative models based on “Neural Networks + Differential Equations” constitutes a subfield known as “Neural ODEs.”

Consider a first-order (ordinary) differential equation (system) on \boldsymbol{x}_t \in \mathbb{R}^d: \frac{d\boldsymbol{x}_t}{dt}=\boldsymbol{f}_t(\boldsymbol{x}_t) \label{eq:ode} Assuming t \in [0, T], given \boldsymbol{x}_0, we can (under relatively easy-to-satisfy conditions) deterministically solve for \boldsymbol{x}_T. In other words, this differential equation describes a transformation from \boldsymbol{x}_0 to \boldsymbol{x}_T. Notably, this transformation is invertible; one can solve the differential equation in reverse to obtain the transformation from \boldsymbol{x}_T to \boldsymbol{x}_0. Thus, differential equations themselves are a theoretically elegant scheme for constructing invertible transformations.

Jacobian Determinant

As with previous diffusion models, in this article, we treat \boldsymbol{x}_0 as a data sample and \boldsymbol{x}_T as a sample from a simple distribution. We hope to use a differential equation to achieve the transformation from the data distribution to the simple distribution.

First, let’s understand the differential equation [eq:ode] from a discretization perspective: \boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t = \boldsymbol{f}_t(\boldsymbol{x}_t)\Delta t \label{eq:ode-diff} Since this is a deterministic transformation, we have: p_t(\boldsymbol{x}_t) d\boldsymbol{x}_t = p_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) d\boldsymbol{x}_{t+\Delta t} = p_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) \left| \frac{\partial \boldsymbol{x}_{t+\Delta t}}{\partial \boldsymbol{x}_t} \right| d\boldsymbol{x}_t Here \frac{\partial \boldsymbol{x}_{t+\Delta t}}{\partial \boldsymbol{x}_t} represents the Jacobian matrix of the transformation, and |\cdot| represents the absolute value of the determinant. By taking the partial derivative of both sides of Eq. [eq:ode-diff], we get: \frac{\partial \boldsymbol{x}_{t+\Delta t}}{\partial \boldsymbol{x}_t} = \boldsymbol{I} + \frac{\partial \boldsymbol{f}_t(\boldsymbol{x}_t)}{\partial \boldsymbol{x}_t}\Delta t According to the article “Derivatives of Determinants”, we have: \left|\frac{\partial \boldsymbol{x}_{t+\Delta t}}{\partial \boldsymbol{x}_t}\right| \approx 1 + \text{Tr}\,\frac{\partial \boldsymbol{f}_t(\boldsymbol{x}_t)}{\partial \boldsymbol{x}_t}\Delta t = 1 + \nabla_{\boldsymbol{x}_t}\cdot \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t \approx e^{\nabla_{\boldsymbol{x}_t}\cdot \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t} Thus, we can write: \log p_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - \log p_t(\boldsymbol{x}_t) \approx -\nabla_{\boldsymbol{x}_t}\cdot \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t \label{eq:approx-ode}

Taylor Approximation

Assume p_t(\boldsymbol{x}_t) is a family of probability density functions that vary continuously with the parameter t, where p_0(\boldsymbol{x}_0) is the data distribution and p_T(\boldsymbol{x}_T) is a simple distribution. When both \Delta t and \boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t are small, we have the first-order Taylor approximation: \log p_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - \log p_t(\boldsymbol{x}_t) \approx (\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t)\cdot \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) + \Delta t\frac{\partial}{\partial t}\log p_t(\boldsymbol{x}_t) Substituting \boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t from Eq. [eq:ode-diff] and comparing it with Eq. [eq:approx-ode], we obtain the equation that \boldsymbol{f}_t(\boldsymbol{x}_t) must satisfy: -\nabla_{\boldsymbol{x}_t}\cdot \boldsymbol{f}_t(\boldsymbol{x}_t) = \boldsymbol{f}_t(\boldsymbol{x}_t)\cdot \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) + \frac{\partial}{\partial t}\log p_t(\boldsymbol{x}_t) \label{eq:ode-f-eq} In other words, any \boldsymbol{f}_t(\boldsymbol{x}_t) satisfying this equation can be used to construct a differential equation [eq:ode] to achieve the transformation between the data distribution and the simple distribution. We can also rearrange it as: \frac{\partial}{\partial t} p_t(\boldsymbol{x}_t) = - \nabla_{\boldsymbol{x}_t}\cdot\Big(\boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\Big) \label{eq:ode-f-eq-fp} This is actually a special case of the “Fokker-Planck equation” introduced in Generative Diffusion Models (6): The ODE Perspective when g_t=0.

Heat Equation

We consider a solution of the following form: \boldsymbol{f}_t(\boldsymbol{x}_t) = - \boldsymbol{D}_t(\boldsymbol{x}_t)\,\nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) \label{eq:ode-f-grad} where \boldsymbol{D}_t(\boldsymbol{x}_t) can be a matrix or a scalar, depending on the complexity considered. Why consider a solution of this form? To be honest, the author initially aimed to match the DDIM format and later discovered that after generalization, it could be linked to the diffusion equation below, so Eq. [eq:ode-f-grad] was set directly. In hindsight, if we assume \boldsymbol{D}_t(\boldsymbol{x}_t) is a non-negative scalar function, then substituting it into Eq. [eq:ode-diff] reveals a format similar to gradient descent. That is, moving from \boldsymbol{x}_0 to \boldsymbol{x}_T gradually seeks low-probability regions, while moving from \boldsymbol{x}_T to \boldsymbol{x}_0 gradually seeks high-probability regions, which aligns with intuition. This serves as a heuristic guide for Eq. [eq:ode-f-grad].

Substituting Eq. [eq:ode-f-grad] into Eq. [eq:ode-f-eq-fp], we get: \frac{\partial}{\partial t}p_t(\boldsymbol{x}_t) = \nabla_{\boldsymbol{x}_t}\cdot\Big(\boldsymbol{D}_t(\boldsymbol{x}_t)\,\nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t)\Big) This is the “diffusion equation” in partial differential equations. Here we only consider the simplest case—where \boldsymbol{D}_t(\boldsymbol{x}_t) is a scalar function D_t independent of \boldsymbol{x}_t. In this case, the diffusion equation simplifies to: \frac{\partial}{\partial t}p_t(\boldsymbol{x}_t) = D_t \nabla_{\boldsymbol{x}_t}^2 p_t(\boldsymbol{x}_t) \label{eq:heat} This is the “heat equation,” which is the primary object we will solve and analyze next.

Solving the Distribution

Using the Fourier transform, the heat equation can be converted into an ordinary differential equation, allowing us to solve for the distribution p_t(\boldsymbol{x}_t). The result is: \begin{aligned} p_t(\boldsymbol{x}_t) =&\, \int \frac{1}{(2\pi\sigma_t^2)^{d/2}}\exp\left(-\frac{\Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2}{2\sigma_t^2}\right)p_0(\boldsymbol{x}_0) d \boldsymbol{x}_0 \\ =&\, \int \mathcal{N}(\boldsymbol{x}_t; \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I})\, p_0(\boldsymbol{x}_0) d \boldsymbol{x}_0 \end{aligned} \label{eq:heat-sol} where \sigma_t^2 = 2\int_0^t D_s ds, or D_t = \dot{\sigma}_t \sigma_t (with \sigma_0=0). As we can see, the solution to the heat equation is exactly a Gaussian mixture model with p_0(\boldsymbol{x}_0) as the initial distribution.

Process: Here is a brief introduction to the method for solving the heat equation. Readers who are not interested in the derivation or are already familiar with the heat equation can skip this part.

Solving the heat equation [eq:heat] with the Fourier transform is quite simple. Apply the Fourier transform to the variable \boldsymbol{x}_t on both sides. According to the principle \nabla_{\boldsymbol{x}_t} \to i\boldsymbol{\omega}, the result is: \frac{\partial}{\partial t}\mathcal{F}_t(\boldsymbol{\omega}) = -D_t \boldsymbol{\omega}^2 \mathcal{F}_t(\boldsymbol{\omega}) This is just an ordinary differential equation with respect to t, which can be solved as: \mathcal{F}_t(\boldsymbol{\omega}) = \mathcal{F}_0(\boldsymbol{\omega}) \exp\left(-\frac{1}{2}\sigma_t^2 \boldsymbol{\omega}^2\right) where \sigma_t^2 = 2\int_0^t D_s ds, and \mathcal{F}_0(\boldsymbol{\omega}) is the Fourier transform of p_0(\boldsymbol{x}_0). Now, applying the inverse Fourier transform to both sides, \mathcal{F}_t(\boldsymbol{\omega}) returns to p_t(\boldsymbol{x}_t), \mathcal{F}_0(\boldsymbol{\omega}) returns to p_0(\boldsymbol{x}_0), and \exp\left(-\frac{1}{2}\sigma_t^2 \boldsymbol{\omega}^2\right) corresponds to the normal distribution \mathcal{N}(\boldsymbol{x}_t; \boldsymbol{0}, \sigma_t^2 \boldsymbol{I}). Finally, using the convolution property of the Fourier transform, we obtain the solution [eq:heat-sol].

Completing the Design

Now let’s summarize our results: by solving the heat equation, we have determined: p_t(\boldsymbol{x}_t) = \int \mathcal{N}(\boldsymbol{x}_t; \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I})\, p_0(\boldsymbol{x}_0) d \boldsymbol{x}_0 \label{eq:heat-sol-2} At this point, the corresponding differential equation: \frac{d\boldsymbol{x}_t}{dt}=-\dot{\sigma}_t \sigma_t \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) provides a deterministic transformation from p_0(\boldsymbol{x}_0) to p_T(\boldsymbol{x}_T). If p_T(\boldsymbol{x}_T) is easy to sample from and \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t) is known, then we can randomly sample \boldsymbol{x}_T \sim p_T(\boldsymbol{x}_T) and solve the differential equation in reverse to generate samples \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0).

The first question: when is p_T(\boldsymbol{x}_T) easy to sample? According to the result [eq:heat-sol-2], we know: \boldsymbol{x}_T \sim p_T(\boldsymbol{x}_T) \quad\Leftrightarrow\quad \boldsymbol{x}_T = \boldsymbol{x}_0 + \sigma_T \boldsymbol{\varepsilon},\,\,\, \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0),\,\boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) When \sigma_T is large enough, the influence of \boldsymbol{x}_0 on \boldsymbol{x}_T becomes very weak. At this point, we can assume: \boldsymbol{x}_T \sim p_T(\boldsymbol{x}_T) \quad\Leftrightarrow\quad \boldsymbol{x}_T = \sigma_T \boldsymbol{\varepsilon},\,\,\,\boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) This achieves the goal of making p_T(\boldsymbol{x}_T) easy to sample. Therefore, the general requirement for choosing \sigma_t is a smooth monotonically increasing function satisfying \sigma_0 = 0 and \sigma_T \gg 1.

The second question: how to calculate \nabla_{\boldsymbol{x}_t}\log p_t(\boldsymbol{x}_t)? This is actually the same as the “Score Matching” section in Generative Diffusion Models (5): The SDE Perspective. We use a neural network \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) to fit it, with the training objective: \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t \sim \mathcal{N}(\boldsymbol{x}_t; \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I})p_0(\boldsymbol{x}_0)}\left[\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log \mathcal{N}(\boldsymbol{x}_t; \boldsymbol{x}_0, \sigma_t^2 \boldsymbol{I})\right\Vert^2\right] This is called “Conditional Score Matching.” Its derivation was already provided in the SDE part, so we won’t repeat it here.

Summary

In this article, we have performed a “top-down” derivation of the ODE-based diffusion model: starting from an ODE, we combined the Jacobian determinant to obtain a first-order approximation of the probability change, then compared it with the first-order approximation from a direct Taylor expansion to obtain the equation the ODE must satisfy. This was then transformed into a diffusion equation and a heat equation to be solved. Relatively speaking, the entire process is quite direct and does not require results from SDEs or FP equations as transitions.

When reposting, please include the original address of this article: https://kexue.fm/archives/9280

For more detailed reposting matters, please refer to: Scientific Space FAQ