Generative Diffusion Models Part 15: General Steps for Constructing ODEs (Part 2) · English (unofficial) translations of posts at kexue.fm

Last week, I wrote "Generative Diffusion Models Part 14: General Steps for Constructing ODEs (Part 1)" (at the time, it didn’t have the "Part 1" suffix). I thought I had already glimpsed the general laws for constructing ODE diffusion models. However, shortly after, the expert @gaohuazuo provided a more efficient and intuitive scheme for constructing Green’s functions in the comments, which made me feel quite humble. Recalling that the same expert previously gave a brilliant description of diffusion ODEs in "Generative Diffusion Models Part 12: ’Hard-Core’ Diffusion ODEs" (which indirectly inspired the results of the previous blog post), one cannot help but admire their insight.

After discussion and reflection, I found that the expert’s idea is essentially the Method of Characteristics for first-order partial differential equations. By constructing a specific vector field to guarantee the initial condition and then solving the differential equation to guarantee the terminal condition, both conditions are satisfied simultaneously. It is truly ingenious! Finally, I have summarized my findings in this article as a follow-up to the previous one.

Review of Previous Content

Let’s briefly review the results of the previous article. Suppose a random variable \boldsymbol{x}_0 \in \mathbb{R}^d continuously transforms into \boldsymbol{x}_T, and its law of change follows an ODE: \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{f}_t(\boldsymbol{x}_t) \label{eq-ode} Then the corresponding distribution p_t(\boldsymbol{x}_t) at time t follows the "continuity equation": \frac{\partial}{\partial t} p_t(\boldsymbol{x}_t) = - \nabla_{\boldsymbol{x}_t} \cdot \Big(\boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\Big) \label{eq:ode-f-eq-fp} Let \boldsymbol{u}(t, \boldsymbol{x}_t) = (p_t(\boldsymbol{x}_t), \boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)) \in \mathbb{R}^{d+1}. Then the continuity equation can be written concisely as: \left\{\begin{aligned} &\nabla_{(t,\, \boldsymbol{x}_t)} \cdot \boldsymbol{u}(t, \boldsymbol{x}_t) = 0 \\ &\boldsymbol{u}_1(0, \boldsymbol{x}_0) = p_0(\boldsymbol{x}_0), \quad \int \boldsymbol{u}_1(t, \boldsymbol{x}_t) d\boldsymbol{x}_t = 1 \end{aligned}\right. \label{eq:div-eq} To solve this equation, we can use the idea of Green’s functions, i.e., first solve: \left\{\begin{aligned} &\nabla_{(t,\, \boldsymbol{x}_t)} \cdot \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = 0 \\ &\boldsymbol{G}_1(0, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_0), \quad \int \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = 1 \end{aligned}\right. \label{eq:div-green} Then: \boldsymbol{u}(t, \boldsymbol{x}_t) = \int \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 = \mathbb{E}_{\boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0)}[\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)] \label{eq:div-green-int} is one of the solutions that satisfies the constraints.

Geometric Intuition

The idea of the Green’s function is actually very simple. It says that we should not rush to solve complex data generation; instead, we first assume that the data to be generated is just a single point \boldsymbol{x}_0, and solve the generation for a single data point first. Some readers might think, isn’t this simple? Just \boldsymbol{x}_T \times 0 + \boldsymbol{x}_0 and it’s done? Of course, it’s not that simple. What we need is a continuous, gradual generation. As shown in the figure below, every point \boldsymbol{x}_T at t=T runs along a smooth trajectory to the point \boldsymbol{x}_0 at t=0:

[Image: Green’s Function Diagram. In the figure, T=1. Each point at t=1 runs along a specific trajectory to a point at t=0. Except for the common point, there is no overlap between trajectories. These trajectories are the field lines of the Green’s function.]

Green’s function schematic. T=1. Every point at t=1 follows a specific trajectory to the point at t=0. Except for the common point, trajectories do not overlap; these are the field lines of the Green’s function.

Since our goal is just to construct a generative model, we don’t fundamentally care about the shape of the trajectories, as long as they all pass through \boldsymbol{x}_0. Thus, we can artificially choose a family of trajectories we like that pass through \boldsymbol{x}_0, denoted as: \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \boldsymbol{x}_T \label{eq:track} To emphasize again, this represents a family of trajectories starting from \boldsymbol{x}_0 and ending at \boldsymbol{x}_T. The independent and dependent variables of the trajectory are t and \boldsymbol{x}_t, respectively. The starting point \boldsymbol{x}_0 is fixed, while the endpoint \boldsymbol{x}_T can vary arbitrarily. The shape of the trajectory is irrelevant; we can choose straight lines, parabolas, etc.

Now we take the derivative of both sides of Eq. [eq:track]. Since \boldsymbol{x}_T can vary freely, it acts like the integration constant of a differential equation, so its derivative is \boldsymbol{0}. Thus, we have: \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}{\partial \boldsymbol{x}_t} \frac{d\boldsymbol{x}_t}{dt} + \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}{\partial t} = \boldsymbol{0} \Downarrow \nonumber \frac{d\boldsymbol{x}_t}{dt} = - \left(\frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}{\partial \boldsymbol{x}_t}\right)^{-1} \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}{\partial t} Comparing this with Eq. [eq-ode], we obtain: \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = - \left(\frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}{\partial \boldsymbol{x}_t}\right)^{-1} \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}{\partial t} \label{eq:f-xt-x0} Here, the original notation \boldsymbol{f}_t(\boldsymbol{x}_t) is replaced by \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) to mark that the trajectories have a common point \boldsymbol{x}_0. In other words, the ODE trajectory corresponding to the force field \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) constructed this way must pass through \boldsymbol{x}_0, which guarantees the initial condition of the Green’s function.

Method of Characteristics

Since the initial condition is guaranteed, we might as well ask for a bit more: let’s guarantee the terminal condition as well. The terminal condition means we hope that at t=T, the distribution of \boldsymbol{x}_T is a simple distribution independent of \boldsymbol{x}_0. The main drawback of the solution framework in the previous article was the inability to directly guarantee the simplicity of the terminal distribution, which could only be studied through post-hoc analysis. The idea in this article is to directly design a specific \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) to guarantee the initial condition, leaving room to guarantee the terminal condition. Moreover, once both initial and terminal conditions are guaranteed, the integral condition is naturally satisfied under the premise of satisfying the continuity equation [eq:ode-f-eq-fp].

Mathematically speaking, we want to solve Eq. [eq:ode-f-eq-fp] given \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) and p_T(\boldsymbol{x}_T). This is a first-order partial differential equation, which can be solved using the "Method of Characteristics." For a theoretical introduction, please refer to my previous post "Method of Characteristics for First-Order PDEs". First, we rewrite Eq. [eq:ode-f-eq-fp] equivalently as: \frac{\partial}{\partial t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) + \nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \cdot \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = - p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \nabla_{\boldsymbol{x}_t} \cdot \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) Similar to before, since we are solving given a starting point \boldsymbol{x}_0, we replace p_t(\boldsymbol{x}_t) with p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) to mark this as the solution starting from \boldsymbol{x}_0.

The idea of the method of characteristics is to first consider the solution of the PDE on a specific trajectory, which converts the partial differential equation into an ordinary differential equation, reducing the difficulty of solving. Specifically, we assume \boldsymbol{x}_t is a function of t and solve along the trajectory of Eq. [eq-ode]. Since Eq. [eq-ode] holds, after replacing \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) on the left side with \frac{d\boldsymbol{x}_t}{dt}, the left side becomes exactly the total derivative of p_t(\boldsymbol{x}_t|\boldsymbol{x}_0). Thus, we have: \frac{d}{dt} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = - p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \nabla_{\boldsymbol{x}_t} \cdot \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) Note that at this point, all \boldsymbol{x}_t should be replaced by the corresponding functions of t, which theoretically can be solved from the trajectory equation [eq:track]. After replacement, p and \boldsymbol{f} in the above equation are purely functions of t, so the equation is just a linear ODE for p, which can be solved as: p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = C \exp\left(\int_t^T \nabla_{\boldsymbol{x}_s} \cdot \boldsymbol{f}_s(\boldsymbol{x}_s|\boldsymbol{x}_0) ds\right) Substituting the terminal condition p_T(\boldsymbol{x}_T), we get C = p_T(\boldsymbol{x}_T), i.e.: p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = p_T(\boldsymbol{x}_T) \exp\left(\int_t^T \nabla_{\boldsymbol{x}_s} \cdot \boldsymbol{f}_s(\boldsymbol{x}_s|\boldsymbol{x}_0) ds\right) \label{eq:pt-xt-x0} By substituting \boldsymbol{x}_T from the trajectory equation [eq:track], we obtain a function containing only t, \boldsymbol{x}_t, \boldsymbol{x}_0, which is the Green’s function \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) we seek. Correspondingly, \boldsymbol{G}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0).

Training Objective

With the Green’s function, we can obtain: \begin{aligned} \boldsymbol{u}_1(t, \boldsymbol{x}_t) &= \int p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 = p_t(\boldsymbol{x}_t) \\ \boldsymbol{u}_{> 1}(t, \boldsymbol{x}_t) &= \int \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 \end{aligned} Thus: \begin{aligned} \boldsymbol{f}_t(\boldsymbol{x}_t) &= \frac{\boldsymbol{u}_{> 1}(t, \boldsymbol{x}_t)}{\boldsymbol{u}_1(t, \boldsymbol{x}_t)} \\ &= \int \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \frac{p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) p_0(\boldsymbol{x}_0)}{p_t(\boldsymbol{x}_t)} d\boldsymbol{x}_0 \\ &= \int \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) p_t(\boldsymbol{x}_0|\boldsymbol{x}_t) d\boldsymbol{x}_0 \\ &= \mathbb{E}_{\boldsymbol{x}_0 \sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] \end{aligned} According to the method for constructing score matching objectives in "Generative Diffusion Models Part 5: SDE in General Framework", we can construct the training objective: \begin{aligned} &\mathbb{E}_{\boldsymbol{x}_t \sim p_t(\boldsymbol{x}_t)} \Big[ \mathbb{E}_{\boldsymbol{x}_0 \sim p_t(\boldsymbol{x}_0|\boldsymbol{x}_t)} \left[ \left\Vert \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \right\Vert^2 \right] \Big] d\boldsymbol{x}_t \\ &= \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_t \sim p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) p_0(\boldsymbol{x}_0)} \left[ \left\Vert \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \right\Vert^2 \right] \end{aligned} \label{eq:score-match} This is formally consistent with the "Conditional Flow Matching" presented in "Flow Matching for Generative Modeling". As we will see later, the results of that paper can be derived from the method in this article. Once training is complete, samples can be generated by solving the equation \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t). From this training objective, we can also see that our requirement for p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) is simply that it must be easy to sample from.

Some Examples

The previous abstract results might still be difficult to understand. Next, we provide some specific examples to deepen the intuitive understanding of this framework. As for the method of characteristics itself, as I mentioned in "Method of Characteristics for First-Order PDEs", I initially felt it was as elusive as "magic." Following the steps doesn’t seem difficult, but it’s hard to grasp the key points. Understanding it requires a process of repeated deliberation, which I cannot further assist with.

Straight Line Trajectory

As the simplest example, we assume \boldsymbol{x}_T changes to \boldsymbol{x}_0 along a straight line trajectory. For simplicity, we can set T=1 without loss of generality. Then the equation for \boldsymbol{x}_t can be written as: \boldsymbol{x}_t = (\boldsymbol{x}_1 - \boldsymbol{x}_0)t + \boldsymbol{x}_0 \quad \Rightarrow \quad \frac{\boldsymbol{x}_t - \boldsymbol{x}_0}{t} + \boldsymbol{x}_0 = \boldsymbol{x}_1 \label{eq:simplest-x1} According to Eq. [eq:f-xt-x0], we have: \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{\boldsymbol{x}_t - \boldsymbol{x}_0}{t} In this case, \nabla_{\boldsymbol{x}_t} \cdot \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{d}{t}. According to Eq. [eq:pt-xt-x0], we have: p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{p_1(\boldsymbol{x}_1)}{t^d} Substituting \boldsymbol{x}_1 from Eq. [eq:simplest-x1], we get: p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{p_1\left(\frac{\boldsymbol{x}_t - \boldsymbol{x}_0}{t} + \boldsymbol{x}_0\right)}{t^d} In particular, if p_1(\boldsymbol{x}_1) is a standard normal distribution, then the above equation actually means p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \mathcal{N}(\boldsymbol{x}_t; (1-t)\boldsymbol{x}_0, t^2\boldsymbol{I}), which is exactly one of the common Gaussian diffusion models. The new result of this framework is that it allows us to choose a more general prior distribution p_1(\boldsymbol{x}_1), such as a uniform distribution. Additionally, as mentioned when introducing score matching [eq:score-match], we only need to know how to sample from p_t(\boldsymbol{x}_t|\boldsymbol{x}_0), and the above equation tells us we only need the prior distribution to be easy to sample from, because: \boldsymbol{x}_t \sim p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) \quad \Leftrightarrow \quad \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t\boldsymbol{\varepsilon}, \, \boldsymbol{\varepsilon} \sim p_1(\boldsymbol{\varepsilon})

Effect Demonstration

Note that our assumption that the trajectory from \boldsymbol{x}_0 to \boldsymbol{x}_1 is a straight line applies only to single-point generation, i.e., the Green’s function solution. When the force field \boldsymbol{f}_t(\boldsymbol{x}_t) corresponding to a general distribution is superimposed via the Green’s function, the generation trajectory is no longer a straight line.

The figure below demonstrates the trajectory plots for multi-point generation when the prior distribution is a uniform distribution:

[Image: Single-point generation] [Image: Two-point generation] [Image: Three-point generation]

Trajectories for single-point, two-point, and three-point generation.

Reference plotting code:

import numpy as np
from scipy.integrate import odeint
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rc('text', usetex=True)
matplotlib.rcParams['text.latex.preamble']=[r"\usepackage{amsmath}"]

prior = lambda x: 0.5 if 2 >= x >= 0 else 0
p = lambda xt, x0, t: prior((xt - x0) / t + x0) / t
f = lambda xt, x0, t: (xt - x0) / t

def f_full(xt, t):
    x0s = [0.5, 0.5, 1.2, 1.7]  # 0.5 appears twice, representing double frequency
    fs = np.array([f(xt, x0, t) for x0 in x0s]).reshape(-1)
    ps = np.array([p(xt, x0, t) for x0 in x0s]).reshape(-1)
    return (fs * ps).sum() / (ps.sum() + 1e-8)

for x1 in np.arange(0.01, 1.99, 0.10999/2):
    ts = np.arange(1, 0, -0.001)
    xs = odeint(f_full, x1, ts).reshape(-1)[::-1]
    ts = ts[::-1]
    if abs(xs[0] - 0.5) < 0.1:
        _ = plt.plot(ts, xs, color='skyblue')
    elif abs(xs[0] - 1.2) < 0.1:
        _ = plt.plot(ts, xs, color='orange')
    else:
        _ = plt.plot(ts, xs, color='limegreen')

plt.xlabel('$t$')
plt.ylabel(r'$\boldsymbol{x}$')
plt.show()

General Generalization

In fact, the above results can be generalized to: \boldsymbol{x}_t = \boldsymbol{\mu}_t(\boldsymbol{x}_0) + \sigma_t \boldsymbol{x}_1 \quad \Rightarrow \quad \frac{\boldsymbol{x}_t - \boldsymbol{\mu}_t(\boldsymbol{x}_0)}{\sigma_t} = \boldsymbol{x}_1 Here \boldsymbol{\mu}_t(\boldsymbol{x}_0) is any function \mathbb{R}^d \mapsto \mathbb{R}^d satisfying \boldsymbol{\mu}_0(\boldsymbol{x}_0) = \boldsymbol{x}_0, \boldsymbol{\mu}_1(\boldsymbol{x}_0) = \boldsymbol{0}, and \sigma_t is any monotonically increasing function satisfying \sigma_0 = 0, \sigma_1 = 1. According to Eq. [eq:f-xt-x0], we have: \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \dot{\boldsymbol{\mu}}_t(\boldsymbol{x}_0) + \frac{\dot{\sigma}_t}{\sigma_t}(\boldsymbol{x}_t - \boldsymbol{\mu}_t(\boldsymbol{x}_0)) This is also equivalent to Eq. (15) in "Flow Matching for Generative Modeling". In this case, \nabla_{\boldsymbol{x}_t} \cdot \boldsymbol{f}_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{d\dot{\sigma}_t}{\sigma_t}. According to Eq. [eq:pt-xt-x0], we have: p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{p_1(\boldsymbol{x}_1)}{\sigma_t^d} Substituting \boldsymbol{x}_1, the final result is: p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{p_1\left(\frac{\boldsymbol{x}_t - \boldsymbol{\mu}_t(\boldsymbol{x}_0)}{\sigma_t}\right)}{\sigma_t^d} This is a general result for linear ODE diffusion, which includes Gaussian diffusion and allows for the use of non-Gaussian prior distributions.

More Complex?

The previous examples all construct the change trajectory of \boldsymbol{x}_t through a simple linear interpolation of \boldsymbol{x}_0 (or some transformation of it) and \boldsymbol{x}_1 (where the interpolation weights are purely functions of t). A natural question is: can we consider more complex trajectories? Theoretically, yes, but higher complexity implies more hidden assumptions, and it is usually difficult to verify whether the target data supports these assumptions. Therefore, more complex trajectories are generally not considered. Furthermore, for more complex trajectories, the difficulty of analytical solutions is usually higher, making it hard to proceed both theoretically and experimentally.

More importantly, the trajectories we currently assume are only for single-point generation. As demonstrated earlier, even if we assume straight lines, multi-point generation still leads to complex curves. Therefore, if the trajectories for single-point generation are assumed to be unnecessarily complex, one can imagine that the complexity of multi-point generation trajectories will be extremely high, and the model might become highly unstable.

Summary

Continuing from the previous article, this post discussed the construction ideas for ODE-style diffusion models again. This time, starting from geometric intuition, we constructed a specific vector field to ensure the results satisfy the initial distribution condition, and then solved the differential equation to ensure the terminal distribution condition, obtaining a Green’s function that satisfies both initial and terminal conditions simultaneously. In particular, this method allows us to use any simple distribution as a prior distribution, breaking away from the previous dependence on Gaussian distributions to construct diffusion models.

Original Address: https://kexue.fm/archives/9379