Generative Diffusion Models (17): General Steps for Constructing ODEs (Part 3) · English (unofficial) translations of posts at kexue.fm

History has a striking way of repeating itself. When I was writing “Generative Diffusion Models (14): General Steps for Constructing ODEs (Part 1)” (which didn’t have the “Part 1” suffix at the time), I thought I had already clarified the general steps for constructing ODE-based diffusion. However, the reader @gaohuazuo provided a new, intuitive, and effective scheme, which directly led to the subsequent “Generative Diffusion Models (14): General Steps for Constructing ODEs (Part 2)” (which was then labeled as “Part 2”). Just when I thought the matter was settled, I discovered the ICLR 2023 paper “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow”, which presents yet another scheme for constructing ODE-based diffusion models. Its simplicity and intuitiveness are unprecedented and absolutely brilliant. Consequently, I have quietly renamed the suffix of the previous post to “Part 2” and written this “Part 3” to share this new result.

Intuitive Results

As we know, a diffusion model is an evolutionary process \boldsymbol{x}_T \to \boldsymbol{x}_0, and an ODE-based diffusion model specifies that the evolution follows a particular ODE: \frac{d\boldsymbol{x}_t}{dt}=\boldsymbol{f}_t(\boldsymbol{x}_t)\label{eq:ode} The so-called construction of an ODE-based diffusion model involves designing a function \boldsymbol{f}_t(\boldsymbol{x}_t) such that the corresponding evolutionary trajectory constitutes a transformation between given distributions p_T(\boldsymbol{x}_T) and p_0(\boldsymbol{x}_0). Simply put, we want to randomly sample \boldsymbol{x}_T from p_T(\boldsymbol{x}_T) and have the \boldsymbol{x}_0 obtained by evolving backward according to the ODE follow the distribution p_0(\boldsymbol{x}_0).

The idea in the original paper is very simple. Randomly select \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0) and \boldsymbol{x}_T \sim p_T(\boldsymbol{x}_T), and assume they transform according to a trajectory: \boldsymbol{x}_t = \boldsymbol{\varphi}_t(\boldsymbol{x}_0, \boldsymbol{x}_T)\label{eq:track} This trajectory is a known function that we design ourselves. In theory, any continuous function that satisfies the following conditions is acceptable: \boldsymbol{x}_0 = \boldsymbol{\varphi}_0(\boldsymbol{x}_0, \boldsymbol{x}_T),\quad \boldsymbol{x}_T = \boldsymbol{\varphi}_T(\boldsymbol{x}_0, \boldsymbol{x}_T) We can then write the differential equation it satisfies: \frac{d\boldsymbol{x}_t}{dt} = \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_0, \boldsymbol{x}_T)}{\partial t}\label{eq:fake-ode} However, this differential equation is not practical because we want to generate \boldsymbol{x}_0 given \boldsymbol{x}_T, but the right-hand side is a function of \boldsymbol{x}_0 (if \boldsymbol{x}_0 were known, we would already be done). Only an ODE like Eq. [eq:ode], where the right-hand side contains only \boldsymbol{x}_t (from a causal perspective, it could theoretically include \boldsymbol{x}_T, but we generally do not consider this case), can be used for practical evolution. Thus, an intuitive and somewhat “bold” idea is: learn a function \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) to approximate the right-hand side of the above equation as closely as possible! To this end, we optimize the following objective: \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_T\sim p_T(\boldsymbol{x}_T)}\left[\left\Vert \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_0, \boldsymbol{x}_T)}{\partial t}\right\Vert^2\right] \label{eq:objective} Since \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) approximates \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_0, \boldsymbol{x}_T)}{\partial t}, we assume that replacing the right-hand side of Eq. [eq:fake-ode] with \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) is also valid, which gives us the practical diffusion ODE: \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\label{eq:s-ode}

Simple Example

As a simple example, let T=1 and assume the transformation trajectory is a straight line: \boldsymbol{x}_t = \boldsymbol{\varphi}_t(\boldsymbol{x}_0,\boldsymbol{x}_1) = (\boldsymbol{x}_1 - \boldsymbol{x}_0)t + \boldsymbol{x}_0 Then: \frac{\partial \boldsymbol{\varphi}_t(\boldsymbol{x}_0, \boldsymbol{x}_T)}{\partial t} = \boldsymbol{x}_1 - \boldsymbol{x}_0 So the training objective [eq:objective] becomes: \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_T\sim p_T(\boldsymbol{x}_T)}\left[\left\Vert \boldsymbol{v}_{\boldsymbol{\theta}}\big((\boldsymbol{x}_1 - \boldsymbol{x}_0)t + \boldsymbol{x}_0, t\big) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right] Or equivalently: \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t\sim p_0(\boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \frac{\boldsymbol{x}_t - \boldsymbol{x}_0}{t}\right\Vert^2\right] And that’s it! The result is completely consistent with the “linear trajectory” example in “Generative Diffusion Models (14): General Steps for Constructing ODEs (Part 2)”. This is the primary model studied in the original paper, known as “Rectified Flow.”

From this linear example, one can see that the steps to construct a diffusion ODE using this approach take only a few lines. Compared to previous processes, it is greatly simplified—so simple that it almost feels unbelievable, as if it overturns one’s impression of diffusion models.

Proof Process

However, the conclusion in the “Intuitive Results” section so far can only be considered an intuitive guess, as we have not yet theoretically proven that the ODE [eq:s-ode] obtained by optimizing objective [eq:objective] indeed achieves the transformation between distributions p_T(\boldsymbol{x}_T) and p_0(\boldsymbol{x}_0).

To prove this, my initial thought was to show that the optimal solution of objective [eq:objective] satisfies the continuity equation: \frac{\partial p_t(\boldsymbol{x}_t)}{\partial t} = -\nabla_{\boldsymbol{x}_t}\cdot\big(p_t(\boldsymbol{x}_t)\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\big) If it does, then according to the relationship between the continuity equation and ODEs (refer to “Generative Diffusion Models (12): ’Hardcore’ Diffusion ODE” and “Deriving the Continuity Equation and Fokker-Planck Equation via the Test Function Method”), Eq. [eq:s-ode] is indeed a transformation between p_T(\boldsymbol{x}_T) and p_0(\boldsymbol{x}_0).

But thinking more carefully, this path seems a bit roundabout. According to the article “Deriving the Continuity Equation and Fokker-Planck Equation via the Test Function Method”, the continuity equation itself is derived from the ODE via: \mathbb{E}_{\boldsymbol{x}_{t+\Delta t}}\left[\phi(\boldsymbol{x}_{t+\Delta t})\right] = \mathbb{E}_{\boldsymbol{x}_t}\left[\phi(\boldsymbol{x}_t + \boldsymbol{f}_t(\boldsymbol{x}_t)\Delta t)\right]\label{eq:base} Therefore, Eq. [eq:base] is more fundamental. We only need to prove that the optimal solution of [eq:objective] satisfies it. That is, we want to find a function \boldsymbol{f}_t(\boldsymbol{x}_t) that depends purely on \boldsymbol{x}_t and satisfies [eq:base], and then discover that it is exactly the optimal solution of [eq:objective].

Thus, we write (for simplicity, \boldsymbol{\varphi}_t(\boldsymbol{x}_0,\boldsymbol{x}_T) is abbreviated as \boldsymbol{\varphi}_t): \begin{aligned} \mathbb{E}_{\boldsymbol{x}_{t+\Delta t}}\left[\phi(\boldsymbol{x}_{t+\Delta t})\right] =&\, \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_T}\left[\phi(\boldsymbol{\varphi}_{t+\Delta t})\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_T}\left[\phi(\boldsymbol{\varphi}_t) + \Delta t\,\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\cdot\nabla_{\boldsymbol{\varphi}_t}\phi(\boldsymbol{\varphi}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_T}\left[\phi(\boldsymbol{x}_t)\right] + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_T}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\cdot\nabla_{\boldsymbol{x}_t}\phi(\boldsymbol{x}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\phi(\boldsymbol{x}_t)\right] + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_T}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\cdot\nabla_{\boldsymbol{x}_t}\phi(\boldsymbol{x}_t)\right] \\ \end{aligned} where the first equality comes from Eq. [eq:track], the second is a first-order Taylor expansion, the third is again from Eq. [eq:track], and the fourth is because \boldsymbol{x}_t is a deterministic function of \boldsymbol{x}_0, \boldsymbol{x}_T, so the expectation over \boldsymbol{x}_0, \boldsymbol{x}_T is the expectation over \boldsymbol{x}_t.

We see that \frac{\partial \boldsymbol{\varphi}_t}{\partial t} is a function of \boldsymbol{x}_0, \boldsymbol{x}_T. Next, we make an assumption: Eq. [eq:track] is invertible with respect to \boldsymbol{x}_T. This assumption implies we can solve for \boldsymbol{x}_T = \boldsymbol{\psi}_t(\boldsymbol{x}_0, \boldsymbol{x}_t) from Eq. [eq:track]. This result can be substituted into \frac{\partial \boldsymbol{\varphi}_t}{\partial t}, making it a function of \boldsymbol{x}_0, \boldsymbol{x}_t. Thus we have: \begin{aligned} \mathbb{E}_{\boldsymbol{x}_{t+\Delta t}}\left[\phi(\boldsymbol{x}_{t+\Delta t})\right] =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\phi(\boldsymbol{x}_t)\right] + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_T}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\cdot\nabla_{\boldsymbol{x}_t}\phi(\boldsymbol{x}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\phi(\boldsymbol{x}_t)\right] + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{x}_t}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\cdot\nabla_{\boldsymbol{x}_t}\phi(\boldsymbol{x}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\phi(\boldsymbol{x}_t)\right] + \Delta t\,\mathbb{E}_{\boldsymbol{x}_t}\left[\underbrace{\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\right]}_{\text{Function of } \boldsymbol{x}_t}\cdot\nabla_{\boldsymbol{x}_t}\phi(\boldsymbol{x}_t)\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_t}\left[\phi\left(\boldsymbol{x}_t + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\right]\right)\right] \end{aligned} The second equality holds because \frac{\partial \boldsymbol{\varphi}_t}{\partial t} has been rewritten as a function of \boldsymbol{x}_0, \boldsymbol{x}_t, so the random variables in the second expectation are changed to \boldsymbol{x}_0, \boldsymbol{x}_t. The third equality is equivalent to the decomposition p(\boldsymbol{x}_0, \boldsymbol{x}_t) = p(\boldsymbol{x}_0|\boldsymbol{x}_t)p(\boldsymbol{x}_t). Here \boldsymbol{x}_0 and \boldsymbol{x}_t are not independent, so we denote \boldsymbol{x}_0|\boldsymbol{x}_t. Note that while \frac{\partial \boldsymbol{\varphi}_t}{\partial t} was originally a function of \boldsymbol{x}_0, \boldsymbol{x}_t, after taking the expectation over \boldsymbol{x}_0, the only remaining independent variable is \boldsymbol{x}_t. As we will see, this is exactly the function of \boldsymbol{x}_t we are looking for! The fourth equality uses the Taylor expansion formula to merge the two terms back together.

Now, we have obtained: \mathbb{E}_{\boldsymbol{x}_{t+\Delta t}}\left[\phi(\boldsymbol{x}_{t+\Delta t})\right] = \mathbb{E}_{\boldsymbol{x}_t}\left[\phi\left(\boldsymbol{x}_t + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\right]\right)\right] Since this holds for any test function \phi, it implies: \boldsymbol{x}_{t+\Delta t} = \boldsymbol{x}_t + \Delta t\,\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\right]\quad\Rightarrow\quad\frac{d\boldsymbol{x}_t}{dt} = \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[\frac{\partial \boldsymbol{\varphi}_t}{\partial t}\right]\label{eq:real-ode} This is the ODE we were seeking. According to the property: \mathbb{E}_{\boldsymbol{x}}[\boldsymbol{x}] = \mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}_{\boldsymbol{x}}\left[\Vert \boldsymbol{x} - \boldsymbol{\mu}\Vert^2\right]\label{eq:mean-opt} The right-hand side of Eq. [eq:real-ode] is exactly the optimal solution to the training objective [eq:objective]. This proves that optimizing the training objective [eq:objective] to obtain Eq. [eq:s-ode] indeed implements the transformation between distributions p_T(\boldsymbol{x}_T) and p_0(\boldsymbol{x}_0).

Reflections

Regarding the construction of diffusion ODEs mentioned in the “Intuitive Results” section, the authors of the original paper also wrote a Zhihu column article “[ICLR2023] A New Method for Diffusion Generative Models: Extreme Simplification, One-Step Generation”, which I recommend reading. I first learned about this method through that column and was deeply shocked and impressed by it.

If you have read “Generative Diffusion Models (14): General Steps for Constructing ODEs (Part 2)”, you will appreciate even more how simple and direct this approach is, and you will understand why I am so generous with my praise. To be honest, when I was writing “Part 2” (then labeled as the final part), I had considered the trajectory described by Eq. [eq:track]. However, within the framework I had at the time, I couldn’t figure out how to proceed, and it ended in failure. I never imagined it could be carried out in such a concise manner. Writing this series on diffusion ODEs truly makes one feel that “comparisons are odious”; Part 2 and Part 3 are the best witnesses to my own intellect being repeatedly subjected to a “dimensionality reduction attack.”

Readers might wonder if there will be a fourth part that is even simpler, causing me to experience another dimensionality reduction attack. It’s possible, but the probability is very small. It is truly hard to imagine a construction process simpler than this. The “Intuitive Results” section looks long, but the actual steps are only two: 1. Choose a gradual trajectory; 2. Use a function of \boldsymbol{x}_t to approximate the derivative of that trajectory with respect to t. With just these two steps, how could it be simplified further? Even the derivation in the “Proof Process” section is quite simple; although it is written out at length, it essentially involves taking a derivative and changing the distribution of the expectation—much simpler than the processes in the previous two parts. In short, any reader who has personally completed the derivations for the first two parts of the ODE diffusion series will deeply feel that the logic in this part is so simple that it feels like it cannot be simplified any further.

Furthermore, in addition to providing a simple idea for constructing diffusion ODEs, the original paper discusses the connection between Rectified Flow and optimal transport, and how to use this connection to accelerate the sampling process, among other things. However, those parts are not the main focus of this article, so we may discuss them in the future if the opportunity arises.

Summary

This article introduced an extremely simple and intuitive approach to constructing ODE-based diffusion models as proposed in the Rectified Flow paper, and provided a derivation for it.

Reprinting notice: Please include the original address of this article: https://kexue.fm/archives/9497