Generative Diffusion Model Chat (30): From Instantaneous Velocity to Average Velocity · English (unofficial) translations of posts at kexue.fm

As is well known, slow generation speed has always been a pain point for diffusion models. To solve this problem, researchers have "crossed the sea like the Eight Immortals, each showing their own prowess," proposing a wide variety of solutions. However, for a long time, no single work has managed to stand out and become the standard. What kind of work could reach such a standard? In my view, it must satisfy at least a few conditions:

Clear mathematical principles that reveal the essence of fast generation;

Ability to be trained from scratch with a single objective, without requiring extra means like adversarial training or distillation;

Single-step generation performance close to SOTA, with the ability to improve results by increasing the number of steps.

Based on my reading experience, almost no work has satisfied all three criteria simultaneously. However, just a few days ago, a paper appeared on arXiv titled "Mean Flows for One-step Generative Modeling" (referred to as "MeanFlow"), which seems very promising. Next, we will take this as an opportunity to discuss the relevant ideas and progress.

Existing Approaches

There has already been a great deal of work on accelerating diffusion model generation, some of which has been briefly introduced in this blog before. Generally speaking, acceleration strategies can be divided into three categories.

First, converting the diffusion model into an SDE/ODE and then researching more efficient solvers. Representative works include DPM-Solver and its subsequent improvements. However, this approach usually only reduces the NFE (Number of Function Evaluations) to around 10; any lower significantly degrades generation quality. This is because the convergence speed of solvers is typically proportional to some power of the step size. When the NFE is very small, the step size cannot be small enough, so convergence is not fast enough to be usable.

Second, converting a pre-trained diffusion model into a generator with fewer steps through distillation. Many works and schemes have emerged from this, including the SiD scheme we introduced previously. Distillation is a relatively standard and general approach, but its common drawback is the requirement for additional training costs; it is not a "from-scratch" training solution. Some works, in order to distill into a single-step generator, also add multiple optimization strategies like adversarial training, making the entire scheme overly complex.

Third, approaches based on Consistency Models (CM), including the CM we briefly introduced in "Generative Diffusion Model Chat (28): Understanding Consistency Models Step-by-Step", its continuous version sCM, and CTM. CM is a self-contained approach that can be trained from scratch to obtain models with very small NFE, or used for distillation. However, the CM objective relies on EMA or stop_gradient operations, meaning it is coupled with optimizer dynamics, which often leaves one with a vague, "indescribable" feeling.

Instantaneous Velocity

So far, the diffusion models with the smallest generation NFE are basically ODEs, because deterministic models are often easier to analyze and solve. This article also focuses only on ODE-style diffusion. The framework used is ReFlow, introduced in "Generative Diffusion Model Chat (17): General Steps for Constructing ODEs (Part 2)", which is essentially consistent with Flow Matching but more intuitive.

ODE-style diffusion aims to learn an ODE: \begin{equation} \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\label{eq:ode} \end{equation} to construct a transformation \boldsymbol{x}_1 \to \boldsymbol{x}_0. Specifically, let \boldsymbol{x}_1 \sim p_1(\boldsymbol{x}_1) be a random noise that is easy to sample, and \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0) be a real sample from the target distribution. We hope to achieve the transformation from random noise to target samples through the above ODE. That is, by sampling \boldsymbol{x}_1 \sim p_1(\boldsymbol{x}_1) as an initial value, the \boldsymbol{x}_0 obtained by solving the ODE is a sample of p_0(\boldsymbol{x}_0).

If we view t as time and \boldsymbol{x}_t as displacement, then d\boldsymbol{x}_t/dt is the instantaneous velocity. Thus, ODE-style diffusion is the modeling of instantaneous velocity. How do we train \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)? ReFlow proposes a very intuitive method: first, construct any interpolation between \boldsymbol{x}_0 and \boldsymbol{x}_1, such as the simplest linear interpolation \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t \boldsymbol{x}_1. Taking the derivative with respect to t gives: \begin{equation} \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{x}_1 - \boldsymbol{x}_0 \end{equation} This is an extremely simple ODE, but it does not meet our requirements because \boldsymbol{x}_0 is our target and should not appear in the ODE. To address this, ReFlow proposes a very intuitive idea—use \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) to approximate \boldsymbol{x}_1 - \boldsymbol{x}_0: \begin{equation} \mathbb{E}_{t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\Vert\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right]\label{eq:loss-reflow} \end{equation} This is the objective function of ReFlow. It is worth noting that: 1) ReFlow theoretically allows any interpolation method between \boldsymbol{x}_0 and \boldsymbol{x}_1; 2) Although ReFlow is intuitive, it is also theoretically rigorous; it can be proven that its optimal solution is indeed the ODE we seek. For details, please refer to the original paper and the aforementioned blog post.

Average Velocity

However, an ODE is merely a pure mathematical form. Actual solving still requires discretization, such as the simplest Euler method: \begin{equation} \boldsymbol{x}_{t - \Delta t} = \boldsymbol{x}_t - \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) \Delta t \end{equation} The NFE from 1 to 0 is 1/\Delta t. Wanting a small NFE is equivalent to a large \Delta t. However, the theoretical basis of ReFlow is an exact ODE, meaning target sample generation is achieved only when the ODE is solved exactly. This implies that \Delta t should be as small as possible, which contradicts our expectations. Although ReFlow claims that using straight-line interpolation can make the ODE trajectory straighter, allowing for a larger \Delta t, the actual trajectory is ultimately curved. It is difficult for \Delta t to approach 1, so ReFlow struggles to achieve single-step generation.

Ultimately, an ODE is inherently something where \Delta t \to 0. Insisting on using it for \Delta t \to 1 while demanding high performance is "forcing the model’s hand." Therefore, changing the modeling target, rather than continuing to "burden" the model, is the essential idea for achieving faster generation. To this end, we consider integrating both sides of Eq. [eq:ode]: \begin{equation} \boldsymbol{x}_t - \boldsymbol{x}_r = \int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau = (t-r)\times \frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau \end{equation} If we can model: \begin{equation} \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) \triangleq \frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau \end{equation} then we have \boldsymbol{x}_0 = \boldsymbol{x}_1 - \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_1, 0, 1), which theoretically allows for precise single-step generation without resorting to approximate relationships. If \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) is the instantaneous velocity at time t, then clearly \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) is the average velocity over the time interval [r, t]. In other words, to accelerate generation or even achieve single-step generation, our modeling target should be the average velocity, not the instantaneous velocity of the ODE.

Identity Transformation

Of course, the shift from instantaneous velocity to average velocity is not hard to imagine. The truly difficult part is how to construct a loss function for it. ReFlow only tells us how to build a loss function for instantaneous velocity; we know nothing about training for average velocity.

A natural next thought is to "transform the unknown into the known"—that is, use the average velocity \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) as a starting point to construct the instantaneous velocity \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t), and then substitute it into the ReFlow objective function. This requires us to derive the identity transformation between the two. From the definition of \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t), we get: \begin{equation} \int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau = (t-r)\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) \end{equation} Differentiating both sides with respect to t, we get: \begin{equation} \begin{aligned} \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) =&\, \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\frac{d}{dt}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) \\ =&\, \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] \end{aligned}\label{eq:id1} \end{equation} This is the first identity relationship between \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) and \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t). Where there is a first, there is naturally a second. The second identity relationship is obtained from the definition of average velocity: \begin{equation} \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \lim_{r\to t}\frac{1}{t-r}\int_r^t \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_{\tau},\tau) d\tau = \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\label{eq:id2} \end{equation} Simply put, the average velocity over an infinitesimal interval equals the instantaneous velocity.

First Objective

Based on d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) and identity [eq:id2], we can replace d\boldsymbol{x}_t/dt in identity [eq:id1] with \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) or \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t). The former is an implicit relationship, which we will discuss later; let’s look at the latter first. In this case, we have: \begin{equation} \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] \end{equation} Substituting this into ReFlow, we obtain the first objective function that can be used to train \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t): \begin{equation} \mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\label{eq:loss-1} \end{equation} This is a very ideal result. It satisfies all our expectations for a generative model objective function:

A single explicit minimization target;

No EMA, stop_gradient, or similar operations;

Theoretically guaranteed (via ReFlow).

These characteristics mean that no matter what optimization algorithm we use, as long as we can find the minimum point of the above equation, it will be the average velocity model we want—that is, a generative model that can theoretically achieve single-step generation. In other words, it possesses the training simplicity and theoretical guarantees of diffusion models, while being able to generate in one step like a GAN, without needing to pray that the model doesn’t "lose its mind" and collapse during training.

JVP Operation

However, for some readers, implementing objective function [eq:loss-1] might still be a bit difficult because it involves the "Jacobian-Vector Product (JVP)," which is relatively uncommon for average users. Specifically, we can write the part inside the square brackets of the objective function as: \begin{equation} \underbrace{\left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t),0,1\right]}_{\text{Vector}} \cdot \underbrace{\left[\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t), \frac{\partial}{\partial r}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t), \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]}_{\text{Jacobian Matrix}} \end{equation} This is the multiplication of the Jacobian matrix of \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) with a given vector [\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t), 0, 1]. The result is a vector of the same size as \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t). This operation is called JVP. There are ready-made implementations in Jax and Torch. For example, the reference code in Jax is:

u = lambda xt, r, t: diffusion_model(weights, [xt, r, t])
urt, durt = jax.jvp(u, (xt, r, t), (u(xt, t, t), r * 0, t * 0 + 1))

Where urt is \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t), and durt is the corresponding JVP result. The usage in Torch is similar. Once the JVP operation is understood, implementing objective function [eq:loss-1] is essentially straightforward.

Second Objective

If there is a disadvantage to objective function [eq:loss-1], in my view, it is only that the computational cost is relatively high. This is because it requires two different forward passes, \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) and \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t), and then a gradient is calculated for the JVP. When using gradient descent optimization, another gradient must be calculated. Thus, it essentially requires second-order gradients, similar to the previous WGAN-GP.

To reduce the computational cost, we can consider adding a stop_gradient operation (\color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]}) to the JVP part: \begin{equation} \mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)}\right]} - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\label{eq:loss-2} \end{equation} This avoids taking the gradient of the JVP again (though it still requires two forward passes). Experimental results show that compared to the first objective [eq:loss-1], the above objective trains nearly twice as fast under gradient optimizers, with no apparent loss in quality.

Note that the stop_gradient here is purely for the purpose of reducing computation. The actual optimization direction is still to minimize the loss function value. This is different from CM series models, especially sCM, where their loss functions are only equivalent losses with equivalent gradients and are not necessarily better when smaller. Their stop_gradient is often mandatory; removing it would almost certainly lead to training collapse.

Third Objective

Earlier, we mentioned that another way to handle d\boldsymbol{x}_t/dt in identity [eq:id1] is to replace it with \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t), which leads to: \begin{equation} \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] \end{equation} If we were to solve for \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t), the result would be: \begin{equation} \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) = \left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]\cdot\left[\boldsymbol{I} - (t-r)\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right]^{-1} \end{equation} This involves a massive matrix inversion, which is impractical. MeanFlow provides a compromise: since the regression target of d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t) is \boldsymbol{x}_1 - \boldsymbol{x}_0, why not just replace d\boldsymbol{x}_t/dt with \boldsymbol{x}_1 - \boldsymbol{x}_0? Thus, the objective function becomes: \begin{equation} \mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\left[(\boldsymbol{x}_1-\boldsymbol{x}_0)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)\right] - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right] \end{equation} However, at this point, \boldsymbol{x}_1 - \boldsymbol{x}_0 is both the regression target and appears in the definition of the model \boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t,t), which inevitably feels like "label leakage." To avoid this, MeanFlow also adopts the method of adding stop_gradient to the JVP part: \begin{equation} \mathbb{E}_{r,t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + (t-r)\color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{(\boldsymbol{x}_1-\boldsymbol{x}_0)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t)}\right]} - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right\Vert^2\right]\label{eq:loss-3} \end{equation} This is the final loss function used by MeanFlow, which we call the "third objective." Compared to the second objective [eq:loss-2], it requires one fewer forward pass \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t), so training is faster. But the introduction of "label leakage" and the stop_gradient countermeasure mean that the training of the third objective is coupled with the gradient optimizer, which, like CM, adds a bit of indescribable mystery.

The paper’s experimental results show that the objective [eq:loss-3] with \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} can train reasonable results. What if it is removed? I asked the author, and he indicated that after removing \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]}, training still converges and multi-step generation is possible, but the single-step generation capability is lost. This is actually not hard to understand, because when r=t, regardless of whether \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} is present, the objective function reduces to ReFlow: \begin{equation} \mathbb{E}_{t,\boldsymbol{x}_0,\boldsymbol{x}_1}\left[\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t, t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right]\label{eq:loss-reflow-2} \end{equation} In other words, MeanFlow always has ReFlow as a "safety net," so it won’t be too bad. However, after removing \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]}, the negative impact of "label leakage" intensifies, making it inferior to keeping it.

A Proof

Can we theoretically prove, like ReFlow, that the optimal solution of the third objective [eq:loss-3] is indeed our desired average velocity model? Let’s try. First, let’s review two key lemmas used to prove ReFlow:

\mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}[\Vert\boldsymbol{\mu} - \boldsymbol{x}\Vert^2] = \mathbb{E}[\boldsymbol{x}], i.e., the optimal solution for minimizing the squared error between \boldsymbol{\mu} and \boldsymbol{x} is the mean of \boldsymbol{x};

The ODE solution that transforms \boldsymbol{x}_1 to \boldsymbol{x}_0 according to the distribution trajectory \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t \boldsymbol{x}_1 is d\boldsymbol{x}_t/dt = \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1 - \boldsymbol{x}_0].

The proof of Lemma 1 is simple: take the gradient with respect to \boldsymbol{\mu} to get \mathbb{E}[\boldsymbol{\mu} - \boldsymbol{x}] = \boldsymbol{\mu} - \mathbb{E}[\boldsymbol{x}] and set it to zero. The proof details for Lemma 2 can be found in "Generative Diffusion Model Chat (17)", where \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1 - \boldsymbol{x}_0] requires first using \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t \boldsymbol{x}_1 to eliminate \boldsymbol{x}_1, obtaining a function of \boldsymbol{x}_0, \boldsymbol{x}_t, and then taking the expectation over the distribution p_t(\boldsymbol{x}_0|\boldsymbol{x}_t), resulting in a function of t, \boldsymbol{x}_t.

Using Lemma 1, we can prove that the theoretical optimal solution of the ReFlow objective function [eq:loss-reflow] is \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t,t) = \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1 - \boldsymbol{x}_0]. Combined with Lemma 2, we get d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t,t) as our desired ODE. The proof for the third objective [eq:loss-3] is similar. Since it contains \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]}, taking the gradient with respect to \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, r, t) and setting it to zero gives: \begin{equation} \begin{aligned} \boldsymbol{0} =&\, \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}\left[(t-r)\left[(\boldsymbol{x}_1-\boldsymbol{x}_0)\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)\right] - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\right] \\ =&\, \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + (t-r)\left[\mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1-\boldsymbol{x}_0]\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)\right] - \mathbb{E}_{\boldsymbol{x}_0|\boldsymbol{x}_t}[\boldsymbol{x}_1 - \boldsymbol{x}_0] \\ =&\, \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + (t-r)\left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t)\right] - \frac{d\boldsymbol{x}_t}{dt} \\ =&\, \frac{d}{dt}\left[(t - r) \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) - (\boldsymbol{x}_t - \boldsymbol{x}_r)\right] \\ \end{aligned} \end{equation} Thus, under appropriate boundary conditions, we have \boldsymbol{x}_t - \boldsymbol{x}_r = (t - r) \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t), which is our desired average velocity model.

The key to this process is that the introduction of \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} avoids taking the gradient of the JVP part, thereby simplifying the gradient expression and yielding the correct result. If \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} were removed, the right side of the above equation would be multiplied by an additional term—the Jacobian matrix of the JVP part with respect to \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t). As a result, the term \frac{d}{dt}\left[(t - r) \boldsymbol{u}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, r, t) - (\boldsymbol{x}_t - \boldsymbol{x}_r)\right] could not be isolated. The mathematical significance of introducing \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} is to solve this problem.

Of course, as I said, the introduction of \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} also couples the entire model training with the gradient optimizer, adding a touch of ambiguity. At this point, the point where the gradient equals zero is at most a stationary point rather than a (local) minimum, so stability is also unclear. This is actually a common feature of all models coupled with \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]}.

Consistency Models

Finally, let’s discuss Consistency Models. Since CM and sCM came first, MeanFlow’s success actually borrowed from their experience, especially the operation of adding \color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\cdot}\right]} to the JVP, which is also mentioned in the original paper. Of course, one of the authors of MeanFlow, Professor Kaiming He, is himself a master of manipulating gradients (e.g., SimSiam), so the emergence of MeanFlow seems very natural.

We carefully analyzed discrete CM in "Generative Diffusion Model Chat (28)". If we replace the EMA operator in CM with stop_gradient, take the gradient, and take the limit \Delta t \to 0, we get the objective function of sCM from "Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models": \begin{equation} \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\cdot \frac{d}{dt}\boldsymbol{f}_{\color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\boldsymbol{\theta}}\right]}}(\boldsymbol{x}_t, t) = \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\cdot\color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) + \frac{\partial}{\partial t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}\right]}\label{eq:loss-scm} \end{equation} If we replace \frac{d\boldsymbol{x}_t}{dt} with \boldsymbol{x}_1 - \boldsymbol{x}_0, and then denote \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = \boldsymbol{x}_t - t\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t), then its gradient is equivalent to MeanFlow’s third objective [eq:loss-3] when r=0: \begin{equation} \begin{aligned} \nabla_{\boldsymbol{\theta}}\eqref{eq:loss-scm} =&\, \nabla_{\boldsymbol{\theta}}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\cdot \left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) + \frac{\partial}{\partial t}\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right] \\[10pt] =&\, -t\nabla_{\boldsymbol{\theta}}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\cdot \left[\frac{d\boldsymbol{x}_t}{dt} - t\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) - \boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) - t\frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\right] \\[10pt] =&\, t\nabla_{\boldsymbol{\theta}}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\cdot \left[\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + t\left[\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)\right]- \frac{d\boldsymbol{x}_t}{dt}\right] \\[10pt] =&\, \frac{t}{2}\nabla_{\boldsymbol{\theta}}\left\Vert\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + t\color{skyblue}{\mathop{\text{sg}}\left[\color{blue}{\frac{d\boldsymbol{x}_t}{dt}\cdot\frac{\partial}{\partial \boldsymbol{x}_t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t) + \frac{\partial}{\partial t}\boldsymbol{u}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, 0, t)}\right]}- \frac{d\boldsymbol{x}_t}{dt}\right\Vert^2 \\[10pt] \sim &\, \left.\nabla_{\boldsymbol{\theta}}\eqref{eq:loss-3}\right|_{r=0} \end{aligned} \end{equation}

So, from this perspective, sCM is a special case of MeanFlow when r=0. As mentioned earlier, introducing the additional time parameter r allows ReFlow to provide a "safety net" for MeanFlow (when r=t), thereby better avoiding training collapse, which is one of its advantages. Of course, starting from sCM, one could also introduce dual time parameters to obtain results identical to the third objective. However, from a personal aesthetic point of view, the physical meaning of CM and sCM is ultimately not as intuitive as the interpretation of average velocity in MeanFlow.

Furthermore, the starting point of combining average velocity and ReFlow can also yield the other first objective [eq:loss-1] and second objective [eq:loss-2]. For a stop_gradient purist like myself, these are very comfortable and beautiful results. In my view, we can consider adding stop_gradient to the loss function for computational cost reasons, but the first principles of derivation and the basic results should not be coupled with stop_gradient; otherwise, it means they are strongly coupled with the optimizer and dynamics, which is not how an essential result should behave.

Summary

This article focused on the recently released MeanFlow and discussed the idea of accelerating diffusion model generation from the perspective of "average velocity."

When reposting, please include the original address of this article: https://kexue.fm/archives/10958

For more detailed reposting matters, please refer to: "Scientific Space FAQ"