English (unofficial) translations of posts at kexue.fm
Source

Generative Diffusion Models (28): Step-by-Step Understanding of Consistency Models

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

Continuing from our previous discussion in "Generative Diffusion Models (27): Using Step Size as Conditional Input", we introduced the Shortcut model for accelerated sampling. One of the models it was compared against is the "Consistency Models". In fact, as early as "Generative Diffusion Models (17): General Steps for Constructing ODEs (Part 2)" when introducing ReFlow, some readers mentioned Consistency Models. However, I initially felt it was more of a practical trick with a somewhat thin theoretical foundation, so I had little interest at the time.

However, since we have begun to focus on the progress of accelerated sampling in diffusion models, Consistency Models is a work that cannot be ignored. Therefore, I would like to take this opportunity to share my understanding of Consistency Models here.

The Familiar Recipe

Using the same familiar recipe, our starting point remains ReFlow, as it is perhaps the simplest way to understand ODE-based diffusion. Let \boldsymbol{x}_0 \sim p_0(\boldsymbol{x}_0) be a real sample from the target distribution, \boldsymbol{x}_1 \sim p_1(\boldsymbol{x}_1) be random noise from the prior distribution, and \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t\boldsymbol{x}_1 be the noisy sample. Then the training objective of ReFlow is: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{t\sim U[0,1],\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\left[w(t)\Vert\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - (\boldsymbol{x}_1 - \boldsymbol{x}_0)\Vert^2\right] \label{eq:loss} \end{equation} where w(t) is a tunable weight. After training, sampling can be achieved by solving d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, t) to transform \boldsymbol{x}_1 into \boldsymbol{x}_0.

It should be noted that the noise schedule for Consistency Models is \boldsymbol{x}_t = \boldsymbol{x}_0 + t\boldsymbol{x}_1 (where \boldsymbol{x}_t is also close to pure noise when t is large enough), which is slightly different from ReFlow. However, the main purpose of this article is to attempt to derive the same training philosophy and objective as Consistency Models step-by-step. I believe ReFlow’s approach is easier to understand, so I will continue to introduce it based on ReFlow. As for the specific training details, readers can adjust them as needed.

Using \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t\boldsymbol{x}_1, we can eliminate \boldsymbol{x}_1 from the objective [eq:loss]: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{t\sim U[0,1],\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t)\Vert \underbrace{\boldsymbol{x}_t - t\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}_{\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)} - \boldsymbol{x}_0\Vert^2\big] \label{eq:loss-2} \end{equation} where \tilde{w}(t) = w(t)/t^2. Note that \boldsymbol{x}_0 is the clean sample and \boldsymbol{x}_t is the noisy sample, so the training objective of ReFlow is actually also performing denoising. The model predicting the clean sample is \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = \boldsymbol{x}_t - t\boldsymbol{v}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t). An important property of this function is that \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_0, 0) = \boldsymbol{x}_0 always holds, which is one of the key constraints of Consistency Models.

Step-by-Step Understanding

Next, let us deconstruct the training process of ReFlow step-by-step, attempting to find a better training objective. First, we divide [0,1] into n equal parts, each of size 1/n, and denote t_k = k/n. Then t only needs to be sampled uniformly from the finite set \{0, t_1, t_2, \dots, t_n\}. Of course, we could also choose a non-uniform discretization method; these are non-critical details.

Since t_0=0 is trivial, we start from t_1. The training objective for the first step is: \begin{equation} \boldsymbol{\theta}_1^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_1)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_1}, t_1) - \boldsymbol{x}_0\Vert^2\big] \end{equation} Next, consider the training objective for the second step. If we followed [eq:loss-2], it would be the expectation of \Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_2}, t_2) - \boldsymbol{x}_0\Vert^2. But now we evaluate a new objective: \begin{equation} \boldsymbol{\theta}_2^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_2)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_2}, t_2) - \boldsymbol{f}_{\boldsymbol{\theta}_1^*}(\boldsymbol{x}_{t_1}, t_1)\Vert^2\big] \end{equation} In other words, the prediction target is changed to \boldsymbol{f}_{\boldsymbol{\theta}_1^*}(\boldsymbol{x}_{t_1}, t_1) instead of \boldsymbol{x}_0. Why make this change? We can discuss this from two aspects: feasibility and necessity. Regarding feasibility, \boldsymbol{x}_{t_2} contains more noise than \boldsymbol{x}_{t_1}, so denoising it is more difficult. In other words, the degree of recovery of \boldsymbol{f}_{\boldsymbol{\theta}_2^*}(\boldsymbol{x}_{t_2}, t_2) is not as good as \boldsymbol{f}_{\boldsymbol{\theta}_1^*}(\boldsymbol{x}_{t_1}, t_1). Therefore, replacing \boldsymbol{x}_0 with \boldsymbol{f}_{\boldsymbol{\theta}_1^*}(\boldsymbol{x}_{t_1}, t_1) as the training target for the second step is entirely feasible.

But even so, what is the necessity of this change? The answer is to reduce "trajectory crossing." Since \boldsymbol{x}_{t_k} = (1-t_k)\boldsymbol{x}_0 + t_k\boldsymbol{x}_1, as k increases, the dependence of \boldsymbol{x}_{t_k} on \boldsymbol{x}_0 becomes weaker and weaker, to the point where two different \boldsymbol{x}_0 values might result in very similar \boldsymbol{x}_{t_k} values. At this point, if \boldsymbol{x}_0 is still used as the prediction target, the dilemma of "one input, multiple targets" arises, which is "trajectory crossing."

To face this dilemma, ReFlow’s strategy is post-hoc distillation, because after pre-training, solving d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, t) can yield many (\boldsymbol{x}_0, \boldsymbol{x}_1) pairs. Constructing \boldsymbol{x}_t using these paired \boldsymbol{x}_0, \boldsymbol{x}_1 can avoid crossing. The idea of Consistency Models is to change the prediction target to \boldsymbol{f}_{\boldsymbol{\theta}_{k-1}^*}(\boldsymbol{x}_{t_{k-1}}, t_{k-1}), because for "the same \boldsymbol{x}_1 but different \boldsymbol{x}_0," the difference between \boldsymbol{f}_{\boldsymbol{\theta}_{k-1}^*}(\boldsymbol{x}_{t_{k-1}}, t_{k-1}) will be smaller than the difference between the \boldsymbol{x}_0 values, thus reducing the risk of crossing.

Simply put, it is easier for \boldsymbol{f}_{\boldsymbol{\theta}_2^*}(\boldsymbol{x}_{t_2}, t_2) to predict \boldsymbol{f}_{\boldsymbol{\theta}_1^*}(\boldsymbol{x}_{t_1}, t_1) than to predict \boldsymbol{x}_0, and the desired effect can still be achieved, so the prediction target is adjusted. Similarly, we can write: \begin{equation} \begin{gathered} \boldsymbol{\theta}_3^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_3)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_3}, t_3) - \boldsymbol{f}_{\boldsymbol{\theta}_2^*}(\boldsymbol{x}_{t_2}, t_2)\Vert^2\big] \\ \boldsymbol{\theta}_4^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_4)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_4}, t_4) - \boldsymbol{f}_{\boldsymbol{\theta}_3^*}(\boldsymbol{x}_{t_3}, t_3)\Vert^2\big] \\ \vdots \\[5pt] \boldsymbol{\theta}_n^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_n)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_n}, t_n) - \boldsymbol{f}_{\boldsymbol{\theta}_{n-1}^*}(\boldsymbol{x}_{t_{n-1}}, t_{n-1})\Vert^2\big] \end{gathered} \end{equation}

Consistency Training

Now that we have completed the deconstruction of the ReFlow model and obtained a new, more reasonable training objective, the cost is that we have n sets of parameters \boldsymbol{\theta}_1^*, \boldsymbol{\theta}_2^*, \dots, \boldsymbol{\theta}_n^*. This is not what we want; we want a single model. Thus, we assume all \boldsymbol{\theta}_i^* can share the same set of parameters, and we can write the training objective as: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{k\sim[n],\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_k)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_k}, t_k) - \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_{k-1}}, t_{k-1})\Vert^2\big] \label{eq:loss-3} \end{equation} Here k\sim[n] means k is sampled uniformly from \{1, 2, \dots, n\}. The problem with the above equation is that \boldsymbol{\theta}^* is the parameter we are solving for, but it also appears in the objective function. This is clearly unscientific (if I knew \boldsymbol{\theta}^*, why would I train?). Therefore, we must modify the objective to make it feasible.

The meaning of \boldsymbol{\theta}^* is the theoretical optimal solution. Considering that as training progresses, \boldsymbol{\theta} will gradually approach \boldsymbol{\theta}^*, we can relax this condition in the objective function to a "superior solution"—it just needs to be better than the current \boldsymbol{\theta}. How do we construct a "superior solution"? The approach of Consistency Models is to perform an EMA (Exponential Moving Average) on the historical weights. This often yields a better solution, a trick frequently used in competitions in earlier years.

Therefore, the final training objective for Consistency Models is: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{k\sim[n],\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_k)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_k}, t_k) - \boldsymbol{f}_{\bar{\boldsymbol{\theta}}}(\boldsymbol{x}_{t_{k-1}}, t_{k-1})\Vert^2\big] \label{eq:loss-4} \end{equation} where \bar{\boldsymbol{\theta}} is the EMA of \boldsymbol{\theta}. This is the "Consistency Training (CT)" described in the original paper. In practice, we can also replace \Vert\cdot - \cdot\Vert^2 with a more general metric d(\cdot, \cdot) to better fit the data characteristics.

Sampling Analysis

Since we derived this step-by-step through "equivalent transformations" from ReFlow, a basic sampling method after training is the same as ReFlow: solving the ODE \begin{equation} d\boldsymbol{x}_t/dt = \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, t) = \frac{\boldsymbol{x}_t - \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, t)}{t} \label{eq:ode} \end{equation} Of course, if all this effort only yielded the same results as ReFlow, it would be a waste of time. Fortunately, models obtained through consistency training have an important advantage: they can use larger sampling steps—even a step size of 1, which allows for single-step generation: \begin{equation} \boldsymbol{x}_0 = \boldsymbol{x}_1 - \boldsymbol{v}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1)\times 1 = \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) \end{equation} The reason is: \begin{equation} \begin{aligned} \Vert\boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) - \boldsymbol{x}_0\Vert &= \left\Vert\sum_{k=1}^n \Big[\boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_k}, t_k) - \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_{k-1}}, t_{k-1})\Big]\right\Vert \\[5pt] &\leq \sum_{k=1}^n \Vert\boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_k}, t_k) - \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_{k-1}}, t_{k-1})\Vert \\ \end{aligned} \label{eq:f-x1-x0} \end{equation} As we can see, consistency training is equivalent to optimizing the upper bound of \Vert\boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) - \boldsymbol{x}_0\Vert. When the loss is small enough, it means \Vert\boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) - \boldsymbol{x}_0\Vert is also small enough, thus enabling single-step generation.

But \Vert\boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) - \boldsymbol{x}_0\Vert was the original training objective of ReFlow. Why is optimizing its upper bound better than optimizing it directly? This goes back to the "trajectory crossing" problem. In direct training, \boldsymbol{x}_0 and \boldsymbol{x}_1 are sampled randomly without a one-to-one pairing, so a single-step generation model cannot be trained directly. However, by training the upper bound, the transitivity of multiple \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_k}, t_k) and \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_{k-1}}, t_{k-1}) implicitly achieves the pairing of \boldsymbol{x}_0 and \boldsymbol{x}_1.

If single-step generation is not satisfactory, we can increase the number of sampling steps to improve generation quality. There are two approaches: 1. Use smaller steps to numerically solve [eq:ode]; 2. Transform it into a stochastic iteration similar to SDE. The former is conventional, so we mainly discuss the latter.

First, note that replacing \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) in [eq:f-x1-x0] with any \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, t) yields a similar inequality, meaning any \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_t, t) predicts \boldsymbol{x}_0. Thus, starting from \boldsymbol{x}_1, we get a preliminary \boldsymbol{x}_0 via \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1). Since it might not be perfect, we "mask" this imperfection by adding noise to get \boldsymbol{x}_{t_{n-1}}, then substitute it into \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_{n-1}}, t_{n-1}) to get a better result, and so on: \begin{equation} \begin{aligned} &\boldsymbol{x}_1\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \\ &\boldsymbol{x}_0\leftarrow \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_1, 1) \\ &\text{for }k=n-1,n-2,\cdots,1: \\ &\qquad \boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \\ &\qquad \boldsymbol{x}_{t_k} \leftarrow (1 - t_k)\boldsymbol{x}_0 + t_k\boldsymbol{z} \\ &\qquad \boldsymbol{x}_0\leftarrow \boldsymbol{f}_{\boldsymbol{\theta}^*}(\boldsymbol{x}_{t_k}, t_k) \end{aligned} \end{equation}

For Distillation

The training philosophy of Consistency Models can also be used for distilling existing diffusion models, resulting in "Consistency Distillation (CD)". The method is to change the learning target of \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_k}, t_k) in [eq:loss-4] from \boldsymbol{f}_{\bar{\boldsymbol{\theta}}}(\boldsymbol{x}_{t_{k-1}}, t_{k-1}) to \boldsymbol{f}_{\bar{\boldsymbol{\theta}}}(\hat{\boldsymbol{x}}_{t_{k-1}}^{\boldsymbol{\varphi}^*}, t_{k-1}): \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \mathbb{E}_{k\sim[n],\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0),\boldsymbol{x}_1\sim p_1(\boldsymbol{x}_1)}\big[\tilde{w}(t_k)\Vert \boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t_k}, t_k) - \boldsymbol{f}_{\bar{\boldsymbol{\theta}}}(\hat{\boldsymbol{x}}_{t_{k-1}}^{\boldsymbol{\varphi}^*}, t_{k-1})\Vert^2\big] \label{eq:loss-5} \end{equation} where \hat{\boldsymbol{x}}_{t_{k-1}}^{\boldsymbol{\varphi}^*} is the prediction of \boldsymbol{x}_{t_{k-1}} by the teacher diffusion model starting from \boldsymbol{x}_{t_k}. For example, using a simple Euler solver: \begin{equation} \hat{\boldsymbol{x}}_{t_{k-1}}^{\boldsymbol{\varphi}^*} \approx \boldsymbol{x}_{t_k} - (t_k - t_{k-1})\boldsymbol{v}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_{t_k}, t_k) \end{equation} The reason is simple: if a pre-trained diffusion model is available, there is no need to find learning targets on the line \boldsymbol{x}_t = (1-t)\boldsymbol{x}_0 + t\boldsymbol{x}_1, which is artificially defined and carries the risk of crossing. Instead, we use the pre-trained model to predict the trajectory. The targets found this way might not be the "straightest," but they certainly won’t cross.

If cost is no object, we could also start from randomly sampled \boldsymbol{x}_1, solve for \boldsymbol{x}_0 using the pre-trained model, and use paired (\boldsymbol{x}_0, \boldsymbol{x}_1) to construct learning targets. This is similar to the distillation idea in ReFlow. The disadvantage is that the teacher model must run the full sampling process, which is time-consuming. In contrast, consistency distillation only requires running a single step of the teacher model, making the computational cost much lower.

However, consistency distillation still requires real samples during the distillation process, which is a drawback in some scenarios. If one wants to avoid both running the full teacher model sampling and providing real data, an alternative is SiD, which we introduced previously, though at the cost of more complex model derivation.

Summary

By step-by-step deconstructing and optimizing the ReFlow training process, this article provides an intuitive path for understanding the transition from ReFlow to Consistency Models.

Reprinted from: https://kexue.fm/archives/10633

For more details on reprinting, please refer to: "Scientific Space FAQ"