English (unofficial) translations of posts at kexue.fm
Source

Generative Diffusion Model Chat (25): Identity-Based Distillation (Part 1)

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

Today we share the paper "Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation". As the name suggests, this is a new paper exploring how to distill diffusion models faster and better.

Even if you haven’t worked on distillation, you can probably guess the conventional steps: randomly sample a large number of inputs, use the diffusion model to generate corresponding results as outputs, and use these input-output pairs as training data to supervise a new model. However, it is well known that the original teacher diffusion model usually requires many iterative steps (e.g., 1000 steps) to generate high-quality outputs. Therefore, regardless of the training details, a significant disadvantage of this scheme is that generating training data is extremely time-consuming and labor-intensive. Furthermore, the student model after distillation usually suffers from some loss in performance.

Is there a method that can solve both of these disadvantages at once? This is the problem the aforementioned paper attempts to address.

A Comeback

The paper refers to the proposed scheme as "Score identity Distillation (SiD)". The name is derived from the fact that the entire framework is designed and derived based on several identities. Choosing this slightly casual name is likely intended to highlight the key role of identity transformations in SiD, which is indeed its core contribution.

As for the training philosophy of SiD, it is almost identical to the paper "Learning Generative Models using Denoising Density Estimators" (referred to as "DDE"), which was previously introduced in "From Denoising Autoencoders to Generative Models". Even the final form is about sixty percent similar. However, at that time, diffusion models had not yet risen to prominence, so DDE was proposed as a new type of generative model, which made it appear very niche. Today, with the popularity of diffusion models, it can be reformulated as a distillation method for diffusion models because it requires a trained denoising autoencoder—which happens to be the core of a diffusion model.

Next, I will introduce SiD using my own line of reasoning. Suppose we have a teacher diffusion model \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t,t) trained on a target dataset. It requires multi-step sampling to generate high-quality images. Our goal is to train a student model \boldsymbol{x} = \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) for one-step sampling, which is a generator similar to a GAN that can directly generate images meeting the requirements from a specified noise \boldsymbol{z}. If we had many (\boldsymbol{z},\boldsymbol{x}) pairs, we could use direct supervised training (of course, the loss function and other details would need further determination; readers can refer to related work). But what if we don’t? It’s certainly not impossible to train, because one can train even without \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t,t), such as with GANs. So the key is how to leverage the already trained diffusion model to provide better signals.

SiD and its predecessor DDE use a logic that seems roundabout but is also very clever:

If the data distribution produced by \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) is very similar to the target distribution, then if we use the dataset generated by \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) to train a diffusion model \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t,t), should it also be very similar to \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t,t)?

Preliminary Form

The cleverness of this idea lies in the fact that it bypasses the need for samples generated by the teacher model and the need for real samples used to train the teacher model. This is because "training a diffusion model using the dataset generated by \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z})" only requires data generated by the student model \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) (referred to as "student data"). Since \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) is a one-step model, using it to generate data is very time-efficient.

Of course, this is still just an idea; there is still a way to go to convert it into a practical training scheme. First, let’s review the diffusion model. We adopt the form from "Generative Diffusion Model Chat (3): DDPM = Bayesian + Denoising". We add noise to the input \boldsymbol{x}_0 as follows: \begin{equation} \boldsymbol{x}_t = \bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \end{equation} In other words, p(\boldsymbol{x}_t|\boldsymbol{x}_0)=\mathcal{N}(\boldsymbol{x}_t;\bar{\alpha}_t\boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I}). The way to train \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t,t) is through denoising: \begin{equation} \boldsymbol{\varphi}^* = \mathop{\text{argmin}}_{\boldsymbol{\varphi}} \mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0),\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},t) - \boldsymbol{\varepsilon}\Vert^2\right] \label{eq:d-real-data} \end{equation} Here \tilde{p}(\boldsymbol{x}_0) is the training data of the teacher model. Similarly, if we want to train a diffusion model using the student data from \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}), the training objective is: \begin{equation} \begin{aligned} \boldsymbol{\psi}^* =&\; \mathop{\text{argmin}}_{\boldsymbol{\psi}} \mathbb{E}_{\boldsymbol{x}_0^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_0^{(g)}),\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\varepsilon}\Vert^2\right] \\ =&\; \mathop{\text{argmin}}_{\boldsymbol{\psi}} \mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\varepsilon}\Vert^2\right] \end{aligned}\label{eq:dloss} \end{equation} Here \boldsymbol{x}_t^{(g)}=\bar{\alpha}_t\boldsymbol{x}_0^{(g)} + \bar{\beta}_t\boldsymbol{\varepsilon}=\bar{\alpha}_t\boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}) + \bar{\beta}_t\boldsymbol{\varepsilon} is a sample after adding noise to the student data, and the distribution of student data is denoted as p_{\boldsymbol{\theta}}(\boldsymbol{x}_0^{(g)}). The second equality uses the fact that "\boldsymbol{x}_0^{(g)} is directly determined by \boldsymbol{z}", so the expectation over \boldsymbol{x}_0^{(g)} is equivalent to the expectation over \boldsymbol{z}. Now we have two diffusion models. The difference between them measures, to some extent, the difference between the data distributions generated by the teacher and student models. Therefore, an intuitive idea is to learn the student model by minimizing the difference between them: \begin{equation} \boldsymbol{\theta}^* = \mathop{\text{argmin}}_{\boldsymbol{\theta}} \underbrace{\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right]}_{\mathcal{L}_1}\label{eq:gloss-1} \end{equation} Note that the optimization of equation \eqref{eq:dloss} depends on \boldsymbol{\theta}. Therefore, when \boldsymbol{\theta} changes through equation \eqref{eq:gloss-1}, the value of \boldsymbol{\psi}^* also changes. Thus, equation \eqref{eq:dloss} and equation \eqref{eq:gloss-1} actually need to be optimized alternately, similar to a GAN.

The Finishing Touch

Speaking of GANs, some readers might "recoil in horror" because they are notoriously easy to collapse during training. Unfortunately, the alternating training scheme proposed above in equations \eqref{eq:dloss} and \eqref{eq:gloss-1} suffers from the same problem. While it is theoretically sound, problems arise in the gap between theory and practice, mainly manifested in two points:

  1. Theoretically, it is required to find the optimal solution for equation \eqref{eq:dloss} before optimizing equation \eqref{eq:gloss-1}. However, in practice, due to training costs, we optimize equation \eqref{eq:gloss-1} before reaching the optimum.

  2. Theoretically, \boldsymbol{\psi}^* changes with \boldsymbol{\theta}, meaning it should be written as \boldsymbol{\psi}^*(\boldsymbol{\theta}). Thus, when optimizing equation \eqref{eq:gloss-1}, there should be an additional term for the gradient of \boldsymbol{\psi}^*(\boldsymbol{\theta}) with respect to \boldsymbol{\theta}. However, in practice, we treat \boldsymbol{\psi}^* as a constant when optimizing equation \eqref{eq:gloss-1}.

These two problems are very fundamental and are the root causes of GAN training instability. A previous paper, "Revisiting GANs by Best-Response Constraint: Perspective, Methodology, and Application", specifically improved GAN training starting from the second point. It seems that neither of these problems can be easily solved. Especially the first one: it is almost impossible to always find the optimal \boldsymbol{\psi}, as the cost would be absolutely unacceptable. As for the second one, in an alternating training scenario, we have no good way to obtain any effective information about \boldsymbol{\psi}^*(\boldsymbol{\theta}), making it even more impossible to obtain its gradient with respect to \boldsymbol{\theta}.

Fortunately, for the aforementioned diffusion model distillation problem, SiD proposes an effective scheme to alleviate these two problems. SiD’s idea is quite "simple": Since taking an approximate value for \boldsymbol{\psi}^* and treating \boldsymbol{\psi}^* as a constant are unavoidable, the only way is to use identity transformations to minimize the dependence of the optimization objective \eqref{eq:gloss-1} on \boldsymbol{\psi}^*. As long as the dependence of equation \eqref{eq:gloss-1} on \boldsymbol{\psi}^* is weak enough, the negative impact of the two problems mentioned above will also be weak enough.

This is the core contribution of SiD, and it is a truly remarkable "finishing touch."

Identity Transformation

Next, let’s look at the specific identity transformations performed. First, consider equation \eqref{eq:d-real-data}. Its optimization objective can be equivalently rewritten as: \begin{equation} \begin{aligned} &\, \mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0),\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon},t) - \boldsymbol{\varepsilon}\Vert^2\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0),\boldsymbol{x}_t\sim p(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t,t) - \frac{\boldsymbol{x}_t - \bar{\alpha}_t \boldsymbol{x}_0}{\bar{\beta}_t}\right\Vert^2\right] \\ =&\, \mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0),\boldsymbol{x}_t\sim p(\boldsymbol{x}_t|\boldsymbol{x}_0)}\left[\left\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t,t) + \bar{\beta}_t\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] \end{aligned} \end{equation} According to the score matching results in "Generative Diffusion Model Chat (5): General Framework SDE Edition", the optimal solution to the above objective is \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t,t)=-\bar{\beta}_t\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t). Similarly, the optimal solution to equation \eqref{eq:dloss} is \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)=-\bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)}). At this point, the objective function of equation \eqref{eq:gloss-1} can be equivalently rewritten as: \begin{equation} \begin{aligned} &\,\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right] \\[5pt] =&\, \mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) + \bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})\right\rangle\right] \\[5pt] =&\, \color{green}{\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t)\right\rangle\right]} \\[5pt] &\, + \color{red}{\mathbb{E}_{\boldsymbol{x}_t^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})\right\rangle\right]} \end{aligned} \end{equation} Next, we use an identity proved in "Generative Diffusion Model Chat (18): Score Matching = Conditional Score Matching" to simplify the red part: \begin{equation} \nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) = \mathbb{E}_{\boldsymbol{x}_0\sim p(\boldsymbol{x}_0|\boldsymbol{x}_t)}\left[\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] \label{eq:id} \end{equation} This identity is derived from the definition of probability density and Bayes’ rule, and it does not depend on the specific forms of p(\boldsymbol{x}_t), p(\boldsymbol{x}_t|\boldsymbol{x}_0), p(\boldsymbol{x}_0|\boldsymbol{x}_t). Substituting this identity into the red part, we have: \begin{equation} \color{red}{\begin{aligned} &\,\mathbb{E}_{\boldsymbol{x}_t^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})\right\rangle\right] \\[5pt] = &\, \mathbb{E}_{\boldsymbol{x}_t^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)}),\boldsymbol{x}_0^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_0^{(g)}|\boldsymbol{x}_t^{(g)})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}} \log p(\boldsymbol{x}_t^{(g)}|\boldsymbol{x}_0^{(g)})\right\rangle\right] \\[5pt] = &\, -\mathbb{E}_{\boldsymbol{x}_0^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_0^{(g)}),\boldsymbol{x}_t^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)}|\boldsymbol{x}_0^{(g)})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \frac{\boldsymbol{x}_t - \bar{\alpha}_t \boldsymbol{x}_0}{\bar{\beta}_t}\right\rangle\right] \\[5pt] = &\, -\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \boldsymbol{\varepsilon}\right\rangle\right] \end{aligned}} \end{equation} Combining this with the green part, we obtain the new loss function for the student model: \begin{equation} \mathcal{L}_2 = \mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t), \boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\varepsilon}\right\rangle\right]\label{eq:gloss-2} \end{equation} This is the core result of SiD. The experimental results in the original paper show that it can efficiently achieve distillation, whereas equation \eqref{eq:gloss-1} failed to produce meaningful results.

Compared to equation \eqref{eq:gloss-1}, equation \eqref{eq:gloss-2} clearly involves \boldsymbol{\psi}^* fewer times, meaning its dependence on \boldsymbol{\psi}^* is weaker. Furthermore, this equation is derived based on the identity transformation of the optimal solution \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)=-\bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)}), which means it (partially) pre-glimpses the exact value of \boldsymbol{\psi}^*, which is another reason for its superiority.

Other Details

So far, the derivation in this article basically repeats the derivation of the original paper, but besides some inconsistencies in notation, there are some differences in details. Let’s clarify them briefly to avoid confusion.

First, the paper’s derivation assumes \bar{\alpha}_t=1, following the settings in "Elucidating the Design Space of Diffusion-Based Generative Models". However, although \bar{\alpha}_t=1 is representative and simplifies the form, it does not cover all types of diffusion models well, so the derivation in this article retains \bar{\alpha}_t. Second, the paper’s results are given in terms of \bar{\boldsymbol{\mu}}(\boldsymbol{x}_t) = \frac{\boldsymbol{x}_t - \bar{\beta}_t \boldsymbol{\epsilon}(\boldsymbol{x}_t,t)}{\bar{\alpha}_t}, which is clearly inconsistent with the common practice in diffusion models of using \boldsymbol{\epsilon}(\boldsymbol{x}_t,t). I have not yet grasped the superiority of the original paper’s representation.

Finally, the original paper found that the loss function \mathcal{L}_1 (equation \eqref{eq:gloss-1}) is so unstable that it often has a negative effect. Therefore, SiD ultimately takes the negative of equation \eqref{eq:gloss-1} as an additional loss function, weighted into the improved loss function \eqref{eq:gloss-2}, i.e., the final loss is \mathcal{L}_2 - \lambda\mathcal{L}_1 (Note: the weight notation in the original paper is \alpha, but since \alpha is used for the noise schedule here, \lambda is used instead). This can achieve even better distillation results in some cases. For specific experimental details and data, readers can refer to the original paper.

Compared to other distillation methods, the disadvantage of SiD is its high demand for video memory, as it needs to maintain three models simultaneously: \boldsymbol{\epsilon}_{\boldsymbol{\varphi}}(\boldsymbol{x}_t,t), \boldsymbol{\epsilon}_{\boldsymbol{\psi}}(\boldsymbol{x}_t,t), and \boldsymbol{g}_{\boldsymbol{\theta}}(\boldsymbol{z}). They have the same size, and although backpropagation is not performed simultaneously, the total memory requirement is roughly doubled. To address this, SiD suggests at the end of the text that future attempts could be made to add LoRA to the pre-trained model as the two additional models to further reduce memory requirements.

Extended Thinking

I believe that for the initial "preliminary form," i.e., the alternating optimization of equations \eqref{eq:dloss} and \eqref{eq:gloss-1}, many readers with a solid theoretical foundation and deep thinking would have the opportunity to think of it, especially with DDE already existing. However, the brilliance of SiD is that it did not stop there but proposed the subsequent identity transformations, making the training more stable and efficient. This reflects the authors’ deep understanding of diffusion models and optimization theory.

At the same time, SiD leaves many questions worth further thinking and exploration. For example, has the identity simplification of the student model’s loss \eqref{eq:gloss-2} reached its end? Not necessarily, because there is still \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t) on the left side of its inner product, which can be simplified in the same way. Specifically, we have: \begin{equation} \begin{aligned} &\,\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2 - 2\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t),\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\right\rangle + \Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{x}_t^{(g)}\sim p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})}\left[ \Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2 - 2\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t),-\bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})\right\rangle + \left\langle\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t),-\bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)})\right\rangle \right] \end{aligned} \end{equation} Each -\bar{\beta}_t\nabla_{\boldsymbol{x}_t^{(g)}}\log p_{\boldsymbol{\theta}}(\boldsymbol{x}_t^{(g)}) can be transformed into a single \boldsymbol{\varepsilon} using the same identity \eqref{eq:id} (but note that in \Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2=\langle\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t),\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\rangle, only one can be transformed). Equation \eqref{eq:gloss-2} only transforms a part. Would it be better to transform everything? Since there are no experimental results, it is currently unknown. But there is a particularly interesting form: if only the middle part above is transformed, the loss function can be written as: \begin{equation} \begin{aligned} &\,\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2 - 2\left\langle\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t),\boldsymbol{\varepsilon}\right\rangle + \Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right] \\[5pt] =&\,\mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\varphi}^*}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\varepsilon}\Vert^2 + \Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}^*}(\boldsymbol{x}_t^{(g)},t)\Vert^2\right] + \text{const} \end{aligned}\label{eq:gloss-3} \end{equation} This is the loss for the student model (generator). Now let’s compare it with the loss of the student data denoising model \eqref{eq:dloss}: \begin{equation} \boldsymbol{\psi}^* = \mathop{\text{argmin}}_{\boldsymbol{\psi}} \mathbb{E}_{\boldsymbol{z},\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\Vert\boldsymbol{\epsilon}_{\boldsymbol{\psi}}(\boldsymbol{x}_t^{(g)},t) - \boldsymbol{\varepsilon}\Vert^2\right]\label{eq:dloss-1} \end{equation} Looking at these two equations together, we can see that the student model is actually aligning with the teacher model and trying to move away from the denoising model trained on student data. Formally, this is very similar to LSGAN, where \boldsymbol{\epsilon}_{\boldsymbol{\psi}}(\boldsymbol{x}_t^{(g)},t) is like a GAN discriminator. The difference is that in a GAN, the discriminator usually has two loss terms and the generator has one, while in SiD, it is reversed. This actually reflects two different learning approaches:

  1. GAN: Initially, both the forger (generator) and the appraiser (discriminator) are novices. The appraiser improves their appraisal level by comparing real items and fakes, and the forger improves their forgery level through the appraiser’s feedback.

  2. SiD: There are no real items at all, but there is an absolutely authoritative master appraiser (teacher model). The forger (student model) continuously produces fakes while training its own appraiser (the denoising model trained on student data), and then improves its forgery level through the communication between its own appraiser and the master.

Some readers might ask: Why doesn’t the forger in SiD consult the master directly, instead of obtaining feedback indirectly by training its own appraiser? This is because if it communicates directly with the master, a potential problem is that they might only discuss the techniques of the same work for a long time, eventually producing only one type of fake that can pass as real (mode collapse). By training its own appraiser, this problem can be avoided to some extent, because the forger’s learning strategy is to "get more praise from the master while minimizing praise from its own appraiser." If the forger still only produces one type of fake, both the master’s and its own appraiser’s praise will increase, which does not align with the forger’s learning strategy, thus forcing the forger to continuously develop new products rather than stagnating.

Furthermore, readers may notice that the entire SiD training does not use any information from the recursive sampling of the diffusion model. In other words, it purely utilizes the denoising model trained through the denoising training method. A natural question is: if the goal is simply to train a single-step generative model rather than distilling an existing diffusion model, would it be better to train a denoising model with only a single noise intensity? For example, like DDE, fixing \bar{\alpha}_t=1 and \bar{\beta}_t=\beta=\text{some constant} to train a denoising model, and then using it to repeat the SiD training process. Would this simplify the training difficulty and improve efficiency? This is also a question worth further confirmation.

Summary

In this article, we introduced a new scheme for distilling diffusion models into single-step generative models. Its idea can be traced back to the work of training generative models using denoising autoencoders in the past two years. It does not require the teacher model’s real training set, nor does it require iterating the teacher model to generate sample pairs. Instead, it introduces GAN-like alternating training and proposes key identity transformations to stabilize the training process. There is much to learn from the entire method.