Talk on Generative Diffusion Models (9): Conditional Control of Generation Results · English (unofficial) translations of posts at kexue.fm

The previous articles in this series have mostly focused on theoretical results. In this article, we will discuss a topic of significant practical value: conditional control of generation results.

As generative models, the developmental history of diffusion models is very similar to that of VAEs, GANs, and flow-based models. Unconditional generation usually appears first, followed closely by conditional generation. While unconditional generation is often used to explore the upper limits of performance, conditional generation is more focused on the application level, as it allows us to control the output according to our intentions. Since the inception of DDPM, many works on conditional diffusion models have emerged. In fact, it could be said that it was conditional diffusion models that truly made diffusion models popular, such as the well-known text-to-image models DALL·E 2 and Imagen.

In this article, we will conduct a simple study and summary of the theoretical foundations of conditional diffusion models.

Technical Analysis

From a methodological perspective, there are two ways to achieve conditional control: post-hoc modification (Classifier-Guidance) and pre-training (Classifier-Free).

For most people, the cost of training a SOTA-level diffusion model is too high, whereas training a classifier is relatively acceptable. Therefore, many choose to reuse a pre-trained unconditional diffusion model and use a classifier to adjust the generation process to achieve controlled generation. This is the post-hoc modification approach known as Classifier-Guidance. On the other hand, companies with abundant resources like Google and OpenAI have no shortage of data and computing power, so they prefer to incorporate conditional signals directly into the training process of the diffusion model to achieve better generation results. This is the pre-training approach known as Classifier-Free.

The Classifier-Guidance scheme first appeared in "Diffusion Models Beat GANs on Image Synthesis", where it was initially used for class-conditioned generation. Later, "More Control for Free! Image Synthesis with Semantic Diffusion Guidance" generalized the concept of a "Classifier," allowing it to generate based on images or text. The training cost of the Classifier-Guidance scheme is relatively low (readers familiar with NLP might find it similar to the PPLM model), but the inference cost is higher, and the control over details is usually not as precise.

As for the Classifier-Free scheme, it originated from "Classifier-Free Diffusion Guidance". Subsequent eye-catching models like DALL·E 2 and Imagen are basically built upon it. It is worth mentioning that although the paper was only uploaded to Arxiv last month, it was actually accepted by NeurIPS 2021 last year. It should be said that the Classifier-Free scheme itself does not involve many theoretical tricks; it is the most straightforward approach to conditional diffusion models. Its late appearance was likely due to the high cost of re-training diffusion models. Given sufficient data and computing power, the Classifier-Free scheme has demonstrated astonishing detail control capabilities.

Conditional Input

To put it simply, the Classifier-Free scheme is costly to train but "technically straightforward." Therefore, the majority of this article will focus on the Classifier-Guidance scheme, with the Classifier-Free scheme briefly introduced at the end.

Through the analysis in previous articles, readers should already know that the most critical step in generative diffusion models is the construction of the generation process p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t). For generation conditioned on \boldsymbol{y}, this simply means replacing p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) with p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}). In other words, the input \boldsymbol{y} is added to the generation process. To reuse a pre-trained unconditional model p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t), we use Bayes’ theorem: p(\boldsymbol{x}_{t-1}|\boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1})p(\boldsymbol{y}|\boldsymbol{x}_{t-1})}{p(\boldsymbol{y})} By adding the condition \boldsymbol{x}_t to each term, we get: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)p(\boldsymbol{y}|\boldsymbol{x}_{t-1}, \boldsymbol{x}_t)}{p(\boldsymbol{y}|\boldsymbol{x}_t)} \label{eq:bayes-1} Note that in the forward process, \boldsymbol{x}_t is obtained by adding noise to \boldsymbol{x}_{t-1}. Noise does not help with classification, so adding \boldsymbol{x}_t provides no additional benefit for classification. Thus, p(\boldsymbol{y}|\boldsymbol{x}_{t-1}, \boldsymbol{x}_t)=p(\boldsymbol{y}|\boldsymbol{x}_{t-1}), which leads to: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)p(\boldsymbol{y}|\boldsymbol{x}_{t-1})}{p(\boldsymbol{y}|\boldsymbol{x}_t)} = p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) e^{\log p(\boldsymbol{y}|\boldsymbol{x}_{t-1}) - \log p(\boldsymbol{y}|\boldsymbol{x}_t)} \label{eq:bayes-2}

Approximate Distribution

For readers who have read "Talk on Generative Diffusion Models (5): SDE Perspective of the General Framework", the following process might seem familiar. However, even if you haven’t read it, we will still provide a complete derivation below.

When T is sufficiently large, the variance of p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}) is small enough that the probability is significantly greater than 0 only when \boldsymbol{x}_t is very close to \boldsymbol{x}_{t-1}. The reverse is also true: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) or p(\boldsymbol{x}_t|\boldsymbol{x}_{t-1}, \boldsymbol{y}) is significantly greater than 0 only when \boldsymbol{x}_t and \boldsymbol{x}_{t-1} are very close. We only need to focus on the probability changes within this range. To this end, we use a Taylor expansion: \log p(\boldsymbol{y}|\boldsymbol{x}_{t-1}) - \log p(\boldsymbol{y}|\boldsymbol{x}_t) \approx (\boldsymbol{x}_{t-1} - \boldsymbol{x}_t) \cdot \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) Strictly speaking, there is also a term regarding the change in t, but that term is independent of \boldsymbol{x}_{t-1} and acts as a constant that does not affect the probability of \boldsymbol{x}_{t-1}, so we have omitted it. Assuming the original distribution is p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)=\mathcal{N}(\boldsymbol{x}_{t-1};\boldsymbol{\mu}(\boldsymbol{x}_t),\sigma_t^2\boldsymbol{I}) \propto e^{-\Vert \boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\Vert^2/2\sigma_t^2}, then we approximately have: \begin{aligned} p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) \propto&\, e^{-\Vert \boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\Vert^2/2\sigma_t^2 + (\boldsymbol{x}_{t-1} - \boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)} \\ \propto&\, e^{-\Vert \boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t) - \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t))\Vert^2/2\sigma_t^2} \end{aligned} From this result, we can see that p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) is approximately \mathcal{N}(\boldsymbol{x}_{t-1};\boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t),\sigma_t^2\boldsymbol{I}). Therefore, we only need to modify the sampling in the generation process to: \boldsymbol{x}_{t-1} = \boldsymbol{\mu}(\boldsymbol{x}_t) + \underbrace{\sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)}_{\text{New term}} + \sigma_t\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) This is the core result of the Classifier-Guidance scheme. Note that the input to p(\boldsymbol{y}|\boldsymbol{x}_t) is the noisy sample \boldsymbol{x}_t, which means we need a model capable of making predictions on noisy samples. If we only have a model p_o(\boldsymbol{y}|\boldsymbol{x}) trained on clean samples, we can consider: p(\boldsymbol{y}|\boldsymbol{x}_t) = p_{o}(\boldsymbol{y}|\boldsymbol{\mu}(\boldsymbol{x}_t)) That is, use \boldsymbol{\mu}(\cdot) to denoise \boldsymbol{x}_t before passing it to p(\boldsymbol{y}|\boldsymbol{x}_t), thereby avoiding the cost of training a separate classifier on noisy samples.

Gradient Scaling

The original paper ("Diffusion Models Beat GANs on Image Synthesis") found that introducing a scaling parameter \gamma to the classifier gradient can better adjust the generation effect: \boldsymbol{x}_{t-1} = \boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \gamma \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) + \sigma_t\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0},\boldsymbol{I}) \label{eq:gamma-sample} When \gamma > 1, the generation process uses more signal from the classifier, which improves the correlation between the generated result and the input signal \boldsymbol{y}, but correspondingly reduces the diversity of the generated results. Conversely, it reduces the correlation but increases diversity.

How can we understand this parameter theoretically? The original paper suggests interpreting it as increasing the focus of the distribution through a power operation, i.e., defining: \tilde{p}(\boldsymbol{y}|\boldsymbol{x}_t) = \frac{p^{\gamma}(\boldsymbol{y}|\boldsymbol{x}_t)}{Z(\boldsymbol{x}_t)},\quad Z(\boldsymbol{x}_t)=\sum_{\boldsymbol{y}} p^{\gamma}(\boldsymbol{y}|\boldsymbol{x}_t) As \gamma increases, the prediction of \tilde{p}(\boldsymbol{y}|\boldsymbol{x}_t) becomes closer to a one-hot distribution. Using this as the classifier for Classifier-Guidance, the generation process will tend to pick samples with high classification confidence.

However, while this perspective provides some reference value, it is not entirely correct because: \nabla_{\boldsymbol{x}_t}\log \tilde{p}(\boldsymbol{y}|\boldsymbol{x}_t) = \gamma\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) - \nabla_{\boldsymbol{x}_t} \log Z(\boldsymbol{x}_t) \neq \gamma\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) The original paper mistakenly assumed Z(\boldsymbol{x}_t) is a constant, so \nabla_{\boldsymbol{x}_t} \log Z(\boldsymbol{x}_t)=0. But in fact, when \gamma \neq 1, Z(\boldsymbol{x}_t) explicitly depends on \boldsymbol{x}_t. I have thought about whether there is any remedy for this, but unfortunately, there are no clear results. It seems one can only loosely assume that the gradient properties at \gamma=1 (where Z(\boldsymbol{x}_t)=1) can approximately generalize to the case where \gamma \neq 1.

Similarity Control

In fact, the best way to understand \gamma \neq 1 is to abandon the interpretation of p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) via Bayes’ theorem in Eq. [eq:bayes-1] and [eq:bayes-2], and instead directly define: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \frac{p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) e^{\gamma\cdot\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})}}{Z(\boldsymbol{x}_t, \boldsymbol{y})},\quad Z(\boldsymbol{x}_t,\boldsymbol{y})=\sum_{\boldsymbol{x}_{t-1}} p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t) e^{\gamma\cdot\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})} where \text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y}) is some measure of similarity or correlation between the generated result \boldsymbol{x}_{t-1} and the condition \boldsymbol{y}. In this view, \gamma is directly integrated into the definition of p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}), directly controlling the correlation between the result and the condition. As \gamma increases, the model tends to generate \boldsymbol{x}_{t-1} that is more correlated with \boldsymbol{y}.

To further obtain an approximate result for sampling, we can expand at \boldsymbol{x}_{t-1}=\boldsymbol{x}_t (or at \boldsymbol{x}_{t-1}=\boldsymbol{\mu}(\boldsymbol{x}_t), similar to before): e^{\gamma\cdot\text{sim}(\boldsymbol{x}_{t-1}, \boldsymbol{y})}\approx e^{\gamma\cdot\text{sim}(\boldsymbol{x}_t, \boldsymbol{y}) + \gamma\cdot(\boldsymbol{x}_{t-1}-\boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t}\text{sim}(\boldsymbol{x}_t, \boldsymbol{y})} Assuming this approximation is sufficient, and removing terms independent of \boldsymbol{x}_{t-1}, we get: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})\propto p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)e^{\gamma\cdot(\boldsymbol{x}_{t-1}-\boldsymbol{x}_t)\cdot\nabla_{\boldsymbol{x}_t}\text{sim}(\boldsymbol{x}_t, \boldsymbol{y})} As before, substituting p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)=\mathcal{N}(\boldsymbol{x}_{t-1};\boldsymbol{\mu}(\boldsymbol{x}_t),\sigma_t^2\boldsymbol{I}) and completing the square yields: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y})\approx \mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2\gamma \nabla_{\boldsymbol{x}_t} \text{sim}(\boldsymbol{x}_t, \boldsymbol{y}),\sigma_t^2\boldsymbol{I})

In this way, we don’t need to worry about the probabilistic meaning of p(\boldsymbol{y}|\boldsymbol{x}_t). We only need to directly define the metric function \text{sim}(\boldsymbol{x}_t, \boldsymbol{y}). Here, \boldsymbol{y} is no longer limited to "classes"; it can be text, images, or any other input signal. The usual approach is to encode them into feature vectors using their respective encoders and then use cosine similarity: \text{sim}(\boldsymbol{x}_t, \boldsymbol{y}) = \frac{E_1(\boldsymbol{x}_t)\cdot E_2(\boldsymbol{y})}{\Vert E_1(\boldsymbol{x}_t)\Vert \Vert E_2(\boldsymbol{y})\Vert} It should be noted that the intermediate \boldsymbol{x}_t contains Gaussian noise, so the encoder E_1 generally cannot be a standard encoder trained on clean data; it is better to fine-tune it using noisy data. Furthermore, for style transfer, one usually uses Gram matrix distance instead of cosine similarity—these depend on the specific scenario. The above is a series of results from the paper "More Control for Free! Image Synthesis with Semantic Diffusion Guidance". For more details, please refer to the original paper.

Continuous Case

Through the previous derivations, we obtained the correction term for the mean as \sigma_t^2 \gamma \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) or \sigma_t^2\gamma \nabla_{\boldsymbol{x}_t} \text{sim}(\boldsymbol{x}_t, \boldsymbol{y}). They share a common characteristic: when \sigma_t=0, the correction term also becomes 0, and the control fails.

Can the variance \sigma_t in the generation process be zero? Certainly. For example, DDIM, introduced in "Talk on Generative Diffusion Models (4): DDIM = High-level DDPM", is a generation process with zero variance. How do we perform controlled generation in this case? Here, we need to use the general SDE-based results introduced in "Talk on Generative Diffusion Models (6): ODE Perspective of the General Framework". There, we introduced that for the forward SDE: d\boldsymbol{x} = \boldsymbol{f}_t(\boldsymbol{x}) dt + g_t d\boldsymbol{w} The corresponding most general reverse SDE is: d\boldsymbol{x} = \left(\boldsymbol{f}_t(\boldsymbol{x}) - \frac{1}{2}(g_t^2 + \sigma_t^2)\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})\right) dt + \sigma_t d\boldsymbol{w} This allows us to freely choose the reverse variance \sigma_t^2. DDPM and DDIM can be considered special cases, where \sigma_t=0 corresponds to a generalized DDIM. As we can see, the only part of the reverse SDE related to the input is \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}). If we want to perform conditional generation, we naturally replace it with \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}|\boldsymbol{y}). Using Bayes’ theorem: \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}|\boldsymbol{y}) = \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) + \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x}) Under standard parameterization, \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}{\bar{\beta}_t}, therefore: \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}|\boldsymbol{y}) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}{\bar{\beta}_t} + \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x}) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \bar{\beta}_t\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x})}{\bar{\beta}_t} This means that regardless of the generation variance, we only need to replace \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) with \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \bar{\beta}_t\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{y}|\boldsymbol{x}) to achieve conditional control. Thus, from the unified perspective of SDEs, we can obtain the most general results for the Classifier-Guidance scheme very simply and directly.

Classifier-Free

Finally, let’s briefly introduce the Classifier-Free scheme. It is actually very simple; it directly defines: p(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t, \boldsymbol{y}) = \mathcal{N}(\boldsymbol{x}_{t-1}; \boldsymbol{\mu}(\boldsymbol{x}_t, \boldsymbol{y}),\sigma_t^2\boldsymbol{I}) Following the results from previous DDPM articles, \boldsymbol{\mu}(\boldsymbol{x}_t, \boldsymbol{y}) is generally parameterized as: \boldsymbol{\mu}(\boldsymbol{x}_t, \boldsymbol{y}) = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \frac{\beta_t^2}{\bar{\beta}_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t)\right) The training loss function is: \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{y}\sim\tilde{p}(\boldsymbol{x}_0,\boldsymbol{y}), \boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\Vert\boldsymbol{\varepsilon} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon}, \boldsymbol{y}, t)\right\Vert^2\right] Its advantage is that the extra input \boldsymbol{y} is introduced during training; theoretically, more input information makes training easier. Its disadvantage is also that \boldsymbol{y} is introduced during training, meaning that for every new set of control signals, the entire diffusion model must be re-trained.

Notably, the Classifier-Free scheme also mimics the scaling mechanism of the Classifier-Guidance scheme by adding a \gamma parameter to balance correlation and diversity. Specifically, the mean in Eq. [eq:gamma-sample] can be rewritten as: \boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \gamma \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t) = \gamma\left[\boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t)\right] - (\gamma - 1) \boldsymbol{\mu}(\boldsymbol{x}_t) The Classifier-Free scheme essentially uses the model to directly fit \boldsymbol{\mu}(\boldsymbol{x}_t) + \sigma_t^2 \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{y}|\boldsymbol{x}_t). By analogy, we can introduce a parameter w = \gamma - 1 in the Classifier-Free scheme and use: \tilde{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t) = (1 + w)\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t) - w \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) instead of \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{y}, t) for generation. Where does the unconditional \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) come from? We can introduce a specific null input \boldsymbol{\phi} whose target images are all images in the dataset and include it in the model training. This way, we can consider \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, \boldsymbol{\phi}, t).

Summary

This article briefly introduced the theoretical results for establishing conditional diffusion models, mainly covering two schemes: post-hoc modification (Classifier-Guidance) and pre-training (Classifier-Free). The former does not require re-training the diffusion model and can achieve simple control at a low cost; the latter requires re-training the model, which is more expensive but allows for more refined control.

Original Address: https://kexue.fm/archives/9257