A Discussion on Generative Diffusion Models (I): DDPM = Demolition + Construction · English (unofficial) translations of posts at kexue.fm

When it comes to generative models, VAE and GAN are "household names," and this site has shared many posts about them. In addition, there are some relatively niche choices, such as flow models and VQ-VAE, which are also quite popular. In particular, VQ-VAE and its variant VQ-GAN have recently evolved to the status of an "Image Tokenizer," used to directly call various pre-training methods from NLP. Besides these, there is another choice that was originally even more niche—Diffusion Models—which is "rising abruptly" in the field of generative models. Currently, the two most advanced text-to-image models—OpenAI’s DALL·E 2 and Google’s Imagen—are both based on diffusion models.

Some examples of "Text-to-Image" from Imagen

Starting from this article, we will open a "new pit" (series) to gradually introduce some progress in generative diffusion models over the past two years. It is said that generative diffusion models are famous for their mathematical complexity and seem much harder to understand than VAEs or GANs. Is this really the case? Can diffusion models really not be understood in "plain language"? Let’s wait and see.

A New Starting Point

In fact, we briefly introduced diffusion models in previous articles such as "GAN Models from an Energy Perspective (III): Generative Model = Energy Model" and "From Denoising Autoencoders to Generative Models". When talking about diffusion models, general articles mention Energy-based Models, Score Matching, Langevin Equations, and so on. Simply put, an energy model is trained through techniques like score matching, and then sampling from the energy model is performed via the Langevin equation.

Theoretically, this is a very mature scheme that can, in principle, achieve the generation and sampling of any continuous objects (speech, images, etc.). However, from a practical perspective, training the energy function is a very difficult task, especially when the data dimension is large (such as high-resolution images), making it hard to train a complete energy function. On the other hand, sampling from the energy model via the Langevin equation also has great uncertainty, often resulting in noisy sampling results. Therefore, for a long time, diffusion models following this traditional path were only experimented with on relatively low-resolution images.

The current popularity of generative diffusion models began with DDPM (Denoising Diffusion Probabilistic Model) proposed in 2020. Although it also uses the name "diffusion model," in fact, except for some similarities in the form of the sampling process, DDPM is almost entirely different from traditional diffusion models based on Langevin equation sampling. This is truly a new starting point and a new chapter.

To be precise, it would be more accurate to call DDPM a "Gradual Change Model." The name "diffusion model" can actually lead to misunderstandings in comprehension. Concepts like energy models, score matching, and Langevin equations from traditional diffusion models actually have little to do with DDPM and its subsequent variants. Interestingly, the mathematical framework of DDPM was actually completed in the ICML 2015 paper "Deep Unsupervised Learning using Nonequilibrium Thermodynamics", but DDPM was the first to successfully tune it for high-resolution image generation, leading to the subsequent craze. This shows that the birth and popularity of a model often require time and opportunity.

Demolition and Construction

Many articles, when introducing DDPM, start by introducing transition distributions followed by variational inference. A pile of mathematical notations scares away many people (of course, from this introduction, we can see again that DDPM is actually a VAE rather than a traditional diffusion model). Coupled with people’s inherent impression of traditional diffusion models, the illusion that "advanced mathematical knowledge is required" is formed. In fact, DDPM can also be understood in "plain language"; it is no more difficult than GANs, which have the popular analogy of "counterfeiting vs. identification."

First, we want to make a generative model like a GAN, which is essentially a process of transforming a random noise \boldsymbol{z} into a data sample \boldsymbol{x}:

\begin{CD} \text{Random Noise } \boldsymbol{z} @>\text{Transformation}>> \text{Data Sample } \boldsymbol{x} \\ @V \text{Analogy} VV @VV \text{Analogy} V \\ \text{Bricks and Cement} @>\text{Construction}>> \text{Skyscraper} \end{CD}

We can imagine this process as "construction," where the random noise \boldsymbol{z} represents raw materials like bricks and cement, and the data sample \boldsymbol{x} is the skyscraper. Thus, the generative model is a construction team that builds skyscrapers from raw materials.

This process is certainly difficult, which is why there is so much research on generative models. But as the saying goes, "destruction is easy, construction is hard." If you don’t know how to build a building, surely you know how to demolish one? Let’s consider the process of step-by-step demolishing a skyscraper into bricks and cement: let \boldsymbol{x}_0 be the completed skyscraper (data sample), and \boldsymbol{x}_T be the demolished bricks and cement (random noise). Assuming "demolition" takes T steps, the whole process can be expressed as: \boldsymbol{x} = \boldsymbol{x}_0 \to \boldsymbol{x}_1 \to \boldsymbol{x}_2 \to \cdots \to \boldsymbol{x}_{T-1} \to \boldsymbol{x}_T = \boldsymbol{z} The difficulty of building a skyscraper lies in the fact that the span from raw materials \boldsymbol{x}_T to the final skyscraper \boldsymbol{x}_0 is too large; it’s hard for ordinary people to understand how \boldsymbol{x}_T suddenly becomes \boldsymbol{x}_0. However, once we have the intermediate steps of "demolition" \boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_T, we know that \boldsymbol{x}_{t-1} \to \boldsymbol{x}_t represents one step of demolition. Then, conversely, \boldsymbol{x}_t \to \boldsymbol{x}_{t-1} is one step of construction! If we can learn the transformation relationship between the two, \boldsymbol{x}_{t-1} = \boldsymbol{\mu}(\boldsymbol{x}_t), then starting from \boldsymbol{x}_T and repeatedly executing \boldsymbol{x}_{T-1} = \boldsymbol{\mu}(\boldsymbol{x}_T), \boldsymbol{x}_{T-2} = \boldsymbol{\mu}(\boldsymbol{x}_{T-1}), ..., won’t we eventually build the skyscraper \boldsymbol{x}_0?

How to Demolish

As the saying goes, "eat your meal one bite at a time," and a building must be built step by step. The process of DDPM as a generative model is exactly consistent with the "demolition-construction" analogy mentioned above. It first constructs a process that gradually changes from a data sample to random noise and then considers its inverse transformation, completing the generation of data samples by repeatedly executing the inverse transformation. This is why I said earlier that DDPM should be more accurately called a "gradual change model" rather than a "diffusion model."

Specifically, DDPM models the "demolition" process as: \boldsymbol{x}_t = \alpha_t \boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t, \quad \boldsymbol{\varepsilon}_t \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \label{eq:forward} where \alpha_t, \beta_t > 0 and \alpha_t^2 + \beta_t^2 = 1. \beta_t is usually very close to 0, representing the degree of damage to the original building structure in a single step of "demolition." The introduction of noise \boldsymbol{\varepsilon}_t represents a kind of destruction to the original signal. We can also understand it as "raw material," meaning in each step of "demolition," we decompose \boldsymbol{x}_{t-1} into "building structure of \alpha_t \boldsymbol{x}_{t-1} + raw material of \beta_t \boldsymbol{\varepsilon}_t." (Note: The definitions of \alpha_t, \beta_t in this article are different from the original paper.)

By repeatedly executing this demolition step, we can obtain: \begin{aligned} \boldsymbol{x}_t =&\, \alpha_t \boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t \\ =&\, \alpha_t \big(\alpha_{t-1} \boldsymbol{x}_{t-2} + \beta_{t-1} \boldsymbol{\varepsilon}_{t-1}\big) + \beta_t \boldsymbol{\varepsilon}_t \\ =&\, \cdots \\ =&\, (\alpha_t \cdots \alpha_1) \boldsymbol{x}_0 + \underbrace{(\alpha_t \cdots \alpha_2)\beta_1 \boldsymbol{\varepsilon}_1 + (\alpha_t \cdots \alpha_3)\beta_2 \boldsymbol{\varepsilon}_2 + \cdots + \alpha_t\beta_{t-1} \boldsymbol{\varepsilon}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t}_{\text{Sum of multiple independent normal noises}} \end{aligned} \label{eq:expand} Perhaps the reader just wanted to ask why the coefficients of the superposition must satisfy \alpha_t^2 + \beta_t^2 = 1. Now we can answer this question. First, the part indicated by the braces in the equation is exactly the sum of multiple independent normal noises, with a mean of 0 and variances of (\alpha_t \cdots \alpha_2)^2\beta_1^2, (\alpha_t \cdots \alpha_3)^2\beta_2^2, ..., \alpha_t^2\beta_{t-1}^2, \beta_t^2, respectively. Then, we use a piece of knowledge from probability theory—the additivity of normal distributions—meaning the distribution of the sum of the multiple independent normal noises above is actually a normal distribution with mean 0 and variance (\alpha_t \cdots \alpha_2)^2\beta_1^2 + (\alpha_t \cdots \alpha_3)^2\beta_2^2 + \cdots + \alpha_t^2\beta_{t-1}^2 + \beta_t^2. Finally, under the condition that \alpha_t^2 + \beta_t^2 = 1 always holds, we can find that the sum of the squares of the coefficients in equation [eq:expand] is still 1, i.e., (\alpha_t \cdots \alpha_1)^2 + (\alpha_t \cdots \alpha_2)^2\beta_1^2 + (\alpha_t \cdots \alpha_3)^2\beta_2^2 + \cdots + \alpha_t^2\beta_{t-1}^2 + \beta_t^2 = 1 So it is actually equivalent to: \boldsymbol{x}_t = \underbrace{(\alpha_t \cdots \alpha_1)}_{\text{denoted as } \bar{\alpha}_t} \boldsymbol{x}_0 + \underbrace{\sqrt{1 - (\alpha_t \cdots \alpha_1)^2}}_{\text{denoted as } \bar{\beta}_t} \bar{\boldsymbol{\varepsilon}}_t, \quad \bar{\boldsymbol{\varepsilon}}_t \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \label{eq:skip} This provides great convenience for calculating \boldsymbol{x}_t. On the other hand, DDPM chooses an appropriate form for \alpha_t such that \bar{\alpha}_T \approx 0, which means that after T steps of demolition, the remaining building structure is almost negligible and has been entirely converted into raw material \boldsymbol{\varepsilon}. (Note: The definition of \bar{\alpha}_t in this article is different from the original paper.)

How to Build

"Demolition" is the process \boldsymbol{x}_{t-1} \to \boldsymbol{x}_t. From this process, we obtain many data pairs (\boldsymbol{x}_{t-1}, \boldsymbol{x}_t). Then "construction" naturally involves learning a model \boldsymbol{x}_t \to \boldsymbol{x}_{t-1} from these data pairs. Let this model be \boldsymbol{\mu}(\boldsymbol{x}_t). It is easy to think that the learning scheme is to minimize the Euclidean distance between the two: \left\Vert\boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\right\Vert^2 \label{eq:loss-0} In fact, this is already very close to the final DDPM model. Next, let’s make this process more refined. First, the "demolition" equation [eq:forward] can be rewritten as \boldsymbol{x}_{t-1} = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\varepsilon}_t\right). This inspires us that we might design the "construction" model \boldsymbol{\mu}(\boldsymbol{x}_t) in the form: \boldsymbol{\mu}(\boldsymbol{x}_t) = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right) \label{eq:sample} where \boldsymbol{\theta} are the training parameters. Substituting this into the loss function, we get: \left\Vert\boldsymbol{x}_{t-1} - \boldsymbol{\mu}(\boldsymbol{x}_t)\right\Vert^2 = \frac{\beta_t^2}{\alpha_t^2}\left\Vert \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right\Vert^2 The preceding factor \frac{\beta_t^2}{\alpha_t^2} represents the weight of the loss, which we can temporarily ignore. Finally, substituting the expression for \boldsymbol{x}_t given by combining equations [eq:skip] and [eq:forward]: \boldsymbol{x}_t = \alpha_t\boldsymbol{x}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t = \alpha_t\left(\bar{\alpha}_{t-1}\boldsymbol{x}_0 + \bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1}\right) + \beta_t \boldsymbol{\varepsilon}_t = \bar{\alpha}_t\boldsymbol{x}_0 + \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t We get the form of the loss function as: \left\Vert \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t, t)\right\Vert^2 \label{eq:loss-1} The reader might ask why we go back one step to provide \boldsymbol{x}_t. Is it okay to provide \boldsymbol{x}_t directly according to equation [eq:skip]? The answer is no, because we have already sampled \boldsymbol{\varepsilon}_t beforehand, and \boldsymbol{\varepsilon}_t and \bar{\boldsymbol{\varepsilon}}_t are not independent. Therefore, given \boldsymbol{\varepsilon}_t, we cannot sample \bar{\boldsymbol{\varepsilon}}_t completely independently.

Reducing Variance

In principle, the loss function [eq:loss-1] can complete the training of DDPM, but in practice, it may have the risk of excessive variance, leading to problems like slow convergence. It is not difficult to understand this; just observe that equation [eq:loss-1] actually contains 4 random variables that need to be sampled:

1. Sample an \boldsymbol{x}_0 from all training samples;
2. Sample \bar{\boldsymbol{\varepsilon}}_{t-1}, \boldsymbol{\varepsilon}_t from the normal distribution \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) (two different sampling results);
3. Sample a t from 1 \sim T.

The more random variables to be sampled, the harder it is to accurately estimate the loss function. Conversely, the fluctuation (variance) of each estimate of the loss function is too large. Fortunately, we can use an integration trick to combine \bar{\boldsymbol{\varepsilon}}_{t-1} and \boldsymbol{\varepsilon}_t into a single normal random variable, thereby alleviating the problem of large variance.

This integration indeed requires some skill, but it is not too complex. Due to the additivity of normal distributions, we know that \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t is actually equivalent to a single random variable \bar{\beta}_t\boldsymbol{\varepsilon} | \boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}). Similarly, \beta_t \bar{\boldsymbol{\varepsilon}}_{t-1} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\varepsilon}_t is actually equivalent to a single random variable \bar{\beta}_t\boldsymbol{\omega} | \boldsymbol{\omega} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}). It can be verified that \mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\omega}^{\top}] = \boldsymbol{0}, so these are two independent normal random variables.

Next, we conversely express \boldsymbol{\varepsilon}_t in terms of \boldsymbol{\varepsilon} and \boldsymbol{\omega}: \boldsymbol{\varepsilon}_t = \frac{(\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega})\bar{\beta}_t}{\beta_t^2 + \alpha_t^2\bar{\beta}_{t-1}^2} = \frac{\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega}}{\bar{\beta}_t} Substituting this into equation [eq:loss-1], we get: \begin{aligned} &\,\mathbb{E}_{\bar{\boldsymbol{\varepsilon}}_{t-1}, \boldsymbol{\varepsilon}_t \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\Vert \boldsymbol{\varepsilon}_t - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \alpha_t\bar{\beta}_{t-1}\bar{\boldsymbol{\varepsilon}}_{t-1} + \beta_t \boldsymbol{\varepsilon}_t, t)\right\Vert^2\right] \\ =&\, \mathbb{E}_{\boldsymbol{\omega}, \boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\Vert \frac{\beta_t \boldsymbol{\varepsilon} - \alpha_t\bar{\beta}_{t-1} \boldsymbol{\omega}}{\bar{\beta}_t} - \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t)\right\Vert^2\right] \end{aligned} Notice that the loss function is now only quadratic with respect to \boldsymbol{\omega}, so we can expand it and calculate its expectation directly. The result is: \frac{\beta_t^2}{\bar{\beta}_t^2}\mathbb{E}_{\boldsymbol{\varepsilon} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})}\left[\left\Vert\boldsymbol{\varepsilon} - \frac{\bar{\beta}_t}{\beta_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t)\right\Vert^2\right] + \text{constant} Again, omitting the constant and the weight of the loss function, we obtain the final loss function used by DDPM: \left\Vert\boldsymbol{\varepsilon} - \frac{\bar{\beta}_t}{\beta_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t)\right\Vert^2 (Note: The \boldsymbol{\epsilon}_{\boldsymbol{\theta}} in the original paper is actually \frac{\bar{\beta}_t}{\beta_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}} in this article, so the results are identical.)

Recursive Generation

So far, we have clarified the entire training process of DDPM. A lot has been written; to say it’s easy would certainly be an overstatement, but there are almost no truly difficult parts—no traditional energy functions, score matching, or even variational inference knowledge was used. Only with the "demolition-construction" analogy and some basic probability theory, we can obtain exactly the same results as the original paper. Therefore, the newly emerging generative diffusion models represented by DDPM are actually not as complex as many readers imagine. They can be seen as a figurative modeling of learning new knowledge from the process of "disassembly and reorganization."

After training, we can start from a random noise \boldsymbol{x}_T \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) and execute T steps of equation [eq:sample] for generation: \boldsymbol{x}_{t-1} = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right) This corresponds to Greedy Search in autoregressive decoding. If Random Sampling is to be performed, a noise term needs to be added: \boldsymbol{x}_{t-1} = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \beta_t \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right) + \sigma_t \boldsymbol{z}, \quad \boldsymbol{z} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) Generally, we can let \sigma_t = \beta_t, keeping the forward and backward variances synchronized. The difference between this sampling process and the Langevin sampling of traditional diffusion models is that DDPM sampling starts from a random noise every time and requires T iterations to get one sample output; Langevin sampling starts from any point and iterates infinitely, and theoretically, in this infinite iteration process, all data samples are generated. So, besides the similarity in form, they are essentially two completely different models.

From this generation process, we can also feel that it is actually the same as the decoding process of Seq2Seq, both being serial autoregressive generation. Therefore, generation speed is a bottleneck. DDPM sets T=1000, which means that for every image generated, \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) must be executed 1000 times. Thus, a major drawback of DDPM is slow sampling speed, and much subsequent work has been dedicated to improving the sampling speed of DDPM. Speaking of "image generation + autoregressive model + very slow," some readers might think of early models like PixelRNN and PixelCNN. They convert image generation into a language modeling task, so they are also recursive in sampling and generation and equally slow. So, what is the substantial difference between DDPM’s autoregressive generation and that of PixelRNN/PixelCNN? Why didn’t PixelRNN/PixelCNN become popular, but DDPM did?

Readers familiar with PixelRNN/PixelCNN know that these generative models generate images pixel by pixel. Since autoregressive generation is ordered, we must pre-arrange the order of every pixel in the image, and the final generation effect is closely related to this order. However, currently, this order can only be designed based on human experience (such designs are collectively called "Inductive Bias"), and no theoretical optimal solution has been found yet. In other words, the generation effect of PixelRNN/PixelCNN is heavily influenced by Inductive Bias. But DDPM is different; it redefines an autoregressive direction through "demolition," and for all pixels, they are equal and unbiased, thus reducing the influence of Inductive Bias and improving the effect. In addition, the number of iterations for DDPM generation is a fixed T, while for PixelRNN/PixelCNN, it equals the image resolution (\text{width} \times \text{height} \times \text{channels}), so DDPM generates high-resolution images much faster than PixelRNN/PixelCNN.

Hyperparameter Settings

In this section, we discuss the setting of hyperparameters.

In DDPM, T=1000, which might be larger than many readers imagine. Why set such a large T? On the other hand, for the choice of \alpha_t, translating the settings of the original paper to the notation of this blog, it is roughly: \alpha_t = \sqrt{1 - \frac{0.02t}{T}} This is a monotonically decreasing function. Why choose a monotonically decreasing \alpha_t?

Actually, these two questions have similar answers related to the specific data background. For simplicity, we used Euclidean distance [eq:loss-0] as the loss function during reconstruction. As readers who have done image generation know, Euclidean distance is not a good measure of image realism. When VAE uses Euclidean distance for reconstruction, it often yields blurry results unless the input and output images are very close. Therefore, choosing T as large as possible is precisely to make the input and output as similar as possible, reducing the blurring problem caused by Euclidean distance.

Choosing a monotonically decreasing \alpha_t has similar considerations. When t is small, \boldsymbol{x}_t is still close to the real image, so we want to reduce the gap between \boldsymbol{x}_{t-1} and \boldsymbol{x}_t to better apply the Euclidean distance [eq:loss-0], hence using a larger \alpha_t. When t is large, \boldsymbol{x}_t is already close to pure noise, and Euclidean distance is fine for noise, so we can slightly increase the gap between \boldsymbol{x}_{t-1} and \boldsymbol{x}_t, i.e., use a smaller \alpha_t. So, can we use a large \alpha_t all the time? Yes, but T must be increased. Note that when deriving [eq:skip], we said there should be \bar{\alpha}_T \approx 0, and we can directly estimate: \log \bar{\alpha}_T = \sum_{t=1}^T \log\alpha_t = \frac{1}{2} \sum_{t=1}^T \log\left(1 - \frac{0.02t}{T}\right) < \frac{1}{2} \sum_{t=1}^T \left(- \frac{0.02t}{T}\right) = -0.005(T+1) Substituting T=1000 gives roughly \bar{\alpha}_T \approx e^{-5}, which just reaches the standard of \approx 0. So if a large \alpha_t is used from beginning to end, a larger T is inevitably required to make \bar{\alpha}_T \approx 0.

Finally, we notice that in the "construction" model \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t), we explicitly write t in the input. This is because, in principle, different t handle objects at different levels, so different reconstruction models should be used, meaning there should be T different reconstruction models. Thus, we share the parameters of all reconstruction models and pass t as a condition. According to the appendix of the paper, t is converted into the positional encoding introduced in "The Road to Transformer Upgrade: 1. Tracing the Source of Sinusoidal Positional Encoding" and then directly added to the residual module.

Summary

This article introduced the latest generative diffusion model DDPM through a popular analogy of "demolition and construction." From this perspective, we can obtain exactly the same results as the original paper through "plain language" descriptions and relatively little mathematical derivation. Overall, this article shows that DDPM can also find a figurative analogy like GANs. It can avoid both the "variation" in VAE and the "probability divergence" or "optimal transport" in GANs. In this sense, DDPM can even be considered simpler than VAE and GAN.

Reprinting: Please include the original address of this article: https://kexue.fm/archives/9119

For more details on reprinting, please refer to: "Scientific Space FAQ"