Generative Diffusion Models Part 5: SDE General Framework · English (unofficial) translations of posts at kexue.fm

When I wrote the first article on generative diffusion models, a reader recommended Dr. Yang Song’s paper "Score-Based Generative Modeling through Stochastic Differential Equations" in the comments. It can be said that this paper constructs a quite generalized theoretical framework for generative diffusion models, connecting various results such as DDPM, SDE, and ODE. Admittedly, it is an excellent paper, but it is not suitable for beginners. It directly utilizes a large number of results from Stochastic Differential Equations (SDE), Fokker-Planck equations, and score matching, making the entry barrier quite high.

However, after the accumulation of the previous four articles, we can now attempt to study this paper. In the following sections, I will try to reproduce the derivation results from the original paper starting from as little theoretical foundation as possible.

Stochastic Differential

In DDPM, the diffusion process is divided into a fixed number of T steps. Using the analogy from "Generative Diffusion Models Part 1: DDPM = Demolishing + Building", both "demolishing" and "building" are pre-divided into T steps. This division is quite arbitrary. In fact, the real "demolishing" and "building" processes should not have deliberately divided steps; we can understand them as a continuous transformation process in time, which can be described by a Stochastic Differential Equation (SDE).

To this end, we use the following SDE to describe the forward process ("demolishing"): d\boldsymbol{x} = \boldsymbol{f}_t(\boldsymbol{x}) dt + g_t d\boldsymbol{w}\label{eq:sde-forward} I believe many readers are unfamiliar with SDEs; I only encountered them briefly during my master’s studies and know just the basics. But it doesn’t matter if you don’t understand it; we only need to view it as the limit of the following discrete form as \Delta t \to 0: \boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t = \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t + g_t \sqrt{\Delta t}\boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})\label{eq:sde-discrete} To be more direct, if we assume demolishing a building takes 1 day, then demolishing is the process of \boldsymbol{x} changing from t=0 to t=1. Each small step of change can be described by the above equation. As for the time interval \Delta t, we do not impose special restrictions; a smaller \Delta t simply means a better approximation of the original SDE. If we take \Delta t=0.001, it corresponds to the original T=1000; if \Delta t = 0.01, it corresponds to T=100, and so on. In other words, under the continuous-time SDE perspective, different T values are manifestations of different discretization schemes of the SDE. They automatically lead to similar results, and we do not need to specify T in advance; instead, we choose an appropriate T for numerical calculation based on the required accuracy in practice.

Therefore, the essential benefit of introducing the SDE form to describe diffusion models is to "separate theoretical analysis from code implementation." We can use the mathematical tools of continuous SDEs for analysis, and when it comes to practice, we only need to use any appropriate discretization scheme to perform numerical calculations on the SDE.

Regarding Equation [eq:sde-discrete], readers might wonder why the first term on the right is \mathcal{O}(\Delta t) while the second term is \mathcal{O}(\sqrt{\Delta t}). That is, why is the order of the stochastic term higher than that of the deterministic term? This is actually not easy to explain and is one of the confusing aspects of SDEs. Simply put, \boldsymbol{\varepsilon} always follows a standard normal distribution. If the weight of the stochastic term were also \mathcal{O}(\Delta t), then because the mean of the standard normal distribution is \boldsymbol{0} and the covariance is \boldsymbol{I}, adjacent stochastic effects would cancel each other out. It must be scaled to \mathcal{O}(\sqrt{\Delta t}) for the stochastic effect to manifest in the long-term results.

Reverse Equation

In the language of probability, Equation [eq:sde-discrete] implies that the conditional probability is: \begin{aligned} p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) =&\, \mathcal{N}\left(\boldsymbol{x}_{t+\Delta t};\boldsymbol{x}_t + \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t, g_t^2\Delta t \,\boldsymbol{I}\right)\\ \propto&\, \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t\Vert^2}{2 g_t^2\Delta t}\right) \end{aligned}\label{eq:sde-proba} For simplicity, the irrelevant normalization factor is omitted here. According to the idea of DDPM, we ultimately want to learn "building" from the process of "demolishing," i.e., to obtain p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}). To this end, as in "Generative Diffusion Models Part 3: DDPM = Bayes + Denoising", we use Bayes’ theorem: \begin{aligned} p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) =&\, \frac{p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t)p(\boldsymbol{x}_t)}{p(\boldsymbol{x}_{t+\Delta t})} = p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) \exp\left(\log p(\boldsymbol{x}_t) - \log p(\boldsymbol{x}_{t+\Delta t})\right)\\ \propto&\, \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \boldsymbol{f}_t(\boldsymbol{x}_t) \Delta t\Vert^2}{2 g_t^2\Delta t} + \log p(\boldsymbol{x}_t) - \log p(\boldsymbol{x}_{t+\Delta t})\right) \end{aligned}\label{eq:bayes-dt} It is not hard to find that when \Delta t is small enough, p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) is significantly non-zero only when \boldsymbol{x}_{t+\Delta t} is sufficiently close to \boldsymbol{x}_t; conversely, p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) is significantly non-zero only in this case. Therefore, we only need to perform an approximate analysis for the case where \boldsymbol{x}_{t+\Delta t} and \boldsymbol{x}_t are sufficiently close. For this, we can use a Taylor expansion: \log p(\boldsymbol{x}_{t+\Delta t})\approx \log p(\boldsymbol{x}_t) + (\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t)\cdot \nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) + \Delta t \frac{\partial}{\partial t}\log p(\boldsymbol{x}_t) Note not to ignore the \frac{\partial}{\partial t} term, because p(\boldsymbol{x}_t) is actually the "probability density that the random variable at time t equals \boldsymbol{x}_t," while p(\boldsymbol{x}_{t+\Delta t}) is the "probability density that the random variable at time t+\Delta t equals \boldsymbol{x}_{t+\Delta t}." That is, p(\boldsymbol{x}_t) is actually a function of both t and \boldsymbol{x}_t, so an extra partial derivative with respect to t is needed. Substituting this into Equation [eq:bayes-dt] and completing the square, we get: p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) \propto \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \left[\boldsymbol{f}_t(\boldsymbol{x}_t) - g_t^2\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) \right]\Delta t\Vert^2}{2 g_t^2\Delta t} + \mathcal{O}(\Delta t)\right) As \Delta t\to 0, the \mathcal{O}(\Delta t) term vanishes, thus: \begin{aligned} p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) \propto&\, \exp\left(-\frac{\Vert\boldsymbol{x}_{t+\Delta t} - \boldsymbol{x}_t - \left[\boldsymbol{f}_t(\boldsymbol{x}_t) - g_t^2\nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) \right]\Delta t\Vert^2}{2 g_t^2\Delta t}\right) \\ \approx&\,\exp\left(-\frac{\Vert \boldsymbol{x}_t - \boldsymbol{x}_{t+\Delta t} + \left[\boldsymbol{f}_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - g_{t+\Delta t}^2\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) \right]\Delta t\Vert^2}{2 g_{t+\Delta t}^2\Delta t}\right) \end{aligned} That is, p(\boldsymbol{x}_t|\boldsymbol{x}_{t+\Delta t}) approximates a normal distribution with mean \boldsymbol{x}_{t+\Delta t} - \left[\boldsymbol{f}_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - g_{t+\Delta t}^2\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) \right]\Delta t and covariance g_{t+\Delta t}^2\Delta t\,\boldsymbol{I}. Taking the limit \Delta t\to 0, this corresponds to the SDE: d\boldsymbol{x} = \left[\boldsymbol{f}_t(\boldsymbol{x}) - g_t^2\nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) \right] dt + g_t d\boldsymbol{w}\label{eq:reverse-sde} This is the SDE corresponding to the reverse process, which first appeared in "Reverse-Time Diffusion Equation Models". Here we specifically added the subscript t to p to emphasize that this is the distribution at time t.

Score Matching

Now that we have obtained the reverse SDE as Equation [eq:reverse-sde], if we further know \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}), we can use the discretization format: \boldsymbol{x}_t - \boldsymbol{x}_{t+\Delta t} = - \left[\boldsymbol{f}_{t+\Delta t}(\boldsymbol{x}_{t+\Delta t}) - g_{t+\Delta t}^2\nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) \right]\Delta t - g_{t+\Delta t} \sqrt{\Delta t}\boldsymbol{\varepsilon}\label{eq:reverse-sde-discrete} to step-by-step complete the generative process of "building" [where \boldsymbol{\varepsilon}\sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})], thereby completing the construction of a generative diffusion model.

So how do we obtain \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x})? The p_t(\boldsymbol{x}) at time t is the p(\boldsymbol{x}_t) mentioned earlier, which represents the marginal distribution at time t. In practice, we generally design models where the analytical solution for p(\boldsymbol{x}_t|\boldsymbol{x}_0) can be found, which means:

p(\boldsymbol{x}_t|\boldsymbol{x}_0) = \lim_{\Delta t\to 0}\int\cdots\iint p(\boldsymbol{x}_t|\boldsymbol{x}_{t-\Delta t})p(\boldsymbol{x}_{t-\Delta t}|\boldsymbol{x}_{t-2\Delta t})\cdots p(\boldsymbol{x}_{\Delta t}|\boldsymbol{x}_0) d\boldsymbol{x}_{t-\Delta t} d\boldsymbol{x}_{t-2\Delta t}\cdots d\boldsymbol{x}_{\Delta t}

can be directly solved. For example, when \boldsymbol{f}_t(\boldsymbol{x}) is a linear function of \boldsymbol{x}, p(\boldsymbol{x}_t|\boldsymbol{x}_0) can be solved analytically. Under this premise, we have: p(\boldsymbol{x}_t) = \int p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)d\boldsymbol{x}_0=\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] Thus: \nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t) = \frac{\mathbb{E}_{\boldsymbol{x}_0}\left[\nabla_{\boldsymbol{x}_t} p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]} = \frac{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]}{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]} It can be seen that the final expression has the form of a "weighted average of \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)." Since we assumed p(\boldsymbol{x}_t|\boldsymbol{x}_0) has an analytical solution, the above expression can actually be estimated directly. However, it involves an average over all training samples \boldsymbol{x}_0, which is computationally expensive and has poor generalization. Therefore, we hope to use a neural network to learn a function \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) such that it can directly compute \nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t).

Many readers should be familiar with the following result (or it is not difficult to derive): \mathbb{E}[\boldsymbol{x}] = \mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}_{\boldsymbol{x}}\left[\Vert \boldsymbol{\mu} - \boldsymbol{x}\Vert^2\right] That is, to make \boldsymbol{\mu} equal to the mean of \boldsymbol{x}, one only needs to minimize the mean of \Vert \boldsymbol{\mu} - \boldsymbol{x}\Vert^2. Similarly, to make \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) equal to the weighted average of \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0) [i.e., \nabla_{\boldsymbol{x}_t}\log p(\boldsymbol{x}_t)], we only need to minimize the weighted average of \left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2, which is: \frac{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right]}{\mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right]} The denominator \mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right] only serves to adjust the Loss weight; for simplicity, we can remove it directly, as it does not affect the optimal solution. Finally, we integrate over \boldsymbol{x}_t (equivalent to minimizing the above loss for every \boldsymbol{x}_t), obtaining the final loss function: \begin{aligned}&\int \mathbb{E}_{\boldsymbol{x}_0}\left[p(\boldsymbol{x}_t|\boldsymbol{x}_0)\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] d\boldsymbol{x}_t \\ =&\, \mathbb{E}_{\boldsymbol{x}_0,\boldsymbol{x}_t \sim p(\boldsymbol{x}_t|\boldsymbol{x}_0)\tilde{p}(\boldsymbol{x}_0)}\left[\left\Vert \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0)\right\Vert^2\right] \end{aligned}\label{eq:score-match} This is the loss function for "(conditional) score matching." The analytical solution for denoising autoencoders we derived in "From Denoising Autoencoders to Generative Models" is also a special case of this. The earliest origin of score matching can be traced back to the 2005 paper "Estimation of Non-Normalized Statistical Models by Score Matching". As for the earliest origin of conditional score matching, I traced it back to the 2011 paper "A Connection Between Score Matching and Denoising Autoencoders".

However, although the result is the same as score matching, in the derivation of this section, we have already set aside the concept of "score." It is an answer guided naturally by the objective. I believe this process is more enlightening and hope this derivation reduces the difficulty of understanding score matching.

Reverse Derivation

So far, we have constructed a general process for generative diffusion models:

1. Define "demolishing" (forward process) via the stochastic differential equation [eq:sde-forward];

2. Find the expression for p(\boldsymbol{x}_t|\boldsymbol{x}_0);

3. Train \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) via the loss function [eq:score-match] (score matching);

4. Replace \nabla_{\boldsymbol{x}}\log p_t(\boldsymbol{x}) in Equation [eq:reverse-sde] with \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) to complete "building" (reverse process).

Perhaps seeing terms like SDE and differential equations naturally makes people feel "panicked," but essentially, SDE is just a "front." Once the understanding of SDE is converted to Equations [eq:sde-discrete] and [eq:sde-proba], the concept of SDE can be completely set aside. Therefore, conceptually, there is not much difficulty.

It is not hard to find that defining a stochastic differential equation [eq:sde-forward] is easy, but solving for p(\boldsymbol{x}_t|\boldsymbol{x}_0) from [eq:sde-forward] is not. The remaining part of the original paper mainly provides derivations and experiments for two practical examples. However, since solving for p(\boldsymbol{x}_t|\boldsymbol{x}_0) is difficult, in my view, rather than defining [eq:sde-forward] first and then solving for p(\boldsymbol{x}_t|\boldsymbol{x}_0), why not define p(\boldsymbol{x}_t|\boldsymbol{x}_0) first and then reverse-derive the corresponding SDE, just like in DDIM?

For example, we first define: p(\boldsymbol{x}_t|\boldsymbol{x}_0) = \mathcal{N}(\boldsymbol{x}_t; \bar{\alpha}_t \boldsymbol{x}_0,\bar{\beta}_t^2 \boldsymbol{I}) And without loss of generality, assume the start is t=0 and the end is t=1, then the boundaries \bar{\alpha}_t, \bar{\beta}_t must satisfy: \bar{\alpha}_0 = 1,\quad \bar{\alpha}_1 = 0,\quad \bar{\beta}_0 = 0,\quad \bar{\beta}_1 = 1 Of course, these boundary conditions theoretically only need to be sufficiently approximate; they don’t necessarily have to be exactly equal. For instance, in the previous article, we analyzed that DDPM is equivalent to choosing \bar{\alpha}_t = e^{-5t^2}, which results in e^{-5}\approx 0 when t=1.

With p(\boldsymbol{x}_t|\boldsymbol{x}_0), we reverse-derive [eq:sde-forward], which essentially requires solving for p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t). It must satisfy: p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_0) = \int p(\boldsymbol{x}_{t+\Delta t}|\boldsymbol{x}_t) p(\boldsymbol{x}_t|\boldsymbol{x}_0) d\boldsymbol{x}_t We consider a linear solution, i.e.: d\boldsymbol{x} = f_t\boldsymbol{x} dt + g_t d\boldsymbol{w} Similar to "Generative Diffusion Models Part 4: DDIM = High-Perspective DDPM", we write:

From this, we obtain: \begin{aligned} \bar{\alpha}_{t+\Delta t} =&\, (1 + f_t\Delta t) \bar{\alpha}_t \\ \bar{\beta}_{t+\Delta t}^2 =&\, (1 + f_t\Delta t)^2\bar{\beta}_t^2 + g_t^2\Delta t \end{aligned} Letting \Delta t\to 0, we solve for: f_t = \frac{d}{dt} \left(\ln \bar{\alpha}_t\right) = \frac{1}{\bar{\alpha}_t}\frac{d\bar{\alpha}_t}{dt}, \quad g_t^2 = \bar{\alpha}_t^2 \frac{d}{dt}\left(\frac{\bar{\beta}_t^2}{\bar{\alpha}_t^2}\right) = 2\bar{\alpha}_t \bar{\beta}_t \frac{d}{dt}\left(\frac{\bar{\beta}_t}{\bar{\alpha}_t}\right) When \bar{\alpha}_t\equiv 1, the result is the VE-SDE (Variance Exploding SDE) in the paper; if we take \bar{\alpha}_t^2 + \bar{\beta}_t^2=1, the result is the VP-SDE (Variance Preserving SDE) in the original paper.

As for the loss function, we can now calculate: \nabla_{\boldsymbol{x}_t} \log p(\boldsymbol{x}_t|\boldsymbol{x}_0) = -\frac{\boldsymbol{x}_t - \bar{\alpha}_t\boldsymbol{x}_0}{\bar{\beta}_t^2}=-\frac{\boldsymbol{\varepsilon}}{\bar{\beta}_t} The second equality holds because \boldsymbol{x}_t = \bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}. To align with previous results, we set \boldsymbol{s}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) = -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)}{\bar{\beta}_t}. At this point, Equation [eq:score-match] becomes: \frac{1}{\bar{\beta}_t^2}\mathbb{E}_{\boldsymbol{x}_0\sim \tilde{p}(\boldsymbol{x}_0),\boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})}\left[\left\Vert \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\bar{\alpha}_t\boldsymbol{x}_0 + \bar{\beta}_t\boldsymbol{\varepsilon}, t) - \boldsymbol{\varepsilon}\right\Vert^2\right] Ignoring the coefficient, this is the loss function of DDPM. Replacing \nabla_{\boldsymbol{x}_{t+\Delta t}}\log p(\boldsymbol{x}_{t+\Delta t}) in Equation [eq:reverse-sde-discrete] with -\frac{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_{t+\Delta t}, t+\Delta t)}{\bar{\beta}_{t+\Delta t}}, the result has the same first-order approximation as the DDPM sampling process (meaning they are equivalent as \Delta t\to 0).

Summary

This article mainly introduced the general framework for understanding diffusion models using SDEs established by Dr. Yang Song. This includes deriving the reverse SDE, score matching, and other results in as intuitive a language as possible, and providing my own thoughts on solving the equations.