Continuing from our previous discussion in Generative Diffusion Models (13): From Universal Gravitation to Diffusion Models, we introduced an ODE-based generative diffusion model inspired by universal gravitation with a very clear geometric interpretation. After reading it, some readers questioned whether "universal gravitation" is the only choice. Could other forms of force be used to construct diffusion models using the same physical picture? On the other hand, while the model is physically intuitive, it still lacks a rigorous mathematical proof that it can indeed learn the data distribution.
This article attempts to answer the question "what kind of force field is suitable for constructing ODE-based generative diffusion models" from a more precise mathematical perspective.
Basic Conclusions
To answer this question, we need to use a conclusion regarding the distribution changes corresponding to ordinary differential equations (ODEs) derived in Generative Diffusion Models (12): "Hard-Core" Diffusion ODEs.
Consider a first-order (system of) ordinary differential equation(s) for \boldsymbol{x}_t \in \mathbb{R}^d, t \in [0,T]: \begin{equation} \frac{d\boldsymbol{x}_t}{dt}=\boldsymbol{f}_t(\boldsymbol{x}_t) \label{eq:ode} \end{equation} This describes a (reversible) transformation from \boldsymbol{x}_0 to \boldsymbol{x}_T. If \boldsymbol{x}_0 is a random variable, then \boldsymbol{x}_t throughout the process is also a random variable. The law governing its distribution change can be described by the following equation: \begin{equation} \frac{\partial}{\partial t} p_t(\boldsymbol{x}_t) = - \nabla_{\boldsymbol{x}_t}\cdot\Big(\boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\Big) \label{eq:ode-f-eq-fp} \end{equation} This result can be derived using the "Jacobian determinant + Taylor approximation" method as in Part 12, or by first deriving the full "Fokker-Planck equation" and setting g_t=0 as in Generative Diffusion Models (6): ODEs in the General Framework. Incidentally, Equation [eq:ode-f-eq-fp] is very famous in physics; it is known as the Continuity Equation and is one of the manifestations of various conservation laws.
Returning to diffusion models, the goal is to construct a transformation that can turn samples from a simple distribution into samples from a target distribution. Using Equation [eq:ode-f-eq-fp], we can theoretically solve for a feasible \boldsymbol{f}_t(\boldsymbol{x}_t) given a specified p_t(\boldsymbol{x}_t), and then complete the generation process using Equation [eq:ode]. Note that Equation [eq:ode-f-eq-fp] is just one equation, but the \boldsymbol{f}_t(\boldsymbol{x}_t) to be solved has d components. Therefore, this is an indeterminate equation. In principle, we can arbitrarily specify the complete p_t(\boldsymbol{x}_t) (not just the boundaries at t=0, T) to solve for \boldsymbol{f}_t(\boldsymbol{x}_t).
So, theoretically, constructing an ODE-based diffusion model is just solving a very relaxed indeterminate equation with almost no constraints. This is true, but the problem is that the solutions obtained this way may be difficult to implement in practice. Therefore, the accurate formulation of the problem is how to find more practical solutions from Equation [eq:ode-f-eq-fp].
Simplifying the Equation
Notice that Equation [eq:ode-f-eq-fp] can be rewritten as: \begin{equation} \underbrace{\left(\frac{\partial}{\partial t}, \nabla_{\boldsymbol{x}_t}\right)}_{\nabla_{(t,\, \boldsymbol{x}_t)}}\cdot \underbrace{\Big(p_t( \boldsymbol{x}_t), \boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\Big)}_{\boldsymbol{u}\in\mathbb{R}^{d+1}}=0 \end{equation} As shown above, we can treat \left(\frac{\partial}{\partial t}, \nabla_{\boldsymbol{x}_t}\right) as a (d+1)-dimensional gradient \nabla_{(t,\, \boldsymbol{x}_t)}, and \big(p_t( \boldsymbol{x}_t), \boldsymbol{f}_t(\boldsymbol{x}_t) p_t(\boldsymbol{x}_t)\big) forms a (d+1)-dimensional vector \boldsymbol{u}(t, \boldsymbol{x}_t). Thus, [eq:ode-f-eq-fp] can be written as a simple divergence equation: \begin{equation} \nabla_{(t,\, \boldsymbol{x}_t)}\cdot\boldsymbol{u}(t, \boldsymbol{x}_t)=0 \label{eq:div-eq} \end{equation} In this form, we have: \begin{equation} \frac{d\boldsymbol{x}_t}{dt} = \boldsymbol{f}_t(\boldsymbol{x}_t) = \frac{\boldsymbol{u}_{> 1}(t, \boldsymbol{x}_t)}{\boldsymbol{u}_1(t, \boldsymbol{x}_t)} \label{eq:div-eq-ode} \end{equation} where \boldsymbol{u}_1 and \boldsymbol{u}_{> 1} represent the first component and the remaining d components of \boldsymbol{u}, respectively. Of course, we must not forget the constraints: \begin{equation} \left\{\begin{aligned} &\boldsymbol{u}_1(0, \boldsymbol{x}_0) = p_0(\boldsymbol{x}_0)\quad&(\text{Initial condition}) \\[5pt] &\int \boldsymbol{u}_1(t, \boldsymbol{x}_t) d\boldsymbol{x}_t = 1\quad&(\text{Integral condition}) \end{aligned}\right. \end{equation} where p_0(\boldsymbol{x}_0) is the data distribution, i.e., the target distribution to be generated. For the final distribution at t=T, our requirement is simply that it be as simple as possible to facilitate sampling; otherwise, there are no quantitative requirements, so we do not need to write it out for now.
Green’s Function
After this formal transformation, we can view \boldsymbol{u}(t, \boldsymbol{x}_t) as a (d+1)-dimensional vector field, and the differential equation [eq:div-eq-ode] exactly describes the trajectory of a particle moving along the field lines. This is identical to the physical picture presented in Part 13.
To find a general solution for \boldsymbol{u}(t, \boldsymbol{x}_t), we can use the idea of Green’s functions. First, we attempt to solve the following problem: \begin{equation} \left\{\begin{aligned} &\nabla_{(t,\, \boldsymbol{x}_t)}\cdot\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)=0\\ &\boldsymbol{G}_1(0, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_0),\int \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = 1 \end{aligned}\right. \label{eq:div-green} \end{equation} It is easy to prove that if the above holds, then: \begin{equation} \boldsymbol{u}(t, \boldsymbol{x}_t) = \int \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 = \mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}[\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)] \label{eq:div-green-int} \end{equation} will be a solution to Equation [eq:div-eq] satisfying the corresponding constraints. In this way, we express \boldsymbol{u}(t, \boldsymbol{x}_t) as an expectation over training samples, which facilitates model training. It is not hard to see that \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) here is actually the conditional probability p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) in the diffusion model.
In fact, \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) defined in Equation [eq:div-green] is not a Green’s function in the usual sense. A standard Green’s function refers to a solution under a point source, whereas the "point source" here is placed at the boundary. Nevertheless, the defined \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) still possesses properties similar to conventional Green’s functions; it itself acts as a "force field" generated by a point source, and Equation [eq:div-green-int] is exactly the integration of fields from point sources to find the field of a continuous distribution of sources.
Universal Gravitation
Now, based on the above framework, we solve for some specific results. As mentioned earlier, Equation [eq:div-eq] or [eq:div-green] are indeterminate equations with "d+1 unknowns and one equation." Theoretically, they have infinitely many solutions of various types. To solve them, we must introduce additional assumptions to make the solution more explicit. The first solution is based on the isotropic assumption, which corresponds exactly to the results in Part 13.
Solving under Assumptions
Note that "isotropic" here refers to isotropy in the (d+1)-dimensional space composed of (t, \boldsymbol{x}_t). This means \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) points towards the source point (0, \boldsymbol{x}_0), and its magnitude depends only on R = \sqrt{(t-0)^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2}. Thus, we can set: \begin{equation} \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \varphi(R)(t, \boldsymbol{x}_t - \boldsymbol{x}_0) \end{equation} Then: \begin{equation} \begin{aligned} 0 =&\, \nabla_{(t,\, \boldsymbol{x}_t)}\cdot\boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) \\ =&\, \nabla_{(t,\, \boldsymbol{x}_t)}\varphi(R)\cdot(t, \boldsymbol{x}_t - \boldsymbol{x}_0) + \varphi(R)\nabla_{(t,\, \boldsymbol{x}_t)}\cdot (t, \boldsymbol{x}_t - \boldsymbol{x}_0) \\ =&\, \varphi'(R) \frac{(t, \boldsymbol{x}_t - \boldsymbol{x}_0)}{R}\cdot(t, \boldsymbol{x}_t - \boldsymbol{x}_0) + (d+1)\varphi(R)\\ =&\, \varphi'(R) R + (d+1)\varphi(R) \\ =&\, \frac{[\varphi(R)R^{d+1}]'}{R^d} \end{aligned} \end{equation} That is, [\varphi(R)R^{d+1}]'=0, or \varphi(R)R^{d+1}=C, which gives \varphi(R)=C\times R^{-(d+1)}. Therefore, a candidate solution is: \begin{equation} \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = C\times\frac{(t, \boldsymbol{x}_t - \boldsymbol{x}_0)}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}} \end{equation}
Constraints
As we can see, under the isotropic assumption, the universal gravitation solution is the unique solution. To prove it is a feasible solution, we must check the constraints, the key one being: \begin{equation} \int\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = C\times \int\frac{t}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}}d\boldsymbol{x}_t \end{equation} In fact, we only need to verify that the integral result is independent of t and \boldsymbol{x}_0; then we can choose an appropriate constant C to make the integral equal to 1. For t > 0, we can perform a change of variables \boldsymbol{z} = (\boldsymbol{x}_t - \boldsymbol{x}_0) / t. Since the range of \boldsymbol{x}_t is the entire space, \boldsymbol{z} is also over the entire space. Substituting this into the equation gives: \begin{equation} \int\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) d\boldsymbol{x}_t = C\times \int\frac{1}{\left(1 + \Vert \boldsymbol{z}\Vert^2\right)^{(d+1)/2}}d\boldsymbol{z} \label{eq:pz} \end{equation} Now it is clear that the integral result is independent of t and \boldsymbol{x}_0. Thus, by choosing an appropriate C, the condition that the integral equals 1 is satisfied. Henceforth, we assume C has been chosen such that the integral is 1.
As for the initial value, we need to verify \lim\limits_{t\to 0^+}\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \delta(\boldsymbol{x}_t - \boldsymbol{x}_0). This can be checked according to the definition of the Dirac delta function:
1. When \boldsymbol{x}_t \neq \boldsymbol{x}_0, the limit is clearly 0;
2. When \boldsymbol{x}_t = \boldsymbol{x}_0, the limit is clearly \infty;
3. We have already verified that the integral of \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) over \boldsymbol{x}_t is always 1.
These three points are exactly the basic properties of the Dirac delta function, or even its definition, so the initial value check also passes.
Analysis of Results
Now, according to Equation [eq:div-green-int], we have: \begin{equation} \boldsymbol{u}(t, \boldsymbol{x}_t) = C\times\mathbb{E}_{\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)}\left[\frac{(t, \boldsymbol{x}_t - \boldsymbol{x}_0)}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}}\right] \end{equation} Next, we can use \mathbb{E}_{\boldsymbol{x}}[\boldsymbol{x}] = \mathop{\text{argmin}}_{\boldsymbol{\mu}}\mathbb{E}_{\boldsymbol{x}}\left[\Vert \boldsymbol{x} - \boldsymbol{\mu}\Vert^2\right] to construct a score-matching-like objective for learning. This process has been discussed many times and will not be repeated here.
As mentioned before, \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) is actually p_t(\boldsymbol{x}_t|\boldsymbol{x}_0). We now know its specific form is: \begin{equation} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)\propto \frac{t}{\left(t^2 + \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert^2\right)^{(d+1)/2}} \end{equation} When t=T is large enough, the influence of \boldsymbol{x}_0 becomes negligible, and p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) degenerates into a prior distribution independent of \boldsymbol{x}_0: \begin{equation} p_{prior}(\boldsymbol{x}_T) \propto \frac{T}{(T^2 + \Vert\boldsymbol{x}_T\Vert^2)^{(d+1)/2}} \end{equation} Previously, in Part 13, deriving this result was quite involved, but within this framework, it follows naturally. Furthermore, since we now have p_t(\boldsymbol{x}_t|\boldsymbol{x}_0), we can theoretically sample \boldsymbol{x}_t \sim p_t(\boldsymbol{x}_t|\boldsymbol{x}_0). From the derivation of Equation [eq:pz], we know that if we substitute \boldsymbol{z} = (\boldsymbol{x}_t - \boldsymbol{x}_0) / t, we have: \begin{equation} p(\boldsymbol{z}) \propto \frac{1}{\left(1 + \Vert \boldsymbol{z}\Vert^2\right)^{(d+1)/2}} \label{eq:pz-2} \end{equation} Thus, we can first sample from p(\boldsymbol{z}) and then obtain the corresponding \boldsymbol{x}_t via \boldsymbol{x}_t = \boldsymbol{x}_0 + t\, \boldsymbol{z}. As for sampling from p(\boldsymbol{z}), it only depends on the magnitude, so we can use the inverse cumulative distribution function method to sample the magnitude and then randomly sample a direction to construct the result, which is identical to sampling from the prior distribution. However, while further investigating the following legacy problem, I discovered an unexpected "surprise"!
Problem Revisited
In Part 13, we pointed out that the sampling scheme given in the original paper was: \begin{equation} \boldsymbol{x}_t = \boldsymbol{x}_0 + \Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert (1+\tau)^m \boldsymbol{u},\quad t = |\varepsilon_t| (1+\tau)^m \end{equation} where (\boldsymbol{\varepsilon}_{\boldsymbol{x}},\varepsilon_t)\sim\mathcal{N}(\boldsymbol{0}, \sigma^2\boldsymbol{I}_{(d+1)\times(d+1)}), m\sim U[0,M], \boldsymbol{u} is a unit vector uniformly distributed on the d-dimensional unit sphere, and \tau, \sigma, M are constants. At the time, the evaluation of this sampling was that it had "a lot of subjectivity," meaning it felt like the author designed it subjectively without much justification. However, whether the author intended it or not, I found a magical "coincidence": this sampling is exactly an implementation of Equation [eq:pz-2]!
Let’s prove this. First, substitute the second half of the above equation into the first half: \begin{equation} \boldsymbol{x}_t = \boldsymbol{x}_0 + t\times \frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|} \boldsymbol{u} \end{equation} Formally, this is already the same as \boldsymbol{x}_t = \boldsymbol{x}_0 + t\, \boldsymbol{z} mentioned in the previous section, and \boldsymbol{u} is an isotropic unit random vector. So the question becomes whether \frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|} has the same distribution as \Vert\boldsymbol{z}\Vert. The answer is yes! Note that when the probability density changes from Cartesian coordinates to spherical coordinates, it must be multiplied by \text{radius}^{d-1}. Thus, according to Equation [eq:pz-2]: \begin{equation} p(\Vert\boldsymbol{z}\Vert) \propto \frac{\Vert \boldsymbol{z}\Vert^{d-1}}{\left(1 + \Vert \boldsymbol{z}\Vert^2\right)^{(d+1)/2}} \label{eq:pz-3} \end{equation} Given (\boldsymbol{\varepsilon}_{\boldsymbol{x}},\varepsilon_t)\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}_{(d+1)\times(d+1)}) (since we are studying a ratio, the variance cancels out, so we take \sigma=1 for simplicity): \begin{equation} p(\Vert\boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert) \propto \Vert\boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert^{d-1} e^{-\Vert\boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert^2/2}, \quad p(|\varepsilon_t|) \propto e^{-|\varepsilon_t|^2/2} \end{equation} Let r = \frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|}, then \Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert=r|\varepsilon_t|. Based on the equality of probabilities: \begin{equation} \begin{aligned} p(r)dr =&\, \mathbb{E}_{|\varepsilon_t|\sim p(|\varepsilon_t|)}\big[p(\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert = r|\varepsilon_t|)d(r|\varepsilon_t|)\big] \\[5pt] \propto&\, \mathbb{E}_{|\varepsilon_t|\sim p(|\varepsilon_t|)}\big[r^{d-1}|\varepsilon_t|^d e^{-r^2|\varepsilon_t|^2/2} dr\big] \\[5pt] \propto&\, \int_0^{\infty} r^{d-1}|\varepsilon_t|^d e^{-r^2|\varepsilon_t|^2/2} e^{-|\varepsilon_t|^2/2} d|\varepsilon_t| dr \\ =&\, \int_0^{\infty} r^{d-1}|\varepsilon_t|^d e^{-(r^2+1)|\varepsilon_t|^2/2} d|\varepsilon_t| dr \\ =&\, \frac{r^{d-1}}{(1+r^2)^{(d+1)/2}} \int_0^{\infty} s^d e^{-s^2/2} ds dr \quad\left(\text{let } s = |\varepsilon_t|\sqrt{r^2+1}\right) \\ \propto&\, \frac{r^{d-1}}{(1+r^2)^{(d+1)/2}} dr \end{aligned} \end{equation} Therefore p(r)\propto \frac{r^{d-1}}{(1+r^2)^{(d+1)/2}}, which is perfectly consistent with [eq:pz-3]. Thus, \frac{\Vert \boldsymbol{\varepsilon}_{\boldsymbol{x}}\Vert}{|\varepsilon_t|}\boldsymbol{u} indeed provides an effective way to sample \boldsymbol{z}. This is much simpler to implement than the inverse CDF method, though the original paper did not mention this.
Spatiotemporal Separation
We just solved for the isotropic solution in the (d+1)-dimensional space composed of (t, \boldsymbol{x}_t). In a sense, this is the simplest solution. This statement might be hard for some readers to accept, as the universal gravitation diffusion model clearly looks much more mathematically complex. However, in solving mathematical physics equations, isotropic solutions are indeed often used as the simplest trial solutions.
Of course, treating (t, \boldsymbol{x}_t) as a "space-time" whole is not very intuitive. We are more accustomed to understanding isotropy in space and keeping the time dimension independent. This section solves under that assumption.
Solving under Assumptions
That is to say, "isotropy" here refers to isotropy in the d-dimensional space of \boldsymbol{x}_t. \boldsymbol{G}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) is decomposed into two parts: (\boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0), \boldsymbol{G}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0)). Here \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) is just a scalar, and isotropy means it depends only on r = \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert, which we denote as \phi_t(r). \boldsymbol{G}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) is a d-dimensional vector, and isotropy means \boldsymbol{G}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) points towards the source point \boldsymbol{x}_0, and its magnitude depends only on r = \Vert \boldsymbol{x}_t - \boldsymbol{x}_0\Vert. Thus, we can set: \begin{equation} \boldsymbol{G}_{>1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = \varphi_t(r)(\boldsymbol{x}_t - \boldsymbol{x}_0) \end{equation} Then: \begin{equation} \begin{aligned} 0 =&\, \frac{\partial}{\partial t}\phi_t(r) + \nabla_{\boldsymbol{x}_t}\cdot(\varphi_t(r) (\boldsymbol{x}_t - \boldsymbol{x}_0)) \\ =&\, \frac{\partial}{\partial t}\phi_t(r) + r\frac{\partial}{\partial r}\varphi_t(r) + d\, \varphi_t(r) \\ =&\, \frac{\partial}{\partial t}\phi_t(r) + \frac{1}{r^{d-1}}\frac{\partial}{\partial r}\big(\varphi_t(r) r^d\big)\\ \end{aligned} \end{equation} There are two unknown functions \phi_t(r) and \varphi_t(r), but only one equation, so solving is even simpler. Since the constraints apply to \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0), i.e., \phi_t(r) rather than \varphi_t(r), we usually specify a \phi_t(r) that satisfies the conditions and solve for \varphi_t(r). The result is: \begin{equation} \varphi_t(r) = -\frac{1}{r^d}\int \frac{\partial}{\partial t}\phi_t(r) r^{d-1} dr = -\frac{1}{r^d}\frac{\partial}{\partial t}\int \phi_t(r) r^{d-1} dr \label{eq:f-g-t-r} \end{equation}
Gaussian Diffusion
In this part, we show that the common ODE diffusion models based on the Gaussian distribution assumption are also a special case of Equation [eq:f-g-t-r]. For the Gaussian assumption: \begin{equation} \boldsymbol{G}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) = p_t(\boldsymbol{x}_t|\boldsymbol{x}_0) = \frac{1}{(2\pi\sigma_t^2)^{d/2}} e^{-\Vert\boldsymbol{x}_t-\boldsymbol{x}_0\Vert^2/2\sigma_t^2} \end{equation} That is, \phi_t(r) = \frac{1}{(2\pi\sigma_t^2)^{d/2}} e^{-r^2/2\sigma_t^2}, where \sigma_t is a monotonically increasing function of t, satisfying \sigma_0=0 and \sigma_T being large enough. \sigma_0=0 is to satisfy the initial condition, and a large \sigma_T ensures the prior distribution is independent of the data. As for the constraint that the integral equals 1, this is a basic property of the Gaussian distribution and is naturally satisfied.
Substituting into Equation [eq:f-g-t-r] yields: \begin{equation} \varphi_t(r) = \frac{\dot{\sigma}_t}{(2\pi\sigma_t^2)^{d/2}\sigma_t} e^{-r^2/2\sigma_t^2} = \frac{\dot{\sigma}_t}{\sigma_t}\phi_t(r) \end{equation} The integral over r involves the incomplete gamma function and is quite complex; I calculated it directly using Mathematica. With this result, we have: \begin{equation} \begin{aligned} \boldsymbol{u}_1(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) =&\, \int p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 = p_t(\boldsymbol{x}_t) \\ \boldsymbol{u}_{> 1}(t, 0; \boldsymbol{x}_t, \boldsymbol{x}_0) =&\, \int \frac{\dot{\sigma}_t}{\sigma_t}(\boldsymbol{x}_t - \boldsymbol{x}_0)p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 \\ =&\, -\dot{\sigma}_t\sigma_t \int\nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)p_0(\boldsymbol{x}_0) d\boldsymbol{x}_0 \\ =&\, -\dot{\sigma}_t\sigma_t \nabla_{\boldsymbol{x}_t} p_t(\boldsymbol{x}_t) \\ \end{aligned} \end{equation} Thus, according to Equation [eq:div-eq-ode]: \begin{equation} \boldsymbol{f}_t(\boldsymbol{x}_t) = \frac{\boldsymbol{u}_{> 1}(t, \boldsymbol{x}_t)}{\boldsymbol{u}_1(t, \boldsymbol{x}_t)} = -\dot{\sigma}_t\sigma_t \nabla_{\boldsymbol{x}_t} \log p_t(\boldsymbol{x}_t) \end{equation} These results are perfectly consistent with those in Part 12. For the remaining processing details, one can refer to that article.
Reverse Construction
The approach of specifying \phi_t(r) to solve for \varphi_t(r) is theoretically simple, but in practice, it faces two difficulties: 1. \phi_t(r) must satisfy both the initial condition and the integral condition, which is not easy to construct; 2. The integral over r may not have a simple elementary form. Given this, we can think of a reverse construction method.
We know that \phi_t(r) is the probability density in Cartesian coordinates. Converting to spherical coordinates requires multiplying by C_d r^{d-1}, where C_d is a constant (related to d). According to Equation [eq:div-eq-ode], the final result is a ratio and is unaffected by constants. So for simplicity, we ignore this constant. After ignoring the constant, it is exactly the integrand in Equation [eq:f-g-t-r]. Therefore, the integral in Equation [eq:f-g-t-r]: \begin{equation} \int \phi_t(r) r^{d-1} dr \end{equation} is exactly a cumulative probability function (more accurately, 1/C_d of the cumulative probability function plus a constant, but we have ignored the irrelevant constants). Calculating the cumulative probability from the density is not always easy, but calculating the density from the cumulative probability is simple (differentiation). Thus, we can first construct the cumulative probability function and then find the corresponding \phi_t(r) and \varphi_t(r), thereby avoiding the difficulty of integration.
Specifically, construct a cumulative probability function \psi_t(r) satisfying the following conditions:
1. \psi_t(0)=0, \psi_t(\infty)=1;
2. \psi_t(r) is monotonically increasing with respect to r;
3. \forall r > 0, \lim\limits_{t\to 0^+} \psi_t(r)=1.
Students who have studied activation functions should find it easy to construct functions satisfying these conditions; they are essentially smooth approximations of the Heaviside step function, such as \tanh\left(\frac{r}{t}\right), 1-e^{-r/t}, etc. With \psi_t(r), according to Equation [eq:f-g-t-r], we have: \begin{equation} \phi_t(r) = \frac{1}{r^{d-1}}\frac{\partial}{\partial r}\psi_t(r), \quad \varphi_t(r) = -\frac{1}{r^d}\frac{\partial}{\partial t}(\psi_t(r) + \lambda_t) \end{equation} where \lambda_t is any function of t, which can generally be set to 0. Of course, these isotropic solutions are essentially equivalent, including the "universal gravitation diffusion" derived in the previous section. They can all be incorporated into the above formula and derived from each other through coordinate transformations. This is because the formula only depends on a univariate cumulative probability function \psi_t(r), and cumulative probability functions between different distributions can generally be transformed into each other (as they are well-behaved monotonically increasing functions).
Summary
This article constructs a general framework for ODE-based diffusion. Theoretically, all ODE-based diffusion models can be incorporated into this framework. We can also derive various novel or even bizarre ODE-based diffusion models from it. For example, the current derivations are based on the isotropic assumption, but we could replace the isotropic \varphi(R) with a more general \varphi(t; \boldsymbol{x}_t, \boldsymbol{x}_0). This can be solved using the Method of Characteristics for first-order partial differential equations, yielding a new family of models. Overall, this is a veritable "production workshop" for ODE-based diffusion models.
Some readers might ask: I just want a usable generative diffusion model; what is the value of creating so many fancy variants? In fact, similar to f-GAN: A Production Workshop for GAN Models and Designing GANs: Another GAN Production Workshop, we hope to discover and master the construction laws of generative models to further understand the keys to generation, thereby discovering more effective models. This is a never-ending pursuit of perfection.
The experimental results in the "universal gravitation diffusion" paper have already shown that, as an ODE-based diffusion model, it performs better than Gaussian diffusion. This indicates that even under the isotropic assumption, these mathematically equivalent diffusion models still exhibit performance differences in practice. Therefore, how to better combine experimental details to answer "what kind of design makes a better diffusion model" will be a very meaningful research question for the future.
Reprinting: Please include the original address of this article: https://kexue.fm/archives/9370
Further details on reprinting/citation: Please refer to Scientific Space FAQ.