In the previous article "Casual Talk on Generative Diffusion Models (22): Signal-to-Noise Ratio and High-Resolution Generation (Part 1)", we introduced how to improve the noise schedule by aligning the signal-to-noise ratio (SNR) of low-resolution images, thereby enhancing the performance of diffusion models trained directly in pixel space for high-resolution image generation. The protagonist of this article is also SNR and high-resolution generation, but it achieves something even more amazing—directly using a diffusion model trained on low-resolution images for high-resolution image generation without any additional training, achieving performance and inference costs comparable to models trained directly on high-resolution data!
This work comes from the recent paper "Upsample Guidance: Scale Up Diffusion Models without Training". It cleverly uses the upsampled output of a low-resolution model as a guidance signal and combines it with the translational invariance of CNNs for texture details, successfully achieving training-free high-resolution image generation.
Discussion of Ideas
We know that the training objective of a diffusion model is denoising (the first "D" in DDPM). Intuitively, the task of denoising should be resolution-independent. In other words, in an ideal scenario, a denoising model trained on low-resolution images should also be applicable to high-resolution image denoising, meaning a low-resolution diffusion model should be directly usable for high-resolution generation.
Is it really that ideal? I tried using a 128 \times 128 face image (CelebA-HQ) diffusion model I trained previously and directly used it as a 256 \times 256 model for inference. The style of the generated results looks like this:
As we can see, the generated results have two characteristics:
1. The results are no longer human faces, indicating that a denoising model trained at 128 \times 128 cannot be used directly at 256 \times 256.
2. Although the results are not ideal, they are very clear, without obvious blurring or checkerboard effects, and they retain some facial texture details.
We know that simply enlarging a small image (upsampling) is the most basic high-resolution generation model. However, depending on the upsampling algorithm, the enlarged images usually suffer from blurring or checkerboard artifacts—they lack sufficient texture details. At this point, a "whimsical" idea arises: since upsampling small images lacks detail, while directly using a small-image model for high-resolution inference retains some details, can we use the latter to supplement the details of the former?
This is the core idea of the method proposed in the original paper.
Mathematical Description
In this section, we will formalize the idea with formulas to see how to proceed.
First, let’s unify the notation. Our target image resolution is w \times h, and the training image resolution is w/s \times h/s. Therefore, \boldsymbol{x}, \boldsymbol{\varepsilon} are of size w \times h \times 3 (including the channel dimension), while \boldsymbol{x}^{\text{low}}, \boldsymbol{\varepsilon}^{\text{low}} are of size w/s \times h/s \times 3. Let \mathcal{D} be the downsampling operator that performs average pooling from w \times h to w/s \times h/s, and \mathcal{U} be the upsampling operator using nearest-neighbor interpolation (direct repetition) from w/s \times h/s to w \times h.
We know that a diffusion model requires a trained denoising model \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t). Taking DDPM as an example (using the form from "Casual Talk on Generative Diffusion Models (3): DDPM = Bayesian + Denoising", which aligns with mainstream forms), its inference format is: \begin{equation} \boldsymbol{x}_{t-1} = \frac{1}{\alpha_t}\left(\boldsymbol{x}_t - \frac{\beta_t^2}{\bar{\beta}_t}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right) + \sigma_t \boldsymbol{\varepsilon},\quad \boldsymbol{\varepsilon}\sim\mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \end{equation} where the mainstream choice for \sigma_t is \frac{\bar{\beta}_{t-1}\beta_t}{\bar{\beta}_t} or \beta_t. However, we currently do not have a trained \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) at w \times h resolution; we only have one trained at w/s \times h/s resolution.
Based on our experience, although shrinking and then enlarging a large image causes distortion, it can still be considered a good approximation of the original image. This inspires us that the denoising model can similarly construct a primary term. Specifically, to denoise a w \times h image, we can first shrink it (downsample via average pooling) to w/s \times h/s, feed it into the denoising model trained at w/s \times h/s, and finally enlarge the denoising result (upsample) to w \times h. Although this is not the ideal denoising result, it should already be a primary component of the ideal result.
Next, as demonstrated in the previous section, directly using a low-resolution trained denoising model as a high-resolution model can preserve some texture details. Thus, we can consider the unmodified \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) as a secondary term depicting details. By finding a way to integrate these primary and secondary terms, we might obtain a sufficiently good approximation of the precise denoising model, thereby achieving training-free high-resolution diffusion generation.
Invoking SNR Again
Now let’s discuss the primary term. First, we clarify that this paper does not aim to retrain a high-resolution model but to reuse the original low-resolution model on high-resolution inputs. Therefore, the noise schedule remains the original \bar{\alpha}_t, \bar{\beta}_t. Thus, we can assume: \begin{equation} \boldsymbol{x}_t = \bar{\alpha}_t \boldsymbol{x}_0 + \bar{\beta}_t \boldsymbol{\varepsilon} \end{equation} where \boldsymbol{\varepsilon} is a standard normal distribution vector. As mentioned, the primary term requires downsampling before denoising. Let \mathcal{D} represent the average pooling operation downsampling to w/s \times h/s; then we have: \begin{equation} \mathcal{D}[\boldsymbol{x}_t] = \bar{\alpha}_t \mathcal{D}[\boldsymbol{x}_0] + \frac{\bar{\beta}_t}{s} \boldsymbol{\varepsilon} \label{eq:dx} \end{equation} Equality here refers to following the same distribution. In the previous article, we introduced the signal-to-noise ratio SNR(t)=\frac{\bar{\alpha}_t^2}{\bar{\beta}_t^2}. From this, we see the SNR of \boldsymbol{x}_t is \frac{\bar{\alpha}_t^2}{\bar{\beta}_t^2}, but the SNR of \mathcal{D}[\boldsymbol{x}_t] is \frac{s^2\bar{\alpha}_t^2}{\bar{\beta}_t^2}. According to our setup, the denoising model \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) was only trained on low-resolution images with the noise schedule \bar{\alpha}_t, \bar{\beta}_t. This means that at time t, the input SNR applicable to \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) is \frac{\bar{\alpha}_t^2}{\bar{\beta}_t^2}, but the SNR of \mathcal{D}[\boldsymbol{x}_t] is \frac{s^2\bar{\alpha}_t^2}{\bar{\beta}_t^2}. Therefore, using the model at time t directly is not optimal.
What should we do? It’s simple: SNR changes with time t. We can find another time \tau such that its SNR is exactly \frac{s^2\bar{\alpha}_t^2}{\bar{\beta}_t^2}, which means solving the equation: \begin{equation} \frac{\bar{\alpha}_{\tau}^2}{\bar{\beta}_{\tau}^2} = \frac{s^2\bar{\alpha}_t^2}{\bar{\beta}_t^2} \end{equation} After solving for \tau, we find that the model at time \tau is more suitable for an input with SNR \frac{s^2\bar{\alpha}_t^2}{\bar{\beta}_t^2}. Thus, the denoising of \mathcal{D}[\boldsymbol{x}_t] should use the model at time \tau instead of t. Furthermore, \mathcal{D}[\boldsymbol{x}_t] itself can be improved. From equation [eq:dx], we find that when s > 1, the sum of the squares of the two coefficients \rho_t^2 = \bar{\alpha}_t^2 + \frac{\bar{\beta}_t^2}{s^2} is no longer 1, whereas during training, the sum of the squares is always 1. Therefore, we can divide it by \rho_t to make it closer to the training format. Finally, the primary denoising term constructed from \mathcal{D}[\boldsymbol{x}_t] should be: \begin{equation} \boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\frac{\mathcal{D}[\boldsymbol{x}_t]}{\rho_t}, \tau\right) \label{eq:down-denoise} \end{equation}
Decomposition and Approximation
Now we have two denoising models available: one is \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t), which directly uses the low-resolution model as a high-resolution one, and the other is the model derived in the previous section [eq:down-denoise], which downsamples before denoising. Next, we can attempt to assemble them.
Suppose we have a perfect denoising model \boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t) trained on high-resolution images. We can decompose it as: \begin{equation} \boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t) = \underbrace{\color{red}{\mathcal{U}\left[\mathcal{D}\left[\boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t)\right]\right]}}_{\text{Low-res primary term}} + \underbrace{\Big\{\color{green}{\boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t) - \mathcal{U}\left[\mathcal{D}\left[\boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t)\right]\right]}\Big\}}_{\text{High-res detail term}} \end{equation} At first glance, this decomposition is just a simple identity transformation. However, it has a very intuitive meaning: the first term downsamples and then upsamples the precise reconstruction result—essentially shrinking and then enlarging it. This is a lossy transformation, but the result is sufficient to depict the main outline, making it the primary term. The second term subtracts the main outline from the precise result, clearly representing local details.
Combining our previous discussion, we believe that equation [eq:down-denoise] provided in the previous section is a good approximation of the low-resolution primary term, so we write: \begin{equation} \mathcal{D}\left[\boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t)\right] \approx \frac{1}{s}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\frac{\mathcal{D}[\boldsymbol{x}_t]}{\rho_t}, \tau\right) \end{equation} Note that the factor 1/s cannot be omitted. This is because denoising models usually predict standard normal noise (i.e., \boldsymbol{\varepsilon}), so their output approximately satisfies zero mean and unit variance. After downsampling \mathcal{D}, the variance becomes 1/s^2. Since the output of \boldsymbol{\epsilon}_{\boldsymbol{\theta}} also has unit variance, we must divide by s to make the variance 1/s^2, improving the approximation.
For the high-resolution detail term, we write: \begin{equation} \boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t) - \mathcal{U}\left[\mathcal{D}\left[\boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t)\right]\right] \approx \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \mathcal{U}\left[\mathcal{D}\left[\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right]\right] \end{equation} This is also based on the previous discussion—directly using the low-resolution denoising model as a high-resolution one preserves texture details well. Thus, we assume that for high-resolution details, \boldsymbol{\epsilon}_{\boldsymbol{\theta}} is a good approximation of \boldsymbol{\epsilon}^{\text{high}}.
Combining these two approximations, we can write the complete expression: \begin{equation} \boldsymbol{\epsilon}^{\text{high}}(\boldsymbol{x}_t, t) \approx \frac{1}{s}\mathcal{U}\left[\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\frac{\mathcal{D}[\boldsymbol{x}_t]}{\rho_t}, \tau\right) \right] + \Big\{\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) - \mathcal{U}\left[\mathcal{D}\left[\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right]\right]\Big\} \triangleq \boldsymbol{\epsilon}_{\boldsymbol{\theta}}^{\text{approx}}(\boldsymbol{x}_t, t) \label{eq:high-key} \end{equation} This is the key approximation for the high-resolution denoising model we were looking for!
In fact, using equation [eq:high-key] directly to generate high-resolution images already yields good results. However, we can introduce an adjustable hyperparameter to make it even better. The idea is to mimic the approach of strengthening conditional generation with an unconditional model (as in "Casual Talk on Generative Diffusion Models (9): Conditioned Generation"). We treat \boldsymbol{\epsilon}_{\boldsymbol{\theta}}^{\text{approx}}(\boldsymbol{x}_t, t) as a conditional denoising model, where the guidance signal is the upsampled low-resolution primary term (hence the title "Upsample Guidance", or UG for short), and \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) as the unconditional denoising model. To strengthen the condition, we introduce an adjustable parameter w > 0 and express the final denoising model as: \begin{equation} \begin{aligned} \tilde{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) &= (1 + w)\, \boldsymbol{\epsilon}_{\boldsymbol{\theta}}^{\text{approx}}(\boldsymbol{x}_t, t) - w\,\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) \\ &= \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) + (1 + w)\mathcal{U}\left[\frac{1}{s}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}\left(\frac{\mathcal{D}[\boldsymbol{x}_t]}{\rho_t}, \tau\right) - \mathcal{D}\left[\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t)\right]\right] \end{aligned} \end{equation} According to the experimental results in the original paper, values around w=0.2 yield better performance.
LDM Extension
Although the previous results do not seem to distinguish between pixel-space diffusion models and Latent Diffusion Models (LDM), theoretically, the results only strictly apply to pixel-space models. LDMs have a non-linear Encoder. The features of a large image after the Encoder, when pooled, might not equal the features of a small image after the Encoder. Therefore, the assumption that we can construct the primary term of the high-resolution denoising model via downsampling and then upsampling might not hold.
To observe the differences in the LDM scenario, we can look at two experimental results from the original paper. The first is the reconstruction result after up/downsampling the Encoder features and feeding them into the Decoder, as shown below. The results show that whether upsampling or downsampling, performing such operations directly in the feature space leads to image degradation. This means the weight of the primary term constructed by downsampling and then upsampling should perhaps be appropriately reduced.
The second experiment involves directly using a low-resolution LDM as a high-resolution model without modification. The generation results after the Decoder can be seen in the "w/o UG" part of the figure below. Unlike pixel-space diffusion models, likely due to the Decoder’s robustness to features, the effect of using \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) directly as a high-resolution model in the LDM scenario is much more ideal. Semantics and clarity are significantly preserved, though some "deformities" appear in certain areas.
Based on these two experimental conclusions, the original paper changes w in the LDM scenario to a function of time t: \begin{equation} w_t = \left\{\begin{aligned} w,\quad t \geq (1-\eta) T \\ -1,\quad t < (1-\eta) T \end{aligned}\right. \end{equation} When w = -1, Upsample Guidance is equivalent to being non-existent. This means Upsample Guidance is only added in the early stages of diffusion. This not only prevents deformities through Upsample Guidance in the early stages but also allows for the full utilization of \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) in the later stages to generate clearer and sharper results, while also saving computational costs—truly "killing three birds with one stone."
Performance Demonstration
Finally, we come to the experimental section. In fact, the "w/ UG" part of the images in the previous section has already demonstrated the effect of Upsample Guidance in the LDM scenario. It can be seen that Upsample Guidance indeed corrects the deformities caused by using \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) directly for high-resolution generation, while ensuring semantic correctness and image clarity.
As for the generation effect in pixel space, refer to the following figure:
Due to the existence of Upsample Guidance, the entire method is somewhat like generating a low-resolution image first and then using super-resolution to generate a high-definition image, except it is done in an unsupervised manner. Therefore, it basically ensures that metrics like FID are no worse than the low-resolution generation results:
Finally, I also tried using the 128 \times 128 CelebA face diffusion model I trained previously, further confirming the effectiveness of Upsample Guidance:
In terms of quality, it is certainly not as good as a directly trained high-resolution model, but it is better than directly enlarging a low-resolution image. In terms of inference cost, compared to using \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\boldsymbol{x}_t, t) trained on high-resolution images, Upsample Guidance adds a low-resolution calculation. The increase in computational cost is roughly 1/s^2. For LDM, since Upsample Guidance is not added in the later stages, this ratio is even smaller. Overall, Upsample Guidance can be called a "free lunch" for high-resolution generation with reasonable costs.
Reflections and Analysis
After looking at the entire Upsample Guidance framework, I wonder what everyone’s feelings are? My feeling is that it is very much in the style of a physicist—imaginative, making bold assumptions, yet grasping the essence invisibly. I might be able to write an interpretation of this kind of work, but it would be impossible for me to come up with it independently, because at best I only have a very rigid mathematical mindset.
A natural question about Upsample Guidance is: what exactly is the reason for its effectiveness? Taking the CelebA face generation model I trained in pixel space as an example, it was only trained on 128 \times 128 small images and has never seen a 256 \times 256 large image. Why can it appropriately generate 256 \times 256 large images that conform to our cognition? Note that this is different from ImageNet. The ImageNet dataset is a multi-scale dataset; for example, a 128 \times 128 image could be a fish, or it could be a person holding a fish. That is to say, although the inputs are all 128 \times 128, it has seen fish of different proportions, allowing it to adapt better to different resolutions. But CelebA is different; it is a single-scale dataset where all faces are aligned in size, position, and orientation. Even so, Upsample Guidance can still successfully generalize it.
I believe this is somewhat related to DIP (Deep Image Prior). The general idea of DIP is that the architectures of CNN models commonly used in CV have already been highly screened and fit vision itself very well. Therefore, even models not trained on real data can complete some vision tasks, such as denoising, completion, and even simple super-resolution. Upsample Guidance allows a diffusion model that has never seen large images to generate large images that basically conform to cognition, which also seems to benefit from the architectural prior of the CNN itself. Simply put, as experimented in the first section of this article, Upsample Guidance relies on directly using a low-resolution model as a high-resolution model, where the generation results retain at least some valid texture details—this is not a trivial property.
To verify this, I specifically tried using a pure Transformer diffusion model (somewhat similar to DiT + RoPE-2D) I trained previously and found that it could not reproduce the effect of Upsample Guidance at all. This indicates that it depends at least on the CNN-based U-Net model architecture. However, readers using Transformers need not be discouraged. While it cannot follow the Upsample Guidance route, it can follow the length extrapolation route of NLP. The paper "FiT: Flexible Vision Transformer for Diffusion Model" shows that by training a diffusion model with a combination of Transformer + RoPE-2D, one can reuse length extrapolation techniques like NTK and YaRN to achieve high-resolution generation with no training or minimal fine-tuning.
Summary
This article introduced a technique called Upsample Guidance, which allows a trained low-resolution diffusion model to directly generate high-resolution images without additional fine-tuning costs. Experiments show that it can basically double the resolution stably. Although the effect is still somewhat behind directly trained high-resolution diffusion models, this nearly free lunch is still worth learning. This article reorganized the ideas and derivations of the method from my perspective and provided reflections on the reasons for its effectiveness.
(Postscript: In fact, according to the original plan, this article was to be published two days ago. The reason for the two-day delay is that during the writing process, I found that many details I thought I understood were actually vague. So I spent two extra days on derivation and experimentation to obtain a more precise understanding. This shows that systematically and clearly restating what one has learned is itself a process of continuous self-improvement and refinement. This is probably the meaning of persisting in writing.)
Original URL: https://kexue.fm/archives/10055