Idempotent Generative Network (IGN): A GAN Attempting to Unify Discrimination and Generation · English (unofficial) translations of posts at kexue.fm

Some time ago, a generative model named "Idempotent Generative Network (IGN)" attracted a certain amount of attention. It claims to be a new type of generative model independent of existing VAEs, GANs, flows, and Diffusion models, featuring single-step sampling. Perhaps because everyone has long suffered from the multi-step sampling process of current mainstream diffusion models, any "slightest hint" of achieving single-step sampling easily attracts attention. Furthermore, the word "idempotent" in the name IGN adds a sense of mystery, further heightening expectations and successfully piquing my interest. However, I was busy with other things before and didn’t have time to read the model details carefully.

Recently, having some free time, I remembered I hadn’t read IGN yet, so I pulled up the paper. But after reading it, I felt quite puzzled: How is this a new model? Isn’t it just a variant of a GAN? Unlike conventional GANs, it merges the generator and the discriminator into one. Does this "merging" offer any special benefits, such as more stable training? Personally, I don’t feel it does. Below, I will share my process of understanding IGN from a GAN perspective and my subsequent questions.

Generative Adversarial Networks

Regarding GANs (Generative Adversarial Networks), I studied them systematically for a period a few years ago (you can find related articles under the GAN tag), but I haven’t followed them continuously in recent years. Therefore, I will first provide a brief review of GANs to facilitate the comparison between GANs and IGN in later sections.

A GAN has two basic components: a Discriminator and a Generator. They can be vividly compared to a "critic" and a "forger." The discriminator is responsible for distinguishing between real samples and fake samples produced by the generator, while the generator is responsible for mapping simple random noise to target samples and using signals from the discriminator to improve its generation quality. In the continuous "offensive and defensive confrontation," the generator’s quality improves until the discriminator can no longer distinguish between real and fake samples, achieving a level of realism.

Taking WGAN as an example, the training objective of the discriminator D_{\theta} is to widen the score gap between real and fake samples: \begin{equation} \max_{\theta} D_{\theta}(G_{\varphi}(z)) - D_{\theta}(x)\label{eq:d-loss} \end{equation} where x is a real sample from the training set, z is random noise, G_{\varphi} is the generator, and G_{\varphi}(z) is the fake sample produced by the generator. The generator’s training objective is to narrow the score gap between real and fake samples, i.e., to minimize the above expression. However, for the generator, x (which contains no parameters) is equivalent to a constant, so it can be simplified to: \begin{equation} \min_{\varphi} D_{\theta}(G_{\varphi}(z))\label{eq:g-loss} \end{equation} In addition, there is content regarding Lipschitz constraints, but those are details I won’t expand on here. Interested readers can further read articles such as "The Art of Mutual Confrontation: From Zero to WGAN-GP" and "From Wasserstein Distance and Duality Theory to WGAN".

Generally, GANs are trained by alternating between two losses. Sometimes they can be written as a single loss optimized in two directions—some parameters perform gradient descent, while others perform gradient ascent. This training process involving opposite directions (i.e., \min-\max) is usually unstable and prone to collapse. Or, even if it trains successfully, there may be a "Mode Collapse" problem, where the generated results are repetitive and lack diversity.

Single Loss

Some readers might object: "You said GANs are optimized with two alternating losses, but IGN clearly uses a single loss. How can you say IGN is a special case of a GAN?"

In fact, the way IGN writes its single loss is a bit "disingenuous." Following its logic, any GAN can also be written in a single loss format. How? It’s simple. Assume \theta', \varphi' are weight copies of \theta, \varphi, meaning \theta' \equiv \theta, \varphi' \equiv \varphi, but they do not compute gradients. Then equations [eq:d-loss] and [eq:g-loss] can be combined as: \begin{equation} \min_{\theta,\varphi} D_{\theta}(x) - D_{\theta}(G_{\varphi'}(z)) + D_{\theta'}(G_{\varphi}(z))\label{eq:pure-one-loss} \end{equation} At this point, the gradients with respect to \theta and \varphi are the same as when the two losses are separate, making it an equivalent implementation. But why is this "disingenuous"? Because it involves no actual technique; it purely replaces the original \min-\max with a different notation. If implemented literally, one would need to constantly clone D_{\theta'}, G_{\varphi'} and stop their gradients, which is very inefficient for training.

In fact, to write a GAN in a single loss training format while maintaining practicality, one can refer to my previous post "Cleverly Stopping Gradients: Implementing GAN Models with a Single Loss". By using the stop_gradient operator provided by frameworks and some gradient calculation tricks, this goal can be achieved. Specifically, stop_gradient forces the gradient of a certain part of the model to be zero. For example: \begin{equation} \nabla_{\theta,\varphi} D_{\theta}(G_{\varphi}(z)) = \left(\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\theta},\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\varphi}\right)\label{eq:full-grad} \end{equation} After adding the stop_gradient operator (abbreviated as \textcolor{cyan}{\text{sg}}), we have: \begin{equation} \nabla_{\theta,\varphi} D_{\theta}(\textcolor{cyan}{\text{sg}(}G_{\varphi}(z)\textcolor{cyan}{)}) = \left(\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\theta}, 0\right)\label{eq:stop-grad} \end{equation} So, through the stop_gradient operator, we can easily mask the inner gradient of a nested function (i.e., the gradient of \varphi). What if we need to mask the outer gradient of a nested function (i.e., the gradient of \theta), as required by the generator? There is no direct way, but we can use a trick: subtract equation [eq:stop-grad] from equation [eq:full-grad] to get: \begin{equation} \nabla_{\theta,\varphi} D_{\theta}(G_{\varphi}(z)) - \nabla_{\theta,\varphi} D_{\theta}(\textcolor{cyan}{\text{sg}(}G_{\varphi}(z)\textcolor{cyan}{)}) = \left(0,\frac{\partial D_{\theta}(G_{\varphi}(z))}{\partial\varphi}\right) \end{equation} This achieves the masking of the outer gradient. Combining the two, we get a way to train a GAN with a single loss: \begin{equation} \begin{gathered} \min_{\theta,\varphi} \underbrace{D_{\theta}(x) - D_{\theta}(\textcolor{cyan}{\text{sg}(}G_{\varphi}(z)\textcolor{cyan}{)})}_{\text{Gradient of } \varphi \text{ removed}} + \underbrace{D_{\theta}(G_{\varphi}(z)) - D_{\theta}(\textcolor{cyan}{\text{sg}(}G_{\varphi}(z)\textcolor{cyan}{)})}_{\text{Gradient of } \theta \text{ removed}} \\[8pt] = \min_{\theta,\varphi} D_{\theta}(x) - 2 D_{\theta}(\textcolor{cyan}{\text{sg}(}G_{\varphi}(z)\textcolor{cyan}{)}) + D_{\theta}(G_{\varphi}(z))\end{gathered} \end{equation} This way, there is no need to repeatedly clone the model, and gradient equivalence is achieved within a single loss.

Idempotent Generation

After all that, we can finally invite the protagonist—the Idempotent Generative Network (IGN)—to the stage. But before its formal debut, please wait a moment while we discuss the motivation behind IGN.

A notable feature of GANs is that once training is successful, usually only the generator is kept, while the discriminator is mostly "discarded." However, in a reasonable GAN, the generator and discriminator usually have a similar number of parameters. Discarding the discriminator means half of the parameters are wasted, which is quite a pity. To address this, some works have tried adding an encoder to the GAN and sharing some parameters between the discriminator and the encoder to improve parameter utilization. Among them, the most minimalist work is the O-GAN I proposed, which only slightly modifies the discriminator structure and adds an extra loss to turn the discriminator into an encoder without increasing parameters or computation. It is a work I am quite satisfied with.

As the title suggests, IGN is a GAN that attempts to unify the discriminator and the generator. The generator "is both the player and the referee." From this perspective, IGN can also be seen as a way to improve parameter utilization. First, IGN assumes that z and x are the same size, so the input and output of the generator G_{\varphi} are the same size, which is different from typical GANs where the dimension of z is usually smaller than x. With the design of identical input and output sizes, the image itself can be used as an input to the generator for further computation. Thus, IGN designs the discriminator as a reconstruction loss: \begin{equation} \delta_{\varphi}(x) = \Vert G_{\varphi}(x) - x\Vert^2\label{eq:ign-d} \end{equation} \delta_{\varphi} reuses the notation from the original IGN paper and has no special meaning. This design completely reuses the generator’s parameters and adds no extra parameters, which seems like a very elegant design. Now, if we substitute this discriminator into equation [eq:pure-one-loss], we get: \begin{equation} \min_{\varphi}\underbrace{\delta_{\varphi}(x) - \delta_{\varphi}(G_{\varphi'}(z))}_{\text{Discriminator Loss}} + \underbrace{\delta_{\varphi'}(G_{\varphi}(z))}_{\text{Generator Loss}} \end{equation} Isn’t this exactly the same as the Final optimization objective in the original IGN paper? Of course, the original paper adds two adjustable coefficients; in fact, every term’s coefficient in equation [eq:pure-one-loss] is also adjustable, so that is nothing special. Clearly, IGN can be derived entirely from a GAN perspective—it is a special case of a GAN, even though the authors state they did not think of IGN from a GAN perspective.

The term "idempotent" comes from the author’s belief that when IGN is successfully trained, the discriminator’s score for real samples is zero, meaning G_{\varphi}(x) = x. From this, one can further derive: \begin{equation} G_{\varphi}(\cdots G_{\varphi}(x)) = \cdots = G_{\varphi}(G_{\varphi}(x)) = G_{\varphi}(x) = x \end{equation} In other words, applying G_{\varphi} multiple times to a real sample x leaves the result unchanged, which is exactly the mathematical meaning of "idempotent." However, theoretically, we cannot guarantee that the GAN discriminator’s loss (for real samples) is zero, so it is difficult to achieve true idempotency. The experimental results in the original paper also support this.

Personal Analysis

A very important question to consider is: Why can the reconstruction loss [eq:ign-d] successfully serve as a discriminator? Or, since there are many expressions that can be constructed based on G_{\varphi}(x) and x, can any of them serve as a discriminator?

Looking solely at "reconstruction loss as a discriminator," IGN is very similar to EBGAN. But this doesn’t mean EBGAN’s success explains IGN’s success, because EBGAN’s generator is independent of the discriminator and does not have the constraint of completely shared parameters. Thus, EBGAN’s success is "within reason" and fits the original design of GANs. But IGN is different because its discriminator and generator share parameters completely. Since GAN training itself involves significant instability, it is easy for "both to suffer together," causing both to fail during training.

In my view, the reason IGN has a chance not to fail is that it happens to satisfy "self-consistency." First, the fundamental goal of a GAN is that for an input noise z, G_{\varphi}(z) should output a real image. With the "reconstruction loss as discriminator" design in IGN, even if the optimal discriminator loss is not zero, it might be close, meaning G_{\varphi}(x) \approx x is approximately satisfied. Thus, it simultaneously satisfies the condition that "for an input image x, G_{\varphi}(x) should output a real image." In other words, regardless of the input, the output space consists of real samples. This self-consistency is very important; otherwise, the generator might "fall apart" because it needs to generate in two different directions.

That being said, what substantial improvement does IGN offer compared to a general GAN? Please forgive my dullness, but I truly cannot see the advantage of IGN. Take parameter utilization, for example: it seems that IGN’s parameter sharing improves utilization, but in fact, to ensure the input and output of the generator G_{\varphi} are the same, IGN uses an autoencoder structure. Its parameter count and computation are equal to the sum of a discriminator and a generator in a typical GAN! In other words, IGN does not reduce the number of parameters; instead, it increases the total computation because it increases the size of the generator.

I also did a simple experiment with IGN and found that its training also has instability issues—one might even say it is more unstable. This is because hard constraints like "parameter sharing + Euclidean distance" are more likely to amplify instability, leading to "suffering together" rather than "prospering together." Furthermore, IGN’s characteristic of identical input and output sizes loses the advantage of general GAN generators projecting from a low-dimensional manifold to high-dimensional data. IGN is also prone to mode collapse, and due to the Euclidean distance issue, the generated images tend to be blurry, similar to VAEs.

Summary

This article introduced the Idempotent Generative Network (IGN), which has recently attracted some attention, from a GAN perspective. It compared the connections and differences between IGN and GANs and shared my own analysis of IGN.

When reposting, please include the original article address: https://kexue.fm/archives/9969

For more detailed reposting matters, please refer to: "Scientific Space FAQ"