For researchers who adhere to the discretization route, VQ (Vector Quantization) is a key part of visual understanding and generation, serving as the "Tokenizer" in vision. It was proposed in the 2017 paper "Neural Discrete Representation Learning", and I also introduced it in my 2019 blog post "A Concise Introduction to VQ-VAE: Quantized Autoencoders".
However, after so many years, we can find that the training technology of VQ has remained almost unchanged, consisting of STE (Straight-Through Estimator) plus an additional Aux Loss. There is nothing wrong with STE; it can be said to be the standard way to design gradients for discretization operations. But the existence of Aux Loss always gives a feeling of not being "end-to-end" enough, while also introducing additional hyperparameters to tune.
Fortunately, this situation might be coming to an end. Last week’s paper "DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick" proposed a new STE trick. Its biggest highlight is that it does not require an Aux Loss, which makes it particularly concise and beautiful!
Discrete Encoding
As usual, let’s first review the existing training schemes for VQ. First, it should be pointed out that VQ (Vector Quantization) itself is actually a very old concept, dating back to the 1980s. Its original meaning was to cluster vectors and replace them with corresponding cluster centers, thereby achieving data compression.
But the VQ we are talking about here mainly refers to the VQ in VQ-VAE proposed in the paper "Neural Discrete Representation Learning". Of course, the definition of VQ itself hasn’t changed—it’s the mapping from a vector to a cluster center. The core of the VQ-VAE paper was to provide an end-to-end training scheme for performing VQ on latent variables and then decoding for reconstruction. The difficulty is that the VQ step is a discretization operation with no existing gradient, so a gradient must be designed for it.
In terms of formulas, a standard AE (AutoEncoder) is: \begin{equation} z = \text{encoder}(x),\quad \hat{x}=\text{decoder}(z),\quad \mathcal{L}=\Vert x - \hat{x}\Vert^2 \end{equation} where x is the original input, z is the encoded vector, and \hat{x} is the reconstruction result. What VQ-VAE wants to do, based on the idea of VQ, is to turn z into one of the entries in the codebook E=\{e_1,e_2,\cdots,e_K\}: \begin{equation} q = \mathop{\mathrm{argmin}}_{e\in E} \Vert z - e\Vert \end{equation} Then the decoder takes q as input for reconstruction. Since q corresponds one-to-one with the index of the codebook, q is actually an integer encoding of x. Of course, to ensure reconstruction quality, in practice, it is definitely not encoded into a single vector but into multiple vectors; after VQ, it becomes multiple integers. Therefore, what VQ-VAE wants to do is to encode the input into an integer sequence, which is similar to a text Tokenizer.
Gradient Design
Now the modules we need to train include the encoder, decoder, and the codebook E. Since the VQ operation involves an \mathop{\mathrm{argmin}} operation, the gradient breaks at q and cannot be passed back to the encoder.
VQ-VAE uses a trick called STE. It says that the input to the decoder is still the q after VQ, but when calculating gradients during backpropagation, we use the z before VQ. This way, the gradient can be passed back to the encoder. It can be implemented using the stop_gradient operator (\operatorname{sg}): \begin{equation} z = \text{encoder}(x),\quad q = \mathop{\mathrm{argmin}}_{e\in E} \Vert z - e\Vert,\quad z_q = z + \operatorname{sg}[q - z],\quad \hat{x} = \text{decoder}(z_q) \end{equation} Simply put, the effect achieved by STE is z_q=q but \nabla z_q = \nabla z. This gives the encoder a gradient, but q still has no gradient, making it impossible to optimize the codebook. To solve this problem, VQ-VAE adds two Aux Loss terms: \begin{equation} \mathcal{L} = \Vert x - \hat{x}\Vert^2 + \beta\Vert q - \operatorname{sg}[z]\Vert^2 + \gamma\Vert z - \operatorname{sg}[q]\Vert^2 \end{equation} These two loss terms represent q moving closer to z and z moving closer to q, respectively, which is consistent with the original idea of VQ. The combination of STE and these two Aux Loss terms constitutes the standard VQ-VAE. There is also a simple variant where \beta=0 is set directly, but the codebook is updated using an exponential moving average of z, which is equivalent to specifying SGD to update the Aux Loss for q.
It is worth pointing out here that although VQ-VAE was named "VAE" by the original paper, it is actually an AE, so "VQ-AE" would be more appropriate in principle. However, the name has already become popular, so we have to continue using it. The later VQGAN added GAN Loss and other tricks on top of VQ-VAE to improve reconstruction clarity.
Alternative Works
For me, these two extra Aux Loss terms are quite bothersome. I suspect many in the industry feel the same way, so related improvement works appear from time to time.
Among them, the most "drastic" approach is to switch to a discretization scheme other than VQ, such as FSQ introduced in "Embarrassingly Simple FSQ: ’Rounding’ Surpasses VQ-VAE", which does not require Aux Loss. If VQ is about clustering high-dimensional vectors, then FSQ is about "rounding" low-dimensional vectors to achieve discretization. However, as I evaluated in that article, FSQ cannot replace VQ in all scenarios, so improving VQ itself remains valuable.
Before proposing DiVeQ, the original author also proposed a scheme called "NSVQ", which took a small step toward "abolishing" Aux Loss. It changed z_q to: \begin{equation} z_q = z + \Vert q - z\Vert \times \frac{\varepsilon}{\Vert \varepsilon\Vert},\qquad \varepsilon\sim\mathcal{N}(0, I)\label{eq:nsvq} \end{equation} Here \varepsilon is a vector of the same size as z and q, with components following a standard normal distribution. After replacing it with this new z_q, because \Vert q - z\Vert is differentiable, q also has a gradient, so in principle, the codebook can be trained without Aux Loss. The geometric meaning of NSVQ is very intuitive: it is uniform sampling on a "circle centered at z with radius \Vert q-z\Vert". The disadvantage is that what is sent to the decoder at this time is not q, and in the inference stage, we care about the reconstruction effect of q, so there is an inconsistency between training and inference in NSVQ.
The Main Character Appears
Starting from NSVQ, if we want to keep the forward pass as q while retaining the gradient brought by \Vert q - z\Vert, then it is easy to propose an improved version: \begin{equation} z_q = z + \Vert q - z\Vert \times \operatorname{sg}\left[\frac{q - z}{\Vert q - z\Vert}\right]\label{eq:diveq0} \end{equation} In the forward pass, it strictly has z_q = q, but in the backward pass, it retains the gradients of z and \Vert q - z\Vert. This is the "DiVeQ-detach" in the appendix of the DiVeQ paper. The DiVeQ in the main text can be seen as a kind of interpolation between Eq. [eq:diveq0] and Eq. [eq:nsvq]: \begin{equation} z_q = z + \Vert q - z\Vert \times \operatorname{sg}\left[\frac{q - z + \varepsilon}{\Vert q - z + \varepsilon\Vert}\right],\qquad \varepsilon\sim\mathcal{N}(0, \sigma^2 I)\label{eq:diveq} \end{equation} Obviously, when \sigma=0, the result is "DiVeQ-detach", and when \sigma\to\infty, the result is "NSVQ". The original paper’s appendix searched for \sigma and concluded that \sigma^2 = 10^{-3} is a generally superior choice.
The experimental results in the paper show that although Eq. [eq:diveq] introduces randomness and has a certain degree of training-inference inconsistency, it performs better than Eq. [eq:diveq0]. However, in terms of my aesthetic, performance cannot come at the cost of elegance, so "DiVeQ-detach" in Eq. [eq:diveq0] is the ideal scheme in my mind. In the following analysis, DiVeQ refers to "DiVeQ-detach".
Theoretical Analysis
Unfortunately, the original paper does not provide much theoretical analysis, so in this section, I will try to perform a basic analysis of the effectiveness of DiVeQ and its correlation with the original VQ training scheme. First, consider the general form of Eq. [eq:diveq0]: \begin{equation} z_q = z + r(q, z) \times \operatorname{sg}\left[\frac{q - z}{r(q, z)}\right] \end{equation} where r(q,z) is any differentiable scalar function of q and z, which can be regarded as any distance function between q and z. If we denote the loss function as \mathcal{L}(z_q), its differential is: \begin{equation} d\mathcal{L} = \langle\nabla_{z_q} \mathcal{L},d z_q\rangle = \left\langle\nabla_{z_q} \mathcal{L},dz + dr \times\frac{q-z}{r}\right\rangle = \langle\nabla_{z_q} \mathcal{L},d z\rangle + \langle\nabla_{z_q} \mathcal{L}, q-z\rangle d(\ln r) \end{equation} The term \langle\nabla_{z_q} \mathcal{L},d z\rangle is what original VQ already has. DiVeQ adds an extra \langle\nabla_{z_q} \mathcal{L}, q-z\rangle d(\ln r). In other words, it is equivalent to introducing an Aux Loss \operatorname{sg}[\langle\nabla_{z_q} \mathcal{L}, q-z\rangle] \ln r. If r represents some distance function between q and z, then it is pulling q and z closer, which is similar to the Aux Loss introduced by VQ. This successfully explains DiVeQ theoretically.
But don’t celebrate too early; the premise for this explanation to hold is that the coefficient \langle\nabla_{z_q} \mathcal{L}, q-z\rangle > 0, otherwise it would be pushing them apart. To prove this, we consider the first-order approximation of the loss function \mathcal{L}(z) at z_q: \begin{equation} \mathcal{L}(z) \approx \mathcal{L}(z_q) + \langle\nabla_{z_q} \mathcal{L}, z - z_q\rangle = \mathcal{L}(z_q) + \langle\nabla_{z_q} \mathcal{L}, z - q\rangle \end{equation} That is, \langle\nabla_{z_q} \mathcal{L}, q-z\rangle \approx \mathcal{L}(z_q) - \mathcal{L}(z). Note that z and z_q are the features before and after VQ, respectively. VQ is a process of information loss, so using z for the target task (such as reconstruction) will be easier than using z_q. Therefore, as long as convergence begins, it can be expected that the loss function for z is lower, i.e., \mathcal{L}(z_q) - \mathcal{L}(z) > 0. Thus, we have proved that \langle\nabla_{z_q} \mathcal{L}, q-z\rangle > 0 is likely to hold.
Improvement Directions
Strictly speaking, \langle\nabla_{z_q} \mathcal{L}, q-z\rangle > 0 is only a necessary condition for DiVeQ to be effective. To fully demonstrate its effectiveness, one would also need to prove that this coefficient is "just right". Due to the arbitrariness of r(q,z), we can only analyze specific functions. If we consider r(q,z)=\Vert q-z\Vert^{\alpha}, it is equivalent to introducing the following Aux Loss: \begin{equation} \operatorname{sg}[\langle\nabla_{z_q} \mathcal{L}, q-z\rangle] \ln \Vert q-z\Vert^{\alpha} \approx \operatorname{sg}[\mathcal{L}(z_q) - \mathcal{L}(z)] \times \alpha \ln \Vert q-z\Vert \end{equation} The coefficient \mathcal{L}(z_q) - \mathcal{L}(z) is homogeneous with the main loss \mathcal{L}(z_q), which means it can adapt well to the scale of the main loss and adjust the Aux Loss weight based on the performance gap before and after VQ. As for what value of \alpha is best, I think it depends on experiments. I personally tried adjusting it and found that \alpha=1 generally performs well. Interested readers can also try adjusting \alpha themselves or even try using other r(q, z).
It should be noted that DiVeQ only provides a new VQ training scheme without Aux Loss. In principle, it does not solve other problems of VQ, such as low codebook utilization or codebook collapse. Enhancement tricks that were effective in the "STE + Aux Loss" scenario can be considered for overlaying onto DiVeQ. The original paper combined DiVeQ with SFVQ to propose SF-DiVeQ to alleviate codebook collapse and other issues.
However, I feel that SFVQ is a bit cumbersome, so I don’t plan to introduce it in detail here. Moreover, the author chose to overlay SFVQ more because SFVQ was his previous work. I prefer the linear transformation trick introduced in "Another Trick for VQ: Adding a Linear Transformation to the Codebook", which adds a linear transformation after the codebook. Experiments show that this can also significantly enhance the effect of DiVeQ.
Summary
This article introduced a new training scheme for VQ (Vector Quantization). It only needs to be implemented through STE and does not require additional Aux Loss, making it particularly concise and elegant.
When reposting, please include the original address of this article: https://kexue.fm/archives/11328
For more detailed reposting matters, please refer to: "Scientific Space FAQ"