English (unofficial) translations of posts at kexue.fm
Source

Naive Bayes is All You Need?

Translated by Gemini Flash 3.0 Preview. Translations can be inaccurate, please refer to the original post for important stuff.

I apologize for choosing such a clickbaity title. After writing "NBCE: Using Naive Bayes to Extend the Context Processing Length of LLMs", I felt that the Naive Bayes mechanism shares many characteristics with the Attention mechanism. Upon further derivation, I discovered that the Attention mechanism can actually be seen as a generalized, parameterized Naive Bayes. Since "Attention is All You Need," does it not imply that "Naive Bayes is all you need"? This is the origin of the title.

In the following, I will introduce my thought process and analyze how to understand the Attention mechanism from the perspective of Naive Bayes.

Naive Bayes

In this article, we primarily consider language models, which aim to model p(x_t|x_1, \dots, x_{t-1}). According to Bayes’ theorem, we have: \begin{equation} p(x_t|x_1, \dots, x_{t-1}) = \frac{p(x_1, \dots, x_{t-1}|x_t)p(x_t)}{p(x_1, \dots, x_{t-1})} \propto p(x_1, \dots, x_{t-1}|x_t)p(x_t) \end{equation} Under the independence assumption p(x_1, \dots, x_{t-1}|x_t) = \prod_{j=1}^{t-1} p(x_j|x_t), we obtain: \begin{equation} p(x_t|x_1, \dots, x_{t-1}) \propto \prod_{j=1}^{t-1} p(x_j|x_t)p(x_t) \end{equation} Applying Bayes’ theorem again, p(x_j|x_t) = \frac{p(x_t|x_j)p(x_j)}{p(x_t)} \propto \frac{p(x_t|x_j)}{p(x_t)}, we get: \begin{equation} p(x_t|x_1, \dots, x_{t-1}) \propto \frac{1}{[p(x_t)]^{t-2}} \prod_{j=1}^{t-1} p(x_t|x_j) \end{equation} Taking the logarithm of both sides yields: \begin{equation} \log p(x_t|x_1, \dots, x_{t-1}) = \sum_{j=1}^{t-1} \log p(x_t|x_j) - (t - 2) \log p(x_t) + \text{constant} \end{equation}

Generalized Result

We performed a similar derivation in "NBCE: Using Naive Bayes to Extend the Context Processing Length of LLMs". As in that article, we generalize the above formula as: \begin{equation} \log p(x_t|x_1, \dots, x_{t-1}) = (1 + \beta)\mathcal{P}[\log p(x_t|x_j)] - \beta \log p(x_t) + \text{constant} \end{equation} Here, \beta serves as a hyperparameter to be tuned, and \mathcal{P} represents some form of Pooling. Next, we focus on the case where \beta=0 and the Pooling method is a weighted average: \begin{equation} \log p(x_t|x_1, \dots, x_{t-1}) = \sum_j a_{t,j} \log p(x_t|x_j) + \text{constant} \label{eq:nb-core} \end{equation} In this expression, a_{t,j} is a function of x_{t-1} and x_j.

Some readers might ask: can this generalized formula still be considered Naive Bayes? I believe it can be viewed as a generalized Naive Bayes because standard Naive Bayes can be seen as an equal-weighted average of each \log p(x_t|x_j), whereas here it is replaced by a more general weighted average. However, selecting a_{t,j} as a function of x_{t-1} and x_j highlights the status of x_{t-1} and mitigates the drawback of disorder inherent in Naive Bayes. Therefore, more accurately, Equation [eq:nb-core] is a combination of a 2-gram language model and Naive Bayes.

The Emergence of Attention

Next, by further parameterizing \log p(x_t|x_j), we can see the form of Attention. It is not difficult to find that p(x_t|x_j) is essentially the Skip-Gram model from the original Word2Vec. Its conventional modeling approach is "Embedding + Inner Product + Softmax," namely: \begin{equation} p(x_t|x_j) = \frac{e^{v(x_j) \cdot w(x_t)}}{Z(x_j)}, \quad Z(x_j) = \sum_{x_t \in \text{Vocab}} e^{v(x_j) \cdot w(x_t)} \end{equation} Thus, we can simply assume: \begin{equation} \log p(x_t|x_j) = v(x_j) \cdot w(x_t) + \text{constant} \end{equation} Substituting this into Equation [eq:nb-core], we get: \begin{equation} \log p(x_t|x_1, \dots, x_{t-1}) = \left(\sum_j a_{t,j} v(x_j)\right) \cdot w(x_t) + \text{constant} \label{eq:nb-core-2} \end{equation} If we take the expression inside the parentheses as a common feature fusion operation, it is exactly the standard Attention mechanism. Therefore, a single-layer Attention acting as a language model is, in fact, a generalized Naive Bayes.

Of course, we have not yet determined a_{t,j}. In the previous section, we stated that a_{t,j} is a function of x_{t-1} and x_j, and it must also be normalized (for the weighted average). A simple way to achieve this, similar to Skip-Gram, is "Embedding + Inner Product + Softmax": \begin{equation} a_{t,j} = \frac{e^{q(x_{t-1}) \cdot k(x_j)}}{Z_t}, \quad Z_t = \sum_{j=1}^{t-1} e^{q(x_{t-1}) \cdot k(x_j)} \end{equation} Substituting this into Equation [eq:nb-core-2] results in the Dot-Product Attention most commonly used today. Of course, this method is not unique; there are also Additive Attention and others. The primary reason for choosing Dot-Product is that it can be implemented in parallel while being relatively memory-efficient.

Stacking and Residuals

Regardless of the parameterization, the capability of a single-layer Naive Bayes is always limited, so the model complexity needs to be further increased. From a neural network perspective, the main way to increase model complexity is to increase depth, which is the stacking of layers. How can we understand this stacking from a probabilistic distribution perspective? The answer lies in latent variable models.

A latent variable model introduces latent variables z_1, z_2, \dots, z_{t-1} such that: \begin{equation} p(x_t|x_1, \dots, x_{t-1}) = \int p(x_t|z_1, \dots, z_{t-1}) p(z_1, \dots, z_{t-1}|x_1, \dots, x_{t-1}) dz_1 \cdots dz_{t-1} \end{equation} Simply put, it fits more complex distributions through the superposition of simpler ones, which is consistent with the idea of GMM (Gaussian Mixture Models). Based on our previous discussion, we also model p(x_t|z_1, \dots, z_{t-1}) using Naive Bayes, which means it is a single-layer Attention at the feature level. For p(z_1, \dots, z_{t-1}|x_1, \dots, x_{t-1}), following the characteristics of autoregressive models, we decompose it as: \begin{equation} p(z_1, \dots, z_{t-1}|x_1, \dots, x_{t-1}) = \prod_{k=1}^{t-1} p(z_k|x_1, \dots, x_k) \end{equation} In this way, each p(z_k|x_1, \dots, x_k) has the same form as p(x_t|z_1, \dots, z_{t-1}) and can likewise be modeled using Naive Bayes. For simplicity, if we define z_k as a continuous variable and p(z_k|x_1, \dots, x_k) as a Dirac distribution, the integral can be calculated directly, and the result is the stacking of two Attention layers.

Finally, another key component in the Transformer is the residual connection. In fact, it generalizes Equation [eq:nb-core] to: \begin{equation} \log p(x_t|x_1, \dots, x_{t-1}) = \log p(x_t|x_{t-1}) + \sum_j a_{t,j} \log p(x_t|x_j) + \text{constant} \end{equation} This can be understood as a Pooling method that emphasizes the status of the 2-gram, acting as a kind of prior. Finally, the remaining components like the FeedForward layer and LayerNorm layer do not involve interactions between tokens; they can be understood as more complexly parameterized parts of the Naive Bayes framework.

Admittedly, such a broad explanation might seem a bit forced. However, my original intention was not to precisely explain the Transformer or Attention, but rather to gain new insights into length extrapolation from the perspective of Naive Bayes. Unfortunately, I have not yet achieved the expected results. Nevertheless, despite what might seem like blind narcissism, I still believe that the perspective of Naive Bayes and latent variable models has further potential for exploration, such as explaining why In-Context Learning in Attention-based language models is effective from a Naive Bayes perspective.

Article Summary

This article elaborates on the connection between Naive Bayes and the Attention mechanism, showing that Attention can be regarded as a generalized Naive Bayes. From this perspective, we can further understand elements such as stacking and residuals in Attention.