Still playing with Naive Bayes in the era of LLMs?
This might be the first thought many readers have after seeing the title. Indeed, when the ancient Naive Bayes meets the cutting-edge LLM, surprising results emerge—we can directly extend the context processing length of existing LLM models without fine-tuning, independent of model architecture, with linear efficiency, and with promising results. This is the NBCE (Naive Bayes-based Context Extension) method proposed in this article.
Initial Exploration
Assume T is the token sequence to be generated, and S_1, S_2, \dots, S_n are several given, relatively independent sets of contexts (e.g., n different paragraphs, at least not a single sentence split into two fragments). Suppose their total length has exceeded the training length of the model, while a single S_k plus T is still within the training length. We need to generate T based on S_1, S_2, \dots, S_n, which means estimating p(T|S_1, S_2, \dots, S_n).
Simply put, Naive Bayes is "Bayes’ theorem + independence assumption." According to Bayes’ theorem: \begin{equation} p(T|S_1, S_2, \dots, S_n) \propto p(S_1, S_2, \dots, S_n|T)p(T) \end{equation} Here, \propto omits constant factors independent of T. According to the (conditional) independence assumption: \begin{equation} p(S_1, S_2, \dots, S_n|T) = \prod_{k=1}^n p(S_k|T) \end{equation} Therefore, we have: \begin{equation} p(T|S_1, S_2, \dots, S_n) \propto p(T)\prod_{k=1}^n p(S_k|T) \end{equation} Applying Bayes’ theorem again, p(S_k|T) \propto \frac{p(T|S_k)}{p(T)}, we get: \begin{equation} p(T|S_1, S_2, \dots, S_n) \propto \frac{1}{p^{n-1}(T)}\prod_{k=1}^n p(T|S_k) \end{equation} Or: \begin{equation} \log p(T|S_1, S_2, \dots, S_n) = \textcolor{red}{\sum_{k=1}^n \log p(T|S_k)} - \textcolor{green}{(n-1)\log p(T)} + \textcolor{cyan}{\text{constant}} \label{eq:nbce-1} \end{equation}
In this formula, \textcolor{red}{p(T|S_k)} and \textcolor{green}{p(T)} can be directly calculated using existing LLMs. Any language model will work, regardless of architecture, and no long-text fine-tuning is required. Here, \textcolor{red}{p(T|S_k)} is the probability predicted with a single context, while \textcolor{green}{p(T)} is the probability with no context (or an empty context). Multiple contexts can be placed in the same batch for parallel calculation, and the computational cost grows linearly with the number of contexts.
In-depth Analysis
Of course, Naive Bayes relies on the independence assumption, which limits its practical effectiveness. To improve upon it, let us further analyze and refine Equation [eq:nbce-1] to achieve better results.
First, we denote \log p(T|S) = [\log p(T|S_1), \dots, \log p(T|S_n)], and: \begin{equation} \overline{\log p(T|S)} = \frac{1}{n}\sum_{k=1}^n \log p(T|S_k) \end{equation} Let \beta = n - 1. Then Equation [eq:nbce-1] can be rewritten as: \begin{equation} \log p(T|S_1, S_2, \dots, S_n) = \textcolor{red}{(\beta + 1)\overline{\log p(T|S)}} - \textcolor{green}{\beta\log p(T)} + \textcolor{cyan}{\text{constant}} \label{eq:nbce-2} \end{equation}
Rewriting it in this form naturally leads to two questions:
If we treat \beta as a hyperparameter to be tuned, is it possible to achieve better results?
Since \overline{\log p(T|S)} is the Average Pooling of \log p(T|S), would other pooling methods (denoted as \mathcal{P}) yield better results? That is: \begin{equation} \log p(T|S_1, S_2, \dots, S_n) = \textcolor{red}{(\beta + 1)\mathcal{P}[\log p(T|S)]} - \textcolor{green}{\beta\log p(T)} + \textcolor{cyan}{\text{constant}} \label{eq:nbce-3} \end{equation}
The author experimented with these two questions on a 7B model. The preliminary conclusion was: in reading comprehension scenarios, Max Pooling combined with \beta=0.25 performs well overall with Greedy Search. However, the results from Random Sampling were essentially unreadable.
Final Solution
Why does Greedy Search perform well while Random Sampling fails? We know that Random Sampling samples according to a distribution. Its poor performance indicates that the result of Max Pooling is not a valid probability distribution. Greedy Search only cares about the token with the highest probability, not the rationality of the distribution. Its success tells us that the token with the highest probability has high accuracy.
Higher probability indicates lower uncertainty. To improve Random Sampling, we change the pooling method to directly output the distribution with the lowest uncertainty: \begin{equation} \begin{aligned} &\mathcal{P}[\log p(T|S)] = \log p(T|S_{\textcolor{red}{k}}) \\[5pt] &\textcolor{red}{k} = \mathop{\text{argmin}} \big\{H_1, H_2, \dots, H_n\big\} \\[5pt] &H_i = -\sum_T p(T|S_i)\log p(T|S_i) \end{aligned} \end{equation} Substituting this into Equation [eq:nbce-3] gives the final NBCE (Naive Bayes-based Context Extension).
It is worth noting that while our starting point was Naive Bayes, the generalized Equation [eq:nbce-3] has moved beyond the scope of conventional Naive Bayes while retaining its interpretability. The form of Equation [eq:nbce-3] is quite intuitive:
Predictions from different contexts are aggregated (or "voted") together via method \mathcal{P} (with weight \beta+1), and the prediction without context is subtracted (with weight \beta).
The reason for subtracting the no-context prediction is to make the model more inclined to use the context rather than relying purely on its own internal knowledge (Note: the paper "Trusting Your Evidence: Hallucinate Less with Context-aware Decoding", which appeared on Arxiv three days later, also proposed the same technique to reduce hallucinations).
Different \beta values can be chosen for different scenarios. For reading comprehension requiring context, a larger \beta can be considered. For creative writing, a smaller \beta might be better. The author believes \beta \geq -1 is reasonable.
Reference Implementation
The reference implementation of NBCE is provided below:
Github: https://github.com/bojone/NBCE
As seen in the demo code, the implementation of NBCE is very simple. It only requires modifying the way logits are constructed in the decoding function and does not conflict with the choice of decoding algorithm.
The provided demo includes 12 different context segments with a total length of over 9,000 characters. These, along with 8 questions, are input into the model at once (the model has a training length of 2048 and 7B parameters, available at OpenBuddy). The model is able to correctly answer all 8 questions based on the provided contexts. Notably, the total length of all contexts, questions, and answers exceeds 10,000 characters! Additionally, some friends have tried it for resume matching and essay scoring with decent results. I highly recommend trying it out yourself.
Further Reflections
A major drawback of NBCE is its lack of order; it cannot recognize the input order of contexts. This may lead to poor performance in scenarios like story continuation. To mitigate this, one could consider adding a prefix to each context to indicate order information, similar to "Chapter 1" or "Chapter 2" in a novel.
Overall, the author’s tests of NBCE have been limited to "reading comprehension" scenarios (i.e., "understanding" long text). Whether this method can be used to "generate" long text remains an open question, and I look forward to everyone’s test results.
Furthermore, there is an interesting question:
Since Naive Bayes can be useful in the LLM field, could other traditional probabilistic models (such as HMM) also find their place in the LLM field?
Conclusion
This article proposes NBCE (Naive Bayes-based Context Extension), which extends the context processing length of LLMs based on the idea of Naive Bayes. It has the advantages of being plug-and-play, model-agnostic, requiring no fine-tuning, linear efficiency, and simple implementation. The results appear promising, and everyone is welcome to test it.
Original URL: https://kexue.fm/archives/9617