Last week, in the post "NBCE: Using Naive Bayes to Extend the Context Processing Length of LLMs", we introduced a scheme called NBCE (Naive Bayes-based Context Extension) to extend the context length of LLMs based on Naive Bayes. Due to its advantages such as being plug-and-play, model-agnostic, and requiring no fine-tuning, it has gained recognition from some readers. Overall, the test results reported by users so far have been quite good.
Of course, some readers also raised questions during use. This article provides some supplementary explanations and analysis of the NBCE method, combining readers’ inquiries with the author’s subsequent reflections.
Method Review
Suppose T is the token sequence to be generated, and S_1, S_2, \dots, S_n are several given contexts. We need to generate T based on S_1, S_2, \dots, S_n, which requires estimating p(T|S_1, S_2, \dots, S_n). Based on the Naive Bayes idea, we obtain: \begin{equation} \log p(T|S_1, S_2, \dots, S_n) = \color{red}{(\beta + 1)\overline{\log p(T|S)}} - \color{green}{\beta\log p(T)} + \color{skyblue}{\text{constant}} \label{eq:nbce-2} \end{equation} where \beta = n - 1, \overline{\log p(T|S)} = \frac{1}{n}\sum_{k=1}^n \log p(T|S_k). For details, please refer to the previous article. NBCE made two modifications: 1. Treat \beta as a hyperparameter to be tuned; 2. Replace \overline{\log p(T|S)} with a general pooling method \mathcal{P}. The result becomes: \begin{equation} \log p(T|S_1, S_2, \dots, S_n) = \color{red}{(\beta + 1)\mathcal{P}[\log p(T|S)]} - \color{green}{\beta\log p(T)} + \color{skyblue}{\text{constant}} \label{eq:nbce-3} \end{equation} Finally, the pooling scheme chosen for NBCE is "selecting the one with minimum entropy": \begin{equation} \begin{aligned} &\mathcal{P}[\log p(T|S)] = \log p(T|S_{\color{red}{k}}) \\[5pt] &\color{red}{k} = \mathop{\text{argmin}} \big\{H_1, H_2, \dots, H_n\big\} \\[5pt] &H_i = -\sum_T p(T|S_i)\log p(T|S_i) \end{aligned} \label{eq:min-h} \end{equation}
Truncated Prediction
Equation [eq:nbce-2] is the standard Naive Bayes result. However, when I implemented it, I found that as n increased, the performance of Equation [eq:nbce-2] gradually deteriorated until it produced complete gibberish. Therefore, after repeated adjustments, I eventually chose "selecting the one with minimum entropy" as the pooling scheme for NBCE. But thinking about it later, this behavior of Equation [eq:nbce-2] is abnormal. Since the only assumption of Naive Bayes is that the contexts are independent, and the contexts I tested were several randomly selected news articles (which satisfy this assumption to some extent), Equation [eq:nbce-2] should not be so bad as to produce complete gibberish.
While puzzled, @Kong Mouren reminded me on WeChat: Language model training labels are all One-Hot, so prediction results are actually untrustworthy except for the head (the part with the highest probability). This tip hit the nail on the head and made the answer clear instantly: since Equation [eq:nbce-2] contains the term -\beta\log p(T), it amplifies the prediction results in the tail. If the tail predictions are unreliable, this amplification can even completely disrupt the accurate results at the head. Why doesn’t it affect "selecting the one with minimum entropy"? Because the minimum entropy result tends to have a larger head probability and a smaller tail probability, so even if the -\beta\log p(T) term amplifies the tail, it still cannot overcome the head. For Equation [eq:nbce-2], which is the average of all predictions, the head is weakened to the point where the tail overcomes the head after adding -\beta\log p(T).
With this clue, the solution is obvious: add Top-P or Top-K truncation to each prediction result. In the Github code, I chose Top-P truncation.
Infinity Handling
However, the matter is not yet over. After truncation, the tails of \log p(T|S_k) and \log p(T) both become -\infty. At this point, Equation [eq:nbce-2] or Equation [eq:nbce-3] might encounter the meaningless operation (-\infty) - (-\infty). Generally, there are the following cases:
| \log p(T|S_k) | \log p(T) | \log p(T|S_k) - \log p(T) | |
|---|---|---|---|
| Case 1 | > -\infty | > -\infty | > -\infty |
| Case 2 | > -\infty | = -\infty | = +\infty |
| Case 3 | = -\infty | > -\infty | = -\infty |
| Case 4 | = -\infty | = -\infty | NaN |
Among these, "Case 1" and "Case 3" can be calculated normally. Although "Case 2" can also be calculated, its result of positive infinity is unreasonable. "Case 4" is an ill-defined, meaningless operation. That is to say, we need to find a way to correct "Case 2" and "Case 4". These two cases exactly correspond to \log p(T) = -\infty, so we modify Equation [eq:nbce-3] to: \begin{equation} \log p(T|S_1, S_2, \dots, S_n) = \left\{ \begin{aligned} &\color{red}{\mathcal{P}[\log p(T|S)]}, \quad \text{if } \color{green}{\log p(T) = -\infty} \\[5pt] &\color{red}{(\beta + 1)\mathcal{P}[\log p(T|S)]} - \color{green}{\beta\log p(T)}, \quad \text{otherwise}\\ \end{aligned}\right\} + \color{skyblue}{\text{constant}} \label{eq:nbce-4} \end{equation} After the above treatment, the standard Naive Bayes Equation [eq:nbce-2] can also output normal results (although the final effect is still not as good as selecting the minimum entropy, at least it won’t produce gibberish), and the modified code is more robust to the pooling method and \beta.
Transition Probability
When used to answer opinion-based questions or questions biased towards free creation, NBCE might exhibit the problem of repeatedly jumping between contexts. Specifically, the model does not confidently focus on a particular context, so the differences between H_1, H_2, \dots, H_n are small. Consequently, the \mathop{\text{argmin}} result in Equation [eq:min-h] becomes unstable, selecting a different context at each generation step. This leads to semantic discontinuity in the generated results or even complete irrelevance to the contexts, exacerbating the "hallucination" phenomenon of LLMs.
To alleviate this problem, we can mimic the concept of transition probability by appropriately weighting the context selected in the previous step, encouraging the model to "not jump unless necessary." The specific approach is to introduce a parameter \eta > 0 and modify Equation [eq:min-h] to: \begin{equation} \color{red}{k} = \mathop{\text{argmin}} \big\{H_1, \dots, H_{k'-1}, H_{k'} \color{red}{- \eta}, H_{k'+1}, \dots, H_n\big\} \end{equation} where k' is the index of the context selected in the previous generation step. In this way, a context jump occurs only when H_k < H_{k'} - \eta, thereby reducing the jump probability.
All the modifications mentioned above have been synchronized to Github:
Github: https://github.com/bojone/NBCE
Applicable Scenarios
Due to the independence assumption made by Naive Bayes, many readers might doubt: when there is significant semantic overlap between contexts, will the performance of NBCE drop significantly? Or rather, what are the applicable scenarios for NBCE?
In fact, it is the standard Naive Bayes (Equation [eq:nbce-2]) that is limited by the independence assumption. The generalized Equation [eq:nbce-3] and Equation [eq:min-h] are essentially no longer restricted by the independence assumption. In fact, the "minimum entropy" version of NBCE essentially uses the entropy of the LLM as a similarity measure to retrieve contexts, updating the retrieval results at each generation step. Therefore, the applicable scenario for NBCE is: Assuming the answer to be predicted can be divided into several segments, each segment depends only on one context.
Based on this conclusion, when we have only one long text as context (such as a novel), we can automatically divide the long context into multiple short contexts using an overlapping sliding window, rather than necessarily requiring manual segmentation into relatively independent fragments. This is because the conclusion just mentioned tells us that the applicability of NBCE is independent of context overlap. As for why an overlapping sliding window is used, it is simply to ensure that the model can output complete results by relying as much as possible on a single context.
The following two scenarios are likely not to work well with NBCE:
Ordered Context: This refers to cases where the generation result strongly depends on the input order of the contexts (or more complex nested structures). NBCE usually performs poorly here because it retains the unordered nature of Naive Bayes. A typical example is writing a summary for a novel (where the novel is cut into multiple contexts). A temporary solution is to manually add markers to each context that identify the order, such as "Chapter xx".
Coupled Context: This refers to cases where the output must be constructed by combining two or more contexts. NBCE performs poorly here because it only selects one context at a time. @Kong Mouren gave a typical example: "Given x > 1 and x < 0, find the solution set for x." If the two conditions are split into two contexts, the model must combine both to output the correct answer "empty set." Looking at a single context alone cannot determine that the set is empty.
If NBCE is to be further developed, it will likely revolve around improving these two scenarios.
Conclusion
This article introduced some subsequent updates and analysis of the context length extension scheme NBCE, and further discussed its applicable scenarios.
Reprinted from: https://kexue.fm/archives/9632
For more details on reprinting, please refer to: "Scientific Space FAQ"