NBCE: Extending the Context Processing Length of LLMs using Naive Bayes · English (unofficial) translations of posts at kexue.fm

Initial Exploration

Assume T is the token sequence to be generated, and S_1, S_2, \dots, S_n are several given, relatively independent sets of contexts (e.g., n different paragraphs, at least not a single sentence split into two fragments). Suppose their total length has exceeded the training length of the model, while a single S_k plus T is still within the training length. We need to generate T based on S_1, S_2, \dots, S_n, which means estimating p(T|S_1, S_2, \dots, S_n).

Simply put, Naive Bayes is "Bayes’ theorem + independence assumption." According to Bayes’ theorem: p(T|S_1, S_2, \dots, S_n) \propto p(S_1, S_2, \dots, S_n|T)p(T) Here, \propto omits constant factors independent of T. According to the (conditional) independence assumption: p(S_1, S_2, \dots, S_n|T) = \prod_{k=1}^n p(S_k|T) Therefore, we have: p(T|S_1, S_2, \dots, S_n) \propto p(T)\prod_{k=1}^n p(S_k|T) Applying Bayes’ theorem again, p(S_k|T) \propto \frac{p(T|S_k)}{p(T)}, we get: p(T|S_1, S_2, \dots, S_n) \propto \frac{1}{p^{n-1}(T)}\prod_{k=1}^n p(T|S_k) Or: \log p(T|S_1, S_2, \dots, S_n) = \textcolor{red}{\sum_{k=1}^n \log p(T|S_k)} - \textcolor{green}{(n-1)\log p(T)} + \textcolor{cyan}{\text{constant}} \label{eq:nbce-1}

In this formula, \textcolor{red}{p(T|S_k)} and \textcolor{green}{p(T)} can be directly calculated using existing LLMs. Any language model will work, regardless of architecture, and no long-text fine-tuning is required. Here, \textcolor{red}{p(T|S_k)} is the probability predicted with a single context, while \textcolor{green}{p(T)} is the probability with no context (or an empty context). Multiple contexts can be placed in the same batch for parallel calculation, and the computational cost grows linearly with the number of contexts.

In-depth Analysis

Of course, Naive Bayes relies on the independence assumption, which limits its practical effectiveness. To improve upon it, let us further analyze and refine Equation [eq:nbce-1] to achieve better results.

First, we denote \log p(T|S) = [\log p(T|S_1), \dots, \log p(T|S_n)], and: \overline{\log p(T|S)} = \frac{1}{n}\sum_{k=1}^n \log p(T|S_k) Let \beta = n - 1. Then Equation [eq:nbce-1] can be rewritten as: \log p(T|S_1, S_2, \dots, S_n) = \textcolor{red}{(\beta + 1)\overline{\log p(T|S)}} - \textcolor{green}{\beta\log p(T)} + \textcolor{cyan}{\text{constant}} \label{eq:nbce-2}

Rewriting it in this form naturally leads to two questions:

If we treat \beta as a hyperparameter to be tuned, is it possible to achieve better results?
Since \overline{\log p(T|S)} is the Average Pooling of \log p(T|S), would other pooling methods (denoted as \mathcal{P}) yield better results? That is: \log p(T|S_1, S_2, \dots, S_n) = \textcolor{red}{(\beta + 1)\mathcal{P}[\log p(T|S)]} - \textcolor{green}{\beta\log p(T)} + \textcolor{cyan}{\text{constant}} \label{eq:nbce-3}

The author experimented with these two questions on a 7B model. The preliminary conclusion was: in reading comprehension scenarios, Max Pooling combined with \beta=0.25 performs well overall with Greedy Search. However, the results from Random Sampling were essentially unreadable.

Final Solution

Why does Greedy Search perform well while Random Sampling fails? We know that Random Sampling samples according to a distribution. Its poor performance indicates that the result of Max Pooling is not a valid probability distribution. Greedy Search only cares about the token with the highest probability, not the rationality of the distribution. Its success tells us that the token with the highest probability has high accuracy.

Higher probability indicates lower uncertainty. To improve Random Sampling, we change the pooling method to directly output the distribution with the lowest uncertainty: \begin{aligned} &\mathcal{P}[\log p(T|S)] = \log p(T|S_{\textcolor{red}{k}}) \\[5pt] &\textcolor{red}{k} = \mathop{\text{argmin}} \big\{H_1, H_2, \dots, H_n\big\} \\[5pt] &H_i = -\sum_T p(T|S_i)\log p(T|S_i) \end{aligned} Substituting this into Equation [eq:nbce-3] gives the final NBCE (Naive Bayes-based Context Extension).

It is worth noting that while our starting point was Naive Bayes, the generalized Equation [eq:nbce-3] has moved beyond the scope of conventional Naive Bayes while retaining its interpretability. The form of Equation [eq:nbce-3] is quite intuitive:

Predictions from different contexts are aggregated (or "voted") together via method \mathcal{P} (with weight \beta+1), and the prediction without context is subtracted (with weight \beta).
The reason for subtracting the no-context prediction is to make the model more inclined to use the context rather than relying purely on its own internal knowledge (Note: the paper "Trusting Your Evidence: Hallucinate Less with Context-aware Decoding", which appeared on Arxiv three days later, also proposed the same technique to reduce hallucinations).
Different \beta values can be chosen for different scenarios. For reading comprehension requiring context, a larger \beta can be considered. For creative writing, a smaller \beta might be better. The author believes \beta \geq -1 is reasonable.

Reference Implementation

The reference implementation of NBCE is provided below:

Github: https://github.com/bojone/NBCE

As seen in the demo code, the implementation of NBCE is very simple. It only requires modifying the way logits are constructed in the decoding function and does not conflict with the choice of decoding algorithm.

View Original SVG Image

Naive Bayes-based Context Extension (NBCE) Diagram

The provided demo includes 12 different context segments with a total length of over 9,000 characters. These, along with 8 questions, are input into the model at once (the model has a training length of 2048 and 7B parameters, available at OpenBuddy). The model is able to correctly answer all 8 questions based on the provided contexts. Notably, the total length of all contexts, questions, and answers exceeds 10,000 characters! Additionally, some friends have tried it for resume matching and essay scoring with decent results. I highly recommend trying it out yourself.

Related Work

There are already several methods for extending the context length of LLMs, but most involve shortening the long context through retrieval or summarization, such as Unlimiformer. Since they do not process the long context directly, they often cannot perform fine-grained reading comprehension. Furthermore, these solutions usually need to be considered during the training phase rather than being plug-and-play for existing LLMs.

Before NBCE, a solution that could extend context length without fine-tuning was Parallel Context Window (PCW), from the papers "Parallel Context Windows for Large Language Models" and "Structured Prompting: Scaling In-Context Learning to 1,000 Examples". These two papers were works by different authors around the same time, and the proposed methods have only minor differences, so they are both referred to here as PCW.

PCW is applicable to Self-Attention models and mainly modifies Position Encoding and Attention Mask, as shown below:

View Original SVG Image

Parallel Context Window

First, determine the maximum context length L (6 in the figure). Then, the last position of each context is encoded as L-1, the second to last as L-2, and so on. This encoding method is called "right-aligned" (or "left-indented"). On the other hand, for the Task Tokens (Prompt + generated content), the position encoding is L, L+1, L+2, \dots. Each context is encoded separately, so the corresponding Attention Mask is a block-diagonal matrix (and since it is an LM, it is a block-diagonal lower triangular matrix). The Task Tokens part needs to combine all contexts, so it attends to all contexts (and itself). In this way, if each context is taken out individually and concatenated with the Task Tokens, the Attention pattern is consistent with the original LM.

Some readers might notice that NBCE and PCW share very similar characteristics, such as treating contexts as unordered and equal. In fact, if NBCE is applied to a single-layer single-head attention model, the result is roughly PCW. To demonstrate this, we write the single-layer single-head attention language model as: p(x_t|x_{< t}) = \text{softmax}\left(\sum_{i=1}^t a_{t,i}v_i W\right) Thus, roughly \log p(x_t|x_{< t}) \sim \sum_{i=1}^t a_{t,i}v_i W. Substituting this into Equation [eq:nbce-2] and setting \beta=0, we get: \log p(T|S_1, S_2, \dots, S_n) \sim \frac{1}{n}\sum_{k=1}^n\left(\sum_{i\in S_k} a_{T,i}v_i\right) W = \left(\sum_{i\in S_1\oplus\dots\oplus S_n} \frac{a_{T,i}}{n}v_i\right) W Here we assume T is a single token, but this does not lose generality; \oplus denotes concatenation. In the above formula, S_k \oplus T is inferred as a continuous segment (the NBCE setting), so their position encodings are adjacent. a_{T,i}/n constitutes an overall attention between T and all S_i (the sum is still 1). These characteristics are consistent with PCW. PCW is simply a more elegant way to integrate this into every layer via Attention Masks.

Therefore, PCW is roughly the Average Pooling version of NBCE. Our tests also found that it shares similar drawbacks with the Average Pooling version of NBCE—as the context data increases, the output becomes less accurate, manifesting as being topic-related but providing the wrong answer to the question.

Further Reflections

A major drawback of NBCE is its lack of order; it cannot recognize the input order of contexts. This may lead to poor performance in scenarios like story continuation. To mitigate this, one could consider adding a prefix to each context to indicate order information, similar to "Chapter 1" or "Chapter 2" in a novel.

Overall, the author’s tests of NBCE have been limited to "reading comprehension" scenarios (i.e., "understanding" long text). Whether this method can be used to "generate" long text remains an open question, and I look forward to everyone’s test results.

Furthermore, there is an interesting question:

Since Naive Bayes can be useful in the LLM field, could other traditional probabilistic models (such as HMM) also find their place in the LLM field?

Conclusion

This article proposes NBCE (Naive Bayes-based Context Extension), which extends the context processing length of LLMs based on the idea of Naive Bayes. It has the advantages of being plug-and-play, model-agnostic, requiring no fine-tuning, linear efficiency, and simple implementation. The results appear promising, and everyone is welcome to test it.

Original URL: https://kexue.fm/archives/9617