A Problem and Countermeasure for Large-Vocabulary Language Models in Text Continuation Tasks · English (unofficial) translations of posts at kexue.fm

For Large Language Models (LLMs), increasing the tokenizer’s vocabulary size to improve the compression rate—thereby shortening sequence lengths and reducing decoding costs—is a development welcomed by everyone. After all, increasing the vocabulary only requires expanding the Embedding layer and the output Dense layer. The resulting increase in computational overhead is almost imperceptible, while the improvement in decoding speed brought by shorter sequences is very real. However, increasing the vocabulary size may also have negative impacts on model performance, so it cannot be expanded without limit. This article analyzes a specific problem that arises in text continuation tasks after increasing the vocabulary size and proposes a potential solution.

Analysis of Pros and Cons

The benefits of increasing the vocabulary size are obvious. On one hand, since LLMs are autoregressive, decoding becomes progressively slower. The chain of "increasing vocabulary \rightarrow improving compression rate \rightarrow shortening sequence length" means that the number of tokens for the same text decreases. In other words, the number of decoding steps is reduced, leading to a direct boost in decoding speed. On the other hand, the training method for language models is Teacher Forcing. Shortening the sequence length can alleviate the Exposure Bias problem caused by Teacher Forcing, which may potentially improve model performance.

However, the disadvantages of increasing the vocabulary are also clear. The most direct issue is that it severs the connection between tokens at the character level, which may affect generalization or even result in the loss of ability to perform certain tasks. For example, if both "solar energy" and "solar" are individual tokens in the vocabulary, the model does not inherently know that "solar energy" is composed of "solar" and "energy," nor does it know that "solar" consists of the characters "so" and "lar." This makes tasks related to sub-words or characters quite difficult. A classic example is asking: "How do you read ’solar energy’ backwards?" The expected answer is "ygrene ralos," but since the model treats the whole phrase as a single token, it is difficult for it to answer correctly.

The Continuation Problem

Recently, @Armen Aghajanyan shared another issue. While training a code model, they used an extremely large vocabulary. As a result, common commands like "import numpy as np" became a single token. They discovered that when a user inputs "import numpy", the model is unable to continue with " as np". The reason is simple: because "import numpy as np" was treated as a single token, the model finds that "import numpy" (as separate tokens) is almost never followed by " as np" in the training data (since those instances were merged into the single long token). Consequently, it fails to complete the continuation.

This phenomenon is quite classic. It occurs not only in code models but also in common natural language models. For instance, if "solar energy" and "solar" are both independent tokens, after a user inputs "solar," the next predicted token will rarely be "energy," which may not match the user’s expected distribution. Similarly, if "White Cloud," "White Cloud Mountain," and "White Cloud Airport" are all independent tokens, if a user inputs "Guangzhou’s White Cloud," the model will likely fail to continue with "Airport" or "Mountain."

Proposed Countermeasure

However, I believe that the phenomenon mentioned by Armen Aghajanyan does not necessarily constitute a disadvantage of large vocabularies. In fact, with a bit of processing, it can even become an advantage. This problem is actually quite simple. Before the era of LLMs, we could perform completion tasks based on "vocabulary + prefix search." Now that we have LLMs, must we be confined strictly to the LLM’s internal logic? Can we not combine LLM-based continuation with vocabulary-based continuation?

Returning to the previous example, suppose the user inputs "Guangzhou’s White Cloud." The tokenizer splits it into "Guangzhou / ’s / White Cloud." If these three tokens are converted into IDs and fed into the model, it will fail to generate "Airport" to complete "Guangzhou’s White Cloud Airport." This is essentially because the tokenizer cannot predict future text, leading to a "suboptimal" tokenization result. (Of course, one could consider using a stochastic tokenization algorithm during training, where "White Cloud Airport" might appear as one token or as "White Cloud / Airport." In this case, the tokenization would not severely impact the results and might even enhance generalization; see "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates").

So, can we estimate the future text? Suppose that after tokenizing into "Guangzhou / ’s / White Cloud," we backtrack one step and use "White Cloud" to perform a prefix search in the vocabulary. Let’s assume the search results are "White Cloud," "White Cloud Airport," "White Cloud Mountain," and "White Cloud Road." This search step is purely based on the vocabulary, and its computational cost is negligible compared to the LLM. Once we have the search results, we use the LLM to calculate:

\begin{aligned} &p(\text{White Cloud} \mid \text{Guangzhou}, \text{'s}) \\ &p(\text{White Cloud Airport} \mid \text{Guangzhou}, \text{'s}) \\ &p(\text{White Cloud Mountain} \mid \text{Guangzhou}, \text{'s}) \\ &p(\text{White Cloud Road} \mid \text{Guangzhou}, \text{'s}) \\ \end{aligned}

Since the input context is the same, calculating these four conditional probabilities only requires a single forward pass of the LLM. Once we have these probabilities, we re-normalize them and sample. If the sampled result is "White Cloud," we continue as "Guangzhou / ’s / White Cloud." If we sample "White Cloud Airport," we output "Airport" and continue as "Guangzhou / ’s / White Cloud Airport," and so on. This easily solves the problem mentioned by Armen Aghajanyan and turns the disadvantage into an advantage (when the compression rate is high, even if we backtrack one step, the prefix-searched word might be very long, allowing for more characters to be generated at once). Notably, the backtracking operation only needs to be performed at the first step of sampling to avoid tokenization errors caused by incomplete input; it is not needed from the second step onwards. Therefore, the additional computational cost is minimal.

It is worth mentioning that a Microsoft library called "guidance" has proposed a similar technique (refer to here). Furthermore, considering more general scenarios, sometimes backtracking one step is not enough. For example, in the "import numpy as np" case, the input "import numpy" might be tokenized as "import / numpy". In such cases, one might need to backtrack at least two steps to find a complete and reasonable sequence. However, there is no fundamental difference in the logic, only slight complexity in the details. I will not expand on that here; readers can implement this logic themselves when deploying inference models.

Summary

This article introduced a problem that may occur when using LLMs with very large vocabularies for text continuation tasks and shared a potential solution.

Original URL: https://kexue.fm/archives/9762

For more details on reposting, please refer to: "Scientific Space FAQ"