A Brief Exploration of Random Tokenization: From Viterbi Decoding to Viterbi Sampling · English (unofficial) translations of posts at kexue.fm

After the publication of the previous article "A Problem and Countermeasure for Large Vocabulary Language Models in Continuation Tasks", readers quickly pointed out that introducing randomness into the tokenization process during the training phase could solve the same problem, and that there are already existing papers and implementations. After further study, I found that this is a technique called Subword Regularization, which was first applied in Neural Machine Translation (NMT). Currently, SentencePiece also has a corresponding implementation. It seems this technique can indeed alleviate the aforementioned issues and even help enhance the fault tolerance of language models. Therefore, I had the idea of incorporating it into BytePiece.

The question then arises: how can we change deterministic tokenization into random tokenization? BytePiece is based on the Unigram model, which uses the Viterbi algorithm to find the segmentation scheme with the maximum probability. Since there is a probability involved, can we naturally derive a random sampling method? This article discusses this issue and shares my own solution.

Key Analysis

At this stage, Unigram tokenization directly outputs the segmentation scheme with the highest probability, which is usually a deterministic output. Specifically, suppose \boldsymbol{w}=(w_1,w_2,\cdots,w_k) represents a segmentation scheme, and the corresponding score is P(\boldsymbol{w})=p(w_1)p(w_2)\cdots p(w_k). Let \Omega(S) represent the set of all possible segmentation schemes for sentence S. Then the tokenization algorithm can be described as: \boldsymbol{w}^* = \mathop{\text{argmax}}_{\boldsymbol{w}\in \Omega(S)}P(\boldsymbol{w}) This can be completed in linear time using the Viterbi algorithm, so this process is also called "Viterbi Decoding." It seems that since the Unigram model inherently involves probabilities, it shouldn’t be difficult to change it into a probabilistic sampling form. However, upon closer reflection, it turns out to be a non-trivial problem with many detailed difficulties to overcome.

My initial idea was to mimic the recursive sampling process of an autoregressive language model. The most difficult part here is how to keep the ranking of the original candidate segmentation schemes as unchanged as possible, or at least ensure that the maximum probability remains unchanged—that is, the maximum probability path \boldsymbol{w}^* from Viterbi decoding should correspond to the maximum probability sampling result of the designed recursive sampling algorithm. Since all segmentation schemes \Omega(S) form a Directed Acyclic Graph (DAG), I initially thought that a random walk directly on the DAG would be a feasible solution. But after further thought, I found it difficult to design appropriate transition probabilities to ensure the maximum probability path remains unchanged (because different edges from the same starting point are not equal, and one cannot simply sample based on edge frequency as weights).

Existing Solutions

Since I didn’t have any new ideas for a while, I decided to look at the "reference answer"—the original paper on Subword Regularization: "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates".

However, this "standard answer" was somewhat unexpected. It turns out the idea of Subword Regularization is very simple and direct: first, search for the n segmentation schemes with the highest P(\boldsymbol{w}), denoted as \boldsymbol{w}^*_1, \boldsymbol{w}^*_2, \cdots, \boldsymbol{w}^*_n (n-best segmentations), and then construct the following distribution: p_i = \frac{P(\boldsymbol{w}^*_i)^{\alpha}}{\sum\limits_{j=1}^n P(\boldsymbol{w}^*_j)^{\alpha}} Then, sample from these n schemes according to their probabilities, where \alpha > 0 is a hyperparameter. This algorithm is already integrated into SentencePiece, and readers can test it themselves (refer to the usage here).

The problem is that "simple and direct" does not necessarily mean "efficient." Although the complexity of searching for the top-n best segmentation schemes is also linear (interested readers can look up materials on N-best Viterbi), it is significantly larger than the Viterbi Decoding that only finds the top-1 (theoretically n times the complexity). Consequently, once random sampling is enabled, it becomes much slower than deterministic tokenization. Therefore, this is not the ideal sampling method I had in mind.

Personal Approach

My thoughts hit a dead end again. In a moment of frustration, I decided to re-evaluate: our goal is to find a random sampling algorithm with a complexity similar to Viterbi Decoding. Since that’s the case, Viterbi Decoding itself should be a good breakthrough point. So, I opened the tokenization code again. The tokenization function at that time looked like this:

def _tokenize(self, bytes text):
    cdef int e, k, s
    cdef double v, score
    cdef list routes = [(0, None)] + [(-INFINITY, None) for _ in text]
    cdef list tokens = []
    for e, (k, v) in self._automaton.iter(text):
        s, e = e - k + 1, e + 1
        score = routes[s][0] + v
        if score > routes[e][0]:
            routes[e] = score, s
    while text:
        s = routes[e][1]
        tokens.append(text[s:e])
        text, e = text[:s], s
    return tokens[::-1]

After reading it several times, I finally got an inspiration: the key to Viterbi Decoding is the line if score > routes[e][0]:. It represents keeping the optimal segmentation scheme up to the current position, where score is the score of the new segmentation scheme (log probability) and routes[e][0] is the historical optimal score. If the new scheme is better, it overwrites the old one. This reminded me of the acceptance rate design in the MCMC algorithm. If we introduce random sampling here, wouldn’t that randomize the tokenization result?

We use r \in \{1, 0\} to represent accepting/rejecting the new scheme. Since this step is just a binary choice, it is very simple to make it probabilistic: r_i = \left\{\begin{aligned}&\,1\,, \,\, s_i > s_{i-1} \\ &\,0\,, \,\, \text{else}\end{aligned}\right.\qquad\longrightarrow\qquad r_i = \left\{\begin{aligned}&\,1\,, \,\, \varepsilon < \sigma(\alpha(s_i - s_{i-1})) \\ &\,0\,, \,\, \text{else}\end{aligned}\right. Here \varepsilon \sim U[0,1] is a uniform random number, \alpha > 0 is a hyperparameter, \sigma(t)=1/(1+e^{-t}) is the Sigmoid function, and s_i, s_{i-1} are the scores (log probabilities) of the new and old schemes, respectively. It is not hard to see that deterministic sampling on the left corresponds to random sampling with \alpha \to \infty.

In this way, based on Viterbi decoding, we obtain a very natural and lightweight random sampling algorithm, which I call "Viterbi Sampling." Implementing it only requires replacing the if score > routes[e][0]: criterion with a version that includes a random number. Due to the monotonicity of the Sigmoid function, when s_i > s_{i-1}, it naturally assigns a higher probability to the new scheme. Thus, it is clear that the original maximum probability segmentation remains the most probable result under Viterbi Sampling. Furthermore, as s_i - s_{i-1} increases, \sigma(\alpha(s_i - s_{i-1})) also increases, meaning that schemes with higher original scores have a higher probability of being sampled. This maintains the ranking of segmentation schemes to a certain extent (although it hasn’t been proven to be strictly order-preserving, from an application perspective, approximate order-preservation is sufficient).

Simple Test

Starting from version 0.4.0, Viterbi Sampling is built into BytePiece’s tokenization function. You only need to add an alpha parameter greater than 0 when calling tokenizer.tokenize or tokenizer.encode to get random results:

import bytepiece
assert bytepiece.__version__ >= '0.4.0'

tokenizer = bytepiece.Tokenizer('bytepiece_160k.model')
text = 'Today is a nice day'
print(tokenizer.tokenize(text))  # alpha defaults to -1; alpha <= 0 means deterministic
for i in range(5):
    print(tokenizer.tokenize(text, alpha=0.1))

# Example outputs (translated for clarity):
# [b'Today', b' is', b' a', b' nice', b' day']
# [b'Today', b' is', b' a', b' ni', b'ce', b' day']
# [b'Today', b' is', b' a', b' nice', b' day']
# [b'To', b'day', b' is', b' a', b' nice', b' day']
# [b'Today', b' i', b's', b' a', b' nice', b' day']

Below is a comparison of the speeds of SentencePiece’s Subword Regularization and BytePiece’s Viterbi Sampling (both set to \alpha=0.1 for random tokenization):

	Deterministic Tokenization	Random Tokenization
SP-BPE	1.36M bytes/sec	1.25M bytes/sec
SP-Unigram	5.65M bytes/sec	1.28M bytes/sec
BytePiece	1.95M bytes/sec	1.36M bytes/sec

As can be seen, after enabling Subword Regularization (the "SP-Unigram" row), the tokenization speed drops to less than 1/4 of the original, indicating that the sampling algorithm for Subword Regularization is quite inefficient. In contrast, the Viterbi Sampling proposed in this article only drops by about 30%, which is clearly more efficient. The decrease is due to the generation of random numbers and the calculation of the Sigmoid function. If these two parts can be further optimized, the speed can be improved even more. As for the BPE model, its random tokenization is called BPE Dropout, which is a method specific to BPE models. Interested readers can look into it themselves; it won’t be introduced here.

Summary

This article mainly explored strategies for changing the deterministic tokenization of the Unigram model into random tokenization. Although an existing method called "Subword Regularization" can achieve this goal, its efficiency is relatively low. To this end, I proposed a more efficient sampling algorithm, Viterbi Sampling, which only requires simple modifications to the deterministic Viterbi Decoding, thereby largely maintaining the original efficiency. Experiments show that the sampling speed of the new algorithm significantly exceeds that of Subword Regularization. The corresponding implementation is already built into the latest version of BytePiece.

When reposting, please include the original address of this article:
https://kexue.fm/archives/9768

For more detailed reposting matters, please refer to:
"Scientific Space FAQ"