In the article "A Brief Exploration of Random Word Segmentation: From Viterbi Decoding to Viterbi Sampling", I proposed a random word segmentation algorithm called "Viterbi Sampling." It is a minor modification of the Viterbi Decoding algorithm used for finding the optimal solution, maintaining the simplicity and speed of the Viterbi algorithm. It is significantly more efficient than the existing Subword Regularization. However, a reader on Zhihu, @HeWu, pointed out that the current sampling algorithm might "dilute" the appearance probability of certain schemes through multiple binary choices. The direct consequence is that the segmentation with the highest original score might not appear with the highest probability.
After careful consideration, I found that this issue does indeed exist. At the time, in the rush to develop a new sampling algorithm, the thinking and handling of details were indeed somewhat coarse. Therefore, this article will further refine the Viterbi Sampling algorithm and prove that the refined algorithm is equivalent in effect to Subword Regularization.
Problem Analysis
First, let’s look at the original comment:
Subword regularization can guarantee data according to probability (with a temperature hyperparameter). In the proposed method, for each e, the first calculated route will be "challenged" multiple times 1v1. Will the final probability distribution be quite different from existing algorithms?
For example, if "watching" has three splitting methods: "watch ing", "wat ching", and "w atching", and the probabilities are all 1/3, the sampling probability in the proposed scheme would become 1/4 for the first two and 1/2 for the third. Is that so?
The comment is actually very clear. If readers still don’t understand, I will expand on it slightly. Suppose there are three segmentation schemes, each with the same score. Naturally, we expect each scheme to have a 1/3 probability of being sampled. However, Viterbi Sampling transforms a multi-choice sampling process into multiple steps of binary choices:
\begin{equation} r_i = \left\{\begin{aligned}&\,1\,, \,\, s_i > s_{i-1} \\ &\,0\,, \,\, \text{else}\end{aligned}\right.\qquad\longrightarrow\qquad r_i = \left\{\begin{aligned}&\,1\,, \,\, \varepsilon < \sigma(\alpha(s_i - s_{i-1})) \\ &\,0\,, \,\, \text{else}\end{aligned}\right. \end{equation}
In this way, the first two segmentation schemes undergo a binary choice first, each with a probability of \frac{1/3}{1/3+1/3}=1/2. After a result is selected, it is put together with the third scheme for another binary choice. Since the probability is calculated according to their respective scores, the probability for each is still 1/2 at this stage. Thus, in the complete sampling process, the probability of the first two schemes appearing is 1/4, and the probability of the last scheme appearing is 1/2. The later a scheme appears, the more "advantageous" it is, while the earlier schemes have their probabilities diluted more severely. Unfortunately, according to the return order of the BytePiece AC automaton, longer words (which usually have higher scores) appear earlier. Therefore, in Viterbi Sampling, schemes with higher scores are actually more likely to have their probabilities diluted.
Solution
It now appears that the solution is quite simple: after each binary choice, we should also cache the cumulative probability. Starting from the second step, each binary choice is not between the new candidate and the current winner’s score, but between the new candidate and the cumulative probability score. This is commonly known as the "Reservoir sampling" algorithm.
Using the previous example, when the first two segmentation schemes come in, one is selected with a probability of \frac{1/3}{1/3+1/3}=1/2, and their total cumulative probability is 2/3. Next, the winner is compared with the new scheme. The probability of the new scheme being selected should be \frac{1/3}{2/3+1/3}=1/3. That is, it must be compared against the cumulative probability, not just the winner’s own probability. In this way, throughout the complete sampling process, the probability of each segmentation scheme appearing is 1/3.
For Viterbi Sampling, each end position will have multiple segmentation schemes. We need to perform multi-choice sampling on them, where the probability of being selected is constructed from their respective scores as p_i = e^{\alpha s_i}/Z, where Z is the normalization factor. Because we process this recursively, we do not know how many "multiple" choices there are, nor can we calculate Z. However, this is not important; knowing e^{\alpha s_i} is enough, because calculating the conditional sampling probability at each step does not require the full Z, but rather the recursive Z_i:
| Viterbi Decoding | Old Viterbi Sampling | New Viterbi Sampling |
|---|---|---|
| r_i = \left\{\begin{aligned}&\,1\,, \,\, s_i > s_{i-1} \\ &\,0\,, \,\, \text{else}\end{aligned}\right. | r_i = \left\{\begin{aligned}&\,1\,, \,\, \varepsilon < \sigma(\alpha(s_i - s_{i-1})) \\ &\,0\,, \,\, \text{else}\end{aligned}\right. | \begin{aligned}Z_i =&\, Z_{i - 1} + e^{\alpha s_i} \\[1pt] r_i =&\, \left\{\begin{aligned}&\,1\,, \,\, \varepsilon < e^{\alpha s_i} / Z_i \\ &\,0\,, \,\, \text{else}\end{aligned}\right.\end{aligned} |
In actual calculation, due to the risk of exponential explosion, directly caching Z_i has a high probability of overflow. Therefore, we generally cache its logarithm Z^{\log}_i and use the \text{logsumexp} function to avoid overflow:
\begin{equation} \begin{aligned}&\,Z^{\log}_i = \text{logsumexp}(Z^{\log}_{i-1}, \alpha s_i) \\ &\qquad e^{\alpha s_i} / Z_i \to e^{\alpha s_i - Z^{\log}_i} \end{aligned},\qquad \text{logsumexp}(x,y) = \left\{\begin{aligned}&\,x + \log(1+e^{y-x}),\,\, x \geq y \\ &\,y + \log(1 + e^{x-y}),\,\,x < y \end{aligned}\right. \end{equation}
The corresponding implementation is already built into
bytepiece>=0.5.0.
Perfect Sampling
Overall, the flaws in the old version of Viterbi Sampling occurred because I was too hasty. Now, I am seriously providing a mathematical proof for the new version. Interestingly, it can be proven that the updated Viterbi Sampling is a "perfect sampling" algorithm, just like Subword Regularization.
As introduced before, the approach of Subword Regularization is very "brute-force": it directly finds the top k segmentation schemes with the highest scores and then calculates the probability of being selected via p_i = e^{\alpha s_i}/Z, where s_i is the score of the i-th scheme. This approach has no flaws other than high complexity. When k is unrestricted (i.e., finding all segmentation schemes), we obtain a random sample of all segmentation schemes, and the probability of each scheme being sampled is proportional to e^{\alpha s_i}—which is a monotonically increasing function of the score s_i. This means the sampling probability and the ranking of scores are identical. I call algorithms that satisfy these two conditions "perfect sampling."
Decoding
To prove that the new Viterbi Sampling is also "perfect sampling," let’s first review Viterbi Decoding. Let there be a byte string c_1, c_2, \dots, c_l of length l. Let S^*(c_1, c_2, \dots, c_l) denote the score of the optimal segmentation scheme. If we know that there must be a break between c_k and c_{k+1}, then we must have: \begin{equation} S^*(c_1, c_2, \dots, c_l) = S^*(c_1, c_2, \dots, c_k) + S^*(c_{k+1}, c_{k+2}, \dots, c_l) \end{equation} In other words, a sub-segment of the optimal segmentation scheme must also be the optimal segmentation scheme for the corresponding sub-byte string. This is the fundamental basis of dynamic programming. Of course, in reality, we cannot predict where the breaks will occur, so we must use an exhaustive method: \begin{equation} S^*(c_1,c_2,\cdots,c_l) = \max\left\{\begin{aligned} &\,\color{green}{s\left(\overline{c_1,\cdots,c_l}\right)} \\ \color{red}{S^*(c_1)} \,+&\,\color{green}{s\left(\overline{c_2,\cdots,c_l}\right)} \\ \color{red}{S^*(c_1,c_2)} \,+&\,\color{green}{s\left(\overline{c_3,\cdots,c_l}\right)} \\ \vdots \\ \color{red}{S^*(c_1,\cdots,c_{l-2})} \,+&\,\color{green}{s\left(\overline{c_{l-1},c_l}\right)} \\ \color{red}{S^*(c_1,\cdots,c_{l-1})} \,+&\,\color{green}{s\left(\overline{c_l}\right)} \end{aligned}\right\}\label{eq:core} \end{equation} where s\left(\overline{c_1,\cdots,c_l}\right) refers to the score of the byte string c_1, \dots, c_l as a single token (if it is not a token in the vocabulary, it is recorded as -\infty). Thus, the calculation of S^*(c_1, c_2, \dots, c_l) is transformed into the calculation of S^*(c_1), S^*(c_1, c_2), \dots, S^*(c_1, \dots, c_{l-1}). By extension, the calculation of S^*(c_1, \dots, c_{l-1}) can be transformed into the calculation of S^*(c_1), \dots, S^*(c_1, \dots, c_{l-2}), and so on. That is, the results of S^* can be reused. Therefore, the entire process can be summarized in one sentence:
When scanning to each position, record the optimal segmentation scheme and its score up to that position.
Of course, if we recurse directly according to Equation [eq:core], the theoretical complexity is \mathcal{O}(l^2). However, in reality, it is impossible for every sub-byte string to be a token in the vocabulary. Therefore, methods like Trie trees or AC automata can be used to pre-scan all possible tokens based on the vocabulary. Then the complexity is proportional to the number of candidate tokens searched, which is linear with respect to l. If a numerical estimate is required, assuming the maximum length of a token in the vocabulary is m, then the number of tokens scanned for a byte string of length l \geq m does not exceed: \begin{equation} l + (l - 1) + \dots + (l - m + 1) = lm - \frac{1}{2}m(m-1) = \mathcal{O}(lm) \end{equation}
Sampling
With the Decoding part as a foundation, understanding Sampling becomes relatively easier. The key still lies in Equation [eq:core]. We use Z(c_1, c_2, \dots, c_l) to represent the normalization factor (perfect sampling) for all segmentation schemes of the byte string c_1, c_2, \dots, c_l. Then we have: \begin{equation} Z(c_1,c_2,\cdots,c_l) = \sum\left\{\begin{aligned} &\,\color{green}{e^{\alpha\cdot s\left(\overline{c_1,\cdots,c_l}\right)}} \\ \color{red}{Z(c_1)} &\, \color{green}{e^{\alpha\cdot s\left(\overline{c_2,\cdots,c_l}\right)}} \\ \color{red}{Z(c_1,c_2)} &\, \color{green}{e^{\alpha\cdot s\left(\overline{c_3,\cdots,c_l}\right)}} \\ \vdots \\ \color{red}{Z(c_1,\cdots,c_{l-2})} &\, \color{green}{e^{\alpha\cdot s\left(\overline{c_{l-1},c_l}\right)}} \\ \color{red}{Z(c_1,\cdots,c_{l-1})} &\, \color{green}{e^{\alpha\cdot s\left(\overline{c_l}\right)}} \end{aligned}\right\}\label{eq:core-2} \end{equation} This equation also indicates that to implement sampling from all segmentation schemes of c_1, c_2, \dots, c_l according to the weight e^{\alpha s}, one can: randomly select one from all segmentation schemes of c_1, \dots, c_{l-1} and append the token \overline{c_l}; randomly select one from all segmentation schemes of c_1, \dots, c_{l-2} and append the token \overline{c_{l-1}, c_l}; randomly select one from all segmentation schemes of c_1, \dots, c_{l-3} and append the token \overline{c_{l-2}, c_{l-1}, c_l}; and so on. After obtaining these l sampling results, select one from them with weights Z(c_1, \dots, c_{l-1}) e^{\alpha\cdot s\left(\overline{c_l}\right)}, Z(c_1, \dots, c_{l-2}) e^{\alpha\cdot s\left(\overline{c_{l-1}, c_l}\right)}, Z(c_1, \dots, c_{l-3}) e^{\alpha\cdot s\left(\overline{c_{l-2}, c_{l-1}, c_l}\right)}, etc.
Next, as in the Decoding case, the calculation of Z(c_1, \dots, c_{l-1}) can reuse the results of Z(c_1), Z(c_1, c_2), \dots, Z(c_1, \dots, c_{l-2}), and the calculation of Z(c_1, \dots, c_{l-2}) can reuse the results of Z(c_1), Z(c_1, c_2), \dots, Z(c_1, \dots, c_{l-3}), and so on. The sampling results are also reusable. Thus, similarly, the entire Sampling algorithm can be summarized in one sentence:
When scanning to each position, perform sampling for all segmentation schemes ending at the current position according to the e^{\alpha s} weights, and record the sampling result and the cumulative weight Z.
If we take the logarithm of both sides, Equation [eq:core-2] can be equivalently rewritten as: \begin{equation} Z^{\log}(c_1,c_2,\cdots,c_l) = \text{logsumexp}\left\{\begin{aligned} &\,\color{green}{\alpha\cdot s\left(\overline{c_1,\cdots,c_l}\right)} \\ \color{red}{Z^{\log}(c_1)} \,+&\,\color{green}{\alpha\cdot s\left(\overline{c_2,\cdots,c_l}\right)} \\ \color{red}{Z^{\log}(c_1,c_2)} \,+&\,\color{green}{\alpha\cdot s\left(\overline{c_3,\cdots,c_l}\right)} \\ \vdots \\ \color{red}{Z^{\log}(c_1,\cdots,c_{l-2})} \,+&\,\color{green}{\alpha\cdot s\left(\overline{c_{l-1},c_l}\right)} \\ \color{red}{Z^{\log}(c_1,\cdots,c_{l-1})} \,+&\,\color{green}{\alpha\cdot s\left(\overline{c_l}\right)} \end{aligned}\right\} \end{equation}
The difference from Equation [eq:core] of Viterbi Decoding is that Z^{\log} replaces S^*, and \text{logsumexp} replaces \max. Since \text{logsumexp} is exactly a smooth approximation of \max, it degenerates to Viterbi Decoding as \alpha \to \infty. On the other hand, in actual calculation, multiple segmentation schemes for the same endpoint arrive one by one rather than all at once. Therefore, it is necessary to transform the single-step "multi-choice" into multi-step "binary choices," which is what was discussed in the "Solution" section. Thus, we have proven (or rather, re-derived from Viterbi Decoding) that the modified Viterbi Sampling is indeed a perfect sampling algorithm, just like Subword Regularization.
Summary
This article refines the previously proposed random word segmentation algorithm, Viterbi Sampling, and mathematically proves that it is a "perfect sampling" algorithm equivalent in effect to Subword Regularization, while offering significantly higher efficiency in practice.
When reposting, please include the original address: https://kexue.fm/archives/9811
For more detailed reposting matters, please refer to: "Scientific Space FAQ"