Currently, the most popular Tokenizer in Large Language Models (LLMs) is likely Google’s SentencePiece. It adheres to several ideal characteristics of a Tokenizer, such as being language-independent and data-driven. Furthermore, because it is written in C++, its tokenization speed is very fast, making it highly suitable for scenarios where efficiency is paramount. However, it also has some notable drawbacks, such as slow training speed (for the BPE algorithm) and high memory consumption. Additionally, because it is written in C++, it remains a "black box" for most users, making it inconvenient for research and secondary development.
In fact, training a Tokenizer is equivalent to the "new word discovery" of the past. I have previously written series of articles on Chinese word segmentation and minimum entropy, accumulating some experience in new word discovery. Consequently, I have long harbored the idea of writing my own version of a Tokenizer. Over the past few days, I finally found the time to complete the initial version. Following the example of SentencePiece, I have named it "BytePiece."
Ideal Characteristics
Since we are rewriting the Tokenizer, we must consider what an ideal Tokenizer should look like to judge whether the final result meets expectations. In my view, a Tokenizer should possess at least the following basic characteristics:
Lossless Reconstruction: The tokenization results should be able to be restored to the input without any loss.
High Compression Ratio: For the same vocabulary size, the number of tokens for the same batch of data should be as small as possible.
Language Independence: Based on statistics, neither the training nor the tokenization process should introduce language-specific features.
Data-Driven: It should be capable of unsupervised training directly on raw corpora.
Training-Friendly: It should be able to complete the training process within a reasonable timeframe and configuration.
Finally, there are some bonus points, such as fast tokenization speed, readable code, and ease of secondary expansion. While it is best to satisfy these, I do not consider them essential basic characteristics.
For me, the biggest complaints about SentencePiece are "Lossless
Reconstruction" and "Training-Friendly." First, SentencePiece performs
NFKC
normalization by default, which leads to irreversible changes such
as converting full-width commas to half-width commas. Thus, by default,
it does not even satisfy "Lossless Reconstruction." For a long time, it
was not on my list of candidates until I discovered that adding the
parameter --normalization_rule_name=identity during
training prevents any conversion. So, SentencePiece does support
lossless reconstruction, but it requires specific settings.
As for training, it is even more frustrating. SentencePiece supports two mainstream algorithms: BPE and Unigram. The training speed of Unigram is acceptable, but its compression ratio is slightly lower. BPE has a higher compression ratio, but its training speed is an order of magnitude slower than Unigram! Moreover, whether using BPE or Unigram, the training process is extremely memory-intensive. In short, using a large corpus to train a SentencePiece model is not a pleasant experience.
Model Design
The construction of a new Tokenizer can be decomposed into three parts: 1. Basic unit; 2. Segmentation algorithm; 3. Training algorithm. Once these three parts are determined, the rest is a matter of programming skill. Below is the reasoning behind BytePiece’s approach to these issues.
Basic Unit
We know that the default string type in Python 3 is Unicode. If we
use Unicode as the basic unit, we call it Char-based. Char-based is
intuitive and convenient—a Chinese character is represented as a single
character of length 1—but there are simply too many characters across
different languages. Even covering single characters would consume a
massive vocab_size, let alone introducing words. Therefore,
like mainstream Tokenizers, BytePiece uses the Byte as its basic
unit.
Returning to Bytes makes many problems "suddenly clear." Since there are only 256 different single Bytes, as long as the vocabulary contains these 256 single Bytes, Out-of-Vocabulary (OOV) issues can be eliminated. This is an obvious benefit. Furthermore, we know that the average information entropy of Chinese characters is greater than that of English letters. If we choose a Char-based approach, although each character appears to have a length of 1, the "internal" granularity differs, leading to biased statistical results. In contrast, the information entropy of each Byte is more uniform (for example, the UTF-8 encoding of most Chinese characters corresponds to 3 Bytes, and the average information entropy of a Chinese character is exactly 2 to 3 times that of an English letter, which corresponds to one Byte). Therefore, statistical results using Bytes are more unbiased, making the model more "language-independent."
In terms of being Byte-based, BytePiece is more thorough than
SentencePiece. SentencePiece first processes text as Char-based and only
handles OOV as Byte-based. BytePiece, however, converts text to Bytes
via text.encode() at the very beginning before performing
any subsequent operations, making it purer by comparison.
Segmentation Algorithm
There are only a few algorithms for segmentation based on a dictionary, such as maximum matching, shortest path, and maximum probability path. Readers interested in tracing these can refer to Matrix67’s "Talks on Chinese Automatic Segmentation and Semantic Recognition (Part 1): Chinese Segmentation Algorithms."
Like Chinese word segmentation tools such as jieba,
BytePiece chooses the maximum probability path segmentation, also known
as the "unigram model." There are three considerations for choosing
Unigram: first, the maximum probability in Unigram is, in other words,
maximum likelihood, which is consistent with the training objective of
LLMs; second, from a compression perspective, the maximum probability is
actually the shortest encoding length (also called Minimum Description
Length), reflecting the maximization of the compression ratio, which
aligns with the belief that "compression is intelligence"; third,
finding the optimal segmentation scheme for Unigram can be completed
within linear complexity using the Viterbi algorithm, which is the
theoretically optimal complexity.
Of course, since there is a "unigram model," there are naturally more complex "bigram models," "trigram models," etc. However, their increase in complexity far outweighs the benefits they bring, so we generally do not consider these higher-order models.
Training Algorithm
The reason for discussing the segmentation algorithm before the training algorithm is that the optimization goal for training can only be determined once the segmentation algorithm is fixed.
As mentioned at the beginning, Tokenizer training is essentially "new word discovery." I have previously proposed several new word discovery algorithms, such as "New Word Discovery Based on Segmentation", "Unsupervised Segmentation Based on Language Models", and "Better New Word Discovery Algorithms". It now appears that the one most compatible with and promising for the Unigram segmentation algorithm is "Unsupervised Segmentation Based on Language Models". BytePiece’s training is implemented based on this, referred to here as the Byte-based N-gram Language Model (BNLM).
Specifically, for Unigram segmentation, if the optimal segmentation of a byte string c_1, c_2, \dots, c_l of length l is w_1, w_2, \dots, w_m, then the probability product p(w_1)p(w_2)\dots p(w_m) should be the largest among all possible segmentations. Let the lengths of w_1, w_2, \dots, w_m be l_1, l_2, \dots, l_m, respectively. Then, according to the conditional decomposition formula: \begin{equation} \prod_{i=1}^m p(w_i) = \prod_{i=1}^m \prod_{j=L_{i-1} + 1}^{L_{i-1} + l_i} p(c_j|c_{L_{i-1} + 1}, \dots, c_{j-1}) \end{equation} where L_i = l_1 + l_2 + \dots + l_i. Considering only the n-gram model, we approximate p(c_j|c_{L_{i-1} + 1}, \dots, c_{j-1}) for j > L_{i-1} + n with p(c_j|c_{j - n + 1}, \dots, c_{j-1}). Thus, Unigram segmentation is transformed into a character (byte) labeling problem, and Tokenizer training is transformed into n-gram language model training (recommending n=6), which can be completed directly in an unsupervised manner. For a more detailed introduction, please refer to the original article "Unsupervised Segmentation Based on Language Models".
(Note: n=6 only means that BytePiece’s statistical information goes up to 6-gram, but it does not mean it can only generate pieces of length 6. Since we approximate conditional probabilities for n-grams where n > 6 using the 6-gram model, it can theoretically generate pieces of any order and length.)
Implementation
After the principles were determined, the rest was the tedious work of development. I have managed to write a set of usable code:
The code is simple, contained in a single file, with two classes:
Trainer and Tokenizer. Tokenization is
slightly accelerated by building an AC automaton using pyahocorasick.
It works, but it is still significantly slower than SentencePiece, as
pure Python cannot compete with C++ in terms of speed. Training is
divided into four main steps: 1. n-gram
counting; 2. n-gram pruning; 3.
Pre-segmentation; 4. Pruning of pre-segmentation results. Steps 1, 3,
and 4 are computationally intensive and parallelizable, so corresponding
multi-process implementations were written. With enough processes (I
used 64, and each process was nearly at full utilization), the training
speed is comparable to SentencePiece’s Unigram training.
A special mention should be made regarding the pruning of results.
The most basic criteria for pruning are frequency and
vocab_size, but this is not enough. Sometimes p(w_1)p(w_2) > p(w_1 \circ w_2) (where
w_1 \circ w_2 denotes the concatenation
of two words) occurs even when w_1,
w_2, and w_1 \circ w_2 are all
in the vocabulary. In such cases, w_1 \circ
w_2 will never be segmented out, so keeping it in the vocabulary
is a waste of space. Therefore, the pruning process also includes the
exclusion of such cases.
Performance Testing
Now for the testing phase. First, a small-scale test: 100,000 samples
were randomly selected from the WuDao open-source dataset as the
training set (the exported file is about 330MB), and another 1,000
samples were used as the test set. A vocabulary of
vocab_size=50k was trained. The results are as follows:
| Training Time \downarrow | Max Memory \downarrow | Compression Ratio \uparrow | |
|---|---|---|---|
| SP-BPE | 55.3 min | 5.2 GB | 4.80 |
| SP-Unigram | 1.6 min | 2.5 GB | 4.73 |
| BytePiece | 6.5 min | 4.3 GB | 5.05 |
Here, SP-BPE and SP-Unigram refer to SentencePiece with
model_type set to BPE and Unigram, respectively. The
training commands were:
spm.SentencePieceTrainer.train('--input=wudao.txt --model_prefix=wudao_m --vocab_size=50000 --model_type=bpe --train_extremely_large_corpus=true --normalization_rule_name=identity')
spm.SentencePieceTrainer.train('--input=wudao.txt --model_prefix=wudao_m2 --vocab_size=50000 --model_type=unigram --train_extremely_large_corpus=true --normalization_rule_name=identity')
The unit for compression ratio is "bytes/token," representing the average number of bytes per token. As seen, BytePiece achieves the highest compression ratio while maintaining a reasonable balance between training time and memory.
Next, a larger-scale test was conducted. From a mixed Chinese-English
corpus (ratio approx. 3:5), 100,000 samples were extracted to train a
Tokenizer with vocab_size=100k. The text in this corpus is
quite long, so the 100,000 samples exported to a 13GB file. The test set
consists of two parts: 1,000 samples from the same corpus (in-domain)
and 1,000 samples from the WuDao dataset (out-of-domain).
| Training Time \downarrow | Max Memory \downarrow | Comp. Ratio (In-domain) \uparrow | Comp. Ratio (Out-of-domain) \uparrow | |
|---|---|---|---|---|
| SP-BPE | 19.21 h | 97 GB | 4.52 | 4.46 |
| SP-Unigram | 2.02 h | 384 GB | 4.51 | 4.48 |
| BytePiece | 2.24 h | 51 GB | 5.39 | 4.51 |
Whether in terms of training time, memory, or compression ratio, it appears that the larger the training data, the more advantageous BytePiece becomes!
To Be Continued
Based on current results, BytePiece has certain advantages in training and acceptable segmentation effects. However, being pure Python, its tokenization speed is only about 1/10th of SentencePiece. This is one of the future directions for optimization. We look forward to C/C++ experts participating to help improve BytePiece’s speed. (Note: Starting from version 0.2.0, Cython has been used to accelerate the tokenization function. Currently, BytePiece’s speed is close to BPE and can even outperform BPE when the text is long enough.)
In fact, if techniques like random sampling and dynamic pruning are adopted, BytePiece’s training speed and memory usage can be further optimized. Currently, to ensure deterministic results, BytePiece only prunes after all statistics are collected. If inputs were shuffled and pruning performed periodically, memory usage could be further controlled and statistics gathered faster, with little expected impact on the final result. This work will be introduced later based on user feedback.
Conclusion
This article introduced BytePiece, a Tokenizer I developed. It is a Byte-based Unigram segmenter implemented in pure Python, making it more readable and extensible. Due to the new training algorithm, its compression ratio is generally higher than existing tokenizers, and it supports multi-process acceleration. Furthermore, it operates directly on UTF-8 bytes of text with almost no preprocessing, making it purer and more language-independent.
Please include the original address when reprinting: https://kexue.fm/archives/9752
For more details on reprinting, please refer to: "Scientific Space FAQ"