Modern language models rely on tokenizers to break text into processable sub-word units—but this seemingly technical detail has profound implications for how different writing systems are represented and processed.
The core issue. Tokenizers fragment text differently based on their training corpus and byte encoding strategies. This creates systematic inefficiencies for non-Latin scripts:
- Uncommon kanji. Characters like 斎 (sai, “purification”) often split into two byte-pair fragments in GPT-4o.
- Common combinations. The surname 鈴木 (Suzuki) reveals divergent strategies—DeepSeek V3 uses two tokens (one per character), while GPT-4o requires three tokens.
- Vendor optimization. Model developers prioritize different scripts depending on target markets and available data.
The historical echo. It mirrors the character-encoding debates of the 1990s and 2000s. UTF-8 ultimately won for storage and transmission, but tokenizer vocabularies remain fragmented, with each model making trade-offs between vocabulary size, compression efficiency, and multilingual coverage.
Practical impact. These differences affect inference costs, context window utilization, and LLM performance across languages—making tokenizer design a crucial, often overlooked pillar of multilingual AI.
Curious how your model handles multilingual names? Load the pre-filled Suzuki example playground and compare the token bars across GPT-4o and DeepSeek V3.