The 'Chinese Tax' on AI Tokens: Why Non-English Costs More

📅 2026-05-03 · 📁 LLM News · 👁 10 views · ⏱️ 3 min read

💡 Chinese text consumes up to 2x more tokens than English in most LLMs. Here's why tokenizers create an invisible language surcharge.

Claude Opus 4 launched with a new tokenizer that sent costs soaring for many users — but a curious claim emerged online: Chinese-language users may have dodged the price hike. Even more surprising, some suggested that classical Chinese is more token-efficient than modern Mandarin, meaning you could theoretically save money by chatting with AI in ancient literary style.

These claims illuminate a deeper, often-overlooked problem in the LLM industry: the so-called 'Chinese tax' on tokens, where non-English languages — especially Chinese — systematically consume more tokens for the same semantic content.

Opus 4's Token Shock Sets the Stage

When Anthropic released Opus 4, the official API pricing stayed the same: $5 per million input tokens and $25 per million output tokens. But two changes compounded costs dramatically.

First, the model shipped with a new tokenizer. Second, Claude Code bumped its default reasoning effort from 'high' to 'xhigh.' Together, these changes meant the same task now consumed 2x to 2.7x more tokens than before.

Users on X complained loudly. Some reported burning through their $200 Max subscription in under 2 hours. Independent developer BridgeMind acknowledged Claude as 'the best model in the world' — but also the most expensive. His workaround? He bought two Max subscriptions.

Why Chinese Text Eats More Tokens

The root cause lies in how tokenizers work. Most modern LLMs use a method called Byte Pair Encoding (BPE), which builds a vocabulary by iteratively merging the most frequent character pairs in a training corpus.

Since training data is overwhelmingly English, the resulting vocabulary is optimized for English text. Common English words like 'the,' 'information,' or 'development' often map to a single token. Chinese characters, by contrast, get fragmented into multiple byte-level tokens because they appear far less frequently in training data.

Here is what that looks like in practice:

The English word 'artificial intelligence' typically costs 2 tokens
Its Chinese equivalent '人工智能' can cost 4-6 tokens depending on the tokenizer
A 1,000-word English article might use ~1,300 tokens
The same content in Chinese could consume ~2,000-2,600 tokens
This creates an effective 50-100% surcharge for Chinese-language users

The problem is not unique to Chinese. Japanese, Korean, Thai, Arabic, and other non-Latin scripts all face similar inflation. But Chinese is particularly affected because each character is encoded in 3 bytes under UTF-8, and many characters fall outside the tokenizer's learned merge patterns.

The Classical Chinese Paradox

The claim that classical Chinese (文言文) is more token-efficient than modern Mandarin is technically plausible — and it reveals something interesting about information density.

Classical Chinese is extraordinarily compact. A single character in literary Chinese often carries the semantic weight of an entire modern phrase. The famous open

📌 Source: GogoAI News (www.gogoai.xin)

🔗 Original: https://www.gogoai.xin/article/chinese-tax-ai-tokens-why-non-english-costs-more

⚠️ Please credit GogoAI when republishing.

🌐 Explore More from GogoAI

🛠️ AI Tools Directory

Discover 100+ curated AI tools for every workflow

ChatGPT Claude Midjourney Copilot

Browse All Tools →

📚 AI Tutorials

Step-by-step guides from beginner to advanced

Prompts AI Coding Basics Projects

Start Learning →