Writing in Italian Costs More Than You Think

Universal Declaration of Human Rights

D ich iar azione Univers ale dei Dir itti U mani

This is, roughly, what LLMs see.

You may not know Italian and still immediately sense that the second representation is more fragmented, and therefore harder to read. Let’s look a bit more closely.

How tokenization works

Modern LLMs don’t process raw text directly. Not word by word, not character by character. They first split it into tokens: chunks chosen from a predefined dictionary learned during training. The details of the algorithm matter less than this practical fact: if the tokenizer’s dictionary contains good whole chunks for a language, that language is encoded efficiently. If not, words get broken into more pieces.

The key point is simple: the tokenizer was learned mostly from English-heavy text. OpenAI models such as GPT-4o and GPT-5 use a vocabulary of roughly 200,000 entries called o200k_base. The tokenizer is optimized for patterns that appear often in English. As a result, words like “Universal”, “Declaration”, and “Human” may fit into a single token, while their Italian equivalents — “Universale”, “Dichiarazione”, “Umani” — are more likely to be split into two or more.

The result: the same semantic content expressed in Italian requires more tokens than in English.

English

Universal Declaration of Human Rights

Italian

D ich iar azione Univers ale dei Dir itti U mani

To make the difference easier to notice, the tokens above use alternating colors and no extra whitespace. This makes it easier to see when individual words are broken apart, and when they are not.

The UDHR experiment

The Universal Declaration of Human Rights is a useful benchmark because it is available in an unusually large number of official translations, making it especially helpful for extending this analysis across languages. It also carries enormous human significance.

That makes it a good place to move from intuition to measurement. Since OpenAI’s tokenizer is open source, it is easy to inspect how it splits the text. For the example at the top of the page, you can use a simple script like this:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-5")

samples = [
  'Universal Declaration of Human Rights',
  'Dichiarazione Universale dei Diritti Umani'
]

for text in samples:
  token_ids = encoding.encode(text)
  token_pieces = [encoding.decode([token_id]) for token_id in token_ids]

Let’s check what the tokens are:

	Token ids
English	84451, 78572, 328, 16296, 19799
Italian	35, 703, 9934, 10436, 11912, 1167, 12749, 22046, 34203, 601, 70740

To decode these IDs back into text pieces, the tokenizer looks them up in its vocabulary, which is just a map from IDs to token strings.

84451 Universal
...
78572  Declaration
...
328  of
...

35 D
...
703 ich
...
9934 iar

	Tokens
English	’Universal’, ’ Declaration’, ’ of’, ’ Human’, ’ Rights’
Italian	’D’, ‘ich’, ‘iar’, ‘azione’, ’ Univers’, ‘ale’, ’ dei’, ’ Dir’, ‘itti’, ’ U’, ‘mani’

In this example, the English title maps cleanly to one token per word, while the Italian one breaks most words into two or more pieces. That is not always true, but it is common enough to matter. In summary, for the sample titles:

	Characters	Words	Tokens
English	37	5	5
Italian	42	5	11

Difference	+13%	—	+120%

Same number of words, just +13% more characters… but +120% tokens.

If we look at the full declaration, the picture becomes more precise:

	Characters	Words	Tokens	Tokens/Words
English	10678	1752	2010	1.15
Italian	11974	1846	2858	1.55

Difference	+12%	+5%	+42%	+35%

Italian is somewhat more verbose by nature: here it uses 12% more characters and 5% more words. But that does not explain a 42% increase in tokens. Most of that gap comes from the tokenizer vocabulary itself, which, as mentioned above, has a strong bias toward English.

Tokenization excerpt: Article 1

EN

Article 1

All human beings are born free and equal in dignity and rights . They are endowed with reason and conscience and should act towards one another in a spirit of brother hood .

IT

Art icolo 1

T utti gli ess eri um ani nas cono lib eri ed eg uali in dign ità e dir itti . Ess i sono dot ati di rag ione e di cos c ienza e devono ag ire gli uni verso gli altri in spir ito di frat ell anza .

Why it matters

This is not just a curiosity about tokenization. It has practical effects:

Cost. More tokens means higher API bills for the same task.
Efficiency. If a system processes tokens at a roughly fixed throughput, more tokens for the same meaning means fewer useful words per second.
Context window. A longer tokenized representation fills the window faster. With a +42% token overhead, the same context budget carries about 30% less content.
Internal translation/reasoning overhead. This is an inference, not something exposed directly by the API, but if a model reasons more effectively in English, an Italian prompt may induce extra hidden steps. In general, reasoning tokens still count toward output usage and occupy context window space.

Let’s put numbers on this using the percentages measured above.

Imagine your provider charges $1.75 / 1M input tokens and $14.00 / 1M output tokens.

Assume:

a typical chatbot session: 20 turns
average usage per turn in English: 300 input tokens and 400 output tokens
10,000 sessions per month

English baseline

Input: 20 × 300 × 10,000 = 60M tokens → $105
Output: 20 × 400 × 10,000 = 80M tokens → $1,120
Total: $1,225/month

Italian, using the measured +42% token overhead on both input and output

Input: 60M + 42% = 85M tokens → $149
Output: 80M + 42% = 114M tokens → $1,590
Total: $1,740/month

That is $515/month extra. At 100,000 sessions per month, the gap becomes about $5,150/month.

Theory or practice?

But can we trust the results we get from a tokenizer library, or is this just a didactic tool with little correspondence to real usage?

You can verify it in practice.

The annoying part is that input token counts are messy in real tools: they get polluted by system prompts, tool traces, cached context, wrappers, and provider-specific overhead. Output token counts are much more stable, because you can force the model to emit a predictable string and measure that.

A simple trick is to prompt:

Repeat verbatim: <text of the declaration>

If the model does what it is told, the output is just the declaration repeated back. That gives you a much cleaner way to compare English and Italian output token counts.

Here are three practical ways to do it.

Codex with GPT-5.4

codex exec "Repeat verbatim:$(<data/udhr_english.txt)" --model gpt-5.4
codex exec "Repeat verbatim:$(<data/udhr_italian.txt)" --model gpt-5.4

Open your session log in ~/.codex/session and look for total_token_usage:

{
  "type": "token_count",
  "info": {
    "total_token_usage": {
      "input_tokens": 11057,
      "cached_input_tokens": 7040,
      "output_tokens": 2947,
      "reasoning_output_tokens": 83,
      "total_tokens": 14004
    }
  }
}

While reasoning_output_tokens varies a bit across runs, the difference output_tokens - reasoning_output_tokens is stable. That gives us the number we are looking for.

I found 2016 for English and 2864 for Italian.

Note. I had to edit the English version slightly to bypass a content filter. After some experimentation, changing this part of the preamble:

in cooperation with the United Nations,

to:

in United cooperation with the Nations,

did the trick and avoided the content filter issue. This change does not affect the token count, but it is peculiar enough that I am investigating it further and will write about it in another post.

Claude with Sonnet 4.6

claude "Repeat verbatim:$(<data/udhr_english.txt)" --print --verbose --output-format stream-json
claude "Repeat verbatim:$(<data/udhr_italian.txt)" --print --verbose --output-format stream-json

  "modelUsage": {
    "claude-sonnet-4-6": {
      "inputTokens": 3,
      "outputTokens": 2152,
      "cacheReadInputTokens": 6284,
      "cacheCreationInputTokens": 10965,
      ...

Here I found 2152 for English and 3437 for Italian.

Open models

You can run the same comparison with open-weight models through Hugging Face transformers. As a minimal example, here is how to inspect a single tokenizer:

from transformers import AutoTokenizer
from pathlib import Path

tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-27b-it")

english = Path("data/udhr_english.txt").read_text(encoding="utf-8")
italian = Path("data/udhr_italian.txt").read_text(encoding="utf-8")

english_tokens = tokenizer(english, add_special_tokens=False)["input_ids"]
italian_tokens = tokenizer(italian, add_special_tokens=False)["input_ids"]

print("English:", len(english_tokens))
print("Italian:", len(italian_tokens))

Results across models

Here is the comparison across the models I tested:

Model	English	Italian	Difference
OpenAI/GPT-5.4	2016	2864	+42.1%
Anthropic/Sonnet 4.6	2152	3437	+59.7%
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16	2051	2782	+35.6%
Qwen/Qwen3.5-27B	2087	2634	+26.2%
google/gemma-3-27b-it	2070	2714	+31.1%
mistralai/Mistral-7B-Instruct-v0.3	2313	3786	+63.7%
deepseek-ai/DeepSeek-V3.2	2376	3557	+49.7%

What you can actually do

1. Write system prompts in English, even for Italian-facing products. The model understands English instructions regardless of what language the user writes in. A 1000-token system prompt saves ~300 tokens per request if written in English.

2. Compress instruction language. Bullet points and short sentences tokenize better than flowing prose in any language.

3. Monitor token usage per request. Both the OpenAI and Anthropic APIs return token counts in every response. Log them. You’ll see the pattern quickly.

4. Use tiktoken in CI to catch prompt bloat. If you version-control your system prompts, a simple token count check in CI catches regressions before they hit production.

5. Consider smaller models for simpler tasks. If your local-language task is classification or extraction, a smaller model with a lower per-token cost may offset the tokenization overhead better than a flagship model.

What providers should do (but probably won’t)

1. Train models that are either more language-neutral or explicitly better optimized for major non-English languages. If you know a model will be used heavily in Italian, Spanish, Arabic, Hindi, or other large languages, English should not remain the hidden default. “Multilingual” should not mean “works, but with a tax.”

2. Move beyond tokenization as the long-term default. Subword tokenization is a technical workaround, not a law of nature. Papers like ByT5: Towards a token-free future with pre-trained byte-to-byte models point to a more language-agnostic direction: models that operate directly on bytes, avoid language-specific vocabularies, and reduce the structural advantage of English. That approach still has real tradeoffs in sequence length and inference cost, but it is a better long-term answer than pretending tokenizer bias is inevitable.

3. Be more honest in product and pricing documentation. OpenAI’s own help center says its models are

Optimized for English, but trained on multilingual data

That is true, but operationally incomplete. If you use another language, you should often expect both higher token costs and, depending on the task, weaker performance. The honest version would be much closer to:

Optimized for English, other languages cost more and work worse

Use your own language. Your extra money helps pay for more English training

Okay, I probably won’t get a job offer from OpenAI marketing for that.

Conclusion

The tokenization gap between English and Italian is real and easy to measure. In the tests shown here, Italian consistently required more tokens than English, with the gap ranging from about +26% to +64% depending on the model.

This analysis was done on Italian, but it is easy to extend it to other languages using the same method. I would expect the general pattern to remain the same: languages that are less favored by English-centric tokenizers will require more tokens to express the same content.

This does not look like a quirk of one tokenizer. The same pattern appears across both flagship commercial models and open models, which suggests a broader structural bias toward English in current tokenization schemes.

That makes this more than a technical footnote. It is also a product and pricing issue, and providers should be much clearer about it than they usually are.

This is not an argument against using LLMs in Italian. It is an argument for being deliberate about where language choice matters operationally: system prompts, internal tooling, and long conversation histories are all places where English can reduce token costs, while Italian may still be the right choice for user-facing output.

The good news is that this is easy to test on your own workload. The tooling is public, the scripts are simple, and the cost difference is large enough to be worth measuring.

First published in October 2025. Updated in March 2026.