Do Language Models Count Letter R's?

The Hidden Cost of Tokenization and Preprocessing

Mar 25, 2025

Every time a new large language model drops — be it GPT-4.5, Claude 3.7, or a 72B-parameter open-source beast — the same tired ritual floods tech forums. Someone inevitably posts: "Hey [Model], how many R's are in 'strawberry'?", followed by mockery when the answer fails. But here's the uncomfortable truth: this test measures nothing about intelligence.

When you type "strawberry", the model doesn’t see S-T-R-A-W-B-E-R-R-Y. It sees something else entirely — a fragmented version of your text, reassembled through rules no user ever sees.

The result? A model might claim there are two R’s, or one, or even none. But before declaring it "dumb", consider this: the error isn’t in the model’s logic, but in a silent translation layer operating behind the scenes. The real question isn’t why models fail this test — it’s why we keep administering a benchmark that misunderstands their most basic design constraint.

What exactly happens to your text before the model "sees" it? And why does this invisible process sabotage simple tasks like letter counting? Let’s dissect the unsung culprit: the tokenizer.

count-r-in-strawberry

What is a Tokenizer?

A tokenizer is the first and most opinionated gatekeeper of your LLM. Think of it as a translator that converts human-readable text into a secret code the model understands — except this translator has strict rules:

  1. It fragments text into small pieces called tokens by frequency.
    Example input: "Do Language Models Count Letter R's?"
    Tokenized: ["Do", " Language", " Models", " Count", " Letter", " R's", "?"]

  2. It forgets original characters: Once text is tokenized, letters like R become prisoners inside token blocks. The model sees "Letter", not L-e-t-t-e-r.

  3. It uses a dictionary to map tokens to integer IDs.
    Example token to ID mapping (illustrative):

    • "Do" -> 1547
    • " Language" -> 2938
    • " Models" -> 2102
    • " Count" -> 867
    • " Letter" -> 1512
    • " R's" -> 1229
    • "?" -> 30

Try this interactively at Tiktokenizer

Decoder-Only LLM's Minimalist Pipeline

[Input] → [Tokenizer] → [Core Processing] → [Tokenizer] → [Output]

Let’s try the sentence: "Count R's in 'strawberry'?"

1. Input → Tokenizer

Tokenizer breaks it down into:
["Count", " R's", " in", " 'st", "raw", "berry", "'?"],
Then maps each to token IDs and feeds into the core model.

2. Core Processing

This black box processes the input token sequence (e.g., [867, 1229, 451, 9102, 2981, 1104, 341]) and outputs a predicted sequence (e.g., [2341, 15, 132]). These are mapped from statistical associations in its training data.

3. Tokenizer (Output Phase)

The output token IDs get decoded back to human-readable text.
For example: [2341, 15] → "Answer: 2"

The Brutal Reality

The model never truly saw the full word — it guessed based on token relationships.

Stage
What you think happens
What actually happens
Input
Model sees "strawberry"
Tokenizer fragments it into "straw" + "berry"
Core Processing
Logical counting of characters
Pattern matching: "berry" often paired with "2 R's"
Output
Precise deterministic answer
Probabilistic guess based on training patterns

CLIP’s Hidden Preprocessor

Just as LLMs don't see raw text, vision models like CLIP never truly "see" your images. Here’s the hidden distortion pipeline:

[Your Image] → [CLIP Preprocessor] → [Model "Sees"] → [Output]

CLIP promises a unified way to process images — but its standardized 224x224 input hides distortions that vary by source resolution and DPI. Let’s compare two icons through the CLIP pipeline.

Case Study: Two Displays, Two Realities

Image 1: Old Apple Cinema Display (2K)

Original (64x85)
Preprocessed (224x224)
Original Icon
Preprocessed Icon

Jagged edges dominate. The original circle now resembles a hexagon.

Image 2: Modern MacBook Pro (5K)

Original (104x98)
Preprocessed (224x224)
Original Icon
Preprocessed Icon

Curves remain intact. The model sees a more faithful representation.


Key Takeaways: The Input Pipeline Paradox

The Core Lesson

A model’s competence depends as much on its input pipeline as its architecture. What we call "AI errors" are often preprocessing side effects.


References

Have questions or feedback?
Open an issue