📑 Table of Contents

Why AI Can't Count the E's in 'Seventeen'

📅 · 📁 LLM News · 👁 9 views · ⏱️ 12 min read
💡 A viral video exposing AI models' inability to count letters in simple words has reignited debate about LLM hallucination and tokenization limits.

A deceptively simple question — 'how many e's are in the word seventeen?' — has become the internet's favorite test for exposing a fundamental flaw in today's most advanced AI systems. A viral video circulating across social media platforms shows multiple large language models confidently delivering wrong answers to this elementary letter-counting task, sparking widespread discussion about the nature of AI hallucination and why billion-dollar models stumble on problems a 7-year-old can solve.

The correct answer is 4. The word s-e-v-e-n-t-e-e-n contains exactly 4 instances of the letter 'e.' Yet models from OpenAI, Google, Meta, and others frequently answer 2 or 3, sometimes providing elaborate but entirely fabricated reasoning to justify their incorrect responses.

Key Takeaways

  • Letter counting is a consistently difficult task for LLMs due to how they process text through tokenization
  • The word 'seventeen' contains 4 e's, but most AI models answer incorrectly with high confidence
  • This failure stems from architectural limitations, not a lack of training data
  • The viral video has been viewed millions of times, fueling public skepticism about AI reliability
  • Newer reasoning models like OpenAI's o1 and o3 perform better but still aren't immune
  • The problem highlights the gap between AI's apparent intelligence and genuine understanding

Tokenization Is the Root Cause, Not Stupidity

Tokenization is the process by which LLMs break text into smaller units called tokens before processing. These tokens are not individual characters — they are subword chunks that the model learns during training. For example, the word 'seventeen' might be tokenized as 'seven' + 'teen' or even as a single token, depending on the model's vocabulary.

This means the model never actually 'sees' individual letters. It processes semantic meaning at the token level, making character-level tasks like counting specific letters fundamentally misaligned with its architecture. When you ask GPT-4 or Claude how many e's appear in a word, the model is essentially guessing based on patterns it has seen during training rather than performing actual character-by-character analysis.

Compare this to how a calculator handles math. A calculator operates on numbers directly. An LLM operates on tokens — abstract representations of language chunks. Asking an LLM to count letters is like asking a translator to do plumbing. The tool simply was not designed for the task.

Why Confident Wrong Answers Are More Dangerous Than Silence

The viral video doesn't just show AI getting the answer wrong. It shows AI getting the answer wrong with absolute confidence. Several models in the demonstration respond with statements like 'there are 2 e's in the word seventeen' and then proceed to provide step-by-step breakdowns that appear logical but contain critical errors.

This phenomenon is what researchers call confabulation — the model generates plausible-sounding explanations that are internally consistent but factually incorrect. The model might spell out 's-e-v-e-n-t-e-e-n' and then somehow still count only 2 or 3 e's, skipping instances without acknowledging the error.

For casual users, this confident delivery is deeply misleading. Studies from Stanford's Human-Centered AI Institute have shown that users are significantly more likely to trust AI outputs when they are delivered with apparent certainty, even when those outputs are demonstrably wrong. The letter-counting failure becomes a powerful metaphor for a much larger problem: AI systems that sound authoritative while being fundamentally unreliable on certain task categories.

The Broader Hallucination Crisis in Numbers

Letter counting is a trivial example, but AI hallucination is anything but trivial in high-stakes applications. Recent research paints a concerning picture of how widespread the problem remains:

  • A 2024 study by Vectara found hallucination rates ranging from 3% to 27% across major LLMs
  • GPT-4 hallucinates in approximately 3-5% of factual queries, according to OpenAI's own benchmarks
  • Open-source models like Llama 2 showed hallucination rates exceeding 15% on certain factual recall tasks
  • Legal AI tools have generated fabricated case citations that were submitted to actual courts, resulting in sanctions against attorneys
  • Medical AI chatbots have provided incorrect drug interaction information in documented test scenarios
  • Customer-facing AI assistants have invented company policies, refund amounts, and product specifications

The 'how many e's in seventeen' question has become a litmus test precisely because it is so simple. If a model cannot reliably count letters in a 9-letter word, what does that imply about its reliability when summarizing legal contracts, analyzing medical records, or generating financial reports?

Reasoning Models Offer Partial Improvements

Newer chain-of-thought and reasoning-focused models have shown measurable improvement on character-level tasks. OpenAI's o1 and o3 models, which spend additional compute time 'thinking' before responding, are more likely to correctly count letters because they can break down the problem step by step.

Google's Gemini 2.5 Pro with its built-in 'thinking' mode has also demonstrated better performance on these types of tasks. By explicitly spelling out each character and counting sequentially, these models can sometimes overcome the tokenization limitation through learned reasoning strategies.

However, even these advanced models are not 100% reliable on character counting. The improvement is probabilistic, not deterministic. A reasoning model might get the answer right 8 out of 10 times instead of 3 out of 10, but that remaining 20% failure rate on a task this simple underscores a fundamental architectural constraint that no amount of scaling has fully resolved.

Anthropic's Claude 3.5 Sonnet and Claude 4 have shown relatively stronger performance on letter-counting tasks in community benchmarks, though Anthropic has not made specific claims about character-level accuracy. The variation between models suggests that training data composition and reinforcement learning from human feedback (RLHF) play a role in how well models handle these edge cases.

What This Means for Developers and Businesses

For anyone building products on top of LLMs, the letter-counting failure carries practical lessons that extend far beyond party tricks:

  • Never trust LLM output for precision tasks without external validation layers
  • Implement programmatic checks for any task involving counting, calculation, or character-level text manipulation
  • Use tool calling — modern APIs from OpenAI, Anthropic, and Google allow models to invoke code execution for tasks like counting, which eliminates the tokenization problem entirely
  • Set user expectations clearly about what AI can and cannot do reliably
  • Test edge cases aggressively before deploying AI in customer-facing or high-stakes scenarios

The most effective mitigation strategy is hybrid architecture design. Rather than asking the LLM to count letters directly, a well-designed system would have the LLM recognize that a counting task is being requested, then delegate the actual counting to a simple Python function. This approach — sometimes called 'tool use' or 'function calling' — combines the LLM's natural language understanding with deterministic code execution.

The Public Perception Problem AI Companies Face

Viral moments like the 'seventeen' video create a trust deficit that AI companies struggle to overcome. When millions of viewers watch a $100 billion technology fail at something a child can do, it shapes public perception in ways that no marketing campaign can easily reverse.

This is particularly challenging because the failure is so intuitive. You don't need a PhD in computer science to understand that counting letters in a word should be easy. The gap between expectation and reality becomes a powerful narrative: if AI can't do this, what else is it getting wrong?

AI companies have generally responded by emphasizing that LLMs are designed for language understanding and generation, not character-level text manipulation. This is technically accurate but rhetorically weak. Users don't care about tokenization architectures — they care about whether they can trust the answers they receive.

Looking Ahead: Will This Problem Ever Be Fully Solved?

Several research directions could eventually address the character-level processing limitation. Byte-level models that process text one character at a time, rather than using subword tokenization, have shown promise in academic research but remain computationally expensive to scale.

Multimodal approaches represent another potential solution. A model that can 'see' text as an image and process it visually might handle character counting more reliably than one that processes text purely through tokenization. Early experiments with vision-language models suggest this is viable but adds latency and complexity.

The most likely near-term solution is the continued refinement of tool-augmented reasoning. As models become better at recognizing when they need external tools and seamlessly invoking them, the practical impact of the tokenization limitation will diminish even if the underlying architectural constraint remains.

For now, the 'how many e's in seventeen' question serves as a humbling reminder: today's AI systems are remarkably capable pattern-matching engines, but they are not thinking machines. Understanding that distinction is essential for anyone building with, investing in, or relying on artificial intelligence in 2025 and beyond.