kalinga.ai

LLM Spelling Errors: Why AI Can’t Spell — And What It Reveals About How It Thinks

A digital concept art showing an AI model breaking the word strawberry into chunks, illustrating LLM spelling errors.
Because AI reads in “tokens” rather than individual letters, even the most advanced language models frequently stumble on character-level tasks.

Google’s own AI can’t spell the word “Google.” If that sentence surprises you, the technical reason behind it will surprise you even more — and it has serious implications for every business deploying AI-generated content today.

In May 2026, Google’s overhauled AI-powered search engine made headlines for all the wrong reasons. Its AI Overviews claimed there were two P’s in “Google,” one R in “poop,” two D’s in “journalism” (spelled out as j-o-u-r-n-a-d-i-s-m), and offered a creative rendition of the sitting U.S. president’s last name. These aren’t typos or server glitches. They are a window into a fundamental architectural limitation of modern large language models — one that researchers have known about for years and that no amount of fine-tuning has been able to permanently solve.

This post explains why LLM spelling errors happen, why they are structurally difficult to fix, what the Google incident reveals about the limits of deploying generative AI at scale, and what practical steps businesses and content teams should take to protect themselves.


The Strawberry Problem — Why LLM Spelling Errors Aren’t Going Away

Every time a major AI lab releases a new model, a ritual takes place in developer communities: someone asks it how many R’s are in the word “strawberry.” The answer — almost always wrong — has become a reliable benchmark not for intelligence, but for a specific, persistent class of LLM spelling errors rooted in how these systems are built.

The strawberry test has been a running joke in AI circles since at least 2024. But when Google built its new AI-first search product on top of the same class of models and shipped it to billions of users, the joke stopped being funny. The same architectural quirk that makes a chatbot stumble on “strawberry” now shows up in the world’s most-used search engine, generating incorrect letter counts and garbled spellings at scale.

The problem is not that AI models are “dumb.” These are the same systems capable of writing functional code in seconds, passing medical licensing exams, and solving mathematical problems that stumped researchers for decades. The problem is more precise and more interesting than that: LLMs do not read the way humans do, and that difference has specific, predictable consequences for spelling and character-level reasoning.


What Is AI Tokenization? (Definition + Expansion)

AI tokenization is the process by which a large language model converts raw text into numerical units called tokens before processing it. A token is not necessarily a letter or even a complete word — it is a chunk of text determined by a tokenization algorithm, which could be a full word, a common suffix, a syllable, or occasionally a single character, depending on frequency and context.

This is the root cause of LLM spelling errors. When a model processes the word “strawberry,” it does not see S-T-R-A-W-B-E-R-R-Y as nine individual characters. It sees something closer to two or three tokens — perhaps “straw” and “berry,” or “strawber” and “ry.” The model has learned to associate these chunks with statistical patterns in language, but it has no reliable internal representation of the individual letters that make up each chunk.

Tokens Are Not Letters

The distinction between tokens and letters is not a minor technical detail — it is the entire explanation for why LLM spelling errors are so consistent and so hard to fix.

When a human spells a word, they have direct access to a sequence of discrete symbols. They can count letters, reverse a word, or identify repeated characters with ease. When an LLM is asked to count the letters in a word, it is not performing that same discrete lookup. It is doing something more like pattern-matching against statistical regularities in its training data.

Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, explained the mechanism clearly: the model sees an encoding of what “the” means, but it does not know about ‘T,’ ‘H,’ ‘E’ as separate items. The word exists in the model as a compressed numerical representation, not as a series of inspectable characters. AI tokenization explained, Google AI Overview mistakes, Transformer architecture limitations, Generative AI quality control

How Numerical Encoding Replaces Reading

Transformer-based models — the architecture underlying GPT, Gemini, and most production LLMs — convert text into high-dimensional numerical vectors. These vectors capture semantic relationships: “king” minus “man” plus “woman” famously approximates “queen.” But they are not designed to capture orthographic structure — what a word looks like as a sequence of characters.

This means the model’s internal representation of “Google” carries information about what the company does, its industry, its cultural significance, and its typical usage patterns. What it does not reliably encode is the sequence G-O-O-G-L-E, or the fact that “oo” appears once and not twice. When asked to spell it out, the model is making a probabilistic inference, not reading from a stored character sequence.


Google’s Spelling Crisis, Explained

What Went Wrong with Google AI Overviews?

Google’s AI Overviews are powered by a large language model that, like all transformer-based systems, processes text through tokenization and numerical encoding. When users asked the system to count letters in “Google” or spell out a word character by character, they were asking the model to perform exactly the kind of task it is worst at: low-level orthographic reasoning.

The specific LLM spelling errors reported in May 2026 included:

  • Two P’s in “Google” (there is one)
  • One R in “poop” (there are two)
  • Two D’s in “journalism,” spelled as “j-o-u-r-n-a-d-i-s-m”
  • The president’s surname rendered as “t-r-p-u-m”

This was not the first time Google’s AI search product embarrassed itself. Earlier in 2026, searching the word “disregard” returned what appeared to be a dictionary definition that instead read: “Understood. Let me know whenever you have a new prompt or question” — a leaked system prompt response that was incorrectly surfaced. Google patched that issue. The spelling errors, however, proved more persistent.

Google’s official response acknowledged the issue directly, stating that counting within words has been a known challenge for LLMs and that the company is working to fix it.

Is This Fixable?

This is the question researchers have been wrestling with for years, and the honest answer is: not cleanly, not at the architectural level.

Sheridan Feucht, a PhD student studying large language model interpretability at Northeastern University, put it plainly: there may be no such thing as a perfect tokenizer due to the inherent fuzziness in how models chunk language. Even if researchers agreed on an ideal token vocabulary, models would likely still find it useful to compress text further internally — obscuring the character-level information needed for accurate spelling.

Some approaches can reduce LLM spelling errors at the margin:

  • Training on more character-level data
  • Post-processing outputs through a dedicated spelling validator
  • Using retrieval-augmented generation (RAG) to cross-check letter counts against a dictionary lookup

But none of these eliminate the underlying issue. They treat the symptom rather than the cause — because the cause is baked into the transformer architecture itself. Generative AI quality control


LLM Spelling Errors vs. Human Spelling Errors — Key Differences

Understanding the contrast between how humans and LLMs make spelling mistakes clarifies why the AI version is harder to detect and patch.

DimensionHuman Spelling ErrorsLLM Spelling Errors
Root causePhonetic confusion, memory gaps, typosTokenization: no direct character access
PatternIrregular, often one-off mistakesSystematic, reproducible across queries
Self-correction abilityCan be prompted to double-checkSelf-correction unreliable; same error on recheck
Letter countingEasy — humans can count charactersStructurally difficult — tokens obscure character count
Fixing mechanismEducation, spell-check toolsArchitecture change or post-processing workarounds
Affected wordsUsually low-frequency or complex wordsEven extremely common words (e.g., “Google”)
Confidence levelUncertain humans typically hedgeLLMs often state incorrect spellings confidently

The last row is perhaps the most important for practical purposes. A human who is unsure how to spell a word typically signals that uncertainty. LLMs, by contrast, generate incorrect letter sequences with the same fluency and confidence they use to produce correct ones. This is what makes LLM spelling errors particularly hazardous in production environments where output is not reviewed before being published.


Why Spelling Is a Low Priority for AI Researchers — But Matters Deeply for Users

From a research perspective, LLM spelling errors are an acknowledged nuisance rather than a crisis. The reason is straightforward: the utility of large language models does not come from their ability to spell. Their value lies in reasoning, synthesis, code generation, and language understanding — capabilities that operate at a semantic level far above individual characters.

Researchers are focused on alignment, reasoning accuracy, hallucination reduction, and safety — problems with much larger real-world stakes than whether a model can correctly count the letters in “strawberry.” This is a defensible prioritization from an engineering standpoint.

But from a user’s perspective, and especially from a business perspective, spelling errors carry disproportionate costs:

  • They erode trust in AI-generated content
  • They can introduce factual-looking errors into published material (a misspelled proper noun or technical term can cause real confusion)
  • They are particularly damaging when the AI output is presented as authoritative, as in the case of Google’s AI Overviews in a search context
  • They are hard to catch at scale without dedicated review processes

The mismatch between how researchers rank the problem and how users experience it is part of what made the Google incident so striking. The company shipped an AI-forward product designed to replace traditional search results, then discovered that the model couldn’t reliably handle one of the most basic tasks a search user might ask of it.


What This Means for Businesses Using AI in Production

The Google incident is not just an embarrassment for one company. It is a signal for any organization using LLMs to generate, review, or publish text.

LLM spelling errors are a predictable failure mode of the current generation of transformer-based models. If your business is using AI to generate product descriptions, knowledge base articles, customer communications, or any content where character-level accuracy matters, you need explicit safeguards — not because the AI is broken, but because it was never designed to handle this class of task reliably.

This is especially important in domains where spelling directly affects accuracy:

  • Medical content (drug names, dosage instructions, anatomical terms)
  • Legal content (proper names, statute citations, party identifiers)
  • Technical documentation (variable names, command-line syntax, API endpoints)
  • Localization and translation work involving proper nouns or branded terms

The risk is not that the model will occasionally produce a typo. The risk is that the model will confidently produce a systematically wrong character sequence and that, without a review process, that error will be published, indexed, and trusted.


The Broader Lesson: AI Confidence Is Not the Same as AI Accuracy

The deeper issue revealed by LLM spelling errors is one of calibration. These models do not express uncertainty proportional to their actual reliability. They produce confident output whether they are reasoning about a domain they have been extensively trained on or being asked to count letters in a three-syllable word.

This is not a flaw specific to spelling. It shows up across many categories of AI output. But spelling is a domain where the ground truth is perfectly clear and verifiable — you can check whether “Google” has one P or two in about two seconds. That makes it a uniquely transparent test case for AI over-confidence.

Users and businesses who understand this dynamic are better positioned to use these tools effectively. The goal is not to distrust AI output wholesale — that would forfeit genuine productivity gains. The goal is to deploy verification where it matters: at the intersection of high confidence and high consequence.


Practical Checklist — When to Trust AI-Generated Text

Use this checklist to evaluate when AI text output requires additional human review, particularly with regard to character-level accuracy.

  • Character-level tasks require human verification. Any AI output involving letter counts, specific spellings of proper nouns, acronym expansions, or character-by-character transcription should be reviewed manually.
  • Proper nouns are high risk. Brand names, person names, place names, and product names are frequent sites of LLM spelling errors because they are often tokenized as compressed chunks rather than spelled-out sequences.
  • Technical strings need validation. Code snippets, API keys, command-line arguments, and URL paths should always be tested — not just read — before being used or published.
  • Confident output is not verified output. A smooth, fluent sentence with no hedging language does not mean the content is accurate. Apply the same scrutiny regardless of how polished the output looks.
  • Run a spot-check on any newly deployed AI feature. Before shipping any AI-powered content generation tool, ask it to spell its own product name. If it fails, assume it will fail on similarly basic tasks in production.
  • Post-processing helps, but is not foolproof. Spell-checkers can catch some LLM spelling errors, but may miss incorrect-yet-valid strings (e.g., a misspelled proper noun that happens to be a real word).
  • Establish a feedback loop. If users can report errors in AI-generated content, create a fast path to review and correction. Spelling errors in AI output often appear in clusters — one broken pattern can repeat across many outputs.

Conclusion: Understanding AI Limitations Is a Competitive Advantage

LLM spelling errors are not a temporary embarrassment that will disappear with the next model release. They are an architectural reality of how transformer-based systems process language — and while improvements are possible at the margins, the fundamental constraint of tokenization is not going away soon.

What does change is how informed organizations respond to that constraint. The businesses and content teams that understand why AI can’t spell are the ones that build appropriate review processes, set accurate expectations for their users, and avoid the costly trust damage that comes from shipping confidently wrong AI output at scale.

Google’s AI Overview spelling failures are visible precisely because they occurred in the world’s most prominent search surface. But the same failure mode exists in every deployment of a large language model — in enterprise chatbots, AI writing assistants, automated customer support tools, and generated product content. The question is not whether your AI will make LLM spelling errors. The question is whether your systems are designed to catch them before they reach your users.

The answer to “why can’t AI spell?” is actually a detailed, fascinating window into how these systems work. Understanding that answer is not just academically interesting — it is directly useful for anyone building, deploying, or managing AI-generated content in 2026 and beyond.

Frequently Asked Questions

Why do LLM spelling errors happen if AI models are so advanced?

LLM spelling errors occur because of a fundamental architectural constraint called tokenization. When a large language model reads text, it does not see individual letters the way a human does. Instead, it processes text in chunks or numerical vectors known as tokens. For instance, a word like “strawberry” might be broken down into two tokens: “straw” and “berry”. Because the model processes the entire chunk as a unified statistical representation, it lacks a reliable internal map of the discrete characters inside that chunk. This architectural blind spot is the root cause of frequent character-level reasoning mistakes.

What is the “Strawberry Problem” in generative AI?

The “Strawberry Problem” is a famous developer community benchmark used to highlight LLM spelling errors and character counting flaws. When asked how many times the letter “R” appears in the word “strawberry,” models consistently fail because they do not look at the word letter-by-letter. Instead of doing a discrete character lookup, the model performs pattern-matching against its training data. This test serves as a persistent reminder that semantic intelligence does not automatically equal orthographic accuracy.

How did LLM spelling errors affect Google’s AI Overviews?

In May 2026, Google’s AI-powered search engine faced public scrutiny when its AI Overviews generated high-profile character mistakes. The model claimed there was only one “R” in the word “poop” and mistakenly added an extra “P” to the word “Google”. It also generated garbled text sequences like “j-o-u-r-n-a-d-i-s-m” for journalism. These issues proved that even highly sophisticated production systems struggle with low-level orthographic reasoning when forced to count or spell out words character by character.

Can developers permanently fix LLM spelling errors?

Unfortunately, LLM spelling errors cannot be easily fixed at the transformer architecture level. AI researchers have pointed out that modifying token vocabulary to be completely precise is incredibly difficult because models naturally compress data internally to optimize language understanding. While engineering workarounds like retrieval-augmented generation (RAG) or post-processing spell-checkers can catch mistakes at the margin, they only treat the symptoms rather than changing how the model natively reads language.

How can businesses mitigate the risk of LLM spelling errors?

To protect your brand credibility from confident but incorrect AI outputs, businesses should implement a strict validation checklist:

  • Human Verification: Always mandate human review for text involving brand names, medical terms, or proper nouns.
  • Technical Audits: Test all generated technical strings, code blocks, and URL paths before publishing.
  • Post-Processing Filters: Deploy separate, traditional dictionary spell-checkers to act as a safety net.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top