Why do LLMs struggle with spelling if they can write code and solve math problems?

Transformer-based models process text as tokens (which may be words, syllables, or letters), not as character sequences. This architecture converts text into numerical encodings that help predict context but lack explicit character-level awareness, making letter counting unintuitive for the model.

Is Google's spelling issue fixable?

According to Google, 'counting within words has been a known challenge for LLMs, and we're working to fix this particular issue.' The source indicates researchers have not been optimistic about solving this via the current tokenization approach, but specific timelines are not disclosed.

Google's AI Overview struggles with basic spelling, revealing token-level architectural limits

Google’s AI Overview Spelling Failures Signal Architectural Friction

Google’s AI Overview is producing embarrassing spelling errors on elementary words, according to TechCrunch. The system has misspelled “Google” as “Googel,” claimed there are two ‘r’s in “poop,” and rendered “journalism” as “j-o-u-r-n-a-d-i-s-m.” The errors extend to proper nouns: the system identified one ‘P’ in the U.S. president’s last name but spelled it “t-r-p-u-m.” Google acknowledged the pattern in a statement to TechCrunch, saying “counting within words has been a known challenge for LLMs, and we’re working to fix this particular issue.”

These failures are not unique to Google. According to TechCrunch, this has become an industry in-joke—vendors’ spelling mistakes on words like “strawberry” are now a standard humor test for new language model releases. Yet the persistence of the problem, despite advances in mathematical reasoning and code generation, points to a structural mismatch in how transformer-based models represent text.

The root cause lies in how transformer architectures process language. Rather than operating on individual characters or graphemes, models break text into tokens—units that can be full words, syllables, or sub-word fragments depending on the tokenizer. According to TechCrunch, this means the model receives numerical encodings of semantic or syntactic meaning, but lacks explicit awareness of character sequences. The system can predict contextually appropriate words without ever “seeing” their letter-by-letter composition in the way humans do.

This architectural choice optimizes for semantic efficiency and computational speed—critical for scaling large models—but leaves character-level operations unintuitive. Spelling tasks require explicit enumeration, a capability that does not emerge naturally from token-based representations. Google’s engineering team faces a dilemma: retrofitting character-level awareness into transformer inference would add latency and memory cost, while training-data quality improvements alone have not historically resolved the issue across the industry.

Broader Implications for AI-First Search

This is not Google’s first stumble with AI Overview. TechCrunch reports that earlier versions cited satirical posts from The Onion and Reddit, recommending users eat rocks or apply glue to pizza. A more recent bug caused searches for “disregard” to return what appeared to be a prompt-injection response: “Understood. Let me know whenever you have a new prompt or question!”

As Google doubles down on embedding generative AI into its 29-year-old flagship search product, these errors compound the company’s credibility risk. Each spelling mistake and hallucination fuels skepticism about AI Overview’s reliability for factual retrieval.

Why This Matters

Teams piloting AI Overview for customer-facing Q&A—especially in spelling-sensitive domains like legal, medical, or educational content—should implement human review gates until Google publishes a fix. For enterprises choosing between Google’s AI Overview and competitors for search integration, the persistence of basic spelling errors despite multiple patch cycles suggests the underlying tokenization architecture may require more fundamental redesign than incremental training improvements can address. This shapes vendor-selection logic for organizations with high accuracy requirements in the near term (next 6–12 months).