The Gap Between Formal Multilingual AI a…

Zero. That is essentially the overlap between how LLMs are trained to “speak” Indic languages and how millions of people actually type them on a smartphone.

For years, the big labs have been patting themselves on the back for “multilingual” support. They point to benchmarks where a model can translate formal Hindi or Tamil with high precision. But there is a massive gap between a textbook and a WhatsApp chat. In the real world, bilingual speakers use Romanized Code Mixing (RCM)—blending local languages with English using the Roman script. It is the dominant way of communicating across these communities, yet it remains a blind spot for almost every major model.

The Indi-RomCoM benchmark finally puts a number to this failure. It evaluates how LLMs handle Romanized Indic-English instructions, and the results are a cold shower for anyone who thinks “global AI” is already here. The models struggle because they are trained on clean, formal datasets. They can handle a formal request in Devanagari, but the moment you switch to the phonetic, hybridized Romanized version that actual humans use, the logic falls apart.

Why do we keep pretending a model is “global” when it can’t parse a basic text message from Delhi? It is an indictment of the current training pipeline. We have spent so much time optimizing for the “gold standard” of language that we forgot people actually talk in fragments and hybrids.

The problem feels like more than just a lack of data; it looks like a structural failure in how these models “see” text. (I suspect the tokenizers are the primary culprit). Most tokenizers are optimized for English or formal scripts. When they encounter Romanized Indic code-mixing, they don’t see words; they see a chaotic sequence of fragments. It is like trying to learn a language from a formal textbook and then being dropped into the middle of a crowded street market—you know the grammar, but you can’t understand a word anyone is actually saying.

The model is essentially guessing the meaning based on proximity to English tokens, rather than actually understanding the linguistic blend. Or maybe it’s just a massive data scarcity problem—see below. The friction here is real: scraping and cleaning RCM data is a nightmare because there is no standardized spelling. One person writes “namaste” and another writes “namastay,” and the model has to treat those as distinct entities unless the tokenizer is smart enough to see the phonetic root.

This creates a tiered system of AI utility. If you are a corporate employee writing formal emails in a native script, the AI is great. If you are a developer or a casual user in Mumbai typing in a Romanized hybrid, the AI is a liability. This isn’t a niche edge case; it’s the primary interface for hundreds of millions of people.

We have seen this pattern before with low-resource languages, where “support” actually means “we scraped a few Wikipedia pages and hope for the best.” The apathetic approach to RCM suggests that labs are more interested in the optics of multilingualism than the actual utility. They want to check a box on a slide deck, not actually solve the friction of real-world communication.

It’s a failure of imagination by the big labs.

If the industry doesn’t pivot from “formal translation” to “actual usage,” these models will remain toys for the elite and footnotes for the rest of the world. We need tokenizers that recognize the phonetic patterns of Romanized Indic languages and training sets that prioritize the messy, blended reality of code-mixing.

By Q1 of next year, we will see the first major model release that explicitly lists “Romanized Indic” as a primary training objective rather than a side effect of web-scraping. Until then, these benchmarks serve as a necessary reminder that “multilingual” is often just a marketing term for “can translate a Wikipedia paragraph.”

Related coverage

Measuring the Gap Between CS Curricula and Industry Standards

Google's Gemini-SQL2: Analyzing the Gap Between Benchmarks and Production

Runtime Governance for LLM Agents: Moving Beyond System Prompts

Dark Matter Research Shifts Toward Compute-First Methodology and Solar Power