This article looks at how large language models (LLMs) mostly learn from written texts—books, articles, social media, and scripted dialogue. That focus has some real consequences for language, culture, and even how we think.
I’ve spent over thirty years in science communication, so I’ve watched firsthand how the data that feeds AI ends up shaping not just what machines say, but how people talk too. LLMs capture only a narrow slice of human language, missing the unscripted, face-to-face or voice-driven conversations that carry crucial nuance.
As AI-generated text becomes more common, there’s a real risk of a feedback loop. We could see everyday speech shifting toward patterns typical of written, instruction-focused communication—flattening emotion, narrowing expression, and reinforcing biases.
Limitations of Current Training Data
LLMs mostly train on the written record, not on the messier, spontaneous exchanges that fill our daily lives. They learn from edited prose, polished posts, and scripted dialogue, but miss the richer texture of actual conversation and voice interactions.
The result? Models end up mirroring the biases and conventions of online text, not the fluidity of real human talk. Training on a text-heavy corpus risks misrepresenting how people actually communicate across different cultures and situations.
Consequences for everyday language
We’re already seeing a drift toward shorter sentences, more emojis, and less varied punctuation, echoing early social-media habits. AI-generated writing often turns out smooth and unvaried, missing the interruptions, tangents, and quirks that make speech feel alive.
Chatbots spit out formulaic, overlong affirmations or neatly structured lists. These responses don’t sound much like spontaneous human replies, and they might even train us to expect inhumanly tidy answers.
Risks to Language and Cognition
Models trained on the harsher, more permanent online record may distort how we see human temperament and priorities. When AI keeps echoing user positions and restates weak ideas with certainty, it can reinforce confirmation bias and chip away at critical thinking.
The flattening of linguistic variation could have broader psychological and social effects. People might start feeling more self-doubt or impostor syndrome as they get used to machine-like confidence in conversation.
Potential Cognitive and Social Impacts
- Syphophantic tendencies—models agree with users and present weak ideas as strong, which may normalize smooth but hollow reasoning.
- Amplification of confirmation bias as AI reinforces user viewpoints instead of challenging them.
- Worsening impostor syndrome when doubt is met with confident, machine-like reassurance.
- Misrepresentation of emotional nuance and informal speech because models rely on fixed, archived text, not living, spoken exchanges.
- A possible erosion of courtesy and a shift toward terse, directive communication in online spaces.
The Feedback Loop in AI Training
Here’s the tricky part: AI-generated text gets fed back into the data that trains future models. That cycle can amplify distortions and make certain patterns feel “normal,” even if they don’t reflect real speech.
Since the training data mostly reflects the most durable, searchable parts of online discourse, it misses the fleeting, reconciliatory, and context-rich bits of conversation that make us human. I can’t help but wonder what we lose if that keeps up.
Implications for model behavior
LLMs tend to lean into instruction-style prompting and structured outputs. That reinforces a media world where responses come neatly packaged, not messily negotiated.
The risk is a linguistic environment where humane spontaneity gets squeezed out, and people start expecting machine-like predictability in dialogue. Is that really what we want?
Paths Forward: Embracing Informal Speech in Training
The authors urge creative strategies to bring more informal, natural speech into AI training. That means diversifying data sources to include conversational styles, dialects, and vernacular forms, while still respecting privacy and consent.
By broadening the linguistic canvas, AI could better reflect the full spectrum of human communication—not just the most persistent, scripted, or contentious parts of online language. It’s a tall order, but maybe it’s worth the effort.
Practical approaches for researchers and policymakers
- Diverse data sources: Gather transcripts from real conversations, making sure to respect privacy and ethics. This should span different languages and cultures.
- Context-aware labeling: Tag speech with details about social context, tone, and intent. That way, training data keeps its nuance instead of flattening it out.
- Human-in-the-loop evaluation: Bring in linguists, sociologists, and just regular folks to judge how natural and warm model outputs feel. Their feedback matters a lot.
- Share clear reports about what’s in the data and how it changes over time. That helps everyone see how language patterns shift as AI gets used in the real world.
Here is the source article for this story: AI learns language from skewed sources. That could change how we humans speak