No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

Researchers tested five major LLMs across English, Hindi, and Spanish to measure how politeness in user prompts affects model output quality. Using 22,500 prompt-response pairs and an eight-factor evaluation framework, they found performance varies significantly by model and language, suggesting politeness effects aren't universal across systems.

Modelwire context

Explainer

The more consequential finding isn't that politeness matters — it's that the effect is model-specific and language-specific, meaning prompt engineering advice that works in English for GPT-4o Mini may actively degrade performance in Hindi on Claude 3.7 Sonnet. That interaction effect is what makes this practically relevant for anyone building multilingual applications.

This connects directly to the reliability problems surfaced in 'Diagnosing LLM Judge Reliability' from April 16, which found that aggregate consistency metrics can look healthy while masking per-instance failures. The PLUM study has the same structural problem in reverse: aggregate politeness effects may look negligible until you disaggregate by model and language, at which point the variance becomes the story. Both papers are pointing at the same underlying issue — that LLM behavior is less uniform across conditions than headline numbers suggest, and that evaluation frameworks need to account for interaction effects rather than main effects alone.

Watch whether any of the five tested labs respond with documentation or guidance on language-specific prompting norms. If none do within six months, that's a signal the industry treats this as an academic edge case rather than a deployment concern worth addressing.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGemini-Pro · GPT-4o Mini · Claude 3.7 Sonnet · DeepSeek-Chat · Llama 3 · PLUM Corpus

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.