Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Researchers benchmarked consistency across GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash when generating exercise prescriptions repeatedly. GPT-4.1 achieved highest semantic stability (0.955) but produced entirely unique outputs each time, revealing a critical tension between reproducibility and diversity that matters for clinical AI deployment.
MentionsGPT-4.1 · Claude Sonnet 4.6 · Gemini 2.5 Flash
Read full story at arXiv cs.CL →(arxiv.org)
Modelwire summarizes — we don’t republish. The full article lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.