Many-Shot CoT-ICL: Making In-Context Learning Truly Learn

Researchers challenge the assumption that many-shot in-context learning scales uniformly across all LLM types and task domains. The study reveals that chain-of-thought demonstrations behave unpredictably when scaled up on non-reasoning models, while reasoning-specialized LLMs benefit consistently. This finding reshapes how practitioners should architect prompt engineering strategies and suggests that model architecture and training objectives fundamentally alter how models absorb multi-example conditioning. The instability on general-purpose models has immediate implications for production deployments relying on long-context windows.

Modelwire context

Explainer

The critical qualifier buried in the findings is that instability on general-purpose models isn't just a performance dip, it's unpredictable, meaning practitioners can't reliably anticipate whether adding more demonstrations will help or hurt without knowing their model's training objective upfront.

This connects directly to the 'Locale-Conditioned Few-Shot Prompting' paper from the same day, which exposed a different but structurally similar failure mode: naive scaling of demonstrations in small quantized models causes verbatim regurgitation rather than genuine generalization. Both papers are converging on the same uncomfortable conclusion that few-shot prompt design cannot be treated as model-agnostic. Together they suggest the field is accumulating evidence that demonstration scaling interacts with model internals in ways that current prompting intuitions don't capture. The 'Prefix Teach, Suffix Fade' distillation paper adds a third data point, showing that dense supervision across full sequences can degrade rather than improve learning, which rhymes with the instability reported here.

Watch whether any of the major inference API providers (OpenAI, Anthropic, Google) publish guidance distinguishing many-shot CoT behavior by model family within the next two quarters. If they do, it signals this instability is reproducible enough to affect production recommendations at scale.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsIn-context learning · Chain-of-thought · Large language models · Long-context models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.