What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It

A new diagnostic metric called answer-in-context addresses a fundamental constraint in retrieval-augmented generation: when context windows are fixed, not all retrieved evidence fits. The work shows that traditional document recall is a poor proxy for what matters in practice, and proposes submodular packing strategies that better predict whether gold answers survive into the final reader context. This reshapes how RAG systems should be evaluated and optimized, with implications for production deployments where token budgets are hard constraints rather than soft targets.
Modelwire context
ExplainerThe paper's core insight is that retrieval metrics used in production (document recall, BM25 ranking) don't predict whether the information needed to answer a question actually survives the context-window bottleneck. This is a measurement problem, not just an engineering one.
This connects directly to the hallucination detection work from earlier this month (the span-level benchmark across code and documents). Both papers identify a gap between what we measure and what matters in deployed systems. Where that work showed hallucinations leak through even grounded inputs, this paper shows that grounding itself is incomplete if the right evidence gets truncated before the reader sees it. The clinical NLP production study from the same batch also surfaces this tension: systems fail not because retrieval is wrong, but because constraints force trade-offs that existing metrics don't capture. Submodular packing is a concrete answer to a problem those papers exposed empirically.
If teams adopting this diagnostic see measurable improvements in answer correctness on their internal multi-hop benchmarks without increasing retrieval volume, that validates the core claim. Watch whether HotpotQA leaderboards shift when answer-in-context becomes the reported metric instead of document recall within the next two quarters; if they don't move significantly, the diagnostic may not generalize beyond the paper's test set.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsHotpotQA · RAG · answer-in-context
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.