Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap

A rigorous audit of RVL-CDIP, a foundational document classification benchmark, reveals systemic data quality issues that undermine published model comparisons. The study identifies 12% label corruption and 35% test-train leakage, both of which artificially inflate reported accuracy metrics. This work matters because practitioners routinely cite RVL-CDIP results to validate production systems, and the corrected dataset variants now enable honest performance assessment. The finding that removing duplicates paradoxically degrades accuracy suggests models may have learned dataset artifacts rather than generalizable patterns, a cautionary signal for how benchmark contamination propagates through the field.
Modelwire context
Skeptical readThe study doesn't just flag data quality issues; it reveals that models trained on the contaminated benchmark actually perform worse when the artifacts are removed. This suggests the field may have been optimizing for noise rather than learning transferable document classification skills.
This audit sits alongside the TabPATE work from the same day, which exposed how foundation models leak private training data through predictions. Both papers share a common thread: the gap between what benchmarks and published results claim versus what actually happens in deployment. Where TabPATE addresses privacy leakage in tabular models, RVL-CDIP reveals that benchmark contamination can make models appear stronger than they are. The difference is TabPATE offers a concrete fix (PATE aggregation); RVL-CDIP's corrected variants exist but adoption depends on whether practitioners actually retrain and re-validate their systems.
Monitor whether major document classification papers published in the next 12 months cite the corrected RVL-CDIP splits or continue using the original. If the corrected variants remain below 20% adoption in new work, it signals the field treats benchmark audits as academic exercises rather than actionable corrections.
Coverage we drew on
This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.
MentionsRVL-CDIP
Modelwire Editorial
This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.
Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.