Research Models & Releases·arXiv cs.LG·May 3

Benchmarking Single-Pose Docking, Consensus Rescoring, and Supervised ML on the LIT-PCBA Library: A Critical Evaluation of DiffDock, AutoDock-GPU, GNINA, and DiffDock-NMDN

A large-scale empirical study on the LIT-PCBA library reveals that traditional docking combined with neural rescoring does not uniformly outperform classical methods in virtual screening. AutoDock-GPU paired with GNINA rescoring achieved the strongest single-method performance (EF1% of 2.14), while newer AI-native approaches like DiffDock showed mixed results on real experimental data. This challenges the narrative that deep learning docking automatically supersedes conventional tools and matters for practitioners choosing screening pipelines and for researchers calibrating expectations around recent AI-based molecular modeling claims.

Modelwire context

Skeptical read

The study's critical finding isn't that DiffDock underperformed, but that the combination of old + new (AutoDock-GPU with GNINA neural rescoring) outperformed pure AI-native approaches. This inverts the expected hierarchy and suggests the screening pipeline narrative has been oversimplified.

This joins a pattern visible in recent benchmarking work: when researchers move from toy tasks to real-world validation, AI-native methods often reveal hidden fragility. The Harvard diagnostic study from May 3rd showed LLMs outperforming ER doctors on case accuracy, but Google DeepMind's co-clinician work (also early May) exposed that general-purpose models still lag experienced physicians on clinical judgment. Similarly, the AutoMat study on computational materials science revealed that coding agents excel at generic benchmarks but fail on underspecified real procedures. Here, DiffDock wins on curated benchmarks but stumbles on LIT-PCBA's experimental data, suggesting the gap between controlled evaluation and production use remains material.

If the same rescoring hierarchy (classical + neural > pure diffusion) holds when tested on the upcoming DEKOIS 2.0 library (expected Q3 2026), that signals the finding generalizes beyond LIT-PCBA. If AutoDock-GPU + GNINA becomes the de facto standard in pharma screening pipelines by end of 2026, adoption will confirm practitioners believe the empirical result over the AI narrative.

Coverage we drew on

Google Deepmind's "AI co-clinician" beats GPT-5.4 in blind doctor tests but still trails experienced physicians · The Decoder

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsDiffDock · AutoDock-GPU · GNINA · DiffDock-NMDN · LIT-PCBA · NMDN

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.