COCOLogic-V2: Identifying Logical Inconsistencies via Truly Hard-Negatives

COCOLogic-V2 addresses a critical gap in interpretable AI evaluation by introducing a dataset that stress-tests concept bottleneck models and program synthesis methods on real-world visual reasoning tasks grounded in first-order logic. The dataset's novel categorization into near-boundary negatives exposes a fundamental weakness: models confidently separate easy cases but systematically fail on hard negatives where reasoning precision matters most. This finding has immediate implications for practitioners deploying interpretable models in high-stakes domains, revealing that current verification approaches may provide false confidence in model accountability.

Modelwire context

Explainer

The critical insight is not just that models fail on hard cases (expected) but that they fail *confidently* on cases near decision boundaries. This means standard accuracy metrics and even typical adversarial robustness checks miss the failure mode entirely, creating a false sense of interpretability.

This connects directly to the verification infrastructure problem raised in Google's Paper Assistant Tool coverage from the same day. That work proposed AI-assisted peer review to handle scaling scientific validation. COCOLogic-V2 reveals a specific blind spot in that validation pipeline: interpretable models used for high-stakes reasoning can pass existing verification checks while systematically failing on boundary cases. The nuclear physics interpretability paper from the same batch showed how domain constraints improve explainability, but it didn't test whether those constraints hold under adversarial pressure. COCOLogic-V2 essentially asks that question for concept bottleneck models and finds the answer is no.

If practitioners adopting concept bottleneck models for medical imaging or legal document classification run COCOLogic-V2-style audits on their own datasets in the next 6 months and report similar failure rates on near-boundary cases, this becomes a required pre-deployment check. If they don't, the finding stays confined to academic benchmarking and doesn't shift practice.

Coverage we drew on

Towards Automating Scientific Review with Google's Paper Assistant Tool · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCOCOLogic-V2 · concept bottleneck models · program synthesis

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.