Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation

Researchers have constructed a large-scale adversarial malware dataset that exposes critical vulnerabilities in ML-based security classifiers. By generating 77,943 evasive PE binaries with 98%+ evasion rates against the EMBER detector, the work demonstrates that malware detection pipelines remain brittle against both adversarial generation and data poisoning. Injecting just 0.5% mislabeled samples during training dramatically degrades classifier performance, signaling that production security systems relying on supervised learning face underestimated attack surface. This research directly challenges assumptions in deployed threat detection and highlights the gap between academic robustness claims and real-world classifier resilience.

Modelwire context

Explainer

The more unsettling finding isn't the evasion rate itself but the poisoning result: a classifier can be quietly degraded by corrupting less than one in two hundred training samples, which means the attack surface extends backward into data pipelines and labeling workflows, not just inference time.

This connects directly to the deployment-complete benchmarking work covered the same day, which showed benchmark coverage of 94.98% collapsing to 10.07% in real deployment. That paper argued that standard evaluation methods systematically overstate readiness; this malware research is essentially a domain-specific proof of that thesis, demonstrating that EMBER's academic robustness claims do not hold under adversarial conditions that are entirely realistic in production. Together, these two papers reinforce a pattern emerging across recent coverage: the gap between what models demonstrate in controlled evaluation and what they deliver when adversaries, distribution shift, or corrupted data enter the picture is not a niche concern but a structural problem in how the field validates systems before shipping them.

Watch whether VirusTotal or any major EDR vendor publicly responds to the RawMal-TF dataset release with updated detection benchmarks within the next six months. Silence would suggest the industry is absorbing this finding quietly rather than treating it as an actionable signal.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsEMBER · VirusTotal · RawMal-TF

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.