CHASM: Unveiling Covert Advertisements on Chinese Social Media

Researchers released CHASM, a 4,992-instance dataset for evaluating multimodal LLMs on detecting covert ads disguised as organic posts on Chinese social media. The benchmark addresses a gap in current LLM evaluation suites by testing real-world deception detection capabilities.

Modelwire context

Explainer

The harder problem CHASM targets isn't spam detection, it's posts that are structurally indistinguishable from organic content, requiring models to reason across image, caption, and platform-specific social cues simultaneously. The benchmark's focus on Rednote (Xiaohongshu) also means the evaluation demands cross-cultural fluency that most Western-trained multimodal models haven't been tested against.

CHASM belongs to a growing cluster of domain-specific benchmarks designed to expose gaps that general LLM evaluations miss. We covered a similar impulse in MADE (from mid-April), which targeted medical adverse event classification precisely because standard benchmarks failed to capture label ambiguity in high-stakes contexts. The pattern is consistent: researchers are building narrow, real-world-grounded datasets because broad capability benchmarks don't stress-test the failure modes that matter in deployment. CoopEval, also from mid-April, extended this logic to social behavior, finding that models optimized on standard evals still defect in game-theoretic settings. CHASM adds deception detection to that list of underexplored blind spots.

Watch whether any of the major multimodal model providers (Qwen, GPT-4o, Gemini) publish results against CHASM within the next two quarters. If scores cluster near random baseline on the most visually ambiguous ad categories, that confirms the benchmark is actually stress-testing something models can't currently handle rather than measuring a capability they already have.

Coverage we drew on

MADE: A Living Benchmark for Multi-Label Text Classification with Uncertainty Quantification of Medical Device Adverse Events · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsCHASM · Rednote · Multimodal Large Language Models

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.