Research Models & Releases·arXiv cs.CL·May 28

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Conversational AI has largely ignored the visual and gestural layer of human interaction, treating dialogue as speech-only. VideoFDB addresses this gap by introducing the first benchmark for evaluating agents that must both perceive and generate nonverbal cues alongside audio in real-time two-way exchanges. The dataset spans 237 video call clips annotated for 11 distinct conversational dynamics, paired with a rubric-based evaluation framework that separates perception from generation tasks. This work signals a maturation in multimodal agent design, pushing the field beyond speech-centric full-duplex systems toward embodied conversational intelligence that mirrors human social presence.

Modelwire context

Explainer

The harder problem VideoFDB surfaces is not perception but generation: producing contextually appropriate nonverbal responses in real time, synchronized with speech, is a constraint that no existing benchmark has formally decomposed. The 11-category annotation rubric is doing significant load-bearing work here, and its validity as a proxy for human social presence remains untested against actual user studies.

This connects most directly to the LoMo paper covered the same day, which identified that current vision-language architectures process visual and textual signals asymmetrically at a structural level. VideoFDB essentially stress-tests that same asymmetry in a dynamic, temporal setting where the modality gap compounds across a live exchange rather than a static query. Both papers are pointing at the same underlying problem from different angles: multimodal models are not actually fusing modalities, they are tolerating them. The Qwen-VLA coverage is also relevant context, since embodied action generation via diffusion decoders is one plausible technical path toward the real-time nonverbal generation VideoFDB is trying to evaluate.

Watch whether any of the major full-duplex voice labs (Google, Hume, or similar) adopt VideoFDB as an external evaluation target within the next six months. Adoption by a third party would validate the rubric; continued silence would suggest the benchmark's annotation categories do not map cleanly onto how production systems are actually built.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsVideoFDB · full-duplex conversational agents · audio-visual-to-audio-visual (AV2AV)

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.