Research Models & Releases·arXiv cs.CL·May 25

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

Illustration accompanying: Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

Researchers have built the first NLP pipeline for dementia detection in Filipino speech, addressing a critical gap in clinical AI that has remained almost entirely English-focused. By constructing a parallel bilingual dataset of 4,000 manually translated transcripts from DementiaBank, the team isolates language effects from domain-specific cognitive markers, then benchmarks five transformer architectures including NeoBERT in this low-resource setting. The work matters because code-switching populations like the Philippines have been systematically excluded from clinical NLP validation, yet they represent millions of potential users. This establishes both a methodological template for non-English clinical AI and evidence that existing models degrade predictably when domain and language effects interact.

Modelwire context

Explainer

The real novelty is the parallel bilingual dataset construction itself. Rather than simply translating existing English clinical speech data, the researchers manually aligned 4,000 transcripts to isolate which performance drops come from language scarcity versus which come from genuine cognitive signal loss. This lets them measure model degradation predictably, not just observe it.

This connects directly to the WhoSaidIt framework from late May, which treated annotation disagreement as a feature and used iterative expert feedback to stabilize labels across multilingual contexts. Both papers share a core insight: quality datasets for underrepresented populations require deliberate design choices, not just scaling existing English pipelines. The dementia work goes further by making the language-domain interaction explicit through controlled parallel data, whereas WhoSaidIt focused on handling inherent ambiguity in demographic inference. Together they signal the field is moving past treating non-English clinical NLP as a translation problem and toward treating it as a distinct validation requirement.

If the same five transformer architectures are benchmarked on a held-out Filipino clinical cohort (not translated, but native speakers) within the next six months, and performance gaps match the predictions from the bilingual dataset, that confirms the method generalizes. If performance diverges significantly, it suggests the translation process itself introduced artifacts that don't reflect real-world code-switching behavior.

Coverage we drew on

WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsNeoBERT · BERT · XLM-R · RoBERTa · DementiaBank · Philippines

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.