Research Products & Apps·arXiv cs.CL·1d ago

Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

Illustration accompanying: Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

Researchers developed a hybrid machine learning pipeline combining traditional classifiers with LLM-based screening to detect self-harm signals in emergency department triage notes, achieving 0.88+ AUPRC across internal and external validation at three Australian hospitals. The work demonstrates practical transferability of language models in clinical surveillance, where diagnostic coding alone misses critical cases. This represents a concrete application of LLMs to high-stakes healthcare screening where model generalization across institutional contexts directly impacts public health outcomes.

Modelwire context

Explainer

The paper's core contribution isn't the self-harm detection itself, but demonstrating that a two-stage pipeline (traditional classifiers plus LLM screening) outperforms diagnostic coding alone. The practical insight: triage notes contain clinical signals that structured billing codes systematically miss, and LLMs can surface them reliably enough to generalize across different hospital systems.

This work sits in tension with the broader LLM evaluation landscape covered recently. ClinEnv (early June) argued that static benchmarks fail to capture real clinical decision-making under incomplete information and sequential constraints. This self-harm detection study validates that argument in reverse: it shows LLMs can handle a narrower, well-defined screening task with high consistency across institutional contexts. Where ClinEnv demands agents query multiple specialized tools before committing to treatment, this pipeline succeeds by doing something simpler: flagging high-risk cases for human review. The difference matters because it suggests LLMs have a viable role in triage and surveillance, even if they're not yet ready for full autonomous clinical reasoning.

If the same model architecture (hybrid classifier plus LLM) maintains 0.88+ AUPRC when tested on a fourth Australian hospital not included in the original validation set within the next 12 months, that confirms genuine transferability. If performance drops below 0.80, it signals the model overfit to the three hospitals in the study and the generalization claim collapses.

Coverage we drew on

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsLarge Language Models · Emergency Department Triage · Machine Learning · Self-harm Detection · Australian Hospitals

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

Research

When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives

arXiv cs.CL·1d ago

Research

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

arXiv cs.CL·1d ago

Research

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

arXiv cs.CL·1d ago

Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

Modelwire context

Coverage we drew on

Modelwire Editorial

Related

When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback