SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

Illustration accompanying: SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation

SwanNLP's SemEval-2026 submission tests LLM plausibility scoring on narrative word sense disambiguation, comparing fine-tuned smaller models against few-shot prompted large models. The work bridges a gap between benchmark performance and real-world story understanding, revealing how different model scales handle contextual sense selection.

Modelwire context

Explainer

The framing around 'plausibility scoring' is the detail worth unpacking: rather than asking a model to pick the correct sense from a lexicon, the task asks it to rate how believable a given sense interpretation is within a narrative context, which is a softer, more gradient signal than traditional WSD evaluation.

The reliability question lurking inside this paper connects directly to coverage from the day prior. The 'Diagnosing LLM Judge Reliability' piece found that even when aggregate consistency looks high, a large share of individual documents produce logically inconsistent pairwise judgments. Plausibility scoring in WSD faces the same structural problem: a model can appear calibrated at the dataset level while being erratic on specific narrative contexts. Similarly, 'DiscoTrace' from the same period showed that LLMs systematically favor breadth over selectivity when constructing answers, which maps onto the risk that few-shot prompted large models here may score many senses as plausible rather than committing to the contextually appropriate one.

If the fine-tuned smaller models outperform few-shot large models on the official SemEval-2026 Task 5 leaderboard when final scores publish, that would suggest narrative context requires task-specific adaptation that prompt engineering alone cannot substitute for.

Coverage we drew on

Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsSwanNLP · SemEval-2026 Task 5 · LLM

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.