Models & Releases Research Products & Apps Business & Funding

Developers Subscribe

Modelwire

A curated feed of what matters in AI. Independent, ad-supported, built in Denver, Colorado.

Read

Today
Models & Releases
Research
Business & Funding

About

About Modelwire
Methodology
Our sources
Editor's notes
Developer API
Contact
Advertise

Legal

Privacy policy
Terms of use
DMCA & takedowns
Corrections

© 2026 Modelwire. All article links go to the original publishers.Summaries generated by Modelwire. We don’t republish full articles.

arXiv cs.CL

https://arxiv.org/list/cs.CL/recent · Editorial weight 5/10

Illustration for: Study finds LLMs violate basic probability laws in conditional reasoning

Study finds LLMs violate basic probability laws in conditional reasoning

Researchers probe whether large language models actually behave like probabilistic systems when prompted in context. Using recursive population partitioning and binary tree structures, they test whether LLM outputs satisfy the law of total probability, a foundational principle that should hold if in-context learning truly functions as conditional inference. The work exposes gaps between how we theorize LLM behavior and what models actually compute, with implications for reliability in downstream applications and our understanding of what in-context learning mechanisms accomplish.

arXiv cs.CL·2d ago

62

Illustration for: New benchmark teaches AI to revise scientific figures from paper edits

Research Tools & Code

New benchmark teaches AI to revise scientific figures from paper edits

Researchers have released SciDiagramEdit, a benchmark and framework that automates the revision of scientific figures through natural language instructions. The system learns from real paper edits and operates on vector-based diagram sources, allowing researchers to co-edit with an AI agent rather than manually redrawing components. This addresses a genuine friction point in academic publishing: the iterative refinement of complex infographics containing schematics, plots, photos, and captions. The work signals growing interest in AI agents that understand domain-specific visual grammars and can collaborate on specialized editing tasks, opening pathways for similar tools across technical documentation and design workflows.

arXiv cs.CL·2d ago

58

Illustration for: Web-scale poisoning attacks can corrupt LLM pretraining at scale

Web-scale poisoning attacks can corrupt LLM pretraining at scale

Researchers have demonstrated that large language models can be compromised during pretraining through poisoning attacks injected via public web interfaces, a vector far more scalable than prior work targeting isolated datasets like Wikipedia. The study introduces HalfLife, a measurement framework for detecting adversarial content that survives web crawling and data curation pipelines. This work exposes a critical supply-chain vulnerability in how foundation models ingest internet-scale data, suggesting that malicious actors need not compromise centralized repositories to corrupt model behavior at scale. The findings reshape threat modeling for pretraining and highlight why data provenance and filtering remain unsolved problems in the industry.

arXiv cs.CL·2d ago

68

Illustration for: Static retrieval scores miss causal value in multi-turn agent search

Static retrieval scores miss causal value in multi-turn agent search

Researchers expose a fundamental gap between how retrieval systems are benchmarked and how they perform in multi-turn agentic workflows. Traditional evaluation scores documents by immediate answer improvement, but agents benefit from intermediate documents that enable better downstream reasoning without directly answering the current query. Using counterfactual trajectory analysis on HotpotQA, the work quantifies this mismatch and suggests that static retrieval metrics systematically undervalue documents with high causal utility in reasoning chains. This finding reshapes how teams should evaluate and train retrieval components for production agents.

arXiv cs.CL·2d ago

62

Illustration for: Benchmark wins mask reasoning failures in clinical multimodal AI

Benchmark wins mask reasoning failures in clinical multimodal AI

Clinical AI systems optimized for benchmark performance often fail to produce trustworthy reasoning in practice. This retrospective analysis of nine multimodal VQA systems from MediaEval Medico 2025 reveals that parameter-efficient fine-tuning wins on leaderboards without guaranteeing faithful explanations or robust handling of diverse question types. Systems enforcing structured reasoning and explicit evidence grounding showed more reliable clinical behavior, suggesting the field needs evaluation metrics beyond lexical overlap and standardized evidence-linked explanations. The finding challenges the assumption that downstream task performance correlates with interpretability, a critical gap for healthcare deployment.

arXiv cs.CL·2d ago

58

Illustration for: TikStance dataset enables multimodal stance detection in short-form political video

Research Tools & Code

TikStance dataset enables multimodal stance detection in short-form political video

Researchers have released TikStance, a multimodal dataset linking 161 political videos with 13,876 hierarchical comments from the 2024 U.S. election cycle. The resource preserves both audiovisual and conversational context for stance detection, addressing a critical gap in training data for short-form video analysis. As political discourse migrates to platforms like TikTok, this dataset enables development of models that understand nuanced positions across modalities and nested discussions, directly supporting the next generation of content moderation and political discourse analysis systems.

arXiv cs.CL·2d ago

58

Log-ratio geometry recovers efficient language identification without neural networks

Researchers propose a mathematically grounded alternative to neural language identification by treating character and bigram frequencies as compositional data mapped through log-ratio geometry. The approach recovers linear-time efficiency of classical n-gram methods while addressing a fundamental statistical flaw: frequency distributions live on a simplex, not Euclidean space, making standard distance metrics inappropriate. By applying the centered log-ratio transformation, the method aligns computational geometry with statistical reality, enabling sparse feature handling via Laplace smoothing. This work signals renewed interest in principled statistical foundations for NLP tasks often assumed to require deep learning, relevant to practitioners balancing accuracy, latency, and resource constraints in production systems.

arXiv cs.CL·2d ago

52

Illustration for: XLM-R extended with Ge'ez vocabulary to fix African language tokenization

Research Models & Releases

XLM-R extended with Ge'ez vocabulary to fix African language tokenization

Researchers have identified a critical bottleneck in multilingual AI: standard tokenizers trained on Latin-script data severely degrade performance on non-Latin languages like Amharic and Tigrinya. VEXMLM addresses this by extending XLM-R with 30,000 Ge'ez-script subwords, trained on curated monolingual corpora and initialized through embedding averaging. The approach targets 19 African languages, tackling both vocabulary gaps and fragmentation that plague low-resource, non-Latin-script communities. This work signals growing recognition that universal pretraining assumptions fail at linguistic diversity, forcing the field to rethink tokenization as a foundational design choice rather than a solved problem.

arXiv cs.CL·2d ago

62

Illustration for: Researchers decompose masked diffusion RL into token and masking objectives

Researchers decompose masked diffusion RL into token and masking objectives

Researchers have cracked a longstanding challenge in reinforcement learning for masked diffusion language models by decomposing the policy gradient into two distinct optimization targets: token prediction and position unmasking strategy. Prior work treated generation as a single decision problem, but this work recognizes that MDLMs make sequential choices about both what to generate and where to generate it. By optimizing both components jointly, the approach achieves state-of-the-art performance on mathematical reasoning and code generation tasks. This matters because it opens a new pathway for applying RL to non-autoregressive architectures, potentially enabling faster inference while maintaining reasoning quality.

arXiv cs.CL·2d ago

62

Illustration for: Transformer variant preserves reasoning state across decoding steps

Research Models & Releases

Transformer variant preserves reasoning state across decoding steps

Researchers propose T2MLR, an architectural modification that addresses a fundamental bottleneck in transformer inference: the compression of reasoning state into discrete tokens during autoregressive decoding. By caching middle-layer representations and injecting them into earlier layers of subsequent positions, the approach preserves abstract computation across decoding steps with minimal overhead. Results show consistent gains over parameter-matched baselines on both pretraining and multi-hop reasoning tasks. This technique matters because it targets a real efficiency and capability ceiling in current LLMs, suggesting a path toward more persistent reasoning without scaling model size or compute.

arXiv cs.CL·2d ago

62

Illustration for: Gemini outperforms humans on scientific visualization, most MLLMs lag

Research Models & Releases

Gemini outperforms humans on scientific visualization, most MLLMs lag

A new benchmark reveals significant gaps in how multimodal models interpret scientific visualizations, a capability increasingly critical as these systems move into research and education workflows. Testing six leading MLLMs against a 49-item assessment spanning diverse SciVis techniques showed uneven performance, with Gemini outperforming human averages but others lagging substantially. The finding matters because chart-reading benchmarks have masked deeper literacy deficits, and as organizations deploy these models for data analysis and scientific communication, understanding their actual visualization reasoning becomes a reliability and safety concern for downstream users.

arXiv cs.CL·2d ago

62

Illustration for: Researchers separate grammaticality from probability in language model internals

Researchers separate grammaticality from probability in language model internals

Researchers challenge the dominant paradigm for measuring grammatical knowledge in language models by moving beyond probability-based metrics. The work investigates whether grammaticality is encoded as a distinct feature in model internals, rather than conflated with likelihood, lexical frequency, and world knowledge. This distinction matters for interpretability: if models encode grammar as a separable representation, it reshapes how we evaluate their linguistic competence and debug failure modes. The finding could influence how practitioners design probes and evals for downstream tasks requiring robust syntactic reasoning.

arXiv cs.CL·2d ago

58

Illustration for: Clinicians build safety taxonomy for medical AI model failures

Research Tools & Code

Clinicians build safety taxonomy for medical AI model failures

Medical AI safety has lacked systematic failure taxonomy. MedFailBench introduces a clinician-authored benchmark that categorizes model errors by severity and failure mode, not just accuracy. The framework identifies six distinct safety gates: missed escalations, unsafe dosing, inappropriate discharge reassurance, hallucinated evidence, protocol violations, and unsupported claims. This shifts evaluation from binary correctness toward granular risk profiling, enabling developers to stress-test models against realistic clinical failure patterns. The open-source release with automated screening pipelines establishes infrastructure for safety-focused model iteration in healthcare, addressing a gap where traditional benchmarks miss high-stakes boundary violations.

arXiv cs.CL·2d ago

62

Grok encyclopedia audit reveals LLM bias persists across judges

Researchers conducted a large-scale audit comparing political bias in Grok-authored Grokipedia against Wikipedia by analyzing 1,394 government member articles across nine ideological dimensions using four LLM judges (Grok, Claude, Mistral, DeepSeek). The study directly tests whether LLM-generated content achieves genuine neutrality or simply redistributes bias, while also examining whether the judges themselves exhibit systematic political leanings. This work exposes a critical tension in AI-driven knowledge systems: as LLMs become primary information sources, their embedded ideologies may shape democratic discourse in ways that differ from but don't necessarily improve upon existing platforms.

arXiv cs.CL·2d ago

62

Illustration for: LLM agents trained to sustain partisan positions in coalition simulations

LLM agents trained to sustain partisan positions in coalition simulations

Researchers have developed a multi-agent framework that enables LLMs to sustain partisan political positions during coalition negotiations, addressing a fundamental limitation in current models. By combining supervised fine-tuning, direct preference optimization, and retrieval-augmented generation tied to party manifestos, the system overcomes RLHF-induced neutrality biases that typically flatten ideological commitment. The work operationalizes this approach on real electoral data, suggesting computational political science can now model adversarial negotiation dynamics with ideologically coherent agents rather than consensus-seeking proxies. This matters for understanding how AI systems might simulate or influence multi-stakeholder policy formation.

arXiv cs.CL·2d ago

58

Illustration for: Self-validating rubrics emerge from queries without human labels

Research Tools & Code

Self-validating rubrics emerge from queries without human labels

Researchers propose Rubrics on Trial, a method for automatically generating and validating evaluation rubrics from user queries alone, without human annotation or model retraining. The framework bootstraps rubric quality by synthesizing response pairs conditioned on candidate rubrics, then tests each proposal's ability to meaningfully distinguish answer quality before incorporation. This addresses a critical bottleneck in LLM training and evaluation: the difficulty of constructing reliable, task-specific scoring criteria. For practitioners building custom evaluators or fine-tuning models, this reduces dependency on expensive human-labeled preference data while maintaining rigor through synthetic validation.

arXiv cs.CL·2d ago

58

Illustration for: New benchmark tests AI agents across 354 real-world application domains

Research Models & Releases

New benchmark tests AI agents across 354 real-world application domains

Researchers have built OmniaBench, a comprehensive evaluation framework that tests AI agents across 354 distinct application domains spanning consumer, business, and enterprise use cases. The benchmark addresses a critical gap in agent assessment: existing evaluations remain siloed around narrow tool sets or interaction patterns, obscuring how well models generalize across real-world deployment scenarios. By grounding domains in app store data, product documentation, and industry resources, OmniaBench creates a hierarchical taxonomy that lets practitioners measure agent robustness at scale. This matters because as LLMs transition from text completion to autonomous task execution, systematic cross-domain evaluation becomes essential for identifying capability ceilings and deployment readiness.

arXiv cs.CL·2d ago

62

Illustration for: New detection method tracks how AI text evolves through latent space

New detection method tracks how AI text evolves through latent space

Researchers propose a fundamentally different lens for detecting AI-generated text by modeling how semantic representations shift across a document's sequence rather than analyzing static aggregate features. The Geometric Trajectory and Contrastive Learning framework treats generation as a dynamic process unfolding through latent space, segmenting text into ordered units and learning to distinguish human writing patterns from autoregressive model outputs. This trajectory-based approach addresses a blind spot in current detection methods and could reshape how systems identify synthetic content as language models become harder to distinguish from human writing.

arXiv cs.CL·2d ago

58

Illustration for: Graph-based reasoning patterns outperform surface features for LLM detection

Graph-based reasoning patterns outperform surface features for LLM detection

Researchers have moved beyond surface-level linguistic fingerprinting to detect LLM authorship by analyzing reasoning structures within generated text. Using graph neural networks to extract and map argument patterns, the team demonstrates substantially higher robustness against paraphrasing attacks compared to traditional transformer baselines. This shift toward deeper semantic signals matters because it raises the bar for detection evasion, forcing future obfuscation techniques to manipulate reasoning itself rather than just vocabulary and syntax. The work signals a maturing arms race in LLM provenance verification, with implications for content authenticity, academic integrity, and trust in AI-generated outputs.

arXiv cs.CL·2d ago

62

Illustration for: Instruction tuning and merging extend reasoning models to unverifiable domains

Research Models & Releases

Instruction tuning and merging extend reasoning models to unverifiable domains

Researchers have identified a practical pathway to extend reasoning models beyond domains with automated verification, addressing a fundamental bottleneck in reinforcement learning-driven model development. By combining instruction tuning on human-authored solutions with model merging, the work recovers performance gains that would otherwise require expensive RL infrastructure. This technique matters because it unlocks adaptation of reasoning capabilities to subjective or hard-to-verify domains like open-ended writing or strategy, where supervised data exists but reward signals don't. The approach signals a shift toward hybrid training regimes that blend classical fine-tuning with modern reasoning architectures, potentially democratizing reasoning model customization across industries lacking verification infrastructure.

arXiv cs.CL·2d ago

62

Illustration for: Finetuning on benign data causes ideological drift across unrelated domains

Research Policy & Regulation

Finetuning on benign data causes ideological drift across unrelated domains

Researchers demonstrate that finetuning language models on narrow, benign datasets produces unexpected ideological drift across unrelated domains. Training GPT-4.1 on economics Q&A shifted outputs on criminal justice, environment, and cultural topics; similar effects emerged from HR policy and finance datasets. The phenomenon, termed ideological generalisation, reveals a critical deployment risk: models can absorb and amplify latent value systems embedded in training data without explicit instruction, even when individual examples pass moderation review. This challenges assumptions about domain-specific adaptation and raises questions about how organizations can safely customize models without inadvertently encoding systematic biases.

arXiv cs.CL·2d ago

72

LLMs challenge specialized classifiers on German library indexing task

Researchers at the German National Library benchmarked supervised extreme multi-label classification against LLM-based approaches for automated subject indexing of scientific literature. The study directly tests whether generative models outperform specialized XMLC algorithms on a real-world library task with thousands of controlled vocabulary terms. Results matter for institutions managing large document collections: if LLMs prove competitive or superior, it reshapes how libraries and archives approach metadata automation, potentially consolidating workflows around foundation models rather than domain-specific classifiers.

arXiv cs.CL·2d ago

52

Illustration for: Larger LLMs burn energy faster than they earn in survival economy simulation

Larger LLMs burn energy faster than they earn in survival economy simulation

Researchers have built a controlled simulation where LLM-based agents face genuine survival constraints tied to computational cost. The Energy Society framework reveals that larger models structurally overspend relative to their earnings, even when token costs are decoupled from model size, suggesting scale itself creates economic inefficiency in multi-agent settings. Critically, cooperative incentive structures substantially reshape agent behavior compared to competitive baselines, pointing to how economic design shapes emergent LLM cooperation patterns. This work matters for understanding whether scaling and market incentives naturally align in deployed multi-agent systems.

arXiv cs.CL·2d ago

62

Illustration for: New RL framework converts sparse rewards into reusable agent skills

Research Models & Releases

New RL framework converts sparse rewards into reusable agent skills

Researchers introduce SEED, a reinforcement learning framework that addresses a critical bottleneck in agent training: converting sparse episode-level rewards into dense token-level guidance. The method distills completed trajectories into natural-language skills that capture decision patterns, then reintegrates these insights back into the policy model. This bridges the supervision gap that has limited RL effectiveness for long-horizon LLM agents, offering a practical pathway to improve multi-turn reasoning and tool-use tasks without requiring dense reward engineering.

arXiv cs.CL·2d ago

62

Dialogue summarization framework incorporates emotion dynamics across speakers

Researchers propose a hierarchical framework for dialogue summarization that jointly models semantic content and emotional tone across multiple speakers. The approach decomposes conversations into topic-driven segments and participant-specific utterance clusters, then generates summaries that preserve emotional context through multimodal inputs. This work addresses a gap in summarization research, which has historically focused on single-author texts like articles and reports. The technique matters for conversational AI systems handling customer service, meeting transcription, and interview analysis, where speaker dynamics and sentiment shifts carry material meaning that traditional extractive methods miss.

arXiv cs.CL·2d ago

52

Illustration for: Small models solve hard reasoning via symbolic code generation

Research Models & Releases

Small models solve hard reasoning via symbolic code generation

Team CoTu's entry into the EXACT 2026 competition demonstrates a practical path for transparent AI reasoning without scale: a neuro-symbolic pipeline that grounds 4B-parameter models in formal logic and executable code rather than black-box token prediction. By routing regulation queries through Z3 constraint solvers and physics problems through symbolic computation, the approach trades inference speed for verifiability, a tradeoff increasingly relevant as institutions demand explainability alongside accuracy. This signals a maturing recognition that reasoning transparency may require hybrid architectures, not just larger models.

arXiv cs.CL·2d ago

58

Illustration for: GPT-2 detection model flags autistic writing at elevated rates

GPT-2 detection model flags autistic writing at elevated rates

A new empirical study challenges the reliability of AI detection systems, revealing that GPT-2 detection models systematically misclassify autistic writing at higher rates than general text. Using 60,000 Reddit posts, researchers found that while overall false-positive rates remain low, neurodivergent communication patterns trigger detection algorithms disproportionately. This exposes a critical bias vector in content moderation and authenticity verification pipelines that many platforms rely on, suggesting detection models encode linguistic assumptions that penalize non-neurotypical expression. The finding underscores how AI safety tooling can inadvertently harm minority populations through statistical artifacts rather than intentional design.

arXiv cs.CL·2d ago

62

Illustration for: Single LLM pass beats multi-agent debate for research paper feedback

Single LLM pass beats multi-agent debate for research paper feedback

A pre-registered experiment testing multi-agent debate as a mechanism for improving AI feedback found that simpler single-pass LLM analysis outperformed two specialized debate systems on real research papers. Across 44 meta-analyses in economics, authors ranked a frontier model's direct critique above both debate variants, despite one system consuming 30x more tokens. The finding challenges a popular assumption in AI reasoning research: that orchestrating multiple model instances in adversarial or collaborative setups reliably produces better outputs. This has implications for how teams architect AI-assisted research tools and suggests efficiency gains may not justify the computational overhead of debate frameworks.

arXiv cs.CL·2d ago

62

Illustration for: Execution-verified code distillation improves financial reasoning in smaller models

Research Tools & Code

Execution-verified code distillation improves financial reasoning in smaller models

Researchers have developed a distillation method that transfers numerical reasoning capabilities from large language models to smaller ones by using execution-verified Python programs as supervision signals rather than natural-language explanations. The approach addresses a critical weakness in LLM-based financial reasoning: arithmetic errors in textual rationales that corrupt training data. By filtering for programs that execute correctly and match gold answers, the technique ensures higher-quality knowledge transfer for domain-specific tasks requiring hybrid reasoning across tables and text. This matters for practitioners building compact financial AI systems that must balance accuracy with inference efficiency.

arXiv cs.CL·2d ago

58

Illustration for: Smaller models with scaffolding beat larger ones in high-stakes supervision

Research Tools & Code

Smaller models with scaffolding beat larger ones in high-stakes supervision

Researchers demonstrate that wrapping smaller language models in deterministic scaffolding (retrieval systems, schema validation, human-in-the-loop gates, audit trails) outperforms larger unstructured models in high-stakes domains. The case study compares a baseline GPT-5 chatbot against a GPT-4o-mini system embedded in a LangGraph harness for academic supervision, revealing a critical shift in production AI: raw model scale matters less than architectural composition when reliability and accountability are non-negotiable. This work signals growing maturity in operationalizing LLMs beyond chat, with implications for regulated industries where explainability and auditability trump raw fluency.

arXiv cs.CL·2d ago

62