Research Products & Apps·arXiv cs.CL·13h ago

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Researchers benchmarked four frontier LLMs (GPT, Claude Opus, Gemini, GLM) against expert judgment in grading Linux/bash exams, using a four-level cognitive taxonomy to assess whether models can reliably award partial credit and recognize equivalent solutions where rule-based autograders fail. The work signals growing viability of LLMs as scalable assessment tools in computing education, where enrollment pressures make manual marking unsustainable. Results matter for educators adopting AI-driven grading and for understanding model reliability in structured evaluation tasks beyond open-ended generation.

Modelwire context

Explainer

The paper doesn't just show that LLMs can grade bash exams; it demonstrates they can award partial credit and recognize functionally equivalent solutions that traditional autograders miss. That's a structural difference in what the tool can do, not just a performance number.

This connects directly to the emotion taxonomy and persona stability work from early July. Like the affective gap study (which found Gemini at 39.9% on fine-grained emotion classification) and the persona instability research, this paper is benchmarking whether LLMs can maintain consistent, reliable judgment across structured tasks where nuance matters. The Linux grading task is analogous: it requires the model to recognize intent and equivalence, not just pattern-match. The key difference is that grading has a clear ground truth (expert judgment), whereas emotion and persona consistency lack that anchor. If LLMs struggle with persona stability in MCQA, the question here is whether they're more reliable when grading against expert consensus rather than maintaining internal coherence.

If the same four models (GPT, Claude Opus, Gemini, GLM) show inter-rater agreement with human experts above 85% on a held-out test set of bash problems not seen during development, that would signal genuine transfer. If agreement drops below 75% on novel problem types (e.g., shell scripting vs. system administration tasks), the taxonomy approach hasn't solved the generalization problem and educators should treat this as task-specific, not a universal grading solution.

Coverage we drew on

Quantifying the Affective Gap: A Zero-Shot Evaluation of LLMs on Fine-Grained Emotion Taxonomies · arXiv cs.CL

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsGPT · Claude Opus · Gemini · GLM

Read full story at arXiv cs.CL →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.