The Reliability of Human Judgement in Essay Grading

Machine learning needs ground truth data – a set of inputs and outputs deemed “correct” – to learn from. For many machine learning applications such as image recognition, speech transcription, or even self-driving cars, humans can reliably generate this data, and the idea of “correct” is rarely in question.

But when it comes to subjective fields such as grading and feedback, and as machines move from simply recognizing and labeling to forming nuanced judgments, defining “correct” becomes part of the problem itself.

In 2012, the Hewlett Foundation sponsored the Automated Student Assessment Prize (ASAP), a competition to create automated essay graders which compiled a dataset for training and testing that continues to serve as a benchmark for the field today (The Hewlett Foundation, 2012). For each of the 12,977 essays in the training dataset, the “true grade” was generated by two different human graders. Between these two humans, they achieved a 0.77 QWK, 0.75 Spearman’s rho, 62% exact match, and 92% equal or ±1 point, revealing close but still considerably imperfect alignment.

Current state-of-the-art grading systems, powered by a blend of classical machine learning and new large language models (LLMs), train on the many thousands of essays in the ASAP dataset and achieve impressive metrics. Jiao, Choi, and Hua (2025) achieved 0.873 QWK for ASAP essay set 6, and while these high numbers suggest strong progress in the field, they obscure a more fundamental problem. The authors chose “truth” to mean the higher of the two human scores – a largely arbitrary decision. And how can a QWK value any higher than 0.77 be meaningful if 0.77 is the limit of human agreement?

This raises the question: if trained human graders disagree so often on essays in the dataset, how do we establish a source of truth?

There is no source of truth – only a set of philosophies we can choose from to determine final grades. Two major philosophies exist: either taking the average opinion from a number of trained humans or choosing one person as the principal grader, to whom all others are compared. Most large testing organizations opt for the latter, periodically ensuring that all examiners align with the principal, because the former is unthinkable in its complexity. But that still results in all student submissions being graded to align with a single human’s, or a small group of humans’ subjective judgment, rather than achieving any sort of absolute truth.

We believe the average of trained opinions ought to be closest to a ground truth, and aim to use this wherever possible. But an averaged judgement will regularly diverge from any individual teacher’s, making it unclear whose standard should prevail. For most formative assessments teachers spin up, an average of opinions isn’t possible, so the AI must be trained at the individual teacher level, relying on the history of a single human’s judgements. Relying on a single rater has been proven in many professions to be unreliable – prone to numerous subconscious biases.

In the same way that court judges are more likely to rule in favor of a defendant after meal breaks or at the beginning of a session, time of day, fatigue, and emotion introduce substantial biases in teachers’ decision-making (Vicario et al., 2025; Mahshanian & Shahnazari, 2020; Brackett et al., 2013). Specifically in grading assessment, heavy biases also arise from handwriting quality and knowledge of previous student performance, among other irrelevant factors about the students themselves (Greifeneder et al., 2010; Malouff et al., 2014; Malouff & Thorsteinsson, 2016).

All of these biases, combined with flexibly interpretable rubrics, make it difficult to train an AI on unique assessments set by specific teachers. For standardized assessments, with historic datasets spanning hundreds of samples across multiple teachers or examiners, these problems do not disappear, but it can become easier to generate defensible grade predictions that appear reasonable.

It may be fundamentally impossible for any system, human or machine, to exactly predict how a specific human will judge more than 85% of submissions. But given the lack of any absolute truth, a reasonable prediction is not a compromise – it is the most defensible approach we have: one that reflects the center of informed human judgement, rather than its edges.

The Reliability of Human Judgement

References