November 11, 2025
Machine learning needs ground truth data - a set of inputs and outputs deemed “correct” - to learn from. For many machine learning applications such as image recognition, speech transcription, or even self-driving cars, humans can reliably generate this data, and the idea of “correct” is rarely in question.
But when it comes to subjective fields such as grading and feedback, and as machines move from simply recognizing and labeling to forming nuanced judgments, defining “correct” becomes part of the problem itself.
In 2012, the Hewlett Foundation sponsored the Automated Student Assessment Prize (ASAP), a competition to create automated essay graders which compiled a dataset for training and testing that continues to serve as a benchmark for the field today (The Hewlett Foundation, 2012). For each of the 12,977 essays in the training dataset, the “true grade” was generated by two different human graders. Between these two humans, they achieved a 0.77 QWK, 0.75 Spearman’s rho, 62% exact match, and 92% equal or ±1 point, revealing close but still considerably imperfect alignment.
Current state-of-the-art grading systems, powered by a blend of classical machine learning and new large language models (LLMs), train on the many thousands of essays in the ASAP dataset and achieve impressive metrics. Jiao, Choi, and Hua (2025) achieved 0.873 QWK for ASAP essay set 6, and while these high numbers suggest strong progress in the field, they obscure a more fundamental problem. The authors chose “truth” to mean the higher of the two human scores - a largely arbitrary decision. And how can a QWK value any higher than 0.77 be meaningful if 0.77 is the limit of human agreement?
This raises the question: if trained human graders disagree so often on essays in the dataset, how do we establish a source of truth?
The most accurate answer would be to say that there is no source of truth - there are only a number of philosophies we can choose from to determine final grades. Two major philosophies exist: either taking the average opinion from a number of trained humans or choosing one person as the principal grader, to whom all others are compared. Most large testing organizations opt for the latter, periodically ensuring that all examiners align with the principal, because the former is unthinkable in its complexity. But that still results in all student submissions being graded to align with a single human’s, or a small group of humans’ subjective judgment, rather than achieving any sort of absolute truth.
We believe the average of trained opinions ought to be closest to a ground truth, and aim to use this wherever possible. However, an averaged judgement may misalign with an individual teacher quite often, sowing doubt into who’s correct. For most formative assessments teachers spin up, an average of opinions isn’t possible, so the AI must be trained at the individual teacher level, relying on the history of a single human's judgements. This has been proven in many professions to be unreliable - prone to numerous subconscious biases.
In the same way that court judges are more likely to rule in favor of a defendant after meal breaks or at the beginning of a session, time of day, fatigue, and emotion introduce substantial biases in teachers’ decision-making (Vicario et al., 2025; Mahshanian & Shahnazari, 2020; Brackett et al., 2013). Specifically in grading assessment, heavy biases also arise from handwriting quality and knowledge of previous student performance, among other irrelevant factors about the students themselves (Greifeneder et al., 2010; Malouff et al., 2014; Malouff & Thorsteinsson, 2016).
All of these biases, combined with flexibly interpretable rubrics, make it difficult to train an AI on unique assessments set by specific teachers. For standardized assessments, with historic datasets spanning hundreds of samples across multiple teachers or examiners, these problems do not disappear, but it can become easier to generate defensible grade predictions that appear reasonable.
For all we know it may be impossible for any system, human or machine, to exactly predict how a specific human will judge more than 85% of submissions. However, given the lack of any absolute truth, a prediction that appears reasonable may not represent a compromise, instead perhaps it is the most defensible approach we have: predictions which reflect the center of informed human judgement, rather than its edges.
References
Brackett, M. A., Floman, J. L., Ashton-James, C., Cherkasskiy, L., & Salovey, P. (2013). The influence of teacher emotion on grading practices: A preliminary look at the evaluation of student writing. Teachers and Teaching: Theory and Practice, 19(6), 634–646. https://doi.org/10.1080/13540602.2013.827453
Greifeneder, R., Alt, A., Bottenberg, K., Seele, T., Zelt, S., & Wagener, D. (2010). On writing legibly: Processing fluency systematically biases evaluations of handwritten material. Social Psychological and Personality Science, 1(3), 230–237. https://doi.org/10.1177/1948550610368434
Jiao, H., Choi, H., & Hua, H. (2025). Exploring the utilities of the rationales from large language models to enhance automated essay scoring. arXiv. https://doi.org/10.48550/arXiv.2510.27131
Mahshanian, A., & Shahnazari, M. (2020). The effect of raters' fatigue on scoring EFL writing tasks. Indonesian Journal of Applied Linguistics, 10(1), 1–13. https://doi.org/10.17509/ijal.v10i1.24956
Malouff, J. M., & Thorsteinsson, E. B. (2016). Bias in grading: A meta-analysis of experimental research findings. Australian Journal of Education, 60(3), 245–256. https://doi.org/10.1177/0004944116664618
Malouff, J. M., Stein, S. J., Bothma, L. N., Coulter, K., & Emmerton, A. J. (2014). Preventing halo bias in grading the work of university students. Cogent Psychology, 1(1), 988937. https://doi.org/10.1080/23311908.2014.988937
The Hewlett Foundation. (2012). Automated Essay Scoring [Competition page]. Kaggle. https://www.kaggle.com/competitions/asap-aes
Vicario, C. M., Nitsche, M. A., Lucifora, C., Perconti, P., Salehinejad, M. A., Tomaiuolo, F., Massimino, S., Avenanti, A., & Mucciardi, M. (2025). Timing matters! Academic assessment changes throughout the day. Frontiers in Psychology, 16, Article 1605041. https://doi.org/10.3389/fpsyg.2025.1605041
We partner with K12 educational institutions across the globe. If you think your institution would be a good fit, please submit an expression of interest.



