Backed by

Comparison is Key
Our Research Focus

AI Precision

Solving the unreliability and bias of generative AI for assessment applications.

AI Safety

Contributing to the field of AI safety and discussions around ethics and regulation.

Teacher Workload

Determining teachers' biggest pain points and measuring the impact of interventions.

Student Feedback

Applying foundational pedagogical literature with new regularity and consistency

November 2, 2025

If you want to know how people feel about a set of things (foods brands, crimes, etc.), don't ask them to rate each on a fixed scale. Instead, have them choose between pairs, over and over.

This idea was first published and mathematically supported almost a century ago (Thurstone, 1927), and it remains a strong method for turning highly subjective fields into frameworks for more objective scoring.

As far as rating things in subjective fields go, assessment in education seeks most to find an objective answer in the form of rubrics. All modern-day classroom assessments come with rubrics which aim to define what each level on the scale looks like. Good rubric design and human training can lead to high inter-rater reliability, with QWK as high as 0.95, found among ETS examiners (Wendler, Glazer & Cline, 2019). But many rubrics are not designed with such rigor, and teachers do not receive the same level of training which examiners do, at which point it can become much easier to agree on which response is better, than to agree how a response meets the terms of the rubric.

Even with good rubrics, in high-stakes settings, examining bodies ensure reliability by having two or more different people grade each submission, and this is where comparison starts to show clear benefits. A study by Ofqual (Holmes, Black & Morin, 2020) compared the accuracy of groups of examiners grading AS History, against the judgement of principal examiners, across 3 organizations responsible for administering the UK’s A-levels. They found that when using traditional methods, taking the grade of any single examiner achieved a Spearman’s Rho of 0.47, compared to taking the average of any two examiners (0.54), or the average of any three (0.57) or up to eight (0.62), clearly proving that more opinions reach higher accuracy. But when the teams were instructed to make pair-wise judgements, each examiner was given a random set of pairs, and the results were combined to exceed a Spearman's Rho of 0.65 with the equivalent of two examiners, 0.80 with the equivalent of eight, and 0.85 with the equivalent of twelve. So among groups of humans, synthesizing comparisons is more accurate than direct gradings.

Not only is a comparison judgement intuitively easier to perform, and hence itself leads toward more accurate judgements, but there’s also something special that comparison unlocks in data, which is especially helpful across teams of humans, or AI systems.

The Ofqual study involved 60 essays for each organization. If an examiner directly generates grades for each essay, the result is 60 datapoints. If eight people all grade the same 60 essays, the result is 480 datapoints, which you can average together for higher reliability. But if you were to compare each essay to every other essay in the set (comparing each essay to 59 others), the result would be 1,770 unique comparisons, which you could use to generate the most accurate rank ordering with math. The twelve examiners in the Ofqual study only performed 60 random comparisons each, totaling 720 - less than half of the possible combinations - and this still yielded significantly higher accuracy.

The strength in comparison is that the potential datapoints for learning and averaging becomes exponential compared to the source data, and each comparison between essays creates a relationship which helps position every other essay in an interconnected network. This network is also much more forgiving of occasional or even frequently inaccurate judgements. Because each essay in a set of 60 gets compared up to 59 times, the general sentiment should still be well captured if a few judgements are innacurate.

While this is effective for human teams striving for accuracy, it is infeasible in live examination scenarios to deploy teams of examiners large enough to judge a significant portion of possible comparisons. Thousands of essays quickly become millions of possible combinations, and millions become billions.

But this exponential relationship is perfect for machine learning systems, because 60 datapoints can turn into 1,770 datapoints, which is a dramatic difference. Turning less than 100 datapoints into well over 1,000 allows us to enter the arena where conventional machine learning can yield results.

References

Holmes, S., Black, B., & Morin, C. (2020). Marking reliability studies 2017: Rank ordering versus marking – which is more reliable? Ofqual. https://www.gov.uk/government/publications/marking-reliability-studies-2017

Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review, 34(4), 273–286. https://doi.org/10.1037/h0070288

Wendler, C., Glazer, N., & Cline, F. (2019). Examining the calibration process for raters of the GRE general test (GRE Board Research Report No. GRE-19-01). Educational Testing Service. https://doi.org/10.1002/ets2.12245

Explore a Partnership

We partner with K12 educational institutions across the globe. If you think your institution would be a good fit, please submit an expression of interest.