Our Research.
State of the Art Small-Scale Learning
QWK measures agreement between two graders, accounting for the magnitude of disagreements. Being off by one point is much better than being off by three. QWK also considers whether agreement is meaningful or coincidental: if 70% of essays have been graded 4/6, QWK would be near zero for a model predicting 4/6 100% of the time.
The chart compares performance on the AES 2.0 Kaggle Competition. The winning solution achieved 0.84 QWK after training on 1,700+ essays. Edexia achieved 0.81 QWK while training on only 20 essays.
85× Less Training Data
Traditional AI grading systems require thousands of pre-graded essays to train. Edexia's approach achieves comparable accuracy with just 20 examples, making AI grading practical for individual teachers and small schools.
Our Background
Our team brings research experience from leading institutions in education, machine learning, and assessment science.
Harvard University
University of Cambridge
University of Technology Sydney
International Olympiad in InformaticsSmall-Scale Learning
Conventional AI systems require hundreds to thousands of datapoints to achieve reliability. But that volume of data rarely exists at the class or school level.
To work in real educational settings, AI grading systems must adapt from as few as 5 samples and become reliable after 50.

Comparison is Key
Saying ‘Essay A is better than Essay B’ is easier than assigning exact grades, for both humans and machines. This unlocks higher reliability when training data is scarce.

Teaching AI Like We Teach Humans
Teachers learn to grade from a handful of exemplars, discussion, and ongoing calibration. LLMs can do the same.
Human Versus Machine
All machine learning systems require human data as the ‘ground truth’ for training and evaluation. But what happens when that ground truth is flawed?
When humans themselves disagree, the definition of ‘accuracy’ in grading becomes far less straightforward.

Taking Each At Their Best
Under ideal conditions, expert human raters can be found to reach 0.95 QWK, yet on other datasets, modern systems are now exceeding the inter-rater reliability of humans.

The Reliability of Human Judgement
When two trained raters disagree on 35% of essays, which dataset should AI learn from — and how do we know if either rater was consistent?