Our Research.

State of the Art Small-Scale Learning

QWK measures agreement between two graders, accounting for the magnitude of disagreements. Being off by one point is much better than being off by three. QWK also considers whether agreement is meaningful or coincidental: if 70% of essays have been graded 4/6, QWK would be near zero for a model predicting 4/6 100% of the time.

The chart compares performance on the AES 2.0 Kaggle Competition. The winning solution achieved 0.84 QWK after training on 1,700+ essays. Edexia achieved 0.81 QWK while training on only 20 essays.

85× Less Training Data

Traditional AI grading systems require thousands of pre-graded essays to train. Edexia's approach achieves comparable accuracy with just 20 examples, making AI grading practical for individual teachers and small schools.

1.000.900.800.700.60

0.84

0.81

Standard ML

1,700 essays

Edexia

20 essays

Our Background

Our team brings research experience from leading institutions in education, machine learning, and assessment science.

Harvard University

University of Cambridge

University of Queensland

University of Technology Sydney

International Olympiad in Informatics

Small-Scale Learning

Conventional AI systems require hundreds to thousands of datapoints to achieve reliability. But that volume of data rarely exists at the class or school level.

To work in real educational settings, AI grading systems must adapt from as few as 5 samples and become reliable after 50.

November 2, 2025

Comparison is Key

Saying ‘Essay A is better than Essay B’ is easier than assigning exact grades, for both humans and machines. This unlocks higher reliability when training data is scarce.

Read article

February 27, 2026

Teaching AI Like We Teach Humans

Teachers learn to grade from a handful of exemplars, discussion, and ongoing calibration. LLMs can do the same.

Read article

Human Versus Machine

All machine learning systems require human data as the ‘ground truth’ for training and evaluation. But what happens when that ground truth is flawed?

When humans themselves disagree, the definition of ‘accuracy’ in grading becomes far less straightforward.

February 27, 2026

Taking Each At Their Best

Under ideal conditions, expert human raters can be found to reach 0.95 QWK, yet on other datasets, modern systems are now exceeding the inter-rater reliability of humans.

Read article

November 11, 2025

The Reliability of Human Judgement

When two trained raters disagree on 35% of essays, which dataset should AI learn from — and how do we know if either rater was consistent?

Read article