March 24, 2025
Metrics We Use
Small-Scale Learning
Conventional AI systems require hundreds to thousands of datapoints to achieve reliability. However, such data volumes are rarely available at the class or school level.
To become viable in real educational settings, AI grading systems must strive to adapt in as few as 5 samples, and become reliable after 50.
Teachers learn to grade by discussing exemplars with colleagues, comparing submissions to only a handful of anchors, and updating their understanding of the rubric as they go. LLMs can do the same.
Human Versus Machine
All machine learning systems require human data as the 'ground truth' for training and evaluation. But what happens when that ground truth is flawed, or there is no reliable ground truth?
Exploring inter-rater reliability of both humans and machines uncovers fundamental questions about what 'accuracy' truly means in grading assessment.
Under ideal conditions, expert human raters can be found to reach 0.95 QWK, yet on other datasets, modern systems are now exceeding the inter-rater reliability of humans.
When two trained raters disagree slightly on 35% of essays, which dataset should AI learn from? When the goal is to match a single teacher's grades, how can we tell if the teacher was consistent?
We partner with K12 educational institutions across the globe. If you think your institution would be a good fit, please submit an expression of interest.

.jpg)







