Our Accuracy.
Independent trial with St Bernard's College VCE English team across 579 essays, 2025.
What That Means
Human examiners typically agree on exact scores 60 to 80% of the time for subjective essay assessment. Edexia's 81.2% exact match falls within the upper range of human inter-rater reliability. The 98.3% within-one figure means that in almost every case, Edexia and the teacher were at most one grade band apart.
How the Study Worked
The trial was conducted with the VCE English department at St Bernard’s College in 2025.
579 student essays across multiple VCE English texts were included.
Teachers accepted or changed Edexia’s grade for each essay.
How Edexia Is Trained
- A team of VCAA assessors (experienced VCE English examiners) trains and validates the system.
- Every criterion, grade descriptor, and study design requirement from the VCAA rubric is built in.
- The system is calibrated through ongoing moderation with schools, similar to how teachers calibrate with colleagues.
- School-level data is siloed. Your essays and grades are never used to train models for other schools.
Comparison to Other Approaches
When teachers use generic AI tools like ChatGPT directly, accuracy suffers because the model has no understanding of VCE-specific rubrics, text knowledge, or assessment standards.
Research from ETS found that GPT-4o scored approximately 0.9 points lower than human raters on average. Without curriculum-specific training, general-purpose AI tends to be inconsistent and imprecise.
Edexia closes this gap through assessor training, rubric calibration, and text-specific knowledge bases.
We ran Edexia alongside our normal grading for a full term. The consistency surprised us. It matched our grades at the same rate our assessors agree with each other.
Chris Mason, Head of English at St Bernard's College
For more on our research program, including ongoing studies and methodology, visit our research page.