← Back to Research

Teaching AI Like We Teach Humans

February 27, 2026

When a new teacher joins a school, they don’t learn to grade by reading a rubric in isolation. They sit with colleagues, examine student work together, debate borderline cases, and gradually develop a shared understanding of what “proficient” actually looks like. This process – calibration through exemplars – is how humans have trained assessors for decades. It’s also how large language models learn best.

Human teacher training and LLM prompting share the same core structure: a small number of carefully chosen examples to establish standards, explicit reasoning about why a response merits a particular score, and iterative refinement as edge cases surface. These parallels suggest concrete ways to make AI grading systems more reliable – and more aligned with human judgment.

How Teachers Learn to Grade

Professional development in assessment typically centers on what educators call “anchor papers” or “benchmark responses” – exemplar student work that clearly illustrates each score level on a rubric. The National Assessment of Educational Progress (NAEP), which has assessed American students since 1969, trains its scorers using anchor sets containing three or four clear examples for each score category (National Center for Education Statistics, 2024). During training, facilitators read each anchor paper aloud and explain precisely why it fits its assigned score level. Scorers then use these anchors as primary references throughout operational scoring.

Humans calibrate their judgment not through abstract criteria alone, but through concrete examples that make those criteria tangible. The Educational Testing Service, which administers the GRE and other standardized tests, has found that even brief calibration exercises – as short as ten scored responses – can predict subsequent scoring accuracy with reasonable reliability (Wendler, Glazer & Cline, 2019). The key is not volume but quality: a handful of well-chosen exemplars, accompanied by explicit rationales, can establish scoring standards more effectively than lengthy rubric documents.

Calibration also happens socially. The consensus moderation model, widely used in Australian education, brings teachers together to independently mark student work, then discuss their judgments as a group (Queensland Curriculum and Assessment Authority, 2023). Through professional conversation, teachers compare their interpretations against standard descriptors and reach consensus on quality standards. This process doesn’t just improve reliability – it deepens teachers’ understanding of what the rubric actually means in practice. Ofqual, the UK’s qualifications regulator, has documented how such standardisation methods significantly improve marking consistency, particularly when combined with clear mark schemes and ongoing monitoring (Ofqual, 2014).

How LLMs Learn from Examples

In 2020, researchers at OpenAI demonstrated something surprising: GPT-3, a language model with 175 billion parameters, could perform new tasks with remarkable accuracy after seeing just a few examples in its prompt – no fine-tuning required (Brown et al., 2020). This capability, called “few-shot learning” or “in-context learning,” showed that large language models could generalize from minimal demonstrations in ways that smaller models could not. Crucially, the researchers found that few-shot performance improved more rapidly with model scale than zero-shot performance, suggesting that the ability to learn from examples is an emergent property of sufficiently large models.

Recent research has pushed this further into the “many-shot” regime, demonstrating that providing hundreds or even thousands of examples yields significant additional gains (Agarwal et al., 2024). Many-shot in-context learning can override pretraining biases and enable models to learn complex functions that approach fine-tuned performance – all without updating a single model weight. Perhaps most surprisingly, some studies have found that in-context learning can outperform fine-tuning on tasks with implicit patterns, even when fine-tuning uses orders of magnitude more training data (Zelikman et al., 2024). The model learns not by changing its parameters, but by recognizing patterns in the examples it’s shown.

Anchor Papers for AI

The connection between human calibration and LLM prompting becomes concrete in recent work on automated essay scoring. Choi and colleagues (2025) found that providing anchor papers – sample essays with assigned scores – significantly improved agreement between LLM-generated grades and human raters, bringing performance closer to human-human reliability. Their approach mirrors exactly what NAEP trainers do: show the model clear examples of each score level, explain the rubric, and let it generalize to new submissions. The researchers specifically focused on prompting strategies accessible to teachers, rather than resource-intensive optimization methods, demonstrating that the same exemplar-based approach that works for human training also works for AI.

This finding has practical implications. Traditional automated essay scoring systems require large datasets of pre-scored essays – often thousands of samples – to train effectively. But if LLMs can calibrate from a handful of anchor papers, the barrier to deploying AI grading drops dramatically. A teacher creating a new assignment doesn’t need to grade hundreds of submissions before the AI can help; they need only provide a few exemplars that illustrate their standards. This is precisely how human teaching assistants are trained: review a small set of graded examples, discuss the reasoning, then begin scoring with periodic check-ins.

Iterative Refinement in Both Worlds

Neither human calibration nor LLM prompting is a one-shot process. Teachers refine their understanding of rubrics as they encounter edge cases – submissions that don’t fit neatly into predefined categories. The best professional development programs build in opportunities for ongoing calibration, where teachers periodically reconvene to discuss difficult cases and realign their standards (Yancey & Huot, 1997). Similarly, effective AI grading systems benefit from iterative refinement: when the model’s judgment diverges from the teacher’s, that disagreement becomes a learning opportunity. The teacher can provide corrective feedback, effectively adding new anchor papers to the model’s context.

This iterative loop – grade, compare, discuss, refine – characterizes both human professional development and effective human-AI collaboration in assessment. The ETS has documented how operational scoring experience and mentored training both improve rater accuracy over time (Educational Testing Service, 2021). The same principle applies to AI systems: a model that receives feedback on its errors can adjust its interpretation of the rubric, just as a human scorer would. The difference is that the AI’s “learning” happens through updated prompts rather than updated neural weights – but the pedagogical structure is identical.

Implications for AI-Assisted Grading

Recognizing that LLMs learn like teachers learn suggests a design principle for AI grading systems: structure the interaction around exemplars, not just rubrics. Rather than asking teachers to write elaborate scoring criteria, ask them to grade a handful of representative submissions and explain their reasoning. Rather than treating AI disagreements as errors to be overridden, treat them as calibration opportunities – chances to refine the shared understanding between human and machine. And rather than expecting perfect alignment from the start, build in mechanisms for ongoing calibration as both teacher and AI encounter new types of student work.

This approach also suggests realistic expectations. Human raters, even after extensive training, rarely achieve perfect agreement – inter-rater reliability of 0.85 to 0.95 is considered excellent in high-stakes assessment (Wendler, Glazer & Cline, 2019). AI systems calibrated through few-shot learning should be held to similar standards: not perfect replication of any single teacher’s judgment, but reasonable alignment with the center of informed human opinion. The goal is not to replace human judgment but to extend it – to give every student access to feedback that reflects careful, calibrated assessment, even when human graders are scarce.

Teachers have always learned to grade by examining examples together, debating borderline cases, and refining their understanding through practice. Large language models, it turns out, learn the same way. AI grading systems that mirror this process – exemplar-driven, iteratively calibrated, structured around reasoning rather than rules – stand to be more accurate and more transparent than those that do not.

References
  • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  • Choi, Y., Powell Tate, T., Ritchie, S., Nixon, N., & Warschauer, M. (2025). Anchor is the key: Toward accessible automated essay scoring with large language models through prompting. OSF Preprints. https://doi.org/10.35542/osf.io/cbhgz
  • Educational Testing Service. (2021). Best practices for constructed-response scoring. ETS. https://www.ets.org/pdfs/about/cr_best_practices.pdf
  • Agarwal, R., Vieillard, N., Stanber, S., Rber, S., Behbahani, F., Schmitt, S., ... & Mesnard, T. (2024). Many-shot in-context learning. Advances in Neural Information Processing Systems, 37. https://arxiv.org/abs/2404.11018
  • National Center for Education Statistics. (2024). NAEP technical documentation: Anchor papers. U.S. Department of Education. https://nces.ed.gov/nationsreportcard/tdw/scoring/training_scorers_guide_anchors.aspx
  • Ofqual. (2014). Standardisation methods, mark schemes, and their impact on marking reliability. Office of Qualifications and Examinations Regulation. https://www.gov.uk/government/publications/standardisation-methods-mark-schemes-marking-reliability
  • Queensland Curriculum and Assessment Authority. (2023). Consensus moderation model. QCAA. https://www.qcaa.qld.edu.au/downloads/aciqv9/general-resources/assessment/ac9_moderation_consensus_model.pdf
  • Wendler, C., Glazer, N., & Cline, F. (2019). Examining the calibration process for raters of the GRE general test (GRE Board Research Report No. GRE-19-01). Educational Testing Service. https://doi.org/10.1002/ets2.12245
  • Yancey, K. B., & Huot, B. (1997). Assessing writing across the curriculum: Diverse approaches and practices. Ablex Publishing.
  • Zelikman, E., Huang, Q., Poesia, G., Goodman, N. D., & Haber, N. (2024). Deeper insights without updates: The power of in-context learning over fine-tuning. arXiv. https://arxiv.org/abs/2410.04691