AI & Grading

Inside the Multi-Agent Grading Engine: How MarkInMinutes Grades Like a Panel of Experts

A look under the hood at how our multi-agent AI architecture delivers consistent, evidence-based grading — with built-in fact-checking, adversarial review, and calibrated scoring.

MarkInMinutes Team

February 9, 20269 min read

When most people think of "AI grading," they picture a single model reading a student's work and spitting out a number. That approach has well-documented problems — inconsistency, shallow evaluation, and ungrounded feedback that crumbles under scrutiny.

MarkInMinutes takes a fundamentally different approach. Instead of one AI doing everything, we built a multi-agent grading engine — a pipeline of specialized agents that each handle one part of the evaluation process, then check each other's work. Think of it less like asking one overworked teaching assistant to grade 200 papers, and more like assembling a panel of experts who each bring a different lens to the evaluation.

Here's how it works.

The Pipeline: From Submission to Final Report

Every submission flows through a structured pipeline of ten distinct stages. Each stage has a specific job, and each one feeds its output into the next.

Stage	Agent	What It Does
1	Loader	Loads the grading profile, rubric dimensions, and proficiency scale
2	Rubric Parser	Extracts specific requirements from the rubric document
3a	Error Detector	Identifies technical errors — grammar, formatting, citations
3b	Fact Checker	Verifies factual claims with parallel runs and an arbiter
4a	Holistic Reviewer	Produces an overall assessment of the submission
4b	Evaluators	Score each rubric dimension independently
5	Challenger	Plays devil's advocate — attacks the provisional scores
6	Auditor	Reconciles all perspectives into final, defensible scores
7	Error Clusterer	Groups errors into actionable patterns
8–9	Coaches	Generate short-term fixes and long-term growth plans
10	Report Generator	Synthesizes everything into the final feedback report

Notice stages 3a/3b and 4a/4b — these run in parallel. The error detector and fact checker work simultaneously, as do the holistic reviewer and per-dimension evaluators. This isn't just an optimization for speed; it ensures that different perspectives develop independently without contaminating each other.

Separation of Concerns: Why Specialized Agents Matter

The most important architectural decision in our engine is that no single agent does the grading alone. Each agent is specialized, which means:

The Rubric Parser only extracts requirements — it doesn't evaluate submissions
The Evaluators only score individual dimensions — they don't try to form an overall impression
The Fact Checker only verifies claims — it doesn't judge writing quality
The Auditor only reconciles — it doesn't generate original assessments

This separation prevents the kind of "criterion blending" that plagues single-model grading, where an AI's impression of writing quality bleeds into its assessment of analytical depth, or where a factual error causes unfairly harsh scores across every dimension.

Why This Matters for Fairness

When a student makes a factual error in one section, it should affect the accuracy score for that dimension — not tank their marks for critical thinking, structure, or argumentation. Specialized agents enforce these boundaries automatically.

Adversarial Review: The Challenger–Auditor Pattern

Perhaps the most distinctive feature of our architecture is the adversarial review loop. After the evaluators assign provisional scores, two more agents step in:

The Challenger acts as a devil's advocate. Its only job is to find weaknesses in the provisional scores — places where the evaluator may have been too generous, where evidence is thin, or where the scoring doesn't align with the rubric's calibration anchors.

The Auditor then takes all perspectives — the evaluators' scores, the holistic reviewer's assessment, and the challenger's objections — and produces the final determination. It reconciles disagreements, applies any task-alignment modifiers, and documents an explicit audit trail explaining every adjustment.

This isn't just quality control. It's a structural guarantee that every score has been stress-tested before it reaches the student.

Fact-Checking with Parallel Runs and an Arbiter

Fact-checking is notoriously difficult for AI systems. A single pass can miss errors or, worse, flag correct information as wrong. Our engine addresses this with a three-step process:

Two independent fact-checking runs execute in parallel, each analyzing the submission's claims without seeing the other's results
An arbiter compares both sets of findings, resolving disagreements and filtering out false positives
A dimension mapper routes confirmed factual issues to the specific rubric dimensions they affect

This parallel-then-reconcile approach dramatically reduces both false positives (incorrectly flagging accurate work) and false negatives (missing genuine errors). It's the same principle behind having two independent reviewers in academic peer review — agreement between independent assessors is far more reliable than a single opinion.

Calibrated Scoring: What "Good" Actually Means

One of the biggest problems with AI grading is score drift — where the model's interpretation of "excellent" shifts between submissions. We solve this with a calibrated proficiency system.

Every grading profile includes a 5-level proficiency scale — from Distinguished to Novice — with 11 notches for finer granularity (High, Mid, and Low subdivisions for the top three levels). Each dimension comes with calibration anchors: concrete descriptions of what work looks like at each level.

When an evaluator scores a dimension, it doesn't just pick a number. It maps the submission's evidence against these pre-defined benchmarks, ensuring that "Proficient – High" means the same thing for the first submission graded and the two-hundredth.

Level	Notches	Description
5 – Distinguished	High / Mid / Low	Exceptional mastery, goes beyond expectations
4 – Proficient	High / Mid / Low	Strong competency, meets all criteria effectively
3 – Developing	High / Mid / Low	Emerging understanding, inconsistent application
2 – Beginning	—	Limited demonstration of required skills
1 – Novice	—	Minimal or no evidence of competency

Task Alignment Gate: Did They Answer the Question?

Before any detailed grading begins, our holistic reviewer performs a task alignment check — a pre-grading validation that asks a simple but crucial question: Did the student actually address the assignment?

This produces one of three statuses:

Fully met: The submission addresses all core requirements — proceed with normal grading
Partially met: Some requirements are addressed — grading proceeds but scores may be capped
Not met: The submission doesn't address the assignment — an automatic modifier is applied

This prevents a well-written essay on the wrong topic from receiving high scores simply because the prose is polished. It's a check that experienced human graders perform instinctively, and our architecture makes it explicit and systematic.

Evidence-Based Everything

Every score in our system comes with specific citations from the student's work. Not vague statements like "demonstrates good understanding," but direct references to passages, pages, and quoted text.

This evidence requirement is enforced at the architectural level — evaluators must attach evidence_citations to every score, and the auditor verifies that these citations actually support the assigned grade. If an evaluator's reasoning doesn't hold up against the cited evidence, the auditor adjusts the score.

The result: when a student asks "Why did I get this grade?", the documentation already exists. Every score has a paper trail.

Evidence Trails in Practice

Each dimension score includes: the proficiency level and notch, a scoring rationale, specific evidence citations with page references, and a detailed explanation of how the evidence maps to the rubric criteria. This level of transparency would take a human grader significantly longer to produce manually.

Configuration-Driven, Not Hard-Coded

Our grading engine doesn't have "one way" of grading hard-coded into it. Instead, behavior is defined by grading profiles — configuration objects that specify:

Which dimensions to evaluate and how to weight them
What the calibration anchors look like for each dimension
Which skills are critical (failing a critical skill can fail the entire assignment)
What level of feedback depth to provide
The education level and subject context

This means the same engine can grade a university-level research paper, a high school persuasive essay, and a technical lab report — each with appropriate criteria and expectations. Educators define what matters; the engine handles how to evaluate it consistently.

The Benefits: Why Architecture Matters

All of these architectural decisions add up to concrete benefits for educators and students:

Consistency across submissions. The calibrated proficiency scale and multi-agent review ensure that the 200th submission is graded to the same standard as the 1st. No fatigue, no drift.

Fairness through independence. Parallel evaluation and adversarial review prevent any single bias from propagating through the entire grade. Each perspective develops independently and gets challenged before finalization.

Transparency and defensibility. Evidence citations and audit trails mean every grade can be explained and justified. No black boxes.

Speed without shortcuts. Parallel execution of independent stages means thorough evaluation completes faster than serial processing would allow — without sacrificing any step in the pipeline.

Adaptability across contexts. Configuration-driven profiles mean the engine adapts to different subjects, education levels, and grading philosophies without code changes.

The Bottom Line

The quality of AI grading isn't just about using a better language model — it's about how you orchestrate that intelligence. A single model, no matter how capable, will always struggle with the consistency, fairness, and transparency that grading demands.

By decomposing the grading process into specialized agents, introducing adversarial review, enforcing evidence-based scoring, and calibrating against concrete benchmarks, we've built a system that doesn't just produce grades — it produces grades you can trust.

Want to see these principles in action? Read about why general-purpose chatbots fall short for grading, or explore our example results to see evidence-based feedback firsthand.

Written by

MarkInMinutes Team

The team behind MarkInMinutes — building AI-powered grading tools for educators worldwide.

Share this article

X LinkedIn

Bar chart comparing grading consistency of ChatGPT, Claude, and MarkInMinutes across 72 benchmark runs

AI & Grading

AI Grading Showdown: MarkInMinutes vs ChatGPT vs Claude

We ran the same 8 student papers through ChatGPT, Claude, and MarkInMinutes 3 times each. The results reveal dramatic differences in consistency, rubric adherence, and grading quality. Here's what we found.

MarkInMinutes Team

February 12, 202614 min read

Read article

AI & Grading

Why You Can't Simply Use ChatGPT for Grading

ChatGPT is great for many things, but consistent, rubric-based grading isn't one of them. Here's why dedicated AI grading tools outperform general-purpose LLMs — and what to look for instead.

MarkInMinutes Team

February 9, 20264 min read

Read article

The Pipeline: From Submission to Final Report

Separation of Concerns: Why Specialized Agents Matter

Adversarial Review: The Challenger–Auditor Pattern

Fact-Checking with Parallel Runs and an Arbiter

Calibrated Scoring: What "Good" Actually Means

Task Alignment Gate: Did They Answer the Question?

Evidence-Based Everything

Configuration-Driven, Not Hard-Coded

The Benefits: Why Architecture Matters

The Bottom Line

Share this article

Related Articles

AI Grading Showdown: MarkInMinutes vs ChatGPT vs Claude

Why You Can't Simply Use ChatGPT for Grading