Inside the Multi-Agent Grading Engine: How MarkInMinutes Grades Like a Panel of Experts
A look under the hood at how our multi-agent AI architecture delivers consistent, evidence-based grading β with built-in fact-checking, adversarial review, and calibrated scoring.
When most people think of "AI grading," they picture a single model reading a student's work and spitting out a number. That approach has well-documented problems β inconsistency, shallow evaluation, and ungrounded feedback that crumbles under scrutiny.
MarkInMinutes takes a fundamentally different approach. Instead of one AI doing everything, we built a multi-agent grading engine β a pipeline of specialized agents that each handle one part of the evaluation process, then check each other's work. Think of it less like asking one overworked teaching assistant to grade 200 papers, and more like assembling a panel of experts who each bring a different lens to the evaluation.
Here's how it works.
The Pipeline: From Submission to Final Report
Every submission flows through a structured pipeline of ten distinct stages. Each stage has a specific job, and each one feeds its output into the next.
| Stage | Agent | What It Does |
|---|---|---|
| 1 | Loader | Loads the grading profile, rubric dimensions, and proficiency scale |
| 2 | Rubric Parser | Extracts specific requirements from the rubric document |
| 3a | Error Detector | Identifies technical errors β grammar, formatting, citations |
| 3b | Fact Checker | Verifies factual claims with parallel runs and an arbiter |
| 4a | Holistic Reviewer | Produces an overall assessment of the submission |
| 4b | Evaluators | Score each rubric dimension independently |
| 5 | Challenger | Plays devil's advocate β attacks the provisional scores |
| 6 | Auditor | Reconciles all perspectives into final, defensible scores |
| 7 | Error Clusterer | Groups errors into actionable patterns |
| 8β9 | Coaches | Generate short-term fixes and long-term growth plans |
| 10 | Report Generator | Synthesizes everything into the final feedback report |
Notice stages 3a/3b and 4a/4b β these run in parallel. The error detector and fact checker work simultaneously, as do the holistic reviewer and per-dimension evaluators. This isn't just an optimization for speed; it ensures that different perspectives develop independently without contaminating each other.
Separation of Concerns: Why Specialized Agents Matter
The most important architectural decision in our engine is that no single agent does the grading alone. Each agent is specialized, which means:
- The Rubric Parser only extracts requirements β it doesn't evaluate submissions
- The Evaluators only score individual dimensions β they don't try to form an overall impression
- The Fact Checker only verifies claims β it doesn't judge writing quality
- The Auditor only reconciles β it doesn't generate original assessments
This separation prevents the kind of "criterion blending" that plagues single-model grading, where an AI's impression of writing quality bleeds into its assessment of analytical depth, or where a factual error causes unfairly harsh scores across every dimension.
Why This Matters for Fairness
When a student makes a factual error in one section, it should affect the accuracy score for that dimension β not tank their marks for critical thinking, structure, or argumentation. Specialized agents enforce these boundaries automatically.
Adversarial Review: The ChallengerβAuditor Pattern
Perhaps the most distinctive feature of our architecture is the adversarial review loop. After the evaluators assign provisional scores, two more agents step in:
The Challenger acts as a devil's advocate. Its only job is to find weaknesses in the provisional scores β places where the evaluator may have been too generous, where evidence is thin, or where the scoring doesn't align with the rubric's calibration anchors.
The Auditor then takes all perspectives β the evaluators' scores, the holistic reviewer's assessment, and the challenger's objections β and produces the final determination. It reconciles disagreements, applies any task-alignment modifiers, and documents an explicit audit trail explaining every adjustment.
This isn't just quality control. It's a structural guarantee that every score has been stress-tested before it reaches the student.
Fact-Checking with Parallel Runs and an Arbiter
Fact-checking is notoriously difficult for AI systems. A single pass can miss errors or, worse, flag correct information as wrong. Our engine addresses this with a three-step process:
- Two independent fact-checking runs execute in parallel, each analyzing the submission's claims without seeing the other's results
- An arbiter compares both sets of findings, resolving disagreements and filtering out false positives
- A dimension mapper routes confirmed factual issues to the specific rubric dimensions they affect
This parallel-then-reconcile approach dramatically reduces both false positives (incorrectly flagging accurate work) and false negatives (missing genuine errors). It's the same principle behind having two independent reviewers in academic peer review β agreement between independent assessors is far more reliable than a single opinion.
Calibrated Scoring: What "Good" Actually Means
One of the biggest problems with AI grading is score drift β where the model's interpretation of "excellent" shifts between submissions. We solve this with a calibrated proficiency system.
Every grading profile includes a 5-level proficiency scale β from Distinguished to Novice β with 11 notches for finer granularity (High, Mid, and Low subdivisions for the top three levels). Each dimension comes with calibration anchors: concrete descriptions of what work looks like at each level.
When an evaluator scores a dimension, it doesn't just pick a number. It maps the submission's evidence against these pre-defined benchmarks, ensuring that "Proficient β High" means the same thing for the first submission graded and the two-hundredth.
| Level | Notches | Description |
|---|---|---|
| 5 β Distinguished | High / Mid / Low | Exceptional mastery, goes beyond expectations |
| 4 β Proficient | High / Mid / Low | Strong competency, meets all criteria effectively |
| 3 β Developing | High / Mid / Low | Emerging understanding, inconsistent application |
| 2 β Beginning | β | Limited demonstration of required skills |
| 1 β Novice | β | Minimal or no evidence of competency |
Task Alignment Gate: Did They Answer the Question?
Before any detailed grading begins, our holistic reviewer performs a task alignment check β a pre-grading validation that asks a simple but crucial question: Did the student actually address the assignment?
This produces one of three statuses:
- Fully met: The submission addresses all core requirements β proceed with normal grading
- Partially met: Some requirements are addressed β grading proceeds but scores may be capped
- Not met: The submission doesn't address the assignment β an automatic modifier is applied
This prevents a well-written essay on the wrong topic from receiving high scores simply because the prose is polished. It's a check that experienced human graders perform instinctively, and our architecture makes it explicit and systematic.
Evidence-Based Everything
Every score in our system comes with specific citations from the student's work. Not vague statements like "demonstrates good understanding," but direct references to passages, pages, and quoted text.
This evidence requirement is enforced at the architectural level β evaluators must attach evidence_citations to every score, and the auditor verifies that these citations actually support the assigned grade. If an evaluator's reasoning doesn't hold up against the cited evidence, the auditor adjusts the score.
The result: when a student asks "Why did I get this grade?", the documentation already exists. Every score has a paper trail.
Evidence Trails in Practice
Each dimension score includes: the proficiency level and notch, a scoring rationale, specific evidence citations with page references, and a detailed explanation of how the evidence maps to the rubric criteria. This level of transparency would take a human grader significantly longer to produce manually.
Configuration-Driven, Not Hard-Coded
Our grading engine doesn't have "one way" of grading hard-coded into it. Instead, behavior is defined by grading profiles β configuration objects that specify:
- Which dimensions to evaluate and how to weight them
- What the calibration anchors look like for each dimension
- Which skills are critical (failing a critical skill can fail the entire assignment)
- What level of feedback depth to provide
- The education level and subject context
This means the same engine can grade a university-level research paper, a high school persuasive essay, and a technical lab report β each with appropriate criteria and expectations. Educators define what matters; the engine handles how to evaluate it consistently.
The Benefits: Why Architecture Matters
All of these architectural decisions add up to concrete benefits for educators and students:
Consistency across submissions. The calibrated proficiency scale and multi-agent review ensure that the 200th submission is graded to the same standard as the 1st. No fatigue, no drift.
Fairness through independence. Parallel evaluation and adversarial review prevent any single bias from propagating through the entire grade. Each perspective develops independently and gets challenged before finalization.
Transparency and defensibility. Evidence citations and audit trails mean every grade can be explained and justified. No black boxes.
Speed without shortcuts. Parallel execution of independent stages means thorough evaluation completes faster than serial processing would allow β without sacrificing any step in the pipeline.
Adaptability across contexts. Configuration-driven profiles mean the engine adapts to different subjects, education levels, and grading philosophies without code changes.
The Bottom Line
The quality of AI grading isn't just about using a better language model β it's about how you orchestrate that intelligence. A single model, no matter how capable, will always struggle with the consistency, fairness, and transparency that grading demands.
By decomposing the grading process into specialized agents, introducing adversarial review, enforcing evidence-based scoring, and calibrating against concrete benchmarks, we've built a system that doesn't just produce grades β it produces grades you can trust.
Want to see these principles in action? Read about why general-purpose chatbots fall short for grading, or explore our example results to see evidence-based feedback firsthand.
Written by
The team behind MarkInMinutes β building AI-powered grading tools for educators worldwide.
Related Articles

AI Grading Showdown: MarkInMinutes vs ChatGPT vs Claude
We ran the same 8 student papers through ChatGPT, Claude, and MarkInMinutes 3 times each. The results reveal dramatic differences in consistency, rubric adherence, and grading quality. Here's what we found.
Why You Can't Simply Use ChatGPT for Grading
ChatGPT is great for many things, but consistent, rubric-based grading isn't one of them. Here's why dedicated AI grading tools outperform general-purpose LLMs β and what to look for instead.