MarkInMinutes
Back to Blog
AI & Grading

AI Grading Showdown: MarkInMinutes vs ChatGPT vs Claude

We ran the same 8 student papers through ChatGPT, Claude, and MarkInMinutes 3 times each. The results reveal dramatic differences in consistency, rubric adherence, and grading quality. Here's what we found.

M
MarkInMinutes Team
February 12, 202614 min read
Bar chart comparing grading consistency of ChatGPT, Claude, and MarkInMinutes across 72 benchmark runs

Every educator who has experimented with AI grading has asked the same question: Can I just paste my rubric into ChatGPT and get reliable grades?

We decided to answer that question definitively. Not with theory, not with opinion β€” with data. We took 8 real student submissions from a university-level finance course, ran them through three different AI grading systems, and measured everything: consistency, rubric adherence, dimension-level stability, differentiation, cost, and speed.

The results were decisive β€” and surprising in ways we didn't expect.

Executive Summary

We graded 8 real student papers 3 times each with ChatGPT (GPT-4o), Claude (Sonnet 4), and MarkInMinutes β€” 72 grading runs total. Here are the headline findings:

MetricChatGPTClaudeMarkInMinutes
Exact Notch Agreement31.2% πŸ”΄100.0% 🟒93.8% 🟒
Cohen's Kappa-0.067 (Poor) πŸ”΄1.000 (Excellent) 🟒0.914 (Excellent) 🟒
Dimension Stability17% πŸ”΄58% πŸ”΄94% 🟒
Unique Notches Used5 🟒2 πŸ”΄4 🟒
Avg Run Time1.7 min 🟑34s 🟒10.1 min 🟑
ResultNot suitableNot suitableProfessional grade

The bottom line:

  • ChatGPT is unreliable for grading. The same paper can receive grades 4 notches apart between runs β€” worse than random chance (negative Cohen's Kappa). Dimension-level stability is just 17%.
  • Claude appears perfectly consistent on the surface (100% overall agreement), but this is an illusion of aggregation. At the rubric dimension level, stability drops to 58% β€” the individual scores fluctuate, they just happen to cancel out. It also compresses nearly all grades into a single notch, failing to differentiate between submissions.
  • MarkInMinutes delivers genuine structural consistency: 93.8% overall and 94% at the dimension level. It meaningfully differentiates between submissions using 4 distinct notches, and every dimension score is reproducible. The tradeoff is a 10-minute processing time due to its multi-agent architecture.

The Bottom Line

If you need grades you can defend β€” to students, to colleagues, to accreditation bodies β€” MarkInMinutes is the only tool in this comparison that produces reliable results at every level of the rubric. Read on for the full methodology and data.

The Experiment

Setup

We selected 8 distinct group submissions from a graduate-level corporate finance assignment. Each submission consisted of a multi-page PDF covering financial modeling, strategic analysis, argumentative writing, and data visualization β€” the kind of complex, multi-dimensional work where grading consistency matters most.

Each submission was graded against a detailed proficiency-based rubric with four weighted dimensions:

DimensionWhat It Measures
Quantitative Modeling & Valuation RigorAccuracy of financial models, calculations, and valuation methods
Strategic Synthesis & Thesis DevelopmentQuality of strategic arguments and thesis coherence
Argumentative Architecture & Red ThreadLogical flow, structure, and persuasiveness
Data Visualization & Professional InterfaceChart quality, formatting, and professional presentation

The grading scale uses a 5-level, 11-notch proficiency system β€” from Novice (1.1) to Distinguished-High (5.3) β€” specifically designed for nuanced assessment. Think of it as a more granular version of traditional letter grades where "B+" and "B" represent meaningfully different achievement levels.

The Three Contestants

  1. ChatGPT (GPT-4o): OpenAI's flagship model. We provided it with the full rubric, proficiency scale, and submission PDF, then asked for structured grading output β€” the same approach thousands of educators use today.

  2. Claude (Sonnet 4): Anthropic's latest reasoning model. Same setup: full rubric, scale, and submission as native PDF input with structured JSON output.

  3. MarkInMinutes: Our multi-agent grading engine β€” a pipeline of 10 specialized AI agents that evaluate, cross-check, challenge, and audit each other's work before producing a final grade.

The Protocol

Every submission was graded 3 times by each model, yielding 72 total grading runs (8 submissions Γ— 3 models Γ— 3 runs). The repetitions allow us to measure something critical: inter-rater reliability β€” does the tool give the same grade when grading the same paper twice?

This is the AI equivalent of the test every human grading panel must pass: if two graders can't agree on a score, neither can be trusted.

Result 1: The Consistency Gap Is Enormous

The first thing that jumps out from the data is how wildly different the three models are in their consistency. We start with the most intuitive question: does the same paper get the same grade?

Grade Agreement: Level vs Notch

We measured agreement at two levels of precision. Level agreement checks whether the broad proficiency category (e.g. "Accomplished") matches across runs. Notch agreement is stricter β€” it requires the exact sub-level (e.g. "Accomplished-Mid") to match.

Grade Agreement: Level vs Notch

How often does the same paper get the exact same grade across 3 runs?

Level AgreementNotch Agreement
0%25%50%75%100%62.5%31.2%ChatGPT100%100%Claude93.8%93.8%MarkInMinutes

Level vs Notch: Level agreement checks if the broad proficiency level (e.g. β€œAccomplished”) matches. Notch agreement is stricter β€” it requires the exact sub-level (e.g. β€œAccomplished-Mid”) to match. ChatGPT drops from 62.5% to 31.2% when moving from level to notch, revealing how imprecise its scoring really is.

Grouped column chart: broad level agreement (left bar) vs strict notch agreement (right bar) per model.

At the broad level, ChatGPT agrees with itself only 62.5% of the time β€” and when you zoom into the exact notch, it drops to 31.2%. Claude and MarkInMinutes both maintain near-perfect agreement at both levels, but as we'll see, the underlying story is very different.

Statistical Reliability: Kappa & Grade Drift

Raw agreement percentages don't account for chance. Cohen's Kappa is a standard statistical measure of inter-rater reliability that corrects for this. We also tracked how far each grade drifts on average between runs.

Statistical Reliability

Cohen's Kappa and average grade drift between runs β€” taller bars = more reliable

Cohen's KappaAvg Notch Change
βˆ’0.0671.19ChatGPT1.0000.00Claude0.9140.06MarkInMinutes

Cohen's Kappa below zero means ChatGPT agrees with itself less than random chance. An average notch change of 1.19 means each paper's grade shifts by more than one full notch between runs. Claude and MarkInMinutes both show near-zero drift, but the next chart reveals a crucial difference.

Grouped column chart: Cohen's Kappa (left, higher = better) and Avg Notch Change (right, inverted β€” taller = more stable).

ChatGPT's Kappa of -0.067 is classified as "Poor" β€” it agrees with itself less often than random chance would predict. In practical terms: the same paper received grades ranging from Accomplished-Low (4.1) to Distinguished-Mid (5.2) across three runs. That's a 4-notch swing β€” the equivalent of getting a B- on Monday and an A on Tuesday for the same work.

What This Means for Students

If you're using ChatGPT for grading, two identical submissions could receive meaningfully different grades depending on when they happen to be processed. That's not just inaccurate β€” it's fundamentally unfair.

Claude and MarkInMinutes both show excellent statistical reliability β€” Claude with a perfect 1.000, MarkInMinutes with 0.914. But the next metric changes everything.

Result 2: Claude's Consistency Is an Illusion

Here's where the experiment gets interesting. When we look beyond the overall grade and examine the four individual rubric dimensions, a completely different picture emerges.

Dimension-Level Stability

Average exact agreement across all four rubric dimensions β€” the real test of consistency

0%25%50%75%100%17%ChatGPT58%Claude94%MarkInMinutes

This changes everything for Claude. Despite 100% overall grade agreement, Claude's rubric dimensions are stable only 58% of the time. The per-dimension scores fluctuate β€” they just happen to cancel out. MarkInMinutes achieves 94% dimension stability β€” its consistency is structural, not statistical.

Average exact agreement across all four rubric dimensions per model.

Dimension-Level Stability

Exact agreement per rubric dimension β€” is the consistency genuine?

25%50%75%100%Quant. ModelingStrategic Synth.ArgumentationData Viz.

Key insight: Claude achieves 100% overall stability but only 58% dimension stability β€” the consistency is a statistical artifact. MarkInMinutes achieves 94% stability at every level of the rubric.

Radar chart of dimension-level stability. Larger area = more consistent. Toggle models to compare.

Claude's dimension-level stability averages just 58% β€” meaning that across the four grading dimensions, Claude gives different scores to the same paper roughly half the time. Some dimensions are as low as 50% stability.

How can the overall grade be perfectly stable while the dimensions underneath it are volatile? It's a compensation effect: when Claude grades a paper multiple times, one dimension goes up while another goes down, and these fluctuations happen to cancel out when aggregated into the overall notch. The consistency is statistical, not structural.

This matters enormously for any grading system that provides dimension-level feedback β€” which is exactly what students need for evidence-based grading. If a student receives an "Accomplished-High" on Strategic Synthesis in one run and "Accomplished-Low" in another, the feedback becomes meaningless regardless of whether the overall grade stayed the same.

Compare this to MarkInMinutes: 94% average dimension stability. The Data Visualization dimension hits 100% β€” every single submission received the exact same dimension score across all three runs. The consistency isn't just at the headline level; it runs all the way down to the individual rubric criteria.

Why Dimension Stability Matters

Students don't just need a final grade β€” they need to know specifically where they excelled and where they need to improve. If the dimension-level scores fluctuate between runs, the feedback is unreliable and can't guide meaningful learning. True evidence-based grading requires stability at every level of the rubric.

Result 3: The Differentiation Problem

A grading tool isn't useful if it gives every student the same grade. We measured differentiation by looking at how many distinct grade notches each model used across all 24 runs.

Grade Distribution by Model

Boxplot of overall grades across 24 runs per model β€” hover to inspect

4.1Acc-Low4.2Acc-Mid4.3Acc-High5.1Dist-Low5.2Dist-MidChatGPTClaudeMarkInMinutes
Boxplot of overall grades per model (24 runs each). Dots show individual runs. Narrow boxes = consistent; wide boxes = volatile.

Claude used only 2 distinct notches across all 24 grading runs: Accomplished-Mid (4.2) for 87.5% of grades, and Distinguished-Mid (5.2) for the remaining 12.5%. This extreme grade compression means Claude essentially sorts submissions into two buckets β€” "good" and "great" β€” with no meaningful differentiation between them.

ChatGPT used 5 distinct notches, which sounds better β€” but this spread is largely noise rather than signal. Because ChatGPT can't reliably reproduce its own grades, the variety reflects randomness, not genuine assessment.

MarkInMinutes used 4 distinct notches with near-perfect reproducibility. It meaningfully differentiates between a submission that's Accomplished-Low (solid work with room for improvement) and one that's Distinguished-Low (exceptional in most dimensions). And critically, it makes these distinctions consistently β€” the same paper lands on the same notch every time.

Result 4: Why Does MarkInMinutes Take Longer?

MarkInMinutes takes ~10 minutes per submission β€” significantly longer than ChatGPT (~1.7 min) or Claude (~34 seconds). The reason is architectural: it runs a pipeline of 10 specialized agents, each handling one aspect of the evaluation. The agents evaluate, cross-check, challenge, and audit each other before producing a final grade.

Why the Time Investment Pays Off

MarkInMinutes runs a pipeline of 10 specialized agents β€” from rubric parsing and fact-checking to adversarial challenge and final audit. Each agent focuses on one aspect of the evaluation, preventing the "criterion blending" that affects single-model approaches. The time investment is what makes the consistency genuine. Learn more in our architecture deep-dive.

The speed difference reflects a fundamental design choice: single-call models (ChatGPT, Claude) are fast but produce grades that are either volatile or superficially consistent. A multi-agent pipeline is slower but produces grades that are structurally reliable at every level of the rubric.

Result 5: Cross-Model Agreement

We also examined how much the three models agree with each other when grading the same submissions. This helps answer: Are they even measuring the same thing?

Model PairLevel AgreementCohen's Kappa (Level)
Claude vs MarkInMinutes87.5%0.600 (Moderate)
ChatGPT vs Claude62.5%-0.200 (Poor)
ChatGPT vs MarkInMinutes50.0%-0.333 (Poor)

Claude and MarkInMinutes agree on the proficiency level 87.5% of the time β€” the highest agreement of any pair. This makes sense: both are highly consistent graders, so their median assessments naturally converge. ChatGPT's volatile grades cause it to disagree with both other models.

The fact that the two most consistent models also agree most with each other suggests they're converging on something real about the submissions' quality β€” not just producing arbitrary numbers.

The Verdict

After 72 grading runs, the data tells a clear story:

ChatGPT: Not Suitable for Grading

  • Grades the same paper differently nearly every time
  • Cohen's Kappa below zero (worse than chance)
  • Dimension-level stability at 17% (essentially random)
  • Cheap and fast, but the output can't be trusted

ChatGPT is an extraordinary tool for many tasks. Consistent, rubric-aligned grading is not one of them. If you're currently using it for grading, we strongly encourage you to run your own consistency test β€” the results will likely match what we found here.

Claude: Deceptively Consistent

  • Perfect overall grade stability masks dimension-level volatility (58%)
  • Extreme grade compression (only 2 notches used)
  • Fast and moderately expensive
  • The consistency is a statistical artifact, not structural reliability

Claude is significantly better than ChatGPT but falls short of what educators need. If you only care about a single overall grade, it performs well. But if you need reliable rubric dimension feedback β€” which students and accreditation bodies increasingly require β€” the dimension-level instability is disqualifying.

MarkInMinutes: Genuinely Reliable

  • 93.8% exact notch agreement (Cohen's Kappa: 0.914, Excellent)
  • 94% dimension-level stability β€” consistent at every level of the rubric
  • Meaningful differentiation with 4 distinct notches used consistently
  • Multi-agent architecture prevents criterion blending and produces defensible evidence trails

The consistency isn't a trick of aggregation. It's structural β€” built into the architecture through specialized agents, adversarial review, and calibrated scoring. When MarkInMinutes says a submission earns "Accomplished-High" on Data Visualization, it means exactly that, every single time.

What This Means for Educators

If you're an educator considering AI-assisted grading, this benchmark highlights three questions you should ask about any tool before trusting it with student grades.

1. Does it produce the same grade for the same paper?

Run your own consistency test. Grade the same submission three times and compare results. If the grades vary, the tool isn't reliable enough for high-stakes use. In our test, ChatGPT failed this check dramatically (31% agreement), while MarkInMinutes passed with 93.8%.

2. Is the consistency genuine or superficial?

Check dimension-level agreement, not just the overall grade. A tool that's stable at the top but volatile underneath is giving you a false sense of reliability. Claude's 100% overall stability masked 58% dimension instability β€” a critical distinction.

3. Does it meaningfully differentiate?

If every paper gets the same grade, the tool isn't actually assessing quality β€” it's producing a default response. Look for a reasonable spread of grades that's reproducible across runs. Claude compressed 87.5% of grades into a single notch.

MarkInMinutes was built specifically to pass all three tests. Our multi-agent architecture doesn't just produce a number β€” it produces an evidence-based assessment with a documented audit trail that explains exactly why each dimension received its score.


Want to see how MarkInMinutes handles your specific grading needs? Visit our page for professors to learn how our multi-agent engine can save you hours while improving grading consistency β€” or try it yourself with your own rubric and submissions.

Written by

M
MarkInMinutes Team

The team behind MarkInMinutes β€” building AI-powered grading tools for educators worldwide.

Share this article

XLinkedIn

Related Articles