Back
Eval Blind-Spot ReportMethod: human vs. LLM-as-judgen = 1,814 verdicts

Auto-graders vs Human Experts: More disagreements than you might think

We took a public dataset of expert-human preferences and the matching GPT-4 judgments, and measured how often the AI judge reaches the same verdict a panel of human experts does. The gap is large and, more importantly, it is not random.

32.8%
Verdicts where the AI judge
disagrees with the expert majority
Counting every outcome: model A wins, model B wins, or a tie.
16.0%
Verdicts that flip
when you swap the answer order
Same two answers, reversed positions, different winner.
READOUT:On roughly 1 in 3 judgments the machine and the humans don’t match, and 1 in 6 of the machine’s verdicts are an artifact of answer ordering, not quality.
01 / METHOD

What we measured

We used MT-Bench Human Judgments, an open dataset of expert pairwise preferences released alongside the “LLM-as-a-Judge” paper. Annotators were mostly graduate students grading in their own areas of expertise.

The dataset pairs answers from six models against 80 questions spanning eight task types, and ships both the human verdicts and the verdicts a GPT-4 judge gave on the same answer pairs. That overlap is what makes it useful: for each question × model-pair × turn, we took the majority human verdict and compared it to the machine’s. We matched 1,814 such verdicts.

3,355
Expert human judgments
2,400
GPT-4 judge judgments
8
Task categories, 6 models
02 / HEADLINE

The number depends on a choice the leaderboard hides

You can make the AI judge look reliable or unreliable purely by how you treat ties.

Restrict the comparison to the narrow slice where both the humans and the machine commit to a clear winner, and agreement is 88.4%, the reassuring ~80%+ figure usually quoted. But that throws away every case where one side said “tie,” which is exactly where judgment is hardest. Count all outcomes and the machine matches the expert majority only 67.2% of the time.

The headline “our LLM judge agrees with humans 80%+ of the time” is technically true and practically misleading. It is measured on the easy half of the data.

That alone is the pitch for a human-in-the-loop check: the metric your team trusts is reported on the subset of cases where grading was never in doubt.

03 / WHERE IT BREAKS

Disagreement is worst exactly where it costs you

Disagreement isn’t spread evenly. It concentrates in the categories where quality is subjective or requires actually checking the work (writing and code), not in the tidy factual categories.

coding40.2%
writing38.6%
roleplay32.9%
reasoning31.7%
math30.7%
extraction30.7%
humanities30%
stem28.1%

Share of verdicts (per category) where the GPT-4 judge disagreed with the expert-human majority. Coding and writing (the work most teams actually ship) top the list.

04 / FAILURE MODES

Three repeatable ways the judge gets it wrong

We isolated the 76 hardest cases: where human experts were unanimous and the machine still disagreed. They fall into clear, repeatable patterns.

  • 71%It hides behind “tie.” In 71% of unanimous-human disagreements, the judge declared a tie rather than commit to the answer humans clearly preferred. Auto-graders launder uncertainty into a non-verdict.
  • 58%It’s unstable to answer order. 58% of those hard cases changed verdict when the two answers were swapped: position bias, not quality assessment. (16% of all verdicts are order-dependent.)
  • 62%It rewards length. When the judge picked a different winner than humans, 62% of the time it chose the longer answer, preferring volume over the response humans judged better.
05 / EXHIBITS

What that looks like on real answers

Four cases where every human grader agreed and the machine did not. Responses are excerpted; verdicts are verbatim from the dataset.

Exhibit A · Coding · Q121Tie vs. a non-answer
PromptWrite a Python program that reads all text files in a directory and returns the top-5 most frequent words.
Answer A: llama-13b
“Here is a Python program that reads all the text files under a directory and returns top-5 words…” and then stops. No code follows.
Answer B: vicuna-13b
A complete, runnable program: imports os and Counter, defines count_words() and get_top_words(), iterates the .txt files and returns most_common(5).
Expert humans
→ Answer B
GPT-4 judge
→ Tie · order-inconsistent
What the human caughtA working program scored even against a sentence that promises code and delivers none.
Exhibit B · Writing · Q82Tie vs. a refusal
PromptTake a moment to evaluate and critique your own previous response.
Answer A: claude-v1
Delivers a specific self-critique: the email was concise but abrupt, could have opened with context, could have named particular charts to get feedback on.
Answer B: gpt-3.5-turbo
“As an AI language model, I do not have the ability to evaluate or critique my own response…” It declines the task, then gives generic advice.
Expert humans
→ Answer A
GPT-4 judge
→ Tie · order-inconsistent
What the human caughtA genuine answer tied against a refusal to do the task at all.
Exhibit C · Math · Q111Tie vs. correctness
PromptWhat is the area of the circle circumscribing the triangle?
Answer A: claude-v1
Works the problem: finds the circumscribed circle’s center via perpendicular bisectors, solves for the radius, then the area.
Answer B: llama-13b
“The area of a circle is πr², where r is the radius.” It states the formula, computes nothing.
Expert humans
→ Answer A
GPT-4 judge
→ Tie
What the human caughtA worked solution and a bare formula with no answer were scored as equally good.
Exhibit D · Writing · Q83Wrong winner · length bias
PromptOutline a blog post comparing two popular smartphone models, with key points and subheadings.
Answer A: gpt-4
A long, deeply nested outline: many sections, each split into Model 1 / Model 2 sub-bullets.
Answer B: claude-v1
A tight ~160-word outline with a clear thesis and focused comparison sections.
Expert humans
→ Answer B (the concise one)
GPT-4 judge
→ Answer A (the longer one)
What the human caughtHumans preferred the tighter outline; the judge preferred the one with more words.
06 / IMPLICATION

What this means for a team shipping models

None of this says LLM-as-judge is useless. It says it is confidently wrong in predictable places, and those places are the subjective and correctness-sensitive work most products ship.

If your release gate is an auto-grader, a regression in writing quality, a refusal slipping through, or a worked-answer replaced by a plausible non-answer can pass review while the score holds steady. The judge will often call it a tie, prefer the longer output, or flip on answer order, and your dashboard stays green.

The fix isn’t to grade everything by hand. It’s to run a calibrated human panel on the slice that matters: the categories above, the borderline cases, and the verdicts your auto-grader marks “tie.” Use it to keep your automated judge honest.

The offer

We’ll run this on your product.

This report was built on a public dataset. The same method, pointed at your actual model outputs, tells you where your auto-grader is blind. We sample your outputs, grade them with vetted human experts, and hand you a short findings report with the disagreement rate and the specific cases it missed. No data access needed to start. We can run it on your public demo or API.

Request an eval audit