What we measured
We used MT-Bench Human Judgments, an open dataset of expert pairwise preferences released alongside the “LLM-as-a-Judge” paper. Annotators were mostly graduate students grading in their own areas of expertise.
The dataset pairs answers from six models against 80 questions spanning eight task types, and ships both the human verdicts and the verdicts a GPT-4 judge gave on the same answer pairs. That overlap is what makes it useful: for each question × model-pair × turn, we took the majority human verdict and compared it to the machine’s. We matched 1,814 such verdicts.
The number depends on a choice the leaderboard hides
You can make the AI judge look reliable or unreliable purely by how you treat ties.
Restrict the comparison to the narrow slice where both the humans and the machine commit to a clear winner, and agreement is 88.4%, the reassuring ~80%+ figure usually quoted. But that throws away every case where one side said “tie,” which is exactly where judgment is hardest. Count all outcomes and the machine matches the expert majority only 67.2% of the time.
The headline “our LLM judge agrees with humans 80%+ of the time” is technically true and practically misleading. It is measured on the easy half of the data.
That alone is the pitch for a human-in-the-loop check: the metric your team trusts is reported on the subset of cases where grading was never in doubt.
Disagreement is worst exactly where it costs you
Disagreement isn’t spread evenly. It concentrates in the categories where quality is subjective or requires actually checking the work (writing and code), not in the tidy factual categories.
Share of verdicts (per category) where the GPT-4 judge disagreed with the expert-human majority. Coding and writing (the work most teams actually ship) top the list.
Three repeatable ways the judge gets it wrong
We isolated the 76 hardest cases: where human experts were unanimous and the machine still disagreed. They fall into clear, repeatable patterns.
- 71%It hides behind “tie.” In 71% of unanimous-human disagreements, the judge declared a tie rather than commit to the answer humans clearly preferred. Auto-graders launder uncertainty into a non-verdict.
- 58%It’s unstable to answer order. 58% of those hard cases changed verdict when the two answers were swapped: position bias, not quality assessment. (16% of all verdicts are order-dependent.)
- 62%It rewards length. When the judge picked a different winner than humans, 62% of the time it chose the longer answer, preferring volume over the response humans judged better.
What that looks like on real answers
Four cases where every human grader agreed and the machine did not. Responses are excerpted; verdicts are verbatim from the dataset.
What this means for a team shipping models
None of this says LLM-as-judge is useless. It says it is confidently wrong in predictable places, and those places are the subjective and correctness-sensitive work most products ship.
If your release gate is an auto-grader, a regression in writing quality, a refusal slipping through, or a worked-answer replaced by a plausible non-answer can pass review while the score holds steady. The judge will often call it a tie, prefer the longer output, or flip on answer order, and your dashboard stays green.
The fix isn’t to grade everything by hand. It’s to run a calibrated human panel on the slice that matters: the categories above, the borderline cases, and the verdicts your auto-grader marks “tie.” Use it to keep your automated judge honest.