BonVivant Blog

The LLM Judge That Says "Check This," Not "This Is Wrong"

How we use an LLM-as-Judge as a publish gate for hand-written pages — and why its factual-accuracy score is a verification queue, not a verdict.

·June 24, 2026·8 min read

Two different jobs for the same idea

We've written before about the LLM-as-Judge feedback loop inside our venue enrichment pipeline — a Haiku judge that scores machine-generated editorials and feeds rejections back to the writer. That judge's job is throughput: turn an 85% reject rate into a self-correcting loop across thousands of venues.

This post is about a second, quieter judge with a different job. When we started writing attraction pages by hand — long-form, opinionated guides like "Free Things to Do in Balboa Park" and "Is the San Diego Zoo Worth It?" — we pointed an LLM judge at the human prose before it could ship. Not to generate, not to loop. To gate.

The interesting part wasn't the gate. It was learning to read its scores correctly — because the most common failure it reports is one you should not act on the way the number implies.

The setup

Every editorial paragraph on an activity page runs through scripts/judge-activity.ts, which scores one section against the standard in specs/voice-profile-activities.md. The judge (pinned to our EDITORIAL_MODEL, Sonnet) returns five things:

Dimension	Threshold	Below threshold means
`voice_match`	≥ 0.80	Rewrite
`factual_accuracy`	≥ 0.85	Kick to a human
`differentiator_quality`	≥ 0.60	Flag for editor review
`banned_phrase_check`	pass	Auto-rewrite
`length_compliance`	pass	Truncate or expand

The exit code is 0 only if every dimension passes. Drop it into a loop and you have a build-time quality bar:

Crucially, factual_accuracy here does not mean what it sounds like. The prompt is explicit with the model: you do not have a database of opening hours, ticket prices, or concert times, and you must not pretend you do. The dimension measures internal consistency and the absence of obvious impossibilities — and it flags confident operational/regulatory/schedule claims for a human to verify. A vivid, plausible, specific detail scores 1.0 even though the judge can't confirm it. The 0.85 threshold isn't "the judge thinks you're wrong." It's "the judge saw something worth a second look."

We did not fully internalize that until we watched it score real pages.

What the scores actually looked like

Here's the final result table across both pages (after revisions):

Balboa Park — "Free Things to Do"

Section	voice	factual	diff	Gate
Differentiator	0.91	0.92	0.93	✅
Ranked experiences	0.82	0.88	0.80	✅
Practical	0.82	0.88	0.72	✅
Best-time	0.88	0.82	0.87	⚠️ factual 0.82

San Diego Zoo — "Is It Worth It?"

Section	voice	factual	diff	Gate
Differentiator	0.87	0.82	0.88	⚠️ factual 0.82
Ranked experiences	0.88	0.90	0.85	✅
Practical	0.87	0.82	0.81	⚠️ factual 0.82
Best-time (before)	0.78	0.88	0.72	❌
Best-time (after)	0.88	0.92	0.85	✅

Three things in that table taught us how to use the judge.

1. The factual flag is a queue, not a verdict

Four sections tripped factual_accuracy 0.82. Every one of them was flagged for the same reason: a confident schedule or price claim. Balboa's best-time names the organ concert ("2pm every Sunday") and "Resident Free Tuesdays." The Zoo's practical names the "$16" parking fee. The judge can't verify any of those, so by design it drops the score below 0.85 and says send a human.

So we sent a human. And here's the punchline: we tried hedging the claim to make the score go up, and it didn't move. We softened Balboa's "2pm every Sunday" to "on a regular weekly schedule" and re-judged. Still 0.82. The flag isn't attached to the precision of the claim — it's attached to the category (a schedule exists in this paragraph at all). Hedging just makes the page worse while leaving the score exactly where it was.

The correct response to a factual flag is not to edit the prose. It's to verify the fact and record the verification — then let the score stand, documented. We confirmed the organ time against the Spreckels Organ Pavilion record, the Resident Free Tuesday program against the Japanese Friendship Garden's live ticketing page, and the $16 Zoo parking fee against the Zoo's own parking page. (More on how we verified in a moment — it's a story of its own.) Each became a VERIFIED line in the page's historicalNotes. The 0.82 stays in the table with an asterisk that means "human-cleared," not "broken."

This is the whole lesson: treat factual_accuracy as a verification worklist the judge generates for you. It is very good at finding every claim that needs a primary source. It is — deliberately — useless at telling you whether the claim is true. Confusing the two leads you to either ship unverified facts (if you trust the high scores) or sand the specificity out of your prose (if you try to satisfy the low ones). Both are worse than just doing the verification.

2. The score it caught for real was a voice miss

Exactly one section failed on something a rewrite should fix: the Zoo's best-time block, at voice_match 0.78. The judge's rationale was specific and correct:

"...reads as a listicle construction embedded in prose — it enumerates features rather than delivering a taste-signal... The paragraph also lacks a concrete sensory anchor."

The offending line was "you'll get cool air, active animals, and short lines." A list wearing a sentence's clothes. We rewrote it to lead with the judgment and land a sensory beat:

"A weekday morning in spring or fall is the version of this place worth paying for. The canyons still hold the night's cool then, the cats and bears are up and working the fence line before the heat flattens them into shade naps, and you walk onto the bus without a line..."

Re-judged: voice 0.78 → 0.88. That is the judge doing the job you actually want from a publish gate — catching the one paragraph that read like filler and telling you, concretely, why.

The contrast with the factual flags is the point. voice_match, differentiator_quality, banned_phrase_check, and length_compliance are verdicts — when they fail, you fix the prose. factual_accuracy is a queue — when it "fails," you verify the world. Same rubric, two completely different response protocols.

3. The judge is occasionally, visibly flaky

One best-time run came back with banned_phrase_check.pass = false and an empty offending_phrases list. There is no banned phrase in the text — the banned list contains you'll love, and the prose said you'll get. The judge contradicted itself: it set the boolean to fail while listing zero offenders. Re-running returned pass = true.

We logged it rather than chase it. The lesson for a gate built on a probabilistic model: a single failing run is a signal, not a sentence. Our exit-code wrapper makes any single fail block the build, which is right for safety — but a human reading the output needs to recognize a self-contradicting result (fail + empty offender list) as noise and re-roll, not start deleting words. We're considering a cheap mitigation: best-of-N on the cheap dimensions (banned-phrase, length) where the right answer is deterministic and the model occasionally fumbles it.

The part where the internet was down

A footnote that turned into a methodology. During this build, our agent's web-research tools (WebSearch / WebFetch) were failing — the summarizer they route through was unavailable. So we couldn't verify the factual queue the normal way.

The workaround held a lesson worth keeping. WebFetch's fetch layer was fine — it resolved redirects and returned real HTTP statuses; only the summarize-the-page step was broken. So we dropped to curl, pulled the raw HTML, and parsed it ourselves. That worked perfectly for server-rendered sources — Wikipedia, the organ society, the Zoo's and garden's own pages — and gave us the primary-source confirmations the judge had queued: organ at 2pm Sundays, JFG admission $16, Zoo parking $16/day (with an $8 San Diego-resident rate the brief had missed), October "Kids Free" for ages 11 and under.

It failed on client-rendered SPA pages, where curl returns an empty shell — which is exactly why a few residuals (a panda timed-entry policy behind a JS widget, one garden's exact hours) remain flagged as publish blockers in the page's notes rather than asserted in the prose. Which is the system working as intended: an unverified operational claim never reaches the reader as a fact. The prose hedges it ("check the current rules the morning you go") until a human clears it.

What we'd tell the next team

Run the judge as an exit-code gate, but read its dimensions with different protocols. Voice / differentiation / banned-phrase / length failures are verdicts — fix the prose. Factual-accuracy failures are a verification queue — verify the world, then let the documented score stand.
Don't optimize for the factual score. It does not respond to hedging, because it's flagging the presence of an operational claim, not its precision. Trying to satisfy it sands away the specificity that makes a page worth reading.
Write the prose so an unverified flag degrades gracefully. If a claim might not survive verification, phrase it so the worst case is "check before you go," never a false fact. Then the gate can be amber, not red.
A single bad run is noise. Build the gate to fail safe, but teach the humans reading it to recognize self-contradicting output and re-roll.
Keep a verification log next to the content. Every factual flag the judge raised has a one-line VERIFIED: <fact> — Source: <url> (or an honest UNVERIFIED) in the page record. That log is the actual product of the factual dimension — not the score.

The enrichment judge made our pipeline self-correcting. This judge made our hand-written pages honest. Same technique; the value was entirely in knowing which of its complaints to argue with and which to go do homework about.

Stack

Judge model: Claude Sonnet (pinned via EDITORIAL_MODEL), temperature 0
Runner: scripts/judge-activity.ts — one paragraph in, JSON + exit code out
Standard: specs/voice-profile-activities.md (the voice the judge grades against)
Verification fallback: curl + HTML parsing for server-rendered primary sources; SPA-only facts stay flagged
Content store: typed const records in lib/activities/, with an in-record historicalNotes verification log

## Two different jobs for the same idea We've written before about [the LLM-as-Judge feedback loop](./llm-judge-feedback-loop.md) inside our venue enrichment pipeline — a Haiku judge that scores mach...

Dimension

Threshold

Below threshold means

voice_match

≥ 0.80

Rewrite

factual_accuracy

≥ 0.85

Kick to a human

differentiator_quality

≥ 0.60

Flag for editor review

banned_phrase_check

pass

Auto-rewrite

length_compliance

pass

Truncate or expand

Section

voice

factual

diff

Gate

Differentiator

0.91

0.92

0.93

✅

Ranked experiences

0.82

0.88

0.80

✅

Practical

0.82

0.88

0.72

✅

Best-time

0.88

0.82

0.87

⚠️ factual 0.82

Section

voice

factual

diff

Gate

Differentiator

0.87

0.82

0.88

⚠️ factual 0.82

Ranked experiences

0.88

0.90

0.85

✅

Practical

0.87

0.82

0.81

⚠️ factual 0.82

Best-time (before)

0.78

0.88

0.72

❌

Best-time (after)

0.88

0.92

0.85

✅