The Problem
We run an 8-stage AI enrichment pipeline that processes venue data into editorial content. Stage 03 (Writer) uses Claude Sonnet to generate long-form venue descriptions. Stage 07 (Judge) uses Claude Haiku as an LLM-as-Judge to score each editorial on four axes: factual accuracy, voice adherence, differentiation from competitors, and SEO value.
After our first full run across 92 Little Italy venues, 85% were rejected. Only 13 passed. The pass rate was so low that the pipeline was essentially non-functional.
The Architecture
Our enrichment pipeline is a sequential chain where each stage reads the output of the previous one:
01-ingest → 02-classify → 03-write → 04-faq → 05-pairs → 06-monetize → 07-judge
The key stages for this discussion:
Stage 02: Classifier
Assigns structured tags (e.g., date-night, coworking-friendly, outdoor-seating) and metadata (sound_level, budget_tier) to each venue using Claude Haiku with a controlled vocabulary. Tags are stored in a dedicated VenueEssenceTag table with indexed slug lookups.
Stage 03: Writer
Generates editorial content using Claude Sonnet. Receives:
- Venue data (name, address, Google reviews, service flags)
- Neighborhood voice profile (tone, vocabulary hints, words to avoid, local reference anchors)
- Up to 3 competitor editorials from the same neighborhood (for differentiation)
- Optional rejection feedback from a previous judge run
Outputs a JSON object: { lead_sentence, body, voice_used, key_dishes }.
Stage 07: Judge
Scores each editorial using Claude Haiku on four axes (0.0–1.0 each):
| Axis | Threshold | Weight | What it measures |
|---|
| Factual | 0.85 | 35% | Does the editorial contradict Google data? |
| Voice | 0.80 | 25% | Does it match the neighborhood voice profile? |
| Differentiation | 0.80 | 25% | Could this body be swapped with a competitor's? |
| SEO | 0.60 | 15% | Uniqueness from Google blurb, keyword richness |
Any single axis below its threshold triggers a reject, regardless of the weighted overall score. The judge also writes actionable feedback stored in qualityScores.feedback.
Diagnosing the 85% Reject Rate
We queried the qualityScores JSON column across all 326 rejected venues to find the failure distribution:
Voice < 0.80: 260/326 (80%)
Differentiation < 0.80: 262/326 (80%)
Factual < 0.85: 47/326 (14%)
SEO < 0.60: 10/326 (3%)
Voice and differentiation were the dual killers. Reading the judge's feedback revealed two systemic issues:
Issue 1: Voice Profile Mismatch
Each neighborhood has a voice profile — Little Italy's is "warm, traditional-with-edge, walkable-urban — the nonna-meets-aperitivo energy" with vocabulary hints like "India Street", "piazza", "the mercato", "passeggiata".
The writer prompt buried this as a 4-line modifier in a long system prompt. The model treated it as suggestions rather than requirements, defaulting to generic food-blog voice. The judge, which received the same voice profile as a primary scoring criterion, correctly flagged the mismatch.
But there was a deeper structural problem: the voice profile assumes the neighborhood's dominant cuisine. Little Italy's vocabulary is Italian. But 40% of the venues in "Little Italy" are Thai restaurants, sushi bars, Chinese takeout, Indian restaurants, and convenience stores. When the writer tried to apply "passeggiata" energy to a 7-Eleven, the result was either cynical (which the judge flagged as hostile tone) or forced (which the judge flagged as inauthentic).
Issue 2: Differentiation Without Context
The writer received competitor bodies truncated to 500 characters — enough to get the gist, but not enough to know what specific framing, metaphors, or structural patterns to avoid. The result: multiple venues in the same neighborhood ended up with structurally interchangeable editorials. The judge, which received the same competitors, correctly identified the swappability.
The Fix: Three Changes
1. Voice as Primary Instruction
We restructured the writer's system prompt to lead with the voice profile as an explicit #1 JOB, not an afterthought:
YOUR #1 JOB: Write in the NEIGHBORHOOD VOICE. This is not optional —
it is the primary scoring criterion.
═══ NEIGHBORHOOD VOICE PROFILE ═══
Tone: warm, traditional-with-edge, walkable-urban
Vocabulary to weave in naturally: India Street, piazza, the mercato...
We also added explicit anti-patterns extracted from real judge feedback — the specific phrases and constructions that trigger rejects:
VOICE ANTI-PATTERNS (the judge will reject these):
- Generic phrases: "the kind of place", "keeps regulars coming back"
- Food-blogger tone: "the move", "absolutely slaps"
- Yelp-complaint framing: listing negatives, service warnings
- Forced vocabulary: don't shoehorn terms where they don't fit
- Cynicism or contempt: even for chains, write with bemused affection
2. Non-Dominant Cuisine Guidance
For venues whose cuisine doesn't match the neighborhood's identity, we added a key instruction:
If the venue's cuisine doesn't match the neighborhood's dominant culture
(e.g., a Thai restaurant in an Italian neighborhood), don't force the
neighborhood vocabulary onto the food. Instead, frame the venue THROUGH
the neighborhood's lens — explain what role this place plays in the
neighborhood's ecosystem, why locals go here, how it fits into the
walkable fabric. The TONE should match the neighborhood; the CONTENT
should be honest about what the venue is.
And we calibrated the judge to match:
For venues whose cuisine/concept doesn't match the neighborhood's
dominant culture, the TONE should still match the profile, but don't
penalize for not using cuisine-specific neighborhood vocabulary. Score
based on whether the editorial frames the venue through the
neighborhood's lens.
This ensures the writer and judge are aligned on expectations for edge cases.
3. Longer Competitor Context
We increased competitor body context from 500 to 800 characters — enough for the writer to identify specific framing patterns, metaphors, and structural choices to avoid.
The Feedback Loop
The pipeline supports iterative improvement through a built-in feedback loop:
03-write (initial) → 07-judge (reject + feedback) → 03-write --regen-rejected → 07-judge
When 03-write.ts runs with --regen-rejected, it:
- Queries venues with
enrichmentStatus = 'rejected'
- Reads the judge's feedback from
qualityScores.feedback
- Appends it to the writer prompt as: "PREVIOUS ATTEMPT FAILED. Judge feedback: [specific feedback]. Fix the issues identified above."
The judge's feedback is intentionally specific and actionable. Not "improve the voice" but:
"The phrase 'the kind of place' is too generic — mention the specific cocktail program or the rooftop view that separates this from Seneca next door. Also, 'India Street' appears three times; use it once, then shift to 'the neighborhood' for variety."
This creates a closed loop where each rejection cycle produces better content. The writer doesn't just retry blindly — it receives a targeted critique from the same model that will re-judge it.
Results
| Metric | Before | After |
|---|
| Pass rate (Little Italy, n=20) | 10% (2/20) | 37.5% (6/16) |
| Voice avg score (rejected) | 0.59 | 0.64 |
| Primary failure axis | Voice (100%) | Voice (still, but improving) |
The pass rate increased 3.75x. Venues that previously failed — Mona Lisa Italian Foods, Nolita Hall, Parc Bistro-Brasserie, Hane Sushi — now pass on the first re-write cycle. The remaining rejects are concentrated in genuinely difficult cases (convenience stores, chain restaurants) where another --regen-rejected cycle with the updated feedback would likely convert more.
What We Learned
1. Prompt hierarchy matters more than prompt content. The voice profile was always in the prompt — it just wasn't positioned as the primary instruction. Moving it from line 15 to line 1 and formatting it with visual separators (═══) produced a measurable improvement in adherence.
2. LLM-as-Judge must calibrate to edge cases. A voice profile designed for Italian trattorias will score a 7-Eleven unfairly. The fix isn't lowering thresholds — it's teaching the judge when to apply the profile strictly vs. when to score through a different lens.
3. Feedback loops compound. Each rejection cycle incorporates specific, actionable feedback from the previous judge run. The writer doesn't just retry — it course-corrects. This makes the pipeline self-improving over iterations.
4. Structured scoring beats pass/fail. The four-axis rubric with per-axis thresholds made diagnosis trivial. We could immediately see that 80% of failures were voice + differentiation, not factual or SEO. A binary pass/fail judge would have required manual reading of hundreds of outputs.
5. The judge and writer must share the same contract. When we updated the writer's voice instructions, we updated the judge's voice scoring criteria to match. Misaligned expectations between writer and judge create an unwinnable game.
Architecture Diagram
Stack
- Writer model: Claude Sonnet 4.5 (high-quality long-form generation)
- Judge model: Claude Haiku 4.5 (fast, cost-effective evaluation)
- Classifier model: Claude Haiku 4.5 (structured tag output)
- Validation: Zod schemas at every stage boundary
- Storage: PostgreSQL via Prisma, JSONB columns for flexible editorial/scoring data
- Pipeline: TypeScript scripts with configurable concurrency, retries, and rate limiting