Eval loops for AI content: pass-fail rubrics that scale a pipeline.

ON THIS PAGE 9 sections

DIRECT ANSWER

Q. What is an eval loop in AI content?

A. An eval loop is a structured pass-fail review where an editor agent scores each draft from a writer agent against a fixed rubric, sends fails back for rewrite, and only ships drafts that pass. It is what turns a writer agent into a production pipeline.

EVIDENCE On a 6-month run of one pipeline, the first 30 drafts had a 38% first-pass rate. By draft 200 the first-pass rate stabilized at 81%, with no model upgrades — only rubric and brief improvements.

An eval loop is the editor-agent counterpart to the writer agent.

Without one, you have an AI writer producing drafts that are mostly fine, sometimes great, occasionally terrible, and consistently inconsistent. With one, you have a pipeline that ships predictable output every time, gets better as the brief library compounds, and surfaces real signal when something is wrong.

The eval loop is the part of an agentic SEO pipeline most teams underbuild. It is also the part that does the most work. Here is what mine contains and how it behaves in production.

What an eval loop actually is

A small definition. An eval loop is a programmatic review stage where an editor agent scores each writer-agent draft against a fixed rubric, returns failures to the writer with specific reasons, and only ships drafts that pass.

The loop has three properties.

Programmatic. Runs without a human in the path. If a human has to click “approve” every time, you have a workflow, not an eval loop.

Pass-fail per axis. Each rubric axis returns a yes or no. Numeric scores feel more rigorous but degrade fast — the editor invents fake distinctions that do not survive replay.

Bounded retry. Maximum two or three rewrite cycles, then escalate to a human. Without bounds, the loop will burn API spend forever on drafts that have a brief problem, not a draft problem.

The six-stage pipeline diagram above is the practical shape. Brief, draft, score, rewrite if fail, re-score, ship or escalate.

The 6-axis rubric

After running this on 8 client pipelines, the same six axes appear in every rubric that produces shippable output. Adding more makes the loop noisier. Cutting any one weakens the catch rate.

1. Entity coverage

The brief specifies 5 to 12 named entities (see brief templates for AI writers). The editor verifies each entity appears in the draft at least once, in a meaningful sentence — not a footnote, not a passing reference.

Why it matters. Entity coverage is a top-three GEO citation signal. A draft that drops three required entities loses citation eligibility for the queries those entities anchor. See entity-based SEO.

2. Evidence density

The brief specifies an evidence floor: e.g., “3+ verifiable numbers per H2, 2+ outbound citations to primary sources.” The editor counts. Fails if the floor is not met.

Why it matters. AI writers will happily fill space with “studies suggest” and “experts agree” if you let them. Programmatic evidence density forces real numbers and real sources. It also dramatically reduces hallucination — if the writer cannot cite a real source, it stops making up plausible ones.

3. Banned words

A list of phrases that auto-fail the draft. The list lives in the brand spec: “agent-driven,” “speed up,” “use,” “enable,” “modern,” “in today’s evolving landscape,” and so on (see E-E-A-T).

Why it matters. Mechanical filter for the laziest LLM tropes. Catches in 1 second what a human editor would catch in 5 minutes. The savings compound massively at scale.

4. Tone match

The editor compares the draft against the tone exemplars in the brief — actual paragraphs from existing site content. Scores on syntactic and lexical similarity, not vibes. Fails if similarity drops below a threshold.

This is the axis most prone to false positives. Calibrate carefully and bias toward letting borderline drafts through, then human-review the borderlines. Tone is the hardest axis to automate well.

5. Structural compliance

The brief specifies an H2 skeleton with word targets. The editor checks:

All required H2s present (allowing minor phrasing variation)
Each section within ±25% of its word target
H1 contains the target query in some form
At least one H2 written as a question (for AI citation)

Why it matters. Drafts that wander off the skeleton waste research effort and produce articles that target nothing. The structural check enforces the SEO strategy implicit in the brief.

6. Answer-first opening

The editor reads the first paragraph after the H1. It must contain a definitional sentence about the target entity, a specific claim, and a stakes sentence. The opening must be self-contained — extractable without surrounding context.

Why it matters. The first 80 words are the AI Overview / featured snippet payload. The single highest-use paragraph on the page. See how to get cited by ChatGPT.

What the editor agent actually looks at

The editor agent gets four inputs:

The brief (all 8 fields)
The draft
The rubric (as a system prompt with each axis defined)
The tone exemplars (as reference text)

It returns a structured JSON object: each axis pass/fail with a one-sentence reason, a global pass/fail, and if global fail, a rewrite directive for the writer.

Example output:

{
  "entity_coverage": { "pass": true, "covered": 6, "required": 6 },
  "evidence_density": { "pass": false, "reason": "H2 section 3 has 1 number; required floor is 3" },
  "banned_words": { "pass": false, "found": ["speed up", "enable"] },
  "tone_match": { "pass": true, "score": 0.78 },
  "structure": { "pass": true },
  "answer_first": { "pass": true },
  "global": "fail",
  "rewrite_directive": "Section 3 needs 2 more verifiable numbers. Replace 'speed up' and 'enable' with specific actions."
}

The writer reads the rewrite directive, regenerates the failing sections (not the whole draft — selective rewrite is faster and cheaper), and resubmits.

First-pass rate as the health metric

The single number to track is first-pass rate: what percent of drafts pass the rubric on the first attempt with no rewrites.

Healthy progression for a new pipeline:

First 30 drafts: 30 to 45% first-pass. Brief library is still being built.
Drafts 30 to 100: 55 to 70%. Library stabilizing, tone exemplars converging.
Drafts 100+: 75 to 85%. Steady state.

If your rate drops over time, something rotted — usually a brief template or a tone exemplar got generic. Investigate immediately. The eval loop is also a brief-quality monitor.

If your rate exceeds 95%, your rubric is too loose. Tighten an axis.

Why the writer should see the rubric

Counterintuitive but important. Put the rubric in the writer agent’s system prompt.

The instinct is to hide the criteria so the writer cannot “game” them. In practice, telling the writer what will be checked raises first-pass rate by about 25 percentage points. The writer is not gaming anything — it is aligning. The point of the rubric is to specify quality, and an aligned writer produces aligned output.

This also reduces compute cost dramatically. A 25-point lift in first-pass rate is roughly a 30% reduction in API spend per shipped post.

Mixing models

The single highest-use rubric upgrade is using different model providers for writer and editor.

Reason: every model has blind spots. GPT-5 writes in a specific cadence and fails to notice that cadence when reviewing its own drafts. Claude is the same with its own. When the writer and editor are the same model family, the loop misses about 18% of failures that a cross-model loop would catch.

My current setup runs Claude as the writer and GPT-5 as the editor for half my pipelines, reversed for the other half. The mix is not magic — both directions work — but cross-model is better than same-model in every test I have run.

This does have a cost. Two providers means two API keys, two billing dashboards, and a slightly more complex orchestration. For most pipelines the quality lift justifies the overhead.

What the loop cannot do

An eval loop catches structural failures. It does not catch editorial ones.

It cannot tell you whether an argument is interesting. It cannot tell you whether a metaphor lands. It cannot tell you whether the post would persuade a CMO. Those judgments still belong to a human.

The right division of labor: the eval loop catches everything mechanical, the human reviews everything strategic. A 20-post-per-month pipeline that goes from 8 hours of human editing per post to 45 minutes — that is the realistic target. Not zero human time, just much less of it on the wrong work.

Logging the loop

Every eval result gets logged. Three months of logs become a dataset.

The dataset is what enables rubric v2. You look at:

Which axes fail most often (usually evidence density and banned words)
Which axes fail rarely (usually structure, since the brief enforces it)
Drafts where the rubric passed but the human edited heavily (calibration miss)
Drafts where the rubric failed but the human read the draft and disagreed (rubric too strict)

The miss-and-strict patterns drive rubric edits. After two or three iterations, the rubric stabilizes and the logs become quieter — at which point the bottleneck moves upstream to brief quality. See agentic SEO cost economics for what the full economic stack looks like.

What to do tomorrow

The single smallest version of an eval loop you can ship in an afternoon.

Take your current writer prompt. Add an “evaluate this draft against these criteria” prompt for the editor stage.
Define the 6 axes above as binary pass-fail.
Run 10 of your existing drafts through the editor manually. Calibrate the prompts until the pass-fail calls match your judgment.
Wire the loop programmatically. Two API calls, a JSON parse, a retry.
Run for 30 drafts. Look at first-pass rate. Iterate.

The first version takes a day. The third version is the one you keep. After that, the rubric becomes infrastructure — the part of the pipeline you stop thinking about because it just runs.

That is the goal. An eval loop should be boring. Boring eval loops ship 200 posts a year without quality drift. Exciting eval loops are signs something is wrong.

INITIAL PASS RATE

30-45%

First 30 drafts.

STABILIZED RATE

~80%

After 200 drafts.

RUBRIC AXES

Below this is too coarse.

FIG. 01 · THE EVAL LOOP

BRIEF

8-field template

→

DRAFT

writer agent

→

SCORE

editor agent

→

REWRITE

if fail

→

RE-SCORE

max 3 cycles

→

SHIP

or escalate

Six stages, fully programmatic.

Where eval loops earn their cost — and where they stall

Volume + consistency. Pipelines shipping 15+ posts per month see the biggest lift. Below 15, manual editing competes.
Catching mechanical failures. Banned words, missing entities, length, structure — these all catch automatically and never reach a human.
Surfacing brief drift. When pass rate drops over time, the brief library has rotted. The eval loop is the early warning system.

Subjective quality. An eval loop cannot judge if an argument is interesting. It catches structural failures, not editorial ones. A human still reviews above the rubric.
Same-model echo chambers. If writer and editor are both GPT-5, the editor will not catch GPT-5's failure modes. Mix providers.
Over-engineered rubrics. 12-axis rubrics with weighted sub-scores tend to optimize for the rubric, not the reader. Six axes is the sweet spot.

Questions people actually ask

FAQ · 6

Q01 Is an eval loop the same as a human editor? +

No. An eval loop is a programmatic check that catches structural failures (banned words, missing entities, wrong length). A human editor still reviews voice, argument quality, and strategic fit. The loop saves the human from reading drafts that obviously fail.

Q02 What rubric scoring scale should I use? +

Binary pass-fail per axis, then a global pass-fail. Numeric scores (1-10) sound rigorous but degrade fast — the editor agent invents distinctions between 7 and 8 that do not exist. Binary forces clarity.

Q03 How do I know my rubric is good? +

Sample 20 passed drafts and 20 failed drafts. Read them as a human. If you agree with the rubric on 18 of 20, it is calibrated. If you disagree on more than 4, the rubric needs work — almost always tightening, not loosening.

Q04 What happens if a draft fails the rubric three times? +

Escalate to human review. Three consecutive failures usually mean the brief is wrong, not the draft. The human inspects, rewrites the brief, and resets the loop. Do not let an infinite loop ship slop.

Q05 Should the writer agent see the rubric? +

Yes, in the system prompt. Telling the writer in advance what the editor will check raises first-pass rate by roughly 25 percentage points in my testing. It is not cheating — it is alignment.

Q06 Which model should I use for the editor? +

A different one from the writer. Claude editing GPT drafts, or GPT editing Claude drafts, catches more failures than same-model loops. Mix providers if budget allows.

Sources & further reading

[01]
Constitutional AI — methodology
Anthropic · 2022

research
[02]
LLM-as-a-judge evaluation guide
OpenAI · 2024

documentation
[03]
Helpful Content System — guidance for creators
Google Search Central · 2025

documentation

Niko Alho

I run agentic SEO and build custom AI for B2B companies. Based in Turku.

About →

Vendor	Purpose	Expires
Google Analytics 4	aggregate page views · referrers	2 years
Google Tag Manager	tag delivery (no data without analytics consent)	session