The eval-first AI build: why most AI projects fail QA.

ON THIS PAGE 7 sections

DIRECT ANSWER

Q. Why do most AI projects fail QA?

A. Because the team defined success after building, not before. Without a written eval rubric and a golden dataset created before the first prompt, there is no objective definition of 'it works' — and the build ships based on vibes, not evidence.

EVIDENCE Across the custom AI builds I have shipped and audited, the single most common failure pattern is a system that 'seemed fine in demo' and degraded within 6 weeks of production because nobody had defined a measurable quality bar up front.

Eval-first is the discipline of writing the pass/fail rubric and assembling a golden dataset before you write the first prompt. Most teams skip it. Most teams ship AI systems that nobody can objectively certify as working.

The failure pattern is consistent. A team gets excited about an AI use case. They build a demo in 2 days. The demo impresses stakeholders. Three weeks later there is a build in production that worked great in the demo and hallucinates on 15% of real inputs. Nobody catches it for 6 weeks because there is no monitoring. Nobody has a rubric against which to measure anything. The team debates whether the new Claude model is “better” by running 3 manual tests and calling it.

This is not a model problem. Claude Sonnet, GPT-4o, Gemini 1.5 Pro — any of them can do the job you are asking. The problem is that you have no defined job for them to do, and no way to know if they are doing it.

Eval-first fixes this. Here is the full system.

Why AI projects fail QA: the actual reason

The McKinsey Global Survey on AI adoption (2023) found that fewer than 20% of companies that had deployed AI at scale could point to a formal evaluation process. That number is consistent with what I see in practice. The builds that fail QA share one trait: the team defined “it works” after building, not before.

There are three common failure modes.

Vibe-based assessment. The team runs 5 to 10 manual tests, the outputs look reasonable, they ship. Three weeks later a user hits an edge case the team never tested. There is no rubric, so there is no way to determine whether this is a bug or an expected limitation. The response is usually to patch the specific case, which makes the prompt more complex and introduces 3 new failure modes.

Demo-to-production gap. The demo dataset is curated. The production inputs are not. Real users ask things in ways the demo did not cover. They abbreviate, use jargon, send empty fields, attach PDFs with unusual formatting. A system that achieves 95% accuracy on the demo dataset and 60% on production inputs is not unusual — it is typical when the demo dataset was small and hand-picked.

Silent drift. Prompt drift happens when the underlying model changes (deprecations, fine-tune updates, behavior shifts between model versions), when the input data distribution shifts (new customers, new product lines, seasonal patterns), or when integration behavior changes (a CRM field that used to always be populated now sometimes comes back null). Without a regression suite, drift is invisible until a user complains — and users usually complain about 10% of the failures they actually experience.

The eval-first discipline catches all three failure modes because it forces specificity before any code exists.

The 4 components of an eval-first build

1. The rubric

A rubric is a written document with named criteria, a scoring scale, and example scores. It answers the question: how do we know this output is good?

A rubric for a content brief generator might look like this:

Accuracy: Does the brief correctly identify the target keyword and search intent? (Pass / Fail)
Completeness: Does the brief include all 8 required fields (keyword, intent, target audience, word count, H2 structure, internal links, CTA, sources)? (Score 0–8)
Specificity: Is the brief specific enough that a writer could produce the draft without clarifying questions? (1–5 scale)
Format compliance: Does the output validate against the expected JSON schema? (Pass / Fail)

Each criterion gets a threshold. Below that threshold, the output fails the eval suite and does not ship.

The rubric takes 4 to 8 hours to write well. That is the investment. It pays back every time you make a prompt change and run the regression suite to verify you did not break anything.

2. The golden dataset

A golden dataset is 20 to 50 input/output pairs where a human has already scored the output against the rubric. It is the ground truth. The eval tools — Braintrust, LangSmith, promptfoo — all treat the golden dataset as the benchmark your system must match or beat.

Where do you get the examples? Three sources work.

Real historical data. If the system is replacing a manual process, score 20 to 50 examples from the existing manual output. This captures the quality bar the humans were hitting and makes the rubric calibration grounded.

Constructed examples. If there is no historical data, write 20 to 50 representative inputs yourself, run them through the desired process manually, and record the ideal output. Cover the main edge cases: empty inputs, unusually long inputs, inputs in the wrong format, inputs that should trigger a graceful failure.

Adversarial examples. Include 5 to 10 inputs designed to break the system. Jailbreak attempts, nonsense inputs, inputs that are technically valid but semantically ambiguous. If the system handles adversarial inputs gracefully, it handles real-user inputs reliably.

For RAG systems in B2B contexts, the golden dataset construction is particularly important because retrieval quality and generation quality are separate failure modes that need separate rubric criteria.

3. The baseline

Before you build anything, run your rubric against the current state. If you are replacing a manual process, score 20 examples from that process against your rubric. This gives you a baseline pass rate. Your build needs to match or beat it.

If you are building a net-new capability (no prior process), use the golden dataset itself as the baseline. After you write the first version of the system, score it against the golden dataset. That first score is your baseline. Every subsequent version should beat it.

This sounds obvious. Almost nobody does it. Without a baseline, “is v2 better than v1?” is a question that cannot be answered objectively.

4. The eval pipeline

The eval pipeline is the machinery that runs your rubric at scale. Three tools handle 95% of use cases.

Braintrust is the most complete hosted option. It manages dataset versioning, prompt experiments, LLM-as-judge evaluation, and a CI integration that blocks deployments when the rubric fails. The UI is clean for non-engineers. The downside is cost — it prices per eval run, which adds up at high volume.

LangSmith integrates tightly with LangChain and LangGraph pipelines. If your build uses LangChain, LangSmith is the path of least resistance: tracing is automatic, eval is built in, and the dataset management works without additional setup. For non-LangChain builds, it requires more manual wiring.

promptfoo is open-source and config-driven. The rubric lives in a YAML file, the dataset lives in a CSV or JSON, and you run it in CI with a single command. No hosted service, no per-run fees. The trade-off is that there is no UI — this is a tool for engineers, not product managers.

OpenAI Evals is worth mentioning but narrower — it is tightly coupled to the OpenAI API surface and works best when your entire pipeline runs on OpenAI models. For multi-model or model-agnostic builds, Braintrust or promptfoo are more flexible.

LLM-as-judge: scaling eval without scaling headcount

Human review of 50 examples is feasible. Human review of 5,000 examples per day is not. This is where LLM-as-judge evaluation earns its place.

The pattern: run your production pipeline and collect outputs. Pass each output to a separate judge model along with the rubric criteria. The judge returns a structured score. Flag outputs below the threshold for human review. Ship a report of the daily pass rate.

A judge prompt for a content brief looks like this:

You are evaluating a content brief against a quality rubric.
Rubric criteria:
1. Accuracy: does the brief correctly identify target keyword and intent? (pass/fail)
2. Completeness: how many of the 8 required fields are present? (score 0-8)
3. Specificity: would a writer be able to draft without clarifying questions? (1-5)
4. Format: does the output validate as valid JSON? (pass/fail)

Brief to evaluate:
[OUTPUT]

Return a JSON object with keys: accuracy, completeness, specificity, format, overall_pass.

The risk with LLM-as-judge is judge bias. A judge model has its own preferences and blind spots. Calibrate against your golden dataset first: run the judge on all 50 golden examples and compare its scores to the human scores. If the judge disagrees with the human on more than 15% of cases, tune the judge prompt until agreement is above 85%.

Braintrust has a built-in calibration workflow. LangSmith does too. promptfoo requires a manual calibration step but supports custom judge prompts out of the box.

The AI content eval loop pattern I use for content pipelines extends this principle — every draft passes a judge eval before it goes to the human editor. The editor sees a pass rate, not a pile of drafts to read from scratch.

Building the regression suite

A regression suite is a version of the eval pipeline that runs automatically every time you change the system — new prompt, new model, new integration behavior. It answers: “did this change break something that was working?”

The setup in CI takes about a day to wire up. The workflow:

On every pull request, run the golden dataset through the new version of the system.
Score outputs against the rubric using LLM-as-judge.
Compare the pass rate against the baseline from the last released version.
If pass rate drops more than 2 percentage points, block the merge.
If pass rate stays flat or improves, merge.

This is how software engineering teams ship. A change that breaks a test is a change that does not ship until the test passes. AI teams should work the same way.

The custom AI build cost breakdown I published earlier shows that evals and monitoring typically run 15 to 20% of the build budget. The regression suite is the bulk of that. Teams that skip it are not saving the 15 to 20% — they are deferring it, with interest, to the debugging sessions 3 months later.

The golden dataset as living documentation

One benefit of the golden dataset that teams discover after the fact: it is the clearest specification of what the system is supposed to do. Better than a PRD. Better than a prompt comment. Better than any description.

When a new engineer joins the team, hand them the golden dataset. They will understand the system’s intent in 20 minutes. When a stakeholder asks why the system gave a certain output, you can compare it to the nearest golden example and show explicitly how it differs. When you want to extend the system to handle a new case, you add the new case to the golden dataset first — before you change any code.

This is the eval-first discipline applied to the full development lifecycle, not just QA.

For agentic systems — builds where the AI is taking actions rather than generating text — the golden dataset becomes even more important. A text generation system can be spot-checked by reading. An agentic system that books meetings, sends emails, or writes to a database cannot. The only way to know it is working correctly is a structured eval against expected actions. See agentic AI in SEO workflows for how this plays out in a production context.

When to hire for eval discipline

Most teams do not have an engineer who has built an eval pipeline from scratch. The skills overlap with testing and observability engineering, but the LLM-specific nuances — LLM-as-judge calibration, golden dataset construction, prompt regression — are not yet standard curriculum.

Two situations where bringing in a consultant specifically for the eval layer makes sense.

At the start of a build. The eval rubric shapes every subsequent decision — what integrations to build, what the prompt needs to handle, what the acceptance criteria are. Getting the rubric wrong at the start costs as much as getting the architecture wrong. A consultant who has built rubrics for similar use cases shortens the calibration process from weeks to days.

After a production failure. If a system is hallucinating in production and the team does not have a rubric or a golden dataset, the first step is building them retroactively. This is harder than building them at the start but recoverable. The process: sample 100 recent outputs, score them manually, write the rubric that those scores imply, instrument monitoring from that baseline forward.

See how to hire an AI consultant for how to evaluate whether someone actually has eval discipline versus someone who uses the word “evaluation” without having shipped a rubric.

What to do this week

If you have an AI build in progress or already in production, the 4-hour exercise that is most likely to improve quality.

Pick 20 recent outputs from the system. Actual production outputs, not cherry-picked ones.
Score each one. Good, acceptable, bad. Write one sentence explaining each score.
Extract the implicit criteria. What made the good ones good? What made the bad ones bad? Name 3 to 5 criteria.
Write a rubric. One paragraph per criterion. Define pass and fail explicitly.
Run the 20 outputs through the rubric. What is the current pass rate?

That pass rate is your baseline. Every change to the system should be evaluated against it. If you want a regression suite, add Braintrust or promptfoo to the CI pipeline. If you want to scale eval, wire up LLM-as-judge calibrated against those 20 human-scored examples.

The model is not the problem. The missing rubric is the problem. Four hours and 20 examples is enough to fix it.

FIG. 01 · THE EVAL-FIRST BUILD ORDER

RUBRIC

define pass/fail

→

DATASET

20-50 examples

→

BASELINE

score current state

→

BUILD

prompt + pipeline

→

EVAL

rubric pass gate

→

SHIP

monitor + retrain

Rubric and dataset exist before line 1 of code.

The model is not the problem. The missing rubric is the problem.

When eval-first is worth the overhead — and when it is overkill

Production systems running 500+ times per month. Volume makes regression failures expensive. An eval suite catches drift before it costs you.
Customer-facing outputs. Any system where a bad output reaches a real user — support responses, generated reports, AI-authored content. The eval rubric is your quality gate.
Builds with multiple prompt versions in flight. When you are iterating on prompts, you need a regression suite to know if v2 broke something v1 handled correctly. LangSmith and Braintrust make this automatic.
Regulated or high-stakes domains. Legal, medical, financial outputs. The rubric doubles as audit evidence. OpenAI Evals and promptfoo both export machine-readable results.

One-off internal scripts running under 50 times per month. Manual spot-checking is faster and cheaper. Save eval infrastructure for volume.
Exploratory prototypes before you know the use case. Writing a rubric for an undefined use case produces a useless rubric. Get the use case right first, then instrument.
Builds where the output is purely subjective. If 10 reasonable people would give 10 different scores, an automated eval rubric will not help. Use human reviewers instead.

Questions people actually ask

FAQ · 7

Q01 What is an eval rubric in the context of AI builds? +

A written document that defines what a good output looks like, what a bad output looks like, and how to score the difference. A rubric has named criteria (accuracy, format compliance, tone, citation correctness), a scoring scale per criterion (pass/fail or 1–5), and at least 20 example outputs already scored. It is the AI equivalent of a unit test suite — it tells you objectively whether the system works.

Q02 What is a golden dataset and how big does it need to be? +

A golden dataset is a set of input/output pairs where the expected output has been hand-labeled by a human who understands the task. For most custom AI builds, 20 to 50 examples is enough to get started. You need enough variety to cover the main edge cases and failure modes, not statistical significance. Braintrust, LangSmith, and promptfoo all have dataset management built in — you do not need to build this infrastructure yourself.

Q03 What is LLM-as-judge evaluation? +

Using a separate language model — often a more capable or cheaper one than the production model — to score the outputs of your main system against the rubric. For example: run your pipeline with GPT-4o, then pass each output to Claude Opus with the rubric as a system prompt and have it return a structured score. This scales to thousands of examples without human review. The risk is that the judge model has its own biases, so you calibrate it against your golden dataset first.

Q04 Which eval tools should I consider for a custom AI build? +

Three tools cover most use cases. Braintrust is the most feature-complete: dataset versioning, prompt experimentation, LLM-as-judge, and a CI integration — good for teams that want a hosted platform. LangSmith (from LangChain) integrates tightly with LangChain and LangGraph pipelines and has strong tracing alongside evaluation. promptfoo is open-source, config-driven, and runs in CI without a hosted service — good if you want full control and no vendor dependency. OpenAI Evals is narrower — best if your pipeline is entirely within the OpenAI API surface.

Q05 How do I write an eval rubric if I do not know what good looks like yet? +

Start with 10 real examples of the input you expect the system to handle. For each, write what a good output would look like and what would make it bad. Patterns emerge fast. By example 7 or 8 you will have implicit criteria you can make explicit. Then formalize: name the criteria, pick a scoring scale, get a second person to score the same examples independently, and reconcile disagreements. The calibration conversation is where the rubric actually sharpens.

Q06 Can I add eval-first discipline to a build that is already in production? +

Yes, though it is harder. Start by sampling 50 recent outputs from the live system and scoring them manually. That becomes your retrospective golden dataset. Then write the rubric backwards from what you found. Then run the eval suite on the current system state to establish a baseline. Anything above that baseline on the next version is an improvement. Anything below is a regression. LangSmith makes this retroactive instrumentation easier than most tools because you can ingest historical traces.

Q07 How does eval-first connect to ongoing monitoring once the build ships? +

The eval rubric is also the monitoring spec. You run a subset of the golden dataset against the live system on a schedule — daily or weekly depending on volume — and alert when pass rate drops. This is how you detect prompt drift, model deprecation side effects, and data distribution shifts without waiting for a user complaint. Tools like Braintrust and LangSmith both support scheduled eval runs against production.

Sources & further reading

[01]
Evals are underrated
OpenAI Cookbook · 2024

guide
[02]
LangSmith documentation — evaluation overview
LangChain · 2025

documentation
[03]
promptfoo — LLM testing and red-teaming
promptfoo · 2025

documentation
[04]
Braintrust — AI evaluation platform
Braintrust · 2025

documentation

Niko Alho

I run agentic SEO and build custom AI for B2B companies. Based in Turku.

About →

Vendor	Purpose	Expires
Google Analytics 4	aggregate page views · referrers	2 years
Google Tag Manager	tag delivery (no data without analytics consent)	session