Creative Ventures engineering10 min read
LLM evaluation: a practical playbook for production agents
Most teams never get past vibes-based LLM evaluation. Here is the evaluation harness we run on every production agent — golden sets, three-layer scoring, and when to stop adding coverage.

Every team we talk to is running some form of LLM evaluation. Most of them are running it wrong. Usually the issue is not the model — it is the measurement. This is the LLM evaluation playbook we run on every production agent, stripped of the parts that only sound good in conference talks.
Start with a golden set, not a metric
The first artifact in any LLM evaluation is 40 hand-curated examples that represent the shape of your traffic. Not 400, not 4,000 — 40. Small enough that a human can actually read them, large enough to catch category-level regressions. Every time a production bug surfaces, the failing example goes into the golden set.

The three-layer scoring model for LLM agents
We score every agent response at three layers. Hard constraints — did it call the right tool, did the output validate against the schema. Correctness — for verifiable tasks, is the answer actually right. Judgment — did a second model rate the response usable. The layers are not weighted: a failure on any layer is a failure.


When to stop evaluating and start listening to production
More eval is not always better. Once the agent passes the golden set at >95%, the next regression will almost certainly come from a category you did not predict, not a marginal drop in accuracy. That is the point to stop adding coverage and start adding telemetry from production to feed back in.
“The eval harness is a forcing function for understanding your own product. If you cannot write the test, you do not know the feature well enough to ship it.”
