
Evals as a first-class artifact
Teams that test their conventional code but run AI features on intuition are operating a double standard with expensive consequences. Evaluations need the same status as tests: versioned, automated, and treated as a gate — not an afterthought.
Software teams have spent decades building the discipline of automated testing. They run unit tests, integration tests, and end-to-end tests on every change. They treat a failing test as a blocker, not an inconvenience. Then they introduce an LLM-powered feature and evaluate it by looking at a few outputs and saying 'that seems fine.' The asymmetry is striking and the consequences are predictable: regressions slip through, prompt changes feel like coin flips, and 'is this better?' becomes an unresolvable argument.
Evaluations — structured, repeatable, automated assessments of model output quality — are the testing discipline applied to the probabilistic part of the system. They deserve exactly the same status as tests: versioned with the code, run automatically in CI, and used to gate releases. The teams that treat them that way ship better AI features and change them with more confidence. The ones that don't are flying blind.
What an eval set actually is — and isn't
An eval set is a collection of representative inputs paired with a consistent method for judging the outputs. It is not a benchmark leaderboard, not a capability showcase, and not a set of inputs you chose because your system handles them gracefully. A good eval set is uncomfortable to look at: it contains the awkward inputs, the adversarial phrasings, the boundary cases that appeared in production last month, and the failure modes you've already fixed once and don't want to see again.
The judge — the mechanism that scores a given output — depends on the task. For classification, structured extraction, or code generation with a verifiable result, deterministic rule-based scoring is precise and fast. For summarization, question-answering, or any task with subjective quality, a rubric evaluated by a second model (a consistent, documented judge prompt applied to every candidate output) gives you a proxy score you can track and compare. For tasks where the cost of errors is high, a human spot-check on a stratified sample anchors the automated score in ground truth.
Start smaller than feels adequate. Twenty-five well-chosen, genuinely representative cases with careful annotations will reveal more than two hundred synthetic cases generated by asking the model to produce variations of your happy-path example. Synthetic data can fill gaps once the real distribution is established, but it's a poor substitute for starting with the actual inputs your system will face.
The mechanics of running evals in CI
Running evals automatically is not technically complicated but it does require a few deliberate decisions. The eval runner needs access to the same versioned prompt templates and configuration that production uses — if evals run against a separate, manually maintained copy of prompts, they will tell you about that copy's behavior, not production's. Tight coupling between the eval harness and the production codebase is a feature, not a smell.
Scoring outputs with a model judge introduces its own variance. A single judge call on a single output is noisy; the score on a given example can shift between runs without any change to the system under test. Mitigate this by averaging multiple judge calls for ambiguous examples, pinning the judge model version, and tracking score distributions rather than single-point results. A change that shifts the mean by two points within a known variance band is not the same as a change that shifts it by two points outside that band.
Establish thresholds before you need them. A score of 87 on your extraction rubric means nothing in isolation; it means something when your baseline from last month was 91 and CI is configured to flag drops greater than three points. Set those thresholds as soon as you have enough history to make them meaningful, and treat them as negotiable — the thresholds should reflect what 'good enough' means for this feature at this stage, not an abstract ideal.
Evals change the entire development loop
The effect of having a reliable eval set isn't just that you catch regressions — it's that the entire way you work on AI features changes. Prompt changes become experiments with measurable outcomes instead of intuitions you hope hold. Model upgrades become straightforward comparisons on the same fixed evaluation set rather than qualitative judgment calls. Retrieval tuning — adjusting chunk size, reranking strategy, embedding model — has a quantitative scoreboard.
This matters especially for the conversations that involve people outside the engineering team. Product owners, stakeholders, and clients who want to know whether a new model version is worth the migration cost can look at the same eval results the engineering team looks at. 'This improved our extraction accuracy on the target rubric by four points on our validation set, with no regression on the adversarial cases' is a different kind of claim than 'we think it's better.' It is a claim that can be scrutinized, reproduced, and challenged — which makes it trustworthy.
It also surfaces problems earlier. A prompt change that helps on the common cases but degrades on a specific failure mode you fixed three months ago is invisible to anyone eyeballing a few outputs. It shows up immediately in an eval set that includes that failure mode. The earlier you catch a regression, the cheaper it is to fix.
Keeping eval sets honest over time
Eval sets rot in two ways. The first is coverage rot: the real input distribution shifts as users discover new ways to use the system, but the eval set stays frozen at the inputs you imagined when you first built the feature. The result is a score that looks stable while quality in production slowly degrades. Fix this by routing a sample of real production inputs into a triage queue, reviewing them weekly, and adding the instructive ones — especially failures and near-misses — to the eval set.
The second rot is annotation rot: as your understanding of what 'good' means evolves, older annotations become inconsistent with newer ones. A judge rubric that was written before you discovered a particular failure mode may not penalize that failure mode clearly. Audit annotations periodically; when you update the rubric, re-score the affected examples so the set remains internally consistent.
The goal is an eval set that grows toward the real distribution of production traffic, with annotations that reflect your current understanding of quality. That living eval set is the most important artifact in your AI feature's engineering history. Treat it accordingly: review it in pull requests, track its coverage, and resist the temptation to prune examples that are inconvenient because your system struggles with them.
What good looks like
A mature eval practice looks like this: every AI feature has an eval set in the repository, living alongside the code and prompt templates it tests. Every pull request that touches a prompt, a retrieval configuration, or the model version runs the eval suite and posts a score comparison to the PR. Score drops above a defined threshold block the merge. The eval set is reviewed on a fixed cadence, and new production failure modes are added within a week of discovery. The judge model and rubric version are pinned and documented.
This is not a high bar. It is the minimum viable testing discipline for a production system whose behavior is nondeterministic. Teams that clear it ship AI features they can reason about, change with confidence, and explain to stakeholders with evidence. Teams that don't ship features that work until they don't, with no reliable way to tell the difference before users find out.