Evals Aren't a Benchmark Suite. They're a Habit of Looking at Your Data.

Evals are how teams systematically improve their AI products — domain-specific assertions written after reading your traces and naming the failure modes, not benchmarks you import. The work is error analysis: open-code each trace, axial-code the notes into named failure modes, write code that fails when those modes recur. A vendor can sell you a runner. Nobody outside your team can author the evals themselves.

Evals are a systematic approach to improving your AI products. — Hamel Husain

Most teams adopting AI ask the same question first: which eval framework should we use.

Wrong question. The framework runs the evals. It does not write them. The writing is the part that matters, and it is the part every team skips because it does not feel like engineering — it feels like research.

This is the conflation that wrecks most AI rollouts. Benchmarks and evals get used interchangeably. They are not the same thing, and treating them the same is how teams end up with a green dashboard while production keeps producing the same wrong outputs.

Benchmark Is Not Eval

A benchmark is generic. MMLU, HumanEval, SWE-bench, GAIA — these are problems someone else curated to compare models in the abstract. They tell you which model is stronger on a population of tasks none of which are yours.

An eval is yours. It is an assertion about how your AI should behave on your data, your prompts, your edge cases. It is written by someone who has read your traces and named what went wrong. Nobody outside your team can write it because nobody else has the traces, the users, or the product definition of "wrong."

When teams say they want to "add evals," they often mean they want to import a benchmark. The two get conflated because both produce a score. Only one tells you whether the feature works for the user in front of you.

The benchmark says: this model is competent at code generation in general. The eval says: this feature fails to refuse out-of-scope requests at a rate you can measure, and the failures cluster on a specific input pattern your users actually hit. One of those numbers will tell you whether to ship.

The Work Is Looking at Your Data

The single highest-leverage activity in evals is the one nobody puts on a roadmap: pull a hundred traces from production and read them.

Hamel Husain — whose AI Evals For Engineers course is the canonical curriculum on this — calls this error analysis, and he names two specific steps borrowed from qualitative research. Open coding: read each trace, write a free-form note on what went wrong, no taxonomy yet. Axial coding: take those notes and group them into a small set of named failure modes. The output of an hour of this is rarely an eval. It is a written taxonomy of failures, ranked by frequency, that the team can point at.

Most teams skip axial coding. They look at traces, feel concerned, and reach for a tool. The taxonomy is the part that turns "I saw some bad outputs" into "we have five named failure modes and we know which one happens most." Without it, every eval written is an opinion. With it, every eval is the encoding of a category you have already seen.

If you cannot write the eval, you do not understand the problem. The reverse holds: once you have looked at enough data and named the failures, the eval almost writes itself. The artefact you are building is the encoding of that understanding. Skipping the looking means encoding nothing.

Don't outsource this. The person who owns the product — engineer, PM, tech lead — has to read the traces themselves. A contractor can run the platform. A junior can label a dataset. Only the owner can decide which failure modes matter for this product, in this quarter, against this user base. Outsourcing error analysis is how teams end up with comprehensive eval suites that miss the only failure mode the business cares about.

The Three-Step Loop, Not the Setup Step

Evals are not a launch checklist item. They are an improvement discipline whose canonical shape is three steps the team turns continuously, driven by fresh data. Analyze. Measure. Improve.

Diagram of the three-step loop — Analyze on top, Measure on the bottom right, Improve on the bottom left, connected by arrows in a continuous cycle — Figure 1: The three-step loop — Analyze, Measure, Improve. The eval suite is the residue of the loop, not the entry point.

Figure 1: The three-step loop — Analyze, Measure, Improve. The eval suite is the residue of the loop, not the entry point.

Analyze is the open coding plus axial coding work from the section above — collect representative traces, read them, name the failure modes. The output is a written taxonomy, not a number.

Measure is the step that turns the qualitative taxonomy into quantitative metrics. Each named failure mode becomes an assertion — code-based, LLM-as-judge, or human-rubric — that produces a number on every run. This is the step where the eval suite actually accumulates.

Improve is the engineering work the suite enables — refine the prompt, swap the model, restructure the pipeline, add retrieval, decompose the task. Then back to Analyze, on fresh traces, to see what the change moved and what it broke.

Each loop teaches the team something the suite did not previously cover. The suite grows in lockstep with what has been observed. It never reaches done. The model drifts, users find new edge cases, the product adds features — and the looking continues.

The teams that treat evals as a setup step run Analyze once, build a suite, watch it pass, and stop. Production keeps producing the same wrong outputs; the dashboard stays green. The failure mode the team didn't encode is the failure mode the team will keep shipping. Generic evals — imported benchmarks, off-the-shelf metric packs — are the worst version of this: a suite that passes uniformly because it was never about your product to begin with.

Three Rungs of Eval. Climb Down.

The Three Rungs are the substance of the Measure step. Not every eval can be code. But you should fight to write every eval as code.

Three-rung ladder showing code-based assertions on the bottom rung labelled fast cheap deterministic, LLM-as-judge in the middle labelled slower drifts costs money, and human review on the top rung labelled expensive does not scale — Figure 2: The three rungs of eval. Reach for the bottom rung first.

Figure 2: The three rungs of eval. Reach for the bottom rung first.

Code-based assertions are the bottom rung. Whenever the rule can be expressed as code, write code. JSON schema validity. No PII in output. Output stays under length budget. Tool calls match the registered shape. Citations point to documents that actually exist in the corpus. These run in milliseconds, cost nothing, never drift, fail the same way every time.

LLM-as-judge is the middle rung. Use it when the property is genuinely fuzzy — "is the response on topic", "does this answer the user's question", "does the tone match the brand". A judge prompt grades each output; you sample-validate judge agreement against a human on a small set, then run it at scale. Cheaper than humans. Slower than code. Drifts when you change the judge model. Useful, but always second choice.

Human review is the top rung. Reserve it for ambiguous, high-stakes, novel failure modes you have not yet learned to encode. The output of human review is rarely the final eval — it is the input that lets you write a code-based or judge-based assertion next time.

The rule is climb down. Every property starts on whichever rung it has to. Each cycle of Analyze → Measure → Improve pushes it one rung lower. A human-review property becomes a judge property once you have enough examples to write a calibrated judge prompt. A judge property becomes a code property the day you find a regex or schema that catches the mode reliably. The suite gets cheaper, faster, and more deterministic over time.

The Eval Is the Spec

The reason no vendor can write your evals is the same reason no vendor can write your tests.

Tests encode the behaviour the codebase must preserve. Evals encode the behaviour the AI feature must produce. Both are specifications written in executable form. Both are durable in a way that tickets, Notion pages, and Slack threads are not. Both belong to the team that owns the product, because they are the team's understanding of what "correct" means for this product, made executable.

This is the part that flips when teams adopt AI seriously. Before, the team's understanding of the product lived in tickets and code. Tickets were ephemeral; code was the durable record. After, the model is generating the code — so the durable record of what the product should do has to live somewhere the model reads on every iteration. Tests are one half of that record for code-writing agents. Evals are the other half for everything else the AI does. For a deeper look at how this maps to the AI agents writing inside your own delivery loop, see evals for engineering agents.

If the eval is not written, the behaviour is not specified. The model does what it does. You hope it keeps doing that. Hope is not a release gate.

What Most Teams Skip — and Why

The looking. Always the looking.

It feels manual. It feels slow. It feels like research, not engineering. An hour spent reading traces produces no commit, no PR, no shipped feature. The temptation is to skip to the part where you wire up the eval platform, watch the dashboard light up green, and call it discipline.

The skip is the failure mode that produces every other failure. A suite written without trace observation encodes the team's assumptions about how the system fails, not the failures the system actually produces. The eval platform without the looking is a thermometer in an empty room — the reading is precise and meaningless. Anything not in the suite is, for production purposes, not part of the spec. And anything not in the suite is anything you have not looked at long enough to encode.

The cost is not bugs. The cost is a feedback loop that lies to you. The team ships with confidence because the suite passes. The suite passes because it was written without the looking. Production produces wrong outputs at a rate no one is measuring, in modes no one is monitoring, until a user escalation or a postmortem reveals what the suite was never asked to check.

What To Do This Week

Five steps, in order. Each one is a gate, not a suggestion.

Pull a hundred traces from the last week of production. Real prompts, real responses, real users. Do not synthesise. If you cannot pull traces, your logging is the bug — fix that first; evals are not possible without trace data.
Open-code, then axial-code. First pass: read each trace, write a free-form note on what went wrong. Second pass: group those notes into a small named taxonomy of failure modes, ranked by frequency. One engineer, one to two hours. No eval work proceeds until the taxonomy exists — the taxonomy is the gate, not the suggestion. The person who owns the product does this themselves; not delegated.
For each of the top three modes, commit a code-based assertion to the repo that fires when the mode recurs. If you cannot express it in code yet, ship an LLM-as-judge prompt with a "convert to code" issue filed against next sprint. An eval that is not in version control does not count.
Wire the assertions into CI on every prompt change, every model swap, every retrieval index rebuild. Block the deploy on regression against the last passing baseline. An assertion that is not gating a deploy is, for AI iteration purposes, not there.
Put the next looking session on the team calendar as a recurring event before you leave the room. Thirty minutes, fresh traces, same engineer or rotated. If the slot is not on the calendar, the habit is not a habit — it is an aspiration, and aspirations lose to delivery pressure every sprint.

The vendor sells you the runner. Steps one through five are what you cannot buy.

The Bottom Line

The teams winning with AI are not the ones with the best eval tools. They are the ones who have made looking at their own data a recurring engineering ritual, and whose eval suite is the residue of that ritual.

The teams losing are running benchmarks against models they did not train, watching numbers that were never about their product, on dashboards that go green while their AI ships the same failure mode for the third month running. The cost is not a bug they will fix in a patch. It is a feedback loop that has quietly stopped working, on a system everyone has decided to trust.

If you cannot write the eval, you do not understand the problem. The looking is how you come to understand it. The suite is what's left over. Everything else is theatre.

Evals Aren't a Benchmark Suite. They're a Habit of Looking at Your Data.

Benchmark Is Not Eval

The Work Is Looking at Your Data

The Three-Step Loop, Not the Setup Step

Three Rungs of Eval. Climb Down.

The Eval Is the Spec

What Most Teams Skip — and Why

What To Do This Week

The Bottom Line

Frequently Asked Questions

What is an eval, in plain terms?

How is an eval different from a benchmark?

What is the three-step loop for building evals?

When should you use code-based evals versus LLM-as-judge versus human review?

Can a vendor write our evals for us?

We can't write your evals — but we can show you where they're missing.

Share this article