"Eval" is not one thing. There are three levels of evaluation — unit-test assertions, human-and-model evaluation on a dataset, and A/B testing in production — and three cross-cutting choices that decide the kind of any single eval: offline or online, reference-based or reference-free, end-to-end or per-component. Most teams build one corner of that map and call it "evals," the way you'd say "we have tests." The failure modes live in the corners they skipped.
Ask a team whether they have evals and you usually get a yes. Ask which kind, and the room goes quiet.
It is the same quiet you get when a team that says "we have tests" is asked whether they mean unit, integration, or end-to-end — and which behaviours any of them actually cover. "We have tests" is not a number. "We have evals" is not an answer. Both are categories, and the category contains a map most teams have never drawn.
The first post in this series argued evals aren't a benchmark suite — they're error analysis, the habit of reading your own data. The second showed how to build the first dataset before you have one. This post draws the map. When someone says "we have evals," what are the things they could mean — and which one have they actually built?
"We Have Evals" Is One of Many Things
The word collapses a category the way "we have tests" does. Nobody on a serious team says "we have tests" and stops there, because everyone knows the sentence is underspecified — a thousand green unit tests tell you nothing about whether the system works end to end, and a passing smoke test tells you nothing about the edge cases a unit test would catch. The kinds are different tools for different failures. You need more than one.
Evals are the same, and the industry is about three years behind on admitting it. Hamel Husain — whose work is the canonical curriculum on this — has spent that time mapping the category, and the map has more than one axis. There is the question of level: how mature and how expensive the evaluation is. And there are the questions of kind: for any single eval, what it runs against and what it grades. A team that has "evals" has made one choice on each of those axes, usually without noticing the other choices existed.
The map matters because production failures are not uniformly distributed across it. They cluster in the cells you didn't build. A team with a wall of Level 1 assertions and nothing else ships the failure mode that only shows up across a whole conversation. A team that judges every final answer but never inspects the steps can tell you the agent is failing and not one thing about why. The green dashboard covers the corner you built. The corner you skipped is where the user is.
The worst version of "we have evals" is the one bought off a shelf. A generic metric pack — imported, configured, lighting up green — is not a cell on your map at all. It is a picture of a map for someone else's product. Husain is blunt about it: "Generic evaluations waste time and create false confidence." The pack measures something. You don't know what, you didn't author it, and it was never about your users. That is not coverage. That is the appearance of coverage, which is worse than none, because none at least doesn't lie to you.
The Three Levels
Start with maturity. Husain's framework has three levels of evaluation, and they are not interchangeable — each runs at a different cost, on a different cadence, answering a different question.
Figure 1: The three levels of evaluation. Cost climbs and cadence falls as you go up — and the level most teams skip is the one that tells them whether any of it mattered to a user.
Level 1 — unit tests. Fast, cheap assertions, the kind you already run with pytest, organised by feature and scenario. Schema validity, no PII, length bounds, the tool call has the right shape. They run on every code change because they cost nothing to run. This is the level most teams reach for first, and the only level many ever build.
Level 2 — human and model eval. You log traces, read them through a domain-specific viewer, and grade them — by hand, and with an LLM-as-judge once you have aligned the judge to human labels. This is the level where the error analysis from the first post actually lives, and it runs on a schedule rather than on every commit because reading traces costs human attention. Most of the real eval work is here.
Level 3 — A/B testing. The model meets real users, and you measure whether it moves the outcome you actually care about. Reserved for mature products, run after significant changes, because it is the most expensive signal you have. Its cost dominates the other two — and, as Husain notes, that cost ordering is exactly what dictates how often you can afford to run each level.
The trap is treating these as a ladder you climb and discard rungs from. They are not. A team that has Level 1 and stops has automated the cheap half and skipped the half where understanding lives. A team that runs Level 2 dataset evals forever and never reaches Level 3 is optimising a number that was never confirmed to be user value — a green offline suite that has never once been checked against whether a real person was better off. The levels accumulate. You don't graduate from one to the next; you add.
One distinction prevents the most common confusion in the field. These three levels are not the three rungs from the first post. The rungs — code-based assertion, LLM-as-judge, human review — are about how you implement a single check. The levels are about scope and stage. They are orthogonal: a Level 2 dataset eval can be implemented on any rung. Conflating them is how teams convince themselves that a pile of Level 1 code assertions is a complete eval program. It is one rung at one level. There is a lot of map left.
The Three Questions That Decide the Kind
Pick any single eval you want to write. Three questions decide what kind it is — and on each one, teams default to a single answer and never build the other.
Figure 2: An eval is a point in this space. Most teams build the same pole of every axis — offline, reference-based, end-to-end — and ship the failures that live on the poles they skipped.
Offline or online? Offline evals run on a fixed, curated dataset — in CI, on every prompt and model change — and favour cheap deterministic checks because they run constantly. Online evals sample live production traces asynchronously, and can afford a more expensive reference-free judge because they run less often. Build only the offline half and your dataset slowly becomes a museum: it can only contain the failure modes you already thought of, while production keeps inventing ones your grid never had a column for. Build only the online half and you are catching every failure after a user already hit it — pure detection, at the moment it is most expensive. The two are complementary by design: production monitoring finds the new failure pattern, and you fold a representative example back into the CI set so it can never regress again.
Reference-based or reference-free? A reference-based eval compares output to a known correct answer — it needs ground truth, and it shines where ground truth exists: retrieval, search, recommendation, debugging the RAG layer. A reference-free eval grades against criteria with no golden answer — code assertions and LLM-as-judge. Most of what a production LLM emits is open-ended; there is no single right answer to "summarise this ticket" or "answer this question in our voice." So reference-free dominates, and a team that built only reference-based evals — because those felt rigorous — has left everything open-ended completely unguarded. The catch is that reference-free quality leans on the judge, and the judge is an AI feature that needs its own eval. That is its own post.
End-to-end or per-component? An end-to-end eval treats the system as a black box and asks one question: did we meet the user's goal? A per-component eval decomposes — was the tool choice right, the parameter extracted correctly, the error handled. Husain's sequence is deliberate: start end-to-end, because that is the only number that corresponds to a user outcome, then let error analysis tell you which step is the hotspot and add component checks there. Build only end-to-end and you know the agent is broken without knowing where — you cannot fix what you cannot localise. Build only per-component and you get the integration-test fallacy in a new costume: every step passes its own check and the whole thing still fails the user, because nobody evaluated the handoffs.
Three questions, two answers each. The point is not that you need all of it on day one. The point is that "we have evals" names exactly one answer to each question, and the gap between your green dashboard and your broken production is almost always a pole you never built.
Find Your One Cell
For a feature you already have evals on, five steps. Each is a diagnosis, not a suggestion.
- Write down the one cell you built. One sentence: which level (1, 2, or 3), offline or online, reference-based or reference-free, end-to-end or per-component. Most teams land on the same sentence — "Level 1, offline, reference-based, end-to-end, on the inputs we imagined." That sentence is your actual coverage. Everything outside it is unmonitored.
- Map your last bad incident onto an axis. The failure that reached a user — which pole was it on? A whole-conversation failure your per-turn checks missed is the end-to-end axis. A failure mode you'd never seen is the offline/online axis. Name the axis; that is the corner to build next.
- Add the missing pole, not more of the same cell. The instinct under pressure is to write more Level 1 assertions because they are cheap. If your incident lived on the online axis, fifty more offline assertions cover nothing. Build the pole the incident came from.
- If you are running a generic metric pack, replace it with error analysis. Off-the-shelf metrics are the false-confidence cell. Pull a hundred traces, read them, name the failure modes, and write the assertions that fire on your modes. Keep the generic metric only as a way to surface interesting traces — never as a measure of quality.
- Schedule the online half before you leave the room. A production-trace review cadence — Husain's rule of thumb is ten to twenty traces weekly for outliers and a hundred-plus every two to four weeks — is what keeps the offline set from becoming a museum. A backfill that isn't on the calendar doesn't happen.
The Bottom Line
Eval is not one thing, and "we have evals" is not a coverage statement — it is one answer to a question with three levels and three axes, said as if the rest of the map didn't exist. The team that has mapped it knows exactly which cell it built, which it skipped, and which skipped cell its next incident will come from. The team that hasn't will find out the same way it always does: from a user, on a pole nobody was watching, long after the dashboard went green.