A single production trace on the left flowing into a free-form open-coding note, then a stack of notes clustering into a ranked taxonomy of named failure modes on the right, with the most frequent mode at the top.
AI Engineering, Engineering Practices, Testing

Error Analysis Is the Eval Work. Here's How to Actually Do It.

By Harsh Parmar12 min read

Error analysis is reading your AI's actual outputs and labeling what went wrong — open coding each trace into a free-form note, then axial-coding the notes into a ranked taxonomy of failure modes. The taxonomy, not your assumptions, decides which evals to write first. It is the highest-leverage activity in evals and the one teams fake, because it looks like research instead of engineering. This is the method for doing it for real.

Look at your data.

Every team that takes evals seriously hears the same instruction first: look at your data. We've made the case ourselves — the single highest-leverage activity in evals is to pull a hundred traces and read them. Everyone nods. It is obviously right.

Then they open a hundred traces, scroll for ten minutes, feel vaguely worried, and reach for a tool.

That is not looking at your data. That is skimming your data and feeling bad about it. The instruction was never "glance at the outputs." It was a method — one with named steps, rules, and a concrete artifact at the end — that Hamel Husain, whose AI Evals course is the canonical curriculum, calls error analysis and ranks as the highest-ROI thing you can do. This post is the method. Where the first post in this series argued that you have to look, this one shows you how.

Before You Read a Single Trace

Error analysis is cheap to do and expensive to do wrong. Most of the wrongness is set before you read anything, in what you read and where you write.

A trace has to be judgeable. A trace is not just the model's output. It is the full input, the full output, and enough surrounding context that you can actually decide whether it was correct: the retrieved chunks for a RAG answer, the tool calls and their results for an agent, the user's real goal for anything conversational. If you cannot tell from the trace whether the output was right, you do not have a trace — you have a string. Fix your logging first. You cannot analyze data you cannot judge.

Sample from real production. A hundred or so representative traces from the last week of real usage. Not synthetic inputs, not the happy-path demos, not the prompts you wrote to test it. If your traffic has segments that behave differently — free versus paid, first message versus deep in a thread — stratify so each is represented. Synthetic data tells you how the system handles the inputs you already imagined, which is precisely the set that never causes the incident. (If you have not launched yet and have no production traces to read, start by building a first eval set — error analysis needs data to analyze.)

Have one place to write. One row per trace. A column for the trace, a column for your note, and a column — empty for now — for the failure mode you will assign later. A spreadsheet is fine. A notebook is fine. The note column is the entire job. Everything else is logistics.

That is the setup. Now the two passes.

Open Coding: Label What Went Wrong, in Your Words

Open coding is the first read. One trace at a time, top to bottom. You ask one question — did this fully do the job the user came for? — and when the answer is no, you write a single plain-language sentence describing what went wrong.

A single production trace flowing into a free-form note, with the four open-coding rules listed: describe the failure not the fix, label the first upstream failure, no taxonomy yet, be specific in your own words
Figure 1: Open coding — one trace, one free-form note describing the first thing that went wrong. No categories yet.

Figure 1: Open coding — one trace, one free-form note describing the first thing that went wrong. No categories yet.

The sentence is where the craft lives. Four rules make the difference between notes that become a taxonomy and notes that become noise.

Describe the failure, not the fix. Write "cited a document that never mentions refunds," not "needs better retrieval." The fix is a hypothesis you will test later; the failure is the thing you actually saw. The moment you write the fix, you have stopped observing and started guessing — and your guess about the cause is exactly the bias error analysis exists to remove.

Label the first upstream failure. In any multi-step trace — a RAG pipeline, an agent loop — one early mistake cascades into a string of downstream symptoms. The retrieval grabs the wrong doc, so the answer is wrong, so the follow-up is wrong, so the tone reads as evasive. If you label all four, you have quadruple-counted one failure. Find the first thing that went wrong and label that. The root in the trace is the one that earns an eval.

No taxonomy yet. Do not reach for a dropdown of categories. The categories have not earned their names — you have not seen the data yet. The whole power of open coding is that the structure emerges from what is actually there, not from a framework you imported. Free text only. Premature categories are the single most common way error analysis quietly turns into confirmation of what you already believed.

Be specific, in your own words. "Wrong" is useless. "Answered for the US plan when the user clearly asked about the EU plan" is a failure mode in waiting. The specificity you capture now is the resolution you get later; vague notes cluster into vague modes that map to no eval anyone can write.

And mark the clean ones too. A trace that did the job gets a note that says so. The ratio of good to bad is a number you want, and the good traces are your working definition of what "right" looks like for this product.

A handful of real notes from a worked example — a RAG support assistant — look like this:

  • "Pulled the cancellation-policy doc; user asked about refunds. Wrong document."
  • "Doc says refunds take 5–7 days; answer said 'instant.' Contradicts the source it cited."
  • "User asked two things — how to export and how to delete. Answered export, ignored delete."
  • "Clean. Correct doc, correct answer, both parts addressed."

Notice none of them name a category and none of them prescribe a fix. They describe what happened. That is open coding.

Axial Coding: Cluster the Notes Into a Ranked Taxonomy

After an hour you have a hundred free-form notes. Axial coding is where they become structure. You pile the notes, merge the ones that say the same thing in different words, and let a small set of named failure modes surface. Then — this is the part that makes it useful — you count each mode and rank by frequency.

A stack of free-form notes on the left clustering into a ranked taxonomy of named failure modes on the right, each with a frequency count and bar, most frequent at the top
Figure 2: Axial coding — free-form notes cluster into named failure modes, counted and ranked by frequency. The ranking is the product.

Figure 2: Axial coding — free-form notes cluster into named failure modes, counted and ranked by frequency. The ranking is the product.

This is the step where an LLM can genuinely help — paste your notes in, ask it to propose clusters and names. But only now, and only because you have already read every trace yourself. The reading is where your judgment of "wrong for this product" forms, and that judgment is the thing no model has. Let it cluster what you have already understood. Using it to skip the read gives you a clean taxonomy of the model's assumptions, which is the opposite of the point.

For the worked example, a hundred traces might cluster into something like this:

Failure modeCount
Retrieved the wrong document18
Answer contradicts the cited source11
Answered only part of a multi-part question9
Assumed the wrong plan or region7
Refused a request that was in scope4
(Clean — did the job)51

That table is the product of error analysis. Not a feeling that retrieval "seems flaky" — a ranked, counted statement that your single most common failure is wrong-document retrieval, at 18 in 100, and that contradicting the source is a distant but real second.

Stop when you hit saturation. How do you know a hundred was enough? You watch the categories. While new traces keep adding new failure modes, keep reading. When the last twenty traces add nothing you have not already named, you have sampled enough to act. The number of traces is a consequence of saturation, never a target you set in advance.

The Taxonomy Decides the Evals — Not the Other Way

Now, and only now, you write evals. You start at the top of the ranked list, because that is where the production pain actually is, and you take each mode down to the lowest rung it can reach. (That ladder — code, then LLM-as-judge, then human review — is the subject of an earlier post.)

  • Retrieved the wrong document and cited a doc not in the corpus are often code assertions — check that the cited document IDs exist and that the top retrieved chunk's metadata matches the query's product area.
  • Answer contradicts the cited source is fuzzy — an LLM-as-judge faithfulness check, validated against your own labels before you trust it.
  • Answered only part of a multi-part question might start as a judge and harden into a structured check once you can detect multi-part inputs in code.

The direction is what matters. You are encoding categories you observed, ranked by how often they actually occur. An eval that does not trace back to a row in your taxonomy is an eval you wrote from imagination — and imagination is exactly what produced the green dashboard over a system that keeps failing. The taxonomy is the link between what production does and what your suite checks. Without it, the two drift apart and nobody notices until a user does.

Where Error Analysis Goes Wrong

The method is simple. The ways teams hollow it out are predictable:

  • Starting with someone else's taxonomy. Generic metric packs and off-the-shelf "hallucination / toxicity / relevance" categories let you skip the reading. You inherit their failure modes and miss the one your business actually cares about.
  • Labeling the fix instead of the failure. The note that says "improve the prompt" has thrown away the observation. You can no longer cluster it, because it describes your plan, not the data.
  • Counting symptoms instead of the first upstream failure. Label every downstream consequence and your most-common mode is an artifact of cascade length, not frequency. You will pour eval effort into the wrong place.
  • Delegating the first read. A junior can label round two against an established taxonomy. The product owner has to do the first pass, because only they can decide which failures matter for this product, this quarter, this user base. Outsource the looking and you get a comprehensive suite that misses the only mode that counts.
  • Grading on a 1-to-5 scale in your head. "How good was this, 1 to 5" hides disagreement and rots into noise. Ask a binary question — did this trace exhibit a failure or not — and your counts mean something.
  • Stopping after one pass. The model drifts, the product ships features, users find inputs you never sampled. Error analysis is a recurring ritual, not a launch task. Saturation today is not saturation next month.

What To Do This Week

  1. Pull a hundred representative traces from the last week of real production, each with full input, output, and the context needed to judge it. If you cannot judge correctness from the trace, your logging is the first bug — fix that before anything else.
  2. Open-code every one. One sentence, the first failure, your own words, no categories. One person, one to two hours. Mark the clean ones too.
  3. Axial-code into a ranked taxonomy. Cluster the notes, name the modes, count and rank them. An LLM may help cluster — after you have read every trace yourself.
  4. Write evals top-down the ranked list, each at the lowest rung it can reach. Commit them to the repo, wire them into CI, gate the deploy on regression.
  5. Put the next read on the calendar before you leave the room. A recurring slot, fresh traces. If it is not scheduled, it is not a habit.

The Bottom Line

The eval suite everyone admires is downstream of an hour nobody wants to spend. Error analysis is that hour: read the traces, write what actually went wrong in your own words, cluster it, rank it. Skip it and you write evals for the failures you imagined, while the ones you actually ship keep shipping behind a dashboard that stays green. The teams winning with AI are not the ones with the most evals. They are the ones whose evals point at failures they have actually seen — because someone sat down and looked, on purpose, with a method.

Frequently Asked Questions

What is error analysis in the context of LLM evals?

Error analysis is the practice of reading your AI's actual production outputs and systematically labeling what went wrong, then clustering those labels into a ranked list of failure modes. It has two passes borrowed from qualitative research: open coding, where you write a free-form note on the first thing that failed in each trace, and axial coding, where you group those notes into a small named taxonomy ordered by frequency. The taxonomy — not your assumptions — is what tells you which evals to write first. Hamel Husain calls it the highest-ROI activity in evals.

Collapse

What is the difference between open coding and axial coding?

Expand

How many traces do you need to read for error analysis?

Expand

Can you automate error analysis with an LLM?

Expand

How is error analysis different from just writing evals?

Expand

We can't read your traces for you — but we can show you where they're piling up.

Connect your repo and get a free diagnosis surfacing where your AI is shipping unverified behaviour, which failure modes your current checks miss, and where a single hour of error analysis would change what you build next.

Get Your Free Diagnosis

Share this article

Help others discover this content

TwitterLinkedIn
Categories:AI EngineeringEngineering PracticesTesting