Evals for Engineering Agents: How We Test the AI That Tests Your Code

An eval for an engineering agent is a structured benchmark that measures the agent's behaviour against a labeled set of known problems with known correct outputs — code generation against a hidden test suite, code review against PRs labeled with real issues, test generation against mutation testing. Without evals on the agent itself, regressions in the agent show up as regressions in your codebase, invisibly. The eval suite is the only honest signal of whether the agent is helping your engineering team or amplifying its gaps.

Your team has an AI code reviewer. It comments on every PR. Developers either accept its suggestions or ignore them. Six months in, you cannot answer the question that matters: is it making your code better, or training your team to mute it?

You probably evaluate it the way most teams do — by vibe. People feel like it has caught a few bugs. People feel like it produces noise on certain repos. The product team rolled it out, the budget was approved, and nobody has actually measured whether the agent is doing better than the alternative of no agent at all.

This is the gap. Engineering agents — the AI that writes, reviews, and tests your code — are inside the delivery loop now. When they regress, the codebase regresses. When their prompts drift, the team's standards drift. And almost no team has a regression suite for the agent itself.

Feature evals are for AI in your product. Engineering agent evals are for AI in your pipeline. Both are required. One of them is almost universally missing.

Why the Agent Needs Its Own Test Suite

A product eval answers: "is this AI feature good enough to ship to users?" An engineering agent eval answers a harder question: "is this AI tool making my team better, or worse, than the next-best alternative?"

The two are not interchangeable. A code-reviewing agent that catches 60% of real bugs and produces a noise comment on 1 in 5 PRs sounds reasonable in isolation. Compared to a human reviewer who catches 85% and produces noise on 1 in 50, it is a downgrade dressed up as automation. Without measurement, you would not know which case you are in.

Worse, engineering agent regressions are silent. When a code generation agent gets a new model and starts producing code that compiles but breaks subtle business rules, no alert fires. The new code lands in PRs. The PRs ship. The regression shows up four weeks later as an uptick in mutation testing failures, an uptick in customer complaints, or an unexplained slope change in your DORA change failure rate. By then, dozens of PRs are downstream of the regression and the agent has been "improved" again.

The fix is not better prompts. It is the same fix the rest of software engineering uses for non-determinism: a regression suite that runs on every change.

The Three Agent Types and Their Eval Surfaces

Engineering agents fall into three categories. Each has a different eval surface — what you measure and how you measure it.

Code-writing agents. Agents that produce production code from a specification or prompt. Eval surface: a hidden suite of programming problems with known correct outputs. The classic structure is HumanEval-style — a function signature, a docstring, a hidden test suite. Run the agent, run the tests, measure pass rate. For your codebase, the equivalent is a curated set of well-defined feature slices from your history: the spec the engineer received, the code that eventually shipped, the tests that protected it. Run the agent against the spec. Score the output against the protective tests. Track pass rate over time.

Three metrics matter beyond pass rate: code quality scores (cyclomatic complexity, duplication, architecture-rule compliance), time-to-pass (how many iterations the agent needed), and consistency (whether re-running the same prompt produces the same shape of code).

Code-reviewing agents. Agents that comment on PRs, flag issues, or suggest changes. Eval surface: a labeled set of historical PRs. For each PR in the set, you know what the human reviewer flagged, what shipped, what caused incidents, and what was noise. Run the agent against the same PRs. Measure precision (real issues out of total flags), recall (real issues caught out of total real issues), and noise rate (false positives per PR).

The labeled set is the asset. It captures your team's institutional memory of what a good review looks like. Curating it is the work; once it exists, every prompt change and model change runs against it.

Test-writing agents. Agents that generate tests for existing code. Eval surface: mutation testing. A test suite that achieves 100% line coverage but a 30% mutation score is decorative — the tests run the code but do not verify behaviour. Run the agent against a known module, generate tests, run mutation testing on the resulting suite, measure the mutation score. The mutation score is the eval.

The metric is direct: did the agent generate tests that actually catch bugs? Coverage will lie about this. Mutation testing will not.

Three boxes showing the three agent types — code writer, code reviewer, test writer — each with their respective eval surface labeled underneath: hidden test suite, labeled PR set, mutation score — Figure 1: The three classes of engineering agents and the eval surface for each. Different agent types need different benchmarks — what they produce determines what you measure.

Figure 1: The three classes of engineering agents and the eval surface for each. Different agent types need different benchmarks — what they produce determines what you measure.

Building the Eval Set

The eval set is the artifact. The agent comes and goes. The set persists across model versions, prompt iterations, and vendor changes. Build it like you would build a test pyramid for production code: small at first, growing with every regression you survive.

A starter set for a code-reviewing agent looks like this:

Pick 50 historical PRs from the last 90 days. Mix sizes, mix authors, mix repos. Avoid only choosing PRs where reviewers caught something — you need negative examples too.
Label each PR. For each PR, write down what a competent reviewer should have caught. Mark which comments your human reviewers made that were real issues, and which were noise. Mark which PRs were merged clean and what production behaviour proved that out.
Run the agent against the labeled set. For every flag the agent produces, classify it: true positive (caught a real issue), false positive (noise), or true negative if it correctly stayed quiet on a clean PR.
Compute precision and recall. Precision = true positives ÷ (true positives + false positives). Recall = true positives ÷ all real issues. Track both as the agent evolves.
Add to the set after every production incident. Every bug that shipped is a labeled example for the next eval run — did the agent miss it? Would the new prompt catch it?

The same shape applies to the other agent types. For code writers, the set is feature slices with hidden tests. For test writers, the set is modules with known mutation scores. In every case, the set grows over time and becomes the regression memory of the agent.

How Evals Plug Into the Pipeline

Evals are not an annual benchmarking exercise. They run automatically, the same way unit tests run automatically. The plug-in is mechanical.

Every change to the agent — new model version, new system prompt, new skill set, new context retrieval strategy — triggers an eval run before the change reaches engineers. Compare against the previous baseline. The gates:

Pass rate cannot regress beyond a defined threshold (e.g. 2%).
Precision cannot drop below baseline.
False-positive rate cannot rise more than X%.
Mutation score on test-writer agents cannot fall below 80%.

If any of these regress, the change does not ship to the team. The agent rolls back. The prompt is iterated. The model version stays pinned.

This is the same canary discipline that protects production code from regressions, applied to the agent that now writes part of your production code. The pattern is not new. The medium is.

Pipeline diagram showing an agent change going through an eval suite gate, comparing against the previous baseline, and only being promoted to the team if precision, recall, and pass rate all hold or improve — Figure 2: Agent changes route through the eval suite like code changes route through CI. The baseline is the contract — regressions block the rollout.

Figure 2: Agent changes route through the eval suite like code changes route through CI. The baseline is the contract — regressions block the rollout.

What to Do This Week

Four steps to move from no evals to a working baseline:

Pick one engineering agent in your stack. The one your team relies on most heavily — usually the code reviewer or the test writer. Start there. You cannot eval everything at once.
Curate 25 labeled examples. For a code reviewer, pick 25 PRs from the last quarter and label the real issues. For a test writer, pick five modules with known good test suites. The number is small on purpose — the goal is to have a regression suite, not a perfect benchmark.
Run the agent against the set and record the baseline. This is your starting precision, recall, and pass rate. Every future change is measured against this baseline.
Wire the eval run into your agent's deployment process. New model version → eval run → diff against baseline → block or promote. The wire-up is a script, not a platform. A weekend of work.

The first iteration of the eval set will be embarrassing. Examples will be mislabeled. The baseline will look low. That is expected. The set improves the same way a unit test suite improves — every escaped regression becomes a new example.

The Bottom Line

The AI that writes, reviews, and tests your code is inside the delivery loop now. When it regresses, your codebase regresses — silently, slowly, in ways your DORA dashboards will eventually catch but your team will not notice in the moment. The fix is the same fix software engineering has used for every other non-deterministic system: a regression suite that runs on every change, with a baseline that has to hold or the change does not ship.

Feature evals protect users. Engineering agent evals protect the codebase. Both are required. Almost no team has the second one — and the longer they wait, the more drift compounds in code that nobody is testing the tester against.

Evals for Engineering Agents: How We Test the AI That Tests Your Code

Why the Agent Needs Its Own Test Suite

The Three Agent Types and Their Eval Surfaces

Building the Eval Set

How Evals Plug Into the Pipeline

What to Do This Week

The Bottom Line

Frequently Asked Questions

What is an evaluation (eval) for an engineering agent?

Why do engineering agents need their own evals, separate from feature evals?

How do you build an eval set for a code-reviewing AI agent?

What metrics matter for evaluating an engineering agent?

How should evals gate engineering agent rollouts?

Want to know if your engineering agent is helping or drifting?

Share this article