Your Tests Are the Only Spec AI Reads

When an AI agent works on your codebase, your test suite is the only artifact of intent it consistently reads. Tickets, Notion pages, and Slack threads do not survive the loop. If your tests describe what the system should do in problem-domain language, the agent is constrained. If they only describe what the code already does, you have no specification — and the agent ships whatever it generated.

Ask an AI agent what your feature is supposed to do. It will tell you whatever your test suite says, and nothing else.

That is uncomfortable because most teams do not write tests as if they were the specification. They write tests as if they were the checker, run after the code is written, mirroring whatever the code happens to do. The agent reads those tests, treats them as the contract, and ships code that satisfies the contract while missing the behavior.

The shift this post is about: tests are no longer the verification step at the end. They are the persistent specification the agent reads on every iteration. The role has inverted, and most test suites have not caught up.

The Read Channel Is Smaller Than You Think

When an AI agent generates code in your repository, here is what it actually reads:

The code in the files it is editing
The tests in those files
A small ring of adjacent files in its context window
Whatever you paste into the prompt

That is the read channel. It is not large.

Here is what the agent does not read:

Your Linear ticket with the acceptance criteria
The Notion page where the PM described the behavior
The Slack thread where the tech lead clarified an edge case
The whiteboard photo from yesterday's standup
The decisions someone made six months ago that everyone remembers but nobody wrote down

You can paste any of those into the prompt. They will help that turn. Then the prompt closes, the context resets, and the next agent invocation reads the same persistent surface: code and tests. Nothing else.

What the AI agent persistently reads from your repository — code and tests — versus the artifacts of intent that never enter its loop — Figure 1: The read channel. Code and tests persist across every agent invocation. Tickets, Notion, and Slack do not.

Figure 1: The read channel. Code and tests persist across every agent invocation. Tickets, Notion, and Slack do not.

This has a consequence most teams have not absorbed: your test suite is the only durable artifact of intent the agent sees. The ticket fades. The standup decision fades. The hallway conversation fades. The test stays, and the agent reads it on every iteration of every change.

That changes what a test is for.

The Role Inversion: From Checker to Specifier

In the pre-AI era, most teams treated tests as the checker. The flow was:

Read the ticket
Hold the intent in your head
Write the code that satisfies the intent
Write tests to verify the code does what the code does

The intent lived in the developer's head while they coded. The test verified that the code worked. The intent never made it into the test itself, because the developer was the bridge between the two.

AI agents are not that bridge. They have no head to hold the intent in. Their working memory is the context window, and the context window resets. The only thing that persists is what is written in files the agent reads.

The role inversion is this: tests stop being the checker and become the specifier. The test now has to encode the intent that used to live in the developer's head. If it does not, that intent is invisible to the agent — and the agent will generate code that satisfies a different intent, possibly the one inferred from the existing code's quirks.

This is also what TDD has always been, philosophically. The reason TDD-written tests have mutation scores two to three times higher than test-after suites is that they were written as specifications first and verifications second. The agent does not care about the philosophy. The agent cares about which interpretation is in the file when it reads it.

What a Spec-Grade Test Looks Like

Not every test that exists is doing the spec job. Most are not. We see six properties that separate spec-grade tests from checker-grade tests, and the gap between them is what determines whether the AI agent gets useful constraints or noise.

Spec-grade tests versus checker-grade tests across six properties — the same code can have either kind of test, and the kind determines whether the agent is constrained or not — Figure 2: Spec-grade versus checker-grade tests. The agent reads both the same way. Only one constrains its output.

Figure 2: Spec-grade versus checker-grade tests. The agent reads both the same way. Only one constrains its output.

1. Behavior-coupled, not implementation-coupled. A spec-grade test asserts on what the caller observes — return values, persisted state, emitted events. A checker-grade test asserts on how the code achieves it — which method was called, in what order, with which mocked dependency. Behavior-coupled tests survive refactors. Implementation-coupled tests break the moment the agent restructures internals, even when the behavior is unchanged.

2. Stated in problem-domain language. A spec-grade test reads like a sentence from the problem domain: should reject duplicate email during registration. A checker-grade test reads like an internal flow chart: should call UserRepository.findByEmail once before inserting. The agent treats both as the contract. Only one tells it what the system should do.

3. Real boundaries, not mock boundaries. Spec-grade tests use real implementations of code you own (in-memory repositories, real domain logic) and only mock at true system boundaries (HTTP clients, third-party services). Checker-grade tests mock everything, which means the test verifies the mocks, not the behavior. The agent then satisfies the mocks while drifting from the behavior.

4. Edge cases and failure paths included. Spec-grade tests cover the negative space — empty inputs, boundary values, error conditions, concurrent access. The agent, left to itself, will generate happy-path implementations that satisfy happy-path tests. Edge-case tests are how you encode the constraints the agent would otherwise skip.

5. Deterministic and fast. A flaky test gets quarantined, and once quarantined, it is no longer in the read channel. A slow test gets skipped locally, and once skipped, it is no longer in the agent's feedback loop. Both stop being specifications. Determinism and speed are not just engineering hygiene — they are how the test stays in the channel.

6. Survives mutation testing. A spec-grade test would fail if a small mutation to the production code changed the behavior. A checker-grade test would still pass — because it was never examining the behavior, only the scaffolding. Mutation testing is the mechanical check for whether your tests would catch a wrong implementation. Which is the same question as: would they catch a wrong AI-generated implementation?

The same module can have all checker-grade tests, all spec-grade tests, or a mix. The agent does not distinguish between them. It just reads them and treats whatever it reads as the contract.

Why Most Test Suites Fail This Test

We covered the underlying mechanism in 88% of Your Tests Are Decorative: when you run mutation testing on a typical enterprise codebase, the vast majority of tests would still pass even if you deleted the production code they claim to test. Coverage culture trained teams to optimize for execution, not verification. The tests look healthy. The contract is empty.

That was a problem before AI. It is a different kind of problem now. A human developer working against a decorative test suite still holds intent in their head while coding. The bug they ship is the gap between what they thought they were building and what the tests required. The slow pace of human typing acts as a natural rate limiter on that gap.

An AI agent has no head. The gap between intent and tests is the gap between intent and what gets shipped. And the agent ships at a rate the rate limiter no longer applies to.

The 2025 DORA Report flagged this pattern without naming it. Teams adopting AI moved work through faster but landed in clusters with higher rework rates and lower stability. The rework cluster has a signature: weak tests, missing acceptance criteria, agents shipping plausible-looking code that fails on inputs nobody specified. Fix the spec channel and the cluster shifts.

This is also what makes the rubber-stamping anti-pattern so common. Developers commit AI-generated code without articulating what it was supposed to do or what criteria they verified against. The tests pass. The reviewer waves it through. Six months later, nobody can say what the change was for, because the intent never made it into any artifact that survives.

If your tests do not carry the intent, nothing does. This is the in-repo half of specification-driven development — the upstream Gherkin scenarios and acceptance criteria define intent for humans; the test suite is where that intent lands so the agent can read it on every iteration.

How to Apply This Monday Morning

The shift is not theoretical. Here is what changes in practice:

1. Stop counting coverage. Start measuring mutation score on critical paths. Coverage tells you which lines ran. Mutation score tells you which behaviors are specified. Pick three or four critical paths — payment, auth, core domain transformations — and measure mutation score there first. Most teams find numbers in the teens or twenties on their first run. That is the size of your spec channel.

2. Write the test as a sentence about behavior, then implement. The first line of the test name should describe what the system does in problem-domain terms. should mark order as paid when payment confirms is a specification. should call orderRepository.save is an implementation note. Use the spec-grade name as the prompt context when you ask an agent to implement.

3. Pair acceptance criteria with the test in the same commit. If the acceptance criteria are in Linear and the test is in code, the agent will only see the test. Co-locate the criteria — in a test comment, a feature file, or a markdown sibling — so the intent survives in the same artifact that survives in the read channel.

4. In code review, ask the spec question. Not "do the tests pass?" but "if I deleted the implementation and gave the tests to someone unfamiliar with the codebase, could they tell me what to build?" If yes, the tests are the spec. If no, they are decoration with green checkmarks.

5. Treat the test suite as documentation that compiles. Documentation rots because it sits next to code that changes without it. Tests do not rot the same way — they break when behavior changes, which forces the documentation update. That is the property to lean into. Stop writing prose docs about what the system does. Write tests that say it, in problem-domain language, and let them be the documentation.

None of this is new advice. TDD practitioners have said it for twenty years. What is new is the cost of ignoring it. The agent reads the tests. The agent does not read the ticket. If the intent is not in the tests, the agent does not have the intent.

The Bottom Line

Your tests are not just verifying your code anymore. They are the spec your AI agent reads on every iteration, the only durable record of intent the agent encounters, and the contract every generated implementation is measured against.

When that contract is empty — when tests are checker-grade decoration that would pass with the production code deleted — the agent has no constraints, and you have no specification. When the contract is full — when tests describe observable behavior in problem-domain language and survive mutation testing — the agent has a target, and your repository has a spec that compiles.

The question is no longer "how much of our code is covered by tests?" The question is "how much of our intent is in our tests?" That number is smaller than you think. And it is the only one the agent can read.

Your Tests Are the Only Spec AI Reads

The Read Channel Is Smaller Than You Think

The Role Inversion: From Checker to Specifier

What a Spec-Grade Test Looks Like

Why Most Test Suites Fail This Test

How to Apply This Monday Morning

The Bottom Line

Frequently Asked Questions

Why do AI coding agents need good test cases?

What is the difference between a checker-grade test and a spec-grade test?

Why does AI amplify weak test suites?

How do you write tests that constrain AI agents?

Should we still use coverage as a quality signal?

Are your tests specifying behavior, or just executing lines?

Share this article