Characterisation Tests: The Safety Net You Need Before Touching Legacy Code

Characterisation tests are snapshot tests of what legacy code currently does — bugs, quirks, and all. They break the catch-22 that paralyses legacy teams: you cannot add real tests without refactoring, and you cannot refactor without tests. Characterisation tests go first. They are scaffolding. Clean tests replace them once the code is safe to reshape.

Every team with a legacy codebase has the same conversation.

"We should add tests."

"We cannot. The code is not testable."

"Then we should refactor."

"We cannot. We do not have tests."

Silence. Then somebody pulls up a ticket. And nothing happens. For months. For years. The scary parts of the codebase grow scarier. New features get glued around old code rather than into it. AI agents start generating plausible-looking patches that nobody can fully verify, because the system has no shape anyone can hold in their head.

There is a specific tool for this situation. Michael Feathers named it in 2004, in Working Effectively with Legacy Code, and most teams have still never used it. Characterisation tests. They are not elegant. They are not testing what the code is supposed to do. They are the safety net you install before you climb.

The Legacy Catch-22

The trap is simple and it is everywhere. The team inherits — or accumulates — code that no longer fits in anyone's head. Functions are hundreds of lines. Dependencies call into dependencies. A single database query is buried six layers deep inside a method that also sends emails, updates a cache, and maybe writes to disk, depending on a config flag added in 2019 by someone who has since left.

Adding unit tests requires isolation. Isolation requires seams — injection points where you can swap in test doubles. This code has none.

So the team concludes: we need to refactor first. But refactoring without tests is a gamble. Every change might break something nobody noticed was working. The CD knowledge base at beyond.minimumcd.org describes this exact symptom — "a large codebase has no automated tests" — and spells the loop out plainly: "adding tests to an untestable codebase requires refactoring first — and refactoring requires tests to do safely."

The loop does not break by finding discipline. It breaks by changing what the word "test" means for this piece of code.

The legacy catch-22 and how characterisation tests break it — two circular loops versus a single entry point through captured behaviour — Figure 1: The catch-22 has no native exit. A characterisation test is the entry point that turns the loop into a line.

Figure 1: The catch-22 has no native exit. A characterisation test is the entry point that turns the loop into a line.

What Characterisation Tests Actually Are

A characterisation test does not describe what the code should do. It describes what the code actually does. Bugs and all.

You pick a function. You call it with inputs. You capture the output. You write a test that asserts the output matches what you just captured. If the captured output contains a bug, the test preserves the bug. If it contains ten years of accumulated edge-case handling that nobody remembers deciding to build, the test preserves that too.

The point is not correctness. The point is pinning the current behaviour down, so that you can change the structure of the code without silently changing what it produces.

This is a strange idea if you have only practiced TDD on greenfield code. TDD tests specify intent before implementation — we covered this in TDD is a design tool. Characterisation tests do the opposite. They capture the implementation as a contract you write against yourself, so that the refactor is not a gamble.

Emily Bache's work on approval testing — particularly her Gilded Rose refactoring kata — shows the idea in practice. You call the code, serialise the result, and commit the serialised output as a golden file. Every future run is compared against the golden. Any deviation is a signal: either you introduced a regression, or you intentionally changed behaviour and need to approve the new output explicitly.

The mental flip: TDD tests describe intent and reject any implementation that violates it. Characterisation tests describe the implementation and reject any refactor that changes it. Same mechanism, opposite direction — and on legacy code, the second one has to come first.

The Workflow That Breaks the Loop

The practical sequence is four steps. Each step has a failure mode that kills the exercise if you skip it.

Step 1 — Find a seam. A seam, in Feathers' terminology, is a place where behaviour can be altered without editing in that place. For legacy code, the seam is usually the public entry point of a function or module — somewhere you can already call from outside. You do not need to refactor to create a seam. You just need to identify where the system accepts a call and returns a result.

Step 2 — Capture the current output. Call the seam with a representative input. Record everything the function produces — return values, emitted events, log lines, side effects you can observe. This captured output becomes the golden master: the reference behaviour that future changes will be compared against.

Step 3 — Write a test that asserts on the captured output. This is the characterisation test. It is not elegant. It may assert on a 300-line string of serialised state. That is fine. The point is to detect change, not to read nicely. Approval-test libraries (ApprovalTests, snapshot testing in Jest, Verify in .NET) automate most of this — they diff the current output against the committed golden file and fail the test if anything differs.

Step 4 — Now refactor. With the test in place, you can start introducing proper seams, splitting the function, extracting dependencies, applying dependency inversion. If any step breaks behaviour, the characterisation test tells you immediately. When the structure is clean enough to support intent-based tests, you replace the characterisation test with real unit tests — and delete the characterisation test. It was scaffolding.

The four-step characterisation test workflow — find a seam, capture output, pin behaviour, refactor with a safety net — Figure 2: The four-step loop — characterisation tests are scaffolding that earns its keep for one refactor and then retires.

Figure 2: The four-step loop — characterisation tests are scaffolding that earns its keep for one refactor and then retires.

The traps are symmetric:

Skip step 2 and your assertions are hope.
Skip step 3 and your refactor is a gamble.
Skip step 4 and your characterisation tests become permanent fixtures, resistant to the legitimate behaviour changes that will come later.

Why This Matters More in 2026

The characterisation test workflow was invented for human developers in 2004. In 2026 it becomes a different kind of tool: a safety rail for AI coding agents.

AI agents are pattern amplifiers. On a codebase with strong tests, they refactor confidently — the tests catch mistakes. On a codebase without tests, they generate plausible-looking changes that can silently break behaviour, and CI stays green because there is nothing verifying what used to work.

1.75x more logic and correctness errors in AI-authored pull requests compared to human-authored ones — CodeRabbit's 2025 analysis of 470 open-source PRs. Meanwhile, GitClear's data shows refactoring activity collapsed from 25% of changed lines in 2021 to under 10% in 2024. AI is generating faster than teams are reshaping.

Characterisation tests invert that dynamic. The sequence becomes:

Ask the agent to generate characterisation tests for the module, with three to five representative inputs. Commit the golden files.
Now ask the agent to refactor.
If the refactor silently changes behaviour, the test fails. The feedback lands in seconds, not in a production incident three weeks later.

The agent is not trusted by assumption. It is constrained by tests it just wrote, against the behaviour the codebase had five minutes ago. This is how AI refactoring of legacy code becomes viable instead of terrifying — and it is the specific practice that lets the "do not rewrite, improve incrementally" advice from the rewrite trap actually survive contact with a 200,000-line codebase.

Start Here

If your team has been stuck in the catch-22, pick one module this week.

Pick the scariest function. Not the easiest. The one everyone routes around. That is where the highest leverage is.
Characterise it before you change it. Capture outputs with three to five representative inputs — include the obvious happy path and two edge cases that people already know about. Commit the golden files to the repo.
Refactor one thing. Extract a dependency. Split the function. Introduce a seam. Run the characterisation test.
When the structure is clean enough, write the real tests. Intent-based unit tests that describe what the code should do. Then delete the characterisation test. It was scaffolding, and scaffolding retires when the building stands.
Do this again next week. The compounding effect is the point. Each characterisation test makes the next refactor safer, which makes the next intent-based test easier to write, which makes the next characterisation test smaller in scope.

The Bottom Line

The legacy catch-22 is not broken by finding discipline. It is broken by using a different kind of test for a different job. Characterisation tests are not correctness tests. They are a snapshot of the current reality, committed to the repository, so that you can change the structure of the code without changing what it does. They are the entry point. The rewrite is not, and it never was. Once the safety net is up, everything else follows — including, eventually, the cleaner code you were hoping to rewrite toward.

Characterisation Tests: The Safety Net You Need Before Touching Legacy Code

The Legacy Catch-22

What Characterisation Tests Actually Are

The Workflow That Breaks the Loop

Why This Matters More in 2026

Start Here

The Bottom Line

Frequently Asked Questions

What is a characterisation test?

How do characterisation tests differ from unit tests or TDD tests?

How do I start adding tests to a legacy codebase with zero coverage?

Why are characterisation tests important when using AI coding agents on legacy code?

Want to see where your codebase is too untested to refactor safely?

Share this article