The Flaky Test Is the Most Expensive Test You Have

Flaky tests don't just waste CI time — they erode the pipeline's credibility. Once a team learns that red means "retry and see," a real regression has nowhere to surface. The most expensive consequence isn't the retries — it's the longer MTTR when engineers hesitate to act on a real failure hiding in flakiness noise.

Your CI pipeline went red. You clicked Retry. It went green. You merged.

This happens — without anyone deciding it is okay — dozens of times a month on most teams. Nobody is counting.

Nobody made a policy. Nobody said "when the build goes red, just retry." It became what you do because it works often enough to feel fine, and because the alternative — stopping to investigate every red — is too slow when you have a sprint to finish.

This is how flaky tests become the most expensive thing in your codebase. Not because of the retries. Because of what the retries teach your team.

When Retry Becomes Policy

The cost of a flaky test is not the minutes spent re-running builds. Those costs are visible — you can count them, complain about them in retros.

The cost that compounds invisibly is what happens to your team's model of what red means.

Red is supposed to mean: something is wrong, do not merge. That is the entire value proposition of a CI pipeline. A team that trusts red stops defects at the boundary. A team that does not trust red has no pipeline — they have a ceremony that occasionally produces useful information.

Flakiness erodes trust one retry at a time. The first is a judgment call. The tenth is a habit. The fortieth is invisible — you have stopped seeing the red.

What Actually Dies When a Test Flakes

The test does not die. It still runs. It still produces results.

What dies is the signal.

With fewer tests, you know the edges of your confidence — what is covered and what is not. With flaky tests you have false confidence: green builds that arrived via retry, coverage that does not protect you.

The Cascade

Flakiness does not cause a single catastrophic event. It creates a slow degradation in your team's ability to detect real failures.

One test flakes. The team retries, it passes, the pattern becomes habit. A second test flakes. Then a third. The "retry and it'll probably pass" response generalises until it is no longer specific to known-flaky tests — it becomes the default response to any red build.

Here is what that looks like when it lands.

Consider a team whose auth token refresh test flakes intermittently — an async timing issue on a session API call. Engineers learn to retry it. It goes green reliably on the second attempt. Nobody investigates for six weeks. When a genuine token refresh regression ships, the pipeline goes red. Engineers retry. The test keeps failing. It takes four re-runs and a Slack escalation before someone treats the red as real.

The bug reaches production. The incident review attributes it to the regression. The retry habit — the enabling condition — goes unnamed.

The Operational Cost

The argument against flaky tests is usually framed as developer productivity: wasted CI minutes, interrupted flow. That is real but it is the smallest part of the problem.

Three operational consequences matter more.

Longer MTTR. When an incident occurs, a pipeline failure is often the first signal. A team conditioned to retry does not act on that signal immediately — they retry, check Slack, retry again. By the time someone treats the red as real, the incident is older and the blast radius wider.

Release hesitation. Teams with unreliable pipelines delay deployments — not because the code is unsafe, but because they cannot trust the gates. This produces less frequent releases, larger batch sizes, and higher change failure rates when they do ship. The pipeline is supposed to give confidence to deploy. A flaky pipeline gives confidence to wait.

Alert fatigue that generalises. The habit of dismissing red does not stay contained to CI. Engineers who have learned that signals are probably noise apply that judgment to monitoring, to error rates, to staging failures. Flaky tests are practice for ignoring real signals everywhere.

What Makes a Test Flake

Flakiness is not random. It is non-deterministic — there is always a cause. Four sources cover most cases.

Async timing. The test asserts before state has settled. It passes when the system is fast and fails when it is slow — and CI environments are consistently slower than local. Any test using an arbitrary timeout to "wait long enough" is a test waiting to flake.

Shared state. Tests that depend on state left by previous tests fail when execution order shifts. Add a new test above it, change the suite, and yesterday's green is today's red — not because the code changed.

Test order dependence. Tests that produce data other tests depend on break silently when the upstream test moves or is removed. The coupling is invisible until it fails in an order nobody predicted.

External service calls. Tests that hit real APIs, databases, or file systems fail when those services have transient issues. A test that fails on the fiftieth run because of an API timeout is not a test. It is a bet on infrastructure stability.

Here is what tracking this data actually looks like in practice:

Figure 1: Flakiness rate by test, aggregated over 30 days. Any rate above 1% under unchanged conditions requires triage.

Figure 1: Flakiness rate by test, aggregated over 30 days. Any rate above 1% under unchanged conditions requires triage.

Most CI systems have this data. The pass/fail record per test per run is there. What is missing is the aggregation — and because nobody surfaces it by default, the pattern compounds undetected.

Quarantine Is Containment, Not a Fix

The standard first response to a known-flaky test is quarantine: tag it, remove it from the main pipeline, deal with it later.

Quarantine is operationally defensible as short-term containment. If a test is failing intermittently and a release is blocked, removing it from the critical path while the team investigates is reasonable triage — not capitulation.

The problem is indefinite quarantine without ownership or a deadline.

A quarantined test that stays quarantined for three weeks has effectively been deleted. It does not run. It does not protect anything. The difference from deletion is that the coverage gap is invisible — the test appears in your count, in your reports, in the assumption that "we have tests for that." It contributes nothing except false coverage that is harder to notice than an outright gap.

Treat quarantine like a P2 incident: it needs an owner, a deadline, and a definition of done. A test that exits quarantine must be deterministic — not "seems to be passing again."

The fix at root is removing the non-determinism: explicit state-change waits instead of arbitrary timeouts, isolated setup and teardown per test, controlled fakes at integration boundaries instead of live service calls.

What a Trustworthy Pipeline Looks Like

A trustworthy pipeline has one property: red means stop.

The operational mechanism: track test stability as a metric. Pass rate per test over 30 days, under unchanged conditions. Any test above a 1% failure rate enters triage with an owner and a deadline. The threshold matters less than consistency — 1% works as an early warning because flakiness compounds long before it becomes frequent. A test that fails 1 in 100 runs is already teaching your team to second-guess red.

A small number of highly visible flaky tests can erode the norm. Every retry tells the whole team that red might be noise. One person watching you click Retry and move on is enough.

The prevention mechanism: quality gates that catch non-deterministic patterns before they merge. A gate that rejects a test with hardcoded timing assumptions at review time is cheaper than retroactive quarantine after the retry habit has spread.

The Bottom Line

Your CI pipeline produces one thing that matters: a trustworthy answer to "is this safe to ship?"

Flaky tests corrupt that answer without breaking the pipeline. The ceremony continues. The signal erodes.

Track which tests in your suite fail under unchanged conditions. If the number is not zero, the retry habit is already forming. The question is whether you find it before or after a real failure hides in the noise.

The Flaky Test Is the Most Expensive Test You Have

When Retry Becomes Policy

What Actually Dies When a Test Flakes

The Cascade

The Operational Cost

What Makes a Test Flake

Quarantine Is Containment, Not a Fix

What a Trustworthy Pipeline Looks Like

The Bottom Line

Frequently Asked Questions

What makes a test flaky?

Why are flaky tests worse than having no tests?

How do you find flaky tests in your codebase?

What is the difference between quarantining and fixing a flaky test?

See your pipeline's test reliability metrics

Share this article