Three threats compound in every codebase — defects, technical debt, and non-determinism. Dashboards show them as green because each threat is measured by different metrics that rarely intersect. AI amplifies all three simultaneously: it generates code faster than teams refactor, ships code without developer understanding, and introduces inference-level variance unit tests cannot catch. Seeing all three at once is the starting point for engineering health.
Your dashboards show green. Test coverage is stable. Change failure rate is flat. Deployment frequency is holding. By every metric you track, the codebase is healthy.
And yet every tech lead reading this knows the feeling: features take longer than they should. Simple changes ripple in unexpected ways. The same bug keeps showing up in the same module. The test suite fails on one branch, passes on rerun, nobody investigates. AI agents ship code, reviews get rubber-stamped, and nobody can quite explain why the codebase feels heavier every month.
That gap — between what the dashboards say and what the team experiences — is where three threats compound. Defects. Debt. Non-determinism. They are each measured by different metrics, hidden in different corners, and accelerated by the same force: AI generating code faster than any team can absorb it.
Here is how to see all three at once.
Why the Dashboards Miss Them
In February 2026, Martin Fowler and Margaret Storey proposed a "triple debt" model — technical debt, cognitive debt, and intent debt — because classical technical debt no longer captures what is happening in AI-assisted codebases. Around the same time, Dave Farley called vibe coding "the worst software idea of the year," identifying three problems: imprecise specifications, insufficient automated testing, and the difficulty of making incremental changes to generated code.
They are circling the same observation from different angles. The health of a codebase is not one metric. It is not even one category of metrics. It is a system of compounding pressures that standard dashboards were never built to surface.
The three threats below are how that system shows up inside the code itself — as things a tech lead can point at, measure, and contain.
Defects are the most familiar threat, which is exactly why they are underestimated in 2026.
Every codebase has a defect budget — the delta between bugs introduced and bugs caught. In a healthy codebase, pre-commit checks, component tests, contract tests at boundaries, and mutation analysis catch defects close to the source. The cost of a defect caught at pre-commit is measured in minutes. The cost of the same defect caught in production is measured in incidents, customer trust, and engineer time reconstructing context around code that was written last week.
AI changes the math.
CodeRabbit's 2025 analysis of 470 open-source pull requests found 1.64x more maintainability issues, 1.75x more logic and correctness errors, and 1.57x more security findings in AI-authored code compared to human-authored code. That is not a small delta. It is a structural shift in the defect baseline.
Worse, the defects hide differently. When a human writes code and the tests pass, the tests and the code were produced by the same mental model — if the model was wrong, both will likely be wrong together, and reviewers can often catch it. When an AI agent writes code and generates tests to match, the tests verify what the AI wrote, not what the specification required. Green CI becomes a lagging indicator of agreement between the AI and itself.
This is why defect source matters more than defect count. The CD knowledge base at beyond.minimumcd.org catalogues ten categories of defect sources — from testing and observability gaps to data and state issues to integration boundaries. The point of the catalogue is not to memorise categories. It is to answer a specific diagnostic question: where in the value stream are defects actually originating in your codebase, and how early are you catching them?
Most teams discover their defects are clustered in two or three sources — usually boundaries between services, data migrations, and untested edge cases. Those are the places where mutation testing, contract tests, and component-level characterisation earn their keep. Everywhere else, you are paying for coverage that does not catch anything.
Threat 2: Debt
Technical debt is the threat everyone claims to understand and nobody measures well. Martin Fowler's debt quadrant — deliberate/inadvertent on one axis, prudent/reckless on the other — assumed human decisions. AI removed the decision. An AI agent is not prudent or reckless. It does not deliberate. It generates.
This is why Fowler and Storey extended the model. Technical debt describes the code. Cognitive debt describes what the team no longer understands about the code. Intent debt describes the reasoning that was never written down — the reasoning AI agents need to make consistent decisions in future changes.
The pattern in AI-assisted codebases is predictable. The AI generates a new utility function that duplicates one three files away. It introduces a third pattern for error handling in a module that already has two. It copies a data access approach that the team decided to abandon last quarter. Each individual change passes review. Each one is locally reasonable. Together, they form architectural drift.
GitClear's research on AI-era codebases captures the shape of the drift with two numbers: refactoring activity dropped from 25% of changed lines in 2021 to less than 10% in 2024. Code cloning rose from 8.3% to 12.3%. Teams are not reshaping code anymore. They are adding to it.
Here is the cruel part: the debt is invisible to standard metrics. Test coverage is stable or improving. Change failure rate is flat. Deployment frequency holds. But development cycle time creeps up because every new change must navigate around inconsistencies previous changes introduced.
Gene Kim and Steve Yegge tell a concrete version of this in their Vibe Coding book. Kim describes finding that his writer's workbench tool had devolved into an incomprehensible 3,000-line function that took days to untangle. They also cite Dr. Dan Sturtevant's research showing developers working in tangled, non-modular systems are 9x more likely to quit or be fired. Debt is not just a code problem. It is an attrition problem.
Containing debt requires three things standard workflows do not provide by default:
- Scheduled refactoring sessions with their own intent and acceptance criteria (no behaviour changes, existing tests pass, named structural targets).
- Structural review gates — not correctness review, but duplication detection, complexity thresholds, and architecture fitness functions running in CI.
- Constraints in feature descriptions — so AI agents distinguish deliberate new patterns from drift.
Without those, AI accelerates debt faster than any team can absorb.
Threat 3: Non-Determinism
Non-determinism is the newest threat in 2026, and the one teams are least equipped to diagnose — because it hides across three layers that are usually owned by different people.
Layer 1: Test architecture. Teams with inverted test pyramids — too few component tests, too many end-to-end tests — have flaky suites by construction. E2E tests depend on network, shared environments, timing, and external services. Any of these can produce a different result on each run. Retry-until-green becomes routine. Real regressions hide behind the noise. The CD knowledge base documents this as the single most common symptom in teams that have lost confidence in their suite.
Layer 2: Environments. Snowflake environments drift apart over time. Tests pass locally, fail in CI. Pass in CI, fail in staging. Pass in staging, fail in production. The defect is not in the code or the test. It is in the environment variance itself.
Layer 3: Inference. This is the new one. Simon Willison, in Defeating Nondeterminism in LLM Inference, documents why the same prompt to the same model with the same seed produces different outputs. The root cause is not floating-point arithmetic as commonly believed. It is batch-size variance in how the inference server schedules work. When an AI agent runs twice against the same codebase, it can make different architectural choices — not because the model changed, but because the infrastructure serving it did.
Dave Farley has been blunt about what this means for engineering practice: "Programming languages are deterministic. AI is not. This changes the reliability characteristics of our tools." It changes how teams think about reproducibility, testing, and trust.
Containing non-determinism means treating each layer separately:
- Test architecture: component tests with real dependencies where possible, test doubles where not, E2E reserved for critical user journeys only.
- Environments: infrastructure-as-code, no manual configuration, parity enforced by CI.
- Inference: treat AI-generated code as a draft that needs deterministic verification — tests, type checks, architectural constraints, structural review — before it lands in main.
Without the third layer, the first two layers start lying. A team that has not made its AI workflow deterministic will see flaky outcomes that look like environment issues but actually come from the model picking a different path on Tuesday than it picked on Monday.
How to See All Three at Once
The three threats are linked. Defects thrive where tests are ineffective. Ineffective tests are usually a symptom of inverted pyramids and environment drift — which is non-determinism. Non-deterministic workflows make structural drift invisible — which is debt. Debt increases the surface area where new defects can hide.
This is why single-metric dashboards miss all three. Coverage does not measure test effectiveness. Change failure rate does not measure debt. Deployment frequency does not measure structural drift. You need a diagnostic that looks at the interaction.
Here is the minimum viable assessment a tech lead can run this week:
For defects:
- Where are defects actually originating? (Map the last 20 incidents to defect-source categories.)
- What is your mutation score on modules touched by AI in the last quarter?
- How much of your test suite is component-level versus E2E?
For debt:
- Has your development cycle time increased in the last 60 days while coverage stayed flat?
- What percentage of changes in the last month were refactoring versus new code?
- How many duplicate utility functions does static analysis find in your AI-touched modules?
For non-determinism:
- What percentage of CI runs fail on first attempt and pass on rerun?
- Do your environments come from infrastructure-as-code?
- Does your AI workflow generate different architectural choices for the same prompt?
If any of those questions are uncomfortable to answer, that is where the threat is hiding.
The Bottom Line
Defects, debt, and non-determinism are not three independent problems. They are three faces of the same problem: a codebase whose health is no longer measured by the metrics it reports. AI made this problem visible by accelerating all three at once. Kent Beck, Dave Farley, Martin Fowler, and the DORA data all converge on the same conclusion — structural quality requires deliberate feedback loops, not dashboards. The threats are hiding where the dashboards do not look. Start looking there.