The Half-Life of AI-Generated Code

AI-generated code passes review on day one and ages worst by month six — not because the code is wrong, but because the cognitive step where intent used to form as a byproduct of typing is the step that gets skipped. The code carries no reasoning to inspect later; the next change has to re-decide under uncertainty whether the original choice was load-bearing or incidental. The teams that hold the line encode intent in artifacts the next iteration is required to read — behavior-first tests gated on mutation score, decision notes required on non-trivial PRs, and architectural fitness functions that fail builds when invariants are violated.

Picture the call six months from now. An engineer has a four-hundred-line module open in their editor. The PR that introduced it merged clean. The tests still pass. The architecture rules in CI still hold. The change they were asked to make is trivial — add a field, surface it through one endpoint, ship before end of week.

It is day three. The PR is not open yet. The engineer is staring at a Map<String, OrderState> and trying to decide whether the choice was load-bearing or incidental — whether the original author picked it deliberately, or whether the AI that wrote it produced it by autocomplete and nobody noticed. The original author moved teams two months ago. The AI does not remember the conversation. The git blame is a developer who Tab-accepted the suggestion and could not, even at the time, have told you why a Map and not a ConcurrentHashMap.

This is not a bug. The module is correct. The tests are honest. The architecture is intact. But the change that should have shipped Tuesday is going to ship next Monday, and the engineer doing it is going to be wrong about whether to over-protect or rip out at least half the time.

What aged worst is not the code. It is the intent that was never in the diff to begin with.

What Passes Review Is Not What Survives

Reviewers check what they can see. Does this work? Does it compile? Do the tests pass? Are the names sensible? Does the change fit the existing patterns? That is the bar most pull requests are asked to clear, and AI-generated code clears it consistently. Surface plausibility is the property AI is best at producing.

What reviewers cannot check — what they have never been able to check, and what nothing in the AI-augmented workflow has changed — is whether the code as shipped carries the reasoning that makes it modifiable later. Why this shape and not the alternative? What constraint was being protected? What deprecation path was implicit in the signature? These are not questions a reviewer asks of the diff. They are questions the next person to touch the code asks of the author. For most of software's history, the author was reachable. They were on the team. They were in the same head that did the typing, and that head could be queried.

The AI-augmented loop quietly broke that link. The "author" of the code is now a process with no memory across sessions. The "author of the PR" is a developer whose intent — to the extent any was held — lived in a Cursor session that closed when the tab did. The code goes into main. The intent, if it ever existed in articulable form, does not.

For about thirty years the software craftsmanship movement argued that ownership was the central virtue of the craft. We covered the merge-time version of that argument in AI Wrote the Bug. You Shipped It. The half-life version is its sibling and the one that produces longer bills. Ownership at merge time is one signature on the PR. Ownership across the half-life of the code is whether the next iteration can defend the same choices — and what trail you left so they can.

Two timelines from the moment a PR merges through month 6. Top lane: the diff — variable names, function signatures, test assertions, control flow — persists unchanged. Bottom lane: the intent that produced the diff — why this shape, what constraint was being protected, what deprecation path was implicit — fading from full opacity at merge to a question mark by month 6 unless explicitly encoded. — Figure 1: What gets written into the diff and what gets read out of it later are not the same artifact. The gap is the intent debt.

Figure 1: What gets written into the diff and what gets read out of it later are not the same artifact. The gap is the intent debt.

Why Intent Isn't in the Diff

Intent used to be a byproduct of the typing — not a separate step, a byproduct. The friction that's now optional is the friction that did most of the work.

A human engineer who picks Map<String, OrderState> over ConcurrentHashMap<String, OrderState> has reasoned about concurrency. They considered who calls this code, on which thread, under what contention. The reasoning may not be written down anywhere. It does not have to be. It lived in the same head that produced the line, and as long as that head is available, the choice carries the reasoning even when the comment doesn't.

An AI produces the same line without the reasoning. Not because the model can't reason — it can produce a perfectly serviceable explanation if asked — but the explanation is generated against the code that already exists, not against the constraints that should have shaped it. The reasoning is post-hoc, plausible, and not load-bearing on the actual decision. The developer who Tab-accepted the line never had to evaluate the alternative, because the alternative was never presented in a way that asked them to. The cognitive step where intent forms — the brief, mandatory pause where a human engineer weighs an alternative and rejects it — is the step that gets skipped. Documentation is not what's missing. The decision moment is.

Six months later, this is what the next change runs into. The choice carries no reasoning because no reasoning was produced. The next engineer cannot ask the AI — it has no memory of the session. They cannot ask the developer — the developer cannot remember weighing it because they did not weigh it. The choice is uninspectable, not because nobody documented it, but because nothing happened that could have been documented. AI shifts the uncertainty from "does this code work?" to "what assumptions is this code secretly depending on?" That is a worse class of question — there is no green CI signal and no test failure that exposes it. Over-protect the choice (lock it down, refuse to refactor) or under-protect it (rip it out, sometimes pay the bill) — both inject risk on every change.

This is the mechanism the pattern-amplifier thesis compresses into three words. AI does not invent. It picks from the surface of what looks right. When the surface of what looks right coincides with what is right, the output is excellent. When it doesn't, the output looks identical and the cost arrives later — in uncertainty, not in bugs.

Margaret-Anne Storey named this category in early 2026 — first in a February analysis on cognitive debt under AI, then in her March paper From Technical Debt to Cognitive and Intent Debt — proposing a triple-debt model alongside technical debt and cognitive debt. The framing matters because it locates the cost in the right place. Technical debt describes the code. Cognitive debt describes what the team no longer understands about it. Intent debt describes the reasoning that was never written down. AI doesn't reliably add to the first two on every merge. It adds to the third in a specific way: every undocumented decision becomes a future branching point someone has to re-decide under uncertainty, with no signal telling them whether the original choice was load-bearing.

We've sketched this elsewhere — the three threats covered intent debt as one of three compounding pressures dashboards miss; the senior-looking-code piece covered the comprehension version of the gap at write-time. This post is about the version that becomes visible at read-time, six months later, when the engineer who shipped the code is no longer the engineer staring at it.

How to Measure the Decay

"Half-life" is not a number. It is a way of thinking about a curve most engineering dashboards do not show, because the dashboard measures the moment of merge and the curve is everything that happens after.

There are four signals you can run this quarter without buying any new tooling. None of them require labelling commits as "AI-generated" — most teams can't do that reliably, and even rough labels rot fast. Track the symptoms, watch the trend lines as AI adoption ramps up across the team, and the curve emerges from data already in your repo.

Refactor abandon rate. Pull the last quarter's refactor PRs — the ones that change shape without changing behavior. How many merged within the original sprint? How many sat in review past seven days? How many were reverted within fourteen days of merging? Track the rate over the last three quarters. A climbing rate means engineers are giving up on refactors they otherwise would have completed — usually because they cannot reconstruct the original intent and cannot defend the new shape against assumptions they cannot see.

Comprehension lead time on the second touch. Pull the lead time on the second non-trivial change to each module changed in the last quarter. Not the change that introduced the code — the change that came after. The first change is where comprehension cost lives. If second-touch lead time is creeping up while first-touch stays flat, the code is shipping fast and becoming expensive to revisit.

Rework at sixty and one-hundred-eighty days. The DORA 2025 report formalized Rework Rate as a fifth metric — the 2024 report had introduced it as an exploratory proxy for change failure rate — and most teams now track it at the commit-into-main level. Extend the window. Re-pull rework on the same commits at sixty days and at one-hundred-eighty days. Short-window rework catches the bugs the code shipped with. Long-window rework catches the bugs the missing intent shipped with — the changes that broke assumptions nobody documented because nobody held them.

"Why is this here?" rate on review comments. Skim the review comments on the last fifty PRs. Count the ones that ask, in any phrasing, for justification — "why this and not X?", "is there a reason for this pattern?", "do we still need this branch?" Track the rate per PR over time. A consistently rising rate is intent debt being paid down in review, one comment at a time. It is also the moment at which encoding the intent would have been cheapest.

A four-quadrant grid of decay signals. Top-left: refactor abandon rate on AI-touched modules versus manually-authored modules. Top-right: comprehension lead time on the second touch after merge. Bottom-left: rework rate at sixty and one hundred eighty days, split by author type. Bottom-right: "why is this here?" review comments per PR. Each quadrant is annotated with the question it answers about the decay curve. — Figure 2: Four signals that show the decay rate without new tooling — split your existing data by author type and let the curve speak.

Figure 2: Four signals that show the decay rate without new tooling — split your existing data by author type and let the curve speak.

None of these are perfect. All of them require the team to be willing to ask the question. The shape that emerges is consistent across the codebases I have seen: as AI adoption climbs, these four signals climb with it — code costs less to ship and more to change, and the gap widens by month rather than by quarter.

What to Encode So Intent Survives

The fix is not "make AI explain itself." AI explanations of code AI just wrote are post-hoc, plausible, and unreliable on the original context. Asking a model to justify a line it generated five minutes ago produces words, not understanding. The fix is encoding intent in artifacts the next iteration — AI or human — must read.

Three of them survive the loop. Each is non-trivial to set up. None of them work as guidelines; they only work as gates. Anything not enforced is, for AI iteration purposes, not there.

Behavior-first tests, gated on mutation score. A test that asserts what the code must do, not what the code does, is the only specification AI consistently reads on the next iteration. Mutation testing is the diagnostic that tells you whether your tests carry the spec or whether they just mirror the code. We covered the upstream mechanism in Tests Are the Spec AI Reads. Two things to add for the half-life problem specifically. First, the tests have to be written before the implementation, or against an independent specification, or they are reverse-engineered from the same surface plausibility that produced the code — and they go green for the same reasons the code looks right. Second, the tests have to be in the domain language, not the implementation language. A test that asserts "an order total includes tax for shipping addresses in taxable jurisdictions" survives a refactor of the tax engine. A test that asserts "calls TaxCalculator.compute with the shipping address" does not. The enforcement: mutation score below a threshold on a changed module blocks the merge. Without that gate, the suite drifts toward implementation mirroring on every AI-assisted commit.

Decision notes, required on non-trivial PRs. Not the eight-header ADR template. A five-line note in the PR description or in a co-located markdown file: "Picked X over Y because Z." If the developer cannot write Z, they have not yet owned the choice — and that fact is more important than the note. The note exists to surface the moment when the choice was made and the alternative considered. AI does not generate this artifact, because AI does not generate alternatives it rejected. The human writes it, or it does not exist. The enforcement: a PR check that flags any non-trivial PR with no decision note attached, and a review-blocking convention that the PR cannot merge until a reviewer accepts the note. Done at the granularity of "every non-trivial choice in the PR" — usually one or two per merge, sometimes zero — this is the cheapest piece of structural intent you can buy. Six months from now, when the next engineer asks "why is this a Map?", the answer is one click away.

Architectural fitness functions in CI. The boundary rules, dependency constraints, complexity budgets, deprecation deadlines — encoded as failing builds, not as a wiki page. The next iteration runs into them whether anyone documented them or not. This is already the most enforceable of the three: the agent commits, the build fails, the agent corrects. The CI signal is the spec. The trap to avoid is letting rules live in a "best practices" document instead of in CI. A rule that fails a build constrains the next change. A rule on a wiki page is a polite request.

A note on code comments. Not all comments are equal. The kind that paraphrase what the code does rot faster than the code itself — every refactor invalidates them, AI tools update the code without touching them, and the next reader has to decide which is right. Those are noise. The kind worth writing are the ones that name an invariant the next change must preserve: "this must be idempotent — retries are at-least-once," "do not move the lock acquisition — see incident #427." Those describe something a refactor must not break. But even those decay if they only live in prose; a "must-not-change" comment is a polite request the next AI iteration is free to ignore. Use comments to signal the invariant for human readers. Encode the invariant itself as a build that fails when it is violated.

The Bottom Line

The half-life of AI-generated code is short by default. What expires is not the code. It is the reasoning that produced it — and with it, the system's changeability.

The cost is not bugs. It is that your codebase becomes progressively less modifiable without you noticing — until the velocity you bought with AI quietly stops compounding, because every new change has to absorb the uncertainty of every previous change nobody documented.

What changed in 2026 is that the friction which used to force intent into the author's head — typing, naming, structuring, choosing — is now optional. The speed-side cost of that is what arrives at merge. The half-life cost is what arrives in month six. Both come from the same trade: intent has to be put somewhere on purpose, or it is nowhere at all.

The bar for shipping AI code didn't move. The work of meeting it now includes leaving a trail the next iteration is required to follow — not advised to.

The Half-Life of AI-Generated Code

What Passes Review Is Not What Survives

Why Intent Isn't in the Diff

How to Measure the Decay

What to Encode So Intent Survives

The Bottom Line

Frequently Asked Questions

Why does AI-generated code age worse than human-written code?

What is intent debt and how does it differ from technical debt?

How do you measure the maintenance cost of AI-generated code?

What artifacts preserve design intent so AI code stays maintainable?

Are code comments enough to capture design intent?

Ready to see the decay rate in your AI-touched modules?

Share this article