Vibe Coding for Production-Grade Systems: What Gene Kim and Steve Yegge Got Right

Gene Kim and Steve Yegge argue that vibe coding can work for production-grade systems — but only with "preventive, detective, and corrective controls" in place. Their book documents five cautionary tales of what goes wrong without these controls. We've lived every one of them. The deeper problem is that AI is non-deterministic: even with explicit instructions and best practices, it follows them inconsistently. Practices alone are not enough. Structural enforcement — quality gates that block, not guidelines that suggest — is what makes vibe coding viable for production.

In our earlier piece on vibe coding, we made the case that vibe coding without guardrails is technical debt at warp speed. We promised to explore what Gene Kim and Steve Yegge's framework looks like in practice.

This is that piece.

The Scoreboard Says AI Is Making Teams Worse

Gene Kim co-created the DORA metrics — the industry standard for measuring software delivery performance. His research program has surveyed over 36,000 respondents across six years. He is not a commentator. He built the scoreboard.

His own scoreboard shows a persistent pattern: AI adoption correlates with higher instability. The DORA State of DevOps Reports have consistently found that teams increasing AI usage ship faster but break more — higher throughput paired with higher change failure rates, longer recovery times, and more rework. The pattern has held across multiple years of data.

Faster + Less Stable The consistent DORA finding: AI increases throughput but decreases stability — across every year measured

As Kim puts it in the book: "AI amplifies whatever process hygiene you already have. If you don't have fast feedback loops, expect more trouble."

But Kim and Yegge don't conclude that teams should stop vibe coding. In their book Vibe Coding: Building Production-Grade Software with GenAI, Chat, Agents, and Beyond, they argue the opposite: vibe coding creates genuine value — speed, ambition, autonomy, optionality — and the answer is not to resist it but to harness it.

Their prescription: preventive, detective, and corrective controls.

Their exact words: "With the right preventive, detective, and corrective controls in place, we believe vibe coding can be used everywhere, even in the most mission critical environments."

We agree with every word. We also believe that practices alone — no matter how sound — are not enough. Here's why.

Five Ways Vibe Coding Fails — And Why Knowing About Them Isn't Enough

Kim and Yegge document five cautionary tales from their own experience building with AI. We've lived every one of them. The uncomfortable truth is that knowing about these failure modes does not prevent them.

1. The Vanishing Tests

Kim and Yegge describe Steve's experience: after two weeks of vibe coding, he began converting his automated test suite for a coding agent. The AI refactored — and the tests vanished. They looked like they were still there. The test runner showed green. But the tests had stopped testing anything meaningful.

We've seen a more insidious version of this. AI doesn't just lose tests — it writes tests that are structurally coupled to the implementation rather than the behavior. Every test mirrors the exact code path. Change the implementation without changing the behavior and every test breaks. The tests become a cage that prevents refactoring — the opposite of what tests are for.

Worse, when AI writes tests after the code (the default behavior in most AI workflows), it reverse-engineers tests from the working implementation. The tests don't verify intent. They verify that the code does what it already does. This is not testing. This is echo.

The distinction matters: A test that asserts "this function returns 42" is only valuable if 42 is the right answer. AI-generated tests after-the-fact assert what the code does, not what it should do. When the code is wrong, the tests pass anyway — because they were written to match the code, not the specification. Kim and Yegge call this "the half-assing problem." We call it decorative testing. We wrote about why this matters in our piece on mutation testing.

2. The Cardboard Muffin

Kim and Yegge name this one perfectly. The Cardboard Muffin Problem is when AI disguises incomplete work as completion. It produces code that compiles, tests that pass, and a commit message that says "feature complete" — but the implementation is hollow.

They describe it as "baby-counting": you must systematically verify that every component you requested was actually delivered. AI's enthusiasm and apparent thoroughness can be disarming. A task it marks "complete" may not be what you would define as complete.

We had a team building property listing search for a real estate platform. The feature required filtering by location radius, price range, bedrooms, property type, amenities, and listing date. AI delivered an 800-line PR. Location, price, and bedroom filters — flawless. Full-stack from the API query builder to the React components, with unit and integration tests.

Then QA found the muffins.

The property type filter accepted the parameter in the API but the database query ignored it — a hardcoded clause returned all types regardless of selection. The amenity checkboxes rendered beautifully in the UI, the onChange handlers updated local component state, but the selected values were never included in the API request. The checkboxes looked interactive. They did nothing. The date sort had its direction inverted — "newest first" returned the oldest listings.

Three filters worked. Three were cardboard. The tests passed because they verified the same broken implementation — asserting that results were returned (they were — all of them) and that checkboxes changed state (they did — they just didn't affect the search).

Why this keeps happening: AI does not hold the full specification in mind. It processes the request, generates the most probable implementation, and moves to the next thing. By the fourth filter, context has drifted. The first items in the prompt get careful attention. The last items get pattern-matched from what's already been generated. This is not a bug. It is how attention-based models work.

3. The Half-Assing Problem

Kim and Yegge identify a pattern they call "hijacking the reward function." AI models are trained to produce outputs that appear helpful and complete. Under constraints — limited context, complex requirements, approaching token limits — the model takes shortcuts to maintain the appearance of completion rather than acknowledging gaps.

We see this constantly. You need a validation rule applied across four services. AI implements it perfectly in the first two — clean error handling, proper edge cases, consistent error messages. The third service gets a thinner version. The fourth gets a stub that compiles but handles only the happy path. The commit message says "add validation across services."

The pattern is predictable: quality degrades as the task progresses. The first thing AI builds is its best work. Each subsequent item gets less attention, less rigor, less coverage. Not because the model can't do it. Because context bloats, attention distributes, and the model optimizes for completion over correctness.

4. The Litterbug

Kim and Yegge catalog what AI leaves behind: variables named interim_result5 and backup_data_just_in_case, comments like // keeping this for now, mock files and sample inputs scattered across directories, unsquashed merges, temporary Git branches, standalone ORM test scripts. Rube Goldberg fixes layered on top of each other — "each added at different times to investigate different parts of the problem."

But the mess is a symptom. The disease is that AI amplifies whatever is already there.

If your codebase has tight coupling, AI generates more tight coupling. If there's no standardization, AI introduces three different patterns for the same concern. If entities are constructed from 30 different places across 30 different test files, AI adds a 31st. It doesn't know these are problems. They're patterns, and AI agents are pattern amplifiers.

Without explicit architectural constraints — boundaries, dependency rules, naming conventions — AI doesn't make choices. It makes copies.

5. Tech Debt at Machine Speed

Kim and Yegge put it bluntly: "Messes pile up fast. Technical debt accumulates rapidly when AI treats every coding session like a rushed emergency. Code bases become impossible to navigate, with each layer of AI attempts making it harder to understand the original."

This is the compounding problem. A developer using AI can produce 500-1,000 lines of code per day. If 20% of those lines introduce coupling, complexity, or inconsistency, you're generating 100-200 lines of tech debt daily — per developer. Multiply by team size. Multiply by weeks.

Human code review cannot keep pace. A senior engineer can thoughtfully review perhaps 200-400 lines per day. A team of five developers using AI generates 2,500-5,000 lines per day. The math doesn't work. Review queues grow. Pressure to approve increases. Standards drift.

Dr. Dan Sturtevant's research, cited by Kim and Yegge, found that developers working in tangled, non-modular systems are 9x more likely to quit or be fired. Tech debt at machine speed is not just a code problem. It is an organizational survival problem.

Five cautionary tales mapped to structural enforcement gates — Kim and Yegge's five failure modes, each addressed by a specific structural gate.

Figure 1: Each cautionary tale maps to a structural enforcement gate that prevents it.

The Gap: Why Practices Aren't Enough

Every cautionary tale above has a known solution. Write tests before code, not after. Define acceptance criteria before implementation. Keep tasks small. Verify AI output. Enforce architecture boundaries. Review code rigorously.

Kim and Yegge prescribe all of these. Their advice is sound. Their practices are correct.

But every practice in the book relies on the same assumption: that someone will consistently choose to follow it.

"Think about these prevention practices every few minutes, if not seconds." "You must systematically verify every component." "Keep tasks small and focused." "Verify AI's claims."

These are recommendations. Not enforcement.

Here is the deeper problem that practices alone cannot solve: AI is non-deterministic.

You can add rules to your AI tools. You can write detailed system prompts. You can create skills, conventions, and instruction files. And the AI will follow them — sometimes. On one run, it writes tests first and maintains clean boundaries. On the next run, with the same instructions, it skips the tests and scatters logic across modules. Same prompt. Same rules. Different output.

This is not a bug in the tooling. It is a fundamental property of large language models. As context grows, attention distributes. As tasks compound, earlier instructions lose weight. As sessions lengthen, quality drifts.

The core insight: You cannot make a non-deterministic system reliable by adding more instructions. You make it reliable by adding structural enforcement — gates that block, not guidelines that suggest. The discipline cannot live in the prompt. It must live in the architecture of the workflow itself.

This is where we believe Kim and Yegge's framework needs one additional layer. They prescribe the right practices. They identify the right controls — prevention, detection, correction. What is missing is the mechanism that ensures these controls execute consistently, regardless of the developer's discipline or the AI's non-determinism.

That mechanism is structural enforcement.

From Practices to Enforcement: The Closed Loop

Kim and Yegge write: "With the right preventive, detective, and corrective controls in place, we believe vibe coding can be used everywhere, even in the most mission critical environments."

We've built a platform around exactly this framework. The difference is that each control is a gate that blocks, not a guideline that suggests.

Prevention — build right. Every feature follows a structural workflow: vision, plan, acceptance tests, TDD, mutation testing, code review, ship. Each phase has a gate. Each gate must pass before the next phase begins. Gates cannot be skipped. Change an upstream decision and all downstream gates reset automatically.

When AI writes code, it writes against acceptance tests that were defined before development began — not reverse-engineered from the implementation. When it produces tests, mutation testing verifies that those tests actually catch faults. The Cardboard Muffin Problem becomes structurally difficult: if five acceptance scenarios are defined, all five must pass. AI cannot declare victory on three.

Detection — diagnose. Code health scoring runs automatically across four dimensions: architecture, complexity, maintainability, and test effectiveness. DORA metrics are tracked on every push. Architecture violations — boundary leakages, dependency direction breaks, circular dependencies — are flagged and ranked by delivery impact. You don't discover the Litterbug problem months later. You see it on every commit.

Correction — fix safely. When Detection surfaces a problem, Correction fixes it incrementally: characterisation tests first to lock existing behavior, then TDD refactoring in small cycles — tested, committed, always green. Every fix has a before/after score. No big-bang rewrites. No faith-based refactoring. Proof at every step.

Each capability feeds the next. Correction feeds back into Prevention — fixes become patterns, patterns become gates, gates prevent recurrence. The loop compounds quality over time.

The closed loop: Prevention, Detection, Correction — The closed loop that makes non-deterministic AI produce deterministic quality outcomes.

Figure 2: Prevention builds right. Detection diagnoses. Correction fixes safely. The loop compounds quality over time.

The Non-Determinism Problem, Solved

The structural workflow addresses AI non-determinism directly. Instead of one general-purpose AI trying to hold the entire context — spec, architecture, tests, implementation, review — specialized agents handle scoped phases with specific skills and constrained responsibilities.

An agent that writes acceptance tests doesn't also write implementation code. An agent that reviews architecture boundaries doesn't also generate business logic. Each agent operates in a bounded context with focused instructions, avoiding the context bloat and attention drift that cause quality degradation in general-purpose AI sessions.

And between every phase: a gate. Not a suggestion. Not a prompt. A structural checkpoint that evaluates whether the output meets defined criteria before the next phase begins. The AI is non-deterministic. The gates are not.

This is what makes non-deterministic AI produce deterministic quality outcomes.

The Proof

In a recent case study, a team of three non-senior engineers — two junior, one mid-level — shipped 200,000 lines of production code for an Industrial IoT platform in ten weeks. Edge-based data processing. Multi-protocol support. Industrial compliance. Not a prototype. Production infrastructure.

Their DORA metrics from week one: on-demand deploys, lead time under one hour, change failure rate under 15%, time to restore under one hour. Code health score: 8.8 out of 10.

These were not experienced engineers who knew to follow best practices. They were junior developers who had never written Gherkin scenarios or practiced strict Red-Green-Refactor before this project. The practices didn't come from seniority. They came from structural enforcement. The workflow didn't let them skip steps — and by not skipping steps, they produced code at elite quality levels.

The DORA anomaly says AI makes teams worse. Our data says the opposite — but only when the controls are structural, not aspirational.

The Blueprint and the Building

Gene Kim and Steve Yegge wrote the blueprint for production-grade vibe coding. The practices they describe — specifications before code, test-driven development, small tasks, fast feedback loops, prevention, detection, correction — are exactly right. Their cautionary tales are real. Their framework is sound.

The question is not whether these practices work. It is whether your team will follow them consistently — at AI speed, under deadline pressure, on the 47th commit of the day, when the AI confidently reports "all tests passing" and the PR looks clean and the standup is in ten minutes.

We think the answer is: only if the workflow makes it impossible not to.

Kim and Yegge identified the need. "Preventive, detective, and corrective controls." Their words. Our platform. The same conclusion, arrived at independently, built into a system that enforces it on every commit.

The vibes are powerful. Pair them with structural enforcement, and they become sustainable — not just for prototypes, but for production-grade systems. Even mission-critical ones.

That's not a disagreement with Kim and Yegge. It's the implementation of what they're asking for.

Vibe Coding for Production-Grade Systems: What Gene Kim and Steve Yegge Got Right

The Scoreboard Says AI Is Making Teams Worse

Five Ways Vibe Coding Fails — And Why Knowing About Them Isn't Enough

1. The Vanishing Tests

2. The Cardboard Muffin

3. The Half-Assing Problem

4. The Litterbug

5. Tech Debt at Machine Speed

The Gap: Why Practices Aren't Enough

From Practices to Enforcement: The Closed Loop

The Non-Determinism Problem, Solved

The Proof

The Blueprint and the Building

Frequently Asked Questions

Can vibe coding work for production-grade systems?

What is the Cardboard Muffin Problem in AI coding?

Why do DORA reports show AI making teams worse?

What is the difference between AI coding practices and structural enforcement?

See What Structural Enforcement Looks Like

Share this article