Series

Multi-part deep dives on practices that unlock what AI tools can actually do.

4 posts published

AI Eval Discipline

Most teams adopting AI ask which eval framework to buy. Wrong question. Evals are not a benchmark suite — they are a habit of looking at your data, naming the failure modes, and writing assertions that fire when they recur. This series walks the discipline from foundations to operational practice: what evals actually are, the three gulfs LLM development has to bridge, how to build an eval set without burning a quarter, and why mutation testing is the only honest eval for AI-written tests.

2 posts published

AI Won't Do This by Default — Until You Do

AI coding agents are incredibly powerful. But the teams getting unreal results combine them with practices most teams skip. This series walks the full delivery value stream — one practice per post — showing what AI tools do brilliantly, what they don't do by default, and what happens when you add the practice that unlocks their real power.

Series

AI Eval Discipline

Evals Aren't a Benchmark Suite. They're a Habit of Looking at Your Data.

You Can't Look at Data You Don't Have: Building Your First Eval Set

You Don't Have Evals. You Have One Kind of Eval.

Error Analysis Is the Eval Work. Here's How to Actually Do It.

AI Won't Do This by Default — Until You Do

AI Won't Do This by Default: Hypothesis-Driven Development

AI Won't Do This by Default: User Story Mapping & Shape