Example Mapping for AI: How We Turn Specs into Executable Tests

Example Mapping is a 25-30 minute collaborative technique that turns a vague user story into a visual map of business rules, concrete examples, and unresolved questions — using four colours of sticky notes. The green example cards translate directly into Gherkin acceptance tests an AI coding agent can build against, replacing prose tickets with executable contracts. The technique is the bridge between a PM's intent and a test suite the agent cannot misinterpret.

The ticket said "users can apply discount codes at checkout." It got estimated, picked up in sprint planning, and handed to an AI coding agent. The agent shipped it in three hours.

Three days later, in production: discount codes were stacking when they shouldn't. Expired codes were silently accepted. A code worth $10 was being applied to a $5 cart, generating negative totals. None of these failures were in the spec. None of them were in the tests. And none of them were in the agent's fault — the agent built exactly what the spec said. The spec just didn't say much.

Fifteen years ago, this gap was bridged by the engineer pausing, walking over to the PM, and asking. Now the agent does not pause. It picks an interpretation and ships.

The technique we use to close that gap before the agent starts is Example Mapping — and it takes thirty minutes per story.

What Example Mapping Is

Example Mapping was created by Matt Wynne in 2015 as a fast, low-tech way for product, engineering, and QA to align on a user story before development begins. It uses four colours of sticky notes (physical or digital) and a strict time-box of 25-30 minutes per story.

The four cards:

Yellow — the story. One per session. The user story being mapped. Placed at the top of the board.
Blue — the rules. The business rules the system must enforce. Constraints, invariants, policies. Placed under the story.
Green — the examples. Concrete examples that prove each rule. Specific inputs, specific outputs, no hand-waving. Placed under each blue card.
Red — the questions. Things the team cannot answer in the room. Edge cases the PM is unsure about. Compliance constraints that need legal input. Placed off to the side, to be taken away and resolved.

The session ends when either the time-box runs out, or the team agrees every blue card has at least one green, and every green is small enough to be a single test case. A healthy session for a typical feature ends with three to seven blues, two to four greens per blue, and a small stack of reds.

Example Mapping board showing a yellow story card at the top, blue rule cards in a row underneath, green example cards stacked under each blue, and red question cards in the upper-right corner — Figure 1: The structure of an Example Mapping board — one story, several rules, multiple concrete examples per rule, and a side stack of unresolved questions.

Figure 1: The structure of an Example Mapping board — one story, several rules, multiple concrete examples per rule, and a side stack of unresolved questions.

The shape of the board tells you everything. Lots of blues with no greens means the rules are still abstract — the team is talking in generalities. Lots of greens but no reds means the team is over-confident — there is always something the room cannot answer. Lots of reds with no greens means the story is not ready for development — the discovery work has not been done.

Walking Through a Session

Take the discount-code story. Here is what the session actually looks like.

Minute 0-3. The PM places the yellow card. "As a checkout user, I can apply a discount code to reduce my cart total." Three lines, no detail. That is intentional — the green cards are where the detail will live.

Minute 3-15. The team works through the rules. Each rule is a blue card.

Valid codes apply a discount to the cart.
Expired codes are rejected.
Already-used codes are rejected.
Codes with a minimum cart value reject carts below that minimum.
Two codes cannot stack unless explicitly marked as stackable.
Percentage discounts apply to eligible items only.

Six blue cards. Five minutes of conversation. None of these were in the original prose ticket.

Minute 15-25. Under each blue, the team adds green example cards. The format is simple: given a specific input, the expected output.

Under "Valid codes apply a discount":

Cart total $50, code SAVE10 (10% off) → cart total $45.
Cart total $100, code FLAT5 ($5 off) → cart total $95.

Under "Codes with a minimum cart value":

Cart total $20, code SAVE10 requires minimum $25 → code rejected with message "Add $5.00 more to use this code."
Cart total $25, same code → code accepted, $2.50 off.

Under "Two codes cannot stack":

Cart with SAVE10 already applied, user applies WELCOME5 (not stackable) → second code rejected with message "Only one discount code can be used per order."

Minute 25-30. The reds. Things the team realises it cannot answer.

What happens if a code expires while it is sitting in the cart between sessions?
Are discounts applied before or after tax?
For percentage codes on partial-eligible carts, is the discount calculated on eligible items only or on the total?

Three red cards. The PM takes them away. Either she answers them by the end of the day, or the story is not ready for the next sprint.

The whole session took less than thirty minutes. The output is a complete board: one story, six rules, fourteen concrete examples, three open questions. This is the spec the agent will build against.

From Cards to Gherkin

The green cards translate directly into Gherkin scenarios. The format is mechanical — anyone familiar with Given/When/Then can do the conversion in about ten minutes for a typical board.

The green card:

Cart total $20, code SAVE10 requires minimum $25 → code rejected with message "Add $5.00 more to use this code."

Becomes the scenario:

Scenario: Discount code rejected when cart is below minimum
  Given the cart total is $20.00
  And the code "SAVE10" requires a minimum cart value of $25.00
  When the user applies the code "SAVE10"
  Then the code is rejected
  And the message reads "Add $5.00 more to use this code"
  And the cart total remains $20.00

One green card, one scenario. Fourteen greens, fourteen scenarios. The team did not write these from imagination — every line came from the conversation in the room.

The same rule holds for edge cases. The "two codes cannot stack" green:

Cart with SAVE10 already applied, user applies WELCOME5 (not stackable) → second code rejected with message "Only one discount code can be used per order."

Becomes:

Scenario: Second discount code rejected when codes are not stackable
  Given the cart has a SAVE10 code applied
  And the code "WELCOME5" is not marked as stackable
  When the user applies the code "WELCOME5"
  Then the code is rejected
  And the message reads "Only one discount code can be used per order"
  And the SAVE10 discount remains applied

The scenarios are the contract. They live in the same repository as the production code. They run on every commit. They are the thing the AI agent must make pass — and the thing that fails loudly if the agent's code drifts from the spec.

How the AI Agent Plugs In

The agent does not write the Example Mapping board. That is human work — PM and developer and tester in the room, surfacing the rules. But once the board exists and the scenarios are written, the agent's role is clear.

The four-step cycle we use in production:

1. Human drafts. The team produces the Example Mapping board and converts the greens into Gherkin scenarios. This is the irreplaceable human step. The agent does not know the business intent and cannot infer it from a prose ticket. The room knows.

2. Agent critiques. The agent reads the scenarios and asks adversarial questions. "Scenario 4 says the discount applies to eligible items only — what is the rule for determining eligibility? Scenario 7 does not specify whether tax is recalculated after the discount." The agent surfaces gaps the team missed. It is good at this because it has no prior context — it sees the scenarios fresh.

3. Human decides. The team reviews the agent's questions, decides which are real gaps and which are noise. Real gaps go back to the board as new greens or new reds. Noise gets dismissed.

4. Agent refines. Once the scenarios are stable, the agent generates the production code and the test infrastructure that runs the scenarios. The Gherkin scenarios become the executable acceptance tests. The agent implements until every scenario passes.

Four-step cycle showing humans drafting Example Mapping output, agent critiquing for gaps, humans deciding which gaps are real, and agent refining the implementation against the final scenarios — Figure 2: The Agent-Assisted Specification cycle. Humans bring the intent. The agent stress-tests it. Humans decide what is real. The agent builds against the result.

Figure 2: The Agent-Assisted Specification cycle. Humans bring the intent. The agent stress-tests it. Humans decide what is real. The agent builds against the result.

This is the move that makes the difference. Without Example Mapping, the agent is guessing at the rules. With it, the agent is implementing against a concrete set of test cases the team built together. The conversation that used to happen mid-build has happened upstream, where it is cheap.

The same scenarios become the regression suite. Six weeks later, when someone changes the discount logic, the suite runs. If the agent's new code breaks scenario 9, the build fails. The contract is permanent. The implementation is not.

What Changes on Monday

You do not need to overhaul your process. Try this on one story this week:

Pick a story that is genuinely ambiguous. Not the easiest one — the one where you suspect there are hidden rules nobody has surfaced yet. Discount logic, permission rules, pricing policies, anything with edge cases.
Get three people in a room for thirty minutes. PM, developer, tester. No more, no fewer. If you cannot find thirty minutes, the story is not important enough to ship.
Use real cards or a digital equivalent. Miro, FigJam, even a shared text document with four sections works. The colours matter — they are how the team visually parses the board. Yellow for story, blue for rules, green for examples, red for questions.
Run the time-box. Twenty-five minutes for the mapping, five minutes for the reds. Stop when the timer ends, even if the board feels incomplete. Incomplete boards reveal that the story is not ready — that is useful information.
Convert the greens to Gherkin before the next standup. Whoever does the conversion shares it back with the room. Anything that does not translate cleanly is a sign the green is not concrete enough yet.
Hand the Gherkin to the agent. Not the original prose ticket. The scenarios are the spec now.

Most teams find the first session feels slow — people are not used to working at this level of concreteness. By the third session, the cadence is natural. By the tenth, the team has built a vocabulary for talking about business rules that did not exist before.

The Bottom Line

A PM and a developer can spend thirty minutes with sticky notes and produce a more reliable specification than three days of prose tickets. The cards are not the deliverable — the shared understanding is. The Gherkin scenarios are the artifact that survives the session and travels with the code. The AI agent is the executor that builds against the contract instead of guessing at the prose.

The technique is twenty years old. It survived because it works. AI did not make it obsolete — it made it the most important meeting on your sprint board. The team that runs Example Mapping before the agent codes ships features that match the intent. The team that does not ships features that match the agent's best guess.

Example Mapping for AI: How We Turn Specs into Executable Tests

What Example Mapping Is

Walking Through a Session

From Cards to Gherkin

How the AI Agent Plugs In

What Changes on Monday

The Bottom Line

Frequently Asked Questions

What is Example Mapping and how does it work?

How does Example Mapping work with AI coding agents?

What are the four card colours in an Example Mapping session?

Who should be in an Example Mapping session?

Do product managers need to write Gherkin to use Example Mapping?

Ready to stop shipping the AI's best guess?

Share this article