You Can't Look at Data You Don't Have: Building Your First Eval Set

When you have no production traces, you bootstrap your first eval set with synthetic data — but not by asking a model for "a hundred test cases." You enumerate the input space as dimensions × scenarios, generate inputs cell by cell, and run error analysis on the synthetic traces. Synthetic data is scaffolding: the day real traces arrive, you backfill them, and real usage corrects the dimensions you guessed wrong.

The first post in this series ended on a line: if you can't write the eval, you don't understand the problem — and the way you come to understand it is by looking at your data. Pull a hundred traces from production, read them, name what went wrong.

Every team about to ship a new AI feature has the same objection. What traces? We haven't launched. There is nothing to look at.

That is the cold-start problem, and it is real. The highest-leverage activity in evals is reading your own data, and a pre-launch feature has no data. So the question this post answers is the one the first post left open: how do you build your first eval set before a single user has touched the thing?

You Can't Look at Data You Don't Have

The default answer is the worst one: ship it, then look.

Ship-then-look feels pragmatic. It is not. It means the first weeks of production are an uncontrolled experiment running on your real users, and you are the only party who hasn't consented to it. You encode each failure mode only after a user has already hit it. The eval suite trails production by exactly the lag between "a user saw the bad output" and "someone noticed and wrote the check."

In the prevention, detection, correction frame, ship-then-look is pure detection. You let the failure happen, catch it downstream, and pay for it twice — once when the user hits it, once when you fix it. Building the eval set before launch is the prevention move: you spend the cost up front, deliberately, to keep the failure out of production in the first place.

There is a deeper problem. Without an eval set, there is no gate. The model does whatever it does, and "it seemed fine in the demo" becomes your release criterion. A demo is three inputs the person building the feature already knew would work. It is the narrowest possible slice of the input space, hand-picked for success. Shipping on the strength of a demo is shipping on the strength of the inputs you happened to think of — which is exactly the bias a real eval set exists to correct.

So the goal before launch is not "no eval set because no data." It is: manufacture the data deliberately, run the loop on it, and replace it with the real thing the moment the real thing exists. Part 1's rule — don't synthesise, pull real traces — is exactly right the day you have traffic. This post is about the narrow window before you do.

Enumerate the Input Space: Dimensions × Scenarios

Here is the wrong way to manufacture it. Open a model, type "give me a hundred test queries for my support assistant," paste the output into a file, call it an eval set.

You will get a hundred near-duplicates. Asked to generate inputs with no constraints, a model collapses to the most probable input — the polite, well-formed, on-topic question — and clusters tightly around it. The set will be large and feel thorough. It will cover a narrow band of what your users will actually send. That is the eval equivalent of a test suite that hits 100% line coverage and verifies nothing: full, green, and blind to the cases that matter.

The fix is to stop asking for cases and start enumerating the space. Two concepts do the work.

A dimension is an axis along which your inputs vary. For a customer-support assistant, dimensions might be: query type (billing, technical, account, out-of-scope), user emotional tone (neutral, frustrated, abusive), input completeness (full context, missing the order number, ambiguous), and language. A scenario is one concrete combination — one value picked from each dimension. "An abusive user asking a billing question with the order number missing" is a scenario. The cross-product of the dimensions is your grid of scenarios.

Figure 1: Dimensions are the axes of variation; scenarios are concrete combinations. The grid is your hypothesis about what users will send, made explicit and testable.

Figure 1: Dimensions are the axes of variation; scenarios are concrete combinations. The grid is your hypothesis about what users will send, made explicit and testable.

The grid is the artifact, and writing it down is the actual work. It forces you to state, in advance, what you think the range of real inputs is. That statement is a hypothesis — and like every hypothesis in this series, it is worth more written down and wrong than unwritten and vaguely held. If your team cannot name the dimensions, that is the finding: you do not yet understand the feature well enough to ship it. The inability to fill the grid is the same signal as the inability to write the eval. Both mean the understanding isn't there yet.

You do not need every cell. Four dimensions with a handful of values each multiply into the hundreds fast; you want coverage of the axes, not the full Cartesian product. Sample deliberately — every value of every dimension appears in several scenarios, and the combinations you believe are highest-risk appear more often. Twenty to fifty well-chosen scenarios beat a few hundred mechanical ones.

Fill the Grid — and Mind the Polish

Now generate. For each scenario, prompt a model to produce one or more concrete inputs that match the cell — steering it explicitly with the dimension values so it varies along the axes you care about instead of collapsing to the mode. "Write a billing question from a frustrated user who never mentions their order number" produces something genuinely different from the default. The grid is what gives the generator its diversity.

Then mind the polish, because this is where synthetic sets quietly fail. Model-generated inputs are too clean. They are grammatical, correctly spelled, single-intent, on-topic, and politely phrased — because that is the mode the generator collapses to even when you are steering it. Real users are none of those things. They send fragments, typos, three questions stacked in one message, screenshots with no text, requests in a language you didn't plan for, and inputs engineered to make your feature misbehave.

A synthetic set made only of clean inputs is decorative in the same way a high-coverage, low-mutation-score test suite is decorative — it runs the system without exercising the cases that break it. So spend deliberate effort on the ugly cells: empty input, hostile input, ambiguous input, the wrong-language input, the input that is technically on-topic but adversarial. These are the cells a model won't generate unless you make it, and they are the cells production will be full of.

Then — and this is the part teams skip — you run the feature against the synthetic set and do the looking anyway. Synthetic data does not let you skip error analysis; it gives you something to perform error analysis on. Open-code each synthetic output, axial-code the notes into named failure modes, write the assertions. The three rungs from the first post apply unchanged: code-based assertions where the rule is expressible in code, an LLM-as-judge where the property is genuinely fuzzy, human review reserved for what you can't yet encode. The only thing synthetic about this is the inputs. The discipline is identical.

This is the same logic that makes tests the spec an AI agent reads: the eval set, even built from synthetic inputs, is the executable record of what "correct" means for this feature. It just happens to be written before the feature met its first user. (When the AI you are evaluating is an agent inside your own delivery loop, the same discipline becomes evals for engineering agents — there the labeled set comes from your history instead of synthetic generation.)

Synthetic Is Scaffolding, Not the Building

The grid you wrote down is a set of guesses. Some of them are wrong. You will overweight a dimension users barely exercise and miss one they live in. The synthetic inputs will be cleaner than reality across the board. None of this is a reason not to build the set — it is the reason the set has a second phase.

Synthetic data is scaffolding. It holds the eval set up before there is anything real to build with. The day production traffic starts, the building begins, and the scaffolding comes down a piece at a time.

Two states side by side: at launch the eval set is entirely synthetic inputs filling the grid; after production traffic arrives, real traces backfill the cells, synthetic inputs are retired, and a new row appears for a dimension users invented that the original grid never had — Figure 2: Synthetic data bootstraps the set before launch. Real traces backfill it after — replacing synthetic inputs cell by cell and revealing the dimensions you never anticipated.

Figure 2: Synthetic data bootstraps the set before launch. Real traces backfill it after — replacing synthetic inputs cell by cell and revealing the dimensions you never anticipated.

Backfilling does three things. It replaces synthetic inputs with real ones, cell by cell, so the set gets less clean and more representative every week. It reweights the grid, because production tells you which scenarios are common and which you imagined — you discover the billing-question cell is 40% of traffic and the cell you obsessed over is empty. And most importantly, it reveals the dimensions you missed: the inputs users invented that your grid never had a column for. Those don't fit any existing cell, which is exactly why they matter. Each one is a new axis of variation and a failure mode you could not have imagined from your desk. They become the most valuable rows in the set.

This is the Analyze–Measure–Improve loop turning on real data, exactly as the first post described it — except you didn't wait for production to start it. You started on synthetic data, and production handed you the corrections.

What to Do This Week

For a feature you have not shipped yet, five steps. Each is a gate, not a suggestion.

Write the dimensions on a whiteboard. One hour, the people who understand the feature in the room. List the axes along which real inputs will vary. If you cannot name them, stop — you are not ready to ship, and you have just learned that cheaply instead of expensively.
Cross-product into twenty to fifty scenarios. Cover every value of every dimension at least once; weight toward the combinations you believe are highest-risk. Don't chase the full Cartesian product — chase coverage of the axes.
Generate inputs cell by cell, and force the ugly cells. Steer the generator with the dimension values so it varies instead of collapsing. Deliberately produce the empty, hostile, ambiguous, and out-of-scope inputs a model won't volunteer.
Run the feature on the set and do the looking. Open-code the outputs, name the failure modes, write the assertions — code-based where you can, judge where you must. This is your pre-launch Analyze step. The synthetic inputs do not exempt you from it.
Commit the set to the repo, wire the assertions into CI, and put "backfill with real traces" on the calendar for week one of production. An eval set that isn't in version control doesn't count. A backfill that isn't scheduled won't happen. Book both before you leave the room.

The Bottom Line

The team that ships without an eval set is running its error analysis on live users and finding out about its failure modes from support tickets — detection, at the moment it is most expensive. The team that builds the synthetic set first has a spec before it has a single user: imperfect, guessed-at, and cleaner than reality, but executable, gating, and ready to be corrected the instant real data arrives.

Synthetic data does not make the looking optional. It gives you something to look at before the looking gets expensive — and turns the first week of production from an uncontrolled experiment into the first turn of a loop you already started.

You Can't Look at Data You Don't Have: Building Your First Eval Set

You Can't Look at Data You Don't Have

Enumerate the Input Space: Dimensions × Scenarios

Fill the Grid — and Mind the Polish

Synthetic Is Scaffolding, Not the Building

What to Do This Week

The Bottom Line

Frequently Asked Questions

How do you build an eval set before you have any production data?

What are dimensions and scenarios in eval dataset design?

Is synthetic data good enough for evals?

When should you replace synthetic eval data with real traces?

Why not just ask an LLM to generate a hundred test cases?

Shipping an AI feature? Find the failure modes before your users do.

Share this article