Two-column comparison showing vendor-reported metrics (story points, velocity, test coverage) versus DORA delivery metrics that actually predict performance
Engineering Leadership, AI Engineering, Engineering Practices

What Are You Actually Paying For? A Non-Technical Guide to Evaluating Your Engineering Vendor

By Shivani Sutreja8 min read

Most founders evaluate engineering vendors on the wrong metrics: headcount, hourly rates, and story points completed. The four questions that actually predict delivery performance — lead time, deployment frequency, change failure rate, and mean time to restore — can be asked by anyone, require no technical background, and reveal more about a vendor's engineering health than any status report.

Your engineering vendor sent the sprint report. Twelve stories completed. Velocity: 84 points. Test coverage: 87%. On track.

Three weeks later, the feature still is not in production.

You are not sure what to ask.

This is the most common founder experience with outsourced engineering — not outright failure, but a persistent fog. Reporting exists. Progress feels real. And yet the gap between "the team is working" and "customers are using new features" keeps widening.

The fog has a cause: you are measuring the wrong things.

What Your Vendor Is Actually Billing You For

Every vendor dashboard highlights the same metrics: story points completed, velocity, hours burned, test coverage percentages. These have one thing in common — they measure inputs, not outcomes.

Story points are team estimates in made-up units. A vendor with 80-point velocity in one sprint tells you nothing about whether features reach customers faster. Velocity is a planning tool, not a delivery metric.

"Test coverage 87%" sounds like quality. It is not. Coverage measures whether code was executed during tests — not whether those tests actually catch bugs. A 2024 analysis by CodeRabbit found that 88% of tests in production codebases are effectively decorative: the tests pass even when the code is broken.

Hours burned tell you the invoice is justified. They do not tell you how much of those hours produced value versus how much was rework, coordination, and waiting.

None of these numbers tell you what you actually need to know: how long does it take for approved work to reach your customers?

The Four Questions That Reveal Everything

The DORA research program — a decade-long study of software delivery across thousands of teams — identified four metrics that predict delivery performance more reliably than any vendor dashboard. You can ask about all four without any technical background.

1. How long from approved to production?

"When we approve a feature, how long does it typically take for a customer to be able to use it?"

Elite teams deliver in under a week. High performers in one to four weeks. If your vendor's answer is measured in months — or if they cannot give you a number — you are paying for a queue system, not a delivery system.

Sixty to eighty percent of elapsed time in outsourced delivery is not development. It is waiting: QA queues, handoff delays between timezone-separated teams, deployment windows that happen on a schedule instead of on demand. Your invoice pays for all of it.

2. How often does code reach customers?

"How many times per week (or month) does code actually deploy to production?"

Elite teams deploy multiple times per day. High performers once per week to once per month. If the answer is "quarterly releases," your vendor is batching risk rather than managing it.

High deployment frequency means small changes, fast feedback, and a low blast radius when something goes wrong. Quarterly releases mean large batches, delayed learning, and high-stakes deployment events that terrify everyone involved — including your vendor.

3. What percentage of releases cause problems?

"When you deploy to production, what percentage of those deployments require a hotfix, rollback, or incident response?"

Elite teams: under 5%. Acceptable: under 15%.

A vendor that does not track this metric is a vendor that does not know how much rework they are generating. This is the number that reveals the hidden cost of cheap development. A vendor billing $80K/month with a 40% change failure rate is generating rework that consumes a significant portion of their own capacity. You are paying for that rework twice: once in the initial development, and once in the fix.

4. When something breaks, how long until it is fixed?

"If there is a production incident, how quickly does it typically get resolved?"

Elite teams restore service in under one hour. High performers in under one day.

If your vendor's answer involves filing a ticket, waiting for the right developer to become available, and a 48-72 hour SLA — that is the speed at which your customers experience broken software. It is also the speed at which your vendor finds out their own code has a problem.

Comparison table showing vendor-reported metrics versus DORA delivery metrics that actually predict performance
Figure 1: What vendors report versus what predicts delivery performance. The left column shows the metrics vendors showcase; the right shows what research confirms actually matters.

What Good and Bad Answers Look Like

The vendor does not need to be elite on all four metrics. What matters is whether they can answer at all, and whether the direction is improving.

Signals worth paying for:

  • They know the numbers without hesitation
  • They have a tool that tracks these metrics automatically
  • They describe deployment as routine, not as a risk event
  • They talk about automated tests as the quality gate, not a manual QA team
  • Lead time is measured in days, not weeks

Signals worth investigating:

  • "It depends on the complexity" — every metric should have an average
  • "We focus on quality, so we deploy carefully" — engineering quality enables faster deployment, not slower
  • They conflate "code complete" with "in production"
  • Deployments happen on a schedule, not on demand
  • They respond to the question with a counter-question about what you mean

The most revealing question is often the simplest: "What was your deployment frequency last month?" If they cannot answer, they are not measuring delivery. If they are not measuring delivery, they have no idea whether they are getting better or worse.

Four DORA metric cards showing the question to ask, elite ranges, and warning signs in vendor responses
Figure 2: The four DORA questions for vendor evaluation. Each metric has a specific question to ask, a benchmark range, and a warning sign to watch for in how vendors respond.

What You Are Actually Paying For

Every outsourced engagement pays for some combination of actual development, coordination and meetings, rework, and waiting time. Most founders assume the majority goes to development.

The breakdown in a typical outsourced engagement looks closer to this:

  • 20-30%: actual development time
  • 30-40%: waiting — QA queues, handoff delays, deployment windows
  • 20-30%: rework — fixing misunderstandings, defects, integration failures
  • 10-20%: coordination overhead across timezones and contract boundaries

You are paying for 100%. You are getting value from 20-30%.

The invoice does not show this breakdown. The sprint report does not show it. The DORA questions reveal it indirectly — a vendor with a four-week lead time is a vendor where most of that time is not development.

Specification misunderstandings alone cause 30-50% rework rates in typical outsourced engagements. Context switching — developers spread across multiple clients — reduces effective capacity by 20-40%. These are not exceptional cases. They are structural features of the outsourcing model.

How to Have This Conversation

You do not need to be technical to ask these questions. You do need to be willing to make your vendor slightly uncomfortable.

Most vendor relationships operate on a comfortable arrangement: the founder asks about progress, the vendor reports velocity and story points, both parties avoid the question of actual delivery throughput. Changing what you ask for in status meetings breaks this arrangement.

Start with one question in your next meeting: "How long did the last feature take from approved to deployed?"

If the answer is vague, follow up: "What is the average for features of similar scope?"

If they cannot answer that either, you have your data point. You are working with a vendor that does not measure delivery time.

Over the following month, add the other three questions. Frame them as establishing a shared baseline, not as an audit. Vendors that deliver well welcome this conversation. Vendors that struggle with it reveal the struggle.

Three questions for your next vendor meeting:

  1. What was your deployment frequency last month — how many times did code reach production?
  2. For features shipped this quarter, what was the average time from approved to live?
  3. When the last production incident occurred, how long did it take to resolve?

These three questions take under five minutes. The answers tell you more about your vendor's engineering health than any status report.

The Bottom Line

Most engineering vendors are evaluated on the wrong metrics. Story points and velocity are planning tools — they tell you nothing about whether software reaches customers, how often it breaks, or how fast it gets fixed. The four DORA questions — lead time, deployment frequency, change failure rate, and mean time to restore — are measurable by any founder, require no technical background, and predict delivery performance reliably. Vendors that answer these confidently are vendors worth keeping. Vendors that cannot answer them are hiding a delivery problem, even if they do not know it themselves.

Frequently Asked Questions

What metrics should a founder use to evaluate an engineering vendor?

Skip story points, velocity, and test coverage percentages — these measure inputs, not outcomes. Instead, ask about four DORA metrics: lead time (approved to production), deployment frequency (how often code ships), change failure rate (percentage of releases causing incidents), and mean time to restore (how fast production problems get fixed). These four questions require no technical background and predict delivery performance more reliably than any vendor dashboard.

Collapse

What are DORA metrics and why do they matter for outsourced development?

Expand

How do you know if your engineering vendor is delivering good value?

Expand

What questions should I ask my software development vendor in a status meeting?

Expand

Want to know how your engineering vendor actually measures up?

Connect your repo and get an objective engineering health diagnosis — lead time, deployment frequency, change failure rate, and more. Data, not dashboards.

Get Your Free Diagnosis

Share this article

Help others discover this content

TwitterLinkedIn
Categories:Engineering LeadershipAI EngineeringEngineering Practices