Project Scheduling in the AI Era: Rebuilding Estimation, Velocity, and Sprint Planning From the Rubble

Prologue — Estimation didn't break because we got worse at estimating

A sprint retro, sometime in 2024. A team's velocity suddenly doubled. Nobody worked longer hours. Cursor and Claude Code had simply joined the team. The next sprint, velocity dropped back by half — because only the hard tickets were left.

The team's PM said it plainly in the retro: "Our velocity chart doesn't predict anything anymore."

She was right. But it's easy to misread the cause. We didn't get worse at estimating. The unit of estimation — the quantity of "human effort" — stopped being stable.

Story points were never time. They were a bundle of effort, complexity, and uncertainty, and they worked because that bundle stayed relatively consistent within a team. A "3-pointer" felt roughly the same weight to everyone. When AI agents entered the loop, that consistency shattered. The same 3-pointer now produces a draft in 4 seconds on one day and still takes two days on another. Averaging them is meaningless — the distribution itself split into two peaks.

This post is about rebuilding scheduling on top of that split distribution. The audience is tech leads, EMs, and senior ICs. This is not an abstract "AI changed the future" essay — it's the set of things you can use in your next sprint planning meeting.

What we'll cover:

The exact mechanism by which story points lost their meaning
The new bottleneck is review and integration, not implementation
Estimating AI-assisted work — ranges not points, spike-first
Velocity is now bimodal
Sprint planning when an agent can do 5 tickets overnight
Tracking AI-assisted vs human work
The "90% done" trap with agent output
Capacity planning when review is the constraint
Metrics that still matter — lead time, escape rate, review latency
Epilogue — checklist and anti-patterns

Chapter 1 · The exact mechanism by which story points lost their meaning

Before blaming story points, let's name why they worked in the first place. Story points stood on three assumptions.

Assumption	Detail	Before AI
Effort consistency	A "3" carries roughly the same weight across the team	Held
Effort as a proxy for time	Summed effort converts to calendar time	Held (via a velocity constant)
Complexity approximates implementation cost	Hard problems also take long to code	Held

All three wobbled at once.

Effort consistency broke. Adding a CRUD endpoint used to be a 2. Now an agent looks at the schema and drafts it in five minutes. But debugging a distributed lock contention — a 5 — is still a 5, because AI barely helps there. The same scale now has a five-minute task and a two-day task wearing the same number.

Effort stopped being a proxy for time. Even when implementation time converges to zero, calendar time does not shrink. Code an agent wrote in 4 seconds still takes a human the same amount of time to review, integrate, and QA. The denominator of the formula that converted summed effort into time has disappeared.

Complexity and implementation cost decoupled. This is the most fundamental one. Hard problems and write-a-lot-of-code problems used to be strongly correlated. Not anymore. Work with a clear spec and a common pattern — complex or simple — gets handled fast by an agent. Conversely, work with an ambiguous spec or deep system entanglement — even with little code — still eats a whole chunk of human time.

The conclusion: you don't need to throw out story points. But you must redefine what you're estimating. Not generation time — that's near zero — but specification cost and verification cost.

Old estimate = f(implementation effort)
New estimate = f(spec clarity) + f(review + integration + QA effort)

The next chapter shows why that second term is the new bottleneck.

Chapter 2 · The new bottleneck is review and integration, not implementation

Theory of Constraints in one line: a system's throughput is set by its slowest stage. Speeding up any other stage does nothing.

Before AI, the stage-by-stage time breakdown of software delivery looked roughly like this.

[Design 15%] → [Implementation 50%] → [Review 10%] → [Integration + QA 20%] → [Deploy 5%]
                       ^ bottleneck

Implementation was the biggest chunk, so we poured every tool into making it faster. IDEs, autocomplete, boilerplate generators. Agents are the peak of that trend — they made implementation nearly free.

So what happens when implementation becomes free? The bottleneck moves.

[Design 25%] → [Implementation 5%] → [Review 35%] → [Integration + QA 30%] → [Deploy 5%]
                                          ^ new bottleneck

Review is the new bottleneck. The reason is simple. Review is fundamentally human cognitive work, and its volume scales with the amount of code to inspect — and agents produce more code, faster. On top of that, agent code carries a different review burden than human code.

Aspect	Human-written PR	Agent-written PR
Author's intent	Just ask the author	Even the (human) author doesn't fully know
Location of subtle bugs	Author flags "please look here"	No signal about where it's weak
Consistency	Consistent with the author's style	Patterns may differ file to file
Volume	As much as the task needs	Often more than needed (over-generation)
Tests	Written by the author with intent	Plausible, but can paper over gaps

Reviewers now don't "read code" — they have to "prove the code is correct," because the shortcut of the author's intent is gone. That's slower, more tiring, and easier to slip on.

Practical implication one. If you introduce agents to a team and leave review capacity unchanged, throughput stays roughly flat. Implementation got faster, so PRs just pile up in the review queue. The queue grows, and lead time can actually increase. What got faster is "PR creation rate," not "delivery rate."

Practical implication two. Integration and QA swell along with it. More PRs mean more merge conflicts and more combinations breaking in the integration environment. An agent has no idea how its PR will conflict with someone else's unmerged PR.

That leads straight to the next chapter's estimation principle: estimate the stage humans verify, not the stage agents generate.

Chapter 3 · Estimating AI-assisted work — ranges not points, spike-first

The real reason AI-assisted work is hard to estimate is that the variance of outcomes is large. The same ticket might finish in 30 minutes or take two days. The fork between those isn't visible before you start the work.

Estimating a high-variance quantity with a single number is almost always wrong. So, two principles.

Principle 1: Estimate with ranges, not points

Instead of a single point, give an optimistic-pessimistic range. The key insight is that the width of the range is itself information.

Estimate	Width	Meaning
0.5 to 1 day	Narrow	Clear spec, common pattern. Agent will do well
1 to 4 days	Wide	Something is unknown. May need a spike
2 to 10 days	Very wide	Effectively unestimatable. Split it or spike it

A wide range doesn't mean "bigger work" — it means "we don't understand this work yet." A sprint planner who sees a wide range should not inflate the number; they should take an action that reduces the uncertainty.

Principle 2: Spike-first

A wide-range ticket does not go straight into the sprint. A timeboxed spike goes in first.

Wide estimate (1 to 4 days) discovered
   |
   v
Timeboxed spike (2 to 4 hours)
   |- have the agent prototype it
   |- find the gaps in the spec
   |- check the integration points
   |
   v
Re-estimate -> usually converges to a narrow range

Spikes got far cheaper in the AI era than before. Tell an agent "just build this for now" and you get a working prototype in 30 minutes. Even if you throw that prototype away, the process of building it surfaces the spec's ambiguities and the integration risks. That's the real value of a spike — the learning, not the code.

Re-estimate after the spike and the range usually narrows. And if it doesn't? That itself is a strong signal. This work is intrinsically uncertain, so pull it out of the sprint commitment, or break it down smaller.

The estimation workflow, summarized

1. Give the draft estimate as a range (optimistic to pessimistic)
2. If the width is narrow -> it's a sprint candidate as-is
3. If the width is wide -> put a spike into the sprint first
4. Re-estimate after the spike -> if narrowed it's a candidate, if not split it
5. Fill the sprint commitment only with "narrow range" tickets

The core message of this workflow: don't estimate the uncertain as if it were certain — do the work that reduces the uncertainty first.

Chapter 4 · Velocity is now bimodal

The statistical reason the velocity chart became meaningless is clear: the distribution of task durations shifted from unimodal to bimodal.

Before AI, a team's task-duration distribution was roughly normal. Most of the mass clustered around the mean, and the mean and median were close. Averaging was meaningful.

Before AI -- unimodal distribution
freq
 |        _-=#=-_
 |     _-=########=-_
 +------------------------ task duration
         mean ~ median (meaningful)

After AI, the distribution split into two peaks.

After AI -- bimodal distribution
freq
 |  #                      #
 |  ##                    ##
 |  ###                  ###
 +---------------------------- task duration
   peak A                peak B
   (trivial work,        (hard work,
    collapsed by agent)   barely changed)

        ^ the "mean" between them is where nobody lives

Peak A — collapsed work. CRUD, boilerplate, applying well-known patterns, clear refactors. Agents drove the time to near zero.
Peak B — unchanged work. Ambiguous specs, deep system context, novel design decisions, gnarly debugging. Agents barely help.

The mean points at the valley between the two peaks. There is no actual work there. A sentence like "the average task this sprint was 1.5 days" is as hollow as saying the average family has 2.3 members.

So what do you track instead

Instead of the mean, track the two peaks separately.

Tracked item	Measure	Why
Peak A throughput	Trivial tasks / sprint	Shows agent utilization
Peak B throughput	Hard tasks / sprint	Shows real team capacity
A:B ratio	The mix of the two peaks	Shows the character of the backlog
Valley ratio	Work stuck somewhere in the middle	Estimation-failure signal (should be near zero)

In particular, Peak B throughput is the real signal. Peak A is handled by agents, so it can be scaled almost without limit — as long as review keeps up. But Peak B needs deep human thought, and that is the team's actual ceiling. Quarterly planning should be built on Peak B throughput.

How bimodal awareness changes the retro conversation

When "velocity dropped" comes up in a retro, ask back:

"Did Peak A drop, or did Peak B drop?"

Peak A dropped -> the review queue is clogged, or agent usage went down. A tooling/process problem.
Peak B dropped -> you're doing genuinely hard work, or a senior is blocked. A people/focus problem.

The same "velocity drop" calls for two completely different prescriptions. With a single mean, that distinction is impossible.

Chapter 5 · Sprint planning when an agent can do 5 tickets overnight

Now a realistic scenario. Friday evening, a senior IC hands the agent the top 5 tickets of the backlog and goes home. Monday morning, 5 PRs are up.

The premise of sprint planning wobbles. What does "the team does N points in two weeks" even mean when PR creation finishes overnight?

The common misread: the sprint is dead

It isn't. The sprint is alive. What changed is what unit the sprint commits to.

Old sprint: "we will write this much code"
New sprint: "we will verify, integrate, and make deployable this much code"

The 5 PRs on Monday morning are not "finished work." They are inputs that have entered the review/integration queue. The sprint commitment is a commitment to the team's ability to drain that queue.

Split sprint planning into two tracks

Track 1 -- generation track (agent-driven)
  |- delegate top-of-backlog tickets to the agent
  |- fast, parallel, nearly unlimited
  |- output: PRs awaiting review

Track 2 -- verification track (human-driven) <- the sprint commitment lives here
  |- review, integration, QA, deploy
  |- strictly bound by human capacity
  |- output: deployed value

The questions to weigh in the sprint planning meeting change.

Old question	New question
How many points do we put in this sprint?	How much review/integration capacity do we have this sprint?
Who implements this ticket?	Who takes responsibility for verifying this PR?
How many days does implementation take?	How many days does it occupy on the verification track?

Deliberately throttle the generation track

It's counterintuitive but important: you have to rein in the generation track.

If an agent makes 5 overnight, then 5 the next day, then another 5, the review queue grows without end. That isn't progress — it's excess inventory. An unmerged PR is not an asset, it's a liability — it raises the risk of merge conflicts, and the longer its assumptions about the codebase age, the harder the rebase.

The rule: only run the generation track as fast as the verification track can absorb. Once the review queue holds more than some number of PRs, don't delegate new work to the agent. It's applying a Kanban WIP limit to the review queue.

Review queue WIP limit = 3
+---------------------------------+
| Awaiting review: [PR][PR][PR] <- full
| -> stop delegating new work to the agent
| -> resume delegating once one review clears
+---------------------------------+

The core of Chapter 5: the sprint is alive, but the unit of commitment moved from generation to verification, and generation must be deliberately throttled to match verification capacity.

Chapter 6 · Tracking AI-assisted vs human work

The question "how much does AI contribute on our team" will absolutely come from leadership. How you answer it either builds or breaks team culture.

What not to track

First, the traps. Do not measure the following metrics. The moment you measure them, they get gamed.

Bad metric	Why it's bad
Lines of code written by AI	Lines aren't value. Rewards over-generation
Share of AI-assisted PRs (as a target)	Makes people check the "used AI" box. Meaningless
Per-person AI utilization rate	Feels like surveillance. Destroys trust
AI vs human commit ratio	AI is a tool, not a person's competitor

The shared problem with these metrics: they pit AI against humans, or make tool usage itself the goal. AI is a tool, like a keyboard or a compiler. Just as you don't track "compiler usage rate this quarter," "AI utilization rate" isn't worth tracking on its own.

What to track

Instead, track the character of the work. Work types, not individuals.

Ticket classification (not a personal evaluation, for understanding the backlog)

  Type A: agent-driven, human-verified
    -> Peak A. Fast throughput. Review is the constraint.

  Type B: human-driven, agent-assisted
    -> Peak B. Deep thinking. Human time is the constraint.

  Type C: pure human work (design, debugging, negotiation)
    -> Agent-unsuitable. Senior time is the constraint.

This classification is not an individual's scorecard. It's for understanding the composition of the backlog and planning capacity. The same person does some tickets as Type A and some as Type C. The thing being classified is the work, not the person.

The one ratio worth having

If you genuinely need an "AI impact" number, look at just this one.

Increase in Type A throughput
-----------------------------  = the agent's real leverage
Review cost spent on Type A

If the numerator is comfortably larger than the denominator, the agent is a net gain. If the denominator catches up to the numerator — if review cost eats the generation gain — that's a signal to fix the process, not the tool.

How to frame it when reporting

Report to leadership like this:

"With agents, lead time on Type A work dropped by X%. As a result the bottleneck moved to review, and if we invest in review capacity next quarter we can convert that gain into throughput."

This framing is honest (no vanity numbers like "lines went up") and actionable (it points at the next investment). A sentence like "AI writes 40% of our code" is impressive but leaves you unable to make any decision.

Chapter 7 · The "90% done" trap with agent output

An old joke in software estimation: "It's 90% done, and the remaining 10% takes another 90% of the time." Agents make this trap deeper and more convincing.

Why agent output looks 90% done

Open an agent-made PR for the first time and it's impressive. The code compiles. There are tests. They pass. Variable names are clean. The structure is reasonable. It looks finished.

But "looks" is the operative word. Agents are optimized for looking plausible. The gap between surface completeness and actual completeness is wider than with human code.

Human-written code
  surface completeness  --------------#  70%
  actual completeness   -------------#   65%
  (the author knows the weak spots. the gap is small.)

Agent-written code
  surface completeness  ----------------------#  95%
  actual completeness   -----------#             60%
  (the gap is large. and you can't see where the gap is.)

That 35% between surface and actual is the true identity of "the remaining 10%." And this gap usually hides in the following places.

Where it hides	Symptom
Edge cases	Only the happy path handled. Empty input, concurrency, failure paths missing
Integration assumptions	Freely assumes the behavior of other services or modules
Non-functional requirements	Works, but slow, memory-hungry, or unobservable
Gaps in tests	Tests exist, but verify the implementation rather than the intent
Error handling	There's a try/catch, but no recovery strategy

The second-to-last one is especially dangerous. Tests an agent writes often assert exactly what the code it wrote does. Even if there's a bug, the test enshrines that bug as the "correct answer." Passing tests do not guarantee correctness.

Estimation rules to avoid the trap

Rule 1: "PR is up" is 0% done. Set the endpoint of the estimate at "deployed and observed," not "PR created." The moment an agent posts a PR is the starting point of verification work, not the endpoint.

Rule 2: Verification as a separate ticket with a separate estimate. Bundle generation and verification into one ticket and verification always gets underestimated. Split large agent output into an "implementation" ticket and a "verification + integration" ticket, and estimate the latter honestly.

Rule 3: Do a "reverse review" on agent PRs. Don't read the code top to bottom — hit the five hiding places from the table above as a checklist first. "Edge cases? Integration assumptions? Do the tests verify intent?" To avoid being fooled by surface completeness, you have to deliberately hunt for the gap.

Rule 4: Give the last 10% a track, not a buffer. A buffer is "extra time, just in case." A track is "this work definitely exists and is planned separately." The last 10% of agent output is not "in case" — it's "always." Don't hide it in a buffer; plan it as a first-class citizen of the verification track.

Chapter 8 · Capacity planning when review is the constraint

Chapter 2 said the bottleneck moved to review. Chapter 8 is about reflecting that fact in capacity planning.

Old capacity planning vs new capacity planning

Old approach
  team capacity = sum of (per-person implementation hours)
  planning = fill implementation work to match that capacity

New approach
  team capacity = MIN(generation capacity, review capacity, integration capacity)
  planning = commit to the smallest term (usually review)

This is Theory of Constraints applied directly. Generation capacity is effectively near-infinite thanks to agents. So it is no longer the basis for planning. The smallest term — almost always review — is the team's real throughput.

Actually calculating review capacity

Review capacity isn't abstract. It's measurable.

Review capacity (PRs/week)
  = number of reviewers
  x weekly review hours per person
  / average review time per PR

Example:
  4 reviewers
  x 6 hours/person/week (realistic time available for review)
  / 1.5 hours/PR (agent PRs take longer)
  = 16 PRs/week

That number, 16, is the ceiling of the sprint commitment. Even if an agent can make 50 PRs a week, the team can only commit to 16 worth. Making the other 34 isn't progress — it's queue backlog.

Levers to raise review capacity

If review is the constraint, improvement comes from raising review capacity. Improving anything else is waste.

Lever	Effect	Cost / risk
Smaller PRs	Shortens review time per PR. Biggest effect	Requires the discipline to delegate to the agent in small units
More reviewers	Increases the reviewer count	Senior time is scarce. Juniors find agent-PR review hard
Stronger automated gates	Machines filter before human review	Upfront investment in lint, types, tests, static analysis
Agent self-review	The agent does the first pass	A second human review is still mandatory. No blind faith
Spec clarification	Less to review in the first place	Upfront time in design and spec

The biggest effect usually comes from smaller PRs. Review time grows not linearly but super-linearly with PR size — a large PR is slower and less accurate because a human can't hold it all in their head. The discipline of delegating to the agent as "split this feature into 5 small PRs" rather than "this whole feature as one PR" raises capacity the most.

New agenda items for the capacity planning meeting

Sprint capacity planning -- new checklist
  [ ] How many PRs is this sprint's review capacity? (calculate with the formula above)
  [ ] What's the integration/QA capacity? (include merge conflicts and environment time)
  [ ] The smaller of the two is the ceiling of the sprint commitment
  [ ] Set the generation track's WIP limit to match that ceiling
  [ ] Is enough senior review time left for Peak B?

The last item is subtle but important. If seniors spend all their time reviewing Peak A agent PRs, there's no one left to do the genuinely hard work of Peak B. When planning review capacity, carve out seniors' deep-work time first.

Chapter 9 · Metrics that still matter — lead time, escape rate, review latency

Just because story points and velocity wobbled doesn't mean we abandon measurement. Fortunately, there are metrics that don't wobble. Their common trait: they don't measure "effort" — they measure "flow" and "outcomes." So even when AI changes effort, their meaning holds.

Metric 1: Lead time

The calendar time from the moment work starts until it's deployed and reaches the user.

Why it survives: lead time looks at the whole flow, not generation time. Even if an agent drives implementation to zero, if lead time doesn't shrink, the bottleneck is somewhere else. Lead time doesn't hide that bottleneck.

Lead time decomposition (for bottleneck diagnosis)
  [waiting] -> [generation] -> [awaiting review] -> [review] -> [integration] -> [deploy]
                                      ^
                  if this got longer, review is the constraint

Break lead time down by stage and you immediately see which stage swelled. In the AI era, the culprit is almost always the "awaiting review" segment.

Metric 2: Escape rate

The proportion of defects that escaped all the way to production. Bugs that passed review and QA but were found by users.

Why it matters more now: the "90% done trap" of agent code (Chapter 7) shows up precisely as escape rate. A PR that looks finished on the surface passes review easily, and the hidden gap blows up in production. A rising escape rate is direct evidence that the verification track isn't keeping up with generation speed.

If escape rate is rising while velocity (generation speed) is also rising, that is not a good thing. You're just shipping unverified code quickly.

Metric 3: Review latency

The time from when a PR is posted to when review begins. Not the time review takes — the time spent waiting for review.

Why it's a key metric: review latency directly shows the length of the review queue. It's the thermometer of the new bottleneck from Chapters 2 and 8.

What review latency tells you
  latency ~ 0       -> review capacity to spare. you can run the generation track harder
  latency trending up -> queue backlog starting. throttle the generation track
  latency spiking   -> bottleneck is serious. tighten the WIP limit immediately

Review latency is a leading indicator. Lead time tells you after the fact that things "were slow," but review latency tells you in real time that things "are getting clogged."

Metric 4: Peak B throughput (back from Chapter 4)

The number of hard tasks (Type B and C) completed per sprint. The metric emphasized in Chapter 4.

Why it's the basis for quarterly planning: Peak A is handled by agents, so it's elastic. But Peak B is capped by deep human thought, and that is the limit of value the team can actually produce. Roadmap commitments should be grounded in Peak B throughput, not intoxicated by Peak A's fast throughput.

The metrics, together

Metric	What it measures	Role in the AI era
Lead time	Total flow time	Diagnose bottleneck location
Escape rate	Defects that escaped verification	Health of the verification track
Review latency	Length of the review queue	Leading alarm for the bottleneck
Peak B throughput	Speed of hard work	Basis for quarterly and roadmap planning

Let me re-emphasize what these four share. Not one of them measures "how fast you wrote code." They all measure flow, outcomes, and capacity. So even when AI makes implementation free, their meaning holds. What we should have been measuring was never effort but flow — the AI era just made that fact impossible to ignore any longer.

Epilogue — Rebuilding from the rubble

Saying estimation broke is only half right. Effort-based estimation broke. Flow-based planning is fine — it actually got clearer in the AI era. Our job is not to abandon measurement but to move the target of measurement from effort to flow.

The core in one sentence: agents made implementation free, so the bottleneck moved to review and integration, and the center of planning and measurement must move there too.

Practical checklist

Things to check before your next sprint planning meeting.

Estimate with ranges. Drop the single point and estimate with an optimistic-pessimistic range. Read the width of the range as a signal of uncertainty.
Wide estimates get a spike first. Don't put a wide-range ticket straight into the sprint; put a timeboxed spike in first and re-estimate.
Split velocity into two peaks. Drop the mean and track Peak A (trivial) and Peak B (hard) separately. Plan quarters on Peak B.
Make the sprint commitment the verification track. Commit to "we will verify, integrate, and deploy," not "we will write code."
Put a WIP limit on the generation track. Delegate to the agent only as much as the review queue can absorb. Unmerged PRs are liabilities, not assets.
Calculate review capacity as a number. Reviewers x weekly review hours / review time per PR. That value is the ceiling of the sprint commitment.
Delegate small PRs. Review time is super-linear in PR size. Small PRs raise review capacity the most.
Treat agent PRs as "0% done." PR creation is the start of verification. Make verification and integration a separate ticket with a separate estimate.
Hit the hidden gaps with a checklist. Edge cases, integration assumptions, non-functional requirements, gaps in tests, error handling — don't read top to bottom, hunt the gap first.
Watch the metrics that don't wobble. Lead time, escape rate, review latency, Peak B throughput. These four measure flow, not effort.

Anti-patterns

Things to avoid. Each one is a common mistake.

Clinging to the velocity mean. The mean of a bimodal distribution is the valley where nobody lives. Before celebrating a rising mean or worrying about a falling one, ask which peak.
Making AI utilization a target. A target like "80% AI-assisted PRs" makes tool usage itself the goal. AI is a tool, not a goal.
Measuring contribution by lines of code. A report bragging about AI-written line counts rewards over-generation and leaves you unable to make any decision.
Running the generation track unlimited. Letting an agent make everything it can make grows the review queue without end. It's excess inventory that looks like progress.
Treating "PR is up" as "done." Agent PRs look finished on the surface. Counting PR creation as done means verification always gets underestimated.
Burning seniors on agent-PR review. If seniors spend all their time on Peak A reviews, there's no one to do the genuinely hard work of Peak B. Carve out seniors' deep-work time first.
Investing in generation tools when verification is the constraint. Improving a non-bottleneck is waste. Invest in the constraint — almost always review.
Watching only speed while ignoring escape rate. If velocity is rising while escape rate is also rising, that's not fast — it's shipping unverified code quickly.

Next post teaser

The next post is "Code Review in the AI Era: A Practical Workflow for Verifying Agent PRs." Since this post repeated that "review is the new bottleneck," the next one goes inside that bottleneck. A review checklist dedicated to agent PRs, the concrete procedure of reverse review, the division of labor between automated gates and human review, and how to grow juniors so they can review agent PRs — we'll cover, at the code level, how to actually raise review capacity.

The place where estimation broke is not empty ground. It's where a more honest plan can stand — one that measures flow, follows the bottleneck, and treats verification as a first-class citizen.