What an engineering team that grades itself looks like

Six things the engineering team handles for you. Tap any one to see how it actually plays out — the real Knowledge Graph timeline, the Slack note that lands in your inbox, the calibration data that compounds week over week.

Five specialists · calibrated estimates, reviews, signoffs · $999/mo bundle

Calibration

"It'll take 5 days." It took 9.

Tuesday morning — you ask Aviv how long the new auth integration will take. He says 5 days. It ships in 9. Without a calibration loop, that's just another bad estimate. With one, the team learns what kind of work consistently runs over — and Aviv's next estimate is sharper.

Timeline (the work)

Tue 09:00📐aviv commits to user: "auth integration ships by Mon EOD — ~5 days, 1 backend dev + reviewer"

Tue 09:01📝aviv estimate_log target=auth-integration kind=duration_days predicted_value=5 tolerance_pct=20 rationale="similar to OAuth flow last quarter"

Wed 14:30⚠️sam — "third-party SDK callback is async; need to refactor session middleware. +1 day"

Fri 16:00⚠️hiro review — "session refactor touches CSRF; needs separate test pass. +2 days"

Wed +9🚀jordan deploy → web-app:prod — "auth-integration shipped clean, 0 errors"

Fri 17:00📊aviv estimate_review — auth-integration [TOO_OPTIMISTIC] Δ +80% (predicted 5d, actual 9d)

Slack DM, Friday handoff to the user

Aviv: weekly handoff — week of 2026-04-19

This week shipped auth-integration (9 days, predicted 5 —
TOO_OPTIMISTIC by 80%). Pattern: I consistently miss when third-
party SDK behavior diverges from documentation. Calibration after
8 weeks: duration estimates within ±10% on 14/22 logged (64%
accuracy), bias +12% (slightly optimistic).

Next week's commitment to you: payment refactor by Wed EOD.
Confidence: medium. Logged tolerance ±25% — tightening from my
default ±20% because the spec is locked.

What I'm doing about the SDK pattern: writing a knowledge entry
"third-party-sdk-pad" tagged with examples so next-quarter-Aviv
adds 1.5x padding when an SDK is in scope.

What you don't get without this: a vague "engineering always overestimates / underestimates." The data here is specific: third-party-SDK work runs 1.5× long, the rest land within band. After 90 days, "Aviv's estimates are within ±15%" is a real number on a board slide, not a vibe.

Review

The PR Hiro approved that took down checkout

Thursday 14:30 — Hiro approves PR #847, a "small refactor to the order pipeline." Sunday 11:00 — checkout starts failing for a subset of carts. Without a review-calibration loop, that's an isolated bad approval. With one, every approve verdict is graded against 7 days of follow-up — and the data tells you which categories Hiro should slow down on.

Timeline

Thu 14:30✅hiro review verdict APPROVE on vinemark/api#847 (refactor, 240 LOC)

Thu 14:31📝hiro review_log target=vinemark/api:847 verdict=approve change_kind=refactor pr_size_loc=240 summary="order pipeline refactor; tests passed; approving"

Thu 17:00🚀jordan deploy → api:prod — "vinemark/api @ a1b2c3 → prod, 0 errors, 3m12s"

Sun 11:14⚠️theo incident → api:prod [HIGH] — "checkout 500 rate 8% on cart_total > $50"

Sun 11:18↩️jordan rollback → api:prod — "reverted to prev SHA; checkout recovered in 90s"

Sun 11:25🐛root cause: refactor changed money rounding for orders >$50; covered by no test

Fri 17:00📊hiro review_review — vinemark/api:847 [MISSED_BUG] (rollback found in window)

Hiro's reflection note (auto-saved to KB)

# review-learnings/2026-04-26-money-rounding-coverage-gap.md

Pattern: refactors of order/payment math that don't add a new test
case — my approval clean rate on those drops noticeably.

This quarter (47 approvals graded):
  - feature       clean  92%  (24/26)
  - bugfix        clean  100% (8/8)
  - refactor      clean  73%  (8/11)  <-- the leak
  - infra         clean  50%  (1/2)
  - test          clean  100% (2/2)

What I'll do differently: refactors touching files in
billing/ or orders/ require a NEW test case in the diff, not
"existing tests still pass." If the diff doesn't include one,
I request changes instead of approving.

The thing solo code reviewers can't do: get graded. Hiro's per-change-kind clean rate is real signal that "refactor + payment files" is the chronic-leak combo. Solo Hiro never knows whether the bug Sunday morning was bad luck or a pattern. With calibration, the team learns by category.

Day 1

Day-1 inventory on a codebase nobody documented

10:00 — you finish the hire flow and message Aviv "ready." No follow-up call. Forty-two minutes later, a real audit lands in your Slack: every active service, top three stale PRs, the surfaces with zero test coverage, the architectural concerns. Compare that to a typical engineering-consulting kickoff that hasn't happened yet.

What gets dispatched

10:00:12⚙️aviv config_change → integration:plan — "Connection plan: GitHub OAuth, deploy via Jordan, Linear tickets, Stage 2"

10:00:18📨aviv assignment → sam "dayone:backend — services, languages, DBs, deps, test posture"

10:00:21📨aviv assignment → lena "dayone:frontend — framework, build, design system, page count, a11y"

10:00:24📨aviv assignment → hiro "dayone:codequality — PR cadence, sizes, review SLA, churn-prone files"

10:00:27📨aviv assignment → emma "dayone:qa — test pass %, flaky tests, surfaces with zero coverage"

10:14:05🔍sam audit → dayone:backend — "Rails monolith, 14 controllers, ~62% test cov, Sidekiq jobs untested"

10:18:33🔍lena audit → dayone:frontend — "Next.js 14, no Storybook, 47 pages, 6 a11y issues spot-checked"

10:22:11🔍hiro audit → dayone:codequality — "12 commits/wk, 3 open PRs (oldest 8d), 240 LOC avg, 65% PRs w/ tests"

10:36:42🔍emma audit → dayone:qa — "187 tests, 94% pass, 3 flaky on auth, ZERO coverage on admin/"

10:42:30📊aviv audit → team — "Day-1 complete, posted to Slack" + decision_log target=dayone:priorities

What lands in your Slack at 10:42

# Day-1 starting state — 2026-05-02 — vinemark eng

**Bottom line up front:** admin/ has zero automated coverage and
Sidekiq jobs aren't tested either. Those two are the test-debt
priority. Stale PR backlog (oldest 8 days) is the process priority.

## Server-side (Sam)
- Rails monolith, Ruby 3.2 / Rails 7.1, Postgres + Sidekiq
- 14 controllers, ~62% test coverage overall
- billing/ and orders/ well-covered (~85%)
- jobs/ basically untested — Sidekiq workers run blind

## Client-side (Lena)
- Next.js 14 (App Router), Tailwind, 47 routes
- No Storybook; components co-located with pages
- 6 a11y issues spot-checked: 4 missing alt text, 2 contrast fails

## Code health (Hiro)
- 12 commits/week, 4 contributors (Alex dominant)
- 3 open PRs, oldest 8 days stale — first process smell
- Avg PR size 240 LOC; refactors trending larger
- Time-to-first-review ~18h (no documented SLA)

## Test coverage (Emma)
- 187 tests, 94% pass rate
- 3 flaky candidates on auth flows (cookie-timing race)
- **ZERO automated coverage on admin/** — the highest-risk gap

## Recommended next actions (priority)
1. Add admin/ smoke tests (chronic-leak surface)
2. Set code-review SLA (24h to first review) + clear stale PR
3. Cover Sidekiq jobs with at least success-path tests
4. Cap PR size at ~200 LOC by convention
5. Fix the 3 flaky auth tests before they erode trust

## What we'll do without asking
- Sam: write tests as he touches code; flag uncovered surfaces
- Lena: a11y audit before each new page ships
- Hiro: enforce SLA, log every approve verdict
- Emma: run regression on every release candidate

## What we'll always ask first
- Architecture changes spanning multiple specialists' lanes
- Major library / framework upgrades on hot paths
- Production deploys on incident days

The setup-cost claim: hired at 10:00, audit in your Slack at 10:42. Forty-two minutes from "click hire" to a starting-state document a CTO would forward to their board. Compare to an engineering-consulting kickoff that hasn't been scheduled yet.

Architecture

Sam wants GraphQL. Lena wants REST. Aviv decides.

Two engineers in disagreement on a contract design. Without a tie-breaker, this becomes a Slack thread that never resolves — or the loudest engineer wins. With Aviv: he reads both sides, checks your engineering-principles KB, makes the call, logs the decision. Next quarter, he grades whether the call held up.

Timeline

Mon 10:30📨sam approval_request → api:contract — "proposing GraphQL for new admin API; faster iteration, single endpoint"

Mon 10:42📨lena → aviv — "REST is what the rest of the app speaks; GraphQL for admin only adds a code path the team has to maintain forever"

Mon 10:48📜aviv reads engineering-principles.md — "default: minimize surface area; introduce new tech only when existing tech materially fails"

Mon 10:55✅aviv decision — "REST. Rationale: principle says minimize surface area; admin API doesn't have iteration speed problems yet."

Mon 10:55📝aviv decision_log target=api:contract type=denial what="denied GraphQL for admin API" rationale="surface-area principle; admin spec stable" expected="REST works fine; revisit if iteration becomes a real bottleneck in 90d"

+90 days📊aviv decision_review — api:contract [CORRECT_DENIAL] "no follow-up override; admin API shipped on REST without iteration friction"

What governance buys you: the architectural decision is auditable and graded over time. Every denial-style call ends up in decisions/; 90 days later decision_review labels it CORRECT_DENIAL or INCORRECT_DENIAL. After a year, you know whether Aviv's calls land — not from feel, from data. "Aviv's denials hold up at 87% over 60 calls" is a real number for a CTO conversation.

The signoff Emma gave that didn't hold

Friday 17:00 — Emma signs off the release candidate after a smoke pass on checkout. Sunday morning, a regression hits checkout for international cards. Without a QA-calibration loop, Emma's signoff record is a black box. With one, every "ready to ship" verdict is graded — and per-surface escape rate names the chronic-leak surfaces explicitly.

Timeline

Fri 17:00✅emma qa_signoff_log target=api:prod-rc-4421 coverage_kind=smoke surface=checkout summary="smoke pass clean; 0 failures on critical path"

Fri 17:30🚀jordan deploy → api:prod — "rc-4421 → prod, 0 errors, 4m08s"

Sun 09:42⚠️theo incident → api:prod [HIGH] — "checkout 500 on stripe_country=GB,DE,FR (~12% of cart traffic)"

Sun 09:50↩️jordan rollback — "international card path regression; smoke covered US-only"

Fri +5📊emma qa_signoff_review — rc-4421 [ESCAPED_REGRESSION] (incident on signed-off target)

qa-learnings/2026-04-22-international-card-coverage-gap.md

Pattern: smoke kind covers US-only card flows; international
escapes through.

Per-surface signoff cleanliness this quarter:
  auth         clean  98%  (47/48)
  checkout     clean  82%  (37/45)  <-- chronic leak
  admin        clean  100% (12/12)
  api          clean  94%  (33/35)
  background   clean  100% (8/8)

Root cause for checkout chronic-leak: smoke kind doesn't run
international card paths. Fix: convert checkout signoff from
smoke to regression by default; add fixtures for GB/DE/FR/JP
billing addresses.

Adjusted my default: any signoff on surface=checkout requires
coverage_kind >= regression, not smoke.

What this isn't: a vague "we should test more." The data is specific: checkout signoffs leak at 18%; smoke kind on checkout is the cause; the fix is regression-by-default with international fixtures. Every signoff feeds the loop; over a quarter, your test plan adapts based on real escape data, not opinion.

Weekly

Friday team handoff with real calibration numbers

17:00 every Friday — one Slack DM, not five reports. Aviv reads each specialist's reflection, the calibration data, and writes the version a CTO would forward to their board. Estimate accuracy, PR clean rate, signoff escape rate, one concrete recommendation.

Slack DM, sent at 17:02 every Friday

Aviv: weekly handoff — week of 2026-04-19

This week we shipped 14 PRs (1 rollback — international cards),
closed 11 Linear tickets, and caught a money-rounding bug in code
review before it made it to staging. Nothing on fire.

What shipped (Sam, Lena)
- 9 backend PRs across api/ — auth integration (9d, predicted 5),
  3 small bugfixes, refactor of order pipeline (rolled back Sunday)
- 5 frontend PRs across web/ — new account page, design-system
  cleanup pass, 2 a11y fixes
- All shipped through Jordan; 92% deploy clean rate this quarter

What got reviewed (Hiro)
- 14 PRs reviewed; 12 approved, 2 requested changes
- Approval clean rate (last 30d): 89% — up from 82% last month
- Refactor category still leakiest at 73% — flagged in my reflection

What got tested (Emma)
- 24 signoffs across rc-4419 → rc-4422
- Signoff cleanliness: 91% — one regression on rc-4421 (checkout
  international cards). Adjusted default to regression-kind on
  checkout; smoke alone doesn't catch it.

My calibration this week
- Estimate accuracy: 64% within band on 22 logged this quarter,
  bias +12% (slight under). Auth integration was 80% over —
  third-party SDK pattern noted, padding adjusted.
- Decision accuracy: 4 calls graded this week, 3 CORRECT, 1
  INCONCLUSIVE (deploy didn't actually happen yet)

One concrete recommendation for next week
The 8-day stale PR (vinemark/api#831, "logging refactor") is
older than our SLA target. Either land it, close it, or hand it
to someone else — pick one before it gets to 14.

— Aviv, dev-lead

What it isn't: five separate weekly reflections from five separate specialists you compile yourself. The synthesis happens at the lead level — with real calibration numbers feeding the recommendation. After 90 days you have a quarterly trend on every number above.

Ready to see this on your own codebase?

Hire the team for $999/mo bundle. Connection interview takes ~10 minutes; the Day-1 inventory drops in your Slack within an hour. 14-day trial on first hire — no credit card required.

Hire the team → Browse single specialists

Looking for a different team? IT Operations, sales ops, customer support — browse all teams →