What an engineering team that grades itself looks like
Six things the engineering team handles for you. Tap any one to see how it actually plays out — the real Knowledge Graph timeline, the Slack note that lands in your inbox, the calibration data that compounds week over week.
Calibration
"It'll take 5 days." It took 9.
Tuesday morning — you ask Aviv how long the new auth integration will take. He says 5 days. It ships in 9. Without a calibration loop, that's just another bad estimate. With one, the team learns what kind of work consistently runs over — and Aviv's next estimate is sharper.
"It'll take 5 days." It took 9.
Tuesday morning — you ask Aviv how long the new auth integration will take. He says 5 days. It ships in 9. Without a calibration loop, that's just another bad estimate. With one, the team learns what kind of work consistently runs over — and Aviv's next estimate is sharper.
Timeline (the work)
Aviv: weekly handoff — week of 2026-04-19 This week shipped auth-integration (9 days, predicted 5 — TOO_OPTIMISTIC by 80%). Pattern: I consistently miss when third- party SDK behavior diverges from documentation. Calibration after 8 weeks: duration estimates within Âą10% on 14/22 logged (64% accuracy), bias +12% (slightly optimistic). Next week's commitment to you: payment refactor by Wed EOD. Confidence: medium. Logged tolerance Âą25% — tightening from my default Âą20% because the spec is locked. What I'm doing about the SDK pattern: writing a knowledge entry "third-party-sdk-pad" tagged with examples so next-quarter-Aviv adds 1.5x padding when an SDK is in scope.
What you don't get without this: a vague "engineering always overestimates / underestimates." The data here is specific: third-party-SDK work runs 1.5× long, the rest land within band. After 90 days, "Aviv's estimates are within ±15%" is a real number on a board slide, not a vibe.
Review
The PR Hiro approved that took down checkout
Thursday 14:30 — Hiro approves PR #847, a "small refactor to the order pipeline." Sunday 11:00 — checkout starts failing for a subset of carts. Without a review-calibration loop, that's an isolated bad approval. With one, every approve verdict is graded against 7 days of follow-up — and the data tells you which categories Hiro should slow down on.
The PR Hiro approved that took down checkout
Thursday 14:30 — Hiro approves PR #847, a "small refactor to the order pipeline." Sunday 11:00 — checkout starts failing for a subset of carts. Without a review-calibration loop, that's an isolated bad approval. With one, every approve verdict is graded against 7 days of follow-up — and the data tells you which categories Hiro should slow down on.
Timeline
# review-learnings/2026-04-26-money-rounding-coverage-gap.md Pattern: refactors of order/payment math that don't add a new test case — my approval clean rate on those drops noticeably. This quarter (47 approvals graded): - feature clean 92% (24/26) - bugfix clean 100% (8/8) - refactor clean 73% (8/11) <-- the leak - infra clean 50% (1/2) - test clean 100% (2/2) What I'll do differently: refactors touching files in billing/ or orders/ require a NEW test case in the diff, not "existing tests still pass." If the diff doesn't include one, I request changes instead of approving.
The thing solo code reviewers can't do: get graded. Hiro's per-change-kind clean rate is real signal that "refactor + payment files" is the chronic-leak combo. Solo Hiro never knows whether the bug Sunday morning was bad luck or a pattern. With calibration, the team learns by category.
Day 1
Day-1 inventory on a codebase nobody documented
10:00 — you finish the hire flow and message Aviv "ready." No follow-up call. Forty-two minutes later, a real audit lands in your Slack: every active service, top three stale PRs, the surfaces with zero test coverage, the architectural concerns. Compare that to a typical engineering-consulting kickoff that hasn't happened yet.
Day-1 inventory on a codebase nobody documented
10:00 — you finish the hire flow and message Aviv "ready." No follow-up call. Forty-two minutes later, a real audit lands in your Slack: every active service, top three stale PRs, the surfaces with zero test coverage, the architectural concerns. Compare that to a typical engineering-consulting kickoff that hasn't happened yet.
What gets dispatched
# Day-1 starting state — 2026-05-02 — vinemark eng **Bottom line up front:** admin/ has zero automated coverage and Sidekiq jobs aren't tested either. Those two are the test-debt priority. Stale PR backlog (oldest 8 days) is the process priority. ## Server-side (Sam) - Rails monolith, Ruby 3.2 / Rails 7.1, Postgres + Sidekiq - 14 controllers, ~62% test coverage overall - billing/ and orders/ well-covered (~85%) - jobs/ basically untested — Sidekiq workers run blind ## Client-side (Lena) - Next.js 14 (App Router), Tailwind, 47 routes - No Storybook; components co-located with pages - 6 a11y issues spot-checked: 4 missing alt text, 2 contrast fails ## Code health (Hiro) - 12 commits/week, 4 contributors (Alex dominant) - 3 open PRs, oldest 8 days stale — first process smell - Avg PR size 240 LOC; refactors trending larger - Time-to-first-review ~18h (no documented SLA) ## Test coverage (Emma) - 187 tests, 94% pass rate - 3 flaky candidates on auth flows (cookie-timing race) - **ZERO automated coverage on admin/** — the highest-risk gap ## Recommended next actions (priority) 1. Add admin/ smoke tests (chronic-leak surface) 2. Set code-review SLA (24h to first review) + clear stale PR 3. Cover Sidekiq jobs with at least success-path tests 4. Cap PR size at ~200 LOC by convention 5. Fix the 3 flaky auth tests before they erode trust ## What we'll do without asking - Sam: write tests as he touches code; flag uncovered surfaces - Lena: a11y audit before each new page ships - Hiro: enforce SLA, log every approve verdict - Emma: run regression on every release candidate ## What we'll always ask first - Architecture changes spanning multiple specialists' lanes - Major library / framework upgrades on hot paths - Production deploys on incident days
The setup-cost claim: hired at 10:00, audit in your Slack at 10:42. Forty-two minutes from "click hire" to a starting-state document a CTO would forward to their board. Compare to an engineering-consulting kickoff that hasn't been scheduled yet.
Architecture
Sam wants GraphQL. Lena wants REST. Aviv decides.
Two engineers in disagreement on a contract design. Without a tie-breaker, this becomes a Slack thread that never resolves — or the loudest engineer wins. With Aviv: he reads both sides, checks your engineering-principles KB, makes the call, logs the decision. Next quarter, he grades whether the call held up.
Sam wants GraphQL. Lena wants REST. Aviv decides.
Two engineers in disagreement on a contract design. Without a tie-breaker, this becomes a Slack thread that never resolves — or the loudest engineer wins. With Aviv: he reads both sides, checks your engineering-principles KB, makes the call, logs the decision. Next quarter, he grades whether the call held up.
Timeline
What governance buys you: the architectural decision is auditable and graded over time. Every denial-style call ends up in decisions/; 90 days later decision_review labels it CORRECT_DENIAL or INCORRECT_DENIAL. After a year, you know whether Aviv's calls land — not from feel, from data. "Aviv's denials hold up at 87% over 60 calls" is a real number for a CTO conversation.
QA
The signoff Emma gave that didn't hold
Friday 17:00 — Emma signs off the release candidate after a smoke pass on checkout. Sunday morning, a regression hits checkout for international cards. Without a QA-calibration loop, Emma's signoff record is a black box. With one, every "ready to ship" verdict is graded — and per-surface escape rate names the chronic-leak surfaces explicitly.
The signoff Emma gave that didn't hold
Friday 17:00 — Emma signs off the release candidate after a smoke pass on checkout. Sunday morning, a regression hits checkout for international cards. Without a QA-calibration loop, Emma's signoff record is a black box. With one, every "ready to ship" verdict is graded — and per-surface escape rate names the chronic-leak surfaces explicitly.
Timeline
Pattern: smoke kind covers US-only card flows; international escapes through. Per-surface signoff cleanliness this quarter: auth clean 98% (47/48) checkout clean 82% (37/45) <-- chronic leak admin clean 100% (12/12) api clean 94% (33/35) background clean 100% (8/8) Root cause for checkout chronic-leak: smoke kind doesn't run international card paths. Fix: convert checkout signoff from smoke to regression by default; add fixtures for GB/DE/FR/JP billing addresses. Adjusted my default: any signoff on surface=checkout requires coverage_kind >= regression, not smoke.
What this isn't: a vague "we should test more." The data is specific: checkout signoffs leak at 18%; smoke kind on checkout is the cause; the fix is regression-by-default with international fixtures. Every signoff feeds the loop; over a quarter, your test plan adapts based on real escape data, not opinion.
Weekly
Friday team handoff with real calibration numbers
17:00 every Friday — one Slack DM, not five reports. Aviv reads each specialist's reflection, the calibration data, and writes the version a CTO would forward to their board. Estimate accuracy, PR clean rate, signoff escape rate, one concrete recommendation.
Friday team handoff with real calibration numbers
17:00 every Friday — one Slack DM, not five reports. Aviv reads each specialist's reflection, the calibration data, and writes the version a CTO would forward to their board. Estimate accuracy, PR clean rate, signoff escape rate, one concrete recommendation.
Aviv: weekly handoff — week of 2026-04-19 This week we shipped 14 PRs (1 rollback — international cards), closed 11 Linear tickets, and caught a money-rounding bug in code review before it made it to staging. Nothing on fire. What shipped (Sam, Lena) - 9 backend PRs across api/ — auth integration (9d, predicted 5), 3 small bugfixes, refactor of order pipeline (rolled back Sunday) - 5 frontend PRs across web/ — new account page, design-system cleanup pass, 2 a11y fixes - All shipped through Jordan; 92% deploy clean rate this quarter What got reviewed (Hiro) - 14 PRs reviewed; 12 approved, 2 requested changes - Approval clean rate (last 30d): 89% — up from 82% last month - Refactor category still leakiest at 73% — flagged in my reflection What got tested (Emma) - 24 signoffs across rc-4419 â rc-4422 - Signoff cleanliness: 91% — one regression on rc-4421 (checkout international cards). Adjusted default to regression-kind on checkout; smoke alone doesn't catch it. My calibration this week - Estimate accuracy: 64% within band on 22 logged this quarter, bias +12% (slight under). Auth integration was 80% over — third-party SDK pattern noted, padding adjusted. - Decision accuracy: 4 calls graded this week, 3 CORRECT, 1 INCONCLUSIVE (deploy didn't actually happen yet) One concrete recommendation for next week The 8-day stale PR (vinemark/api#831, "logging refactor") is older than our SLA target. Either land it, close it, or hand it to someone else — pick one before it gets to 14. — Aviv, dev-lead
What it isn't: five separate weekly reflections from five separate specialists you compile yourself. The synthesis happens at the lead level — with real calibration numbers feeding the recommendation. After 90 days you have a quarterly trend on every number above.
Ready to see this on your own codebase?
Hire the team for $999/mo bundle. Connection interview takes ~10 minutes; the Day-1 inventory drops in your Slack within an hour. 14-day trial on first hire — no credit card required.
Hire the team → Browse single specialistsLooking for a different team? IT Operations, sales ops, customer support — browse all teams →