On-call fatigue and the case for rest

I have answered a page at 2 a.m., fixed the symptom, gone back to bed, and discovered in the morning that I changed the wrong cluster. Nothing catastrophic happened. That is not the point. The point is I was on duty in name only — body in bed, mind half in kubectl — and the platform deserved a sharper version of me than I could offer at that hour.

Laptop open on a desk at night

Photo by cottonbro studio on Pexels

On-call is part of DevOps work. I am not arguing it away. I am arguing that we treat fatigue as a technical debt we never schedule, then act surprised when incidents stretch, postmortems repeat, and good engineers quietly stop wanting the pager.

Aviation has rest rules that feel bureaucratic until you read an accident report where fatigue was a factor. Commercial pilots have duty time limits, minimum rest periods, and scheduling systems that track both. I am not saying SRE teams need the same legal framework as an airline. I am saying the underlying physics is shared: humans degrade under interrupted sleep, and degraded humans degrade systems.

This is what I think about when I help shape rotations, when I am the one holding the pager, and when I notice I am making small mistakes that my rested self would not make.

Fatigue is not laziness

Alert fatigue — too many pages, too little signal — gets a lot of ink. On-call human fatigue is adjacent and quieter.

Interrupted sleep accumulates. Three “quick” pages in one night do not equal three separate events psychologically. They equal one broken night. By the fourth night of a primary rotation, reaction time slips. You reach for familiar commands instead of reading the error. You assume the last incident’s fix applies again. You merge a PR you would have asked someone to review on a Tuesday.

I have done all of this. I am not proud. I am also not unique. Studies on shift work and sleep deprivation show the same pattern: judgment suffers before motivation does. You still care. You care badly.

Aviation schedules account for duty period (time on task), flight time, sector count, and rest since last duty. Ops does not rely on pilots saying “I feel fine.” Feelings lie. Clocks are blunt but fair.

We often schedule on-call as one name in PagerDuty until next Monday and call it done. No tracking of pages per night. No recovery day after a bad incident. No cap on consecutive weeks. The tool knows who is primary. It does not know if that person slept.

What a bad rotation feels like from inside

Primary week starts hopeful. Day three, a deploy misfires and you are up until 4 a.m. Day four, marketing sends email and HPA lag wakes you again. Day five, you sit in meetings because that is when meetings happen. Day six, someone asks you to “just quick review” a production change because you “know the cluster.” Day seven, you hand off to secondary and realize you cannot remember what you fixed twice.

Secondary is not rest if secondary still gets escalations. Manager-on-call that never gets trained is not backup. Coverage on paper is not coverage in biology.

The cruelest version is the hero culture: whoever fixes it fastest gets thanked, whoever asks for rest gets side-eyed. Aviation learned the hard way that heroes crashing airplanes is a bad trade. We are not flying passengers, but we are touching systems that pay rent and hospital bills for strangers. Tired fixes become tomorrow’s incidents.

I do not have a perfect rotation story. I have been in broken ones and helped patch a few into something kinder. Kindness here is operational, not sentimental.

Parallels I take seriously (and one I do not)

Duty limits. A pilot’s day has a ceiling. On-call is not active duty every minute, but pages and incident bridges are. I treat long bridges after midnight as consuming the next day’s capacity the way a long duty consumes rest minimums. If we bridged four hours overnight, I am skeptical of that person leading a risky change the same afternoon.

Rest before return. After a heavy incident, the person who drove root cause analysis should not automatically be primary again the next night. Rotation tools rarely encode “post-incident recovery.” Teams can, manually at first.

Crew resource management. CRM says speak up when someone is overloaded, divide tasks, verify critical steps aloud. Incident command helps when it is practiced. One tired person typing alone in a channel at 3 a.m. is the opposite of CRM.

Sterile cockpit rules. Brief quiet periods for high-risk changes reduce distraction. I wrote about this elsewhere in production change terms. The same idea protects on-call: not every alert needs a human in the loop at once. Reduce noise so the human can rest between real events.

What I do not parallel literally. Pilots cannot “roll back” a landing. We can roll back deploys. The stakes rhyme; the mechanics differ. I avoid aviation cosplay in postmortems. The useful part is respect for human limits, not wearing epaulets in Slack.

Designing rotations that admit humans exist

I do not have one template. Context matters — team size, timezone spread, customer SLA, how noisy the platform is. Patterns that helped teams I have been on:

Primary and secondary with real secondary. Secondary answers within minutes if primary does not. Secondary is trained on the same runbooks. Secondary is not “whoever checks email sometimes.”

Follow-the-sun when you can. Handoffs at defined times with written notes — open incidents, suspicious graphs, deploys in flight. Like a crew change with a logbook entry, not “hey you’re up, good luck.”

Maximum consecutive weeks. I like two primary weeks per quarter per person as a starting conversation, not a law. The number matters less than having a number so one engineer is not de facto permanent backup.

Comp time or recovery days after bad nights. Controversial in some companies. Cheaper than repeat outages. One org I worked with gave an automatic day off after any incident bridge longer than three hours ending after midnight. Abuse was rare. Morale improved.

Manager escalation path that does not default to waking the primary again. Sometimes the answer is customer comms or a feature flag owned by product, not another kubectl from the same tired brain.

Handoff ritual. Five-minute sync or async doc: what fired, what is flaky, what deploys are planned. Boring. Reduces “why did nobody tell me Redis was mid-migration” pages.

Tools support this — PagerDuty schedules, Opsgenie overrides, Google Calendar blocks for “recovery.” Tools do not replace policy. Without policy, the calendar block gets meeting’d over.

Reducing pages is rest policy too

The cheapest way to reduce on-call fatigue is fewer false emergencies. I overlap with alert discipline here; the on-call angle is sleep.

If everything pages, nothing is urgent, and every night is broken. Fixing thresholds, ownership, and SLO-based alerts is rest engineering. It is also incident prevention.

Kubernetes clusters generate infinite ways to wake someone — CrashLoopBackOff, node NotReady, cert expiry in thirty days labeled CRITICAL. Each rule should pass a simple test: would I want to be woken for this if I had been woken twice already this week? If no, ticket it.

Weekly digest for trending issues beats nightly noise for slow leaks. Burn-rate alerts beat static CPU thresholds. I still see platforms that page on pod restarts without excluding known cron failures. That is a policy choice wearing a monitoring badge.

When you are on call and tired anyway

Real life does not wait for ideal rotations. Kids sick. Insomnia. Neighbor’s alarm. Primary week still happens.

Personal habits that helped me — imperfectly:

Do not hero past handoff time. If secondary exists, escalate at the agreed threshold. Carrying “I got this” past exhaustion is how wrong clusters get edited.

Write down every action during pages. Timestamped notes in the incident doc. Rested you verifies; tired you trusts paper.

Avoid irreversible moves without a second pair of eyes. Delete PVC, drain production node, --force anything — pause, ping secondary or senior, breathe. Aviation uses challenge-response on critical items. We can type “confirm delete prod” in a thread.

Sleep after handoff if the incident is stable. The story can wait until morning standup. All-nighters feel noble and produce fuzzy postmortems.

Say you are tired in the bridge. Not as excuse — as data. “I need someone else to drive verification; I have been up since 1.” Good incident leads treat that like a low fuel callout.

I am bad at the last one when ego flares. I am trying to be better.

What managers and leads can do without a new platform

Ask how many pages last week — per person, not aggregate. Aggregates hide one person drowning.

Review consecutive on-call loads in retros. If the same name appears every third week and also owns the migration project, that is scheduling debt.

Protect training time for secondary. Untrained backup is fiction.

Celebrate boring weeks, not only war stories. Reward reduced MTTA from better alerts, not just heroic 4 a.m. saves.

Staff on-call with enough people that vacation does not require guilt. If removing one engineer collapses the rotation, hiring or scope reduction is the real issue — not that they took a week off.

None of this requires buying software. It requires admitting sustainable ops is a staffing and alerting problem, not a virtue problem.

Rest and the humble admission of limits

I wanted to be the person who could always take the page, always stay sharp, always ship the fix. That person is a character, not a coworker. Real me needs sleep. Real me makes more mistakes on the fifth interrupted night. Real me does better work when the rotation is fair and the alerts are honest.

Aviation did not eliminate fatigue by telling pilots to try harder. It measured duty, enforced rest, built CRM, and changed culture when accidents demanded it. We can borrow the humility without copying the regulatory stack.

If you hold the pager this week, I hope the nights are quiet. If they are not, I hope someone can cover you while you recover. If you design rotations, count pages and sleep debt the way you count error budgets — real numbers, real tradeoffs, not vibes.

The cluster will still be there tomorrow. It runs better when the humans tending it are still standing, still thinking clearly, still willing to pick up the pager next month because this month did not break them.

That is the whole argument. Rest is not the opposite of reliability. Rest is part of how reliability survives contact with production.