Why I write runbooks like flight manuals

Flight manuals are boring on a sunny day and priceless when something unexpected happens. Runbooks are the same species of document. I did not invent that comparison, and I am not claiming aviation solved operations. What I took from flying is simpler: when your heart rate goes up, you fall back on structure. A good runbook gives you that structure without pretending the world will always match the script.

Open notebook and pen on a desk

Photo by Leeloo The First on Pexels

This post is about how I write runbooks today, what I still get wrong, and where Kubernetes fits into the picture. If you are looking for a template library or a certification answer, this is not that. It is one person’s notes from incidents that went badly and a few that went less badly because someone had written something down while they were calm.

What a good runbook actually contains

Not a copy-paste of every kubectl command you have ever typed. A runbook should answer a small set of questions that your future self, at 2 a.m., will not want to reconstruct from memory.

Symptom — what does broken look like for users and for metrics? Be specific. “App slow” is not a symptom. “P95 latency above 800 ms on the checkout service for more than five minutes” is closer. Include what alerts fire, what customers report, and what a healthy baseline looks like so you can tell the difference.

First checks — three to five commands or dashboard panels, in order. Order matters. If you jump to the interesting hypothesis before confirming the basics, you waste time. I usually start with: is the service receiving traffic, are Pods ready, are errors elevated, did anything deploy recently. Boring checks first.

Safe mitigations — scale, rollback, traffic shift, freeze deploys. Each step should say what it fixes, what it might break, and how to undo it. “Restart the deployment” is not a mitigation unless you know why restart helps and what state you might lose.

Escalation — who to call and what info they need. Name a person or a rotation, not a department. List the facts they will ask for anyway: when it started, blast radius, last change, what you already tried.

Stop conditions — when to pause and think instead of poking more. This is the part most runbooks skip. In aviation, checklists have explicit hold points: do not continue until this condition is met. Operations needs the same. If two mitigations fail, stop and widen the investigation. Random deletes at 3 a.m. rarely age well.

If your runbook is fifty pages, nobody will open it during an incident. One page per common failure mode is enough to start. You can always link to deeper docs for the rare cases.

Aviation parallel, without the poster slogans

Pilots drill flows until the structure is muscle memory, then adapt when reality does not match the script. Emergency procedures in a flight manual are not inspirational quotes. They are sequential steps with decision branches: if this, then that; if not, go here.

I try to write runbooks the same way. Stable steps, explicit decision points, blank space for notes after an incident. The goal is not to remove judgment. The goal is to reserve judgment for the parts the manual cannot know — the weird interaction between a new feature flag and an old cache — instead of spending cognitive bandwidth remembering which dashboard shows ingress errors.

There is a difference between a checklist and a reference manual. Checklists are short and executed under time pressure. Reference chapters explain systems in depth for training and troubleshooting when you have an hour. I keep both, but I label them clearly. During an incident, I want the checklist. The day after, I want the reference.

One habit from the cockpit that transfers better than I expected: read-back. When someone says “scale the deployment to ten,” the incident lead repeats it and confirms. Small thing. Cuts down on “oh I thought you meant the other cluster” moments.

Where Kubernetes fits

Cluster incidents repeat more than we admit. Node NotReady, CrashLoopBackOff, pending PVCs, DNS weirdness, ingress misconfig, Certificate expiry, quota exhaustion, stuck terminating namespaces. Each deserves a short path documented while you are calm.

Here is the shape of a runbook section I use for CrashLoopBackOff — not because it is novel, but because writing it down once saved me from improvising badly twice:

# Symptom: Pod restarts repeatedly; kubectl get pods shows CrashLoopBackOff

# 1. Recent change?
kubectl rollout history deployment/<name> -n <ns>
kubectl get events -n <ns> --sort-by='.lastTimestamp' | tail -20

# 2. Why is the container exiting?
kubectl logs deployment/<name> -n <ns> --previous
kubectl describe pod -n <ns> -l app=<label> | grep -A5 "Last State"

# 3. Safe mitigations (pick one, document in incident channel)
kubectl rollout undo deployment/<name> -n <ns>   # if correlated with deploy
kubectl scale deployment/<name> -n <ns> --replicas=0  # stop bleeding, if acceptable

# Stop: do not delete PVCs, Secrets, or Namespaces without explicit approval

That block is not clever. It is ordered. The runbook also links to the Grafana dashboard for that service and names the on-call for the database if logs show connection refused.

For node-level issues, I separate “Pod cannot schedule” from “node is NotReady.” They look similar in Slack (“nothing works”) but the fixes diverge fast. A scheduling runbook mentions taints, resource requests, PVC zone mismatch, and cluster autoscaler status. A NotReady runbook mentions kubelet logs, disk pressure, and cloud provider status pages. Mixing them in one doc means someone applies the wrong fix with confidence.

OpenShift adds another layer — Routes instead of Ingress, Operators that reconcile custom resources, SCC constraints that silently block Pods. I maintain platform-specific addenda rather than pretending one runbook fits vanilla Kubernetes and OpenShift equally. Duplication is annoying. Wrong steps during an outage are worse.

How I actually write and maintain them

I write runbooks after incidents, not before launch. That sounds backward. In practice, the post-incident window is when the team agrees on what happened and what would have helped. Capture that while it is fresh. A runbook born from a real failure mode gets read. A runbook born from a template gets ignored.

Each page gets a date and an owner. “Last verified: 2024-05-02, @marc” is not bureaucracy. It is permission to distrust the page when the date is two years old and the architecture has changed twice.

I link to live dashboards, not screenshots. Screenshots lie the week after someone renames a panel. If the link rots, the runbook fails visibly during a drill, which is when you want to find out.

Drills matter. Once a quarter, pick a runbook and walk through it in staging or against a recorded incident. You will find steps that reference deleted namespaces and commands that need a newer flag. Fix them then, not during production pain.

For GitOps environments, the runbook should say how to pause reconciliation safely. Argo CD and Flux can fight you if you manually patch something they manage. A line that says “freeze sync on application X before manual scale” prevents a second incident nested inside the first.

What I still get wrong

Keeping runbooks updated after architecture changes is my weakest link. A stale runbook is worse than none — it trains mistrust. People stop opening them. Then the one time someone does, they follow a rollback command for a deployment that no longer exists.

I have also over-documented. Early in my runbook enthusiasm, I merged troubleshooting, architecture diagrams, and onboarding into one Confluence tree. Nobody could find the checklist inside the encyclopedia. Splitting “how the system works” from “what to do when it breaks” helped.

Another mistake: writing for the expert I wish I had on call, not for the person who actually is. If your runbook assumes deep knowledge of your custom service mesh, rewrite it or accept that it is reference material, not an incident checklist.

I still improvise plenty during incidents. The runbook stops me from starting with random deletes. It does not replace talking to people who know the application better than I do. CRM on the bridge — who speaks, who decides — matters as much as the document. The runbook is one crew member, not the whole crew.

A minimal template that works for me

When I start a new runbook page, I use the same headings every time:

Symptoms and alerts
Impact (who cares, SLO breach or not)
First five minutes (commands/panels in order)
If that confirms X, then Y (decision branches)
Safe mitigations (with undo steps)
Escalate when (people + data to collect)
Stop / do not do
After resolution (post-incident note, runbook update ticket)

That structure fits on one printed page for common cases. Rare edge cases get a link and a owner to call.

Closing thought

Flight manuals did not make me a good pilot by themselves. Flying, briefing, debriefing, and getting corrected by instructors made me better. Runbooks are the same. They are worth writing because they compress calm thinking into a form you can use when you are not calm. They are not worth worshipping. Update them, drill them, and forgive yourself when you still have to improvise.

If you maintain one runbook this month, pick the failure mode that woke you up last quarter. One honest page beats a library of aspirational documentation. I am still working on my own library one page at a time.