This is the checklist I pin mentally when something breaks. It is not a guarantee. It is a sequence that stops me from making the outage worse while I figure out what is actually wrong.

Lines of code on a dark screen

Photo by Luis Gomes on Pexels

Your environment will need different commands and different dashboards. Treat this as a starting point you rewrite for your clusters, your namespaces, and your on-call rotation.

Before anything breaks: make rollback possible

Incidents are easier when you prepared on a calm Tuesday.

  • Know how to undo the last deploy (kubectl rollout undo, Argo CD revert, Helm rollback).
  • Keep previous image tags and chart versions reachable.
  • Have one dashboard per critical service that on-call actually opens.
  • Write three runbooks for your top three failure modes — not fifty.

If you cannot roll back, your incident options shrink to “fix forward under pressure.” That can work. It is not where I want to be at 2 a.m.

First few minutes — stop the bleeding in your head

  1. Confirm the symptom — who is affected, since when, is it total or partial?
  2. Pause risky changes — freeze deploys, pause GitOps auto-sync, stop unrelated pipeline runs if you can.
  3. Assign roles — incident lead, scribe, comms. One person can wear two hats in a small team, but name it out loud.
  4. Last change — what shipped in the last hour? The last day? Correlation is not causation, but it is a good first guess.
  5. Control plane pulse — is the API responsive? Are many nodes NotReady? Is etcd healthy (if you operate it)?

Write the start time in the incident channel. Future you will want it.

Stabilise — buy time without random deletes

Goal: stop making it worse, restore service if you can do so safely, preserve evidence.

# Cluster / node health
kubectl get nodes -o wide
kubectl get componentstatuses 2>/dev/null || true

# Pods not in Running phase (!= is valid for field selectors; quote for the shell)
kubectl get pods -A --field-selector='status.phase!=Running'
# Note: includes Pending, Failed, Succeeded, Unknown — filter out completed Jobs if noisy

# Recent events — often the fastest clue
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# If one namespace is in scope
kubectl get deploy,sts,ds -n <namespace>
kubectl rollout history deployment/<name> -n <namespace>

Then pick one mitigation path and say it in the channel before you run it:

  • Load — scale horizontally if the app is healthy but saturated; verify HPA limits and node capacity first.
  • Bad revision — shift traffic away or kubectl rollout undo if the timeline matches.
  • Dependency down — circuit break, feature flag, maintenance page — whatever your app supports.
  • Unknown — narrow blast radius (scale to zero on a non-critical service only if stakeholders agree).

Write down what you rolled back — deployment revision, image tag, Git commit, Helm release version. Incidents get fuzzy in memory after thirty minutes.

What I try not to do in stabilise

  • Delete namespaces to “fix” things.
  • Restart etcd or control plane components because someone on Slack said to.
  • Apply five unrelated YAML files because one might help.

Diagnose slowly — one hypothesis at a time

Stabilise does not mean you understand the root cause. After mitigations, slow down.

If you see… I usually look at… Commands / notes
CrashLoopBackOff Exit reason, probes, limits, image pull, config kubectl logs --previous, describe pod, recent deploy
Pending Scheduling, quotas, affinity, PVC binding, capacity describe pod Events section; get pvc; node resources
Running but errors App logs, ingress/route, dependency timeouts Compare error rate to deploy time; trace one failing request
Latency spike Recent change, DB, DNS, CNI, saturation Node and Pod CPU/memory; dependency dashboards
Intermittent Single bad pod, node, zone, or client path Spread of replicas across nodes; correlate with one AZ
# CrashLoop — why did the last container exit?
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns> | sed -n '/Events:/,$p'

# Pending — why not scheduled?
kubectl describe pod <pod> -n <ns> | grep -A2 "Conditions:"

# Service / ingress path (vanilla k8s)
kubectl get svc,ingress -n <ns>
kubectl get endpoints -n <ns>

Rule I borrowed from aviation: change one variable, observe, then decide. If you restart, scale, and deploy in the same five minutes, you will not know which action helped.

Observability — trust but verify the dashboard

Open the dashboard the team agreed on. Not twelve tabs.

  • Error rate and latency vs a baseline you believe in.
  • Saturation (CPU, memory, connections, thread pools).
  • Recent deploy marker on the same time axis.
  • For canary or progressive delivery: split metrics by version label if you have them.

If metrics look fine but users are angry, believe the users and dig into sampling, auth paths, or a single region.

OpenShift add-ons I forget on vanilla habits

When the symptom is vague on OpenShift:

  • oc get clusteroperators — platform operators degraded?
  • Routes vs Ingress — TLS edge, wrong target port?
  • Image pull secrets and internal registry reachability.
  • SCC denials — sometimes show up late in Events.

Same mindset as Kubernetes. Different objects.

Communication — boring updates beat silence

Every twenty to thirty minutes until resolved:

  • Impact (who, what workaround exists).
  • Current theory (one sentence).
  • Next step and next update time.

“We are still investigating” is acceptable once. Repeated without new facts erodes trust.

After it’s green — close the loop

Short debrief within a few days while memory is fresh:

  • Timeline with commands and decisions (from the scribe notes).
  • Root cause or “unknown with mitigations.”
  • One follow-up with an owner and a date — runbook update, alert fix, deploy guardrail.

Update the runbook while the details are still vivid. A runbook that only exists in someone’s head is a single point of failure.

What this checklist does not solve

It will not fix a broken architecture, missing capacity planning, or alerts that fire for everything. It will not replace a team that has not practiced rollback.

It gives me a place to start when adrenaline is high — the same reason pilots use checklists when the weather is fine and when it is not.

If you adapt this for your org, keep the sections that match how you actually work and delete the rest. A shorter checklist you use beats a long one you skip.