Commercial aircraft fly with items inoperative more often than passengers realize. Not because maintenance is sloppy, but because aviation separates “broken” from “may dispatch with restrictions.” That separation lives in the Minimum Equipment List — the MEL — and in the captain’s judgment against weather, route, and crew experience.

Photo by Pixabay on Pexels
Kubernetes clusters rarely have a formal MEL document. They have READMEs, tribal knowledge, and a Grafana board someone built during an incident. When a node pool is degraded, when the backup region is stale, when the observability stack is half-down but apps still serve — you’re already in degraded mode. The question is whether you’re operating with MEL discipline or pretending you’re at full capability until something proves otherwise.
I’m cautious about military metaphors in ops posts. MEL thinking isn’t heroics. It’s paperwork and honesty: what are we flying without, under what limits, for how long, with what mitigations, and who signed off.
Full green vs dispatchable
In maintenance terms, an airplane can have a fault and still be legal to fly if the MEL allows deferral — often with placards, operational limits, or both engines of redundancy reduced but not gone. The airline’s ops spec and the release say what “go” means today.
In a cluster, we confuse technically running with fully capable:
- One AZ down but traffic shifted — dispatchable with limits
- Metrics pipeline lagging ten minutes — dispatchable if you don’t rely on it for auto-rollback
- Single etcd member unhealthy in a three-member cluster — not dispatchable; land now
- HPA broken but flat replica count handles current load — dispatchable until marketing sends email
- Backup restore untested for ninety days — dispatchable until it isn’t; different category of risk
The failure mode I see most: teams keep shipping features while carrying known deferrals they never wrote down. That’s flying with a taped-over annunciator and no entry in the logbook.
Building a cluster MEL (without pretending it’s FAA paperwork)
You don’t need a leather binder. You need a living list — wiki, markdown in the repo, whatever people open — structured like an MEL row:
| Item | Degraded state | Allowed ops | Compensating controls | Max deferral | Owner |
|---|---|---|---|---|---|
| Ingress controller | One replica of two | No config changes | Manual failover runbook | 24h | platform |
| Vault | Seal unreachable, cache warm | No secret rotation | Static creds frozen | 4h | security |
| Observability | Metrics delayed | No automated rollbacks | Manual dashboard watch | 72h | SRE |
Not every row needs numbers on day one. Start with what we’d keep running during a partial outage versus what triggers a stop.
Our first draft was ugly — bullet list in a Google Doc — and still prevented a bad Friday deploy because someone asked “is tracing still deferred?” and the answer was yes, we weren’t rolling out a change that required trace IDs in prod.
Kubernetes components through an MEL lens
Different layers degrade differently. Humble notes from things I’ve seen, not an exhaustive catalog.
Control plane and etcd
Losing quorum isn’t a “degraded” conversation; it’s divert immediately. Partial latency or one unhealthy member might be deferrable only with monitoring proving quorum stability and a maintenance window scheduled.
I’ve watched teams run for days on a flaky control plane because kubectl “still worked sometimes.” That’s not MEL thinking; that’s gambling. The MEL mindset says: document the deferral, cap duration, increase audit frequency, freeze risky changes.
Nodes and capacity
Losing one node in a pool with headroom — classic deferrable item if PodDisruptionBudgets and surge capacity absorb rescheduling. Losing half the pool while HPA is pegged — operational limit: no deploys, no voluntary disruptions, maybe traffic shed at the edge.
Cordoned nodes without drain plans are placards nobody read. Label them; announce in the ops channel; set calendar reminder to fix or remove.
Networking and ingress
Single ingress controller replica in a small cluster — many teams run this until the first OOM during cert renewal. MEL entry: max deferral until next business day, compensating control is manual controller restart runbook tested this quarter.
Service mesh control plane degraded while data plane proxies still run — a weirdly common state. Dispatchable for read-heavy traffic; not dispatchable for mTLS policy changes or new routes until control plane healthy.
DNS inside the cluster — if CoreDNS is one replica and it’s CrashLooping, you’re not in degraded mode, you’re on fire. If external DNS is slow but kube-dns fine, different row in the table.
Data plane: databases, queues, caches
This is where MEL discipline meets CAP theorem without the lecture. Primary up, async replica lagging — dispatchable with no schema migrations and elevated monitoring on replication lag. Cache cluster missing a node — dispatchable if hit rate and latency within SLO.
The compensating control is often human: someone watches the lag graph during deploys because automated checks aren’t trusted while deferred.
Observability stack
Metrics or logs partially down is the most dangerous deferral because it feels dispatchable while blinding you to the next failure. Our MEL rows here are strict: if automated rollback depends on a metric, that metric path must be green or rollback reverts to manual.
Tracing deferred — many teams live here for weeks. Fine if nobody debugging prod incidents relies on traces alone. Less fine if your on-call runbook step three is “open Jaeger.”
Annunciators: making degradation visible
Airlines placard inoperative equipment in the cockpit — physical reminder that capability is reduced. Kubernetes equivalents:
- Cluster annotations or ConfigMaps —
degraded-mode: observability-lagconsumed by deploy pipelines to block risky jobs - Status page internal section — “known deferrals” visible to engineers, not just customers
- Admission policy — deny deploys to certain namespaces while platform MEL items open (heavy-handed; use sparingly)
- Dashboard banner — simple text: “Backup region drill failed Sunday; failback untested”
The point isn’t shame. It’s shared situational awareness. CRM falls apart when half the crew knows autopilot is degraded and half assumes full automation.
Who dispatches: authority and timeouts
In aviation, maintenance releases the aircraft; the captain accepts it for the specific flight. In ops, someone must accept degraded state for a time-bound window.
Vague ownership produces eternal deferrals. “We’ll fix etcd when we get to it” means never.
We try — not always succeed — to attach:
- Named owner — team or individual on-call rotation
- Review date — calendar event, not “eventually”
- Exit criteria — what “fixed” means technically
- Escalation — if deferral exceeds max, who decides stop-the-line
For Kubernetes platform work, stop-the-line might mean deploy freeze, incident declared, or failover to secondary region even though secondary is less favorite. MEL max deferral exists to force that conversation before luck runs out.
Compensating controls that aren’t fantasy
MEL items require compensating controls that are actually performed, not theoretically available.
Bad compensating control: “we can restore from backup” when restore hasn’t been tested since the backup tool changed.
Better: “restore tested Q3; last drill 34 minutes to RPO; runbook section 4.2; on-call trained.”
Bad: “we have two ingress replicas” when both run on the same node pool with no PDB.
Better: PDB verified; anti-affinity confirmed; drain simulation last month.
In Kubernetes, compensating controls I trust more often:
- Manual runbooks exercised in staging (simulator tie-in)
- Reduced change velocity — fewer concurrent deploys
- Human verification gates — two-person rule for prod applies while automation blind
- Traffic limits — rate limit at edge while backend fragile
- Feature flags off for non-essential paths
Each is operational cost. MEL discipline admits that cost instead of hiding it in technical debt.
When degraded becomes undispatchable
Some combinations of deferrals compound. One AZ down plus observability lag plus HPA stuck might be fine individually and catastrophic together — you won’t see the traffic spike that finishes the job.
We occasionally run a compound review before big events ( launches, sales, holidays ): open MEL rows plus “what if we lose one more thing?” Not formal FMEA; thirty minutes with coffee.
Red lines that usually flip us to undispatchable:
- Customer SLO burn with no working rollback path
- Security control deferred that affects secret or cert lifecycle during planned rotation
- Data durability question — replication lag beyond RPO with write path still open
- Control plane instability trending worse, not stable
Calling undispatchable isn’t failure. It’s the MEL saying “this flight doesn’t go until maintenance catches up.”
Relation to incident response and runbooks
Runbooks describe flows when something breaks unexpectedly. MEL describes known reduced capability carried intentionally.
They meet when a deferral turns into an incident — deferred ingress replica dies completely — or when incident response opens temporary MEL entries (“we disabled autoscaling manually during firefight; documenting now”).
After incidents, I try to ask: should this have been a deferral we carried openly instead of an surprise? Sometimes the answer is no — true unknown unknown. Often the answer is we knew and didn’t log it.
Practical first steps if you have nothing today
Don’t boil the ocean.
- List three things currently degraded in prod/staging that everyone “knows about.”
- For each, write allowed ops, one compensating control, and a review date.
- Pick one automation hook — even a deploy script grep for an env var — that blocks the riskiest action while item open.
- Review in weekly ops standup until closed or renewed with eyes open.
Your MEL will be wrong at first. Aviation MELs amend with bulletins. Yours should change when architecture changes.
What I get wrong
I defer things mentally without writing them. I assume everyone in platform knows the backup region is read-only practice only. I let “temporary” mitigations become permanent because the ticket lost priority. I ship anyway when tired and tell myself we’ll fix Monday.
MEL thinking doesn’t make me more conservative about innovation. It makes me more honest about what we’re trading. Kubernetes wants to abstract failure; operators still have to name it.
You don’t need every annunciator dark green to operate. You need to know which ones are taped over, what that limits, and when the flight must not depart. That’s the whole mindset. The rest is maintenance log entries and showing up for the review date.