Degraded mode and the minimum equipment list mindset in Kubernetes

Commercial aircraft fly with items inoperative more often than passengers realize. Not because maintenance is sloppy, but because aviation separates “broken” from “may dispatch with restrictions.” That separation lives in the Minimum Equipment List — the MEL — and in the captain’s judgment against weather, route, and crew experience.

Warning lights on an aircraft instrument panel

Photo by Pixabay on Pexels

Kubernetes clusters rarely have a formal MEL document. They have READMEs, tribal knowledge, and a Grafana board someone built during an incident. When a node pool is degraded, when the backup region is stale, when the observability stack is half-down but apps still serve — you’re already in degraded mode. The question is whether you’re operating with MEL discipline or pretending you’re at full capability until something proves otherwise.

I’m cautious about military metaphors in ops posts. MEL thinking isn’t heroics. It’s paperwork and honesty: what are we flying without, under what limits, for how long, with what mitigations, and who signed off.

Full green vs dispatchable

In maintenance terms, an airplane can have a fault and still be legal to fly if the MEL allows deferral — often with placards, operational limits, or both engines of redundancy reduced but not gone. The airline’s ops spec and the release say what “go” means today.

In a cluster, we confuse technically running with fully capable:

One AZ down but traffic shifted — dispatchable with limits
Metrics pipeline lagging ten minutes — dispatchable if you don’t rely on it for auto-rollback
Single etcd member unhealthy in a three-member cluster — not dispatchable; land now
HPA broken but flat replica count handles current load — dispatchable until marketing sends email
Backup restore untested for ninety days — dispatchable until it isn’t; different category of risk

The failure mode I see most: teams keep shipping features while carrying known deferrals they never wrote down. That’s flying with a taped-over annunciator and no entry in the logbook.

Building a cluster MEL (without pretending it’s FAA paperwork)

You don’t need a leather binder. You need a living list — wiki, markdown in the repo, whatever people open — structured like an MEL row:

Item	Degraded state	Allowed ops	Compensating controls	Max deferral	Owner
Ingress controller	One replica of two	No config changes	Manual failover runbook	24h	platform
Vault	Seal unreachable, cache warm	No secret rotation	Static creds frozen	4h	security
Observability	Metrics delayed	No automated rollbacks	Manual dashboard watch	72h	SRE

Not every row needs numbers on day one. Start with what we’d keep running during a partial outage versus what triggers a stop.

Our first draft was ugly — bullet list in a Google Doc — and still prevented a bad Friday deploy because someone asked “is tracing still deferred?” and the answer was yes, we weren’t rolling out a change that required trace IDs in prod.

Kubernetes components through an MEL lens

Different layers degrade differently. Humble notes from things I’ve seen, not an exhaustive catalog.

Control plane and etcd

Losing quorum isn’t a “degraded” conversation; it’s divert immediately. Partial latency or one unhealthy member might be deferrable only with monitoring proving quorum stability and a maintenance window scheduled.

I’ve watched teams run for days on a flaky control plane because kubectl “still worked sometimes.” That’s not MEL thinking; that’s gambling. The MEL mindset says: document the deferral, cap duration, increase audit frequency, freeze risky changes.

Nodes and capacity

Losing one node in a pool with headroom — classic deferrable item if PodDisruptionBudgets and surge capacity absorb rescheduling. Losing half the pool while HPA is pegged — operational limit: no deploys, no voluntary disruptions, maybe traffic shed at the edge.

Cordoned nodes without drain plans are placards nobody read. Label them; announce in the ops channel; set calendar reminder to fix or remove.

Networking and ingress

Single ingress controller replica in a small cluster — many teams run this until the first OOM during cert renewal. MEL entry: max deferral until next business day, compensating control is manual controller restart runbook tested this quarter.

Service mesh control plane degraded while data plane proxies still run — a weirdly common state. Dispatchable for read-heavy traffic; not dispatchable for mTLS policy changes or new routes until control plane healthy.

DNS inside the cluster — if CoreDNS is one replica and it’s CrashLooping, you’re not in degraded mode, you’re on fire. If external DNS is slow but kube-dns fine, different row in the table.

Data plane: databases, queues, caches

This is where MEL discipline meets CAP theorem without the lecture. Primary up, async replica lagging — dispatchable with no schema migrations and elevated monitoring on replication lag. Cache cluster missing a node — dispatchable if hit rate and latency within SLO.

The compensating control is often human: someone watches the lag graph during deploys because automated checks aren’t trusted while deferred.

Observability stack

Metrics or logs partially down is the most dangerous deferral because it feels dispatchable while blinding you to the next failure. Our MEL rows here are strict: if automated rollback depends on a metric, that metric path must be green or rollback reverts to manual.

Tracing deferred — many teams live here for weeks. Fine if nobody debugging prod incidents relies on traces alone. Less fine if your on-call runbook step three is “open Jaeger.”

Annunciators: making degradation visible

Airlines placard inoperative equipment in the cockpit — physical reminder that capability is reduced. Kubernetes equivalents:

Cluster annotations or ConfigMaps — degraded-mode: observability-lag consumed by deploy pipelines to block risky jobs
Status page internal section — “known deferrals” visible to engineers, not just customers
Admission policy — deny deploys to certain namespaces while platform MEL items open (heavy-handed; use sparingly)
Dashboard banner — simple text: “Backup region drill failed Sunday; failback untested”

The point isn’t shame. It’s shared situational awareness. CRM falls apart when half the crew knows autopilot is degraded and half assumes full automation.

Who dispatches: authority and timeouts

In aviation, maintenance releases the aircraft; the captain accepts it for the specific flight. In ops, someone must accept degraded state for a time-bound window.

Vague ownership produces eternal deferrals. “We’ll fix etcd when we get to it” means never.

We try — not always succeed — to attach:

Named owner — team or individual on-call rotation
Review date — calendar event, not “eventually”
Exit criteria — what “fixed” means technically
Escalation — if deferral exceeds max, who decides stop-the-line

For Kubernetes platform work, stop-the-line might mean deploy freeze, incident declared, or failover to secondary region even though secondary is less favorite. MEL max deferral exists to force that conversation before luck runs out.

Compensating controls that aren’t fantasy

MEL items require compensating controls that are actually performed, not theoretically available.

Bad compensating control: “we can restore from backup” when restore hasn’t been tested since the backup tool changed.

Better: “restore tested Q3; last drill 34 minutes to RPO; runbook section 4.2; on-call trained.”

Bad: “we have two ingress replicas” when both run on the same node pool with no PDB.

Better: PDB verified; anti-affinity confirmed; drain simulation last month.

In Kubernetes, compensating controls I trust more often:

Manual runbooks exercised in staging (simulator tie-in)
Reduced change velocity — fewer concurrent deploys
Human verification gates — two-person rule for prod applies while automation blind
Traffic limits — rate limit at edge while backend fragile
Feature flags off for non-essential paths

Each is operational cost. MEL discipline admits that cost instead of hiding it in technical debt.

When degraded becomes undispatchable

Some combinations of deferrals compound. One AZ down plus observability lag plus HPA stuck might be fine individually and catastrophic together — you won’t see the traffic spike that finishes the job.

We occasionally run a compound review before big events ( launches, sales, holidays ): open MEL rows plus “what if we lose one more thing?” Not formal FMEA; thirty minutes with coffee.

Red lines that usually flip us to undispatchable:

Customer SLO burn with no working rollback path
Security control deferred that affects secret or cert lifecycle during planned rotation
Data durability question — replication lag beyond RPO with write path still open
Control plane instability trending worse, not stable

Calling undispatchable isn’t failure. It’s the MEL saying “this flight doesn’t go until maintenance catches up.”

Relation to incident response and runbooks

Runbooks describe flows when something breaks unexpectedly. MEL describes known reduced capability carried intentionally.

They meet when a deferral turns into an incident — deferred ingress replica dies completely — or when incident response opens temporary MEL entries (“we disabled autoscaling manually during firefight; documenting now”).

After incidents, I try to ask: should this have been a deferral we carried openly instead of an surprise? Sometimes the answer is no — true unknown unknown. Often the answer is we knew and didn’t log it.

Practical first steps if you have nothing today

Don’t boil the ocean.

List three things currently degraded in prod/staging that everyone “knows about.”
For each, write allowed ops, one compensating control, and a review date.
Pick one automation hook — even a deploy script grep for an env var — that blocks the riskiest action while item open.
Review in weekly ops standup until closed or renewed with eyes open.

Your MEL will be wrong at first. Aviation MELs amend with bulletins. Yours should change when architecture changes.

What I get wrong

I defer things mentally without writing them. I assume everyone in platform knows the backup region is read-only practice only. I let “temporary” mitigations become permanent because the ticket lost priority. I ship anyway when tired and tell myself we’ll fix Monday.

MEL thinking doesn’t make me more conservative about innovation. It makes me more honest about what we’re trading. Kubernetes wants to abstract failure; operators still have to name it.

You don’t need every annunciator dark green to operate. You need to know which ones are taped over, what that limits, and when the flight must not depart. That’s the whole mindset. The rest is maintenance log entries and showing up for the review date.