Cross-checking config before you apply

In airline operations, nobody flies a critical approach alone because they feel confident. You brief, you configure, and then someone else reads back the important numbers. Not because pilots are incompetent. Because tired, competent people misread things. Because context switches hide errors. Because “I checked it twice” usually means you checked the same wrong assumption twice.

Colleagues reviewing work on a laptop together

Photo by Brooke Cagle on Unsplash

Kubernetes changes do not usually kill anyone. I say that plainly because the aviation comparison gets overplayed. But they can delete data, expose services, drain capacity, or turn a Friday deploy into a weekend on a bridge call. The habit I carried from the cockpit — cross-check before commit — maps better to kubectl apply than I expected when I started this career transition.

This post is about a lightweight review culture for config changes: what to check, who should check it, and how to do it without building a bureaucracy that everyone ignores.

What “cross-check” means here

I am not talking about a formal change advisory board for every typo in a ConfigMap. I am talking about a deliberate second set of eyes on changes where the blast radius is real.

Cross-check means:

Someone other than the author verifies the change against intent. Not “looks fine” — “this matches what we agreed in the ticket.”

Critical fields are read aloud or highlighted. Namespace, cluster context, image tag, replica count, ingress host, resource limits, secrets references, Helm values that override defaults.

The rollback path is stated once before apply. If we cannot undo it in one sentence, we pause.

Environment is confirmed. kubectl config current-context is not ceremonial. I have applied to the wrong cluster. You probably have too.

In aviation we called parts of this a challenge-and-response flow. In ops teams I have seen it called two-person rule, four-eyes, or simply “get a reviewer.” The label matters less than the behavior: one person executes, one person verifies, roles are explicit.

Why solo apply feels faster and isn’t

When you are the person who wrote the YAML, your brain fills in gaps. You know you meant staging even though the file says production. You know the image tag v2.1.4 is the hotfix, not the broken build with the same prefix. You know the --force flag is there for a reason you explained in a standup three days ago.

The reviewer does not carry that baggage. They read what is on the page.

Solo apply also hides time pressure. “We need this out before the demo” is how namespaces get overwritten. A second person does not eliminate pressure, but they give you one beat to say “wait, that Service is ClusterIP, we needed LoadBalancer.”

I still solo-apply small changes in dev namespaces. I am not preaching purity. I am saying: match the rigor to the risk, and be honest about risk.

A preflight checklist you can actually use

Checklists fail when they are forty items long and nobody reads them. This is the short list I use before production-impacting applies. Adapt it; the point is structure, not my exact bullets.

Intent

What problem does this change solve?
What does success look like in metrics or behavior?
Is this the smallest change that could work?

Scope

Which cluster, namespace, and workloads are touched?
Does anything outside this diff depend on the new value?
Are there concurrent changes from other people or pipelines?

Diff sanity

Read the full diff, not just the hunk you edited.
Watch for accidental whitespace, wrong indentation in YAML, duplicated keys.
For Helm: check rendered output, not just values.yaml. helm template or helm diff upgrade if you have it.

Safety

Probes, PDBs, HPA min/max — will the cluster stay available during rollout?
Resource requests/limits — any Pod going OOM or starving the node?
Secrets — new references, rotated keys, external secret sync lag?
NetworkPolicy or ingress — unintended exposure or blocked traffic?

Rollback

Previous revision known? (kubectl rollout history, Git revert, Helm rollback)
Data migrations — backward compatible?
Time box — when do we call it failed and undo?

Comms

Who needs to know this is happening?
Is there a sterile window or change freeze?

Print it, pin it, put it in the PR template. Use the same list for GitOps merges if the apply happens via pipeline. The checklist belongs to the change, not to kubectl specifically.

Two-person review without performance theater

The failure mode of “two-person rule” is checkbox culture: reviewer approves in thirty seconds because they trust you, or because they want to go to lunch. That is worse than solo apply because it adds process without safety.

What helps:

Rotate reviewers. Same pair develops blind spots.

Reviewers run commands, not only read GitHub. For GitOps, reviewer checks the rendered manifest in the PR. For imperative applies, reviewer holds the second terminal and confirms context before the author hits enter.

Separate “author” and “executor” when possible. In GitOps, merge is author; sync might be executor. In kubectl land, author prepares branch; executor applies after read-back.

Time-box the review. Five focused minutes beats a vague LGTM hours later when nobody remembers the context.

Less experienced reviewers welcome. A less experienced engineer asking “why is this privileged?” is gold. I learned more from dumb questions I could not answer than from senior nodding.

What does not help:

Requiring two directors for a typo fix.

Reviewers who only check formatting.

Rubber-stamp bots with human avatars.

kubectl, Helm, and GitOps each need a slightly different lens

The checklist is shared; the tooling adds specific traps.

kubectl apply

Context and namespace are the classic failure pair. I say them out loud: “Context is prod-eu, namespace is payments.”

Watch server-side apply field ownership if you mix CI and manual edits. The diff you see locally may not be the diff the apiserver merges.

For -f directories, confirm you are not picking up a stray manifest from an old experiment.

--dry-run=server is worth the seconds when the API can validate admission.

Helm

Values files drift from chart defaults in ways that are invisible in a one-line PR. Render and diff.

Chart version bumps can change labels and selectors in ways that orphan ReplicaSets. Read release notes.

Upgrades with --wait — does the reviewer know what “wait” considers success? A broken probe can block forever.

Rollback with Helm is not always symmetric if hooks or CRDs changed. Say that in the brief.

GitOps (Argo CD and kin)

The apply happens later, which is good for review and bad for complacency. Merge is not deploy; sync is deploy. Review both if auto-sync is on.

Watch for override parameters, --force sync options, and prune settings. Prune deletes things not in Git. That is powerful and easy to approve without understanding.

Drift from manual kubectl is a trust problem. If someone patched production last week, your reviewed Git change may fight the cluster at sync time. Cross-check includes “is the live state what we think?”

Multi-source apps and Helm umbrellas multiply diff noise. Slow down when the PR touches three charts and a Kustomize overlay.

PM and ticket mindset without the PM title

You do not need a project manager to think like one for config changes. You need a brief.

Before the review, the author writes five lines in the ticket or PR description:

User or system outcome expected.
Clusters/namespaces affected.
Rollback steps.
Metrics or logs to watch for fifteen minutes after.
Owner on call if it goes wrong.

The reviewer checks the change against those five lines. If the diff solves a different problem than the ticket, stop.

This is how flight briefings worked: same data, two people, explicit go/no-go. The PM mindset is really “no surprise changes.” Surprises belong in incidents, not in deploys.

When to escalate beyond a peer review

Some changes deserve a wider room:

First apply of a new CRD or operator with cluster-wide permissions.
IAM, cert trust, or identity provider changes.
Storage class or default StorageClass changes.
Network paths for regulated data.
Anything that touches shared platform namespaces other teams rely on.

Escalation is not shame. It is scope recognition.

I have seen teams skip this because the YAML looked small. A ten-line ClusterRoleBinding can grant more than a hundred-line Deployment.

What I still get wrong

I rush when I am the expert. Expertise makes me skip read-back because “I have done this a hundred times.” That is when I apply the wrong overlay.

I treat Git merge as done when it is only halfway. The PR was reviewed; the sync at 2 a.m. was not.

I forget that reviewers need context time. Dropping a PR and pinging “need this in ten minutes” guarantees theater.

I solo-apply when I am embarrassed about the mistake that led to the fix. Embarrassment is not a risk model.

Closing

Cross-checking config is not about distrust of individuals. It is about respecting complexity and human limits. Aviation did not invent this because copilots are decorative. It invented it because the approach is safer when two people agree on what the instruments say.

Try it on the next change that would wake someone up if it failed: a peer, a short checklist, a read-back of context and rollback. No guru required. Just a habit that survives tired Tuesday afternoons.

If you want one concrete starting point, add a PR template field: “Reviewer verified cluster/namespace and rollback.” Check that box only after someone other than the author says it out loud. Small friction, real safety.