Debugging Kubernetes workloads for beginners

Kubernetes debugging feels strange at the beginning because the thing you are trying to fix is rarely the thing you created directly.

You write a Deployment. The Deployment creates a ReplicaSet. The ReplicaSet creates Pods. A Service sends traffic to Pods selected by labels. The scheduler chooses a node. The kubelet starts containers. Probes decide whether traffic should reach them. A controller keeps reconciling the desired state while you are still reading the error message.

That is a lot of moving parts for a beginner. The good news is that Kubernetes leaves breadcrumbs. They are not always friendly, but they are usually there: status fields, Events, logs, rollout history, endpoint lists, and conditions. Debugging is mostly learning which breadcrumb to read next.

This is the path I wish I had used earlier: slow, boring, and much better than deleting Pods until something changes.

Start with the mental model

Kubernetes is not a script runner. It is a control system.

You submit desired state: “I want three replicas of this container image, exposed on this port, with these environment variables.” Controllers constantly compare desired state with actual state and try to close the gap. When something goes wrong, your job is to find where the gap begins.

For a workload problem, the chain often looks like this:

Deployment -> ReplicaSet -> Pod -> Container -> Process
Service -> EndpointSlice -> Pod IPs
Ingress -> Service -> Pods

Do not start inside the container unless you already know the container started correctly. Do not start with Ingress if the Service has no endpoints. Do not start with DNS if the Pod is not Ready. Beginners lose time by jumping to the most interesting layer instead of the next observable layer.

I usually ask three questions first:

What changed?
What is the smallest broken object I can name?
What does Kubernetes already know about it?

The third question matters. Kubernetes often knows more than your first guess.

Confirm the namespace before anything else

The most boring Kubernetes bug is also one of the most common: looking in the wrong namespace.

kubectl get namespaces
kubectl get pods -n my-app
kubectl config view --minify --output 'jsonpath={..namespace}'; echo

If the current context has no namespace set, kubectl usually uses default. Many production workloads do not live there. I prefer being explicit while debugging:

kubectl get deploy,rs,pod,svc -n my-app

That command gives a first map. Are there Deployments? Did they create ReplicaSets? Are Pods present? Is a Service present? Are the desired and available replica counts close?

If the answer is “I do not see the object,” stop. Check the namespace, spelling, Helm release, GitOps sync state, or whether you are connected to the right cluster.

Read the rollout before reading logs

For Deployments, the first useful question is whether the rollout completed.

kubectl rollout status deployment/web -n my-app
kubectl get deployment web -n my-app
kubectl describe deployment web -n my-app

kubectl describe is noisy, but it is good beginner noise. Look for conditions near the bottom. You might see messages like ProgressDeadlineExceeded, MinimumReplicasUnavailable, or ReplicaSetUpdated. These are not final diagnoses, but they point you to the next object.

Then inspect the ReplicaSet and Pods:

kubectl get rs -n my-app
kubectl get pods -n my-app -l app=web -o wide

The -o wide output adds node names and Pod IPs. That matters if all failing Pods are on the same node, or if you later need to compare Service endpoints.

Common beginner misconception: “A Deployment is running, so the app is fine.” A Deployment can exist while every Pod is crashing. Look at available replicas and Pod readiness, not just whether the Deployment object exists.

Pod phases are headlines, not explanations

kubectl get pods gives statuses like Pending, Running, CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError, and Completed. Treat these as headlines. The article is in describe.

kubectl describe pod web-abc123 -n my-app

Scroll to three areas:

Conditions. These explain scheduling, initialization, readiness, and container status.

Container state. This shows whether the container is waiting, running, or terminated. If it terminated, look for exit code, reason, start time, and finish time.

Events. These are often the fastest path to the real issue. Failed scheduling, image pull errors, failed mounts, probe failures, and secret problems usually appear here.

Some common examples:

Failed to pull image "example/web:v12": not found

The image tag is wrong, the registry is unreachable, or credentials are missing.

0/3 nodes are available: 3 Insufficient cpu.

The Pod is not broken. The cluster cannot currently schedule it with the requested resources.

Error: secret "web-config" not found

The container never started because configuration required by the Pod spec does not exist in that namespace.

This is why deleting the Pod is usually a weak first move. If the spec still references a missing Secret, the next Pod will fail the same way.

Logs answer application questions

Logs are useful after you know the container started. They answer “what did the process say?” They do not answer every Kubernetes question.

kubectl logs deployment/web -n my-app
kubectl logs pod/web-abc123 -n my-app
kubectl logs pod/web-abc123 -n my-app -c app
kubectl logs pod/web-abc123 -n my-app --previous

--previous is important for crash loops. The current container may have just restarted and produced no logs yet. The previous instance may contain the stack trace.

If a Pod has multiple containers, specify -c. Sidecars can make beginner debugging confusing because kubectl logs pod/name may not show the container you care about.

Use logs for application-level problems: missing environment variable handling, failed database migrations, refused connections, bad credentials, panics, and startup errors. Use Events for Kubernetes-level problems: scheduling, mounting, pulling images, and probes.

Understand CrashLoopBackOff

CrashLoopBackOff does not mean Kubernetes crashed. It means Kubernetes started your container, the process exited, and Kubernetes is waiting before trying again. The waiting period grows to avoid tight restart loops.

Check the last termination:

kubectl describe pod web-abc123 -n my-app
kubectl logs web-abc123 -n my-app --previous

Exit code 0 can still be wrong for a long-running service. It means the process ended successfully, but a Deployment expects it to keep running. Maybe the image runs a one-time command. Maybe the command in the manifest overrides the real server start command.

Exit code 1 often points to application failure. Exit code 137 commonly means the process was killed, often because it exceeded its memory limit. Confirm with:

kubectl top pod web-abc123 -n my-app
kubectl describe pod web-abc123 -n my-app

kubectl top requires metrics-server. If it is not installed, use Events, container status, and your platform metrics.

Probes can create false mysteries

Readiness and liveness probes are helpful, but they are also common self-inflicted wounds.

Readiness answers: should this Pod receive traffic?

Liveness answers: should this container be restarted?

Startup probes, when used, give slow applications time to boot before liveness begins judging them.

Here is a simple pattern:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: app
          image: ghcr.io/example/web:1.0.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

If readiness fails, the Pod may be Running but not Ready. The Service will not send it traffic. If liveness fails, Kubernetes restarts it. A too-aggressive liveness probe can keep killing a slow application before it finishes starting.

Check Events:

kubectl describe pod web-abc123 -n my-app

If you see repeated probe failures, do not immediately increase delays blindly. First verify the path, port, scheme, and whether the endpoint depends on something slow like a database connection.

Services fail quietly when selectors do not match

A Service is stable networking in front of changing Pods. It does not magically know which Pods belong to it. It uses selectors.

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: web
  ports:
    - port: 80
      targetPort: 8080

The selector must match Pod labels:

kubectl get pods -n my-app --show-labels
kubectl get svc web -n my-app -o yaml
kubectl get endpoints web -n my-app
kubectl get endpointslice -n my-app -l kubernetes.io/service-name=web

If the Service has no endpoints, traffic has nowhere to go. Common causes are label mismatch, Pods not Ready, or the Service selecting a different version than you think.

This is a classic beginner trap because the Service object looks healthy. Kubernetes will happily create a Service with a selector that matches nothing.

Use exec carefully

kubectl exec is useful, but I treat it as a focused tool, not a lifestyle.

kubectl exec -n my-app deploy/web -- printenv
kubectl exec -n my-app deploy/web -- wget -qO- http://localhost:8080/ready

Some images are minimal and do not include sh, curl, or wget. That is good for security, annoying for debugging. In newer clusters, ephemeral debug containers may help:

kubectl debug -n my-app pod/web-abc123 -it --image=busybox:1.36 --target=app

Whether this works depends on cluster policy. In locked-down environments it may be disabled, and that is not a bug.

A beginner-friendly sequence

When I am stuck, I come back to this order:

Confirm cluster and namespace.
kubectl get deploy,rs,pod,svc -n <namespace>.
Check rollout status for Deployments.
Describe the failing Pod.
Read Events before changing anything.
Read logs, including --previous for restarts.
Check Service endpoints if traffic does not arrive.
Check probes, config references, image pull status, and resource limits.
Change one thing, then observe.

That last point is not motivational. It is practical. If you change image tag, probe path, resource limits, and Service selector at the same time, you may fix the problem and learn nothing. Worse, you may introduce the next problem while hiding the first one.

Kubernetes debugging gets easier when you stop treating the cluster as a black box. It is more like a system that keeps filing reports. The reports are verbose, sometimes awkward, and not always in the order you want. But for beginner workload issues, they are usually enough to move from panic to a sensible next command.