Liveness and readiness probes for beginners

A Pod can show Running in kubectl get pods while being a terrible place to send traffic. It can also sit in a restart loop because Kubernetes thinks the container is dead when it is merely slow. Probes are how the kubelet decides those two different questions.

This post is for the stage where Deployments and Services already make sense, but traffic still hits broken instances, rollouts stall at 0 of 3 updated, or containers restart every thirty seconds for no obvious application error. Probes are often the explanation.

They are not decoration. They are the contract between your application and the platform about when a Pod is safe to serve and when it needs intervention.

Three questions, three probe types

Kubernetes separates health into three probes because the right response to failure differs:

Startup probe — Has the container finished booting? While startup fails, other probes are disabled. Use this for slow-starting apps so liveness does not kill them mid-boot.
Readiness probe — Should this Pod receive traffic right now? Failure removes the Pod from Service endpoints. The container keeps running.
Liveness probe — Should Kubernetes restart this container? Failure triggers a kill and restart.

That distinction matters in practice:

Probe	Failure means	Typical fix
Startup	Still booting	Wait, or fix boot path
Readiness	Not safe for traffic	Stop routing; app may recover
Liveness	Container is stuck or dead	Restart

A common beginner mistake is using one shallow HTTP check for everything. The process responds on /health during startup, so liveness passes, readiness passes, and traffic arrives before the database connection pool is ready. Or startup is slow, liveness fires, and the Pod never becomes stable.

Rule of thumb I use:

Readiness should be honest about whether the instance can do useful work now.
Liveness should be conservative — restart only when recovery without restart is unlikely.
Startup exists when boot takes longer than you are willing to tolerate liveness failures.

How probes connect to Services

A Service does not health-check your application by itself. It routes to Pod IPs listed in EndpointSlices. Which Pods appear there depends on readiness.

When readiness succeeds, the Pod IP is included. When readiness fails, it is removed. Existing connections may persist for a while depending on client behavior and platform settings, but new traffic should not target that Pod.

That is why kubectl get pods showing 1/1 Running is not enough. Check the Ready column:

kubectl get pods -l app=checkout -o wide
kubectl get endpoints checkout -o wide
kubectl get endpointslice -l kubernetes.io/service-name=checkout

If endpoints are empty while Pods run, readiness is failing or selectors do not match. If endpoints include Pods that return 500s under load, readiness is too loose.

During rolling updates, readiness controls when new ReplicaSet Pods count as available. Strict readiness can slow deploys but protects users. Loose readiness marks a rollout successful while instances still warm up — then error rates spike when traffic shifts.

Graceful shutdown fits here too. Before exit, the app should fail readiness so endpoints drop the Pod while it finishes in-flight requests. If the process exits while still ready, clients may hit a closing connection.

Probe mechanisms: HTTP, TCP, and exec

All three probe types support httpGet, tcpSocket, and exec. Choose based on what your application actually exposes.

HTTP probes

Most common for web services. The kubelet sends an HTTP request to a path and port inside the Pod network namespace.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
    scheme: HTTP
  initialDelaySeconds: 5
  periodSeconds: 10
  timeoutSeconds: 2
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
  timeoutSeconds: 3
  failureThreshold: 3
startupProbe:
  httpGet:
    path: /live
    port: 8080
  periodSeconds: 5
  failureThreshold: 30

Implement separate endpoints in application code when possible. /ready can check database connectivity. /live should be cheap — process up, event loop not wedged. Pointing both at / that always returns 200 teaches Kubernetes nothing.

HTTP probes support custom headers and HTTPS. Match the port to what the container listens on, not necessarily the Service targetPort if they differ.

TCP probes

The kubelet opens a TCP connection to a port. Success means something accepted the connection.

readinessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10

TCP probes are useful when there is no HTTP endpoint — databases, message brokers, legacy services. Limitation: an open port does not mean the application is ready. PostgreSQL might accept connections while still recovering. Prefer HTTP or exec when you can express real readiness.

Exec probes

The kubelet runs a command inside the container. Exit code 0 means success.

livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - pg_isready -U app -d checkout
  periodSeconds: 15

Exec probes are flexible but heavier. They require the binary to exist in the image, behave correctly under load, and finish within timeoutSeconds. A slow exec probe under CPU pressure can itself cause failures.

A complete Deployment example

Here is a pattern I reach for on stateless HTTP services with a slow JVM or similar boot:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
        - name: checkout
          image: ghcr.io/example/checkout:2.1.0
          ports:
            - containerPort: 8080
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            timeoutSeconds: 2
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /live
              port: 8080
            periodSeconds: 15
            timeoutSeconds: 3
            failureThreshold: 3
          startupProbe:
            httpGet:
              path: /live
              port: 8080
            periodSeconds: 5
            failureThreshold: 24

With periodSeconds: 5 and failureThreshold: 24, startup allows up to roughly two minutes before liveness applies. Adjust from measured boot time in staging, not guesses.

Pair with a Service that selects the same labels. Readiness on the Pod side and selectors on the Service side are both required for healthy routing.

apiVersion: v1
kind: Service
metadata:
  name: checkout
spec:
  selector:
    app: checkout
  ports:
    - port: 80
      targetPort: 8080

Tuning fields that actually matter

Probe specs look small. These fields cause most production surprises:

initialDelaySeconds — Wait before first probe. Still used, but startup probes reduce reliance on large values that delay failure detection.
periodSeconds — How often to probe. Too aggressive adds load; too slow delays endpoint removal.
timeoutSeconds — Max wait per probe. Must stay below periodSeconds. Under load, short timeouts false-fail.
successThreshold — Consecutive successes needed after failure. Readiness defaults to 1; raising it can smooth flapping at the cost of slower return to rotation.
failureThreshold — Consecutive failures before marking failed. Multiplied by periodSeconds, this is roughly how long before action.

For readiness, failureThreshold: 1 removes a Pod from endpoints quickly — good for user protection, hard on apps that briefly stall. For liveness, low thresholds plus aggressive periods cause restart loops.

Test under realistic load, not on an idle laptop. GC pauses, connection pool exhaustion, and dependency blips show up only when something resembles production traffic.

Common mistakes

Using liveness for slow startup. Without a startup probe, increase initialDelaySeconds only goes so far. Boot time changes with data size and cache state. Startup probe is the cleaner fix.

Readiness and liveness on the same expensive check. If /ready hits the database and liveness uses the same path, a DB outage restarts all Pods instead of removing them from traffic. Keep liveness cheap.

Wrong port or path. Probes run inside the Pod network namespace against the container port. A Service port of 80 does not matter here. A typo in path returns connection errors and fails the probe immediately.

No readiness probe at all. Kubernetes may mark the Pod ready as soon as the container starts. Traffic arrives before the app listens or migrations finish.

Probes that always pass. Static 200 on / while workers are not consuming, migrations run, or the app returns errors on real routes — readiness lies, deploys look green, users see failures.

Ignoring timeoutSeconds during rollouts. A probe that passes locally but times out under load removes Pods from endpoints during peak traffic.

Forgetting graceful shutdown. On SIGTERM, fail readiness first, sleep briefly, then drain. Without that, load balancers and EndpointSlices may still send requests to a dying Pod.

Debugging probe failures

When Pods restart or stay Not Ready, start with events and describe output:

kubectl describe pod checkout-7b9d4c8d6f-q2xk9
kubectl get events --field-selector involvedObject.name=checkout-7b9d4c8d6f-q2xk9
kubectl logs checkout-7b9d4c8d6f-q2xk9 --previous

Look for Liveness probe failed, Readiness probe failed, or Startup probe failed in events. The message often includes HTTP status codes or connection refused — that narrows the fix quickly.

Exec into the Pod and run the probe manually:

kubectl exec -it checkout-7b9d4c8d6f-q2xk9 -- curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:8080/ready

If manual checks pass but probes fail, suspect timeouts, wrong scheme (HTTP vs HTTPS), or the app listening only on 127.0.0.1 while the probe expects the container port on all interfaces.

For restart loops, check whether liveness fires before startup completes. Temporarily disable liveness in a staging cluster, measure boot time, then set startup failureThreshold with margin.

When plain defaults are enough

Not every workload needs three custom probes on day one. A simple internal tool with fast boot and tolerant users might start with readiness only:

readinessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 5

Add liveness when you have evidence of stuck processes that stop responding without exiting. Add startup when boot time exceeds roughly thirty seconds or varies enough that liveness kills Pods mid-start.

Platform teams sometimes inject defaults through mutating admission. Verify what your cluster adds. Application-owned endpoints beat generic TCP checks on the main port.

Closing thought

Probes are the bridge between “container started” and “this instance should participate in the Service.” They are easy to copy from tutorials and hard to get honest.

I still treat probe changes like small contract changes: measure boot time, define what ready means for this app, keep liveness cheaper than readiness, and watch endpoints during a staging rollout before trusting green status in production.

When probes tell the truth, deploys feel boring in a good way — EndpointSlices update, traffic shifts, and nobody debugs mysterious partial outages that look like load balancer ghosts. When they lie or fight slow startup, the Pod column looks fine while users notice first. That gap is worth closing early.