Kubernetes Jobs and CronJobs for beginners

Most Kubernetes tutorials start with Deployments. That makes sense. A web API, a frontend, a message consumer that should always be running — these are long-lived services. Kubernetes creates Pods, watches them, and replaces them when they fail.

But not every workload is a service.

Some tasks should run once and finish: migrate a database schema, generate a monthly report, import a batch of records, back up a directory, or reindex a search index overnight. Others should run on a schedule: clean up temp files every night, sync reference data at 03:00, or send a digest email on weekdays.

For those patterns, a Deployment is the wrong tool. Deployments assume the Pod should keep running. If your container exits with code 0, Kubernetes treats that as a problem and starts another Pod. That is correct for a server. It is wrong for a one-off script.

Jobs and CronJobs exist for batch and scheduled work. They are not exotic objects. They are just controllers with a different goal: run this work until it succeeds or until retry limits are exhausted, then stop.

Job vs Deployment: different lifecycles

A Deployment says: keep N copies of this Pod template running. If a Pod dies, create a replacement. If you roll out a new image, gradually replace old Pods with new ones. The desired state is continuous availability.

A Job says: create one or more Pods from this template and run them until they complete successfully. When the work is done, the Job is done. Kubernetes does not treat a successful exit as a failure.

That difference sounds small until you deploy a migration script as a Deployment and wonder why Kubernetes keeps restarting it after it finishes.

Compare the intent:

Question	Deployment	Job
Should the container keep running?	Yes	No — it should finish
Is exit code 0 success?	Usually means the Pod will be recreated	Means the Job succeeded
Typical use	APIs, workers, web servers	Migrations, batch imports, one-off scripts
Scaling meaning	More always-on replicas	More parallel task Pods

If your process is supposed to listen on a port forever, use a Deployment. If your process is supposed to print “done” and exit, use a Job.

A basic Job

Here is a minimal Job that runs a short command:

apiVersion: batch/v1
kind: Job
metadata:
  name: hello-job
  namespace: demo
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: worker
          image: busybox:1.36
          command: ["sh", "-c", "echo starting && sleep 5 && echo done"]

Apply and watch it:

kubectl apply -f job.yaml
kubectl get jobs -n demo
kubectl get pods -n demo -l job-name=hello-job
kubectl logs -n demo -l job-name=hello-job

Two fields matter immediately.

restartPolicy: Never (or OnFailure) belongs on the Pod template inside the Job. Jobs do not use the default Always restart policy that Deployments rely on. If you omit this or set Always, the Job may not behave as you expect.

The Job controller creates a Pod, watches it finish, and records success or failure on the Job object.

Check status:

kubectl describe job hello-job -n demo

Look for Succeeded in the status. A completed Job keeps its Pod around by default so you can read logs. That is useful for debugging. It also means finished Jobs accumulate unless you clean them up or set a TTL.

One-off tasks in practice

Real one-off Jobs usually wrap something your team already runs as a script or CLI:

database migrations before a release
data backfills after a schema change
file exports to object storage
test harnesses that must run exactly once in a cluster context

Example migration-style Job:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migrate-20260812
  namespace: demo
spec:
  backoffLimit: 3
  activeDeadlineSeconds: 600
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: my-registry.example/app-migrate:1.4.2
          env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: app-db
                  key: url
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              memory: 512Mi

backoffLimit controls how many times Kubernetes retries a failed Pod before marking the Job failed. More on that below.

activeDeadlineSeconds is a safety net. If the Job runs longer than ten minutes, Kubernetes terminates it. That prevents a stuck migration from running forever.

Name Jobs so humans can tell what happened. db-migrate-20260812 is easier to audit than migrate.

Parallelism and completions

Jobs can run more than one Pod. Two fields control that:

completions: how many successful Pod runs the Job needs
parallelism: how many Pods may run at the same time

For a simple single-run script, you can omit both. Defaults are completions: 1 and parallelism: 1.

For work that can be split across shards:

apiVersion: batch/v1
kind: Job
metadata:
  name: shard-import
  namespace: demo
spec:
  completions: 5
  parallelism: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: importer
          image: my-registry.example/importer:2.0.0
          env:
            - name: SHARD_INDEX
              valueFrom:
                fieldRef:
                  fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']

Each Pod gets an index annotation when indexed completion is enabled. That pattern is useful for partitioned batch work.

If you only need one Pod but want retries on failure, keep completions: 1 and tune backoffLimit.

Backoff and retries

When a Job Pod fails — crash, non-zero exit, eviction, image pull error — the Job controller may create another Pod. It does not retry forever.

backoffLimit defaults to 6. That means up to six failed Pod attempts before the Job is marked Failed. Each retry waits with exponential backoff, capped at a few minutes. You will see increasing delays between attempts in events.

Inspect failures:

kubectl describe job db-migrate-20260812 -n demo
kubectl get pods -n demo -l job-name=db-migrate-20260812
kubectl logs -n demo <pod-name>
kubectl logs -n demo <pod-name> --previous

Failed Job Pods often have names like db-migrate-20260812-abc12 with random suffixes. Multiple Pods from retries are normal. Check the latest Pod first, but --previous helps when the container restarted inside the same Pod.

Set backoffLimit: 0 when you want no retries. That is appropriate for jobs where a partial retry would corrupt data unless the script is explicitly idempotent.

Common misconception: “The Job failed, so Kubernetes will keep trying tomorrow.” No. A failed Job stays failed until something creates a new Job object. For recurring work, use a CronJob or an external scheduler.

CronJobs: Jobs on a schedule

A CronJob creates Job objects on a cron schedule. The schedule string uses the standard five-field cron format:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-cleanup
  namespace: demo
spec:
  schedule: "0 3 * * *"
  timeZone: "Europe/Berlin"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      backoffLimit: 2
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: cleanup
              image: my-registry.example/cleanup:1.0.0
              command: ["sh", "-c", "find /tmp/work -type f -mtime +7 -delete"]

Important CronJob fields:

schedule uses minute, hour, day of month, month, day of week. "0 3 * * *" means 03:00 every day. Quote schedules in YAML so * and special characters do not confuse the parser.

timeZone (supported in newer Kubernetes versions) makes the schedule respect a named timezone instead of the control plane’s UTC default. Without it, assume UTC.

concurrencyPolicy controls overlap:

Allow — multiple Jobs may run at once if a previous run is still going
Forbid — skip a new run if the last one has not finished
Replace — cancel the running Job and start a new one

For cleanup or sync tasks, Forbid is often the safe default. For idempotent metrics collectors, Allow may be fine.

successfulJobsHistoryLimit and failedJobsHistoryLimit keep the namespace tidy. Without them, old Jobs pile up.

Inspect CronJobs:

kubectl get cronjobs -n demo
kubectl describe cronjob nightly-cleanup -n demo
kubectl get jobs -n demo --sort-by=.metadata.creationTimestamp

To test without waiting for the clock, create a one-off Job from the CronJob template or temporarily set a schedule a minute ahead in a dev namespace.

When not to use Jobs

Jobs are simple, which makes them easy to misuse.

Long-running services. If the process should stay up and accept traffic, use a Deployment or StatefulSet. Do not wrap a web server in a Job and hope.

Work that needs continuous reconciliation. A queue consumer that runs all day is a Deployment, not a Job.

Complex multi-step workflows. CronJobs fire on a schedule. They do not replace Argo Workflows, Tekton, or similar orchestrators when steps branch or depend on each other.

Non-idempotent scripts with retries. If a failed attempt leaves bad state, use backoffLimit: 0 or fix the script before retrying.

High-frequency schedules. Running every few seconds via CronJob creates object churn. A long-running Pod with an internal ticker is usually cleaner.

If you are choosing between CronJob and something else, ask: “Is this a short task that should start at a time and finish?” If yes, CronJob fits. If it should run forever with periodic internal timing, use a Deployment instead.

Debugging failed Jobs

When a Job fails, I work top down.

Step 1: Read the Job object.

kubectl get job <job-name> -n <namespace>
kubectl describe job <job-name> -n <namespace>

Check Status, Conditions, Failed, Succeeded, and events at the bottom. Messages about BackoffLimitExceeded mean retries were exhausted.

Step 2: Find the Pod(s).

kubectl get pods -n <namespace> -l job-name=<job-name>
kubectl describe pod <pod-name> -n <namespace>

Look for Init: errors, image pull failures, mount problems, OOMKilled, and probe issues. Jobs can still use probes, but many batch containers exit too quickly for them to matter.

Step 3: Read logs.

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

Application stack traces usually live here, not in the Job status field.

Step 4: Check context — wrong controller?

If logs show the script completed successfully but the Job keeps creating Pods, you may have deployed a long-running entrypoint by mistake. If the script exits immediately with code 0 but the cluster expected a server, you may have used a Job where a Deployment was needed.

Step 5: For CronJobs, check the last triggered Job.

kubectl get jobs -n <namespace> --selector=job-name
kubectl describe cronjob <cronjob-name> -n <namespace>

Look for Last Schedule Time, suspend: true, and concurrency policy skips.

For cleanup, set ttlSecondsAfterFinished on Jobs and history limits on CronJobs so finished work does not clutter the namespace.

Final thought

Jobs and CronJobs are the batch layer of Kubernetes. Deployments keep services alive. Jobs run work to completion. CronJobs wrap Jobs in time.

The beginner mistake is not misunderstanding YAML. It is choosing the wrong lifecycle. If the container should exit successfully and stay exited, you want a Job. If it should run again next Tuesday at 03:00, you want a CronJob. If it should still be running when you check tomorrow, you want something else.

Learn to read Job status, respect backoff limits, and debug through Pods and logs. Once that habit is in place, batch work in Kubernetes feels predictable instead of surprising.