Docker Compose Resource Limits, Healthchecks, and Restart Policies

12 min read·Matthieu·VPSProductionContainersDocker ComposeDocker|

Your Compose file works in dev but is not production-ready. Learn how to add memory/CPU limits, healthchecks, restart policies, and startup ordering to protect your VPS from OOM kills and cascading failures.

A Compose file that runs docker compose up without errors is not production-ready. Without resource limits, one container can consume all host memory and trigger the Linux OOM killer, taking down every service on the VPS. Without healthchecks, Docker has no way to detect a hung process. Without restart policies, a crashed container stays dead until you notice.

These three systems work together: resource limits prevent runaway consumption, healthchecks detect failures, and restart policies recover from them. This guide covers all three as an integrated production hardening layer for Docker Compose v2.

Prerequisites: A working Docker Compose setup on a VPS. Familiarity with basic Compose file structure. If you need a refresher, see Docker Compose for Multi-Service VPS Deployments.

How do you set memory and CPU limits in Docker Compose?

Use the deploy.resources.limits key in your service definition. Set memory to a value like 512M or 1G to create a hard ceiling. Set cpus to a decimal string like '0.5' for half a core. The container process gets OOM-killed if it exceeds the memory limit. CPU limits throttle the process instead of killing it.

services:
  api:
    image: node:22-slim
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 512M
        reservations:
          cpus: '0.25'
          memory: 256M

What this does: The api container can use at most 1 CPU core and 512 MB of RAM. Docker guarantees at least 0.25 cores and 256 MB are available for it, even when other containers compete for resources.

Verify the limits are applied:

docker compose up -d
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"

The MEM USAGE / LIMIT column shows both current usage and the configured ceiling. You should see 512MiB as the limit, not the full host RAM. If you see the host's total memory instead, the limits are not active.

What is the difference between limits and reservations?

Limits are hard ceilings. The container cannot exceed them. Reservations are soft guarantees. Docker uses them for scheduling decisions and memory pressure handling.

Setting What it does When it matters
limits.memory Hard cap. OOM kill if exceeded. Always. Prevents runaway containers.
limits.cpus Throttle. Process slows down. High-CPU workloads (builds, inference).
reservations.memory Guaranteed minimum. Host under memory pressure.
reservations.cpus Guaranteed CPU share. Multiple CPU-heavy containers.

If you skip reservations, Docker allocates resources first-come-first-served. Under pressure, any container might get starved. Set reservations to the minimum your service needs to function.

What happens when a Docker container exceeds its memory limit?

The Linux kernel OOM-killer terminates the container's main process with SIGKILL. Docker records exit code 137 (128 + 9, where 9 is SIGKILL). If the restart policy is on-failure or always, Docker restarts the container automatically.

Check if a container was OOM-killed:

docker inspect api-1 --format '{{.State.OOMKilled}}'

Output: true confirms an OOM kill.

For more detail:

docker inspect api-1 --format '{{json .State}}' | python3 -m json.tool

Look for "OOMKilled": true, "ExitCode": 137, and "RestartCount" to see how many times it restarted.

Without limits, the container allocates until the host runs out of memory. The kernel then kills processes system-wide to free RAM. This can take down your database, reverse proxy, or SSH daemon. Setting limits confines the blast radius to the offending container.

How do you plan container resource budgets on a VPS?

On a VPS with fixed resources, you must budget memory across all containers. Leave headroom for the host OS and Docker itself.

Example budget for an 8 GB VPS:

Component Memory
Host OS + Docker engine 1 GB
PostgreSQL 2 GB
Redis 512 MB
Node.js API 1 GB
Nginx 128 MB
Background worker 1 GB
Headroom (unallocated) 2.35 GB

Keep 20-30% unallocated as headroom. If your container limits add up to more than total host RAM, you risk the host OOM-killer stepping in, which ignores Docker's container boundaries.

Verify total allocation against host memory:

free -h
docker stats --no-stream --format "{{.Name}}: {{.MemUsage}}"

How do you configure a healthcheck in Docker Compose?

Add a healthcheck block to your service definition. Docker runs the test command at the specified interval and marks the container as unhealthy after the configured number of consecutive failures. Other services can depend on this health status for startup ordering.

services:
  api:
    image: node:22-slim
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s
      start_interval: 2s

Healthcheck parameters explained:

Parameter Default Recommended What it controls
interval 30s 15-60s Time between checks after startup
timeout 30s 5-10s Max time a check can run
retries 3 3-5 Failures before unhealthy
start_period 0s 10-60s Grace period for slow-starting services
start_interval 5s 2-5s Interval during startup (Compose v2.20+)

The start_interval parameter (added in Compose v2.20) lets you check more frequently during startup. The container transitions from starting to healthy as soon as the first check passes during start_period. After that, checks run at the normal interval.

What is the difference between CMD and CMD-SHELL in healthchecks?

CMD runs the command directly without a shell. CMD-SHELL runs it through /bin/sh -c. Use CMD when possible. It avoids shell overhead and eliminates PID 1 issues where the shell, not your check command, receives signals.

# CMD format - no shell, runs the binary directly
healthcheck:
  test: ["CMD", "pg_isready", "-U", "postgres"]

# CMD-SHELL format - runs through /bin/sh -c
healthcheck:
  test: ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]

# String shorthand - equivalent to CMD-SHELL
healthcheck:
  test: curl -f http://localhost:3000/health || exit 1

Use CMD-SHELL when you need shell features like ||, pipes, or variable expansion. Use CMD for simple binary execution.

Which healthcheck command should you use for PostgreSQL, Redis, and Nginx?

Each service needs a healthcheck that actually tests the service's ability to handle requests, not just whether the process is running.

Service Healthcheck command What it tests
PostgreSQL ["CMD", "pg_isready", "-U", "postgres"] Accepts connections
Redis ["CMD", "redis-cli", "ping"] Responds to commands
Nginx `["CMD-SHELL", "curl -f http://localhost/
Node.js app `["CMD-SHELL", "curl -f http://localhost:3000/health
MySQL/MariaDB ["CMD", "healthcheck.sh", "--connect", "--innodb_initialized"] Engine ready, not just socket open

Important: For curl-based healthchecks, the image must include curl. Many slim images do not. Install it in your Dockerfile, use wget instead, or write a minimal health endpoint that your app framework serves natively.

Verify healthcheck status:

docker compose ps

Look for (healthy) or (unhealthy) in the STATUS column. For detailed healthcheck history:

docker inspect api-1 --format '{{json .State.Health}}' | python3 -m json.tool

This shows the last few check results, including stdout/stderr output from failed checks. Sharp eyes: if FailingStreak keeps incrementing, your check command is wrong or the service is genuinely broken.

How do restart policies interact with healthchecks?

Restart policies control what Docker does when a container exits. Healthchecks control how Docker detects problems in running containers. Together, they create an automated recovery loop: the healthcheck detects failure, the container gets stopped, and the restart policy brings it back.

services:
  api:
    restart: on-failure:5
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 10s

Restart policies compared

Policy On crash On reboot On docker stop Best for
no Stays dead Stays dead Stays dead One-off tasks, debugging
always Restarts Restarts Restarts Core infrastructure (databases, proxies)
unless-stopped Restarts Restarts Stays dead Most production services
on-failure:N Restarts (up to N times) Stays dead Stays dead Services that should not restart forever

on-failure:5 means Docker retries up to 5 times. After that, the container stays dead. This prevents restart loops that waste CPU on a fundamentally broken service.

The OOM + restart interaction: When a container hits its memory limit and gets OOM-killed (exit code 137), Docker treats this as a failure. With on-failure:5, it restarts up to 5 times. If the service leaks memory, it will OOM-kill and restart repeatedly until the retry limit is reached. Check the restart count:

docker inspect api-1 --format '{{.RestartCount}}'

For most production services, use unless-stopped. It restarts on crashes and host reboots, but respects manual docker compose stop commands. Use on-failure:N for services where a crash loop should raise an alert, not silently retry forever.

How do you make a service wait for another service to be healthy?

Use depends_on with condition: service_healthy. This makes Docker Compose wait for the dependency's healthcheck to pass before starting the dependent service.

services:
  db:
    image: postgres:17
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 5s
      timeout: 3s
      retries: 5
      start_period: 10s
    restart: unless-stopped

  api:
    image: myapp:latest
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 3
    restart: unless-stopped

Without condition: service_healthy, depends_on only waits for the container to start, not for the service inside it to be ready. PostgreSQL takes several seconds to initialize. Your app would crash trying to connect to a database that is not accepting connections yet.

The restart: true option inside depends_on (Compose v2.21+) tells Docker to restart the dependent service if the dependency restarts:

depends_on:
  db:
    condition: service_healthy
    restart: true

This is useful when your app caches the database connection and cannot recover from a database restart without a full restart itself.

What ulimits should you set for production containers?

Set nofile (open file descriptors) and nproc (max processes) for services that handle many concurrent connections. Each TCP connection, open file, and pipe consumes a file descriptor. The default limit (1024 in many images) is too low for databases and high-traffic services.

services:
  db:
    image: postgres:17
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
      nproc:
        soft: 4096
        hard: 4096

Verify inside the container:

docker compose exec db cat /proc/1/limits

Look for Max open files and Max processes. The values should match what you set.

Fork bomb prevention with pids limit

Set deploy.resources.limits.pids to cap the number of processes a container can create. This prevents fork bombs and runaway process spawning from consuming all PIDs on the host.

services:
  api:
    image: myapp:latest
    deploy:
      resources:
        limits:
          pids: 200

If you are not using deploy.resources for CPU/memory limits, the top-level pids_limit key also works. But when deploy.resources.limits is present, you must put the PID limit there too. Mixing the two causes a validation error in Compose v5+.

200 PIDs is generous for a typical web application. A Node.js app uses around 10-30. PostgreSQL uses roughly one process per connection plus background workers. Size it to 2-3x your expected peak.

How do you limit container log size?

Without log limits, container logs grow unbounded. A verbose service can fill your disk in hours. Set max-size and max-file on the json-file logging driver to rotate logs automatically.

services:
  api:
    image: myapp:latest
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

This keeps at most 3 files of 10 MB each, capping log storage at 30 MB per service. Adjust based on how much you need for debugging. For detailed log management across your full stack, see Docker Log Rotation: Stop Logs from Filling Your VPS Disk.

Use a YAML anchor to apply the same logging config across all services:

x-logging: &default-logging
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

services:
  api:
    image: myapp:latest
    logging: *default-logging
  worker:
    image: myapp:latest
    logging: *default-logging

Setting stop_grace_period for clean shutdown

When Docker stops a container, it sends SIGTERM and waits for the process to exit gracefully. If the process does not exit within the grace period, Docker sends SIGKILL. The default is 10 seconds.

services:
  db:
    image: postgres:17
    stop_grace_period: 30s

  api:
    image: myapp:latest
    stop_grace_period: 5s

Databases need longer grace periods to flush writes and close connections cleanly. Web servers and API processes typically exit within a few seconds. Set the grace period to match your service's actual shutdown time, with a small buffer.

Full production-ready Compose file

This example combines all settings for a typical web application stack: Nginx reverse proxy, Node.js API, PostgreSQL database, and Redis cache.

x-logging: &default-logging
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

services:
  nginx:
    image: nginx:1.27-alpine
    ports:
      - "80:80"
      - "443:443"
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 128M
        reservations:
          memory: 64M
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost/ || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped
    stop_grace_period: 5s
    logging: *default-logging
    depends_on:
      api:
        condition: service_healthy

  api:
    image: myapp:latest
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 512M
          pids: 200
        reservations:
          cpus: '0.25'
          memory: 256M
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s
      start_interval: 2s
    restart: unless-stopped
    stop_grace_period: 5s
    logging: *default-logging
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy

  db:
    image: postgres:17
    environment:
      POSTGRES_PASSWORD_FILE: /run/secrets/db_password
    secrets:
      - db_password
    deploy:
      resources:
        limits:
          cpus: '1.0'
          memory: 2G
        reservations:
          cpus: '0.5'
          memory: 1G
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "postgres"]
      interval: 10s
      timeout: 3s
      retries: 5
      start_period: 15s
    restart: unless-stopped
    stop_grace_period: 30s
    ulimits:
      nofile:
        soft: 65536
        hard: 65536
    logging: *default-logging
    volumes:
      - pgdata:/var/lib/postgresql/data

  redis:
    image: redis:7-alpine
    deploy:
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          memory: 128M
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3
    restart: unless-stopped
    stop_grace_period: 5s
    logging: *default-logging

volumes:
  pgdata:

secrets:
  db_password:
    file: ./secrets/db_password.txt

Notice: The database password uses Docker secrets with POSTGRES_PASSWORD_FILE instead of a plaintext POSTGRES_PASSWORD environment variable. Create the secrets file with restricted permissions:

mkdir -p secrets
openssl rand -base64 32 > secrets/db_password.txt
chmod 600 secrets/db_password.txt

Verification checklist

After applying all settings, run through these checks to confirm everything is active.

1. Check resource limits are applied:

docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.CPUPerc}}"

The MEM USAGE / LIMIT column shows both current usage and the configured ceiling. Every container should show its configured memory limit, not the host's total RAM.

2. Check healthcheck status:

docker compose ps

All services should show (healthy) in the STATUS column. If any show (health: starting), wait for the start_period to complete.

3. Check restart policy:

docker inspect --format '{{.HostConfig.RestartPolicy.Name}}:{{.HostConfig.RestartPolicy.MaximumRetryCount}}' $(docker compose ps -q)

4. Check ulimits inside a container:

docker compose exec db cat /proc/1/limits | grep -E "open files|processes"

5. Check log configuration:

docker inspect --format '{{.HostConfig.LogConfig.Type}} max-size={{index .HostConfig.LogConfig.Config "max-size"}} max-file={{index .HostConfig.LogConfig.Config "max-file"}}' $(docker compose ps -q api)

6. Test the full recovery chain:

Stop a container and watch it recover:

docker compose stop api
docker compose ps  # api should show Exited
docker compose start api
docker compose ps  # api should show (healthy) after start_period
docker inspect $(docker compose ps -q api) --format 'RestartCount: {{.RestartCount}}'

To test automatic restart on a real crash, trigger an OOM kill by lowering the memory limit below the app's baseline. The restart policy kicks in when the process exits on its own. Note that docker kill does not trigger restart policies in recent Docker versions.

Resource sizing quick reference

Starting points for common services on a VPS. Adjust based on your actual load.

Service Memory limit CPU limit Healthcheck
PostgreSQL 1-4 GB 1.0-2.0 pg_isready -U postgres
Redis 256M-1G 0.25-0.5 redis-cli ping
Node.js API 256M-1G 0.5-1.0 curl -f http://localhost:PORT/health
Nginx 64M-256M 0.25-0.5 curl -f http://localhost/
Ollama (LLM) 4-8 GB 2.0-4.0 curl -f http://localhost:11434/
Background worker 256M-1G 0.5-1.0 App-specific check

Something went wrong?

Container keeps restarting (restart loop):

docker compose logs api --tail 50
docker inspect api-1 --format '{{.State.ExitCode}} OOM:{{.State.OOMKilled}} Restarts:{{.RestartCount}}'

If OOMKilled: true, increase the memory limit. If the exit code is not 137, check the application logs for the actual error.

Healthcheck always failing:

docker inspect api-1 --format '{{json .State.Health.Log}}' | python3 -m json.tool

This shows the output of each check. Common causes: the health endpoint does not exist, curl is not installed in the image, or the service listens on a different port than the check targets.

depends_on not waiting:

Make sure the dependency has a healthcheck defined. Without it, condition: service_healthy has nothing to wait for and Compose will error on startup.

Limits not showing in docker stats:

Verify you are using Docker Compose v2 (the docker compose plugin, not the old docker-compose binary). Check your version:

docker compose version

The deploy.resources key requires Compose v2. If you are on an older version, see Docker Commands Cheatsheet for installation instructions.

Reading logs when something fails:

journalctl -u docker -f
docker compose logs -f --tail 100

The Docker daemon log shows OOM events and container lifecycle changes. The Compose logs show application-level output.

For the parent guide covering the full Docker production setup, see Docker in Production on a VPS: What Breaks and How to Fix It.