Your server fails at 3 AM. You get paged, SSH in half-asleep, restart the crashed service, and go back to bed. This happens enough times that you start wondering: could the server just fix itself?

It can. This tutorial builds a self-healing feedback loop on a single VPS. Prometheus and node_exporter collect metrics. Alertmanager fires when thresholds break. A Python webhook receiver catches those alerts and feeds context to Ollama. The LLM diagnoses the issue and recommends a remediation action. If the action is on the allowlist, it executes automatically. If it is destructive, a human approves it first via Discord.

The full flow looks like this:

node_exporter -> Prometheus -> Alertmanager -> webhook receiver -> Ollama -> remediation engine -> action
                                                                                    |
                                                                              audit log + Discord

Every action is logged. Every LLM recommendation is treated as untrusted input. This is the part most AIOps content skips, and the part that matters most.

What is a self-healing server and why build one on a VPS?

A self-healing server detects failures using metrics and alert rules, then triggers remediation without human intervention. Combined with a local LLM like Ollama, the system diagnoses root causes from alert context and executes allowlisted actions: restarting a crashed service, clearing disk space, or killing a runaway process.

Enterprise teams use PagerDuty, Rundeck, or StackStorm for this. Those tools assume a team, a fleet of servers, and a budget. On a single VPS, you need something lighter. The "sentinel pattern" described here is a standalone agent that watches your server and fixes common problems automatically, with safety controls that prevent the LLM from doing anything dangerous.

Prerequisites

A VPS with 4+ vCPU and 8 GB RAM (the LLM needs room alongside your services)
Debian 12 or Ubuntu 24.04
Docker and Docker Compose installed Docker Compose for Multi-Service VPS Deployments
Ollama installed with at least one model pulled (qwen2.5:7b recommended for structured JSON output)
A non-root user with sudo access Linux VPS Security: Threats, Layers, and Hardening Guide
A Discord webhook URL (for human approval notifications)

How do you set up Prometheus, node_exporter, and Alertmanager with Docker Compose?

Deploy the monitoring stack as three containers: Prometheus v3.10.0 scrapes metrics, node_exporter v1.10.2 exposes host metrics, and Alertmanager v0.31.1 routes alerts to your webhook receiver. The entire stack starts with a single docker compose up -d.

Create the project directory:

mkdir -p ~/sentinel/{prometheus,alertmanager}
cd ~/sentinel

Docker Compose stack

Create docker-compose.yml:

services:
  prometheus:
    image: prom/prometheus:v3.10.0
    container_name: prometheus
    restart: unless-stopped
    user: "65534:65534"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=7d'
      - '--web.enable-lifecycle'
    ports:
      - "127.0.0.1:9090:9090"
    networks:
      - sentinel

  node-exporter:
    image: prom/node-exporter:v1.10.2
    container_name: node-exporter
    restart: unless-stopped
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - sentinel

  alertmanager:
    image: prom/alertmanager:v0.31.1
    container_name: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports:
      - "127.0.0.1:9093:9093"
    extra_hosts:
      - "host.docker.internal:host-gateway"
    networks:
      - sentinel

volumes:
  prometheus_data:

networks:
  sentinel:
    driver: bridge

Prometheus and Alertmanager bind to 127.0.0.1 only. Exposing monitoring dashboards to the internet is a common misconfiguration that leaks internal metrics to anyone who scans your IP.

Prometheus configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Alertmanager configuration

Create alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  receiver: sentinel-webhook
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: sentinel-webhook
    webhook_configs:
      - url: 'http://host.docker.internal:5001/alert'
        send_resolved: true
        max_alerts: 10

The webhook URL points to the Python receiver running on the host. host.docker.internal resolves to the host's IP from inside Docker containers. The extra_hosts directive in the Compose file above maps it to the host gateway on Linux.

Set file ownership and permissions before starting the stack:

chmod 644 prometheus/prometheus.yml prometheus/alert_rules.yml alertmanager/alertmanager.yml

What alert rules should you configure for common VPS failures?

Four alert rules cover the most common VPS failures: disk filling up, memory exhaustion, a service going down, and sustained high CPU. Each rule fires when a threshold is breached for a defined duration, giving Prometheus time to filter transient spikes.

Create prometheus/alert_rules.yml:

groups:
  - name: vps_health
    rules:
      - alert: DiskSpaceLow
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage above 85%"
          description: "Root filesystem is {{ $value | printf \"%.1f\" }}% full."
          remediation_hint: "prune_docker_images"

      - alert: MemoryHigh
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Memory usage above 90%"
          description: "Available memory is {{ $value | printf \"%.1f\" }}% used."
          remediation_hint: "kill_top_memory_process"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Target {{ $labels.job }} is down"
          description: "Scrape target {{ $labels.instance }} has been unreachable for over 1 minute."
          remediation_hint: "restart_service"

      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 85% for 10 minutes"
          description: "Average CPU usage is {{ $value | printf \"%.1f\" }}%."
          remediation_hint: "identify_cpu_hog"

Alert	PromQL threshold	Duration	Severity	Default remediation
DiskSpaceLow	Root FS > 85% full	5 min	warning	Prune Docker images
MemoryHigh	Available RAM < 10%	5 min	critical	Kill top memory process
ServiceDown	Scrape target unreachable	1 min	critical	Restart service
HighCPU	Avg CPU > 85%	10 min	warning	Identify CPU hog

The remediation_hint annotation is a custom field. It tells the remediation engine which action to suggest if the LLM diagnosis is ambiguous.

Start the stack:

cd ~/sentinel
docker compose up -d

[+] Running 4/4
 ✔ Network sentinel_sentinel  Created
 ✔ Container node-exporter     Started
 ✔ Container prometheus        Started
 ✔ Container alertmanager      Started

Check that Prometheus is scraping targets:

curl -s http://127.0.0.1:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"health"'

"health": "up",
"lastScrape": "2026-03-19T14:30:15.123Z",

Both targets should show "health": "up".

How does the webhook receiver catch Alertmanager alerts?

The webhook receiver is a Flask app that listens for POST requests from Alertmanager, extracts alert context, queries Ollama for a diagnosis, and passes the result to the remediation engine. It runs on the host, not in a container, because it needs access to Docker and systemd to execute remediation actions.

Install dependencies:

sudo apt install -y python3-pip python3-venv jq
mkdir -p ~/sentinel/receiver
cd ~/sentinel/receiver
python3 -m venv venv
source venv/bin/activate
pip install flask requests pydantic

Create ~/sentinel/receiver/sentinel.py:

#!/usr/bin/env python3
"""Sentinel: self-healing webhook receiver for Alertmanager."""

import json
import logging
import os
import subprocess
import time
from datetime import datetime, timezone
from pathlib import Path

import requests
from flask import Flask, request, jsonify
from pydantic import BaseModel, field_validator

app = Flask(__name__)

# --- Configuration ---
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://127.0.0.1:11434")
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:7b")
DISCORD_WEBHOOK = os.environ.get("DISCORD_WEBHOOK", "")
DRY_RUN = os.environ.get("SENTINEL_DRY_RUN", "false").lower() == "true"
AUDIT_LOG = Path(os.environ.get("SENTINEL_AUDIT_LOG", "/var/log/sentinel/audit.jsonl"))

AUDIT_LOG.parent.mkdir(parents=True, exist_ok=True)

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s"
)
logger = logging.getLogger("sentinel")


# --- Safety: Action Allowlist ---
ALLOWED_ACTIONS = {
    "restart_service": {
        "command": "docker restart {service}",
        "requires_approval": False,
        "allowed_services": ["nginx", "docker", "prometheus", "node-exporter", "alertmanager"],
    },
    "prune_docker_images": {
        "command": "docker image prune -af --filter 'until=72h'",
        "requires_approval": False,
    },
    "kill_top_memory_process": {
        "command": "ps aux --sort=-%mem | awk 'NR==2{print $2}' | xargs kill -15",
        "requires_approval": True,
    },
    "identify_cpu_hog": {
        "command": "ps aux --sort=-%cpu | head -5",
        "requires_approval": False,
        "read_only": True,
    },
    "clear_journal_logs": {
        "command": "journalctl --vacuum-size=200M",
        "requires_approval": False,
    },
    "add_swap": {
        "command": "fallocate -l 2G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile",
        "requires_approval": True,
    },
}


class Diagnosis(BaseModel):
    """Structured LLM diagnosis output."""
    severity: str
    root_cause: str
    recommended_action: str
    reasoning: str

    @field_validator("recommended_action")
    @classmethod
    def action_must_be_allowed(cls, v):
        if v not in ALLOWED_ACTIONS:
            raise ValueError(f"Action '{v}' is not in the allowlist")
        return v


# --- Ollama Integration ---
def query_ollama(alert_data: dict) -> Diagnosis | None:
    """Send alert context to Ollama and parse structured diagnosis."""
    prompt = f"""You are a server diagnostics agent. Analyze this alert and respond with a JSON diagnosis.

Alert: {alert_data.get('labels', {}).get('alertname', 'unknown')}
Status: {alert_data.get('status', 'unknown')}
Severity: {alert_data.get('labels', {}).get('severity', 'unknown')}
Summary: {alert_data.get('annotations', {}).get('summary', '')}
Description: {alert_data.get('annotations', {}).get('description', '')}
Remediation hint: {alert_data.get('annotations', {}).get('remediation_hint', 'none')}
Instance: {alert_data.get('labels', {}).get('instance', 'unknown')}
Started at: {alert_data.get('startsAt', 'unknown')}

Available actions: {', '.join(ALLOWED_ACTIONS.keys())}

Respond ONLY with a JSON object. Fields:
- severity: "low", "medium", "high", or "critical"
- root_cause: one-sentence explanation of the likely cause
- recommended_action: exactly one action from the available actions list
- reasoning: why you chose this action"""

    schema = Diagnosis.model_json_schema()

    try:
        start = time.monotonic()
        resp = requests.post(
            f"{OLLAMA_URL}/api/chat",
            json={
                "model": OLLAMA_MODEL,
                "messages": [{"role": "user", "content": prompt}],
                "format": schema,
                "stream": False,
                "options": {"temperature": 0},
            },
            timeout=120,
        )
        resp.raise_for_status()
        elapsed = time.monotonic() - start
        logger.info(f"Ollama responded in {elapsed:.1f}s")

        content = resp.json()["message"]["content"]
        diagnosis = Diagnosis.model_validate_json(content)
        return diagnosis

    except Exception as e:
        logger.error(f"Ollama query failed: {e}")
        return None


# --- Remediation Engine ---
def execute_action(action_name: str, context: dict) -> dict:
    """Execute an allowlisted remediation action."""
    action = ALLOWED_ACTIONS.get(action_name)
    if not action:
        return {"status": "blocked", "reason": f"Action '{action_name}' not in allowlist"}

    command = action["command"]

    # Template the service name if needed
    if "{service}" in command:
        service = context.get("service", "")
        if service not in action.get("allowed_services", []):
            return {"status": "blocked", "reason": f"Service '{service}' not in allowed_services"}
        command = command.format(service=service)

    if DRY_RUN:
        logger.info(f"DRY RUN: would execute: {command}")
        return {"status": "dry_run", "command": command}

    if action.get("requires_approval"):
        approved = request_discord_approval(action_name, command, context)
        if not approved:
            return {"status": "awaiting_approval", "command": command}

    try:
        result = subprocess.run(
            command,
            shell=True,
            capture_output=True,
            text=True,
            timeout=60,
        )
        return {
            "status": "executed",
            "command": command,
            "returncode": result.returncode,
            "stdout": result.stdout[:500],
            "stderr": result.stderr[:500],
        }
    except subprocess.TimeoutExpired:
        return {"status": "timeout", "command": command}


# --- Discord Approval ---
def request_discord_approval(action_name: str, command: str, context: dict) -> bool:
    """Send a Discord message requesting human approval. Returns False (async approval)."""
    if not DISCORD_WEBHOOK:
        logger.warning("No Discord webhook configured. Blocking destructive action.")
        return False

    payload = {
        "embeds": [{
            "title": f"🔒 Approval Required: {action_name}",
            "description": (
                f"**Alert:** {context.get('alertname', 'unknown')}\n"
                f"**Command:** `{command}`\n"
                f"**Diagnosis:** {context.get('reasoning', 'N/A')}\n\n"
                "React with ✅ to approve or ❌ to deny."
            ),
            "color": 15158332,
        }]
    }

    try:
        requests.post(DISCORD_WEBHOOK, json=payload, timeout=10)
        logger.info(f"Discord approval requested for {action_name}")
    except Exception as e:
        logger.error(f"Discord notification failed: {e}")

    return False


# --- Audit Logging ---
def audit_log(entry: dict):
    """Append a structured JSON log entry."""
    entry["timestamp"] = datetime.now(timezone.utc).isoformat()
    with open(AUDIT_LOG, "a") as f:
        f.write(json.dumps(entry) + "\n")
    logger.info(f"Audit: {entry.get('event', 'unknown')} - {entry.get('action', 'none')}")


# --- Webhook Endpoint ---
@app.route("/alert", methods=["POST"])
def receive_alert():
    """Handle incoming Alertmanager webhook."""
    data = request.get_json(silent=True)
    if not data:
        return jsonify({"error": "invalid payload"}), 400

    alerts = data.get("alerts", [])
    results = []

    for alert in alerts:
        if alert.get("status") == "resolved":
            audit_log({"event": "alert_resolved", "alert": alert.get("labels", {})})
            continue

        alertname = alert.get("labels", {}).get("alertname", "unknown")
        logger.info(f"Processing alert: {alertname}")

        # Step 1: Diagnose with Ollama
        diagnosis = query_ollama(alert)

        if diagnosis is None:
            # Fallback to remediation_hint from alert annotations
            hint = alert.get("annotations", {}).get("remediation_hint")
            if hint and hint in ALLOWED_ACTIONS:
                logger.warning(f"Ollama unavailable. Falling back to hint: {hint}")
                diagnosis = Diagnosis(
                    severity="unknown",
                    root_cause="LLM unavailable, using alert hint",
                    recommended_action=hint,
                    reasoning="Fallback to annotation-defined remediation",
                )
            else:
                audit_log({
                    "event": "diagnosis_failed",
                    "alert": alertname,
                    "action": "none",
                })
                results.append({"alert": alertname, "status": "diagnosis_failed"})
                continue

        audit_log({
            "event": "diagnosis",
            "alert": alertname,
            "severity": diagnosis.severity,
            "root_cause": diagnosis.root_cause,
            "action": diagnosis.recommended_action,
            "reasoning": diagnosis.reasoning,
        })

        # Step 2: Execute remediation
        context = {
            "alertname": alertname,
            "service": alert.get("labels", {}).get("job", ""),
            "reasoning": diagnosis.reasoning,
        }
        result = execute_action(diagnosis.recommended_action, context)

        audit_log({
            "event": "remediation",
            "alert": alertname,
            "action": diagnosis.recommended_action,
            **result,
        })

        results.append({
            "alert": alertname,
            "diagnosis": diagnosis.model_dump(),
            "remediation": result,
        })

    return jsonify({"processed": len(results), "results": results})


@app.route("/health", methods=["GET"])
def health():
    """Health check endpoint."""
    return jsonify({"status": "ok", "dry_run": DRY_RUN})


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5001)

The receiver binds to 0.0.0.0:5001 so Alertmanager can reach it through Docker's host gateway (host.docker.internal resolves to the Docker bridge IP, not 127.0.0.1). Without a firewall rule, this port is exposed to the internet. Block external access now:

sudo ufw deny in on eth0 to any port 5001

If you use nftables instead of ufw, add an equivalent rule to drop incoming traffic on port 5001 from public interfaces.

Running the receiver with systemd

Create a systemd unit so the receiver starts on boot and restarts on failure.

Create /etc/systemd/system/sentinel.service:

[Unit]
Description=Sentinel self-healing webhook receiver
After=network.target docker.service
Wants=docker.service

[Service]
Type=simple
User=sentinel
Group=sentinel
SupplementaryGroups=docker
WorkingDirectory=/home/sentinel/sentinel/receiver
ExecStart=/home/sentinel/sentinel/receiver/venv/bin/python sentinel.py
EnvironmentFile=/etc/sentinel/env
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal

# Hardening
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/var/log/sentinel

[Install]
WantedBy=multi-user.target

The SupplementaryGroups=docker line gives the sentinel process access to the Docker socket without running as root. This is required for docker restart and docker image prune actions.

Create a dedicated user, add it to the docker group, set up its home directory, and copy the receiver files:

sudo useradd -r -m -d /home/sentinel -s /bin/false sentinel
sudo usermod -aG docker sentinel
sudo mkdir -p /etc/sentinel /var/log/sentinel
sudo chown sentinel:sentinel /var/log/sentinel
sudo cp -r ~/sentinel/receiver /home/sentinel/sentinel/receiver
sudo chown -R sentinel:sentinel /home/sentinel/sentinel

The systemd unit runs as the sentinel user, so the receiver code and virtualenv must live under its home directory.

Note on elevated actions: The docker restart and docker image prune actions work through docker group membership. However, journalctl --vacuum-size, kill (for processes owned by other users), and fallocate /swapfile require root privileges. If you need these actions in live mode, add targeted sudoers rules in /etc/sudoers.d/sentinel for the specific commands. Do not give the sentinel user blanket sudo access.

Create /etc/sentinel/env with your secrets:

OLLAMA_URL=http://127.0.0.1:11434
OLLAMA_MODEL=qwen2.5:7b
DISCORD_WEBHOOK=https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN
SENTINEL_DRY_RUN=true
SENTINEL_AUDIT_LOG=/var/log/sentinel/audit.jsonl

Lock down the file:

sudo chmod 600 /etc/sentinel/env
sudo chown sentinel:sentinel /etc/sentinel/env

Start the service:

sudo systemctl daemon-reload
sudo systemctl enable --now sentinel.service

enable makes it survive reboots. --now starts it immediately.

sudo systemctl status sentinel.service

● sentinel.service - Sentinel self-healing webhook receiver
     Loaded: loaded (/etc/systemd/system/sentinel.service; enabled)
     Active: active (running) since Wed 2026-03-19 14:35:00 UTC
   Main PID: 12345 (python)

Start in dry-run mode (SENTINEL_DRY_RUN=true) until you have tested every alert path. Only flip to false once you trust the behavior.

How does Ollama diagnose server issues from alert context?

Ollama receives a structured prompt containing the alert name, severity, description, current metric values, and the list of allowed actions. It returns a JSON object with the diagnosis: severity assessment, root cause, recommended action, and reasoning. Using Ollama's format parameter with a JSON schema enforces the output structure at the model level.

The prompt template in the code above gives Ollama exactly what it needs:

The alert data (what happened)
The list of valid actions (what it can recommend)
The output schema (how to respond)

Setting temperature: 0 makes output deterministic. The same alert always produces the same diagnosis.

Which models work best?

Not all models handle structured JSON output reliably. Here is what we tested on a 4 vCPU / 8 GB RAM VPS:

Model	Size	Inference time	JSON reliability	Notes
qwen2.5:7b	4.7 GB	3-5s	High	Best speed-accuracy trade-off
llama3.1:8b	4.7 GB	4-6s	Medium	Occasionally ignores schema constraints
mistral:7b	4.1 GB	3-4s	Medium	Fast but sometimes hallucinates actions
phi3:mini	2.3 GB	1-2s	Low	Too small for reliable structured output
qwen2.5:14b	9.0 GB	8-12s	High	Best accuracy, but tight on 8 GB RAM

qwen2.5:7b is the recommended default. It fits comfortably in 8 GB alongside Prometheus and your services, and reliably produces valid JSON matching the Pydantic schema.

Pull it:

ollama pull qwen2.5:7b

A sample diagnosis from a DiskSpaceLow alert:

{
  "severity": "medium",
  "root_cause": "Docker images and build cache accumulating over time, consuming disk space on the root filesystem",
  "recommended_action": "prune_docker_images",
  "reasoning": "Disk is at 87% usage. The most common cause on a Docker-based VPS is unused images. Pruning images older than 72 hours will reclaim space without affecting running containers."
}

The Pydantic validator rejects any response where recommended_action is not in ALLOWED_ACTIONS. If Ollama hallucinates an action like rm -rf /tmp/*, the validator catches it and the action never executes.

What remediation actions can the system execute safely?

The remediation engine only runs commands defined in a strict allowlist. Each action has a fixed command template, an optional service allowlist, and a flag indicating whether human approval is required before execution.

Action	Command	Risk	Approval required
restart_service	`docker restart {service}`	Low	No (only allowlisted containers)
prune_docker_images	`docker image prune -af --filter 'until=72h'`	Low	No
kill_top_memory_process	`ps aux --sort=-%mem \| awk 'NR==2{print $2}' \| xargs kill -15`	High	Yes
identify_cpu_hog	`ps aux --sort=-%cpu \| head -5`	None (read-only)	No
clear_journal_logs	`journalctl --vacuum-size=200M`	Low	No
add_swap	`fallocate + mkswap + swapon`	Medium	Yes

The restart_service action accepts a {service} template variable, but only if the container name is in allowed_services. If Ollama recommends restarting sshd, the engine blocks it because sshd is not on the list. The service name comes from the Prometheus job label, so make sure your job_name values in prometheus.yml match the Docker container names. If you run systemd-managed services instead, swap docker restart for systemctl restart.

Commands use SIGTERM (kill -15), not SIGKILL. The process gets a chance to clean up. subprocess.run has a 60-second timeout to prevent hung remediation.

How do you implement the safety layer with allowlists and human approval?

The safety layer treats all LLM output as untrusted. Three mechanisms prevent dangerous actions: the action allowlist rejects any command not predefined in code, the Pydantic validator catches malformed responses before they reach the engine, and the Discord approval gate blocks destructive actions until a human reacts.

Defense in depth

The system has four layers of protection:

Schema enforcement. Ollama's format parameter constrains the model to the JSON schema. It cannot return free-text commands.
Pydantic validation. The Diagnosis model validates that recommended_action exists in ALLOWED_ACTIONS. Invalid actions raise a ValueError and the alert falls through to the hint-based fallback.
Action allowlist. The remediation engine only executes commands defined in ALLOWED_ACTIONS. No dynamic command construction. No string interpolation beyond the service name (which itself is validated against a list).
Human approval. Actions marked requires_approval: True send a Discord notification and do not execute. A separate approval handler (outside this tutorial's scope) listens for Discord reactions and triggers execution.

Dry-run mode

Set SENTINEL_DRY_RUN=true in /etc/sentinel/env. The receiver logs what it would do without executing anything:

{"timestamp": "2026-03-19T15:00:00Z", "event": "remediation", "alert": "DiskSpaceLow", "action": "prune_docker_images", "status": "dry_run", "command": "docker image prune -af --filter 'until=72h'"}

Read the audit log:

tail -f /var/log/sentinel/audit.jsonl | jq .

Run in dry-run mode for at least a week before enabling live remediation. Review every logged action. Make sure the LLM recommends sane actions for your actual alerts.

Discord approval flow

When a destructive action triggers, the receiver posts an embed to your Discord channel:

🔒 Approval Required: kill_top_memory_process
Alert: MemoryHigh
Command: ps aux --sort=-%mem | awk 'NR==2{print $2}' | xargs kill -15
Diagnosis: Memory at 93%. The top process by RSS is consuming 4.2 GB.
React with ✅ to approve or ❌ to deny.

Without a Discord webhook configured, destructive actions are blocked entirely. This is the safe default. A missing webhook means no destructive actions run, not that they run without approval.

What is the sentinel pattern and why do you need it?

The sentinel pattern is a standalone cron-based watchdog that monitors the monitoring stack itself. If Prometheus crashes, nobody fires an alert because the alert system is down. The sentinel script checks whether Prometheus, Alertmanager, and the webhook receiver are running, and restarts them if not. It is the "who watches the watchmen" answer.

Create ~/sentinel/watchdog.sh:

#!/bin/bash
# Sentinel watchdog: monitors the monitoring stack
# Runs via cron every 2 minutes

LOG="/var/log/sentinel/watchdog.log"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

check_and_restart() {
    local name="$1"
    local check_cmd="$2"
    local restart_cmd="$3"

    if ! eval "$check_cmd" > /dev/null 2>&1; then
        echo "${TIMESTAMP} ALERT: ${name} is down. Restarting..." >> "$LOG"
        eval "$restart_cmd"
        sleep 5
        if eval "$check_cmd" > /dev/null 2>&1; then
            echo "${TIMESTAMP} OK: ${name} restarted successfully" >> "$LOG"
        else
            echo "${TIMESTAMP} CRITICAL: ${name} failed to restart" >> "$LOG"
        fi
    fi
}

# Check Docker containers
check_and_restart "prometheus" \
    "docker inspect --format='{{.State.Running}}' prometheus 2>/dev/null | grep -q true" \
    "cd $HOME/sentinel && docker compose up -d prometheus"

check_and_restart "alertmanager" \
    "docker inspect --format='{{.State.Running}}' alertmanager 2>/dev/null | grep -q true" \
    "cd $HOME/sentinel && docker compose up -d alertmanager"

check_and_restart "node-exporter" \
    "docker inspect --format='{{.State.Running}}' node-exporter 2>/dev/null | grep -q true" \
    "cd $HOME/sentinel && docker compose up -d node-exporter"

# Check the webhook receiver systemd service
check_and_restart "sentinel-receiver" \
    "systemctl is-active --quiet sentinel.service" \
    "systemctl restart sentinel.service"

chmod 700 ~/sentinel/watchdog.sh

Add a cron job that runs every 2 minutes. Replace /root with your actual home directory:

(sudo crontab -l 2>/dev/null; echo "*/2 * * * * /root/sentinel/watchdog.sh") | sudo crontab -

The watchdog has no dependencies beyond bash and Docker CLI. It runs even if Python, Ollama, or the entire monitoring stack is down. That is why it exists: it is the one component that does not depend on any other component.

Check the watchdog log:

cat /var/log/sentinel/watchdog.log

2026-03-19T15:10:00Z ALERT: prometheus is down. Restarting...
2026-03-19T15:10:05Z OK: prometheus restarted successfully

How do you test the self-healing loop end to end?

Simulate real failures to confirm the full loop works: alert fires, webhook receives, Ollama diagnoses, remediation executes (or logs in dry-run mode). Start every test with SENTINEL_DRY_RUN=true so nothing actually executes until you verify the diagnosis.

Test 1: Service down

Start a dummy container and kill it:

docker run -d --name test-service --network sentinel_sentinel alpine sleep 3600

Add a scrape target in prometheus/prometheus.yml for the test (or just stop the node-exporter container):

docker stop node-exporter

Wait 1-2 minutes for the ServiceDown alert to fire. Watch the audit log:

tail -f /var/log/sentinel/audit.jsonl | jq .

{
  "timestamp": "2026-03-19T15:12:30Z",
  "event": "diagnosis",
  "alert": "ServiceDown",
  "severity": "critical",
  "root_cause": "node-exporter scrape target is unreachable, container likely stopped or crashed",
  "action": "restart_service",
  "reasoning": "The node-exporter container is down. Restarting it will restore metric collection."
}

{
  "timestamp": "2026-03-19T15:12:31Z",
  "event": "remediation",
  "alert": "ServiceDown",
  "action": "restart_service",
  "status": "dry_run",
  "command": "docker restart node-exporter"
}

Restart node-exporter manually to resolve the alert:

docker start node-exporter

Test 2: Disk pressure

Create a large file to push disk usage above 85%:

fallocate -l 20G /tmp/disk-pressure-test

Wait for the DiskSpaceLow alert (5 minute threshold). The LLM should diagnose it and recommend prune_docker_images. After confirming the dry-run log, clean up:

rm /tmp/disk-pressure-test

Test 3: Memory pressure

Use stress-ng to consume memory:

sudo apt install -y stress-ng
stress-ng --vm 1 --vm-bytes 6G --timeout 600s &

The MemoryHigh alert fires. The LLM should recommend kill_top_memory_process, which requires Discord approval. Confirm the approval message appears in Discord.

Stop the stress test:

killall stress-ng

Switching to live mode

Once every dry-run test produces correct diagnoses and appropriate action recommendations:

Edit /etc/sentinel/env and set SENTINEL_DRY_RUN=false
Restart the service: sudo systemctl restart sentinel.service
Run the service down test again. The node-exporter container should restart automatically via docker restart.

Keep monitoring /var/log/sentinel/audit.jsonl for the first few days. Every action is logged with full context for post-incident review.

Troubleshooting

Ollama not responding. Check that Ollama is running: systemctl status ollama. Check model is pulled: ollama list. If the model is not loaded, the first request takes 10-20 seconds while it loads into RAM.

Alertmanager cannot reach the webhook. Verify the receiver is listening: curl http://127.0.0.1:5001/health. Check Alertmanager logs: docker logs alertmanager. Make sure host.docker.internal resolves from inside the Alertmanager container: docker exec alertmanager wget -q -O- http://host.docker.internal:5001/health.

Alerts not firing. Check that Prometheus loaded the rules: curl http://127.0.0.1:9090/api/v1/rules | python3 -m json.tool. Verify scrape targets are up. A for: 5m clause means the condition must be true for 5 continuous minutes before the alert fires.

Pydantic validation errors. The model produced output that does not match the schema. Check Ollama's raw response in the journal: journalctl -u sentinel.service -f. Try a different model. qwen2.5:7b is the most reliable for structured output on limited hardware.

Watchdog not running. Verify the cron entry: sudo crontab -l. Check cron daemon: systemctl status cron. Check the watchdog log exists: ls -la /var/log/sentinel/watchdog.log.

Logs for every component:

# Prometheus
docker logs prometheus --tail 50

# Alertmanager
docker logs alertmanager --tail 50

# Sentinel receiver
journalctl -u sentinel.service -f

# Audit trail
tail -f /var/log/sentinel/audit.jsonl | jq .

Next steps

The sentinel pattern handles reactive remediation. For proactive monitoring, feed your logs into an LLM for pattern detection before alerts fire.

If you extend the remediation engine to run more complex scripts, sandbox them. Never let an LLM-driven process run with unrestricted system access.

The monitoring stack here covers a single VPS. For multi-service Docker setups, add cAdvisor as a scrape target to monitor container-level CPU, memory, and network metrics.