使用Prometheus和Ollama构建自修复VPS
将Prometheus告警连接到本地LLM,自动诊断故障并在VPS上执行安全的修复操作。完整可用代码,包含白名单、dry-run模式和人工审批控制。
凌晨3点,服务器挂了。你被告警叫醒,半梦半醒地SSH连上去,重启崩溃的服务,然后回去睡觉。经历了几次之后,你开始想:服务器能不能自己修复?
可以。本教程在单台VPS上构建一个自修复反馈循环。Prometheus和node_exporter采集指标。Alertmanager在阈值突破时触发告警。一个Python webhook接收器捕获这些告警并将上下文传递给Ollama。LLM诊断问题并推荐修复操作。如果操作在白名单中,自动执行。如果是破坏性操作,需要人工先通过Discord审批。
完整流程如下:
node_exporter -> Prometheus -> Alertmanager -> webhook receiver -> Ollama -> remediation engine -> action
|
audit log + Discord
每个操作都有日志记录。每个LLM推荐都被视为不可信输入。这是大多数AIOps内容跳过的部分,也是最重要的部分。
什么是自修复服务器?为什么要在VPS上构建?
自修复服务器使用指标和告警规则检测故障,然后无需人工干预即可触发修复。结合Ollama这样的本地LLM,系统从告警上下文中诊断根因,并执行白名单中的操作:重启崩溃的服务、清理磁盘空间或终止失控进程。
企业团队使用PagerDuty、Rundeck或StackStorm来实现这一点。这些工具假设你有团队、服务器集群和预算。在单台VPS上,你需要更轻量的方案。本文描述的「sentinel pattern」是一个独立代理,监视服务器并自动修复常见问题,同时有安全控制防止LLM执行危险操作。
前提条件
- 一台4+ vCPU、8 GB RAM的VPS(LLM需要和你的服务共享内存空间)
- Debian 12或Ubuntu 24.04
- 已安装Docker和Docker Compose
- 已安装Ollama并拉取至少一个模型(推荐
qwen2.5:7b,结构化JSON输出效果好) - 具有sudo权限的非root用户
- Discord webhook URL(用于人工审批通知)
如何使用Docker Compose配置Prometheus、node_exporter和Alertmanager?
将监控栈部署为三个容器:Prometheus v3.10.0抓取指标,node_exporter v1.10.2暴露主机指标,Alertmanager v0.31.1将告警路由到webhook接收器。整个栈通过一条docker compose up -d命令启动。
创建项目目录:
mkdir -p ~/sentinel/{prometheus,alertmanager}
cd ~/sentinel
Docker Compose栈
创建docker-compose.yml:
services:
prometheus:
image: prom/prometheus:v3.10.0
container_name: prometheus
restart: unless-stopped
user: "65534:65534"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml:ro
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'
- '--web.enable-lifecycle'
ports:
- "127.0.0.1:9090:9090"
networks:
- sentinel
node-exporter:
image: prom/node-exporter:v1.10.2
container_name: node-exporter
restart: unless-stopped
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- sentinel
alertmanager:
image: prom/alertmanager:v0.31.1
container_name: alertmanager
restart: unless-stopped
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "127.0.0.1:9093:9093"
extra_hosts:
- "host.docker.internal:host-gateway"
networks:
- sentinel
volumes:
prometheus_data:
networks:
sentinel:
driver: bridge
Prometheus和Alertmanager只绑定到127.0.0.1。将监控面板暴露到公网是常见的配置错误,会把内部指标泄露给任何扫描你IP的人。
Prometheus配置
创建prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
Alertmanager配置
创建alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
receiver: sentinel-webhook
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: sentinel-webhook
webhook_configs:
- url: 'http://host.docker.internal:5001/alert'
send_resolved: true
max_alerts: 10
Webhook URL指向运行在主机上的Python接收器。host.docker.internal在Docker容器内部解析为主机IP。上面Compose文件中的extra_hosts指令在Linux上将其映射到主机网关。
启动栈之前设置文件权限:
chmod 644 prometheus/prometheus.yml prometheus/alert_rules.yml alertmanager/alertmanager.yml
应该为常见VPS故障配置哪些告警规则?
四条告警规则覆盖最常见的VPS故障:磁盘空间不足、内存耗尽、服务宕机和CPU持续高负载。每条规则在阈值持续突破指定时间后触发,给Prometheus时间过滤瞬时峰值。
创建prometheus/alert_rules.yml:
groups:
- name: vps_health
rules:
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk usage above 85%"
description: "Root filesystem is {{ $value | printf \"%.1f\" }}% full."
remediation_hint: "prune_docker_images"
- alert: MemoryHigh
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Memory usage above 90%"
description: "Available memory is {{ $value | printf \"%.1f\" }}% used."
remediation_hint: "kill_top_memory_process"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Target {{ $labels.job }} is down"
description: "Scrape target {{ $labels.instance }} has been unreachable for over 1 minute."
remediation_hint: "restart_service"
- alert: HighCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "CPU usage above 85% for 10 minutes"
description: "Average CPU usage is {{ $value | printf \"%.1f\" }}%."
remediation_hint: "identify_cpu_hog"
| 告警 | PromQL阈值 | 持续时间 | 严重级别 | 默认修复操作 |
|---|---|---|---|---|
| DiskSpaceLow | 根分区使用率 > 85% | 5分钟 | warning | 清理Docker镜像 |
| MemoryHigh | 可用内存 < 10% | 5分钟 | critical | 终止内存占用最高的进程 |
| ServiceDown | 抓取目标不可达 | 1分钟 | critical | 重启服务 |
| HighCPU | 平均CPU > 85% | 10分钟 | warning | 识别CPU占用进程 |
remediation_hint注解是自定义字段。当LLM诊断不明确时,它告诉修复引擎应该建议哪个操作。
启动栈:
cd ~/sentinel
docker compose up -d
[+] Running 4/4
✔ Network sentinel_sentinel Created
✔ Container node-exporter Started
✔ Container prometheus Started
✔ Container alertmanager Started
检查Prometheus是否在抓取目标:
curl -s http://127.0.0.1:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"health"'
"health": "up",
"lastScrape": "2026-03-19T14:30:15.123Z",
两个目标都应显示"health": "up"。
Webhook接收器如何捕获Alertmanager告警?
Webhook接收器是一个Flask应用,监听来自Alertmanager的POST请求,提取告警上下文,查询Ollama进行诊断,然后将结果传递给修复引擎。它运行在主机上而非容器中,因为需要访问Docker和systemd来执行修复操作。
安装依赖:
sudo apt install -y python3-pip python3-venv jq
mkdir -p ~/sentinel/receiver
cd ~/sentinel/receiver
python3 -m venv venv
source venv/bin/activate
pip install flask requests pydantic
创建~/sentinel/receiver/sentinel.py:
#!/usr/bin/env python3
"""Sentinel: self-healing webhook receiver for Alertmanager."""
import json
import logging
import os
import subprocess
import time
from datetime import datetime, timezone
from pathlib import Path
import requests
from flask import Flask, request, jsonify
from pydantic import BaseModel, field_validator
app = Flask(__name__)
# --- Configuration ---
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://127.0.0.1:11434")
OLLAMA_MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:7b")
DISCORD_WEBHOOK = os.environ.get("DISCORD_WEBHOOK", "")
DRY_RUN = os.environ.get("SENTINEL_DRY_RUN", "false").lower() == "true"
AUDIT_LOG = Path(os.environ.get("SENTINEL_AUDIT_LOG", "/var/log/sentinel/audit.jsonl"))
AUDIT_LOG.parent.mkdir(parents=True, exist_ok=True)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s"
)
logger = logging.getLogger("sentinel")
# --- Safety: Action Allowlist ---
ALLOWED_ACTIONS = {
"restart_service": {
"command": "docker restart {service}",
"requires_approval": False,
"allowed_services": ["nginx", "docker", "prometheus", "node-exporter", "alertmanager"],
},
"prune_docker_images": {
"command": "docker image prune -af --filter 'until=72h'",
"requires_approval": False,
},
"kill_top_memory_process": {
"command": "ps aux --sort=-%mem | awk 'NR==2{print $2}' | xargs kill -15",
"requires_approval": True,
},
"identify_cpu_hog": {
"command": "ps aux --sort=-%cpu | head -5",
"requires_approval": False,
"read_only": True,
},
"clear_journal_logs": {
"command": "journalctl --vacuum-size=200M",
"requires_approval": False,
},
"add_swap": {
"command": "fallocate -l 2G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile",
"requires_approval": True,
},
}
class Diagnosis(BaseModel):
"""Structured LLM diagnosis output."""
severity: str
root_cause: str
recommended_action: str
reasoning: str
@field_validator("recommended_action")
@classmethod
def action_must_be_allowed(cls, v):
if v not in ALLOWED_ACTIONS:
raise ValueError(f"Action '{v}' is not in the allowlist")
return v
# --- Ollama Integration ---
def query_ollama(alert_data: dict) -> Diagnosis | None:
"""Send alert context to Ollama and parse structured diagnosis."""
prompt = f"""You are a server diagnostics agent. Analyze this alert and respond with a JSON diagnosis.
Alert: {alert_data.get('labels', {}).get('alertname', 'unknown')}
Status: {alert_data.get('status', 'unknown')}
Severity: {alert_data.get('labels', {}).get('severity', 'unknown')}
Summary: {alert_data.get('annotations', {}).get('summary', '')}
Description: {alert_data.get('annotations', {}).get('description', '')}
Remediation hint: {alert_data.get('annotations', {}).get('remediation_hint', 'none')}
Instance: {alert_data.get('labels', {}).get('instance', 'unknown')}
Started at: {alert_data.get('startsAt', 'unknown')}
Available actions: {', '.join(ALLOWED_ACTIONS.keys())}
Respond ONLY with a JSON object. Fields:
- severity: "low", "medium", "high", or "critical"
- root_cause: one-sentence explanation of the likely cause
- recommended_action: exactly one action from the available actions list
- reasoning: why you chose this action"""
schema = Diagnosis.model_json_schema()
try:
start = time.monotonic()
resp = requests.post(
f"{OLLAMA_URL}/api/chat",
json={
"model": OLLAMA_MODEL,
"messages": [{"role": "user", "content": prompt}],
"format": schema,
"stream": False,
"options": {"temperature": 0},
},
timeout=120,
)
resp.raise_for_status()
elapsed = time.monotonic() - start
logger.info(f"Ollama responded in {elapsed:.1f}s")
content = resp.json()["message"]["content"]
diagnosis = Diagnosis.model_validate_json(content)
return diagnosis
except Exception as e:
logger.error(f"Ollama query failed: {e}")
return None
# --- Remediation Engine ---
def execute_action(action_name: str, context: dict) -> dict:
"""Execute an allowlisted remediation action."""
action = ALLOWED_ACTIONS.get(action_name)
if not action:
return {"status": "blocked", "reason": f"Action '{action_name}' not in allowlist"}
command = action["command"]
# Template the service name if needed
if "{service}" in command:
service = context.get("service", "")
if service not in action.get("allowed_services", []):
return {"status": "blocked", "reason": f"Service '{service}' not in allowed_services"}
command = command.format(service=service)
if DRY_RUN:
logger.info(f"DRY RUN: would execute: {command}")
return {"status": "dry_run", "command": command}
if action.get("requires_approval"):
approved = request_discord_approval(action_name, command, context)
if not approved:
return {"status": "awaiting_approval", "command": command}
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=60,
)
return {
"status": "executed",
"command": command,
"returncode": result.returncode,
"stdout": result.stdout[:500],
"stderr": result.stderr[:500],
}
except subprocess.TimeoutExpired:
return {"status": "timeout", "command": command}
# --- Discord Approval ---
def request_discord_approval(action_name: str, command: str, context: dict) -> bool:
"""Send a Discord message requesting human approval. Returns False (async approval)."""
if not DISCORD_WEBHOOK:
logger.warning("No Discord webhook configured. Blocking destructive action.")
return False
payload = {
"embeds": [{
"title": f"🔒 Approval Required: {action_name}",
"description": (
f"**Alert:** {context.get('alertname', 'unknown')}\n"
f"**Command:** `{command}`\n"
f"**Diagnosis:** {context.get('reasoning', 'N/A')}\n\n"
"React with ✅ to approve or ❌ to deny."
),
"color": 15158332,
}]
}
try:
requests.post(DISCORD_WEBHOOK, json=payload, timeout=10)
logger.info(f"Discord approval requested for {action_name}")
except Exception as e:
logger.error(f"Discord notification failed: {e}")
return False
# --- Audit Logging ---
def audit_log(entry: dict):
"""Append a structured JSON log entry."""
entry["timestamp"] = datetime.now(timezone.utc).isoformat()
with open(AUDIT_LOG, "a") as f:
f.write(json.dumps(entry) + "\n")
logger.info(f"Audit: {entry.get('event', 'unknown')} - {entry.get('action', 'none')}")
# --- Webhook Endpoint ---
@app.route("/alert", methods=["POST"])
def receive_alert():
"""Handle incoming Alertmanager webhook."""
data = request.get_json(silent=True)
if not data:
return jsonify({"error": "invalid payload"}), 400
alerts = data.get("alerts", [])
results = []
for alert in alerts:
if alert.get("status") == "resolved":
audit_log({"event": "alert_resolved", "alert": alert.get("labels", {})})
continue
alertname = alert.get("labels", {}).get("alertname", "unknown")
logger.info(f"Processing alert: {alertname}")
# Step 1: Diagnose with Ollama
diagnosis = query_ollama(alert)
if diagnosis is None:
# Fallback to remediation_hint from alert annotations
hint = alert.get("annotations", {}).get("remediation_hint")
if hint and hint in ALLOWED_ACTIONS:
logger.warning(f"Ollama unavailable. Falling back to hint: {hint}")
diagnosis = Diagnosis(
severity="unknown",
root_cause="LLM unavailable, using alert hint",
recommended_action=hint,
reasoning="Fallback to annotation-defined remediation",
)
else:
audit_log({
"event": "diagnosis_failed",
"alert": alertname,
"action": "none",
})
results.append({"alert": alertname, "status": "diagnosis_failed"})
continue
audit_log({
"event": "diagnosis",
"alert": alertname,
"severity": diagnosis.severity,
"root_cause": diagnosis.root_cause,
"action": diagnosis.recommended_action,
"reasoning": diagnosis.reasoning,
})
# Step 2: Execute remediation
context = {
"alertname": alertname,
"service": alert.get("labels", {}).get("job", ""),
"reasoning": diagnosis.reasoning,
}
result = execute_action(diagnosis.recommended_action, context)
audit_log({
"event": "remediation",
"alert": alertname,
"action": diagnosis.recommended_action,
**result,
})
results.append({
"alert": alertname,
"diagnosis": diagnosis.model_dump(),
"remediation": result,
})
return jsonify({"processed": len(results), "results": results})
@app.route("/health", methods=["GET"])
def health():
"""Health check endpoint."""
return jsonify({"status": "ok", "dry_run": DRY_RUN})
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5001)
接收器绑定到0.0.0.0:5001,这样Alertmanager才能通过Docker的主机网关访问它(host.docker.internal解析为Docker桥接IP,而非127.0.0.1)。没有防火墙规则的话,这个端口会暴露在公网上。立即阻止外部访问:
sudo ufw deny in on eth0 to any port 5001
如果你使用nftables而非ufw,添加等效规则来丢弃从公共接口进入的5001端口流量。
使用systemd运行接收器
创建systemd单元文件,使接收器开机自启并在失败时自动重启。
创建/etc/systemd/system/sentinel.service:
[Unit]
Description=Sentinel self-healing webhook receiver
After=network.target docker.service
Wants=docker.service
[Service]
Type=simple
User=sentinel
Group=sentinel
SupplementaryGroups=docker
WorkingDirectory=/home/sentinel/sentinel/receiver
ExecStart=/home/sentinel/sentinel/receiver/venv/bin/python sentinel.py
EnvironmentFile=/etc/sentinel/env
Restart=on-failure
RestartSec=5
StandardOutput=journal
StandardError=journal
# Hardening
NoNewPrivileges=yes
ProtectSystem=strict
ReadWritePaths=/var/log/sentinel
[Install]
WantedBy=multi-user.target
SupplementaryGroups=docker这一行让sentinel进程无需以root身份运行即可访问Docker套接字。docker restart和docker image prune操作需要这个权限。
创建专用用户,将其加入docker组,配置home目录并复制接收器文件:
sudo useradd -r -m -d /home/sentinel -s /bin/false sentinel
sudo usermod -aG docker sentinel
sudo mkdir -p /etc/sentinel /var/log/sentinel
sudo chown sentinel:sentinel /var/log/sentinel
sudo cp -r ~/sentinel/receiver /home/sentinel/sentinel/receiver
sudo chown -R sentinel:sentinel /home/sentinel/sentinel
systemd单元以sentinel用户身份运行,所以接收器代码和virtualenv必须位于其home目录下。
关于需要提升权限的操作:
docker restart和docker image prune通过docker组成员身份即可执行。但journalctl --vacuum-size、kill(终止其他用户的进程)和fallocate /swapfile需要root权限。如果你在生产模式下需要这些操作,请在/etc/sudoers.d/sentinel中为特定命令添加针对性的sudoers规则。不要给sentinel用户无限制的sudo权限。
创建/etc/sentinel/env,填入你的密钥:
OLLAMA_URL=http://127.0.0.1:11434
OLLAMA_MODEL=qwen2.5:7b
DISCORD_WEBHOOK=https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN
SENTINEL_DRY_RUN=true
SENTINEL_AUDIT_LOG=/var/log/sentinel/audit.jsonl
锁定文件权限:
sudo chmod 600 /etc/sentinel/env
sudo chown sentinel:sentinel /etc/sentinel/env
启动服务:
sudo systemctl daemon-reload
sudo systemctl enable --now sentinel.service
enable确保重启后服务仍然运行。--now立即启动。
sudo systemctl status sentinel.service
● sentinel.service - Sentinel self-healing webhook receiver
Loaded: loaded (/etc/systemd/system/sentinel.service; enabled)
Active: active (running) since Wed 2026-03-19 14:35:00 UTC
Main PID: 12345 (python)
先以dry-run模式启动(SENTINEL_DRY_RUN=true),直到你测试完每条告警路径。只有在你信任其行为后才改为false。
Ollama如何从告警上下文诊断服务器问题?
Ollama接收一个结构化提示词,包含告警名称、严重级别、描述、当前指标值和允许的操作列表。它返回一个JSON对象作为诊断结果:严重级别评估、根因、推荐操作和推理过程。通过Ollama的format参数配合JSON schema,在模型层面强制输出结构。
上面代码中的提示词模板给Ollama提供了它需要的全部信息:
- 告警数据(发生了什么)
- 有效操作列表(可以推荐什么)
- 输出schema(如何响应)
设置temperature: 0使输出具有确定性。相同的告警始终产生相同的诊断。
哪些模型效果最好?
并非所有模型都能可靠处理结构化JSON输出。以下是我们在4 vCPU / 8 GB RAM VPS上的测试结果:
| 模型 | 大小 | 推理时间 | JSON可靠性 | 备注 |
|---|---|---|---|---|
| qwen2.5:7b | 4.7 GB | 3-5秒 | 高 | 速度与准确性的最佳平衡 |
| llama3.1:8b | 4.7 GB | 4-6秒 | 中 | 偶尔忽略schema约束 |
| mistral:7b | 4.1 GB | 3-4秒 | 中 | 速度快但偶尔会幻觉出不存在的操作 |
| phi3:mini | 2.3 GB | 1-2秒 | 低 | 太小,无法可靠输出结构化内容 |
| qwen2.5:14b | 9.0 GB | 8-12秒 | 高 | 准确性最高,但8 GB RAM较紧张 |
推荐默认使用qwen2.5:7b。它在8 GB内存中与Prometheus和你的服务共存毫无压力,并且能可靠生成符合Pydantic schema的有效JSON。
拉取模型:
ollama pull qwen2.5:7b
一个DiskSpaceLow告警的诊断示例:
{
"severity": "medium",
"root_cause": "Docker images and build cache accumulating over time, consuming disk space on the root filesystem",
"recommended_action": "prune_docker_images",
"reasoning": "Disk is at 87% usage. The most common cause on a Docker-based VPS is unused images. Pruning images older than 72 hours will reclaim space without affecting running containers."
}
Pydantic验证器会拒绝recommended_action不在ALLOWED_ACTIONS中的任何响应。如果Ollama幻觉出类似rm -rf /tmp/*的操作,验证器会拦截它,操作永远不会执行。
系统可以安全执行哪些修复操作?
修复引擎只执行在严格白名单中定义的命令。每个操作有固定的命令模板、可选的服务白名单,以及是否需要人工审批的标志。
| 操作 | 命令 | 风险 | 需要审批 |
|---|---|---|---|
| restart_service | docker restart {service} |
低 | 否(仅白名单中的容器) |
| prune_docker_images | docker image prune -af --filter 'until=72h' |
低 | 否 |
| kill_top_memory_process | ps aux --sort=-%mem | awk 'NR==2{print $2}' | xargs kill -15 |
高 | 是 |
| identify_cpu_hog | ps aux --sort=-%cpu | head -5 |
无(只读) | 否 |
| clear_journal_logs | journalctl --vacuum-size=200M |
低 | 否 |
| add_swap | fallocate + mkswap + swapon |
中 | 是 |
restart_service操作接受{service}模板变量,但仅当容器名称在allowed_services中时才执行。如果Ollama建议重启sshd,引擎会阻止,因为sshd不在列表中。服务名称来自Prometheus的job标签,所以要确保你在prometheus.yml中的job_name值与Docker容器名称一致。如果你运行systemd管理的服务,将docker restart替换为systemctl restart。
命令使用SIGTERM(kill -15),而非SIGKILL。进程有机会清理资源。subprocess.run有60秒超时,防止修复操作卡住。
如何实现白名单和人工审批的安全层?
安全层将所有LLM输出视为不可信。三个机制防止危险操作:操作白名单拒绝代码中未预定义的任何命令,Pydantic验证器在响应到达引擎之前拦截格式错误的响应,Discord审批门控阻止破坏性操作直到人工确认。
纵深防御
系统有四层保护:
- **Schema强制。**Ollama的
format参数将模型约束在JSON schema中。它无法返回自由文本命令。 - Pydantic验证。
Diagnosis模型验证recommended_action存在于ALLOWED_ACTIONS中。无效操作抛出ValueError,告警回退到基于hint的备选方案。 - **操作白名单。**修复引擎只执行
ALLOWED_ACTIONS中定义的命令。没有动态命令构造。除了服务名称(本身也会验证)外没有字符串插值。 - **人工审批。**标记为
requires_approval: True的操作发送Discord通知但不执行。一个独立的审批处理器(超出本教程范围)监听Discord反应并触发执行。
Dry-run模式
在/etc/sentinel/env中设置SENTINEL_DRY_RUN=true。接收器只记录它将要做什么,不实际执行任何操作:
{"timestamp": "2026-03-19T15:00:00Z", "event": "remediation", "alert": "DiskSpaceLow", "action": "prune_docker_images", "status": "dry_run", "command": "docker image prune -af --filter 'until=72h'"}
查看审计日志:
tail -f /var/log/sentinel/audit.jsonl | jq .
在启用实际修复之前,至少运行一周dry-run模式。审查每条记录的操作。确保LLM对你的实际告警推荐合理的操作。
Discord审批流程
当破坏性操作触发时,接收器在你的Discord频道发布一个embed:
🔒 Approval Required: kill_top_memory_process
Alert: MemoryHigh
Command: ps aux --sort=-%mem | awk 'NR==2{print $2}' | xargs kill -15
Diagnosis: Memory at 93%. The top process by RSS is consuming 4.2 GB.
React with ✅ to approve or ❌ to deny.
没有配置Discord webhook时,破坏性操作会被完全阻止。这是安全的默认行为。缺少webhook意味着不执行破坏性操作,而非无需审批就执行。
什么是sentinel pattern,为什么需要它?
Sentinel pattern是一个基于cron的独立看门狗,监控监控栈本身。如果Prometheus崩溃,没有人会触发告警,因为告警系统本身已经宕机。sentinel脚本检查Prometheus、Alertmanager和webhook接收器是否在运行,如果没有就重启它们。这是对「谁来监控监控者」这个问题的回答。
创建~/sentinel/watchdog.sh:
#!/bin/bash
# Sentinel watchdog: monitors the monitoring stack
# Runs via cron every 2 minutes
LOG="/var/log/sentinel/watchdog.log"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
check_and_restart() {
local name="$1"
local check_cmd="$2"
local restart_cmd="$3"
if ! eval "$check_cmd" > /dev/null 2>&1; then
echo "${TIMESTAMP} ALERT: ${name} is down. Restarting..." >> "$LOG"
eval "$restart_cmd"
sleep 5
if eval "$check_cmd" > /dev/null 2>&1; then
echo "${TIMESTAMP} OK: ${name} restarted successfully" >> "$LOG"
else
echo "${TIMESTAMP} CRITICAL: ${name} failed to restart" >> "$LOG"
fi
fi
}
# Check Docker containers
check_and_restart "prometheus" \
"docker inspect --format='{{.State.Running}}' prometheus 2>/dev/null | grep -q true" \
"cd $HOME/sentinel && docker compose up -d prometheus"
check_and_restart "alertmanager" \
"docker inspect --format='{{.State.Running}}' alertmanager 2>/dev/null | grep -q true" \
"cd $HOME/sentinel && docker compose up -d alertmanager"
check_and_restart "node-exporter" \
"docker inspect --format='{{.State.Running}}' node-exporter 2>/dev/null | grep -q true" \
"cd $HOME/sentinel && docker compose up -d node-exporter"
# Check the webhook receiver systemd service
check_and_restart "sentinel-receiver" \
"systemctl is-active --quiet sentinel.service" \
"systemctl restart sentinel.service"
chmod 700 ~/sentinel/watchdog.sh
添加每2分钟执行一次的cron任务。将/root替换为你的实际home目录:
(sudo crontab -l 2>/dev/null; echo "*/2 * * * * /root/sentinel/watchdog.sh") | sudo crontab -
看门狗除了bash和Docker CLI外没有任何依赖。即使Python、Ollama或整个监控栈都宕机了,它仍然能运行。这就是它存在的原因:它是唯一不依赖任何其他组件的组件。
查看看门狗日志:
cat /var/log/sentinel/watchdog.log
2026-03-19T15:10:00Z ALERT: prometheus is down. Restarting...
2026-03-19T15:10:05Z OK: prometheus restarted successfully
如何端到端测试自修复循环?
模拟真实故障来确认完整循环正常工作:告警触发、webhook接收、Ollama诊断、修复执行(或在dry-run模式下记录)。每次测试都使用SENTINEL_DRY_RUN=true启动,这样在你验证诊断之前不会实际执行任何操作。
测试1:服务宕机
启动一个测试容器然后停止它:
docker run -d --name test-service --network sentinel_sentinel alpine sleep 3600
在prometheus/prometheus.yml中为测试添加一个抓取目标(或直接停止node-exporter容器):
docker stop node-exporter
等待1-2分钟让ServiceDown告警触发。查看审计日志:
tail -f /var/log/sentinel/audit.jsonl | jq .
{
"timestamp": "2026-03-19T15:12:30Z",
"event": "diagnosis",
"alert": "ServiceDown",
"severity": "critical",
"root_cause": "node-exporter scrape target is unreachable, container likely stopped or crashed",
"action": "restart_service",
"reasoning": "The node-exporter container is down. Restarting it will restore metric collection."
}
{
"timestamp": "2026-03-19T15:12:31Z",
"event": "remediation",
"alert": "ServiceDown",
"action": "restart_service",
"status": "dry_run",
"command": "docker restart node-exporter"
}
手动重启node-exporter来解除告警:
docker start node-exporter
测试2:磁盘压力
创建一个大文件使磁盘使用率超过85%:
fallocate -l 20G /tmp/disk-pressure-test
等待DiskSpaceLow告警(5分钟阈值)。LLM应该诊断问题并推荐prune_docker_images。确认dry-run日志后清理:
rm /tmp/disk-pressure-test
测试3:内存压力
使用stress-ng消耗内存:
sudo apt install -y stress-ng
stress-ng --vm 1 --vm-bytes 6G --timeout 600s &
MemoryHigh告警触发。LLM应该推荐kill_top_memory_process,该操作需要Discord审批。确认审批消息出现在Discord中。
停止压力测试:
killall stress-ng
切换到生产模式
当每次dry-run测试都产生正确的诊断和合适的操作建议后:
- 编辑
/etc/sentinel/env,设置SENTINEL_DRY_RUN=false - 重启服务:
sudo systemctl restart sentinel.service - 再次运行服务宕机测试。node-exporter容器应通过
docker restart自动重启。
头几天持续监控/var/log/sentinel/audit.jsonl。每个操作都带有完整上下文记录,便于事后复盘。
故障排查
**Ollama没有响应。**检查Ollama是否在运行:systemctl status ollama。检查模型是否已拉取:ollama list。如果模型尚未加载,第一次请求需要10-20秒来加载到内存。
**Alertmanager无法访问webhook。**确认接收器在监听:curl http://127.0.0.1:5001/health。查看Alertmanager日志:docker logs alertmanager。确保host.docker.internal在Alertmanager容器内可以解析:docker exec alertmanager wget -q -O- http://host.docker.internal:5001/health。
**告警没有触发。**检查Prometheus是否加载了规则:curl http://127.0.0.1:9090/api/v1/rules | python3 -m json.tool。确认抓取目标是否正常。for: 5m子句意味着条件必须持续为真5分钟后告警才会触发。
**Pydantic验证错误。**模型产生了不符合schema的输出。检查Ollama在journal中的原始响应:journalctl -u sentinel.service -f。尝试其他模型。qwen2.5:7b在有限硬件上进行结构化输出最为可靠。
**看门狗没有运行。**确认cron条目:sudo crontab -l。检查cron守护进程:systemctl status cron。检查看门狗日志是否存在:ls -la /var/log/sentinel/watchdog.log。
各组件的日志:
# Prometheus
docker logs prometheus --tail 50
# Alertmanager
docker logs alertmanager --tail 50
# Sentinel receiver
journalctl -u sentinel.service -f
# Audit trail
tail -f /var/log/sentinel/audit.jsonl | jq .
下一步
Sentinel pattern处理的是被动修复。要做主动监控,可以将日志输入LLM进行模式检测,在告警触发之前发现问题。
如果你扩展修复引擎来运行更复杂的脚本,要做好沙箱隔离。永远不要让LLM驱动的进程拥有无限制的系统访问权限。
这里的监控栈覆盖单台VPS。对于多服务Docker环境,添加cAdvisor作为抓取目标来监控容器级别的CPU、内存和网络指标。
版权所有 2026 Virtua.Cloud。保留所有权利。 本内容为 Virtua.Cloud 团队原创作品。 未经书面许可,禁止复制、转载或再分发。