VPS Resource Watchdog Architecture — 2026-04-24

部 VPS 第一代完整 real-time 保安系統。2026-04-24 建立（源於同日 17:50 HK OOM 死機事件）。

事件背景

2026-04-24 17:50 HK（09:50 UTC）部機死機。Kernel OOM killer 觸發、部機凍住 6 分鐘、17:56 自動 reboot。

原因分析：

當時 10+ 個 claude session 同時開
每個 session spawn 5-6 個 MCP 子進程（notebooklm-mcp、actors-mcp-server、n8n-mcp、github-mcp-server Docker 容器等），每個約 200-500MB
其中 2 個 node process 開咗 Chrome/Puppeteer，各食 ~2 GB
部機總共 16 GB RAM + 0 swap
冇任何 real-time RAM 監察（唯一 check 係 nightly-health-check 每日 04:00 UTC 一次）

順帶發現：

claude-api-proxy.service 連續 12 日每 5 秒死咗重啟，寫咗 434,297 行 log 噪音（app file 搬咗入 archive 但 systemd service 冇 disable）
openclaw-error-monitor.service 同樣狀態，寫 40,875 行
Syslog 滾到 199 MB

核心理念

RAM / CPU / Disk 爆 = 最常見死法。n8n 一死 = 廣告 live 時最貴嘅損失。

所以保護分兩層：全局資源保安（防爆）+ n8n 專屬免死金牌（點爆都唔會死到 n8n）。

系統架構

4 個保安員，分 3 個巡邏頻率

保安	頻率	Script / Unit	職責
MCP Watchdog	每 1 分鐘	`systemd mcp-watchdog.timer` → `/home/claude/scripts/mcp-watchdog.sh`	只 check Telegram MCP 連線，斷就重啟 Claude session
Resource Watchdog	每 1 分鐘	`cron` → `/home/claude/scripts/resource-watchdog.sh`	主力：RAM/CPU/Disk/Service/Docker/Domain 六大 check + auto-heal
Kill Duplicate Bun	每 1 分鐘	`cron` → `/home/claude/kill-duplicate-bun.sh`	清潔：殺重複嘅 bun process
Nightly Health Check	每日 04:00 UTC	`cron` → `/home/claude/scripts/nightly-health-check.sh`	全身體檢，報摘要

Resource Watchdog 六大 check（`resource-watchdog.sh`）

1. RAM

≥ 80% → 🟡 TG alert + top 5 食 RAM process
≥ 92% → 🚨 Auto-heal 啟動：殺最大嗰個 non-critical node/bun/chrome/python process，SIGTERM 等 5 秒，再 SIGKILL
白名單（auto-heal 絕對唔殺）：systemd / sshd / tmux / bash / cron / init / containerd / dockerd / docker / cloudflared / tailscaled / n8n / nginx / mongod / redis / postgres / rabbitmq / code-server / mcp-watchdog / resource-watchdog

2. CPU

1 秒 sample 嚟自 /proc/stat
≥ 80% → 🟠 TG alert + top 5 食 CPU 嘅 process + load average

3. 硬碟（root partition）

≥ 80% → 💾 TG alert + top 5 食 space 嘅 folder（/var/log/* + /home/claude/*）

4. Systemd service 健康

systemctl --failed list
同時 check 所有 running service 嘅 NRestarts > 100（silent restart loop 殭屍）
有事就 🔴 TG alert

5. Critical Docker 容器

目前 list：n8n
如果唔係 running → 🔥 TG alert + 自動 docker start
每次 check 亦確認 n8n PID 嘅 oom_score_adj = -500（冇 enforce 就即補）

6. Public domain health

Check 7 個 live domain：notes / preview / vm / n8n / dashboard / salesbot/docs / lion
只會 alert 遇到：5xx / timeout / DNS 失敗
2xx / 3xx / 401 / 403 / 404 當正常（design-intended response）
Remotion exclude：on-demand workflow，大部份時間 expected down

Cooldown 規則

大部份 alert：15 分鐘內同一類唔重複 TG
Auto-heal：2 分鐘 cooldown，避免一次性清光成部機
State files：/tmp/resource-watchdog/{ram,cpu,disk,service,docker-n8n,domain,heal}.last

n8n 專屬三重保護

第 1 層：OOM 免死金牌

Systemd unit：/etc/systemd/system/n8n-oom-protect.service（oneshot + RemainAfterExit + After=docker.service）
將 n8n container main PID + children 嘅 oom_score_adj 設做 -500
Kernel OOM killer 永遠揀其他 process 先，跳過 n8n
Watchdog 每分鐘 enforce，確保 Docker restart 後即補

第 2 層：容器 health check + auto-restart

Resource watchdog check docker inspect n8n .State.Status
唔係 running → 即 docker start n8n + TG 通報

第 3 層：Auto-heal 白名單保護

就算 RAM 去到 92% 觸發 auto-heal，n8n 喺白名單入面
絕對唔會被誤殺

4 月 24 日同場清理

項目	動作	原因
`claude-api-proxy.service`	`stop` + `disable`	App 檔案已搬入 `archive/`（4 月 11 日寫，4 月 19 日棄用），service 冇 disable 導致 12 日 restart loop
`openclaw-error-monitor.service`	`stop` + `disable`	OpenClaw 整套 4 月 21 日放棄（詳見 reference_cloudflare_tunnel_mgmt 同 feedback_no_lobster）
`lightdm.service`	`mask`	VPS 唔需要 desktop GUI，佢不嬲 failed 但會搞到 watchdog false positive
`/swapfile` 8 GB	新增	之前 0 swap，爆 RAM 即死。加 swap + `vm.swappiness=10`
Syslog 199 MB	`truncate -s 0`	清走殭屍 log spam
Rotated syslog.1 184 MB + auth.log.1 17 MB	`rm`	同上
Journal vacuum	`--vacuum-time=3d`	釋放 ~570 MB

Total freed：~950 MB disk。

Remotion on-demand 工作流

見 project_remotion_ondemand memory entry。Steven 講「剪片」→ 我開 Remotion Studio on port 3007；「搞掂 / 關返」→ pkill -f "remotion studio"。Domain watchdog 已 exclude。

未 cover 嘅 edge case

情況	狀態	原因
Cloudflare tunnel 自己斷	未 cover	歷史上 tunnel 冇前科死過，Steven 2026-04-24 明確話唔使理
外網網絡完全斷	未 cover	斷咗 watchdog 都 TG 唔到你
硬件故障	未 cover	唔係軟件層面可以解
黑客 DDoS	未 cover	需要不同方案

上面全部比「RAM 爆 / n8n crash」嘅出現率低幾個數量級，暫時唔值得加複雜度。

Feedback rule reinforced

今日再次印證 feedback_autonomous_synthesis.md：Steven 唔識技術選擇，問佢 A/B/C 冇 value，直接 synthesize 最佳 bet ship 就啱。

2026-04-24 晚補：TG Plugin Collision 發現

收到多次 PLUGIN_DEAD TG alert，拆解後發現 plugin 其實冇死 —

成日手打 claude --channels plugin:telegram@... 冇 set TELEGRAM_STATE_DIR
結果污染 tmux cc 嘅 bot.pid，watchdog 誤判
2 重修復：.bashrc 層 strip + watchdog 層自愈

詳見 reference_tg_plugin_collision。

Steven's Knowledge Base

Explorer

VPS Resource Watchdog Architecture 2026-04-24

VPS Resource Watchdog Architecture — 2026-04-24

事件背景

核心理念

系統架構

4 個保安員，分 3 個巡邏頻率

Resource Watchdog 六大 check（`resource-watchdog.sh`）

1. RAM

2. CPU

3. 硬碟（root partition）

4. Systemd service 健康

5. Critical Docker 容器

6. Public domain health

Cooldown 規則

n8n 專屬三重保護

第 1 層：OOM 免死金牌

第 2 層：容器 health check + auto-restart

第 3 層：Auto-heal 白名單保護

4 月 24 日同場清理

Remotion on-demand 工作流

未 cover 嘅 edge case

Feedback rule reinforced

相關檔案

2026-04-24 晚補：TG Plugin Collision 發現

Graph View

Table of Contents

Steven's Knowledge Base

Explorer

VPS Resource Watchdog Architecture 2026-04-24

VPS Resource Watchdog Architecture — 2026-04-24

事件背景

核心理念

系統架構

4 個保安員，分 3 個巡邏頻率

Resource Watchdog 六大 check（resource-watchdog.sh）

1. RAM

2. CPU

3. 硬碟（root partition）

4. Systemd service 健康

5. Critical Docker 容器

6. Public domain health

Cooldown 規則

n8n 專屬三重保護

第 1 層：OOM 免死金牌

第 2 層：容器 health check + auto-restart

第 3 層：Auto-heal 白名單保護

4 月 24 日同場清理

Remotion on-demand 工作流

未 cover 嘅 edge case

Feedback rule reinforced

相關檔案

2026-04-24 晚補：TG Plugin Collision 發現

Graph View

Table of Contents

Resource Watchdog 六大 check（`resource-watchdog.sh`）