NEXO 7.11.2 — Watchdog reaper + Enforcer respects restart_required

Published 2026-04-27. Patch release over v7.11.1 — two reliability fixes in the same family ("components ignoring signals they should respect"), no API change.

Why

v5.8.1 taught the watchdog to leave running jobs alone. Every cron now opens a row in cron_runs at start (started_at set, ended_at NULL) and the watchdog reads that row to tell currently running from missed/stuck. That fix closed the loop where the watchdog kept kickstart -k'ing deep-sleep mid-flight (2026-04-14 to 2026-04-17), killing the worker that was actually doing the job.

The same restraint became the next failure mode. When a wrapper child genuinely hangs — e.g. headless claude --bare blocked on an MCP that flagged mcp_restart_required after a brain update — the row stays open forever. The next tick reads the same flag and skips with Another instance running. Skipping. The watchdog only logged WARN. morning-agent, followup-runner and orchestrator-v2 went silent for days (2026-04-24 to 2026-04-27) for exactly this reason: a single zombi wrapper held the slot indefinitely and nothing was authorized to evict it.

What changed

A new sweep run_stuck_reaper() runs at the top of every watchdog tick, before the per-monitor loop. It reads every cron_runs row with ended_at IS NULL and compares its age to the per-cron threshold from stuck_after_seconds in src/crons/manifest.json. Anything past threshold gets reaped:

Live wrapper. The reaper sends SIGTERM to the wrapper PID (pgrep -f "nexo-cron-wrapper.sh CRON_ID "). The wrapper's existing trap catches it, forwards to the child, and runs finalize_row — the row closes cleanly with exit_code=143. After a 10-second grace, any survivor (wrapper plus descendants) gets SIGKILL via pkill -KILL -P.
Orphan zombi row. Wrapper PID gone, row still NULL. Without intervention the next tick still skips (Another instance running). The reaper closes it in-band: ended_at=now, exit_code=137, summary='stuck row reaped by watchdog: wrapper PID gone'.

Why the v5.8.1 bug cannot recur

Two safeguards keep deep-sleep safe:

Generous default. STUCK_DEFAULT_SECONDS=43200 (12h). Any cron not in the manifest gets the global default, well above any legitimate worst case.
Explicit per-cron overrides for known long-runners. deep-sleep: 28800 (8h), sleep: 14400 (4h), evolution: 14400 (4h, weekly heavy run). Short-runners get tighter bounds: morning-agent: 1800, followup-runner: 1800, email-monitor: 600.

cron_id='watchdog' is hard-coded into STUCK_REAPER_SKIP: the watchdog can never reap itself mid-tick.

Observability

New counter TOTAL_REAPED exposed in three places: watchdog-status.json (summary.reaped), the human report header (REAPED:), and the final log line (REAPED=N). When the sweep does nothing, none of these show motion — you only see action when something genuinely needed reaping.

Tests

6 new tests in tests/test_watchdog_stuck_reaper.py: fresh in-flight row left alone (v5.8.1 regression guard), per-cron threshold respected (deep-sleep 8h not reaped at 4h), orphan zombi row cleaned in-band with exit_code=137, real wrapper killed via SIGTERM with the trap closing the row at exit_code=143, cron_id='watchdog' never reaped, default 12h threshold applied to crons not in manifest. The 3 existing watchdog tests stay green: 9 watchdog tests pass total.

Sibling fix: Enforcer respects the restart_required marker

The Guardian/Enforcer (HeadlessEnforcer in src/enforcement_engine.py) periodically injects <system-reminder> blocks asking the agent to call nexo_* tools (heartbeat, session_diary_write, smart_startup, guard_check, etc). When the MCP server has the ~/.nexo/runtime/operations/mcp-restart-required.json marker on disk — written by plugins/update.py after a nexo update that actually changes runtime .py bytes (cf. v7.11.0 fingerprint gating) — every one of those reminders triggered a tool call that immediately failed with mcp_restart_required. The agent burned cycles on guaranteed no-ops until the operator restarted the client.

v7.11.2 adds a gate at the top of HeadlessEnforcer._enqueue(): if the prompt mentions nexo_ and the marker file exists, skip + log SKIP: ... mcp_restart_required marker present. Reminders that don't reference nexo_* (R23 deploy guards, R25 nora/maria read-only, etc.) still fire — they don't depend on the MCP being live. The check is cached per-instance with a 30s TTL so we don't stat the marker on every _enqueue call. Conservative: any path/IO error in the resolver returns False so the gate never blocks legitimate enforcement.

This is the sibling of the watchdog reaper. Both fixes follow the same shape — a NEXO component that ignored a signal already on disk and burned cycles or held slots open as a result. v7.11.0 added the runtime fingerprint precisely to avoid forcing restarts when not needed; v7.11.2 makes the rest of NEXO actually respect the restart marker when one exists.

Full changelog entry → · src/scripts/nexo-watchdog.sh · src/enforcement_engine.py