The bug this closes
Between 2026-04-14 and 2026-04-17 the nightly deep-sleep on the reference install stopped producing extractions, synthesis, or applied artifacts. The Phase 2 worker would start, begin processing Session 1 of 20, and die with Automation backend error (exit 143). Thirty minutes later a new worker started. Same failure. Same session. Indefinitely.
Extraction checkpoints caching a transient overloaded_error as a permanent “0 findings” result made the symptom look like a model problem. The real culprit was a feedback loop between three components that each worked in isolation.
The feedback loop
• scripts/nexo-cron-wrapper.sh only wrote to cron_runs at the end of the job. Any wrapper killed mid-flight produced zero database records.
• scripts/nexo-watchdog.sh, running every 1800 s, used cron_runs as its source of truth for “has this cron run recently?”. With no row for deep-sleep it decided the cron was stuck and called launchctl kickstart -k "gui/<uid>/com.nexo.deep-sleep". The -k flag kills the running instance first — so the watchdog was actively terminating its own worker.
• scripts/deep-sleep/extract.py cached the first failure (Anthropic’s overloaded_error) in a per-session checkpoint and reused it forever. Even if the kickstart loop got broken, the same session would keep reporting 0 findings.
Two-phase cron_runs recording
The cron wrapper now INSERTs a row with ended_at = NULL at start and UPDATEs it at end. The foreground command runs as a background job so trap TERM/INT/HUP fires immediately, forwards SIGTERM to the child, and closes the row with exit_code=143 plus an explicit Killed by SIGTERM error string. A wrapper killed mid-flight — by the watchdog, by a shutdown, by a manual launchctl stop — now shows up as a failed run instead of silently disappearing from the audit trail.
In-flight detection in the watchdog
With the new wrapper, any cron that is currently running produces a row where started_at is set and ended_at is NULL. The watchdog treats that as a first-class state: never kickstart -k an in-flight row. If the row is older than 3×max_stale and the worker process is provably dead (the configured proc_grep pattern no longer matches any PID), only then does the watchdog escalate to re-execution. No process check ⇒ no intervention. The kickstart loop is gone.
Transient vs deterministic failures
The extractor now classifies CLI failures into transient (overloaded_error, rate_limit_error, api_error, timeout, signal) and deterministic (json_parse, unknown). Transient failures do not persist a poisoned checkpoint — the next deep-sleep run gets a clean retry. Deterministic failures increment error_count and are skipped once the counter reaches MAX_POISON_ATTEMPTS (3). A session that trips on a bad transcript no longer stalls every subsequent run behind it.
The extractor also writes a slim version of shared-context.txt (first 200 lines plus metadata) and points the per-session prompt at that file instead of streaming the full 400+ KB DB dump into the Claude CLI subprocess on every single extraction.
Silent heal for existing installs
A fix at the source only helps future runs. Installs that have already been running the buggy loop have accumulated junk: poisoned checkpoints, stale locks, dangling cron_runs rows with NULL ended_at, bloated .watchdog-fails counters. The new auto_update._heal_deep_sleep_runtime() runs on every post-sync pass and is an idempotent no-op on clean runtimes. It purges checkpoints containing overloaded_error, releases lock files older than 6 h, closes cron_runs rows older than 6 h with an explicit healed by auto_update (pre-5.8.1 wrapper left row open) marker, and resets stale fails counters older than 24 h.
Upgrading
Run nexo update. The heal runs automatically during post-sync, the new wrapper and watchdog take over immediately, and the next 04:30 deep-sleep produces real artifacts again. Anyone whose install was silently stuck in the kickstart loop — even on dates they did not notice — gets the residue cleaned up transparently.