NEXO 7.9.32 — Email monitor recovery hardening

Published 2026-04-26. Patch release over v7.9.31.

NEXO's email monitor processes inbound mail by spawning a Claude Code worker per email. Most replies are short (yes/no, one paragraph) and finish in seconds, but a meaningful slice are long-running tasks: drafting a presentation for a client, building a multi-step report, working through code. When that worker dies mid-flight — OOM, timeout, or, in 7.9.32's reproducer, a Brain release that updates the runtime under a live worker — the next retry previously started from scratch. The new attempt re-read the same incoming email, re-thought the same plan, re-drafted the same files, and burned tokens duplicating work the previous attempt had already produced. Sometimes it left half-written files in the working directory with no narrative context.

What 7.9.32 changes

Two complementary changes:

1. Recovery window: 24h → 7 days

The periodic _recover_unreplied_processed sweep re-queues emails that the BD marked processed but which never actually got a reply sent. Before 7.9.32 it looked back 24 hours. That was tight enough to fail under a normal but unfortunate scenario: a single email falls between several Brain releases in a short window (we shipped four releases on 2026-04-26 alone), and by the time the next sweep runs, the email is older than 24h and aged out. Permanent limbo. v7.9.32 widens the lookback to 7 days (168h). 7 days absorbs a normal release cadence without re-triggering very old “stuck” emails indefinitely.

2. Per-email recovery checkpoints

Whenever a worker run does not finish OK (timeout, non-zero exit, AutomationBackendUnavailableError, unexpected exception), the email-monitor now persists a small JSON record at ~/.nexo/nexo-email/checkpoints/<sha1(message_id)[:16]>.json with:

message_id, subject, first_attempt_at, last_attempt_at, attempts
files_touched — absolute paths in the working directory whose mtime advanced during the failed run, capped at 50 entries
last_assistant_text — the last narration extracted from Claude Code's JSON output (capped at 4000 chars)
last_error — e.g. “exit 137 (oom)”, “timeout after 1800s”

On the next attempt the email-monitor reads the checkpoint and injects a “Previous attempt context” block into the Claude Code prompt. The retry sees: how many attempts have already happened, which files the previous attempt left behind (so it can decide whether to pick them up or start clean), and what the previous attempt was thinking when it died. A successful reply or escalation deletes the checkpoint. Stale files older than 7 days are pruned automatically by _email_checkpoint_cleanup on every monitor tick.

Why hash the Message-ID

RFC-5322 Message-IDs contain <, >, @ and other characters that mix badly with filesystems on macOS. The checkpoint helper hashes the Message-ID to a 16-character hex prefix of SHA-1 — well above the collision threshold for the few hundred emails NEXO handles per operator, while keeping filenames short enough to skim during a debug session.

Best-effort everywhere

Reads, writes, deletes, and cleanup all degrade gracefully on IO or parse errors. A misbehaving filesystem cannot block the worker; the worst case is "no recovery context", which simply means the retry behaves like 7.9.31. The helpers log a warning and move on.

Verification

15 new unit tests in tests/test_email_monitor_checkpoints.py cover the helpers from every angle: write+read round trip, repeated attempts merging the files_touched set, the 50-file cap, the human-readable previous-progress block render, empty/None input handling, idempotent delete, the 7-day cleanup window, JSON result extraction, plain-text fallback, the 4000-char truncation, filesystem-safe SHA-1 path, and missing-checkpoint reads returning None. Wider regression sweep including the 7.9.30/7.9.31 override-mode and stop_sequences tests stays green.

Full changelog entry →