NEXO 5.2.0: Bilingual Response Discipline & Closed Cortex-Quality Loop

5.2.0 is a focused minor release. No breaking changes. No bootstrap, startup, Deep Sleep, or client-parity surfaces touched. Two concrete gaps in the Cortex layer close under themselves, both identified during the audit of the response-contract behaviour that landed in 5.1.

The response-contract recap

When a NEXO agent calls nexo_task_open, the runtime does not just accept the task. A pure function called evaluate_response_confidence inspects the goal, task type, area, evidence references, verification step, and stakes, and returns a response contract that tells the agent how to answer: answer (you can respond directly), verify (respond but confirm with evidence first), ask (you are missing information, request it), or defer (do not answer yet, this is high-stakes and under-evidenced).

This is deterministic reasoning on top of the LLM — no hidden model call, no temperature, just a 50-line function with an explicit score and an explicit decision tree. The advantage is auditability: every response-contract decision comes with a reasons array that explains exactly why it landed where it did. The disadvantage, pre-5.2.0, was that the function had two real limits: it was monolingual, and it could only push the score down.

Bilingual high-stakes detection

The HIGH_STAKES_KEYWORDS set that triggers the high-stakes penalty on the score was English-only. Roughly 25 words — production, deploy, migration, customer, credential, delete, and so on. A goal written in Spanish like "migrar la base de datos de producción al nuevo servidor" silently skipped every single one. It had the same risk profile as its English twin and the runtime treated it like an innocuous task.

5.2.0 adds HIGH_STAKES_KEYWORDS_ES: about 45 Spanish keywords including both accented and unaccented variants for the ones that carry diacritics (crítico/critico, producción/produccion, facturación/facturacion, migración/migracion, reputación/reputacion, público/publico, médico/medico). User prompts in the wild mix both forms, so accepting only one is effectively accepting neither. _detect_high_stakes now matches against the union of both sets.

Negation-aware detection

There is a subtler bug that keyword-matching always has: boundary statements. A user who writes "refactor del parser interno sin tocar producción" is explicitly disclaiming the sensitive area — the word producción appears in the string, but the user is stating where they will not go, not where they will act. Pre-5.2.0, this tripped the high-stakes flag anyway because the raw keyword was physically present. False positive.

5.2.0 adds NEGATION_PATTERNS: a small, conservative list of regex patterns that match these boundary statements bilingually. Examples: no tocar prod, sin afectar, nunca borrar, evitar eliminar, without touching, don't modify, avoid deleting. When any of these match, _detect_high_stakes short-circuits and returns False even if a high-stakes keyword is physically present. The effect is that the user can now describe the safe zone of their task without accidentally flagging the whole thing as a production operation.

Positive signals on the confidence score

Before 5.2.0, the confidence score inside evaluate_response_confidence was a pure penalty accumulator. Base 85, minus the penalties for unknowns, missing evidence, missing verification path, and high-stakes context. There was no mechanism to reward a task that had actually loaded the right context or that operated inside a known project area. The score drifted downward even when the agent was well-prepared.

5.2.0 adds two optional kwargs:

pre_action_context_hits: int — adds +min(10, hits*2) when the pre-action context lookup returned relevant learnings, decisions, or past events for this task. The cap keeps the boost bounded.
area_has_atlas_entry: bool — adds +5 when the task's area is a known entry in project-atlas.json. Known projects start from more trusted footing than unknown ones.

Both signals are capped — they can never override a real risk penalty. A high-stakes task with missing evidence stays high-stakes and under-evidenced no matter how much context the agent loaded. But a routine task that ran the right searches finally gets credit for it in the score.

Numeric safeguard over the decision tree

The boolean decision tree inside evaluate_response_confidence covers every obvious case. But tasks can accumulate soft penalties without tripping any single boolean rule. Imagine a task with no unknowns, no high-stakes context, no verification step, and no evidence refs: the tree maps it to verify, and the score ends up at 50 exactly.

5.2.0 adds a monotonic numeric safeguard on top of the tree. After the boolean rules pick a mode, two extra checks run:

If mode == "answer" and final_score < 50, downgrade to verify.
If mode == "verify" and high_stakes and final_score < 30, downgrade to defer.

The safeguard is monotonic — it can only make the response discipline stricter, never looser. This means it can never break an existing pass-case; it can only catch edge cases where soft penalties stacked below the threshold of any single rule but still meant the runtime had no business answering directly. The reasons array gets a dedicated numeric safeguard: ... entry so the downgrade is explicit.

Cortex-quality cache reader

The second real gap was outside the response-contract layer, over in the Cortex quality cron. Since 5.1.0, src/scripts/nexo-cortex-cycle.py has been running every 6 hours and writing a full quality snapshot to $NEXO_HOME/operations/cortex-quality-latest.json. The cron's own docstring makes an explicit promise:

"Persists the snapshot to ~/claude/operations/cortex-quality-latest.json so dashboards / morning briefings can read fresh metrics without re-running the SQL."

That reader never existed. handle_cortex_quality in src/plugins/cortex.py was still re-computing the summary from SQL on every call. The cache file sat there for weeks being overwritten every 6h with fresh data, and nobody ever read it. Not a bug. Just a half-delivered promise.

5.2.0 closes it. The handler now attempts the cache first when days is 7 or 1 (the two windows the cron actually writes), falls back silently to the live SQL path on any failure, and returns an observable "source": "cache" | "live" field in its JSON response. Failure conditions that trigger fallback:

Cache file does not exist
captured_at is older than 6h 30m (6h cron + 30m slack for a slightly-late run)
schema field does not match the current version
JSON is corrupt or malformed
Requested window is not cached (e.g. days=30 always hits live)

The cache is a performance optimisation, never a correctness dependency. Any corruption routes transparently to live computation.

Tests

9 new tests lock in Spanish keyword detection, accented variants, bilingual negation suppression, positive signal boosting (including the caps), numeric safeguard transitions, and score bounds. 7 new tests cover the cache reader: fresh cache hits for both windows, stale cache, unknown schema, invalid JSON, missing file, and explicitly non-cached windows. Every pre-existing protocol and cortex test continues to pass — the new positive-signal kwargs default to safe values and the numeric safeguard is monotonic.

Lint, security, coverage, release-readiness, and client-parity gates all green on the release PR.

What did not change

5.2.0 does not replace the 5.1.0 closed-loops story, and it does not replace the 5.0.0 goal/decision/outcome backbone. Every subsystem from prior releases keeps behaving the same way for tasks written in English. The difference is that tasks written in Spanish now trip the same gates, boundary statements are no longer mistaken for action targets, well-prepared tasks get credit for the context they loaded, and the cortex-quality cron's promise to serve fresh metrics without re-running SQL is finally kept.

If you want the exact release record, open the 5.2.0 changelog section. If you are on an older install, nexo update will pick up the new module on the next run automatically. No migration. No schema change. No bootstrap surface touched.