Benchmarks and public proof

This page now shows both public proof layers. LoCoMo covers the memory slice. The operator runtime pack covers continuity, contradiction handling, cross-client state, guarded execution, and the first outcome/prioritization advantages against realistic local baselines.

F1 0.588 Runtime pack 96.2% CPU only 13 operator scenarios +55% vs GPT-4 (128K)
Memory sliceLoCoMo: long conversational memory under realistic retrieval pressure.
Runtime sliceOperator runtime pack: continuity, contradiction handling, cross-client state, outcome loop, prioritization.
Why it mattersPublic proof that NEXO is not just architecture talk or a single benchmark cherry-pick.
System F1 Score Hardware
NEXO Brain   Checked-in memory run 0.588 CPU only
GPT-4 (128K) 0.379 GPU cloud
Gemini Pro 1.0 0.313 GPU cloud
LLaMA-3 70B 0.295 A100 GPU
GPT-3.5 + Contriever 0.283 GPU
93.3%
Adversarial rejection
25s
Ingestion time
768
CPU embeddings (dims)
#1
First MCP server benchmarked

The broader proof is now a runtime matrix, not just one memory score

The checked-in operator runtime pack expands public proof into the workflows that matter once memory is embedded into a working system: contradiction handling, temporal reasoning, structured recall, multi-session continuity, cross-client continuity, guarded execution, and the first honest outcome-loop / prioritization checks.

Latest runtime matrix

  • 13 scored operator scenarios.
  • NEXO full stack: 96.2%.
  • Static CLAUDE.md: 42.3%.
  • No memory baseline: 0%.

What it measures

  • Contradiction latest-wins and temporal reasoning.
  • Structured recall, interrupted-task resume, and related-context stitching.
  • Cross-client continuity across Claude Code and Codex.
  • Outcome-loop and prioritization quality, graded conservatively.

Why this is honest

  • LoCoMo stays the memory benchmark; it is not stretched into a runtime score.
  • The runtime pack uses manual-rubric grading with checked-in scenario files and scored run JSON.
  • Outcome-loop and prioritization are still scoped conservatively: the matrix shows advantage, but the docs keep the claim narrower than “solved forever”.

Give your agent a mind

Open source, AGPL-3.0 licensed, and built for builders who want public proof for both memory and runtime behavior.