Benchmarks

Benchmarks and public proof

This page now shows both public proof layers. LoCoMo covers the memory slice. The operator runtime pack covers continuity, contradiction handling, cross-client state, guarded execution, and the first outcome/prioritization advantages against realistic local baselines.

F1 0.588 Runtime pack 96.2% CPU only 13 operator scenarios +55% vs GPT-4 (128K)

Memory sliceLoCoMo: long conversational memory under realistic retrieval pressure.

Runtime sliceOperator runtime pack: continuity, contradiction handling, cross-client state, outcome loop, prioritization.

Why it mattersPublic proof that NEXO is not just architecture talk or a single benchmark cherry-pick.

System	F1 Score	Hardware
NEXO Brain Checked-in memory run	0.588	CPU only
GPT-4 (128K)	0.379	GPU cloud
Gemini Pro 1.0	0.313	GPU cloud
LLaMA-3 70B	0.295	A100 GPU
GPT-3.5 + Contriever	0.283	GPU

93.3%

Adversarial rejection

25s

Ingestion time

768

CPU embeddings (dims)

First MCP server benchmarked

Operator runtime pack

The broader proof is now a runtime matrix, not just one memory score

The checked-in operator runtime pack expands public proof into the workflows that matter once memory is embedded into a working system: contradiction handling, temporal reasoning, structured recall, multi-session continuity, cross-client continuity, guarded execution, and the first honest outcome-loop / prioritization checks.

Latest runtime matrix

13 scored operator scenarios.
NEXO full stack: 96.2%.
Static CLAUDE.md: 42.3%.
No memory baseline: 0%.

What it measures

Contradiction latest-wins and temporal reasoning.
Structured recall, interrupted-task resume, and related-context stitching.
Cross-client continuity across Claude Code and Codex.
Outcome-loop and prioritization quality, graded conservatively.

Why this is honest

LoCoMo stays the memory benchmark; it is not stretched into a runtime score.
The runtime pack uses manual-rubric grading with checked-in scenario files and scored run JSON.
Outcome-loop and prioritization are still scoped conservatively: the matrix shows advantage, but the docs keep the claim narrower than “solved forever”.

Open compare hub Runtime pack README Latest summary JSON

Benchmarks and public proof

The broader proof is now a runtime matrix, not just one memory score

Latest runtime matrix

What it measures

Why this is honest

Give your agent a mind