Forem: Penfield

We Fixed Karpathy’s LLM Wiki - PENgram Is the Typed Knowledge Graph Pipeline Everyone Asked For

Penfield — Tue, 28 Apr 2026 12:10:00 +0000

We recently published an article about the gaps in Karpathy's LLM Wiki pattern. The thesis was simple: wikilinks without relationship types are just lines on a graph. You can see that two notes connect but not how. Does one support the other? Contradict it? Supersede it? That semantic layer is what turns a pile of linked notes into a knowledge graph.

The comments pushed us further. People running production knowledge graphs at scale said typed relationships like supersedes or contracts fundamentally changed how their agents reason. Others pointed out that structure alone doesn't solve things when connections explode in volume. You need something to actively maintain the graph over time. One thread debated whether the real problem is staleness, not structure, and whether auto-expiry or evaluation agents are the answer.

The common thread: everyone wanted to move past the theory. Typed links are the first step. An Obsidian plugin is another step. But the hard part is taking raw content, extracting entities, classifying relationships, and producing a graph you can actually query. That's a pipeline. We didn't have one ready.

Now we do.

Graphify showed us the architecture

Before we built anything, we studied what already existed. Graphify by Safi Shamsi is an open-source codebase-to-knowledge-graph tool that does many things well. Its three-pass architecture is clean and smart:

Deterministic extraction - tree-sitter parses code into AST nodes and edges. No LLM needed, no hallucination risk, fast.
Local processing - Whisper transcribes audio and video into text. Runs on your machine, no API calls.
LLM semantic extraction - an LLM reads the text and identifies entities, concepts, and relationships. Local or remote, your choice.

The SHA256 incremental caching is elegant, only reprocess the files that changed. The Leiden community detection finds clusters in the graph. The interactive HTML visualization lets you explore results in a browser.

What Graphify doesn't do is type its relationships. An edge between two nodes is calls, imports, or semantically_similar_to. That works for code analysis. It doesn't work when you need to know that one research paper contradicts another, or that a concept supersedes an older one.

That's where we decided to pick up the ball and run with it.

PENgram: Parse, Extract, Normalize

PENgram takes Graphify's architectural patterns and rebuilds the pipeline around a typed relationship vocabulary. The same 24 types we discussed in the previous article.

The name is a nod to engrams, the memory traces that Wilder Penfield spent his career mapping in the human brain. PENgram builds memory traces from your data.

What goes in

PENgram accepts the messy reality of how knowledge actually lives:

Input	How it's processed
Code (25 languages, 37 file extensions)	Tree-sitter AST extraction — classes, functions, imports, call graphs
Markdown, text, HTML	Direct LLM extraction
PDFs	Text extraction via pypdf, then LLM
EPUBs	Text extraction via ebooklib, then LLM
YouTube channels	yt-dlp captions/subtitles
Audio/video files	Coming soon

What comes out

Every run produces three core outputs regardless of configuration:

graph.json — the full graph: nodes, edges, communities, analysis metadata
graph.html — interactive visualization powered by vis.js, opens in any browser
GRAPH_REPORT.md — analysis report with god nodes, surprising connections, suggested questions

Set PENGRAM_OUTPUT_TARGET to also export vault formats — penfield, obsidian, or both:

Penfield vault — YAML frontmatter with typed relationship keys, ready for penfield-import
Obsidian vault — frontmatter plus inline @type wikilinks via obsidian-wikilink-types

The both target produces both vaults side by side from the same graph. You can also re-export later without reprocessing: pengram export obsidian pengram-out/graph.json is instant - zero LLM calls, just format conversion.

You don't need Penfield to use PENgram. You don't need Obsidian either. The JSON and HTML outputs stand on their own.

The 24 types

Every relationship PENgram extracts gets classified into one of 24 semantic types:

Knowledge evolution — supersedes, updates, evolution_of
Evidence — supports, contradicts, disputes
Hierarchy — parent_of, child_of, sibling_of, composed_of, part_of
Causation — causes, influenced_by, prerequisite_for
Implementation — implements, documents, tests, example_of
Conversation — responds_to, references, inspired_by
Sequence — follows, precedes
Dependencies — depends_on

Code extraction uses 8 additional structural types: calls, imports, uses, extends, implements_interface, instantiates, overrides, decorates.

Every edge also carries a confidence label: EXTRACTED (stated in the source), INFERRED (deduced from context), or AMBIGUOUS (uncertain). This matters. A graph where every edge claims equal confidence is lying to you.

How the pipeline works

Every intermediate result gets written to disk as it completes. Kill the process halfway through, restart, and it picks up where it left off. SHA256 content hashing means unchanged files are never reprocessed.

For living corpora, pengram run --watch monitors the input directory and rebuilds incrementally on file changes - 2-second debounce, Ctrl+C to stop.

LLM flexibility

PENgram doesn't lock you into a provider. Four modes:

claude-cli (default) — uses claude -p locally. No API key, no cost, works out of the box if you have Claude Code installed.
openai — OpenAI API. Set OPENAI_API_KEY.
openrouter — OpenRouter API. Hundreds of models. Set OPENROUTER_API_KEY.
ollama — fully local, fully private, zero API costs. PENgram auto-detects whichever model you have loaded. Chunk sizes adjust automatically (20K vs 95K for cloud providers) to fit typical local context windows.

Extraction and linking default to fast, cheap models (Haiku-class). Synthesis uses heavier models (Sonnet-class) only where it matters. If you're running Ollama, you pick what runs where.

Quick start

pip install 'pengram[all]'

# Run on a directory
pengram run /path/to/your/notes

# Process a YouTube channel (channel key defined in config.py)
pengram youtube channel-key

# Re-export an existing graph to a different format
pengram export obsidian pengram-out/graph.json

# See what's there
open pengram-out/graph.html

Optional extras for specific input types:

pip install 'pengram[code]'       # tree-sitter AST extraction
pip install 'pengram[youtube]'    # yt-dlp
pip install 'pengram[pdf]'        # PDF text extraction
pip install 'pengram[epub]'       # EPUB text extraction
pip install 'pengram[clustering]' # Leiden community detection
pip install 'pengram[all]'        # everything

What this isn't

PENgram is not a knowledge management system. It's a pipeline. It takes raw content, builds a typed knowledge graph, and hands it off. What you do with the graph is up to you.

If you want to import the graph into a persistent, agent-maintained knowledge system, that's what Penfield does. PENgram's Penfield vault output is turn-key ready for penfield-import.

But you don't need any of that. If you want the Obsidian vault with typed wikilinks, PENgram produces that directly. If you just want the JSON and HTML to explore in a browser, that works too. The tool is modular by design. Use the pieces you need, ignore the rest.

What's next

PENgram is v0.1.0. The core pipeline works. On the roadmap:

Image extraction — vision API support for diagrams, screenshots, whiteboards
Video and audio transcription - local Whisper and API-based transcription for audio/video files
Platform integrations — Claude Code skill, Cursor, other editor plugins

The repo is MIT licensed. Star it, fork it, open issues. If you build something with it, we'd love to hear about it.

github.com/penfieldlabs/pengram

What Karpathy's LLM Wiki Is Missing (And How to Fix It)

Penfield — Mon, 13 Apr 2026 17:59:00 +0000

Andrej Karpathy's LLM Wiki pattern went viral this month. 5,000+ stars, 3,700 forks, dozens of implementations. The core insight is right: stop re-deriving knowledge on every query. Compile it once into a structured wiki. Let the LLM do the bookkeeping that makes humans abandon knowledge bases.

If you haven't read it, the pattern is: raw sources go into a directory, an LLM processes them into interlinked markdown pages, and Obsidian serves as the viewer. Three layers, three operations (ingest, query, lint), and the LLM maintains everything.

It's a good starting point. But if you've tried to run this pattern beyond a few hundred notes, you've likely already hit the wall. There are three structural gaps that break down at scale, and they aren't things you can fix with a better prompt or a fancier index file.

Here's what's missing and how to fix it.

Gap 1: Your links don't mean anything

Open Obsidian's graph view on a Karpathy-style wiki. What do you see? A web of identical gray lines. Every connection looks the same because every [[wikilink]] carries exactly one bit of information: "these two notes are connected."

That's not enough.

When Karpathy talks about the LLM "noting where new data contradicts old claims" and "flagging contradictions," he's describing semantic relationships. But the underlying link format can't express any of them. [[Note A]] doesn't tell you whether Note A supports, contradicts, supersedes, or was caused by the current note. The meaning lives in the prose around the link, invisible to every tool in the Obsidian ecosystem.

This matters because the whole point of a compiled wiki is that the structure does work for you. If your graph can't distinguish "this supersedes that" from "this contradicts that," you're leaving some of the most valuable information trapped in unstructured text, which is exactly the problem you were trying to solve.

The fix: typed relationships inside wikilinks

obsidian-wikilink-types adds semantic relationship types to standard Obsidian wikilinks using @ syntax:

[[Previous Analysis|The new research @supersedes the previous analysis]]
[[Redis Paper|This @supports the caching architecture in @references the Redis paper]]

Type @ inside a wikilink alias and you get an autocomplete dropdown of 24 relationship types: supersedes, contradicts, causes, supports, evolution_of, prerequisite_for, and more.

On save, the plugin syncs matched types to YAML frontmatter automatically:

---
supersedes:
  - "[[Previous Analysis]]"
supports:
  - "[[Redis Paper]]"
references:
  - "[[Redis Paper]]"
---

That's it. Standard YAML frontmatter. Dataview can query it. Nothing breaks.

The @ syntax was deliberately chosen: it doesn't conflict with any existing Obsidian syntax (^ is block references, :: is Dataview inline fields), and it triggers autocomplete only when preceded by a space or appearing right after the | pipe. john@example.com in your display text is left alone. Only configured relationship types generate frontmatter. @monkeyballs is just display text.

Install it via BRAT with penfieldlabs/obsidian-wikilink-types.

What this changes

With typed links, your vault goes from a tangle of identical connections to a queryable knowledge graph. You can write Dataview queries like "show me everything that contradicts my current hypothesis." You can trace causation chains. You can see at a glance which notes have been superseded and which are current.

This is what Karpathy's pattern needs but doesn't have: links that carry meaning.

Gap 2: You shouldn't have to type every relationship yourself

A wiki with typed links is more useful than one without. But manually typing @supersedes and @contradicts on every note is tedious, and you'll miss connections that aren't obvious.

The whole premise of the LLM Wiki is that the LLM does the bookkeeping. So let it discover the relationships too.

The fix: AI-discovered typed relationships

The Vault Linker skill ships in the same repo as the plugin. It's a skill specification for AI agents (Claude Code, OpenClaw, or anything that can read and write files) that analyzes your vault and discovers relationships between notes.

The workflow:

Point your AI agent at your vault with the Vault Linker skill loaded
The agent reads your notes and identifies connections: "This note supersedes that one. This note contradicts that claim. This was caused by that decision."
The agent writes the relationships in Wikilink Types format: adding @supersedes, @contradicts, etc. to the wikilinks and syncing the frontmatter
You review and approve

The human stays in the loop for judgment. The AI does the grunt work of reading hundreds of notes and spotting connections you'd never find manually.

The LLM Wiki pattern says the LLM should do all the "summarizing, cross-referencing, filing, and bookkeeping." Typed links give the LLM a vocabulary for those cross-references. The Vault Linker skill gives it the workflow to actually do it.

Autonomous mode: link an entire vault overnight

The skill above is interactive: the agent discovers, you approve. But what if you have 500 notes and want to link the whole thing in one pass?

The repo includes two prompts designed to work as a pipeline:

Autonomous Vault Linking is the build phase. You give it to your agent with a vault path and walk away. The agent creates a git branch, surveys the vault, classifies notes as hubs or spokes, then works through them in priority order: hub-to-hub relationships first (the highest-value connections), then spoke-to-hub (the bulk of the work), then lateral spoke-to-spoke connections. It commits every 20-50 notes, writes a linking log with stats and confidence levels, and never touches your main branch. If you're running multiple agents in parallel (one per folder, say), the prompt includes coordination rules: each agent only writes to its assigned notes, verifies target files exist before linking, and logs anything it had to skip.

Verify and Repair is the cleanup phase. You run it on the same branch after the build completes. It builds a complete file index, scans every note for broken links (correctly excluding code blocks and callouts), repairs what it can (near-match resolution, parallel-agent artifact removal), checks that frontmatter and inline @type links are consistent, removes duplicates, classifies orphan notes, and validates all YAML. The output is a verification report telling you exactly what was fixed and what still needs human judgment. Only after verify passes do you merge.

The two-phase design is deliberate: the build phase is optimized for throughput, the verify phase is optimized for correctness. Both are idempotent. Re-running on an already-linked vault produces zero changes.

Gap 3: Your knowledge is trapped on one machine

This is the gap most implementations aren't solving.

The LLM Wiki stores everything as plain markdown. You can sync those files with git, point multiple tools at the same directory, access them from anywhere. The files aren't the problem.

The agent's understanding is.

Every time you start a new session, the LLM reads your index file, re-parses the wiki structure, and rediscovers what it already knew last session. There's no persistent graph in memory. No way to query "what contradicts my hypothesis about X?" without the LLM re-reading every relevant page. No graph traversal that can walk typed relationships across hundreds of notes. The index.md catalog works at small scale, but it's a flat file, not a query engine.

Git gives you file portability. What it doesn't give you is agent-level memory, relationship-aware search, or a persistent knowledge graph that any tool can query without re-parsing everything from scratch.

The fix: a persistent knowledge graph backend

Penfield is a persistent memory and knowledge graph system for AI agents. It stores memories, artifacts, and typed relationships in a backend accessible via MCP (Model Context Protocol) from any compatible client.

The relevant capabilities:

Hybrid search: BM25 (keyword) + vector (semantic) + graph traversal, fused together. Not "pick one." All three, weighted and merged.
Typed relationships: The same 24 relationship types from wikilink-types are native to Penfield's graph. supersedes, contradicts, causes, all of them. The vocabulary matches exactly.
Cross-platform access: Connect from Claude Code, Claude.ai, OpenClaw, Cursor, Gemini CLI, or anything else that speaks MCP. Same knowledge graph, same relationships, regardless of which tool you're using.
Persistence across sessions: The graph doesn't disappear when you close a tab. Memories, relationships, and artifacts survive indefinitely. Start a new session and pick up where you left off.

The pipeline: Obsidian to Penfield

penfield-import is the bridge. It reads an Obsidian vault (or any collection of markdown files) and imports everything into Penfield as memories, relationships, and artifacts.

The tool runs in seven phases with crash-safe checkpointing:

Parse: Reads all .md and .txt files, extracts YAML frontmatter and typed relationships
Memories: Creates one Penfield memory per note
Artifacts: Uploads full content for notes exceeding the 10K character memory limit
Exported Artifacts: Uploads pre-existing artifact files
Documents: Uploads documents (PDFs, code files, etc.)
Relationships: Bulk-creates relationships between memories in batches of 100
Verify: Confirms import counts match

Quick start:

# Install
pip install .

# Authenticate (opens browser, takes 2 seconds)
penfield-import --login

# Preview what will be imported
penfield-import /path/to/your/vault --dry-run

# Run the import
penfield-import /path/to/your/vault

If your vault has typed relationships from obsidian-wikilink-types, they come through as graph edges in Penfield. If it doesn't, you still get all your notes as searchable memories. Typed links make the import richer, but they aren't required.

We've run this at scale with over 4,000 notes and over 20,000 relationships imported in a single autonomous run. The checkpoint system means if something crashes at phase 5, it resumes from phase 5, not from scratch.

The complete pipeline

Here's what the full workflow looks like, whether you're upgrading an existing vault or starting fresh:

Path A: You already have an Obsidian vault

Install obsidian-wikilink-types in your vault
Run the Vault Linker skill with Claude Code or OpenClaw to discover relationships across your existing notes
Review and approve the AI-suggested relationships
Run penfield-import to push everything into Penfield
Access your knowledge from any MCP-compatible AI tool, on any device

Path B: Starting fresh with the LLM Wiki pattern

Follow Karpathy's pattern: collect sources, have the LLM compile a wiki
But use obsidian-wikilink-types from day one. When the LLM creates cross-references, have it use @ syntax so the relationships are typed from the start
Periodically run the Vault Linker skill to catch relationships the LLM missed
When your wiki is rich enough, import to Penfield for persistent, cross-platform access

What you get vs. what you had

	Karpathy's LLM Wiki	With typed links + Penfield
Link semantics	[[Note]] - connected, no type	[[Note @supersedes]] - 24 relationship types
Search	index.md flat file, breaks at scale	Hybrid: BM25 + vector + graph traversal
Persistence	None - LLM forgets between sessions	Full - knowledge graph persists indefinitely
Device access	One laptop, one directory	Any device, any MCP or API client
Agent compatibility	One agent at a time	Claude, OpenClaw, Cursor, Gemini CLI, etc.
Relationship discovery	Manual, in prose	AI-discovered via Vault Linker, human approval

The tools

Everything mentioned in this article is available now:

obsidian-wikilink-types: The Obsidian plugin. Typed @ relationships in wikilinks, auto-synced to YAML frontmatter. Includes the Vault Linker skill for AI-discovered relationships. AGPL-3.0.
penfield-import: Import tool for Obsidian vaults and other markdown collections. Seven-phase pipeline with crash-safe checkpointing. AGPL-3.0.
Penfield: Persistent memory and knowledge graph for AI agents. Free trial at portal.penfield.app/sign-up. MCP server setup at github.com/penfieldlabs/penfield-mcp.

Karpathy's LLM Wiki pattern is a solid foundation. Typed relationships, AI-discovered connections, and a persistent backend are what turn it from a clever note-taking hack into a knowledge system that actually compounds.

If you have questions or want to contribute, open an issue on any of the repos above or find us at @penfieldlabs.

See the followup article here: We Fixed Karpathy’s LLM Wiki - PENgram Is the Typed Knowledge Graph Pipeline Everyone Asked For

Great piece. Seven months later it's only gotten worse. Stars are one of the primary marketing metric for AI repos regardless of whether or not the code even works.

Penfield — Sat, 11 Apr 2026 07:31:15 +0000

Sep 15 '25

The fake GitHub economy no one talks about: Stars, Followers, and $5k Accounts.

#discuss #webdev #programming #github

Comments 2

28 min read

The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.

Penfield — Sat, 11 Apr 2026 07:23:04 +0000

Garry Tan is the president and CEO of Y Combinator. He has over 738,000 followers on X. Yesterday he publicly endorsed MemPalace, calling it "impressive." In the same post, he announced GBrain, his own AI memory project.

There is one problem with the endorsement and one problem with the project.

The Endorsement

MemPalace reported Recall@5 retrieval scores as end-to-end QA accuracy. When independent developers ran actual QA evaluation, scores dropped dramatically from the reported 96.6%. The project's own GitHub issues document the discrepancies in detail (#27, #29, #39, #125, #242).

None of this is hidden. It is in the project's public issue tracker. Garry Tan either did not check, did not care or did not understand the issues.

We wrote about MemPalace's benchmarks shortly after the project first went viral: Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.

The Project

GBrain appeared on GitHub on April 5, 2026. Now just six days old. 43 commits. One contributor. Over 2,000 stars.

The README described three flagship features: compiled truth rewriting, a dream cycle for overnight maintenance, and entity detection on every message.

We cloned the repo and read every file. All three features are markdown documents that instruct an AI agent what to do. The codebase itself contained no rewrite logic, no scheduling, no entity detection. The words "rewrite," "stale," "synthesize," and "consolidate" do not appear in any source file. "Cron," "schedule," "setInterval," and "timer" do not appear either.

What does exist is a storage layer over PostgreSQL with pgvector, hybrid search with Reciprocal Rank Fusion, and a chunking pipeline. Reasonably competent infrastructure. But the MCP server, the primary integration point for AI agents, ships broken. The project's own issue #22 documented twelve critical bugs including race conditions, NULL embedding overwrites, and an S3 backend that a security audit note dated April 10 calls "not production-ready."

The Pattern

This is not the first time. Tan's previous project, gstack, has amassed over 69,000 GitHub stars. Developer Mo Bitar described it as "a bunch of prompts in a folder." Another founder noted that without the YC title, it would not have made Product Hunt. A developer who examined Tan's AI-generated website code found 78,400 lines including empty CSS files, duplicate assets, and test files shipped to production.

Three projects. One pattern. Big claims, big following, no independent verification.

The Stars Mean Nothing

MemPalace now has over 40,000 stars. GBrain has over 2,000 in six days. gstack has over 69,000. None of these numbers tell you whether the software works.

If you don't happen to have a Hollywood movie star friend and you aren't president of YC with 738,000+ X followers, don't worry. You can always just buy stars.

Full Investigation

We published a detailed investigation into GBrain's source code and the MemPalace endorsement on our Substack:

When the YC President Says He's "Impressed"

We are building an open benchmark for long-term AI memory, because the current ecosystem too often fails to distinguish working systems from compelling READMEs.

Proposal: A Real Benchmark for Long-Term AI Memory Systems

Penfield — Fri, 10 Apr 2026 12:24:32 +0000

The Problem

Nearly every AI memory system is publishing scores on benchmarks that don't adequately measure what they claim to measure.

We audited LoCoMo and found 6.4% of the answer key is factually wrong (99 errors in 1,540 questions), the LLM judge accepts 63% of intentionally wrong answers, and 56% of per-category system comparisons are statistically indistinguishable from noise.

LongMemEval-S uses ~115K tokens per question — every frontier model can hold that in context. It's a better context window test than a memory test.

Meanwhile, each system uses its own ingestion, its own answer generation prompt, and sometimes its own judge configuration — then publishes scores in the same table as if they share a common methodology. The Mem0/Zep benchmark dispute illustrates this perfectly: two companies testing the same systems, arriving at wildly different numbers.

Ten Design Principles

1. Corpus must exceed context windows

1–2 million tokens of total context. Large enough to require genuine memory retrieval. Small enough to be economically feasible for independent researchers.

2. Corpus must model real agent usage

Multi-session conversations between one person and an AI assistant over ~6 months. Work projects, personal preferences, corrections, evolving facts — not disconnected chit-chat between strangers.

3. Ingestion is the system's problem, but must be disclosed

Each system ingests however it wants. But it must publish: ingestion method, model used, embedding model, total cost, and total time.

4. Answer generation: standardized OR fully disclosed

Standard track: Prescribed model, prescribed prompt, single-shot. The only variable is what memory retrieves. Apples-to-apples.
Open track: Use whatever you want, fully disclosed, reported separately. Never mixed with standard track scores.

5. Equal statistical power across categories

400 questions per category. LoCoMo's smallest category has 96 questions — Wilson Score margins of error so wide that score differences are noise.

6. Human-verified ground truth

Error rate target: <1%. Model council pre-screening, crowd-sourced review with bounties, expert tiebreakers.

7. Adversarially validated judge

Generate intentionally wrong answers before launch. The judge must reject >95%. No more judges that can't distinguish vague topically-adjacent answers from correct ones.

8. Abstention is scored

"I don't know" when the answer IS in the corpus: 0.10. Confidently wrong: 0.0. A system that knows its limits should beat one that hallucinates.

9. Multiple scoring dimensions

Accuracy alone hides everything interesting. The scorecard includes: accuracy (standard + open), retrieval precision (tokens per question), latency (p50/p95), abstention quality, and supersession handling.

10. Context-stuffing is measured, not hidden

Systems report the token count of context provided to the answer generation model for each question.

Six Question Categories

2,400 questions total — 400 per category:

Direct recall — Can you retrieve a specific fact that was stated explicitly?

Temporal reasoning — Can you reason about when things happened and how facts changed over time?

Multi-hop inference — Can you connect information from different conversations to answer a question never explicitly discussed?

Supersession and correction — Can you track when facts have been updated, corrected, or superseded?

Cognitive inference — Can you make connections that require understanding implications rather than explicit statements?

Adversarial abstention — Can you correctly identify when you DON'T have the information?

What We're NOT Doing

Not prescribing ingestion method
Not requiring a specific embedding model
Not testing with outdated models
Not making it cost-prohibitive to run
Not handing down a finished spec — this is a proposal and an invitation to collaborate

Read the Full Proposal

The complete write-up, including corpus generation methodology, model comparability framework, open questions, and full references can be found here:

A Real Benchmark for Long-Term AI Memory Systems

The full LoCoMo audit with all 99 errors documented is public.

We're looking for memory system builders, benchmark designers, and researchers who share the goal of honest measurement. Feedback, criticism, and contributions welcome.

Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.

Penfield — Tue, 07 Apr 2026 11:39:10 +0000

Problem: None of the benchmark scores are real.

Yesterday an X account belonging to a developer named Ben Sigman posted the launch of an open-source AI memory project called MemPalace. The post claimed "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." It credited the actress Milla Jovovich as a co-author. The GitHub account hosting the repository is named milla-jovovich/mempalace. The first commit to the repository is dated April 5. As of this writing, less than 24 hours after the launch post, the repository has approximately 5,400 stars and over 1.5 million views on the launch tweet.

For comparison: open-source memory projects with similar architectures and similar honest baseline numbers typically receive just a handful of stars in their first week. The variable producing the orders-of-magnitude difference in engagement is not the engineering. The engineering, as we'll demonstrate, is in some respects interesting and in most respects unexceptional. The variable is the celebrity name on the GitHub account and the celebrity attribution in the launch post. The launch post described her as a co-author. Whatever the underlying collaboration looked like, the practical effect of attaching the name was that a repository created two days ago reached over 1.5 million people on a single tweet, and the methodology errors documented below were carried by that reach to an audience the majority of which is unlikely to read the BENCHMARKS.md file themselves.

We work on a different memory project at Penfield, and a couple of months ago we published an audit of LoCoMo's ground truth documenting roughly ninety-nine wrong, hallucinated, or misattributed answers across the dataset's ten conversations. A 100% score on the published version of LoCoMo is mathematically excluded. The answer key contains errors any honest system would disagree with. So when the launch post showed up in the timeline we sought to understand how this impossible number was produced.

What we found is a methodology stack that contains, in one repository created two days ago, almost every failure mode the AI memory benchmark layer suffers from right now. The interesting thing — the thing that made this worth writing about rather than ignoring — is that the project's own internal documentation discloses most of its failure modes honestly. The launch post strips every caveat. The methodology errors are common across the field. The honesty gap between the repository and the marketing is arguably the bigger story. The celebrity name is the reason anyone heard about it.

The LoCoMo bypass

LoCoMo is a conversational memory benchmark with ten long conversations and 1,986 question-answer pairs. The standard convention in published evaluation is to report on the 1,540 non-adversarial subset; the launch post reports on all 1,986. The ten conversations contain 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than fifty sessions.

The MemPalace LoCoMo runner produces its 100% number with top_k=50. Their own BENCHMARKS.md says this verbatim:

The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions — the embedding retrieval step is bypassed entirely.

Setting top_k=50 against a candidate pool that maxes out at 32 retrieves the entire conversation. At that setting the pipeline reduces to: dump every session into Claude Sonnet, ask Sonnet which one matches. That is cat *.txt | claude. It is not retrieval and it is not memory. The "memory architecture" contributes nothing to the score.

The honest LoCoMo numbers, from the same file, are 60.3% R@10 with no rerank and 88.9% R@10 with the project's hybrid scoring and no LLM. Those are real and unremarkable. The 100% should not be cited at all. It cannot be 100% in any case, because the published ground truth is wrong on roughly 99 questions. It is also worth noting that the LoCoMo judge scores up to 63% of intentionally wrong answers correct.

The LongMemEval metric error

LongMemEval as published is an end-to-end question-answering benchmark. A system has to retrieve from a haystack of prior chat sessions, generate an answer, and have that answer marked correct by a GPT-4 judge. Every score on the published LongMemEval leaderboard is the percentage of questions where the generated answer was judged correct.

The MemPalace LongMemEval runner does the retrieval step only. It never generates an answer and never invokes a judge. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings, returns the top five sessions by cosine distance, and checks set membership against the gold session IDs labeled by the LongMemEval authors. If any one of the gold session IDs appears in the top five, the question scores 1.0. This metric is recall_any@5. The runner also computes recall_all@5 (the stricter version that requires every gold session to be retrieved) and the project reports the softer one.

So the system never reads what is in the retrieved sessions, never produces an answer, and never demonstrates that the sessions it returned actually answer the question. The dataset author labeled them, the runner checks the labels, and credit is awarded on label-set overlap. None of the LongMemEval numbers in this repository — not the 100%, not the 98.4% "held-out" number, not the 96.6% raw baseline — are LongMemEval scores in the sense the published leaderboard means. They are retrieval recall numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error.

The 100% number additionally has a separate problem. The project's hybrid v4 mode was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. Then the same five hundred are rerun and the result is reported as a perfect score. The project's own BENCHMARKS.md calls this what it is, on line 461, verbatim:

This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns.

The features that don't exist in the code

The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. The file mempalace/knowledge_graph.py contains zero occurrences of the word "contradict." The only deduplication logic in that file is an exact-match check on (subject, predicate, object) triples — it blocks identical triples from being added twice and does nothing else. Conflicting facts about the same subject can accumulate indefinitely. The marketed feature does not exist in the code. Credit to the developer Leonard Lin (lhl), who documented this independently in issue #27 on the same repository within hours of the launch.

AAAK is not lossless

The launch post claims "AAAK compression fits your entire life context into 120 tokens — 30x lossless compression any LLM reads natively." The project's compression module, mempalace/dialect.py, truncates sentences at 55 characters (if len(best) > 55: best = best[:52] + "..."), filters by keyword frequency, and provides a decode() function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip.

There is also a measurement. The same BENCHMARKS.md reports results_raw_full500.jsonl at 96.6% R@5 and results_aaak_full500.jsonl at 84.2% R@5 — a 12.4 percentage point quality drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. The project measured the loss, recorded it in the benchmark file, and then published "30x lossless" anyway.

The broken layer underneath

None of these failure modes are unique to MemPalace. LoCoMo's ground truth has been broken since the dataset was published. The benchmark wars in the AI memory space already involve documented methodology disputes that go well beyond normal disagreement: Zep published a detailed article in 2025 titled "Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?" arguing that Mem0's published LoCoMo numbers depend on a flawed evaluation harness and on Mem0 having run a misconfigured version of Zep when benchmarking against it. Mem0's CTO replied on Zep's own issue tracker in "Revisiting Zep's 84% LoCoMo Claim: Corrected Evaluation & 58.44% Accuracy" claiming that Zep's real score is 58.44% rather than 84%. Letta has separately published "Benchmarking AI Agent Memory: Is a Filesystem All You Need?" reaching similar conclusions about reproducibility on the same benchmark. The MemPalace launch fits into a pattern that the field is already arguing about. What's new is the scale of the honesty gap between a single repository and their related marketing.

What's unusual about MemPalace is not necessarily that one project did all of these things at once. What's unusual is that the project's own internal documentation discloses these issues honestly, while the launch communication strips these caveats. BENCHMARKS.md is over 5,000 words of careful, self-aware methodology notes that contradict the launch tweet point by point. Whoever reviewed that file knew. It's clearly documented. But then they published the inflated numbers anyway.

Over five thousand stars in less than twenty-four hours

The repository was created on April 5. The launch post went up on April 6. By the morning of April 7, the launch tweet had over 1.5 million views and the repository had over 5,400 stars. There are many open-source memory projects with similar architectures and similar honest baseline numbers. They do not get 5,400 stars in twenty-four hours. They get fifty stars in their first week if they're lucky. The variable is the celebrity name. Strip the celebrity attribution out of the launch post and the project is a Python repository with a regex-based abbreviation scheme, default ChromaDB embeddings, a knowledge-graph file that doesn't implement the feature its README claims, and a benchmark folder whose own internal notes contradict the headline numbers. That repository gets fifty stars at best and dies in a week. The same code with an actress's name on the GitHub account gets 5,400 stars in less than a day and reaches over 1.5 million people on a single tweet.

The engineering result underneath all of this is genuinely interesting in one specific way: it appears that raw verbatim text plus default embeddings does, in fact, beat a number of LLM-extraction approaches at session retrieval on LongMemEval-s. That suggests the field is over-engineering the memory extraction step. It is a useful negative finding. It does not require a perfect score on a benchmark whose ground truth makes a perfect score impossible. It does not require a metric category error. It does not require hand-coded patches against three specific dev questions. It does not require a celebrity attribution. The honest version of this story would have been more interesting than the hyped version, and it would likely have survived more than 24 hours of community scrutiny instead of collapsing under it.

What we're doing about it

We maintain a public LoCoMo ground-truth audit at github.com/dial481/locomo-audit, with per-conversation error files documenting hallucinations, attribution errors, ambiguous questions, and incomplete answers across all ten conversations. The audit is open for contribution. We believe a new and improved version of LoCoMo would benefit every group working on conversational memory, including the MemPalace maintainers and including ourselves. The goal is better benchmarks, not a kill shot on any individual project.

Two other independent technical critiques of MemPalace landed within the same 24 hour window: Leonard Lin's README-versus-code teardown in issue #27, and a Chinese-language warning post for the simplified Chinese developer community in issue #37. If you're evaluating any AI memory system right now, the right thing to do is read the benchmark code yourself before trusting the headline number. If that feels like a lot to ask — and it is — that's the problem this article is about. The celebrity name on the GitHub account is what made the problem visible. The problem itself was already there.

The Real Ceiling in Claude Code's Memory System (It’s Not the 200-Line Cap)

Penfield — Sun, 05 Apr 2026 08:11:16 +0000

Someone published the full Claude Code source to the internet last week. 512,000 lines of TypeScript across 1,916 files.

Like everyone else, we went straight for the memory system. But unlike the analyses making the rounds, we didn't stop at the index file. We read the entire memory pipeline: the extraction agent, the dream consolidation system, the forked agent pattern, the lock files, the feature flags, the prompt templates, all of it.

Here's the full picture, including the parts nobody else is talking about, and why replacing the storage layer alone doesn't fix the actual problem.

The architecture is smarter than people think

Most of the commentary has focused on the 200-line index cap in MEMORY.md and declared the system broken. That's a surface read. The architecture underneath is genuinely well-designed for a v1.

Three-tier memory with bandwidth awareness:

The system has three layers, each with a different access pattern:

Layer 1: MEMORY.md, the index. Always loaded into the system prompt. One-line pointers to topic files, roughly 150 characters each. Hard cap of 200 lines and 25KB. This is the only layer that costs tokens every turn.

Layer 2: Topic files (markdown files in the memory directory). Loaded on demand. Each turn, a separate Sonnet call reads the file manifest and picks up to 5 relevant files based on the current query. These contain the actual knowledge.

Layer 3: Session transcripts (JSONL files). Never fully read. Only accessed via targeted grep with narrow search terms. This is the raw conversation history, kept as a last-resort reference.

This is a cost-conscious design. Layer 1 is always in context. Layer 2 is fetched selectively. Layer 3 is almost never touched. The 200-line cap on the index isn't an oversight, it's a token budget. The index is injected into every single system prompt.

Four memory types, strictly constrained:

The type taxonomy is intentionally narrow: user (who you are), feedback (corrections AND confirmations), project (ongoing work context), and reference (pointers to external systems).

What's interesting is what they explicitly exclude. The source code has a dedicated WHAT_NOT_TO_SAVE section: no code patterns, no architecture, no file paths, no git history, no debugging solutions. The rule is: if it's derivable from the current codebase through grep or git, don't persist it. Memory is reserved for things the codebase can't tell you.

The feedback type is more nuanced than it appears. The prompt instructs the model to record both corrections ("stop doing X") and confirmations ("yes, exactly like that"). The reasoning is explicit in the source: if you only save corrections, you avoid past mistakes but drift away from validated approaches. Most memory systems only capture negative feedback. This one captures positive signal too.

Staleness is a first-class concept:

There's a memoryFreshnessText() function that appends warnings to any memory older than one day: "This memory is X days old. Memories are point-in-time observations, not live state." The model is instructed to treat memory as a hint, not truth, and verify before using. Memory is skeptical of itself.

The part nobody is talking about: the dream system

This is where it gets interesting. Claude Code doesn't just accumulate memories. It consolidates them.

autoDream: background memory consolidation

After at least 24 hours and at least 5 sessions have passed, a background process called autoDream fires. It's controlled by a GrowthBook feature flag (tengu_onyx_plover), meaning Anthropic can tune the thresholds remotely without shipping code.

autoDream runs as a forked subagent, a separate process that clones the parent's file state cache and gets its own transcript. It has restricted tool access (only file read and write within the memory directory) so it can't corrupt the main conversation context.

The consolidation runs in four phases:

Phase 1, Orient: Read the memory directory. Understand what exists. Skim topic files to avoid creating duplicates.

Phase 2, Gather: Look for new signal worth persisting. Check daily logs, spot memories that contradict current codebase state, grep transcripts for specific context (narrow terms only, never exhaustive reads).

Phase 3, Consolidate: Write or update memory files. Merge new signal into existing topics rather than creating near-duplicates. Convert relative dates to absolute. Delete contradicted facts at the source.

Phase 4, Prune and index: Keep MEMORY.md under the 200-line and 25KB caps. Remove stale pointers. Shorten verbose entries. Resolve contradictions between files.

This is a self-healing memory system. It merges, deduplicates, resolves contradictions, and aggressively prunes. Memory is continuously edited, not just appended.

Race protection:

A PID-based lock file (.consolidate-lock) prevents multiple processes from running consolidation simultaneously. The lock has a 1-hour staleness timeout (in case a process crashes mid-consolidation) and PID verification to prevent reuse collisions. The lock file's mtime doubles as the lastConsolidatedAt timestamp, so checking "should we consolidate?" costs exactly one stat() call per turn.

extractMemories: per-turn capture

Separately from the dream system, there's an extraction agent that runs after each query completes. It's a forked agent (same pattern as autoDream) that reviews the conversation and extracts durable memories. This is what captures information in real time. autoDream is what consolidates it later.

Two different processes writing to the same memory directory. Real-time capture and periodic consolidation. The biological analogy is obvious: short-term encoding during the day, long-term consolidation during sleep.

The forked agent pattern

This is the core architectural primitive that makes everything work, and nobody has mentioned it.

runForkedAgent() creates a perfect fork of the main conversation. It clones the file state cache, creates a separate transcript, and shares the parent's prompt cache (the expensive part). The forked agent gets restricted tools so it can't interfere with the parent context.

This single pattern powers: memory extraction (per-turn), memory consolidation (autoDream), auto-compaction, agent summaries, and sub-agent tasks. One cache, multiple specialized agents. This is Anthropic's cost optimization for running background intelligence alongside the main conversation.

Where the system actually hits a ceiling

The 200-line index cap is not the real limitation. The dream system manages that cap through pruning and consolidation. The actual ceiling is architectural:

No knowledge graph. Every memory is an isolated markdown file. There's no way to express that one memory supports another, contradicts another, or supersedes another. The dream system can spot contradictions and resolve them, but only through brute-force LLM reasoning over the full text. There are no typed relationships. No structured connections. No way for the agent to explore how its knowledge evolved over time.

No embeddings. Retrieval is a language model reading filenames and one-line descriptions, then picking up to 5 files. It's remarkably effective for what it is, but it's not semantic search. As the memory directory grows, the relevance of filename-based selection degrades. A memory about a "database migration decision" won't surface when the query is about "schema changes" unless the filename happens to match.

No cross-project memory. Each project gets its own isolated memory directory, keyed to the canonical git root. Knowledge learned in one project cannot inform work in another. There's no shared context, no transfer learning between workspaces.

No cross-device or cross-product memory. The memory directory lives at ~/.claude/projects/ on your local filesystem. Your desktop and laptop have separate memories. Claude.ai, Claude Desktop, Claude mobile, and Claude Code all have completely separate memory systems. Knowledge is fragmented across every device and interface you use.

No personality persistence. There's no mechanism for the model's communication style, behavioral preferences, or domain expertise to persist. Every new session starts with a blank personality slate. Any rapport or working style you've established exists only in the current conversation's context window.

No GUI for non-technical users. Memory is markdown files on disk. Managing them means editing files in a text editor or asking Claude to do it for you. There's no portal, no visual browser, no way for a non-developer to see what's stored or how things connect.

Replacing the storage layer doesn't fix this

Swapping markdown files for a vector store addresses one limitation (the filename-based retrieval) while leaving every other ceiling untouched.

A vector store with no knowledge graph is still flat memory. You get better recall on individual memories, but the memories themselves are still isolated notes. There's still no way to say "this decision superseded that one" or "this insight contradicts our earlier assumption." You're scaling a note pile, not building knowledge.

The retrieval improvement is real, embedding similarity beats filename matching. But retrieval was never the core problem. The core problem is that isolated memories, no matter how well-retrieved, can't represent connected knowledge.

What actually fixes this:

A knowledge graph with typed relationships. Not just "these memories are similar" (that's what embeddings give you) but structured connections: supports, contradicts, supersedes, causes, depends_on, updates, and more. The agent needs to build and traverse a graph, not search a list.

Agent-managed memory. Give the model a rich set of tools, store, recall, connect, explore, reflect, update, and let it decide what matters. Claude Code's extractMemories and autoDream are early steps in this direction, but they operate on flat files. The same agent-driven approach applied to a knowledge graph is dramatically more powerful. A recent Google DeepMind paper (Evo-Memory) showed that agents with self-evolving memory cut task steps roughly in half and let smaller models match or beat larger ones with static context.

Typed memories. Claude Code's four types (user, feedback, project, reference) are a good start. But a correction is different from an insight, which is different from a strategic decision, which is different from a checkpoint. More types means the agent (and the user) can understand what kind of knowledge they're looking at.

Personality persistence. Your AI's communication style, domain expertise, behavioral quirks, and boundaries should be stored as part of the memory system and loaded at the start of every session. On any device. On any platform.

Cross-device, cross-platform access. Memory needs to be accessible from everywhere. Your desktop, your phone, your IDE, your browser. Cloud-hosted, synced, exportable. Local-only memory fragments your knowledge across every device you own.

A GUI portal. Non-developers need to see what's in the memory system, edit what's wrong, and understand how things connect. "Trust us, it's in the database" isn't good enough.

What this looks like in practice

Imagine you've been working with an AI assistant across multiple projects for six months. It knows your coding preferences, your architectural decisions, the bugs you've encountered, the strategies you've tried, the corrections you've made along the way.

With flat memory (even with embeddings), those are 500 isolated notes that surface based on keyword or semantic similarity. Useful, but limited.

With a knowledge graph, those memories are connected. The agent can trace how a decision evolved: "We chose Postgres in January (decision). Switched to DynamoDB in March (supersedes). Because of the latency issues we hit in February (caused_by). Which contradicted our original assumption about read patterns (contradicts)." It can explore connections, spot patterns, and understand context that no embedding similarity search would surface.

That's the difference between a note pile and a knowledge base.

The 30-second fix

Claude Code's memory system is a well-engineered v1 with real architectural ceilings. Replacing the storage layer puts a bigger engine in a car with no steering wheel.

What you actually want is a memory system with a knowledge graph, typed relationships, hybrid search (keyword, vector, and graph traversal combined), personality persistence, a GUI portal, and access from every device and platform you use.

And it should take less than a minute to set up.

If you use Claude.ai or Claude Desktop: go to Settings, Connectors, Add Custom Connector, paste a URL. Done. That connector is now available everywhere you use Claude, including Claude Code, Claude mobile, and Cowork. Turn it on or off anytime.

If you use Cursor, Windsurf, or any MCP-compatible client: one line in your MCP config.

If you're building something custom: full REST API.

No Docker. No npm install. No environment variables. No JSON config files.

Your AI should remember you. Across every session, every device, every platform. And that memory should be connected, not just piled up.

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

Penfield — Sat, 04 Apr 2026 14:56:34 +0000

Projects are still submitting new scores on LoCoMo as of March 2026.

We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found.

LoCoMo

LoCoMo (Maharana et al., ACL 2024) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors.

Examples:

The answer key specifies "Ferrari 488 GTB," but the source conversation contains only "this beauty" and the image caption reads "a red sports car." The car model exists only in an internal query field (annotator search strings for stock photos) that no memory system ingests. Systems are evaluated against facts they have no access to.
"Last Saturday" on a Thursday should resolve to the preceding Saturday. The answer key says Sunday. A system that performs the date arithmetic correctly is penalized.
24 questions attribute statements to the wrong speaker. A system with accurate speaker tracking will contradict the answer key.

The theoretical maximum score for a perfect system is approximately 93.6%.

We also tested the LLM judge. LoCoMo uses gpt-4o-mini to score answers against the golden reference. We generated intentionally wrong but topically adjacent answers for all 1,540 questions and scored them using the same judge configuration and prompts used in published evaluations. The judge accepted 62.81% of them. Specific factual errors (wrong name, wrong date) were caught approximately 89% of the time. However, vague answers that identified the correct topic while missing every specific detail passed nearly two-thirds of the time. This is precisely the failure mode of weak retrieval - locating the right conversation but extracting nothing specific - and the benchmark rewards it.

There is also no standardized evaluation pipeline. Each system uses its own ingestion method (arguably necessary given architectural differences), its own answer generation prompt, and sometimes entirely different models. Scores are then compared in tables as if they share a common methodology. Multiple independent researchers have documented inability to reproduce published results (EverMemOS #73, Mem0 #3944, Zep scoring discrepancy).

Full audit with all 99 errors documented, methodology, and reproducible scripts: locomo-audit

LongMemEval

LongMemEval-S (Wang et al., 2024) is the other frequently cited benchmark. The issue is different but equally fundamental: it does not effectively isolate memory capability from context window capacity.

LongMemEval-S uses approximately 115K tokens of context per question. Current models support 200K to 1M token context windows. The entire test corpus fits in a single context window for most current models.

Mastra's research illustrates this: their full-context baseline scored 60.20% with gpt-4o (128K context window, near the 115K threshold). Their observational memory system scored 84.23% with the same model, largely by compressing context to fit more comfortably. The benchmark is measuring context window management efficiency rather than long-term memory retrieval. As context windows continue to grow, the full-context baseline will keep climbing and the benchmark will lose its ability to discriminate.

LongMemEval-S tests whether a model can locate information within 115K tokens. That is a useful capability to measure, but it is a context window test, not a memory test.

LoCoMo-Plus

LoCoMo-Plus (Li et al., 2025) introduces a genuinely interesting new category: "cognitive" questions testing implicit inference rather than factual recall. These use cue-trigger pairs with deliberate semantic disconnect - the system must connect "I just adopted a rescue dog" (cue) to "what kind of pet food should I buy?" (trigger) across sessions without lexical overlap. The concept is sound and addresses a real gap in existing evaluation.

The issues:

It inherits all 1,540 original LoCoMo questions unchanged, including the 99 score-corrupting errors documented above.
The improved judging methodology (task-specific prompts, three-tier scoring, 0.80+ human-LLM agreement) was only validated on the new cognitive questions. The original five categories retain the same broken ground truth with no revalidation.
The judge model defaults to gpt-4o-mini.
Same lack of pipeline standardization.

The new cognitive category is a meaningful contribution. The inherited evaluation infrastructure retains the problems described above.

Requirements for meaningful long-term memory evaluation

Based on this analysis, we see several requirements for benchmarks that can meaningfully evaluate long-term memory systems:

Corpus size must exceed context windows. If the full test corpus fits in context, retrieval is optional and the benchmark cannot distinguish memory systems from context window management. BEAM moves in this direction with conversations up to 10M tokens, though it introduces its own challenges.
Evaluation must use current-generation models. gpt-4o-mini as a judge introduces a ceiling on scoring precision. Both the systems under test and the judges evaluating them should reflect current model capabilities.
Judge reliability must be validated adversarially. When a judge accepts 63% of intentionally wrong answers, score differences below that threshold are not interpretable. Task-specific rubrics, stronger judge models, and adversarially validated ground truth are all necessary.
Ingestion should reflect realistic use. Knowledge in real applications builds through conversation - with turns, corrections, temporal references, and evolving relationships. Benchmarks that test single-pass ingestion of static text miss the core challenge of persistent memory.
Evaluation pipelines must be standardized or fully disclosed. At minimum: ingestion method (and prompt if applicable), embedding model, answer generation prompt, judge model, judge prompt, number of runs, and standard deviation. Without this, cross-system comparisons in published tables are not meaningful.
Ground truth must be verified. A 6.4% error rate in the answer key creates a noise floor that makes small score differences uninterpretable. Northcutt et al. (NeurIPS 2021) found an average of 3.3% label errors across 10 major ML benchmarks and demonstrated that these errors can destabilize model rankings. LoCoMo's error rate is nearly double that baseline.

The long-term memory evaluation problem is genuinely hard - it sits at the intersection of retrieval, reasoning, temporal understanding, and knowledge integration. We'd be interested in hearing what the community thinks is missing from this list, and whether anyone has found evaluation approaches that avoid these pitfalls.