<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Penfield</title>
    <description>The latest articles on Forem by Penfield (@penfieldlabs).</description>
    <link>https://forem.com/penfieldlabs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3748893%2F285067b2-ce2f-4b22-8761-96b931a3ef02.jpg</url>
      <title>Forem: Penfield</title>
      <link>https://forem.com/penfieldlabs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/penfieldlabs"/>
    <language>en</language>
    <item>
      <title>Great piece. Seven months later it's only gotten worse. Stars are one of the primary marketing metric for AI repos regardless of whether or not the code even works.</title>
      <dc:creator>Penfield</dc:creator>
      <pubDate>Sat, 11 Apr 2026 07:31:15 +0000</pubDate>
      <link>https://forem.com/penfieldlabs/great-piece-seven-months-later-its-only-gotten-worse-stars-are-one-of-the-primary-marketing-6b9</link>
      <guid>https://forem.com/penfieldlabs/great-piece-seven-months-later-its-only-gotten-worse-stars-are-one-of-the-primary-marketing-6b9</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/dev_tips/the-fake-github-economy-no-one-talks-about-stars-followers-and-5k-accounts-43pn" class="crayons-story__hidden-navigation-link"&gt;The fake GitHub economy no one talks about: Stars, Followers, and $5k Accounts.&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/dev_tips" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2901662%2F1dcce5de-7920-43a0-a337-e1dfb375b204.png" alt="dev_tips profile" class="crayons-avatar__image" width="478" height="375"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/dev_tips" class="crayons-story__secondary fw-medium m:hidden"&gt;
              &amp;lt;devtips/&amp;gt;
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                &amp;lt;devtips/&amp;gt;
                
              
              &lt;div id="story-author-preview-content-2846994" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/dev_tips" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2901662%2F1dcce5de-7920-43a0-a337-e1dfb375b204.png" class="crayons-avatar__image" alt="" width="478" height="375"&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;&amp;lt;devtips/&amp;gt;&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/dev_tips/the-fake-github-economy-no-one-talks-about-stars-followers-and-5k-accounts-43pn" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Sep 15 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/dev_tips/the-fake-github-economy-no-one-talks-about-stars-followers-and-5k-accounts-43pn" id="article-link-2846994"&gt;
          The fake GitHub economy no one talks about: Stars, Followers, and $5k Accounts.
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag crayons-tag--filled  " href="/t/discuss"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;discuss&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/webdev"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;webdev&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/programming"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;programming&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/github"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;github&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/dev_tips/the-fake-github-economy-no-one-talks-about-stars-followers-and-5k-accounts-43pn" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/fire-f60e7a582391810302117f987b22a8ef04a2fe0df7e3258a5f49332df1cec71e.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="24" height="24"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/dev_tips/the-fake-github-economy-no-one-talks-about-stars-followers-and-5k-accounts-43pn#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              2&lt;span class="hidden s:inline"&gt; comments&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            28 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>The YC President Endorsed an AI Memory System With Fake Benchmarks. He Also Shipped His Own. We Read the Code.</title>
      <dc:creator>Penfield</dc:creator>
      <pubDate>Sat, 11 Apr 2026 07:23:04 +0000</pubDate>
      <link>https://forem.com/penfieldlabs/the-yc-president-endorsed-an-ai-memory-system-with-fake-benchmarks-then-he-shipped-his-own-we-4c9l</link>
      <guid>https://forem.com/penfieldlabs/the-yc-president-endorsed-an-ai-memory-system-with-fake-benchmarks-then-he-shipped-his-own-we-4c9l</guid>
      <description>&lt;p&gt;Garry Tan is the president and CEO of Y Combinator. He has over 738,000 followers on X. Yesterday he publicly endorsed &lt;a href="https://dev.to/penfieldlabs/milla-jovovich-just-released-an-ai-memory-system-it-reached-over-15-million-people-and-5400-297l"&gt;MemPalace&lt;/a&gt;, calling it &lt;a href="https://x.com/garrytan/status/2042507733237485994?s=20" rel="noopener noreferrer"&gt;"impressive."&lt;/a&gt; In the same post, he announced &lt;a href="https://github.com/garrytan/gbrain" rel="noopener noreferrer"&gt;GBrain&lt;/a&gt;, his own AI memory project.&lt;/p&gt;

&lt;p&gt;There is one problem with the endorsement and one problem with the project.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Endorsement
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/milla-jovovich/mempalace/" rel="noopener noreferrer"&gt;MemPalace&lt;/a&gt; reported Recall@5 retrieval scores as end-to-end QA accuracy. When independent developers ran actual QA evaluation, scores dropped dramatically from the reported 96.6%. The project's own GitHub issues document the discrepancies in detail (#27, #29, #39, #125, #242). &lt;/p&gt;

&lt;p&gt;None of this is hidden. It is in the project's public issue tracker. Garry Tan either did not check, did not care or did not understand the issues.&lt;/p&gt;

&lt;p&gt;We wrote about MemPalace's benchmarks shortly after the project first went viral: &lt;a href="https://dev.to/penfieldlabs/milla-jovovich-just-released-an-ai-memory-system-it-reached-over-15-million-people-and-5400-297l"&gt;Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Project
&lt;/h2&gt;

&lt;p&gt;GBrain appeared on GitHub on April 5, 2026. Now just six days old. 43 commits. One contributor. Over 2,000 stars.&lt;/p&gt;

&lt;p&gt;The README described three flagship features: compiled truth rewriting, a dream cycle for overnight maintenance, and entity detection on every message.&lt;/p&gt;

&lt;p&gt;We cloned the repo and read every file. All three features are markdown documents that instruct an AI agent what to do. The codebase itself contained no rewrite logic, no scheduling, no entity detection. The words "rewrite," "stale," "synthesize," and "consolidate" do not appear in any source file. "Cron," "schedule," "setInterval," and "timer" do not appear either.&lt;/p&gt;

&lt;p&gt;What does exist is a storage layer over PostgreSQL with pgvector, hybrid search with Reciprocal Rank Fusion, and a chunking pipeline. Reasonably competent infrastructure. But the MCP server, the primary integration point for AI agents, ships broken. The project's own &lt;a href="https://github.com/garrytan/gbrain/issues/22" rel="noopener noreferrer"&gt;issue #22&lt;/a&gt; documented twelve critical bugs including race conditions, NULL embedding overwrites, and an S3 backend that a security audit note dated April 10 calls "not production-ready."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pattern
&lt;/h2&gt;

&lt;p&gt;This is not the first time. Tan's previous project, &lt;a href="https://github.com/garrytan/gstack" rel="noopener noreferrer"&gt;gstack&lt;/a&gt;, has amassed over 69,000 GitHub stars. Developer &lt;a href="https://x.com/atmoio/status/2033532603433619845?s=20" rel="noopener noreferrer"&gt;Mo Bitar described it as "a bunch of prompts in a folder."&lt;/a&gt; Another founder noted that without the YC title, it would not have made Product Hunt. A developer who examined Tan's AI-generated website code &lt;a href="https://www.fastcompany.com/91520702/y-combinator-garry-tan-agentic-ai-social-media" rel="noopener noreferrer"&gt;found 78,400 lines including empty CSS files, duplicate assets, and test files shipped to production.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three projects. One pattern. Big claims, big following, no independent verification.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Stars Mean Nothing
&lt;/h2&gt;

&lt;p&gt;MemPalace now has over 40,000 stars. GBrain has over 2,000 in six days. gstack has over 69,000. None of these numbers tell you whether the software works.&lt;/p&gt;

&lt;p&gt;If you don't happen to have a Hollywood movie star friend and you aren't president of YC with 738,000+ X followers, don't worry. You can always just &lt;a href="https://dev.to/dev_tips/the-fake-github-economy-no-one-talks-about-stars-followers-and-5k-accounts-43pn"&gt;buy stars&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Investigation
&lt;/h2&gt;

&lt;p&gt;We published a detailed investigation into GBrain's source code and the MemPalace endorsement on our Substack:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://penfieldlabs.substack.com/" rel="noopener noreferrer"&gt;When the YC President Says He's "Impressed"&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We are building &lt;a href="https://penfieldlabs.substack.com/p/proposal-a-new-benchmark-for-long" rel="noopener noreferrer"&gt;an open benchmark for long-term AI memory&lt;/a&gt;, because the current ecosystem too often fails to distinguish working systems from compelling READMEs.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aimemory</category>
      <category>benchmarks</category>
      <category>yc</category>
    </item>
    <item>
      <title>Proposal: A Real Benchmark for Long-Term AI Memory Systems</title>
      <dc:creator>Penfield</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:24:32 +0000</pubDate>
      <link>https://forem.com/penfieldlabs/proposal-a-real-benchmark-for-long-term-ai-memory-systems-57p5</link>
      <guid>https://forem.com/penfieldlabs/proposal-a-real-benchmark-for-long-term-ai-memory-systems-57p5</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Nearly every AI memory system is publishing scores on benchmarks that don't adequately measure what they claim to measure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg"&gt;We audited LoCoMo&lt;/a&gt; and found &lt;strong&gt;6.4% of the answer key is factually wrong&lt;/strong&gt; (99 errors in 1,540 questions), the LLM judge accepts &lt;strong&gt;63% of intentionally wrong answers&lt;/strong&gt;, and &lt;strong&gt;56% of per-category system comparisons are statistically indistinguishable from noise&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2410.10813" rel="noopener noreferrer"&gt;LongMemEval-S&lt;/a&gt; uses ~115K tokens per question — every frontier model can hold that in context. It's a better context window test than a memory test.&lt;/p&gt;

&lt;p&gt;Meanwhile, each system uses its own ingestion, its own answer generation prompt, and sometimes its own judge configuration — then publishes scores in the same table as if they share a common methodology. The &lt;a href="https://github.com/getzep/zep-papers/issues/5" rel="noopener noreferrer"&gt;Mem0/Zep benchmark dispute&lt;/a&gt; illustrates this perfectly: two companies testing the same systems, arriving at wildly different numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ten Design Principles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Corpus must exceed context windows
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1–2 million tokens&lt;/strong&gt; of total context. Large enough to require genuine memory retrieval. Small enough to be economically feasible for independent researchers.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Corpus must model real agent usage
&lt;/h3&gt;

&lt;p&gt;Multi-session conversations between one person and an AI assistant over ~6 months. Work projects, personal preferences, corrections, evolving facts — not disconnected chit-chat between strangers.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Ingestion is the system's problem, but must be disclosed
&lt;/h3&gt;

&lt;p&gt;Each system ingests however it wants. But it must publish: ingestion method, model used, embedding model, total cost, and total time.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Answer generation: standardized OR fully disclosed
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Standard track:&lt;/strong&gt; Prescribed model, prescribed prompt, single-shot. The only variable is what memory retrieves. Apples-to-apples.&lt;br&gt;
&lt;strong&gt;Open track:&lt;/strong&gt; Use whatever you want, fully disclosed, reported separately. Never mixed with standard track scores.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Equal statistical power across categories
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;400 questions per category.&lt;/strong&gt; LoCoMo's smallest category has 96 questions — Wilson Score margins of error so wide that score differences are noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Human-verified ground truth
&lt;/h3&gt;

&lt;p&gt;Error rate target: &lt;strong&gt;&amp;lt;1%.&lt;/strong&gt; Model council pre-screening, crowd-sourced review with bounties, expert tiebreakers.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Adversarially validated judge
&lt;/h3&gt;

&lt;p&gt;Generate intentionally wrong answers before launch. The judge must reject &lt;strong&gt;&amp;gt;95%&lt;/strong&gt;. No more judges that can't distinguish vague topically-adjacent answers from correct ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Abstention is scored
&lt;/h3&gt;

&lt;p&gt;"I don't know" when the answer IS in the corpus: &lt;strong&gt;0.10.&lt;/strong&gt; Confidently wrong: &lt;strong&gt;0.0.&lt;/strong&gt; A system that knows its limits should beat one that hallucinates.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Multiple scoring dimensions
&lt;/h3&gt;

&lt;p&gt;Accuracy alone hides everything interesting. The scorecard includes: accuracy (standard + open), retrieval precision (tokens per question), latency (p50/p95), abstention quality, and supersession handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Context-stuffing is measured, not hidden
&lt;/h3&gt;

&lt;p&gt;Systems report the token count of context provided to the answer generation model for each question.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six Question Categories
&lt;/h2&gt;

&lt;p&gt;2,400 questions total — 400 per category:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct recall&lt;/strong&gt; — Can you retrieve a specific fact that was stated explicitly?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal reasoning&lt;/strong&gt; — Can you reason about when things happened and how facts changed over time?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-hop inference&lt;/strong&gt; — Can you connect information from different conversations to answer a question never explicitly discussed?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supersession and correction&lt;/strong&gt; — Can you track when facts have been updated, corrected, or superseded?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cognitive inference&lt;/strong&gt; — Can you make connections that require understanding implications rather than explicit statements?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adversarial abstention&lt;/strong&gt; — Can you correctly identify when you DON'T have the information?&lt;/p&gt;

&lt;h2&gt;
  
  
  What We're NOT Doing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Not prescribing ingestion method&lt;/li&gt;
&lt;li&gt;Not requiring a specific embedding model&lt;/li&gt;
&lt;li&gt;Not testing with outdated models&lt;/li&gt;
&lt;li&gt;Not making it cost-prohibitive to run&lt;/li&gt;
&lt;li&gt;Not handing down a finished spec — this is a proposal and an invitation to collaborate&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Read the Full Proposal
&lt;/h2&gt;

&lt;p&gt;The complete write-up, including corpus generation methodology, model comparability framework, open questions, and full references can be found here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://penfieldlabs.substack.com/p/proposal-a-new-benchmark-for-long" rel="noopener noreferrer"&gt;A Real Benchmark for Long-Term AI Memory Systems&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The full &lt;a href="https://github.com/dial481/locomo-audit" rel="noopener noreferrer"&gt;LoCoMo audit&lt;/a&gt; with all 99 errors documented is public.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;We're looking for memory system builders, benchmark designers, and researchers who share the goal of honest measurement. Feedback, criticism, and contributions welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aimemory</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>Milla Jovovich just released an AI memory system. It reached over 1.5 million people and 5,400 GitHub stars in less than 24 hours.</title>
      <dc:creator>Penfield</dc:creator>
      <pubDate>Tue, 07 Apr 2026 11:39:10 +0000</pubDate>
      <link>https://forem.com/penfieldlabs/milla-jovovich-just-released-an-ai-memory-system-it-reached-over-15-million-people-and-5400-297l</link>
      <guid>https://forem.com/penfieldlabs/milla-jovovich-just-released-an-ai-memory-system-it-reached-over-15-million-people-and-5400-297l</guid>
      <description>&lt;h2&gt;
  
  
  Problem: None of the benchmark scores are real.
&lt;/h2&gt;

&lt;p&gt;Yesterday an X account belonging to a developer named &lt;strong&gt;Ben Sigman&lt;/strong&gt; posted the launch of an open-source &lt;strong&gt;AI memory project called MemPalace&lt;/strong&gt;. The post claimed "100% on LoCoMo" and "the first perfect score ever recorded on LongMemEval. 500/500 questions, every category at 100%." It credited the actress &lt;strong&gt;Milla Jovovich&lt;/strong&gt; as a co-author. The GitHub account hosting the repository is named &lt;a href="https://github.com/milla-jovovich/mempalace" rel="noopener noreferrer"&gt;milla-jovovich/mempalace&lt;/a&gt;. The first commit to the repository is dated April 5. As of this writing, less than 24 hours after the launch post, the repository has approximately 5,400 stars and over 1.5 million views on the launch tweet.&lt;/p&gt;

&lt;p&gt;For comparison: open-source memory projects with similar architectures and similar honest baseline numbers typically receive just a handful of stars in their first week. The variable producing the orders-of-magnitude difference in engagement is not the engineering. The engineering, as we'll demonstrate, is in some respects interesting and in most respects unexceptional. The variable is the celebrity name on the GitHub account and the celebrity attribution in the launch post. The launch post described her as a co-author. Whatever the underlying collaboration looked like, the practical effect of attaching the name was that a repository created two days ago reached over 1.5 million people on a single tweet, and the methodology errors documented below were carried by that reach to an audience the majority of which is unlikely to read the &lt;a href="https://github.com/milla-jovovich/mempalace/blob/main/benchmarks/BENCHMARKS.md" rel="noopener noreferrer"&gt;BENCHMARKS.md&lt;/a&gt; file themselves.&lt;/p&gt;

&lt;p&gt;We work on a different memory project at Penfield, and a couple of months ago we published &lt;a href="https://github.com/dial481/locomo-audit" rel="noopener noreferrer"&gt;an audit of LoCoMo's ground truth&lt;/a&gt; documenting roughly ninety-nine wrong, hallucinated, or misattributed answers across the dataset's ten conversations. A 100% score on the published version of LoCoMo is mathematically excluded. The answer key contains errors any honest system would disagree with. So when the launch post showed up in the timeline we sought to understand how this impossible number was produced.&lt;/p&gt;

&lt;p&gt;What we found is a methodology stack that contains, in one repository created two days ago, almost every failure mode the AI memory benchmark layer suffers from right now. The interesting thing — the thing that made this worth writing about rather than ignoring — is that &lt;strong&gt;the project's own internal documentation discloses most of its failure modes honestly&lt;/strong&gt;. The launch post strips every caveat. The methodology errors are common across the field. The honesty gap between the repository and the marketing is arguably the bigger story. The celebrity name is the reason anyone heard about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LoCoMo bypass
&lt;/h2&gt;

&lt;p&gt;LoCoMo is a conversational memory benchmark with ten long conversations and 1,986 question-answer pairs. The standard convention in published evaluation is to report on the 1,540 non-adversarial subset; the launch post reports on all 1,986. The ten conversations contain 19, 19, 32, 29, 29, 28, 31, 30, 25, and 30 sessions respectively. Every conversation has fewer than fifty sessions.&lt;/p&gt;

&lt;p&gt;The MemPalace LoCoMo runner produces its 100% number with &lt;code&gt;top_k=50&lt;/code&gt;. Their own &lt;a href="https://github.com/milla-jovovich/mempalace/blob/main/benchmarks/BENCHMARKS.md" rel="noopener noreferrer"&gt;BENCHMARKS.md&lt;/a&gt; says this verbatim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The LoCoMo 100% result with top-k=50 has a structural issue: each of the 10 conversations has 19–32 sessions, but top-k=50 exceeds that count. This means the ground-truth session is always in the candidate pool regardless of the embedding model's ranking. The Sonnet rerank is essentially doing reading comprehension over all sessions — the embedding retrieval step is bypassed entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Setting &lt;code&gt;top_k=50&lt;/code&gt; against a candidate pool that maxes out at 32 retrieves the entire conversation. At that setting the pipeline reduces to: dump every session into Claude Sonnet, ask Sonnet which one matches. That is &lt;code&gt;cat *.txt | claude&lt;/code&gt;. It is not retrieval and it is not memory. The "memory architecture" contributes nothing to the score.&lt;/p&gt;

&lt;p&gt;The honest LoCoMo numbers, from the same file, are 60.3% R@10 with no rerank and 88.9% R@10 with the project's hybrid scoring and no LLM. Those are real and unremarkable. The 100% should not be cited at all. It cannot be 100% in any case, because the published ground truth is wrong on roughly 99 questions. It is also worth noting that the LoCoMo judge scores up to 63% of intentionally wrong answers correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  The LongMemEval metric error
&lt;/h2&gt;

&lt;p&gt;LongMemEval as published is an end-to-end question-answering benchmark. A system has to retrieve from a haystack of prior chat sessions, generate an answer, and have that answer marked correct by a GPT-4 judge. Every score on the published LongMemEval leaderboard is the percentage of questions where the generated answer was judged correct.&lt;/p&gt;

&lt;p&gt;The MemPalace LongMemEval runner does the retrieval step only. It never generates an answer and never invokes a judge. For each of the 500 questions it builds one document per session by concatenating only the user turns (assistant turns are not indexed at all), embeds with default ChromaDB embeddings, returns the top five sessions by cosine distance, and checks set membership against the gold session IDs labeled by the LongMemEval authors. If any one of the gold session IDs appears in the top five, the question scores 1.0. This metric is &lt;code&gt;recall_any@5&lt;/code&gt;. The runner also computes &lt;code&gt;recall_all@5&lt;/code&gt; (the stricter version that requires every gold session to be retrieved) and the project reports the softer one.&lt;/p&gt;

&lt;p&gt;So the system never reads what is in the retrieved sessions, never produces an answer, and never demonstrates that the sessions it returned actually answer the question. The dataset author labeled them, the runner checks the labels, and credit is awarded on label-set overlap. None of the LongMemEval numbers in this repository — not the 100%, not the 98.4% "held-out" number, not the 96.6% raw baseline — are LongMemEval scores in the sense the published leaderboard means. They are retrieval recall numbers on the same dataset, which is a substantially easier task. Calling any of them a "perfect score on LongMemEval" is a metric category error.&lt;/p&gt;

&lt;p&gt;The 100% number additionally has a separate problem. The project's hybrid v4 mode was built by inspecting the three remaining wrong answers in their dev set and writing targeted code for each one: a quoted-phrase boost for a question containing a specific phrase in single quotes, a person-name boost for a question about someone named Rachel, and "I still remember" / "when I was in high school" patterns for a question about a high school reunion. Three patches for three specific questions. Then the same five hundred are rerun and the result is reported as a perfect score. The project's own BENCHMARKS.md calls this what it is, on line 461, verbatim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is teaching to the test. The fixes were designed around the exact failure cases, not discovered by analyzing general failure patterns.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The features that don't exist in the code
&lt;/h2&gt;

&lt;p&gt;The launch post lists "contradiction detection catches wrong names, wrong pronouns, wrong ages before you ever see them" as a feature. The file &lt;a href="https://github.com/milla-jovovich/mempalace/blob/main/mempalace/knowledge_graph.py" rel="noopener noreferrer"&gt;&lt;code&gt;mempalace/knowledge_graph.py&lt;/code&gt;&lt;/a&gt; contains zero occurrences of the word "contradict." The only deduplication logic in that file is an exact-match check on &lt;code&gt;(subject, predicate, object)&lt;/code&gt; triples — it blocks identical triples from being added twice and does nothing else. Conflicting facts about the same subject can accumulate indefinitely. The marketed feature does not exist in the code. Credit to the developer Leonard Lin (lhl), who documented this independently in &lt;a href="https://github.com/milla-jovovich/mempalace/issues/27" rel="noopener noreferrer"&gt;issue #27&lt;/a&gt; on the same repository within hours of the launch.&lt;/p&gt;

&lt;h2&gt;
  
  
  AAAK is not lossless
&lt;/h2&gt;

&lt;p&gt;The launch post claims "AAAK compression fits your entire life context into 120 tokens — 30x lossless compression any LLM reads natively." The project's compression module, &lt;a href="https://github.com/milla-jovovich/mempalace/blob/main/mempalace/dialect.py" rel="noopener noreferrer"&gt;&lt;code&gt;mempalace/dialect.py&lt;/code&gt;&lt;/a&gt;, truncates sentences at 55 characters (&lt;code&gt;if len(best) &amp;gt; 55: best = best[:52] + "..."&lt;/code&gt;), filters by keyword frequency, and provides a &lt;code&gt;decode()&lt;/code&gt; function that splits the compressed string into a header dictionary without reconstructing the original text. There is no round-trip.&lt;/p&gt;

&lt;p&gt;There is also a measurement. The same BENCHMARKS.md reports &lt;code&gt;results_raw_full500.jsonl&lt;/code&gt; at 96.6% R@5 and &lt;code&gt;results_aaak_full500.jsonl&lt;/code&gt; at 84.2% R@5 — a 12.4 percentage point quality drop on the same dataset and the same metric, run by the project itself. Lossless compression cannot cause a measured quality drop. The project measured the loss, recorded it in the benchmark file, and then published "30x lossless" anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The broken layer underneath
&lt;/h2&gt;

&lt;p&gt;None of these failure modes are unique to MemPalace. LoCoMo's ground truth has been broken since the dataset was published. The benchmark wars in the AI memory space already involve documented methodology disputes that go well beyond normal disagreement: Zep published a detailed article in 2025 titled &lt;a href="https://blog.getzep.com/lies-damn-lies-statistics-is-mem0-really-sota-in-agent-memory/" rel="noopener noreferrer"&gt;"Lies, Damn Lies, and Statistics: Is Mem0 Really SOTA in Agent Memory?"&lt;/a&gt; arguing that Mem0's published LoCoMo numbers depend on a flawed evaluation harness and on Mem0 having run a misconfigured version of Zep when benchmarking against it. Mem0's CTO replied on Zep's own issue tracker in &lt;a href="https://github.com/getzep/zep-papers/issues/5" rel="noopener noreferrer"&gt;"Revisiting Zep's 84% LoCoMo Claim: Corrected Evaluation &amp;amp; 58.44% Accuracy"&lt;/a&gt; claiming that Zep's real score is 58.44% rather than 84%. Letta has separately published &lt;a href="https://www.letta.com/blog/benchmarking-ai-agent-memory" rel="noopener noreferrer"&gt;"Benchmarking AI Agent Memory: Is a Filesystem All You Need?"&lt;/a&gt; reaching similar conclusions about reproducibility on the same benchmark. The MemPalace launch fits into a pattern that the field is already arguing about. What's new is the scale of the honesty gap between a single repository and their related marketing.&lt;/p&gt;

&lt;p&gt;What's unusual about MemPalace is not necessarily that one project did all of these things at once. What's unusual is that the project's own internal documentation discloses these issues honestly, while the launch communication strips these caveats. BENCHMARKS.md is over 5,000 words of careful, self-aware methodology notes that contradict the launch tweet point by point. Whoever reviewed that file knew. It's clearly documented. But then they published the inflated numbers anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Over five thousand stars in less than twenty-four hours
&lt;/h2&gt;

&lt;p&gt;The repository was created on April 5. The launch post went up on April 6. By the morning of April 7, the launch tweet had over 1.5 million views and the repository had over 5,400 stars. There are many open-source memory projects with similar architectures and similar honest baseline numbers. They do not get 5,400 stars in twenty-four hours. They get fifty stars in their first week if they're lucky. The variable is the celebrity name. Strip the celebrity attribution out of the launch post and the project is a Python repository with a regex-based abbreviation scheme, default ChromaDB embeddings, a knowledge-graph file that doesn't implement the feature its README claims, and a benchmark folder whose own internal notes contradict the headline numbers. That repository gets fifty stars at best and dies in a week. The same code with an actress's name on the GitHub account gets 5,400 stars in less than a day and reaches over 1.5 million people on a single tweet.&lt;/p&gt;

&lt;p&gt;The engineering result underneath all of this is genuinely interesting in one specific way: it appears that raw verbatim text plus default embeddings does, in fact, beat a number of LLM-extraction approaches at session retrieval on LongMemEval-s. That suggests the field is over-engineering the memory extraction step. It is a useful negative finding. It does not require a perfect score on a benchmark whose ground truth makes a perfect score impossible. It does not require a metric category error. It does not require hand-coded patches against three specific dev questions. It does not require a celebrity attribution. The honest version of this story would have been more interesting than the hyped version, and it would likely have survived more than 24 hours of community scrutiny instead of collapsing under it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we're doing about it
&lt;/h2&gt;

&lt;p&gt;We maintain a public LoCoMo ground-truth audit at &lt;a href="https://github.com/dial481/locomo-audit" rel="noopener noreferrer"&gt;github.com/dial481/locomo-audit&lt;/a&gt;, with per-conversation error files documenting hallucinations, attribution errors, ambiguous questions, and incomplete answers across all ten conversations. The audit is open for contribution. We believe a new and improved version of LoCoMo would benefit every group working on conversational memory, including the MemPalace maintainers and including ourselves. The goal is better benchmarks, not a kill shot on any individual project.&lt;/p&gt;

&lt;p&gt;Two other independent technical critiques of MemPalace landed within the same 24 hour window: &lt;a href="https://github.com/milla-jovovich/mempalace/issues/27" rel="noopener noreferrer"&gt;Leonard Lin's README-versus-code teardown in issue #27&lt;/a&gt;, and a &lt;a href="https://github.com/milla-jovovich/mempalace/issues/37" rel="noopener noreferrer"&gt;Chinese-language warning post for the simplified Chinese developer community in issue #37&lt;/a&gt;. If you're evaluating any AI memory system right now, the right thing to do is read the benchmark code yourself before trusting the headline number. If that feels like a lot to ask — and it is — that's the problem this article is about. The celebrity name on the GitHub account is what made the problem visible. The problem itself was already there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aimemory</category>
      <category>benchmarks</category>
    </item>
    <item>
      <title>The Real Ceiling in Claude Code's Memory System (It’s Not the 200-Line Cap)</title>
      <dc:creator>Penfield</dc:creator>
      <pubDate>Sun, 05 Apr 2026 08:11:16 +0000</pubDate>
      <link>https://forem.com/penfieldlabs/the-real-ceiling-in-claude-codes-memory-system-its-not-the-200-line-cap-2cbl</link>
      <guid>https://forem.com/penfieldlabs/the-real-ceiling-in-claude-codes-memory-system-its-not-the-200-line-cap-2cbl</guid>
      <description>&lt;p&gt;Someone published the full Claude Code source to the internet last week. 512,000 lines of TypeScript across 1,916 files.&lt;/p&gt;

&lt;p&gt;Like everyone else, we went straight for the memory system. But unlike the analyses making the rounds, we didn't stop at the index file. We read the entire memory pipeline: the extraction agent, the dream consolidation system, the forked agent pattern, the lock files, the feature flags, the prompt templates, all of it.&lt;/p&gt;

&lt;p&gt;Here's the full picture, including the parts nobody else is talking about, and why replacing the storage layer alone doesn't fix the actual problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture is smarter than people think
&lt;/h2&gt;

&lt;p&gt;Most of the commentary has focused on the 200-line index cap in MEMORY.md and declared the system broken. That's a surface read. The architecture underneath is genuinely well-designed for a v1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three-tier memory with bandwidth awareness:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The system has three layers, each with a different access pattern:&lt;/p&gt;

&lt;p&gt;Layer 1: MEMORY.md, the index. Always loaded into the system prompt. One-line pointers to topic files, roughly 150 characters each. Hard cap of 200 lines and 25KB. This is the only layer that costs tokens every turn.&lt;/p&gt;

&lt;p&gt;Layer 2: Topic files (markdown files in the memory directory). Loaded on demand. Each turn, a separate Sonnet call reads the file manifest and picks up to 5 relevant files based on the current query. These contain the actual knowledge.&lt;/p&gt;

&lt;p&gt;Layer 3: Session transcripts (JSONL files). Never fully read. Only accessed via targeted grep with narrow search terms. This is the raw conversation history, kept as a last-resort reference.&lt;/p&gt;

&lt;p&gt;This is a cost-conscious design. Layer 1 is always in context. Layer 2 is fetched selectively. Layer 3 is almost never touched. The 200-line cap on the index isn't an oversight, it's a token budget. The index is injected into every single system prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Four memory types, strictly constrained:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The type taxonomy is intentionally narrow: user (who you are), feedback (corrections AND confirmations), project (ongoing work context), and reference (pointers to external systems).&lt;/p&gt;

&lt;p&gt;What's interesting is what they explicitly exclude. The source code has a dedicated WHAT_NOT_TO_SAVE section: no code patterns, no architecture, no file paths, no git history, no debugging solutions. The rule is: if it's derivable from the current codebase through grep or git, don't persist it. Memory is reserved for things the codebase can't tell you.&lt;/p&gt;

&lt;p&gt;The feedback type is more nuanced than it appears. The prompt instructs the model to record both corrections ("stop doing X") and confirmations ("yes, exactly like that"). The reasoning is explicit in the source: if you only save corrections, you avoid past mistakes but drift away from validated approaches. Most memory systems only capture negative feedback. This one captures positive signal too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Staleness is a first-class concept:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There's a memoryFreshnessText() function that appends warnings to any memory older than one day: "This memory is X days old. Memories are point-in-time observations, not live state." The model is instructed to treat memory as a hint, not truth, and verify before using. Memory is skeptical of itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  The part nobody is talking about: the dream system
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Claude Code doesn't just accumulate memories. It consolidates them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;autoDream: background memory consolidation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After at least 24 hours and at least 5 sessions have passed, a background process called autoDream fires. It's controlled by a GrowthBook feature flag (tengu_onyx_plover), meaning Anthropic can tune the thresholds remotely without shipping code.&lt;/p&gt;

&lt;p&gt;autoDream runs as a forked subagent, a separate process that clones the parent's file state cache and gets its own transcript. It has restricted tool access (only file read and write within the memory directory) so it can't corrupt the main conversation context.&lt;/p&gt;

&lt;p&gt;The consolidation runs in four phases:&lt;/p&gt;

&lt;p&gt;Phase 1, Orient: Read the memory directory. Understand what exists. Skim topic files to avoid creating duplicates.&lt;/p&gt;

&lt;p&gt;Phase 2, Gather: Look for new signal worth persisting. Check daily logs, spot memories that contradict current codebase state, grep transcripts for specific context (narrow terms only, never exhaustive reads).&lt;/p&gt;

&lt;p&gt;Phase 3, Consolidate: Write or update memory files. Merge new signal into existing topics rather than creating near-duplicates. Convert relative dates to absolute. Delete contradicted facts at the source.&lt;/p&gt;

&lt;p&gt;Phase 4, Prune and index: Keep MEMORY.md under the 200-line and 25KB caps. Remove stale pointers. Shorten verbose entries. Resolve contradictions between files.&lt;/p&gt;

&lt;p&gt;This is a self-healing memory system. It merges, deduplicates, resolves contradictions, and aggressively prunes. Memory is continuously edited, not just appended.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Race protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A PID-based lock file (.consolidate-lock) prevents multiple processes from running consolidation simultaneously. The lock has a 1-hour staleness timeout (in case a process crashes mid-consolidation) and PID verification to prevent reuse collisions. The lock file's mtime doubles as the lastConsolidatedAt timestamp, so checking "should we consolidate?" costs exactly one stat() call per turn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;extractMemories: per-turn capture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Separately from the dream system, there's an extraction agent that runs after each query completes. It's a forked agent (same pattern as autoDream) that reviews the conversation and extracts durable memories. This is what captures information in real time. autoDream is what consolidates it later.&lt;/p&gt;

&lt;p&gt;Two different processes writing to the same memory directory. Real-time capture and periodic consolidation. The biological analogy is obvious: short-term encoding during the day, long-term consolidation during sleep.&lt;/p&gt;




&lt;h2&gt;
  
  
  The forked agent pattern
&lt;/h2&gt;

&lt;p&gt;This is the core architectural primitive that makes everything work, and nobody has mentioned it.&lt;/p&gt;

&lt;p&gt;runForkedAgent() creates a perfect fork of the main conversation. It clones the file state cache, creates a separate transcript, and shares the parent's prompt cache (the expensive part). The forked agent gets restricted tools so it can't interfere with the parent context.&lt;/p&gt;

&lt;p&gt;This single pattern powers: memory extraction (per-turn), memory consolidation (autoDream), auto-compaction, agent summaries, and sub-agent tasks. One cache, multiple specialized agents. This is Anthropic's cost optimization for running background intelligence alongside the main conversation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the system actually hits a ceiling
&lt;/h2&gt;

&lt;p&gt;The 200-line index cap is not the real limitation. The dream system manages that cap through pruning and consolidation. The actual ceiling is architectural:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No knowledge graph.&lt;/strong&gt; Every memory is an isolated markdown file. There's no way to express that one memory supports another, contradicts another, or supersedes another. The dream system can spot contradictions and resolve them, but only through brute-force LLM reasoning over the full text. There are no typed relationships. No structured connections. No way for the agent to explore how its knowledge evolved over time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No embeddings.&lt;/strong&gt; Retrieval is a language model reading filenames and one-line descriptions, then picking up to 5 files. It's remarkably effective for what it is, but it's not semantic search. As the memory directory grows, the relevance of filename-based selection degrades. A memory about a "database migration decision" won't surface when the query is about "schema changes" unless the filename happens to match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No cross-project memory.&lt;/strong&gt; Each project gets its own isolated memory directory, keyed to the canonical git root. Knowledge learned in one project cannot inform work in another. There's no shared context, no transfer learning between workspaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No cross-device or cross-product memory.&lt;/strong&gt; The memory directory lives at ~/.claude/projects/ on your local filesystem. Your desktop and laptop have separate memories. Claude.ai, Claude Desktop, Claude mobile, and Claude Code all have completely separate memory systems. Knowledge is fragmented across every device and interface you use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No personality persistence.&lt;/strong&gt; There's no mechanism for the model's communication style, behavioral preferences, or domain expertise to persist. Every new session starts with a blank personality slate. Any rapport or working style you've established exists only in the current conversation's context window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No GUI for non-technical users.&lt;/strong&gt; Memory is markdown files on disk. Managing them means editing files in a text editor or asking Claude to do it for you. There's no portal, no visual browser, no way for a non-developer to see what's stored or how things connect.&lt;/p&gt;




&lt;h2&gt;
  
  
  Replacing the storage layer doesn't fix this
&lt;/h2&gt;

&lt;p&gt;Swapping markdown files for a vector store addresses one limitation (the filename-based retrieval) while leaving every other ceiling untouched.&lt;/p&gt;

&lt;p&gt;A vector store with no knowledge graph is still flat memory. You get better recall on individual memories, but the memories themselves are still isolated notes. There's still no way to say "this decision superseded that one" or "this insight contradicts our earlier assumption." You're scaling a note pile, not building knowledge.&lt;/p&gt;

&lt;p&gt;The retrieval improvement is real, embedding similarity beats filename matching. But retrieval was never the core problem. The core problem is that isolated memories, no matter how well-retrieved, can't represent connected knowledge.&lt;/p&gt;

&lt;p&gt;What actually fixes this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A knowledge graph with typed relationships.&lt;/strong&gt; Not just "these memories are similar" (that's what embeddings give you) but structured connections: supports, contradicts, supersedes, causes, depends_on, updates, and more. The agent needs to build and traverse a graph, not search a list.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent-managed memory.&lt;/strong&gt; Give the model a rich set of tools, store, recall, connect, explore, reflect, update, and let it decide what matters. Claude Code's extractMemories and autoDream are early steps in this direction, but they operate on flat files. The same agent-driven approach applied to a knowledge graph is dramatically more powerful. A recent Google DeepMind paper (Evo-Memory) showed that agents with self-evolving memory cut task steps roughly in half and let smaller models match or beat larger ones with static context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Typed memories.&lt;/strong&gt; Claude Code's four types (user, feedback, project, reference) are a good start. But a correction is different from an insight, which is different from a strategic decision, which is different from a checkpoint. More types means the agent (and the user) can understand what kind of knowledge they're looking at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personality persistence.&lt;/strong&gt; Your AI's communication style, domain expertise, behavioral quirks, and boundaries should be stored as part of the memory system and loaded at the start of every session. On any device. On any platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-device, cross-platform access.&lt;/strong&gt; Memory needs to be accessible from everywhere. Your desktop, your phone, your IDE, your browser. Cloud-hosted, synced, exportable. Local-only memory fragments your knowledge across every device you own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A GUI portal.&lt;/strong&gt; Non-developers need to see what's in the memory system, edit what's wrong, and understand how things connect. "Trust us, it's in the database" isn't good enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this looks like in practice
&lt;/h2&gt;

&lt;p&gt;Imagine you've been working with an AI assistant across multiple projects for six months. It knows your coding preferences, your architectural decisions, the bugs you've encountered, the strategies you've tried, the corrections you've made along the way.&lt;/p&gt;

&lt;p&gt;With flat memory (even with embeddings), those are 500 isolated notes that surface based on keyword or semantic similarity. Useful, but limited.&lt;/p&gt;

&lt;p&gt;With a knowledge graph, those memories are connected. The agent can trace how a decision evolved: "We chose Postgres in January (decision). Switched to DynamoDB in March (supersedes). Because of the latency issues we hit in February (caused_by). Which contradicted our original assumption about read patterns (contradicts)." It can explore connections, spot patterns, and understand context that no embedding similarity search would surface.&lt;/p&gt;

&lt;p&gt;That's the difference between a note pile and a knowledge base.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 30-second fix
&lt;/h2&gt;

&lt;p&gt;Claude Code's memory system is a well-engineered v1 with real architectural ceilings. Replacing the storage layer puts a bigger engine in a car with no steering wheel.&lt;/p&gt;

&lt;p&gt;What you actually want is a memory system with a knowledge graph, typed relationships, hybrid search (keyword, vector, and graph traversal combined), personality persistence, a GUI portal, and access from every device and platform you use.&lt;/p&gt;

&lt;p&gt;And it should take less than a minute to set up.&lt;/p&gt;

&lt;p&gt;If you use Claude.ai or Claude Desktop: go to Settings, Connectors, Add Custom Connector, paste a URL. Done. That connector is now available everywhere you use Claude, including Claude Code, Claude mobile, and Cowork. Turn it on or off anytime.&lt;/p&gt;

&lt;p&gt;If you use Cursor, Windsurf, or any MCP-compatible client: one line in your MCP config.&lt;/p&gt;

&lt;p&gt;If you're building something custom: full REST API.&lt;/p&gt;

&lt;p&gt;No Docker. No npm install. No environment variables. No JSON config files.&lt;/p&gt;

&lt;p&gt;Your AI should remember you. Across every session, every device, every platform. And that memory should be connected, not just piled up.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>claudecode</category>
      <category>aimemory</category>
    </item>
    <item>
      <title>We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally</title>
      <dc:creator>Penfield</dc:creator>
      <pubDate>Sat, 04 Apr 2026 14:56:34 +0000</pubDate>
      <link>https://forem.com/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg</link>
      <guid>https://forem.com/penfieldlabs/we-audited-locomo-64-of-the-answer-key-is-wrong-and-the-judge-accepts-up-to-63-of-intentionally-33lg</guid>
      <description>&lt;p&gt;Projects are still submitting &lt;a href="https://github.com/snap-research/locomo/issues" rel="noopener noreferrer"&gt;new scores on LoCoMo as of March 2026.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found.&lt;/p&gt;

&lt;h2&gt;
  
  
  LoCoMo
&lt;/h2&gt;

&lt;p&gt;LoCoMo (&lt;a href="https://aclanthology.org/2024.acl-long.747.pdf" rel="noopener noreferrer"&gt;Maharana et al., ACL 2024&lt;/a&gt;) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The answer key specifies "Ferrari 488 GTB," but the source conversation contains only "this beauty" and the image caption reads "a red sports car." The car model exists only in an internal query field (annotator search strings for stock photos) that no memory system ingests. Systems are evaluated against facts they have no access to.&lt;/li&gt;
&lt;li&gt;"Last Saturday" on a Thursday should resolve to the preceding Saturday. The answer key says Sunday. A system that performs the date arithmetic correctly is penalized.&lt;/li&gt;
&lt;li&gt;24 questions attribute statements to the wrong speaker. A system with accurate speaker tracking will contradict the answer key.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The theoretical maximum score for a perfect system is approximately 93.6%.&lt;/p&gt;

&lt;p&gt;We also tested the LLM judge. LoCoMo uses gpt-4o-mini to score answers against the golden reference. We generated intentionally wrong but topically adjacent answers for all 1,540 questions and scored them using the same judge configuration and prompts used in published evaluations. The judge accepted 62.81% of them. Specific factual errors (wrong name, wrong date) were caught approximately 89% of the time. However, vague answers that identified the correct topic while missing every specific detail passed nearly two-thirds of the time. This is precisely the failure mode of weak retrieval - locating the right conversation but extracting nothing specific - and the benchmark rewards it.&lt;/p&gt;

&lt;p&gt;There is also no standardized evaluation pipeline. Each system uses its own ingestion method (arguably necessary given architectural differences), its own answer generation prompt, and sometimes entirely different models. Scores are then compared in tables as if they share a common methodology. Multiple independent researchers have documented inability to reproduce published results (&lt;a href="https://github.com/EverMind-AI/EverMemOS/issues/73" rel="noopener noreferrer"&gt;EverMemOS #73&lt;/a&gt;, &lt;a href="https://github.com/mem0ai/mem0/issues/3944" rel="noopener noreferrer"&gt;Mem0 #3944&lt;/a&gt;, &lt;a href="https://github.com/getzep/zep-papers/issues/5" rel="noopener noreferrer"&gt;Zep scoring discrepancy&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Full audit with all 99 errors documented, methodology, and reproducible scripts: &lt;a href="https://github.com/dial481/locomo-audit" rel="noopener noreferrer"&gt;locomo-audit&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  LongMemEval
&lt;/h2&gt;

&lt;p&gt;LongMemEval-S (&lt;a href="https://arxiv.org/abs/2407.15460" rel="noopener noreferrer"&gt;Wang et al., 2024&lt;/a&gt;) is the other frequently cited benchmark. The issue is different but equally fundamental: it does not effectively isolate memory capability from context window capacity.&lt;/p&gt;

&lt;p&gt;LongMemEval-S uses approximately 115K tokens of context per question. Current models support 200K to 1M token context windows. The entire test corpus fits in a single context window for most current models.&lt;/p&gt;

&lt;p&gt;Mastra's &lt;a href="https://mastra.ai/research/observational-memory" rel="noopener noreferrer"&gt;research&lt;/a&gt; illustrates this: their full-context baseline scored 60.20% with gpt-4o (128K context window, near the 115K threshold). Their observational memory system scored 84.23% with the same model, largely by compressing context to fit more comfortably. The benchmark is measuring context window management efficiency rather than long-term memory retrieval. As context windows continue to grow, the full-context baseline will keep climbing and the benchmark will lose its ability to discriminate.&lt;/p&gt;

&lt;p&gt;LongMemEval-S tests whether a model can locate information within 115K tokens. That is a useful capability to measure, but it is a context window test, not a memory test.&lt;/p&gt;

&lt;h2&gt;
  
  
  LoCoMo-Plus
&lt;/h2&gt;

&lt;p&gt;LoCoMo-Plus (&lt;a href="https://arxiv.org/abs/2602.10715" rel="noopener noreferrer"&gt;Li et al., 2025&lt;/a&gt;) introduces a genuinely interesting new category: "cognitive" questions testing implicit inference rather than factual recall. These use cue-trigger pairs with deliberate semantic disconnect - the system must connect "I just adopted a rescue dog" (cue) to "what kind of pet food should I buy?" (trigger) across sessions without lexical overlap. The concept is sound and addresses a real gap in existing evaluation.&lt;/p&gt;

&lt;p&gt;The issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It inherits all 1,540 original LoCoMo questions unchanged, including the 99 score-corrupting errors documented above.&lt;/li&gt;
&lt;li&gt;The improved judging methodology (task-specific prompts, three-tier scoring, 0.80+ human-LLM agreement) was only validated on the new cognitive questions. The original five categories retain the same broken ground truth with no revalidation.&lt;/li&gt;
&lt;li&gt;The judge model defaults to gpt-4o-mini.&lt;/li&gt;
&lt;li&gt;Same lack of pipeline standardization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The new cognitive category is a meaningful contribution. The inherited evaluation infrastructure retains the problems described above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements for meaningful long-term memory evaluation
&lt;/h2&gt;

&lt;p&gt;Based on this analysis, we see several requirements for benchmarks that can meaningfully evaluate long-term memory systems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Corpus size must exceed context windows.&lt;/strong&gt; If the full test corpus fits in context, retrieval is optional and the benchmark cannot distinguish memory systems from context window management. &lt;a href="https://arxiv.org/abs/2510.27246" rel="noopener noreferrer"&gt;BEAM&lt;/a&gt; moves in this direction with conversations up to 10M tokens, though it introduces its own challenges.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Evaluation must use current-generation models.&lt;/strong&gt; gpt-4o-mini as a judge introduces a ceiling on scoring precision. Both the systems under test and the judges evaluating them should reflect current model capabilities.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Judge reliability must be validated adversarially.&lt;/strong&gt; When a judge accepts 63% of intentionally wrong answers, score differences below that threshold are not interpretable. Task-specific rubrics, stronger judge models, and adversarially validated ground truth are all necessary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ingestion should reflect realistic use.&lt;/strong&gt; Knowledge in real applications builds through conversation - with turns, corrections, temporal references, and evolving relationships. Benchmarks that test single-pass ingestion of static text miss the core challenge of persistent memory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Evaluation pipelines must be standardized or fully disclosed.&lt;/strong&gt; At minimum: ingestion method (and prompt if applicable), embedding model, answer generation prompt, judge model, judge prompt, number of runs, and standard deviation. Without this, cross-system comparisons in published tables are not meaningful.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ground truth must be verified.&lt;/strong&gt; A 6.4% error rate in the answer key creates a noise floor that makes small score differences uninterpretable. &lt;a href="https://arxiv.org/abs/2103.14749" rel="noopener noreferrer"&gt;Northcutt et al. (NeurIPS 2021)&lt;/a&gt; found an average of 3.3% label errors across 10 major ML benchmarks and demonstrated that these errors can destabilize model rankings. LoCoMo's error rate is nearly double that baseline.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The long-term memory evaluation problem is genuinely hard - it sits at the intersection of retrieval, reasoning, temporal understanding, and knowledge integration. We'd be interested in hearing what the community thinks is missing from this list, and whether anyone has found evaluation approaches that avoid these pitfalls.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>machinelearning</category>
      <category>benchmarks</category>
    </item>
  </channel>
</rss>
