<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jaewon Jang</title>
    <description>The latest articles on Forem by Jaewon Jang (@jaewon_jang_d63fddcf69ac2).</description>
    <link>https://forem.com/jaewon_jang_d63fddcf69ac2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855130%2F7aab33c6-d7e2-4ece-8084-7d55a696de88.png</url>
      <title>Forem: Jaewon Jang</title>
      <link>https://forem.com/jaewon_jang_d63fddcf69ac2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jaewon_jang_d63fddcf69ac2"/>
    <language>en</language>
    <item>
      <title>CTX: I gave Claude Code a memory that actually works</title>
      <dc:creator>Jaewon Jang</dc:creator>
      <pubDate>Sun, 03 May 2026 10:19:23 +0000</pubDate>
      <link>https://forem.com/jaewon_jang_d63fddcf69ac2/ctx-i-gave-claude-code-a-memory-that-actually-works-45id</link>
      <guid>https://forem.com/jaewon_jang_d63fddcf69ac2/ctx-i-gave-claude-code-a-memory-that-actually-works-45id</guid>
      <description>&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Claude Code resets every session. There is no built-in memory. You open a new terminal, start coding, and the model has no idea what you decided yesterday, what architecture you settled on, or which files matter. You explain it again. Every time.&lt;/p&gt;

&lt;p&gt;I spent three months building something to fix this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What CTX does
&lt;/h2&gt;

&lt;p&gt;CTX hooks into Claude Code's &lt;code&gt;UserPromptSubmit&lt;/code&gt; event. Before every prompt, three things happen — in under 1ms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;G1 — Decision memory&lt;/strong&gt;&lt;br&gt;
Parses your git log and surfaces the most relevant past decisions. "Why did we switch to BM25?" "What was the reasoning behind this architecture?" CTX pulls those commit messages and injects them before you even ask.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;G2 — Code and doc search&lt;/strong&gt;&lt;br&gt;
BM25 search across your entire codebase and markdown docs. When you ask about a function, the right files are already in context. No more "I can't find that file" hallucinations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CM — Chat memory vault&lt;/strong&gt;&lt;br&gt;
A local SQLite database of past conversations, hybrid-searched (BM25 + optional vector). The things you explained once, you should only have to explain once.&lt;/p&gt;
&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;I ran rigorous benchmarks — not synthetic toy tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory recall (MAB, N=50)&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;Recall&lt;/th&gt;
&lt;th&gt;Wilson CI 95%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;None (baseline)&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;[0.00, 0.07]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CTX&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;[0.28, 0.54]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CTX v2&lt;/td&gt;
&lt;td&gt;0.58&lt;/td&gt;
&lt;td&gt;[0.44, 0.71]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CTX v3&lt;/td&gt;
&lt;td&gt;0.88&lt;/td&gt;
&lt;td&gt;[0.762, 0.944]&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;CTX v3 vs baseline: McNemar p &amp;lt; 0.001. Statistically significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world telemetry (10,000+ turns)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Overall utility rate: 39.6% (items injected that Claude actually cited)&lt;/li&gt;
&lt;li&gt;CM block: 52.6% utility rate (highest — chat memory is the most cited)&lt;/li&gt;
&lt;li&gt;G1 block: 39.6%&lt;/li&gt;
&lt;li&gt;G2 docs: 27.8%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 42 percentage point gap between KEYWORD (16%) and SEMANTIC (42%) queries confirms retrieval method selection matters — and CTX routes them differently.&lt;/p&gt;
&lt;h2&gt;
  
  
  How it installs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Option A — Native plugin (recommended, one step):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/plugin &lt;span class="nb"&gt;install &lt;/span&gt;ctx@jaytoone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code handles everything — venv, daemons, hooks. No terminal needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option B — PyPI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ctx-retriever &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ctx-install
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Copies hooks to &lt;code&gt;~/.claude/hooks/&lt;/code&gt; and patches &lt;code&gt;settings.json&lt;/code&gt; atomically. Validated in clean Docker (ubuntu:22.04).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Latest: v0.3.13&lt;/strong&gt; — vec-daemon isolated venv (no numpy/ABI conflicts). BGE reranker opt-in: &lt;code&gt;CTX_BGE_ENABLE=1&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What it does not do
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;No cloud sync. Everything stays local.&lt;/li&gt;
&lt;li&gt;No LLM calls. Pure BM25 + SQLite.&lt;/li&gt;
&lt;li&gt;No mandatory telemetry. Opt-in only.&lt;/li&gt;
&lt;li&gt;Does not replace Claude's context window — it fills it intelligently before you ask.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/jaytoone/CTX" rel="noopener noreferrer"&gt;https://github.com/jaytoone/CTX&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;PyPI: &lt;a href="https://pypi.org/project/ctx-retriever/" rel="noopener noreferrer"&gt;https://pypi.org/project/ctx-retriever/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dashboard: runs locally at port 8787 after install&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Demo video
&lt;/h2&gt;

&lt;p&gt;Dashboard live demo (39 seconds) — shows System Health, Knowledge Graph node interactions, and real-time events:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://drive.google.com/file/d/1b4ZvbRYkXKTepKDx8N7gLfim-zLDiCGo/view?usp=sharing" rel="noopener noreferrer"&gt;▶ Watch dashboard demo (39s) — Google Drive&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or download directly: &lt;a href="https://drive.google.com/uc?export=download&amp;amp;id=1b4ZvbRYkXKTepKDx8N7gLfim-zLDiCGo" rel="noopener noreferrer"&gt;ctx-dashboard-demo.mp4&lt;/a&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
    </item>
    <item>
      <title>LLM agents don't degrade gradually — they cliff-edge. I built HarnessOS to survive it</title>
      <dc:creator>Jaewon Jang</dc:creator>
      <pubDate>Wed, 01 Apr 2026 15:04:10 +0000</pubDate>
      <link>https://forem.com/jaewon_jang_d63fddcf69ac2/harnessos-scaffoldmiddleware-for-infinite-autonomous-tasks-built-on-harness-engineering-3pf1</link>
      <guid>https://forem.com/jaewon_jang_d63fddcf69ac2/harnessos-scaffoldmiddleware-for-infinite-autonomous-tasks-built-on-harness-engineering-3pf1</guid>
      <description>&lt;p&gt;There's a concept gaining traction in AI systems engineering: &lt;strong&gt;Harness Engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not the testing tool. The idea: raw LLM capability is like raw power — high voltage,&lt;br&gt;
hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of&lt;br&gt;
building the control structures that make that power &lt;em&gt;usable at scale&lt;/em&gt;.&lt;br&gt;
Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.&lt;/p&gt;

&lt;p&gt;I think it's going to be one of the defining disciplines of serious AI systems work.&lt;br&gt;
And I've been building a platform around it.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;HarnessOS&lt;/strong&gt; is a scaffold/middleware system for running infinite autonomous tasks.&lt;/p&gt;

&lt;p&gt;The key word is &lt;em&gt;infinite&lt;/em&gt;. Not one task. Not one session. An agent that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs continuously, across context window rotations&lt;/li&gt;
&lt;li&gt;Evolves its own goals when it succeeds at the current one&lt;/li&gt;
&lt;li&gt;Persists state across sessions without losing context&lt;/li&gt;
&lt;li&gt;Classifies its own failures and routes them appropriately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HarnessOS
├── CTX                      ← context precision layer
│   └── LLM-free retrieval, 5.2% token budget, R@5=1.0 dependency recall
├── omc-live                 ← finite outer loop
│   └── 2-Wave strategy + self-evolving goals + episode memory
├── omc-live-infinite        ← infinite outer loop
│   └── context rotation, world model, no iteration cap
├── HalluMaze                ← hallucination management (in development)
└── [future layers]
    ├── Evaluation Layer
    ├── Safety Layer
    └── Memory Tier System
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Problem with Current Agent Frameworks
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks are built for tasks that complete in one session.&lt;/p&gt;

&lt;p&gt;Spin up → run → done.&lt;/p&gt;

&lt;p&gt;That's fine for demos. It breaks for real autonomous work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context exhaustion&lt;/strong&gt;: At ~70% context capacity, agents start losing earlier decisions.&lt;br&gt;
Not gracefully. They cliff-edge — sudden degradation, not gradual fade.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No goal evolution&lt;/strong&gt;: An agent that succeeds at "write tests" has no mechanism to&lt;br&gt;
ask "what's the next improvement?" It just stops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Failure is terminal&lt;/strong&gt;: Most frameworks catch exceptions. Few &lt;em&gt;classify&lt;/em&gt; them —&lt;br&gt;
transient vs persistent vs fundamental goal mismatch.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;HarnessOS is built specifically to address all three.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Measured (The Empirical Foundation)
&lt;/h2&gt;

&lt;p&gt;Before building anything, I ran controlled experiments on questions I couldn't find&lt;br&gt;
good empirical answers to anywhere else.&lt;/p&gt;
&lt;h3&gt;
  
  
  Q1: How should autonomous agents reason about problems?
&lt;/h3&gt;

&lt;p&gt;Compared &lt;strong&gt;hypothesis-driven debugging&lt;/strong&gt; (observe → hypothesize → verify)&lt;br&gt;
against &lt;strong&gt;engineering-only&lt;/strong&gt; (pattern match → retry) on 12 bug scenarios.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug type&lt;/th&gt;
&lt;th&gt;Engineering&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Causal&lt;/td&gt;
&lt;td&gt;1.75 attempts&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assumption&lt;/td&gt;
&lt;td&gt;2.0 attempts&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-50%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;First-hypothesis accuracy: &lt;strong&gt;100%&lt;/strong&gt;. This is now the default reasoning strategy in omc-live.&lt;/p&gt;
&lt;h3&gt;
  
  
  Q2: Where do context limits actually hit?
&lt;/h3&gt;

&lt;p&gt;Measured Lost-in-the-Middle across 1K/10K/50K/100K token contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key finding: degradation is threshold-based, not gradual.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents don't slowly forget. They cliff-edge at a specific token length and fail silently.&lt;br&gt;
This changed how &lt;code&gt;omc-live-infinite&lt;/code&gt; handles context — it monitors budget and triggers&lt;br&gt;
a safe rotation handoff at 70%, before the cliff.&lt;/p&gt;
&lt;h3&gt;
  
  
  Q3: Where do autonomous agents actually fail?
&lt;/h3&gt;

&lt;p&gt;OpenHands on 20-step coding tasks. Failure clusters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Wrong task decomposition (incorrect sub-goals from the start)&lt;/li&gt;
&lt;li&gt;Role non-compliance (agent exceeds defined scope)&lt;/li&gt;
&lt;li&gt;Boundary violations (unexpected state mutations)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Predictable = preventable. The omc-failure-router classifies failures into these&lt;br&gt;
categories and routes them appropriately instead of generic retry.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Architecture in Practice
&lt;/h2&gt;
&lt;h3&gt;
  
  
  omc-live: Finite Self-Evolving Loop
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wave 1: Strategy consultation (specialist agents, runs once)
   ↓
Wave 2: Execution loop
   ↓
Judgment: Goal achieved?
   ├── NO  → update goal tree, retry
   └── YES → Score (5 dimensions)
                ├── delta ≥ epsilon → EVOLVE goal, continue
                └── plateau × 3    → CONVERGED, stop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When the system succeeds, it scores the output, finds the weakest dimension,&lt;br&gt;
generates an elevated goal, and continues — until quality plateaus.&lt;/p&gt;
&lt;h3&gt;
  
  
  omc-live-infinite: No Iteration Cap
&lt;/h3&gt;

&lt;p&gt;New mechanisms beyond the finite version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context rotation&lt;/strong&gt;: at 70% budget → save state → fresh session → resume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;World model&lt;/strong&gt;: epistemic state layer that persists across rotations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Co-evolution feedback&lt;/strong&gt;: strategy outcomes feed back into Wave 1 planning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enables agents that work on complex goals for hours, not seconds.&lt;/p&gt;
&lt;h3&gt;
  
  
  CTX: Precision Context Loading
&lt;/h3&gt;

&lt;p&gt;Query classification → retrieval strategy selection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EXPLICIT_SYMBOL → direct lookup&lt;/li&gt;
&lt;li&gt;SEMANTIC_FUNCTIONALITY → embedding search&lt;/li&gt;
&lt;li&gt;STRUCTURAL_RELATIONSHIP → dependency graph&lt;/li&gt;
&lt;li&gt;RECENT_CHANGE → git recency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: 5.2% average token budget, R@5=1.0. No LLM calls for retrieval.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why "Harness Engineering" Is the Right Frame
&lt;/h2&gt;

&lt;p&gt;A harness doesn't constrain power — it &lt;em&gt;channels&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;LLMs have enormous capability. Without control structure, that capability is:&lt;br&gt;
context-unaware, goal-unstable, failure-opaque, session-local.&lt;/p&gt;

&lt;p&gt;HarnessOS adds the control structure. Not to limit the model — to make it usable&lt;br&gt;
for work that spans hours, not seconds.&lt;/p&gt;


&lt;h2&gt;
  
  
  Current State &amp;amp; Quick Start
&lt;/h2&gt;

&lt;p&gt;214 tests, 100% coverage. CTX and omc-live/infinite are stable and used daily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/jaytoone/HarnessOS
python3 analyze.py &lt;span class="nt"&gt;--run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No pip install. No required API keys for base experiments.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/jaytoone/HarnessOS" rel="noopener noreferrer"&gt;https://github.com/jaytoone/HarnessOS&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're building autonomous agents and thinking about long-run reliability — happy to compare notes.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>opensource</category>
      <category>productivity</category>
    </item>
    <item>
      <title>HarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness Engineering</title>
      <dc:creator>Jaewon Jang</dc:creator>
      <pubDate>Wed, 01 Apr 2026 08:27:45 +0000</pubDate>
      <link>https://forem.com/jaewon_jang_d63fddcf69ac2/harnessos-scaffoldmiddleware-for-infinite-autonomous-tasks-built-on-harness-engineering-50n0</link>
      <guid>https://forem.com/jaewon_jang_d63fddcf69ac2/harnessos-scaffoldmiddleware-for-infinite-autonomous-tasks-built-on-harness-engineering-50n0</guid>
      <description>&lt;p&gt;There's a concept gaining traction in AI systems engineering: &lt;strong&gt;Harness Engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not the testing tool. The idea: raw LLM capability is like raw power — high voltage, hard to control, dangerous to run indefinitely. Harness Engineering is the discipline of building the control structures that make that power &lt;em&gt;usable at scale&lt;/em&gt;.&lt;br&gt;
Context managers. Evaluation loops. Failure classifiers. Goal trackers. Memory tiers.&lt;/p&gt;

&lt;p&gt;I think it's going to be one of the defining disciplines of serious AI systems work.&lt;br&gt;
And I've been building a platform around it.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;HarnessOS&lt;/strong&gt; is a scaffold/middleware system for running infinite autonomous tasks.&lt;/p&gt;

&lt;p&gt;The key word is &lt;em&gt;infinite&lt;/em&gt;. Not one task. Not one session. An agent that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs continuously, across context window rotations&lt;/li&gt;
&lt;li&gt;Evolves its own goals when it succeeds at the current one&lt;/li&gt;
&lt;li&gt;Persists state across sessions without losing context&lt;/li&gt;
&lt;li&gt;Classifies its own failures and routes them appropriately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HarnessOS
├── CTX                      ← context precision layer
│   └── LLM-free retrieval, 5.2% token budget, R@5=1.0 dependency recall
├── omc-live                 ← finite outer loop
│   └── 2-Wave strategy + self-evolving goals + episode memory
├── omc-live-infinite        ← infinite outer loop
│   └── context rotation, world model, no iteration cap
├── HalluMaze                ← hallucination management (in development)
└── [future layers]
    ├── Evaluation Layer
    ├── Safety Layer
    └── Memory Tier System
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Problem with Current Agent Frameworks
&lt;/h2&gt;

&lt;p&gt;Most agent frameworks are built for tasks that complete in one session.&lt;/p&gt;

&lt;p&gt;Spin up → run → done.&lt;/p&gt;

&lt;p&gt;That's fine for demos. It breaks for real autonomous work:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Context exhaustion&lt;/strong&gt;: At ~70% context capacity, agents start losing earlier decisions. Not gracefully. They cliff-edge — sudden degradation, not gradual fade.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No goal evolution&lt;/strong&gt;: An agent that succeeds at "write tests" has no mechanism to ask "what's the next improvement?" It just stops.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Failure is terminal&lt;/strong&gt;: Most frameworks catch exceptions. Few &lt;em&gt;classify&lt;/em&gt; them — transient vs persistent vs fundamental goal mismatch.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;HarnessOS is built specifically to address all three.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Measured (The Empirical Foundation)
&lt;/h2&gt;

&lt;p&gt;Before building anything, I ran controlled experiments on questions I couldn't find good empirical answers to anywhere else.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q1: How should autonomous agents reason about problems?
&lt;/h3&gt;

&lt;p&gt;Compared &lt;strong&gt;hypothesis-driven debugging&lt;/strong&gt; (observe → hypothesize → verify) against &lt;strong&gt;engineering-only&lt;/strong&gt; (pattern match → retry) on 12 bug scenarios.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Bug type&lt;/th&gt;
&lt;th&gt;Engineering&lt;/th&gt;
&lt;th&gt;Hypothesis&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Causal&lt;/td&gt;
&lt;td&gt;1.75 attempts&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-43%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Assumption&lt;/td&gt;
&lt;td&gt;2.0 attempts&lt;/td&gt;
&lt;td&gt;1.0 attempts&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-50%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;First-hypothesis accuracy: &lt;strong&gt;100%&lt;/strong&gt;. This is now the default reasoning strategy in omc-live.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q2: Where do context limits actually hit?
&lt;/h3&gt;

&lt;p&gt;Measured Lost-in-the-Middle across 1K/10K/50K/100K token contexts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key finding: degradation is threshold-based, not gradual.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Agents don't slowly forget. They cliff-edge at a specific token length and fail silently.&lt;br&gt;
This changed how &lt;code&gt;omc-live-infinite&lt;/code&gt; handles context — it monitors budget and triggers a safe rotation handoff at 70%, before the cliff.&lt;/p&gt;
&lt;h3&gt;
  
  
  Q3: Where do autonomous agents actually fail?
&lt;/h3&gt;

&lt;p&gt;OpenHands on 20-step coding tasks. Failure clusters:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Wrong task decomposition (incorrect sub-goals from the start)&lt;/li&gt;
&lt;li&gt;Role non-compliance (agent exceeds defined scope)&lt;/li&gt;
&lt;li&gt;Boundary violations (unexpected state mutations)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Predictable = preventable. The omc-failure-router classifies failures into these categories and routes them appropriately instead of generic retry.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Architecture in Practice
&lt;/h2&gt;
&lt;h3&gt;
  
  
  omc-live: Finite Self-Evolving Loop
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wave 1: Strategy consultation (specialist agents, runs once)
   ↓
Wave 2: Execution loop
   ↓
Judgment: Goal achieved?
   ├── NO  → update goal tree, retry
   └── YES → Score (5 dimensions)
                ├── delta ≥ epsilon → EVOLVE goal, continue
                └── plateau × 3    → CONVERGED, stop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When the system succeeds, it scores the output, finds the weakest dimension, generates an elevated goal, and continues — until quality plateaus.&lt;/p&gt;
&lt;h3&gt;
  
  
  omc-live-infinite: No Iteration Cap
&lt;/h3&gt;

&lt;p&gt;New mechanisms beyond the finite version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context rotation&lt;/strong&gt;: at 70% budget → save state → fresh session → resume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;World model&lt;/strong&gt;: epistemic state layer that persists across rotations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Co-evolution feedback&lt;/strong&gt;: strategy outcomes feed back into Wave 1 planning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Enables agents that work on complex goals for hours, not seconds.&lt;/p&gt;
&lt;h3&gt;
  
  
  CTX: Precision Context Loading
&lt;/h3&gt;

&lt;p&gt;Query classification → retrieval strategy selection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EXPLICIT_SYMBOL → direct lookup&lt;/li&gt;
&lt;li&gt;SEMANTIC_FUNCTIONALITY → embedding search&lt;/li&gt;
&lt;li&gt;STRUCTURAL_RELATIONSHIP → dependency graph&lt;/li&gt;
&lt;li&gt;RECENT_CHANGE → git recency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: 5.2% average token budget, R@5=1.0. No LLM calls for retrieval.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why "Harness Engineering" Is the Right Frame
&lt;/h2&gt;

&lt;p&gt;A harness doesn't constrain power — it &lt;em&gt;channels&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;LLMs have enormous capability. Without control structure, that capability is: context-unaware, goal-unstable, failure-opaque, session-local.&lt;/p&gt;

&lt;p&gt;HarnessOS adds the control structure. Not to limit the model — to make it usable for work that spans hours, not seconds.&lt;/p&gt;


&lt;h2&gt;
  
  
  Current State &amp;amp; Quick Start
&lt;/h2&gt;

&lt;p&gt;214 tests, 100% coverage. CTX and omc-live/infinite are stable and used daily.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/jaytoone/HarnessOS
python3 analyze.py &lt;span class="nt"&gt;--run&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No pip install. No required API keys for base experiments.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/jaytoone/HarnessOS" rel="noopener noreferrer"&gt;https://github.com/jaytoone/HarnessOS&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're building autonomous agents and thinking about long-run reliability — happy to compare notes.&lt;/p&gt;

</description>
      <category>aiagentsopensourceproductivity</category>
    </item>
  </channel>
</rss>
