<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Ahmed Fayyaz</title>
    <description>The latest articles on Forem by Ahmed Fayyaz (@ahmedbutt2015).</description>
    <link>https://forem.com/ahmedbutt2015</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898184%2Fc9cfaae3-f959-446e-90b5-af3281e8cb1c.jpeg</url>
      <title>Forem: Ahmed Fayyaz</title>
      <link>https://forem.com/ahmedbutt2015</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/ahmedbutt2015"/>
    <language>en</language>
    <item>
      <title>Spine v1: Stop Making Claude Rediscover Your Codebase Every Time You Open a Repo</title>
      <dc:creator>Ahmed Fayyaz</dc:creator>
      <pubDate>Sat, 02 May 2026 23:53:36 +0000</pubDate>
      <link>https://forem.com/ahmedbutt2015/spine-v1-stop-making-claude-rediscover-your-codebase-every-time-you-open-a-repo-3kgh</link>
      <guid>https://forem.com/ahmedbutt2015/spine-v1-stop-making-claude-rediscover-your-codebase-every-time-you-open-a-repo-3kgh</guid>
      <description>&lt;p&gt;You know this feeling:&lt;/p&gt;

&lt;p&gt;You open a new repository, read the README, click five files, open the wrong folder, find a second entry point, and ten minutes later you still cannot answer a basic question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What should I read first if I want the real shape of this codebase?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is the moment &lt;code&gt;spine&lt;/code&gt; is built for.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spine&lt;/code&gt; is a small onboarding tool that scans a repository, finds a verified architecture spine, and turns it into something a developer can actually use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a compact architecture map&lt;/li&gt;
&lt;li&gt;a prioritized reading order&lt;/li&gt;
&lt;li&gt;a short mental model&lt;/li&gt;
&lt;li&gt;subsystem summaries&lt;/li&gt;
&lt;li&gt;a few gotchas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important part is not that it looks smart. The important part is that it stays grounded.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spine&lt;/code&gt; only draws architecture edges it can verify from source.&lt;br&gt;
GitHub:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ahmedbutt2015/spine" rel="noopener noreferrer"&gt;https://github.com/ahmedbutt2015/spine&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Why Claude Code users should care
&lt;/h2&gt;

&lt;p&gt;If you are already living inside Claude Code, &lt;code&gt;spine&lt;/code&gt; is not just a documentation tool. It is a context tool.&lt;/p&gt;

&lt;p&gt;The first run does two useful things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it generates &lt;code&gt;ONBOARDING.md&lt;/code&gt; for a human-readable tour&lt;/li&gt;
&lt;li&gt;it writes &lt;code&gt;.claude/REPO_CONTEXT.md&lt;/code&gt;, a compact repo snapshot for future Claude sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That second file matters a lot.&lt;/p&gt;

&lt;p&gt;Instead of re-explaining the repo every time, you get a persisted summary of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;project shape&lt;/li&gt;
&lt;li&gt;primary language&lt;/li&gt;
&lt;li&gt;verified architecture spine&lt;/li&gt;
&lt;li&gt;subsystem boundaries&lt;/li&gt;
&lt;li&gt;key entry points&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the value is not only "help me understand the codebase once."&lt;/p&gt;

&lt;p&gt;It is also:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;help future Claude sessions start from a grounded snapshot instead of burning tokens rediscovering the same repo shape.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That does not mean Claude magically knows every line of the application forever. It means &lt;code&gt;spine&lt;/code&gt; leaves behind a small verified context file that is useful to refresh and reuse as the repo evolves.&lt;/p&gt;
&lt;h2&gt;
  
  
  It is designed to feel instant inside Claude Code
&lt;/h2&gt;

&lt;p&gt;The repo already ships with Claude Code skill definitions for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/map&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/onboard&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the product story is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run &lt;code&gt;/map&lt;/code&gt; when you want a fast, token-light architecture preview.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;/onboard&lt;/code&gt; when you want the full guided tour.&lt;/li&gt;
&lt;li&gt;Let &lt;code&gt;spine&lt;/code&gt; refresh &lt;code&gt;.claude/REPO_CONTEXT.md&lt;/code&gt; so later Claude sessions start from a much better baseline.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For teams using Claude Code heavily, this is one of the most interesting parts of the product.&lt;/p&gt;

&lt;p&gt;You are not just generating a document. You are creating a reusable repo memory.&lt;/p&gt;
&lt;h2&gt;
  
  
  The problem with most onboarding docs
&lt;/h2&gt;

&lt;p&gt;Most repo onboarding today breaks in one of three ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it is too broad and turns into a wall of context&lt;/li&gt;
&lt;li&gt;it is stale and no longer matches the code&lt;/li&gt;
&lt;li&gt;it sounds confident while quietly guessing the architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is the worst.&lt;/p&gt;

&lt;p&gt;If a diagram is wrong, it does not just waste time. It gives you the wrong mental model, and you end up reading the code in the wrong order.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spine&lt;/code&gt; takes a narrower bet:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;smaller, verified, and useful beats bigger, prettier, and guessed.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  What spine actually gives you
&lt;/h2&gt;

&lt;p&gt;There are two modes.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;code&gt;/map&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Use this when you want the fastest answer possible.&lt;/p&gt;

&lt;p&gt;It gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a validated Mermaid diagram&lt;/li&gt;
&lt;li&gt;a &lt;code&gt;mermaid.live&lt;/code&gt; link&lt;/li&gt;
&lt;li&gt;no synthesis step&lt;/li&gt;
&lt;li&gt;no &lt;code&gt;ONBOARDING.md&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the "show me the real backbone first" mode.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;code&gt;/onboard&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Use this when you want the full tour.&lt;/p&gt;

&lt;p&gt;It writes an onboarding document with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;TL;DR&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Architecture map&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Mental model&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Reading order&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Entry points found&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Subsystems&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Gotchas&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the "I just joined this codebase, help me become dangerous fast" mode.&lt;/p&gt;
&lt;h2&gt;
  
  
  Try it in 3 minutes
&lt;/h2&gt;

&lt;p&gt;If you want the best first impression, use &lt;code&gt;axios&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It is the best launch example because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;most developers already know what it is&lt;/li&gt;
&lt;li&gt;the repo is real but not huge&lt;/li&gt;
&lt;li&gt;the request flow is meaningful&lt;/li&gt;
&lt;li&gt;the output is easy to judge with your own eyes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @spine-io/onboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clone a benchmark:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/axios/axios.git
&lt;span class="nb"&gt;cd &lt;/span&gt;axios
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start with the map:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;onboard &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--map-only&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then generate the full guide:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;onboard &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the core experience.&lt;/p&gt;

&lt;p&gt;If you are already in Claude Code, the equivalent mental model is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/map      -&amp;gt; show me the verified backbone
/onboard  -&amp;gt; write the full guide and refresh repo context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What makes the first run feel good
&lt;/h2&gt;

&lt;p&gt;The first useful moment is not a giant report. It is a reduction in ambiguity.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;spine&lt;/code&gt; works well, you feel three things quickly:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You can see the real entry path instead of guessing it.&lt;/li&gt;
&lt;li&gt;You know which files matter first.&lt;/li&gt;
&lt;li&gt;You can ignore the rest for now without feeling lost.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is a very different experience from skimming a file tree and hoping your intuition is right.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why axios is such a good demo
&lt;/h2&gt;

&lt;p&gt;Before &lt;code&gt;spine&lt;/code&gt;, a developer usually bounces around files like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;package.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;index.js&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lib/axios.js&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;lib/core/Axios.js&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not terrible, but it is not guided either. You are still doing the work of building the map in your head.&lt;/p&gt;

&lt;p&gt;After &lt;code&gt;spine&lt;/code&gt;, the repo gets compressed into a much more actionable shape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a small verified graph&lt;/li&gt;
&lt;li&gt;a short list of files to read first&lt;/li&gt;
&lt;li&gt;a sentence or two that gives you the right mental frame&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the whole product idea in one repo.&lt;/p&gt;

&lt;p&gt;If someone asks what &lt;code&gt;spine&lt;/code&gt; does, the shortest honest answer is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It shows you where to start in a codebase, and it only draws the edges it can prove.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The part I care about most
&lt;/h2&gt;

&lt;p&gt;I did not want to build another tool that generates a beautiful but suspicious diagram.&lt;/p&gt;

&lt;p&gt;So the rule is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if an edge can be verified, keep it&lt;/li&gt;
&lt;li&gt;if it cannot be verified, drop it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the diagram is sometimes smaller than what a human might infer.&lt;/p&gt;

&lt;p&gt;Good.&lt;/p&gt;

&lt;p&gt;Smaller and true is better than bigger and imagined.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this can save tokens over time
&lt;/h2&gt;

&lt;p&gt;There are really two savings stories here.&lt;/p&gt;

&lt;p&gt;The first is human:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;less random file clicking&lt;/li&gt;
&lt;li&gt;less repeated explanation across teammates&lt;/li&gt;
&lt;li&gt;less time building the wrong mental model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The second is model-side:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;spine&lt;/code&gt; writes &lt;code&gt;.claude/REPO_CONTEXT.md&lt;/code&gt; by default&lt;/li&gt;
&lt;li&gt;the Anthropic synthesis path supports prompt caching on the structured context block&lt;/li&gt;
&lt;li&gt;the CLI surfaces estimated cost and actual cost when usage metadata is available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So if you keep re-running onboarding in the same repo, or keep revisiting the same codebase with Claude, the product is trying to make those later passes cheaper and more grounded than the first one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code is a natural fit
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;spine&lt;/code&gt; is also shaped to work well with Claude Code.&lt;/p&gt;

&lt;p&gt;The flow is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;run &lt;code&gt;/map&lt;/code&gt; when you want a fast architecture preview&lt;/li&gt;
&lt;li&gt;run &lt;code&gt;/onboard&lt;/code&gt; when you want the full guide&lt;/li&gt;
&lt;li&gt;let &lt;code&gt;.claude/REPO_CONTEXT.md&lt;/code&gt; carry verified context into later conversations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of starting every coding session from scratch, you can start from a verified repo summary.&lt;/p&gt;

&lt;p&gt;That is probably the most compelling long-run use case:&lt;/p&gt;

&lt;p&gt;not "generate one nice markdown file,"&lt;/p&gt;

&lt;p&gt;but "leave behind a compact, refreshable repo context file that helps Claude spend less time relearning the same app."&lt;/p&gt;

&lt;h2&gt;
  
  
  A fun way to try it
&lt;/h2&gt;

&lt;p&gt;If you want to make this feel more interactive, try this little challenge:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open a repo you know only vaguely.&lt;/li&gt;
&lt;li&gt;Before running &lt;code&gt;spine&lt;/code&gt;, write down which file you think is the true entry point.&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;onboard . --map-only&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;See whether the verified spine agrees with your guess.&lt;/li&gt;
&lt;li&gt;Then run &lt;code&gt;onboard .&lt;/code&gt; and compare your reading order with the one it generates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That turns the tool into a quick test of your own codebase intuition, which is honestly a fun way to experience it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to use it next
&lt;/h2&gt;

&lt;p&gt;After &lt;code&gt;axios&lt;/code&gt;, I would try these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;glow&lt;/code&gt; for a clean Go CLI example&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;poetry&lt;/code&gt; for a larger Python repo&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;log&lt;/code&gt; for a compact Rust library&lt;/li&gt;
&lt;li&gt;your own codebase right before onboarding a teammate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most interesting use case is not a benchmark repo.&lt;/p&gt;

&lt;p&gt;It is the repository where your team keeps saying, "someone should really document how this thing is wired."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;The cost of a strange codebase is not just time. It is hesitation.&lt;/p&gt;

&lt;p&gt;You do not know what to trust yet.&lt;br&gt;
You do not know what is core versus incidental.&lt;br&gt;
You do not know whether the architecture in your head is real.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;spine&lt;/code&gt; is an attempt to reduce that hesitation.&lt;/p&gt;

&lt;p&gt;Not by telling you everything.&lt;/p&gt;

&lt;p&gt;By helping you start in the right place.&lt;/p&gt;

&lt;p&gt;That is v1.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>claude</category>
      <category>programming</category>
      <category>softwaredevelopment</category>
    </item>
    <item>
      <title>We Wrapped an Open-Source Agent in GraphOS and Turned the Debugging Session Into a Story</title>
      <dc:creator>Ahmed Fayyaz</dc:creator>
      <pubDate>Sun, 26 Apr 2026 02:06:44 +0000</pubDate>
      <link>https://forem.com/ahmedbutt2015/we-wrapped-an-open-source-agent-in-graphos-and-turned-the-debugging-session-into-a-story-4de4</link>
      <guid>https://forem.com/ahmedbutt2015/we-wrapped-an-open-source-agent-in-graphos-and-turned-the-debugging-session-into-a-story-4de4</guid>
      <description>&lt;p&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fahmedbutt2015%2Fgraphos%2Fmain%2Fassets%2Flogo-v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fahmedbutt2015%2Fgraphos%2Fmain%2Fassets%2Flogo-v2.png" alt="GraphOS" width="800" height="189"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  A story-driven, hands-on report about taking an existing open-source LangGraph.js agent, wrapping it in GraphOS, and learning what observability actually feels like when an agent goes sideways.
&lt;/h2&gt;

&lt;p&gt;There is a moment every agent project eventually reaches.&lt;/p&gt;

&lt;p&gt;The demo works. The graph looks clean. The tools are wired up. The prompt feels smart.&lt;/p&gt;

&lt;p&gt;And then one run goes sideways.&lt;/p&gt;

&lt;p&gt;Not in a dramatic, movie-scene way. In the real way.&lt;/p&gt;

&lt;p&gt;The assistant calls the same tool again. Then again. The state grows. The trace gets noisier. The budget keeps moving in one direction. And the hardest part is not even the cost. It is the feeling that you can no longer see the system clearly.&lt;/p&gt;

&lt;p&gt;That is the moment GraphOS was built for.&lt;/p&gt;

&lt;p&gt;This post is not just a feature announcement. It is a field report. We took a real open-source project, wrapped it with GraphOS, ran the integration end to end, and used that exercise to answer one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can an existing LangGraph.js agent — one we did not write — be made easier to observe, safer to run, and easier to explain to other developers?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The short answer is yes.&lt;/p&gt;

&lt;p&gt;The better answer is the story below.&lt;/p&gt;

&lt;p&gt;
  
    &lt;a href="https://raw.githubusercontent.com/ahmedbutt2015/graphos/main/assets/hero.mp4" rel="noopener noreferrer"&gt;▶ Watch GraphOS catch a runaway agent loop (12s)&lt;/a&gt;
  
&lt;/p&gt;

&lt;h2&gt;
  
  
  Before and after, in one breath
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; wrapping the agent in GraphOS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every run was a black box. The first signal that something was wrong was the OpenAI bill or a stuck UI.&lt;/li&gt;
&lt;li&gt;Loops only surfaced as "this got slow" or "this never finished."&lt;/li&gt;
&lt;li&gt;Debugging meant reading log files after the fact and reconstructing a sequence the system had not preserved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt; wrapping the agent in GraphOS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every step of the graph was observable in real time.&lt;/li&gt;
&lt;li&gt;A loop was caught at visit 7, not visit 700.&lt;/li&gt;
&lt;li&gt;A budget ceiling halted a misbehaving run before the credit card did.&lt;/li&gt;
&lt;li&gt;Every past session became time-travelable in a local SQLite-backed dashboard, no SaaS in the loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same agent. Same model. Different blast radius.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark we chose
&lt;/h2&gt;

&lt;p&gt;Instead of inventing a toy example, we used an existing open-source benchmark:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;agents-from-scratch-ts&lt;/code&gt;: &lt;a href="https://github.com/langchain-ai/agents-from-scratch-ts" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/agents-from-scratch-ts&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Inside this repository, that benchmark lives at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;benchmarks/agents-from-scratch-ts&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a strong test case because it is not a toy. It already contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A working email assistant&lt;/li&gt;
&lt;li&gt;A Human-in-the-Loop flow&lt;/li&gt;
&lt;li&gt;A memory-enabled variant&lt;/li&gt;
&lt;li&gt;Jest-based test suites&lt;/li&gt;
&lt;li&gt;LangGraph wiring that looks like real application code, not a demo built for a tool launch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That makes it the right kind of pressure test for GraphOS. We did not want to ship a wrapper that only works on our own handcrafted demo. We wanted something that survives contact with someone else's architecture, state shape, tool conventions, and tests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;If you are reading this as a builder, imagine two options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Build a brand-new sample agent designed to make the tool look good.&lt;/li&gt;
&lt;li&gt;Take someone else's open-source agent and prove the tool against that.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We picked option 2.&lt;/p&gt;

&lt;p&gt;That choice changes the tone of the work. Now the question is not "can GraphOS run a demo we wrote?" It becomes "can GraphOS survive contact with someone else's code?"&lt;/p&gt;

&lt;p&gt;That is a much better story to tell — and a much better thing to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GraphOS adds to a graph
&lt;/h2&gt;

&lt;p&gt;GraphOS is an observability and policy layer for LangGraph.js agents.&lt;/p&gt;

&lt;p&gt;At a high level, it gives you three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A wrapper around any compiled graph&lt;/li&gt;
&lt;li&gt;Composable policies (&lt;code&gt;LoopGuard&lt;/code&gt;, &lt;code&gt;BudgetGuard&lt;/code&gt;, more)&lt;/li&gt;
&lt;li&gt;A local dashboard that shows what the agent did, step by step, and lets you scrub through past runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;
  &lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fahmedbutt2015%2Fgraphos%2Fmain%2Fassets%2Farchitecture-v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fahmedbutt2015%2Fgraphos%2Fmain%2Fassets%2Farchitecture-v2.png" alt="GraphOS architecture: your code → @graphos-io/sdk → @graphos-io/dashboard, with SQLite persistence" width="800" height="400"&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;The integration stays intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;GraphOS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;LoopGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;tokenCost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;createWebSocketTransport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@graphos-io/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;managed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;GraphOS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;myCompiledGraph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-agent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LoopGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maxRepeats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;usdLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;tokenCost&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;onTrace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;createWebSocketTransport&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the promise. But promises are cheap.&lt;/p&gt;

&lt;p&gt;So we tested it against the benchmark.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we brought GraphOS into the benchmark
&lt;/h2&gt;

&lt;p&gt;There are two installation stories worth separating, because people often confuse "how we developed it inside the monorepo" with "how I should use it in my own codebase."&lt;/p&gt;

&lt;h3&gt;
  
  
  Story 1: how we integrated it inside this monorepo
&lt;/h3&gt;

&lt;p&gt;Because the benchmark is checked into this repository, the local integration uses the built SDK directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;GraphOS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;LoopGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;tokenCost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;createWebSocketTransport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;PolicyViolationError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;../../packages/sdk/dist/index.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That exact integration lives in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;benchmarks/agents-from-scratch-ts/graphos-wrap.ts&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is useful for development because it lets us iterate on GraphOS and immediately retest it against the benchmark without publishing a new package every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Story 2: how you would install it in any outside project
&lt;/h3&gt;

&lt;p&gt;If you are doing this in your own LangGraph.js project, the install is the simple part:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @graphos-io/sdk
&lt;span class="c"&gt;# or&lt;/span&gt;
pnpm add @graphos-io/sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then replace the local import with the published package import:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;GraphOS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;LoopGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;tokenCost&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;createWebSocketTransport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;PolicyViolationError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@graphos-io/sdk&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same code, same wrapper. The only thing that changes is where the SDK comes from.&lt;/p&gt;

&lt;h2&gt;
  
  
  The wrapper we added
&lt;/h2&gt;

&lt;p&gt;Here is the part of the benchmark integration that mattered:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;managed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;GraphOS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wrap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agents-from-scratch&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LoopGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;node&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maxRepeats&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;usdLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;tokenCost&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;onTrace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;createWebSocketTransport&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three details deserve a beat each.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. We used &lt;code&gt;LoopGuard&lt;/code&gt; in &lt;code&gt;node&lt;/code&gt; mode
&lt;/h3&gt;

&lt;p&gt;This benchmark is exactly why &lt;code&gt;node&lt;/code&gt; mode exists.&lt;/p&gt;

&lt;p&gt;In many real agents, the state changes every iteration because the &lt;code&gt;messages&lt;/code&gt; array keeps growing. That means pure state-equality is not enough to detect a loop. The graph may be functionally stuck even though the raw state object is technically different each turn.&lt;/p&gt;

&lt;p&gt;So instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Did we revisit the exact same state?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;we ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Did we keep revisiting the same node too many times?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the more practical safety rule for agents that keep appending messages as they reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. We set a budget ceiling
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;BudgetGuard&lt;/code&gt; lets us cap cumulative spend per session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BudgetGuard&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;usdLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;tokenCost&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;tokenCost()&lt;/code&gt; is a drop-in cost extractor that walks the state for LangChain messages, pulls usage from &lt;code&gt;usage_metadata&lt;/code&gt; / &lt;code&gt;response_metadata.usage&lt;/code&gt; / &lt;code&gt;tokenUsage&lt;/code&gt;, and applies a built-in OpenAI + Anthropic price table. For unknown models you can pass a &lt;code&gt;fallback&lt;/code&gt; (flat USD per step or a custom price entry).&lt;/p&gt;

&lt;p&gt;This is not just observability anymore. The run has a real boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. We streamed telemetry to the local dashboard
&lt;/h3&gt;

&lt;p&gt;This line is small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;onTrace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;createWebSocketTransport&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But it changes the experience completely. Instead of waiting for the final output and guessing what happened, you watch the run unfold — node by node — in the GraphOS dashboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  The small but clever trick: a mock key path
&lt;/h2&gt;

&lt;p&gt;One of the nicest touches in the integration is that &lt;code&gt;graphos-wrap.ts&lt;/code&gt; checks whether &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; starts with &lt;code&gt;sk-mock&lt;/code&gt;. If it does, it installs a &lt;code&gt;fetch&lt;/code&gt; interceptor and simulates the OpenAI responses.&lt;/p&gt;

&lt;p&gt;Why is that useful?&lt;/p&gt;

&lt;p&gt;Because it gives us a reproducible benchmark run that is intentionally shaped to trigger the loop path. In the mock flow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The triage step routes the email into the response subgraph&lt;/li&gt;
&lt;li&gt;The agent keeps requesting &lt;code&gt;schedule_meeting&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The graph cycles through &lt;code&gt;llm_call ↔ environment&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;LoopGuard&lt;/code&gt; halts at visit 7 with a clean policy reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of test harness you want when you are building safety infrastructure. You do not want to rely on "hopefully the model misbehaves today." You want a deterministic failure mode you can use on purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce the setup
&lt;/h2&gt;

&lt;p&gt;If you want to walk through this yourself, the full path is short.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Install the workspace
&lt;/h3&gt;

&lt;p&gt;From the repository root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Build the SDK
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm &lt;span class="nt"&gt;--filter&lt;/span&gt; @graphos-io/sdk build
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Move into the benchmark
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;benchmarks/agents-from-scratch-ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Use the benchmark normally
&lt;/h3&gt;

&lt;p&gt;The upstream benchmark documents its own workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It expects a &lt;code&gt;.env&lt;/code&gt; file with your API key if you want real model calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENAI_API_KEY=your_api_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Run GraphOS alongside it
&lt;/h3&gt;

&lt;p&gt;In another terminal, start the dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @graphos-io/dashboard graphos dashboard
&lt;span class="c"&gt;# open http://localhost:4000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then run the wrapped benchmark entrypoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-mock pnpm &lt;span class="nb"&gt;exec &lt;/span&gt;tsx graphos-wrap.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run against a real provider instead of the mock path, swap in a real key and keep the same wrapper.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we actually verified
&lt;/h2&gt;

&lt;p&gt;This is where the story becomes more than marketing. We did not just wrap the benchmark and eyeball the result.&lt;/p&gt;

&lt;h3&gt;
  
  
  GraphOS SDK verification
&lt;/h3&gt;

&lt;p&gt;From the repo root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm &lt;span class="nt"&gt;--filter&lt;/span&gt; @graphos-io/sdk &lt;span class="nb"&gt;test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All SDK tests pass. Coverage spans:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;LoopGuard&lt;/code&gt; — both &lt;code&gt;state&lt;/code&gt; and &lt;code&gt;node&lt;/code&gt; modes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;BudgetGuard&lt;/code&gt; — cumulative cost ceiling&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tokenCost()&lt;/code&gt; — multiple LangChain message shapes, multiple price-table lookups&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GraphOS.wrap()&lt;/code&gt; — session lifecycle, error handling, sessionId continuity, listener-throw resilience&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Benchmark verification
&lt;/h3&gt;

&lt;p&gt;From &lt;code&gt;benchmarks/agents-from-scratch-ts&lt;/code&gt;, the benchmark's own Jest suites still pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pnpm &lt;span class="nb"&gt;test&lt;/span&gt;:base
pnpm &lt;span class="nb"&gt;test&lt;/span&gt;:hitl
pnpm &lt;span class="nb"&gt;test&lt;/span&gt;:memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That matters because it tells us something subtle but important:&lt;/p&gt;

&lt;p&gt;GraphOS was developed alongside the benchmark without breaking the benchmark's behavior. We are not telling a story about observability by quietly degrading the agent underneath it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the benchmark actually exercises
&lt;/h2&gt;

&lt;p&gt;This part is worth slowing down for. The benchmark is not one narrow happy path. It exercises:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Response quality&lt;/li&gt;
&lt;li&gt;Expected tool calls&lt;/li&gt;
&lt;li&gt;Human acceptance flow&lt;/li&gt;
&lt;li&gt;Human edit flow&lt;/li&gt;
&lt;li&gt;Human rejection with feedback&lt;/li&gt;
&lt;li&gt;Memory persistence across later runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when we say we used &lt;code&gt;agents-from-scratch-ts&lt;/code&gt;, we mean we used a compact but meaningful open-source application with real behavioral coverage — not "we ran one prompt once."&lt;/p&gt;

&lt;h2&gt;
  
  
  The human lesson
&lt;/h2&gt;

&lt;p&gt;The benchmark is an email assistant, but the lesson is bigger than email.&lt;/p&gt;

&lt;p&gt;Every agent team eventually needs answers to these questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What node did we get stuck in?&lt;/li&gt;
&lt;li&gt;How many times did we visit it?&lt;/li&gt;
&lt;li&gt;What tool calls were made before failure?&lt;/li&gt;
&lt;li&gt;What did the state look like at that moment?&lt;/li&gt;
&lt;li&gt;Was the run expensive because it was useful, or expensive because it was looping?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without observability, those questions become archaeology.&lt;/p&gt;

&lt;p&gt;With GraphOS, they become part of the normal debugging workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick interactive moment
&lt;/h2&gt;

&lt;p&gt;Imagine you are looking at a run and you see the same node lighting up over and over. Which of these do you want next?&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A bigger console log&lt;/li&gt;
&lt;li&gt;The final model output only&lt;/li&gt;
&lt;li&gt;A live graph, a session timeline, and a policy halt reason that says exactly which guard fired and why&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is the difference this project is trying to create. Not more noise. Better visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this story is stronger than a basic product post
&lt;/h2&gt;

&lt;p&gt;The first version of a launch blog usually says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Here is what we built&lt;/li&gt;
&lt;li&gt;Here is the API&lt;/li&gt;
&lt;li&gt;Here is why it is useful&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is fine, but it is mostly a claim.&lt;/p&gt;

&lt;p&gt;This story is better because it shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The open-source project we used&lt;/li&gt;
&lt;li&gt;The exact link to it&lt;/li&gt;
&lt;li&gt;Where it lives in our repo&lt;/li&gt;
&lt;li&gt;How we installed GraphOS into the workflow&lt;/li&gt;
&lt;li&gt;How we wrapped the graph&lt;/li&gt;
&lt;li&gt;How we tested both the SDK and the benchmark&lt;/li&gt;
&lt;li&gt;What safety behavior we specifically cared about&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, this is not just &lt;em&gt;what GraphOS is&lt;/em&gt;. It is &lt;em&gt;how GraphOS behaves when it meets a real agent&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The files behind this story
&lt;/h2&gt;

&lt;p&gt;If you want to inspect the exact pieces mentioned above, start here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GraphOS root: &lt;a href="//../../README.md"&gt;&lt;code&gt;README.md&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Companion launch reference: &lt;a href="//./v1-launch.md"&gt;&lt;code&gt;docs/blog/v1-launch.md&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark wrapper: &lt;a href="//../../benchmarks/agents-from-scratch-ts/graphos-wrap.ts"&gt;&lt;code&gt;benchmarks/agents-from-scratch-ts/graphos-wrap.ts&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark package setup: &lt;a href="//../../benchmarks/agents-from-scratch-ts/package.json"&gt;&lt;code&gt;benchmarks/agents-from-scratch-ts/package.json&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Benchmark tests: &lt;a href="//../../benchmarks/agents-from-scratch-ts/tests"&gt;&lt;code&gt;benchmarks/agents-from-scratch-ts/tests&lt;/code&gt;&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final takeaway
&lt;/h2&gt;

&lt;p&gt;GraphOS becomes much easier to understand when you stop describing it as a package and start describing it as a moment in a developer's day.&lt;/p&gt;

&lt;p&gt;An agent starts drifting.&lt;/p&gt;

&lt;p&gt;A team needs answers.&lt;/p&gt;

&lt;p&gt;A wrapper adds policies.&lt;/p&gt;

&lt;p&gt;A dashboard turns hidden execution into something visible.&lt;/p&gt;

&lt;p&gt;A benchmark proves the idea against real code.&lt;/p&gt;

&lt;p&gt;That is the story. And that is why we used &lt;code&gt;agents-from-scratch-ts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you want to try the same path yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Benchmark: &lt;a href="https://github.com/langchain-ai/agents-from-scratch-ts" rel="noopener noreferrer"&gt;https://github.com/langchain-ai/agents-from-scratch-ts&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GraphOS: &lt;a href="https://github.com/ahmedbutt2015/graphos" rel="noopener noreferrer"&gt;https://github.com/ahmedbutt2015/graphos&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;SDK on npm: &lt;a href="https://www.npmjs.com/package/@graphos-io/sdk" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@graphos-io/sdk&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Dashboard on npm: &lt;a href="https://www.npmjs.com/package/@graphos-io/dashboard" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/@graphos-io/dashboard&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @graphos-io/sdk
npx @graphos-io/dashboard graphos dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next step is simple: wrap your graph, run your tests, and see what your agent was actually doing when nobody was watching.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>langgraph</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
