Forem: Ted Murray

I Run AI Agents With Full System Access. Here's What Makes It Safe Enough.

Ted Murray — Thu, 30 Apr 2026 16:01:57 +0000

The guardrails question is real. If you give an LLM access to your infrastructure — actual file writes, actual service restarts, actual database queries — and it hallucinates a tool call or gets confused mid-task, you have a problem. A real one, not a theoretical risk.

I know this because I've been running agents with real infrastructure access since February. Five project contexts, each with its own scope and capabilities. Dev handles code repos. Infra manages Docker stacks, reverse proxy configs, and deploy scripts. Research does web search and architecture planning. The system runs on a dedicated mini PC called claudebox that exists for this purpose.

Nothing has gone catastrophically wrong. That's not luck — or at least, I didn't want it to be luck.

Here's the architecture.

The problem with the VM approach

Giving a local LLM access inside a VM is a real answer. You limit blast radius. If the agent does something harmful, the damage is contained. Wipe it and start over.

But isolation by containment has a ceiling.

A VM doesn't answer what the agent can reach before something goes wrong. It doesn't scope which tools each agent can call, or which credentials it holds. In Claude Code, if you've configured an MCP server with a database token, every agent session has that token — even if that session has nothing to do with databases. You're not reducing the attack surface. You're just containing the explosion.

At scale it gets worse. Run multiple agent contexts and every one of them carries your full credential set. A confused agent in one context can potentially reach resources that belong to a completely different job.

The VM gives you recovery. It doesn't give you architecture.

Layer 1: Dedicated hardware

The simplest thing I did: agents don't run on my main machine.

Claudebox is a GMKtec NucBox K11, running Debian, sitting on my desk. It has no personal files. No work credentials. No access to the things I'd actually regret losing. Its entire job is to run Claude agents and the infrastructure that supports them.

This is the physical version of least privilege. Not a VM on a machine that has your personal data — a separate host that doesn't. If something goes wrong on claudebox, the blast radius is claudebox. My NAS, my Unraid server, my family photos — they're on different hardware. Agents reach them only through the MCP tools I've explicitly configured, and scoped-mcp controls which agent gets access to which tools. No agent reaches a host it wasn't given the tools to reach.

It also makes the whole setup easier to reason about. Claudebox is an agent host. That's its job. Everything on it exists to serve that purpose.

Layer 2: Scoped credentials and tools

The second layer is scoped-mcp.

The problem it solves: in Claude Code, every MCP server you configure loads its tools and credentials into every agent session. A research agent that does web search and nothing else still holds your infrastructure write keys if you've configured them somewhere in settings.json. That's a lot of capability for an agent that should only be reading web pages.

scoped-mcp sits between each agent and the full MCP server pool. At session start, the agent connects to a scoped-mcp process that only exposes the tools and credentials configured for that specific agent type. The research agent sees search tools. The infra agent sees infrastructure tools. Neither sees the other's credentials.

A hallucinated tool call can only reach what that agent was given. A prompt injection in a search result can't pivot to infrastructure write operations because those tools aren't in scope.

The config is a YAML profile per agent type that declares which MCP servers are visible and which environment variables are passed through. One process per agent, started at session time over stdio.

There's a detail about how scoped-mcp was built that I still find slightly recursive: I built it using the same multi-agent pattern it protects. A research agent evaluated the problem space. A dev agent wrote the code. Each worked with scoped access to only what its role required. The tool was built by agents operating under the constraints it enforces.

What running this actually looks like

homelab-agent is the reference implementation for all of this.

Five agent contexts, each scoped:

Dev — code repos, GitHub, PR workflow
Infra — Docker Compose stacks, SWAG reverse proxy configs, deploy scripts
Research — web search, architecture planning, vendor evaluation
Outreach — blog workflow, GitHub profile, content publishing
Homelab-ops — day-to-day system management across all hosts

The infra agent holds credentials for the hosts it manages. The research agent holds search API keys and nothing else. The outreach agent has publishing tokens for dev.to and nothing that touches infrastructure.

These are real operations on real systems. Not experiments. Not demos. A typical week: infra deploys a new Docker stack, dev commits code, research evaluates a tool, outreach publishes an article. The system has been running in this configuration for months.

The layer that makes it reliable over time

The safety architecture handles what agents can reach. Memory handles what they can remember.

I built the memory system early — before most of what homelab-agent became. The decision was deliberate. Agents with no persistent context make the same mistakes repeatedly. They can't build on prior decisions. Every session feels like the first one.

The memory system has three tiers: session notes auto-captured at the end of each conversation, working memory files promoted from those sessions, and a distilled long-term tier for decisions that need to survive indefinitely. Each tier has its own retention policy.

What's grown more recently is the search layer. There are now five ways agents can query memory — automatic context injection at session start, on-demand semantic and keyword search, a unified interface that spans all tiers at once, a knowledge graph for infrastructure topology queries, and full-text structured retrieval. They serve different access patterns: some run automatically without prompting, some are on-demand, some answer "what did we decide about this" and some answer "what in the infrastructure connects to what."

The practical effect: when an agent is working on something that touches a decision made three weeks ago, it can find that decision. Not because I re-explained it — through search. Architecture decisions, constraints, tradeoffs — they persist. Agents don't have to re-derive them.

This is a significant part of why I could build this quickly — I'm a Windows sysadmin, not a software developer. The agents weren't starting cold. They had access to everything we'd figured out together. Early sessions generated context that later sessions could build on. The system got better through use rather than worse.

The actual answer to the guardrails question

Containment is a last resort. It's what you reach for when you can't trust the architecture.

Dedicated hardware means the blast radius of any failure is a box that exists for this purpose. Scoped credentials mean agents can't reach what they were never given. Persistent memory means agents build on prior decisions instead of guessing.

None of this eliminates risk. An agent can still make mistakes. The difference is the mistakes are bounded — by scope, by access, by what the architecture permits.

That's not a theoretical safety model. It's been running since February.

Both projects are open source:

homelab-agent — the full reference implementation
scoped-mcp — the MCP credential and tool scoping layer

I Built Two Ollama Tools I Don't Actually Need Yet

Ted Murray — Thu, 23 Apr 2026 14:27:40 +0000

I had one Ollama instance and a question: can I queue inference tasks and route them with priorities? That question turned into two open source tools over about three days, from planning through to shipped repos. I don't fully need either of them yet. I built them anyway.

The starting point

Ollama ships with no authentication. There's an OLLAMA_API_KEY option, but it's a single shared key — every client uses it, there's no per-client separation, and community tools that can't send Authorization: Bearer headers are locked out entirely.

Several services share the same Ollama instance on a dedicated host that homelab-agent is provisioning: LibreChat for interactive chat, a SearXNG MCP server for ML-reranked search, and three background embedding jobs — graphiti, jobsearch-mcp, and memsearch-watch.

The embedding jobs run constantly. Interactive chat waits. I wanted to fix that — give batch workloads a lower priority so interactive requests don't queue behind them.

Before building anything, I used the SearXNG MCP to do a proper competitive search — structured queries, ML reranking, multi-source synthesis. A few tool calls instead of an afternoon of tab-hopping through SEO results.

What the research turned up

The existing tools fell into two camps:

Too little: Basic auth proxies that add a shared key or simple API key checking. No queuing, no per-client control, no routing.

Too much: Rails-heavy tools with VRAM-aware scheduling and GPU memory introspection. Impressive engineering, but overkill for anyone not running a GPU cluster.

The gap: nobody had priority queuing with per-tier depth limits, per-client concurrency caps, and client injection (injecting auth on behalf of clients that can't send headers themselves) — all in one place. LiteLLM has semantic caching and multi-provider routing but no client injection. The simpler proxies have auth but no queue semantics. Nothing had all three.

That gap was worth filling.

The two tools

ollama-queue-proxy

github.com/TadMSTR/ollama-queue-proxy

A smart pool manager for Ollama. Drop it in front of your instance (or fleet), point your consumers at port 11435 instead of 11434, and everything else works as before.

What it adds:

Per-client API keys with priority ceilings. A batch key with max_priority: low that sends X-Queue-Priority: high gets silently capped. The client doesn't know — it just gets queued at its allowed tier.
Three-tier priority queue — high, normal, low — with per-tier depth limits, expiry, and backpressure (429/503 + Retry-After).
Client injection — extra listen ports that inject a fixed identity for clients that can't send Bearer headers. memsearch-watch points at port 11436; the proxy fills in its key and routes it through the same queue with the same priority ceiling.
Model-aware routing — background poller hits /api/tags on each host every 30 seconds, maintains a live model inventory, and routes requests to a host that already has the target model loaded. Avoids cold-start latency when you have multiple Ollama hosts.
Embedding cache — SHA-256 keyed Valkey cache for /api/embed and /api/embeddings. Repeated RAG requests skip the queue and upstream entirely. X-Cache: HIT in the response.
Per-client concurrency caps — a batch client configured with max_concurrent: 2 can't monopolize the worker pool regardless of how many requests it queues.
Failover — mark a host unhealthy on connection failure, retry on the next host, recover automatically via background health checks.

Proxy overhead is ~1-2ms. Negligible compared to inference time.

The unique combination is the point. Priority queuing + per-client caps with fairness bound + client injection — I didn't find that anywhere else.

OLLAMA_HOST=http://localhost:11435

That's the only change your consumers need. Everything else is transparent.

ollama-auth-sidecar

github.com/TadMSTR/ollama-auth-sidecar

Not everyone needs a queue. Some people just want to put auth in front of Ollama without running a full reverse proxy stack.

The sidecar is a single nginx container with a config file. Each consumer gets its own listen port and its own key. Clients that can't send auth headers point at the sidecar instead of Ollama — the sidecar injects the header for them.

services:
  - name: librechat
    listen: 11436
    upstream: http://ollama:11434
    timeout: 300s
    headers:
      Authorization: "Bearer ${LIBRECHAT_KEY}"

  - name: memsearch-watch
    listen: 11437
    upstream: http://ollama:11434
    timeout: 60s
    headers:
      Authorization: "Bearer ${MEMSEARCH_KEY}"

No databases. No dashboards. No processes to keep alive. Container restart is under 1 second. The entrypoint fails fast at startup if any ${ENV_VAR} reference is unset — you don't discover a missing key at request time.

The upstream field is generic — works with any auth-gated HTTP service, not just Ollama. The Ollama-specific framing is for discoverability; the implementation doesn't care.

The relationship between the two tools: If you run the queue proxy, you don't need the sidecar — v0.2.0 bakes client injection in directly. If you only want auth, use the sidecar. Each README links to the other as the appropriate upgrade or downgrade path.

The workflow that made this possible

Both tools went from question to shipped repo in a few days. That's not because I'm fast — it's because each phase of work is handed off to a specialized agent, so I'm not context-switching between research, implementation, security, and docs.

Each phase is a separate Claude Code session with a specific role:

Research agent — competitive analysis using the SearXNG MCP. Structured queries across multiple sources, ML reranking, cross-source synthesis. Found the ecosystem gaps that shaped the feature list.
Build plan — a detailed spec document: repo structure, config schema, phases, success criteria, deployment modes. Written before any code.
Security review (pre-build) — a dedicated pass over the build plan before handoff. For the queue proxy: 7 FLAG items, 0 BLOCKs. For the sidecar: 10 FLAG items across two passes, 0 BLOCKs. All implementation-level — no plan amendments needed, but concrete requirements the dev agent had to address.
Dev agent — picks up the spec and security review, implements, tests, writes docs, tags the release.
Security audit — a separate security agent runs a post-build audit and sends a structured report back. The build agent then picks up the findings and works through them: some fixes, some confirmations, documented reasoning for anything left as-is.
Writer agent — once the build is clean, a writer agent takes over. Docs updated, verified, and cross-referenced across both repos. READMEs, CHANGELOGs, and cross-links between the two tools all done in one pass.

This isn't Copilot autocomplete. Each phase is a full delegated role with a distinct responsibility and a clean handoff. The research agent doesn't touch code. The security reviewer reads both the plan and the finished implementation. The writer agent doesn't make build decisions — it synthesizes and documents what was actually built.

The queue proxy's dev agent shipped all five planned features plus two unplanned improvements (Prometheus label escaping, Dockerfile CMD fix). The sidecar's dev agent shipped the feature, hit real test failures, fixed them methodically, then handed off cleanly to the security and writer passes. Both look like healthy implementation trails from experienced engineers.

The honest close

I have one Ollama instance. I don't need priority queuing. I don't have a second GPU host to route between. The embedding cache is useful, but my embedding job isn't running at a volume where cache hits matter much.

I built these anyway for two reasons.

First: the research showed there was a real gap. Tools built to fill ecosystem gaps get used. Tools built to scratch a personal itch stay forks. If I'm going to build something, it might as well be useful to more people than me.

Second: the workflow needed a real test. The research → build plan → security review → dev agent → security audit → writer agent pipeline is what I'm developing inside homelab-agent — with the goal of shipping it as a proper framework in the platform homelab-agent is building toward. Running the workflow against a real, non-trivial project — something with actual security considerations, a competitive landscape to map, and multiple phases of implementation — was the validation I needed. Both tools passed.

I'll grow into the features. I'm adding a second Ollama host to my homelab. The model-aware routing will stop being theoretical the moment I plug it in.

Both repos ship multi-arch Docker images (linux/amd64 and linux/arm64) published to ghcr.io, with CI on every PR.

ollama-queue-proxy: github.com/TadMSTR/ollama-queue-proxy
ollama-auth-sidecar: github.com/TadMSTR/ollama-auth-sidecar

Your AI Agent Has Your API Keys (And So Does Every Other Agent)

Ted Murray — Tue, 21 Apr 2026 13:28:36 +0000

Open your Claude Code settings.json. Look at the env blocks under your MCP servers. Every API key, every database token, every webhook URL you've put there — your agent has all of them, right now, in its process environment.

That might sound obvious. You configured it that way. But think about what it actually means.

You've got an MCP server for file operations and one for notifications. The notification server needs a webhook URL. The file server doesn't. But Claude Code doesn't scope credentials to individual servers — it loads the full environment and passes it to the session. Your agent has the webhook URL even if it never sends a notification. It has database tokens for backends it never queries. It holds the Grafana service account token whether or not it ever touches a dashboard.

This is fine if you trust the agent completely and nothing ever goes wrong. But "nothing ever goes wrong" is a strange assumption to build on. A hallucinated tool call, a prompt injection in a tool response, a confused agent that decides to "help" by writing to a backend it shouldn't know about — the blast radius isn't one credential. It's every credential you've configured across every MCP server.

And that's with a single agent. Add more and it gets worse.

It gets worse at scale

I was designing the build layer of homelab-agent — a platform where Claude Code agents run durable, multi-phase infrastructure builds. The design called for agent pools: multiple instances of the same agent type running in parallel, each working on a different build phase.

The single-agent credential problem multiplied immediately. Every agent in the pool holds every credential. But new problems appeared too:

Tool visibility. A read-only research agent sees write tools for infrastructure backends it has no business touching. Every agent carries the full tool surface, including everything that can cause damage if called incorrectly.

Resource collisions. No boundaries between agent workspaces. Agent A can read files Agent B wrote. Two agents running in parallel can overwrite each other's working data.

Audit fragmentation. Tool calls are scattered across logs from a dozen server processes, if they're logged at all. Reconstructing what a specific agent did is manual work.

Token overhead. Every agent session loads tool schemas from every configured MCP server. With 12 servers contributing their full tool lists, you're burning 15–30K tokens per session before the agent does anything. At 20 concurrent agents, that's 300–600K tokens of pure initialization overhead — just so each agent can be told about tools it'll never use.

I looked at what existed. Aggregation gateways combine servers but don't scope anything. Access control proxies filter which tools an agent can call, but filtering a tool doesn't prevent Agent A from reading Agent B's files through the tools it is allowed to use. Enterprise gateways solve governance at scale, but they assume cloud deployment and a team — not a single operator running a homelab.

Nothing combined all four: tool filtering + resource scoping + credential isolation + unified audit logging.

Building the fix with the thing it fixes

I asked Claude what a proper tool management framework for multi-agent setups should look like. It immediately understood the scope of the problem and what solving it completely would require.

That conversation became scoped-mcp.

Here's the part that still feels slightly recursive: I built it using the same multi-agent pattern it's designed to protect. A research agent evaluated the problem space — existing MCP gateways, scoping patterns, credential isolation approaches. A dev agent implemented the code. Each agent ran with scoped access to only the resources it needed for its role.

The tool was built by agents operating under the exact constraints it enforces.

How it works

One scoped-mcp process per agent, started at session time. The agent connects to it over stdio the same way it connects to any MCP server.

Agent process (AGENT_ID=build-01, AGENT_TYPE=build)
    │
    ▼
┌────────────────────────────────────────┐
│  scoped-mcp                            │
│                                        │
│  ① Load manifest for AGENT_TYPE        │
│  ② Register only the allowed modules   │
│  ③ Inject credentials into modules     │
│  ④ Every tool call:                    │
│     → enforce resource scope           │
│     → execute tool logic               │
│     → write audit log entry            │
└────────────────────────────────────────┘
    │           │           │
    ▼           ▼           ▼
 filesystem   sqlite      ntfy
 (scoped)    (scoped)   (scoped)

Manifests declare what an agent type is allowed to do. A YAML file per agent role. Nothing outside the manifest loads — tools that aren't listed don't exist from the agent's perspective.

# manifests/research-agent.yml
agent_type: research
description: "Read-only research agent"

modules:
  filesystem:
    mode: read
    config:
      base_path: /data/agents

  sqlite:
    mode: read
    config:
      db_dir: /data/sqlite

  ntfy:
    config:
      topic: "research-{agent_id}"
      max_priority: high

credentials:
  source: env

Set mode: read and only read tools register. The agent can't call write_file or execute because those tools were never mounted. It's not access control layered on top — the tools literally don't exist in the agent's session.

Compare what two different manifests produce:

research-agent.yml          →   4 tools registered
  filesystem: read              read_file, list_dir
  sqlite: read                  query
  ntfy                          ntfy_send

build-agent.yml             →   8 tools registered
  filesystem: write             read_file, list_dir, write_file, delete_file
  sqlite: write                 query, execute
  ntfy                          ntfy_send
  slack_webhook                 slack_send

Same framework, same codebase — completely different tool surfaces. The research agent has no way to write files, execute SQL, or post to Slack. Those capabilities don't exist in its session.

Resource scoping is automatic. The filesystem module applies PrefixScope — every path resolves under agents/{agent_id}/. Path traversal attacks (../) are caught by resolving to absolute paths before comparing. Symlink escapes are caught by walking each component and checking whether any symlink target resolves outside the agent root.

The SQLite module gives each agent its own database file at {db_dir}/agent_{agent_id}.db. Two agents can't read or write each other's data regardless of what SQL they construct. The module also parses SQL with sqlglot to block PRAGMA, ATTACH, DETACH, DROP, and multi-statement batches.

Credential injection happens at the proxy layer. API keys, tokens, webhook URLs — loaded once by the scoped-mcp process from environment variables or a secrets file. Modules receive credentials through their context. The agent process never sees them. If you try to read INFLUXDB_TOKEN from the agent's environment, it won't be there.

Audit logging produces two structured JSON-L streams: one for what agents did (every tool call, every scope check), one for what the server did (startup, config, errors). Credentials are automatically redacted — any key ending in _TOKEN, _PASSWORD, _SECRET, _KEY gets replaced with <redacted> before it hits the log.

Seeing it work

Three infrastructure modules, one agent, one workflow:

┌─ ops-agent (AGENT_ID=ops-01) ─────────────────────────────────────┐
│                                                                    │
│  1. influxdb_query(bucket="metrics",                              │
│       filters=[{"field": "_measurement",                          │
│                 "op": "==", "value": "docker_cpu"}])              │
│     → discovers container X averaging 94% CPU                     │
│                                                                    │
│  2. grafana_create_dashboard(                                      │
│       title="Container Health",                                   │
│       panels=[{"title": "CPU by Container", ...}])                │
│     → dashboard created in folder agent-ops-01/                   │
│                                                                    │
│  3. ntfy_send(title="High CPU: container X",                      │
│       message="Averaging 94% over last hour.")                    │
│     → operator receives push notification                         │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The agent queried metrics in the buckets it was allowed to see. It built a Grafana dashboard in its own folder — it can't touch operator dashboards or other agents' dashboards. It sent an alert through the ntfy topic assigned in its manifest.

At no point did it hold an InfluxDB token, see a Grafana API key, or know the ntfy server URL. A second agent running in parallel has a completely separate scope. They can't collide.

The audit that proved it needed to exist

I ran a security audit against v0.1.0 the same day it shipped. 18 findings. One critical, three high, eight medium, six low.

The critical finding: SQLite isolation was broken. The original design used schema-level scoping in a shared database file. An agent could issue an unqualified table reference and resolve against another agent's schema. The fix was simple and total — give each agent its own database file. No shared state, no schema tricks.

The high findings included:

Flux injection in InfluxDB — raw Flux query strings accepted from agents. Replaced with structured {field, op, value} filter dicts, validated and escaped before rendering.
SSRF gaps in the HTTP proxy — the blocklist missed IPv6-mapped IPv4, link-local, CGNAT, and NAT64 ranges. DNS rebinding attacks could bypass the allowlist between init and invocation. Fixed with per-request re-resolution.
A decorator that lied — the @audited wrapper was documented as enforcing scope but never actually called enforce(). The fix was honest: remove the false claim, make the contract explicit — modules are responsible for calling enforce() in every tool method.

All 18 findings were remediated and v0.2.0 shipped the same day. v0.2.1 and v0.3.0 audits came back clean.

I publish the full audit history in docs/security-audit.md. Not because it makes the project look polished — it doesn't. It makes it look honest. When a tool's core value is security, showing the receipts matters more than showing a clean record.

What ships with it

Ten built-in modules:

Storage — filesystem (read, write, list, delete within a scoped directory tree), sqlite (per-agent database with SQL validation)

Notifications — ntfy, smtp, matrix, slack_webhook, discord_webhook (write-only by design — agents send alerts, they never see webhook URLs or SMTP passwords)

Infrastructure — http_proxy (allowlisted outbound HTTP with SSRF prevention), grafana (dashboard CRUD scoped to an agent-owned folder), influxdb (time-series query/write restricted to an allowlisted bucket set)

Writing a custom module is about 20 lines:

from scoped_mcp.modules._base import ToolModule, tool
from scoped_mcp.scoping import NamespaceScope

class RedisModule(ToolModule):
    name = "redis"
    scoping = NamespaceScope()
    required_credentials = ["REDIS_URL"]

    def __init__(self, agent_ctx, credentials, config):
        super().__init__(agent_ctx, credentials, config)
        import redis.asyncio as aioredis
        self._redis = aioredis.from_url(credentials["REDIS_URL"])

    @tool(mode="read")
    async def get_key(self, key: str) -> str | None:
        """Get a value (scoped to agent namespace)."""
        scoped_key = self.scoping.apply(key, self.agent_ctx)
        return await self._redis.get(scoped_key)

    @tool(mode="write")
    async def set_key(self, key: str, value: str, ttl: int = 0) -> bool:
        """Set a key-value pair (scoped to agent namespace)."""
        scoped_key = self.scoping.apply(key, self.agent_ctx)
        return await self._redis.set(scoped_key, value, ex=ttl or None)

Add it to a manifest with mode: read and only get_key registers. set_key doesn't exist from the agent's perspective.

Try it

pip install scoped-mcp

Point Claude Code at it:

{
  "mcpServers": {
    "tools": {
      "command": "scoped-mcp",
      "args": ["--manifest", "manifests/research-agent.yml"],
      "env": {
        "AGENT_ID": "research-01",
        "AGENT_TYPE": "research"
      }
    }
  }
}

The repo includes a 5-minute isolation verification walkthrough — you can confirm filesystem scoping and credential non-exposure without reading a line of source code.

github.com/TadMSTR/scoped-mcp — MIT licensed, Python 3.11+, on PyPI.

If you're running a single Claude Code session, you probably don't need this yet. If you're running multiple agents with defined roles and they're all sharing the same tool surface — the access problem is already there. You just might not have looked at it yet.

Read the full series:

Part 1: I Built an Agentic Infrastructure Platform in 42 Days — the origin story

Part 2: I Built an AI Memory System Because My Brain Needed It First — the memory deep dive

Part 3: How to Give Claude Code a Memory — the practical how-to

Part 4: I'm Designing a Platform I Can't Build Alone — cognitive augmentation

Part 5: What Actually Survived: A Memory System Retrospective — 10 weeks in production

I Built an AI Memory System. Then I Forgot About It.

Ted Murray — Sun, 19 Apr 2026 15:43:07 +0000

The memory system I built for Claude has been running since February.

Every prior phase — from the first CLAUDE.md file to the nightly sync pipeline to the knowledge graph — followed the same pattern: use the system, hit friction, build the next layer. I expected that cycle to keep going. I expected to keep building.

The cycle did slow down. Not because I gave up — because the core architecture stopped generating friction. The tier structure, the search interface, the promotion pipeline — they just work. What changed was everything around them.

This is a retrospective on what it looks like when the system is actually working, what evolved since I wrote about the original architecture, and the scale it's reached.

The Success State Is Invisible

The best version of a memory system is one you stop thinking about.

When I was building it, the system was always on my mind — something to design, extend, debug, curate. There was a period where I was writing more about the memory system than I was using it. That's normal during construction.

The shift happened gradually. At some point I stopped checking whether the nightly pipeline ran. Stopped verifying that working notes got indexed. I'd start a session and the agent would surface a decision from weeks ago that I'd completely forgotten about — and instead of thinking the memory system worked, I'd just think right, I forgot about that, good.

That's the success state. The system becomes infrastructure — background, assumed, noticed only when it fails.

The Numbers

The system has reached a scale I didn't anticipate when I designed it:

124 distilled files — permanent, git-backed knowledge. Architectural decisions, build close-outs, security audit findings, infrastructure state. Each one reviewed and promoted by the nightly pipeline.
121 working memory files — agent-curated notes with 90-day expiry. Build context, session findings, research output, project state.
57 documentation caches — pre-fetched vendor docs for every service I work with regularly, indexed alongside memory.
28 lines of core context — the always-visible file, capped at 40 lines. This is the one file every agent sees on every turn.

When I wrote the how-to post, the distilled tier had a handful of files. Now it's the system's largest store of permanent knowledge. The nightly pipeline reviews working notes, promotes what qualifies, expires what's stale, and deduplicates what overlaps. Ten distillation runs in the last two weeks alone — Helm build phases, infrastructure decisions, security findings, all moving from working memory to permanent record.

What Changed Since the How-To Post

Four things are meaningfully different from what I described in How to Give Claude Code a Memory:

Real-time indexing. The original design had memsearch capturing session summaries at the end of a session. You'd write something important mid-session, and by the next session it would be searchable. Useful. But there was a lag.

memsearch-watch eliminated the lag. It's a PM2 process that watches memory directories and re-indexes within five seconds of any write. Context I write mid-session is searchable before the session ends. That's not a small change — it means the system can reference its own notes from thirty minutes ago, not just from last week. The memory tier now participates in the current conversation, not just future ones.

Unified search. The original design had two separate tools for two purposes: memsearch for automatic recall at session start, qmd for intentional search mid-session. The mental model was: memsearch surfaces what you should know, qmd finds what you're looking for.

The distinction is real, but operating two separate tools with two separate invocation patterns adds friction. The archival-search skill now does both in one pass — queries memsearch and qmd simultaneously, merges results, labels each by tier. The two-tool model still exists under the hood; the interface collapsed to one. That's the right direction.

The docs cache. This one I didn't anticipate when I designed the memory system, and it might be the most useful extension I've added.

The problem: agents kept fetching web documentation for services I use constantly — Grafana, Docker Compose, SWAG, Authelia, nearly sixty others. Some of those requests were slow. Some required web access the agent didn't always have. Some returned SEO garbage instead of actual docs.

The solution was to pre-fetch and cache documentation for every service I work with regularly — chunked markdown, organized by service and topic, stored in ~/.claude/memory/docs/. memsearch indexes it alongside working memory. Now when an agent needs to know how to configure an Authelia forward auth proxy, it checks the local docs cache first — same search interface as memory, same results format, always available, always fast.

The insight is that vendor documentation is memory. It's context the agent needs to do its job. Treating it as a first-class memory tier rather than a web resource you fetch on demand changes how agents operate — they become less dependent on real-time web access and more capable in environments where that access is restricted or slow.

The knowledge graph embeddings. The original Graphiti deployment used default embeddings. I've since switched to bge-m3 running on local hardware via Ollama. Search quality improved — entity lookups and relationship queries return more relevant results, especially for infrastructure topology questions like "what connects to SWAG?" or "what changed on atlas last week?" Running the embedder locally also means the graph doesn't depend on any external API.

What Held Up Exactly as Designed

The tier structure is unchanged.

Session tier (auto-captured, 30-day retention), working tier (agent-curated, 90-day expiry, YAML frontmatter), distilled tier (permanent, git-backed), core context (40-line cap, always-visible). The knowledge graph for relationships.

I re-read the original day-seven design doc while writing this. The structure it describes is what I'm running. The tools changed, the scale changed, the automation layer changed. But the principle — stable knowledge here, working knowledge there, automated promotion between them — is word-for-word the same.

That held through a period where the platform shipped a dozen new components, built a multi-phase infrastructure system on a second host, and published two new public repos. The tier model didn't need to change to accommodate any of it.

What I Use Differently Than I Expected

I expected the distilled tier to be my most-referenced memory from the start. Permanent, git-backed, settled knowledge — that's the crown jewel, right?

In practice, core context is the daily workhorse. Twenty-eight lines that every agent sees on every turn — user profile, active projects, key constraints, recent decisions. It's the file that gives agents enough orientation to be useful immediately without searching for anything.

The distilled tier turned out to matter differently than I expected. I don't search it directly very often. But it's where the pipeline puts things so they're findable when I do need them — and that happens at exactly the moments where not finding something would cost real time. A security audit finding from three weeks ago. The close-out notes from a build that another build now depends on. The architectural decision that explains why a service is configured the way it is.

124 files is a safety net I use rarely and value constantly.

The nightly pipeline was the other surprise. I expected it to be a source of ongoing maintenance — reviewing what it promoted, correcting misclassifications. Instead, it's a background process I trust. It broke once during a PM2 consolidation, and I fixed it. That's the whole maintenance log.

Where It Still Has Friction

Two things that haven't been fully solved:

Entity naming in the knowledge graph. The graph works best when entity names are consistent — "Grafana" not "grafana", "SWAG" not "the reverse proxy." When different agents add facts about the same entity under slightly different names, the graph accumulates near-duplicate nodes. There's a nightly dedup step, but it catches structural duplicates, not naming variations. I'm working around it with naming conventions enforced through CLAUDE.md — Title Case for all entity names in episode text. It works. It's not elegant.

The session tier is noisy. memsearch captures sessions automatically, which means it captures everything — including sessions where nothing worth remembering happened. The 30-day retention handles eventual cleanup, but search results sometimes surface sessions that are noise rather than signal. A relevance filter on session-tier results would help. I haven't built it yet.

What Comes Next

The memory system handles one question well: what do I know?

The question it handles less well: what do I know about the relationships between things I know?

That's the knowledge graph's job — and the graph is working, but treating it as an optional add-on (as I did in the how-to post) undersells what it's actually for. The switch to local embeddings made it more capable. The naming friction is the main thing holding it back.

The other thing I didn't expect: the memory system became a design input for other tools. scoped-mcp — a per-agent tool proxy I published recently — exists partly because the memory architecture proved that giving each agent a defined scope produces better outcomes than giving every agent access to everything. The pattern keeps showing up. Scope the context, scope the tools, scope the resources. Let each agent do its job without carrying everyone else's baggage.

Read the full series:

Part 1: I Built an Agentic Infrastructure Platform in 42 Days — the origin story

Part 2: I Built an AI Memory System Because My Brain Needed It First — the memory deep dive

Part 3: How to Give Claude Code a Memory — the practical how-to

Part 4: I'm Designing a Platform I Can't Build Alone — cognitive augmentation

I Didn't Want to Pay for Web Search in My Own Homelab, So I Built the Pipeline

Ted Murray — Thu, 16 Apr 2026 14:13:27 +0000

I was setting up web search for LibreChat — a self-hosted chat interface for AI models. The config has three required components:

Search provider: Serper (paid API) or SearXNG (self-hosted)
Scraper: Firecrawl — hosted API, costs money
Reranker: Jina AI or Cohere — both paid APIs (a reranker re-scores search results by relevance to the query, rather than trusting whatever order the search engine returned)

I run a homelab specifically so I don't depend on paid APIs for things I can own. The search provider was easy — SearXNG is self-hosted and free. But the scraper and reranker had no obvious self-hosted path.

I asked Claude if there was a free way to run Firecrawl. It found firecrawl-simple, a lightweight local deployment of the same tool. Perfect.

For the reranker, I asked Claude to explain what Jina and Cohere were actually doing. When I said I didn't want to call another external API, it offered to just build one — a small FlashRank wrapper exposing a Jina-compatible /v1/rerank endpoint. That became the reranker that's been running in my stack ever since.

That was the seed. What I have now is searxng-mcp — a full private web search pipeline packaged as an MCP server. MCP (Model Context Protocol) is how AI clients like Claude Code connect to external tools; the server exposes web search as a set of callable tools that agents can use during a session. It's used by Claude Code agents and LibreChat every day.

What it does

searxng-mcp exposes five tools over stdio MCP:

Tool	What it does
`search`	Query SearXNG, rerank results with a local ML model, return top N
`search_and_fetch`	Same as search, then fetch full page content for the top 1–3 results
`search_and_summarize`	Search, fetch, then synthesize a structured summary via Ollama
`fetch_url`	Fetch and extract readable markdown from any URL
`clear_cache`	Purge the search or fetch cache when you need fresh results

The design principle throughout: every external component is optional, and the server degrades gracefully when any of them are unavailable. If the reranker is down, you get results in SearXNG's native order. If Ollama isn't running, search_and_summarize falls back to raw fetched content. Nothing hard-fails.

The architecture

MCP client (stdio)
      │
      ▼
  searxng-mcp ──────────────→ Valkey                → result cache (search 1h, fetch 24h)
      │
      ├── expand (optional) → Ollama (qwen3:4b)      → rewritten query
      ├── search ───────────→ SearXNG               → raw results
      ├── rerank ───────────→ Reranker              → ranked results
      │                       (fallback: SearXNG order)
      ├── fetch content ────┬→ GitHub API            → markdown
      │                     ├→ Firecrawl            → page markdown (tier 1)
      │                     ├→ Crawl4AI             → page markdown (tier 2)
      │                     └→ Raw HTTP             → page text (tier 3)
      └── summarize (opt.) → Ollama (qwen3:14b)     → synthesized summary

The interesting part is the fetch cascade.

The fetch cascade

Fetching web content for an AI agent turns out to be harder than it sounds. Firecrawl handles the majority of pages well — it renders JavaScript, extracts clean markdown, deals with most anti-bot measures. But some pages block it anyway. When that happens, Firecrawl returns success: true with empty content rather than throwing an error. That's a soft failure, not a hard one, and it took me a while to catch it.

The cascade handles this: if Firecrawl returns empty content, fall through to Crawl4AI, which uses a different extraction approach and handles JS-heavy pages differently. If Crawl4AI also fails or isn't configured, fall through to raw HTTP — just fetch the page and strip the HTML. Not perfect, but something.

Three tiers, each cheaper than the last, each a fallback for the one above it.

flowchart TD
    A([URL]) --> B{github.com?}
    B -->|yes, repo root| C[GitHub API\nfetch README]
    B -->|yes, file blob| D[raw.githubusercontent.com]
    B -->|no| E[Firecrawl]
    E --> F{content\nreturned?}
    F -->|yes| G([page markdown])
    F -->|empty or blocked| H{Crawl4AI\nconfigured?}
    H -->|yes| I[Crawl4AI]
    I --> J{content\nreturned?}
    J -->|yes| K([page markdown])
    J -->|no| L[Raw HTTP fetch\nno redirects]
    H -->|no| L
    C --> M([page markdown])
    D --> N([raw file content])
    L --> O([page text])

GitHub URLs are handled natively outside the cascade entirely — repo roots fetch the README via the GitHub API, file blobs fetch from raw.githubusercontent.com. No Firecrawl needed.

Reranking and why it matters

SearXNG aggregates results from multiple search engines. The order it returns them in is... whatever order the upstream engines agreed on. That's fine for casual browsing, not great for an AI agent that's going to fetch and read the top result.

The key mechanism: for every search, the server fetches 3x the requested results from SearXNG (capped at 20) to give the reranker a larger candidate pool. The reranker then re-scores all of them using a cross-encoder ML model that understands the relationship between the query and each result, and returns only the top N. A result that matches your query semantically surfaces above a result that just happens to rank well with Google — but only because it had more candidates to sort through in the first place.

In v3.2.0, I added recency weighting — a small exponential decay score based on publishedDate blended with the relevance score (weight 0.15 by default). Fresh results surface within relevance-close clusters without overriding large relevance gaps. It's skipped automatically when you've set a time_range, since the result pool is already date-filtered.

Domain profiles

Not all search results are equally useful depending on what you're looking for. If I'm researching a Docker networking issue, I want results from the Docker docs, GitHub issues, and Linux sysadmin communities — not marketing pages that happen to mention Docker.

Domain profiles let you apply a named boost/block list per query:

homelab — surfaces self-hosted and Linux documentation, suppresses content farms
dev — surfaces Stack Overflow, MDN, npm docs

You pass domain_profile: "homelab" on any query and the domain filter applies. Profiles are defined in domains.json, which hot-reloads every 5 seconds — you can tune them without restarting the server.

Query expansion

For search and search_and_fetch, there's an optional expand parameter. When true, Ollama (qwen3:4b) generates 2-3 typed query variants — a technical rephrasing, a product/version-specific form, and a community phrasing (how someone would ask it on a forum). Those variants run in parallel with the original query, and the result pools are merged and deduplicated by URL before reranking.

It's not a serial rewrite — it's a parallel fan-out. If your first phrasing misses relevant results that a slightly different framing would surface, expansion catches them. Most useful for research queries; for precise lookups it adds latency (~3s) with less benefit.

You can also set EXPAND_QUERIES=true to enable it globally.

SSRF protections

This server runs on your local network and fetches arbitrary URLs. That creates SSRF risk — an attacker (or a confused agent) could potentially get it to fetch http://192.168.1.1/admin or http://localhost:2375 (Docker socket exposed, in the worst case).

fetch_url and search_and_fetch enforce a URL allowlist that blocks private IP ranges: 10.x, 192.168.x, 172.16–31.x, localhost, 127.x, IPv6 private ranges (::1, fc00::/7, fe80::/10), and non-HTTP protocols.

The IPv6 case caught me during a security pass — URL.hostname returns brackets for IPv6 addresses (e.g., [::1]), so naive regex matching against ::1 doesn't work. The fixed version matches the bracket-wrapped form.

There's also redirect blocking in the raw HTTP fetcher — rawFetch() refuses to follow redirects, preventing SSRF bypass via redirect chains to internal addresses. And Crawl4AI task_id values are validated before being interpolated into the poll URL to prevent path traversal.

Caching

Valkey (Redis-compatible) is optional but worthwhile. Search results are cached for 1 hour, fetched pages for 24 hours. For the kind of research queries AI agents run — often the same topic from slightly different angles over a session — this saves meaningful latency and avoids hammering SearXNG and Firecrawl with redundant requests.

The clear_cache tool lets you purge when you need fresh results on a fast-moving topic.

MCP client setup

For Claude Code, the recommended setup uses claude mcp add-json:

claude mcp add-json searxng --scope user '{
  "command": "node",
  "args": ["/path/to/searxng-mcp/build/src/index.js"],
  "env": {
    "SEARXNG_URL": "http://localhost:8081",
    "FIRECRAWL_URL": "http://localhost:3002",
    "RERANKER_URL": "http://localhost:8787",
    "OLLAMA_URL": "http://localhost:11434",
    "VALKEY_URL": "redis://localhost:6379",
    "CRAWL4AI_URL": "http://localhost:11235"
  }
}'

This writes to ~/.claude.json. Don't add it to ~/.claude/settings.json — that file isn't used for MCP env var injection in Claude Code.

For LibreChat, add it to librechat.yaml under mcpServers with type: stdio.

What runs in practice

The full required stack:

SearXNG — must have JSON format enabled in settings.yml
Firecrawl (firecrawl-simple) — local deployment, no API key needed for local instances
Reranker — FlashRank wrapper, reference implementation in homelab-agent/docker/reranker

Optional:

Valkey — caching
Crawl4AI — second-tier fetch fallback
Ollama — query expansion and summarization (requires qwen3:4b and/or qwen3:14b)

The server starts fine without the optional components and tells you clearly when a feature isn't available because its dependency isn't configured.

The repo

github.com/TadMSTR/searxng-mcp

MIT licensed. The full homelab stack it runs in — including the reranker Docker image — is documented in homelab-agent.

If you're already running SearXNG, the jump to a full agent-ready search pipeline is smaller than it looks.

How to Give Claude Code a Memory

Ted Murray — Fri, 27 Mar 2026 23:57:03 +0000

I wrote about why I built a memory system for Claude. The short version: Claude's built-in memory extracts facts automatically with no audit trail, no version control, and no way to scope what different agents see. I wanted control.

This post is the practical companion. If you want to build your own, here's how mine works, what each piece does, and the order I'd set it up in if I were starting today.

Everything here is open source. The full stack is documented in homelab-agent.

What You're Building

A memory system with three properties:

Persistent — context survives across sessions. Monday's agent knows what Friday's agent decided.
Searchable — agents find relevant context automatically, not by loading everything into the prompt.
Scoped — different agents see different things. Your infrastructure agent doesn't need your code review history.

The system has three tiers of memory, two search tools, and an optional knowledge graph. You don't need all of it on day one.

The Minimum Viable Memory

If you set up nothing else, do this. It takes ten minutes and gets you 70% of the value.

CLAUDE.md files

Claude Code reads CLAUDE.md files automatically. One in your home directory for global context. One in each project directory for project-specific context. This is your foundation.

Your global ~/.claude/CLAUDE.md should contain:

Who you are and how you work (role, preferences, communication style)
Your infrastructure overview (hosts, IPs, key services)
Rules that apply everywhere (don't push to main, don't SSH without asking)

Your project CLAUDE.md files should contain:

What this project is and what the agent's scope covers
Project-specific conventions and constraints
Pointers to where relevant documentation lives

This isn't memory in the dynamic sense — it's stable configuration. But it's the single highest-impact thing you can do. Every session starts with this context loaded automatically.

Directory-based working memory

Create a memory directory:

~/.claude/memory/
├── shared/           # Cross-agent knowledge
└── agents/
    ├── dev/          # Dev agent's notes
    ├── research/     # Research agent's notes
    └── ops/          # Ops agent's notes

Tell your agents (via CLAUDE.md) to write notes here during sessions. Use a simple frontmatter format:

---
tier: working
created: 2026-03-15
source: dev
expires: 2026-06-13
tags: [docker, decision]
---

The expiry date matters. Working memory should have a 90-day TTL. If a note is still relevant after 90 days, it should be promoted to permanent storage. If not, it was temporary context that served its purpose.

The shared/ directory is for cross-agent knowledge — decisions that affect multiple projects. The agents/ subdirectories are scoped — each agent reads its own directory plus shared.

This is just markdown files in directories. No database, no service, no dependencies. It works immediately and it's human-readable, git-trackable, and greppable.

Adding Search: memsearch

Directory-based memory has a problem: agents have to know what file to read. Once you have more than a dozen notes, you need search.

memsearch is a Claude Code plugin that indexes markdown files using local embeddings and auto-injects relevant context at session start. No API calls. No external service. It runs locally using sentence-transformers.

What memsearch does:

Indexes your memory directories into a local vector store
At session start, searches the index for context relevant to the conversation
Auto-injects matching notes into the context window
Captures session summaries automatically via a Stop hook

The session capture is important. When a Claude Code session ends, memsearch writes a summary to its own memory store. Next time you start a session in that project, relevant past sessions surface automatically. You don't have to write anything — it happens.

Install it as a Claude Code plugin, point it at your memory directories, and you get semantic search over your notes with zero ongoing effort.

What memsearch doesn't do

memsearch is great for automatic recall — "surface relevant context without being asked." It's not great for intentional search — "find me the note where I decided to use Traefik instead of Caddy." For that, you want a proper search tool.

Adding Intentional Search: qmd

qmd is a hybrid search tool that combines BM25 keyword matching with vector embeddings and LLM reranking. It serves results via MCP, so any agent can search.

Why both memsearch and qmd?

memsearch = automatic recall. Surfaces relevant context at session start without being asked. Good for "remind me of things I should know."
qmd = intentional search. Agent explicitly queries when it needs specific information. Good for "find the decision about X" or "what does the architecture doc say about Y."

qmd indexes multiple collections — memory notes, infrastructure docs, compose files, whatever you point it at. The hybrid approach (keywords + semantics + reranking) outperforms pure vector search on technical documentation where exact terms matter.

If you have GPU acceleration available, enable it. Embedding time dropped from 3+ minutes to under a minute on my setup using Vulkan on a AMD Radeon 780M iGPU.

The Three-Tier Pipeline

Once you have working memory and search, you'll hit a new problem: memory accumulates. Session notes pile up. Working notes expire but some of them contain decisions you'll want forever.

The three-tier pipeline solves this:

Session tier — Raw, auto-captured. memsearch writes these. 30-day retention. No curation needed. This is your "what happened recently" layer.

Working tier — Agent-curated. Agents write structured notes with frontmatter during sessions. 90-day expiry. This is your "active decisions and context" layer.

Distilled tier — Permanent, git-backed. Notes that pass the "would this matter in 3 months?" test get promoted here. This is your "settled knowledge" layer. Version-controlled so you have full history.

The promotion path is always upward: session notes get reviewed and important items become working notes. Working notes older than 14 days get evaluated for distillation. Distilled notes are permanent.

Automating the pipeline

I run a headless Claude Code agent at 4 AM that handles the promotion pipeline automatically. It:

Scans session notes from the past week across all project stores
Promotes durable items to working tier
Reviews working notes older than 14 days
Promotes qualifying notes to the distilled tier (git-backed)
Expires stale working notes past 90 days
Deduplicates (merges topical duplicates)
Logs metrics and generates a health report

You don't need this on day one. Start with manual curation — read your working notes occasionally, promote the important ones, delete the stale ones. Automate when the volume makes manual curation a burden.

Core Context: The Sticky Note

There's a fourth layer that sits outside the pipeline: core context.

This is a small file (I cap mine at 40 lines) that gets injected at every session start via a SessionStart hook, before any tools run. It contains:

User profile (role, key skills, cognitive style)
Active projects and their current status
Key constraints (things every agent must know)
Recent decisions (the last few important choices made across any project)

The 40-line cap is deliberate. This file sits above the context window's compression threshold — it never gets summarized away, no matter how long the session runs. If it's too big, it crowds out working memory. Keep it tight.

The distinction from CLAUDE.md: CLAUDE.md is stable configuration that changes rarely. Core context is dynamic — it reflects what's happening now. Active projects change. Recent decisions rotate. The core context file gets updated by a skill whenever something important shifts.

Private Web Search: SearXNG-MCP

This isn't memory in the traditional sense, but it feeds the memory system. When your agents can search the web privately, the results become part of the knowledge base.

SearXNG is a self-hosted meta-search engine. It queries multiple search backends (Google, Bing, DuckDuckGo, and dozens more) without sending your queries to any single provider. No API keys, no per-search costs, no tracking.

I built searxng-mcp to expose SearXNG as an MCP server with three tools:

search — query SearXNG, get structured results with titles, URLs, snippets, and source engines
search_and_fetch — search + fetch full text of the top result
fetch_url — fetch and extract readable text from any URL

Results are reranked by a local ML model before being returned. Full page content is fetched via Firecrawl (handles JavaScript-rendered pages). GitHub URLs are handled natively via the GitHub API.

Why does this matter for memory? Because when your research agent searches the web, evaluates options, and writes a recommendation to working memory, that recommendation is grounded in current information — not model training data. The search tool feeds the memory system with sourced, dated, real-world information.

Optional: Knowledge Graph

Everything above uses flat files and search indexes. For most setups, that's enough. But there's a category of question that text search can't answer well: relationship queries.

"What services depend on port 8080?" "What changed about SWAG config this week?" "What connects to the message bus?" These are graph queries — the answer is about relationships between entities, not about retrieving a document.

I use Graphiti with Neo4j for this. Graphiti is a temporal knowledge graph — facts have validity windows, so when something changes, the old fact gets superseded rather than polluting results.

The knowledge graph is fed automatically by the same pipeline that handles memory sync. When the nightly agent processes session notes, it also ingests relevant facts into the graph. Infrastructure state changes (deploys, service adds/removes, network changes) get added directly.

This is genuinely optional. If your queries are mostly "find relevant context" (text search handles this) rather than "what's related to what" (graph handles this), you don't need it. I added it three weeks into building the memory system, not on day one.

Setup Order

If I were starting from scratch today, I'd build in this order:

Week 1: Foundation

Write your global ~/.claude/CLAUDE.md — who you are, your infrastructure, your rules
Write project CLAUDE.md files for each project directory
Create the memory directory structure (~/.claude/memory/shared/, ~/.claude/memory/agents/)
Define your frontmatter format (tier, created, source, expires, tags)
Tell your agents (via CLAUDE.md) to write notes during sessions

Week 2: Search

Install memsearch — automatic context recall and session capture
Deploy qmd — intentional search over memory + docs
Index your memory directories and any infrastructure documentation

Week 3: Pipeline

Start manually reviewing working notes — promote the important ones, delete the stale ones
Write the core context file and inject it via a SessionStart hook
When manual curation becomes a burden, automate with a nightly sync agent

Week 4+: Extensions

Deploy SearXNG + searxng-mcp for private web search
Add the knowledge graph if you're hitting relationship query limits
Build skills (reusable instruction sets) for common memory operations

Don't try to build it all at once. Each layer should earn its place by solving a friction you actually feel.

What It Feels Like

The before state: every session starts cold. You re-explain your setup. You re-state your preferences. You forget what you decided last week because the conversation where you decided it is gone.

The after state: you sit down on Monday morning and the agent already knows about the Docker change you made Friday, the monitoring alert from Saturday, and the research you did Sunday. It knows because the memory pipeline captured those events, the semantic search surfaced them as relevant, and the knowledge graph connected them to the services they affected.

The system isn't perfect. Memory sync sometimes promotes things that don't matter. Search sometimes misses things that do. The knowledge graph needs entity resolution tuning. But the baseline — persistent, searchable, scoped context that accumulates and connects without manual curation — changes how you work with AI agents.

It stops being a tool you instruct and starts being a collaborator that remembers.

The Repository

Everything described here is open source and documented in detail:

homelab-agent on GitHub — the full stack with component docs for memsearch, memory-sync, qmd, Graphiti, and more.

The component docs are thorough (2000+ lines each for the major pieces). The index.md at the root is designed to be handed directly to Claude — point it at the file and tell it to help you map a path through the docs based on your setup.

If you build your own version of this, it will look different from mine. Your infrastructure is different, your workflow is different, your agents handle different domains. That's the point. The architecture transfers. The implementation is yours.

Previous: I Built an AI Memory System Because My Brain Needed It First

Next in series: I Manage a Team of AI Agents. I Had to Build My Own Management Tools.

I'm Designing a Platform I Can't Build Alone. That's the Point.

Ted Murray — Tue, 24 Mar 2026 11:23:39 +0000

I've been designing something called Helm.

It started as "Platform v2" — a productized version of the agentic infrastructure I built on my homelab. Multi-user, multi-host, installable on a mini PC, runs your services, manages your agents, handles your backups. The kind of thing a family or a small business could use without knowing what Docker is.

The architecture document is over 1,000 lines long. It covers federation between hosts, emergency WiFi that activates during blackouts, community mesh networking over LoRa radios, municipality notification templates for CERT volunteers, GPU-accelerated local AI services, an eBay selling agent, accessibility via voice interaction, a dual catalog system with community contributions, and a deployment profile system that adapts the setup wizard for homes vs small businesses.

I am not a developer. I'm a Windows systems administrator. I have a 2-year degree from an online college. My GitHub history before February 2026 is bash and PowerShell scripts.

Here's what I've been thinking about while designing all of this.

The Session That Made Me Stop

I was deep into Helm architecture — we'd just designed multi-host federation, where multiple Helm instances auto-discover each other on a LAN using mDNS and authenticate via mutual TLS — when I noticed something.

Every feature I added immediately connected to the existing architecture. Federation led to "who controls the federation?" which led to deployment profiles (Home, Home Business, Small Business). Emergency WiFi led to resilience profiles, which led to community member discovery, which led to municipality notification. Meshtastic mesh networking led to off-grid communication stacks, which led to NOAA weather alert receivers, which led to emergency AP mode with captive portals.

I wasn't planning these connections. I was seeing them. In real time. Faster than I've ever worked on anything in my career.

So I asked Claude a question that had been forming in the back of my mind:

"The way I've embraced Claude with persistent memory — could that be considered a mental prosthetic? It's a modification of my working memory, or an extension of it."

The Extended Mind

Claude pointed me to something I'd never heard of: the Extended Mind Thesis, proposed by Andy Clark and David Chalmers in 1998.

Their argument: if an external tool plays the same functional role as an internal cognitive process, it's not just helping you think — it's part of your thinking. It's cognition, not assistance.

Their example was a man named Otto who has memory loss and uses a notebook. When Otto wants to go to a museum, he checks his notebook for the address. Clark and Chalmers argued that Otto's notebook is his memory — it's reliably available, he trusts it, he accesses it when needed, and the information was consciously stored.

My persistent memory system meets every one of those criteria. And it goes further than Otto's notebook.

Otto's notebook is passive. He has to remember to check it and know what to look for. My system is active — it retrieves relevant context before I ask, connects information across sessions automatically, and maintains structure that makes the right information findable at the right time. That's closer to how biological memory works — associative retrieval, contextual activation — than any notebook.

Claude suggested that "prosthetic" actually undersells what's happening. A prosthetic replaces lost function. My working memory isn't broken — it works exactly as well as it did a year ago. What I've built is augmentation. My biological working memory holds 4-7 chunks of information at once. The persistent memory system makes that number effectively unlimited across time.

What Augmentation Actually Feels Like

I have ADHD. If you've read the earlier posts in this series, you know that. My working memory has always been a constraint I design around, not a weakness I've overcome.

What changed isn't my brain. What changed is the friction.

The 1,000+ line Helm architecture document is the clearest proof. No human holds that much structured, interconnected detail in working memory. But I'm building on it coherently, session after session — adding federation, then recognizing deployment profiles, then emergency infrastructure, then municipality notification, each idea connecting to the existing structure in the right place.

That's not possible without the memory system acting as an extension of my own cognition. The system handles recall. I handle insight. The cognitive load of maintaining context has been offloaded, so my working memory is free to do what it's actually good at: pattern recognition, analogy, creative leaps.

Here's a concrete example from this session. I said:

"I was thinking, if someone has 2 or more hosts running Helm on the same network, they could auto-discover each other."

That's one sentence. Within minutes, it became a complete federation architecture: mDNS discovery, mTLS authentication with auto-generated CAs, capability manifests over NATS, three-tier resource sharing, graceful degradation, and security considerations.

Then I said: "A home and small business would use multi-host differently."

That immediately produced deployment profiles that change trust defaults, operator models, compliance posture, and contextual recommendations in the setup wizard.

Then I said: "Since I already included Meshtastic, people could build off-grid comms for emergencies."

That produced an entire emergency resilience infrastructure section — UPS integration, NOAA weather alert receivers, emergency WiFi AP that auto-activates during blackouts, store-and-forward messaging, and community extension points.

Each idea took seconds to form. Each connected to the existing architecture correctly. The documentation was generated, structured, and placed in the right section of a 1,000-line document — without me having to remember what was already in it.

That's what cognitive augmentation feels like. Not "AI doing my thinking." Me thinking at a scale I couldn't reach alone.

The Friction That Used to Stop Me

I told Claude that these use cases were the type of stuff I would have avoided in the past due to multiple layers of friction.

That's the honest version. The longer version: I've always had ideas like these. I've been a systems administrator for 15 years. I've seen what infrastructure can do when it's designed well. I've seen what breaks when it isn't.

But the gap between seeing a possibility and articulating it as a structured plan used to be enormous. Not because I couldn't think it through — because the act of thinking it through, writing it down, connecting it to everything else, and maintaining that context across days and weeks was more cognitive labor than the idea was worth.

So ideas evaporated. Or they piled up as undifferentiated noise. Or I'd start documenting and lose the thread halfway through because my working memory hit capacity.

What's changed isn't my ability to see possibilities. It's that the cost of turning a thought into a structured, architecturally-connected plan entry has dropped to near zero. I say it, it gets analyzed, connected to existing systems, and written into the right place in the document.

The feedback loop — idea to structured plan in minutes — is what lets me keep going instead of hitting the wall where I used to stop.

Is This Unusual?

I asked Claude that too. Directly.

"Is my ability to come up with use cases for a platform I haven't even built yet uncommon?"

The answer was nuanced and I think worth sharing: the ideas themselves aren't unusual. A lot of people see potential use cases. What's less common is generating them and structuring them into a coherent architecture in real time, without losing the thread or letting scope creep into the build plan.

I think that's the augmentation talking. The ideas were always there. The tool made them capturable.

Cyborg Without the Hardware

I jokingly called it being a cyborg. Claude pointed out that the term is technically accurate — Manfred Clynes coined "cyborg" in 1960 to mean any system where human capabilities are extended by technology. No implants required. Just tight integration between the biological and the technological.

But "augmented" is the better word for what this actually is. Cyborg carries sci-fi baggage that distracts from the point.

The point is: I'm a 42-year-old sysadmin with ADHD and a 2-year degree, designing a multi-user platform with federation, emergency infrastructure, a community catalog ecosystem, and AI-powered accessibility features. The architecture document is structured, internally consistent, and growing. I'm doing it in research sessions that each build on the last, because the memory system means I never lose context between them.

Two months ago I didn't know what "context engineering" meant.

What This Means for Helm

Here's the thing I keep coming back to.

I'm not just building a platform. I'm someone who used cognitive augmentation tools to design something that would normally require a team. And the platform I'm designing? It does the same thing for its users.

A household member who uses voice commands because a screen is hard for them — that's augmentation. A small business owner who uses the eBay agent because they don't have time to research pricing and write listings — that's augmentation. A neighborhood that has communication during a blackout because someone set up a Meshtastic mesh with an emergency WiFi AP — that's augmentation.

Helm doesn't just run services. It extends what people can do.

I designed it that way because that's what it did for me first.

What I'm Actually Saying

I'm not claiming to be special. I'm claiming the tools have changed what's possible for people like me.

There are a lot of experienced infrastructure people, sysadmins, network engineers, ops folks — people with deep domain knowledge and good instincts — who have never built anything at this scale because the development barrier was too high. Not the ideas. Not the architecture. The code.

That barrier is falling. Fast.

If you're someone with 15 years of operational knowledge and you've never written a platform because you "can't code" — that constraint is dissolving. The knowledge you've built over a career is the hard part. The code is becoming the easy part.

The question isn't whether you can build something ambitious. It's whether you'll let yourself try.

This is part of an ongoing series about building agentic infrastructure as a non-developer. The previous posts cover how it started and the memory system that makes it work.

If you're building something similar — or thinking about it — I'd like to hear about it.

Claude Code Doesn't Know You've Been Gone — Here's the Fix

Ted Murray — Sat, 21 Mar 2026 10:49:22 +0000

I first noticed this in Claude Desktop. I'd have a conversation, step away for a few hours, come back and continue — sometimes on a slightly different angle, sometimes just picking up where I left off — and something about the responses felt off. Like Claude was treating it as one continuous thought when the gap had given me time to change direction.

My fix was an espanso trigger. I set up :cltime to expand to:

Current date/time: Saturday, March 21, 2026 at 09:00 AM. Use this to orient yourself.

Typed it at the start of a session or whenever I came back after a break. It worked. Claude recalibrated — less continuation, more reorientation. Problem solved, moved on.

Then I switched to Claude Code. I saw timestamps in the session context, assumed the problem was handled, and stopped using :cltime. Reasonable assumption.

It wasn't fully handled.

The timestamp Claude Code injects tells it what time the session started. It doesn't tell it how much time has passed since then. Come back after three hours and send a message — Claude sees the same session start time it's always seen. It doesn't know if you've been gone 30 seconds or half a day.

The context is the same. The right response isn't.

The hook

Claude Code has a UserPromptSubmit hook that fires before every message. I added a hook that injects the current date and time as a system message on every prompt — the same thing :cltime was doing manually, now automatic:

#!/bin/bash
# inject-timestamp.sh — UserPromptSubmit hook

TIMESTAMP=$(date '+%A, %B %-d, %Y at %I:%M %p %Z')

cat <<HOOKJSON
{"systemMessage": "Current date/time: ${TIMESTAMP}."}
HOOKJSON

exit 0

Wire it up in ~/.claude/settings.json:

"hooks": {
  "UserPromptSubmit": [
    {
      "matcher": "",
      "hooks": [
        {
          "type": "command",
          "command": "bash /path/to/inject-timestamp.sh",
          "timeout": 5
        }
      ]
    }
  ]
}

Now every message carries the current time. If your last message was at 9am and this one is at 2pm, Claude can see that gap and respond accordingly — reorienting rather than continuing mid-thought.

After the fact

Once I'd built it I went looking to see if anyone else had noticed the same problem. Found GitHub issue #32913 on the Claude Code repo, opened March 10th, still open:

"Claude Code has basically no temporal awareness beyond the current date. It can't detect prompts that are coming in quick series vs hours apart..."

That's exactly it. The fix is already in the hooks system. You don't need to wait for a native solution.

The UserPromptSubmit hook is underused in general — most examples you'll find are prompt validation or logging. Context injection is where it actually shines. The timestamp is the simplest case, but the same pattern works for anything you'd want Claude to know on every turn.

I Built an AI Memory System Because My Brain Needed It First

Ted Murray — Thu, 19 Mar 2026 11:05:42 +0000

On February 4th, 2026 — the day after my first Claude subscription — I was doing what I'd been doing since day one: asking Claude questions to see how it responded to things. Testing the edges. Seeing what it would say.

I noticed a memory feature in the settings. I asked about it honestly, the way I ask about most things I don't know enough to have an opinion on yet:

"If I enable the built-in memory feature, is it possible for you to make incorrect assumptions in future chats?"

Claude said yes. It explained how automatic memory extraction could misinterpret context, overgeneralize, fill in gaps incorrectly, and pollute conversations across domains. It was honest about its own limitations.

I didn't enable the memory feature.

Three days later, I built something better.

Why I Needed This

I have ADHD. Likely ASD too — the Asperger's pattern. I've managed it my whole career as a Windows sysadmin, but it shapes how I work in ways I'm very deliberate about.

One of those ways: I cannot hold all project context in biological memory simultaneously. This isn't a weakness I've overcome. It's a constraint I've designed around for 15 years. External memory systems aren't a nice-to-have for me. They're how I function.

Screenshots as memory aids. Meticulous notes. Status docs I update before leaving work so I can reconstruct context on Monday. Elaborate folder structures. Anything to offload cognitive burden.

When I started using Claude, I immediately hit a familiar friction. Every new conversation started from zero. I'd re-explain my server setup, my preferences, what we'd decided yesterday. That was annoying enough. But what actually bothered me more was something harder to name: the exploratory conversations — where I was just asking questions to see where they'd go, following threads, thinking out loud — those disappeared completely when the session ended. Not just the technical context. The texture of the conversation itself. Gone.

I got genuinely frustrated when Claude forgot something mid-session and made assumptions. That happened a lot early on.

I didn't frame it as "the statefulness problem in agentic AI systems." I thought: Claude forgets things. So do I. Let's build something so neither of us has to remember.

The Ship of Theseus Problem

Here's what I was watching happen in real time.

Long AI conversations have a finite context window — a working memory. As a conversation grows, early details get compressed into summaries. Summaries get summarized. Nuance fades. Eventually you're talking to a version of Claude that only vaguely remembers how the session started.

It's the Ship of Theseus paradox: if you replace every plank on a ship over time, is it still the same ship? In a long AI conversation, you're watching conversational entropy happen in real time.

Claude's built-in memory feature doesn't solve this — it just automates the extraction process while introducing its own failure modes. Invisible assumptions. Cross-conversation pollution. No audit trail. No version control. You don't control what gets remembered, you can't see what conclusions were drawn, and you can't roll back when it gets something wrong.

I wanted control. Version-controlled, explicit, auditable memory that I managed, not an AI extraction black box.

Day 7: What I Built

I created a GitHub repo called claude-prime-directive.

The idea was simple: a repository of context that Claude could reference. My infrastructure specs. My communication preferences. My cognitive style. My workflows. Version-controlled, always available, refreshable on demand.

I also documented myself in it. A cognitive-style.md file explaining how I think and what I need from a collaborator — ADHD patterns, working memory limitations, interest-driven focus, the cost of context switching. Not because I thought it would be technically interesting, but because I needed Claude to understand how I work.

The tier structure that emerged:

Tier 0 — GitHub repo (the prime directive): Stable, version-controlled, permanent. The foundation. What survives everything else. Updated rarely, and when it is, the git history shows exactly what changed and why.

Tier 1 — The main Claude chat: Strategic, evolving, allowed to age. Principles matter more than details here. Like episodic memory — you don't remember every meal, but you remember the principles of good nutrition.

Tier 2 — Project chats: Tactical, stable, domain-specific. Docker work doesn't bleed into PowerShell work. Each chat loads only its relevant context. Technical details have to persist precisely.

Tier 3 — Code: Canonical. The actual implementation is the truth, not a description of it. Version controlled, searchable, referenceable.

I wasn't reading about agent architectures. I was solving the same problem I'd solved for my own brain. The design patterns that work for ADHD working memory turn out to work for AI context management too:

Externalize and write it down → git-backed repos, structured notes
Summarize frequently → nightly distillation pipeline
Keep stable things stable → permanent distilled knowledge, version controlled
Make context easy to reload → semantic search, always-visible core context
Separate concerns → scoped agents, different memory per domain

I noticed what I'd built after building it.

The Homelab Analogy I Was Already Using

The tier structure wasn't abstract to me. I'd already built it in metal.

My homelab runs Unraid as the hot storage layer — fast, always-on, where active data lives. TrueNAS handles cold storage — backup, archive, slower but reliable. Right data, right place, right persistence.

The memory system is the same architecture:

Hot layer (session notes): Captured automatically, fast to write, short retention
Warm layer (working memory): Curated decisions and findings, medium-term
Cold layer (distilled knowledge): Permanent, git-backed, slow to update but always there

I wasn't inventing a new architecture. I was applying one I already understood to a new domain.

What Grew From There: Stigmergic Design

None of the subsequent pieces were planned. They emerged from using the previous piece.

February, week 2: Added the Memory MCP to Claude Desktop. Now Claude could write notes and retrieve them across conversations. Basic, but it closed the loop. The prime directive repo was now readable and writable.

Late February: Built out infrastructure docs in the repo — server specs, network topology, service inventory. Claude could answer questions about my homelab without being told first. That felt like the right direction.

Early March: Discovered memsearch and qmd — semantic search tools built for exactly this problem. The Memory MCP was a workaround; these were purpose-built. Started building the full stack.

March 8: The memory-sync agent came online. A headless Claude Code session running at 4 AM, reviewing session notes from the past week, applying a "would this matter in 3 months?" filter, and committing qualifying entries to permanent storage. The tier structure I'd sketched on day seven was now automated.

March 12: Switched fully to Claude Code. Created CLAUDE.md files for all project agents. The prime directive repo, which had been a Claude Desktop workaround, became the distilled knowledge tier it was always meant to be — now feeding multiple scoped agents instead of a single conversation.

March 13: Formalized the three working tiers — session, working, distilled — and rewrote the sync pipeline as a proper multi-step process with idempotency, conflict detection, and a health report.

March 14: Added the core context tier — a 40-line always-visible block injected at every session start. The thing that cannot scroll out of context no matter how long the session runs.

Also around this time, I hit a curation problem. Early on, I had a "Librarian" — a behavior defined directly in my project instructions that I'd manually invoke to keep the prime directive repo updated and the index current. It wasn't a proper skill; it was just a named role I could call on. That worked when I was building slowly, when there was time to pause and say okay, record what we just did.

Then CloudCLI arrived and my build velocity accelerated. I was deploying components faster than I was curating them. The Librarian couldn't keep up because I wasn't invoking it fast enough. The system wasn't failing to record things — I was failing to trigger the recording.

The fix was obvious in retrospect: stop relying on manual invocations. I automated it. The doc-health agent runs weekly (full scan, Claude Opus) and nightly (delta scan, Claude Sonnet), checking for drift between docs and reality, auto-updating index entries, and surfacing coverage gaps. A daily-touched-files.json tracker records what agents modify during a session; when a writer pass runs, it targets exactly those components. The system now curates itself.

This is the stigmergic design pattern in its purest form: I didn't plan for automation. I built a manual process, increased the pace until the manual process broke, and then automated the broken part.

March 17: Added a temporal knowledge graph (Neo4j + Graphiti MCP). Now the system doesn't just store facts — it stores relationships between entities and tracks how they change over time. "What connects to SWAG?" is a graph query, not a text search.

The pattern across all of this: use the system → hit friction → build the next piece. Each layer emerged from using the previous one. I didn't design this top-down. I noticed what was missing and filled the gap.

What the System Looks Like Now

Four tiers, three of which are fully automated:

Session tier — memsearch captures every conversation automatically. Semantic search makes past sessions findable. 30-day retention. No effort required.

Working tier — agents promote important decisions to structured markdown files with YAML frontmatter: creation date, 90-day expiry, tags, tier. Shared across agents where relevant, scoped to one domain where not.

Distilled tier — the nightly 4 AM pipeline. A headless Claude Code agent reviews working memory notes, applies the "would this matter in 3 months?" filter, commits qualifying entries to a git-backed repo. These are permanent. This is what the prime directive repo became.

Core context — a 40-line cap, always injected, never compresses away. User profile, active projects, key constraints, recent decisions. The sticky note on the monitor.

Plus the knowledge graph for relationships and topology — infrastructure that's hard to encode in flat files.

What I Found Out Later

Around the same time the memory-sync agent came online, I built a private web search stack — SearXNG as the search engine, firecrawl-simple for full-page extraction, and a local reranker to surface the most relevant results. The point was to give Claude real search capability without relying on external APIs or sending queries to Google.

Once that was running, I started asking Claude to find comparable projects — things other people had built that solved similar problems. That's how I found Letta, Mem0, and eventually the ICLR 2026 MemAgents workshop.

The infrastructure I built to make Claude more capable also made Claude better at finding the research that validated the approach. The tools gave back.

The ICLR 2026 MemAgents workshop was organized around this question: "What are the principled memory substrates for agentic systems?"

Researchers submitted papers. Panels were organized. Frameworks were compared.

Letta (formerly MemGPT) designed a system with core memory, archival memory, and recall memory — different persistence for different purposes. Mem0 built a bolt-on memory layer for any agent framework.

The architectures are similar to mine. The key difference: they designed theirs. I built mine by accident, from necessity, starting from "I asked Claude if its own memory could make mistakes and it said yes."

I don't think I solved this better than the research community. I think I solved it independently, driven by a cognitive pattern I've been compensating for my whole life, arriving at similar answers because the problem is the same. Working memory is limited. Context matters. Some things need to persist forever; others can fade.

The way you manage that for a human brain is, it turns out, a reasonable way to manage it for an AI agent.

The Thing That Stays With Me

I went back and read the original tier-definitions doc — written on day seven — while writing this article.

The structure it describes is essentially what I'm running today. The tools changed completely. The scale is different. But the principle — stable knowledge in one place, working knowledge in another, automated promotion between them, accept entropy where it doesn't matter and fight it where it does — that's in the day-seven file.

I didn't study this. I didn't read the research first. I just applied the same patterns I use to function at work to a new problem, and they worked.

At the bottom of that original file, I wrote a note:

"This architecture emerged organically from first-week AI usage and accommodates both technological constraints (context windows) and human limitations (working memory, context switching)."

I wrote that in week one. It's still true.

The Repository

The full memory system is part of homelab-agent, open source and documented as a reference architecture. The index.md is designed to be handed directly to Claude — point it at the file and ask for help mapping a path through the docs based on your setup.

homelab-agent on GitHub →

Previous: I Built an Agentic Infrastructure Platform in 42 Days. I'm a Windows Sysadmin.

Next: The permission problem — why an AI agent with filesystem write access needs a two-party enforcement model, and what 15 years of Active Directory taught me about building it.

I Built an Agentic Infrastructure Platform in 42 Days. I'm a Windows Sysadmin.

Ted Murray — Tue, 17 Mar 2026 19:47:10 +0000

I want to tell you something that still surprises me.

On February 3rd, 2026, I paid for my first AI subscription. I'm 42, a Windows systems administrator for 15+ years, and my GitHub history before this year is mostly simple bash and PowerShell scripts I wrote for myself. I have a 2-year associate degree — enough to clear the HR checkbox, not enough to impress anyone in a developer room.

42 days later, I had built what I now understand is called agentic infrastructure — a three-layer platform where Claude AI agents have persistent memory across sessions, coordinate with each other through structured handoffs, enforce their own filesystem permissions, and run nightly pipelines that distill knowledge from every session into a growing, searchable knowledge base.

I didn't plan this. I didn't follow a tutorial. I didn't know the term "context engineering" when I started. I just had a homelab, a problem, and a new tool that turned out to be more powerful than I realized.

This is the story of how that happened.

It Started With a Backup Script

My homelab is a small fleet of servers — an Unraid NAS running 77+ Docker containers, a TrueNAS backup server, a Debian test box, and a dedicated AI workstation I call claudebox. I manage it the way most homelabbers do: scripts, wikis, too many browser tabs, and institutional memory that lives entirely in my head.

The first thing I used Claude for was writing a backup script. Nothing revolutionary — I wanted something that would stop my Docker containers, rsync the appdata, restart them, and notify me if anything went wrong.

Claude wrote it in minutes. Tested it. Fixed an edge case I hadn't thought of. Done.

That result convinced me to push further. I turned to work — I was tired of looking up user information across multiple portals and wanted a single lookup toolkit. PowerShell is notoriously difficult for AI: the syntax is idiosyncratic, the documentation is scattered, and most models produce plausible-looking scripts that quietly do the wrong thing. Claude produced things that surprised me. Within three days I'd burned through 91% of my weekly Pro usage limit. I upgraded to Max and kept going.

That was supposed to be the end of it.

Around that same time I was watching YouTube videos about OpenClaw — an open-source AI assistant that had been getting attention in the homelab community. I understood the general idea: an AI agent you could run yourself, in your own environment. But I didn't actually know what an "agent" was in any technical sense. I just knew that what I saw in those videos wasn't quite what I wanted.

So I did something that turned out to be the key decision of this whole project: instead of installing OpenClaw, I asked Claude what I actually needed.

I described my setup. My goals. What I was frustrated with. We brainstormed. Claude walked me through the real options — what OpenClaw does, what LibreChat does, what MCP is, what Claude Desktop with the right integrations could become. By the end of that conversation I had a clearer picture of what I was trying to build than I would have gotten from any tutorial.

That brainstorm session became the blueprint. Not a copy of someone else's setup. Mine.

I started with Claude Desktop on a dedicated mini PC, remote access via Guacamole so I could reach it from anywhere, and a handful of MCP servers to give Claude real infrastructure access. That first working version was already more useful than anything I'd seen in those YouTube videos.

Then I thought: what if Claude could just... know my setup? Not just during this session — persistently. What if instead of copying context into every conversation, it had memory? What if it could query my monitoring dashboards, read my Docker configs, check on running services, and remember what we decided last week?

That question turned into six weeks of building.

What I Built

The project is called homelab-agent and it's open source on GitHub. Let me describe what it actually is, because "homelab assistant" undersells it.

Layer 1: Claude With Real Infrastructure Access

The foundation is Claude Desktop with MCP (Model Context Protocol) servers — structured tool integrations that give Claude direct, programmatic access to infrastructure instead of copy-paste workflows.

I connected Claude to Netdata (real-time system metrics), Grafana (dashboards and alerts), my Unraid and TrueNAS APIs, GitHub, and a custom HTTP server I wrote that handles shell commands, file reads, and process management on the host. I also added SearXNG — a self-hosted meta-search engine — so Claude can search the web without calling home to Google.

The result: Claude stops being a chatbot you explain things to and becomes an operator that already knows your setup.

Layer 2: A Self-Hosted AI Platform

On top of that foundation, I run a Docker-based service stack: LibreChat (a self-hosted multi-provider chat UI), Authelia for SSO, SWAG as the reverse proxy, and observability tooling (Grafana, InfluxDB, Loki). This gives household access to multiple AI providers through a single interface with one login — not just my Claude Desktop session.

Layer 3: The Multi-Agent Engine

This is where things got interesting.

I use Claude Code — Anthropic's AI coding tool — as a multi-agent platform. Different "agents" handle different domains: one for homelab operations, one for development, one for research, one for memory management. Each agent has its own context file (CLAUDE.md) that scopes what it knows and what it's allowed to do.

And each agent has memory.

The Memory Problem

Here's the thing about AI agents: they're stateless by default. Every new conversation starts from zero. You re-explain your setup, your preferences, what you decided last week. It's like having a brilliant contractor who forgets everything between visits.

I designed a four-tier memory system to solve this:

Session tier — Every conversation is automatically summarized to disk. Semantic search makes past sessions retrievable.

Working tier — Agents promote important decisions and findings to structured markdown files with YAML frontmatter: creation date, expiry (90 days), tags, tier classification.

Distilled tier — A headless Claude Code agent runs every night at 4 AM. It reviews working memory notes, applies a "would this matter in 3 months?" filter, and commits qualifying entries to a git-backed permanent knowledge base.

Core context — A 40-line always-visible block injected at every session start via a hook. User profile, active projects, key constraints, recent decisions. Never scrolls out of context.

Knowledge flows upward through tiers automatically. By Monday morning, Claude already knows about Friday's Docker stack change and Saturday's monitoring alert. There's no manual curation required.

I found out later that the ICLR 2026 MemAgents workshop — a machine learning research conference — was specifically organized around this problem: "principled memory substrates for agentic systems." Academics wrote papers about it. I had accidentally built a working implementation.

The Permission Problem

Once Claude has filesystem write access, you need to think carefully about what it can touch.

I built the Agent Workspace Protocol: declarative AGENT_WORKSPACE.md marker files at seven filesystem roots that define what access is allowed. Each agent also has a manifest declaring what it claims to need. An edit can only proceed if both the workspace marker and the agent manifest agree — stricter of the two wins.

An hourly background job (Python script, PM2 cron) validates all markers, auto-commits any tracking drift in git-backed directories, cross-references manifests against markers for conflicts, and emits structured security events to InfluxDB and Loki tagged with CIA-triad classifications (confidentiality, integrity, availability).

There's also rogue agent detection wired in — disabled while it calibrates a baseline from two weeks of normal operation, then it'll flag agents that suddenly start touching paths they've never touched before.

I'm familiar with identity-based access control from years of managing Office 365, Entra ID, and Azure — scoped roles, least-privilege policies, who can touch what. I applied that same thinking here. I haven't seen this filesystem-level two-party model in any comparable AI project. It emerged because I thought about what could go wrong.

The Self-Healing Documentation Problem

Infrastructure docs rot. Services get added, configs get changed, and the documentation lags behind until it's actively misleading.

I have a doc-health agent that runs weekly (full scan, Claude Opus) and nightly (delta scan, Claude Sonnet). It checks for drift between docs and reality, coverage gaps for new services, stale references to changed infrastructure, and leaked internal IPs or API keys. It auto-commits mechanical fixes (index entries) and surfaces everything else as a report.

The interesting part is the feedback loop: when agents modify infrastructure, they append to a daily-touched-files.json tracker. When a writer agent runs, it updates docs for changed components and triggers a targeted re-scan to confirm its own work. The nightly scan catches anything remaining and resets the tracker.

The system verifies its own corrections.

What This Actually Took

The honest answer: I don't write code in the traditional sense. Claude writes the code. I provide the vision, the infrastructure instincts, the architectural decisions, and the problem framing.

What I brought was 15 years of Windows systems administration — thinking about failure modes, permissions models, backup strategies, retention policies, operational health monitoring. Every design pattern in homelab-agent traces back to something I've seen break in a production environment.

The AI cost tracking pipeline (Claude Code session logs → Telegraf → InfluxDB → Grafana) exists because I've always metered infrastructure costs. The nightly backup script with stop-rsync-restart sequencing exists because I've seen live copies get corrupted. The two-party permission model exists because I've managed multi-admin environments where whoever touches something last owns it.

I didn't learn systems thinking in 42 days. I've been developing it for 15 years. I just found a medium that let it show.

What I Didn't Expect

I expected to build a homelab assistant. I didn't expect to be ahead of academic research on agent memory systems.

I expected to learn some Docker and maybe a little Python. I didn't expect to end up designing permission models and self-healing architectures that I can't find equivalents of in any comparable project.

I expected AI to be a productivity tool. I didn't expect it to be a creative medium — one where infrastructure instincts and systems thinking translate directly into novel technical designs.

Anthropic's own internal research, published in December 2025, found that 27% of Claude-assisted work consists of tasks that wouldn't have happened otherwise — work that's too exploratory, too niche, or too cost-prohibitive without AI assistance. Every component of homelab-agent is in that 27%. This project doesn't exist in any form without Claude.

The Repository

homelab-agent is open source and documented as a reference architecture. It's not a one-click installer — it's a documented system you can understand and adapt. The README explains all three layers. There's a getting-started guide with explicit stopping points if you want Layer 1 without the full stack. Component deep-dives cover every service with configuration details and design decisions.

If you want to jump in without reading everything, the repo has an index.md — a machine-readable navigation file designed to be handed directly to an AI assistant. Point Claude at it and say "help me figure out which components to adopt based on my setup." It'll ask about your hardware, your existing services, and your goals, then map a path through the docs. That's the intended on-ramp.

If you're a homelabber who wants Claude to actually know your setup, start there.
If you're an AI infrastructure builder looking at agent memory patterns or permission models, the architecture docs are the interesting part.
If you're hiring for agentic infrastructure roles and you made it this far — hi.

homelab-agent on GitHub →

This is the first post in a series. Next: the memory architecture in depth — how four tiers of knowledge accumulation work together and what it looks like to build memory for AI systems from first principles.