Forem: Coherence

AI-Native CRM vs. Legacy CRM: The Architecture Decision That Determines Your Sales Team's Future

Keith Fawcett — Mon, 30 Mar 2026 17:46:14 +0000

A technical and strategic breakdown for founders, RevOps leaders, and developers building lean B2B sales operations.

If you're evaluating CRM platforms in 2026, you've noticed that every vendor claims to be "AI-powered." Salesforce has Einstein. HubSpot has Breeze. Microsoft Dynamics has Copilot. The marketing language has converged completely — but the architectures behind the marketing language have not.

This post makes a specific, testable argument: there is a fundamental architectural difference between AI-native CRM and AI-augmented CRM, and that difference determines whether your AI capabilities compound over time or remain static features you pay a premium to access.

The Problem with "AI-Powered" CRM Marketing

The phrase "AI-powered" has become meaningless as a differentiator. Consider what it actually covers:

A rule-based email suggestion engine: "AI-powered"
A GPT-4 wrapper that drafts follow-up emails: "AI-powered"
A fully autonomous agent that manages your entire prospecting workflow end-to-end: "AI-powered"

These are not the same thing. They differ by orders of magnitude in architectural sophistication, data requirements, and business impact. The number that matters: sellers who effectively partner with AI tools are 3.7× more likely to meet quota than those who don't (Gartner, 2025). But that figure assumes genuinely effective AI partnership — not a chatbot inside your CRM.

The Five Architectural Pillars of AI-Native CRM

Based on analysis of current market leaders and purpose-built AI-native platforms, here are the five pillars that separate genuine AI-native architecture from AI-augmented legacy systems:

Pillar 1: Autonomous Data Enrichment

Legacy CRM: Data quality is a human responsibility. Records are as good as the last time someone manually updated them. 37% of CRM users report revenue loss due to poor data quality (Teamgate, 2025).

AI-native CRM: The system continuously enriches contact, company, and deal records without manual intervention — pulling from email signals, web activity, firmographic databases, and behavioral patterns. Data quality is an automated, continuous process, not a periodic cleanup project.

Why it matters: AI is only as good as its data. Autonomous enrichment creates a self-improving data foundation that makes every downstream AI capability more accurate over time.

Pillar 2: Proactive Intelligence

Legacy CRM: The system surfaces what you ask for. You run a report, you get a report. Insight generation is human-initiated.

AI-native CRM: The system surfaces what you need before you ask. Deals at risk. Contacts who haven't been touched. Prospects signaling buying intent. Follow-ups that have been missed.

Why it matters: The value of intelligence is perishable. By the time a human reviews reports and identifies an at-risk deal, the opportunity to intervene may have passed. Proactive intelligence captures value in the moment.

Pillar 3: Natural Language Interaction

Legacy CRM: Interaction is through forms, dropdowns, and predefined fields. Adding a note, updating a deal stage, or running a report requires navigating UI designed around data entry.

AI-native CRM: Users interact through natural language. "Show me all deals over $50k that haven't had activity in 14 days and draft a follow-up for each." One sentence. Executed.

Why it matters: Natural language interaction removes the CRM adoption barrier. The tool works for users rather than requiring users to learn how to work it. This directly addresses the 75% administrative burden problem (Bain & Company, 2025).

Pillar 4: Agentic Workflow Execution

Legacy CRM: Automation is rule-based. When X happens, do Y. The rules are predefined by humans and cannot navigate ambiguity or exception cases.

AI-native CRM: AI agents pursue goals across multi-step workflows, making contextual decisions at each step. A prospecting agent doesn't just send a template — it researches the prospect, selects the most relevant value proposition, drafts personalized outreach, adjusts based on response patterns, and escalates to human review when confidence is low.

Why it matters: Gartner (2025) predicts 60% of B2B sales workflows will be partly or fully automated through AI by 2028, up from 5% in 2023. The AI agent market is growing at a 45% CAGR (BCG, 2025). Agentic capacity is not a future capability — it is a current competitive differentiator.

Pillar 5: Adaptive Learning

Legacy CRM: AI features operate on static or periodically updated models. The lead scoring model you configured 12 months ago may not reflect current market dynamics.

AI-native CRM: The system continuously learns from outcomes. Every won deal, lost deal, responded email, and ignored follow-up updates the model. The AI becomes more accurate and more valuable as the business processes more data through it.

Why it matters: This is the compounding returns dynamic. Year 1 ROI and Year 3 ROI from AI-native CRM are categorically different, because the Year 3 model has processed thousands more data points from your specific customer base. CRM implementations with effective AI achieve ROI of up to 245% (Teamgate, 2025) — compared to the baseline $8.71 per $1 invested for standard CRM.

The Data Quality Crisis in Legacy Systems

Before any AI can operate effectively, the underlying data must be reliable. This is where legacy CRM systems face their most fundamental challenge.

Legacy systems rely on humans to maintain data quality. Humans are inconsistent. They're incentivized to sell, not to update records. They forget. They enter data in inconsistent formats. The result: 37% of CRM users report revenue loss due to poor data quality (Teamgate, 2025).

AI-native CRM addresses this at the architectural level:

Bidirectional email sync captures every interaction automatically
Contact enrichment pulls from multiple data sources continuously
Behavioral signal tracking updates records based on actual activity
Duplicate detection and data validation run continuously

The result is a self-healing data layer — one that enables AI capabilities to compound in accuracy rather than degrade under the weight of stale data.

The Economics of AI-Native vs. Legacy for Lean Teams

The cost structure of legacy CRM platforms was designed for enterprise economics. HubSpot's AI-enhanced Sales Hub starts at $90/user/month. Salesforce Einstein ranges from $75-$300/user/month for AI features on top of base platform costs. For a 5-person team, you're looking at $450-$1,500/month minimum before professional services and implementation.

More importantly: the AI capabilities in these platforms are often gated behind premium tiers. The AI features that matter most — autonomous workflow execution, predictive scoring, advanced personalization — are not available in entry-level tiers. You pay enterprise prices for enterprise-designed tools that weren't built for the way lean teams actually work.

AI-native platforms built for lean teams invert this model. AI is not a premium add-on — it is the default operating layer, available at every tier. The economics reflect lean-team unit costs, not enterprise overhead. And the architecture reflects the reality of a 3-person team, not a 300-person sales org.

A Technical Evaluation Checklist

If you're evaluating CRM platforms and want to test whether a vendor's "AI-native" claims are real, ask these specific questions:

Data layer:

[ ] Does email sync work bidirectionally across all tiers, or only in premium tiers?
[ ] How is contact enrichment performed — manual, batch-scheduled, or continuous?
[ ] How does the system detect and resolve duplicate records?
[ ] What is the data freshness guarantee for contact and company records?

AI execution:

[ ] Can AI agents execute multi-step workflows without human initiation of each step?
[ ] What is the scope of autonomous actions available — read-only, draft, or full execution?
[ ] How does the system handle edge cases that fall outside predefined workflow paths?
[ ] What human oversight controls govern autonomous AI execution?

Learning and improvement:

[ ] How often is the lead scoring model updated, and what data drives those updates?
[ ] Can you access model performance data to understand how AI recommendations are performing?
[ ] Does the system learn from your specific business outcomes, or from a generic model?

Integration architecture:

[ ] Is email integration native (OAuth/IMAP sync) or third-party dependent?
[ ] What is the latency between a real-world event and CRM record update?
[ ] Can AI access full conversation context from email and calendar, or only logged activities?

Cost structure:

[ ] What is the total cost to enable all AI capabilities for a 5-person team?
[ ] Which AI features require usage-based fees beyond the subscription?
[ ] What professional services are typically required for AI feature activation?

The Compounding Advantage: Why Switching Costs Increase Over Time

Here's the competitive dynamic that makes the AI-native vs. legacy decision more consequential than typical software choices: the value of AI-native CRM compounds over time, while legacy CRM value plateaus.

A legacy CRM's core value — storing and organizing customer data — does not meaningfully improve in Year 3 compared to Year 1. The data gets more complete, but the intelligence layer stays relatively static.

An AI-native CRM's value in Year 3 is categorically higher than Year 1, because the AI has spent three years learning from your specific customer base — which deals convert, which messaging resonates, which follow-up timing works, which customer profiles expand. This institutional intelligence is non-transferable. If you switch platforms after three years, you restart the learning curve from scratch.

The implication: every month you delay transitioning to AI-native CRM is not just a month of foregone efficiency — it is a month of compounding institutional intelligence that your AI-native competitors are building and you are not.

Bottom Line

The architectural decision you make about CRM is not a feature choice — it is a strategic choice about whether your revenue operations will compound in intelligence over time or remain static.

The data is clear:

81% of sales teams are now using AI (Salesforce, 2024)
AI users are 3.7× more likely to hit quota (Gartner, 2025)
AI-native CRM delivers up to 245% ROI vs. $8.71 baseline for standard CRM (Teamgate, 2025)
The AI in CRM market will grow from $11B to $48.4B by 2033

The question is not whether AI will run your revenue operations. It's whether you'll build that capability on a foundation designed for it, or retrofit it onto infrastructure designed for a different era.

Built at Coherence — the AI-native XRM for founders and lean B2B teams. 600+ integrations, true email sync, autonomous AI agents, starting free.

Sources: Bain & Company (2025), BCG (2025), Gartner (2025), HubSpot (2024), McKinsey (2024), Salesforce (2024/2025), SellersCommerce (2025), Sopro (2025), Teamgate (2025).

Building a Developer-First CRM: Why Your API Should Be the Product

Keith Fawcett — Sun, 29 Mar 2026 10:00:56 +0000

Building a Developer-First CRM: Why Your API Should Be the Product

When most CRM vendors talk about "API access," they mean an afterthought—a way to export data or sync contacts on a schedule.

That's not an API. That's a data pump.

A developer-first CRM treats the API as a first-class citizen. Here's what that actually means for the 28.7 million developers worldwide building, buying, or recommending business software.

The Problem with "Enterprise API" Thinking

Traditional CRM APIs are built for enterprise integration scenarios:

Long-running sync jobs that run nightly
Bulk operations designed for millions of records
Authentication that requires a PhD to configure

For solo developers and small teams? These APIs are like bringing a forklift to move a box.

What Developer-First Actually Looks Like

1. Webhooks as the default

Instead of "poll our API every 5 minutes to check for new leads," think: "we'll tell you when something happens."

{
  "event": "deal.stage_changed",
  "timestamp": "2026-03-27T09:00:00Z",
  "data": {
    "deal_id": "deal_abc123",
    "from_stage": "qualified",
    "to_stage": "proposal",
    "value": 5000
  }
}

2. REST that actually RESTs

CRUD operations that feel natural:

POST /contacts to create
GET /contacts/{id} to retrieve
PATCH /contacts/{id} to update
DELETE /contacts/{id} to... you get it

3. Rate limits that accommodate real automation

If your automation runs every 5 minutes and hits rate limits, it's not an automation. It's a schedule you're fighting.

4. Error messages that help

❌ "Rate limit exceeded"
✅ "Rate limit exceeded (1000/hour). Retry after 3 seconds or implement exponential backoff."

The Solo Developer Advantage

Here's the thing: solo developers have the same needs as enterprise teams—just without the bureaucracy.

Sync with email (Gmail, Outlook)
Calendar integration (Google Calendar, Cal.com)
Form submissions (Typeform, JotForm)
Payment tracking (Stripe)
Support ticketing (Intercom, Zendesk)

The difference? Solo developers need this to work on day one, not after a 6-week implementation.

The Test: 30-Minute Integration

The best test for a developer-first CRM:

Connect to your email tool
Connect to your calendar
Connect to one form tool
Verify data flows automatically

If you can't do this in 30 minutes, it's not developer-first. It's developer-tolerant.

Why This Matters for CRM Buyers

If you're evaluating CRMs and the vendor can't explain their API in under 2 minutes, that's a red flag.

If their docs require you to contact "sales engineering" to get API access, that's another red flag.

A developer-first CRM should feel like using a well-designed API—not like requesting permission to use one.

What developer-first features would you actually use? Reply in the comments.

This is part of a series on building tools for solo founders. Follow for more insights on developer experience and CRM architecture.

How We Cut Our Deploy Time from 60 Minutes to 15 Minutes

Keith Fawcett — Mon, 23 Mar 2026 09:42:20 +0000

A practical walkthrough of how we optimized our CI/CD pipeline for a 12-service monorepo on DigitalOcean App Platform — from pre-built Docker images to parallel CI jobs.

The Problem: An Hour Per Deploy

Our platform is a TypeScript monorepo with 12 backend microservices, a React SPA, shared packages, and a PostgreSQL database. We deploy to DigitalOcean App Platform. Until last week, every push to main took roughly 60 minutes to reach production.

For a small team shipping fast, that is unacceptable. A one-hour deploy loop means you either batch changes (risky) or spend your afternoon watching progress bars. We decided to fix it.

Where the Time Was Going

Before optimizing, our pipeline had three sequential phases:

Phase	Duration	What It Did
CI checks	~20 min	Install deps, build monorepo, lint, typecheck, test, security audit
Security	~15 min	Duplicate install + build, then RLS coverage check and dep audit
DO deploy	~25-35 min	DigitalOcean builds 12 Docker images from scratch, deploys everything

The biggest offender was the DigitalOcean deploy phase. App Platform clones the repo and builds every Dockerfile independently — with no Docker layer cache between deployments. Twelve services, each running pnpm install and tsc from scratch, every single time.

Fix 1: Pre-Build Docker Images in GitHub Actions

This was the single biggest win. Instead of letting DigitalOcean build 12 Dockerfiles from scratch, we now:

Build all 12 service images in parallel using a GitHub Actions matrix
Push them to DigitalOcean Container Registry (DOCR)
Update the live app spec to point at the pre-built images
Tell DO to deploy — it just pulls the images, no building

The key enabler is docker/build-push-action with GitHub Actions cache:

build-images:
  runs-on: ubuntu-latest
  strategy:
    fail-fast: false
    matrix:
      include:
        - name: auth-service
          dockerfile: services/auth/Dockerfile
        - name: module-config-service
          dockerfile: services/module-config/Dockerfile
        # ... 10 more services
  steps:
    - uses: actions/checkout@v4
    - uses: docker/setup-buildx-action@v3
    - uses: docker/login-action@v3
      with:
        registry: registry.digitalocean.com
        username: ${{ secrets.DIGITALOCEAN_ACCESS_TOKEN }}
        password: ${{ secrets.DIGITALOCEAN_ACCESS_TOKEN }}
    - uses: docker/build-push-action@v6
      with:
        context: .
        file: ${{ matrix.dockerfile }}
        push: true
        tags: registry.digitalocean.com/coherence/${{ matrix.name }}:${{ github.sha }}
        cache-from: type=gha,scope=${{ matrix.name }}
        cache-to: type=gha,mode=max,scope=${{ matrix.name }}

The type=gha cache stores Docker layers in GitHub Actions cache. First build is cold (~8 min per service), but subsequent builds with unchanged layers take 2-3 minutes. Since all 12 run in parallel, the wall clock time is just the slowest individual build.

Result: DO deploy dropped from ~30 min to ~7 min. It just pulls images and runs health checks — no building.

The Spec Update Trick

DigitalOcean App Platform specs can reference either a GitHub repo (source build) or a DOCR image (pre-built). We use yq to dynamically update the live spec:

# Fetch the live spec (preserves all encrypted secrets)
doctl apps spec get $APP_ID --format yaml > /tmp/spec.yaml

# Swap each service from GitHub source to DOCR image
for svc in auth-service module-config-service ...; do
  yq -i "del(.services[] | select(.name == \"$svc\").github)" /tmp/spec.yaml
  yq -i "(.services[] | select(.name == \"$svc\")).image.registry_type = \"DOCR\"" /tmp/spec.yaml
  yq -i "(.services[] | select(.name == \"$svc\")).image.tag = \"$COMMIT_SHA\"" /tmp/spec.yaml
done

# Apply and deploy
doctl apps update $APP_ID --spec /tmp/spec.yaml
doctl apps create-deployment $APP_ID --wait

This is idempotent — on the first run it migrates from source to image, on subsequent runs it just updates the tag. The doctl apps spec get preserves all secrets as encrypted values, so the update is safe.

Fix 2: Parallelize CI Checks

Our CI was running lint → typecheck → test → security sequentially in one job. We split them into parallel jobs that share a cached workspace:

ci-setup (install + build packages, ~3 min)
  ├── ci-lint      (~3 min)
  ├── ci-typecheck  (~3 min)
  ├── ci-test       (~6 min)
  └── ci-security   (~2 min)

The ci-setup job installs dependencies, builds shared packages, and saves the workspace to GitHub Actions cache. Each parallel job restores that cache and runs only its specific check.

Key insight: We only build shared packages (packages/*) in CI setup — not apps or services. Lint and typecheck work on source files. Tests use vitest which transpiles on-the-fly. The actual production builds happen in the Docker image matrix. This cut our setup step from 11 minutes to under 3.

Fix 3: Remove Type-Checked ESLint Rules

Our ESLint config used parserOptions.project to enable type-checked rules like no-floating-promises and no-misused-promises. These rules require ESLint to spin up a full TypeScript type-checker — essentially running tsc inside ESLint.

Since we already run tsc --noEmit as a separate typecheck step, these rules were making us pay for type-checking twice. Removing parserOptions.project and the type-checked rules dropped lint from 14 minutes to 3.5 minutes.

We also added --cache to all ESLint scripts, which helps on local dev even though CI starts fresh each time.

Fix 4: Run CI and Image Builds Concurrently

CI checks and Docker image builds have no dependency on each other. We run them in parallel:

ci-setup ──> parallel CI checks ──┐
                                   ├──> deploy-staging
build-images (12 parallel) ───────┘

Image builds typically finish before CI (cached builds take ~2 min), so by the time tests pass, images are already waiting in DOCR.

The Other Small Wins

Merged CI and Security jobs — security was a separate job that duplicated the entire install + build. Folding it into the CI flow saved 10-15 min.
Concurrency groups with cancel-in-progress: true — stale runs get cancelled when new commits are pushed.
Sentry shallow clone — fetch-depth: 100 instead of full git history.
Vite reportCompressedSize: false in CI — small but free.

Final Results

Metric	Before	After	Improvement
CI wall clock	~35 min	~9 min	74% faster
DO deploy	~30 min	~7 min	77% faster
Total pipeline	~60 min	~16 min	73% faster
Docker builds (first run)	N/A (DO built)	~8 min	Cached after first
Docker builds (cached)	N/A	~2 min	GHA layer cache

The pipeline now looks like this in GitHub Actions:

┌─ ci-setup (3 min) ─┬─ lint (3 min)
│                     ├─ typecheck (3 min)
│                     ├─ test (6 min)      ──┐
│                     └─ security (2 min)     ├── deploy (7 min)
└─ build-images x12 (2 min, parallel) ──────┘

What We Would Do Next

If we needed to go even faster:

Vitest sharding — split the test suite across 2-3 parallel runners
TypeScript incremental — tsc --noEmit --incremental writes a .tsbuildinfo file that speeds up subsequent type checks
Selective image builds — only rebuild services whose files actually changed using path filters
Convert web to a service — the React SPA still builds from source on DO (static sites do not support pre-built images). Converting it to an nginx service would let us pre-build it too

But at 16 minutes from push to production, we are happy to ship.

Key Takeaways

If your platform builds Docker images, build them yourself. Managed platforms rarely cache Docker layers between deployments. Building in CI with docker/build-push-action and GHA cache is dramatically faster.
Do not run type-checking twice. If you have tsc --noEmit in your pipeline, you do not also need ESLint type-checked rules. Pick one.
Build once, share via cache. A monorepo CI that installs and builds in every parallel job wastes most of its time on redundant setup.
Measure before optimizing. Our initial assumption was that tests were the bottleneck. The real bottleneck was DigitalOcean rebuilding 12 Docker images from scratch on every deploy.

How We Built an AI Memory System That Actually Learns

Keith Fawcett — Mon, 23 Mar 2026 09:31:02 +0000

How We Built an AI Memory System That Actually Learns

A deep dive into our multi-layered memory architecture — from vector embeddings to biographical peer cards — and what we learned from studying the best in the space.

The Problem: AI Agents That Forget Everything

Most AI agent systems treat every interaction as a blank slate. Your agent completes a task, generates insights, discovers user preferences — and then forgets all of it. The next task starts from zero. This is the fundamental gap between a tool and a teammate: teammates learn.

We set out to build a memory system for our autonomous agents that would make them genuinely learn over time — not just retrieve facts, but reason about what they've learned, identify patterns, and build an evolving understanding of the people they work with.

Along the way, we studied platforms like Honcho, mem0, and Polsia — each taking a different approach to AI memory. Honcho's concept of "dreaming" (periodic reflection on accumulated knowledge) and Polsia's three-tier memory hierarchy both heavily influenced our architecture.

Three Tiers of Memory: User, Account, Platform

Before diving into implementation, it's worth understanding the scoping model. Inspired by Polsia's approach to memory hierarchy, we designed three distinct tiers:

User-Scoped Memory — Personal to each team member. When Nash (our autopilot agent) runs a personal autopilot cycle for you, it saves memories scoped to your UserId: your communication preferences, your pipeline priorities, the follow-ups you care about. These memories are invisible to other users' agents.

Account-Scoped Memory — Shared across the entire workspace. Business facts, company strategy, market intelligence, customer patterns — things that benefit every team member and every agent in the organization. When Nash discovers that "enterprise deals close 40% faster when a technical champion is identified early," that's account-level knowledge.

Platform-Scoped Memory (Global Learning) — Anonymized patterns that propagate across all accounts. When agents across different organizations independently discover the same operational insight, it gets promoted to the global layer. This creates a flywheel — the more accounts use the system, the smarter every agent gets out of the box.

This three-tier model ensures privacy (personal memories stay personal), organizational knowledge compounds (account memories get richer), and the entire platform gets smarter over time (global learnings benefit everyone).

Architecture: Five Layers Deep

Our memory system operates across five distinct layers, each serving a different purpose:

Layer 1: Atomic Memories

The foundation. Every observation, pattern, preference, fact, and episode gets stored as an individual memory with a 1536-dimensional vector embedding via pgvector. Each memory has confidence scoring, occurrence tracking, and category taxonomy.

Key design choices:

Hash-partitioned by tenant (16 partitions) for physical isolation
HNSW indexes per partition for fast cosine similarity search
Content encrypted at rest (AES-256-GCM) while embeddings remain unencrypted for search
Confidence + Occurrences fields that increase when the same insight is reinforced
Agent-scoped memories (AgentId column) so agents can build self-knowledge — Nash accumulates memories about what approaches worked, which tools were most effective, and what strategies produced the best outcomes, carrying that self-knowledge forward across cycles

Layer 2: Automatic Extraction

After every agent task, a lightweight LLM pass (Claude Haiku or GPT-5-mini) analyzes the execution output and decides what's worth remembering. The extraction is selective — trivial tasks like simple record updates are pre-filtered, and the LLM can explicitly declare "nothing worth remembering" for routine operations.

For cycles that execute multiple tasks, we batch the extraction into a single LLM call. Instead of N separate passes, one pass reviews all outputs together, looking for cross-task patterns and themes. This typically reduces extraction cost by 60-70% during busy cycles.

Layer 3: Semantic Deduplication & Search

Our rememberOrReinforce() method prevents memory bloat. Before creating a new memory, it searches for semantically similar existing memories (cosine similarity >= 0.9). If a match exists, it increments the occurrence count and boosts confidence instead of creating a duplicate. The system naturally converges — repeated patterns get stronger, not noisier.

On the retrieval side, not every query needs the same depth of processing. A synthesize parameter controls this: default mode returns a fast, cheap vector similarity search, while synthesize mode adds an LLM pass that weaves results into a coherent narrative — more expensive, but dramatically better for complex queries spanning multiple memories.

Layer 4: The Dreaming Job

This is where the learning happens. Inspired by Honcho's concept of "sleep for a memory system," a background job runs periodically to reflect on accumulated knowledge:

Consolidation: Finds groups of semantically similar memories (0.85 threshold, looser than the 0.9 dedup threshold) and merges them via LLM into single, stronger statements. The originals get archived, and the consolidated memory inherits the sum of all occurrences.

Promotion: Observations reinforced 3+ times get promoted to "pattern" type with a confidence boost. The system recognizes that something noticed once is actually a reliable signal.

Deduction Specialist: For entities with enough accumulated data (50+ new conclusions since last dream), an LLM analyzes existing memories to identify logical implications, contradictions, and knowledge updates. If a user changed roles, the deduction pass catches the conflicting facts and resolves them.

Induction Specialist: Scans for recurring behavioral patterns across multiple observations. A preference must have evidence from at least two independent data points before it's promoted — single observations stay as observations.

Both specialist passes feed results into the global learning layer when patterns are strong enough (high confidence + multiple occurrences), completing the user → account → platform propagation loop.

Layer 5: Peer Cards (Biographical Profiles)

Instead of requiring agents to search for user information every time, a stable biographical profile is always injected into the agent's context. No similarity search needed, no query to get right — it's just there.

Each peer card is a list of up to 40 atomic facts, prefixed by category:

Name: Keith
TIMEZONE: US/Pacific
PREFERENCE: Outcome-oriented execution without repeated approval-seeking
PREFERENCE: Async communication over synchronous meetings
INSTRUCTION: Do not create documents unless explicitly asked
TRAIT: Delegates broadly to trusted agents
INTEREST: AI agent architectures, autonomous systems
ROLE: Founder, technical product lead

Cards are auto-populated by the dreaming job's specialist passes, but can also be manually edited. They bootstrap automatically on first access with whatever data the system already knows.

The key insight: separate stable biographical facts from transient memories. A user's timezone doesn't change — it shouldn't be competing for relevance in a semantic search alongside yesterday's task results.

What We Learned

Memory without reasoning is just storage. Adding the dreaming job — especially the deduction and induction specialists — transformed our system from a recall engine into a learning engine.

Three-tier scoping is essential. Without it, you either over-share (personal preferences leak to other users) or under-share (valuable business patterns stay locked in one user's context). Polsia's hierarchy model was the right starting point.

Peer cards are deceptively powerful. The simplest feature (a list of strings) had the biggest impact on agent quality. When your agent always knows your name, timezone, and communication preferences, every interaction starts from a foundation of understanding instead of a blank slate.

Threshold-based processing beats interval-based. Triggering dreams based on conclusion counts + cooldown periods is smarter than a fixed interval. It ensures the system only does expensive reasoning when there's enough new data to justify it.

Batch extraction is a free optimization. Combining multiple task outputs into a single extraction pass doesn't just save tokens — it actually produces better memories because the LLM can identify cross-task patterns that individual extraction would miss.

The Self-Improvement Loop: Closing the Feedback Gap

A memory system that stores and retrieves is necessary but not sufficient. The missing piece was self-evaluation — the agent needs to know whether it's actually getting better.

We recently drew inspiration from two very different approaches to this problem:

MiniMax's M2.7 model uses a recursive self-evolution cycle where the model analyzes its own failure trajectories, modifies its scaffold code, evaluates results, and decides whether to keep or revert changes — running autonomously for 100+ iterations. Their Dev Harness architecture treats persistent memory, hierarchical skills, and evaluation infrastructure as first-class components of the training loop itself, not afterthoughts. Their Forge RL framework goes further, making context management an explicit action the agent learns through reinforcement — the agent decides what to remember, compress, or discard, rather than following hand-coded rules.

Andrej Karpathy's autoresearch takes the opposite extreme: radical simplicity. One metric (val_bpb), one file to edit (train.py), and a strict hill-climbing loop — try a change, measure, keep or revert, repeat indefinitely. The human's job is to write the agent's instructions (program.md), not the code. The key insight: an append-only results log (results.tsv) gives the agent complete self-awareness about what's been tried and what worked.

We took the core pattern common to both — measure → inject feedback → self-correct — and applied it to our autopilot cycles:

Cycle Quality Scoring: After each autopilot cycle completes, a lightweight LLM call scores the cycle's performance on a 1-10 scale across four criteria: task relevance, output quality, efficiency, and improvement over previous cycles. The score and actionable feedback are persisted and injected into the next cycle's planning context, so the agent can see its own trajectory and adjust.

Structured Performance Log: A rolling 7-day view of task execution metrics — success rates by task type, token efficiency, tool error rates, and user ratings — compiled from the AgentTask table and injected alongside the quality scores. When Nash sees that research tasks average 45K tokens but one outlier consumed 89K, it self-corrects on scope.

Tool Call Instrumentation: Every tool invocation is wrapped with timing and success tracking. Per-tool breakdowns (call count, error count, average duration) flow through to the performance log, giving the agent visibility into its own operational efficiency.

Agent-Driven Memory: Rather than always running an expensive LLM extraction pass after every task, we let the agent decide what's worth remembering in real-time. When the agent explicitly calls saveMemory during execution, the post-task extraction is skipped in favor of a lightweight episode log. Agent-chosen memories are higher signal and lower cost — the agent develops intuition for what matters through the natural feedback of "I remembered X and it was useful later."

The result is a closed loop: tool stats feed the performance log, the performance log feeds the quality scorer, and the quality scorer feeds the next planning cycle. Each cycle is slightly better-informed than the last.

What's Next

We're exploring several directions:

Compound tools: Higher-level skills that chain common tool sequences (research → create record → write doc), reducing token waste on re-discovering obvious patterns — inspired by MiniMax's hierarchical skill architecture
Strategy documents: Evolving the flat FocusAreas text field into a structured, version-controlled strategy doc that the user and agent co-author over time — the program.md pattern from autoresearch applied to business operations
RL-informed context management: Using quality scores and performance data to automatically tune which context blocks are most valuable, rather than injecting everything — taking a page from Forge's approach to learned context policies

We're excited about where this is heading. Every improvement to our memory system compounds — as more teams use Coherence, the agents get smarter, the global learning layer gets richer, and the feedback loops tighten. What Nash learns working with your business today makes it better for everyone tomorrow.

Onward and upward.