Forem: Victor García

Designing an AI approval system: when should your agent ask for permission?

Victor García — Tue, 07 Apr 2026 10:38:11 +0000

An AI agent that can only read data is safe but useless. An AI agent that can send emails, delete files, format disks, and configure VPNs is useful but terrifying. The entire value of a personal AI operating system comes from the agent acting on your behalf — and the entire risk comes from the same thing.

We needed a system that says "yes" fast to everyday operations and "are you sure?" to dangerous ones. Not a blanket confirmation on everything (that just trains the user to click "approve" without reading). Not unrestricted access either (one prompt injection away from rm -rf /).

This post is about the 4-level approval system we built, the dual-layer architecture (shell + API), and the surprisingly difficult design decision of where to draw the line between "just do it" and "ask me first."

The two attack surfaces

An AI agent in our system can cause damage in two completely different ways:

Shell execution. The agent uses the runtime's exec tool to run commands on the host machine. This is raw power — curl, ls, grep, but also rm, dd, python3 -c 'import os; os.system("...")'. The attack surface is the entire operating system.

API operations. The agent calls our REST API via curl. The curl command itself is harmless — it's the endpoint that's dangerous. POST /storage/pools creates a RAID array. DELETE /files/:id removes a file. POST /emails/send sends an email you can't unsend. The business logic is the attack surface.

These need separate control mechanisms because they have different risk profiles and different mitigation strategies.

Layer 1: Shell control

By default, the agent can only execute a small set of safe binaries:

Safe bins: curl, jq, cat, echo, date, wc, head, tail, grep

That's it. The agent talks to the outside world through curl to our REST API. Everything else — file manipulation, package installation, network commands, scripting — is blocked at the runtime level.

There's an "Unrestricted Shell Mode" toggle in Settings → Security. It's deliberately scary: the toggle is marked in red, requires the user's password (not just a click), and shows a warning explaining that this allows the agent to execute any command on the system.

Even in unrestricted mode, destructive commands (rm, dd, mkfs, fdisk) always require per-operation confirmation from the user. Full freedom doesn't mean no guardrails — it means the agent can attempt anything, but the user decides on dangerous operations.

The key design principle: most users never enable unrestricted mode. The agent does everything it needs through the API. Shell access is a power-user feature for people who know what they're doing.

Layer 2: Operation approvals

This is where it gets interesting. Every API operation has an approval level:

Level	Name	What happens
0	Auto	Execute immediately, no record. Reads, searches, listings.
1	Logged	Execute immediately, log to audit trail. Creates, updates, non-destructive writes.
2	Confirm	Pause and ask the user "Are you sure?" before executing. Sends, deletes, infrastructure changes.
3	Secure	Pause, ask for confirmation AND a numeric PIN. Format disk, delete volume, system reset.

The approval level is checked by a Fastify middleware (preHandler) that runs before the route handler. If the request comes from an agent token and the operation requires Level 2+, the middleware returns a 202 Accepted with an approval_id instead of executing the operation. The agent then asks the user for confirmation through whatever channel they're chatting on — Telegram, WhatsApp, or the web dashboard.

How it looks in practice

Level 0 — Auto (reading notes):

User: "What notes do I have about the project?"
Agent: GET /notes?search=project&format=compact
→ Executes immediately. User sees results.

No friction. Reads are always auto.

Level 1 — Logged (creating a note):

User: "Save a note about today's meeting decisions"
Agent: POST /notes {title: "Meeting decisions", content: "..."}
→ Executes immediately. Logged to audit trail.

Still no friction — the user asked for it. But the audit trail records that agent francis created note abc123 at 14:32. If something goes wrong, there's accountability.

Level 2 — Confirm (sending an email):

User: "Send Ana the budget update"
Agent: POST /emails/send {to: "ana@techcorp.com", subject: "Budget Q3", ...}
→ 202 Accepted {approval_id: "req_xyz"}

Agent: "I'm about to send this email to Ana García:
        Subject: Budget Q3
        [Preview of the body]
        Should I send it?"

User: "Yes"
Agent: POST /approvals/req_xyz/approve
→ Email sent.

One confirmation step. The agent shows what it's about to do. The user says yes or no. This catches the common case where the agent misunderstood the intent — "I said Ana, not María" — without making every email a five-step process.

Level 3 — Secure (deleting a volume):

User: "Delete the old backup volume"
Agent: POST /storage/volumes/vol_old/delete
→ 202 Accepted {approval_id: "req_abc", level: 3}

Agent: "⚠️ This will permanently delete volume vol_old (2.3TB).
        This cannot be undone.
        Please confirm with your security PIN."

User: "4829"
Agent: POST /approvals/req_abc/approve {pin: "4829"}
→ Volume deleted.

Two-factor: the user confirms AND enters their PIN. The PIN is a 4-6 digit numeric code set during initial setup, stored hashed with bcrypt. In messaging channels, the PIN message is deleted from chat history after verification (when the channel supports it).

The approval lifecycle

Every Level 2+ operation creates an approval_request record:

CREATE TABLE approval_requests (
    id              UUID PRIMARY KEY,
    user_id         UUID NOT NULL,
    requested_by    VARCHAR(100) NOT NULL,    -- 'agent:francis'
    operation       VARCHAR(200) NOT NULL,    -- 'POST /emails/send'
    level           SMALLINT NOT NULL,        -- 2 or 3
    summary         TEXT NOT NULL,            -- "Send email to Ana: Budget Q3"
    params          JSONB,                    -- Request body snapshot
    status          VARCHAR(20) DEFAULT 'pending',
    resolved_at     TIMESTAMPTZ,
    pin_verified    BOOLEAN DEFAULT false,
    expires_at      TIMESTAMPTZ NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL
);

The lifecycle is:

Agent triggers operation
    ↓
Middleware: level >= 2? → Create approval_request (status: pending)
    ↓
Notify user via WebSocket (Dash) + Gateway RPC (Telegram/WhatsApp)
    ↓
User approves, rejects, or ignores
    ↓
├── Approved → Execute operation, status: approved
├── Rejected → Return error to agent, status: rejected
└── Timeout (30min) → Auto-expire, status: expired

Three timeout stages prevent approvals from hanging forever:

Stage	Default	What happens
Reminder	5 minutes	Send a reminder to the user
Escalation	15 minutes	Notify the system owner (if different from the user)
Expiry	30 minutes	Auto-reject the request

All three are configurable in Settings → Security → Approval Timeouts.

What's configurable and what isn't

The default levels are sensible but not everyone agrees on what's "dangerous." A power user who sends 50 emails a day wants email sending at Level 1 (logged, no confirmation). A cautious user wants it at Level 2 (confirm every one).

Settings → Security shows a table of operations with dropdown selectors:

Operation                    Default    Your Level
─────────────────────────────────────────────────
Create note                  Logged     [1 - Logged ▼]
Send email                   Confirm    [2 - Confirm ▼]
Delete files (permanent)     Confirm    [2 - Confirm ▼]
Enable VPN                   Confirm    [2 - Confirm ▼]
Add VPN peer                 Confirm    [2 - Confirm ▼]
Delete volume                Secure     [3 - Secure  ▼]
Format disk                  Secure     [3 - Secure  ▼]
System reset                 Secure     [3 - Secure  ▼]

Two constraints prevent dangerous configurations:

Level 3 operations can't go below Level 2. You can downgrade "Delete volume" from Secure (3) to Confirm (2), but not to Logged (1) or Auto (0). Destructive, irreversible operations always require at least one confirmation.
Level 0 operations can't be upgraded. Read operations are always auto. Making GET /notes require confirmation would break the system — the agent would need approval to answer "what notes do I have?"

Changing approval levels is itself a Level 2 operation — the system asks for confirmation before letting you change the security settings.

Why agents can't approve their own requests

This sounds obvious but it's the most important security decision in the system: an agent API key cannot approve an approval request. Only JWT tokens (human login via Dash) or system tokens can call POST /approvals/:id/approve.

If the agent could approve its own requests, a prompt injection attack could chain: trigger the operation → intercept the approval request → approve it. The human-in-the-loop is only meaningful if the human is the one doing the approving.

In messaging channels (Telegram, WhatsApp), the approval flows through the agent — the user says "yes" in the chat, and the agent calls the approve endpoint. But the approve endpoint verifies that the approval was triggered by a user message, not by the agent itself. The requested_by field records which agent requested the operation, and the same agent cannot resolve it.

The middleware: 15 lines that matter

The approval check is a Fastify preHandler that runs on every route with an assigned level:

// Simplified — the real version handles edge cases
async function approvalMiddleware(req, reply) {
  // Skip for human (JWT) requests — the Dash IS the confirmation
  if (req.authType === 'jwt') return;

  const level = getOperationLevel(req.method, req.url, req.userId);

  if (level <= 1) return; // Auto or Logged — proceed

  // Level 2 or 3: create approval request and pause
  const approval = await createApprovalRequest({
    userId: req.userId,
    requestedBy: `agent:${req.apiKeyName}`,
    operation: `${req.method} ${req.url}`,
    level,
    summary: buildSummary(req),
    params: req.body,
  });

  // Notify user
  await notifyApproval(approval);

  // Return 202 — the agent knows to ask the user
  reply.code(202).send({
    data: { approval_id: approval.id, level, summary: approval.summary },
    hint: level === 3 
      ? "Ask user to confirm with PIN" 
      : "Ask user to confirm"
  });
}

The hint field in the 202 response tells the agent's skill what kind of confirmation to request. Level 2: ask for a yes/no. Level 3: ask for the PIN.

Human requests from the Dash skip the middleware entirely. When you click "Send" in the email composer, you ARE the confirmation. The approval system only gates agent-initiated operations.

The skill: teaching the agent to ask

The claw-approvals skill teaches the agent how to handle the approval flow:

## Approval Protocol

When you receive a 202 response with an approval_id:
1. Show the user what you're about to do (use the summary field)
2. For Level 2: ask "Should I proceed?"
3. For Level 3: ask "Please confirm with your security PIN"
4. On "yes" or PIN: POST /approvals/{id}/approve (with pin if Level 3)
5. On "no" or "cancel": POST /approvals/{id}/reject
6. Never approve your own requests — always wait for user input

The skill also handles /pending (show pending approvals), /history (show past approvals), and edge cases like expired requests.

What I learned about where to draw the line

The hardest part wasn't building the system — it was deciding which operations go at which level. Some were obvious:

Read anything → Level 0. No debate.
Format disk → Level 3. No debate.

The middle ground is where every conversation happened:

Email sending: Level 1 or Level 2? We went with Level 2 (Confirm) as default. Email is irreversible — you can't unsend it. A misunderstood intent ("send Ana the budget" when you meant "draft Ana the budget") has real consequences. But we made it configurable because power users find the confirmation annoying.

Creating a note: Level 0 or Level 1? We went with Level 1 (Logged). Creating a note is harmless — but logging it means the audit trail shows everything the agent did. If the agent starts creating garbage notes due to a bug, the log tells you when it started. Zero friction, full accountability.

VPN operations: Level 2. Enabling or disabling VPN changes network topology. Adding a peer grants network access to a device. These aren't destructive (you can undo them) but they affect security posture. Confirmation is appropriate.

Deleting files: Level 2, not Level 3. Files have soft delete with restore capability. "Delete" doesn't mean "gone forever" — it means "moved to trash." If deletion were permanent (bypassing soft delete), it would be Level 3.

The general heuristic: reversible actions need less approval than irreversible ones. Creating is Level 1 (you can delete it). Sending is Level 2 (you can't unsend it). Destroying is Level 3 (it's gone forever).

The dual-layer interaction

The two layers (shell + API) work together:

User: "Check my disk space"

Agent: exec "df -h"
→ Layer 1: "df" is in safe bins → Execute immediately

Agent: GET /storage/overview?format=compact
→ Layer 2: Level 0 (read) → Execute immediately

Agent: "You have 450GB free on /data. The backup volume is 78% full."

No approvals needed — both layers allowed the operations silently.

User: "Set up a RAID1 mirror with the two new drives"

Agent: POST /storage/pools {name: "mirror1", level: "raid1", devices: ["sda", "sdb"]}
→ Layer 2: Level 2 (Confirm) → 202 Accepted

Agent: "I'm about to create a RAID1 pool using sda and sdb. 
        This will erase all data on both drives. Confirm?"

User: "Yes"
→ Pool created.

And in unrestricted shell mode:

User: "Install htop on the server"

Agent: exec "apt-get install htop"
→ Layer 1: Not in safe bins + unrestricted mode → Ask for confirmation

Agent: "I want to run: apt-get install htop
        This will install a package on your system. OK?"

User: "Go ahead"
→ Layer 1: Approved → Execute

Both layers enforce independently. An operation that passes Layer 1 (shell) can still be blocked by Layer 2 (API). An operation that bypasses Layer 2 (because it's a direct shell command) is still caught by Layer 1.

What I'd do differently

I'd add batch approvals from the start. When the agent needs to send 15 emails from a mail merge, asking for 15 individual confirmations is absurd. A "batch approve" mechanism ("Send these 15 emails? Here's the list.") should have been in v1. It's now on the backlog.

I'd make the approval history more visible. The audit trail exists (GET /approvals/history), and there's an "Approvals History" section in the sidebar, but it should be more prominent. A weekly summary — "Your agent executed 340 operations this week: 280 auto, 55 logged, 5 confirmed" — would build trust and help users understand their agent's activity.

I'd reconsider the messaging channel PIN flow. Sending a PIN via Telegram is not ideal — it's visible in chat history even if the agent tries to delete the message. For Level 3 operations, maybe the system should redirect to the Dash where a secure input modal exists, rather than accepting PINs in plain text through messaging.

What's next: the sandbox

There's a missing piece between "restricted mode" (curl only) and "unrestricted mode" (everything, with confirmations). What if the agent could have a place to go completely wild — install packages, run services, break things — without any risk to your real system?

We're planning sandbox environments: Docker containers that the agent can create from Settings → Security. Not one — as many as you need. Each sandbox is an isolated machine with full root access. apt install, pip install, systemctl, custom scripts, databases, web servers — anything goes. Zero approval, zero restrictions, zero risk to the host.

The workflow we're designing around:

Sandbox "deploy-test"     → experimenting with nginx + certbot config
Sandbox "ml-pipeline"     → building a data processing pipeline with pandas
Sandbox "new-skill"       → developing and testing a new agent skill

Each one: independent, disposable, unrestricted.

The agent knows which sandbox it's working in. Commands routed to a sandbox go to that container. Commands on the real system go through normal approval layers. The two worlds don't touch — no shared volumes, no network bridge to internal services, no mount to /data.

The key is the promote-to-production flow. Once you've got something working in the sandbox — a configuration, a script, a service setup — you tell the agent "promote this to production." At that point, and only at that point, normal approval rules kick in. The agent needs Level 2 confirmation to copy files to the host, Level 2 to install a package on the real system, Level 3 to modify infrastructure. The sandbox is the drafting table; production is the real thing.

If a sandbox gets trashed, nuke it and spawn a fresh one in seconds. The sandboxes are cheap — a base Debian image with internet access and a persistent volume for the workspace. Multiple sandboxes can run simultaneously for different experiments without interfering with each other or with the host.

This isn't implemented yet, and there are open design questions: should sandboxes have read-only access to the real API (for testing skills against real data)? Should there be resource limits per sandbox (CPU, RAM, disk)? What's the UX for promoting — file-by-file or snapshot the whole container? We'd love input from anyone who's built agent sandboxing — this is genuinely uncharted territory for personal AI systems.

The takeaway

An AI approval system needs two properties: it must be fast for safe operations (no friction on reads, minimal friction on writes) and deliberate for dangerous ones (explicit confirmation, PIN for irreversible actions, timeout for stale requests).

Four levels handle this: auto (reads), logged (writes), confirm (irreversible), secure (destructive + PIN). Two layers: shell control for host-level commands, API control for business operations. One principle: agents cannot approve their own requests.

The system processes ~95% of operations at Level 0 or 1 — invisible to the user. The 5% that require confirmation are the operations where a mistake actually matters: sending an email to the wrong person, deleting a volume, configuring network access. Those 5% are where trust is built or broken.

Next up: building a process manager — how we manage Docker containers, systemd services, and Ollama models from a single dashboard with auto-start, health monitoring, and graceful shutdown.

PII-aware routing: how to use cloud AI and keep your sensitive data local

Victor García — Fri, 27 Mar 2026 14:23:21 +0000

Here's the tension at the heart of every personal AI system: cloud models are better at reasoning, but your data is private. A self-hosted system can run everything locally — but a 2B parameter model on a mini-PC isn't going to draft a nuanced email response or analyze a complex financial situation the way a frontier model can.

The naive solutions are both bad. "Send everything to the cloud" means your diary entries, medical notes, and financial records pass through someone else's servers. "Run everything locally" means accepting worse reasoning on tasks where model quality actually matters.

We built a third option: a PII-aware routing layer that classifies every piece of data by sensitivity, routes it to the right model, and pseudonymizes anything sensitive that needs cloud reasoning power.

The classification: four levels, zero LLM calls

Every record in the system gets a sensitivity level. The classification is entirely deterministic — regex patterns and domain rules. No LLM in the classification loop, because sending data to an LLM to decide if the data is too sensitive to send to an LLM is a circular problem.

Level	What it means	Example domains
`low`	Public or low-risk data	Events, bookmarks
`normal`	Common personal data	Notes, contacts, files, diary
`high`	Sensitive personal data	Emails, financial transactions
`critical`	Never leaves the device	Medical/health data

Each domain has a default sensitivity level. Events are low — knowing you have a meeting at 3pm isn't particularly sensitive. Emails are high — they contain names, addresses, business context, and sometimes confidential information. Health entries are critical — always local, no exceptions.

But domains are just the baseline. The classifier also scans content for PII patterns that override the default:

Email addresses     → elevate to high minimum
Phone numbers       → elevate to high minimum
Credit card numbers → elevate to high minimum
IBAN codes          → elevate to high minimum
SSN / DNI / NIE     → elevate to high minimum
Medical terminology → elevate to critical

A note titled "Grocery list" stays at normal. A note containing "Dr. García prescribed 20mg omeprazole" gets elevated to critical because the regex matched medical terminology. The content drives the classification, not just the domain.

This is deliberately conservative. The regex patterns over-match — "Dr." triggers medical detection even if it's "Dr. Pepper." False positives mean data gets routed locally when it could have gone to the cloud. False negatives mean sensitive data leaks. Over-matching is the correct failure mode.

The routing decision

Once classified, the router decides where each piece of data goes:

low / normal  → Cloud LLM — best reasoning
high          → Cloud LLM WITH pseudonymization — good reasoning, protected data
critical      → Local model only (Ollama) — or skip if Ollama unavailable

The decision isn't binary "local vs cloud." There's a middle path: pseudonymize the sensitive parts, send to the cloud for reasoning, and de-pseudonymize the response before the user sees it.

This matters because most tasks involving sensitive data don't need the sensitive parts for reasoning. "Summarize this email thread" needs the content structure and topic — not the actual names and email addresses. "What's the sentiment of this diary entry?" needs the emotional content — not the specific people mentioned.

The pseudonymizer

When a high sensitivity record needs cloud processing, the pseudonymizer replaces PII with consistent tokens:

Entity type	Pseudonym format	Example
Person	`Person_XXXX`	"Ana García" → `Person_A3F2`
Email	`email_XXXX@example.com`	"ana@techcorp.com" → `email_7B1C@example.com`
Phone	`+00-XXXX-0000`	"+34 612 345 678" → `+00-E5D9-0000`
Organization	`Org_XXXX`	"TechCorp" → `Org_4C8A`
Location	`Location_XXXX`	"Calle Sagasta 15" → `Location_B2E1`

Three properties make this work:

Consistency. The same value always produces the same pseudonym (SHA-256 of the original value, truncated). "Ana García" is always Person_A3F2, in every record, in every session. This means the cloud model can reason about relationships: "Person_A3F2 sent 3 emails to Person_B7D1 about Org_4C8A" preserves the structure even though the names are hidden.

Reversibility. The pseudonym_map table stores every mapping. When the cloud model's response comes back, the system replaces all pseudonyms with real values before storing or displaying the result. The user never sees Person_A3F2 — they see "Ana García."

Persistence. Mappings survive across sessions. If "Ana García" was pseudonymized yesterday and appears again today, she gets the same pseudonym. This means the cloud model can build consistent context across multiple interactions without ever learning the real name.

The detection itself uses regex — no LLM call. It's the same NER-lite approach as the sensitivity classifier: pattern matching for emails, phones, card numbers, and named entity patterns for persons and organizations. Not perfect, but fast and deterministic.

What this looks like in practice

Scenario 1: Calendar event (low sensitivity)

User asks: "What's on my calendar tomorrow?"

The system fetches tomorrow's events. Events are low sensitivity. The full data — titles, locations, attendees — goes straight to the cloud model. No pseudonymization needed. The model reasons about the schedule and responds with a natural summary.

Cost: one cloud API call. Privacy: no sensitive data exposed.

Scenario 2: Email analysis (high sensitivity)

User asks: "Summarize the email thread about the partnership."

The email thread is high sensitivity (default for emails). Before sending to the cloud model:

Original: "Ana García <ana@techcorp.com> wrote: Hi Paco, 
regarding the TechCorp partnership with NexaTech..."

Pseudonymized: "Person_A3F2 <email_7B1C@example.com> wrote: 
Hi Person_0D4E, regarding the Org_4C8A partnership with Org_9F3B..."

The cloud model receives the pseudonymized version. It can still analyze the thread structure, identify that Person_A3F2 is negotiating with Person_0D4E, and summarize the key points. The reasoning quality is nearly identical — the model doesn't need to know the real names to understand the negotiation dynamics.

The response comes back with pseudonyms:

"Person_A3F2 proposed a revenue-sharing model with Org_9F3B. 
Person_0D4E agreed in principle but requested..."

The system de-pseudonymizes:

"Ana García proposed a revenue-sharing model with NexaTech. 
Paco agreed in principle but requested..."

Cost: one cloud API call + ~2ms pseudonymization. Privacy: no real names or emails left the device.

Scenario 3: Health data (critical sensitivity)

User asks: "What medications am I taking?"

Health entries are critical. They never leave the device, period. The system routes to the local Ollama model. If Ollama is unavailable, the query fails gracefully — it does NOT fall back to the cloud.

The local model's response might be less polished, but for medical data retrieval, the task is usually simple: find the records and list them. A 2B model handles that fine.

Cost: one local model call. Privacy: absolute — zero data exposure.

Scenario 4: Note with accidental PII (elevated sensitivity)

User creates a note: "Meeting with Dr. López about the lab results. Blood pressure 140/90."

The note's domain is normal, but the content contains medical terminology ("Dr.", "lab results", "blood pressure"). The classifier elevates it to critical. From this point on, this note is treated like health data — local only.

The user didn't tag it as medical. They didn't configure anything. The system caught it automatically. Conservative false positives are the design choice: if a note mentions "Dr. Pepper," it gets elevated too. That's a minor inconvenience (one note processed locally instead of on the cloud) with zero privacy risk.

The audit trail

Every routing decision is logged:

Field	What it records
`domain`	Which data domain (notes, emails, etc.)
`record_id`	Which specific record
`sensitivity`	Classified sensitivity level
`action`	What happened: `sent_pseudonymized`, `sent_plain`, `blocked`
`destination`	Where it went: `embeddings`, `contextual_retrieval`, `sleep_time`

The pii_routing_log table creates a complete audit of what data was exposed to which processing pipeline. If you ever need to answer "did my medical data ever touch a cloud service?", the answer is in the log.

This is also how we verify the system works correctly. The log shows every routing decision. If a critical record ever appears with action sent_plain and a cloud destination, that's a bug — and the log caught it.

Where routing applies

PII-aware routing isn't just for chat interactions. It applies everywhere the system sends data to an LLM:

Embeddings. When generating semantic embeddings, the text is classified before being sent to the embedding model. If you're using a cloud embedding API (future option), high and critical records get embedded locally via Ollama instead.

Contextual retrieval. The HyDE pipeline (generating hypothetical answers for better search) uses LLM calls. If the search touches sensitive domains, those calls route through the pseudonymizer.

Sleep-time compute. The background intelligence jobs process records during idle periods. The enrichment job (re-extracting entities from hot records) respects the same routing rules — a critical record only gets re-extracted if Ollama is available.

Entity extraction. When the CRUD hooks pipeline sends text to the LLM for entity extraction, the same classification applies. A health-related note gets extracted locally.

The routing layer sits between every LLM consumer in the system and the actual model call. It's middleware — invisible to the features that use it, enforced consistently everywhere.

The multi-agent dimension

With a multi-agent topology (7 agents in our system), PII routing gets another layer: agent scoping.

Each agent has a scoped token that defines what domains it can access:

Francis (main):     notes:*, events:*, emails:*, contacts:*, diary:*
Sentinel (infra):   storage:*, hal:*, network:*
Dalí (creative):    photos:read, files:read
Ledger (finance):   finance:*, crm:*
Darwin (analytics): graph:*, insights:*

Sentinel can't access emails. Dalí can't read the diary. This is enforced at the API level — even if a prompt injection tricks Dalí into requesting diary entries, the scoped token blocks it.

Combined with PII routing, this creates defense in depth:

Agent scoping prevents access to domains the agent shouldn't touch
Sensitivity classification catches PII regardless of domain
Pseudonymization protects data that needs cloud processing
Audit logging records everything for verification

A prompt injection attack would need to bypass all four layers to exfiltrate sensitive data. The scoping blocks the API call. The classification catches the content. The pseudonymizer strips the PII. The audit log records the attempt.

What we explicitly don't do

We don't use ML for classification. A fine-tuned classifier could be more accurate than regex patterns. But it would need to see the data to classify it — which means sending potentially sensitive data to a model before deciding if it's safe to send to a model. Regex is dumber but has zero data exposure during classification.

We don't redact — we pseudonymize. Redaction ([REDACTED]) destroys information the cloud model needs for reasoning. Pseudonymization preserves structure ("Person_A sent an email to Person_B") while hiding identity. The cloud model can still reason about relationships, quantities, and patterns.

We don't let the user override critical. You can change a record's sensitivity from normal to high manually. You cannot downgrade critical to anything else. Health data stays local regardless of user preferences. This is a deliberate paternalistic choice — the privacy risk of accidentally exposing medical data outweighs the convenience of sending it to a better model.

We don't route based on the LLM provider's privacy policy. Whether provider A's privacy policy is better than provider B's is irrelevant. The system treats all cloud LLMs identically: external services that should never see critical data and should only see high data in pseudonymized form. Trust the math, not the terms of service.

What I'd do differently

I'd add per-field sensitivity, not just per-record. Currently, a contact record is normal even though the phones field is arguably more sensitive than the company field. Per-field classification would let us pseudonymize just the phone number while sending the company name to the cloud. More precise, but also more complex — the pseudonymizer would need to understand JSON field structure.

I'd build a sensitivity dashboard earlier. The pii_routing_log has all the data, but there's no UI for it yet. A dashboard showing "this week: 450 records processed, 380 sent plain, 65 pseudonymized, 5 blocked" would build user trust and make the privacy system tangible.

I'd make the regex patterns configurable. Different users have different sensitivity needs. A doctor might want "aspirin" to be flagged as medical. A pharmacist might want it treated as normal. The current patterns are one-size-fits-all, which means they're too aggressive for some users and not aggressive enough for others.

The takeaway

The privacy problem in personal AI isn't "local vs cloud." It's "which data goes where." Most of your data is fine to send to a cloud model — your calendar events and bookmark titles aren't secrets. Some data needs protection but can still benefit from cloud reasoning — pseudonymize it and send the structure without the identity. A small fraction of data should never leave your device — route it locally and accept the quality trade-off.

Three components: a regex classifier (zero LLM calls, deterministic), a SHA-256 pseudonymizer (consistent, reversible, persistent), and a routing table (domain defaults + content elevation). No ML, no fine-tuning, no privacy policy trust assumptions.

The system processes your medical notes with a 2B local model and your calendar queries with a cloud model. It knows the difference because a regex matched "blood pressure" — not because it asked an AI what's sensitive.

Next up: designing an AI approval system — when should your agent ask for permission, and how do you build a confirmation workflow that doesn't slow everything down?

Sleep-time compute for personal data: what your AI should do while you sleep

Victor García — Thu, 26 Mar 2026 12:01:52 +0000

Your personal AI assistant sits idle most of the day. You send it a message, it responds, then it waits. For hours. Maybe all night. The compute is there — the model is loaded, the database is running, the server is warm. But nothing happens until you type the next message.

That's test-time compute: work done when the user asks for it. Letta's research (arXiv:2504.13171) showed that shifting processing to idle periods — sleep-time compute — achieves 5× fewer tokens at test time and 15% more correct answers. But their implementation only processes conversation memory. Nobody had applied it to structured personal data.

We did.

The idea

Instead of the agent doing all its thinking when you ask a question, it does most of the thinking in the background — during idle periods when you're not using the system. When you finally ask "what's going on with Project Tempest?", the answer is already half-assembled.

The system maintains four background jobs that run every 30 minutes, but only when you're idle:

Priority	Job	What it does
1	Enrich connections	Finds hot records with few graph links, re-runs entity extraction
2	Generate summary	Compiles a weekly overview from the change log
3	Detect patterns	Discovers entities that co-occur but aren't linked
4	Update preferences	Learns behavioral patterns from your data

Each job consumes tokens from a configurable budget (default: 5,000 tokens per execution). When the budget runs out, lower-priority jobs get skipped. This means enriching connections (the most impactful job) always runs, while preference learning (the least time-sensitive) gets skipped first if resources are tight.

The trigger: idle detection

The engine only runs when you're not using the system. If you're actively writing notes or reading emails, the background jobs wait. This matters for two reasons:

Resource contention. The LLM (whether local via Ollama or remote via API) is a shared resource. Background jobs competing with user requests for model access would add latency to the interactions you actually care about.
Relevance. Sleep-time processing works on data that has settled. Running entity extraction on a note you're still editing wastes tokens — the note will change again in 30 seconds. Waiting until you're idle means processing stable data.

The idle detector is simple: if no user activity (API requests from the dashboard, agent messages, WebSocket heartbeats) has occurred in the last N minutes, the user is idle. The scheduler checks this condition before executing each run.

Job 1: Enrich connections

The highest-priority job. It finds records that are "hot" (recently accessed, heat score > 0.3) but poorly connected in the knowledge graph (fewer than 3 entity links). These are records you care about but that the system doesn't fully understand yet.

Here's what the query returned on a real run:

Optimal Fuse Burn Rate Calculations v3    heat: 0.5  links: 2
Recipe: Rodney's Smoked Eyebrows Marinade heat: 0.5  links: 2
Banned Substances List (and Why Rodney…)  heat: 0.5  links: 2

Three notes flagged. Each has a warm heat score (recently accessed) but only 2 entity links where the average for this user is 5+. The initial extraction caught the obvious entities, but a second pass might find connections to people, projects, or locations that were mentioned implicitly.

The engine re-enqueues them to the async extraction pipeline at priority: low — they won't compete with real-time user actions. When the extraction worker picks them up, it sends the full note content to the LLM for a more thorough entity pass than the initial CRUD hook provides.

Why prioritize this job? Because the knowledge graph is the foundation of search ranking, the digest engine, and the agent's contextual awareness. A poorly connected hot record means the system is blind to something you're actively working on. Enriching it improves everything downstream.

Cost on this run: 600 tokens (3 records × ~200 tokens each). Execution time: 16ms (just the enqueue — the actual extraction happens later).

Job 2: Generate summary

Aggregates a week of change log activity and the most active entities from the knowledge graph into a single pre-computed insight.

On this run, the change log query returned:

files:        201 inserts, 5 updates, 152 deletes
rss:          141 inserts
emails:       61 inserts, 32 updates, 7 deletes
notes:        2 updates
kanban_cards: 1 update

And the top graph entities by recent activity:

Madrid              (location)      28 mentions
micelclaw           (organization)  12 mentions
Meta Platforms, Inc.(organization)   7 mentions
Instagram           (location)       6 mentions
Victoria            (person)         1 mention

Two things jump out. First: 201 file inserts and 152 file deletes in one week — that's a bulk operation or a sync cycle, not manual activity. The summary captures this so the agent can mention it if asked "what happened this week?" without scanning 600+ change log rows at query time.

Second: "Instagram" classified as a location is an entity extraction error — the kind of noise the enrichment job (Job 1) and the merge candidates system are designed to catch over time.

The summary gets stored as a weekly_summary insight with a 7-day TTL. No LLM call needed — it's pure SQL aggregation.

Cost on this run: 50 tokens (fixed). Execution time: 14ms.

Job 3: Detect patterns

The most interesting job. It self-joins entity_links to find pairs of entities that co-occur in 3 or more records but have no direct link between them — latent connections nobody made explicit.

On this run, five patterns emerged:

Rodney ↔ Dolores     co-occur in 12 records, no direct link
Rodney ↔ Linda       co-occur in 11 records, no direct link
Rodney ↔ BoomClaw    co-occur in 10 records, no direct link
Warehouse B ↔ Rodney co-occur in 10 records, no direct link
Benny ↔ Rodney       co-occur in 10 records, no direct link

Every pattern radiates from "Rodney" — he's a hub entity that appears alongside four other entities across 10-12 records without any direct graph edge connecting them. The extraction pipeline created links from each note/email to "Rodney" and to "Dolores" independently, but never linked Rodney to Dolores directly. The co-occurrence pattern reveals the relationship that was hiding in plain sight.

Each pair becomes a connection_discovered insight with a 14-day TTL. The next time you search for "Rodney," the graph traversal finds Dolores, Linda, BoomClaw, Warehouse B, and Benny — even though no single record ever says "Rodney works with Dolores."

This is the job that produces the "how did it know that?" moments. The answer is always the same: it counted co-occurrences while you weren't looking.

The query itself — a self-join on entity_links filtered by NOT EXISTS — took 138ms. That's the heaviest operation in the pipeline, and it runs during idle time where nobody notices. At query time, the connections are already in the graph.

Cost on this run: 30 tokens (fixed — pure SQL, no LLM).

Job 4: Update preferences

The system learns behavioral patterns by analyzing your data over time. On this run, two patterns were detected:

Writing time distribution (last 30 days): All 50 notes created at hour 12 UTC. That's not a preference — that's a signal so strong it maxed out confidence immediately. The system now knows that if you're going to write a note, it's probably at noon.

Tag frequency:

safety: 11, personal: 9, r-and-d: 6, strategy: 5, humor: 5

Both get persisted via UPSERT with incremental confidence — each observation nudges the score up by 0.05, capped at 0.95. After multiple runs, the preferences look like this:

scheduling / preferred_writing_hour = "12"
  confidence: 0.95 (max), evidence: 20,650 observations

organization / preferred_tags = ["safety","personal","r-and-d","strategy","humor"]
  confidence: 0.95 (max), evidence: 417 observations

The agent uses these when it needs to make decisions. Scheduling a reminder? It knows noon is when you're active. Suggesting tags for a new note? It offers your most-used tags first. Creating a diary entry template? It matches your writing style.

If a preference is wrong, you delete it via the API. The system may re-learn it later if the pattern persists, but with reduced confidence — the deletion counts as negative feedback.

Cost on this run: 20 tokens (fixed — pure SQL, no LLM).

The real numbers: 700 tokens, 184 milliseconds

Here's the actual pipeline summary from the run above:

┌────────────────────┬────────┬────────┬───────────────────────────────┐
│ Job                │ Tokens │ ms     │ Output                        │
├────────────────────┼────────┼────────┼───────────────────────────────┤
│ enrich_connections │ 600    │ 16     │ 3 notes re-enqueued           │
│ generate_summary   │ 50     │ 14     │ 1 weekly_summary insight      │
│ detect_patterns    │ 30     │ 138    │ 5 connection_discovered       │
│ update_preferences │ 20     │ 16     │ 2 preferences updated         │
├────────────────────┼────────┼────────┼───────────────────────────────┤
│ Total              │ 700    │ 184    │                               │
└────────────────────┴────────┴────────┴───────────────────────────────┘

700 out of the 5,000 token budget — 14%. The full pipeline completed in under 200 milliseconds. Three of the four jobs are pure SQL with zero LLM calls. Only enrich_connections queues work for the model, and even that just enqueues — the actual extraction runs later at low priority.

Every execution gets logged to sleep_time_jobs for auditability. If a job fails, the error is recorded and the next job still runs — the pipeline is fault-tolerant by design.

The token budget: sleep-time vs test-time

Every sleep-time execution has a capped budget — 5,000 tokens by default, configurable per user. This prevents runaway costs from background processing. The jobs run in priority order and stop when the budget is exhausted.

The insight from Letta's research holds: spending tokens during idle time dramatically reduces what you need to spend during active conversations. When the agent already knows that Rodney is connected to Dolores across 12 records (because Job 3 discovered it overnight), answering "who works with Rodney?" costs a graph traversal query (~5ms) instead of a full cross-domain LLM analysis (~3,000 tokens).

We track sleep-time and test-time token usage separately in the token metrics dashboard, so you can see the trade-off directly: more sleep-time tokens → fewer test-time tokens → faster, cheaper responses.

The three-stage digest

The sleep-time engine powers an evolved version of the Digest Engine — the system that tells the agent "here's what changed since you last checked."

The original digest was simple: scan the change log, format a markdown file, write it to the agent's workspace. The agent reads it on the next heartbeat.

The v2 digest is a three-stage pipeline:

Stage 1 — Selection. Filter changes by relevance using configurable rules stored in a digest_rules table. VIP emails (from your boss, from specific contacts) trigger immediate notification via PostgreSQL LISTEN/NOTIFY. Routine changes (a synced contact updated its phone number) get buffered for the periodic digest.

Stage 2 — Correlation. Use the LLM to discover cross-domain connections between changes. "You received an email from Ana García. Ana is attending tomorrow's meeting. You have 2 unfinished notes about the project she's working on." This stage is why sleep-time matters — the correlation discovery happens in the background, not when the agent is trying to respond to you.

Stage 3 — Scoring. Rate each item by urgency, cross-domain relevance, and historical feedback (did the user act on similar insights before?). The output shifts from "what changed" to "what matters and why."

The scored digest gets written to DIGEST.md in the agent's workspace. The agent reads it and decides what to surface. Urgent items might trigger an immediate notification. Low-score items accumulate for a daily summary.

The world model

One output of the sleep-time engine is a materialized "world model" — a living document that summarizes your current state:

Active projects and their status
Key people and recent interactions
Upcoming deadlines and events
Behavioral patterns and preferences

The world model is updated incrementally. Each sleep-time run only modifies the sections affected by recent changes. The agent references it as persistent context — a pre-computed summary of "what's going on in your life right now" that doesn't need to be recomputed every conversation.

This is inspired by Daniel Miessler's PAI framework (MISSION.md, GOALS.md, PROJECTS.md pattern), adapted to structured data. Instead of the user maintaining these documents manually, the system generates them from real data.

Zero cost when idle

The most important design decision: when nothing has changed, the engine does nothing. Zero tokens. Zero queries. The scheduler checks for pending changes in the change log before executing any job. No changes → skip the entire run.

This means the system's cost is proportional to your activity, not to time. A weekend where you don't use the system costs nothing. A busy Monday with 50 emails and 10 notes triggers multiple enrichment passes. The cost follows the value.

Similarly, the digest delivery to the agent is conditional. No changes → no DIGEST.md written → no system event → the agent doesn't wake up → zero tokens consumed. This was a deliberate choice over a heartbeat model where the agent would check for updates periodically regardless.

What I'd do differently

I'd add entity type validation in the summary job. The real run showed "Instagram" classified as a location — a clear extraction error that propagated into the weekly summary. A simple validation step (is this entity type plausible for this name?) would catch obvious misclassifications before they pollute insights. The data exists to fix this; we just haven't built the filter yet.

I'd build the pattern detection job first, not the enrichment job. Enrichment (Job 1) improves the knowledge graph incrementally. Pattern detection (Job 3) produces visible, surprising insights that users actually react to. "Ana García is connected to Project Tempest" is a moment of delight. "We added 2 more entity links to your note about Tempest" is invisible maintenance. Leading with delight would have made the sleep-time engine feel valuable sooner.

I'd make the token budget adaptive. Right now it's a flat 5,000 tokens per run. A smarter approach: scale the budget with the amount of pending work. 3 new records → 1,000 tokens. 50 new records after a sync → 10,000 tokens. The budget should match the opportunity, not be a fixed ceiling.

I'd add a "sleep-time log" visible in the dashboard. Currently, the only way to see what the engine did is through the insights API and the sleep_time_jobs table. A visible log ("Last night I discovered 3 new connections, updated 2 preferences, and generated your weekly summary") would build trust and make the background processing feel tangible.

The takeaway

A personal AI system that only works when you talk to it is wasting 95% of its available compute. The data is sitting in PostgreSQL. The model is loaded in Ollama. The knowledge graph has gaps that a 30-token LLM call could fill. Why wait for the user to ask?

Sleep-time compute shifts the work from "the user asked a question and now we scramble" to "we already know the answer because we connected the dots overnight." Four jobs, a token budget, an idle detector, and a three-stage digest pipeline. The system gets smarter while you sleep.

The insight that makes it all work: spending tokens when nobody is waiting for a response is categorically cheaper — in latency, in user experience, and in total cost — than spending them when someone is staring at a loading spinner.

Next up: PII-aware routing — how we send sensitive data to local models and everything else to the cloud, without the user having to think about it.

Hybrid search with RRF: combining pgvector, tsvector, and a knowledge graph in one query"

Victor García — Tue, 24 Mar 2026 12:04:03 +0000

Here's a search query: "beach trip."

Full-text search finds nothing — no record contains the word "beach." But there's a note that says "Qué calor en Valencia, el agua estaba perfecta." Semantic search finds it because the embedding for "beach trip" is close to the embedding for a hot day at the beach in Valencia.

Now a different query: "Ana García."

Semantic search returns a dozen vaguely related records. Full-text search returns the 3 records that literally contain "Ana García." But neither shows you that Ana attended last week's meeting, is CC'd on 5 email threads, and appears in tomorrow's calendar — connections that only the knowledge graph knows about.

No single search method is enough. We needed all three, plus a way to combine them that doesn't require manual tuning.

The four signals

Our search pipeline produces four independent scores for every candidate result:

Signal	Source	What it catches	Tier
Semantic	pgvector cosine similarity	Meaning-based matches ("beach" → "calor en Valencia")	Pro
Full-text	tsvector + GIN + ts_rank	Exact keyword matches, fast and precise	Free
Graph	entity_links overlap	Relational connections ("Ana García" → meetings she attended)	Pro
Heat	record_heat table	Temporal relevance (recently accessed records)	Free (display), Pro (in ranking)

Free tier users get full-text search only — which is still fast and well-ranked thanks to tsvector with weighted columns (title gets weight A, content gets weight B, tags get weight C). Pro users get all four signals fused together.

The pipeline

The search happens in seven steps:

Query: "Ana García project update"
    │
    ├── 1. Vector search ──→ top-50 by cosine similarity
    ├── 2. Full-text search ──→ top-N by ts_rank (UNION ALL across domains)
    └── 3. Graph discovery ──→ N candidates via entity_links
                │
                ▼
         4. Deduplicate by (domain, record_id)
                │
                ▼
         5. Rank-normalize each signal to [0, 1]
                │
                ▼
         6. Detect degenerate signals
                │
                ▼
         7. Weighted fusion + multi-signal bonus
                │
                ▼
            Final ranked results with provenance

Let me walk through each step.

Step 1: Vector search

The query text is embedded on-the-fly using the same model that embeds records (qwen3-embedding:0.6b, 1024 dimensions). Then a cosine similarity query runs against the embeddings table:

SELECT domain, record_id, 1 - (embedding <=> $query_embedding) AS similarity
FROM embeddings
WHERE 1 - (embedding <=> $query_embedding) >= 0.3
ORDER BY embedding <=> $query_embedding
LIMIT 50;

The 0.3 minimum threshold filters garbage. The top 50 candidates move to the next step. If Ollama is down and we can't embed the query, this signal is simply skipped — the other signals still work.

Step 2: Full-text search

A UNION ALL query across all domain tables, using PostgreSQL's native full-text search:

SELECT 'note' AS domain, id AS record_id, ts_rank(search_vector, query) AS score
FROM notes
WHERE search_vector @@ plainto_tsquery('simple', $q) AND deleted_at IS NULL
UNION ALL
SELECT 'event', id, ts_rank(search_vector, query)
FROM events
WHERE search_vector @@ plainto_tsquery('simple', $q) AND deleted_at IS NULL
UNION ALL
-- ... contacts, emails, files, diary, bookmarks, kanban_cards
ORDER BY score DESC
LIMIT 50;

We use plainto_tsquery('simple', ...) instead of language-specific configurations. The simple configuration doesn't stem words, which matters for multilingual data — Spanish and English records coexist, and stemming rules for one language would butcher the other.

Each domain table has a search_vector tsvector column maintained by a trigger (or GENERATED ALWAYS AS ... STORED for newer tables). The vectors are weighted: title gets 'A', description/content gets 'B', tags get 'C'. A match in the title ranks higher than a match in the body.

Step 3: Graph discovery

This signal is different — it doesn't match text, it matches relationships.

The query is matched against graph_entities.normalized_name. If "Ana García" matches a Person entity, we find all records linked to that entity via entity_links:

-- Find entities mentioned in the query
SELECT id FROM graph_entities
WHERE normalized_name ILIKE '%ana garcia%' AND deleted_at IS NULL;

-- Find all records linked to those entities
SELECT source_type AS domain, source_id AS record_id
FROM entity_links
WHERE target_type = 'graph_entity' AND target_id = ANY($entity_ids);

The graph_score for each result is the overlap ratio: how many of the query's entities appear in the result's connections, divided by the total entities found in the query.

Step 4: Deduplication

The three signals produce candidate sets that overlap. A note containing "Ana García" might appear in vector search (semantically similar), full-text search (exact keyword match), and graph search (linked to the Ana García entity). We deduplicate by (domain, record_id) and track which signals produced each candidate.

Step 5: Rank-based normalization

Here's where it gets interesting. We do NOT normalize by raw scores. We normalize by rank position:

normalized_value = (N - position) / N

The top result in a signal gets 1.0. The bottom gets nearly 0. A candidate absent from a signal gets 0.

Why rank-based instead of min-max normalization? Because cosine similarity scores cluster. In a typical query, the top-50 vector search results might have similarities between 0.54 and 0.64 — a 10-point range. Min-max normalization would stretch this to 0.0–1.0, making the difference between rank 1 and rank 50 look huge when it's actually tiny.

Rank-based normalization treats position as the signal, not magnitude. First place is first place, whether it scored 0.99 or 0.55.

Step 6: Degenerate signal detection

Sometimes a signal doesn't discriminate. If all 50 vector search results have cosine similarities within 5% of each other, the signal is noise — everything "looks the same" to the embedding model.

When this happens, the weight assigned to the degenerate signal gets redistributed proportionally to the other signals. The search doesn't fail — it just relies more on the signals that are actually informative.

Step 7: Weighted fusion + multi-signal bonus

The final score combines all four signals:

score = (α × heat + β × semantic + γ × fulltext + δ × graph) / (α + β + γ + δ)

Default weights: all equal at 0.25 each. But the user can adjust them via the dashboard's ranking sliders — crank heat to prioritize recent activity, drop graph to ignore relationship signals, etc.

Then comes the multi-signal bonus:

Signals that found this result	Multiplier
1 signal only	×1.00
2 signals	×1.25
3 signals	×1.50

A result that appears in vector search, full-text search, AND graph search gets a 50% bonus. This rewards results that are independently confirmed by multiple methods — they're almost certainly relevant.

Heat doesn't count for the bonus calculation. It's a temporal signal, not a relevance signal — a record being hot doesn't mean it matches the query.

Two search modes

The system exposes two search endpoints that use this pipeline differently:

Standard search: `GET /search`

The default. Uses Reciprocal Rank Fusion with fixed weights:

rrf_score = Σ 1/(K + rank_i)    where K = 60
final = rrf_score × (1 + 0.1 × heat_score)

RRF is elegant: it ignores raw scores entirely, only caring about rank position. A result that's #1 in vector search and #3 in full-text search gets 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323. The K=60 constant smooths the curve — the difference between rank 1 and rank 10 is meaningful but not extreme.

Heat acts as a post-fusion tiebreaker with a maximum 10% boost. It never dominates relevance. A cold but highly relevant result always beats a hot but marginally relevant one.

Advanced search: `GET /search/advanced`

The full power mode. Four independent sliders, degenerate signal detection, multi-signal bonus. Used by the Search module in the dashboard where users can see and control exactly how results are ranked.

Every result includes full provenance:

{
  "provenance": {
    "heat_score": 0.52,
    "vector_score": 0.68,
    "fulltext_score": 0.52,
    "graph_score": 0.11
  }
}

Complete transparency. The user can see that a result ranked high because of semantic similarity (0.68) despite low graph connectivity (0.11), and adjust weights accordingly.

Free tier: surprisingly good without vectors

Free users don't get semantic search, graph search, or heat-weighted ranking. They get tsvector + GIN full-text search across all domains, with ts_rank() scoring and weighted columns.

This sounds like a big downgrade. In practice, it's surprisingly solid:

Searches for names, titles, and specific terms work perfectly — full-text search is exact.
Weighted tsvector means a match in the title ranks above a match in the body.
plainto_tsquery('simple', ...) handles both Spanish and English without configuration.
The UNION ALL across domains means one search bar finds notes, emails, events, contacts, and files.
GIN indexes make it O(log n) — fast even with tens of thousands of records.

The meaningful Pro differentiator is semantic search: finding "calor en Valencia" when you search for "beach trip." That's genuinely impossible with keyword matching. But for the 80% of searches where people type exactly what they're looking for, free tier search works fine.

Temporal expansion: the "wow" moment

Here's a feature that surprised us with how useful it turned out to be.

If the query mentions an entity that has upcoming events (within ±7 days), temporally close results get a boost:

temporal_boost = 1 + 0.15 × proximity

In practice: you search for "Ana García." The system finds that Ana has a meeting with you tomorrow. Notes, emails, and contacts related to Ana that were created or accessed in the last week get boosted. The search results naturally organize around "here's everything relevant to Ana before your meeting tomorrow."

We didn't plan this as a feature — it fell out of having the knowledge graph and the calendar in the same database. But it consistently produces the kind of results that make people say "how did it know I needed that?"

The tuning knobs

Five parameters control the pipeline behavior:

Parameter	Default	What it does
`MIN_SIMILARITY`	0.3	Cosine sim threshold. Higher = less vector noise
`RRF_FETCH_SIZE`	50	Candidates per signal. Lower = faster, fewer candidates
`RRF_K`	60	RRF smoothing constant. Higher = smoother rank differences
Multi-signal factor	0.25	Bonus per additional signal (1 + 0.25 × (count-1))
Degenerate threshold	5%	Signal suppression when score range is too narrow

We've left these at their defaults since implementation. The multi-signal bonus and degenerate detection handle most edge cases automatically. If we ever need to tune, the provenance metadata on every result tells us exactly which signal is helping or hurting.

Performance

All of this runs inside a single PostgreSQL instance, on the same machine that serves the API:

Operation	Latency
Vector search (50 candidates)	~15-30ms
Full-text search (UNION ALL, 7 tables)	~5-15ms
Graph discovery	~5-10ms
Fusion + scoring	~2-5ms
Total (standard search)	~30-60ms
Total (advanced search, 4 signals)	~50-80ms

No Elasticsearch. No Solr. No separate search service. pgvector and tsvector are PostgreSQL extensions that run in the same process as the rest of the database. One backup strategy, one connection pool, one operational concern.

For a personal system with a few thousand records per domain, this is more than fast enough. If we ever hit scale problems (unlikely for single-user), the first optimization would be reducing RRF_FETCH_SIZE from 50 to 20 — cutting candidate generation in half with minimal quality loss.

What I'd do differently

I'd implement full-text search from the very beginning, not as a free-tier afterthought. We built vector search first (Phase 3) and added tsvector later as a "fallback for free users." Turns out full-text search is essential even for Pro users — it catches exact matches that embeddings miss. "Show me the email from patricia@work.com" is a full-text query, not a semantic one.

I'd add provenance to the standard search too, not just advanced. We initially only exposed the score breakdown in /search/advanced. When we added it to the standard /search endpoint (in the provenance field), debugging search quality became ten times easier. Every bug report went from "the search is bad" to "this result has vector_score 0.8 but graph_score 0, why?"

I'd explore reranking with a cross-encoder. Our pipeline does retrieval + fusion but no reranking. A small cross-encoder model (like ms-marco-MiniLM) could re-score the top 20 results for higher precision. We deferred this because the current quality is good enough and adding another model to the Ollama queue would increase latency. But for a future post-MVP iteration, it's the obvious next step.

The takeaway

The trick to hybrid search isn't the individual signals — pgvector and tsvector are well-documented, and knowledge graph traversal is just recursive CTEs. The trick is the fusion: how you combine signals with different scales, different failure modes, and different strengths.

Reciprocal Rank Fusion solves the scale problem — ranks instead of raw scores. Degenerate signal detection solves the failure mode problem — a noisy signal gets suppressed instead of poisoning results. Multi-signal bonus solves the confidence problem — results confirmed by multiple methods are almost certainly good.

Four signals, one UNION ALL, one PostgreSQL instance, under 100ms. The search that finds "calor en Valencia" when you type "beach trip" — and also shows you that Ana García has a meeting with you tomorrow.

Next up: sleep-time compute for personal data — what your AI should be doing while you sleep, and why idle cycles are the most valuable compute you have.

Your AI agent is wasting 90% of its tokens on field names"

Victor García — Fri, 20 Mar 2026 13:56:54 +0000

We built the compact API format (previous post) and felt good about ourselves. API responses were 78% smaller. Tokens saved. Problem solved.

Then we actually measured where our agent's tokens were going.

The API responses weren't the problem. The skills were.

The audit

We ran a token audit across all 31 skills in the system. Here's what we found:

Main agent (Francis) — 11 skills loaded:

always:true skills (injected on every single message): ~17,500 tokens
Total including on-demand skills: ~20,600 tokens

Global (all 31 skills across all agents):

always:true skills (12 total): ~27,500 tokens
All skills combined: ~50,000 tokens

That's 25% of Sonnet's context window consumed by skills alone. Before the user says a word. Before the agent reads a single note or email. A quarter of the available context is just instructions on how to call APIs.

Add the workspace identity files — SOUL.md, IDENTITY.md, USER.md, TOOLS.md, AGENTS.md, BOOTSTRAP.md — and you're looking at another 3-5K tokens. So the agent starts every conversation with roughly 20-22K tokens of system prompt. That's over 10% of the context window, gone.

The user's actual message? Usually 20-50 tokens. A rounding error.

The top 5 offenders

Skill	always	Tokens
claw-search	true	~5,100
claw-hal	true	~4,200
claw-approvals	true	~3,050
claw-files	true	~2,850
claw-mail	true	~2,780

claw-search alone burns 5,100 tokens every message. It's the biggest skill because it handles cross-domain search routing — deciding whether to query the user's data (notes, emails, contacts) or the agent's own workspace memory. That routing logic is complex and takes words to explain.

claw-hal (hardware abstraction — storage, docker, network) is second because it covers multiple subsystems. When someone asks "how's my disk?", HAL needs to know about volumes, SMART data, mount points, and Docker containers. That's a lot of endpoints.

Why this matters more than API response size

Think about the token flow of a typical interaction:

User: "What meetings do I have today?"

System prompt:  ~20,000 tokens (skills + identity)
User message:        ~8 tokens
API call:          ~300 tokens (compact format response)
Agent response:     ~50 tokens
─────────────────────────────────
Total:          ~20,358 tokens

The API response is 1.5% of the total. Even if we made it zero tokens, we'd save almost nothing. The system prompt is 98% of the cost.

This is why we say the title of the previous post was slightly misleading. Yes, compact format saves 78% on API responses. But API responses are the small slice. The real token budget is dominated by the system prompt — and within that, by the skills.

What we did about it

1. The `always:true` / `always:false` split

The most impactful decision: most skills don't need to be loaded on every message.

If you say "save a note about the meeting," the agent needs claw-notes. It does not need claw-photos, claw-diary, claw-bookmarks, claw-storage, or home-assistant. Loading all of them wastes context on instructions the agent will never use for this interaction.

OpenClaw's skill system supports an always flag in the skill metadata:

metadata: {"openclaw":{"always":true}}

Skills marked always:true are injected into every prompt. Skills marked always:false are only activated when the conversation context matches their description. The routing model (a fast, cheap classifier) reads the user's message and decides which skills to load.

Our split:

always:true (every message)	always:false (on demand)
claw-notes, claw-calendar, claw-mail, claw-contacts, claw-drive, claw-search	claw-diary, claw-photos, claw-bookmarks, claw-storage, claw-hal, claw-graph, home-assistant

The first group are things people expect to always work: "save a note," "what's on my calendar," "check my email." If these weren't always loaded, the agent would sometimes miss obvious requests.

The second group are contextual: "how's my disk?" activates storage. "Show me photos from last week" activates photos. The routing model triggers them based on keywords.

Result: the main agent's always-on cost dropped from ~20,600 to ~17,500 tokens. Still a lot, but 3,100 tokens saved on every single message adds up fast across hundreds of daily interactions.

2. Writing skills for tokens, not for humans

The SKILL.md file is not documentation. It's a prompt. Every word costs money.

Our early skills looked like documentation:

## Creating a note

To create a new note, send a POST request to the notes endpoint.
The request body should contain the title and content fields.
The title is optional — if not provided, the system will use the
first line of the content as the title.

### Example

POST /api/v1/notes
Content-Type: application/json

{
  "title": "Meeting notes from Q1 review",
  "content": "Discussed budget allocation...",
  "tags": ["work", "q1"]
}

### Response

201 Created
{
  "data": {
    "id": "550e8400-...",
    "title": "Meeting notes from Q1 review",
    ...
  }
}

That's ~150 tokens to say "POST /notes with title, content, and tags." After optimization:

### Create note
- `POST /notes` body: `{title?, content, tags?[]}`
- Response: `201` with created note

~30 tokens. Same information. The agent doesn't need prose explaining what a POST request is. It doesn't need example JSON responses — it knows what a 201 looks like. It needs the method, the path, the body fields, and which ones are optional.

The guidelines we adopted:

No full JSON response examples. The agent doesn't need them.
Include body fields for POST/PATCH — the agent does need those.
Use ? suffix for optional fields — title? instead of "title (optional)."
One line per operation when possible.
No prose connectors — "To create a note, you should..." becomes "Create: POST /notes"

3. The compact instruction in every skill

Every skill now instructs the agent to use ?format=compact for listings:

## API optimization
- List: always use `?format=compact`
- Detail: `GET /:id` (full JSON) only when needed
- Do NOT use `format=compact` on POST/PATCH/DELETE

This ensures the savings from the compact API format (post 6) are actually realized. Without this instruction, the agent defaults to full JSON responses — it doesn't know compact exists unless the skill tells it.

4. Multi-agent delegation

The nuclear option for token optimization: don't load skills you don't need because a different agent handles them.

Our multi-agent topology has 7 agents. The main agent (Francis) is a router — it handles common requests and delegates specialized ones:

Atlas handles search, knowledge graph, and research
Sentinel handles infrastructure, HAL, Docker, network
Dalí handles photos, media, creative tasks
Ledger handles finance, invoicing, crypto
Darwin handles analytics, insights, sleep-time intelligence

Francis keeps 11 skills. The heavy ones like claw-hal (4,200 tokens) move to Sentinel, who only loads when infrastructure questions come up. claw-photos and visual intelligence move to Dalí. The search skill stays with Francis because search is needed in almost every interaction.

The main agent's prompt drops from ~20K to ~17.5K tokens. Still significant, but the per-message cost is meaningful — especially when using cloud models billed per token.

Full disclosure: the multi-agent topology is still early. We've defined the roles and the skill distribution, but we haven't battle-tested delegation patterns, error propagation between agents, or the overhead of agent-to-agent communication. There are almost certainly optimizations we're missing — whether it's smarter skill chunking, dynamic skill loading based on conversation history, or something we haven't thought of at all. If you've built multi-agent systems and see room for improvement, we'd genuinely love to hear about it in the comments.

The counterintuitive insight: bigger models handle it better

Here's something we didn't expect: Sonnet (the larger model) processes large skill contexts more efficiently than Haiku (the smaller, supposedly faster model).

When the system prompt is ~20K tokens across 12 skills, the root cause of latency isn't API performance or network — it's the model processing the skill context. Haiku, despite being "faster" per token, takes longer to reason through a large, complex system prompt. Sonnet processes the same context and produces a better-routed response in less wall-clock time.

This means the intuition of "use the small model for simple routing" breaks down when the routing itself requires understanding a large skill corpus. The small model saves on per-token cost but loses on latency and accuracy. For our use case — personal OS with 12+ skills — Sonnet as the primary agent model is strictly better than Haiku, despite the higher per-token price.

The numbers after optimization

Metric	Before	After	Change
Main agent always-on skills	~20,600 tokens	~17,500 tokens	-15%
Tokens per skill (avg)	~1,700	~1,400	-18%
Skills always:true	15	6 (main) / 12 (global)	-60%
API response (10 events)	~4,200 tokens	~800 tokens	-81%
System prompt total	~25,000 tokens	~20,000 tokens	-20%

The 20% reduction in system prompt is nice, but the real win is architectural: understanding that skills are the dominant cost and designing the multi-agent topology, the always/on-demand split, and the skill writing guidelines around that reality.

What I'd do differently

I'd measure token consumption from day one. We built 12 skills before ever counting how many tokens they consumed together. If we'd measured after the third skill, we'd have adopted the concise writing style immediately instead of rewriting everything later.

I'd design the multi-agent topology earlier. The decision to split agents was driven by token costs, but it should have been driven by separation of concerns. Sentinel handling all infrastructure makes sense regardless of tokens — it's a different expertise domain. We arrived at the right architecture for the wrong reason.

I'd add a token budget per skill in the manifest. Right now there's no mechanism to warn when a skill exceeds a reasonable size. A max_tokens: 3000 field in the manifest would force skill authors (including us) to stay concise. If your skill is over budget, you need to split it or trim it.

The takeaway

When you're building an AI agent system, the optimization hierarchy is:

System prompt size (~20K tokens, 98% of most interactions) — reduce always-on skills, write concisely, use multi-agent delegation
Skill activation routing — load only what's needed for this specific message
API response format — compact, diff-aware, progressive disclosure
Model selection — sometimes the bigger model is faster because it handles context better

Most optimization guides start at #3. The actual money is at #1.

Your agent isn't wasting tokens on API responses. It's wasting them on instructions it doesn't need for this particular message. Fix the prompt, then fix the API.

Next up: hybrid search with Reciprocal Rank Fusion — how we combined pgvector, tsvector, the knowledge graph, and heat scoring into a single search pipeline.

From JSON to compact: reducing API payloads 60% for LLM consumption"

Victor García — Thu, 19 Mar 2026 12:17:24 +0000

Every time your AI agent calls your API, it pays for the response in tokens. Not just the useful data — every "id":, every "created_at":, every null field it didn't ask for. JSON is designed for humans reading documentation, not for LLMs processing structured data.

We measured our API responses before optimization. A typical "show me my events today" call returned 3 events in ~1,200 tokens. After compact format: ~280 tokens. Same information, 77% fewer tokens.

This post is about how we built that format, what we tried and killed, and the surprising insight that changed how we think about API design for agents.

The problem: JSON is expensive

Here's what a standard events response looks like:

{
  "data": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "title": "Standup",
      "description": null,
      "location": "Discord",
      "start_at": "2026-02-18T09:00:00Z",
      "end_at": "2026-02-18T09:30:00Z",
      "all_day": false,
      "recurrence": {"freq": "weekly", "days": ["mon","wed","fri"]},
      "status": "confirmed",
      "source": "google",
      "source_id": "abc123",
      "calendar_name": "Work",
      "reminders": null,
      "attendees": null,
      "custom_fields": null,
      "created_at": "2026-01-15T10:00:00Z",
      "updated_at": "2026-02-17T10:00:00Z",
      "synced_at": "2026-02-18T06:00:00Z",
      "deleted_at": null
    }
  ],
  "meta": {"total": 1, "limit": 20, "offset": 0},
  "tier": "free"
}

One event. 18 fields. Of those, the agent needs maybe 5 to answer "what's on my calendar?": title, start time, end time, location, and whether it recurs. The other 13 fields — nulls, internal IDs, sync timestamps — are noise.

Multiply this by 20 events and every API call burns tokens on data the agent will never use.

The solution: `?format=compact`

Add one query parameter and the response transforms entirely:

GET /api/v1/events?date=today&format=compact

{
  "format": "compact",
  "summary": "3 events (today)",
  "lines": [
    "09:00-09:30 Standup [Discord] 🔁 id:550e8400",
    "14:00-15:00 Comida con Ana [La Mar] id:660e8400",
    "📅 all-day: Cumple Mamá id:770e8400"
  ],
  "ids": [
    "550e8400-e29b-41d4-a716-446655440000",
    "660e8400-e29b-41d4-a716-446655440001",
    "770e8400-e29b-41d4-a716-446655440002"
  ],
  "meta": {"total": 3, "limit": 20, "offset": 0, "has_more": false}
}

Three events in under 300 tokens. The agent can read the summary line and respond "You have 3 events today" without parsing anything. If it needs to modify an event, the ids array has the UUIDs in the same order as the lines.

Each domain has its own line format:

Domain	Line format	Example
Events	`HH:MM-HH:MM title [location] 🔁`	`09:00-09:30 Standup [Discord] 🔁`
Notes	`📌 "title" (tags) [relative_date]`	`📌 "Meeting notes Q1" (trabajo) [2w ago]`
Emails	`from: "subject" [FOLDER] status 📎`	`Juan: "Budget Q3" [INBOX] unread 📎`
Contacts	`name — company — email`	`Ana García — TechCorp — ana@tech.com`
Files	`filename (size) [type]`	`report.pdf (2.3MB) [pdf]`
Diary	`date mood "preview"`	`2026-02-18 😊 "Great day at the office..."`

The line templates are fixed per domain. They include exactly the fields an agent needs for a conversational summary, and nothing else.

What the agent does with it

The key insight is progressive disclosure. Compact format is the first step; full JSON is the second.

User: "What's on my calendar today?"
Agent: GET /events?date=today&format=compact
       → reads summary: "3 events (today)"
       → responds: "You have 3 events today: Standup at 9, Comida con Ana at 2, and it's Mamá's birthday."

User: "Move the lunch to 3pm"
Agent: PATCH /events/660e8400-e29b-41d4-a716-446655440001
       → body: {"start_at": "2026-02-18T15:00:00Z", "end_at": "2026-02-18T16:00:00Z"}

The agent used compact for the listing (cheap) and a direct PATCH for the mutation (needs the UUID, which compact provides). It never needed to fetch the full JSON for any event.

Every skill instructs the agent to use ?format=compact for listings and GET /:id (full JSON) only when it needs details about a specific record. This pattern — compact for browse, full for drill-down — is progressive disclosure implemented through the API, not the UI.

What we tried and killed

Before arriving at compact format, we explored three ideas. All of them died.

TOON (Token-Oriented Object Notation)

TOON is a serialization format designed for LLMs. It defines headers once and streams rows — like a TSV with a schema preamble. Research claims 30-60% fewer tokens than JSON.

We evaluated it and dropped it. The savings over our compact format were marginal (~3-5 tokens per line), but TOON introduced real problems: special characters need escaping, the parser has to handle edge cases with newlines in content, and the agent needs to understand a custom format instead of reading natural-language lines.

Compact lines are readable in English (or Spanish). TOON rows are not. For an LLM, readability matters more than raw compression.

Token Budget (`?token_budget=N`)

The idea: tell the server "I have 500 tokens left, fit the response in that." The server would progressively reduce fields, then limit results, then summarize content to fit within the budget.

We killed it because it violates separation of responsibilities. The server doesn't know which fields matter to the agent in a given context. Sometimes the agent needs attendees and not descriptions. Sometimes the reverse. Making the server decide which fields to cut is a design smell.

The agent already has ?fields=, ?limit=, and ?format=compact to control granularity. It doesn't need the server to guess on its behalf.

Semantic cache (pgvector similarity)

The idea: cache API responses by query embedding. If a new query is semantically similar to a cached one, return the cached response. Research shows 31% of queries are semantically similar.

For a single-user personal system, the hit rate would be 5-15% at best. And the overhead of embedding each query (50-200ms via Ollama) is slower than just running the database query directly (5-20ms). We'd be adding latency to save latency.

We kept the SHA-256 exact-match cache as a possibility for the future (when the sleep-time engine starts making repetitive queries), but the semantic layer was pure overhead.

Diff-aware responses: `?since=`

Compact format reduces the size of each response. But there's another dimension: reducing how often the agent needs to ask at all.

The ?since= parameter lets the agent say "I last checked at 14:00 — what changed?"

GET /api/v1/events?since=2026-02-18T14:00:00Z

{
  "data": {
    "created": [],
    "updated": [
      {"id": "660e8400-...", "title": "Comida con Ana", "start_at": "2026-02-18T15:00:00Z"}
    ],
    "deleted": []
  },
  "meta": {
    "response_timestamp": "2026-02-18T14:35:00Z",
    "total_changes": 1
  }
}

The agent stores the response_timestamp from each call and passes it as ?since= next time. If nothing changed, the response is essentially empty. The response_timestamp is captured before the query executes — conservative by design, so changes are never lost (at worst, the agent sees a duplicate).

Combined with compact format, the pattern becomes:

First call: GET /events?format=compact → full compact listing
Subsequent calls: GET /events?since=<last_timestamp> → only changes
If changes exist and agent needs context: GET /events/:id → full detail on specific records

For a system where the agent polls every 30 minutes and maybe 2 records changed, this turns a 1,200-token response into a 150-token one.

Token counting: the `X-Token-Count` header

Every response includes an X-Token-Count header with the approximate token count of the response body. It's a heuristic (±15% accuracy), not an exact count — we use character-based estimation rather than running a tokenizer on every response.

The agent doesn't make binary decisions based on this number. It's informational — "this response cost me approximately 450 tokens" — so the agent can track its consumption over time and adjust strategy. If it notices that contacts responses are consistently heavy, it might switch to compact or reduce limits.

We briefly considered using tiktoken for exact counting and decided against it. The precision wasn't worth the overhead on every response, and the agent doesn't need exact numbers — it needs a signal.

The numbers

Measured across 7 domain endpoints with realistic data volumes:

Domain	JSON (tokens)	Compact (tokens)	Reduction
Events (10)	~4,200	~800	81%
Notes (10)	~5,800	~1,200	79%
Emails (10)	~8,500	~1,500	82%
Contacts (10)	~3,200	~900	72%
Files (10)	~2,800	~700	75%
Diary (10)	~4,000	~1,100	73%
Search (10)	~6,000	~1,000	83%

Average reduction: ~78%. The title says 60% because that's the conservative number for small result sets (3-5 records). As result sets grow, the savings compound.

Emails save the most because they have the most fields (30+ columns, most of which are null for any given record). Contacts save the least because they're already relatively compact in JSON.

Design decisions worth noting

Compact and ?fields= are mutually exclusive. Compact templates are fixed per domain. Allowing custom field selection within compact would mean building a dynamic template engine — complexity for a feature nobody asked for. If you need specific fields, use ?fields=. If you need minimal tokens, use ?format=compact. Not both.

Compact is ignored silently on non-list endpoints. GET /notes/:id?format=compact returns normal JSON. No error, no warning. The agent shouldn't need to remember which endpoints support which formats — it just adds format=compact to everything and the server does the right thing.

Errors are always full JSON. Even with ?format=compact, errors return the standard error envelope with code, message, details, and hint. The agent needs structured error information more than it needs minimal tokens in failure cases.

The ids array is the bridge. The compact lines are for the LLM to read and summarize. The ids array is for the LLM to act — it maps 1:1 to the lines, so "the second event" maps to ids[1]. This dual-track design (human-readable lines + machine-actionable IDs) is what makes compact format actually useful rather than just cheap.

What I'd do differently

I'd build compact format before anything else. We built it during the API Intelligence phase (Cluster A), but it should have been in Phase 1. Every skill we wrote before compact format included verbose JSON examples that the agent had to parse. The day we shipped compact, we updated every skill to use it and immediately saw lower latency and better agent responses.

I'd kill more ideas faster. TOON, token budget, and semantic cache each took a design session to spec out and a decision to kill. The design sessions weren't wasted — they clarified what we actually needed — but we could have killed them in 30 minutes of back-of-envelope calculation instead of writing full specs.

The takeaway

If you're building an API that LLM agents will consume, you're probably shipping too much data. JSON is great for browsers and SDKs. It's terrible for language models that pay per token.

The fix is embarrassingly simple: a query parameter that switches the response format from structured objects to human-readable lines with an ID array. Fixed templates per domain. No custom serializer, no protocol change, no content negotiation. Just a different if branch in your response handler.

The deeper insight: the agent doesn't need your data model. It needs a summary it can repeat to the user and IDs it can use to take action. Everything else is overhead.

Next up: your AI agent is wasting 90% of its tokens — and it's not the API's fault. It's the skills.

Heat scoring: teaching your data to forget (gracefully)"

Victor García — Wed, 18 Mar 2026 17:26:00 +0000

Here's something that bothered us early on: a search for "project meeting" returned every meeting note we'd ever written, sorted by relevance. The note from this morning and the one from eight months ago scored identically — because semantically, they're equally "about project meetings."

But they're not equally useful. The one from this morning matters right now. The one from eight months ago is a fossil. We needed a way to express that difference without breaking search when someone genuinely wants to find old records.

The solution was to give every record a temperature.

The concept

Every record in every domain table — notes, events, contacts, emails, files, diary entries — has a heat score. It's a number between 0 and 1 that answers one question: how relevant is this record right now?

Heat rises when you interact with a record (open it, edit it, create it). Heat decays exponentially over time when you don't. A note you wrote this morning is hot. A note you haven't touched in six months is cold. A contact you email every week stays permanently warm.

Three tiers classify records:

Tier	Heat score	What it means
Hot	> 0.7	Actively used right now
Warm	> 0.2	Used recently-ish
Cold	≤ 0.2	Hasn't been touched in a while

The idea is inspired by MemoryOS (EMNLP 2025), which applies memory tiers to conversational AI. We adapted it to structured personal data — records instead of chat messages, database rows instead of context windows.

The formula

The heat score combines two signals: how often you've accessed a record, and how recently.

effective_access = access_count × max(0, 1 - (hours_since_last_increment / 4383))
raw_heat = effective_access × e^(-λ × hours_since_last_access)

Where λ = ln(2) / 168 — a half-life of one week (168 hours).

The effective_access part handles frequency. A contact you've accessed 100 times counts for a lot — but that count itself decays linearly over a year. Six months of neglect and those 100 accesses are worth 50. A year of neglect and they're worth zero. This prevents records from staying hot forever just because they were used intensively once.

The exponential decay part handles recency. Something accessed right now has full heat. Something accessed a week ago has half. Two weeks ago, a quarter. The exponential curve means heat drops fast in the first few days, then slowly flattens out.

The combined effect is interesting:

Contact accessed 100 times, last access 6 months ago: effective_access ≈ 50, but the exponential decay from 6 months of recency drives heat_score close to 0. It's cold.
Note accessed 3 times, last access today: effective_access = 3, multiplied by nearly full recency. Heat_score is moderate. It's warm, trending hot.
Email accessed once, created this morning: effective_access = 1, recency = fresh. Low warm. One more access and it tips to solid warm.

Normalization: the 0-1 range

The raw formula produces unbounded values — a record accessed 50 times today gives a raw heat of ~50. That's useless for ranking. So we normalize:

normalized = raw_heat / SCALE_FACTOR    // SCALE_FACTOR = 10.0 (calibrated)
heat_score = normalized ≤ 0.8 ? normalized : 0.8 + (normalized - 0.8) × 0.5
heat_score = min(heat_score, 1.0)

The soft cap at 0.8 is deliberate. Getting from 0.0 to 0.8 takes normal usage. Getting from 0.8 to 1.0 takes double the activity. This prevents "thermal saturation" — a record you obsessively check doesn't permanently peg the meter at 1.0, leaving room for other records to be relatively hotter.

The hard cap at 1.0 is enforced in the database: CHECK (heat_score >= 0 AND heat_score <= 1).

The table

Heat data lives in its own table, not in the domain tables:

CREATE TABLE record_heat (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    domain          VARCHAR(30) NOT NULL,
    record_id       UUID NOT NULL,
    access_count    INTEGER NOT NULL DEFAULT 0,
    last_accessed   TIMESTAMPTZ NOT NULL DEFAULT now(),
    last_increment  TIMESTAMPTZ NOT NULL DEFAULT now(),
    heat_score      REAL NOT NULL DEFAULT 0.0,
    memory_tier     VARCHAR(10) NOT NULL DEFAULT 'cold',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(domain, record_id)
);

Same pattern as the embeddings table: (domain, record_id) instead of foreign keys. No cascading deletes to manage — when a note is deleted, the heat row becomes an orphan that the cleanup cron handles.

Why a separate table instead of adding columns to each domain table? Because heat is cross-cutting. Every domain behaves the same way. A single cron job, a single update function, a single query pattern. If heat lived in 7+ domain tables, every change to the formula would be 7+ ALTER statements.

When heat changes

On access (instant)

Every time you open a record — GET /notes/:id, GET /contacts/:id, whatever — a post-handler fires:

Increment access_count by 1
Set last_accessed to now
Recalculate heat_score and memory_tier
Upsert into record_heat

That's one SQL query. Under 1 millisecond. The user never notices.

Creating or editing a record also bumps heat. A record you just wrote or just modified reflects active intent — it should be hot.

Deleting doesn't bump heat. The decay cron will cool it naturally.

On cron (every 6 hours)

A scheduled job recalculates heat for every row in record_heat. This is necessary because the exponential decay is time-based — without periodic recalculation, a record that nobody accesses would stay at whatever heat it had when last touched. The cron ensures everything drifts toward cold over time.

The cron also applies the linear decay to access_count — proportionally reducing it based on time elapsed since the last recalculation.

Source tracking

Not all accesses are equal. We track the source of each heat event:

type HeatSource = 'user_dash' | 'agent_primary' | 'agent_creative' 
                | 'sleep_time' | 'sync' | 'unknown';

A user opening a contact in the dashboard is a strong signal of interest. The sleep-time engine scanning the same contact for background analysis is not. The heat system accepts the source parameter so that downstream consumers (like the proactive preference learner) can distinguish genuine human interest from automated access.

Sync imports get lower initial heat than user-created records, because the user didn't actively choose to create them.

How heat changes search

Heat never filters search results. If you search for "apartment lease" and the only match is a cold note from a year ago, it absolutely must appear. Hiding results because they're cold would be insane.

Instead, heat acts as a post-fusion tiebreaker in the search ranking algorithm:

final_score = rrf_score × (1 + 0.1 × heat_score)

This is step 4 in the Reciprocal Rank Fusion pipeline. The 0.1 multiplier means heat can boost a result by at most 10%. It breaks ties between equally relevant results — the hotter one wins — but it can never override genuine relevance. A cold but highly relevant result still beats a hot but marginally relevant one.

In the advanced search mode, heat becomes one of four independently weighted signals:

score = (α × heat + β × semantic + γ × fulltext + δ × graph) / (α + β + γ + δ)

The user can drag sliders to adjust the weight of each signal. Want to find recent-activity-first? Crank heat to 1.0. Want to find the most semantically relevant regardless of age? Drop heat to 0. The weights are transparent — every search result shows its raw scores for all four signals.

How heat changes the digest

The Digest Engine — the background job that compiles "here's what changed since you last checked" — uses heat in two ways:

Prioritization. When 50 things changed overnight, the digest needs to decide what to mention first. Heat-weighted changes surface near the top. "Ana García, who you've been emailing all week, sent a new message" ranks above "A contact you haven't opened in 4 months was synced."

Cold filtering. By default, the search endpoint excludes records in the "accessed but cold" zone (heat > 0 but < 0.2). These are records you interacted with once and then forgot. They still exist and are findable if you search explicitly — but they don't clutter routine results. A ?include_cold=true parameter brings them back for exhaustive searches.

How heat changes the API

Every list endpoint gains a ?tier= filter:

GET /notes?tier=hot          → only hot notes
GET /contacts?tier=warm      → warm + hot contacts
GET /events?tier=cold        → only cold events (rarely useful, but available)
GET /notes                   → all notes (no change from before)

The AI agent uses this heavily. When the user says "what's on my plate?", the agent queries ?tier=hot across all domains to get a quick snapshot of active items without wading through the entire database. When the user says "find that contract from last year", the agent searches without tier filtering because cold records are exactly what's needed.

The edge case: heat on edges

Records aren't the only thing with heat. Graph edges — the connections between entities in the knowledge graph — also have a heat score.

When you open a note and then immediately open a related email (within 60 seconds), the edge between those two records gets warmer. The dashboard tracks these co-navigation patterns automatically:

if (lastViewedRecord && Date.now() - lastViewedRecord.timestamp < 60_000) {
  api.post('/links/heat-edge', {
    from_type: lastViewedRecord.type,
    from_id: lastViewedRecord.id,
    to_type: currentRecord.type,
    to_id: currentRecord.id
  });
}

Hot edges become visible in the graph visualization — glowing connections between frequently co-accessed records. Cold edges fade almost invisible. It turns the knowledge graph into a heat map of your actual workflow, not just a static web of extracted entities.

What I'd do differently

I'd start with simpler thresholds and tune later. We went through three iterations of tier boundaries (the original design had unbounded scores with thresholds at 5.0 and 0.5, then we normalized to 0-1 with a soft cap, then adjusted thresholds to 0.7/0.2). Starting with "anything accessed in the last week is hot, anything in the last month is warm, everything else is cold" would have been fine for v1.

I'd add heat to the search provenance from day one. Heat was initially invisible in search results — you could feel its effect on ranking but couldn't see the raw score. Adding provenance (the breakdown of all four search signals) happened later, and it was immediately obvious it should have been there from the start. Transparency in ranking builds trust.

The takeaway

Heat scoring is maybe 150 lines of code: one table, one upsert function, one cron job, one post-fusion multiplier. But it fundamentally changes the system's relationship with time.

Without heat, every record is equally present. Your database is a flat archive — 2,000 notes, all with the same weight. With heat, the database has a sense of "now." Recent activity floats to the surface. Old activity sinks naturally. The AI agent, the search engine, and the digest all benefit from the same signal: what matters to you right now.

The formula doesn't need to be clever. Exponential decay with a weekly half-life and a soft cap. That's it. The cleverness is in where you apply it — as a tiebreaker in search, as a filter in the agent's queries, as a priority signal in the digest, as a visual indicator in the dashboard.

Data that knows how to forget is data that stays useful.

Next up: from JSON to compact — how we reduced API payloads by 60% for LLM consumption, and why your AI agent is wasting most of its tokens on field names.

Building a personal knowledge graph with just PostgreSQL (no Neo4j needed)"

Victor García — Wed, 18 Mar 2026 13:04:24 +0000

At some point during development, we needed to answer questions that search couldn't: "Who's connected to this project?" "What links this email to tomorrow's meeting?" "Which people keep appearing together across my notes?"

Semantic search finds records by meaning. Full-text search finds them by keywords. But neither can traverse relationships. For that, you need a graph.

The obvious choice was Neo4j, or at least Apache AGE (which adds Cypher queries to PostgreSQL). We evaluated both and went with... two tables and WITH RECURSIVE.

This post explains why, and shows the actual schema, queries, and API that power the knowledge graph.

Why not a graph database?

We seriously considered Apache AGE. It adds OpenCypher support directly inside PostgreSQL — no separate service, same database. The competitive analysis even had it as "Innovation #2: Bi-temporal personal knowledge graph via Apache AGE."

We rejected it for three reasons:

Scale. This is a personal system. One user, maybe a family. We're talking about fewer than 10,000 entity nodes and 50,000 edges. At that scale, a recursive CTE with depth 2-3 takes microseconds. The performance argument for a graph engine simply doesn't exist.

Dependencies. Apache AGE is a PostgreSQL extension that needs to be compiled and installed. On a bare-metal mini-PC with different architectures (x86, ARM), that's a maintenance headache. Our system already depends on two extensions (pgcrypto and pgvector). Adding a third for a feature that SQL handles natively felt wrong.

Type safety. We use Drizzle ORM for everything. Apache AGE queries return untyped results from Cypher strings embedded in SQL. Our recursive CTEs return typed rows that Drizzle understands. No impedance mismatch, no parsing layer, no surprise nulls.

The decision was documented as "ADR B-4: Apache AGE rejected in favor of SQL-native approach with recursive CTEs." The roadmap originally referenced AGE — we had to go back and update it.

The schema: two tables

The entire knowledge graph lives in two tables. One for nodes, one for edges.

Nodes: `graph_entities`

CREATE TABLE graph_entities (
    id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    entity_type       VARCHAR(30) NOT NULL,    -- person, project, location, topic
    name              TEXT NOT NULL,            -- "Ana García", "Micelclaw OS", "Zaragoza"
    normalized_name   TEXT NOT NULL,            -- "ana garcia", "micelclaw os", "zaragoza"
    properties        JSONB DEFAULT '{}',       -- {email, company, role, coordinates...}
    merge_history     JSONB DEFAULT '[]',       -- [{merged_from, date, reason}]
    created_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at        TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at        TIMESTAMPTZ,
    UNIQUE(entity_type, normalized_name)
);

Four entity types: person, project, location, topic. That's it. We considered adding more (organization, event, document) and decided against it. Four types cover 95% of real-world connections in personal data. The properties JSONB handles the rest — a person can have an email and company, a location can have coordinates.

The normalized_name column is the key to entity resolution: lowercase, no accents, trimmed. When the extraction pipeline finds "Ana García" in a note and "ana garcia" in an email, the UNIQUE constraint ensures they map to the same node.

Edges: `entity_links` (extended)

We already had an entity_links table from the initial schema — it was one of the original 13 tables. It stored simple connections like "this note mentions this contact." For the knowledge graph, we extended it with three columns:

ALTER TABLE entity_links ADD COLUMN link_type VARCHAR(20) NOT NULL DEFAULT 'manual';
-- 'manual' | 'extracted' | 'inferred' | 'structural'

ALTER TABLE entity_links ADD COLUMN confidence REAL NOT NULL DEFAULT 1.0;
-- 0.0 to 1.0

ALTER TABLE entity_links ADD COLUMN created_by VARCHAR(20) NOT NULL DEFAULT 'system';
-- 'user' | 'llm' | 'system'

Three ALTER statements. That's the entire migration for turning a flat links table into a knowledge graph edge store.

The source_type and target_type columns (VARCHAR, no CHECK constraint) now accept graph_entity as a valid type. This means an edge can connect:

A note → a graph entity (Person mentioned in text)
A graph entity → a graph entity (Person works at Project)
An event → a graph entity (Event located in City)

The relationship taxonomy follows a subject → verb → object convention:

Relationship	Direction	Created by
`mentions`	record → entity	LLM extraction
`attended_by`	event → person	Sync (calendar attendees)
`located_in`	record → location	LLM extraction
`works_at`	person → project/org	LLM extraction
`collaborates_with`	person → person	Inferred from co-occurrence
`relates_to`	entity → entity	Manual or sleep-time engine

We also reserved three relationships for future use: contradicts, follows_up, supersedes — for a Zettelkasten-style auto-linking feature that isn't built yet but whose namespace is already protected.

The queries: recursive CTEs

Three query patterns cover every graph operation we need.

Expansion: "Who is connected to Ana García?"

Given a node, find all neighbors up to depth N:

WITH RECURSIVE graph_walk AS (
    -- Base: direct connections from the starting entity
    SELECT
        el.target_type,
        el.target_id,
        el.relationship,
        el.confidence,
        1 AS depth
    FROM entity_links el
    WHERE el.source_type = 'graph_entity'
      AND el.source_id = $entity_id
      AND el.confidence >= 0.5

    UNION ALL

    -- Recursive: follow edges from discovered nodes
    SELECT
        el.target_type,
        el.target_id,
        el.relationship,
        el.confidence,
        gw.depth + 1
    FROM graph_walk gw
    JOIN entity_links el
        ON el.source_type = gw.target_type
       AND el.source_id = gw.target_id
    WHERE gw.depth < $max_depth    -- default: 2
      AND el.confidence >= 0.5
)
SELECT DISTINCT ON (target_type, target_id) *
FROM graph_walk
ORDER BY target_type, target_id, depth;

At depth 2 with 10K nodes, this query returns in under 10ms. We set the default to 2 and the maximum to 3. Depth 3 is rarely useful for personal data — it usually returns the entire graph.

Path finding: "What connects this email to that meeting?"

BFS to find the shortest path between two nodes:

WITH RECURSIVE path_search AS (
    SELECT
        ARRAY[source_id] AS path,
        target_type,
        target_id,
        1 AS depth
    FROM entity_links
    WHERE source_id = $from_id

    UNION ALL

    SELECT
        ps.path || el.source_id,
        el.target_type,
        el.target_id,
        ps.depth + 1
    FROM path_search ps
    JOIN entity_links el
        ON el.source_id = ps.target_id
    WHERE ps.depth < $max_depth    -- default: 4
      AND NOT (el.source_id = ANY(ps.path))  -- cycle prevention
)
SELECT * FROM path_search
WHERE target_id = $to_id
ORDER BY depth
LIMIT 1;

The NOT (el.source_id = ANY(ps.path)) clause prevents infinite loops. The LIMIT 1 with ORDER BY depth gives us the shortest path.

Subgraph: "Show me everything around this project"

For the visualization view in the dashboard (a force-directed graph using react-force-graph-2d):

-- Get the top N entities by mention count, centered on an optional entity
SELECT
    ge.id, ge.name, ge.entity_type,
    COUNT(el.id) AS mention_count,
    rh.heat_score
FROM graph_entities ge
LEFT JOIN entity_links el
    ON (el.target_type = 'graph_entity' AND el.target_id = ge.id)
    OR (el.source_type = 'graph_entity' AND el.source_id = ge.id)
LEFT JOIN record_heat rh
    ON rh.domain = 'graph_entity' AND rh.record_id = ge.id
WHERE ge.deleted_at IS NULL
GROUP BY ge.id, rh.heat_score
ORDER BY mention_count DESC
LIMIT $limit;

Then a second query fetches all edges between the returned nodes. The dashboard renders nodes sized by mention count and colored by heat score — hot nodes glow, cold nodes fade.

How nodes get created

Nodes enter the graph through three pipelines, only one of which involves an LLM:

1. Contacts → Person nodes (no LLM)

When a contact is created or synced from Google, the CRUD hook creates a Person node directly from structured data:

Contact {display_name: "Ana García", company: "TechCorp", emails: [{address: "ana@techcorp.com"}]}
    ↓
graph_entities {entity_type: "person", name: "Ana García", normalized_name: "ana garcia",
                properties: {email: "ana@techcorp.com", company: "TechCorp"}}

No model needed. The data is already structured.

2. Event attendees → Person nodes (no LLM)

Calendar events have attendees as structured JSONB. Each attendee is resolved against existing Person nodes by email, or created as a new node:

Event attendee {email: "ana@techcorp.com", name: "Ana"}
    ↓
Match against graph_entities WHERE properties->>'email' = 'ana@techcorp.com'
    ↓
Found → reuse existing node (even if name differs slightly)
Not found → create new Person node

3. Text extraction → All entity types (LLM)

This is the async pipeline from the previous blog post. The 2B model extracts persons, projects, locations, and topics from notes, emails, diary entries, and file contents. Each extracted entity is upserted by (entity_type, normalized_name) — if it already exists, the properties are merged.

The key insight: pipelines 1 and 2 mean the graph has a solid foundation of real, structured Person nodes before the LLM ever runs. When the LLM extracts "Ana García" from a note, it matches against a node that already exists from the contacts sync. No orphan entities, no duplicates — just a new edge connecting the note to an existing person.

Entity resolution: the hard problem

Extraction creates nodes. Entity resolution decides whether two nodes are the same thing. We handle this in three levels:

Level 1 — Deterministic (automatic). The UNIQUE(entity_type, normalized_name) constraint handles exact matches. "Ana García" and "ana garcía" always map to the same node. For persons, email matching overrides name matching — if two nodes have the same email, they're the same person regardless of name differences.

Level 2 — Suggested merges (semi-automatic). An endpoint returns pairs of same-type entities with high name similarity (using pg_trgm's similarity() function, threshold > 0.4):

GET /graph/merge-candidates

[{
  entity_a: {name: "Ana García", mention_count: 15},
  entity_b: {name: "Ana G.", mention_count: 3},
  similarity: 0.87
}]

A merge redirects all edges from the absorbed node to the surviving node, records the event in merge_history, and hard-deletes the duplicate.

Level 3 — Sleep-time resolution (automatic, low-priority). A background job periodically reviews Person nodes, calculates cross-similarities, and proposes merge candidates. The AI agent can also trigger merges when context makes it obvious ("Juan" and "Juan Pérez" in the same conversation).

What we explicitly don't try to resolve: nicknames ("Pepe" = "José"), role-based references ("the accountant"), and ambiguous entities ("Santiago" — person or city?). Those are either handled by the agent in conversation or corrected manually. The graph needs to be conservative — an incorrect merge destroys data, while a duplicate is just noise that can be cleaned up later.

The API

Six endpoints expose the graph:

GET  /graph/entities              — Search by name, filter by type
GET  /graph/entities/:id          — Entity detail with direct connections
GET  /graph/connections           — Expansion traversal (depth 1-3)
GET  /graph/path                  — Shortest path between two entities
GET  /graph/subgraph              — Nodes + edges for visualization
GET  /graph/stats                 — Counts, orphans, pending queue
GET  /graph/merge-candidates      — Similar entity pairs
POST /graph/merge                 — Fuse two entities
POST /graph/cleanup               — Delete orphan nodes (0 connections)

The graph is a Pro feature. Free tier users still get entity_links (manual connections between records), but the automatic entity extraction, the graph visualization, and the traversal queries require Pro.

What the graph enables

Once you have a knowledge graph, things that were impossible become trivial:

"Who's involved in this project?" — expansion query from a Project entity, depth 1, filter by Person type. Returns everyone who's been mentioned in connection with the project across all domains.

"What connects this email to that meeting?" — path query. Returns: email → mentions → Person:Ana García → attended_by → Event:Sprint Review. Two hops, one shared person.

"Show me everything about Ana García" — entity detail. Returns: 15 mentions across notes and emails, 3 events attended, works at TechCorp, collaborates with Javier Losada. All discovered automatically.

Search ranking. The hybrid search uses graph connectivity as one of four signals (along with semantic similarity, full-text relevance, and heat score) in a Reciprocal Rank Fusion algorithm. A note that mentions entities connected to your recent activity ranks higher.

The graph also feeds the sleep-time engine (background jobs that discover cross-domain correlations), the proactive digest ("Ana García, who you're meeting tomorrow, sent you an email yesterday"), and the AI agent's contextual awareness.

The numbers

At the current scale of development:

Metric	Value
Graph entities	~150-300 per active user
Entity links	~500-1000
Expansion query (depth 2)	< 10ms
Path query (depth 4)	< 15ms
Subgraph for visualization (100 nodes)	< 20ms
Memory overhead	0 (it's just PostgreSQL)

No new service to deploy. No new port to manage. No new backup strategy. It's tables in the same database that holds everything else, queried with the same ORM, backed up with the same pg_dump.

What I'd do differently

I'd add pg_trgm from day one. We needed it for merge candidates (fuzzy name matching) and ended up adding it as a later migration. It's a small extension with zero downsides — should have been in the initial schema alongside pgcrypto and pgvector.

I'd index entity_links more aggressively. The default indexes cover the UNIQUE constraint, but expansion queries benefit from separate indexes on (source_type, source_id) and (target_type, target_id). We added them later when profiling showed sequential scans on the edges table.

I'd build the visualization earlier. The force-directed graph view in the dashboard was one of the last features built, but it should have been one of the first. Seeing the graph visually — nodes clustering around projects, people forming communities — was the moment the knowledge graph went from "interesting data structure" to "this actually understands my data." It would have motivated better extraction quality earlier.

The takeaway

A knowledge graph doesn't need a graph database. At personal scale (< 10K nodes, < 50K edges), PostgreSQL with recursive CTEs is fast, simple, and — critically — already there. No new infrastructure, no new dependency, no new operational burden.

Two tables (graph_entities + extended entity_links), three traversal patterns (expansion, path, subgraph), three entity creation pipelines (structured sync, calendar attendees, LLM extraction), and three levels of entity resolution (deterministic, suggested, sleep-time).

The graph is the connective tissue between everything else in the system — search, heat scoring, the AI agent, the digest engine. And it's all just rows in PostgreSQL.

Next up: heat scoring — how we taught records to fade like memories, and why a simple exponential decay formula changes how search works.

Entity extraction with a 2B model: benchmarks from a personal knowledge graph"

Victor García — Sat, 14 Mar 2026 21:48:43 +0000

When you're building a personal knowledge graph — the kind that automatically discovers that "Ana García" appears in your emails, your calendar, and tomorrow's meeting notes — you need entity extraction. The industry answer is to throw GPT-4 at it and move on. But when your system runs on a mini-PC in someone's living room, you need something that fits in 2GB of RAM.

We benchmarked qwen3-vl:2b-instruct-q4_K_M — a 2-billion parameter multimodal model, quantized to 4-bit — running locally through Ollama. The same model that describes our photos also extracts entities from text. One model, two jobs, less RAM.

Here's what we found.

The setup

We built a benchmark suite with two tasks:

Text extraction — 15 cases across notes, emails, and diary entries. Mix of Spanish and English. Each case has human-annotated ground truth: which persons, projects, locations, and topics should be extracted.

Vision extraction — 10 photos ranging from restaurant dinners to construction sites to landscape shots. Each photo goes through two stages: the model describes the image, then a second pass extracts entities from that description.

The extraction prompt is deliberately simple:

Extract named entities from the following text. Return ONLY a JSON object:
- persons: array of person names mentioned
- projects: array of project/product names mentioned
- locations: array of place names mentioned
- topics: array of key topics/themes (max 3)

Rules:
- Only extract what is EXPLICITLY mentioned
- Do not invent or infer entities not present
- Normalize names (capitalize properly)

Matching uses embedding similarity (qwen3-embedding, 1024d) with a 0.75 threshold instead of exact string matching. "Parte Vieja" matches "Parte Vieja" obviously, but "edge caching" also matches "edge caching approach" because the embeddings are close enough.

Text extraction: the numbers

Overall F1: 0.645. Zero parse errors across all 15 cases — the model always returned valid JSON. Average latency: 2-4 seconds per case on CPU.

But the overall F1 hides a story. Let's break it down by entity type:

Entity type	Avg F1	What happened
Persons	~0.87	Near-perfect. The model's strongest category by far
Locations	~0.72	Handles Spanish geography beautifully
Projects	~0.65	Good when names are explicit, invents sometimes
Topics	~0.30	Weakest — but also the most subjective category

Persons: the killer feature

The model nails names. Full names, first names, Spanish names with accents — it gets them right:

"Marta Ibáñez", "Javier Losada", "Rubén" — all extracted from a construction note. ✓
"Carmen Pueyo", "Víctor García", "Diego Martínez" — from an email thread. ✓
"Tom Preston-Werner" — from an English conference note. ✓
"José Miguel Aguirre" — from a text full of nicknames. ✓
"Roberto Casas", "Víctor", "Lucía", "Sandra" — four people from a sprint review email. All four. ✓

Where it stumbles: a diary entry mentioning "Papá" and "Mamá" — the model extracted them as persons. Technically correct (they are persons), but the human ground truth didn't include them because they're not named individuals. This is a recurring pattern: the model extracts more than the human annotated, which hurts precision without being wrong.

The other pattern: the model extracted "Javier" as a separate person from "Javier Losada". Both in the same note. That's an entity resolution problem, not an extraction problem — and the knowledge graph handles it downstream with merge candidates.

Locations: surprisingly good at Spanish geography

"Valdespartera", "Villanueva de Gállego", "La Ternasca", "Parte Vieja", "Urgull", "Benasque", "Añisclo" — these aren't exactly world-famous cities. They're neighborhoods, hiking valleys, and small towns in Aragon and the Basque Country. The model got them all.

It also correctly classified "San Sebastián" as a location (not a person), "Ordesa" as a location (not a project), and "calle San Miguel" as a variant of "San Miguel." The embedding similarity matching helped here — "calle San Miguel" and "San Miguel" have near-perfect similarity.

One amusing misclassification: "eu-west-1" (an AWS region) was extracted as a location. I mean... it is a location. Just not the kind we meant.

Projects: good when explicit, creative when not

When the text says "Micelclaw OS" or "MACP Protocol" or "OpenClaw Gateway", the model finds them with 100% accuracy. Named projects are easy.

The problem is when the model decides something is a project that isn't. "Pilotaje" (a construction technique) got classified as a project. "Txuletón" (a steak cut) became a project. "Barna" (slang for Barcelona) appeared as a project. The model is trying to be helpful — if it can't figure out which category something fits, it hedges by putting it in projects.

Topics: where F1 lies

Topics scored ~0.30 F1. That sounds terrible. But look at what actually happened:

A diary entry about a trip to San Sebastián. Human ground truth: ["viaje", "desconexión"] (trip, disconnecting). Model output: ["pintxos", "txuletón", "playa"] (pintxos, steak, beach).

Both are correct summaries of the same diary entry. The human abstracted ("it was a trip about disconnecting"), the model got specific ("there were pintxos and beach"). The embedding similarity between "viaje" and "txuletón" is 0.67 — below the 0.75 threshold — so it counts as a miss.

This pattern repeats across almost every case. The human writes abstract topics; the model extracts concrete ones. For a knowledge graph, the model's approach is arguably better — "pintxos" is more searchable than "desconexión."

Bilingual without trying

We mixed Spanish and English cases without telling the model which language to expect. It handled both without issues. "Tom Preston-Werner" from an English note and "José Miguel Aguirre" from a Spanish one were extracted with the same accuracy. The extraction prompt is in English; the input text is in whatever language the user writes in. The model doesn't care.

The nickname challenge

The hardest test case was a Spanish note full of nicknames: "Pepe", "Tere", "Boli", "Txe", plus the full name "José Miguel Aguirre."

The model extracted "Pepe" and "José Miguel Aguirre" as separate persons — it didn't connect the nickname to the full name. It found "Tere" and "Txe" but missed "Boli." Three out of four nicknames is honestly better than expected for a 2B model.

Resolving "Pepe" = "José Miguel Aguirre" is entity resolution, not extraction. That's handled by the knowledge graph's merge candidate system — when two person nodes co-occur frequently, the system flags them for manual or automated merging.

Vision extraction: description first, entities second

The photo pipeline works in two stages: the model describes the image, then the same extraction prompt runs on that description. This means the quality of entity extraction depends entirely on the quality of the description.

Overall vision F1: 0.532. But the descriptions themselves are far better than the F1 suggests.

The descriptions are impressive

A photo of an olive grove landscape:

"This image captures a vast, sunlit landscape of rolling hills and valleys, likely in a rural or agricultural region. The scene is dominated by rows of olive trees planted in a dense, geometric pattern across the slopes."

A construction site photo:

"A red and silver laser level is set up on a tripod, indicating precise work is being done. The site is surrounded by dirt, sand, and a few trees."

The model identified olive trees, a laser level on a tripod, and even recognized a 3D structural engineering model from a screenshot. For 2B parameters quantized to 4-bit, running on CPU, this is remarkable.

Where vision extraction breaks down

The main issue: when photos contain people, the model says "four people sitting at a table" or "three people walking on a boardwalk." It counts them, describes what they're doing, but can't identify them. This is expected — face recognition requires a separate pipeline (we use InsightFace for that).

The problem for the benchmark is that "four people" gets extracted as a person entity, which counts as a false positive against a ground truth that says "no specific persons." This systematically tanks the persons F1 for vision.

The ground truth problem

Here's what we learned about benchmarking entity extraction: the human is the bottleneck, not the model.

For photo 7 (construction site), the human annotated objects as: ["tripod", "briefcase", "net", "brick", "concrete"]. The model found: ["red brick wall", "large concrete block", "red and silver laser level", "tripod", "dirt", "sand", "trees", "house", "sunlight", "clear sky"].

The model extracted 10 objects where the human listed 5. The model's list is more complete and more accurate — "red and silver laser level" is a better description than what the human wrote. But the F1 score penalizes the model for being thorough, because every "extra" extraction hurts precision.

This is a fundamental issue with evaluating extraction against human annotations. The human annotates what they think is important. The model extracts what is present. For a knowledge graph that needs to be comprehensive, the model's approach is correct — you want to capture everything and let the search ranking decide what's relevant.

Latency: the real constraint

Task	Average	Range
Text extraction	~3s	1.5–4.2s
Vision description	~2.4s	1.4–6.2s
Vision extraction	~1.8s	1.1–4.4s

All on CPU, all sequential through a single Ollama instance with a priority semaphore. These numbers are for the async pipeline — the user never waits for them. A note gets created in ~50ms; the entity extraction happens 2-4 seconds later in the background.

The first request after a cold start took 66 seconds (model loading into RAM). After that, Ollama keeps the model loaded and subsequent requests are fast. This is why we keep a single model in memory — loading and unloading models per task would destroy latency.

What we'd change

Lower the similarity threshold for topics. The 0.75 threshold is too strict for abstract concepts. "Viaje" and "pintxos" are obviously related in context, but their embeddings are only 0.67 similar. For persons and locations, 0.75 is fine. For topics, 0.60 might be more appropriate.

Post-process the "N people" pattern. When the vision model says "four women" or "three people," the extraction prompt shouldn't classify that as a person entity. A simple regex filter on the extraction output would fix the most common false positive.

Embrace the verbosity. The model extracts more than a human would annotate. Instead of fighting this, design the knowledge graph to handle it — use confidence scores and the heat system to surface what matters and let the rest decay naturally.

The takeaway

A 2B parameter model, quantized to 4-bit, running on CPU:

Persons: F1 0.87 — production-ready for a personal knowledge graph
Locations: F1 0.72 — solid, handles non-English geography
Projects: F1 0.65 — good enough with downstream deduplication
Topics: F1 0.30 — misleading number, the model is actually more thorough than the human
Parse reliability: 0 errors in 25 cases — always returns valid JSON
Latency: 2-4 seconds async, invisible to the user

Is it as good as GPT-4? No. Is it good enough to build a personal knowledge graph that automatically discovers connections between your notes, emails, and calendar? Yes. And it runs on your hardware, processes your data locally, and costs zero per extraction.

For a personal system processing maybe 50-100 new records per day, a 2B model with 3-second extraction time and ~0.87 F1 on the entities that matter most (people and places) is more than enough. The knowledge graph doesn't need to be perfect — it needs to be useful. And "Ana García appears in 3 emails and tomorrow's meeting" is useful even if the system also extracted "txuletón" as a project.

Next up: how we built a knowledge graph using just PostgreSQL — no Neo4j, no Apache AGE, just recursive CTEs and an entity_links table.

The 4-slot hook pipeline: how every CRUD operation feeds four systems at once"

Victor García — Sat, 14 Mar 2026 13:06:25 +0000

Here's a problem that sneaks up on you when you're building a data-heavy application: every time you create or update a record, a bunch of other things need to happen. The record needs to be embedded for semantic search. Its heat score needs to be updated. Entities (people, places, projects) need to be extracted from the text. And the change needs to be logged so the digest engine knows what happened.

The naive approach is to do all of that inline — right there in the route handler, after the INSERT. We tried that. It was a mistake.

This post is about the pipeline we built to replace it.

The problem with inline processing

Our first version of the notes endpoint looked something like this:

// The "just do it all here" approach
app.post('/notes', async (req, reply) => {
  const note = await db.insert(notes).values(req.body).returning();

  // Now embed it...
  const text = note.title + ' ' + note.content;
  const embedding = await ollama.embed(text);  // 50-200ms
  await db.insert(embeddings).values({ domain: 'notes', recordId: note.id, embedding });

  // Log the change...
  await db.insert(changeLog).values({ domain: 'notes', recordId: note.id, action: 'create' });

  reply.send({ data: note });
});

Two problems. First, if Ollama is down or slow, the user waits. Or worse, the request fails — even though the note was already saved. Second, this same embedding + changelog logic needs to exist in every single route handler. Notes, events, contacts, emails, files, diary entries. That's a lot of duplicated code that's going to drift.

And we hadn't even added heat tracking or entity extraction yet. Those would be two more blocks of code copy-pasted across every route.

The pipeline

We replaced all of that with a single function call: runPostHooks(ctx).

Every domain route — notes, events, contacts, emails, files, diary entries — calls it after the database operation succeeds. It looks like this:

handler    → INSERT/UPDATE/DELETE in domain table
              ↓
postHandler → runPostHooks(ctx)
              ↓
           ┌─────────────────────────────────────────┐
           │  Slot 1: Embedding (enqueue async job)  │
           │  Slot 2: Heat tracking (upsert, <1ms)   │
           │  Slot 3: Entity extraction (enqueue)     │
           │  Slot 4: Changelog (fire-and-forget)     │
           └─────────────────────────────────────────┘

Each slot runs inside its own try/catch. If slot 1 fails (Ollama is down), slot 2 still runs. If slot 3 throws (extraction model crashed), slot 4 still logs the change. The user's response is never affected — the INSERT already happened, the reply already went out.

Here's the context object that every slot receives:

interface CrudHookContext {
  domain: string;      // 'notes', 'events', 'contacts', ...
  action: 'insert' | 'update' | 'delete';
  recordId: string;
  userId: string;
  source: HeatSource;  // 'user_dash', 'sync', 'agent_primary', ...
  record?: Record<string, any>;
}

That's it. Every slot knows what happened (action), to what (domain + recordId), by whom (userId), and through which channel (source). The record field carries the full row for slots that need to extract text from it.

What each slot does

Slot 1: Embedding

Extracts text from the record and enqueues an async job. The enqueue itself takes ~0.1ms — the actual embedding generation happens later in the background via the AsyncQueue.

The text extraction is domain-specific:

Domain	What gets embedded
notes	title + content
events	title + description + location
contacts	display_name + company + job_title + notes
emails	subject + body_plain
files	filename + extracted content (if available)
diary	date + content

The AsyncQueue processes jobs sequentially through an Ollama client with a semaphore (concurrency=1). Embeddings get high priority. If Ollama is unreachable, the job is retried 3 times with exponential backoff, then dropped with a log entry. The record is still fully usable — it just won't appear in semantic search until the next successful embed.

Slot 2: Heat tracking

A single upsert to the record_heat table: increment access_count, update last_accessed, recalculate the heat score. Under 1ms. This is the cheapest slot by far, but it powers the entire memory tier system — hot, warm, and cold records that influence search ranking and the digest engine.

One detail: heat tracking only fires on create and update operations from the user. Sync imports use source: 'sync', which the heat system treats differently (lower initial heat, since the user didn't actively create the record).

Slot 3: Entity extraction

Enqueues an async job (same AsyncQueue as embeddings, but lower priority). The extraction worker sends the record's text to an LLM and gets back a structured list of entities — people, projects, locations, topics — that become nodes in the knowledge graph.

Each domain has different extraction behavior:

Notes, emails, diary entries: Full LLM extraction from text content
Contacts: No LLM needed — the contact's structured data (name, email, company) directly becomes a Person node in the graph
Events: Attendees are resolved directly against existing graph entities by email/name; the rest goes through LLM

Entity extraction runs with lower priority than embeddings because it's more expensive (1-10 seconds per record vs ~50ms for an embedding) and less time-sensitive. The graph can be a few seconds behind without anyone noticing.

Slot 4: Changelog

A simple INSERT into the change_log table. Domain, record ID, action, user ID, timestamp, and a human-readable summary generated by extractSummary():

Domain	Summary format
notes	Note title
events	Event title + formatted start date
contacts	Display name
emails	Subject line
files	Filename
diary	Date + content preview (first 100 chars)

The Digest Engine reads this table periodically and compiles a summary of what changed: "3 new emails, 1 event moved, 2 notes created." It's the backbone of the proactive notification system.

The design that anticipated growth

Here's the thing we got right, almost by accident: the slots were designed to be filled in later.

When we first built the pipeline in Phase 4, only two slots were active:

Slot	Phase 4	After Cluster B
1	✅ Embedding	✅ Embedding
2	🔒 Reserved (no-op)	✅ Heat tracking
3	🔒 Reserved (no-op)	✅ Entity extraction
4	✅ Changelog	✅ Changelog

Slots 2 and 3 were literally empty functions — registered in the pipeline to document the execution order and reserve their position. When Cluster B (the data foundation for the knowledge graph and heat scoring) was implemented weeks later, those slots got filled in without touching a single line of the existing pipeline code or any route handler. No refactoring. No merge conflicts.

This worked because the pipeline was designed around the CrudHookContext interface. Every slot receives the same context. Adding a new slot means writing a function that takes CrudHookContext and does something with it. That's the entire contract.

What happens per HTTP verb

Not every verb triggers every slot. Deleting a record doesn't need a new embedding. Listing records doesn't need heat tracking (only individual GETs do).

POST (create):  → Slot 1 (embed) → Slot 2 (heat: count=1) → Slot 3 (extract) → Slot 4 (log)
GET /:id (read): → Slot 2 (heat: count++) only
PATCH (update): → Slot 1 (re-embed) → Slot 2 (heat: count++) → Slot 3 (re-extract) → Slot 4 (log)
DELETE (soft):  → Slot 4 (log) only
GET / (list):   → nothing

Deletes don't re-embed or re-extract — the record is conceptually gone (soft-deleted). The heat cron will naturally cool it down. Listing doesn't trigger hooks at all — only individual record access bumps heat.

The Ollama bottleneck

All AI-powered slots (embedding and extraction) funnel through a single Ollama instance with a priority queue and a semaphore (concurrency=1). This sounds like a bottleneck, and it is — by design.

Ollama running a 0.6B embedding model (qwen3-embedding:0.6b, 1024 dimensions) on a mini-PC can handle one job at a time reliably. Trying to parallelize would thrash the CPU and make everything slower. Sequential processing with priority ordering (embeddings first, extraction second) gives predictable latency:

Embedding: ~50ms per record
Entity extraction: 1-10 seconds per record (done by the same vision model that describes photos — qwen3-vl:2b — because running a single multimodal model saves RAM versus having separate text-only and vision models)

When the user creates a note, the embedding is ready in under a second. The entity extraction might take a few more seconds, but the knowledge graph being slightly behind is invisible to the user.

If Ollama is completely down, everything still works. The CRUD succeeds. The changelog is written. The heat is tracked. Only semantic search and the knowledge graph are degraded, and they'll catch up when Ollama comes back online thanks to the reindex endpoint.

Why not database triggers?

PostgreSQL triggers could do some of this — especially the changelog. We considered it and decided against it for three reasons:

Triggers can't call external services. Embedding requires Ollama. Entity extraction requires an LLM. Triggers are stuck inside the database.
Error handling is all-or-nothing. A failing trigger rolls back the entire transaction. Our pipeline explicitly allows individual slot failures without affecting the core operation.
Visibility. Application-level hooks are easy to debug, log, and monitor. Trigger debugging is... less fun.

The changelog could legitimately be a trigger. But keeping all four slots in the same application-level pipeline means they're all visible in one file, they share the same error handling pattern, and they can be toggled or reordered without touching the database.

The payoff

The 4-slot pipeline is maybe 200 lines of code. It replaced thousands of lines of duplicated inline processing across every route handler. Every new domain we add — kanban cards, RSS feed articles, bookmarks — gets embeddings, heat tracking, entity extraction, and changelog for free by adding one runPostHooks(ctx) call.

More importantly, it created clean extension points. When we needed PII-aware routing (Cluster E), it was a middleware in the preHandler — not a new slot. When we needed sleep-time intelligence (Cluster D), it consumed the changelog and heat data that the pipeline was already producing. The pipeline doesn't just feed four systems — it feeds the systems that feed the systems.

Next up: how a 0.6B parameter model turned out to be better than we expected for entity extraction — and why bigger isn't always faster when skill context is the bottleneck.

137 migrations and counting: building a personal OS schema from scratch

Victor García — Fri, 13 Mar 2026 08:14:09 +0000

I started this project on February 18th, 2026. The idea was embarrassingly simple: I wanted a personal assistant that could look up my contacts fast and without burning through tokens. Notes, emails, a diary — the basics. A small Fastify server, a PostgreSQL database, maybe ten tables.

Somewhere along the way, I ended up with 137 SQL migrations.

This is the story of how a "quick personal cloud" turned into what we now call a sovereign digital operating system — and what the schema evolution looks like when you're building the plane while flying it.

Day 1: the schema that fit on a napkin

Here's the first migration, 0000_initial_schema.sql. Thirteen tables. I remember thinking "this covers everything I'll ever need":

CREATE EXTENSION IF NOT EXISTS "pgcrypto";
CREATE EXTENSION IF NOT EXISTS "vector";

CREATE TABLE IF NOT EXISTS notes (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title           TEXT,
    content         TEXT NOT NULL,
    content_format  VARCHAR(20) DEFAULT 'markdown',
    source          VARCHAR(50) NOT NULL DEFAULT 'local',
    tags            TEXT[],
    pinned          BOOLEAN DEFAULT false,
    archived        BOOLEAN DEFAULT false,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    deleted_at      TIMESTAMPTZ
);

Notes, events, contacts, emails, files, diary entries, a CRM table (ambition!), Home Assistant events, agent conversations, entity links, OAuth tokens, a license cache, and an embeddings table with pgvector.

Look at that embeddings table:

CREATE TABLE IF NOT EXISTS embeddings (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    domain          VARCHAR(30) NOT NULL,
    record_id       UUID NOT NULL,
    content_text    TEXT NOT NULL,
    embedding       vector(768) NOT NULL,
    model_used      VARCHAR(100),
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE(domain, record_id)
);

One table for all embeddings. Domain column to tell apart notes from emails. Simple. Clean.

That simplicity lasted about four days.

The decisions that shaped everything

Three early decisions ended up defining the entire architecture. I didn't fully appreciate their impact at the time.

Everything in `public` schema

We put all tables in PostgreSQL's default public schema instead of creating separate schemas per domain (a notes schema, an emails schema, etc.). The reasoning was practical: every record in the system — notes, events, contacts, files — needs to go through the same intelligence pipeline. Embeddings. Entity extraction. Heat tracking. Knowledge graph edges. Change log entries.

If each domain lived in its own schema, every cross-domain feature would need cross-schema joins and duplicated hooks. With everything in public, a single set of CRUD hooks can process any record type through the same pipeline. A note and an email get embedded, heat-tracked, entity-extracted, and changelog'd by the exact same code path.

This decision paid off enormously when we built the knowledge graph. An entity link between a contact and a calendar event is just a row in entity_links — no cross-schema ceremony.

Soft delete everywhere

Every single domain table has deleted_at TIMESTAMPTZ. No exceptions. From day one.

This wasn't wisdom — it was paranoia. But it turned out to be critical for the sync engine. When you sync with Gmail or Google Calendar, you need to know what was deleted locally so you can propagate that deletion. A hard delete means lost information. deleted_at means you can diff between "this record was deleted" and "this record never existed."

It also enabled restore functionality for free. PATCH /notes/:id/restore is just SET deleted_at = NULL.

UUID primary keys on everything

Not SERIAL, not BIGINT. UUIDs. Every table.

The immediate benefit: entity_links can reference any table with just (source_type, source_id). No composite foreign keys, no lookup tables. The knowledge graph doesn't care if the source is a note or an email — it's always a UUID.

The less obvious benefit: when we added multi-user support later, there were zero ID collision issues. Users get created on different instances, sync happens, and UUIDs just work.

The migration timeline: from 0 to 137

Here's roughly how the schema evolved, grouped by what was happening:

Migrations 0000–0010: The foundation

The initial thirteen tables. Then tsvector search columns for free-tier full-text search. Then the sync engine infrastructure — change_log, sync_connectors, sync_history. Custom fields support. Email accounts. File snapshots and shared links. Photo albums and face clusters.

At this point we had maybe 20 tables and the system could do CRUD, search, and sync. A functional personal cloud. I could have stopped here.

I did not stop here.

Migrations 0050–0065: The intelligence layer

This is where things got wild. We added:

change_log_affected_columns — tracking which specific fields changed, not just "this record was updated"
record_heat — heat scores that decay over time, turning every record into a memory that fades unless you interact with it
graph_entities — knowledge graph nodes (Person, Project, Location, Topic) extracted automatically by an LLM
Extended entity_links with link_type, confidence, strength, and created_by columns
semantic_cache — caching API responses by query embedding similarity
Reserved schemas for future domains (bookmarks, financial accounts, health data)

The record_heat table deserves its own blog post (coming soon), but here's the gist: every record has a heat score between 0 and 1. It decays exponentially over time. Every time you access a record, the heat bumps up. The search algorithm uses heat as a post-fusion multiplier — hot records rank higher than cold ones, all else being equal.

The knowledge graph was another inflection point. Suddenly entity_links wasn't just "this note mentions this contact" — it was "the system automatically discovered that Person:Ana García appeared in 3 emails, 2 notes, and tomorrow's calendar event, with confidence 0.85."

Migrations 0080–0103: Proactive intelligence

Sleep-time jobs. Derived insights. User preferences. Proactive feedback. Agent tokens. Audit logs. Data sensitivity labels. PII routing logs. Pseudonym maps.

This batch transformed the system from "a database you query" to "a database that thinks while you sleep." The sleep-time engine runs background LLM jobs during idle periods — discovering cross-domain correlations, extracting behavioral preferences, generating insights. All stored in tables that started as sketches on a whiteboard.

Migrations 0110–0137: The OS layer

App system tables. Email moderation. Clipboard items. Visual intelligence embeddings. Face detection infrastructure. Kanban boards with labels, comments, dependencies, and checklists. RSS feeds with categories and AI summarization.

By migration 0137, the schema had grown from "a personal notes database" into something that manages your entire digital life — notes, emails, calendar, contacts, files, photos, diary, projects, feeds, and an AI that understands the connections between all of it.

What I'd do differently

I'd plan the multi-user migration from day one. Migration 0009 (add_user_id_to_domain_tables) added user_id to every single domain table. Every. Single. One. That's a lot of ALTER TABLE statements, a lot of index recreations, and a lot of unique constraint changes (diary entries went from UNIQUE(entry_date) to UNIQUE(user_id, entry_date)). If I'd included user_id in the initial schema, that migration wouldn't exist.

I'd version-track the embedding model from the start. We added model_version to the embeddings table later. But by then we'd already generated thousands of embeddings with one model. When we switched from nomic-embed-text to qwen3-embedding:0.6b, we had to re-embed everything. If the model version had been there from day one, the re-embedding could have been incremental.

I'd actually apply my migrations. The single most common bug in development wasn't a code error — it was me forgetting to run a migration I'd already written. The schema in the SQL file said one thing; the database said another. We ended up with a simple for f in core/drizzle/0*.sql; do psql $DATABASE_URL -f "$f"; done command, but I should have automated it from week one.

The numbers

As of today:

Metric	Count
SQL migration files	137
Tables in the schema	~50
Drizzle ORM schemas	35+
Domains (notes, events, etc.)	7 core + 5 extended
Index definitions	80+

Some migrations took minutes to write. Others took days of back-and-forth to get right.

The takeaway

If I could go back and tell February-18th me one thing, it would be: the schema is the product. Not the API. Not the frontend. Not the AI. The schema.

Every feature we built — heat scoring, knowledge graph, sleep-time intelligence, PII routing, multi-user isolation — started as a migration file. The tables defined the boundaries of what was possible. Getting the schema right (or at least right enough to iterate on) was the single highest-leverage activity in the entire project.

137 migrations sounds like a lot. And it is. But each one was a small, deliberate step from "personal notes database" to something much bigger. And we're not done — migration 0138 is already drafted.

This is the first in a series of technical posts about building a self-hosted AI productivity OS. Next up: how a 4-slot hook pipeline makes every CRUD operation feed embeddings, heat tracking, entity extraction, and the changelog — without any of them blocking each other.

Forem: Victor García

Designing an AI approval system: when should your agent ask for permission?

The two attack surfaces

Layer 1: Shell control

Layer 2: Operation approvals

How it looks in practice

The approval lifecycle

What's configurable and what isn't

Why agents can't approve their own requests

The middleware: 15 lines that matter

The skill: teaching the agent to ask

What I learned about where to draw the line

The dual-layer interaction

What I'd do differently

What's next: the sandbox

The takeaway

PII-aware routing: how to use cloud AI and keep your sensitive data local

The classification: four levels, zero LLM calls

The routing decision

The pseudonymizer

What this looks like in practice

The audit trail

Where routing applies

The multi-agent dimension

What we explicitly don't do

What I'd do differently

The takeaway

Sleep-time compute for personal data: what your AI should do while you sleep

The idea

The trigger: idle detection

Job 1: Enrich connections

Job 2: Generate summary

Job 3: Detect patterns

Job 4: Update preferences

The real numbers: 700 tokens, 184 milliseconds

The token budget: sleep-time vs test-time

The three-stage digest

The world model

Zero cost when idle

What I'd do differently

The takeaway

Hybrid search with RRF: combining pgvector, tsvector, and a knowledge graph in one query"

The four signals

The pipeline

Step 1: Vector search

Step 2: Full-text search

Step 3: Graph discovery

Step 4: Deduplication

Step 5: Rank-based normalization

Step 6: Degenerate signal detection

Step 7: Weighted fusion + multi-signal bonus

Two search modes

Standard search: GET /search

Advanced search: GET /search/advanced

Free tier: surprisingly good without vectors

Temporal expansion: the "wow" moment

The tuning knobs

Performance

What I'd do differently

The takeaway

Your AI agent is wasting 90% of its tokens on field names"

The audit

The top 5 offenders

Why this matters more than API response size

What we did about it

1. The always:true / always:false split

2. Writing skills for tokens, not for humans

3. The compact instruction in every skill

4. Multi-agent delegation

The counterintuitive insight: bigger models handle it better

The numbers after optimization

What I'd do differently

The takeaway

From JSON to compact: reducing API payloads 60% for LLM consumption"

The problem: JSON is expensive

The solution: ?format=compact

What the agent does with it

What we tried and killed

TOON (Token-Oriented Object Notation)

Token Budget (?token_budget=N)

Standard search: `GET /search`

Advanced search: `GET /search/advanced`

1. The `always:true` / `always:false` split

The solution: `?format=compact`

Token Budget (`?token_budget=N`)

Diff-aware responses: `?since=`

Token counting: the `X-Token-Count` header

Nodes: `graph_entities`

Edges: `entity_links` (extended)