Forem: Lars Winstand

I thought multi-agent meant more prompts until I saw 3 ways OpenClaw users are actually splitting the work

Lars Winstand — Sun, 03 May 2026 12:39:41 +0000

I went into a bunch of OpenClaw discussions expecting the usual advice about subagents: better prompts, cleaner folders, maybe some heroic config.

What I found was more interesting.

The OpenClaw setups that actually seem to hold up are not just "one agent with more prompts." They are separate services with separate trust zones.

The pattern that keeps showing up looks like this:

a librarian agent
an executor agent
a company-facing agent

Usually connected over A2A.

That sounds like a small implementation detail. It is not.

A separate prompt inside one workspace is still one workspace:

one context blob
one tool surface
one security boundary
one place for bloat to accumulate

A separate OpenClaw instance is different. Now you have real boundaries:

different runtimes
different API keys
different networks
different memory policies
explicit handoffs

That is where multi-agent starts being architecture instead of roleplay.

The Reddit pattern is ahead of most blog posts

One of the clearest examples was an r/openclaw thread about an A2A plugin:

https://reddit.com/r/openclaw/comments/1t1yf86/i_made_an_openclaw_a2a_plugin_connect_your/

The post itself was small, but the use cases were sharp:

a sandboxed local OpenClaw talking to a full-access cloud OpenClaw
a personal OpenClaw talking to a company-wide OpenClaw for internal services
teammate agents syncing plans over the internet to avoid stepping on each other

That is not prompt organization. That is system design.

And it answers the question I keep seeing from people trying to force multi-agent into one workspace:

Why not just keep everything in one OpenClaw workspace?

Because the boundary is the point.

If your librarian, executor, and company-facing assistant all live in the same workspace, a lot of the specialization is fake.

The librarian can still see too much.

The executor still inherits too much context.

The company-facing assistant is still one bad tool call away from touching something it should not.

Here is the tradeoff in plain terms:

Approach	What actually happens
Separate A2A services	Clear trust boundary, can run on different machines or networks, but setup and security overhead are real
Subagents inside one OpenClaw workspace	Fast and simple, lower latency, but weaker isolation of tools and context and easier to bloat
n8n for orchestration plus agents for reasoning	Great for deterministic triggers and data movement, but glue code gets messy fast

My opinionated take: multi-agent is only worth the complexity when the boundary is real.

If the split is just:

this prompt is the researcher
this prompt is the coder

then you probably do not have multiple agents. You have one agent wearing name tags.

The librarian pattern is better than it sounds

A commenter in that A2A thread described a pattern I think more teams should steal:

I need an agent that acts as a librarian and gatekeeper for a RAG implementation.

That is a strong design choice because it forces a question most agent stacks avoid:

Who is allowed to touch memory, and why?

A librarian agent can own retrieval and document selection.

It can decide:

which sources are valid
how much context to return
whether a query deserves a deep search
what gets filtered before it reaches the executor

Then your executor agent can stay focused on doing work instead of dragging your entire RAG stack into every session.

When a separate librarian makes sense

Use a dedicated librarian when:

retrieval needs its own rules
memory access should be restricted
different agents need different knowledge slices
you want to keep executor context small

When direct memory access is better

Keep it simple when:

everything is local
latency matters more than isolation
the same agent already owns the knowledge domain
you are adding A2A mostly because it sounds advanced

That tradeoff matters more than the label.

Not every boundary should become a network boundary.

But the useful ones usually should.

A practical split: one agent per trust boundary

The cleanest rule I found is this:

one agent per trust boundary
one agent per memory policy
one agent per tool class

That usually gives you something like this:

1. Librarian

Owns:

retrieval
indexing rules
memory access
document selection

2. Executor

Owns:

actions
code changes
task completion
narrow operational tools

3. Company-facing interface

Owns:

internal service access
approvals
policy enforcement
boring but critical guardrails

If two of those share the same tools, same memory, same runtime, and same risk profile, they probably should not be separate yet.

If they differ on any of those, split them.

What this looks like in practice

Here is a simple mental model:

[user/app]
   |
   v
[company-facing OpenClaw]
   |
   +--> [librarian OpenClaw] --> [docs/vector store]
   |
   +--> [executor OpenClaw] --> [repo/tools/shell]

And here is the kind of split I would actually implement.

Company-facing agent

This is the only agent that talks to the outside world.

Responsibilities:

receive requests
check policy
decide whether work needs retrieval, execution, or both
redact or reshape requests before forwarding

Librarian agent

This agent gets read-only access to your knowledge systems.

Responsibilities:

search docs
fetch relevant chunks
summarize long context
return only what downstream agents need

Executor agent

This one gets the dangerous tools.

Responsibilities:

write code
run commands
modify files
execute workflows

That split avoids the worst anti-pattern: giving the same agent broad memory access and broad tool access and then hoping the prompt keeps it safe.

Security is where the fantasy ends

This is the first serious objection in every good A2A discussion, and it should be.

In that same A2A thread, someone pointed out the obvious risk: inbound calls can trigger OpenClaw tools.

That is not paranoia. That is basic engineering.

The plugin author responded with a few practical details:

secure-by-default posture
per-agent API keys
sender IDs
new conversation threads for each inbound message
Tailscale for receiving messages

They also suggested using a separate profile for experiments:

openclaw --profile gateway

That is the right mindset.

A2A is not magic. It is distributed systems with LLMs attached.

Which means you inherit the normal taxes:

security tax
ops tax
debugging tax
latency tax

If you are not getting a real boundary in return, do not pay those taxes.

Add n8n carefully or you will build glue-code soup

Another useful OpenClaw thread described a setup with:

a shared VPS
multiple OpenClaw agents
n8n
local users connecting through Antigravity

Source:

https://reddit.com/r/openclaw/comments/1t0nnkz/am_i_overengineering_this_openclaw_n8n/

That architecture is not crazy.

But it gets messy fast if every system co-owns the workflow.

My rule of thumb:

let n8n handle deterministic flows, triggers, schedules, and integrations
let OpenClaw handle reasoning, exception handling, and ambiguous tasks
keep cross-service handoffs lower than your first instinct

A simple split looks like this:

n8n:
  owns:
    - cron jobs
    - webhooks
    - API integrations
    - retries

openclaw:
  owns:
    - planning
    - reasoning
    - ambiguous decisions
    - code generation

If you make n8n, OpenClaw, and your local client all coordinate state, debugging gets ugly.

You end up tracing things like:

OpenClaw A calls OpenClaw B
OpenClaw B triggers n8n
n8n writes state
OpenClaw A no longer trusts the state it originally requested

That is not a model problem. That is orchestration debt.

The expensive part is often not the model

One of the most useful OpenClaw cost posts I found came from a user who spent about $850 in a month, including around $350 in one day:

https://reddit.com/r/openclaw/comments/1t2fd8o/spent_850_on_openclaw_in_a_month_350_in_one_day/

The key line was this:

At first I thought it was model cost. It wasn’t. It was bad system design.

That should be printed on a sticker and attached to every agent dashboard.

The fixes were not exotic:

strict context pruning
short sessions
n8n for repeat tasks
workspace cleanup

They reported 70 to 90 percent savings after redesigning the stack.

That matches what a lot of teams eventually learn:

The bill is not just about which model you picked.

It is about:

how much useless context you drag around
how often the wrong agent gets invoked
how many handoffs you created
how much deterministic work you let an LLM do

This is exactly why real boundaries matter.

A librarian agent can stay small.

An executor can stay sharp.

A company-facing agent can stay boring.

That is not architecture purity. That is cost control.

A minimal implementation sketch

If I were building this today, I would start with something like this.

1. Create isolated runtimes

openclaw --profile company
openclaw --profile librarian
openclaw --profile executor

2. Give each runtime only the tools it needs

{
  "company": ["policy-check", "request-router"],
  "librarian": ["vector-search", "doc-fetch", "rerank"],
  "executor": ["git", "shell", "test-runner"]
}

3. Keep the message contract small

{
  "task": "summarize auth flow docs relevant to OAuth token refresh bugs",
  "constraints": ["read-only", "max 10 chunks"],
  "request_id": "req_123"
}

4. Return only what the next agent needs

{
  "request_id": "req_123",
  "summary": "Token refresh logic lives in auth-service and mobile-sdk",
  "sources": [
    "docs/auth/refresh-flow.md",
    "docs/mobile/oauth.md"
  ]
}

That one habit alone prevents a lot of context bloat.

How I would decide whether to split an agent

Before creating a new agent, ask:

should this component have different tool permissions?
should this component have different memory access?
should this component run in a different network or trust zone?
would this split reduce context size in a meaningful way?
can I explain the boundary without using the phrase "it feels cleaner"?

If the answer to most of those is no, keep one agent.

If the answer is yes, split it.

Where Standard Compute fits

There is one more practical issue here: once you start doing multi-agent properly, request volume goes up fast.

Not because you are being wasteful. Because clean architecture creates more small calls:

routing calls
retrieval calls
execution calls
retries
background automations

That is exactly where per-token pricing becomes annoying.

You stop optimizing for quality and start optimizing for what will not surprise you on the invoice.

For OpenClaw users running always-on agents, that is backwards.

Standard Compute is built for this exact situation:

unlimited AI compute for OpenClaw at a flat monthly price
no per-token billing
works with existing OpenClaw setups using a custom prompt
dynamic routing across GPT-5.4, Claude Opus 4.6, and Grok 4.20
plans from $9 to $399 per month

If your stack is moving from "one giant workspace" to actual multi-agent services, predictable cost matters a lot more than people admit.

Because the fastest way to ruin a good architecture is making developers afraid to let agents run.

The boring takeaway that will save you later

If you are building with OpenClaw, do not start with:

how many agents should I have?

Start with:

which agent should know this?
which agent should be allowed to do this?
which agent should pay the context cost for this?

If all three answers point to the same place, keep it in one workspace.

If they do not, stop stuffing more prompts into one bot and calling it architecture.

That is the shift I keep seeing in OpenClaw discussions.

Not more agents for the sake of it.

Better boundaries.

Less context bloat.

Fewer surprise bills.

And systems that still make sense when they are running under pressure.

I found the dumbest way to burn 500 LLM calls a day: polling an inbox every 5 minutes

Lars Winstand — Sat, 02 May 2026 13:44:12 +0000

If your OpenClaw agent checks an email inbox every 5 minutes, you’re probably paying for idle paranoia.

That’s not a theoretical complaint. In an r/openclaw thread about triggering jobs from email, one user described an MS365 setup like this:

"At the moment, I have Openclaw job where agent checks its ms365 mailbox every 5 minutes... Wasted calls to LLM (nearly 500 calls to LLM per day)"

That is such a painfully real failure mode.

The demo works. The cron job looks harmless. Then a month later your agent is re-checking old mail, occasionally double-processing messages, and quietly spending model calls on nothing.

If you’re building always-on agents, this is exactly the kind of bug that turns “cool automation” into “why is this thing flaky and expensive?”

The pattern everyone starts with

Usually it looks like this:

Connect OpenClaw to a mailbox
Poll every 5 minutes with IMAP or Microsoft Graph
If there’s a new message, send it to GPT-5.4, Claude Opus 4.6, or whatever model you’re using
Try not to process the same email twice

For a proof of concept, that’s fine.

If it’s one internal mailbox, low volume, and you have a tiny dedupe store in SQLite, polling can be good enough.

But once the workflow matters, polling starts failing in boring and expensive ways:

you keep checking when nothing changed
you burn LLM calls on already-seen messages
you introduce delays by design
you get duplicate processing when scans overlap
you miss messages when state gets out of sync

Another user in that same r/openclaw discussion put it even more bluntly:

"I abandoned the interval based scanning... if the scan got out of sync I had repeated responses (more wasted calls) or ignored mails. I failed to get it to be reliable."

That’s the actual problem.

Polling doesn’t just waste money. It makes the agent feel unreliable.

And unreliable is worse than expensive.

Microsoft and Google are both telling you to stop polling

This part is worth emphasizing: the anti-polling advice is not just random architecture purism.

Microsoft Graph supports change notifications so apps can react to mailbox changes instead of hammering the API on a timer.

Gmail push notifications exist for the same reason. Google says push eliminates the extra network and compute cost of polling resources to see if they changed.

If both mailbox providers are nudging you toward push, that’s a clue.

What production intake should look like

There are a few sane ways to do inbound email for agents:

Gmail API watch + Google Cloud Pub/Sub
Microsoft Graph change notifications
Twilio SendGrid Inbound Parse Webhook
an email-native service like AgentMail

The common idea is simple:

The provider tells your system that mail arrived.

Your system does not keep asking if anything changed.

Gmail: watch the inbox instead of polling it

For Gmail, the production path is Gmail API watch on the inbox, then Pub/Sub delivers notifications to your webhook.

Example request:

POST https://gmail.googleapis.com/gmail/v1/users/me/watch
Content-Type: application/json
Authorization: Bearer <access_token>

{
  "topicName": "projects/myproject/topics/mytopic",
  "labelIds": ["INBOX"],
  "labelFilterBehavior": "INCLUDE"
}

Google returns a history ID and an expiration time.

That means two things:

you need to process changes based on history
you need to renew the watch before it expires

This is cleaner than polling, but it is not zero-maintenance.

You still need:

a Pub/Sub topic
a subscription
IAM configured correctly
watch renewal logic

If you skip the lifecycle work, your “event-driven” setup becomes a very fancy outage.

Microsoft 365: use Graph change notifications

For Microsoft 365, use Microsoft Graph subscriptions for Outlook messages.

Example subscription:

POST https://graph.microsoft.com/v1.0/subscriptions
Content-Type: application/json
Authorization: Bearer <access_token>

{
  "changeType": "created",
  "notificationUrl": "https://your-app.example.com/webhooks/graph",
  "resource": "/me/mailFolders('Inbox')/messages",
  "expirationDateTime": "2026-05-03T00:00:00Z",
  "clientState": "openclaw-mailbox-prod"
}

You need to handle:

webhook validation
subscription renewal
clientState verification
dedupe after notification delivery

Again: more setup than polling, much better behavior in production.

SendGrid is the cleanest mental model

If you want the simplest model for inbound email to HTTP, SendGrid Inbound Parse is hard to beat.

Email arrives.

SendGrid parses it.

SendGrid POSTs the content to your endpoint.

Minimal example in Node:

import express from "express";

const app = express();
app.use(express.urlencoded({ extended: true }));
app.use(express.json());

app.post("/inbound-email", async (req, res) => {
  const messageId = req.body.headers?.match(/Message-ID: (.+)/i)?.[1] || req.body.message_id;
  const from = req.body.from;
  const subject = req.body.subject;
  const text = req.body.text;

  // 1. dedupe check
  // 2. persist event
  // 3. enqueue background processing

  console.log({ messageId, from, subject, text });

  res.status(200).send("ok");
});

app.listen(3000, () => {
  console.log("Listening on :3000");
});

The nice part is the delivery contract.

If your endpoint returns 5XX, SendGrid retries.
If your endpoint returns 2XX, retries stop.

That is a much sharper failure model than “cron ran, maybe.”

There are constraints:

total message size limit
dedicated receiving subdomain setup
MX record configuration

Still better than burning cycles forever because polling was easier on day one.

n8n helps, but it does not magically fix polling

This comes up a lot: “Can’t I just use n8n?”

You can absolutely use n8n to improve the workflow.

But if you use the n8n Email Trigger over IMAP, you are still doing mailbox-checking infrastructure. It’s just nicer mailbox-checking infrastructure.

That matters.

n8n gives you useful features like:

mailbox selection
mark as read
attachment handling
custom search rules
reconnect controls

That is a lot better than a hand-rolled cron script.

But it does not change the trigger model.

If the source of truth is still “go ask the mailbox if anything happened,” you still have polling-shaped failure modes.

Polling vs push

Here’s the tradeoff in plain English:

Approach	What you’re really signing up for
Poll mailbox with IMAP or cron	Easy setup, delayed reactions, duplicate checks, wasted model calls, awkward dedupe logic
n8n Email Trigger (IMAP)	Better operational ergonomics, but still polling underneath
Gmail watch / Graph notifications / SendGrid webhook	More setup, much lower idle waste, faster reactions, better delivery semantics

This is not really “simple vs advanced.”

It’s demo-friendly vs production-friendly.

What your OpenClaw email pipeline should actually do

If I were building this today, I’d split it into two layers.

Layer 1: intake

Pick one:

SendGrid Inbound Parse if you want email -> HTTP
Gmail watch + Pub/Sub if you’re on Google Workspace
Microsoft Graph notifications if you’re on Microsoft 365
n8n IMAP only for a fast proof of concept

Layer 2: idempotent processing

No matter how the event arrives, your OpenClaw job should:

extract a stable message ID
check a dedupe store before calling any model
persist processing state
acknowledge receipt quickly
do the expensive work asynchronously

That last point is where people get into trouble.

Do not do all processing inside the webhook request.

Accept the event.
Store it.
Deduplicate it.
Then hand it off.

That’s how you survive retries without duplicate replies.

A minimal queue-based pattern

Here’s a practical shape for the service:

email-webhook -> postgres(inbox_events) -> job queue -> OpenClaw worker -> reply/send action

Pseudo-schema:

create table inbox_events (
  id bigserial primary key,
  provider text not null,
  external_message_id text not null,
  received_at timestamptz not null default now(),
  payload jsonb not null,
  processing_status text not null default 'pending',
  unique(provider, external_message_id)
);

Worker logic:

async function processInboxEvent(event) {
  const existing = await db.findByProviderAndMessageId(
    event.provider,
    event.external_message_id
  );

  if (!existing) {
    throw new Error("missing event");
  }

  if (existing.processing_status === "done") {
    return;
  }

  await db.markProcessing(existing.id);

  const result = await runOpenClawAgent({
    email: existing.payload
  });

  await db.saveResult(existing.id, result);
  await db.markDone(existing.id);
}

That is much less exciting than prompt tricks.

It is also the difference between a system that feels solid and one that occasionally replies twice at 3 AM.

The cost side gets ugly fast

If your agent is always on, wasted checks become real money or real usage pressure.

This is where pricing model matters.

Per-token billing makes polling bugs feel worse because every pointless re-check and duplicate pass looks like another tiny leak. You start optimizing prompts and reducing context not because it improves quality, but because you’re trying to contain operational sloppiness.

That’s backwards.

If you’re running OpenClaw agents continuously, predictable flat-rate compute is a much better fit than watching token spend all day. Standard Compute is built for exactly that: OpenAI-compatible API access for OpenClaw agents, flat monthly pricing, and dynamic routing across models like GPT-5.4, Claude Opus 4.6, and Grok 4.20.

So yes, fix the architecture first.

But also: if your agents run 24/7, stop pairing always-on automation with pricing that punishes every extra call.

When polling is still okay

Polling is not always wrong.

Use it when:

you have one internal mailbox
volume is low
a few minutes of delay is fine
you have dedupe in SQLite or Postgres
nobody will care if you rebuild it later

That is a proof of concept.

Just be honest that it is a proof of concept.

The mistake is pretending that a polling loop is production architecture for a customer-facing or always-on agent.

It isn’t.

The actual line between toy and production

The interesting distinction is not whether OpenClaw can read email.

Of course it can.

The distinction is:

how the email arrives
whether processing is idempotent after it arrives

A toy automation asks the mailbox every few minutes if anything happened.

A production agent gets an event, validates it, records it once, and processes it once.

That sounds boring.

It’s also the difference between “works in a demo” and “still works three months later.”

If your OpenClaw workflow still polls an inbox every 5 minutes, I wouldn’t call it broken.

I’d call it unfinished.

And once you’ve seen nearly 500 LLM calls per day wasted on mailbox checks, it’s hard to unsee.