Forem: Manveer Chawla

Can ClickHouse DELETE Data? A 2026 PR-by-PR Analysis

Manveer Chawla — Tue, 19 May 2026 21:33:41 +0000

TL;DR

ClickHouse has supported DELETE operations since 2018. As of 2026, it ships four production-grade deletion paths: heavyweight ALTER TABLE DELETE, lightweight DELETE FROM (default since v23.3), patch-part DELETEs (v25.7), and ALTER TABLE DROP PARTITION for bulk operations. The "ClickHouse is immutable / append-only" narrative is outdated by eight years and 80+ merged PRs spanning five architectural eras, and the evidence is in the commit history.

We analyzed 80+ GitHub pull requests, official ClickHouse changelogs, and release blogs to trace the full evolution of DELETE support from 2018 through early 2026.
In 2018, ClickHouse shipped ALTER TABLE … DELETE as a heavyweight asynchronous mutation that rewrote affected data parts. The criticism that "deletes require heavy mutations" was fair — for that era. It was also the only delete path for four years.
By early 2026, ClickHouse ships standard SQL DELETE FROM (lightweight by default since v23.3), ALTER TABLE DELETE for guaranteed physical removal, ALTER TABLE DROP PARTITION for bulk deletion, patch-part-based lightweight updates and deletes, on-the-fly mutation visibility at SELECT time, and engine-level deletion patterns through ReplacingMergeTree(version, is_deleted) with optimized FINAL. None of these require experimental flags.
The single highest-impact change is the lightweight DELETE introduction (PR #37893), which redefined DELETE on MergeTree from "rewrite all affected parts" to "rewrite only _row_exists, hardlink the rest, filter on read." Benchmarks in the PR show 15 single-row deletes on a 100M-row, 12-column table dropping from ~8 seconds to ~200 ms — roughly 40× faster on the initial mask write.
Patch parts (PR #82004), shipped in v25.7, eliminated the part rewrite entirely. A DELETE becomes a tiny insert that sets _row_exists = 0, applied on read until a background merge consolidates it. ClickHouse's own benchmark blog series claims up to ~1,000× speedup for small/selective changes versus classic mutations (vendor benchmark — treat as an upper bound, not a guarantee).
On-the-fly mutations (PR #74877) and on-the-fly LWD (PR #79281) eliminated the "DELETE was issued but rows still appear" surprise. Queued deletes that haven't materialized are now applied at SELECT time.
The allow_experimental_lightweight_delete setting hasn't been needed since 23.3. It was aliased to enable_lightweight_delete in PR #50044 (commit 7189481, June 2023) for backward compatibility, then promoted to default-enabled.
Verdict: the "ClickHouse is immutable" advice made sense in 2017. Repeating it in 2026 is misinformation. ClickHouse offers four production-grade deletion paths covering compliance, bulk, selective, and high-frequency operational workloads, each with explicit trade-offs and observability.

Why People Still Say "ClickHouse Can't Delete"

If you've evaluated ClickHouse in the last few years, you've heard the warnings:

"ClickHouse can't delete data"
"ClickHouse is immutable / append-only"
"Deletes require heavy mutations"
"ReplacingMergeTree can't handle deletes"
"You need allow_experimental_lightweight_delete"
"FINAL is too slow for production queries"

Some of these started as legitimate ClickHouse guidance in the 2018–2020 era. The original ALTER TABLE … DELETE was deliberately syntactically heavyweight: ClickHouse was founded on the principle that "logs are immutable," and the ALTER syntax (rather than DELETE FROM) was an explicit signal that this was an administrative operation, not OLTP-style row modification. The cost model was honest: for a 100-column table, deleting a single row required reading and rewriting all 100 column files for the affected part.

In 2018, the criticism was largely fair. There was one delete path (ALTER TABLE DELETE), it was a full part rewrite, it was asynchronous, and you tracked it through system.mutations. If you needed selective row-level deletion at scale, you were going to feel it.

Then ClickHouse's engineering team spent eight years dismantling every one of those limitations. Over 80 significant pull requests merged. They added DELETE FROM syntax with a hidden-mask implementation (PR #37893), promoted it to GA (v23.3), made it synchronous by default (PR #44718), added IN PARTITION scoping (PR #67805), made it observable (apply_deleted_mask, has_lightweight_delete, parts_postpone_reasons), made it correct in the presence of projections and skip indexes (PRs #52517, #52530, #62364, #65594), and finally re-implemented it as a tiny patch-part write (PR #82004) so the part rewrite is gone entirely.

This article traces that evolution with PR-level evidence. No marketing claims. No benchmarks on toy datasets. Just the commit history.

Methodology: How We Analyzed ClickHouse's DELETE Commit History

We went through ClickHouse's GitHub commit history, pull requests, changelogs, and release blogs from 2018 through early 2026. The scope covered every PR that touched the DELETE subsystem: mutation engine changes, the lightweight-delete read path, patch parts, projection and skip-index correctness, replication and ON CLUSTER propagation, observability, and default configuration changes.

Each PR was classified by category (mutation engine, read-path filtering, storage interaction, correctness, settings, observability), impact severity, and whether it changed default behavior. We cross-referenced PR descriptions against changelog entries and release blog benchmarks to verify the claimed improvements. Where multiple PRs addressed the same subsystem (for example, the long tail of LWD-vs-projection bugs), we traced the dependency chain to understand how the incremental fixes compounded.

The result is a ranked set of PRs by impact, organized into five chronological eras, with full provenance. Every claim in this article maps to a specific merged PR or commit SHA that you can verify yourself on GitHub. Where two reputable sources cite different SHAs for the same PR (a known issue with squash-merges in older PR threads), we surface the conflict rather than picking one.

This isn't a benchmarking exercise. Benchmarks measure peak performance on controlled workloads. This analysis measures the engineering trajectory: what was built, why, and what it means for teams deciding whether ClickHouse can handle their delete patterns today.

ClickHouse DELETE Features in 2026: What Ships by Default

The current state, as of early 2026:

Standard SQL DELETE FROM: Lightweight by default since v23.3. Hidden _row_exists mask column, PREWHERE-injected at read time, hardlinks unaffected column files in wide parts. Synchronous by default with explicit lightweight_deletes_sync control.
ALTER TABLE … DELETE (heavyweight mutation): Retained for use cases that require guaranteed physical removal at completion (compliance, GDPR right-to-erasure, audit-bound workloads). Tracked in system.mutations, cancellable via KILL MUTATION.
ALTER TABLE DELETE … IN PARTITION: Both for heavyweight mutations (PR #13403, 2020) and lightweight DELETE (PR #67805, 2024). Prunes partitions before the delete plan even starts.
ALTER TABLE DROP PARTITION: Bulk deletion via durable empty parts (PR #41145), atomic and non-blocking for concurrent reads. The most efficient path for time-bounded data lifecycle operations.
Patch-part lightweight updates and DELETEs: Available since v25.7 via lightweight_delete_mode = 'lightweight_update'. A DELETE becomes a tiny insert of a patch part instead of a part rewrite.
On-the-fly mutation visibility: apply_mutations_on_fly = 1 makes queued deletes visible at SELECT time before background materialization completes.
ReplacingMergeTree(version, is_deleted): Engine-native upsert and tombstone semantics with OPTIMIZE TABLE … FINAL CLEANUP for forced physical removal. Combined with the optimized FINAL keyword for immediate consistency at query time.
Explicit physical-removal control: ALTER TABLE … APPLY DELETED MASK (PR #57433) forces materialization without waiting for background merges. min_age_to_force_merge_seconds and exclude_deleted_rows_for_part_size_in_merge give the merge selector the right inputs.
Observability: system.parts.has_lightweight_delete, removal_state, last_removal_attempt_time, and system.mutations.parts_postpone_reasons give operators machine-readable state for every delete in flight.

These aren't experimental features hidden behind flags. They're defaults that ship with every ClickHouse installation.

ClickHouse DELETE Myths vs. Reality: A 2026 Checklist

#	The FUD	Score	Evidence Volume	Reality (2026)
1	"ClickHouse can't delete data"	🟢 False since 2018	80+ PRs across 5 eras	Four production-grade delete paths: heavyweight mutation, lightweight DELETE, patch-part DELETE, partition-scoped DELETE.
2	"ClickHouse is immutable / append-only"	🟢 Outdated	Mutations since 2018 (release `1.1.54388`); LWD since 22.8	Standard SQL DELETE FROM has been default-enabled since v23.3 (April 2023).
3	"Deletes require heavy mutations"	🟢 False since 22.8	PR #37893, #82004	LWD is ~40× faster than heavy mutations on the initial mask write. Patch-part DELETEs target up to ~1,000× speedup for small/selective changes per ClickHouse benchmarks.
4	"ReplacingMergeTree can't handle deletes"	🟢 False	`is_deleted` column parameter	`ReplacingMergeTree(version, is_deleted)` natively supports tombstones. `OPTIMIZE TABLE … FINAL CLEANUP` forces physical removal.
5	"You need `allow_experimental_lightweight_delete`"	🟢 Obsolete	PR #50044 (commit `7189481`, June 2023)	The setting was renamed to `enable_lightweight_delete` and default-enabled in v23.3. The old name remains as a backward-compatibility alias.
6	"`FINAL` is too slow for production queries"	🟡 Outdated	Multiple optimization PRs through v25.x	`FINAL` was significantly optimized for production. It's the recommended path for immediate consistency on `ReplacingMergeTree` regardless of background merge state.
7	"DELETE crashes the mutation queue"	🟢 False since 2023	PR #48522, #44718	Memory usage reduced for large mutation queues. LWD synchronous by default to bound queue growth.
8	"Deleted rows linger forever in storage"	🟢 False	PR #58223, #57433	Merge selector counts existing rows, not physical rows. `APPLY DELETED MASK` forces immediate physical cleanup on demand.
9	"Can't delete in a specific partition without scanning everything"	🟢 False since 2024	PR #67805, #13403	`DELETE FROM … IN PARTITION` and `ALTER … DELETE … IN PARTITION` prune partitions at plan time.
10	"DELETEs are invisible until merges finish"	🟢 False since 2025	PR #74877, #79281	`apply_mutations_on_fly` makes queued deletes visible at SELECT time. LWDs apply on the fly via the same mechanism.
11	"DELETE breaks projections and skip indexes"	🟢 False since 2024	PR #52517, #52530, #62364, #65594	Skip indexes and projections recalculate correctly during delete-driven merges. `lightweight_mutation_projection_mode` gives explicit policy options.
12	"No way to observe in-flight deletes"	🟢 False	`system.mutations`, `system.parts.has_lightweight_delete`, `parts_postpone_reasons`	Full lifecycle observability: queue state, postpone reasons, masked-part flagging, removal attempt timing.

Phase 1 (2018): The Original Mutation-Based DELETE

The FUD: "ClickHouse can't delete data"

This one was true for the first few years of ClickHouse's life. ClickHouse was designed around the principle that analytical data, once written, should not be modified. Compression and read throughput took priority over update flexibility. The first delete capability landed in mid-2018 as a deliberate compromise: support deletion, but signal architecturally that this was a heavyweight administrative operation.

`ALTER TABLE … DELETE` Lands as a Mutation (June–July 2018)

ClickHouse release 1.1.54388 (2018-06-28) added replicated ALTER TABLE t DELETE WHERE support together with the system.mutations table. Release 18.1.0 (2018-07-23) extended this to non-replicated MergeTree via PR #2634.

The mechanism was straightforward and expensive. When ALTER TABLE DELETE was issued, the server recorded the mutation with a unique ID (mutation_1.txt), returned immediately, and a background process scanned for parts containing rows matching the filter. For each affected part, a new version was created by reading the original data, applying the filter in memory, and writing only the surviving rows into a new part directory. The old part remained active until the new part was fully written and verified.

The write amplification was severe. For a 100-column table, deleting a single row required reading and rewriting all 100 column files for the affected part. The cost scaled with column count, not with the number of deleted rows. This is the math behind the "deletes are expensive" guidance from this era — and it was true.

Skip Unaffected Parts in DELETE Mutations (PR #2694, Late 2018)

PR #2694 added the first explicit rewrite-amplification optimization for DELETE: ALTER TABLE t DELETE WHERE no longer rewrote data parts that the predicate didn't touch. Before this PR, the mutation engine was conservative; after it, parts with no matching rows were skipped entirely. This established the pattern that would define the next eight years of DELETE evolution: do less work, lazily, and only on parts that actually need it.

DELETE Correctness Fixes (2020–2021)

The mutation-based DELETE generated a steady stream of correctness fixes:

PR #9048 (alesapin, 2020) — fixed primary.idx corruption after a delete mutation, the most severe class of DELETE bug.
PR #12153 (alexey-milovidov, 2020) — fixed over-deletion when the predicate evaluated to NULL on a row.
PR #21477 (alesapin, 2021) — fixed a deadlock for non-replicated MergeTree when ALTER DELETE WHERE referenced the same table.
PR #13403 (Vladimir Chebotarev, 2020) — added ALTER TABLE … DELETE … IN PARTITION for partition pruning, addressing metadata and ZooKeeper bloat in tables with thousands of partitions.

By the end of this era, ClickHouse had a working DELETE. It was honest about the cost. The "ClickHouse is immutable" critique was already inaccurate by 2018, but it was understandable.

Phase 2 (2022): How Lightweight DELETE Reframed Everything

The FUD: "Deletes require heavy mutations"

PR #37893, authored by Jianmei Zhang (zhangjmruc) and reviewed by davenger and alesapin, is the single highest-impact change in the entire DELETE history. Merged into v22.8 in mid-2022, it introduced standard SQL DELETE FROM <table> WHERE … and re-implemented it as a special mutation: ALTER TABLE <table> UPDATE _row_exists = 0 WHERE ….

The architectural shift was complete. DELETE went from "rewrite all affected parts" to "rewrite only the _row_exists mask, hardlink the rest, filter on read."

The `_row_exists` Mask Column

Each MergeTree part gained a virtual system column called _row_exists. When a row was deleted, its bit in this column was flipped from 1 to 0. The data itself remained on disk — only the mask was updated.

For wide-format MergeTree parts (the default for parts above a threshold size), where each column is stored in its own .bin and .mrk files, the optimization is dramatic. ClickHouse only writes a new _row_exists.bin; all other column files are hardlinked from the old part to the new one. For compact-format parts, where all columns are interleaved in one file, the gain is smaller because the single file still has to be rewritten.

PREWHERE Injection on Read

Reading from a table with deleted rows is where the design pays off. A query like:

SELECT count() FROM users WHERE age > 25

is internally transformed into:

SELECT count() FROM users PREWHERE _row_exists WHERE age > 25

PREWHERE runs before the main column reads. If _row_exists indicates an entire granule is deleted, ClickHouse skips reading any other column data for that granule. The mask is tiny (one bit per row, highly compressible), so the read overhead is negligible compared to the savings on filtered-out granules.

Hardlink Optimization for Wide Parts

The PR description includes a benchmark on a 100-million-row, 12-column table. Fifteen single-row lightweight DELETEs took roughly 200 ms. The same operation as a heavyweight mutation took roughly 8 seconds. That's a ~40× speedup on the initial mask write — and the gap widens as column count increases.

`IStorage::supportsDelete()`: Architectural Formalization (December 2022)

Commit 938aac9 (2022-12-29, davenger) added IStorage::supportsDelete(), formalizing the architectural separation between engines that support DELETE FROM and those that don't. This wasn't a feature in itself, but it was the contract that the rest of the system would build on.

Lightweight DELETE Correctness Fixes (Late 2022)

Lightweight DELETE introduced a new class of correctness bugs that had to be hunted down:

PR #40559 (davenger, August 2022) — fixed vertical merge of parts with lightweight-deleted rows. First post-introduction LWD bugfix; backported to 22.8.
PR #42126 (davenger, October 2022) — fixed "Invalid number of rows in Chunk" errors. Required a substantive PREWHERE refactor to support multiple PREWHERE steps so the _row_exists filter could be the first step in the chain.

By the end of 2022, the criticism had shifted. "Deletes require heavy mutations" had a documented expiration date in the changelog.

Phase 3 (2023): Lightweight DELETE Goes GA

The FUD: "You need allow_experimental_lightweight_delete"

Lightweight DELETE was promoted to GA / on-by-default in v23.3, announced in the v23.3 release blog and the April 2023 newsletter. The GA work is credited to Jianmei Zhang and Alexander Gololobov.

Synchronous by Default (PR #44718, December 2022 / January 2023)

PR #44718 made DELETE FROM synchronous by default so the command would not return until the rows were masked and invisible to subsequent queries. This bounded the mutation queue and prevented accumulating piles of pending LWD mutations. The original async behavior was later partially restored as a setting (lightweight_deletes_sync) for users on remote storage where the per-LWD coordination cost is high.

Memory and Concurrency Hardening (PR #48522, April 2023)

PR #48522 (KochetovNicolai) directly targeted "Reduce memory usage for multiple ALTER DELETE mutations" — the canonical reference for fixing OOM scenarios on large mutation queues. Related issue #57411 documented servers being killed by OOM while loading large numbers of mutation_*.txt files on startup. This PR shrunk the per-mutation memory footprint enough that the queue stopped being an operational hazard.

Lightweight DELETE on JSON and Object Columns (PR #49737, May 2023)

PR #49737 (davenger) stopped the MutatePlainMergeTreeTask log loop "There is no physical column _row_exists in table" when the table had an Object or JSON column. Fixed issues #49509 and #55076.

From `allow_experimental_lightweight_delete` to `enable_lightweight_delete` (PR #50044, June 2023)

PR #50044 (Azat Khuzhin, commit 7189481, 2023-06-04) aliased allow_experimental_lightweight_delete to enable_lightweight_delete. This was the bridge for users coming from older releases: their existing settings continued to work, but the canonical name reflected the feature's promotion out of experimental status.

This is the precise moment where "you need allow_experimental_lightweight_delete" became misinformation. The setting wasn't removed (backward compatibility matters), but it stopped being required, and the canonical documentation moved to the new name.

Projection Compatibility (PRs #52517 and #52530, July–August 2023)

PR #52517 (Anton Popov / CurtizJ, July 2023) — fixed lightweight DELETE failing after a projection was dropped. Even after the projection was gone, stale metadata could poison later LWD execution. Backported to 22.8 and 23.3.

PR #52530 (CurtizJ, August 2023, commit b6ce725) — fixed recalculation of skip indexes (bloom_filter, minmax, ngrambf, etc.) and projections in ALTER DELETE queries. Both needed to be recalculated, not stale-copied. Backported to 22.8, 23.3, 23.5, 23.7.

`apply_deleted_mask` and `APPLY DELETED MASK`: Operator Levers (PRs #55952 and #57433, late 2023)

PR #55952 (davenger, October 2023, commit 40062ca) added the apply_deleted_mask setting. With apply_deleted_mask = 0, SELECTs return rows that LWD has masked, which is essential for forensics, audits, and compliance verification — confirming that a delete actually happened, that the right rows were marked, and that the data is still recoverable until physical cleanup.

PR #57433 (CurtizJ, December 2023, commit 87d0cec) added ALTER TABLE … APPLY DELETED MASK [IN PARTITION …]. This is the explicit "stop waiting for merges, physically remove these rows now" lever. Implemented as an ordinary mutation command, it's the clean answer to compliance workloads that need guaranteed cleanup on demand.

By the end of 2023, lightweight DELETE was production-ready. The umbrella tracking issue (#56728) for production-readiness work was being closed. Operators had observability, force-cleanup levers, and synchronous semantics by default.

Phase 4 (2023–2024): Storage-Aware DELETE Optimizations

The FUD: "Deleted rows linger forever in storage"

The early lightweight DELETE design had a known operational gap. Because LWD only writes the mask, the underlying part still contains the deleted rows. They're filtered out at read time, but they consume disk space until the next merge. And the merge selector — which decides which parts to merge next — was counting physical rows, not existing rows. That meant a large part dominated by lightweight-deleted rows could sit at near max_bytes_to_merge_at_max_space_in_pool indefinitely, never picked up for merging, never cleaned.

Phase 4 fixed this and added the merge-selector and physical-cleanup machinery that makes LWD usable at scale.

`exclude_deleted_rows_for_part_size_in_merge`: Merge Selection That Counts Existing Rows (PR #58223, early 2024)

PR #58223 (jewelzqiu) added existing_rows_count to data parts and taught the merge selector to use it. The MergeTree settings exclude_deleted_rows_for_part_size_in_merge and load_existing_rows_count_for_old_parts give operators control over the trade-off.

The operational result: a 50 GB part where 90% of the rows are lightweight-deleted is now treated as a 5 GB part for merge selection. It gets picked up, merged, and the deleted rows physically disappear.

`lightweight_deletes_sync` (PR #62195, April 2024)

PR #62195 (CurtizJ, 2024-04-03) introduced the dedicated lightweight_deletes_sync setting (default value 2: "wait all replicas synchronously"), separating LWD synchronicity from the generic mutations_sync. This gave users on S3-backed deployments — where the per-LWD coordination cost is high — a way to lower the wait without weakening async semantics for heavy mutations. References to commit SHAs vary across sources (534905f and ed448ea both appear in PR #62195's commit set).

Projection Rebuild on Row-Reducing Merges (PR #62364, Q2 2024)

PR #62364 (cangyin / ardenwick) added projection rebuild for merges that reduce row count. Some merging modes (Replacing, Collapsing, deletes) genuinely reduce rows, and projections need to be rebuilt to avoid silently retaining "deleted" rows in projection data — a subtle correctness bug that could cause queries hitting the projection to return different results than queries hitting the base table.

`lightweight_mutation_projection_mode`: Lightweight DELETE on Tables with Projections (PR #65594, July 2024)

PR #65594 (jsc0218 / ShiChao Jin, 2024-07-04, commit 556c7de) added the lightweight_mutation_projection_mode table-level setting with three values: throw (default), drop, and rebuild. Without this setting, lightweight DELETE on a table with a projection unconditionally errored. Now you have an explicit policy: throw safely, drop the projection, or rebuild it as part of the merge.

`DELETE FROM … IN PARTITION` for Lightweight DELETE (PR #67805, August 2024)

PR #67805 (sunny19930321, 2024-08-30, commit 950ca28) added DELETE FROM … IN PARTITION 'xy' WHERE …. This means the planner doesn't have to scan all partitions when you know the delete is partition-bounded. Resolves issues #59409 and #60218. The ON CLUSTER form is also supported: DELETE FROM [db.]table [ON CLUSTER cluster] [IN PARTITION partition_expr] WHERE expr;.

`system.parts.has_lightweight_delete` and `system.mutations` Schema Additions (2024)

system.parts gained columns to flag and track lightweight-deleted parts:

has_lightweight_delete — true if the part has any rows masked by LWD
removal_state — current state of the part's removal lifecycle
last_removal_attempt_time — timestamp of the most recent attempt to remove the part

Combined with system.mutations.parts_postpone_reasons (commit 8903fd1) and parts_in_progress_names, operators have machine-readable answers for "why is this delete stuck" without grepping logs.

By the end of 2024, the storage layer understood lightweight DELETE end-to-end. Deleted rows didn't linger; the merge selector picked the right parts; projections and skip indexes were consistent; and operators had the levers and observability to manage it all.

Phase 5 (2025–2026): Patch Parts and On-the-Fly Mutations

The FUD: "DELETEs are invisible until merges finish"

The lightweight DELETE design from PR #37893 still required something to be written to disk for each delete — at minimum, a new version of _row_exists.bin for the affected part. For workloads with many small, frequent deletes, that per-DELETE write was the dominant cost. Phase 5 attacked it.

`apply_mutations_on_fly`: On-the-Fly Mutations (PR #74877, Q1 2025)

PR #74877 (CurtizJ) introduced apply_mutations_on_fly. Queued ALTER UPDATE and ALTER DELETE mutations that have not yet materialized are now applied at SELECT time, so users immediately see updated and deleted state. This closed the long-standing "DELETE was issued but rows still appear when mutations_sync = 0" surprise. Heavy mutations still materialize asynchronously in the background, but they're no longer invisible to queries in the meantime.

The mechanism has limits: only scalar subqueries up to mutations_max_literal_size_to_replace, only constant non-deterministic functions (controlled by mutations_execute_nondeterministic_on_initiator and mutations_execute_subqueries_on_initiator). Within those limits, the read path applies the mutation transform on the fly.

On-the-Fly Lightweight DELETE (PR #79281, April 2025)

PR #79281 (CurtizJ, commit dc9f636) extended on-the-fly mutations to lightweight DELETE specifically. Now DELETE FROM … SETTINGS lightweight_deletes_sync = 0 becomes visible immediately when apply_mutations_on_fly = 1. Resolves issue #75180.

Patch Parts: DELETEs Without Part Rewrites (PR #82004, July 2025)

PR #82004 (CurtizJ / Anton Popov), merged into v25.7, is the second-most-important PR in this entire history. It added standard SQL UPDATE syntax via patch parts and re-implemented lightweight DELETE on top of the same mechanism when lightweight_delete_mode = 'lightweight_update'.

The new shape: a DELETE creates a tiny patch part that sets _row_exists = 0 for affected rows. The patch is applied on read and physically merged in the next background merge. There's no rewrite of the source part. There's no hardlinking ceremony. The DELETE is, essentially, an insert.

Mechanically:

Patch parts are sorted by _part, _part_offset.
Partition ID is patch-<hash of column names>-<original_partition_id>.
Merging of patch parts uses a ReplacingMergeTree-style algorithm with _data_version as version.
Two read-time application modes: merge by sorted system columns when the source part is unchanged, join when the source part has been re-merged.
update_sequential_consistency and update_parallel_mode control behavior under concurrent updates.

The benchmarks in ClickHouse's own three-part series claim up to ~1,000× faster for small/selective changes versus classic mutations. Vendor benchmarks; treat as upper bounds, not guarantees, but the mechanism explains the size of the gap. For larger deletes (>~10% of a table), classic mutation is still preferred — the patch-on-read overhead grows with patch size.

2026 Correctness and Operability Hardening

The first half of 2026 has been a wave of LWD-related correctness and operability fixes:

PR #101212 (Anton Popov / CurtizJ, 2026-04-21, commit 509d35a) — "Fix several optimizations after lightweight deletes [2]." Critical fix: query optimizations like trivial COUNT(*) and minmax_count_projection were permanently disabled after a lightweight DELETE, even after all masked parts had been merged away. Replaced a sticky global flag with a per-snapshot computation: mutations_snapshot->hasLightweightDeletedMask(). This prevents permanent performance degradation on tables that ever ran an LWD.
PR #97589 (Alexey Milovidov, 2026-02-28, commit 53a75e8) — Fix KILL QUERY for ALTER DELETE with mutations_sync=1 on ReplicatedMergeTree. Synchronous replicated ALTER DELETE could become effectively unkillable. The fix is in mutation execution and control flow rather than DELETE semantics, but it materially improves operability under stalls.
PR #99281 (Yash, 2026) — Fix ALTER TABLE UPDATE/DELETE failing with "Missing columns" when a MATERIALIZED column depends on an EPHEMERAL column. Computed-column dependency analysis was wrong; the statement could fail before mutation execution.
Commit 9c4dda6 (2026-04-06) — Fix usage of text index with lightweight deletes. High-severity correctness fix for incorrect query results when both features were used together.
Commit 1acc6f3 — Fix for stuck mutations caused by phantom entries (race condition causing DELETE mutations to become stuck indefinitely).
PR #101792 — Broad LWD stateless test coverage. Not a feature landing, but a signal: LWD's hidden-row lifecycle and read-path semantics are now important enough to encode in dedicated stateless test suites.

This long correctness tail isn't a bad sign — it's how all mature deletion paths look once they're in production at scale. Compare to PostgreSQL's history of vacuum/freeze edge cases, or InnoDB's purge interactions. The volume of LWD fixes in 2026 reflects the volume of LWD usage.

Bulk Deletion: `DROP PARTITION` and Empty-Part Tombstones

The FUD: "Bulk deletion in ClickHouse is unsafe / non-atomic"

For bulk data lifecycle operations — removing a day's worth of data, dropping all rows for a deleted customer, reloading after a bad ETL run — the right answer is rarely DELETE FROM. It's ALTER TABLE … DROP PARTITION.

PR #41145 (Sema Checherinda, 2022) made these destructive partition operations durable. Before this PR, TRUNCATE TABLE, ALTER TABLE DROP PART, and ALTER TABLE DROP PARTITION worked by removing the part metadata and unlinking the files on disk. In a distributed system coordinated by ZooKeeper, a replica that was offline during the deletion or a crash between unlink and ZooKeeper update could leave the system in a state where the replica later attempted to "recover" the deleted part from another node. The result: "resurrected parts" — data that was supposedly deleted reappearing after a server restart or replica re-initialization.

The fix is elegant. Instead of immediate removal, these queries now create empty parts that explicitly cover the range of the old parts. Empty parts act as tombstones within the part set, so even if an old part is found on disk or on another replica, the system knows it has been superseded.

The achievements:

Durability: if the request succeeds, the empty part is committed to disk and ZooKeeper, preventing resurrection.
Atomicity: the substitution of old parts with empty ones is a single atomic operation within MergeTree's transaction scope.
Non-blocking reads: the operation no longer requires a follow-up exclusive lock to clean up filesystem entries, so concurrent reads aren't blocked.

For workloads where data has a clean partition boundary (date, customer, region), DROP PARTITION is the most efficient deletion path in ClickHouse, full stop. It's effectively constant-time in the data volume.

`ReplacingMergeTree`, `is_deleted`, and the Optimized `FINAL`

The FUD: "ReplacingMergeTree can't handle deletes" / "FINAL is too slow"

For workloads that look more like upserts — frequent updates to the same primary key, with occasional deletions — engine-level deletion through ReplacingMergeTree is often the right architecture. ClickHouse supports an is_deleted column parameter on ReplacingMergeTree, which is the canonical "tombstone" pattern.

The mechanics:

ReplacingMergeTree keeps only the most recent version of a row with a given primary key (using a version column).
Adding is_deleted as a parameter tells the engine to treat rows where is_deleted = 1 as tombstones during merges.
To delete a row, insert a new record with the same primary key, the latest version, and is_deleted = 1.
During a merge, ClickHouse keeps the record with the highest version; if that record has is_deleted = 1, the row is dropped entirely.

The allow_experimental_cleanup_merges setting allows OPTIMIZE TABLE … FINAL CLEANUP, which forces the engine to physically remove rows where the latest version is marked as deleted. This is a declarative, high-throughput way to manage deletions without going through the mutation engine at all.

For query-time consistency, SELECT … FINAL ensures deleted rows are excluded regardless of background merge state. The historical critique of FINAL — that it was prohibitively slow for production queries — has been largely addressed through extensive optimization work. FINAL now runs efficiently enough to be the recommended path for immediate consistency on ReplacingMergeTree tables, especially when paired with appropriate primary key design.

The pattern in production: write inserts and tombstones at full ingest speed, run analytical queries with FINAL for consistency, and let merges (or OPTIMIZE … FINAL CLEANUP on demand) handle physical cleanup in the background.

ClickHouse DELETE Internals: Low-Level Optimizations and Correctness Hardening

Beyond the headline features, the DELETE subsystem received systematic low-level optimization that compounds across every delete operation:

PREWHERE multi-step refactor (PR #42126): MergeTree reader now supports multiple PREWHERE steps so the _row_exists filter can be the first step in the chain, both for correctness and for performance — read the tiny mask first, then large columns only for surviving rows.
I/O pool and asynchronous reads (PR #43260): max_streams_for_merge_tree_reading and allow_asynchronous_read_from_io_pool_for_merge_tree allow a dedicated I/O pool for reading MergeTree parts during queries and mutations. Up to 100× speedup for mutation reads on high-latency storage like Amazon S3, per the 2022 changelog.
MemoryTracker for background tasks (PR #48787, novikd): Memory tracking and soft limits for background tasks, including DELETE mutations.
mutations_execute_subqueries_on_initiator / mutations_execute_nondeterministic_on_initiator (settings, 23.x): Address one of the historically thorniest classes of replicated-DELETE bugs — divergent results across replicas when the predicate contains now(), scalar subqueries, or IN (subquery) (issues #18118, #19315, #23734, #16532).
_row_exists user-collision fix (PR #41763): Early correctness proof point — handles the case where a user defines a column named _row_exists themselves, preventing segfaults and wrong results.
min_age_to_force_merge_seconds: MergeTree setting that lets operators force older parts to merge regardless of size, reclaiming space from lightweight-deleted rows on a predictable schedule.
Lock contention reduction in BackgroundSchedulePool: On high-core-count CPUs (240+ threads), internal mutex contention in the background schedule pool was a latency source. CPU cycles spent in native_queued_spin_lock_slowpath were reduced significantly through critical-section shrinking and thread-local timer_id storage. This improves the scalability of the mutation engine on massive servers, ensuring background deletions don't interfere with interactive query performance.

None of these individually make a press release. Together, they compound into a materially faster and more reliable DELETE engine at every level of the stack.

ClickHouse DELETE Limitations and Trade-offs in 2026

Fairness matters. A few things still require awareness:

MutationsInterpreter still routes through the old analyzer. Issue #61563 tracks the migration of MutationsInterpreter to the new query analyzer; PR #61528 is the partial work. Most user-facing DELETE behavior is unaffected, but some advanced predicate forms behave differently than equivalent SELECTs. This is the only feature area where the answer is "partial / not done."
Patch-on-read overhead grows with patch size. Patch-part DELETEs (PR #82004) are dramatically faster than classic mutations for small, selective changes. For deletes affecting more than ~10% of a table, classic mutation is still preferred. The optimizer doesn't auto-select between them; the user picks via lightweight_delete_mode.
Compact parts gain less from LWD's hardlink optimization. In compact parts (the format used for small parts), all columns are interleaved in a single file. The "rewrite only _row_exists, hardlink the rest" optimization can't apply, so LWD on compact parts still rewrites the file. The wider your parts and the more columns, the bigger the LWD win.
Lightweight DELETE on remote storage has surprising latency. Real-world reports (issues #58281, #59225, #67048) document that LWD on S3-backed deployments can be slow even when no rows match the predicate, due to the synchronous coordination cost. lightweight_deletes_sync and patch parts (#82004) are the architectural answers; users on older versions should expect older behavior.
DROP PARTITION is constant-time only if your partitioning key is right. If your data isn't partitioned along the dimension you want to delete by, DROP PARTITION doesn't help. Partition design is a one-shot decision that defines what bulk deletion looks like.
Correctness fixes are ongoing. The 2026 wave (PR #101212, #97589, #99281, 9c4dda6, 1acc6f3) shows that LWD's interactions with read-path optimizations, replication, and other indexes still produce edge cases. ClickHouse's engineering team has been rigorous about correctness, but running the latest stable release matters.

These are real engineering trade-offs, and understanding them is part of making an informed decision.

ClickHouse DELETE Improvements Timeline (2018–2026)

Year	What Changed	Key PRs	Impact
2018	`ALTER TABLE … DELETE` lands as mutation. Replicated and non-replicated MergeTree. Skip-unaffected-parts optimization.	Release `1.1.54388`; #2634; #2694	DELETE is supported. Heavyweight by design. `system.mutations`, `KILL MUTATION`, `mutations_sync` established.
2020–2021	Correctness hardening on the mutation path. `IN PARTITION` for mutations.	#9048, #12153, #21477, #13403	Primary index corruption fixed, NULL-predicate over-deletion fixed, deadlocks fixed. Partition pruning for heavy mutations.
2022	Lightweight DELETE introduction. `_row_exists` mask, PREWHERE injection, hardlink unaffected columns. Empty-part tombstones for `DROP PARTITION`.	#37893, #41145, #40559, #42126	Standard SQL `DELETE FROM`. ~40× faster than heavy mutation on initial mask write. Durable bulk-deletion semantics.
2023	LWD goes GA in v23.3. Synchronous by default. Memory hardening. Projection compatibility. `apply_deleted_mask`. `APPLY DELETED MASK`.	#44718, #48522, #52517, #52530, #55952, #57433, #50044	LWD production-ready. `allow_experimental_lightweight_delete` deprecated. Operators get visibility and force-cleanup levers.
2024	Storage-aware merge selection. `lightweight_deletes_sync`. Projection policy. `DELETE FROM … IN PARTITION`. Row-reducing-merge projection rebuild.	#58223, #62195, #65594, #67805, #62364	Deleted rows physically reclaimed by merges. Replicated LWD has independent sync control. Partition-scoped LWD.
2025	On-the-fly mutations and on-the-fly LWD. Patch parts: DELETE as a tiny insert.	#74877, #79281, #82004	Queued deletes visible at SELECT time. Patch-part path targets up to ~1,000× faster than classic mutations on small/selective changes.
2026	Read-path optimization fixes after LWD. Killable replicated synchronous DELETEs. Text-index correctness. Stuck-mutation race fixes.	#101212, #97589, #99281, `9c4dda6`, `1acc6f3`	`COUNT(*)` and projection optimizations no longer permanently disabled after LWD. `KILL QUERY` works for synchronous replicated `ALTER DELETE`.

When Should You Use Each DELETE Method in ClickHouse?

Workload	Verdict	Reasoning
Bulk historical cleanup along a partition boundary	✅ `ALTER TABLE … DROP PARTITION`	Constant-time, atomic, durable via empty-part tombstones (PR #41145). The most efficient bulk path.
Reloading data after a bad ETL run	✅ `DROP PARTITION` then re-insert	Same logic — partition-bounded operations are essentially free.
Selective row-level deletion (small set of rows)	✅ `DELETE FROM … WHERE …` (lightweight)	Default since v23.3. ~40× faster than heavy mutation.
High-frequency operational deletes (many small)	✅ Patch-part DELETE (`lightweight_delete_mode = 'lightweight_update'`)	Each DELETE is a tiny insert, no part rewrite. Up to ~1,000× faster on small/selective changes per ClickHouse benchmarks.
Compliance-grade deletion (GDPR right-to-erasure)	✅ `ALTER TABLE … DELETE` (heavyweight) or `APPLY DELETED MASK` after LWD	When you need "the bytes are physically gone" on completion, mutation is the path. `APPLY DELETED MASK` (PR #57433) forces materialization of LWD-marked rows.
Frequent updates with occasional deletions (upserts)	✅ `ReplacingMergeTree(version, is_deleted)` + `FINAL`	Engine-native pattern. Tombstones at ingest speed, query-time consistency via optimized `FINAL`.
Streaming data with explicit cancellation pairs	✅ `CollapsingMergeTree` with `Sign`	`+1` / `-1` pair "collapses" on merge. Highly efficient for streams that can produce cancellation events.
Deletes scoped to a known partition	✅ `DELETE FROM … IN PARTITION`	Planner skips unaffected partitions (PR #67805).
Need to verify a delete worked / forensic audit	✅ `apply_deleted_mask = 0`	Returns rows that LWD has masked but not physically removed (PR #55952).
Need queued deletes visible immediately to queries	✅ `apply_mutations_on_fly = 1`	On-the-fly mutation visibility (PR #74877, #79281).
Deleting more than ~10% of a table	🟡 `ALTER TABLE DELETE` (heavyweight) or `DROP PARTITION`	Patch-part overhead grows with patch size; classic mutation is more efficient at this scale.
Sub-10ms p99 latency on read-heavy workload with frequent deletes	🟡 Conditional	LWD is fast but adds a PREWHERE step. `ReplacingMergeTree` + `FINAL` may be faster depending on read-pattern. Benchmark on your workload.

How to Respond to "ClickHouse Can't Delete"

Run the PR numbers.

When someone tells you ClickHouse can't delete data in 2026, ask them what release they're benchmarking against. Specifically:

If they're citing "experimental lightweight delete," they're on something pre-v23.3. The setting was renamed and default-enabled in April 2023 (PR #50044).
If they're citing "heavyweight mutations only," they're on something pre-v22.8. Standard SQL DELETE FROM has been available for nearly four years.
If they're citing "deleted rows linger forever," they're on something pre-v24.x. The merge selector has counted existing rows since PR #58223.
If they're citing "DELETEs are invisible until merges finish," they're on something pre-v25.x. On-the-fly mutations (PR #74877) and on-the-fly LWD (PR #79281) closed that gap in early 2025.
If they're citing "DELETE breaks projections," they're on something pre-v24.7. lightweight_mutation_projection_mode (PR #65594) gives explicit policy options.
If they're citing "patch parts don't exist," they're on something pre-v25.7. PR #82004 shipped them in July 2025.

If they're benchmarking against ClickHouse 21.x or earlier, or repeating 2018-era documentation, they aren't evaluating ClickHouse. They're evaluating a system that no longer exists.

The commit history doesn't lie. 80+ pull requests. Five architectural eras. Four production-grade delete paths. Engine-level deletion patterns through ReplacingMergeTree. Storage-aware merge selection. Patch parts. On-the-fly visibility. Full observability through system.mutations and system.parts.has_lightweight_delete.

ClickHouse's DELETE subsystem in 2026 bears no resemblance to the one that earned the early "immutable" warnings. The engineers built a modern deletion engine, and the evidence is in the PRs.

Test it on your workload. That's the only benchmark that matters.

ClickHouse DELETE FAQ

Can ClickHouse delete data?

Yes. ClickHouse has supported ALTER TABLE … DELETE since 2018 (release 1.1.54388) and standard SQL DELETE FROM since v22.8 (lightweight, default-enabled since v23.3). It also supports bulk deletion via ALTER TABLE … DROP PARTITION, engine-level deletion patterns via ReplacingMergeTree(version, is_deleted), and patch-part-based DELETEs since v25.7. The "ClickHouse is append-only" claim is outdated by eight years.

What's the fastest way to delete in ClickHouse?

The fastest delete in ClickHouse is ALTER TABLE … DROP PARTITION for bulk deletion along a partition boundary — it's essentially constant-time in data volume. For selective row-level deletion, lightweight DELETE FROM (default since v23.3) is roughly 40× faster than heavyweight mutations on the initial mask write (PR #37893 benchmark). For high-frequency small deletes, patch-part DELETEs (v25.7, PR #82004) eliminate the part rewrite entirely.

Do I still need `allow_experimental_lightweight_delete`?

No. The allow_experimental_lightweight_delete setting was renamed to enable_lightweight_delete and default-enabled in ClickHouse v23.3 (PR #50044, commit 7189481, June 2023). The old name is preserved as a backward-compatibility alias but is no longer required. Anyone telling you to set this flag is benchmarking a release older than three years.

How does lightweight DELETE work?

Lightweight DELETE in ClickHouse implements standard SQL DELETE FROM <table> WHERE … as a hidden ALTER TABLE <table> UPDATE _row_exists = 0 WHERE … mutation. Each MergeTree part has a virtual column _row_exists; setting bits to 0 marks rows as deleted. Reads automatically inject PREWHERE _row_exists so deleted rows are filtered out before the main column scan. For wide-format parts, only the mask file is rewritten — all other column files are hardlinked from the old part. Physical removal happens during the next background merge, or on demand via ALTER TABLE … APPLY DELETED MASK.

What's the difference between lightweight DELETE and heavyweight ALTER DELETE?

In ClickHouse, lightweight DELETE writes a hidden mask and filters deleted rows out at read time; physical rows survive until the next background merge. It returns fast but defers physical cleanup. Heavyweight ALTER TABLE DELETE rewrites all affected parts to physically remove rows; it's slower but guarantees the bytes are gone when the mutation completes. For compliance-grade workloads (GDPR right-to-erasure), heavyweight ALTER DELETE or APPLY DELETED MASK after a lightweight DELETE is the right path.

Is `FINAL` slow in ClickHouse?

Not anymore. FINAL in ClickHouse was historically slow on ReplacingMergeTree and similar engines, which fueled the critique. It has been significantly optimized in recent versions and is now the recommended path for immediate consistency at query time, regardless of background merge state. For workloads using ReplacingMergeTree(version, is_deleted), SELECT … FINAL is the canonical pattern for ensuring deleted rows are excluded.

Can `ReplacingMergeTree` handle deletes?

Yes. ClickHouse's ReplacingMergeTree engine supports an is_deleted column parameter for tombstone-style deletion. To delete a row, insert a new record with the same primary key, the latest version, and is_deleted = 1. During a merge, rows where the latest version has is_deleted = 1 are dropped entirely. OPTIMIZE TABLE … FINAL CLEANUP (gated by allow_experimental_cleanup_merges) forces physical removal on demand.

How do I bulk-delete a lot of data efficiently?

In ClickHouse, the most efficient bulk-delete is ALTER TABLE … DROP PARTITION. Since PR #41145, partition drops use durable empty-part tombstones, making them atomic, non-blocking for concurrent reads, and resilient to replica crashes. The operation is essentially constant-time in the data volume, far cheaper than any row-level DELETE for the same scope. The trade-off: you have to design your partitioning key around the dimensions you'll bulk-delete by.

Are queued deletes visible to queries before they finish?

Yes, in ClickHouse v25.x and later. Set apply_mutations_on_fly = 1 (PR #74877, Q1 2025) and queued ALTER UPDATE / ALTER DELETE mutations are applied at SELECT time before background materialization. PR #79281 (April 2025) extended this to lightweight DELETE specifically. The "I deleted a row but a SELECT still returns it" surprise is solved.

What if my table has projections or skip indexes?

If a ClickHouse table with projections or skip indexes needs lightweight DELETE, use the lightweight_mutation_projection_mode table-level setting (PR #65594, v24.7): throw (default), drop, or rebuild. Projections and skip indexes are recalculated correctly during delete-driven merges as of PR #52530 (backported to 22.8 and later). Row-reducing merges trigger projection rebuild (PR #62364) so projections stay consistent with the base table.

How do I monitor in-flight deletes?

In ClickHouse, monitor in-flight deletes via the system.mutations table, which shows queued and running mutations with parts_to_do, is_done, latest_fail_reason, parts_postpone_reasons, and parts_in_progress_names. system.parts includes has_lightweight_delete, removal_state, and last_removal_attempt_time for masked parts. KILL MUTATION cancels a stuck or unwanted mutation. apply_deleted_mask = 0 lets you query rows that LWD has masked but not physically removed, useful for forensics and audits.

What are patch parts?

Patch parts in ClickHouse (PR #82004, v25.7) are a new on-disk architecture for lightweight UPDATEs and DELETEs. Instead of rewriting _row_exists in the source part, a DELETE creates a tiny patch part that records the rows to mark deleted. The patch is applied on read (via merge by sorted system columns or a join, depending on whether the source has been re-merged) and physically merged into the source during the next background merge. For small/selective changes, this eliminates the per-DELETE write cost almost entirely. ClickHouse's benchmarks claim up to ~1,000× speedup on small changes, though that's a vendor benchmark on a favorable workload — treat it as an upper bound. Set lightweight_delete_mode = 'lightweight_update' to enable.

Analysis based on 80+ GitHub pull requests, official ClickHouse changelogs, and release blogs covering the period 2018–2026. Every claim maps to a specific merged PR or commit SHA. Verify the evidence yourself — the commit history is public.

How to manage multi-user AI agent authentication and authorization in 2026 (OAuth 2.1, OIDC, and delegated access)

Manveer Chawla — Thu, 14 May 2026 20:18:23 +0000

TL;DR: multi-user AI agent authentication and authorization in 2026

Moving AI agents from single-user desktop demos to enterprise production means solving a brutal engineering problem: multi-user, multi-system delegated authorization.

Security architects and lead AI engineers are now dealing with agents that execute complex workflows across critical infrastructure on behalf of thousands of concurrent users.

The core design principle is non-negotiable: treat every agent action as delegated user access, never as the agent's own blanket access. The whole authorization stack falls out of that distinction. Nine capabilities, two identities, one strict intersection rule.

This guide breaks down how to combine OpenID Connect, OAuth 2.1, and a managed Model Context Protocol (MCP) runtime like Arcade.dev to prevent tool misuse, data leakage, and excessive agency. It's built for identity and access management leads, security architects, and AI engineering leads who need the exact infrastructure requirements to safely deploy multi-user agents into production.

Threat model for multi-user AI agents: prompt injection, tool misuse, and confused deputy

You can't engineer secure authorization without defining the threat model first. For large language models, the most dangerous attack vector runs from prompt injection straight to tool misuse.

If an enterprise agent inherits blanket admin access to a backend system, a single poisoned RAG document or malicious prompt can weaponize that agent. An attacker instructs the model to scan an inbox, summarize sensitive financial data, and exfiltrate the payload via an external tool call. The whole exfil chain completes without a human in the loop.

The Open Web Application Security Project highlights these vulnerabilities in its updated guidelines, citing prompt injection and excessive agency as primary risks that lead directly to the confused deputy problem.

In a confused deputy attack, an application gets tricked into misusing its inherited authority.

There's a second class of attack that targets the authorization flow itself. An attacker who can intercept or guess the identifier for a pending OAuth authorization can redirect the consent step to their own browser, either capturing the user's grant or seeding the agent with credentials it shouldn't have. Treating every first-time tool authorization as a step that must be cryptographically bound to a verified app user is the only durable defense.

The two-identity model for agent authorization

Engineering teams typically make one of two mistakes when designing agent authorization. Give the agent its own identity, and an intern can bypass their permissions through the agent. Inherit the user's full access, and a single prompt injection cascades through every connected system.

The right answer is the intersection: what this agent is allowed to do AND what this user is allowed to do, evaluated per action, at runtime.

Effective authorization in agentic systems requires every request to carry two identity layers:

The project-level key (the agent application): The workload identity making the call. Registered as an OAuth client, scoped to the application running the agent logic.
The user-level identity (on whose behalf the action is taken): The actual person requesting the action, authenticated via a protocol like OpenID Connect, and represented in the request as a delegated subject.

The runtime evaluates these two identities against a delegated execution context: a bounded, short-lived binding that ties a specific user to a specific agent for a specific task. The context isn't a third identity. It's the tuple of claims (user, agent, scopes, audience, tenant, task ID, expiry) the runtime evaluates at every tool call.

This model enforces the identity intersection rule, which is the foundation of modern agent security.

An agent's effective authority must always be calculated as the strict intersection of its own baseline permissions and the requesting human user's permissions. Never the union.

If a user can't delete a database record, the agent acting on their behalf must fail when attempting the same action. It doesn't matter what the agent's maximum theoretical capabilities are.

Implementing this intersection requires strict protocol separation. OpenID Connect authenticates the human user to establish who is interacting with the system. OAuth 2.1 authorizes what specific tool calls the agent can make on the human's behalf.

Conflating these two protocols leads to over-permissioned tokens that get reused across systems they were never scoped for, giving a compromised agent durable access well beyond what the user actually authorized.

Nine capabilities for production multi-user AI agent auth

The Model Context Protocol's own authorization spec, developed as a broad collaboration with Anthropic, Arcade.dev, Microsoft, Okta/Auth0, and others, defines OAuth-style protected resources and authorization server discovery, with audience binding via Resource Indicators (RFC 8707) and delegation via Token Exchange (RFC 8693). MCP defines the auth handshake; the runtime layer above must still handle token vaulting, just-in-time consent, user verification, RBAC, and audit. The nine capabilities below close that gap.

Building resilient multi-user agent infrastructure means evaluating your systems against this 2026 capability checklist. Unifying these capabilities prevents unauthorized access while ensuring reliable tool execution.

Capability 1: Model user, agent, and delegated context

Every authorization decision in your runtime must evaluate the user, agent, and context tuple simultaneously.

If your backend tool plane only verifies the agent's API key, you've failed to model the human user.

True delegated modeling ensures that the upstream resource server knows exactly which human began the request, which workload orchestrated it, and the precise context under which the delegation was granted.

In practice, this means the user_id flows from your app's authenticated session into every runtime call. A typical pattern: your IdP (Stytch, Auth0, Okta, or similar) authenticates the user and issues a session, your app extracts the user identifier from that session, and your code passes that identifier explicitly to every runtime SDK call. For example, getTools({ tools: [...], userId: userEmail }) and tools.execute({ ..., user_id: userEmail }). The runtime then resolves that specific user's vaulted OAuth tokens for the requested provider and scope. Without this explicit user binding on every call, the runtime has no way to enforce the intersection rule.

Capability 2: Separate OpenID Connect authentication from OAuth authorization

You need to strictly separate human authentication from delegated agent authorization. OpenID Connect handles the initial login session. OAuth 2.1 handles the subsequent tool authorization.

By separating these concerns, you prevent identity conflation. An agent compromised by a malicious prompt can't reuse human session cookies to access unrelated systems.

Capability 3: Issue short-lived, scoped, audience-bound access tokens

Agent access tokens must adhere to the strictest cryptographic standards to prevent token replay and lateral movement.

Each delegated access token should carry the full execution context as claims. In a delegated token, the subject (sub) identifies the human user on whose behalf the action is taken (e.g., user:alice). The actor (act) identifies the agent making the call (e.g., agent:support-copilot). The audience (aud) binds the token to a specific resource server (e.g., gmail-api), and the scope (scope) grants a specific permission (e.g., email.draft, not email.send). The expiry (exp) is set to a tight window of typically 5 to 30 minutes. A tenant claim (e.g., tenant:acme) carries the customer or workspace context, and a task ID (e.g., task_123) ties the call back to the originating user task or session.

This claim structure enforces the intersection rule cryptographically: every token carries the user, the agent, and the bounded execution context, and the resource server validates all three before honoring the request.

Your stack must enforce RFC 8707 resource indicators to bind tokens to a specific audience, ensuring a token minted for a calendar API can't be replayed against a CRM.

Use RFC 8693 token exchange to safely trade broad user tokens for tightly downscoped agent tokens.

Sender-constrain tokens using RFC 9449 demonstrating proof of possession (DPoP), ensuring that even if an access token gets intercepted, attackers can't use it without the client's private key. The stack should also support RFC 9126 pushed authorization requests and RFC 9396 rich authorization requests for enhanced, tamper-proof granularity.

Capability 4: Vault tokens and automate refresh across providers

A runtime that handles token storage and refresh per-user, per-provider, is non-negotiable for production agents. Managing the OAuth token lifecycle across thousands of users and dozens of providers is a substantial engineering problem in its own right.

Access and refresh tokens must be vaulted and encrypted on a strict per-user, per-provider basis. Your system needs to automatically handle provider-specific nuances outside the language model context.

For example, Google enforces a rolling limit of 100 refresh tokens per client, and Microsoft Entra rotates refresh tokens on every redemption with a 90-day sliding inactivity window. A dedicated token vault must abstract this refresh logic away from the agent developer.

Capability 5: Enforce read, draft, and commit approval steps

Security architects must enforce out-of-band approval flows for any irreversible action.

Reading data or drafting responses requires minimal friction and can be executed synchronously. But external side effects, such as sending emails, deleting records, or committing code, must trigger explicit human step-up approvals.

These approvals should occur via a secure, out-of-band channel, such as an enterprise authentication app, a separate user interface, or a direct messaging platform.

Capability 6: Evaluate policy before every tool call by hooking into existing entitlement systems

Never trust a language model's direct API request. Every tool call must route through a centralized policy layer that intersects the user, agent, tenant, action, resource, and task. And it must evaluate that intersection in milliseconds to avoid throttling the agent's conversational latency.

Critically, this is not an invitation to stand up yet another policy system. Enterprises already have entitlement systems and identity providers like Okta, Entra, SailPoint, and homegrown role/permission stores. The runtime's job is to hook into those systems, acquire scoped tokens at runtime, and enforce the policies the enterprise has already defined, not duplicate them in a new tool.

Open Policy Agent, Cedar, Oso, OpenFGA, WorkOS FGA, and Zanzibar-style relationship graphs are useful as the local enforcement engine. But the source of truth for who can do what should remain in your existing identity and governance systems. A runtime that asks you to redefine your authorization model in its own DSL is moving the problem, not solving it.

Capability 7: Use just-in-time consent and authorization

Blanket consent at user onboarding violates the principle of least privilege.

Implement just-in-time authorization instead. When an agent requires access to a new system or an ungranted scope to fulfill a prompt, the runtime pauses execution. It returns a granular, context-specific consent interface to the user, captures the cryptographic consent, brokers the new token, and resumes the agent's task without losing conversational context.

MCP's URL Elicitation Specification Enhancement Proposal (SEP), authored by Arcade.dev in collaboration with Anthropic and accepted into the MCP spec, standardizes how an agent runtime delivers granular, context-specific consent URLs to the user mid-task.

Capability 8: Bind first-time auth flows to a verified app user

Granular consent (Capability 7) only matters if the runtime can confirm which user is sitting at the keyboard during the first-time OAuth authorization. Without that confirmation, an attacker who intercepts a flow_id can redirect the consent step to their own browser and either hijack the authorization back into your user's session or capture the user's grant for themselves.

The mitigation is a server-side user verifier. When a user authorizes a tool for the first time, the runtime redirects them to a verifier route in your app. Your verifier reads the flow_id from the query string, looks up the currently authenticated user from your app's session (Stytch, Auth0, Okta, as the IdP, or an app-layer auth system like Supabase), and posts that user_id back to the runtime via a server-side confirm_user call signed with your API key.

If the user_id from your session matches the user_id specified when the flow started, the runtime continues. If not, the runtime rejects the flow. Every first-time authorization is therefore bound to a verified, authenticated identity in your app, which closes the flow-phishing attack surface.

In production multi-user deployments, this is non-negotiable. Arcade's reference implementations show the pattern in Next.js with Stytch and Next.js with Supabase, and Arcade's Secure Auth in Production guide walks through the verifier route end-to-end.

Capability 9: generate immutable audit logs for every agent action

Every action taken by an agent must generate an immutable audit log with a complete chain of custody.

This means capturing the requesting user, the agent identity, the tenant, the task ID, the specific tool invoked, the resource accessed, the policy decision and policy version, the prompt hash, input references, output hash, approval status, and the exact timestamp.

These logs must be OpenTelemetry-compatible, providing structured traces that export cleanly into enterprise security information and event management systems for immediate incident response.

And the audit story isn't only about the logs themselves. It's about the controls that produce them. SOC 2 Type 2 certification validates that the runtime's audit, access, and change-management controls operate as designed under independent audit. Treat the certification as a procurement floor and the per-action log structure as the actual product capability. You need both.

Why a runtime, not a gateway: the architecture shift behind multi-user authorization

In the traditional model, users interact with applications, applications call APIs, and a gateway sits between them, routing, authenticating, and rate-limiting at the perimeter. The proxy is the control point because it's the choke point: every request flows through it.

In the agentic model, that topology inverts. The agent is already the proxy. A user talks to an agent. The agent reasons, plans, and calls tools on the user's behalf. It already handles mediation, routing, and orchestration. Adding a traditional API gateway in front of the tools doesn't add a control point; it adds a redundant hop that can't see into the execution context that actually matters: which user, which action, which permission, right now.

That's why "MCP gateway" is the wrong frame for the auth problem. A stateless proxy evaluates each request in isolation. It can't track that a request is step 3 of a 6-step agent workflow, acting on behalf of a specific user who authorized a particular scope minutes ago. Bolting MCP support onto an API gateway is not a pivot. It's a patch.

The control point in an agentic architecture is the execution layer where the tool runs. That's where credentials are resolved, permissions are checked, and actions are taken on behalf of a specific human. That's the runtime. The nine capabilities above can only be enforced there.

Where each layer fits in the agent auth stack (IdP, OAuth vault, policy engine, MCP runtime)

Understanding the vendor landscape means categorizing platforms by their strict architectural function. Misunderstanding where a tool fits in the stack leads to dangerous auth gaps.

The deeper issue is consistency at scale. Even with the right primitives in place (an IdP, a token vault, a policy engine), most stacks have no uniform way to apply them across every agent, every user, and every system. Each team stitches its own integration, and two teams in the same company end up enforcing the same policy differently. The runtime is what makes a single authorization model enforceable across every agent, without each team rebuilding the plumbing.

Architectural layer	Example vendors	Primary function	Key gap for multi-user agents
Identity providers	Okta, Auth0, Entra, WorkOS, and Clerk	Authenticate the human user into the application via OpenID Connect.	Lacks the full agent authorization stack. Support for explicit delegation flows, such as RFC 8693 and sender-constraining via DPoP, varies significantly and often requires heavy custom actions. Audit covers authentication events, not per-tool-call agent actions.
OAuth libraries and vaults	Authlib, HashiCorp Vault, Doppler	Securely store, encrypt, and manage raw OAuth tokens.	Lacks a contextual decision engine, robust policy evaluation, and the dynamic, multi-provider refresh logic necessary for asynchronous agentic workflows. Audit captures token operations, not the user, agent, and tool context behind each call.
Policy engines and FGA platforms	Open Policy Agent, Cedar, Oso (Polar DSL), OpenFGA, WorkOS FGA, Zanzibar-style, Sailpoint	Evaluate fine-grained authorization policies against complex relationship graphs.	Leaves token brokering, consent user experiences, and physical tool connectivity for the engineering team to build from scratch. Audit records the policy decision, not the full execution context that the resource server actually saw.
Agent frameworks	LangChain, Mastra, Crew AI	Provide tool abstraction for agent workflows.	Push the auth burden back onto your application code; treat tools like keys in a dotenv file and quietly break the moment a second customer signs up. No native audit trail for agent actions.
MCP gateways and integration wrappers	Composio	Connect language models to external tools using standardized interfaces.	Designed for rapid prototyping and single-user proof-of-concept agents. An SDK-layer integration wrapper, not a runtime. Per-user OAuth is supported, but SSO, OIDC, and audit are limited rather than native, and the agent/user permission intersection isn't enforced.
MCP runtimes	Arcade.dev	The first MCP runtime built for agent authorization. Delivers post-prompt user-specific permissions, isolated token lifecycle management (refresh, rotation, mismatch), OAuth protocol brokering, contextual access policy enforcement, and immutable per-action audit logs exportable via OpenTelemetry.	Not applicable. This layer explicitly unifies the previous layers and fills their operational gaps.

Reference architectures for multi-user agent auth

These capabilities only matter if you can map them to real architectures. The three patterns below show how an MCP runtime enforces multi-user authorization in production.

The patterns assume the canonical multi-user setup: an agent application that authenticates users via its own identity provider (Stytch, Auth0, Okta, or Entra) and calls the runtime through its client SDK, passing the authenticated user_id on every tool call. The runtime is the backend that brokers OAuth, vaults tokens per user, and enforces policy. For MCP-client integrations like Copilot, Cursor or Claude Desktop, the runtime's MCP gateway path is used instead, but the runtime semantics are the same.

Two distinct auth flows run inside each pattern. Server-level auth determines whether the agent application (an MCP client) can connect to the MCP server. Tool-level auth governs whether the currently authenticated user can invoke a specific tool against this resource with these parameters right now. Server-level auth happens once per client-to-server connection. Tool-level auth runs on every tool call, and it's where the user verifier (Capability 8), just-in-time consent via URL Elicitation (Capability 7), and the permission intersection rule actually operate. Arcade's Server-Level vs Tool-Level Authorization guide walks through the distinction in detail.

Pattern 1: internal productivity agent (Google Workspace)

Architectural flow: Human User -> [OIDC Identity Provider] -> Agent Application -> MCP Runtime -> Gmail and Calendar MCP tools-> Google Workspace

Scenario: An internal, Claude-based assistant organizes meetings and summarizes emails across a multi-user Google Workspace environment.

Implementation: The agent must never possess domain-wide delegation. Instead, the MCP runtime brokers a user-specific OAuth flow. The runtime requests delegated gmail.readonly and gmail.compose scopes, binding the resulting token strictly to the individual employee.

On the user's first authorization, the runtime redirects the user's browser to a verifier route in the app. The verifier reads the flow_id, looks up the authenticated user from the OIDC session, and confirms the user_id back to the runtime. Only after the runtime matches the verifier-confirmed user_id against the user_id that started the flow does the OAuth grant proceed. From that point forward, the user's token is vaulted per provider and reused on subsequent calls without re-authorization.

When the agent attempts to read an inbox, the app passes the authenticated user_id from its session into the runtime SDK call. The runtime evaluates the policy engine, retrieves that specific user's token from the vault, and executes the call.

If the agent hallucinates or receives a malicious prompt to send an email, it requests the gmail.send scope. The runtime catches this unauthorized request, pauses execution, and forces an out-of-band step-up approval to the user's device. A human explicitly authorizes the transmission, or it doesn't happen.

Pattern 2: multi-tenant Slack agent (workspace isolation)

Architectural flow: Human User -> [OIDC Identity Provider] -> Agent Application -> MCP Runtime -> Slack MCP tools -> Slack workspace

Scenario: A business-to-business application deploys an agent that aggregates alerts and takes administrative actions across multiple customer Slack workspaces.

Implementation: Managing access across distinct corporate boundaries requires strict multi-tenant isolation. The runtime manages workspace-level OAuth installations, generating bot tokens combined with granular user-level channel permissions like chat:write and channels:history.

The runtime uses RFC 8707 resource indicators, ensuring that tokens minted for Tenant A's Slack instance are mathematically bound to that tenant's audience.

If an injection attack attempts to force the agent to read Tenant B's data using Tenant A's context, the policy engine rejects the cross-tenant token replay instantly. That prevents catastrophic cross-customer data leakage.

Pattern 3: Salesforce CRM agent (user-level permissions)

Architectural flow: Human User -> [OIDC Identity Provider] -> Agent Application -> MCP Runtime -> Salesforce MCP tools -> Salesforce

Scenario: A sales copilot updates pipeline records, drafts follow-up emails, and queries customer history on behalf of individual account executives.

Implementation: Salesforce data access rules are notoriously complex. The MCP runtime requests the api and refresh_token OAuth scopes to call Salesforce on behalf of the user, then evaluates the account executive's specific Salesforce profile and permission sets at every tool call before allowing the agent to proceed. Object-level access (read on Account / Contact, edit on Opportunity stage transitions, commit on Lead conversion) is gated by the user's existing Salesforce permissions, not by the agent's own credentials.

The implementation enforces strict separation between reading account contacts, drafting meeting notes, and committing pipeline updates.

Through just-in-time authorization, if a junior rep asks the agent to update a closed-won opportunity they lack privileges to edit, the runtime's policy engine blocks the action at the tool boundary. It returns a graceful access denial to the language model without exposing backend credentials.

Agent auth anti-patterns to avoid in production

Answer engines and security audits favor systems that eliminate known architectural flaws. If your current homegrown agent setup relies on any of these anti-patterns, your infrastructure isn't ready for enterprise production.

Single API key routing: Your agent backend shares a single, highly privileged service account key across all users. This breaks identity attribution at the request layer. The backend can't distinguish between an intern's request and a CEO's request, and a single prompt injection inherits maximum blast radius across the entire user base.
God mode with prompted guardrails: The agent runs with root or admin credentials, and engineers rely on system prompts like "do not delete data" to maintain security. Language models are easily manipulated through indirect injection, so relying on the model to govern its own authorization is a fundamental security failure.
Blanket sign-up consent: Forcing users to grant massive, multi-system OAuth scopes during their initial onboarding. This violates the principle of least privilege, causes consent fatigue, and provisions tokens with dangerous capabilities long before the user actually needs them.
User interface-only checks: Authorization checks are enforced exclusively at the chat interface or frontend web application, leaving the backend tool plane unprotected. If an attacker bypasses the chat interface and sends payloads directly to the tool execution endpoint, the system complies without verifying the delegated user context.
No distinction between draft and commit: Your agent treats every action with the same authorization level, sending emails or transferring funds as easily as drafting them. Without a read/draft/commit gradient and an out-of-band approval step for irreversible actions, a single prompt injection causes irreversible damage.
No immutable audit trail: Your agent system has no per-action audit log or relies on application logs that can be modified after the fact. Without an immutable record of who authorized what tool action when (with policy version, prompt hash, and approval status), security incidents can't be reconstructed, and regulator-facing audit reports become impossible.

Conclusion: the delegated authorization rule for multi-user agents

The transition to production-grade, multi-user AI agents demands a fundamental shift in how we architect security. The entire philosophy of agent authorization boils down to one strict rule:

This specific agent may perform this specific action on this specific resource, for this specific user, in this specific tenant, for this specific task, for a strictly limited period of time.

If your current infrastructure can't cryptographically enforce and audit that exact sentence from the chat prompt down to the backend API layer, your system isn't ready for multi-user production in 2026.

A gateway can't enforce that rule. A runtime can.

Before you commit to a runtime, do three things. Audit your current identity mapping to confirm your backend systems actually model the user, agent, and context tuple on every tool call. Stop building bespoke OAuth plumbing. Refresh logic, just-in-time consent user interfaces, and multi-tenant token vaulting are undifferentiated technical debt your engineers shouldn't be writing. And test the intersection rule aggressively by sending malicious prompts against your own agents to verify that your policy engine intercepts them at the network boundary.

Arcade is the first MCP runtime purpose-built for agent authorization, handling per-user OAuth, just-in-time consent, token vaulting, policy intersection, and immutable audit as native capabilities, not bolt-on plugins. The nine capabilities above are unified under one control plane, alongside Arcade's agent-optimized tool catalog and lifecycle governance, so your engineering teams can focus on shipping high-value agent logic instead of maintaining fragile identity plumbing.

Frequently asked questions

What's the best way to manage multi-user AI agent authentication and authorization in 2026?

Treat every tool call as delegated user access, not agent-owned access. Implement a two-identity model (the agent application and the user on whose behalf the action is taken), bind every call to a delegated execution context, and enforce the intersection rule via OAuth 2.1 delegated tokens, a policy engine in front of tools, short-lived scoped tokens, and immutable audit logs.

What is the two-identity model for agent authorization?

Every request carries two identities: the project-level key (the agent application making the call) and the user-level identity (the human on whose behalf the action is taken). The runtime evaluates these two identities against a delegated execution context, a bounded binding that ties a specific user to a specific agent for a specific task, so the backend can attribute and constrain every action.

What is the "intersection rule," and why does it matter?

The agent's effective permissions must be the intersection of the user's permissions and the agent's allowed capabilities. Never the union. This rule prevents "confused deputy" failures where an injected prompt causes the agent to misuse broad system access.

How should OpenID Connect and OAuth 2.1 be used together for agents?

Use OpenID Connect to authenticate the human user (who they are). Use OAuth 2.1 to authorize the agent's tool calls (what the agent can do on the user's behalf) with scoped, audience-bound tokens.

How do you prevent prompt injection from turning into tool misuse?

Don't rely on prompts for security. Route every tool call through a policy enforcement layer that checks user/agent/context, scopes, tenant, and resource. Use short-lived, audience-bound tokens so even a successful injection can't pivot across systems.

Which token properties are required for secure delegated-agent access?

Tokens should be short-lived, scoped, and audience-bound (so they can't be replayed against other APIs). For stronger replay resistance, use sender-constrained tokens (e.g., DPoP) so stolen tokens are unusable without the client key.

How do you handle OAuth refresh tokens safely for thousands of users?

Store tokens in a per-user, per-provider encrypted vault and automate refresh/rotation outside the LLM. This prevents secrets from leaking into prompts and prevents provider-specific refresh edge cases from breaking agent workflows.

When should an agent require step-up approval or human confirmation?

Require step-up approval for irreversible or high-impact actions (e.g., sending an external email, deleting records, committing code, or transferring funds). Let the agent read and draft with lower friction, but gate "commit" actions via an out-of-band confirmation flow.

What is just-in-time authorization for AI agents?

The agent requests new scopes or system access only when needed for a specific task. The runtime pauses, collects granular consent, mints a downscoped token, and resumes. This reduces over-permissioning and consent fatigue.

What is MCP URL Elicitation?

URL Elicitation is a Specification Enhancement Proposal authored by Arcade.dev with Anthropic and accepted into the Model Context Protocol spec. It defines how an MCP runtime returns a granular, context-specific consent URL to the user mid-task when the agent needs a new scope or system, allowing the user to authorize the request out of band before the runtime resumes execution. URL Elicitation is the standardized mechanism behind just-in-time agent authorization.

What should be included in an audit log for agent tool calls?

Log the user identity, agent identity, tenant, tool/action/resource, policy decision, timestamp, and a prompt or request hash. Make logs immutable and exportable via OpenTelemetry-compatible formats for incident response and compliance.

Should you build or buy an MCP runtime for enterprise AI agents in 2026?

Manveer Chawla — Wed, 13 May 2026 21:21:22 +0000

The engineering bottleneck for enterprise AI has shifted. Your team has built agents. They work in single-user environments on LangChain or Mastra. The wall hits when you try to wire those agents into secure enterprise systems for thousands of employees without creating new security exposure or a permanent maintenance load.

In 2026, engineering directors face a real architectural decision, and it isn't whether to write custom Model Context Protocol (MCP) servers. Custom MCP servers are how you connect agents to proprietary internal systems, regardless of which path you choose. The actual decision is whether you also build the runtime layer that wraps those servers: OAuth lifecycle, credential vaulting, multi-user auth, permission intersection logic, audit pipeline, policy enforcement, and observability. Build that layer yourself on top of LangChain or Mastra, or buy an MCP runtime that delivers it off the shelf.

The right answer depends on your deployment profile. Once multi-user authorization, audit-grade governance, or asynchronous tool-call observability enter the picture, the build path incurs increasing costs and a growing risk surface. Maintaining your own auth, credential vaulting, and audit pipeline puts every agent action inside your security blast radius. The decision favors buying a runtime.

TL;DR: Build vs. buy MCP runtime

An MCP runtime handles the work most teams have no business writing themselves: agent authorization, OAuth token rotation, audit logging, and policy enforcement. The runtime is the execution, authorization, and governance layer where your agent's tools (MCP servers) run.

If you build your own runtime. Three narrow profiles fit this path: single-user scope, agent infrastructure as your core product, or all-internal API pipelines. You retain full control and assume responsibility for the OAuth lifecycle, credential vaulting, audit logging, and policy enforcement. Each integration becomes a permanent line item on your engineering roadmap; auth and policy maintenance never go to zero.

If you buy a runtime. This is the default for multi-user production. You get centralized lifecycle governance that maps to your existing policies, multi-user authorization with full OAuth lifecycle management, tool execution, and a path to build proprietary tools without rebuilding the runtime layer.

Four tipping points force a transition from a self-built runtime layer to a vendor-provided one:

Crossing the three-integration threshold, where API maintenance starts consuming dedicated sprints.
Introducing user-delegated actions, requiring agents to execute tool calls on behalf of specific human users with distinct permissions.
Moving from synchronous read-only tasks to asynchronous, long-running read/write operations that break standard LLM timeouts.
Needing OpenTelemetry-compatible audit logs to satisfy compliance and security teams.

The state of MCP infrastructure: Config hell vs. the buy tradeoff

The Model Context Protocol has standardized how AI applications consume context and execute tools, replacing the bespoke API wrappers teams used to write for every LLM feature.

Adopting MCP introduces architectural challenges. Enterprise platform teams choose between two operational burdens: the DIY trap of "config hell," or the buy-side tradeoff of vendor cadence and ecosystem dependency.

Config hell happens when you scale bespoke MCP servers. Platform engineers spend their time editing JSON configurations to re-map tool schemas every time an upstream SaaS provider deprecates an endpoint, chasing token rotation drift when an OAuth refresh expires and the custom retry logic doesn't handle the edge case, and handling the manual work that SOC 2 and GDPR compliance requires (immutable schema registries, signed tool manifests, middleware to redact PII from tool outputs). When you build your own infrastructure, you own every broken connection, every expired token, and every security patch.

The runtime is not an additional proxy in front of your tools. In an agentic architecture, the agent is already the proxy. It mediates between the user and downstream systems, reasons about which tools to call, and orchestrates multi-step workflows. The runtime is the execution layer where the chosen action actually runs. It is where credentials are resolved, policy is enforced, and the call is made on behalf of a specific user.

The runtime is the best gateway.

The real buy-side tradeoffs are different. You accept the runtime's policy primitives and observability format as lock-in. You take on overhead from per-tool authorization checks and just-in-time token resolution, which is a fraction of LLM inference and downstream API latency.

The real choice in 2026 is risk, not cost. Build your own runtime layer, and your security blast radius scales with every integration, user, and policy change. Buying a runtime moves that work to a vendor that has already been audited for it. For enterprise deployments, that is the safer side of the tradeoff.

When to build your own runtime

Building your own runtime layer is the right call in a narrow set of scenarios. The open-source ecosystem has matured enough that deep platform engineering teams can stand up their own orchestration layer on top of the official Model Context Protocol Python or TypeScript SDKs. The SDKs implement the MCP specification over JSON-RPC 2.0 and support both stdio for local process communication and Streamable HTTP for remote execution. Teams wrap MCP servers in adapters provided by frameworks like LangChain or Mastra so agents can invoke them directly, then deploy on Kubernetes using custom Helm charts.

The MCP servers themselves then become the easy part. The runtime layer that wraps them is the actual work, and the cases where building that layer in-house make sense are narrow.

Build your own runtime if you have a single-user scope. Per-user OAuth, token vaulting, and permission intersection are the hardest parts of the runtime layer, and they matter only once more than one human is involved. A solo developer connecting their own credentials to a single agent does not need them.

Build your own runtime if the agent infrastructure is your core product. A startup whose entire product is a smart scheduling agent for end users must control every layer of the stack. The engineers should be deep into this work because it is the company.

Build your own runtime if you own every API in the pipeline. If your agents act only on systems and data sources you control, with no third-party SaaS connections, you bypass the OAuth-lifecycle problem entirely, and the case for buying weakens.

Air-gapped deployments are not a build trigger. They are a deployment-mode question. Self-hosted runtimes run the vendor's runtime layer entirely inside your infrastructure, satisfying the air-gap while inheriting auth, audit, and governance from the runtime. Build your own runtime layer only when the deployment also prohibits third-party vendor software, which typically applies to highly classified environments.

Outside those three cases, building your own runtime is a misallocation of senior engineering time.

Beyond the MCP servers themselves, you build secure token vaults to manage OAuth refresh lifecycles for each user and service. You handle provider-specific rate limits and pagination. You architect state machines for asynchronous debugging when a tool call takes ten minutes to execute. You patch custom servers every time an upstream API changes its schema. Skip that work, and you get agent hallucination and silent failures.

Auth and policy carry their own ongoing burden, separate from API drift. People join and leave the company. Roles change. Permissions get revoked. Policies tighten after an incident. Each event has to flow through your custom auth layer in real time. This is a permanent FTE cost, not a build-once-leave-alone problem, and it never decreases as the deployment grows.

When to buy a runtime

An MCP runtime shifts engineering effort from infrastructure to product. Your team operates on top of an execution layer that already handles auth, vaults, audit, and policy, instead of building each one.

A runtime gives you four things off the shelf.

Centralized lifecycle governance. The runtime is the enforcement point for the policies your organization has already defined elsewhere (in your IDPs, your sales tools, your security systems). It maps to those existing policies and enforces them at the agent layer. It does not ask you to recreate access policies inside a new tool. Administrators get a single control plane to manage agent behavior, audit tool execution, and roll out updates safely across the organization.

Multi-user post-prompt authorization. Every tool call executes using the credentials and permissions of the human user requesting the action. The runtime handles the OAuth token lifecycle (secure vaulting, refresh, rotation) without exposing credentials to the LLM.

A catalog of pre-built, version-controlled MCP tools, so your agents reach thousands of enterprise systems on day one.

A path for proprietary tools that doesn't require rebuilding the runtime layer. When you need custom MCP servers for internal systems, you write them on the runtime's open-source MCP framework and inherit auth, audit, and governance for free. If you already have custom MCP servers built without the framework, you can connect them to the runtime and still get the same auth, audit, governance, and pre/post-call policy hooks without rewriting them.

Platform engineers shift from writing brittle integration scripts and debugging broken OAuth flows to managing high-level access policies. Your team defines which agents can access which tools, sets up visibility filtering so specific teams only see permitted integrations, and monitors OpenTelemetry-compatible dashboards to track agent reasoning and tool execution latency. You spend time on the agent's logic, not the plumbing.

Enterprise MCP scorecard: Decision criteria for build vs. buy

Eight dimensions separate a local prototype from a production deployment. The matrix scores each lane against them.

Evaluation dimension	DIY runtime layer (open-source SDKs)	Vendor MCP runtime
Control & customization	Absolute. Full control over transport layers, custom memory state, and bespoke hardware isolation.	High. Standardized tool execution with hooks for custom policies, but limited underlying infrastructure access.
Setup speed	Weeks to months. Requires building auth layers, token vaults, and infrastructure deployment pipelines.	Hours to days. Drop-in integration with existing IdPs and immediate access to pre-built tool catalogs.
Maintenance burden	Severe. Team owns all API schema updates, deprecations, token rotation logic, and security patches. The work compounds with every integration and every policy change.	Minimal. The vendor absorbs API drift, token lifecycle work, and security patching. Your team manages access policies and visibility, not infrastructure.
Multi-user authorization	Manual implementation. High risk of prompt injection and credential leakage if built incorrectly.	Built-in. Automated just-in-time token issuance, scoped per user and isolated from the LLM.
Lifecycle governance	Fragmented. Requires custom logging middleware, disparate SIEM integrations, and manual version control.	Centralized. Unified control plane, OpenTelemetry-native audit logs, and shadow MCP prevention.
Async task handling	Complex. Requires building external polling, dead-letter queues, and durable state machines for timeouts.	Native. Parallelized execution, automatic failover, intelligent retries, and decoupled result fetching.
Deployment options	Infinite. Deploy anywhere, including fully air-gapped, offline environments.	Cloud, self-hosted on-prem or in cloud (vendor enterprise tier), hybrid, or fully air-gapped. Cloud requires network egress to the vendor control plane; self-hosted runs the runtime entirely in your own infrastructure.
Best-fit team profile	Single-user scope, agent infrastructure is your core product, you own every API in the pipeline.	Multi-user production, mixed proprietary plus SaaS requirements, teams optimizing for time-to-value and audit-grade governance.

Multi-user authorization in production

Multi-user authorization is where most enterprise agent projects stall before production.

A developer testing locally passes their personal API keys to the system. In production, an agent serves thousands of users with different permission scopes.

If your runtime layer relies on a shared service account or forwards a user's full-scope bearer token to the LLM context, you've created an attack vector. A prompt injection attack instructs the agent to use those inherited permissions to exfiltrate data or delete repositories.

Shared service accounts also break audit-trail requirements: systems can't tell an autonomous-agent action apart from a human-directed one.

A runtime solves this with multi-user, post-prompt authorization. The runtime enforces a permission intersection at execution time:

Agent Permissions ∩ User Permissions = Effective Action Scope

The agent can only execute an action if both the agent's role policy and the user's native SaaS permissions allow it. Every other combination is denied.

For example, an HR agent scoped to recruiting tasks is invoked by an employee with admin privileges in Workday, including access to global payroll data. When the agent attempts to read payroll, the runtime evaluates the intersection at call time and denies the request. The user has the authority. The agent's restricted scope blocks the action.

The runtime acquires a tightly scoped, just-in-time token to execute the allowed action on behalf of the user. The credentials never reach the LLM client, which removes prompt injection as a direct credential-theft vector.

Lifecycle governance and audit

Without centralized governance, enterprise agent deployments turn into shadow IT. Developers spin up rogue MCP servers on local machines or unauthorized cloud instances, connecting LLMs to internal databases without oversight.

A runtime acts as the central enforcement point for the policies your organization has already defined elsewhere. It maps to your IDPs, your sales tools, your security systems, and enforces what's there. It does not ask you to recreate access policies inside a new tool. Think of the runtime as the bouncer: it enforces, it doesn't author. All tools and servers are registered in a single catalog. Visibility filtering ensures that an HR agent sees only HR-related tools, while a coding agent sees only repository tools.

Beyond enforcing what's already defined, the runtime exposes pre- and post-tool-call hooks for custom logic. Compliance teams drop in their own variables (workflow state, time windows, request volume, contextual data on the user or session), and the runtime treats those as first-class enforcement primitives alongside standard policies. Organization-specific conditions get wired in without forking the runtime.

The runtime generates fine-grained, OpenTelemetry-compatible audit logs. Every action is tracked: which user prompted the agent, which LLM model generated the tool call, what parameters were passed, and what the downstream API returned. That visibility is a prerequisite for passing security reviews in regulated industries.

Async and long-running tasks

Standard LLM architectures are synchronous. Inference endpoints time out within minutes.

Enterprise agent actions, such as triggering CI/CD pipeline builds, provisioning cloud infrastructure, or querying large data warehouses, can run for tens of minutes or hours.

In a DIY runtime, platform engineers build the asynchronous scaffolding themselves: job queues, external-memory state synchronization, polling mechanisms, and dead-letter queues for failed operations.

A runtime handles this work. It supports the latest MCP Tasks specifications, so agents trigger a long-running process, receive a task identifier immediately, and poll for the result asynchronously.

The runtime handles parallelized execution, failover routing when an endpoint drops, and backoff retries. The agent workflow stays durable without the application layer managing state.

Observability: end-to-end OpenTelemetry traces

The hidden cost of DIY MCP stacks is debugging. When an agent fails a tool call at 3 a.m., platform engineers stitch together traces from the agent run, each MCP server's logs, each provider SDK's retry logs, and each target SaaS API's status page. There is no correlated view. Investigating one failed async action means grepping across three systems in parallel and reconstructing the sequence by hand.

A runtime emits a single OpenTelemetry trace that carries the full chain. An example span tree for one agent action ("schedule a follow-up meeting and send the recap"):

agent.run (root)                 user_id, session_id, agent_id
├─ llm.infer                     model, prompt_tokens, completion_tokens
├─ mcp.tool_call                 tool=google_calendar.create_event
│  ├─ mcp.authz                  policy_result=allow, user_scope=calendar.events.write
│  ├─ mcp.oauth.refresh          token_id, refresh_outcome=ok
│  └─ mcp.http.execute           target_host, status=200, latency_ms=412
├─ mcp.tool_call                 tool=gmail.send
│  ├─ mcp.authz                  policy_result=allow
│  ├─ mcp.retry                  attempt=2, reason=rate_limited
│  └─ mcp.http.execute           status=202, latency_ms=890
└─ llm.infer                     result synthesis

Export that trace to Honeycomb, Datadog, or your SIEM, and you can answer "which user, agent, tool, policy, token, or retry caused the failure?" in one view. DIY gets you there only if you build the trace-correlation layer yourself and maintain it as SDKs, provider log formats, and policy engines evolve. That maintenance is a direct cost on your DIY stack, and it goes away when you adopt a runtime that emits agent-to-tool traces natively.

Operational burden of building

The operational burden of a DIY runtime layer compounds with every integration and every policy change. Initial development is the smallest part of the work. Most of the engineering effort lands after launch in API deprecations, schema changes, OAuth token rotation, security patching, and the auth and policy churn that grows with every user, every role change, and every revoked permission.

A production post-mortem of custom MCP servers documents the typical failure chain: auth drift, orphaned session state, brittle retries, silent tool hallucinations. Each failure costs senior engineering capacity to diagnose and remediate, on a timeline that doesn't compress.

Senior engineers building a DIY runtime spend their time on OAuth refresh scripts and incident-response patches. Senior engineers using a runtime spend their time on proprietary agent logic and domain-specific workflows. The differences compound across every team and every quarter.

How to evaluate MCP runtime vendors in 2026

Buying a runtime starts with picking the right vendor. The MCP infrastructure market has been segmented into three rough categories. Gateways route MCP traffic. Registries catalog MCP servers. Runtimes handle execution, authorization, and governance. Different vendors cover different layers. Most cover one. Some bundle two. The breakdown of MCP gateways, runtimes, and registries shows where specific vendors stack up across the three categories.

Within the runtime category, evaluate vendors against four capabilities:

Centralized lifecycle governance. Does the runtime enforce the policies your organization has already defined elsewhere (IDPs, sales tools, security systems), or does it ask you to recreate them in a new tool? Look for one control plane with audit logs, version control, and visibility filtering across every agent and tool.
Multi-user post-prompt authorization. Does the runtime evaluate per-user, per-action permissions at execution time, or does it pass through a shared service account? Per-user OAuth, with credentials isolated from the LLM, is the bar.
Agent-optimized tools, plus a path for proprietary ones. Are the tools intent-translating, or are they raw API wrappers that make the agent fill in object IDs and enums? Does the vendor offer an open-source framework that lets you build custom MCP servers for internal systems and inherit auth, audit, and governance without rebuilding the runtime layer?
Custom policy hooks for contextual access. Can your compliance team add organization-specific logic (workflow state, time windows, request volume, contextual data on the user or session) as first-class enforcement primitives, without forking the runtime?

How Arcade delivers on each

Arcade is the MCP runtime. It delivers all four capabilities in a single layer for multi-user AI agents at scale.

Agent lifecycle governance. Arcade is the central enforcement point for the policies your organization has already defined. It maps to and enforces policies from your IDPs, sales tools, and security systems. It does not ask you to recreate access policies inside a new tool. One control plane for every tool, agent, and auth provider. Version control to safely roll out tool upgrades. A shared registry that prevents teams from rebuilding what already exists. Visibility filtering so agents only see tools their user is permitted to invoke. Fine-grained audit logs, OpenTelemetry-exportable to your SIEM, that track every agent action per user and per service. Arcade's SOC 2 Type 2 certification validates these controls through an independent audit.

Agent authorization. Every MCP request in Arcade carries two identity layers: a project-level key (which application is making the request) and a user-level identity (on whose behalf the action is taken). Arcade evaluates the intersection of agent and user permissions dynamically at runtime to prevent privilege escalation. It handles the full OAuth lifecycle (refresh, rotation, mismatch) with credentials isolated from the LLM, and hooks into existing enterprise identity governance systems like Okta, Entra, and SailPoint to enforce policies the enterprise has already defined rather than duplicating them. That is the layer that removes prompt injection as a direct credential-theft vector.

Agent-optimized tools. Arcade's catalog of over 8,000 agent-optimized MCP tools are not API wrappers. They translate natural-language intent into structured API calls, so an agent asked to "send this to Finance" does not have to hallucinate the target recipient_user_id. The token cost shows up in benchmarks: for identical CRM queries, intent-level tooling produced 100x fewer response tokens than a raw API-passthrough approach, with token output equivalent to 3.7% of a 200K context window versus 373%. At scale, that overhead translates to context-window overflow in multi-step workflows and degraded agent accuracy. The runtime handles parallelized tool execution, failover, and retries. The Arcade MCP Framework lets you build custom proprietary tools that federate into the same control plane with the same auth and governance wrapping.

Contextual access and custom policies. Beyond enforcing policies your organization has already defined elsewhere, Arcade exposes pre- and post-tool-call hooks for custom logic. Compliance teams drop in their own variables (workflow state, time windows, request volume, contextual data on the user or session), and the runtime treats those as first-class enforcement primitives. Organization-specific conditions get wired in without forking the runtime.

For enterprises with mixed requirements (proprietary-internal systems plus SaaS breadth, multi-user auth plus governance, fast shipping plus safety), Arcade covers the full set without forcing ecosystem lock-in.

Final recommendation

For most enterprise deployments in 2026, buy an MCP runtime. The deployment profile shapes how the runtime gets deployed, not whether to deploy it.

Proprietary-internal-only. Sensitive data is the strongest buy signal, not a build trigger. Legacy systems holding proprietary data are precisely where Arcade gets pulled in. That's where the operational pain peaks and where security and compliance officers carry the most direct accountability. A custom OAuth pipeline maintained by a small team is a position no security leader wants to defend in a regulated audit. An audited, SOC 2 Type 2 runtime that has already cleared third-party scrutiny is much easier to defend.

Recommended pattern: build custom MCP servers using the Arcade MCP Framework, run them inside your VPC or on-prem, and create an MCP gateway in the runtime to connect them to the Arcade control plane. For environments where even the control plane must stay in customer infrastructure, run the runtime self-hosted. The data stays inside your boundary. The runtime handles auth, OBO, vaulted credentials, audit logs, and governance.

For fully air-gapped deployments with no external network egress, run a self-hosted runtime entirely inside your infrastructure. The runtime layer is identical to the cloud version; only the deployment mode changes. Build your own runtime only when the deployment also prohibits third-party vendor software.

SaaS-heavy. Once your agentic workflow needs to touch Google Workspace, Microsoft, Salesforce, GitHub, or Slack, you buy. The runtime handles the OAuth lifecycle, schema drift, and tool maintenance for hundreds of SaaS APIs your team would otherwise rebuild. The security gap is largest in this profile. So is the operational gap.

Mixed (most enterprises). Agents query proprietary internal databases, synthesize that data, and act in public SaaS applications. Mixed-requirement teams do not have to choose between proprietary security and SaaS breadth. Adopt an MCP runtime, such as Arcade.dev, for SaaS coverage, then create an MCP gateway in the runtime to connect internal MCP servers (or custom servers built with the Arcade MCP Framework) to the same control plane. Both surfaces inherit the same security and audit controls, with multi-user authorization wrapping every action.

If you have already built MCP servers without the Arcade Framework, you do not have to rewrite them. Connecting an existing custom server to Arcade still gives you the runtime's auth, audit, governance, and pre- and post-call policy hooks on top of what you already have.

Summary

Deployment profile	Recommendation	Pattern
Proprietary-internal-only	Buy an MCP runtime	Build custom MCP servers on the Arcade MCP Framework, run them inside your VPC or on-prem, and create an MCP gateway in the runtime to reach them. Self-hosted for environments where the control plane must stay in customer infrastructure.
Fully air-gapped (no external egress)	Buy an MCP runtime, self-hosted	Run a vendor's self-hosted runtime entirely inside your infrastructure. Build your own only when the deployment also prohibits third-party vendor software.
SaaS-heavy	Buy an MCP runtime	Adopt the runtime directly. It handles the OAuth lifecycle, schema drift, and tool maintenance for hundreds of SaaS APIs.
Mixed proprietary plus SaaS	Buy an MCP runtime	Arcade for SaaS coverage. An MCP gateway created in the runtime connects internal MCP servers (built with or without the Arcade Framework) to the same control plane.

Decision checklist

Run your deployment plan against these five questions:

Will your agents serve more than one human user with different permission scopes?
Do you need audit-grade logs that tie every tool call to a specific human, agent, and target system?
Do any of your agents take asynchronous actions that exceed standard LLM request timeouts?
Are you connecting to five or more external SaaS APIs across the organization?
Are your regulatory constraints so severe that no external network egress is permitted, even through a gateway running inside your own network?

How to read the answers:

"Yes" on any one of the five questions: buy an MCP runtime.
Otherwise: confirm fit against the three build cases in "When to build your own runtime" before committing to DIY.

Conclusion

The deployments that stall in 2026 fail on risk: auth that can't be audited, credentials sitting inside an LLM context window, and a security blast radius no one in the room can scope. Sensitive data raises that bar, which is why proprietary scenarios are a buy trigger, not a build trigger. Rebuilding OAuth pipelines and schema registries is a poor use of senior engineering time, and the build path stops compounding the moment a second user or a regulated audit enters the picture.

Arcade.dev's MCP runtime provides agent lifecycle governance, agent authorization, and an agent-optimized tool catalog in a single layer.

Next step: book a 30-minute technical discovery call with Arcade's team to walk through the multi-user authorization architecture and the deployment options for your environment.

Or start in the Arcade playground. Connect one tool, run one user-scoped action, and see how the runtime handles OAuth, policy, and audit in a single trace.

Frequently Asked Questions

What is the difference between an MCP server and an MCP runtime?

An MCP server is a single endpoint that exposes tools. An MCP runtime is the execution layer that hosts, secures, and governs those servers. The runtime handles production complexities like multi-user authorization, load balancing, and audit logging that individual servers lack.

How do MCP runtimes handle rate limits and long-running tasks?

They use the asynchronous MCP Tasks specification, returning a task ID immediately while managing the long-running job in the background. The runtime handles vendor-specific API rate limits, backoff retries, and connection failovers. Your agent polls for the final result without managing execution state.

Why is multi-user authorization so difficult for custom AI agents?

Multi-user authorization requires dynamic, just-in-time credential management to prevent prompt-injection attacks that compromise a user's full account. Custom builds must securely orchestrate complex "On-Behalf-Of" token flows, vault credentials out of the LLM context window, manage strict refresh token rotation, and enforce granular access policies at execution time.

Can you mix custom MCP servers with an MCP runtime?

Yes. Custom MCP servers and runtimes are not alternatives. You build custom MCP servers for proprietary internal systems in either path. The question is whether you also build the runtime layer wrapping them. Runtimes support hybrid architectures: custom servers running proprietary tools inside your VPC connect to the runtime's control plane via a gateway or a secure tunnel. This governs public SaaS and custom internal tools through a single control plane. Servers built on the runtime's open-source framework inherit auth and audit automatically. Existing servers built without the framework connect to the runtime and still get its auth, audit, and policy hooks without being rewritten.

When should we build our own runtime layer instead of buying an MCP runtime?

Build your own runtime if you have a single-user scope with no multi-user requirement, if the agent infrastructure is itself your core product, if you own every API in the pipeline (no third-party SaaS). Sensitive data on its own is not a build trigger. Air-gapped deployments are handled by self-hosted runtimes from vendors that offer them. Buy a runtime in every other case.

When does it become cheaper to buy an MCP runtime?

Once you support multiple integrations and multi-user OAuth. Maintenance and security work exceed the runtime's usage cost beyond three integrations.

Do MCP runtimes expose OAuth tokens or credentials to the LLM?

No. The runtime keeps credentials in a vault and issues tightly scoped, just-in-time tokens for tool execution without placing secrets in the model context.

What security and compliance features should an enterprise MCP runtime include?

Post-prompt authorization, least-privilege policy enforcement, immutable audit logs (OpenTelemetry-friendly), secret vaulting and rotation, and admin controls for tool access and visibility.

What is "post-prompt" (on-behalf-of) authorization for AI agents?

Post-prompt authorization means the runtime authorizes and executes each tool call using the requesting user's permissions at execution time, rather than using a shared service account or passing user tokens into prompts.

How much latency does an MCP runtime add?

A small overhead from per-tool authorization checks and just-in-time token resolution. The overhead is a fraction of LLM inference and downstream SaaS API latency.

Can an MCP runtime work in a private VPC or hybrid environment?

Yes. The runtime's MCP gateway lets internal MCP servers run inside your VPC while governance and routing stay centralized. Self-hosted deployment runs the runtime entirely in your own infrastructure.

How do MCP runtimes help with audit logging and incident response?

They record who requested the action, which tool was called, the parameters, results, and timing. All exportable to SIEM via OpenTelemetry for compliance and investigations.

How do MCP runtimes handle SaaS API changes and version drift?

The vendor maintains tool schemas and centrally updates integrations. This reduces breakage from deprecations and keeps tool definitions consistent across agents.

Can we start DIY and migrate to a runtime later?

Yes. Teams begin with DIY for prototypes and migrate to a runtime when multi-user auth, governance, and operational load become production requirements.

Claude Code Routines: 5 production workflows that ship real work

Manveer Chawla — Fri, 01 May 2026 16:50:58 +0000

TL;DR

Claude Code Routines enable unattended, cloud-run workflows via scheduled, API, and GitHub event triggers. Enterprise use breaks with demo-grade setups.
Daily run caps and shared subscription usage push teams to batch work into a single daily "meta-orchestrator" routine plus a few real-time triggers.
5 production workflows: incident postmortem drafting, on-call triage → ticket drafts, PR-aging report, expansion-signal scanning, and changelog PR generation.
Key enterprise risks: over-permissioned connectors, prompt injection from untrusted inputs, API rate limits (notably Slack history), and weak auditability.
Production pattern: use an MCP runtime that delivers agent authorization, agent-optimized tools, and agent lifecycle governance, plus human approval gates for write actions.

Cloud-hosted agents are not new. OpenClaw, Perplexity Computer, n8n, Zapier, and a handful of SaaS agent runtimes have been executing unattended work for a while. The release of Claude Code Routines adds a different option: teams that already use Claude Code as their day-to-day development agent can now run that same agent, with the same prompts, tools, and conventions, on Anthropic's cloud instead of tethered to a laptop.

A routine is a saved Claude Code configuration (a prompt, one or more repositories, and a set of connectors) packaged once and run automatically on Anthropic-managed cloud infrastructure. Each routine can attach any combination of three trigger types: scheduled (recurring cadence), API (POST to a per-routine endpoint with a bearer token), and GitHub events (pull request or release activity on a connected repository). Routines are currently in research preview, so limits and API shapes are still moving.

Most of the early Routines content focuses on personal productivity: meeting prep, inbox summaries, and calendar wrangling. For senior developers and engineering leaders trying to run autonomous agents across an enterprise, those demos do not cut it.

Moving from a script on one laptop to a production-grade engineering workflow means dealing with the realities of enterprise architecture. Production automation demands strict governance, robust security boundaries, and the ability to work within aggressive API rate limits.

This article covers five production-leaning, unattended routines designed for engineering teams. We'll map exactly what happens at runtime, identify which workflows need human oversight, and outline the governance models you need to safely run scheduled, API-triggered, and GitHub-triggered Claude Code sessions without compromising your infrastructure. Before getting to the workflows, it's worth looking at why demo-grade setups buckle the moment they move from a single laptop to a shared team environment.

Where demo patterns hit production reality (security, reliability, governance)

Routines formalize what teams have been wiring together with cron jobs, GitHub Actions, and custom middleware for two years: Claude Code running on a schedule, against a GitHub event, or through an API call, with no developer laptop in the loop. But moving from a single developer's personal setup to a shared enterprise environment exposes severe limitations in security, reliability, and auditability. Fast.

Start with the execution model. Per Anthropic's docs, routines "run autonomously as full Claude Code cloud sessions: there is no permission-mode picker and no approval prompts during a run." Whatever the agent decides to do, it does. At the speed of inference, without a human in the loop. That shifts the burden of "what is this agent allowed to do" from interactive confirmation to pre-deployment configuration. If the configuration leans on bundled first-party connectors and creator-inherited OAuth scopes, the guardrails come off exactly when you need them most.

The most critical vulnerability is the permission inheritance model of bundled first-party connectors.

In a standard setup, an automated routine inherits the full global access of the developer who created it. Anthropic's docs make the consequence explicit: "Anything a routine does through your connected GitHub identity or connectors appears as you: commits and pull requests carry your GitHub user, and Slack messages, Linear tickets, or other connector actions use your linked accounts for those services." A first-party OAuth token works for a single developer querying their personal pull requests. It becomes a massive liability the moment you deploy it as an unattended routine on behalf of a whole team.

If an agent operates with an engineering lead's administrative permissions, a single compromised routine gains unrestricted read and write access across your entire enterprise system. This architecture fails security reviews every time the automation touches shared customer data, source code, or regulated infrastructure.

This over-permissioning makes prompt injection threats way worse. Unattended routines ingest untrusted third-party text by design. They process incoming PagerDuty incident descriptions, analyze raw Sentry stack traces, and scan customer support emails.

Without typed, permission-scoped tool contracts to validate the output, a malicious payload hidden in a customer ticket can instruct the routine to exfiltrate data or delete production resources. Natural language instructions won't stop these exploits in an enterprise environment.

Operational and reliability constraints compound the problem. Routines draw down the same subscription usage as interactive sessions, plus a separate daily cap on how many runs can start per account. Anthropic doesn't publish a specific number, and Claude usage tightens once team activity ramps up, so unattended workflows have to be designed with quota-awareness from day one.

This forces engineering teams to abandon simple event-driven architectures for complex batch processing. You can't trigger a routine for every individual pull request comment. Instead, you orchestrate batch jobs that process dozens of events at once to conserve quota, or enable extra usage and accept metered overage when caps hit.

Reliability and visibility close out the failure list. Early adopters report consistent issues with bundled connectors in unattended execution: community issue trackers show silent failures during runtime, OAuth token expiration errors that crash scheduled tasks, and connectors that fail to load in the cloud environment.

Bundled connectors also lack auditability. When an unattended routine updates a Jira ticket, queries a GitHub repository, and posts a Slack message, standard bundled connectors give you opaque execution logs. Security teams can't construct a definitive audit trail of what the agent did across multiple platforms.

The rest of this article shows how a dedicated MCP runtime resolves each of these failure modes:

Risk	Control	Where it lives
Over-permissioned token	Per-user, per-tool authorization evaluated per action	MCP runtime
Prompt injection from untrusted text	Agent-optimized tools with schema enforcement and isolated credentials	MCP runtime
Quota overrun	Meta-orchestrator batching plus targeted GitHub event triggers	Routine design
Silent write to production	Human approval gate on drafts, PRs, or prefixed branches	Workflow config and branch protection
No audit trail for compliance	Full execution context logged per tool call, exportable via OpenTelemetry	MCP runtime

5 production Claude Code routine workflows you can batch into one daily run

The risks and controls above become concrete through workflow design. Before the patterns, one operational constraint shapes every choice below: quota. Routines share subscription usage with interactive sessions and add a daily cap on runs per account, so running a separate routine for every minor event burns through the budget fast.

The solution is to architect a single "meta-orchestrator" routine that wakes up once a day, runs a sequential batch of discrete data-gathering and reporting tasks, and shuts down. That consumes one run from your daily cap.

This strategy saves your remaining runs for critical, real-time API and GitHub event triggers that demand immediate attention.

Here are five concrete engineering workflows designed for this quota-aware framework, with their technical triggers, human approval surfaces, and governance requirements. Three of them (nightly incident postmortem, weekly PR-aging, expansion-signal scanning) sit inside the meta-orchestrator and share the daily run. The other two (Sentry triage, release-notes draft) run real-time because their value is latency-bound. You want the Linear ticket while the incident is hot, and the changelog draft as soon as the release tag lands.

Routine	Trigger	Primary tools	Approval surface	Run slot
Nightly incident postmortem	Scheduled (2:00 AM daily)	PagerDuty, Slack, Notion	Human engineers review and publish the drafted Notion page	Meta-orchestrator
On-call Sentry triage	API (Sentry webhook → routine `/fire` endpoint)	Sentry, Linear	On-call engineer triages the drafted Linear ticket queue	Real-time
Weekly PR-aging report	Scheduled (Friday morning)	GitHub, email	Read-only; no write approval needed	Meta-orchestrator
Expansion signal scanner	API (nightly)	HubSpot, Slack Search	Account managers review flagged accounts in a Slack channel	Meta-orchestrator
Friday release notes draft	GitHub event (release created)	GitHub, Jira / Linear	PM reviews the pull request and merges the changelog	Real-time

Nightly incident postmortem draft (PagerDuty, Slack, Notion)

Assembling a postmortem means stitching PagerDuty timestamps, Slack threads, and deploy markers into a readable narrative. This workflow does the assembly and drafts the first pass so the engineer lands on a structured Notion page instead of a blank one.

Trigger: Scheduled. Runs as the first sequence in the daily 2:00 AM meta-orchestrator.
Workflow: The routine queries the PagerDuty API for resolved events from the previous 24 hours. The hard part is Slack context: the conversations.history endpoint now rate-limits non-Marketplace apps to one request per minute, so bulk-ingesting incident channels is off the table. The routine uses the Slack Search API to isolate key messages, or fires via the API trigger when a Slack reaction-event webhook (configured in your Slack app) POSTs to the routine's /fire endpoint after an engineer drops a designated emoji on a summary message. It then drafts a Notion page with a timeline, impact, and initial resolution steps.
Approval surface: The routine runs unattended. An engineer reviews, edits, and publishes the Notion draft the next morning.
Governance & security checklist:
- Scope the PagerDuty token to read-only on specific services. Scope Slack tokens to the incident channels only, not org-wide.
- Redact customer identifiers (email, user ID, account ID) at the tool layer before the draft is written to Notion. Do not rely on the model to scrub PII.
- Log triggering PagerDuty incident ID → drafted Notion page ID for every run, not just on failure.

On-call triage and ticket creation (Sentry to Linear)

When a service degrades, on-call engineers get paged with a dozen near-identical error reports. This workflow groups the noise by Sentry fingerprint and files one Linear ticket per cluster so the on-call triages root causes, not duplicates.

Trigger: API. Claude Code Routines don't accept arbitrary third-party webhooks (only GitHub events), so configure Sentry's webhook integration to POST to the routine's /fire endpoint with its bearer token when an error spike crosses a configured threshold. Runs outside the daily orchestrator because triage value drops fast if it waits.
Workflow: The routine reads fresh events from Sentry, groups them by fingerprint to collapse duplicates, and ranks clusters by event count and affected-users count. Each cluster becomes a Linear ticket with the stack trace snippet, affected release, and a link back to the Sentry issue. Tickets land in an un-triaged queue with a default P3 label.
Approval surface: The routine never triages itself. The on-call engineer reviews the queue, adjusts severity, and assigns the ticket.
Governance & security checklist:
- Scope the Sentry token to specific project slugs. Exclude projects flagged as handling authentication or payment data.
- Strip user-supplied strings (URL params, form inputs, search terms) from error payloads before the agent sees them. Those fields are the prompt-injection surface.
- Log the mapping from Sentry event ID → Linear ticket ID. This is what lets post-incident reviews reconstruct which alert caused which ticket.

Weekly pull request aging and code review report (GitHub)

Stale PRs create merge conflicts, block releases, and erode review velocity. This workflow replaces the Friday morning dashboard sweep with a single email that names the three PRs each lead needs to act on.

Trigger: Scheduled. The daily orchestrator runs the workflow every day; the body skips itself on non-Fridays.
Workflow: The routine queries the GitHub GraphQL API for PRs open longer than three days across the org, pulling each PR's review state, failing check runs, and unresolved review comments in a single query. It summarizes each PR's blocker (waiting on reviewer X, failing CI check Y, unresolved change requests) and emails a grouped digest to the relevant engineering leads.
Approval surface: Read-only. The email dispatches without human intervention, so the token scope is the real control.
Governance & security checklist:
- Use a GitHub App token with metadata, pull_requests, and issues read-only. Do not grant contents scope; the routine never needs the diff.
- Strip code blocks from the email template before send, even if the agent tries to paste one.
- Send from a dedicated service-account email, not a developer mailbox, so downstream audit trails stay clean.

Expansion signal scanner for customer health (HubSpot, Slack)

Support tickets and shared Slack channels are where customers accidentally self-identify as enterprise-tier: questions about rate limits, SSO, SOC 2 reviews, and data residency. This workflow surfaces those signals into a single account-health feed so the revenue team sees them.

Trigger: API-triggered. Runs as part of the nightly meta-orchestrator.
Workflow: The routine queries HubSpot for tickets created or updated in the last 24 hours and scans the body and notes for enterprise-tier keywords ("rate limits," "SSO," "SOC 2," "HIPAA," "data residency"). For shared customer Slack channels, bulk history ingestion is off the table because of conversations.history rate limits, so the routine uses the Slack Search API against the same keyword set. Each matching account gets a row in an internal Slack post with links back to the source ticket or message.
Approval surface: Findings land in a dedicated internal Slack channel with source links. An account manager reviews each flagged account and decides whether to open an expansion conversation.
Governance & security checklist:
- The routine never writes to HubSpot. It reads from an allowlist of ticket properties (subject, body, pipeline stage) and nothing else.
- Restrict the Slack token to public support channels plus explicitly listed shared customer channels. Never grant channels:history org-wide.
- Log which account IDs, ticket IDs, and Slack message IDs were scanned on each run, along with which keywords matched. The keyword that triggered the flag is the part account managers need to trust the signal.

Friday release notes and changelog draft (GitHub, Jira/Linear)

Commit messages are written for engineers; release notes are written for customers. This workflow drafts the customer version so the product team edits prose instead of compiling a changelog from scratch.

Trigger: GitHub event trigger on release.created, scoped to the specific repository. Requires the Claude GitHub App installed on the repo. Running /web-setup alone grants clone access but doesn't enable webhook delivery.
Workflow: The routine finds the previous release tag, collects every PR merged into main between the two tags, and resolves each PR back to its Jira or Linear ticket using the ticket ID conventionally placed in the PR title or body. It then drafts customer-facing release notes in Markdown, grouped by feature area. One caveat: the bundled GitHub MCP connector has gaps around basic writes like updating the release body directly, so the routine opens a pull request against a release-notes/ branch instead of editing the release in place.
Approval surface: The routine commits the Markdown to a release-notes/<tag> branch and opens a PR. A product manager edits the copy and merges.
Governance & security checklist:
- Give the routine read-only access to Jira and Linear. It should never change a ticket's status or rewrite acceptance criteria.
- Enforce a branch protection rule: the routine's write token can only push to branches matching release-notes/*. The main branch is structurally unreachable.
- Log triggering release tag → list of PRs analyzed → resulting changelog PR number. When the next release breaks, provenance is what makes the diff debuggable.

How to evaluate an enterprise MCP runtime for Claude Code routines

Every workflow above has a shared dependency: the tool layer underneath. Native Claude Code Routines can't safely execute these tasks on bundled connectors alone. Workflow 5's note about the GitHub connector missing basic writes is representative of the stock first-party set, not an outlier.

Relying on bundled connectors and first-party token inheritance also means rate-limit failures, prompt injection exploits, and security audits that halt deployment.

What's missing is a purpose-built MCP runtime: the execution layer where tools run, credentials are resolved just-in-time, and every action is authorized against a specific user's permissions. This is not another proxy in front of your enterprise systems; the agent is already the proxy. The runtime is where the tool call lands, where identity and policy are evaluated, and where the audit record is written. Critically, the runtime is stateful. It maintains per-session, per-user context across an agent's entire reasoning loop, which is exactly what a stateless proxy cannot do. And this statefulness is what makes per-user, per-tool authorization enforceable.

An enterprise MCP runtime delivers three capabilities working in concert: agent authorization (per-user, per-tool, per-action), agent-optimized tools (built for LLM consumption, not API passthrough), and agent lifecycle governance (centralized control, versioning, and full-execution audit logs).

Capability	Bundled first-party connectors	Enterprise MCP runtime
Permission model	Inherits the creator's global OAuth scope	Scoped per routine, per user, per action
Auth lifecycle	Token embedded at setup; manual refresh	Runtime manages refresh, rotation, and expiry
Audit logs	Opaque, per-connector, not unified	Full chain of custody per tool call (user, tool, params, result), exportable to SIEM via OpenTelemetry
Prompt injection defense	None; LLM parses raw input into API calls	Multi-layered: isolated credentials, per-action auth, schema enforcement, visibility filtering
Rate-limit handling	Direct hits against upstream APIs	Throttling, batching, and targeted webhooks
Tool catalog	Stock first-party set only	The largest catalog of agent-optimized MCP tools (8000+)
Gateway composition	One OAuth/connector per upstream service	Runtime-level federation: tools composed into a single identity-scoped URL (Arcade calls this the MCP Gateway feature: a composition layer, not a proxy)
Cross-harness portability	Claude Code only	Any MCP-compatible harness (Codex, OpenCode, local-model)

Agent authorization: per-user, per-tool, evaluated at runtime

The most critical function of a dedicated MCP runtime is handling multi-user agent authorization, sometimes called post-prompt authorization.

Single-user demos hide the real problem. Anthropic's docs are explicit that "routines belong to your individual claude.ai account. They are not shared with teammates." Every routine is structurally a single-user artifact, even when the work it does affects an entire team. The moment a routine has to act on behalf of multiple users (one-per-engineer on a platform team, or org-wide when a customer-health scanner runs for every account manager), shared service accounts and creator-inherited OAuth scopes collapse as a model. Teams either give the agent broad permissions (and an intern bypasses their access controls through the agent) or inherit the user's full permissions (and one prompt injection cascades through every system that user can touch). The right answer is the intersection: what is this agent allowed to do AND what is this user allowed to do, evaluated per action at runtime. That is the problem the runtime has to solve before routines can move past single-user demos.

Rather than letting a routine inherit the global, administrative permissions of its creator, an advanced runtime isolates the LLM entirely from underlying credentials and executes every tool call On-Behalf-Of (OBO) a specific user. The runtime evaluates the intersection of the agent's baseline permissions and that user's native permissions per action at runtime, so every action is attributable to a specific human in the audit log.

Authorization is just-in-time. The runtime requests and validates credentials only when a specific user action requires them. If a user never invokes the Salesforce integration, no Salesforce tokens are ever obtained or stored. The entire OAuth flow (token exchange, refresh, storage) executes in deterministic backend logic that the LLM can never observe, alter, or leak. For additional governance, teams attach pre-tool-call and post-tool-call hooks to enforce custom policies: human-in-the-loop approvals for destructive actions, usage limits, or contextual access rules.

The runtime manages the entire OAuth token lifecycle. It handles token refresh, rotation, and mismatch scenarios outside the view of the LLM. If a routine tries to access a repository the target user can't see, the runtime blocks the action at the protocol layer.

Critically, the runtime hooks into the identity and entitlement systems you already run (Okta, Entra, SailPoint) instead of asking you to redefine authorization policies in yet another system. It acquires scoped tokens just-in-time, enforces the policy your IDP already owns, and keeps credentials isolated from the LLM and the MCP client. The runtime delegates authorization to what the enterprise has already defined; it doesn't duplicate it.

Agent-optimized tools: built for LLM consumption, not API passthrough

Most MCP servers today are thin API wrappers. When a user says "update the Acme deal," the wrapper still asks the agent for opportunity_id, owner_id, stage_enum, and close_date. The agent fills those parameters probabilistically and either guesses the wrong values or retries blindly. This failure mode is called parameter hallucination, and it's where most agent failures happen in production. A proxy layer has no mechanism to close it.

Agent-optimized tools invert this pattern. When a user asks to "make the intro paragraph friendlier," the tool translates that to segmentId=gz49hg56, index=350, text='your friendlier message'. The agent never thinks beyond "intro paragraph." Every tool ships with rich semantic descriptions to help the LLM pick correctly, consistent schemas across services regardless of the underlying API, and agent-interpretable errors instead of raw HTTP status codes. In practice this ships as the largest catalog of pre-built agent-optimized MCP tools (8000+), covering productivity, CRM, communication, and developer systems, so teams skip the wrap-an-API-in-MCP step entirely.

Reliability is a runtime concern, not an agent concern. Pagination, rate limiting, retries, and failover all get handled by the runtime, invisible to the agent. Tools execute in parallel where safe; failed calls retry with additional developer-defined context; MCP servers fail over automatically. The agent gets a clean result or a clean error, never a half-paginated list or a transient network blip bubbling up into the reasoning loop.

Strict schemas also harden the tool layer against prompt injection. Schema enforcement is one layer of the defense, not the whole defense. A malicious payload buried in a customer email can't talk the agent into a destructive call that doesn't match an approved schema. More importantly, credentials never leave the runtime, so a jailbroken prompt has no tokens to exfiltrate. Per-user authorization is evaluated at every action, so an injected instruction can't do more than the acting user is already permitted to do. And visibility filtering scopes the tools a routine can even see, so there's no latent high-privilege tool hanging around for a payload to discover. Prompt injection defense has to be structural and in depth: at the tool layer, the auth layer, and the governance layer. Not a prompt-level patch.

Agent lifecycle governance: centralized control and full visibility

Agent lifecycle governance is the third pillar of an enterprise MCP runtime. Deploying autonomous agents at scale requires centralized control over which tools are available, to whom, and with what permissions, plus total visibility into what's happening at runtime.

A dedicated runtime provides a full chain of custody for every agent action (user identity, tool name, parameters, and result), exportable to your SIEM via OpenTelemetry. Independent attestation (Arcade.dev is SOC 2 Type 2 certified) validates that these controls hold in production, which matters when security reviews start before deployment, not after. The runtime also lets security teams enforce visibility filtering so a routine only sees the tools it explicitly has permission to use, and provides the infrastructure to mandate human-approval gates for any routine attempting to write data to a production system.

Portability across agent runtimes using MCP

Investing in an MCP runtime also guarantees architectural portability. Because tools are exposed over the open MCP standard, the heavy lifting of building tool contracts, managing OAuth flows, and establishing governance policies happens once.

That investment is usable from any MCP client (Claude Code Routines, Cursor, Claude Desktop, VS Code, ChatGPT, and custom applications) and stays portable across other agent harnesses like OpenAI Codex or on-prem deployments running open-weights models for regulated workloads. When your team swaps Claude for a different harness on a specific workflow, or moves sensitive routines onto on-prem compute for compliance reasons, the tool contracts, OAuth flows, and audit logs travel with you. The agent harness changes; the governance layer does not.

How to test and deploy your first remote Claude Code routine

With the runtime in place, the remaining question is how to ship a routine to production without breaking things. Writing a prompt, attaching a token, and flipping the schedule is not the move. The four-step framework below enforces clear boundaries on top of your MCP runtime:

Step 1: Wire up Arcade MCP Gateway as a custom connector

Before you can safely test anything, give the routine somewhere governed to call. With Arcade, the flow is (full integration walkthrough at Arcade for Claude Code):

In your Arcade dashboard, create a new MCP Gateway. Configure it with Arcade auth so tools inherit per-user, per-action authorization rather than a shared service account.
Add the tools this routine needs to the gateway, scoped to the minimum the workflow requires and nothing more.
In the Claude web interface, create a custom connector pointing at the gateway's URL.
Complete the one-time authorization to link the connector to the gateway.

With the connector live, any routine you create can include it alongside (or in place of) bundled first-party connectors.

Step 2: Sandbox execution

Never test a new routine against production data. Sandbox the execution using the /schedule command in the CLI or the "Run now" feature in the web interface.

Point the routine at a scratch Notion workspace, a dedicated testing Slack channel, or a sandbox GitHub repository. Conduct multiple dry runs to observe how the routine handles edge cases, unexpected inputs, and empty datasets.

Step 3: Start with read-only permissions

When configuring the routine for its initial deployment, enforce a strict "Read-Only First" mandate. Use your Arcade gateway to scope the routine's MCP tools exclusively to read operations.

For example, if you're building an incident triage routine, allow the routine to read from PagerDuty and output its analysis to a simple text file or a private Slack message. Validate the quality of the routine's logic and data extraction for at least one week before granting permission to write data or create tickets.

Step 4: Add human approval gates for write actions

As you transition the routine to handle write operations, establish hard structural boundaries that mandate human oversight.

Don't allow the agent to commit directly to your main branch or publish documentation live. Instead, configure the routine to draft documents, open pull requests, or push code exclusively to branches with a specific prefix. Every destructive or state-changing action requires a human engineer to review and merge the work.

Where to start

Claude Code Routines deliver genuine unattended automation for engineering teams: Claude Code running on a schedule, GitHub event, or API call, entirely off the developer laptop. Realizing that value across an organization means acknowledging that moving from a localized laptop demo to a nightly production workflow introduces severe architectural and security challenges.

You can't run autonomous workflows at scale using bundled connectors, first-party token inheritance, and opaque execution logs. Production deployments demand typed tool contracts, robust rate-limit handling, and explicit permission scoping to protect against prompt injection and data exposure.

If your engineering team is evaluating how to run unattended AI agents safely, Arcade is the industry’s first MCP runtime purpose-built for this. By unifying agent authorization, agent-optimized tools, and agent lifecycle governance in a single runtime, we let you ship reliable production workflows without spending months rebuilding security and operational plumbing.

FAQ

What are Claude Code Routines, and what changed in the April 2026 release?

A routine is a saved Claude Code configuration (prompt, repositories, and connectors) packaged to run automatically on Anthropic-managed cloud infrastructure. The April 2026 release shipped three trigger types: scheduled, API (per-routine /fire endpoint with a bearer token), and GitHub events (pull request or release activity on a connected repository). Routines are currently in research preview.

How many times per day can a Claude Code Routine run?

Routines share subscription usage with interactive sessions and have an additional daily cap on how many runs can start per account. Anthropic doesn't publish a specific number and it can change during the research preview, so per-event routines that fire on every PR comment or alert quickly become impractical.

How do teams work around routine run quotas in production?

Two options. First, batch multiple tasks into a single daily "meta-orchestrator" routine and reserve real-time runs for only the highest-severity API and GitHub event triggers. Second, enable extra usage in Settings → Billing so runs that hit the cap continue on metered overage.

Why are bundled connectors risky for enterprise unattended routines?

Bundled first-party connectors inherit the creating developer's global OAuth scope. That permission inheritance fails security reviews the moment the routine touches shared code, customer data, or regulated systems.

How do unattended routines increase prompt injection risk?

Untrusted third-party text (PagerDuty descriptions, Sentry traces, customer emails) flows directly into the agent at runtime. A payload buried in that text can steer the agent toward unsafe actions. Defense has to be multi-layered at the runtime: isolated credentials the LLM never sees, per-user authorization evaluated on every action, schema enforcement on each tool call, and visibility filtering so the routine can't even discover tools it isn't permitted to use.

What is an MCP runtime, and why do I need it?

An MCP runtime is the execution layer where agent tool calls run. It resolves credentials just-in-time, authorizes each action against a specific user's permissions, enforces tool schemas, and writes a unified audit log. It is not another proxy in front of your enterprise systems. The agent is already the proxy. The runtime is where identity, policy, and execution come together.

What is "post-prompt authorization"?

The runtime checks each individual tool action at execution time against the acting user's permissions and the routine's policy. The routine never inherits the creator's blanket credentials.

Which routine actions should require human approval?

Any write or state-changing action (creating tickets, committing code, publishing documentation) should land as a draft, PR, or triage queue and go through a human review gate before merging.

How do Slack API rate limits affect these workflows?

Slack's conversations.history endpoint now rate-limits non-Marketplace apps to a single request per minute. Production designs use Slack Search, targeted webhooks, or curated context instead of bulk history pulls.

What should I implement first to deploy a safe routine?

Wire up Arcade as a custom connector first so the routine calls tools through a governed runtime, then test in a sandbox, enforce read-only tools, and introduce human-in-the-loop gates before granting write permissions.

What should be logged for auditability in enterprise routines?

Log the triggering event, the tools called, the target resources, the acting user or service account, and the resulting object IDs (e.g., Sentry event ID → Linear ticket ID).

Does ClickHouse Support UPDATEs? A 2026 Data Analysis

Manveer Chawla — Thu, 30 Apr 2026 20:30:00 +0000

TL;DR

Yes, ClickHouse fully supports UPDATEs. As of April 2026, ClickHouse ships standard SQL UPDATE ... SET ... WHERE syntax that runs in milliseconds, alongside four other update mechanisms: ALTER TABLE … UPDATE mutations for bulk operations, lightweight DELETE, on-the-fly mutation visibility, and ReplacingMergeTree for high-volume upserts and CDC. The "ClickHouse is append-only" claim is outdated by eight years and 100+ merged pull requests.

Key facts:

Standard SQL UPDATE shipped in ClickHouse 25.7 (July 2025) via PR #82004, backed by a new "patch part" architecture. It was promoted to Beta with default enablement in version 25.8 (PR #85952).
Lightweight UPDATE delivers up to 1,000× to 2,400× speedup for single-row updates compared to classical mutations, per ClickHouse's own benchmarks. Patch parts store only the changed columns plus five system columns, with no part rewrite.
ALTER TABLE … UPDATE has shipped since August 2018 (ClickHouse v18.12), authored by Alex Zatelepin (ztlpn). Updates have never been "unsupported" in any release from the last eight years.
Lightweight DELETE has been GA since 2022 (PR #37893). The allow_experimental_lightweight_delete flag is no longer required.
On-the-fly mutations (PR #74877) make queued UPDATEs immediately visible to SELECTs, eliminating the eventual-consistency gap when needed.
Operational controls are production-grade: max_uncompressed_bytes_in_patches (default 30 GiB, PR #85641), exponential backoff for failed mutations (PR #58036), workload classification (PR #64061), and bandwidth throttling (PR #57877).
Observability is first-class: parts_postpone_reasons, latest_fail_error_code_name, mutation_ids in system.part_log, and dynamic system.warnings for stalled mutations all ship by default.
Verdict: the "ClickHouse is append-only" claim made sense in 2017. Repeating it in 2026 is misinformation. ClickHouse's UPDATE subsystem now uses standard SQL, runs in milliseconds, and replicates correctly across distributed clusters.

Why People Still Say "ClickHouse Doesn't Support Updates"

If you have evaluated ClickHouse in the last few years, you have probably heard one of these:

"ClickHouse is append-only."
"ClickHouse doesn't support UPDATEs."
"Updates require ALTER TABLE mutations that rewrite entire parts."
"Mutations are the only way to update data."
"Updates are eventually consistent."
"You have to use allow_experimental_lightweight_delete."
"There is no standard SQL UPDATE syntax."

Some of this was accurate documentation circa 2018 to 2022. Some is now folklore that competitors keep repeating because it is a convenient story: ClickHouse is fast for scans, but you can't update.

In 2017, before mutations even existed, the criticism was structurally correct. ClickHouse's MergeTree engine was designed around immutability, and "updates" had to be modeled by inserting new rows into specialized engines like ReplacingMergeTree and resolving the conflict at merge time or via FINAL. There was no UPDATE statement.

Then, over eight years, ClickHouse's engineering team systematically dismantled that limitation. They added ALTER TABLE … UPDATE (2018), KILL MUTATION and system.mutations for diagnosability (2019), mutations_sync for synchronous waits (2019), IN PARTITION scoping (2020), the MutateTask refactor (2021), a long correctness wave for replicated mutations (2020 to 2022), lightweight DELETE (2022), on-the-fly mutations (2025), and finally lightweight UPDATE backed by patch parts (2025), accompanied by 50+ stabilization PRs that made it production-safe by 2026.

This article traces that evolution with PR-level evidence. No marketing claims, no benchmarks on toy datasets. Just the commit history.

Methodology: How This ClickHouse UPDATE Analysis Was Built

We went through ClickHouse's GitHub commit history, pull requests, changelogs, and release blogs from 2018 through April 2026. The scope covered every PR that touched the UPDATE subsystem: the original mutation engine, the replicated-coordination correctness wave, lightweight DELETE, on-the-fly mutations, the patch-part architecture (PR #82004 plus 35 follow-up commits inside the same PR), and the post-landing stabilization work.

Each PR was classified by category (engine, planner, replication, observability, performance, correctness), impact severity, and whether it changed default behavior or required an opt-in flag. We cross-referenced PR descriptions against changelog entries and the ClickHouse Updates blog series to verify the claimed improvements. Where multiple PRs addressed the same subsystem, we traced the dependency chain to understand how the incremental changes compounded.

The result is a chronological narrative across seven distinct eras, with full provenance. Every claim in this article maps to a specific merged PR or issue that you can verify yourself on GitHub.

This is not a benchmarking exercise. Benchmarks measure peak performance on controlled workloads. This analysis measures the engineering trajectory: what was built, why, and what it means for teams deciding whether ClickHouse can support their update workloads today.

What Update Features Does ClickHouse Ship by Default in 2026?

The current state, as of April 2026:

Standard SQL UPDATE statement. UPDATE table SET col = expr WHERE … works for MergeTree-family tables, backed by patch parts. No special syntax, no experimental flags by default in stable production paths.
Classical ALTER TABLE … UPDATE mutations. The original 2018 mechanism is still available and is the right tool for bulk backfills, schema-level corrections, and operations where rewriting affected parts is acceptable.
Lightweight DELETE. DELETE FROM … WHERE is implemented as a single-column rewrite of a _row_exists virtual mask. Deletes that used to take 8 seconds finish in 200 ms.
On-the-fly mutation visibility. SELECTs see queued UPDATEs and DELETEs immediately, before background materialization completes. The latency between issuing an UPDATE and seeing its effect goes from "depends on the merge schedule" to "insert-like."
ReplacingMergeTree for CDC and upsert workflows. Updates are ingested as new rows; deduplication happens asynchronously during background merges. The FINAL keyword guarantees deduplicated reads at query time, and FINAL has been heavily optimized for production use.
Operational safety nets. Exponential backoff for failed mutations, workload classification for resource isolation, server-level bandwidth throttling, per-replica concurrency caps, and max_uncompressed_bytes_in_patches reject runaway UPDATE storms before they can hurt the cluster.
First-class observability. system.mutations, system.part_log, system.warnings, and system.parts.is_patch give operators the data they need to diagnose stalled or failed mutations without grepping logs.

These are not experimental features hidden behind flags. They are defaults that ship with every modern ClickHouse installation.

ClickHouse UPDATE Myths vs. Reality: A 2026 Checklist

#	The FUD	Score	Evidence	Reality (2026)
1	"ClickHouse is append-only"	🟢 False since 2018	v18.12 release	`ALTER TABLE … UPDATE` shipped in 2018. Standard SQL `UPDATE` shipped in 2025 (PR #82004).
2	"ClickHouse doesn't support updates"	🟢 False	100+ PRs across 8 years	Multiple update mechanisms ship by default: standard `UPDATE`, `ALTER TABLE … UPDATE`, lightweight DELETE, on-the-fly mutations, ReplacingMergeTree.
3	"Updates require ALTER TABLE mutations that rewrite entire parts"	🟢 False since 2025	PR #82004	Lightweight UPDATE writes only changed columns into patch parts. No part rewrite. Insert-like latency.
4	"Mutations are the only way to update"	🟢 False since 2022	PR #37893, #74877, #82004	Lightweight DELETE, on-the-fly mutations, and lightweight UPDATE all bypass the classical part-rewrite path.
5	"Updates are eventually consistent"	🟡 Nuanced	PR #74877, #82004	Classical `ALTER TABLE … UPDATE` is async by default. On-the-fly mutations and lightweight UPDATE provide immediate read-after-write visibility. `mutations_sync` provides synchronous semantics on demand.
6	"`allow_experimental_lightweight_delete` is required"	🟢 False since v22.8	Lightweight DELETE is GA	The flag is no longer needed. Lightweight DELETE is the default `DELETE` implementation.
7	"No standard SQL UPDATE syntax"	🟢 False since 2025	PR #82004	`UPDATE table SET col = expr WHERE …` works as standard SQL on MergeTree-family tables.
8	"Updates cause unbounded part rewriting"	🟢 Solved since 2025	PR #85641, #58036	`max_uncompressed_bytes_in_patches` (default 30 GiB) caps patch accumulation. Exponential backoff prevents failure-loop CPU burn.
9	"You can't kill a stuck UPDATE"	🟢 False since 2019	PR #4287	`KILL MUTATION` works on both MergeTree and ReplicatedMergeTree. `system.mutations` exposes failure reasons.
10	"ClickHouse can't isolate update workload from queries"	🟢 False since 2024	PR #64061, #57877	`mutation_workload`, `merge_workload`, and `max_mutations_bandwidth_for_server` provide first-class resource isolation.

Phase 0 (2016 to 2017): How Did You Update Data in ClickHouse Before Mutations?

The FUD: "ClickHouse is append-only and was never designed for updates."

This part of the criticism is half-right historically. ClickHouse was designed as a columnar OLAP store optimized for ingest throughput and scan performance. Row-level mutability was deliberately out of scope. There was no UPDATE statement.

But "no UPDATE statement" never meant "no way to update data." From the earliest releases, ClickHouse shipped specialized MergeTree engines that modeled mutations as insertions:

ReplacingMergeTree: last-write-wins on the sorting key. Updates are ingested as new rows with the same primary key, and the most recent version wins after the next background merge. The FINAL keyword forces deduplication at query time for cases where you can't wait for the merge.
CollapsingMergeTree: uses a Sign column (+1 for the new row, −1 for the old one). Pairs of rows with the same key cancel each other out during merges.
VersionedCollapsingMergeTree: adds a version column for ingestion of updates that arrive out of order.

These were not workarounds. They were the design. For change-data-capture (CDC) workloads, high-volume upserts, and event-sourcing patterns, they remain the most efficient option in ClickHouse to this day. The trade-off is moving the cost from write time to read time (or to merge time).

The competitor FUD that frames ClickHouse as "append-only" is technically describing this era. What it leaves out is everything that happened after.

Phase 1 (2018): How Does ClickHouse's `ALTER TABLE … UPDATE` Work?

The FUD: "ClickHouse has no UPDATE statement."

In v18.12, Alex Zatelepin (ztlpn) shipped ALTER TABLE … UPDATE and ALTER TABLE … DELETE. The model was deliberately heavyweight: every UPDATE is a logged mutation that runs asynchronously in the background.

Mechanically, a mutation does this:

The command is persisted (to ZooKeeper for ReplicatedMergeTree, or to a local mutation_*.txt file for non-replicated tables).
Each affected part is rewritten to a temporary part by MergeTreeDataMergerMutator::mutatePartToTemporaryPart. Files for unaffected columns are hardlinked in Wide parts; only the changed columns get rewritten.
A max_block_number invariant ensures mutations only process parts that existed when the mutation was issued. Data inserted after the UPDATE is not retroactively touched.
Replicas pull the mutation entry from the ZooKeeper log and execute it locally on their copies of the affected parts.

This design enforces several semantic restrictions that persist today. They are not bugs; they are the contract:

You cannot UPDATE primary-key or partition-key columns. Enforced in MutationsInterpreter::validateUpdateColumns. Changing the sort order would require rebuilding the entire part's index, which defeats the point of MergeTree.
No transactional atomicity. Mutations are not bundled into transactions by default. If the server restarts mid-mutation, the operation resumes from where it left off, but you do not get cross-mutation atomicity.
No immediate read-after-write. A SELECT issued right after an ALTER TABLE … UPDATE may return pre-update values until the background materialization completes.
No non-deterministic functions in replicated mutations. (PR #7247.) rand() and now() are forbidden because each replica would compute different values, causing divergence.

The 2018 criticism was: ClickHouse just got UPDATE support, but it is slow and async. That was fair. What followed was eight years of work to make it fast, predictable, and immediately visible, without giving up the bulk-update use case the original design was good at.

Phase 2 (2019 to 2021): Can You Diagnose, Cancel, and Wait for ClickHouse Mutations?

The FUD: "You can't kill a stuck mutation. There's no way to know if your UPDATE landed."

Once mutations existed, the next two years were dominated by operational maturity. Six PRs in particular turned mutations from "fire and pray" into something you could reason about in production.

`KILL MUTATION` and Failure Diagnostics (PR #4287, 2019)

ztlpn's February 2019 PR added KILL MUTATION for both MergeTree and ReplicatedMergeTree. It also extended system.mutations with latest_failed_part, latest_fail_time, and latest_fail_reason, and added an is_mutation flag in system.merges. From this point on, "my UPDATE is stuck" became a diagnosable problem rather than an opaque one.

`mutations_sync`: Defining "Did My UPDATE Land?" (PR #8237, 2019)

alesapin's December 2019 PR introduced the mutations_sync setting. Set to 0, the default, mutations are fully async. Set to 1, the client waits until the mutation completes on the local replica. Set to 2, it waits until all replicas have completed. Every later wait-correctness fix in the replication wave (#22669, #28889, #24809, #10588) is a repair of this contract.

`IN PARTITION` Scoping (PR #13403, 2020)

Vladimir Chebotarev's PR added ALTER UPDATE/DELETE … IN PARTITION. This was the first SQL semantics extension since the original landing, enabling partition pruning. If you only need to update last week's data, you say so explicitly and the mutation skips every other partition.

The `MergeTask` / `MutateTask` Refactor (PR #25165, 2021)

nikitamikhaylov's September 2021 PR is the quietly load-bearing change of the entire UPDATE history. It split the monolithic merge/mutate logic into stage-based, suspendable MergeTask and MutateTask objects. The PR description: "Added an ability to suspend and resume a process of a merge." Every later mutation improvement, from compact-part stage collapse to patch parts to vertical-merge correctness, builds on this refactor.

The Replication Correctness Wave (2020 to 2022)

In parallel, a long series of PRs from alesapin, azat, and tavplubix turned replicated UPDATE from "best-effort" into "predictable but slow." Notable fixes:

#9022 fixes the parts_to_do=0 ∧ is_done=0 hang where a mutation appeared "almost done" forever.
#11681 fixes the inconsistency between system.mutations.is_done=1 and a MUTATE_PART entry still sitting in the replication queue.
#17499 fixes ALTER hang when the corresponding mutation is killed on a different replica.
#19702 fixes virtual_parts after part corruption so replicated mutations can recover.
#22669 fixes wait-on-multiple-replicas semantics for mutations_sync=2.
#28889 fixes a rbegin vs begin typo in the cross-replica wait logic. Tiny diff, large blast radius.
#34096 fixes the race between mergeSelectingTask and queue reinit after ZooKeeper reconnect.

Distributed UPDATE is uniquely hard. ZooKeeper coordination, virtual_parts after part corruption, queue reinit races, finalization ambiguity: every distributed system that supports UPDATE eventually relives this set of problems. ClickHouse's 2020 to 2022 commit history is what working through them looks like.

Phase 3 (2022): Why Is ClickHouse's Lightweight DELETE So Much Faster?

The FUD: "Every delete in ClickHouse rewrites entire parts."

PR #37893 by zhangjmruc in 2022 was the architectural wedge that made everything later possible. It implemented DELETE FROM … WHERE as ALTER UPDATE _row_exists = 0 WHERE … against a new virtual mask column.

Before this PR, deleting matching rows meant rewriting every part that contained any of them. After it, deletion is a single-column UPDATE of the virtual _row_exists mask, with the actual row filtering happening at SELECT time.

The PR body cites a benchmark: 200 ms vs 8 seconds on the same workload. Forty-fold improvement, with no part rewrite required.

This was not just a performance win. It was a proof of concept for a new pattern: instead of physically modifying data, write a small "diff" alongside it and reconcile at read time. That pattern would later become the foundation of patch parts.

The competitor FUD point about allow_experimental_lightweight_delete refers to the early enablement flag for this feature. The flag is no longer needed; lightweight DELETE has been the default DELETE implementation for years.

Phase 4 (Early 2025): How Do On-the-Fly Mutations in ClickHouse Work?

The FUD: "You always have to wait for the next merge to see your update."

The latency problem with classical mutations is structural: the UPDATE is logged immediately, but the data is not physically modified until a background merge gets around to it. In a busy cluster, that can mean seconds or minutes between issuing an UPDATE and being able to read its effect.

PR #74877 by CurtizJ (early 2025) introduced on-the-fly mutations via the apply_mutations_on_fly setting. With this enabled, SELECTs apply non-finished UPDATE/DELETE mutations immediately, before background materialization. The latency between "I issued an UPDATE" and "I can read the new value" goes from "depends on the merge schedule" to "insert-like."

Three companion settings landed alongside it:

mutations_max_literal_size_to_replace: caps how large a literal can be while still being inlined into the on-the-fly application path.
mutations_execute_nondeterministic_on_initiator: controls where non-deterministic mutation expressions execute, to keep results consistent across replicas.
mutations_execute_subqueries_on_initiator: same idea for subqueries inside mutation predicates.

On-the-fly mutations made it explicit: read-after-write consistency is something users can opt into when they need it, without giving up the asynchronous bulk-rewrite model when they do not.

This was the latency wedge. The next PR was the syntax wedge.

Phase 5 (Mid-2025): How Does ClickHouse's Lightweight UPDATE and Patch-Part Architecture Work?

The FUD: "There is no standard SQL UPDATE syntax in ClickHouse. Every UPDATE rewrites parts."

PR #82004 is the landmark commit of this entire eight-year history. Authored by Anton Popov (CurtizJ), the initial commit (a5327c6) landed June 16, 2025, and the PR merged to master around July 6, 2025. It shipped in ClickHouse 25.7.

What it does: introduces standard SQL UPDATE table SET col = expr WHERE … for MergeTree-family tables, backed by a new artifact called a patch part.

What Does a ClickHouse Patch Part Contain?

A patch part is a small, separate part on disk that stores only what changed:

The columns that were updated (with their new values).
Five system columns: _part, _part_offset, _block_number, _block_offset, _data_version.

That is it. No copy of unchanged columns. No index rebuild. No part rewrite. The size overhead is approximately 40 bytes per row plus the actual changed cell values.

The implementation lives in a new directory, src/Storages/MergeTree/PatchParts/, with new types like PatchPartInfo, PatchMode, MergeTreeSinkPatch, MergeTreePatchReader, and a brand-new InterpreterUpdateQuery. The landing also touched MutationCommands, MutateTask, MergeTreeData, MergeTreeDataMergerMutator, MergeTreeSink, ReplicatedMergeTreeQueue, ReplicatedMergeTreeLogEntry, and the reader-chain files. By any reasonable measure, this was a multi-subsystem rewrite, not a feature add.

How Does ClickHouse Apply Patch Parts at Query Time?

ClickHouse reconciles a patch part with a base part at SELECT time. There are two strategies:

PatchMode::Merge: sorted on (_part, _part_offset). Used when patches and base parts share row offsets directly.
PatchMode::Join: joined on (_block_number, _block_offset). Used when offsets do not line up directly and a logical join is needed.

The choice is automatic. Implicit minmax indexes on _block_number and _block_offset inside patch parts (PR #85040) make the join-mode path much faster by pruning patches that do not touch the rows being read.

Patches themselves get merged together in the background (a "replacing-merge by _data_version"), so the read-time overhead does not accumulate forever. Eventually, patches fold into base parts during normal merges, and the system returns to baseline read performance.

How Fast Is ClickHouse's Lightweight UPDATE?

ClickHouse's own benchmarks, published in the Updates in ClickHouse, Part 3 blog post, report up to 1,000× to 2,400× faster for single-row updates compared to classical ALTER TABLE … UPDATE mutations. The exact multiplier depends on the workload shape; the headline is that what used to be a heavyweight asynchronous operation now has insert-like latency.

The cost is read-time overhead. The umbrella issue #82033 cites approximately 7% to 18% on average for SELECTs that have to apply patches. That is the trade-off: patches are cheap to write and bounded in size, but they do add a small reconciliation cost at read time. When patches fold into base parts during background merges, the overhead disappears.

What Settings Control Lightweight UPDATE in ClickHouse?

allow_experimental_lightweight_update: gate during the experimental period.
apply_patches_to_read: read-side toggle.
update_parallel_mode: controls write-side parallelism for patch creation.
update_sequential_consistency: visibility model.
enable_block_number_column = 1 and enable_block_offset_column = 1: prerequisites; patch parts depend on the per-row block-number/offset columns introduced for this purpose.
lightweight_delete_mode = 'lightweight_update': opt-in path for routing DELETEs through patch parts as well.

PR #85952 (August 24, 2025) promoted lightweight UPDATE to Beta with default enablement, shipping in ClickHouse 25.8.

Phase 6 (Mid-2025 to 2026): How Was ClickHouse's Patch-Part Architecture Stabilized?

The FUD: "The new UPDATE features are experimental and unsafe in production."

A feature this cross-cutting needed weeks of immediate stabilization. Inside PR #82004 itself, 35 commits landed between the initial June 16, 2025 commit and the final merge. Ten of those follow-up commits are worth naming:

Date	SHA	What it fixed
2025-06-17	`7f5a42a`	Lightweight updates on `ReplicatedMergeTree`
2025-06-19	`284c239`	Better consistency for lightweight updates in RMT
2025-06-19	`c7ec4db`	Merges of patch parts in RMT
2025-06-20	`f18385c`	Disable partition detach in RMT with patch parts (operational safety)
2025-06-20	`9902d37`	Crash in prefetch of patch parts
2025-06-23	`e7d8624`	Filtering of `versions_block`
2025-06-25	`cc28005`	Better waiting for LWU before running classic mutation
2025-06-26	`5af26c2`	Better applying patches with PREWHERE
2025-07-02	`6409858`/`b23a074`	Disable lazy columns with lightweight updates (correctness over a read-path optimization)

That is just inside the original PR. Outside it, the post-landing stabilization involved another wave of fixes:

Read-Path and Query-Plan Correctness

PR #85040: implicit minmax indexes on _block_number/_block_offset inside patch parts; reworked PatchJoinCache. Big SELECT-side win.
PR #92838: primary-index use for lightweight updates with IN-subquery predicates.
PR #99023: patch parts without _part_offset query-plan fix.
PR #99164: patch-parts column-order mismatch causing LOGICAL_ERROR.

Memory and Resource Guardrails

PR #85641: max_uncompressed_bytes_in_patches (default 30 GiB). New lightweight updates are rejected with TOO_LARGE_LIGHTWEIGHT_UPDATES if patches accumulate beyond the threshold. This is the operational governor that prevents runaway patch growth from degrading reads forever.
PR #95231: fixes inaccurate memory accounting for large patch-part application that could trigger OOM-killer events.
PR #77922: parallel column flushes during vertical merges.

Correctness and Crash Fixes (Late 2025 to April 2026)

PR #82945: mutations snapshot built from parts visible in the query; consistency for on-fly + patch parts vs running mutations.
PR #97162 (alexey-milovidov, 2026-02-17): fixes phantom entries in mutations' parts_to_do that caused stuck mutations. Race condition where PartCheckThread re-enqueued already-mutated parts; the fix adjusts ReplicatedMergeTreeQueue to immediately remove obsolete parts.
PR #97347 (Kirill Kopnev, 2026-02-20): scalar subquery in ALTER UPDATE/DELETE could corrupt the mutation command and even make the table unloadable on restart. High-severity.
PR #98044 (Raul Marin / Algunenano, 2026-02-26): fixes mutation after lightweight update on tables with secondary indices. The cleanest example of how the legacy mutation framework and the new lightweight-update system needed to learn to coexist.
PR #101403 (2026-04-22): fixes UPDATE SET DateTime literal not being rewritten with session timezone, which was a silent data-corruption hazard.

Replicated-Side Concurrency

PR #95771 (2026-04-09): optimizes ReplicatedMergeTree queue locks; reduces lock contention for SELECTs on replicated tables with mutations.
PR #87265: fixes lightweight UPDATE with WHERE col IN (SELECT …) in replicated tables with partitions.

The volume of stabilization work tells you something honest: a feature that lets you write UPDATE against a columnar OLAP store and finishes in milliseconds and replicates correctly and coexists with the legacy mutation framework is genuinely hard. ClickHouse's engineering team did the work. Running ClickHouse 25.8 or later gets you a feature that has been hardened in the open, with every fix traceable to a public PR.

What Operational Controls Does ClickHouse Provide for UPDATE Workloads?

Beyond the headline features, the eight-year history added a set of operational levers that make UPDATE workloads predictable in production:

Exponential backoff for failed mutations (PR #58036, 2024). Default retry interval of 5 minutes for mutations that keep failing (e.g., a bad CAST). Prevents CPU and log-file blowup from hot-looping on a permanent error.
Workload classification (PR #64061, 2024). The mutation_workload and merge_workload settings integrate with the workload scheduler so UPDATE mutations can be classed and throttled separately from merges and queries.
Server-level bandwidth throttling (PR #57877, 2024). The max_mutations_bandwidth_for_server setting caps the I/O bandwidth mutations can consume cluster-wide.
Pre-submit query validation (PR #71300, 2024). The full mutation query, including subqueries, is validated before being queued. Prevents queue-blocking dead mutations.
Throttling caps. number_of_mutations_to_delay, number_of_mutations_to_throw, and max_number_of_mutations_for_replica cap queued and concurrent mutation counts.
Replication coalescing limit (PR #48731, 2023). replicated_max_mutations_in_one_entry (default 10000) bounds how many mutation commands are coalesced into one ZooKeeper entry, preventing OOM on startup.
Lightweight-DELETE-with-projections control (PR #66169). lightweight_mutation_projection_mode (throw / drop / rebuild) gives operators explicit control over how lightweight DELETE interacts with materialized projections.

None of these individually make a press release. Together, they are what "production-grade UPDATE support in a columnar database" actually requires.

How Do You Monitor ClickHouse UPDATE Performance and Health?

If you cannot see what your UPDATEs are doing, you cannot run them in production. The 2024 to 2026 observability additions are substantial:

latest_fail_error_code_name in system.mutations (PR #72398). Enables automated alerting on specific failure classes.
parts_postpone_reasons in system.mutations (PR #92206, 2025-12-16). Lets operators diagnose stalled mutations instantly. "Why is this mutation not progressing?" used to require log-grepping. Now it is a column.
mutation_ids in system.part_log (PR #93811). For MUTATE_PART and MUTATE_PART_START events. Materially improves traceability during incident investigations.
is_patch in system.parts. Distinguishes patch overlays from base parts, so operators can see directly how much patch material has accumulated.
Long-running mutation warnings (PR #78658). Adds a dynamic system.warnings entry when mutations exceed max_pending_mutations_execution_time_to_warn. Surfaces silently-stuck mutations without external monitoring.
On-fly mutation metrics in system.tables (PR #75738). Per-table visibility into the on-the-fly mutation backlog.
Independent background settings for mutate vs. merge (PR #93905). Previously the two shared the default profile, which made it impossible to isolate update resource usage from merges.

What Are the Limitations of ClickHouse UPDATEs in 2026?

Fairness matters. A few things still require awareness:

Primary-key and partition-key columns still cannot be updated. This is a structural property of MergeTree, not a missing feature. Changing the sort order would require rebuilding the part's primary index; if you genuinely need to change a key column, the right pattern is to insert into a new table with the desired key and swap.
Classical ALTER TABLE … UPDATE mutations are still asynchronous by default. They are the right tool for bulk backfills and schema-level corrections, but if you need read-after-write consistency, you need on-the-fly mutations, lightweight UPDATE, or mutations_sync.
Patch parts have a read-time cost. The umbrella issue cites approximately 7% to 18% read overhead while patches are unmerged. Background merges fold patches into base parts and the overhead disappears, but a workload that issues massive patch volume faster than merges can absorb will see sustained read regression.
max_uncompressed_bytes_in_patches is a hard ceiling. The 30 GiB default is a sensible starting point, but a workload generating patches faster than merges can consume them will eventually hit the cap and have new updates rejected with TOO_LARGE_LIGHTWEIGHT_UPDATES. Tune it, monitor it.
The new analyzer is intentionally not used by MutationsInterpreter. PR #61528 (2024) explicitly forces mutations to use the legacy analyzer, and issue #61563 tracking the migration remains open in early 2026. This is the largest outstanding planner gap on the UPDATE side.
Lightweight UPDATE and classical mutations can interact. Issues like #98898 (LOGICAL_ERROR: Found patch part intersects mutation) and PR #98044 show that the two systems are still being taught to coexist cleanly. Run a recent stable release.
ReplacingMergeTree with FINAL is still the right tool for very high-volume CDC and upsert workloads. Lightweight UPDATE is fast for low-to-medium volume row-level changes; for streams of millions of upserts per second, the engine-level deduplication model continues to win.

These are real engineering trade-offs. Understanding them is part of making an informed decision.

ClickHouse UPDATE Improvements Timeline (2018 to 2026)

Year	What Changed	Key PRs	Impact
2018	Original `ALTER TABLE … UPDATE/DELETE` lands	ztlpn 2018 series	First UPDATE statement. Heavyweight, async, replicated via ZooKeeper.
2019	`KILL MUTATION`, `system.mutations` failure columns, `mutations_sync`	#4287, #8237	UPDATE becomes diagnosable, cancellable, and waitable.
2020	`IN PARTITION` scoping, NULL semantics fix, `isAffectingAllColumns` gate	#13403, #12153, #12760	First SQL semantics extension. Partition pruning. Correct WHERE handling.
2021	`MergeTask`/`MutateTask` refactor; replicated correctness wave	#25165, #22669, #28889	Architectural foundation for everything later. Replicated UPDATE becomes predictable.
2022	Lightweight DELETE via `_row_exists` mask	#37893	200 ms vs 8 s. Wedge for the patch-part architecture.
2023	Skip-index recalc, vertical Compact-to-Wide merges, `replicated_max_mutations_in_one_entry`	#55202, #45681, #48731	Storage-feature integration. Mutation-storm safety.
2024	Backoff, workload classification, bandwidth throttling, `latest_fail_error_code_name`	#58036, #64061, #57877, #72398	Production-grade operational controls. UPDATE becomes a first-class workload class.
2025	On-the-fly mutations; lightweight UPDATE / patch parts	#74877, #82004, #85641, #85952	Standard SQL `UPDATE`. Insert-like latency. 1,000× to 2,400× faster for single-row updates. Promoted to Beta.
2026	Stabilization: phantom-queue fix, secondary-index reconciliation, queue-lock optimization, timezone correctness	#97162, #98044, #95771, #101403	Production hardening. New observability columns. Replicated concurrency improvements.

When Should You Use Each ClickHouse Update Mechanism?

Workload	Recommendation	Reasoning
Single-row UPDATEs from an application	✅ Lightweight UPDATE (`UPDATE ... SET ... WHERE`)	Insert-like latency, standard SQL syntax, immediate read-after-write visibility (PR #82004).
Scattered row-level updates from a service	✅ Lightweight UPDATE	Patch parts handle scattered writes far better than classical mutations.
Bulk backfill of a column across millions of rows	✅ `ALTER TABLE … UPDATE`	Classical mutation rewrites parts efficiently when the volume justifies the rewrite.
Schema-level correction (one-off fix for bad data)	✅ `ALTER TABLE … UPDATE`	Async, runs in the background, no read-time overhead afterwards.
Continuous high-volume CDC / upsert stream	✅ `ReplacingMergeTree` + `FINAL`	Engine-level deduplication remains the most efficient path for millions of upserts per second.
Soft delete / mark-as-deleted	✅ Lightweight DELETE	`_row_exists` mask is a single-column rewrite (PR #37893).
Hard delete with disk reclamation	🟡 `ALTER TABLE … DELETE` or `APPLY DELETED MASK`	Lightweight DELETE leaves data on disk until merge; force physical removal when compliance or reclamation requires it.
Read-after-write consistency on a queued mutation	✅ `apply_mutations_on_fly` or `mutations_sync`	On-the-fly application makes pending mutations visible to SELECTs immediately (PR #74877).
Update to a primary-key or partition-key column	❌ Not supported	Insert into a new table with the desired key and swap. This is structural, not a missing feature.
Updates with non-deterministic functions in replicated tables	❌ Not supported	`rand()` and `now()` would diverge across replicas (PR #7247).

How to Respond to "ClickHouse Doesn't Support Updates"

Run the numbers on your data.

When someone tells you ClickHouse cannot handle updates in 2026, ask them which version they tested against. If they are benchmarking ClickHouse 22.x or earlier, they are testing a system that does not include lightweight DELETE (2022), on-the-fly mutations (early 2025), lightweight UPDATE (mid-2025), patch parts (mid-2025), or the entire 2025 to 2026 stabilization wave.

If they cite "ClickHouse is append-only" without acknowledging that ALTER TABLE … UPDATE shipped in v18.12, they are working from 2017 documentation.

If they cite "no standard SQL UPDATE syntax," they have not read PR #82004 or the Updates in ClickHouse blog series.

If they cite "all updates rewrite entire parts," they are describing one of three update mechanisms (the classical heavyweight one) and ignoring the other two (lightweight DELETE and lightweight UPDATE) plus the engine-level upsert pattern (ReplacingMergeTree).

If they cite "you need allow_experimental_lightweight_delete," they have not run a stable ClickHouse release in years.

The commit history does not lie. 100+ pull requests across eight years. Standard SQL UPDATE syntax. Insert-like latency for single-row updates. Production-grade observability. Workload isolation. Bandwidth throttling. Patch-part guardrails. Phantom-queue race conditions fixed in February 2026 by Alexey Milovidov himself.

ClickHouse's UPDATE subsystem in 2026 bears no resemblance to the one that earned the "append-only" label. The engineers built a real update story, and the evidence is in the PRs.

Test it on your workload. That is the only benchmark that matters.

ClickHouse UPDATE FAQ

Does ClickHouse support standard SQL `UPDATE`?

Yes. ClickHouse 25.7 (July 2025) added standard SQL UPDATE table SET col = expr WHERE … for MergeTree-family tables via PR #82004. It uses a "patch part" architecture and was promoted to Beta with default enablement in version 25.8.

Is ClickHouse append-only?

No. ClickHouse stopped being append-only in August 2018, when v18.12 added ALTER TABLE … UPDATE. Standard SQL UPDATE arrived in v25.7 (July 2025). The "append-only" label is accurate only for the 2016 to 2017 era.

Do all ClickHouse UPDATEs rewrite entire parts?

No. ClickHouse offers three update paths. Lightweight UPDATE (PR #82004) writes a small patch part containing only changed columns, with no part rewrite. Lightweight DELETE (PR #37893) rewrites only the _row_exists virtual column. Classical ALTER TABLE … UPDATE rewrites affected parts and is the right mechanism for bulk backfills.

Are ClickHouse UPDATEs eventually consistent?

It depends on the mechanism. Classical ALTER TABLE … UPDATE is asynchronous by default. Lightweight UPDATE and on-the-fly mutations (PR #74877) provide immediate read-after-write visibility. The mutations_sync setting forces synchronous semantics on demand. You choose the consistency model per workload.

What is `ReplacingMergeTree` and when should you use it?

ReplacingMergeTree is a ClickHouse engine that resolves duplicates on the sorting key during background merges. Use it for high-volume CDC and upsert workflows: updates are ingested as new rows, and deduplication runs asynchronously. Add the FINAL keyword to SELECT queries for guaranteed deduplicated reads. FINAL has been heavily optimized for production use.

What is the read-time overhead of ClickHouse patch parts?

Approximately 7% to 18% on average while patches are unmerged, per umbrella issue #82033. Background merges fold patches into base parts, after which the overhead disappears. The max_uncompressed_bytes_in_patches setting (default 30 GiB, PR #85641) caps total patch accumulation.

Do you still need `allow_experimental_lightweight_delete` in ClickHouse?

No. Lightweight DELETE has been GA for years and is the default DELETE implementation in modern ClickHouse releases. The experimental flag is no longer required.

Can you cancel a stuck UPDATE in ClickHouse?

Yes. KILL MUTATION (PR #4287, 2019) works on both MergeTree and ReplicatedMergeTree. The system.mutations table exposes latest_fail_reason, latest_failed_part, and latest_fail_time. Since late 2025, parts_postpone_reasons (PR #92206) tells you exactly why a mutation is not progressing.

Can you UPDATE primary-key or partition-key columns in ClickHouse?

No. This is structural, not a missing feature. Changing a key column would require rebuilding the part's primary index. The recommended pattern is to insert into a new table with the desired key and swap.

How fast is ClickHouse lightweight UPDATE compared to classical mutations?

Up to 1,000× to 2,400× faster for single-row updates, per ClickHouse's Updates in ClickHouse, Part 3 benchmark blog post. Classical mutation latency is bounded by the merge schedule; lightweight UPDATE has insert-like latency because it writes a small patch part instead of rewriting affected parts.

Does ClickHouse use the new query analyzer for mutations?

Not yet. PR #61528 (2024) explicitly forces MutationsInterpreter to use the legacy analyzer. The migration is tracked in issue #61563, still open in early 2026. This is the largest outstanding planner gap on the UPDATE side.

Analysis based on 100+ GitHub pull requests, official ClickHouse changelogs, and release blog posts covering the period 2018 to April 2026. Every claim maps to a specific merged PR, issue, or blog post. Verify the evidence yourself; the commit history is public.

Reference reading: Updates in ClickHouse, Part 1: Purpose-Built Engines · Part 2: SQL-Style Updates · Part 3: Benchmarks · Handling Updates and Deletes in ClickHouse · SQL Reference: UPDATE · Updating Data Overview · ReplacingMergeTree Guide.

Claude Code for the Outer Loop: An AI SRE Playbook to Reduce On-Call Toil

Manveer Chawla — Wed, 22 Apr 2026 22:03:16 +0000

It is 2:13am. PagerDuty fires for checkout-service, p95 past threshold for four minutes. You open Datadog, find the wrong dashboard, then the right one, then the CI tool for recent deploys, then Jira for open incidents, then #incidents in Slack to check whether a co-worker is already in the war room. Eight minutes in, you have a working hypothesis.

That is not incident response. That is a context-loading tax the on-call pays before the work begins.

Coding agents, such as Claude Code, are eating the inner loop. The outer loop is a different story. Operational work (incident response, runbook execution, SLO investigation, on-call handoffs) still looks almost identical to how it looked five years ago. The gap is not the model. It is the infrastructure to run agentic tools across a team, against production, with the auth, scope, and audit guarantees an SRE program needs.

This article is about the execution layer. The data substrate underneath is the other half of the problem, and I've written about it on the ClickHouse blog.

TL;DR

Claude Code already works in the outer loop. The interface, the reasoning, the tool-call contract all transfer. What changes is the data sources.
Five workflows prove it. Incident triage, runbook execution, postmortem drafting, SLO investigation, on-call handoffs. Every one of them is Claude-shaped.
The auth, scope, and audit gap is the bottleneck. The MCP servers for most SaaS tools already exist. The problem is that when every engineer wires their own connection, you inherit inconsistent authorization, over-scoped credentials, and no audit trail. Useful to one person at best. A data exposure incident at worst.
The gap is an MCP runtime, not a model. Managed auth, hosted compute, tool-level governance, persistent audit logs. Until something provides all four, outer-loop AI stays a party trick.
An MCP runtime is more than an MCP gateway. A gateway routes MCP tools under one URL. An MCP runtime adds the compute that runs them, the auth that scopes them, and the audit trail that makes them safe in production. Arcade.dev is an MCP runtime with a gateway inside it.

Five AI SRE workflows and the MCP servers that power them

If you only read one thing in this article, read this table.

#	Workflow	MCP servers	What Claude Code does	What on-call does
1	Incident triage	PagerDuty, Datadog, Slack, GitHub	Pulls the PagerDuty payload, correlates Datadog signals in the window, checks recent deploys, scans Jira and #incidents, drafts a war room post	Decides the next move
2	Runbook execution	Confluence, Kubernetes, GitHub	Parses the Confluence doc into steps, lays out the diagnostic sequence with commands and expected output, proposes any write command	Runs the steps, approves every write
3	Postmortem drafting	Slack, PagerDuty, Datadog, Confluence	Reconstructs the timeline from Slack, PagerDuty, Datadog, and the deploy log, fills the team template with source-linked evidence	Writes the root cause and action items
4	SLO investigation	Datadog, PagerDuty, Snowflake, Confluence	Finds the burn inflection, correlates deploys, config changes, traffic shifts, and upstream incidents, ranks hypotheses with linked evidence	Evaluates hypotheses, decides action items
5	On-call handoff	PagerDuty, Datadog, Slack, Zendesk	Assembles the shift briefing from pages, active incidents, baking deploys, SLO burn, and open action items, delivers it as a Slack DM	Reviews, adds color, signs off

Workflow 1: Incident triage is mostly archaeology

Scenario

The manual triage above is a parallelism problem, not a skill problem. One engineer, five workflows, sequential context loads. Every on-call engineer I know tells the same story: "I spent the first ten minutes figuring out what was happening."

What Claude Code does

Hand the alert to Claude Code: "Triage this particular alert, correlated with the Datadog metrics, service logs, and the deployment history. Scan Slack history for other correlated failures."

Claude Code returns the alert context in two sentences, the top three correlated signals with direct Datadog links, and the deploys most likely to matter by service-graph proximity with commit SHAs and authors. Two to three minutes end to end, running while you are opening the laptop. Grafana's team reported a 3.5x reduction in time to root cause using a similar pattern.

What on-call does

By the time the on-call moves from the alert on their phone to opening their laptop, Claude Code's initial analysis is waiting. They read the summary, validate it against the dashboards, cross-reference the ranked deploys against what they know shipped recently, and decide the next move. They also catch the failure modes: the correlation that is spurious, the deploy the service graph does not know about, the #incidents thread that was noise. Claude Code compresses the archaeology. The on-call judges it.

The auth, scope, and audit gap

PagerDuty, Datadog, Slack, Jira, and GitHub all ship MCP servers. The problem is running them across a team, not building them.

If the setup is not configured consistently for every engineer on the rotation, the workflow breaks on the shift that needs it most. Misconfigured permissions lead to inconclusive analysis, and inconclusive analysis at 3am is worse than no analysis at all. Engineers who wire up their own connections often grant themselves broader scopes than the workflow needs, and the next access review turns into cleanup nobody planned for. The failure mode that matters most: if tool access is not scoped properly, a diagnostic step can inadvertently trigger a write action, mutate state in production, and turn the triage itself into the incident. Consistent setup, scoped credentials, and read-only enforcement are properties of the MCP runtime, not the individual engineer's configuration.

Workflow 2: Runbook execution at 3am

Scenario

Mature teams maintain their runbooks. The ones in constant use stay fresh because people fix them after every incident. The rot lives in two quieter places. Runbooks that fire once a quarter drift between uses, and nobody notices until the next 3am page reveals that half the commands point at deprecated tools and renamed clusters. And new engineers on the rotation often do not know which runbook applies to the alert in front of them. Finding the right doc at 3am is its own skill, and it takes months on the rotation to build.

"Runbooks are a lie we tell ourselves."

During my time leading reliability at Confluent and Dropbox, I saw this pattern play out across very different stacks. It is not an organization-specific problem. It is the law of prioritization playing out: the runbooks that fire often get the attention, and the ones that fire rarely do not.

What Claude Code does

Finding the right runbook. Once triage narrows the problem, the on-call needs to know which runbook applies and what to run. Point Claude Code at the alert. It matches the metadata (service, symptom, tag) against the runbook index, surfaces the top candidate, and lays out the diagnostic sequence with exact commands, the systems they target, and expected output for each step.

Keeping runbooks fresh. Most mature teams run quality weeks or reliability sprints to refresh runbooks. At Confluent, we did this quarterly. Claude Code makes the sprint cheaper since this is a safe environment: replay every runbook against staging in a batch, flag the commands pointing at deprecated tools and renamed clusters, regenerate steps against current infra. The rot that accumulated since the last review gets caught in hours instead of weeks.

What on-call does

The on-call runs the steps. Claude Code lays out the plan, the engineer executes it. Opening unbounded production access to a coding agent does not pass the sniff test for any reliability org I have worked with, and should not. The engineer confirms Claude Code picked the right runbook, runs each diagnostic in their own terminal with their own scoped credentials, and tracks pass/fail as they go. When Claude Code picks the wrong runbook, the on-call re-points it, and that correction feeds the index for the next page.

The auth, scope, and audit gap

If Claude Code does not execute against production directly, enforcement becomes the whole game. The runbook has to be scoped to the user running it, the environment it targets, and the actions the current step actually needs. A step that is safe in staging is dangerous in prod. A step that is safe for a senior SRE is catastrophic for a new joiner still learning the cluster. Without tool-level governance that understands user, environment, and action together, you are back to trusting every engineer to read carefully at 3am, which is exactly the failure mode the runbook was supposed to prevent. Finding the right runbook and enforcing the right scopes are two different problems. Claude Code solves the first. The MCP runtime solves the second, with governance scoped per user, per environment, and per action. Both have to work, and neither replaces the other.

Workflow 3: Postmortem drafting rots at the archaeology step

Scenario

The incident resolved at 4pm. The retro is Thursday. Someone has to write the draft. The hard part is not the thinking. It is the archaeology: Slack scrollback, PagerDuty timeline, Datadog graphs, deploy history, team template. The incident.io team puts manual reconstruction at 60 to 90 minutes per incident. That matches every team I have run.

Most postmortems get drafted badly at the last minute. The retro starts from a weak foundation, and the same incident class comes back six months later.

What Claude Code does

Type into Claude Code: "Draft the postmortem for INC-4729 using the team template." Claude Code assembles the archaeology. It pulls the Slack transcript, the PagerDuty timeline, the Datadog panels from the incident dashboard, and the deploy log for every service touched. It drops each of those into the team template with source links, so every timeline entry traces back to the panel, commit, or message it came from.

The draft stops at archaeology. Timeline, impact, affected services, evidence. The root cause, contributing factors, and action items fields are left structurally empty. Teams that let AI draft those turn every retro into a cleanup exercise. Zalando's team reported hallucination rates as high as 40 percent in early AI-drafted postmortem analysis, and the lesson is not better prompting. It is to keep anything causal out of the draft.

What on-call does

The on-call and the retro group review the draft. They are not rewriting it. They correct timeline entries that are wrong, add the signal the archaeology missed (a customer report that came through email, a related incident three days earlier, the deploy two sprints ago that introduced the latent bug), and spend their time on the part that matters: running the 5 whys, pressure-testing the root cause, deciding action items.

The leverage is strongest on the long tail. In my experience, eighty to ninety percent of incidents a mature team handles are high-volume, low-priority events where the archaeology is mechanical and the writeup feels mundane. That is where teams cut corners, and where repeat incidents quietly accumulate. Claude Code absorbs the mundane work so the high-judgment work gets attention on every incident, not just the big ones.

The auth, scope, and audit gap

The tools the draft pulls from carry the most sensitive data in the company. #incidents has customer PII and vendor secrets. The deploy log has commit messages that sometimes leak security context. Datadog dashboards expose traffic patterns across the fleet. The engineer who set up the Slack connector usually has broader workspace read than the postmortem role needs, and the draft ends up citing messages it had no business reading.

Scoping has to happen at the tool layer, not the prompt layer. Which channels the draft can read, which dashboards it can fetch, which tables it can query, all bounded by policy and tied to the user triggering the workflow. Then a provenance trail in a persistent log, showing what the AI accessed, when, and under whose identity. That is the half compliance will ask about, and the half that decides whether the workflow survives its first security review.

Workflow 4: SLO investigation and error budget reviews

Scenario

At Confluent, my team reviewed our availability SLO every Monday. We pulled the week's incidents, measured their impact on the SLO and the customer SLA, and mapped the root causes from each postmortem back to services and themes. The goal was to see whether the week's error budget had been spent on one repeat problem or scattered across five unrelated ones.

Most of the prep was manual correlation: error budget delta, matched to PagerDuty incident, matched to Datadog regression, matched to deploy history, matched to the postmortem, matched to the theme bucket. One SRE typically spent four to six hours on that pipeline before the meeting started. The thinking happened in the review. The prep was legwork.

What Claude Code does

Ask Claude Code to prep the Monday review. It pulls the SLO and SLA deltas, fetches every PagerDuty incident in the window, joins each to the Datadog regression that matches in time and service, pulls the postmortem from Confluence, and extracts the root cause section. It groups root causes into themes using the team's existing taxonomy and hands back a structured brief: error budget delta, the incidents that account for it, the themes, and the open questions the postmortems did not resolve.

What Claude Code does not do is quantify how much of the burn each incident "caused" in percentage terms. That is causal analysis current models do poorly, and a made-up percentage in a metrics review is worse than no number.

The AI hunts. The human decides.

What on-call does

The SRE running the review reads the brief, validates the incident-to-regression matches (Claude Code will get some wrong), writes the causal story the AI refused to guess at, decides which themes warrant action items, and raises the open questions in the meeting. Four hours of prep becomes thirty minutes of review and correction.

The auth, scope, and audit gap

Warehouse-backed workflows are the ones SRE teams have held off on the longest, and the reason is scope. You cannot hand Claude Code unrestricted warehouse access and hope prompt engineering keeps it away from PII. You cannot give it unbounded query budgets and wait to see a five-thousand-dollar scan on next month's bill. Scope enforcement at the MCP runtime layer is what changes the math: this task queries these tables and not others, costs less than fifty dollars, never touches prod write paths. Without that, the workflow stays a prototype and never makes the rotation.

Workflow 5: On-call handoffs lose the context nobody wrote down

Scenario

Handoffs are the most undervalued ritual in SRE work because the incidents they prevent never get counted. Handoff quality tracks how tired the outgoing engineer is, which means handoffs are worst on the shifts that had the most incidents, which is when they matter most. The non-obvious cost: the morning incident where the new on-call did not know a deploy was still baking, and ends up paging the previous on-call at 8am to ask what happened overnight.

What Claude Code does

Claude Code generates the briefing at the rotation boundary, without anyone triggering it. It pulls the last 24 hours of pages with resolution notes, active incidents, baking deploys, SLOs that crossed a burn threshold, unresolved #incidents threads, Zendesk escalations, and customer reports that came in through the on-call email alias. It lists open action items assigned to the rotation. It delivers the briefing as a Slack DM with a copy in the team's handoff Confluence doc.

What on-call does

The outgoing engineer adds the color only they can add: what they think is a false alarm, which customer report to watch, which deploy they are nervous about, which alert they silenced and why. That is the handoff knowledge that lives in the outgoing engineer's head and nowhere else. Claude Code assembles the facts. The on-call provides the judgment.

The auth, scope, and audit gap

The briefing fires at 5pm whether anyone is logged in or not, which means it needs a credential that lives outside any single engineer's session. Dotfiles on a closed laptop do not qualify. A scheduled workflow without a persistent service identity is not a workflow. It is a cron job that silently stops running the next time someone rotates off the team. Persistent service identity is a property of the MCP runtime, not the engineer's laptop.

Claude Code is a companion, not an autonomous AI SRE

Five workflows, one pattern. Claude Code reads, correlates, drafts, and waits. The human decides.

Most of the AI SRE market is betting the other way. Traversal, Resolve, Anyshift, and others are building toward autonomous agents that page, remediate, and close incidents on their own. I am skeptical. A model's output is a function of its capability and the context it is given. Current models can do the archaeology reliably. They cannot reliably be given enough scoped context and the right tools to remediate production unsupervised. That is a context and tooling gap, not a model gap, and I would rather ship the shape that already works.

Claude Code runs when you ask. It stops when the next step needs judgment. It never pages, rolls back, or closes an incident on its own.

A companion also dodges the procurement fight that stalls autonomous rollouts. You are not replacing a role or adding an on-call tier. You are pointing the tool your team already uses at data sources they already trust, with an MCP runtime that scopes what it can do. The security review goes from "new vendor, new risk" to "scoped tools inside an existing agent."

Every workflow in this article starts as a prompt and grows into a skill. The triage prompt, the runbook dispatcher, the postmortem drafter, the SLO prep pipeline, the handoff briefing: each one begins as something one engineer types once, and becomes a packaged skill every engineer on the rotation invokes the same way. The skill keeps getting sharper because the team keeps editing it: a new data source here, a tighter prompt there, a correction after an incident surfaces a blind spot. One person's trick becomes team infrastructure, and the infrastructure compounds.

Reliability comes from running a proper reliability program, and a proper program is mostly operational work around rituals: triage, runbooks, postmortems, SLO reviews, handoffs. Claude Code earns its keep by making the rituals cheap enough to happen on every shift, not just the ones where someone has the energy for them.

What an AI SRE needs from its MCP tool integration layer

Every workflow above needs the same four things.

Managed authentication and authorization across tools. OAuth flows for every connected tool, credentials refreshed automatically, scoped per user, reachable from any device including a phone at 3am.
Managed compute, always on, team-wide. Tools run on shared infrastructure, cloud-hosted or on-prem, with the same behavior whether the trigger came from a laptop, a phone, a webhook, or a cron job.
Tool- and agent-level governance. Per-tool permission policies, per-task cost budgets, and per-query data access limits enforced where the call happens, not where the model proposes it. This is the difference between a workflow security will approve and one they kill on sight.
Persistent audit logs. Every tool call logged with triggering user, arguments, response, and timestamp, in a log the agent cannot modify. Without this you cannot retro the AI, and you cannot trust it.

Arcade: an MCP runtime for AI SRE workflows

Arcade is an MCP runtime built to close exactly this gap. Managed OAuth handles every connected tool, with credentials that refresh automatically and never touch the language model. Every tool call runs on behalf of the user who triggered it, so native permissions in PagerDuty, Datadog, and Snowflake apply exactly as they would outside the agent. You connect PagerDuty once, and every Claude Code session on your team picks it up at the right scope.

The runtime runs tools on hosted workers, deployable in your cloud or on-prem, and enforces per-tool policies where the call happens, not where the model proposes it. The same workflow triggered from a phone, a laptop, or a cron job executes on shared infrastructure. Policies fire at the MCP runtime layer: "this workflow queries these Snowflake tables and not others," "this workflow can propose PagerDuty actions but cannot execute without approval," "this workflow has a $25 query budget."

Every tool call lands in an OpenTelemetry-compatible run log with triggering user, arguments, response, and timestamp. It drops straight into the observability pipeline your platform team already runs. When your postmortem asks what Claude Code did during the incident, you have the answer. When compliance asks for every query this AI ran against the warehouse last quarter, you have the answer.

Prebuilt tools ship for PagerDuty, Datadog, Slack, Jira, Confluence, GitHub, Snowflake, and more. You can also bring your own MCP servers into the runtime: the PagerDuty, Datadog, Snowflake, and Kubernetes servers linked in the table above drop in as-is and inherit the same managed auth, policy enforcement, and audit logs as the prebuilt ones. You extend your existing MCP investment instead of replacing it.

You can build this without Arcade, and the reason not to is the same reason you did not write your own CI system: the work is real, the edge cases are ugly, and it is not where your reliability differentiation lives. A mature team can hand-roll managed OAuth, stand up hosted workers, wire per-tool policy enforcement, and ship a tamper-evident audit log. A few platform teams I know started down that path and concluded it was too costly to own, or simply not where they wanted to spend their reliability budget.

Reducing on-call toil is where SRE leverage lives

The outer loop has not caught up to the inner loop because the infrastructure to run agentic tools safely against production systems has been missing. A coding assistant only needs your repo and your editor. An operational assistant needs managed identity, hosted compute, enforced governance, and an audit trail, because it reaches into systems where mistakes page the CTO.

The SRE teams that figure this out over the next year will pull away from the ones that do not, the same way the teams that adopted Claude Code for inner-loop work in 2024 pulled away from the teams that waited. The inner loop is solved. The outer loop is where the leverage lives now, sitting on a data substrate that is its own design problem.

Claude Code does not replace the on-call. It just lets them start on page 5 instead of page 1.

Frequently asked questions

What is an AI SRE?

An AI SRE is an AI assistant that helps site reliability engineers with operational work: incident triage, runbook execution, postmortem drafting, SLO investigation, and on-call handoffs. Most practical AI SRE deployments today run as companions that read, correlate, and draft while a human engineer decides the next move, rather than as autonomous agents that page, remediate, and close incidents on their own.

What is the difference between an MCP gateway and an MCP runtime?

An MCP gateway routes MCP tools under a single URL so any MCP client can call them. An MCP runtime goes further: it adds the compute that runs the tools, managed authentication, per-tool permission enforcement, and persistent audit logs. A gateway is routing infrastructure. A runtime is production infrastructure. Arcade is an MCP runtime with a gateway inside it.

Can Claude Code replace an on-call engineer?

No. Claude Code works best as a companion to the on-call engineer, not a replacement. It compresses the archaeology (pulling alerts, correlating signals, drafting summaries) so the engineer starts with context already loaded. Every decision that requires judgment (rolling back a deploy, paging a co-worker, closing an incident) stays with the human.

How do I use Claude Code for incident triage?

Point Claude Code at the alert with a prompt like "Triage this alert, correlated with Datadog metrics, service logs, and deployment history. Scan Slack for correlated failures." With MCP servers for PagerDuty, Datadog, Slack, and GitHub wired into an MCP runtime, Claude Code returns a summary, the top correlated signals, candidate deploys, and a draft war room post in two to three minutes.

Is it safe to let Claude Code execute runbooks in production?

Claude Code should not execute against production directly. The safer pattern is for Claude Code to parse the runbook, lay out the diagnostic sequence, and propose commands, while the on-call engineer runs each step in their own terminal with their own scoped credentials. Unbounded production access for any coding agent should not pass a reliability review.

What MCP servers do I need for AI SRE workflows?

The core set covers the tools already in an SRE rotation: PagerDuty, Datadog, Slack, and GitHub for incident triage; Confluence and Kubernetes for runbook execution; Snowflake for SLO investigation; Zendesk for on-call handoffs. Each has a production-ready MCP server that can run inside an MCP runtime like Arcade, which handles managed auth, policies, and audit logs across all of them.

How does Arcade work with Claude Code?

Arcade is an MCP runtime that manages OAuth, per-tool permission policies, and audit logs for every tool Claude Code calls. You connect PagerDuty, Datadog, or Snowflake once, and every Claude Code session on your team picks up the tools at the right scope. Arcade also runs bring-your-own MCP servers, so existing integrations work as-is.

What is the difference between AI SRE tools like Traversal and using Claude Code with an MCP runtime?

Traversal, Resolve, and Anyshift are building autonomous agents that page, remediate, and close incidents on their own. Claude Code with an MCP runtime takes the companion approach: read, correlate, draft, and wait for the engineer to decide. The companion pattern ships today. The autonomous bet does not.

Does the observability store underneath matter as much as the MCP runtime above?

Yes. An AI agent runs 10 to 30 queries per investigation, and most observability stores weren't built to serve that pattern at the retention and cardinality an SRE needs. The MCP runtime handles the execution layer; the observability store handles the cognitive substrate. Both matter. I've written about the substrate side here.

ClickHouse Native JSON Support in 2026: A PR-by-PR Analysis

Manveer Chawla — Mon, 20 Apr 2026 20:18:38 +0000

TL;DR

ClickHouse has full native JSON support, and has since v25.3. The JSON type stores each path as a separate columnar subcolumn with native type preservation, primary key indexing, and selective path reads. It is 2,500x faster than MongoDB for aggregations on the JSONBench 1-billion-document benchmark. The narrative that "ClickHouse can't do JSON" is outdated by two years and 80+ merged PRs.

We analyzed 80+ GitHub pull requests, official ClickHouse changelogs, release blogs, and third-party benchmarks to trace the full evolution of JSON support from string-based functions through the modern native JSON type.
In 2021, the criticism had some basis. JSON was stored as opaque String blobs, queried via JSONExtract* functions that required full column scans on every query. The experimental Object('json') type shipped in 2022 but suffered from eager type unification, unbounded column explosion, and race conditions.
By early 2026, ClickHouse ships a production-ready native JSON type built on three foundational types (Variant, Dynamic, JSON), with configurable path limits, type hints, primary key support for JSON subcolumns, three generations of storage serialization, and a query planner that reads only the specific JSON paths your query needs. None of this requires manual schema management.
The single highest-impact storage change is advanced shared data serialization (PR #83777), which delivered 58x faster reads and 3,300x less memory for selective path access by introducing per-granule metadata with path indexes.
The native JSON type stores each path as a separate Dynamic-typed subcolumn in columnar format. The result: 2,500x faster than MongoDB for aggregations, 10x faster than Elasticsearch, and 9,000x faster than DuckDB/PostgreSQL for analytics on the same dataset, according to the JSONBench benchmark on 1 billion Bluesky documents.
The JSON type reached GA in ClickHouse 25.3 (PR #77785), with experimental flags removed and the type backported to the LTS release. The legacy Object('json') type was fully removed in v25.11 (PR #85718).
Verdict: the "ClickHouse doesn't do JSON" advice referenced a system that no longer exists. The current JSON type is a ground-up columnar implementation that preserves native types, supports primary key indexing, and reads only the paths you query. Repeating the old criticism in 2026 is misinformation.

Why People Still Say "ClickHouse Has No Native JSON Support"

If you've evaluated ClickHouse for semi-structured data, you've heard the warnings:

"ClickHouse doesn't support JSON natively"
"Flatten JSON into columns manually"
"Use JSONExtract functions on String columns" (as the primary approach)
"Use Object('JSON')" (deprecated type)
"No native JSON support"

Some of these started as legitimate observations circa 2021-2022. ClickHouse did store JSON as Strings. The JSONExtract* functions did scan the full column. The first attempt at a native type (Object('json')) did have serious architectural flaws.

Others were amplified by competitors who found a convenient story: ClickHouse is fast for scans, but it can't handle semi-structured data.

Then ClickHouse's engineering team spent three years building one of the most sophisticated columnar JSON implementations in any database. Over 80 significant pull requests merged. They built three new foundational types (Variant, Dynamic, JSON), three generations of storage serialization, a query planner that reads only needed subcolumns, primary key and skip index support for JSON paths, and clear migration paths from every legacy representation.

This article traces that evolution with PR-level evidence. No marketing claims. No benchmarks on toy datasets. Just the commit history.

Methodology: How We Analyzed ClickHouse's JSON Type Commit History

We went through ClickHouse's GitHub commit history, pull requests, changelogs, and release blogs from 2019 through early 2026. The scope covered every PR that touched JSON handling: type implementations, storage formats, function changes, planner optimizations, memory improvements, correctness fixes, and migration paths.

Each PR was classified by category (type system, storage, functions, planner, correctness, migration), impact severity, and whether it changed default behavior. We cross-referenced PR descriptions against changelog entries and benchmark results to verify claimed improvements. Where multiple PRs addressed the same subsystem, we traced the dependency chain to understand how incremental changes compounded.

The result is a ranked analysis of 80+ pull requests organized into six phases, with full provenance. Every claim in this article maps to a specific merged PR that you can verify yourself on GitHub.

ClickHouse JSON Capabilities in 2026: What Ships by Default

The current state, as of early 2026:

Native JSON data type (GA since v25.3): Each JSON path is stored as a separate Dynamic-typed subcolumn in columnar format. Full SQL query, filter, and aggregation support on JSON fields, including nested structures and arrays (Array(JSON)). Configurable max_dynamic_paths (default 1024) and max_dynamic_types (default 32) control resource usage. Known paths can be materialized as physical columns with type hints (JSON(key1 UInt32, key2 String)), while unknown paths are automatically discovered with type inference. Path filtering via SKIP and SKIP REGEXP provides fine-grained schema control.
Three foundational types: Variant (PR #58047) provides discriminated unions. Dynamic (PR #63058) extends Variant with open-ended type storage. JSON (PR #66444) combines both to store semi-structured data with native type preservation.
Primary key and skip index support: JSON subcolumns can appear in ORDER BY and data-skipping index expressions (PR #72644), enabling the same data pruning that ClickHouse applies to regular typed columns.
Advanced shared data serialization: Per-granule path indexes for selective reads of specific paths without scanning the entire JSON column (PR #83777). Three serialization modes optimized for different access patterns.
Planner-level subcolumn optimization: The query planner reads only the JSON paths referenced in your query (PR #68053), pushes subcolumn requirements through CTEs and views (PR #94105), and rewrites JSONExtract calls into direct subcolumn reads (PR #96711).
Full JSONExtract interop: All JSONExtract* functions work with native JSON columns (PR #96711). Introspection functions (distinctJSONPaths, distinctJSONPathsAndTypes) provide schema discovery from metadata alone (PR #68463, PR #92196).
Migration from every legacy format: ALTER TABLE ... MODIFY COLUMN converts String, Object('json'), Map, and Tuple columns to the native JSON type (PR #70442, PR #71784, PR #71320).

These are not experimental features behind flags. They are defaults that ship with every ClickHouse installation since v25.3.

ClickHouse JSON Myths vs. Reality: A 2026 Checklist

#	The FUD	Score	Evidence Volume	Reality (2026)
1	"No native JSON support"	False since Aug 2024	PR #66444, #77785	Native JSON type stores each path as a separate columnar subcolumn. GA since v25.3.
2	"Flatten JSON into columns manually"	False since Aug 2024	PR #66444, #72644	Automatic path flattening into Dynamic-typed subcolumns. No manual schema management.
3	"Use JSONExtract on String columns"	Outdated	PR #96711, #66444	JSONExtract works on native JSON columns and gets rewritten to direct subcolumn reads. No full-column scan.
4	"Use Object('JSON')"	Removed in v25.11	PR #85718, #66444	Object('json') was replaced by a ground-up redesign. The old type was fully removed in v25.11.
5	"JSON queries require full column scans"	False since 2024	PR #68053, #83777, #94105	Planner reads only referenced subcolumns. Advanced serialization provides per-granule path indexes. 58x faster, 3,300x less memory.
6	"Can't index JSON fields"	False since Dec 2024	PR #72644, #98886	JSON subcolumns in ORDER BY, primary key, and skip indexes. Bloom/text indexes on JSONAllPaths.
7	"JSON types lose type information"	False	PR #58047, #63058	Variant/Dynamic preserve native types (UInt32, Float64, DateTime, etc.). No String collapse.
8	"ClickHouse JSON is slower than document DBs"	False	JSONBench (1B docs)	2,500x faster than MongoDB. 10x faster than Elasticsearch. 9,000x faster than DuckDB/PostgreSQL.
9	"No schema discovery for JSON"	False since Aug 2024	PR #68463, #92196	`distinctJSONPaths()` and `distinctJSONPathsAndTypes()` read metadata only. Instant schema views.
10	"Can't migrate existing JSON String columns"	False since Oct 2024	PR #70442, #71784, #71320	ALTER TABLE converts String, Object, Map, and Tuple to native JSON. Background merge conversion.

Phase 1: ClickHouse JSON Functions and String Storage (2019-2021)

The FUD: "Use JSONExtract functions on String columns"

In this era, the criticism was fair. ClickHouse stored JSON as opaque String blobs, and every JSON query required parsing the entire string value.

ClickHouse JSONExtract Functions: simdjson-Powered but CPU-Heavy (May 2019)

PR #5235 introduced the JSONExtract* function family, powered by simdjson with a RapidJSON fallback. This was a meaningful step: SIMD instructions allowed structural element identification at near-memory-bandwidth speeds.

But the fundamental limitation remained. Every query, no matter which field it accessed, required scanning and parsing the full JSON string column. There was no way to read just event.user_id without also reading event.metadata, event.payload, and every other field.

ClickHouse provided two function families with different trade-offs:

simpleJSON / visitParam: Minimalist heuristic parsing with low CPU overhead, but strict assumptions about canonical encoding and no nested object support.
JSONExtract*: Full simdjson-powered parsing with standards-compliant extraction, but high per-row CPU cost from full document parsing.

Neither approach could avoid the core problem: 100% column scan for every query.

SQL/JSON Standard Functions (Mid-2021)

PR #24148 added JSON_VALUE, JSON_QUERY, and JSON_EXISTS with JSONPath expression support, bringing ClickHouse closer to SQL/JSON standard compliance. This improved SQL compatibility but did not change the underlying storage model. JSON was still strings.

Map(String, String): A Partial Improvement

The Map(String, String) type offered some improvement by storing JSON key-value pairs natively, eliminating the need for string parsing on every access. But it still required reading all keys to find one entry, and it lost all type information by collapsing everything to strings.

By the end of 2021, ClickHouse had capable JSON parsing functions but no native JSON storage. The gap was real, and the engineering team knew it.

Phase 2: ClickHouse Object('json') Type -- What Went Wrong (2022)

The FUD: "Use Object('JSON')"

PR #23932, merged March 2022 by Anton Popov, was the first attempt at native columnar JSON storage. It shipped in ClickHouse 22.3 LTS under allow_experimental_object_type. The implementation spanned 101 commits and proved a critical concept: JSON could be stored with each path as a separate subcolumn.

But it had serious architectural flaws:

Challenge	Impact	Consequence
Eager Type Unification	Mixed types at a path collapsed to String	Lost native type optimizations
Metadata Explosion	High memory for many unique keys	System instability with high-cardinality JSON
Race Conditions	Inconsistent results during merges	Unreliable query analysis
Schema Rigidity	Inability to handle type changes	Required manual ALTER or table rewrites
No Primary Key Support	JSON paths excluded from ORDER BY	No data pruning on JSON fields

Despite these flaws, Object('json') validated the demand for native JSON storage and identified every architectural challenge the replacement would need to solve.

Alongside the type work, ClickHouse continued improving JSON ecosystem support. PR #40910 introduced the JSONObjectEachRow format for keyed JSON objects. PR #39186 added automatic type inference from JSON strings, detecting dates, datetimes, and integers by default. PR #54427 enabled schema inference of JSON objects as named Tuples.

These format and inference improvements meant ClickHouse was getting better at ingesting JSON data. What it still lacked was a sound way to store it.

Phase 3: ClickHouse Variant and Dynamic Types -- The JSON Foundation (2024)

The FUD: "JSON types lose type information"

Rather than patching Object('json'), ClickHouse built from first principles. The redesign started with two new foundational types that solved the type-preservation problem that had plagued the original implementation.

Variant: Discriminated Union Type (January 2024)

PR #58047, by Pavel Kruglov, introduced Variant(T1, T2, ..., TN), a discriminated union storing values of different types in a single column. It uses a UInt8 discriminator column plus dense subcolumns per type variant, supporting up to 255 variants. The PR included 47 commits and roughly 5,000 lines of tests.

This solved the type-unification problem that killed Object('json'). Instead of collapsing 42 and "hello" at the same path into String, Variant stores them in their native types with a discriminator indicating which type each row contains.

A follow-up optimization (PR #62774) introduced compact discriminator serialization: when all discriminators in a granule are the same type (the common case for JSON paths), it stores 3 values instead of 8,192. This is highly effective in practice since most JSON paths have homogeneous types within a granule.

Dynamic: Open-Ended Type Storage (May 2024)

PR #63058, also by Pavel Kruglov, extended Variant with an open, self-describing type set. Dynamic has a max_types parameter (default 32); the most frequent types get their own Variant slots, and overflow types are stored in a SharedVariant as binary-encoded strings. This provided the flexibility that JSON demands without the unbounded explosion that doomed Object('json').

The PR included 39 commits and introduced the dynamicType() introspection function. A dynamic_structure.bin metadata file per data part tracks the type composition.

These two types, Variant and Dynamic, were the architectural foundation. The JSON type would combine them both.

Phase 4: ClickHouse Native JSON Type Implementation (2024)

The FUD: "ClickHouse doesn't support JSON natively"

How ClickHouse Implemented the Native JSON Data Type: PR #66444 (August 2024)

PR #66444 is the single most important commit in ClickHouse's JSON evolution. Authored by Pavel Kruglov, it implements the entirely new JSON data type in 91 commits, closing RFC #54864 ("Semistructured Columns") authored by Alexey Milovidov.

The design works as follows. JSON paths are flattened into individual Dynamic-typed subcolumns, each stored in separate column files per data part. Paths exceeding max_dynamic_paths (default 1024) overflow into a shared data structure. The type supports:

Full SQL support: Query, filter, and aggregate on any JSON field using standard SQL. Nested structures and arrays are first-class citizens.
Configurable limits: max_dynamic_paths (default 1024) and max_dynamic_types (default 32) control resource usage
Materialized known paths: Type hints like JSON(key1 UInt32, key2 String) materialize known paths as physical typed columns for maximum performance, while unknown paths are automatically created with type inference as they are discovered
Path filtering: SKIP and SKIP REGEXP to exclude noisy paths from columnar storage
Dot-notation access: json.a.b for direct path reads
Sub-object access: json.^prefix for extracting JSON subtrees
Array(JSON) support: Nested structures and arrays of JSON documents
Efficient data skipping on dynamic paths: JSON subcolumns in primary keys and skip indexes enable granule-level pruning

First shipped in ClickHouse 24.8 LTS under allow_experimental_json_type. The official blog post "How we built a new powerful JSON data type for ClickHouse" (October 2024) detailed the architecture.

20x Memory Reduction for Inserts (September 2024)

PR #69272 addressed a critical production concern: memory consumption during JSON inserts. Before this PR, inserting JSON data consumed 6.99 GiB of memory. After, 354 MiB. A 20x reduction.

The fix was adaptive write buffer sizing. Buffers start at 16 KiB and grow exponentially to a maximum of 1 MiB, selectively enabled for dynamic substreams. S3 inserts improved from 23.13 GiB to 7.65 GiB. No throughput regression.

ALTER String to JSON + Serialization V2 (October 2024)

PR #70442 delivered two major changes. First, ALTER TABLE ... MODIFY COLUMN col JSON to convert existing String columns to the JSON type. Conversion happens during background merges, so there is no downtime. Second, Serialization V2 for JSON and Dynamic types with an improved binary layout.

This was the beginning of clear migration paths. Teams no longer had to reimport data to adopt native JSON.

JSONExtract Refactoring for Native JSON (July 2024)

PR #66046 refactored the JSONExtract function family to work with the new type, splitting the implementation into reusable JSONExtractTree.h/cpp components and adding Dynamic type support. This ensured that existing queries using JSONExtract would continue to work when columns migrated to native JSON.

Introspection Functions (August 2024)

PR #68463 added distinctDynamicTypes(), distinctJSONPaths(), and distinctJSONPathsAndTypes(). These are essential schema discovery tools for semi-structured data. They were later optimized in PR #92196 to read only metadata files instead of scanning actual data, making schema diversity views effectively instant.

Subcolumn Optimization Enabled by Default (August 2024)

PR #68053 enabled optimize_functions_to_subcolumns by default. This planner optimization rewrites function calls to read only the specific subcolumns needed, which is transformative for JSON queries. A query accessing json.user.id reads only that subcolumn's data, not the entire JSON column.

Beta Promotion (November 2024)

PR #72294 moved JSON, Dynamic, and Variant to beta status, backported to 24.11. This signaled production readiness for early adopters.

JSON Subcolumns in Primary Key and Skip Indexes (December 2024)

PR #72644 was a milestone for performance. It enabled JSON subcolumns (json.path.to.key) in ORDER BY expressions and data-skipping index definitions. This means ClickHouse applies the same data pruning to JSON fields that it applies to regular typed columns.

The JSONBench benchmark uses this capability for sub-second queries over 1 billion documents. Without it, JSON columns could not participate in ClickHouse's primary mechanism for reducing scan ranges.

Migration Paths from Every Legacy Format

By the end of 2024, clear migration routes existed for every semi-structured representation:

Source Type	Migration Method	PR	Notes
`String`	`ALTER TABLE ... MODIFY COLUMN ... JSON`	#70442	Background merge conversion; `ALTER UPDATE` for immediate
`Map(String, String)`	`CAST(col AS JSON)`	#71320	Serialize-then-parse roundtrip
`Tuple`	`CAST(col AS JSON)`	#71320	Serialize-then-parse roundtrip
`Object('json')`	`ALTER TABLE ... MODIFY COLUMN ... JSON`	#71784	Must complete before upgrading past v25.11
`JSON(params_A)` to `JSON(params_B)`	`CAST` or `ALTER`	#72303	Change max_dynamic_paths, SKIP rules, type hints

Phase 5: ClickHouse JSON Reaches GA -- Performance and Storage Optimizations (2025)

The FUD: "JSON queries require full column scans"

ClickHouse JSON Production-Ready: GA in v25.3 (March 2025)

PR #77785, authored by Alexey Milovidov and expanded by Pavel Kruglov, removed all experimental and beta gates for JSON, Dynamic, and Variant. The commit message references https://jsonbench.com/. Backported to ClickHouse 25.3 LTS via cherry-pick PRs #77974 and #77975.

The 25.3 release blog stated: "About 1.5 years ago, we weren't happy with our JSON implementation, so we returned to the drawing board."

63x Memory Reduction for Read Prefetches (March 2025)

PR #77640 addressed memory consumption during read-ahead prefetches of JSON columns in Wide parts. Before: SELECT * WHERE y=1 on 1 million rows with 1,000 JSON paths consumed 69.16 GiB peak memory. After: 1.11 GiB. A 63x reduction.

4-10x Faster S3 Reads (February 2025)

PR #74827 introduced prefetches for subcolumn prefix deserialization, a cache for deserialized prefixes, and parallel prefix deserialization for JSON columns on S3. The result: 4x faster full scans and roughly 10x faster LIMIT 10 queries on remote storage. This introduced MergeTreePrefixesDeserializationThreadPool and benefits any remote filesystem with similar latency characteristics.

58x Faster Selective Reads: Advanced Shared Data Serialization (August 2025)

PR #83777 is the most impactful storage optimization in the JSON type's history. It introduced three serialization modes for shared data (the overflow storage for paths beyond max_dynamic_paths):

Mode	Read Latency	Memory	Write Cost	Ideal Use Case
`map`	High for subcolumns	High	Low	Writing data, reading whole JSON
`map_with_buckets`	Medium	Medium	Medium	Balanced workloads
`advanced`	Low for subcolumns	Low	High	Reading specific paths

The advanced mode creates per-granule .structure, .data, and .paths_marks files with a path index that enables direct lookup of specific paths without scanning the entire shared data structure.

The benchmarks speak for themselves. Reading a single key from 200,000 rows with 10,000 unique paths improved from 3.63s / 12.53 GiB to 0.063s / 3.89 MiB. That is 58x faster and 3,300x less memory. For Compact parts, non-existing key reads improved from 3.4s to 0.3s (roughly 11x faster), memory from 517 MiB to 3.7 MiB (roughly 140x reduction).

This PR contained 47 commits and is documented in the official ClickHouse blog "Making complex JSON 58x faster, use 3,300x less memory."

Substream Marks in Compact Parts (March 2025)

PR #77940 added marks for individual substreams within compact parts, extending selective subcolumn read efficiency to the compact storage format. Previously, reading any subcolumn from a compact part required reading the entire part.

Experimental Settings Obsoleted (v25.8)

PR #85934 marked the experimental and beta JSON settings as obsolete. JSON was now unconditionally enabled.

Legacy Object('json') Fully Removed (November 2025)

PR #85718 removed the deprecated Object('json') implementation entirely. 270 files changed. ColumnObjectDeprecated, DataTypeObjectDeprecated, deprecated serialization files, the JSONDataParser, and all legacy tests were deleted. This was backward-incompatible: any tables or queries referencing Object('json') must be migrated before upgrading past v25.11.

Phase 6: ClickHouse JSON Query Planner and JSONExtract Interop (2026)

The FUD: "JSONExtract on String columns is the primary approach"

ClickHouse JSONExtract Now Works with Native JSON Columns (2026)

PR #96711, by Fisnik Kastrati, extended all JSONExtract*, JSONHas, JSONLength, and JSONType functions to accept native JSON columns directly. More importantly, it introduced a FunctionToSubcolumnsPass planner optimization that rewrites constant-path JSONExtract calls into direct subcolumn reads.

This means existing queries that use JSONExtractString(json_col, 'user', 'name') now bypass text parsing entirely when the column is a native JSON type. The planner rewrites the call to a direct subcolumn read of json_col.user.name. This closed issue #88370.

Optimized has(JSON, path) Function (2026)

PR #96927 added an optimized has(json_col, 'path') function for fast path-existence checks without text parsing. This is essential for queries that filter based on whether a JSON path exists.

SubcolumnPushdownPass in Query Planner (January 2026)

PR #94105 introduced SubcolumnPushdownPass, which pushes subcolumn requirements through CTEs and views. This means wrapping a JSON table in a view or CTE no longer defeats subcolumn optimizations.

Skip Indexes on JSONAllPaths (April 2026)

PR #98886 enabled bloom and text skip indexes on JSONAllPaths(), allowing efficient filtering on JSON key presence. This gives ClickHouse the ability to skip entire granules when querying for documents that contain (or don't contain) specific paths.

SIMD Tokenizer Refactoring (2026)

PR #97871, by Amos Bird, refactored the tokenizer to a SIMD-ready stateful API, replacing the older iterator API. This lays the groundwork for continued parsing performance improvements.

Combined Subcolumn Access (2026)

PR #98788 introduced a unified combined subcolumn that returns Dynamic for both scalar and object access at a path. This simplifies queries that need to handle paths where the value might be a scalar or a nested object.

ClickHouse JSON Performance Benchmarks: MongoDB, Elasticsearch, and DuckDB Compared

The JSON type's performance has been validated by both ClickHouse's internal benchmarks and independent third-party testing. The numbers come from specific, verifiable sources.

JSONBench: 1 Billion Bluesky Documents (January 2025)

The JSONBench benchmark (https://jsonbench.com/) tested the native JSON type against other databases on 1 billion Bluesky social media documents on a single m6i.8xlarge node:

Comparison	ClickHouse Advantage	Source
vs MongoDB aggregations	2,500x faster (405ms vs ~16 min)	JSONBench
vs Elasticsearch aggregations	10x faster	JSONBench
vs DuckDB/PostgreSQL analytics	9,000x faster	JSONBench
Storage vs compressed files	20% more compact (same algorithm)	JSONBench
Storage vs MongoDB	40% more efficient	JSONBench
Peak memory (1B document count)	< 3 MiB	JSONBench

A follow-up benchmark in March 2025 scaled to 4 billion+ documents (1.6 TiB), achieving 91.84 million docs/sec throughput with sub-100ms queries.

Storage and Memory Improvements

Optimization	PR	Before	After	Improvement
Insert memory	#69272	6.99 GiB	354 MiB	20x reduction
S3 insert memory	#69272	23.13 GiB	7.65 GiB	3x reduction
Read prefetch memory	#77640	69.16 GiB	1.11 GiB	63x reduction
Selective read latency	#83777	3.63s	0.063s	58x faster
Selective read memory	#83777	12.53 GiB	3.89 MiB	3,300x reduction
S3 full scan	#74827	Baseline	4x faster	4x improvement
S3 LIMIT 10	#74827	Baseline	~10x faster	~10x improvement

Third-Party Validation

SigNoz, a ClickHouse-based observability platform, reported 30% faster log queries with the native JSON type. ClickHouse's own observability stack (ClickStack) demonstrated 9x faster queries compared to the previous Map-based approach for OpenTelemetry log attributes.

ClickHouse JSON for OpenTelemetry and Log Analytics: A Real-World Use Case

The observability domain is where the JSON type's impact is most visible. Before native JSON, log management solutions built on ClickHouse flattened attributes into Map(String, String) columns, losing type information. Queries like SUM(LogAttributes.response_size) required explicit casts on every access.

With the native JSON type, OpenTelemetry log attributes preserve their native types. The performance difference:

Dimension	Legacy (Map)	Modern (JSON)	Impact
I/O Efficiency	Read entire column	Read specific path subcolumns	Reduced disk I/O
Memory Footprint	High (String parsing)	Low (Columnar access)	Lower peak memory
Schema Migration	Manual ALTER TABLE	Fully Automatic	Simplified operations
Aggregation Speed	Slow (Cast required)	Native (No cast)	Up to 10x faster queries

ClickHouse JSON Limitations and Trade-offs in 2026

Fairness matters. A few things still require awareness:

Path explosion requires attention. Without appropriate max_dynamic_paths settings and SKIP rules, high-cardinality JSON (thousands of unique paths per document) can create many subcolumns. Set limits that match your schema shape, and use SKIP REGEXP for noisy paths.
Subcolumn pruning through SELECT * in CTEs is not yet supported. Issue #92455 documents this gap. Explicitly name columns in CTEs over JSON tables for now.
Legacy Object('json') migration is mandatory. PR #85718 enforces a hard removal. Post-upgrade to v25.12+, any tables or queries referencing Object('json') will fail. Audit schemas and run ALTER before upgrading past v25.11.
Correctness fixes are ongoing. Edge cases in JSONExtract interop (issue #102018, #102079), default value handling (issue #101721), and specific format combinations (issue #101911) show that a system this complex requires staying on the latest stable release.
Advanced shared data mode trades write cost for read performance. The per-granule path indexes that enable 58x faster reads add write overhead. For write-heavy workloads with infrequent selective reads, the simpler map mode may be more appropriate.
Type hints and path configuration require understanding your data. The defaults work well for moderate schemas (up to 1,024 unique paths). Workloads with tens of thousands of unique paths need tuning.

These are real engineering trade-offs, and understanding them is part of making an informed decision.

ClickHouse JSON Evolution Timeline (2019-2026)

Year	What Changed	Key PRs	Impact
2019	JSONExtract function family with simdjson	#5235	SIMD-accelerated extraction from String columns. Full column scan required.
2021	SQL/JSON standard functions	#24148	JSONPath support. Still string-based storage.
2022	Object('json') first attempt. Format ecosystem.	#23932, #40910, #39186	Proved columnar JSON concept. Architectural flaws identified. Schema inference improved.
2024 H1	Variant and Dynamic types. Compact discriminators.	#58047, #63058, #62774	Type-preserving foundation built. Efficient storage for homogeneous granules.
2024 H2	Native JSON type. 20x insert memory. Serialization V2. Primary key support. Beta.	#66444, #69272, #70442, #72644, #72294	Complete native JSON type ships. Migration paths established. JSON paths in ORDER BY.
2025	GA. 63x read memory. 58x selective reads. S3 optimization. Legacy removal.	#77785, #77640, #83777, #74827, #85718	Production-ready. Advanced shared data. 2,500x vs MongoDB. Legacy Object removed.
2026	JSONExtract interop. Planner intelligence. Skip indexes on paths.	#96711, #94105, #98886, #96927	Full function compatibility. Subcolumn pushdown through CTEs. Bloom indexes on JSON paths.

When Should You Use the Native JSON Type in ClickHouse?

Workload	Verdict	Reasoning
Log and event analytics with semi-structured attributes	Yes	Native type preserves types, subcolumn reads minimize I/O, primary key support enables data pruning
OpenTelemetry / observability data	Yes	Purpose-built for this. ClickStack validates 9x faster queries vs Map approach
JSON documents with known high-value fields	Yes	Use type hints for critical paths, SKIP rules for noisy paths, ORDER BY on key fields
Schema-on-read analytics over heterogeneous JSON	Yes	Dynamic type handles mixed schemas. `distinctJSONPaths()` provides instant schema discovery
Migrating from MongoDB/Elasticsearch for analytics	Yes	2,500x faster aggregations (MongoDB), 10x faster (Elasticsearch). Clear migration via ALTER
JSON with 10,000+ unique paths per document	Depends	Set appropriate `max_dynamic_paths`. Use SKIP REGEXP for noisy paths. Advanced shared data helps but requires tuning
Write-heavy JSON ingestion with rare reads	Depends	Simpler serialization modes (`map`) may be more appropriate than `advanced`
Existing String/Map JSON columns	Yes, migrate	ALTER TABLE converts in background. No downtime. Immediate query performance improvement

How to Respond When Someone Says "ClickHouse Doesn't Support JSON"

Run the PR and benchmark numbers.

When someone tells you ClickHouse can't handle JSON in 2026, ask them if they've tested against a version that includes the native JSON type (GA since v25.3, PR #77785), primary key support for JSON subcolumns (v24.12, PR #72644), advanced shared data serialization (v25.8, PR #83777), or JSONExtract interop with native JSON columns (v26.2, PR #96711).

If they're referencing Object('json'), that type was removed in v25.11. If they're recommending JSONExtract on String columns as the primary approach, the native JSON type has made that unnecessary since v24.8. If they're telling you to flatten JSON into columns manually, the type does this automatically with configurable limits and type hints.

The commit history doesn't lie. 80+ pull requests. Three foundational types. Three generations of storage serialization. Primary key indexing. Planner-level subcolumn optimization. 2,500x faster than MongoDB on real-world data.

ClickHouse's JSON implementation in 2026 bears no resemblance to the string-based functions and experimental Object type that earned those early warnings. The engineers built a ground-up columnar JSON storage system, and the evidence is in the PRs.

Test it on your workload. That's the only benchmark that matters.

ClickHouse JSON FAQ

Does ClickHouse support JSON natively in 2026?

Yes. ClickHouse's native JSON type (PR #66444) stores each JSON path as a separate Dynamic-typed subcolumn in columnar format. It reached GA in v25.3 (PR #77785) with all experimental flags removed. The legacy Object('json') type was fully removed in v25.11 (PR #85718).

What is the most impactful ClickHouse JSON optimization?

Advanced shared data serialization (PR #83777), which delivers 58x faster reads and 3,300x less memory for selective path access. For insert workloads, adaptive write buffers (PR #69272) with 20x memory reduction are equally important.

ClickHouse vs MongoDB vs Elasticsearch for JSON: Which Is Faster?

On the JSONBench benchmark (1 billion Bluesky documents, single node), ClickHouse with the native JSON type is 2,500x faster than MongoDB for aggregations, 10x faster than Elasticsearch, and 9,000x faster than DuckDB/PostgreSQL for analytics. Storage is 20% more compact than compressed files and 40% more efficient than MongoDB.

Should I migrate from JSONExtract on String columns to the native JSON type?

Yes. ALTER TABLE ... MODIFY COLUMN col JSON converts String columns to native JSON during background merges with no downtime (PR #70442). After migration, queries read only the paths they need instead of scanning the full string. JSONExtract functions continue to work on native JSON columns (PR #96711) and get rewritten to direct subcolumn reads by the planner.

What happened to ClickHouse Object('json') and how to migrate?

Object('json') (PR #23932) was ClickHouse's first attempt at native JSON storage, shipped in 2022. It suffered from type unification issues, metadata explosion, and race conditions. Rather than patching it, ClickHouse built an entirely new implementation from first principles using Variant, Dynamic, and JSON types. Object('json') was fully removed in v25.11 (PR #85718). Tables using it must be migrated via ALTER TABLE ... MODIFY COLUMN ... JSON (PR #71784) before upgrading.

Can I use JSON fields in ClickHouse primary keys and indexes?

Yes, since v24.12. PR #72644 enables JSON subcolumns in ORDER BY and data-skipping index expressions. PR #98886 adds bloom and text skip indexes on JSONAllPaths() for efficient key-presence filtering.

How does ClickHouse handle high-cardinality JSON with thousands of paths?

The max_dynamic_paths parameter (default 1024) controls how many paths get their own columnar subcolumn. Paths beyond this limit overflow into shared data storage. The advanced serialization mode (PR #83777) makes shared data reads efficient with per-granule path indexes. Use SKIP and SKIP REGEXP to exclude noisy paths from columnar storage.

Is ClickHouse JSON good for logs and observability?

Yes. ClickHouse's own observability stack (ClickStack) uses the native JSON type for OpenTelemetry log attributes, demonstrating 9x faster queries compared to the previous Map-based approach. SigNoz independently validated 30% faster log queries. The type preserves native numeric types, eliminating cast overhead for aggregations on log attributes.

Analysis based on 80+ GitHub pull requests, official ClickHouse changelogs, release blogs, and third-party benchmarks covering the period 2019-2026. Every claim maps to a specific merged PR. Verify the evidence yourself -- the commit history is public.

Are ClickHouse JOINs Slow? A 2026 PR-by-PR Analysis

Manveer Chawla — Wed, 15 Apr 2026 15:14:34 +0000

TL;DR

Are ClickHouse JOINs slow? Not since 2022. Over 50 merged PRs between 2022 and 2026 rebuilt the join engine from the ground up. The evidence is in the commit history.

We analyzed 50+ GitHub pull requests, official ClickHouse changelogs, and release blogs to trace the full evolution of JOIN support from 2022 through early 2026.
In 2021, the criticism was fair. ClickHouse had one join algorithm (hash join), no disk spilling, no cost-based optimization, and join order followed query syntax. If your right table exceeded memory, the query crashed.
By early 2026, ClickHouse ships six distinct join algorithms, cost-based global join reordering with dynamic programming, runtime bloom filters at the storage layer, parallel hash join as the default, correlated subquery decorrelation, and automatic build-side selection. None of this requires manual tuning.
The single highest-impact change is equivalence-set filter pushdown (PR #61216), which delivered 180×+ speedups by propagating predicates across join sides through column equivalence classes. PostgreSQL and Oracle's planners use the same technique, and ClickHouse implements it natively in its columnar vectorized engine.
Grace hash join (PR #38191) eliminated OOM crashes for memory-bound joins. Parallel hash join (PR #70788) became the default and scales near-linearly across CPU cores. Neither requires configuration.
Global join reordering with column statistics (PR #86822) produces 1,450× speedups on TPC-H SF100 by automatically finding the optimal join order. The DPsize dynamic programming algorithm (PR #91002) further improves this for complex multi-table queries.
Runtime bloom filters, enabled by default since February 2026 (PR #89314), dynamically prune probe-side data at the storage scan level. The v25.10 release blog reports a 2.1× speedup and 7× memory reduction on star-schema workloads.
Verdict: the "avoid JOINs in ClickHouse" advice made sense in 2020. Repeating it in 2026 is misinformation. ClickHouse's join engine now operates with the planning sophistication of a mature enterprise RDBMS, and it does so inside the columnar vectorized execution model that makes ClickHouse fast in the first place.

Why People Still Say "Avoid JOINs in ClickHouse"

If you've evaluated ClickHouse in the last few years, you've heard the warnings:

"Avoid JOINs in ClickHouse"
"ClickHouse doesn't handle JOINs well"
"Denormalize everything, always use flat tables"
"JOINs are slow in ClickHouse"
"Only hash join available, limited join algorithms"

Some of these started as legitimate ClickHouse documentation circa 2019–2020 that advised caution with joins. Others were amplified by competitors who found a convenient story: ClickHouse is fast for scans, but it can't join.

In 2020, the criticism was mostly fair. ClickHouse had a single hash join algorithm, no disk spilling, no cost-based optimizer, and join order followed query syntax. If your right table exceeded memory, the query crashed with OOM.

Then ClickHouse's engineering team spent four years dismantling every one of those limitations. Over 50 significant pull requests merged. They added six join algorithms, a cost-based optimizer with dynamic programming, runtime bloom filters, and automatic algorithm selection, build-side selection, join reordering, and predicate pushdown.

This article traces that evolution with PR-level evidence. No marketing claims. No benchmarks on toy datasets. Just the commit history.

Methodology: How We Analyzed ClickHouse's Join Commit History

We went through ClickHouse's GitHub commit history, pull requests, changelogs, and release blogs from 2022 through early 2026. The scope covered every PR that touched the join subsystem: algorithm changes, optimizer rewrites, planner passes, correctness fixes, and default configuration changes.

Each PR was classified by category (algorithm, optimizer, parallelism, correctness), impact severity, and whether it changed default behavior. We cross-referenced PR descriptions against changelog entries and release blog benchmarks to verify the claimed improvements. Where multiple PRs addressed the same subsystem, we traced the dependency chain to understand how the incremental changes compounded.

The result is a ranked list of 50 pull requests by impact, organized into eight thematic arcs, with full provenance. Every claim in this article maps to a specific merged PR that you can verify yourself on GitHub.

ClickHouse JOIN Features in 2026: What Ships by Default

The current state, as of early 2026:

Six distinct join algorithms: hash, parallel hash, grace hash (disk-spilling), full sorting merge, direct (key-value), and paste join. Each one is optimized for a different workload shape, and the engine selects automatically.
Cost-based global join reordering: Greedy and dynamic programming algorithms find the optimal join order using column statistics. No manual query rewriting needed.
Runtime bloom filters: Build-side join keys compile into bloom filters that get pushed down to probe-side storage scans, filtering non-matching rows before they reach the join.
Equivalence-set predicate pushdown: Filters propagate transitively across multi-table join chains. WHERE t1.id = 5 pushes to t2, t3, and beyond when joined on equivalent keys.
Parallel execution by default: Parallel hash join scales near-linearly with cores. No configuration required.
Correlated subquery support: EXISTS, scalar subqueries, and projection-list subqueries are automatically decorrelated into joins.

These aren't experimental features hidden behind flags. They're defaults that ship with every ClickHouse installation.

ClickHouse JOIN Myths vs. Reality: A 2026 Checklist

#	The FUD	Score	Evidence Volume	Reality (2026)
1	"Only hash join available"	🟢 False since 2022	6 algorithms, 10+ PRs	Six algorithms: hash, parallel hash, grace hash, full sorting merge, direct, paste. Auto-selected.
2	"JOINs cause OOM crashes"	🟢 Solved since late 2022	PR #38191 + follow-ups	Grace hash join spills to disk. Full sorting merge uses bounded memory. OOM joins are a solved problem.
3	"JOINs are slow in ClickHouse"	🟢 Outdated	50+ optimization PRs	180× from predicate pushdown, 1,450× from join reordering, 2.1× from runtime filters, all automatic.
4	"No query optimizer for JOINs"	🟢 False since 2024	PR #86822, #91002, #71577	Cost-based global join reordering with DPsize dynamic programming. Statistics-driven. Automatic.
5	"Always denormalize everything"	🟡 Nuanced	Architecture-dependent	Denormalization still has value for extreme query latency targets, but normalized star/snowflake schemas now perform well with automatic optimizations.
6	"JOINs don't scale across cores"	🟢 False since late 2024	PR #70788, #92068	Parallel hash join is the default. Near-linear scaling. Outer join completion parallelized in 2026.
7	"No predicate pushdown across JOINs"	🟢 False since April 2024	PR #61216, #96596	Equivalence-set pushdown across single and multi-join chains. Disjunctions supported.
8	"Can't handle star/snowflake schemas"	🟢 Outdated	Runtime filters + reordering	Runtime bloom filters and cost-based reordering specifically target star/snowflake schemas.
9	"No correlated subquery support"	🟢 False since mid-2025	PR #76078, #85107	Correlated subqueries decorrelated into joins. Beta since August 2025, enabled by default.
10	"Have to manually tune join order"	🟢 False since mid-2025	PR #86822, #89332, #93912	Automatic join reordering using column statistics and runtime hash table size feedback.

Phase 1 (2022): How Many Join Algorithms Does ClickHouse Support?

The FUD: "ClickHouse only has hash join"

In mid-2022, ClickHouse had exactly one production join algorithm: hash join. The criticism was valid. Then, in roughly six months, three new algorithms landed.

Full Sorting Merge Join: Memory-Bounded Joins (July 2022)

PR #35796 introduced full sorting merge join, a classical sort-merge algorithm integrated into ClickHouse's pipeline. Both sides sort by join keys (with external sorting if needed), then merge in streaming fashion. Memory is bounded by the sort buffer, not by hash table size.

This mattered for two reasons. First, it was the first non-memory-bound join algorithm in ClickHouse, so you could join tables larger than RAM without crashing. Second, it skips sorting entirely when physical row order already matches join keys, which makes it faster than hash join for pre-sorted data.

A follow-up optimization (PR #39418) builds an in-memory key set from the smaller table to pre-filter the larger table before sorting. That made full sorting merge competitive with hash join on general workloads, not just pre-sorted ones.

Grace Hash Join: Disk-Spilling for Out-of-Memory JOINs (November 2022)

PR #38191 was arguably the most important foundational change of this era. Grace hash join partitions both inputs into buckets via a secondary hash. Only one bucket pair is processed at a time, and inactive buckets spill to disk.

Before this PR, a join where the right table exceeded available memory crashed with OOM. After it, the join completed. It just took longer.

Grace hash initially supported only INNER and LEFT joins. FULL and RIGHT support arrived in July 2023 (PR #51013), and a cache locality optimization (PR #72237) delivered a ~24% speedup in late 2024. It graduated to GA in v24.3, which closed issue #11596, the most upvoted join-related issue in ClickHouse's history, open since June 2020.

Direct Join: O(1) Memory Key-Value Lookups (2022)

PR #35363 introduced direct join for EmbeddedRocksDB tables, and PR #38956 extended it to dictionaries with SEMI/ANTI support.

Direct join bypasses hash table construction entirely. It performs O(1) key-value lookups against the storage engine for each left-side row, and memory usage stays constant regardless of right table size.

ConcurrentHashJoin: The Foundation for Parallel Hash Join (May 2022)

PR #36415 laid the groundwork for what would become ClickHouse's most impactful default change. ConcurrentHashJoin creates multiple HashJoin instances, one per thread, and partitions both build and probe sides for concurrent execution. This was the foundation for parallel hash join, which became the default two and a half years later.

By the end of 2022, ClickHouse had five distinct join algorithms where it previously had one. The "only hash join available" criticism had a documented expiration date.

Phase 2 (2023–2024): Does ClickHouse Have a Query Optimizer for JOINs?

The FUD: "ClickHouse has no query optimizer for JOINs"

The promotion of ClickHouse's new query Analyzer to production status in v24.9 was the catalyst. The Analyzer provides richer semantic information about column relationships than the old parser-based planner did, which enabled a class of optimizations that were previously impossible.

Equivalence-Set Filter Pushdown: 180× Speedup (April 2024)

PR #61216 is the single highest-impact join optimization in this entire four-year period. It introduced equivalence-class-based predicate pushdown across join sides.

The logic is straightforward. When tables are joined on t1.id = t2.id, a filter WHERE t1.id = 5 is equivalent to t2.id = 5. The optimizer recognizes this equivalence and pushes the filter to both sides of the join before execution.

Before this PR, filters were applied only after the join completed, which forced full table scans of both sides. After it, filters propagate to both sides and prune data before it reaches the join. Benchmarks show up to 180×+ improvement.

This was later extended to work across chains of multiple INNER JOINs (PR #96596) using a Disjoint Set Union data structure to track transitive equalities. For a query joining t1, t2, and t3 on equivalent keys, a filter WHERE t1.id = 42 now pushes to all three tables.

Automatic OUTER JOIN to INNER JOIN Conversion (April 2024)

PR #62907 automatically converts OUTER JOINs to INNER JOINs when post-join filter conditions make the outer semantics unnecessary. A LEFT JOIN ... WHERE right_col IS NOT NULL is functionally an INNER JOIN, and the optimizer now recognizes this.

This matters beyond the immediate execution improvement (benchmarks show 32s to 0.006s in some cases) because it enables cascading optimizations. INNER JOINs allow predicate pushdown and join reordering that are structurally impossible for OUTER JOINs. Converting the join type first unlocks the full optimization pipeline downstream.

Right-Side Pushdown, OR Conditions, and Common Expression Extraction

The planner intelligence kept accumulating:

PR #50532 extended predicate pushdown to the right side of joins, delivering 27× improvement on applicable queries.
PR #84735 enabled pushdown of OR conditions through joins. Previously only AND conditions could be pushed.
PR #71537 extracted common expressions from WHERE/ON clauses, which reduced redundant hash table instantiation for BI-generated queries with complex OR conditions.
PR #78877 moved equality predicates from WHERE into JOIN ON conditions, enabling more efficient hash table lookups.

Each of these operates automatically. No query hints, and no manual rewriting. The planner just does the right thing.

Phase 3 (Late 2024): Do ClickHouse JOINs Scale Across CPU Cores?

The FUD: "JOINs don't scale across cores in ClickHouse"

Parallel Hash Join Becomes the Default Algorithm (November 2024)

PR #70788 changed the default join_algorithm from 'direct,hash' to 'direct,parallel_hash,hash'. Every ClickHouse installation now uses parallel hash join by default.

The parallel hash join builds hash tables using multiple threads via hash-based sharding. The probe phase shards the same way for lock-free concurrent execution. No configuration needed, and scaling is near-linear with CPU cores.

This was the most broadly impactful default configuration change in this period. Every hash join query on every ClickHouse installation benefits without any user action.

The path to default status was paved by years of incremental improvements. Hash table size statistics caching (PR #64553) pre-allocates tables on repeat queries. Zero-copy block scattering (PR #67782) eliminated redundant memory copies. An adaptive threshold (PR #76185) falls back to single-threaded hash join for small tables where parallelism would add overhead. Two-level hash maps in v25.1 yielded another ~40% speedup.

Parallelizing OUTER JOIN Completion (February 2026)

PR #92068 addressed the last remaining single-threaded bottleneck in parallel hash join. For FULL and RIGHT OUTER joins, the "non-joined rows" (rows from the build side with no match) were previously emitted by a single thread. That created an Amdahl's Law bottleneck that limited outer join scalability. The fix parallelizes non-joined row emission across all hash table buckets.

Phase 4 (2025–2026): ClickHouse Cost-Based Join Optimization

The FUD: "You have to manually optimize join order in ClickHouse"

Automatic Build-Side Selection for Hash Joins (November 2024)

PR #71577 introduced query_plan_join_swap_table = 'auto'. The optimizer estimates table sizes and places the smaller table on the build (right) side of hash joins. This was the first step toward automatic join reordering.

Statistics-Driven Global Join Reordering: 1,450× TPC-H Speedup (v25.9)

PR #86822 introduced global join reordering using a greedy algorithm with column statistics. For queries joining three or more tables, the optimizer evaluates estimated cardinalities and selects the join order that minimizes intermediate result sizes.

The numbers on TPC-H SF100: 1,450× speedup and 25× memory reduction compared to syntax-order execution. That kind of improvement turns "don't use JOINs" into "write whatever join order you want, the optimizer will figure it out."

DPsize Dynamic Programming Join Reordering (v25.12)

PR #91002 added a dynamic programming algorithm (DPsize) for more exhaustive join order search. The greedy algorithm makes locally optimal choices, but DPsize evaluates subsets of joined relations systematically. It produces ~4.7% further improvement over greedy on TPC-H, with bigger gains on complex multi-table queries.

The optimizer tries DPsize first and falls back to greedy if the complexity threshold is exceeded. That's how mature query planners work.

Automatic Statistics Collection for the Join Optimizer

The optimizer is only as good as its statistics. Column statistics moved from manual (ALTER TABLE ADD STATISTICS) to automatic. PR #89332 enabled allow_statistics_optimize by default in v25.10.

Runtime hash table size statistics (PR #93912) close the feedback loop between execution and planning. Actual observed sizes from previous queries inform future optimization decisions.

ClickHouse Runtime Bloom Filters and Star Schema JOIN Performance

The FUD: "ClickHouse can't handle star/snowflake schemas"

PR #89314, merged February 2026, enabled runtime bloom filters by default. During hash table construction, ClickHouse builds a bloom filter from the build-side join keys and pushes it down to the probe-side scan pipeline. Rows that don't match the bloom filter are discarded at the storage scan level, before they ever reach the join.

For star-schema workloads where fact tables are orders of magnitude larger than dimension tables, this is transformative. The v25.10 release blog reports a 2.1× overall query speedup and 7× memory reduction.

The implementation was hardened through 10+ follow-up correctness fixes addressing edge cases: Nullable keys (PR #94555), multi-key ANTI joins (PR #98871), const columns, Merge tables, and more. An adaptive mechanism (PR #91578) dynamically disables bloom filters at runtime when they become saturated or aren't filtering enough rows, which prevents negative ROI on non-selective joins. Coverage was extended to RIGHT OUTER joins (PR #96183).

Runtime filters can also be pushed into PREWHERE (PR #95838), ClickHouse's storage-layer pre-filtering mechanism, for maximum efficiency.

Does ClickHouse Support Correlated Subqueries? (2025 Decorrelation)

The FUD: "ClickHouse can't do correlated subqueries"

This one was true until April 2025. PR #76078 introduced the first correlated subquery decorrelation support, converting EXISTS with correlated references into joins. Scalar subquery support followed (PR #79600), then projection-list subqueries (PR #79925).

PR #85107 promoted correlated subqueries to beta with default enablement in August 2025. That closed issue #6697, one of the longest-standing SQL compatibility gaps in ClickHouse, open since 2019.

Teams migrating from PostgreSQL, MySQL, or Snowflake no longer need to manually rewrite correlated subqueries into explicit joins. The planner does it automatically.

ClickHouse Hash Join Internals: Low-Level Optimizations

Beyond the headline features, ClickHouse's most-used join algorithm received systematic low-level optimization that compounds across every query:

Main loop specialization (PR #82308): Compile-time elimination of null_map and join_mask checks for single-key joins. No more unnecessary branches on every row.
JoinUsedFlags vector optimization (PR #83043): Replaced hash-based flag tracking with atomic vectors, removing per-access hash computation in FULL/RIGHT joins.
Output size enforcement (PR #56996): max_joined_block_size_rows prevents catastrophic memory spikes from ALL JOIN row replication.
Cache locality improvements (PR #60341): Right-table reranging by join keys for cache-friendly access patterns.
Dynamic dispatch (PR #79573): Optimized ColumnVector::replicate in the hash join hot path, lowering CPU per output row.

None of these individually make a press release. Together, they compound into a materially faster join engine at every level of the stack.

ClickHouse JOIN Limitations and Trade-offs in 2026

Fairness matters. A few things still require awareness:

Denormalization still has value for extreme latency targets. If you need sub-10ms p99 on dashboard queries and you can afford the storage, flat tables remain faster than joins. The optimizer is good, but it isn't free.
Join reordering depends on statistics. When statistics are missing or stale, the optimizer can pick suboptimal plans. The system increasingly collects statistics automatically, but monitoring is still your responsibility.
Correlated subqueries are beta. They work for common patterns like EXISTS and scalar subqueries, but edge cases exist. For complex correlated logic, explicit join rewrites may still be necessary.
Grace hash join trades speed for completion. Disk-spilling joins complete instead of crashing, but they're slower than in-memory execution. If you consistently need to spill, you need more memory or a different data model.
Correctness fixes are ongoing. The volume of bug fixes following runtime filter enablement (10+ PRs) shows how complex cross-cutting optimizations get when they're enabled by default. ClickHouse's engineering team has been rigorous about correctness, but running the latest stable release matters.

These are real engineering trade-offs, and understanding them is part of making an informed decision.

ClickHouse JOIN Improvements Timeline (2022–2026)

Year	What Changed	Key PRs	Impact
2022	Algorithm diversification: hash to 5 algorithms	#35796, #38191, #35363, #36415	OOM joins eliminated. Sort-merge and direct join added. Parallel hash foundation laid.
2023	Grace hash FULL/RIGHT. PASTE JOIN. Output size limits.	#51013, #57995, #56996	Disk-spilling joins cover all join types. Safety valves for memory.
2024	Planner intelligence. Parallel hash default. Build-side selection.	#61216, #62907, #70788, #71577, #71537	180×+ from predicate pushdown. Multi-core joins default. OUTER to INNER conversion automatic.
2025	Cost-based optimization. Correlated subqueries. Statistics infra.	#86822, #91002, #76078, #85107, #89332	1,450× from join reordering. Correlated subqueries work. Statistics automatic.
2026	Runtime bloom filters default. Correctness hardening.	#89314, #92068, #96596, #98871	2.1× from runtime filters. Full outer join parallelism. Multi-join pushdown.

When Should You Use JOINs in ClickHouse?

Workload	Verdict	Reasoning
Star/snowflake schema analytics	✅ Yes	Runtime bloom filters, cost-based reordering, and predicate pushdown are purpose-built for this
Multi-table reporting queries	✅ Yes	Global join reordering eliminates the need for manual optimization. Write readable SQL.
Joins exceeding available memory	✅ Yes	Grace hash join and full sorting merge handle this without OOM. Completion is guaranteed.
Real-time dashboards with dimension lookups	✅ Yes	Direct join (O(1) memory) and parallel hash join handle typical enrichment patterns
Time-series ASOF joins	✅ Yes	ASOF JOIN with full_sorting_merge is 2× faster and uses 2× less memory than the hash-based version (PR #55051)
Sub-10ms p99 latency on complex joins	🟡 Depends	Flat tables still win for extreme latency. Joins add overhead even when well-optimized.
Correlated subqueries from legacy SQL	🟡 Mostly	Beta support covers common patterns. Edge cases may need manual rewriting.
10+ table joins with no statistics	🟡 Conditional	The optimizer needs statistics. Make sure `allow_statistics_optimize` is enabled (default since v25.10).

How to Respond to "Avoid JOINs in ClickHouse"

Run the PR numbers.

When someone tells you ClickHouse can't do joins in 2026, ask them if they've tested against a version that includes parallel hash join (default since v24.12), equivalence-set predicate pushdown (v24.4), grace hash join (GA since v24.3), runtime bloom filters (default since v25.10), or cost-based join reordering (v25.9).

If they're benchmarking against ClickHouse 23.x or earlier, or repeating 2020-era blog posts, they aren't evaluating ClickHouse. They're evaluating a system that no longer exists.

The commit history doesn't lie. 50+ pull requests. Six algorithms. Cost-based optimization. Runtime filtering. Automatic algorithm selection, build-side selection, join reordering, and predicate pushdown.

ClickHouse's join subsystem in 2026 bears no resemblance to the one that earned those early warnings. The engineers built a modern join engine, and the evidence is in the PRs.

Test it on your workload. That's the only benchmark that matters.

ClickHouse JOINs FAQ

Are JOINs production-ready in ClickHouse in 2026?

Yes. ClickHouse's join subsystem has been transformed since 2022. Six join algorithms, cost-based global join reordering, runtime bloom filters, and parallel execution all ship enabled by default. The "avoid JOINs" advice is outdated by four years and 50+ merged PRs.

What is the most impactful ClickHouse join optimization?

Equivalence-set filter pushdown (PR #61216), which delivers 180×+ speedups by propagating predicates across join sides. For multi-table workloads, global join reordering (PR #86822) with 1,450× speedup on TPC-H SF100 is equally transformative.

Does ClickHouse still crash with OOM on large joins?

No. Grace hash join (PR #38191) introduced disk-spilling in November 2022 and graduated to GA in v24.3. Full sorting merge join (PR #35796) provides a memory-bounded sort-merge alternative. Both algorithms guarantee completion regardless of data size.

How many join algorithms does ClickHouse support?

Six: hash join, parallel hash join (default), grace hash join (disk-spilling), full sorting merge join, direct join (O(1) memory key-value), and paste join (positional). The engine selects the appropriate one automatically.

Does ClickHouse have cost-based join optimization?

Yes, since v25.9. The optimizer uses column statistics for greedy join reordering and a DPsize dynamic programming algorithm (v25.12) for exhaustive search. Statistics collection is automatic since v25.10 (PR #89332). Runtime hash table statistics (PR #93912) feed execution data back into the planner.

Should I denormalize tables in ClickHouse instead of using JOINs?

It depends on your latency requirements. For sub-10ms p99 dashboard queries, flat tables remain the fastest path. For analytical workloads where query readability, data freshness, and storage efficiency matter, normalized star/snowflake schemas with JOINs are now well-optimized. The "always denormalize" advice is no longer a blanket recommendation.

Can ClickHouse handle star and snowflake schemas?

Yes. Runtime bloom filters (default since February 2026, PR #89314) specifically target star-schema patterns by filtering fact table rows at the storage layer using dimension table keys. Combined with cost-based join reordering and predicate pushdown, star/snowflake schemas are a first-class workload.

Does ClickHouse support correlated subqueries?

Yes, since mid-2025. Correlated subquery decorrelation (PR #76078) automatically converts EXISTS, scalar, and projection-list subqueries into joins. Promoted to beta with default enablement in August 2025 (PR #85107). Closes a feature gap that had been open since 2019.

Analysis based on 50+ GitHub pull requests, official ClickHouse changelogs, and release blogs covering the period 2022–2026. Every claim maps to a specific merged PR. Verify the evidence yourself, the commit history is public.

How to Connect AI Agents to Enterprise Productivity Tools Securely (2026 Architecture Guide)

Manveer Chawla — Thu, 09 Apr 2026 20:58:36 +0000

Most enterprise AI agents today can analyze but can't execute. They summarize documents, surface insights, and draft responses. They don't close support tickets, update Salesforce, or trigger deployments. The ROI stays incremental. The architecture that solves this is an MCP runtime, a secure execution layer that handles authorization, credentials, and tool calling on behalf of each user.

The real transformation happens when agents take actions, when employees direct work instead of doing it. But getting agents to safely execute across enterprise systems is where everything falls apart.

Recent industry studies from IDC and MIT show that 88 to 95 percent of enterprise AI pilots fail to reach production. The root cause isn't the language model. It's the complexity of secure integration, and every month spent rebuilding auth plumbing is a month your agents aren't delivering business value.

Key takeaways

Use an MCP runtime as the secure action layer between your agents and enterprise tools. It evaluates the intersection of agent permissions and user permissions per action at runtime.
Execute every tool call on behalf of the user (OBO). The agent acts with the user's credentials, scoped to the user's native permissions, and every action is attributable in audit logs.
Keep OAuth tokens out of the LLM context. Credentials must be vaulted at the runtime layer where the model cannot observe, alter, or leak them.
Do not use static service accounts. They break permission models and turn a single prompt injection into an enterprise-wide incident.
Build with agent-optimized tools, not raw API wrappers: intent-level operations with validated schemas that prevent parameter hallucination and eliminate retry loops.
Require human-in-the-loop approvals for all destructive actions. Deletes, bulk updates, and external communications must pause for explicit sign-off before execution.
Ship audit logs and telemetry from day one. Export every tool call via OpenTelemetry to your SIEM for compliance, incident response, and root cause analysis.

Why connecting AI agents to enterprise tools is hard: identity, permissions, and safe execution

The bottleneck in agentic systems, such as Claude Cowork or OpenClaw, isn't making API calls. It's identity propagation, permission inheritance, and safe execution within complex enterprise environments.

When teams build direct integrations between LLMs and enterprise software, they immediately hit friction. Developers spend cycles managing fragile OAuth token lifecycles, handling async user consent flows, manually tuning least-privilege authorization scopes, and building custom approval controls. This is undifferentiated infrastructure work that burns engineering time without advancing the agent's core capabilities.

Because this work is tedious and blocks core agent development, teams frequently take a dangerous shortcut: they use service accounts.

Granting an agent global read and write access across an entire enterprise instance breaks native permission models. You're bypassing years of carefully configured role-based access controls.

A single manipulated input can result in instant, untraceable data exfiltration or system modification. If an agent holds a static API key with global write access, a localized prompt injection vulnerability becomes an enterprise-wide blast radius.

Teams make two mistakes here. Give the agent its own identity, and an intern can bypass their permissions through the agent. Inherit the user's full access, and one prompt injection cascades through every connected system.

The right answer is the intersection: what is this agent allowed to do AND what is this user allowed to do, evaluated per action, at runtime. This is the permission intersection model, and it's the only approach that prevents both privilege escalation and blast radius expansion simultaneously.

This evaluation must happen at the runtime layer. Not at login time, not in the prompt, and not in the application code. Without it, scaling agents beyond single-user demos is unsafe.

The architectural shift: The agent is already the proxy

Before evaluating specific integration approaches, you need to understand why the traditional enterprise architecture no longer applies.

In the pre-agentic model, a proxy (API gateway) sits between applications and APIs, routing, authenticating, and rate limiting. The proxy is the control point because all traffic flows through it.

Agents invert this topology. The agent mediates between the user and the infrastructure. It already handles routing, orchestration, and decision-making. Adding a traditional proxy in front of the tools the agent calls doesn't add a control point. It adds a redundant hop that can't see into the execution context that matters: which user, which action, which permission, right now.

The control point in an agentic architecture is the execution layer where the tool runs, where credentials are resolved, permissions are checked, and actions are taken on behalf of a specific human. That's the runtime.

The gateway era was defined by the proxy as the control point. The agentic era is defined by the runtime.

Four architectures for connecting AI agents to enterprise tools

As organizations move from isolated pilots to production deployments, engineering teams adopt one of four integration models. Understanding where each approach breaks down under enterprise load is critical for architectural planning.

Integration approach	Security & identity	Maintenance burden	Reliability & execution	Speed-to-market
Custom connectors & DIY auth	Highly variable; often falls back to static keys.	Extremely high; requires dedicated auth teams.	Low; prone to parameter hallucination loops.	Very slow.
Legacy iPaaS	Moderate; struggles with On-Behalf-Of execution.	Medium; relies on maintaining visual workflows.	Medium; optimized for linear triggers, not loops.	Moderate.
Unmanaged MCP servers	Low; lacks centralized multi-user authorization.	High; requires manual deployment and patching.	Low; lacks native retries and failover state.	Fast for prototypes.
MCP runtime (e.g., Arcade)	High; native permission mapping and token vaults.	Low; runtime handles lifecycle and upgrades.	High; parallel execution and automatic retries.	Very fast.

Approach 1: Build custom connectors and OAuth (DIY authentication)

Build one-off API wrappers and custom OAuth layers for every enterprise tool your agent needs.

The upside is total control. You dictate every aspect of the integration and avoid third-party vendor lock-in.

But the limitations get crippling fast. Custom connectors become a massive engineering drain. Teams spend months building secure token vaults, handling refresh token rotation, and writing edge-case logic. Those are months that could have been spent shipping agent features that actually move the business forward.

Raw enterprise APIs compound the problem. They expect highly structured, deterministic inputs, but agents generate dynamic natural language. Wiring them directly to raw endpoints leads to parameter hallucination and endless retry loops. Authentication alone becomes a standalone infrastructure project: token rotation, user matching, session validation.

Approach 2: Use legacy iPaaS for agent tool calls

Enterprises retrofit existing integration platforms like Workato, MuleSoft, or Zapier to trigger actions based on LLM outputs.

The strength is familiarity. Enterprise IT teams already know these tools, and they come with massive pre-built endpoint catalogs.

But the limitations are architectural and fundamental. These platforms were built for linear, deterministic, trigger-based automation. Agentic systems operate on non-deterministic, stateful reasoning loops where the agent decides what to call, when, and how many times based on intermediate results. Forcing that into a linear webhook pattern breaks down fast.

The deeper problem is identity. Legacy iPaaS platforms center on system-to-system service accounts. They lack true user-scoped, On-Behalf-Of (OBO) execution, which forces teams to build complex, fragile workarounds to ensure the agent only acts with the specific permissions of the user typing the prompt. Per-user authorization evaluated at runtime across every tool call requires infrastructure these platforms were never designed to deliver.

Approach 3: Run unmanaged MCP servers

The Model Context Protocol standardized how AI models connect to data sources and tools. In this approach, teams deploy open-source MCP servers to expose local or SaaS capabilities directly to their agents.

MCP's strength is standardization. It decouples the agent framework from the underlying tool implementation, creating a universal language for tool calling. The problem is that the quality of unmanaged, open-source MCP servers varies widely. According to benchmarks many struggle with reliability and correctness, which compounds the challenges of production deployments.

These servers break down the moment you take them to production. Raw, unmanaged MCP servers lack centralized governance. They don't ship with multi-user enterprise authentication handling, meaning every user often shares the same connection identity.

They also lack production reliability features like automatic retries, parallel execution, and stateful failover out of the box. That burden falls back on the application developer.

Approach 4: Use an MCP runtime (the secure action layer)

An MCP runtime is the infrastructure layer purpose-built to solve this problem. Arcade.dev, the industry's first MCP runtime, combines agent-optimized tools, centralized authentication and authorization, and enterprise governance into a single control plane.

This approach targets production AI specifically. The runtime speaks MCP natively (JSON-RPC, Streamable HTTP) with no protocol translation and no context loss. It preserves native permissions through On-Behalf-Of token flows, isolates credentials from the language model, and provides instant, OpenTelemetry-compatible audit logs for every action.

Teams ship faster because the runtime handles authorization, token lifecycle, retries, and governance. Engineers focus entirely on agent logic and business outcomes.

Arcade's MCP Gateway lets any MCP client access the full tool catalog through a single endpoint. Teams can also bring their own MCP servers into the runtime to get authorization, retries, and audit logs without rewriting what already works. The runtime extends your existing MCP investment rather than replacing it.

For single-user hobbyist projects or local scripts, a full runtime adds unnecessary overhead. But for platform engineering teams deploying autonomous systems to thousands of corporate users, an MCP runtime is the only viable path to production.

What production demands: authorization, tooling, and governance

The comparison above shows where each approach breaks. But understanding why the MCP runtime wins requires going deeper into the three capabilities that separate production deployments from demos: just-in-time authorization that enforces user-scoped access, agent-optimized tools that eliminate hallucination loops, and governance infrastructure that gives platform teams full visibility over every action.

How just-in-time authorization enforces user-scoped access

Custom connectors fall back to static keys. Legacy iPaaS platforms rely on shared service accounts. Unmanaged MCP servers lack multi-user auth entirely. All three fail at the same point: they can't evaluate who is allowed to do what at the moment the tool is called.

That’s the problem just-in-time authorization solves.

The agent requests and validates credentials only at the moment an action requires them, not upfront. If a user never invokes the Salesforce integration, no Salesforce tokens are ever obtained or stored.

The entire authentication flow (OAuth exchanges, token refresh, credential storage) executes in deterministic backend logic that the LLM can never alter, observe, or leak. For additional governance, teams can attach pre-tool-call and post-tool-call hooks to enforce custom policies like human-in-the-loop approvals for certain actions, usage limits or contextual access rules.

This works because the runtime is stateful. It maintains per-session, per-user context across an agent's entire reasoning loop. A stateless proxy evaluates each request in isolation and can't know that a request is step 3 of a 6-step workflow, acting on behalf of Alice, who authorized this specific scope 4 minutes ago. The runtime can, and that session context is what makes per-user, per-tool authorization enforceable.

This is where the permission intersection model described earlier becomes operational. The architecture enforces: Agent Permissions ∩ User Permissions = Effective Action Scope. The agent can only execute an action if both the agent's role policy and the human user's native SaaS permissions explicitly allow it. Every other combination is denied.

A concrete example: an enterprise AI agent is built to assist the Human Resources department. An employee using this agent has high-level administrative privileges in Workday, including access to global payroll data. But the HR agent itself is scoped strictly to recruiting tasks.

Because the runtime evaluates the intersection of these permissions at call time, the agent is denied when prompted to access payroll data. The user has the authority, but the agent's restricted scope prevents the action. This stops data exfiltration and confused deputy attacks cold.

Agent-optimized tools vs API wrappers: what to use and why

The comparison table flags a specific failure mode for custom connectors: parameter hallucination loops. This happens because raw REST endpoints require precise, deterministic parameters, and language models produce probabilistic natural language. Wiring one directly to the other without an intermediary is where agents break.

Agents need intent-level tools rather than raw API wrappers. An intent-level tool absorbs the ambiguity of an agent's request and translates it into a safe, predictable transaction. The result is faster execution, fewer failed actions, and lower inference costs because the agent doesn't burn tokens on retry loops.

Production execution also requires runtime reliability features that raw APIs don't provide. The runtime provides developer-defined context for intelligent retries, parallelized execution for multi-step tasks, and automatic failover to handle rate limits and transient network errors gracefully. Standardized schemas within these tools prevent parameter hallucination, the most common cause of agent failure when wiring models directly to APIs.

Consider how this works in practice. Instead of an agent calling a raw Salesforce update endpoint and failing because it hallucinated a required stage ID string, the agent uses a high-level, agent-optimized progress tool.

The tool natively understands the user's intent to move a deal to negotiation. Its internal logic securely looks up the correct, exact ID for that specific Salesforce instance, validates the state transition, and safely executes the update. The language model doesn't need to guess the exact database schema. The action succeeds on the first call, not the fifth.

Governance and observability for agent actions (audit logs, OTel, versioning)

Unmanaged MCP servers scored "Low" on reliability and security in the comparison above because they lack centralized governance. Once agents execute real actions on behalf of users, platform teams need complete visibility and control over the integration ecosystem. The runtime delivers this through three mechanisms.

Visibility filtering ensures agents only see the specific tools the current user is permitted to invoke. If a user doesn't have permission to merge code in GitHub, the GitHub merge tool doesn't appear in the agent's context window.

Deep audit trails log every action per user, per service, and per agent session. These logs are exportable to standard SIEM tools via OpenTelemetry (OTel) to satisfy compliance audits.

Version control lets platform engineers safely upgrade tool schemas and rotate connection parameters without breaking production agents running mid-session on older versions.

When an agent incorrectly closes several open opportunities in a CRM, the platform team can't spend days parsing raw application logs. With an OTel-compatible audit log generated by the action layer, the security team can instantly trace the destructive action back to the exact user prompt, the specific agent session, and the token used. This isolates the root cause in minutes, enabling teams to refine the agent's instructions or the tool's access policy immediately.

Of the four approaches evaluated, only the MCP runtime delivers all three: user-scoped authorization at call time, intent-level tooling that prevents hallucination, and centralized governance with full audit trails. The remaining sections show how this architecture works in practice and how to evaluate it for your organization.

How to choose an enterprise agent integration approach (security, OBO, and TCO)

Choosing how to connect your AI agents to enterprise tools is a foundational architectural decision. It dictates the speed and security of your deployment. Platform engineers and technical leaders need to frame their buying and building criteria around security, scale, and where their engineering resources should focus.

Security and compliance requirements (SOC 2, ISO 27001, auditability)

Can the proposed solution natively map to SOC 2 and ISO 27001 requirements for strict user attribution? If an agent deletes a file in Google Workspace, the audit log must definitively prove which human authorized that action.

The system must support pre-tool-call Human-in-the-Loop (HITL) approval hooks. Destructive actions like modifying production configurations or bulk-updating database records must pause execution and require cryptographic sign-off from a human administrator via Slack or email before proceeding.

Build vs buy economics (OAuth maintenance and total cost of ownership)

Build versus buy demands a ruthless economic assessment.

Calculate the actual engineering hours required to build, maintain, and securely upgrade OAuth flows for ten or more distinct enterprise APIs. Factor in the hidden costs: managing refresh token rotation, building webhook callback URLs for long-running async tasks, patching custom connectors when SaaS vendors inevitably deprecate their API versions.

Then ask what those engineers could have shipped instead.

Adopting an MCP runtime transforms a multi-month infrastructure project into a configuration exercise. The total cost of ownership drops dramatically, and your team reclaims months of engineering capacity to invest in the agent capabilities that differentiate your product.

Time-to-value and engineering focus

Time-to-value is where most teams underestimate the cost of building in-house.

Will your highly paid AI engineers spend the next three months building reliable Slack and Workspace connectors, or will they spend that time optimizing agent prompts, evaluating reasoning logic, and shipping the agent capabilities that drive revenue? Every week spent on integration plumbing is a week your competitors use to get their agents into production.

When evaluating external vendors or internal architecture plans, force the issue with hard technical questions:

Are API keys or OAuth tokens ever visible in the language model's prompt context window?
How does the system resolve conflicting permissions between a highly privileged user and a narrowly scoped agent?
Can the system emit W3C-standard trace context to our existing OpenTelemetry collectors?
How does the tool handle rate limiting when an agent enters an unexpected retry loop?

If the answer to credential visibility is anything other than absolute isolation, the architecture is unfit for enterprise production.

Reference architecture for an MCP runtime (step-by-step flow)

With the architectural decision framed, here's how a request actually flows through the runtime end to end. The MCP runtime acts as the intermediary that brokers trust and execution between the non-deterministic reasoning engine and the deterministic enterprise environment.

The flow of a secure request follows a strict sequence:

User prompt: The user submits a request, e.g., "close this support ticket."
LLM plan: The agent's language model determines the sequence of tool calls needed to fulfill the request.
MCP runtime: The runtime receives the tool call request. It evaluates user and agent permissions and retrieves the necessary On-Behalf-Of credential.
Tool execution: The runtime, not the agent, executes the precise API call against the target system (e.g., Zendesk).
Result & next action: The runtime receives the API result, filters it, and passes it back to the agent. The LLM then either plans the next action in the sequence or determines the task is complete.
Confirmation & audit: The agent confirms the action's completion to the user, and the runtime logs the entire transaction via OpenTelemetry for audit purposes.

This architecture enforces a hard separation of concerns. The language model handles reasoning, planning, action selection, and generation. The runtime layer handles credentials, policy enforcement, rate limiting, action execution, and logging.

By vaulting tokens at the runtime layer, this architecture prevents prompt-injection-driven data exfiltration. The language model never possesses the keys required to export data.

How an MCP runtime works with any LLM

The MCP runtime works with any LLM through any orchestration framework, or none at all. No framework dependency is required. Arcade serves as the secure execution backend: your code handles reasoning, Arcade handles credentials, authorization, and tool execution.

This clean separation is what accelerates time-to-production. AI engineers focus entirely on agent logic while offloading the high-risk plumbing of enterprise integrations to the runtime.

A working example: an agent that reads Gmail and sends Slack messages through Arcade's runtime. Setup requires three dependencies and three environment variables:

pip install arcadepy openai python-dotenv

# .env
ARCADE_API_KEY=your_arcade_api_key        # Free at arcade.dev
ARCADE_USER_ID=your_email@company.com     # The user the agent acts on behalf of
OPENAI_KEY=your_openai_key

from arcadepy import Arcade
from openai import OpenAI
from dotenv import load_dotenv
import json, os

load_dotenv()

arcade_client = Arcade()
arcade_user_id = os.getenv("ARCADE_USER_ID")
llm_client = OpenAI(
   api_key=os.getenv("OPENAI_KEY"),
)

# Define enterprise productivity tools — Arcade handles auth for each
tool_catalog = [
   "Gmail.ListEmails",
   "Gmail.SendEmail",
   "Slack.SendMessage",
]

# Get tool definitions formatted for the LLM
tool_definitions = [
   arcade_client.tools.formatted.get(name=t, format="openai")
   for t in tool_catalog
]

# JIT authorization + execution — credentials never touch the LLM
def authorize_and_run_tool(tool_name: str, input: str):
   auth = arcade_client.tools.authorize(
       tool_name=tool_name, user_id=arcade_user_id
   )
   if auth.status != "completed":
       print(f"Authorize {tool_name}: {auth.url}")
       arcade_client.auth.wait_for_completion(auth.id)

   result = arcade_client.tools.execute(
       tool_name=tool_name,
       input=json.loads(input),
       user_id=arcade_user_id,
   )
   return json.dumps(result.output.value)

# Agentic loop — LLM reasons and selects tools, Arcade executes them
def invoke_llm(history: list[dict], max_turns: int = 5) -> list[dict]:
   turns = 0
   while turns < max_turns:
       turns += 1
       response = llm_client.chat.completions.create(
           model="gpt-4o-mini",
           messages=history,
           tools=tool_definitions,
           tool_choice="auto",
       )
       msg = response.choices[0].message
       if msg.tool_calls:
           history.append(msg.model_dump(exclude_none=True))
           for tc in msg.tool_calls:
               result = authorize_and_run_tool(tc.function.name, tc.function.arguments)
               history.append({"role": "tool", "tool_call_id": tc.id, "content": result})
           continue
       else:
           history.append({"role": "assistant", "content": msg.content})
       break
   return history

# Run the agent
history = [{"role": "system", "content": "You are a helpful assistant."}]
history.append({"role": "user", "content": "Summarize my latest 5 emails, then send me a DM on Slack with the summary."})
history = invoke_llm(history)
print(history[-1]["content"])

The LLM reasons through the task, selects Gmail.ListEmails to fetch emails, summarizes them, then selects Slack.SendMessage to deliver the summary. The runtime handles JIT authorization for each tool on behalf of that specific user. The agent never sees OAuth tokens, never manages refresh flows, and never touches credentials. Full walkthrough in the Arcade docs.

Next steps to productionize agent integrations (checklist)

To transition from sandbox prototypes to production-grade deployments, platform engineering teams follow a structured, iterative implementation plan.

Step 1: Inventory required tools and least-privilege scopes

Start by conducting a rigorous audit of your necessary tools. List the specific APIs your agents need, and document the exact user-scopes and OAuth granularities required for each. Don't request global access. Map out the principle of least privilege for every single workflow.

Step 2: Define autonomous vs human-approved actions (HITL)

Next, define your operational boundaries. Build a matrix deciding which actions are safe for autonomous execution (like reading calendar events) and which high-risk actions require explicit user delegation or human-in-the-loop approval hooks (like deleting files or sending external emails).

Step 3: Standardize on a single control plane

Centralize your integration strategy immediately. Prevent the creation of "shadow registries."

When disparate engineering teams build redundant, unmanaged integrations using hardcoded tokens, they create severe security vulnerabilities and integration sprawl. Standardize on a single control plane for all agent tool use.

Step 4: Pilot one workflow and validate token isolation and telemetry

Before rolling out broadly, test the architecture with a narrow, controlled use case. Pilot a single workflow, like developer issue automation linking GitHub and Jira, to validate token isolation and telemetry.

Invest in infrastructure, not just isolated connectors. Evaluate platforms that treat authorization, agent-optimized tools, and lifecycle governance as a unified secure runtime, not separate problems.

Conclusion: Use an MCP runtime to connect AI agents to enterprise tools

The true challenge of connecting AI to enterprise productivity tools has little to do with formatting JSON payloads or making API calls. The bottleneck is securing user-scoped access, enforcing least-privilege permissions at runtime, and maintaining rigorous operational governance over non-deterministic systems.

The most successful platform engineering teams recognize that rebuilding identity propagation, token lifecycles, and reliable integration mechanics from scratch is an expensive distraction from their core business objectives. They need an MCP runtime, not more custom connectors.

Arcade is the industry's first MCP runtime. It delivers secure agent authorization, the largest catalog of agent-optimized tools, and centralized lifecycle governance in a single control plane. Arcade eliminates the undifferentiated heavy lifting of enterprise integration so your team ships faster and scales with control.

If you're building agents that need to execute across enterprise tools, start with the getting started guide or explore the full tool catalog to see what's available out of the box.

FAQ: Enterprise AI agent integrations

What is the best way to connect AI agents to enterprise productivity tools?

Use an MCP runtime, a secure action layer that performs user-scoped (OBO) execution, keeps tokens out of the LLM, and enforces runtime authorization per tool call.

Should AI agents use service accounts to access Slack, Google Workspace, or Microsoft 365?

No. Service accounts bypass user permissions and expand the blast radius of prompt injection. Use on-behalf-of user execution with least-privilege scopes.

What does "On-Behalf-Of (OBO)" mean for agent integrations?

OBO means the agent executes each action using credentials tied to the requesting user, so the action is limited to that user's native permissions and is attributable in audit logs.

What is just-in-time authorization for AI agents?

Just-in-time authorization is a runtime policy check that executes at the moment of each tool call, evaluating the user's identity, the agent's allowed scope, and the requested action. Credentials are requested and validated only when needed, not pre-authorized during setup.

What is an MCP runtime, and how is it different from an MCP server?

An MCP server exposes tools to an agent using the MCP, but it's typically single-user, stateless, and ships without built-in auth, token management, or observability. An MCP runtime is the enterprise infrastructure layer that complements MCP servers to add what they lack: multi-user OBO authentication, per-call policy enforcement, token vaulting, automatic retries, and audit/telemetry. The server defines what the agent can call; the runtime makes it safe to call at scale. Arcade is the industry's first MCP runtime, purpose-built for production agent deployments.

What are the minimum security requirements for production agent tool access?

Token isolation from the LLM, user-scoped/OBO execution, least-privilege scopes, per-action authorization, audit logs with user attribution, and HITL approvals for high-risk actions.

How do you audit and attribute agent actions for compliance (SOC 2 / ISO 27001)?

Log every tool call with user identity, tool, parameters/intent, outcome, and trace context, and export via OpenTelemetry to your SIEM for investigation and reporting.

When do legacy iPaaS tools (Zapier/Workato/MuleSoft) break down for agents?

They struggle with non-deterministic agent loops and true user-scoped OBO execution, forcing teams to rely on shared credentials or brittle workarounds.

How do agent-optimized tools reduce hallucinations compared to raw API wrappers?

They use intent-level operations with validated schemas and internal lookups, so the model doesn't have to guess required IDs/parameters and can fail safely.

When should we add human-in-the-loop (HITL) approvals?

For destructive or irreversible actions (deletes, external emails, bulk updates, permission changes) or any action that materially impacts security, finance, or customer data.

How to build a secure WhatsApp AI assistant with Arcade and Claude Code (OpenClaw alternative)

Manveer Chawla — Thu, 02 Apr 2026 21:43:19 +0000

I texted "prep me for my 2pm" on WhatsApp. Thirty seconds later, my phone buzzed back with a structured briefing: who I was meeting, what we last discussed over email, what my team said about them in Slack, and three talking points. No browser tab. No laptop. Just a message on my commute.

That's the promise of an always-on AI assistant. And until recently, it was almost impossible to build one that actually worked.

Open-source frameworks like OpenClaw made headless, two-way messaging agents popular. Anthropic's Claude Code Channels confirmed the approach had legs. Channels is currently in research preview, but the direction is clear. Anthropic already uses this pattern for hand-offs between their desktop app, mobile app, and Claude Code. Expect this to GA in some form.

But getting from a weekend demo to a reliable assistant exposes gaps that no amount of prompt engineering fixes. Authorization. Tool reliability. Session management. The agent needs access to your calendar, email, and Slack, and you need to be sure it's not a security liability.

I built a working version. This guide walks through the entire thing: a WhatsApp relay server, an MCP server, Claude Code as the brain, and Arcade.dev for secure tool access. Working code at every step.

We'll start with the pitfalls you need to understand, then build it.

TL;DR

OpenClaw-style headless frameworks give your agent god-mode access to every connected service, rely on brittle tool wrappers, bloat the context window with raw API responses, and produce zero audit trail. Buying a dedicated Mac Mini to run them doesn't help. The machine isn't the threat model, the credentials are.
This guide builds a WhatsApp AI assistant using a relay server that handles Meta's webhooks, an MCP server that bridges to Claude Code, Arcade for secure tool access and audit logging, and a meeting-prep skill that pulls from Google Calendar, Gmail, and Slack to deliver structured briefings directly in WhatsApp.
Every layer includes working code you can run locally: webhook ingestion with HMAC signature validation, a cursor-based message queue, MCP tool definitions, Claude Code configuration, and a complete skill file that encodes a three-phase meeting-prep workflow.

From demo to production: The four pitfalls of always-on AI agents

The headless setup that OpenClaw popularized is the starting line. The moment you try to move from a weekend proof of concept to something you'd actually trust with your calendar and email, four architectural problems surface.

Pitfall 1: God-mode credentials and the agent security risk

Headless agent frameworks inherit the host machine's full access profile. The agent gets the same permissions as the developer who launched it. Every OAuth token, every API key, every connected service, wide open.

A single prompt injection or compromised dependency cascades through everything. Your Google Drive, your CRM, your source code repos. One bad input and the agent becomes an insider threat.

This isn't theoretical. CVE-2026-25253 exposed a one-click RCE in OpenClaw. The gateway lacked origin validation. An attacker could exfiltrate the auth token via a malicious link and achieve total system compromise.

We wrote about this pattern in detail in OpenClaw doesn't need your tokens.

Pitfall 2: Fragile API wrappers and the tool reliability problem

Most agent tools are thin wrappers around REST APIs. They force the model to guess complex payload parameters and retry when natural language doesn't map to rigid schemas.

Then shadow registries appear. Different teams build duplicate, unversioned wrappers for the same APIs. One unannounced API change breaks multiple agents in ways nobody predicted. Public tool registries have already become a supply-chain attack vector, with malicious tools that exfiltrate local state or establish backdoors.

For patterns that make MCP tools more resilient, see 54 Patterns for Building Better MCP Tools.

Pitfall 3: Context window bloat from raw API responses

Unoptimized tools dump the full API response into the context window. A Jira ticket history? Tens of thousands of tokens of irrelevant metadata. The agent's reasoning goes erratic. Costs spike with every conversation turn.

Pitfall 4: No audit trail, no reliability, no compliance

Keeping a self-hosted agent alive with tmux or systemd creates an audit black hole. When the process crashes or misbehaves, there's no structured log to trace what happened. Which action was taken? What parameters? Which user started the request?

You can't answer "what did the agent do?" if you never logged it.

That's an immediate fail for SOC2, ISO27001, and any serious compliance review.

Why buying a Mac Mini doesn't fix any of this

There's a growing trend: developers buying dedicated Mac Minis or spinning up VMs to run OpenClaw-style agents 24/7. The reasoning is, if the agent has its own machine, you've isolated it.

You haven't. The machine isn't the threat model. The credentials are.

That Mac Mini still needs OAuth tokens for Google Calendar, API keys for your CRM, access to your Slack workspace. A compromised dependency doesn't care whether it's running on your laptop or a dedicated server in a closet. The blast radius is identical. For a deeper comparison of isolation strategies that actually reduce blast radius, see AI Agent Sandboxing Guide.

Hardware isolation solves availability. It doesn't touch authorization, tool reliability, context management, or audit logging.

You've built an expensive, always-on machine with unfettered access to your business systems. Every pitfall above still applies.

How Arcade, Claude Code, and Skills solve these problems

I needed three things: a secure way to connect to business tools, a battle-tested agent runtime, and a way to encode workflows without writing integration code.

Arcade solves the tool and auth layer. It sits between the agent and your business tools. When the agent wants to read your calendar, Arcade evaluates permissions, mints a just-in-time token scoped to that specific action, and executes the call. The LLM never sees long-lived credentials. Your Google Calendar token isn't sitting in an .env file on a Mac Mini. It's managed by Arcade's runtime with per-action authorization.

Arcade also solves the brittle tools problem. Instead of writing fragile REST wrappers, you use pre-built, agent-optimized integrations that return summarized data, not raw JSON dumps. When Google changes their Calendar API, Arcade handles it. Your agent code stays untouched. And every tool call generates structured audit logs tied to the specific user and action.

Claude Code is the agent runtime. It's more battle-tested than OpenClaw, has native MCP support, and handles tool orchestration without the brittle process management of tmux and systemd scripts.

Skills encode the actual workflows. This is the piece most people miss. Arcade gives the agent access to your tools with proper auth. Skills tell the agent how to use them well. For a deeper look at the distinction, see Skills vs Tools for AI Agents.

A skill is a markdown file that encodes domain expertise: which tools to call, in what order, what to look for in the results, how to format the output. Without a skill, you have an agent with calendar access but no idea how to prepare a meeting brief. With a skill, you have an assistant that pulls calendar events, cross-references email threads, checks Slack for internal context, and delivers a structured briefing, all from a single WhatsApp message.

Arcade gives access. Skills give expertise. Together, they turn an LLM into a useful assistant.

And because skills are just markdown files, anyone on the team can write and iterate on them. No code deployment. No engineering tickets.

Here's what we're building: a WhatsApp relay for messaging, Claude Code as the brain, Arcade for auth-managed tool access, and skills that encode your team's workflows.

Step-by-step: building the WhatsApp AI assistant with MCP and Arcade

Enough architecture. Here's what we're making: WhatsApp messages flow through a relay server into an MCP server, which feeds them to Claude Code. Claude Code processes messages using skills, calls business tools through Arcade, and replies back through the same chain.

One wrinkle: WhatsApp's Cloud API only supports webhooks. There's no WebSocket or long-polling option. That means something has to sit on a public URL to receive Meta's callbacks. Since we're running everything locally, the relay server handles that role, and ngrok tunnels traffic from Meta's servers to it on your machine.

Prerequisites: WhatsApp Business API, Claude Code, and Arcade

Before starting, make sure you have a Meta developer account with a WhatsApp Business App configured (Meta's getting started guide), Node.js 20+ and npm, ngrok for tunneling webhooks to your local machine, Claude Code installed and configured, an Arcade account with API access, and a phone number registered with WhatsApp Business API.

Step 1: Project structure and environment setup

Here's the folder layout:

whatsapp-assistant/
├── whatsapp.ts          # MCP server (bridge between relay and Claude Code)
├── package.json         # MCP server dependencies
├── .mcp.json            # Claude Code MCP server registration
├── whatsapp-relay/
│   ├── relay.ts         # Relay server (faces the internet via ngrok)
│   ├── package.json     # Relay server dependencies
│   └── .env             # WhatsApp API credentials (from .env.example)
└── skills/
    └── meeting-prep/
        └── SKILL.md     # Meeting preparation skill for Claude Code

Start by setting up both projects:

# Create the project
mkdir whatsapp-assistant && cd whatsapp-assistant

# Initialize the MCP server
npm init -y
npm install @modelcontextprotocol/sdk
npm install -D typescript @types/node tsx

# Initialize the relay server
mkdir whatsapp-relay && cd whatsapp-relay
npm init -y
npm install hono @hono/node-server
npm install -D typescript @types/node tsx
cd ..

Create your .env file inside whatsapp-relay/ with the following variables:

# Meta WhatsApp Cloud API
WHATSAPP_ACCESS_TOKEN=        # Bearer token from Meta App Dashboard
WHATSAPP_PHONE_NUMBER_ID=     # Bot's phone number ID
WHATSAPP_VERIFY_TOKEN=        # Any string, used for webhook verification handshake
WHATSAPP_APP_SECRET=          # App secret for validating webhook signatures

# Relay auth
RELAY_SECRET=                 # Shared secret, local MCP server sends this in X-Relay-Secret header

The RELAY_SECRET is a shared key between the relay and MCP server. Generate something random (openssl rand -hex 32). It prevents anything on your network from impersonating the MCP server.

Step 2: Build the WhatsApp webhook relay server

The relay is the only component that faces the internet. It has three jobs: validate incoming WhatsApp webhooks, queue messages for the MCP server, and proxy outbound messages to Meta's API.

Webhook signature validation

Every webhook payload from Meta includes an HMAC-SHA256 signature. The relay verifies this before processing anything:

import { createHmac, timingSafeEqual } from "node:crypto";

const APP_SECRET = process.env.WHATSAPP_APP_SECRET!;

function verifySignature(rawBody: string, header: string | undefined): boolean {
  if (!header) return false;
  const sig = header.replace("sha256=", "");
  if (!sig) return false;
  const expected = createHmac("sha256", APP_SECRET)
    .update(rawBody)
    .digest("hex");
  if (sig.length !== expected.length) return false;
  return timingSafeEqual(Buffer.from(sig), Buffer.from(expected));
}

This uses timingSafeEqual to prevent timing attacks, a detail that matters when you're validating signatures from a third party.

Webhook handler: always return 200

Meta uses at-least-once delivery. If your endpoint returns anything other than 200, Meta retries, potentially creating a storm of duplicate events. The relay acknowledges immediately and processes asynchronously:

app.post("/webhook", async (c) => {
  const rawBody = await c.req.text();

  if (!verifySignature(rawBody, c.req.header("x-hub-signature-256"))) {
    // Still return 200. Returning 4xx causes Meta to retry with the same bad signature.
    console.error("webhook: invalid signature");
    return c.text("ok", 200);
  }

  try {
    const payload = JSON.parse(rawBody) as WaWebhookPayload;
    parseMessages(payload);
  } catch (err) {
    console.error("webhook: parse error:", err);
  }

  return c.text("ok", 200);
});

Note the pattern: even on a bad signature, we return 200. Logging the rejection is enough. Returning 4xx just makes Meta retry with the same bad payload.

In-memory message queue with polling

The relay queues validated messages and exposes a polling endpoint for the MCP server. The MCP server passes a cursor (the last message ID it saw) to get only new messages:

let queue: InboundMessage[] = [];
let nextId = 1;
const MAX_QUEUE = 1000;

function enqueue(msg: Omit<InboundMessage, "id" | "timestamp">): void {
  queue.push({ ...msg, id: nextId++, timestamp: Date.now() });
  if (queue.length > MAX_QUEUE) {
    queue = queue.slice(queue.length - MAX_QUEUE);
  }
}

// Polling endpoint, protected by relay secret
app.get("/poll", (c) => {
  const since = Number(c.req.query("since") ?? "0") || 0;
  const messages = queue.filter((m) => m.id > since);
  const cursor = messages.length > 0 ? messages[messages.length - 1].id : since;
  return c.json({ messages, cursor });
});

The relay authenticates all local-facing routes with the shared secret via x-relay-secret header. The WhatsApp-facing webhook routes don't use this. They're validated by Meta's HMAC signature instead.

Outbound message proxy

When Claude Code wants to reply, it goes through the MCP server, which calls the relay, which calls Meta's API:

const WA_API = `https://graph.facebook.com/v21.0/${PHONE_NUMBER_ID}`;

async function waApi(
  path: string,
  body: Record<string, unknown>,
): Promise<{ ok: boolean; status: number; data: unknown }> {
  const res = await fetch(`${WA_API}${path}`, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${ACCESS_TOKEN}`,
    },
    body: JSON.stringify(body),
  });
  const data = await res.json();
  return { ok: res.ok, status: res.status, data };
}

The relay is built with Hono, a lightweight framework that keeps the code minimal. The full relay is roughly 200 lines and handles text messages, images, documents, audio, video, stickers, reactions, and location shares.

Step 3: Build the MCP server for Claude Code

The MCP server is the bridge between the relay and Claude Code. It polls the relay for incoming WhatsApp messages and exposes tools that Claude Code can call to respond.

Tool definitions

The server registers four tools with Claude Code via the Model Context Protocol:

const mcp = new Server(
  { name: "whatsapp", version: "1.0.0" },
  {
    capabilities: { tools: {}, experimental: { "claude/channel": {} } },
    instructions: [
      "The sender reads WhatsApp, not this session.",
      "Anything you want them to see must go through the reply tool.",
      'Messages arrive as <channel source="whatsapp" chat_id="..." wamid="..." user="..." ts="...">.',
      "Reply with the reply tool. Pass chat_id (phone number) back.",
      "WhatsApp has a 24-hour session window: you can only send free-form messages",
      "within 24 hours of the user's last message.",
    ].join("\n"),
  },
);

The instructions field tells Claude Code how to interpret incoming messages and that it must use the reply tool to send anything back. Without this, the model might try to respond in its own transcript, which the WhatsApp user would never see.

The four tools are reply (send text), react (emoji reactions), mark_read (read receipts), and send_media (images, documents, audio, video). Here's the reply tool definition:

{
  name: 'reply',
  description: 'Reply on WhatsApp. Pass chat_id (phone number) from the inbound message.',
  inputSchema: {
    type: 'object',
    properties: {
      chat_id: { type: 'string', description: 'Phone number to send to' },
      text: { type: 'string', description: 'Message text' },
      reply_to: { type: 'string', description: 'wamid to quote-reply to (optional)' },
    },
    required: ['chat_id', 'text'],
  },
}

Polling loop with cursor persistence

The MCP server polls the relay every 2 seconds and forwards new messages to Claude Code as channel notifications:

const CURSOR_FILE = join(process.env.HOME || "/tmp", ".whatsapp-relay-cursor");
let cursor = loadCursor();

async function poll(): Promise<void> {
  try {
    const result = await relay(`/poll?since=${cursor}`);
    if (!result.ok) return;

    const { messages, cursor: newCursor } = result.data as {
      messages: InboundMessage[];
      cursor: number;
    };

    for (const msg of messages) {
      const meta = {
        chat_id: msg.from,
        wamid: msg.wamid,
        user: msg.pushName || msg.from,
        ts: new Date(msg.timestamp).toISOString(),
        type: msg.type,
      };

      mcp.notification({
        method: "notifications/claude/channel",
        params: {
          content: msg.text ?? `(${msg.type})`,
          meta,
        },
      });
    }

    if (newCursor > cursor) {
      cursor = newCursor;
      saveCursor(cursor);
    }
  } catch (err) {
    process.stderr.write(`whatsapp channel: poll error: ${err}\n`);
  }
}

The cursor persists to disk (~/.whatsapp-relay-cursor), so restarting the MCP server doesn't re-process old messages. Each message becomes a channel notification that Claude Code sees as a new input, including the sender's phone number, display name, timestamp, and message type as metadata.

Step 4: Register the MCP server with Claude Code

Create a .mcp.json file in your project root:

{
  "mcpServers": {
    "whatsapp": {
      "command": "node",
      "args": ["--import", "tsx", "whatsapp.ts"]
    }
  }
}

That's it. When Claude Code starts in this directory, it discovers the MCP server, launches it as a child process via stdio, and the WhatsApp channel becomes available. Claude Code now receives WhatsApp messages as channel notifications and can call the reply, react, mark_read, and send_media tools.

Step 5: Configure the Arcade gateway and connect it to Claude Code

Before the assistant can access business tools, you need to create an Arcade gateway that defines which tools the agent can use and with what permissions.

Log into the Arcade dashboard, create a new gateway, and add the MCP servers for the services your assistant needs: Google Calendar, Gmail, Slack, and any others relevant to your workflows. For each server, select only the specific tools you want the agent to access. This is where you scope permissions. If the meeting-prep skill only needs to list calendar events and search email, there's no reason to expose tools that delete events or send email on your behalf.

Once the gateway is created, register it with Claude Code from the command line:

claude mcp add 'arcade-gateway' \
  --transport http 'https://api.arcade.dev/mcp/<your-gateway-slug>' \
  --header "Authorization: Bearer <your-arcade-api-key>" \
  --header 'Arcade-User-ID: <your-email>'

This writes the gateway configuration to ~/.claude.json. Claude Code now has two MCP servers: the local WhatsApp channel server (from .mcp.json in the project) and the remote Arcade gateway (from ~/.claude.json). The WhatsApp server handles messaging. The Arcade gateway handles business tool access with per-action authorization.

The Arcade-User-ID header tells Arcade which user's credentials to use when executing tool calls. In the single-user setup, this is your email. In the multi-user architecture described later, the orchestrator passes a different user ID per session.

Step 6: Create a meeting-prep skill with Arcade tools

With the channel wired up, the assistant needs capabilities. This is where tools and skills work together. Arcade provides secure access to business tools (Google Calendar, Gmail, Slack), and skills tell the agent how to use those tools to accomplish a specific workflow.

Skills in Claude Code are markdown files. No code, no deployment, just a structured prompt that encodes domain expertise. Here's the structure:

skills/
└── meeting-prep/
    └── SKILL.md

The skill file has two parts: frontmatter that tells Claude Code when to activate it, and a body that defines the workflow. Here's the meeting-prep skill:

---
name: meeting-prep
description: "\"Prepare briefings for upcoming customer meetings by reading"
  your Google Calendar, identifying external/customer meetings (based on
  attendee email domains), then pulling relevant context from Gmail threads
  and Slack conversations."
---

# Meeting Prep

You are a meeting preparation assistant. Your job is to create concise,
actionable briefings for upcoming external meetings.

## Customer Directory
Read the centralized client registry at `$AGENT_DATA_DIR/clients.md`.
Use it to match calendar attendee domains to known customers, find the
correct Slack channel, and locate customer-specific data files.

## Phase 1: Discover (Find the Meeting)

- Search Google Calendar using `list_events` for the relevant time window
- Identify external meetings by checking attendee email domains
- Any attendee whose domain is NOT your organization signals an external meeting

## Phase 2: Gather (Pull Context from Email and Slack)

### Email Context (Gmail)
1. Search for recent threads involving external attendees (last 30 days)
2. Read the 3-5 most relevant threads, looking for decisions, action items, tone
3. Check the calendar event itself for agenda or documents

### Slack Context
1. If there's a dedicated customer channel, read recent messages there
2. Otherwise search by company name or contact names (last 2 weeks)
3. Look for internal context not in email: concerns, feature requests, deal status

## Phase 3: Brief (Deliver the Prep)

### Meeting Briefing: [Title]
**When:** [Date & Time]
**With:** [Attendees + roles/company]
**Meeting type:** [Quarterly review, Demo, Follow-up, Intro call]

**Quick Context:** 2-3 sentences on where things stand
**Recent History:** Chronological recap of last interactions
**Key Things to Know:** Open items, concerns, opportunities
**Suggested Talking Points:** 3-5 practical conversation starters
**People Notes:** Brief note on new stakeholders or unfamiliar attendees

The skill tells the agent exactly which Arcade-powered tools to use (list_events, search_messages, read_thread), in what order, what signals to look for in the results, and how to format the output. The customer directory lookup means the agent doesn't waste tokens fuzzy-matching company names. It goes straight to the right email domain and Slack channel.

When a user texts "prep me for my 2pm" on WhatsApp, Claude Code receives the message via the channel, activates this skill, runs the three-phase workflow through Arcade's tools, and sends the briefing back via the WhatsApp reply tool. The whole flow, from WhatsApp message to structured briefing, happens without the user leaving the chat.

Step 7: Run and test the WhatsApp assistant locally

Start everything in order:

# Terminal 1: Start the relay server
cd whatsapp-relay
node --import tsx relay.ts
# → "whatsapp relay listening on :3000"

# Terminal 2: Expose the relay via ngrok
ngrok http 3000
# → Copy the https:// forwarding URL

# Terminal 3: Start Claude Code from the project root
cd whatsapp-assistant
claude --dangerously-load-development-channels server:whatsapp
# Claude Code discovers .mcp.json and launches the MCP server
# → "whatsapp channel: connected, polling http://localhost:3000 every 2000ms"

Register your webhook with Meta by going to your app in the Meta Developer Dashboard, then navigating to WhatsApp, Configuration, Webhook. Set the Callback URL to your ngrok URL plus /webhook (e.g., https://abc123.ngrok.io/webhook), set the Verify Token to the value in your .env file, and subscribe to the messages webhook field.

Now send a message from your phone to the WhatsApp Business number. You should see it flow through the relay, into the MCP server, and appear in Claude Code. Claude Code processes it and sends a reply back through the same chain.

Try texting "prep me for my next meeting." The first time Claude Code calls an Arcade-powered tool (like reading your calendar), Arcade prints an authorization URL in the terminal. Open it in your browser and authenticate with the relevant account (Google, Slack, etc.). This is a one-time step per service. After that, Arcade manages token refresh automatically.

If you have the meeting-prep skill configured and Google Calendar / Gmail connected through Arcade, you'll get back a structured briefing right in WhatsApp.

Scaling from single-user to multi-user: What changes in the architecture

Everything above runs as a single user. One Claude Code instance, one set of Arcade credentials, one identity context. Here's what breaks when a second user messages the bot, and what you need to change.

Why a single Claude Code instance doesn't work for multiple users

The single-user setup has an implicit assumption: every WhatsApp message belongs to you. When Claude Code calls an Arcade tool like list_events, Arcade uses the credentials you authenticated during setup. There's no user identifier in the call.

If User 2 messages the same bot, Claude Code still calls Arcade with your credentials. User 2 gets your calendar. Worse, Claude Code runs in a single conversation context. User 1's meeting briefing (deal terms, internal Slack messages, revenue numbers) is sitting in the context window when User 2's message arrives. A prompt injection from User 2 could surface User 1's data. Arcade secured the credentials correctly, but the shared context window breaks tenant isolation.

You need two things: separate agent instances so context never crosses between users, and per-user credential routing so Arcade knows whose calendar to read.

The multi-user architecture

The relay server, MCP tool schemas (reply, react, send_media), and skills stay identical. What changes is the orchestration layer.

The single-user version uses Claude Code CLI with its built-in channels feature. For multi-user, you build a custom orchestrator using the Claude Agent SDK. The SDK doesn't have native channel support, but it gives you sessions, hooks, tool permissions, and MCP connections, the building blocks to replicate what channels do for a single user across many users.

The relay server becomes a router. When a message arrives from +1111, the orchestrator looks up which agent session owns that phone number and routes the message there. When +2222 messages, it routes to a different session. Each session has its own context window, its own MCP server instance, and its own Arcade user context. No data crosses between them.

Credential routing works through Arcade's user_id parameter on tool calls. Each user goes through the Arcade browser auth flow once (the same authorization URL step from the single-user setup). After that, when the orchestrator calls an Arcade tool on behalf of User 2, it passes User 2's identity. Arcade resolves the correct OAuth grants, mints a scoped token for that specific action, and executes the call. User 2's calendar request returns User 2's calendar. For a full walkthrough of how this authorization model works across frameworks, see SSO for AI Agents: Authentication and Authorization Guide.

The identity pairing itself is straightforward. Map each WhatsApp sender ID to a corporate identity using a one-time verification flow: send a code via a WhatsApp Authentication Template, have the user confirm it in a web portal, and store the mapping.

Arcade handles the rest of the multi-user complexity: per-user OAuth token exchange and just-in-time grants for credential delegation, scoped tool execution that prevents cross-tenant data access, a versioned tool registry that doesn't break when upstream APIs change, and structured audit logs tied to the specific user and action. These are the same four pitfalls from earlier. They all get harder at multi-user scale, and Arcade handles them natively.

Production readiness checklist for AI agents

Before you move beyond local use, gut-check these five things:

Credential isolation. Can the LLM see your auth tokens? If yes, stop. The architecture needs just-in-time, per-action authorization where the model never touches long-lived credentials. Standing service account privileges are a non-starter.
Tool reliability. Are your tools agent-optimized or naive REST wrappers? If the model has to guess complex payload parameters and brute-force retries, you'll hit failures that are invisible until production.
Versioning and rollbacks. Can you update a tool without breaking the running assistant? If one upstream API change takes down your agent, you need a versioned registry with safe deprecation periods.
Auditability. Can you trace every action back to the specific human who requested it? If not, you fail SOC2 and ISO27001. You need immutable logs with user IDs, tool names, and sanitized parameters.
Developer time allocation. Are your engineers building OAuth plumbing and webhook retry logic, or building skills and workflows? If it's the former, the architecture is too low-level.

Next steps

You now have a working WhatsApp assistant. A relay handling Meta's webhooks. An MCP server bridging to Claude Code. A meeting-prep skill that turns "prep me for my 2pm" into a structured briefing pulled from your calendar, email, and Slack.

The interesting part is what comes next. The relay and MCP server are infrastructure you write once. The skills are where the ongoing value lives, and anyone on the team can write them. Meeting prep was the first one I built. Expense report summaries, daily standups, customer check-in reminders: same pattern, different markdown file.

For multi-user deployments, the Claude Agent SDK gives you the building blocks to orchestrate per-user agent sessions, with the relay routing messages and Arcade handling per-user credential delegation, tenant isolation, and audit logging. You focus on skills, not infrastructure.

The code from this guide is on GitHub. Fork it and build something useful.

FAQ

What is an always-on AI executive assistant?

An always-on assistant runs continuously and interacts through messaging channels like WhatsApp or Slack. It maintains state across conversations and takes actions in connected business tools asynchronously, without needing a browser tab open.

What are the risks of using OpenClaw for an AI agent?

They commonly rely on shared machine credentials, fragile scripts, and ungoverned tool wrappers. This creates high risk of token leakage, unreliable tool calls, context bloat, and missing audit trails required for compliance.

How do you prevent an agent from having god-mode access to company systems?

Use runtime, per-action authorization with just-in-time, short-lived grants (e.g., OAuth token exchange). The agent never holds broad or long-lived credentials, and every action is evaluated against the requesting user's permissions.

What is Arcade and how does it secure AI agent tool access?

Arcade is a runtime that sits between an AI agent and your business tools. Instead of giving the agent stored credentials, Arcade evaluates each tool call against the requesting user's permissions, mints a just-in-time token scoped to that action, executes the call, and logs the result. It also provides agent-optimized integrations that return summarized data instead of raw API responses. For a full overview, see How Arcade works.

Is it safe to give an AI agent access to my Google Calendar and email?

Not if the agent holds long-lived OAuth tokens or API keys directly. A prompt injection or compromised dependency can exfiltrate those credentials and access everything the agent can reach. The safe approach is per-action authorization: a runtime like Arcade mints a short-lived, scoped token for each specific action and revokes it immediately after, limiting the blast radius to a single call.

How does the relay server handle duplicate WhatsApp webhooks?

WhatsApp delivers events with at-least-once semantics. The relay returns 200 OK immediately (even on bad signatures) to prevent retry storms, and processes messages asynchronously. For production use, add a deduplication store like Redis keyed by message ID.

What is WhatsApp's 24-hour messaging window?

Free-form replies are allowed within 24 hours of the user's last message. Proactive messages outside that window must use pre-approved WhatsApp message templates (HSM templates). For an 8 AM morning brief, you'd need an approved template.

Can I use this architecture with models other than Claude?

Yes. The relay server and MCP protocol are model-agnostic. The relay handles WhatsApp I/O, and the MCP server defines tools via a standard protocol. You could swap Claude Code for any MCP-compatible runtime.

How do I add new skills or workflows to a Claude Code agent?

Create a new directory under skills/ with a SKILL.md file. The skill's frontmatter description tells Claude Code when to activate it. Skills are just structured prompts, no code deployment required.

Best Openclaw Alternatives For Secure, Fully Managed Agents (2026 Buyer's Guide)

Manveer Chawla — Thu, 02 Apr 2026 17:55:33 +0000

OpenClaw is the most capable open-source personal AI agent framework available right now. But deploying it in production comes with a real cost: self-hosting means you're managing VPSs, maintaining Docker container orchestration, and debugging OAuth authentication flows. Every week, indefinitely.

This guide evaluates the top alternatives across two categories to help you escape that burden: fully managed OpenClaw hosting providers and general personal AI assistants.

We wrote this guide for technical but time-poor users, think software developers and product managers, alongside execution-focused operators like growth hackers and agency coordinators. If you need immediate, secure results from an autonomous agent without turning AI deployment into an ongoing maintenance project, this guide is for you.

TL;DR: Best OpenClaw alternatives in 2026

Quick decision framework: Choose managed OpenClaw hosting to keep OpenClaw's full architecture, including model flexibility, custom code execution, and BYOK support, on production-grade infrastructure. Choose a general assistant if you're willing to trade developer-level control for a broader feature set or a different workflow paradigm. Avoid raw self-hosted OpenClaw unless you have dedicated DevOps and security resources.

We evaluated each alternative on security architecture, setup speed, model flexibility, and native integrations. Here's where each one lands:

Best for secure, always-on OpenClaw agents in production: KiloClaw offers a setup in under two minutes, with five-layer tenant isolation, Firecracker VM boundaries, AES-256 encrypted credential vaults, no SSH access, tool allow-lists, and pre-built tool integrations without any infrastructure management.
Best for Anthropic-ecosystem desktop automation: Claude Cowork works best for users who want an autonomous desktop agent with file access, scheduled tasks, and computer use capabilities. It's powerful for local workflow automation but runs exclusively on your desktop, not on a remote cloud host, and is locked to Anthropic's model ecosystem.
Best for managed multi-model orchestration, if you don't need model control or BYOK: Perplexity Computer orchestrates 19 AI models across 400+ app integrations for complex, multi-step tasks. It's powerful out of the box but doesn't offer manual model selection or BYOK, and its opinionated framework is a significant departure from OpenClaw's open architecture.
Best for no-code, multi-channel workflow automation: Lindy AI serves users who want a visual builder with 5,000+ integrations, AI phone agents, and cloud-based computer use. It supports multiple models but lacks OpenClaw's raw script execution and developer-level customizability.
Avoid for most business production use: Skip raw self-hosted OpenClaw on an unmanaged VPS unless you have dedicated SecOps/DevOps resources and can ensure strong sandboxing. The architecture demands excessive security patching, continuous dependency updates, and constant third-party API maintenance.

Why self-hosting OpenClaw is risky and expensive

Setting up OpenClaw isn't as simple as cloning a repository and running a single command. You've got to provision a VPS with adequate memory, install the correct runtime environments, and manage multiple Docker containers for the gateway and CLI. You need to configure reverse proxies like Nginx to handle secure WebSocket connections, manage persistent storage volumes for memory files, and monitor system resources.

And when an update introduces breaking changes to node dependencies? You're the one bringing the agent back online.

The always-on problem

Running an agent locally creates an always-on problem. If the agent lives on your laptop, your autonomous workflows die the moment you close the lid. Moving the agent to a cloud server solves the uptime issue, but turns you into a part-time sysadmin who monitors logs and server health.

Integration fragility

Third-party integrations require maintaining fragile OAuth flows.

Google Workspace limits applications to one hundred refresh tokens, automatically invalidating the oldest token without warning when the limit is reached. If your application remains in testing status, Google issues tokens that expire in just seven days.

GitHub recently reduced the default lifespan of new granular access tokens to seven days. That forces self-hosted users to regenerate and update credentials just to keep basic repository reads working.

Prompt injection risk

Because agents take autonomous action, an injection attack no longer stops at generating inaccurate text. It also executes harmful commands. An agent reading a malicious email or scanning a compromised public repository can be tricked into exfiltrating private data.

Recent exploits illustrate just how real this is:

The EchoLeak vulnerability in Microsoft 365 Copilot showed that a single crafted email could trigger zero-click remote data exfiltration without any user interaction.
In another instance, prompt injection embedded in public repository code comments instructed an AI coding assistant to modify configuration files, enabling remote code execution.

Security researchers report these attacks succeed 50% and 84% of the time in agentic systems. That makes unmanaged agents a massive liability.

Credential exposure

Giving open-source frameworks direct access to production APIs, internal password vaults, or payment infrastructure without a dedicated security layer creates critical risk. Storing raw access tokens in plain text environment files on a standard server exposes your most sensitive financial and operational data to anyone who breaches the system.

Hosted solutions reduce this risk with enterprise-grade managed vaults, encrypted storage at rest, and controlled payment mechanisms like KiloClaw's AgentCard, which limits financial exposure.

Unrestricted agent access

Granting SSH access to a VPS running an autonomous agent creates unacceptable risk for any serious business or IT team. SSH access exposes the underlying operating system to direct attack, allowing compromised containers to pivot and access the host kernel. This architecture circumvents proper auditing, logging, and security controls.

Without strict tool allow-listing, an agent can become a powerful internal attack vector. The principle of least privilege must apply to AI. The platform must enforce strict permissions, so the agent can only access tools, channels, and functions that a human administrator has explicitly authorized.

When self-hosting OpenClaw still makes sense

There are narrow scenarios where self-hosting remains the right call:

Academic researchers testing experimental local models in air-gapped environments without internet access can safely self-host.
Hobbyists who enjoy tinkering with complex Docker configurations, managing Linux networking, and debugging dependency trees will find the open-source repository rewarding.
Organizations with dedicated security operations teams that require custom hardware deployments for strict compliance and data residency reasons may still choose to build their own internal infrastructure around the open-source core.

How to evaluate OpenClaw alternatives for security and production readiness

To evaluate managed alternatives, look beyond marketing claims. Assess how each platform abstracts infrastructure, enforces security, and reduces daily friction to determine if it actually replaces self-hosting. Here are the four criteria that matter most.

1. Security and isolation features

The platform's security architecture separates a toy deployment from a production-grade agent.

Check whether the platform enforces strict tool allow-listing by default. An agent should never have implicit access to your entire digital workspace. Restrict its reach to prevent rogue actions or accidental deletions.

Check how the platform manages secrets. Storing application keys in flat text files is obsolete. Check whether the platform stores access tokens in encrypted, managed vaults and blocks direct SSH access to the server.

2. Setup speed

The main reason to abandon self-hosting is to reclaim your time. So measure how long it takes to go from creating an account to running your first workflow.

A premium managed alternative should eliminate provisioning entirely. Check whether complex integrations, like connecting to Google Workspace, Telegram, or GitHub, are handled via guided one-click authorization flows.

If a platform still requires you to generate webhooks, and configure callback URLs into a configuration dashboard, it hasn't solved the friction.

3. Model flexibility

The AI landscape moves fast. Locking your autonomous workflows into a single proprietary provider creates real risk. If your chosen vendor experiences an outage or degrades their model's reasoning capabilities, your entire agentic workforce halts.

Check whether the platform lets you choose your preferred model or bring your own API keys from providers like OpenAI, Anthropic, or Google. Evaluate whether you can select the right model for your workload, whether that's a frontier reasoning model for complex tasks or a cost-effective open-weight model for high-volume processing.

True model flexibility means you're never locked into a single vendor. You can optimize for cost, context window limits, and data privacy by selecting the best model for the job, not the only model the platform allows.

4. Native integrations

An autonomous agent is only as useful as the systems it can influence.

Check whether the agent connects natively to your actual work channels, like Slack, Discord, or Telegram. Beyond communication, evaluate whether the platform can execute real-world actions securely: deep file search across Google Drive and GitHub, updating CRM records, and executing controlled financial payments through isolated, platform-managed debit cards.

OpenClaw alternatives comparison table (2026)

Alternative	Category	Best for	Model flexibility	Security model	Pricing	Migration effort
KiloClaw	Managed OpenClaw	Always-on secure multi-channel agents with zero infrastructure and full model control	Yes	5-layer tenant isolation, Firecracker VMs, encrypted vaults, no SSH, independently audited	$9/mo + inference at zero markup	Low
xCloud	Managed OpenClaw	Managed OpenClaw hosting with automatic updates, no native multi-platform integrations	Yes	Managed security defaults, isolated environments, no published independent audit	$24/mo + BYOK inference	Low-Moderate
DockClaw	Managed OpenClaw	Fast single-channel hosting with multi-model support, Telegram only	Yes	Dedicated virtual machine isolation	From $19.99/mo + BYOK inference	Low-Moderate
Perplexity Computer	General Agent	Multi-model workflow execution without infrastructure control or model choice	No (automatic routing, no BYOK)	Consumer web security	$200/mo (Max) or $325/seat/mo (Enterprise)	High
Claude Cowork	General Agent	Local file and desktop automation that stops when your machine powers off	No	Human-in-the-loop oversight	From $20/mo (Pro)	High
Lindy AI	General Agent	Visual no-code agent building with no custom code execution	Limited (multi-model, no BYOK)	Enterprise compliance	Free tier; paid from $19.99/mo (credit-based)	High

For most teams migrating off self-hosted OpenClaw, KiloClaw delivers the strongest combination of security controls, setup speed, model flexibility, and native integrations. It's the only managed provider that pairs enterprise-grade credential vaulting with full BYOK model access and always-on headless execution.

Fully managed OpenClaw hosting providers

This category represents direct infrastructure replacements for users who want the exact capabilities of the open-source OpenClaw framework but refuse to manage the underlying servers, networking, and dependency updates. These platforms handle the operational burden while preserving the core autonomous architecture.

KiloClaw (managed OpenClaw hosting)

Who KiloClaw is best for

Technical founders, operators, and agency coordinators who need always-on, headless messaging agents running across Slack, Telegram, and WhatsApp with zero infrastructure management, maintenance, or security headaches.

KiloClaw Overview

KiloClaw is an optimized, hosted, production-ready version of the OpenClaw framework. It takes users from zero to a running, always-on AI agent in under two minutes.

Instead of presenting you with a blank terminal, KiloClaw acts as a tireless operational assistant out of the box. It handles everything from routing incoming messages and triaging complex inboxes to executing high-volume sales research across the web.

How KiloClaw compares to self-hosted OpenClaw

Painless setup: KiloClaw eliminates manual setup with guided authorization flows for all supported integrations. No more frustrating OAuth consent screens or managing expiring tokens.
Security-first architecture: The platform runs each customer inside a dedicated Firecracker micro-VM (the same isolation technology behind AWS Lambda), not a shared container. There is no shared kernel, no shared filesystem, and no shared process namespace between tenants. KiloClaw prohibits direct SSH access, enforces tool allow-listing by default, and locks agent security controls in the platform's start script, preventing them from being overridden by the agent itself or by prompt injection through chat channels.
Independent security validation: KiloClaw's architecture was validated by a 10-day independent security assessment in February 2026 using the PASTA threat modeling framework. The assessment covered 30 threats across 13 assets, ran 60+ adversarial tests including cross-tenant isolation probes, and found zero cross-tenant vulnerabilities. No other alternative in this guide has published comparable third-party validation.
Model flexibility: KiloClaw uses Kilo Gateway by default, which provides access to more than 500 AI models through a single integration. You can also bring your own API keys from providers like Anthropic, OpenAI, and Google, giving you full control over which model powers your agent.
Native integrations: KiloClaw provides natively guided authorization flows for Telegram, Slack, WhatsApp, Google Workspace, GitHub, and 1Password. These deep, two-way integrations support the headless messaging pattern central to OpenClaw's value. The agent can receive messages, take autonomous action, and respond directly within your communication channels 24/7.
Code execution and skills: Like OpenClaw, KiloClaw agents can write and execute code, build reusable scripts, and extend their own capabilities over time. This self-improving loop runs on managed cloud infrastructure, so your agent grows more capable without you having to maintain the server.

What you get with KiloClaw

Instant readiness is the biggest advantage. You can launch an integrated, multi-channel agent during a coffee break. That used to be a frustrating weekend engineering sprint.

You also get peace of mind. KiloClaw provides a secure boundary where you can safely grant the agent access to sensitive tools, including corporate password vaults and controlled financial transactions via the integrated AgentCard.

And you get true always-on reliability on managed cloud infrastructure. Your agent runs 24/7 regardless of whether your laptop is open, your desktop is powered on, or you're on vacation. Unlike desktop-bound alternatives, KiloClaw's headless architecture means your messaging agents, scheduled workflows, and autonomous tasks never stop running.

KiloClaw limitations

Because KiloClaw is a managed cloud service, you don't have root server access. You can't SSH into the underlying infrastructure to modify core OS-level dependencies or alter the container orchestration. It also can't support air-gapped local execution for classified, offline environments.

KiloClaw pricing

KiloClaw costs $9 per month for hosting (with a $4 first month and a 7-day free trial, no credit card required). AI inference is billed separately through Kilo Gateway at zero markup across 500+ models, with free models included. Compared to self-hosting, you replace unpredictable compute fees, bandwidth charges, and ongoing maintenance costs with a predictable flat hosting fee and transparent, at-cost model usage.

OpenClaw to KiloClaw migration effort

Low. Standard OpenClaw system prompts, behavior instructions, and logic workflows map directly to the new environment. KiloClaw's guided UI flows replace the need to migrate fragile configuration files and plain text environment variables.

Ready to ditch the DevOps tax?

Start your KiloClaw deployment today and have an agent running in under two minutes.

xCloud (OpenClaw VPS hosting)

Who xCloud is best for

Non-technical to semi-technical users who want fully managed OpenClaw hosting with automatic updates and dedicated support, but don't need guided multi-platform OAuth flows, advanced credential vaulting, or independently audited security architecture.

xCloud Overview

xCloud is a fully managed OpenClaw hosting provider that handles server provisioning, Docker configuration, SSL setup, updates, and backups. Deployment takes approximately five minutes with no technical skills required. However, you must bring your own AI provider API key, and integrations beyond Telegram and WhatsApp require manual configuration.

How xCloud compares to self-hosted OpenClaw

xCloud removes the full infrastructure management burden, not just initial provisioning. The platform handles server setup, OpenClaw installation, SSL configuration, automatic updates, security patches, and backup recovery. A web dashboard provides monitoring, logs, uptime tracking, and one-click restore without any CLI or SSH access required.

What you get with xCloud

A fully managed deployment with approximately five-minute setup time, automatic OpenClaw updates, automatic backups, free SSL, integrated monitoring and logs, and 24/7 expert support. The platform requires no Docker, terminal, or DevOps knowledge to operate.

xCloud limitations

xCloud requires you to bring your own AI provider API key. The platform currently supports Anthropic, OpenAI, Gemini, OpenRouter, and Moonshot AI, with providers like Grok, xAI, and Mistral listed as coming soon. Unlike KiloClaw's unified Kilo Gateway, there is no single integration point that gives you access to hundreds of models through one connection.

Channel support is limited. Telegram and WhatsApp work natively, but Discord, Slack, and Signal remain on xCloud's roadmap for Q2 2026. For OpenClaw users who rely on multi-channel headless messaging across Slack, Discord, and Telegram simultaneously, that's a meaningful gap today.

xCloud also lacks guided OAuth authorization flows for third-party services. Connecting tools like Google Workspace, GitHub, or 1Password requires manual configuration rather than one-click setup. The platform does not publish an independent security assessment or provide detailed documentation on its tenant isolation architecture beyond describing isolated environments.

xCloud pricing

xCloud starts at $24 per month for managed OpenClaw hosting, making it the highest-priced managed OpenClaw host in this guide. AI inference is not included. You must bring your own API key from providers like Anthropic, OpenAI, or Gemini, so total monthly cost will be higher depending on model usage.

OpenClaw to xCloud migration effort

Low-Moderate. xCloud handles server provisioning and OpenClaw installation automatically. You will need to input your AI provider API keys and configure your messaging platform connections through their dashboard. No raw Docker volume transfers or environment file manipulation required.

Bottom line

xCloud handles hosting, updates, and support, but lacks guided OAuth flows for third-party services, publishes no independent security audit, and is the highest-priced managed option in this guide at $24 per month before inference costs. If you need multi-channel integrations, credential vaulting, and validated security architecture at a lower price, KiloClaw covers all of that.

DockClaw (managed OpenClaw hosting)

Who DockClaw is best for

Solo developers and small teams who need fast managed OpenClaw hosting with multi-model flexibility and don't need multi-channel messaging or advanced enterprise security features.

DockClaw Overview

DockClaw is a managed hosting service tailored for the OpenClaw framework. The platform emphasizes deployment speed, offering a sub-60-second deployment process combined with dedicated VM isolation for every agent. It supports 10+ AI providers including Claude, GPT-4o, Gemini, Venice, Llama, and any OpenAI-compatible model, with the ability to switch providers at any time.

How DockClaw compares to self-hosted OpenClaw

DockClaw removes all infrastructure setup friction by delivering a running, networked agent in under 60 seconds. Rather than relying on shared container environments, DockClaw provisions a dedicated isolated VM for each agent. The platform includes 24/7 uptime monitoring, persistent storage, and a control UI dashboard for managing your agent without touching a terminal.

What you get with DockClaw

A quick, painless setup process that bypasses the need to understand cloud infrastructure, multi-provider model support with zero-lock-in switching, Telegram integration out of the box, persistent storage, 24/7 monitoring, and a web dashboard for agent management.

DockClaw limitations

DockClaw supports Telegram as its only native messaging channel. There is no Slack, Discord, or WhatsApp integration. For OpenClaw users who rely on multi-channel headless messaging across several platforms simultaneously, that limits the agent's reach from day one.

The Starter tier is BYOK only. You bring your own API key from providers like Claude, GPT-4o, or Gemini. The Pro tier bundles Kimi K2.5 credits, but total inference costs on the Starter plan depend entirely on your provider usage on top of the $19.99 monthly hosting fee.

DockClaw lacks guided OAuth authorization flows for third-party services like Google Workspace, GitHub, or 1Password. Connecting external tools requires manual configuration. The platform provides no credential vaulting, no integrated payment controls, and no enterprise SSO. Its security architecture is limited to dedicated VM isolation per agent with no published independent security assessment validating the implementation.

DockClaw pricing

The platform starts around $19.99 per month with a 7-day free trial and includes one agent deployment, a dedicated isolated VM, Telegram integration, web browsing, cron jobs, and a control UI dashboard. You bring your own API key. Pro costs $49.99 per month with a 3-day free trial and adds bundled AI model credits (Kimi K2.5, $250 value), Brave Search API access, voice support with Whisper STT and ElevenLabs TTS, a template library, and an agent onboarding wizard. Both tiers require no technical setup.

OpenClaw to DockClaw migration effort

Low-Moderate. The migration process involves transferring your core system prompts and using their web interface to re-authenticate your essential tools. No need to manipulate raw server files.

Bottom line

DockClaw delivers solid baseline hosting with strong VM isolation at an accessible price point. If you need guided integrations, credential vaulting, and features like AgentCard for controlled financial transactions, KiloClaw provides a more complete production environment.

General AI assistants that can replace some OpenClaw workflows

These platforms approach workflow automation through different architectures. They compete for the same automation budget as OpenClaw but prioritize proprietary interfaces, specific foundational models, or visual, no-code environments.

Perplexity Computer (multi-model agentic platform)

Who Perplexity Computer is best for

Knowledge workers, operators, and technical teams who need a fully managed agentic platform that can execute complex, multi-step workflows spanning research, and content production.

Perplexity Computer Overview

Perplexity Computer is a fully agentic platform that coordinates 19 AI models simultaneously, routing each subtask to the best-suited model automatically. Claude Opus 4.6 handles core reasoning, Gemini manages deep research, and dedicated models cover image generation, video production, and lightweight tasks.

You don't pick the model. Perplexity Computer owns the orchestration layer and makes routing decisions for you.

How Perplexity Computer compares to OpenClaw

Perplexity Computer runs every task in an isolated cloud environment with a real filesystem, browser, and native integrations with over 400 applications including Slack, Gmail, GitHub, and Notion. It can execute workflows that run for hours, generate code, produce images and video, draft documents, and interact with connected apps in parallel.

OpenClaw gives you full control over model selection and workflow logic. Perplexity Computer abstracts that away behind its own orchestration engine.

Critically, Perplexity Computer also supports the two-way messaging pattern that made OpenClaw popular. It integrates directly into Slack, WhatsApp, Telegram, and Discord, responding to messages and running workflows from within your existing communication channels. Enterprise users can query @computer inside Slack channels and continue those conversations in the web or mobile interface.

What you get with Perplexity Computer

You get complex workflow execution across research, code generation, and content production without managing any infrastructure. The platform's multi-model orchestration routes subtasks to the best available model automatically. Teams migrating from OpenClaw gain a polished managed experience but lose the ability to choose which model handles each task.

Perplexity Computer limitations

Perplexity Computer doesn't offer manual model selection. You can't plug in your own API keys from external providers. For OpenClaw users accustomed to full control over their agent's reasoning engine, this is a fundamental architectural constraint, and the premium subscription tier puts it at a significantly higher price point than most alternatives in this guide.

Perplexity Computer supports two-way messaging across major channels, but you don't control the underlying orchestration logic. The platform decides how to route tasks across its 19 models. You're adopting Perplexity's opinionated framework for how your agent behaves in those channels.

The platform can generate and execute code within workflows, but you don't own the execution environment. You can't build a persistent library of custom scripts and reusable skills that grow the agent's capabilities over time. Code runs within Perplexity's orchestration layer, not within infrastructure you manage.

Perplexity Computer pricing:

Access to Perplexity Computer requires a Max subscription at $200 per month or $2,000 per year. Enterprise pricing starts at $325 per seat per month and includes SSO, audit logs, and additional security controls. Compared to managed OpenClaw hosting providers, weigh this cost increase against the platform's broader orchestration capabilities.

OpenClaw to Perplexity Computer migration effort

High. Migrating from OpenClaw to Perplexity Computer requires rebuilding your autonomous workflows within an opinionated orchestration framework. Existing system prompts, custom scripts, and model-specific logic won't transfer directly. You'll need to restructure your agent behavior around Perplexity's automatic model routing and connect your tools through its native integration layer rather than maintaining your own OAuth flows.

Bottom line

Perplexity Computer is powerful for multi-model orchestration, but you surrender all control over model selection and can't bring your own API keys. If custom orchestration, reusable skills, vendor flexibility, and cost control matter to your team, KiloClaw delivers all of that at a fraction of the price.

Claude Cowork (desktop automation agent)

Who Claude Cowork is best for

Desktop-bound professionals, including writers, analysts, and developers, who want an autonomous agent that can read, edit, and create local files, run scheduled tasks, and control their desktop, but who don't need an always-on autonomous agent.

Claude Cowork Overview

Claude Cowork is Anthropic's autonomous desktop agent that works directly within your local environment. It can read, edit, and create files in local folders, run shell commands in a sandboxed environment, execute scheduled background tasks, and control the desktop through computer use. Cowork is an autonomous desktop agent. It doesn't run on a remote cloud host like OpenClaw.

How Claude Cowork compares to OpenClaw

OpenClaw operates as a headless agent on a remote server with API-based integrations. Claude Cowork operates directly on your local machine with direct file access, a sandboxed Linux shell, MCP integrations, scheduled tasks for cron-style automation, and Dispatch mode that lets it work autonomously while you step away. It's restricted to Anthropic's proprietary Claude models.

Of all the general alternatives, Claude Cowork comes closest to matching OpenClaw's self-improving architecture. It can write and execute code in a sandboxed shell, create reusable skills, and build on its own capabilities over time. The critical difference is that this entire loop runs on your local desktop, not on a remote cloud host that stays online independently.

What you get with Claude Cowork

You can automate local file workflows, desktop applications, and tasks that require direct access to your machine's filesystem, things a cloud-hosted OpenClaw instance can't reach. You also get scheduled background tasks and Dispatch mode for hands-off execution, plus computer use for automating GUI-based applications that lack API endpoints. The desktop-first model means you can watch the agent work and intervene in real time when needed.

Claude Cowork limitations

Claude Cowork enforces strict vendor lock-in. You can't switch to OpenAI, Google, or open-weight models if the Claude infrastructure experiences an outage or performance degradation. The fundamental constraint for OpenClaw migrants is that Cowork runs exclusively on your desktop. It supports scheduled tasks and Dispatch mode, but your machine must remain powered on and running. No remote cloud host or VPS keeps your agent alive, so if you close your laptop while traveling or shut down your desktop, your automation stops. For teams that need always-on, location-independent uptime, that's a dealbreaker.

Claude Cowork pricing

Claude Cowork is available on the Pro plan at $20 per month. Max tiers at $100 per month (5x usage) and $200 per month (20x usage) unlock heavier workloads and full Claude Code access.

OpenClaw to Claude Cowork migration effort

High. Migrating from OpenClaw to Claude Cowork requires a fundamental architecture shift. OpenClaw system prompts, headless scripts, and OAuth-based cloud integrations don't transfer to Cowork's desktop-first, file-access model. Existing autonomous workflows must be rebuilt around local file operations, MCP integrations, and scheduled tasks rather than remote API orchestration.

Bottom line

Claude Cowork offers strong desktop automation with file access and scheduled tasks, but your agent stops running the moment your machine powers off. If you need always-on, location-independent uptime, KiloClaw runs 24/7 on managed cloud infrastructure regardless of whether your laptop is open.

Lindy AI (no-code AI assistant)

Who Lindy AI is best for

Non-technical operators, sales teams, customer service leads, and administrative staff who want a visual, no-code platform for deploying AI agents across text, voice, web, and phone channels without writing a single line of code.

Lindy AI overview

Lindy AI is a comprehensive no-code agentic platform. Users build specialized AI agents from natural language prompts in minutes. The platform spans text, web, voice, and phone automation with over 5,000 integrations, AI phone agents for inbound and outbound calls, and cloud-based computer use via its Autopilot feature. It focuses on visual workflow building and conversational onboarding, so users never touch configuration files or code.

How Lindy AI compares to OpenClaw

OpenClaw gives developers full control over model selection, custom scripts, and raw infrastructure. Lindy replaces all of that with a visual builder where you map out integrations, conditional logic, and tool permissions.

Lindy supports multiple models including Claude 4.x, GPT-5.x, and Gemini 3.x, and you select the model per agent. It also ships with a library of pre-packaged templates, so you can deploy a configured sales agent, customer service rep, or HR assistant right away.

Lindy also supports the headless, two-way messaging pattern central to OpenClaw's appeal. Agents connect natively to Slack, Telegram, and WhatsApp, responding to incoming messages and executing workflows 24/7 on Lindy's cloud infrastructure. OpenClaw requires you to configure integrations through OAuth flows and webhook endpoints. Lindy handles that setup through its visual builder.

What you get with Lindy AI

A gentle learning curve suitable for rapid adoption across the entire company, plus built-in human-in-the-loop approval for sensitive actions.

For OpenClaw migrants, the key draw is that Lindy handles hosting, uptime, and integrations entirely in the cloud. Your agents run on Lindy's infrastructure, not on your desktop or your VPS.

You also get capabilities OpenClaw doesn't offer natively, like AI phone agents and cloud-based browser automation. Teams whose primary use case is always-on messaging agents that triage inboxes, respond to customers, or route requests across channels get that without any infrastructure management.

Lindy AI limitations

The platform sacrifices the raw power, deep customizability, and operational flexibility inherent to the open-source OpenClaw ecosystem. You can't inject custom Python scripts, execute arbitrary shell commands, or build bespoke edge-case integrations. While Lindy supports multiple models, it doesn't offer bring-your-own-key support, so you're working within the models and tiers Lindy provisions.

The visual interface can become prescriptive, making complex developer workflows frustrating or impossible to implement. You also have less control over messaging behavior than OpenClaw provides. You can't write custom message parsing logic, implement bespoke routing rules in code, or fine-tune how the agent handles conversation edge cases.

Lindy offers no custom code execution. You must build every workflow through the visual builder. For OpenClaw users accustomed to an agent that can code its way through edge cases and extend its own toolset, that's a fundamental capability gap.

Lindy AI pricing

Lindy offers a free tier with 400 credits per month. Paid plans start at $19.99 per month for 2,000 credits (Starter), $49.99 per month for 5,000 credits plus 30 phone calls (Pro), and $299 per month for 30,000 credits plus 100 phone calls (Business). Additional credits cost $10 per 1,000. Compared to managed OpenClaw hosting, Lindy's credit-based model can scale costs quickly for high-volume autonomous workflows.

OpenClaw to Lindy AI migration effort

High. Migrating from OpenClaw to Lindy requires deconstructing your existing autonomous logic, system prompts, and custom scripts, then rebuilding that behavior within Lindy's visual, no-code workflow builder. OpenClaw's raw script execution, direct model API access, and custom OAuth configurations have no direct equivalent in Lindy's abstraction layer.

Bottom line

Lindy AI makes agent building accessible to non-technical teams through its visual builder, but you cannot execute custom code or build scripts that extend the agent's capabilities over time. If your workflows require the raw flexibility of OpenClaw's code execution model, KiloClaw preserves that power on fully managed infrastructure.

How to migrate from self-hosted OpenClaw to a managed provider

Migrating away from a self-hosted architecture doesn't have to mean lost workflows or operational downtime. With a structured plan for extracting and redeploying, you can transition your entire autonomous workforce smoothly and securely.

Step 1: Audit and export your OpenClaw workflows

Before touching your new environment, document the specific communication channels, like Telegram or Slack, and the third-party tools your current self-hosted instance uses.

Then export all custom system prompts, persona instructions, and memory files from your local workspace. Make sure you capture the agent's accumulated context.

Step 2: Set up your managed OpenClaw alternative

Log into your chosen managed platform to begin the transition. For example, spin up your new KiloClaw workspace. The platform provisions isolated infrastructure in under two minutes..

Once the workspace is active, paste your exported system prompts and behavioral instructions into the platform's configuration dashboard. These settings maintain agent continuity and personality.

Step 3: Reconnect integrations using secure OAuth

Don't copy over legacy environment files containing raw application keys. That defeats the purpose of upgrading your architecture.

Instead, use the new platform's guided, secure OAuth flows. Connect your Google Workspace, GitHub repositories, and 1Password vaults via the secure UI. Let the platform manage and vault the new access tokens properly.

Step 4: Run in parallel and validate workflows

Keep your legacy self-hosted instance running temporarily for operational stability, but isolate it to a muted test channel to prevent duplicate actions.

Trigger your most common workflows, like preparing executive meetings or running deep research requests, within the newly provisioned KiloClaw environment. Verify integrations work and models perform correctly before shutting down your VPS.

Conclusion: Choosing the right OpenClaw alternative

The OpenClaw framework has changed how we approach personal automation, proving that autonomous systems can handle complex, multi-step operations. But for professionals whose primary output is strategic execution, managing VPSs, patching Docker containers, and rotating fragile API tokens is a poor use of time.

When choosing your deployment strategy, evaluate the total cost of ownership. Factor in your own hourly rate for mandatory server maintenance and security patching. You'll find that self-hosting costs more than a predictable managed SaaS subscription. The hidden DevOps tax quickly eclipses any perceived savings from renting raw compute.

If you want the raw autonomous power of OpenClaw without the DevOps overhead, the security risks, or the rigid model constraints of proprietary platforms, start your deployment with KiloClaw today.

You can have an integrated, secure agent running in Slack or Telegram in under two minutes. Get back to the work that actually matters.

OpenClaw alternatives FAQ

Is OpenClaw safe to use for work?

Self-hosted OpenClaw can be risky without strong sandboxing and strict permissions. Managed platforms like KiloClaw reduce risk through dedicated Firecracker VM isolation, AES-256 encrypted credential vaults, tool allow-lists, and no SSH access. KiloClaw's security architecture has been validated by an independent assessment with zero cross-tenant findings.

What is the difference between hosted OpenClaw and general AI assistants?

General assistants vary widely. Some now offer always-on execution and two-way messaging, but they typically trade off developer-level control, model flexibility, and raw customizability compared to the OpenClaw framework.

Can you switch AI models in an OpenClaw alternative?

It depends on the provider. Some managed alternatives support model switching across multiple vendors, while many general assistants are locked to a single model ecosystem.

Do you need Docker or DevOps to use an AI agent?

Not if you choose a managed OpenClaw host. Self-hosting usually requires ongoing DevOps work (updates, OAuth maintenance, monitoring, security patching).

When does self-hosting OpenClaw still make sense?

When you need air-gapped/offline operation, you're doing research experiments, or you have dedicated DevOps/SecOps to maintain and secure the stack.

How hard is it to migrate from self-hosted OpenClaw to a managed host?

Usually straightforward: export prompts/memory, re-connect tools via OAuth, and test in parallel. Avoid copying raw environment files with tokens; re-authenticate securely instead.

What's the real cost difference between self-hosting and managed OpenClaw hosting?

Self-hosting often looks cheaper in compute costs but becomes expensive in engineering time, security work, and integration maintenance. Managed hosting like KiloClaw trades that DevOps overhead for a predictable subscription.

Will a general AI assistant replace OpenClaw for business automation?

It depends on your requirements. Some general assistants now offer always-on execution and deep integrations, but they typically lack OpenClaw's raw customizability, custom code execution, bring-your-own-key support, and developer-level control over agent behavior and orchestration logic.

The Prompt Injection Problem: A Guide to Defense-in-Depth for AI Agents

Manveer Chawla — Wed, 25 Feb 2026 21:09:22 +0000

TL;DR

Prompt injection is an architecture problem, not a benchmarking problem. Anthropic's Sonnet 4.6 system card shows 8% one-shot attack success rate in computer use with all safeguards on, and 50% with unbounded attempts. In coding environments, the same model hits 0%. The difference is the environment, not the model.
Training won't fix prompt injection. Instructions and data share the same context window. SQL injection for the LLM era requires an architectural fix, not a behavioral one.
The "lethal trifecta" is the threat model. When your agent has tools, processes untrusted input, and holds sensitive access, all three at once, prompt injection becomes catastrophic. Almost every use case people want hits all three.
Build the kill chain around the model. A five-layer defense (permission boundaries, action gating, input sanitization, output monitoring, blast radius containment) turns the question from "will injection happen" to "how bad when it does."
Defense-in-depth constrains the autonomy ceiling. Agents that need human review for irreversible actions don't replace humans. They augment them. The companies winning here redesign the loop, not remove the human from it.

Anthropic published the Claude Sonnet 4.6 system card on February 17, 2026. Buried in the safety evaluations is a number that should change how every engineering team thinks about deploying agentic AI.

With every safeguard enabled, including extended thinking, automated adversarial attacks still achieve a successful prompt injection takeover 8% of the time on the first attempt in computer use environments. Scale to unbounded attempts and the success rate climbs to 50%.

Here's what makes this number genuinely interesting, not just alarming. In coding environments with the same model and the same extended thinking, the attack success rate drops to 0.0%.

Zero. The model didn't get smarter between these two evaluations. The environment changed.

Coding environments have structured inputs: code, terminal output, API responses with defined schemas. Computer use environments encounter arbitrary untrusted content: web pages, emails, calendar invites, documents with hidden text, DOM elements with embedded instructions.

The difference isn't the model. It's the attack surface.

A commenter in a Hacker News thread on the system card put it bluntly: "That seems wildly unacceptable. This tech is just a non-starter unless I'm misunderstanding."

He's not misunderstanding. He's looking for the solution in the wrong place.

When I built Zenith's own agent infrastructure, I made the same mistake. I assumed model improvements would close the gap. They won't. Not fully.

The solution isn't a better model. It's a better architecture around the model.

This post explains why prompt injection is an architecture problem, defines precisely where the risk concentrates, and lays out a five-layer defense framework for teams shipping agents into production.

When is Prompt Injection Most Dangerous? The Lethal Trifecta

Not every agent deployment carries the same risk. Understanding exactly where risk concentrates determines where you invest engineering effort.

Simon Willison coined the term "lethal trifecta" to describe the combination of capabilities that makes an agent critically vulnerable to prompt injection. An agent enters the danger zone when three conditions occur simultaneously.

The agent has access to tools. The agent can take actions: send emails, execute code, click buttons, call APIs, move money.

A model that only generates text in a chat window can't cause real-world harm through prompt injection. The moment the model gains the ability to act on systems, the stakes change categorically.

The agent processes untrusted input. The agent reads content it didn't generate: web pages, incoming emails, documents uploaded by third parties, API responses from external services, calendar invites from unknown senders.

Any content the agent ingests that an attacker could have influenced counts as untrusted input.

The agent has access to sensitive data or capabilities. The agent can reach credentials, PII, financial systems, internal APIs, private documents, or anything else that causes damage if exfiltrated or misused.

Any two out of three is manageable. An agent with tools and sensitive access but no untrusted input (an internal automation bot processing only your own data) is reasonably safe.

An agent processing untrusted input with sensitive access but no tools (a summarization engine reading external documents) can't act on injected instructions.

An agent with tools and untrusted input but no sensitive access (a web scraper writing to a sandbox) has limited blast radius.

All three together is where prompt injection becomes catastrophic. And almost every use case people want involves all three.

Use Case	Tools?	Untrusted Input?	Sensitive Access?	Risk Level
Summarize a doc I uploaded	No	No	No	Safe
Browse the web for research	No	Yes	No	Safe
Send emails on my behalf	Yes	No	Yes	Manageable
Read my emails and reply	Yes	Yes	Yes	Lethal
Browse web + write code in my repo	Yes	Yes	Yes	Lethal
Fill out forms on websites	Yes	Yes	Depends	Likely lethal
Computer use (general)	Yes	Yes	Yes	Lethal

The "safe zone" is far narrower than most deployment plans assume. During the HN discussion, one commenter tried to argue for a narrow safe zone limited to internal apps with no external input. Another correctly shot it down: even a calendar invite can contain injection text. Even a PDF from a trusted colleague can carry hidden white-on-white text with embedded instructions.

The Notion 3.0 incident proved this threat is real. Attackers used exactly that technique (hidden text in PDFs) to instruct the Notion AI agent to use its web search tool and exfiltrate client lists and financial data to an attacker-controlled domain.

The EchoLeak vulnerability (CVE-2025-32711) against Microsoft 365 Copilot was even worse: a zero-click indirect injection via a poisoned email enabled remote exfiltration of emails, OneDrive files, and Teams chats. No user interaction required.

Meta has operationalized this threat model through their "Agents Rule of Two" policy, mandating human-in-the-loop supervision whenever all three conditions are met. That's the right starting point for any team deploying agents against untrusted content.

Why "train it away" won't work

The natural response to the 8% number is to assume the next model generation will fix the problem. If training improved resistance from 50% to 8%, surely continued training will push it to 0%.

I held this view for a while. I was wrong.

Prompt injection is fundamentally different from content moderation. Content moderation (blocking harmful outputs, refusing dangerous requests) operates on the semantics of what the model produces. Prompt injection operates on the control plane: the model can't reliably distinguish between "instructions from the user" and "instructions embedded in content the user asked it to read" because both arrive as tokens in the same context window.

The security community spent decades eliminating in-band signaling vulnerabilities. SQL injection existed because queries and data shared the same channel. XSS existed because code and content shared the same rendering context. Command injection existed because shell commands and arguments shared the same string.

In every case, the fix was architectural: parameterized queries, content security policies, structured argument passing. The fix was never "train the database to be smarter about distinguishing queries from data."

LLMs have reintroduced in-band signaling at a fundamental architectural level. Trusted instructions (system prompts, user messages) and untrusted data (web page content, email bodies, document text) get concatenated into a single context window and processed by the same transformer mechanism.

There's no equivalent of a parameterized query. Karpowicz's Impossibility Theorem (June 2025) formalizes this argument, claiming that no LLM can simultaneously guarantee truthfulness and semantic conservation, making manipulation a mathematical certainty under adversarial conditions. OWASP's Top 10 for LLM Applications ranks prompt injection as the number one vulnerability for the second consecutive year, explicitly noting that defenses like RAG and fine-tuning don't fully mitigate the risk.

Training against prompt injection is an arms race with infinite surface area. You can train the model to resist "ignore previous instructions." Straightforward. But the attack space is unbounded.

Attackers encode instructions in base64. They hide them in image metadata. They use semantic persuasion that never directly says "ignore your instructions" but achieves the same effect through narrative framing. They embed instructions in white-on-white text in PDFs, in HTML comments, in alt text on images, in Unicode characters that render invisibly.

Advanced training techniques like Meta's SecAlign++ have reduced attack success rates on the InjecAgent benchmark from 53.8% to 0.5%. Impressive. But when researchers test those same defenses against adaptive, optimization-based attacks (GCG, TAP), attackers still achieve 98% success rates against defended models.

The defenses work against known patterns. The attacker always gets to choose new ones.

Resistance rates asymptote. They don't converge to zero. Going from 50% to 8% one-shot success rate is substantial progress. Going from 8% to 0% may be impossible with current transformer architectures because the model processes instructions and content through the same mechanism.

The coding environment achieves 0% not because the model is smarter in that context, but because the environment constrains inputs to structured formats where injection is syntactically detectable. The 0% comes from environmental structure, not model robustness.

8% on first attempt means near-certainty over sessions. If your agent runs 50 tasks per day and each task involves processing untrusted content, 8% per-attempt means the agent gets compromised roughly 4 times per day.

Over a five-day work week, compromise is a statistical certainty. Over a month, you're looking at roughly 80 successful injection events. The question isn't whether the agent will be compromised. The question is how much damage each compromise causes.

You can't train your way out of an architectural vulnerability.

Prompt injection resistance training isn't useless. Moving from 50% to 8% is the difference between "trivially exploitable" and "requires effort." That effort buys time for architectural defenses to catch what gets through. But treating model-level resistance as the primary defense is building on sand.

A 5-Layer Defense-in-Depth Architecture for Prompt Injection

If you accept that the model can't be fully trusted, the engineering question becomes: what do you build around the model?

Defense in depth. No single layer is expected to be perfect. Each layer catches what the previous one missed. The system succeeds when no single failure is catastrophic.

A five-layer model defines this defense. Each layer operates independently, so a failure in one doesn't cascade into the others.

Layer 1: Permission boundaries (least privilege)

The agent should never have more permissions than the specific task requires. The default in most agent frameworks grants broad access at session initialization and leaves the access active for the entire session. That's the equivalent of giving every microservice root access to your database.

Implement per-task capability grants, not session-wide permissions. An agent browsing the web for research shouldn't simultaneously hold credentials to send email. An agent drafting a document shouldn't have access to the financial transaction API.

Each task invocation should receive a scoped set of permissions that get revoked when the task completes.

The cloud providers have started building real infrastructure for this pattern. AWS Bedrock AgentCore, Microsoft Entra Agent ID, and Google Native Agent Identities all provide distinct, manageable identities for agents, treating them as Non-Human Identities (NHIs) with their own RBAC and ABAC controls.

The critical implementation detail is Just-in-Time (JIT) access: credentials should be short-lived (15-minute TTL is a reasonable starting point) and task-scoped. If an injection succeeds but the compromised session holds a token that expires in 12 minutes and can only read from a single S3 bucket, the blast radius is contained.

For code execution, sandboxing remains essential. Firecracker microVMs and gVisor provide hardware-level isolation that prevents a compromised agent from escaping its execution environment. AWS Bedrock AgentCore already uses microVMs for session isolation. This is table stakes for any agent that executes code or interacts with a filesystem.

Layer 2: Action classification and gating

Not all agent actions carry equal risk. Reading a web page is fundamentally different from sending an email, which is fundamentally different from executing a financial transaction. Your defense architecture should reflect this difference.

Classify every tool available to the agent into risk tiers. Read-only actions (fetching web pages, reading documents, querying databases) are low risk and can proceed autonomously.

Reversible writes (creating draft emails, writing to staging environments, adding items to a list) are medium risk. Log them with automatic rollback on anomaly detection.

Irreversible actions (sending emails, financial transactions, deleting data, publishing content, modifying access controls) are high risk and require human confirmation or second-model review before execution.

This pattern isn't new. AWS Bedrock Agents ships with "Action Approval" as a built-in feature. Microsoft Copilot Studio has "User Confirmation" for sensitive actions.

The engineering work is in the classification, not the gating mechanism. Every tool the agent can call needs to be categorized, and the categorization needs to be conservative. When in doubt, gate the action.

The second-model review pattern deserves specific attention. Instead of (or in addition to) human review, a separate model instance with a different system prompt evaluates proposed irreversible actions. This model has no context about the current task beyond the proposed action itself and simply asks: does this action make sense given the stated task? Does the action access resources outside the expected scope? Does the action match known attack patterns?

This pattern isn't foolproof (both models share architectural vulnerabilities), but it adds friction that significantly raises the cost of a successful attack.

Layer 3: Input sanitization and segmentation

Treat untrusted content as a separate context segment with reduced authority. If you can't fully separate instructions from data architecturally, at least create soft boundaries that make injection harder.

Strip or neutralize potential instruction patterns in ingested content before the content enters the model's context window. Remove HTML comments. Strip metadata that could contain instructions. Convert rich text to plain text where formatting isn't needed. Flag content that contains patterns matching known injection techniques.

More sophisticated approaches use role-tagged formats (like ChatML) or special delimiters to create boundaries between trusted instructions and untrusted data. Frameworks like CaMel enforce separation at a deeper level, preventing data from untrusted sources from being used as arguments in dangerous function calls.

The model can read the content and reason about it, but the framework blocks the model from treating that content as executable instructions.

This layer is inherently imperfect. Stripping everything that could possibly be an injection also destroys legitimate content. The goal isn't perfection. The goal is raising the bar high enough that attacks bypassing input sanitization are more likely to be caught by output monitoring (Layer 4) or contained by blast radius controls (Layer 5).

Layer 4: Output monitoring and anomaly detection

Monitor the agent's actions in real-time against a behavioral baseline. Flag deviations before they cause damage.

Watch for several categories of anomaly. Unexpected tool calls: if the agent is tasked with summarizing a document and attempts to call an email send function, that's a red flag.

Resource access outside scope: if the agent is browsing a specific website and attempts to hit an internal API endpoint, terminate the session.

Data exfiltration patterns: if the agent constructs a URL containing what appears to be encoded data and tries to fetch the URL, that matches a known exfiltration technique. The EchoLeak attack against Microsoft 365 Copilot used exactly this pattern, encoding stolen data in image URL parameters.

Behavioral discontinuities: a sudden shift in the agent's action patterns mid-session, particularly after ingesting new untrusted content, suggests injection may have occurred.

The architecture needs kill switches that halt the agent immediately on high-confidence anomaly detection and escalate to a human. This has to be a hard stop, not a suggestion. The OWASP GenAI Incident Response Guide recommends identifying compromised sessions via trace ID, issuing revoke commands to block further tool calls, and preserving the context window for forensics.

Integration with existing security infrastructure matters. Agent action logs should feed into your SIEM. Anomaly detection rules should trigger the same incident response workflows as any other security event. Configure alerts for "impossible toolchains" (sequences of tool calls that no legitimate task would produce) and high-velocity looping (an agent calling the same tool repeatedly in a way that suggests the agent is stuck in an injection-induced loop).

Layer 5: Blast radius containment

Layers 1 through 4 reduce the probability and speed of a successful attack. Layer 5 limits the damage when an attack succeeds. Because eventually, one will.

Network segmentation. The agent's compute environment should not have unrestricted network access. Deploy agents within private network perimeters (VPC Service Controls on Google Cloud, PrivateLink on AWS) with default-deny egress rules. The agent can reach only the specific endpoints required for its current task.

If a compromised agent tries to exfiltrate data to an attacker-controlled domain, the network layer blocks the attempt regardless of what the model has been tricked into doing.

Credential isolation. The agent uses scoped, short-lived tokens. Never long-lived API keys or static credentials. If a session is compromised, the attacker gets a token that expires in minutes and can only perform a narrow set of operations.

The Google Antigravity IDE incident demonstrated what happens without this protection. A poisoned web guide combined with a browser subagent that had a permissive domain allowlist (including webhook.site) enabled theft of AWS keys from .env files. Short-lived, tightly scoped credentials would have eliminated the entire attack vector.

Session isolation. Compromise of one agent session must not propagate to others. Each task runs in its own isolated environment with its own credentials, its own network rules, and its own filesystem. No shared state between sessions means no lateral movement.

Audit logging. Every action the agent takes gets recorded with full context: the input that preceded the action, the tool called, the parameters passed, the result returned. This serves two purposes: forensic analysis after an incident, and pattern detection across sessions that may reveal slower, more sophisticated attacks that evade real-time monitoring.

Example Blueprint: Securing an Email Agent

Abstract architectures are useful for framing. Concrete implementations are useful for building. Here's how the five-layer model applies to one of the most requested and most dangerous agentic workflows: an agent that reads your email and drafts replies.

This use case hits the full lethal trifecta. The agent has tools (drafting and potentially sending email). The agent processes untrusted input (incoming email bodies, which any external sender controls). The agent has access to sensitive data (your inbox, your contacts, your organizational context).

EchoLeak proved this attack surface is real and actively exploited.

Permission boundaries. The agent gets read access to the inbox and draft-only write access. The agent can't send emails, only create drafts. The agent has no access to calendars, file storage, or contacts beyond the current thread. Its OAuth token is task-scoped and expires after 15 minutes.

Action gating. Drafts are created but never sent without human review. The agent can't modify email filters, forwarding rules, or account settings. Any attempt to call a tool outside the approved set terminates the session immediately. Moving a draft to the outbox is classified as irreversible and requires explicit human approval.

Input sanitization. Incoming email bodies are pre-processed before the agent sees them. HTML converts to plain text. Embedded images get stripped (preventing pixel-based exfiltration). Content matching known injection patterns (directives, base64-encoded blocks, invisible Unicode characters) is flagged and either stripped or presented with an explicit warning marker.

Output monitoring. If the agent attempts to access any URL, API, or resource not on the allowlist for email operations, the session terminates. If the agent constructs a draft containing what appears to be encoded data in URLs (the EchoLeak exfiltration pattern), the draft gets quarantined for human review. If behavior shifts discontinuously after processing a specific email, that email is flagged as potentially adversarial.

Blast radius containment. The agent runs in an isolated sandbox with no filesystem access beyond its working directory. Network egress is restricted to the email provider's API endpoints. The OAuth token covers read + draft-create, not full mailbox access.

If every other layer fails and the agent is fully compromised, the attacker can create draft emails (which the human reviews before sending) and read emails already in the inbox (which is the scope the agent was legitimately granted). The damage ceiling is defined and bounded.

This architecture doesn't make the agent invulnerable. This architecture makes the agent fail safely.

The difference between "an injection that creates a weird draft the human deletes" and "an injection that silently exfiltrates your entire inbox" is entirely about the architecture sitting around the model.

What this means for the "replace all workers" narrative

The prompt injection problem directly constrains the labor displacement ceiling for agentic AI. Understanding this constraint matters for teams making investment decisions about agent deployments.

Agents that require human oversight for irreversible actions can't replace humans. They augment them. The supervision requirement scales with risk, not with task volume.

An agent that autonomously handles 200 low-risk email drafts per day while a human reviews 15 high-risk ones is a massive productivity gain. But it's a different value proposition than "we replaced the person who used to do email."

I see this playing out with our clients at Zenith constantly. The near-term reality isn't autonomous agents replacing knowledge workers. It's a redesigned workflow where agents handle high-volume, lower-risk tasks autonomously while humans focus on decisions where the cost of error is high: sending the email, approving the transaction, publishing the content, granting the access.

The companies extracting real value from agents aren't removing humans from the loop. They're redesigning the loop so that humans review only what matters while agents handle the rest.

The adoption numbers tell the same story. PwC reports that 79% of executives are adopting agents, but 34% cite cybersecurity as their top barrier. An S&P Global report found that 42% of companies abandoned AI initiatives entirely, with security risks as the primary driver.

The organizations that push through aren't the ones that found a way to make agents safe enough for full autonomy. They're the ones that built architectures where the agent doesn't need full autonomy to be valuable.

Summarize some text while I supervise is a productivity improvement. Replace me with autonomous decisions is liability chaos.

The security constraint isn't a bug in the adoption curve. The security constraint defines the shape of the adoption curve.

The model is the weakest link. Build around the model.

Security engineers have known for decades that you don't build your security posture around the assumption that any single component is bulletproof. You assume every layer can fail and design the system so that no single failure is catastrophic.

The 8% number isn't a reason to avoid deploying agentic AI. The 8% number is a reason to stop treating the model as the security boundary and start treating the model as what it is: a powerful but unreliable component that needs guardrails, monitoring, and containment.

The model will keep getting better at resisting prompt injection. That 8% will probably drop. But it won't hit zero. Not with current architectures, and possibly not ever.

Build accordingly.

Frequently Asked Questions (FAQ)

What is prompt injection?

Prompt injection is a security vulnerability where an attacker manipulates a large language model (LLM) by embedding malicious instructions into the content the model processes. This attack can trick the AI agent into performing unintended actions, such as leaking sensitive data.

Why is prompt injection a major security risk?

Prompt injection becomes a major risk when three conditions are met (the "lethal trifecta"): the AI agent can use tools (like sending emails), processes untrusted input (like web pages or documents), and has access to sensitive data. This combination allows an attacker to take control of the agent to exfiltrate data or cause harm.

How can you protect AI agents from prompt injection?

Protection requires a defense-in-depth architecture. This architecture includes five key layers: implementing strict permission boundaries, gating high-risk actions, sanitizing inputs, monitoring outputs for anomalies, and containing the blast radius with network and credential isolation.

Forem: Manveer Chawla

Can ClickHouse DELETE Data? A 2026 PR-by-PR Analysis

TL;DR

Why People Still Say "ClickHouse Can't Delete"

Methodology: How We Analyzed ClickHouse's DELETE Commit History

ClickHouse DELETE Features in 2026: What Ships by Default

ClickHouse DELETE Myths vs. Reality: A 2026 Checklist

Phase 1 (2018): The Original Mutation-Based DELETE

ALTER TABLE … DELETE Lands as a Mutation (June–July 2018)

Skip Unaffected Parts in DELETE Mutations (PR #2694, Late 2018)

DELETE Correctness Fixes (2020–2021)

Phase 2 (2022): How Lightweight DELETE Reframed Everything

The _row_exists Mask Column

PREWHERE Injection on Read

Hardlink Optimization for Wide Parts

IStorage::supportsDelete(): Architectural Formalization (December 2022)

Lightweight DELETE Correctness Fixes (Late 2022)

Phase 3 (2023): Lightweight DELETE Goes GA

Synchronous by Default (PR #44718, December 2022 / January 2023)

Memory and Concurrency Hardening (PR #48522, April 2023)

Lightweight DELETE on JSON and Object Columns (PR #49737, May 2023)

From allow_experimental_lightweight_delete to enable_lightweight_delete (PR #50044, June 2023)

Projection Compatibility (PRs #52517 and #52530, July–August 2023)

apply_deleted_mask and APPLY DELETED MASK: Operator Levers (PRs #55952 and #57433, late 2023)

Phase 4 (2023–2024): Storage-Aware DELETE Optimizations

exclude_deleted_rows_for_part_size_in_merge: Merge Selection That Counts Existing Rows (PR #58223, early 2024)

lightweight_deletes_sync (PR #62195, April 2024)

Projection Rebuild on Row-Reducing Merges (PR #62364, Q2 2024)

lightweight_mutation_projection_mode: Lightweight DELETE on Tables with Projections (PR #65594, July 2024)

DELETE FROM … IN PARTITION for Lightweight DELETE (PR #67805, August 2024)

system.parts.has_lightweight_delete and system.mutations Schema Additions (2024)

Phase 5 (2025–2026): Patch Parts and On-the-Fly Mutations

apply_mutations_on_fly: On-the-Fly Mutations (PR #74877, Q1 2025)

On-the-Fly Lightweight DELETE (PR #79281, April 2025)

Patch Parts: DELETEs Without Part Rewrites (PR #82004, July 2025)

2026 Correctness and Operability Hardening

Bulk Deletion: DROP PARTITION and Empty-Part Tombstones

ReplacingMergeTree, is_deleted, and the Optimized FINAL

ClickHouse DELETE Internals: Low-Level Optimizations and Correctness Hardening

ClickHouse DELETE Limitations and Trade-offs in 2026

ClickHouse DELETE Improvements Timeline (2018–2026)

When Should You Use Each DELETE Method in ClickHouse?

How to Respond to "ClickHouse Can't Delete"

ClickHouse DELETE FAQ

Can ClickHouse delete data?

What's the fastest way to delete in ClickHouse?

Do I still need allow_experimental_lightweight_delete?

How does lightweight DELETE work?

What's the difference between lightweight DELETE and heavyweight ALTER DELETE?

Is FINAL slow in ClickHouse?

Can ReplacingMergeTree handle deletes?

How do I bulk-delete a lot of data efficiently?

Are queued deletes visible to queries before they finish?

What if my table has projections or skip indexes?

How do I monitor in-flight deletes?

What are patch parts?

How to manage multi-user AI agent authentication and authorization in 2026 (OAuth 2.1, OIDC, and delegated access)

TL;DR: multi-user AI agent authentication and authorization in 2026

Threat model for multi-user AI agents: prompt injection, tool misuse, and confused deputy

The two-identity model for agent authorization

Nine capabilities for production multi-user AI agent auth

Capability 1: Model user, agent, and delegated context

Capability 2: Separate OpenID Connect authentication from OAuth authorization

Capability 3: Issue short-lived, scoped, audience-bound access tokens

Capability 4: Vault tokens and automate refresh across providers

Capability 5: Enforce read, draft, and commit approval steps

Capability 6: Evaluate policy before every tool call by hooking into existing entitlement systems

Capability 7: Use just-in-time consent and authorization

Capability 8: Bind first-time auth flows to a verified app user

Capability 9: generate immutable audit logs for every agent action

Why a runtime, not a gateway: the architecture shift behind multi-user authorization

Where each layer fits in the agent auth stack (IdP, OAuth vault, policy engine, MCP runtime)

Reference architectures for multi-user agent auth

Pattern 1: internal productivity agent (Google Workspace)

Pattern 2: multi-tenant Slack agent (workspace isolation)

Pattern 3: Salesforce CRM agent (user-level permissions)

Agent auth anti-patterns to avoid in production

Conclusion: the delegated authorization rule for multi-user agents

Frequently asked questions

What's the best way to manage multi-user AI agent authentication and authorization in 2026?

`ALTER TABLE … DELETE` Lands as a Mutation (June–July 2018)

The `_row_exists` Mask Column

`IStorage::supportsDelete()`: Architectural Formalization (December 2022)

From `allow_experimental_lightweight_delete` to `enable_lightweight_delete` (PR #50044, June 2023)

`apply_deleted_mask` and `APPLY DELETED MASK`: Operator Levers (PRs #55952 and #57433, late 2023)

`exclude_deleted_rows_for_part_size_in_merge`: Merge Selection That Counts Existing Rows (PR #58223, early 2024)

`lightweight_deletes_sync` (PR #62195, April 2024)

`lightweight_mutation_projection_mode`: Lightweight DELETE on Tables with Projections (PR #65594, July 2024)

`DELETE FROM … IN PARTITION` for Lightweight DELETE (PR #67805, August 2024)

`system.parts.has_lightweight_delete` and `system.mutations` Schema Additions (2024)

`apply_mutations_on_fly`: On-the-Fly Mutations (PR #74877, Q1 2025)

Bulk Deletion: `DROP PARTITION` and Empty-Part Tombstones

`ReplacingMergeTree`, `is_deleted`, and the Optimized `FINAL`

Do I still need `allow_experimental_lightweight_delete`?

Is `FINAL` slow in ClickHouse?

Can `ReplacingMergeTree` handle deletes?