Forem: Rafa Calderon

Observability for AI Systems with OpenTelemetry

Rafa Calderon — Sat, 02 May 2026 14:49:14 +0000

A user reports that an AI application's response is bad. The dashboard, however, is green: 200 OK, latency within the expected percentile, clean logs. The complaint is legitimate, and traditional observability fails to capture it.

The tooling refined over a decade for web services measures latency, errors, and throughput. Generative AI systems demand more: cost per request, output quality, propagation of semantic errors, and the behavior of an agent that decides at each iteration what to do next. OpenTelemetry and its GenAI semantic conventions provide a foundation, but effective implementation requires addressing five distinct layers and, above all, the correlations between them.

This article walks through those layers with concrete attributes, named metrics, JSON span examples, and diagrams. The structure follows the natural flow of a request: agent orchestration, retrieval pipeline, cost, quality, and security as an initial filter. After that, a section dedicated to cross-layer correlations, the telemetry pipeline that ties everything together, the most frequent anti-patterns, and an actionable minimal implementation.

The system under observation

Before instrumenting, the system needs to be fixed in scope. The reference application receives user queries, executes an agent capable of reasoning across multiple iterations and calling tools, queries a private corpus through a retrieval pipeline, invokes one or more external language models in streaming mode, and returns a response. This pattern covers most modern AI applications: vertical assistants, conversational search engines, productivity copilots, automated support systems.

This system has four boundaries where traditional observability falls short. The boundary between application code and external models: the call is not a normal HTTP request, latency depends on prompt length, output arrives in streaming mode token by token, and the model may refuse to respond due to its own filters. The boundary between the query and the retrieved documents: relevance is a qualitative dimension that cannot be reduced to a simple numeric metric. The boundary between usage and cost: each token has a price, prices change, cached tokens cost differently, and the bill arrives a month later in aggregate. The boundary between response and validity: a 200 does not mean a correct response, quality is subjective, contextual, and only verifiable with additional effort.

The layers that compose the observability system respond to these boundaries. The rest of the article develops each one with concrete attributes and explicit positions.

The orchestration layer

The modern agent is not a function that takes input and returns output. It is a loop that decides at each iteration: reason, call a tool, query a model, reason again. The natural way to model it is hierarchical. Root span per interaction, child spans per iteration, grandchild spans per tool call, and, in more complex systems, sub-agents that generate sub-hierarchies.

The first strong position of the article: Jaeger is not adequate for agents with more than three levels of depth. Plan from the outset for a different UI: Langfuse, Arize Phoenix, SigNoz, or one built on top of ClickHouse. Reusing the tool designed for microservices typically results in weeks lost and the same conclusion.

To ground this in practice, an agent iteration span, following OpenTelemetry GenAI semantic conventions and extended with custom attributes, looks like this:

{
  "name": "agent.iteration",
  "kind": "INTERNAL",
  "attributes": {
    "agent.iteration.number": 3,
    "agent.iteration.reason": "tool_call_required",
    "gen_ai.system": "anthropic",
    "gen_ai.request.model": "claude-sonnet-4",
    "gen_ai.response.model": "claude-sonnet-4-20250514",
    "gen_ai.usage.input_tokens": 4523,
    "gen_ai.usage.output_tokens": 187,
    "gen_ai.usage.cached_tokens": 4200,
    "gen_ai.error.type": null,
    "cost.usd": 0.0143,
    "feature.name": "support_assistant",
    "tenant.id": "t_4521"
  },
  "events": [
    {"name": "first_token", "time_unix_nano": "..."},
    {"name": "tool_call.dispatched", "attributes": {"tool.name": "search_kb"}}
  ]
}

The gen_ai.* attributes are the ones standardized by OpenTelemetry, in experimental status but adopted by serious SDKs. Attributes without the gen_ai prefix, such as cost.usd, feature.name, and tenant.id, are custom extensions that should be isolated in their own namespace so they can be renamed when the standard matures. The events inside the span are crucial: they mark significant moments such as the arrival of the first token or the dispatch of a tool call, and they form the basis for derived metrics covered in the next subsection.

The most relevant architectural decision in this layer is whether to capture full prompts and completions in spans. Strong position: capture the full prompt only on errors and outliers. For everything else, hash plus separate storage, or nothing. There are three reasons, all of them painful. Size: a RAG-augmented prompt can contain tens of thousands of tokens, and storing the full text in every span inflates storage non-linearly. Privacy: the prompt may contain user PII, which then travels through the entire telemetry pipeline toward systems not designed with personal data controls. Cost: any tracing backend bills by volume, and storing natural-language text multiplies bills by one or two orders of magnitude compared to structured metadata.

In practice, teams do not pick a single point on the spectrum but rather a combination. Hash of the prompt always stored, full text only on errors and cost outliers, PII redaction at the collector. This is defense in depth, not maximalism in any direction.

Streaming, TTFT and TPOT

Here lies one of the most common gaps in AI observability: ignoring streaming. LLM output does not arrive as a single payload but in chunks, typically via Server-Sent Events. Total latency stops being the relevant metric. What matters is the TTFT (Time To First Token), which defines how quickly the user perceives that the system is responding, and the TPOT (Time Per Output Token), which defines the fluency of generation.

The client SDK or wrapper around the LLM emits span events: stream.start, stream.first_token, stream.last_token. The collector consumes them and derives metrics: TTFT as the difference between request and first_token, TPOT as (last_token - first_token) / (output_tokens - 1). These two metrics, not total latency, are the ones that belong in product-oriented SLOs.

{
  "name": "llm.generation",
  "attributes": {
    "gen_ai.streaming": true,
    "gen_ai.usage.output_tokens": 412,
    "stream.ttft_ms": 240,
    "stream.tpot_ms_avg": 18,
    "stream.duration_ms": 7656
  },
  "events": [
    {"name": "stream.start", "time_unix_nano": "..."},
    {"name": "stream.first_token", "time_unix_nano": "..."},
    {"name": "stream.last_token", "time_unix_nano": "..."}
  ]
}

A TTFT below one second feels instantaneous; above three seconds it kills the experience. A TPOT below 30 ms reads fluently; above 80 ms it feels like a slow connection. These numbers should be explicit system SLOs, not anecdotal references. Aggregate latency does not capture what the user perceives: TTFT and TPOT must be measured as first-class metrics.

Error types and tooling

Not all LLM errors are equal. gen_ai.error.type should use a controlled vocabulary: rate_limit, content_filter, max_tokens_exceeded, structured_output_malformed, timeout, provider_outage. Each requires a different corrective action. Grouping them as a generic error loses critical information.

On tooling, an explicit stance is warranted: OpenLLMetry from Traceloop is the option most aligned with pure OpenTelemetry, exporting to standard OTLP and integrating with any backend. Arize's OpenInference is more oriented toward evaluation. Langfuse and LangSmith are complete products with their own protocols. Vendors imply real lock-in: OTLP exports exist but are partial, and dashboards are built around their proprietary data models. The promised interoperability is more marketing than engineering at this moment.

The retrieval layer

If the orchestration layer governs the agent's logic, the retrieval layer governs its inputs. A typical query traverses five phases: text-to-vector embedding, vector search against the index, candidate re-ranking, final prompt construction with the chosen documents, and generation. Each phase has its own latency profile, its own failure mode, and its own relevant metrics.

A frequent anti-pattern: a single "rag" span covering the entire pipeline. If vector search degrades from 30 ms to 300 ms, that change should be visible in isolation rather than diluted in an aggregate latency that also includes generation. One span per phase, with specific attributes, is not a luxury: it is the difference between being able to operate the system and not.

A concrete vector search span:

{
  "name": "rag.search.vector",
  "attributes": {
    "vector.index": "docs_v3",
    "vector.top_k": 10,
    "vector.distance.min": 0.21,
    "vector.distance.max": 0.67,
    "vector.distance.avg": 0.43,
    "cache.embedding.hit": true,
    "cache.search.hit": false,
    "docs.retrieved": ["doc_4521", "doc_8931", "doc_2901"],
    "rag.session.id": "sess_abc123"
  }
}

The concrete metrics worth exposing in this layer, by name:

rag_phase_latency_ms{phase} — latency per phase, separated, not aggregated.
rag_cache_hit_ratio{level} — embedding, search, response. Three levels, three ratios.
rag_retrieval_relevance_proxy — average score of the top_k that actually ended up in the prompt after re-ranking. Not real relevance, but a proxy.
rag_docs_referenced_ratio — percentage of retrieved documents that appear cited in the model's response. Drift in this metric is an early signal of degradation.
rag_prompt_tokens_p99 — distribution of constructed prompt size. Uncontrolled growth heads toward silent truncation.

The defining problem of this layer is measuring quality without ground truth. That topic is developed in its own section later, because quality deserves its own layer rather than a paragraph hidden here.

The cache deserves specific attention. Three levels, three different economics. Embedding cache: same query, same vector. Hit ratio is typically high in applications with repetitive queries. Search cache: same vector, same results. Sensitive to corpus updates. Response cache: same query, same response. Controversial given the expected variability of LLMs, but in deterministic applications with zero temperature it saves enormously. Without hit/miss counters at each level, a drop in embedding cache rate goes unnoticed until the provider's bill arrives showing inexplicable growth.

Detecting silent regressions is where most teams get burned. Same query, same corpus, different results over time. Common causes: the embedding model provider changed versions without notice, someone re-indexed documents, the reranker was updated, the corpus grew. A defensive pattern: canonical queries executed periodically, score distribution archived as baseline, alert when the distribution diverges from the historical baseline. This is not optional, it is operational survival.

The cost layer

Of the five layers, cost is the only one where traditional observability actively misleads. A green dashboard while the system bleeds money. The bill arrives a month later, aggregated, with no breakdown by feature, no attribution by user, no distinction between legitimate use and a bug in a loop. An observability practice that does not include cost as a first-class citizen is incomplete for AI systems.

A decision that confuses many teams is how to model tokens. The correct answer is to model them simultaneously as span attribute and as metric. As an attribute, tokens travel alongside the individual trace: for this specific query from this specific user, how many tokens were consumed? In which phase? This is the queryable, debuggable, post-mortem dimension. As a metric with labels, tokens enable fast aggregation: total per model, per feature, per hour. Cardinality limits the labels: do not put user_id as a label if the TSDB is to remain alive. That is why both forms are needed. The metric answers "how much." The attribute answers "who and why."

The concrete cost metrics any serious system should expose:

cost_usd_total{model, feature, tenant} — aggregate cost in dollars with relevant low-cardinality labels.
tokens_total{model, type, cache_status} — type is input or output, cache_status is cached or fresh.
cost_per_request_p50/p95/p99{feature} — distribution per feature, not global.
cost_anomaly_score{tenant} — deviation from the tenant's baseline. Alert when it exceeds N standard deviations.

On real-time cost calculation, tokens are a proxy: what is billed are dollars, and the conversion requires a pricing table the system must know. Strong position: the table lives in the collector, not in each SDK or the backend. Reason: updating prices is a configuration change, not a redeploy, and cost is materialized before storage rather than in post-processing. Distributing this logic per service in a new architecture adds friction for zero return.

Cost anti-patterns that recur in real systems: not separating input from output tokens (their prices differ by orders of magnitude), treating cached and non-cached tokens equally (Anthropic and OpenAI charge fractions of the standard price for cached), not versioning the model in spans (when the provider updates, historical metrics stop being comparable), not labeling by feature (when finance asks for a breakdown by functionality, no answer exists).

On prompt caching specifically: the gen_ai.usage.cached_tokens attribute is the difference between knowing what is billed and guessing. Providers offer cached tokens at a fraction of the price. Optimizing prompts to maximize cache hits changes the system's economics, but only if the observability layer distinguishes them. Treating all tokens equally makes attribution lie and optimizations operate blind.

Cost anomaly detection is where this layer demonstrates its operational value. A user suddenly consuming a hundred times the normal usage may be a legitimate use case, a bug retrying in a loop, or an attack. The three situations require different responses, and only well-instrumented systems can distinguish them quickly. Useful signals: associated errors, temporal patterns, similarity between consecutive prompts, correlation with other users. Without these signals, the first attempt at resource exhaustion becomes visible only through the bill.

The strong idea of this layer: in AI systems, FinOps is not a department separate from observability. It is a dimension alongside latency, errors, and throughput. Treating it as a monthly finance problem leads to consistent lateness.

The quality layer

Here lies the gap most teams leave uncovered. Quality is the dimension that distinguishes an AI system from a traditional web service, and yet it is rarely treated as a first-class metric. A response can be fast, cheap, and technically successful, and still be bad. An observability practice that does not capture that difference operates a system whose output is not being measured.

The first problem is one of definition. What is "quality" when discussing LLM output? Not a single metric. It is a vector with at least four dimensions worth handling separately: groundedness (the response is supported by retrieved documents or fabricates content), relevance (the response addresses what the user actually asked), completeness (the response covers all aspects of the query), safety (the response does not include problematic content). Each dimension needs its own scorer, its own threshold, its own alert.

Operationalizing quality requires three complementary approaches. None alone is sufficient.

Indirect user signals. The user follows up with the same question rephrased, clicks a "this didn't help" button, abandons the session, copies or does not copy the response content. Each signal is noisy, but aggregated over thousands of interactions they reveal patterns of degradation. These signals are linked back to the original span via the session.id, closing the loop between what happened and how it was received.

Structural metrics. They do not measure quality directly, but their drift over time almost always indicates that something has changed. Distribution of similarity scores returned by vector search. Percentage of the top_k that ended up in the prompt. Proportion of retrieved documents referenced in the response. Size of the constructed prompt. Number of agent iterations. These metrics are cheap, deterministic, and useful as an early warning system.

LLM-as-judge. A model, typically more expensive or more capable than the production one, evaluates a sample of responses against the defined quality dimensions. The score returns to the original trace via session ID.

{
  "name": "quality.evaluation",
  "attributes": {
    "trace_id_ref": "4f7e9a8b...",
    "session.id": "sess_abc123",
    "quality.score.overall": 3.2,
    "quality.scorer.type": "llm_judge",
    "quality.scorer.model": "claude-opus-4",
    "quality.dimensions.groundedness": 4.0,
    "quality.dimensions.relevance": 2.0,
    "quality.dimensions.completeness": 4.0,
    "quality.dimensions.safety": 5.0,
    "quality.evaluated_at": "2026-04-30T14:23:00Z"
  }
}

This brings up the problem few teams address: the recursive observability hell. The judge model itself suffers from hallucinations, drift, and high cost. Who watches the watcher. Mandatory judge metrics: judge.cost.usd_total, judge.score.distribution, judge.agreement_rate (with humans on sub-samples), judge.refusal_rate. Without these, the judge is a black box evaluating another black box. Strong position: when the LLM-as-judge cost exceeds five percent of production cost, the evaluation pipeline ceases to be sustainable and must be redesigned.

The sampling strategy for evaluation is also non-trivial. Uniform sampling captures average behavior but misses the rare cases that matter most. Stratified sampling by feature, by tenant, by error type captures the extremes better. The defensible practice: 100% of responses with user-reported errors, stratified sampling of 1-5% of the rest, sampling of cost and latency outliers.

The concrete quality metrics any serious system exposes:

quality_score_p50/p95{dimension, feature} — distribution per dimension and feature, not aggregated.
quality_alert_threshold_breaches{dimension} — number of evaluations that fell below the defined threshold.
user_negative_feedback_rate{feature} — noisy but actionable proxy.
quality_drift_score — deviation of the current distribution from a baseline window. Automatic alert when it exceeds N sigmas.

The operating rule: quality needs its own dashboard, its own alerts, and its own weekly review. Without the same rigor applied to latency and errors, the system's output remains outside the scope of measurement.

Correlation across layers

Here lies the real value of having all layers instrumented: the correlations between them. An AI system is not modular, it is systemic. Problems do not appear isolated in one layer; they appear in chains that cross several. Treating each layer separately and watching independent dashboards loses the pattern.

Three correlation chains that recur in real systems:

Chain A: silent retrieval degradation. Vector search scores fall, retrieved documents become less relevant, the model compensates by fabricating content (hallucination), quality drops without latency or cost raising any flag, and users start abandoning. A retrieval-only dashboard shows scores falling but "still passing the threshold." A feedback-only analysis surfaces the problem weeks late. The signal appears by crossing both views.

Chain B: context overflow from corpus growth. The corpus grows, the constructed prompt becomes longer, at some point it begins silently truncating at the model's limit, responses lose completeness, and meanwhile the cost of input tokens keeps rising even though they add no real value. Without correlation between prompt.tokens (retrieval layer) and quality.dimensions.completeness (quality layer), the problem hides for months.

Chain C: runaway agent. A tool fails with an error the agent cannot interpret. The agent retries. Iterations grow. Each iteration adds tokens. The monthly bill arrives with a forty percent increase without anything having triggered traditional monitoring. The signal is the correlation between tool.error.rate (orchestration layer) and iterations.count (orchestration layer) and cost.usd (cost layer).

The operational consequence is that the useful dashboard is not one per layer. It is one per correlation. The most valuable: quality.score vs retrieval.scores.avg, cost.per_request vs iterations.count, prompt.tokens vs quality.completeness. Observability limited to isolated layers produces symptoms, not diagnoses.

Security as a telemetry filter

Security in AI systems is neither a future problem nor a footnote in a "what is not yet solved" section. It is an initial filter that blocks requests before they reach the LLM, and that filter generates its own massive telemetry that must be observed with the same rigor as the rest of the system. Treating it as a minor section is among the costliest mistakes that can be made in enterprise AI architecture.

Guardrails apply at two points: before sending the prompt to the LLM (input guardrails) and before returning the response to the user (output guardrails). Each detects different patterns. Input guardrails look for prompt injection, jailbreaks, out-of-scope queries, PII in external user requests. Output guardrails look for PII leaked by the model, generated problematic content, hallucinations detectable by source verification.

Each guardrail activation is a span with its attributes:

{
  "name": "guardrail.input.check",
  "attributes": {
    "guardrail.type": "prompt_injection",
    "guardrail.detector": "regex_v2 + llm_classifier",
    "guardrail.verdict": "blocked",
    "guardrail.confidence": 0.87,
    "guardrail.signals": [
      "imperative_instructions_in_user_content",
      "system_prompt_override_attempt"
    ],
    "guardrail.action": "request_rejected",
    "user.id_hash": "..."
  }
}

The concrete signals worth capturing to detect prompt injection through patterns in spans: anomalous prompt length compared to the user's distribution, presence of imperative instructions in user content ("ignore previous instructions", "act as"), embeddings close to known jailbreak corpora, unusual ratio of special characters or control tokens, attempts to inject markdown or structures that mimic system instructions.

Mandatory metrics for the security layer:

guardrail_blocks_total{type, detector} — blocks per type and detector. The distribution reveals which attacks are most frequent.
guardrail_false_positive_rate{type} — legitimate requests blocked, measured on a sample with human review.
guardrail_latency_ms_p95{type} — guardrails add latency before the LLM. Above 200ms, TTFT starts to feel slow.
guardrail_bypass_attempts{user_id_hash} — repeated attempts by the same user after a block. Adversarial actor signal.

Strong position: an AI API exposed to external users without input guardrails is not free of prompt injection attacks; it is simply not registering them. Prompt injection attacks on public APIs are the norm, not the exception.

Correlation with other layers is again where the value lies. A spike of blocks from a specific detector correlated with a particular endpoint indicates that the endpoint exposes a new surface. A high bypass attempt rate from a specific user is a candidate for investigation or aggressive rate limiting. False positives growing after a detector update is a regression in the classification model worth rolling back.

The telemetry pipeline

So far the discussion has covered what to generate. This section is operational: how to process it. The telemetry pipeline connects SDKs with backends, and in AI systems it has specific demands that justify a different architecture from traditional microservices.

The collector has three non-negotiable processors in AI systems. First, PII redaction, which in this domain is non-trivial because PII lives in natural language inside prompts, not in structured fields. Regex rules for obvious cases (emails, phone numbers), lightweight NER models as sidecar for names and entities, allowlist of safe fields rather than denylist (the former philosophy is more conservative, and in this domain the cost of a false negative far exceeds that of a false positive). Second, tail sampling: head sampling leaves too much value on the table. Observing the complete trace before deciding, retaining 100% of errors and cost outliers, and sampling the rest is the right approach. The RAM and CPU cost on the collector under tail sampling is real and must be planned for. Third, enrichment: cost calculation from tokens (the pricing table lives here), labeling by feature and tenant if not provided by the SDK, annotation of the active model version.

Storage architecture is where most teams collapse economically. Strong position: use data tiering from day one. Three tiers, three economics. Hot path on ClickHouse or equivalent: structured metadata, aggregable metrics, all attributes without long natural-language text. Queryable, low-latency. Intermediate buffer on Kafka: heavy prompts and completions are published to a dedicated topic, decoupled from the critical path, consumed asynchronously. Cold storage on S3 or GCS: full text is stored indexed by content hash. The hot span stores a pointer to the hash, not the text. Recovering full text is an on-demand operation, not automatic.

The result of this architecture is that the per-trace cost on the hot path remains predictable even as prompts grow, while cold storage has a far cheaper economics. Without tiering, every new long prompt linearly amplifies the hot path bill, and the system becomes unsustainable at scale.

On the backend decision: for teams smaller than ten people or with volume below a few thousand requests per day, the most efficient path is to start with a specialized vendor (Langfuse, Arize, Helicone). Self-hosting OTel for AI before reaching scale is operational friction with zero return. Migration to a custom stack is justified once volume offsets the operational cost. Large teams with mature internal platforms can skip that phase, but they are the exception, not the rule.

Anti-patterns

A short list of errors that recur in real systems and warrant explicit avoidance:

A single span for the entire RAG. Hides where the problem is when something fails.
Not separating input from output tokens. Their prices differ by orders of magnitude.
Not tracking cache hits per level. A drop in hit rate goes unnoticed until the bill arrives.
Not versioning the model in each span. When the provider updates, historical metrics stop being comparable.
No labeling by feature or tenant. Breakdowns by functionality or by client are inaccessible.
Capturing the full prompt in every span. Triples the tracing bill and leaks PII.
Using Jaeger for agents with sub-agents. A wall of horizontal bars that does not fit on screen.
Treating all LLM errors as one. Rate limit, content filter, and malformed structured output are different problems.
Measuring aggregate latency in streaming. TTFT and TPOT are the metrics the user perceives.
Not instrumenting guardrails. An exposed API without guardrails is already under attack, only invisibly.
LLM-as-judge without observing the judge. A black box evaluating another black box.
Quality as an anecdotal metric. Without its own dashboard and automatic alerts, it is not a metric, it is theater.

Minimal implementation

To bring this model into production immediately, the actionable minimum set is:

Mandatory spans: root span per user request, span per agent iteration, span per tool call, span per RAG phase (minimum: vector search and generation), span per guardrail activation.

Mandatory attributes on every LLM generation span: gen_ai.system, gen_ai.request.model, gen_ai.response.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_tokens, gen_ai.error.type, cost.usd, feature.name, tenant.id, stream.ttft_ms, stream.tpot_ms_avg.

Mandatory attributes on every retrieval span: vector.index, vector.top_k, vector.distance.avg, cache.embedding.hit, cache.search.hit, docs.retrieved (IDs), prompt.tokens (final).

Minimum metrics with low-cardinality labels: cost_usd_total{model, feature, tenant}, tokens_total{model, type, cache_status}, rag_phase_latency_ms{phase}, rag_cache_hit_ratio{level}, quality_score_p50/p95{dimension, feature}, guardrail_blocks_total{type}, stream_ttft_ms{model}, stream_tpot_ms{model}.

Collector with three processors: PII redaction (allowlist-first), tail sampling (100% errors, 100% cost outliers, sampling of the rest), enrichment with centralized cost calculation.

Storage in three tiers: ClickHouse for hot (or vendor if volume is low), Kafka as buffer for heavy prompts/completions, S3 for cold with hash indexing.

Dashboards from day one: latency per phase, TTFT and TPOT per model, cost per feature with anomaly detection, error rate per LLM type, sampled quality per dimension, guardrail blocks per type, the three cross-layer correlations (quality vs retrieval, cost vs iterations, prompt tokens vs completeness).

With this set, the system is observable. Lacking any of these pieces, the system appears observable without being so.

Closing

The observability of AI systems is a discipline still taking shape. There is not yet a decade of refined practices comparable to those available for web services. What is available is a partial map, a set of emerging patterns, and a tooling ecosystem that changes every quarter. Against that volatility, the most useful thing is to understand the problems each layer solves and how they intersect.

Three operational principles:

Without measuring cost per feature, the system operates blind. The monthly bill is not a metric, it is a delayed punishment. Cost must be treated as a real-time, attributable, alertable dimension, on equal footing with latency and errors.

Without separating quality from latency as independent metrics, claiming the system works has no foundation. A 200 OK is not success in AI. A fast response can be a bad response. Without a continuous evaluation system with its own dashboard, what is measured is server availability, not product functionality.

Without tracing each agent iteration with its nested tool calls, there is no real understanding of the system. Runaway agents, sub-agents that consume the budget, chains of tool calls that fail silently: all these problems are invisible without hierarchical instrumentation. And all of them translate into cost or reputational loss.

The ideal observability system for an AI application is not the one with the most metrics or the most expensive. It is the one that spans the five layers and their correlations at once and allows answering, in less than five minutes, the questions that matter: why a response is bad, what it cost, which documents support it, whether the agent behaved as expected, whether there were attack attempts. When the observability system fails to answer these questions, the problem lies not in the tool but in the underlying conceptual model.

How Cloudflare Replaced NGINX with Rust, Tokio, and Pingora — and Saved 434 Years of TLS Handshakes Every Single Day

Rafa Calderon — Fri, 03 Apr 2026 12:08:23 +0000

1 trillion requests per day, years of workarounds, and an architectural problem with no patch. This is the story of what broke, what Cloudflare built to replace it, and why every technical decision in Pingora is a direct answer to a specific NGINX limitation.

First, the numbers

Cloudflare published the migration performance data in their blog in September 2022. Not projections, not synthetic benchmarks — production metrics on real traffic:

Metric	Result
CPU and memory consumed	−66% vs NGINX on identical hardware
New TCP connections opened per second	3× fewer globally
Connection reuse rate (major customer)	From 87.1% → 99.92%
Reduction in new connections (same customer)	160× fewer
Median TTFB	−5ms
95th percentile TTFB	−80ms
TLS handshake time saved	434 years... per day

The 434-year number is the direct mathematical consequence of 99.92% reuse at their scale. To understand why that number was unreachable with NGINX, you need to understand how NGINX works internally.

How NGINX works internally — and where it breaks

NGINX uses a master-worker process model: one master process that coordinates, and N worker processes — typically one per CPU core. The OS assigns each incoming connection to a worker, and that connection lives there until it finishes. The worker has complete ownership.
This design was brilliant at the time. It avoids inter-process synchronization complexity and makes good use of per-core cache locality. But it has three structural problems that cannot be solved from within the model:

Problem 1: connection pools are not shared

Each worker maintains its own connection pool toward origins. If worker A has an open, idle TLS connection to api.client.com, and a new request toward that same origin arrives and the OS assigns it to worker B — worker B opens a new connection from scratch. It has no access to worker A's pool.

On a server with 16 workers and an origin that all of them need to connect to, in the worst case you have 16 connections doing the work that one could do. Multiply this by the number of origins, the number of servers in Cloudflare's datacenters, and the traffic volume — the waste is enormous.

Every new connection means a TCP handshake and, over HTTPS, a TLS handshake. TLS 1.3 has reduced the latency of this process, but it carries non-trivial CPU cost and round-trips. Cloudflare estimated that, at their scale, this accumulated waste added up to 434 years of handshake time per day.

Cloudflare spent years trying to mitigate this problem. They wrote multiple posts about their NGINX workarounds. Eventually they reached the conclusion that every company reaches when they hit this limit: you cannot share a connection pool across NGINX worker processes because the isolated process model makes it architecturally impossible without rebuilding from scratch.

Problem 2: load imbalance from request pinning

Because each request is pinned to a worker for its entire lifetime, the OS has to distribute incoming requests across workers before knowing how long they'll take. A request that takes 2ms and one that takes 2 seconds waiting on a slow origin cost the OS exactly the same at assignment time — but the worker that gets the slow one is blocked for those 2 seconds.

In practice, this translates to highly uneven CPU loads: some cores saturated, others underutilized. Adding more workers improves the situation but aggravates the connection pool problem.

Problem 3: extensibility in C

Cloudflare needed to implement complex business logic in the proxy: routing by client characteristics, real-time header modification, custom cache logic, security rules. NGINX allows extension via C modules (requiring recompilation) or via Lua with OpenResty (introducing VM overhead and a separate runtime).

Both options have limits. C modules are prone to the same memory issues as NGINX itself. The team wanted to write business logic with modern systems language tooling — type safety, normal unit tests, the Rust crate ecosystem.

The underlying problem: memory safety in C

A network proxy processes untrusted data from the Internet on every request. Malformed headers, oversized bodies, payloads designed to exploit parsers. In C, a parsing bug can escalate to buffer overflow, heap corruption, or remote code execution.

This is not theoretical. CVE-2013-2028: stack buffer overflow in NGINX, remote code execution. CVE-2017-7529: integer overflow in the range request module, out-of-bounds memory read. CVE-2021-23017: off-by-one in NGINX's DNS resolver. In 2022, the NSA and CISA jointly published a guide explicitly recommending migrating critical infrastructure to memory-safe languages.

Cloudflare was already a Rust shop. The decision was practical: an entire class of vulnerabilities becomes impossible in safe Rust because the compiler rejects the code that could cause them.

The response: what each Pingora decision solves

With the problems clear, Pingora's architecture reads like a solution map. Every technical piece answers a specific problem.

Answer to Problem 1: a single multi-threaded process with a shared pool

The fix to the connection pool problem is changing the unit of isolation: from processes to threads within a single process.

Pingora runs as a single process with N threads — by default, one per core. All threads share the same memory space, and therefore the same connection pool. When thread A has an open TLS connection to api.client.com and thread B needs the same origin, it can reuse it directly. No IPC, no serialization, no coordination protocol — it's a pointer to a shared structure in memory.

This is what produces the jump from 87.1% to 99.92% connection reuse. It's not an optimization trick — it's the direct consequence of eliminating the process isolation that made sharing impossible.

Answer to Problem 2: Tokio's work-stealing scheduler

Multi-threading solves the connection pool but introduces a new risk: if you distribute work across threads statically, imbalance persists.

Pingora uses Tokio's multi-thread work-stealing scheduler. According to Tokio's official documentation and the technical post on the scheduler rewrite, the mechanism is more sophisticated than its name suggests.

Each worker thread maintains three levels of task queues:

LIFO slot: the most recently spawned task. Executed first to maximize CPU cache locality.
Local queue: bounded (max 256 tasks), lock-free. Only the owning thread pushes; other threads can steal from the opposite end.
Global queue: consulted periodically, not every tick, to minimize synchronization overhead.

When a thread exhausts its queues, it picks a random victim thread and steals half its local queue in a single batch operation — multiple tasks at once to reduce per-steal synchronization cost. The queues are implemented with atomic operations (compare-and-swap), no mutexes on the hot path.

The practical result over NGINX's request pinning problem: in Pingora, when a connection is waiting on a slow origin, that connection is simply a suspended Future in the scheduler. The thread that initiated it can steal work from other threads and keep processing other requests. No blocking. Cores are utilized uniformly and automatically.

Answer to Problem 3: the ProxyHttp trait

Instead of C modules or Lua scripts, Pingora exposes a Rust trait called ProxyHttp that defines phases of each request's lifecycle. It's the same mental model as NGINX/OpenResty's configurable phases — but implemented in compiled, type-safe Rust.

Phase	When it runs
`request_filter`	When the client request arrives
`upstream_peer`	To decide which backend gets the request (required)
`upstream_request_filter`	Before forwarding the request to the backend
`upstream_response_filter`	When the backend response arrives
`response_filter`	Before sending the response to the client
`logging`	Always, even on error. For metrics and tracing.

You implement only the phases you need. The Rust compiler guarantees your implementation is type-correct, race-free, and memory-safe. A logic bug in your request_filter is a compilation error or a failing test — not a segfault in production at 3am.

Answer to the memory safety problem: Rust by construction

In safe Rust, a buffer overflow is a compilation error. A use-after-free is a compilation error. A data race between threads is a compilation error. These aren't warnings, linters, or static analysis — the compiler rejects code that could cause those conditions.

Cloudflare reported a significant reduction in memory safety errors after the migration, and that engineers could focus on product logic instead of chasing segfaults. This category of improvement is hard to quantify in production numbers, but its consequences show up in what Cloudflare has built since: FL2, their system managing security and performance rules for every customer — 15 years of C, rewritten in 2024-2025 — is also built on Pingora.

The two mechanisms NGINX never had

Beyond the structural problems that motivated the migration, Cloudflare had to build two engineering pieces that didn't exist in NGINX and that they needed to operate at their scale.

TinyUFO: the cache algorithm they published as an independent crate

Caching in a high-traffic proxy has a problem that LRU doesn't solve well: web access patterns follow a Zipf distribution, where a few items are extremely hot and most are accessed very rarely. LRU treats all items equally in terms of admission — if something arrives and the cache is full, it evicts the least recently used, regardless of how frequently that item was accessed.

The result is that scans — sequential reads of items that won't be accessed again — can contaminate the cache and evict hot items. At 40M req/s, the consequences of a poor admission policy are measured in millions of additional cache misses.

Cloudflare built TinyUFO by combining two recent research algorithms:

S3-FIFO (ACM 2023 paper): instead of a doubly-linked list like LRU, uses three FIFO queues. FIFO queues have better CPU cache behavior — insertions and evictions are sequential memory accesses, not random pointer traversals. A ghost queue tracks recently evicted items; if they return quickly, they're promoted to the main queue instead of starting over.

TinyLFU: maintains approximate frequency counts using a Count-Min Sketch — a fixed-size probabilistic structure independent of the number of tracked items. Before admitting a new item, it checks whether its access frequency beats the item that would be evicted. Scans don't pass the filter because their items appear with frequency 1.

The design is completely lock-free: metadata operations use atomic compare-and-swap. In their benchmarks with 8 threads on x64 Linux, TinyUFO outperforms moka (another widely-used TinyLFU implementation) in throughput, precisely because it eliminates mutex contention.

They published it as an independent crate on crates.io, separate from the rest of Pingora. Usable in any Rust project that needs a high-performance in-memory cache.

Graceful Restart: transferring sockets between processes

A proxy in production needs to update without dropping traffic. The problem: the new process needs to bind() on the same port the old process is already listening on.

The kernel mechanism that solves this is SCM_RIGHTS — a Linux feature that Cloudflare documented in detail in their own blog: Know your SCM_RIGHTS.

The mechanics: file descriptors are process-local indices in a file descriptor table. They're not global kernel handles. SCM_RIGHTS allows sending an open file descriptor from one process to another using sendmsg() over a Unix domain socket — the kernel duplicates the underlying resource (the active network socket with all its established connections) into the receiving process's file descriptor table.

The Pingora upgrade protocol, from official docs:

The new binary starts with --upgrade. It does not call bind(). It connects to a coordination socket and waits.
SIGQUIT is sent to the old process. The old process transfers its listening socket FDs to the new one via SCM_RIGHTS.
The new process starts accepting new connections on the received sockets.
The old process drains: it finishes its in-flight requests within the grace period and exits.

The guarantees: every request is handled by the old process or the new one, never by neither. The listening socket is never closed. No client sees Connection Refused. No request that can finish within the grace period is cut.

HAProxy and Envoy use the same mechanism. What makes Pingora different is that it's integrated transparently into the server lifecycle — two terminal commands and the upgrade happens with no additional intervention.

What Pingora is today

Pingora has been open source since March 2024 (Apache 2.0). What has happened since then signals that the bet worked:

FL2 — the system Cloudflare calls the "brain" of their network, 15 years of C managing security and configuration rules for every customer — was rewritten in 2024-2025 on top of Pingora. This is not a satellite project: it's Cloudflare's central infrastructure.

ecdysis — the library that encapsulates the zero-downtime upgrade mechanism (SCM_RIGHTS) — was published in 2025 as an independent Rust crate. Usable in any Rust network service without depending on Pingora.

Pingora 0.8.0 patched request smuggling vulnerabilities in ingress proxy configurations, responsibly disclosed through their bug bounty.

The current MSRV is 1.84, with a rolling 6-month policy. The API is pre-1.0, so expect breaking changes — but the ProxyHttp trait has proven stable enough for production for years.

If you want to explore the codebase or build on it:

GitHub cloudflare/pingora — source code, architecture notes, and the full workspace
Quick Start guide — official tutorial, walks you through building a working load balancer
User Guide — configuration, TLS, graceful restart, custom services

Pingora isn't for everyone. If you need a reverse proxy you can configure in 10 minutes, Caddy or Traefik are better options — they're binaries, not frameworks. Pingora is for when you have a real infrastructure problem that a configurable proxy can't solve: connection efficiency at scale, routing logic that exceeds what Lua can express maintainably, or the requirement that a memory bug in your proxy not become a CVE.

Cloudflare spent years concluding they needed to build their own proxy. The deployment numbers say they got the technical decisions right.

References

How we built Pingora — Cloudflare Blog (2022) — all production data cited in this article
Open sourcing Pingora — Cloudflare Blog (2024) — the announcement and usage context
GitHub cloudflare/pingora — source code, documentation and examples
Tokio scheduler: making it 10x faster — the technical post on the Tokio scheduler rewrite
Tokio runtime docs — official multi-thread scheduler documentation
Know your SCM_RIGHTS — Cloudflare Blog — the FD transfer mechanism explained by the team that uses it
Graceful restart docs — the upgrade protocol officially documented
TinyUFO on crates.io — the independent crate with benchmark suite
S3-FIFO paper (ACM 2023) — the academic algorithm behind TinyUFO
Cloudflare FL2 (2025) — the system running on Pingora in production
NSA/CISA: Software Memory Safety (2022) — the government guide on memory-safe languages

How I Built a CRDT Engine for a Collaborative Whiteboard in Rust

Rafa Calderon — Sat, 07 Mar 2026 16:10:43 +0000

I'm currently building a real-time collaborative whiteboard. Think of it as Figma's infinite canvas, but focused on stylus input and handwriting. Multiple users draw simultaneously, offline sessions sync on reconnect, and every stroke appears on everyone's screen without conflicts.

Sounds simple. It isn't.

After evaluating existing CRDT libraries, none of them modeled the domain correctly — they're built for text editors, not vector graphics. So I built vectis-crdt: a Rust library that compiles to both native and WebAssembly, with the server also consuming it in Rust.

This is the story of why I made every design decision I made.

The problem: three requirements that pull in opposite directions

A real-time collaborative whiteboard has three fundamental constraints:

Immediate local responsiveness: every stylus touch must appear on screen before any server round-trip. 80ms of latency between pen-down and pixel-rendered breaks the experience.
Eventual convergence: two clients that have applied the same set of operations — in any order — must end up with identical visible state.
No conflicts: a whiteboard has no "conflicts" to resolve. Two users drawing simultaneously both draw. You never show a conflict dialog to someone holding a stylus.

The classic solution is CRDTs (Conflict-free Replicated Data Types). But which flavor?

Why RGA + YATA, not OT or simple counters

I evaluated three approaches:

Operational Transformation (OT) — Used by Google Docs. Requires a central server to sequence all operations before distributing them. That kills offline support and adds latency on every stroke.

State-based CRDTs (sets, counters) — Simple but wrong for this domain. A whiteboard needs ordered strokes (z-order). "Stroke B is drawn on top of stroke A" is fundamental semantics. A set has no order; a counter has no identity.

RGA (Replicated Growable Array) + YATA — This is what Yjs uses internally for text. It maintains a sequence with a total deterministic order for concurrent insertions. I adapted it: instead of characters, each slot holds a stroke reference.

The key insight: the array is small. A whiteboard has hundreds of strokes, not millions of characters. This allows simpler data structures — a Vec with a HashMap index rather than a tree — without any performance penalty.

Before diving into the code, let's map out the mental model. At its core, this engine relies on three pillars: a way to uniquely identify every single action across space and time (OpId and Vector Clocks), a strict set of rules to resolve order when users draw simultaneously (the YATA algorithm), and a mechanism to clean up deleted data without corrupting the document's history (Garbage Collection). Let's start from the bottom up.

The base layer: OpId and Vector Clocks

Every operation in the system gets a globally unique identifier:

pub struct OpId {
    pub lamport: LamportTs,  // monotonically increasing logical clock
    pub actor:   ActorId,    // u64 — compact wire representation of a peer
}

ActorId is a u64 assigned by the server on first connection, not a UUID. This saves 8 bytes per reference on the wire — significant when every stroke carries three of them (id, origin_left, origin_right).

The ordering on OpId is total and deterministic:

impl Ord for OpId {
    fn cmp(&self, other: &Self) -> Ordering {
        self.lamport.0.cmp(&other.lamport.0)
            .then_with(|| self.actor.0.cmp(&other.actor.0))
    }
}

Higher Lamport wins; on tie (concurrent operations), higher ActorId wins. This is the tiebreaker that makes the CRDT converge when two users draw at the exact same logical moment.

For causality tracking, each peer maintains a Vector Clock:

pub struct VectorClock {
    clocks: BTreeMap<ActorId, u64>,  // actor → max lamport seen from that actor
}

The vector clock drives three features: causal delivery, delta synchronization, and garbage collection. It's the single most important data structure in the system.

Operation lifecycle

Before diving into each component, it's worth seeing how they fit together:

insert_stroke()
    │
    ├─ simplify(epsilon)        ← RDP reduces 500 pts → ~40 pts
    │
    ├─ tick LamportTs           ← generates new OpId
    │
    ├─ RgaArray::integrate()    ← places the stroke in z-order
    │
    ├─ StrokeStore::insert()    ← stores points + LWW properties
    │
    └─ pending_ops.push()       ← queues for sending to server
                                       │
                                       ▼
                               encode_update() → wire (LEB128 + f32 LE)
                                       │
                                       ▼
                               peer receives → CausalBuffer → apply_remote()

Each stage is independent and testable. The separation between RgaArray (only 16-byte references) and StrokeStore (actual point data) is deliberate: it keeps the working set for conflict integration in L1/L2 cache.

The core CRDT: the YATA integration algorithm

The RGA array maintains the z-ordering of strokes. Each item stores its insertion context:

pub struct RgaItem {
    pub id:           OpId,
    pub origin_left:  OpId,   // item to the left at insert time
    pub origin_right: OpId,   // item to the right at insert time
    pub content:      StrokeId,
    pub state:        ItemState,  // Active or Tombstone { deleted_at: OpId }
}

The genius of YATA is how it resolves concurrent insertions. Imagine Alice and Bob both insert a stroke at the same position simultaneously:

Alice inserts A with origin_left = X
Bob inserts B with origin_left = X

Without a rule, the order would depend on which operation arrives first — breaking convergence. The YATA rule:

The YATA Rule: Within items sharing the same origin_left, items whose own origin_left is to the right of ours belong to the "right subtree" and are skipped. Among the remaining items in the same zone, higher OpId goes further left.

The implementation:

pub fn integrate(&mut self, item: RgaItem) {
    // Idempotent: if already present, do nothing.
    if self.index.contains_key(&item.id) { return; }

    let scan_start      = /* position after origin_left */;
    let scan_end        = /* position of origin_right */;
    let origin_left_pos = /* position of our origin_left */;

    let mut insert_pos = scan_start;
    for i in scan_start..scan_end {
        let existing        = &self.items[i];
        let existing_ol_pos = /* position of existing.origin_left */;

        if existing_ol_pos < origin_left_pos {
            break;  // passed our zone
        } else if existing_ol_pos > origin_left_pos {
            insert_pos = i + 1;  // skip right-subtree item
        } else {
            // Same zone: higher OpId → further left
            if existing.id > item.id {
                insert_pos = i + 1;
            } else {
                break;
            }
        }
    }
    self.items.insert(insert_pos, item);
    self.rebuild_index_from(insert_pos);
}

This is O(k) where k is the number of concurrent conflicting operations at that position — typically 1 or 2 in practice, O(n) worst case.

Deletions: tombstones and why you can't just remove items

In a distributed system, you can't immediately remove a deleted item from the array. The problematic scenario:

Alice has [A, B, C]. She deletes B.
Bob, offline, inserts D after B. Bob has [A, B, D, C].
Bob reconnects and his insert arrives.
If we'd already erased B from Alice's array, D's origin_left = B.id would be unresolvable.

The solution: tombstones. Deleted items remain in the array with state Tombstone { deleted_at: OpId }, invisible to the application but still present for conflict resolution.

This means the array can grow unbounded over time — which leads to garbage collection.

Causal delivery: the CausalBuffer

Operations can arrive out of order over WebSocket. If InsertStroke(B, origin_left=A.id) arrives before InsertStroke(A), applying B immediately would place it at the wrong z-order position.

The CausalBuffer holds not-yet-ready operations and retries them every time a new operation is successfully applied. The non-obvious part is that this can trigger a cascade: applying A unblocks B, which in turn unblocks C and D that were waiting on B. A single operation can free an entire chain.

fn apply_remote_buffered(&mut self, op: Operation) {
    self.pending.push(op);
    // Loop until no more ops can be unblocked
    loop {
        let before = self.pending.len();
        self.pending.retain(|op| {
            if is_causally_ready(op, &self.doc) {
                self.doc.apply_remote(op.clone());
                false  // remove from buffer
            } else {
                true   // keep waiting
            }
        });
        if self.pending.len() == before { break; }  // nothing new unblocked
    }
}

fn is_causally_ready(op: &Operation, doc: &Document) -> bool {
    match op {
        Operation::InsertStroke { origin_left, .. } =>
            origin_left.is_zero() || doc.stroke_order.index.contains_key(origin_left),
        Operation::DeleteStroke { target, .. } =>
            doc.stroke_order.index.contains_key(target),
        Operation::UpdateProperty { target, .. } =>
            doc.stroke_store.contains(target),
        Operation::UpdateMetadata { .. } => true,
    }
}

The buffer has a hard limit of 10,000 operations. If exceeded, the client requests a full snapshot from the server rather than trying to recover — a safer failure mode than OOM.

Incremental Garbage Collection with re-parenting

Tombstones are safe to remove only when all known peers have seen the deletion — a condition called "causal stability". The system tracks this via the Minimum Version Vector (MVV): the server periodically broadcasts the pointwise minimum of all known vector clocks.

A tombstone T with deleted_at = op is GC-eligible when:

mvv.get(op.actor) >= op.lamport

Incremental GC runs in bounded cycles (max_gc_per_cycle = 5,000 items) to avoid long pauses.

The critical bug without re-parenting

Here's the problem I almost missed. Consider this state:

Array: [A] → [B, active] → [C, active]
              C.origin_left = B.id

B gets deleted (tombstone). Eventually, GC removes it when causally stable. Now C.origin_left points to an ID that no longer exists in the array.

When the snapshot is serialized and reconstructed on another peer, the op sequence is: InsertStroke(A), InsertStroke(C, origin_left=B.id). But B is not in the snapshot because it was GC'd. C has nowhere to anchor → it gets inserted at the end. Z-order is corrupted.

The fix: re-parenting before retain. Before physically erasing tombstones, the GC walks surviving items whose origin_left points to an ID in the remove_set, and finds the nearest surviving ancestor:

fn find_kept_ancestor(mut origin: OpId, remove_set: &HashSet<OpId>, ...) -> OpId {
    for _ in 0..MAX_DEPTH {
        if !remove_set.contains(&origin) { return origin; }
        origin = items[index[&origin]].origin_left;
    }
    OpId::ZERO  // attach to root if the entire chain was GC'd
}

The multi-hop case: A → B(deleted) → C(deleted) → D(alive). GC of B and C re-parents D directly to A. This is deterministic: two peers with the same MVV produce exactly the same re-parented state. Convergence is preserved.

Mutable properties: LWW-Registers per field

Strokes have mutable properties: color, stroke width, opacity, transform. If Alice changes the color while Bob changes the opacity, both changes must survive — not conflict.

The solution: each property is an independent Last-Write-Wins Register:

pub struct StrokeProperties {
    pub color:        LwwRegister<u32>,
    pub stroke_width: LwwRegister<f32>,
    pub opacity:      LwwRegister<f32>,
    pub transform:    LwwRegister<Transform2D>,
}

impl<T: Clone> LwwRegister<T> {
    pub fn apply(&mut self, value: T, timestamp: OpId) -> bool {
        if timestamp > self.timestamp {
            self.value = value;
            self.timestamp = timestamp;
            true
        } else { false }
    }
}

Color change and opacity change use different registers → both survive. Concurrent color changes by two users → higher timestamp wins. On equal timestamps (same Lamport), the higher ActorId wins — the same deterministic tiebreaker the RGA uses.

An honest note on LWW: the winner is the one with the higher OpId, not the most recent by wall-clock time. If Alice writes color=red at t=5 and Bob writes color=blue also at t=5 but with a higher ActorId, Alice's red is lost even if she was "the last one" from her perspective. For aesthetic stroke properties, this semantics is acceptable.

Delta synchronization

When a client reconnects after being offline, you don't want to send the entire document history — just the operations the client hasn't seen yet.

The Vector Clock diff method computes exactly this:

pub fn diff(&self, other: &VectorClock) -> Vec<(ActorId, u64, u64)> {
    // Returns (actor, from_ts, to_ts) ranges
    // where `self` has seen more than `other`
    self.clocks.iter()
        .filter(|(&actor, &my_ts)| my_ts > other.get(actor))
        .map(|(&actor, &my_ts)| (actor, other.get(actor) + 1, my_ts))
        .collect()
}

Client sends its vector clock → server computes diff → server sends only the missing operations. O(actors) to compute, not O(operations). For a document with 50,000 operations of history and a client that missed 200, those 200 ops are transmitted — not 50,000.

A production detail that matters: a peer disconnected for longer than the gc_grace_period may hold references to already-GC'd tombstones. On reconnect, it must receive a full snapshot instead of a delta — the server needs to detect this condition by comparing the client's vector clock against the current MVV.

Stroke simplification: Ramer-Douglas-Peucker at insert time

A stylus at 240Hz produces one point every ~4ms. A 3-second stroke = ~720 raw points. Storing and transmitting all of them is wasteful — the human eye can't perceive the difference at normal zoom levels.

I implemented the Ramer-Douglas-Peucker algorithm, applied automatically at insert time:

pub fn insert_stroke(&mut self, mut data: StrokeData, props: StrokeProperties) -> StrokeId {
    if self.simplify_epsilon > 0.0 && data.points.len() > 2 {
        data.simplify(self.simplify_epsilon);  // in-place, before storing
    }
    // ...
}

The RDP implementation uses an iterative stack rather than recursion — no stack overflow risk for 50k-point strokes:

fn rdp_indices(points: &[StrokePoint], epsilon: f32) -> Vec<usize> {
    let mut stack: Vec<(usize, usize)> = Vec::with_capacity(64);
    stack.push((0, n - 1));
    while let Some((start, end)) = stack.pop() {
        // find max perpendicular distance in [start, end]
        // if > epsilon: keep the point, push both halves
    }
}

Typical reduction at epsilon = 0.5:

Stroke type	Original pts	With epsilon=0.5	Reduction
Straight line	500	2	99.6%
Smooth curve	500	25–60	~90%
Calligraphic signature	500	80–150	~75%

Simplification happens before storing in the store and before emitting the network op — the remote peer receives the already-simplified stroke.

Zero-copy Wasm rendering: bypassing the JS boundary

The rendering hot path must be fast. Every animation frame, the canvas engine needs all visible stroke data. Crossing the JS↔Wasm boundary with individual function calls is expensive (~100–200ns each).

The solution: pack all visible strokes into a single contiguous buffer in Wasm linear memory, then hand JS a raw pointer:

#[wasm_bindgen]
pub fn build_render_data_viewport(
    &mut self, vx0: f32, vy0: f32, vx1: f32, vy1: f32, stroke_expand: f32
) -> *const u8 {
    let viewport = Aabb { min_x: vx0, min_y: vy0, max_x: vx1, max_y: vy1 };
    self.render_buf.clear();  // reuse buffer — no alloc
    for id in self.inner.visible_stroke_ids() {
        if let Some((data, props)) = self.inner.get_stroke(&id) {
            let bounds = if props.transform.value.is_identity() {
                data.bounds.expanded(stroke_expand)
            } else {
                data.bounds.transform(&props.transform.value).expanded(stroke_expand)
            };
            if !bounds.intersects(&viewport) { continue; }  // AABB culling
            write_stroke_to_buf(&mut self.render_buf, &id, data, props);
        }
    }
    self.render_buf.as_ptr()
}

In JavaScript:

const ptr = doc.build_render_data_viewport(vx0, vy0, vx1, vy1, expand);
const len = doc.get_render_data_len();
const view = new DataView(wasmInstance.memory.buffer, ptr, len);
// Direct read from Wasm memory — zero copies, zero allocations on the JS side

render_buf is reused across frames with clear() instead of alloc. The pointer is valid only until the next operation that mutates the buffer — the JS client must read all data in the same frame before any mutation.

AABB culling skips strokes outside the viewport without iterating their points. For a whiteboard with 5,000 strokes but only 200 visible in the current viewport, the difference is substantial.

Wire format: LEB128 and encoding decisions

The binary protocol uses unsigned LEB128 for all integers and IEEE 754 little-endian for floats. Some non-obvious decisions:

Why u64 instead of UUID for ActorId: in LEB128, an ActorId < 2^14 takes 2 bytes. A UUID always takes 16 bytes. Each operation carries three OpIds (id, origin_left, origin_right): 6 bytes vs 48 bytes per stroke just in identifiers.

Why fixed LE for cursors instead of LEB128: awareness cursors are sent in bulk (N × 28 bytes) and decoded via chunks_exact(28) without parsing. The predictability of the fixed format compensates for the potential inefficiency for small ActorIds — cursors are not persisted and decode speed on the hot path matters more than size.

Optional LZ4 for updates > 200 bytes: stroke points compress well because they are sequential coordinates with small deltas. An InsertStroke of 100 points goes from ~1,200 bytes to ~900 bytes with LZ4. The 200-byte threshold avoids compressing small ops where the LZ4 header overhead would exceed the savings.

A typical InsertStroke of 40 simplified points takes ~560 bytes on the wire uncompressed.

Local undo

Undo on a collaborative whiteboard is subtle. The naive approach — "revert to previous state" — is wrong because you can't undo what other users did. The correct semantic: undo generates a delete operation that is broadcast to all peers.

pub fn undo_last_stroke(&mut self) -> Option<StrokeId> {
    while let Some(id) = self.undo_stack.pop() {
        if self.delete_stroke(id) {  // generates a DeleteStroke in pending_ops
            return Some(id);
        }
        // Stroke was already deleted by a remote peer → skip, try the previous one
    }
    None
}

The undo stack only tracks the local actor's strokes. The stack is session-only (not persisted in snapshots). The cap is 200 entries — enough for any reasonable undo sequence.

Defensive limits and error philosophy

The library enforces hard limits to prevent resource exhaustion from malformed or malicious payloads:

pub const MAX_POINTS_PER_STROKE: usize = 50_000;  // 240Hz × ~3.5min without simplification
pub const MAX_STROKES:           usize = 100_000;  // ~8 MB RGA memory
pub const MAX_ACTORS:            usize = 10_000;   // bounds the VectorClock BTreeMap

The philosophy is dual, and intentional:

decode_* paths (untrusted external data): return Err or None when limits are exceeded. The error propagates to the caller — malformed data is rejected before any allocation.
apply_remote (already-parsed ops from remote peers): silent drop. The CRDT is designed to be tolerant; dropping a remote op is preferable to OOM. If the document already has MAX_STROKES, the new one simply isn't inserted.

Local limits are more permissive: insert_stroke (trusted local operation) has no point count limit — auto-simplification with RDP epsilon=0.5 reduces 50k points to ~500 before storing.

Formal guarantees

Property	Guarantee
Strong Eventual Consistency	Two replicas with the same operation set have identical `visible_stroke_ids()`, regardless of application order.
Idempotency	`apply_remote(op)` called twice is equivalent to once. Safe with WebSocket redelivery.
Commutativity	Application order of concurrent ops doesn't change final state.
GC Safety	Only causally stable tombstones are removed. No operation that any known peer might still need is GC'd prematurely.
Snapshot-replay equivalence	`decode_snapshot(encode_snapshot(doc))` produces the same visible state as `doc`. Z-order and properties are identical.

What vectis-crdt does NOT guarantee

Equally important:

Limitation	Description
GC without MVV	GC requires the server to compute and broadcast the MVV. Without a server, GC cannot run safely in a pure P2P scenario.
LWW wall-clock consistency	If two peers modify the same property offline, the winner is the one with the higher OpId — not the "most recent by system clock".
Liveness under indefinite partition	If two peers are disconnected indefinitely, they don't converge until they reconnect. This is inherent to CRDTs without coordination.

Build output

cargo build --release --target wasm32-unknown-unknown
wasm-opt -O3 -o vectis_crdt_bg.wasm vectis_crdt_bg.wasm
gzip -9 vectis_crdt_bg.wasm

Result: ~85 KB gzipped. Fits comfortably in a single HTTP/2 push alongside the app bundle.

Conclusion

Building vectis-crdt showed me that domain-specific CRDTs are worth the investment. A general-purpose CRDT library would have forced the whiteboard domain to adapt to the library's model. Instead, the model adapts to the domain:

Strokes, not characters, are the unit of the array.
Z-ordering is a first-class concept, not an afterthought.
Simplification, viewport culling, and awareness are built in, not bolted on.

The YATA algorithm gives convergence. Vector clocks give causal consistency. The MVV gives safe GC. The Wasm bridge gives zero-copy rendering. Each piece is independently verifiable.

References and further reading

The algorithms and data structures in vectis-crdt have solid academic foundations. If you want to go deeper into the theory behind each decision:

RGA — Replicated Growable Array
Roh, H. G., Jeon, M., Kim, J. S., & Lee, J. (2011). Replicated abstract data types: Building blocks for collaborative applications. Journal of Parallel and Distributed Computing, 71(3), 354–368.
The original paper defining the replicated array model with origin_left and the integration algorithm.

YATA — Yet Another Transformation Approach
Nicolaescu, P., Jahns, K., Derntl, M., & Klamma, R. (2016). Near Real-Time Peer-to-Peer Shared Editing on Extensible Data Types. ECSCW 2016.
Introduces origin_right and the right-subtree correction that fixes the interleaving cases that plain RGA doesn't handle. The basis of Yjs's integration algorithm.

Lamport Clocks
Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), 558–565.
The foundational paper on logical clocks. Defines the "happens-before" relation and the monotonic clock construction used in LamportTs.

Vector Clocks
Fidge, C. J. (1988). Timestamps in message-passing systems that preserve the partial ordering. Proceedings of the 11th Australian Computer Science Conference.
Mattern, F. (1989). Virtual time and global states of distributed systems. Parallel and Distributed Algorithms.
Two independent and simultaneous papers that formalize vector clocks. Foundation of VectorClock::dominates and delta synchronization.

CRDTs — theoretical framework
Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). Conflict-free Replicated Data Types. INRIA Research Report RR-7687.
Shapiro, M., et al. (2011). A comprehensive study of Convergent and Commutative Replicated Data Types. SSS 2011.
The two reference papers that formalize CRDTs, distinguish state-based from op-based, and establish the mathematical conditions for convergence.

Ramer-Douglas-Peucker
Ramer, U. (1972). An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing, 1(3), 244–256.
Douglas, D. H., & Peucker, T. K. (1973). Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica, 10(2), 112–122.
The two original papers (published independently) of the polyline simplification algorithm used in StrokeData::simplify.

Reference implementations

Yjs — mature YATA implementation for collaborative text in JS, by Nicolaescu et al.
Automerge — general-purpose CRDT with a Rust backend
diamond-types — high-performance reference implementation of RGA in Rust, by Joseph Gentle

vectis-crdt:
GitHub · crates.io · npm

Tags: rust, crdt, distributed-systems, webassembly, collaborative

16 Patterns for Crossing the WebAssembly Boundary (And the One That Wants to Kill Them All)

Rafa Calderon — Sun, 01 Mar 2026 06:06:22 +0000

WebAssembly is fast. We all know that by now. What almost nobody talks about is the hidden toll you pay every time you try to talk to it.

The moment your JavaScript code needs to pass a measly string to a WASM module, or your WASM tries to touch a DOM node, you slam face-first into the boundary — a literal wall between two worlds with fundamentally opposed type systems, memory models, and execution paradigms. On one side, JS breathes UTF-16 strings, garbage-collected live objects, and async promises. On the other, WASM is spartan: it only understands numeric primitives like i32 or f64, raw linear memory, and strictly synchronous execution.

Crossing this boundary is never free. Every interaction has a price, and depending on the strategy you choose to pay it, that cost can range from mathematically negligible to a painful "why on earth did I bother compiling this to WASM?"

What you're about to read is the definitive catalog of every known pattern for crossing this boundary, from the most trivial to the most exotic. To make sense of it all, I've organized them into three fundamental blocks based on the exact question they answer:

Block 1 — The Primitives: What things can actually cross the boundary and how do they do it?

Block 2 — Memory Strategies: How do you move heavy data efficiently without killing performance?

Block 3 — Flow Architectures: How do you orchestrate and design the conversation between both sides?

And to close, we'll talk about the Component Model — the emerging standard that aspires to turn all of these patterns into museum pieces.

Block 1 — The Primitives

What can cross the boundary, and how?

Before optimizing anything, you need to understand what can actually travel between the two worlds. WebAssembly's binary interface (ABI) is minimalist: numbers in, numbers out. Everything else — strings, objects, callbacks, DOM references — requires a translation layer.

The five patterns in this block are the foundation. Every advanced technique in the later blocks is built on top of one or more of these. Think of them as the alphabet: you need to know the letters before you can write sentences.

Pattern 1: Scalar Pass-through

The only thing WebAssembly can natively pass across its boundary: numbers.

Rust side:

#[no_mangle]
pub extern "C" fn add(a: i32, b: i32) -> i32 {
    a + b
}

JavaScript side:

const result = wasm.instance.exports.add(2, 3); // Returns 5

Functions that take integers (i32, i64) or floats (f32, f64) and return the same have zero serialization overhead. The values go straight onto the WASM stack. No memory allocation, no encoding, no copies.

This is the ideal case. The trouble starts when you need to pass a string, an array, or a JSON object. At that point, you leave paradise.

When to use it: Pure math functions, hash computations, physics calculations, or any logic where inputs and outputs are strictly numeric.

The tax: None. This crossing is completely tax-free.

Pattern 2: Pointer + Length Convention

The fundamental building block for passing anything more complex than a bare number. Both sides of the boundary agree on a strict protocol:

The caller writes the data into WASM's linear memory.
The caller passes two integer values (typically i32 or usize): the memory offset where the data starts (pointer) and the exact length in bytes.
The callee reads and processes from that memory region.

Rust side:

#[no_mangle]
pub extern "C" fn alloc(len: usize) -> *mut u8 {
    let mut buf = Vec::with_capacity(len);
    let ptr = buf.as_mut_ptr();
    std::mem::forget(buf); // Prevent Rust from freeing this memory automatically
    ptr
}

#[no_mangle]
pub unsafe extern "C" fn dealloc(ptr: *mut u8, len: usize) {
    // Reconstruct the Vec so Rust's allocator frees it when it goes out of scope
    let _ = Vec::from_raw_parts(ptr, 0, len);
}

#[no_mangle]
pub unsafe extern "C" fn process_string(ptr: *const u8, len: usize) -> i32 {
    let slice = std::slice::from_raw_parts(ptr, len);
    if let Ok(s) = std::str::from_utf8(slice) {
        s.len() as i32 // Simulate doing something with the string
    } else {
        -1
    }
}

JavaScript side:

const encoder = new TextEncoder();
const bytes = encoder.encode("Hello, WASM");

// 1. Ask Rust for memory
const ptr = wasm.instance.exports.alloc(bytes.length);

// 2. Write the bytes into linear memory
const memory = new Uint8Array(wasm.instance.exports.memory.buffer);
memory.set(bytes, ptr);

// 3. Pass the pointer and length to Rust
const result = wasm.instance.exports.process_string(ptr, bytes.length);

// 4. Free the memory explicitly to avoid leaks
wasm.instance.exports.dealloc(ptr, bytes.length);

This is exactly what every toolchain and code generator does under the hood. It's a completely manual process, highly error-prone, and requires you to manage memory allocation and deallocation yourself from JavaScript. In return, it gives you total, absolute control — no black boxes, no magic.

When to use it: When you need maximum control over memory, when you're writing very low-level base libraries, or simply when you're learning how WebAssembly's linear memory actually works.

The tax: Manual memory management. You pay the cost of encoding and decoding data (like TextEncoder), and you accept the constant risk of critical mistakes: using memory that's already been freed (use-after-free), freeing it twice (double-free), or simply forgetting to call dealloc and causing a memory leak that will eventually take down the browser tab.

Pattern 3: Opaque Handles / `externref`

Before the Reference Types standard landed in WebAssembly, life was miserable if your WASM code needed to hold a live reference to a JavaScript object (like a DOM node, a WebSocket connection, or a Canvas context). You had to build a lookup table manually on the JS side.

The old way (userland):
You'd create an array in JS. Every time you wanted to hand an object to Rust, you'd stick it in the array and give Rust the index (a plain i32). Rust would hand that i32 back when it needed to interact with the object, and JS would look it up in the array. It works, sure, but the lifecycle is a nightmare: when do you delete entries from the array so the garbage collector (GC) can reclaim memory? What happens if you create circular references?

The modern way (externref):
With the externref type (now standardized and implemented in all modern engines), WebAssembly can hold opaque references to JavaScript objects directly, no hacks required.

Rust side:

// At the low level, externref is a type managed by the engine.
// (Note: In real-world ecosystems, wasm-bindgen wraps this as the JsValue type.)

#[link(wasm_import_module = "env")]
extern "C" {
    // We import a JS function that knows what to do with the object
    fn js_set_text_content(node: externref, text_ptr: *const u8, len: usize);
}

#[no_mangle]
pub unsafe extern "C" fn process_node(node: externref) {
    // Rust holds the DOM object, but it's a black box.
    // It can't read or mutate it. It can only hand it back to JS.
    let msg = "Updated from Rust";
    js_set_text_content(node, msg.as_ptr(), msg.len());
}

JavaScript side:

// We pass the actual DOM object directly to the WebAssembly function
const button = document.getElementById("my-button");
wasm.instance.exports.process_node(button);

The absolute key to this pattern is the word "opaque." Rust receives the object, stores it in its registers or its own internal tables (using WebAssembly.Table), but to Rust it's inscrutable. It cannot inspect its properties or call its methods internally.

The only things it can do are: store it, pass it from one function to another, and hand it back to JavaScript so that JS can do the real work. The massive advantage is that the JavaScript engine (V8, SpiderMonkey, JavaScriptCore) now understands what's going on and automatically manages the lifecycle and garbage collection of that reference. No more memory leaks caused by your manual table.

When to use it: Whenever WASM needs to "remember" or retain a JS object — DOM nodes, event handlers, network resources, class instances. It eliminates in one stroke the need to maintain index tables in JavaScript and all the associated garbage collector headaches.

The tax: WebAssembly remains blind to the object. Every time you want to do something useful with it (read a property, modify its state), you have to pay the toll of crossing the boundary back to JavaScript. Holding the reference in Rust's pocket is free; trying to use it is not.

Pattern 4: Function Tables / `call_indirect`

How does WebAssembly call different JavaScript functions dynamically, without having to hardcode and declare every single import in the Rust source code?

The answer is WebAssembly.Table. It's essentially an array of function references that lives at the boundary and is accessible to both JS and WASM. WASM uses a special virtual-processor-level instruction called call_indirect, passing it an integer index. The engine looks up which function sits at that index in the table and executes it at runtime.

Rust side:

// In WebAssembly, a function pointer isn't a memory address.
// It's literally an index (an i32) into a WebAssembly.Table.

#[no_mangle]
pub unsafe extern "C" fn invoke_dynamic(callback_index: usize, value: i32) {
    // We trick the compiler by transmuting the integer index to a function pointer.
    // When compiled to WASM, this magically becomes a call_indirect instruction.
    let callback: extern "C" fn(i32) = std::mem::transmute(callback_index);

    // Execute the JS function pointed to by the index
    callback(value);
}

JavaScript side:

// 1. Create a table capable of storing function references
const table = new WebAssembly.Table({ initial: 10, element: "anyfunc" });

// 2. Place our JS function at index 0
table.set(0, (val) => console.log("Callback fired from Rust with value:", val));

// (When instantiating the WASM module, pass this table in the imports under env.table or similar)

// 3. Tell Rust to execute index 0
wasm.instance.exports.invoke_dynamic(0, 99);

This is exactly how Rust and C++ map their function pointers to the JavaScript world. When you have a function pointer in your compiled code, it doesn't point to WASM's linear memory — it points to a slot in this table. It's also the architectural foundation for building plugin systems where different WASM modules can register callbacks with each other.

When to use it: Callbacks, UI event handlers, polymorphic dispatch, or plugin architectures where the exact set of functions you'll invoke isn't known at compile time.

The tax: You pay one level of indirection (index → function lookup → execution). The CPU cost per call is tiny, nearly negligible, but the table itself requires extremely careful management if you're doing it by hand. You have to explicitly register functions, track which indices are free, and clean them up when a callback is no longer needed to avoid blowing past the table's limits.

Pattern 5: wasm-bindgen / Emscripten Glue

This isn't a separate interop pattern. It's a massive automation layer built squarely on top of the foundations of Patterns 2, 3, and 4 that we just covered.

Tools like wasm-bindgen in Rust handle generating all the intermediary JavaScript code (glue code) for you. Specifically, they automate:

String conversion using TextEncoder and TextDecoder (Pattern 2).
Table management for JS object references or native externref usage (Pattern 3).
Function table setup for injecting and executing callbacks (Pattern 4).
Manual linear memory allocation and deallocation behind the scenes.

Rust side:

use wasm_bindgen::prelude::*;

// This simple macro triggers all the plumbing
#[wasm_bindgen]
pub fn greet(name: &str) -> String {
    format!("Hello, {}!", name)
}

You write idiomatic Rust. The macro intercepts compilation and automatically generates the pointer-plus-length protocol, the memory allocation, and the JavaScript shim.

Critical insight: wasm-bindgen is to these low-level patterns what an ORM is to raw SQL queries. It's not a new mechanism — it's a code generator that hides the complexity. Understanding exactly what it generates beneath that macro is your only lifeline for debugging bottlenecks and knowing which critical parts of your application need you to skip the tool and cross the boundary manually.

When to use it: During prototyping, in the vast majority of business application code, and whenever development speed matters more than squeezing the last microsecond out of the processor.

The tax: The size of the glue code that will bloat your final JavaScript bundle. You're accepting automatic memory copies that might be unnecessary for your particular use case. On top of that, the very convenience of the abstraction is a trap: it makes it dangerously easy to cross the boundary thousands of times inside a for loop without ever noticing the cost you're paying.

Quick decision guide:

Only numbers? → Pattern 1
Occasional strings? → Pattern 2 / wasm-bindgen
JS objects? → Pattern 3

Block 2 — Memory Strategies

How do you move data efficiently?

Once you know what can cross the boundary, the next question is how much it costs. The default answer — "copy everything" — works, but it's the equivalent of shipping goods by air when a pipeline would do. The four patterns in this block are variations on the same theme: reducing or eliminating copies. They range from simple (creating a view instead of a copy) to sophisticated (agreeing on a binary layout so both sides can read the same bytes without any transformation). If your application moves more than trivial amounts of data across the boundary, at least one of these patterns will save you.

Pattern 6: Typed Array Views

Instead of copying data out of WASM memory into JS, you create a view directly on top of WASM's linear memory:

Rust side:

#[no_mangle]
pub extern "C" fn process_image(width: usize, height: usize) -> *const u8 {
    static mut BUFFER: Vec<u8> = Vec::new();
    unsafe {
        BUFFER.resize(width * height * 4, 255);
        BUFFER.as_ptr()
    }
}

JavaScript side:

const ptr = wasm.instance.exports.process_image(800, 600);
const wasmMemory = wasm.instance.exports.memory;

// Create the view over WASM memory — zero copies
const pixels = new Uint8ClampedArray(wasmMemory.buffer, ptr, 800 * 600 * 4);
const imageData = new ImageData(pixels, 800, 600);
ctx.putImageData(imageData, 0, 0);

Zero copies. JS reads directly from WASM memory. The typed array (Uint8Array, Float32Array, Int32Array, etc.) is just a view — a window into the same underlying ArrayBuffer.

The critical gotcha: If WASM calls memory.grow(), the underlying ArrayBuffer gets detached and all existing views are invalidated. You must re-create them after any potential growth. This is the single most common source of bugs with this pattern.

Mitigation: Pre-allocate enough memory upfront, or re-create views on every access (slightly slower but safe).

When to use it: Reading large results from WASM — rendered images, audio buffers, computed arrays. Anywhere you need to read (not write) massive data with zero overhead.

The tax: Fragility with memory.grow(). From JS's perspective it's read-only (writing through views is possible but risky if WASM is also writing).

Pattern 7: Memory Pool / Arena Allocation

Instead of allocating and freeing individual objects, you pre-allocate a large block of linear memory and use a simple bump allocator:

Rust side:

const ARENA_SIZE: usize = 1024 * 1024; // 1 MB
static mut ARENA: [u8; ARENA_SIZE] = [0; ARENA_SIZE];
static mut HEAD: usize = 0;

#[no_mangle]
pub unsafe extern "C" fn arena_alloc(size: usize) -> *mut u8 {
    let ptr = ARENA.as_mut_ptr().add(HEAD);
    HEAD += size; // Advance the pointer, no complex logic
    ptr
}

#[no_mangle]
pub unsafe extern "C" fn arena_reset() {
    HEAD = 0; // Free everything in one shot
}

All allocations advance the pointer. No individual free() calls. When you're done with the whole batch, reset the pointer to the beginning.

The web-specific benefit is subtle but important: by keeping all data inside WASM's linear memory, you avoid creating thousands of small JS objects that the garbage collector has to track. Arena allocation means the JS GC has nothing to do — all data lives in a single large ArrayBuffer that the GC sees as one object.

When to use it: Processing pipelines where you allocate many temporary objects (e.g., parsing, transformation). Per-frame allocation in games or visualizations.

The tax: You can't free individual allocations. The entire arena is all-or-nothing. It requires estimating the maximum memory you'll need ahead of time.

Pattern 8: Zero-Copy with Format-Aligned Layout (Arrow C Data Interface)

The most sophisticated zero-copy pattern available today. The key idea: if both sides of the boundary agree on an identical memory layout, you don't need to serialize or deserialize anything. You just share the pointer.

Apache Arrow defines a columnar memory layout that is identical across every implementation — Arrow C++, Arrow JS, Arrow Rust. When a Rust library compiled to WASM produces an Arrow RecordBatch, the bytes in WASM memory are already in the format Arrow JS expects.

The arrow-js-ffi library implements the Arrow C Data Interface in JavaScript, allowing it to read Arrow data directly from WASM memory:

JavaScript side (using arrow-js-ffi):

import { parseRecordBatch } from "arrow-js-ffi";

// Rust returns pointers to its internal Arrow structures
const ffiRecordBatch = wasmRecordBatch.intoFFI();

const recordBatch = parseRecordBatch(
    wasmMemory.buffer,
    ffiRecordBatch.arrayAddr(),
    ffiRecordBatch.schemaAddr(),
    false  // false = zero-copy view, don't move data to JS
);

This isn't limited to Arrow. Any format with a deterministic binary layout — FlatBuffers, Cap'n Proto, Protocol Buffers (wire format) — can achieve similar results. The principle is: agree on the byte layout at design time, and sharing becomes free at runtime.

DuckDB-WASM uses this approach to pass query results from its C++ engine (compiled to WASM) to JavaScript without serialization.

When to use it: Analytical workloads, large tabular datasets, any scenario where both sides can use the same binary format.

The tax: Both sides must implement the same format. Memory lifecycle management is complex — who owns the data? When is it safe to free it? Views over WASM memory are invalidated if memory.grow() is called.

Pattern 9: String Passing Optimizations

Strings deserve their own pattern because they are the single most expensive data type to cross the boundary.

The fundamental problem: WASM operates in UTF-8. JavaScript engines use UTF-16 internally (or Latin-1 for ASCII-only strings). Every string crossing requires a transcoding step — TextEncoder (JS→WASM) or TextDecoder (WASM→JS) — which is O(n) in the string's length.

There are four strategies on a spectrum:

a) Standard TextEncoder/TextDecoder — The usual approach. It works. Costs O(n) per crossing. Acceptable for occasional string passing.

b) Deferred decoding — Don't convert to a JS String unless it's absolutely necessary. Keep strings as raw UTF-8 byte arrays (Uint8Array views over WASM memory) and only decode when you need to render to the DOM or pass to a JS API that requires a String. Many intermediate operations (comparison, hashing, searching) can work directly on UTF-8 bytes.

c) stringref proposal (Future) — A proposed WASM type that would let WASM hold direct references to engine-managed strings, avoiding UTF-8↔UTF-16 conversion entirely. WASM could call operations on the string (length, substring, compare) through imported functions without ever copying the string data. Still in proposal stage.

d) JS String Builtins — A more pragmatic near-term alternative. Safari 26.2 shipped JS String Builtins, which reduce the need for JavaScript glue code when passing strings, eliminating some of the overhead without requiring a new type system.

When to use it: Any application that passes many strings or large strings across the boundary — text editors, parsers, search engines, internationalization systems.

The tax: UTF-8↔UTF-16 transcoding is unavoidable today for any string that must become a JS String. The deferred decoding pattern changes when you pay the tax, not whether you pay it.

Quick decision guide:

Small data (<10 KB)? → Just copy it
Large and read-only? → Typed Array view (Pattern 6)
Streamed continuously? → Ring buffer (Pattern 12)

Block 3 — Flow Architectures

How do you orchestrate the communication?

Blocks 1 and 2 answer "what can cross" and "how to move data." This block answers the harder question: "how do you design the conversation?" The raw cost of a single boundary crossing is small (~100ns). The problem is frequency and coordination. A naively written render loop can cross the boundary 50,000 times per frame. An async web API call can stall your entire WASM stack. The six patterns here aren't about moving bytes faster — they're about restructuring the interaction so you cross the boundary fewer times, in smarter ways, and without blocking when you shouldn't.

Pattern 10: Batch / Coalesce

The simplest flow optimization: instead of making N boundary crossings, make 1.

Rust side:

#[repr(C)]
pub struct Point { x: f64, y: f64 }

#[no_mangle]
pub unsafe extern "C" fn process_point_batch(ptr: *const Point, len: usize) {
    let points = std::slice::from_raw_parts(ptr, len);
    for p in points {
        // Do the heavy lifting here, without crossing the boundary
        let _calc = p.x * p.y;
    }
}

JavaScript side:

// 1 boundary crossing instead of 10,000
const buffer = new Float64Array(wasm.instance.exports.memory.buffer, ptr, points.length * 2);
points.forEach((p, i) => {
    buffer[i * 2] = p.x;
    buffer[i * 2 + 1] = p.y;
});
wasm.instance.exports.process_point_batch(ptr, points.length);

This is the equivalent of batch inserts in a database vs. individual inserts. The per-call overhead of crossing the JS↔WASM boundary is small (~100ns), but multiplied by 10,000 calls per frame, it dominates the total cost.

The Yew framework (Rust UI framework in WASM) uses this for DOM updates: instead of calling JS for each individual DOM mutation, it queues all mutations during virtual DOM reconciliation and flushes them in a single call.

When to use it: Any loop that calls WASM functions. Any scenario where you can accumulate work and send it in bulk.

The tax: You need to design your API for batch operations. Single-element functions are simpler to implement but more expensive to call repeatedly.

Pattern 11: Command Buffer / Opcode Stream

An evolution of batching. Instead of passing data, you pass an encoded instruction stream across the boundary:

Rust side:

const CMD_CREATE_ELEMENT: u8 = 1;
const CMD_SET_TEXT: u8 = 2;

#[no_mangle]
pub unsafe extern "C" fn generate_ui_commands(ptr: *mut u8) -> usize {
    let mut offset = 0;

    // Command 1: Create a DIV
    *ptr.add(offset) = CMD_CREATE_ELEMENT;
    offset += 1;
    // (Here you'd encode "div" with null termination)

    // Command 2: Set text
    *ptr.add(offset) = CMD_SET_TEXT;
    offset += 1;
    // (Encode the text to insert)

    offset // Return how many bytes the command buffer is
}

JavaScript side:

const len = wasm.instance.exports.generate_ui_commands(ptr);
const commands = new Uint8Array(wasm.instance.exports.memory.buffer, ptr, len);

let i = 0;
while (i < len) {
    const opcode = commands[i++];
    if (opcode === 1) { // CMD_CREATE_ELEMENT
        // Read string from memory and call document.createElement()
    } else if (opcode === 2) { // CMD_SET_TEXT
        // Read string and call node.textContent = ...
    }
}

WASM writes this command buffer into linear memory. JS reads it in a single pass and executes each command against the DOM. One boundary crossing for an entire tree of DOM mutations.

This is conceptually identical to how GPUs work: Vulkan and Metal use command buffers because the CPU↔GPU boundary has overhead similar to the JS↔WASM boundary. You record commands, then submit the buffer.

When to use it: UI frameworks written in WASM that need to manipulate the DOM. Any scenario where WASM needs to fire complex sequences of JS operations.

The tax: You're designing a mini VM / bytecode interpreter on the JS side. Debugging is harder — you're staring at opcode streams instead of function calls. The command buffer format becomes an API contract that's painful to change.

Pattern 12: Ring Buffer (Circular Buffer)

A fixed-size buffer in WASM's linear memory with two pointers:

Rust side:

const BUFFER_SIZE: usize = 1024;
static mut RING_BUFFER: [u8; BUFFER_SIZE] = [0; BUFFER_SIZE];
static mut HEAD: std::sync::atomic::AtomicUsize = std::sync::atomic::AtomicUsize::new(0);

#[no_mangle]
pub extern "C" fn produce_data(data: u8) {
    unsafe {
        let current_head = HEAD.load(std::sync::atomic::Ordering::Relaxed);
        RING_BUFFER[current_head % BUFFER_SIZE] = data;
        HEAD.store(current_head + 1, std::sync::atomic::Ordering::Release);
    }
}

JavaScript side:

// JS acts as the consumer (e.g., in an AudioWorklet)
let tail = 0;
const headPtr = wasm.instance.exports.get_head_pointer();
const ringBuffer = new Uint8Array(wasm.instance.exports.memory.buffer, bufferPtr, 1024);

function consume() {
    const currentHead = readIntFromMemory(headPtr);
    while (tail < currentHead) {
        const data = ringBuffer[tail % 1024];
        processAudio(data);
        tail++;
    }
}

The producer (WASM) advances the head pointer after writing data. The consumer (JS, or a Web Worker) advances the tail pointer after reading. When head reaches the end of the buffer, it wraps around to the beginning.

For a single-producer, single-consumer scenario, this is lock-free by design: the producer only writes head, the consumer only writes tail. No mutexes, no atomic CAS, no contention.

A notable variant is the BipBuffer (bipartite buffer): it guarantees that written data is always in a contiguous block, even when wrapping around the buffer boundary. This matters for WASM because you can pass a single pointer+length to describe the readable region, without the consumer needing to handle two disjoint segments.

When to use it: Audio processing (AudioWorklet + WASM), real-time telemetry, video frame pipelines — any producer-consumer streaming scenario.

The tax: The fixed buffer size means you must handle the "buffer full" case (discard data, block, or grow). Not suitable for bursty workloads where data volume is unpredictable.

Pattern 13: Double Buffering

Two identical buffers. WASM writes to Buffer A while JS reads from Buffer B. When WASM finishes writing, the buffers swap roles.

Rust side:

static mut BUFFER_A: [u8; 800 * 600 * 4] = [0; 800 * 600 * 4];
static mut BUFFER_B: [u8; 800 * 600 * 4] = [0; 800 * 600 * 4];

#[no_mangle]
pub unsafe extern "C" fn render_frame(use_buffer_a: bool) -> *const u8 {
    let buffer = if use_buffer_a { &mut BUFFER_A } else { &mut BUFFER_B };
    // Compute physics and write pixels to the active buffer...
    buffer.as_ptr()
}

JavaScript side:

let usingBufferA = true;

function loop() {
    // Rust writes to Buffer A while JS reads and paints the previous frame (Buffer B)
    const ptr = wasm.instance.exports.render_frame(usingBufferA);
    const view = new Uint8ClampedArray(wasm.instance.exports.memory.buffer, ptr, 800 * 600 * 4);

    ctx.putImageData(new ImageData(view, 800, 600), 0, 0);

    // Swap buffers for the next frame
    usingBufferA = !usingBufferA;
    requestAnimationFrame(loop);
}

Zero contention, zero locks. The consumer always reads a complete, consistent snapshot. The producer never stalls waiting for the consumer.

This is the standard technique in game rendering (front buffer / back buffer) applied to the WASM boundary. Combined with requestAnimationFrame, you get a smooth pipeline:

WASM computes frame N+1 in Buffer A
JS renders frame N from Buffer B using putImageData or texImage2D
Swap
Repeat

When to use it: Rendering pipelines, any scenario where production and consumption need to be decoupled and never block each other.

The tax: Double memory usage. You need a coordination mechanism to signal when swapping is safe (can be as simple as a flag in shared memory or a postMessage to a Worker).

Pattern 14: SharedArrayBuffer + Atomics

The only way to achieve true shared-memory concurrency in the browser with WASM.

A SharedArrayBuffer is a block of memory that multiple Web Workers (and WASM instances) can read and write simultaneously. Combined with Atomics (wait, notify, compareExchange), you can build any concurrent data structure — lock-free queues, mutexes, semaphores.

// Main thread
const shared = new WebAssembly.Memory({
    initial: 256, maximum: 256, shared: true
});

// Worker can access the same memory
worker.postMessage({ memory: shared });

// WASM in the worker writes data
// Main thread reads it via Atomics.load / Atomics.wait

This enables patterns that are impossible otherwise: a WASM physics engine running in a Worker, updating shared state that the main thread's renderer reads every frame. No postMessage serialization, no copies.

The critical restriction: Requires Cross-Origin Isolation (Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp headers). This is a post-Spectre/Meltdown security requirement that breaks many third-party embeds (ads, analytics, iframes).

When to use it: Multi-threaded WASM applications, parallel processing, any scenario where data is too large or updates too frequently for postMessage.

The tax: Cross-Origin Isolation requirements can be a deployment blocker. Concurrency bugs (races, deadlocks) are just as real here as in any shared-memory system. The Atomics API is low-level and easy to misuse.

Pattern 15: The Async Boundary — JSPI (JavaScript Promise Integration)

Every pattern so far assumes synchronous execution: WASM calls JS, JS returns immediately. But the web is asynchronous. fetch(), IndexedDB, setTimeout, Web Crypto — they all return Promises.

Before JSPI, you had two terrible options:

a) Asyncify — A compile-time transformation that instruments your WASM binary to capture and restore the entire call stack, simulating suspension. It works, but bloats binary size by up to 50% and adds overhead to every function call (even synchronous ones).

b) Restructure your code — Rewrite your synchronous C/Rust code to be callback-oriented, with explicit state machines. Possible, but it destroys code structure and developer experience.

JSPI is the real solution. It's a proposal (available in Chrome behind a flag, actively being standardized) that lets a WASM function:

Call a JS imported function that returns a Promise
Suspend WASM execution
Return a Promise to JS
Resume WASM execution when the Promise resolves

From WASM's perspective, the call looks synchronous. From JS's perspective, it's just a Promise. The engine handles stack suspension and resumption with zero instrumentation overhead.

// JS side: wrap an async import with JSPI
const importObj = {
    env: {
        fetch_data: WebAssembly.promising(async (url_ptr, url_len) => {
            const url = decodeString(url_ptr, url_len);
            const response = await fetch(url);
            return writeToMemory(await response.arrayBuffer());
        })
    }
};

When to use it: Any WASM application that needs to call asynchronous Web APIs — network requests, file access, crypto operations, timers.

The tax: Browser support is currently limited (Chrome with flag, Firefox in progress). Requires understanding the suspension model. Not all code patterns are compatible (you can't suspend across certain boundaries like WASM→JS→WASM re-entry).

Epilogue — The Component Model: The Pattern That Wants to Rule Them All

Every pattern in this article exists because core WebAssembly's type system only speaks numbers. Strings? Pointer+length hack. Structs? Manually encoded in linear memory. Objects? Opaque handle workaround. Async? Stack manipulation hack.

The WebAssembly Component Model asks a radical question: what if the runtime handled all of this?

The core idea

The Component Model introduces three things on top of core WASM:

1. WIT (WebAssembly Interface Types) — An IDL (like Protobuf or OpenAPI) that describes component interfaces in terms of high-level types:

package myapp:image-processor;

interface processor {
    record image {
        width: u32,
        height: u32,
        pixels: list<u8>,
        format: pixel-format,
    }

    enum pixel-format { rgba, rgb, grayscale }

    apply-filter: func(img: image, filter: string) -> result<image, string>;
}

world image-app {
    export processor;
}

Strings, lists, records, enums, results, options — all first-class types, defined once, understood by every language.

2. Canonical ABI — A precise specification of how each WIT type maps to bytes in linear memory. A string is always UTF-8 with a specific pointer+length layout. A record has deterministic field ordering and alignment. A list<u8> has a concrete binary representation.

This is essentially Pattern 2 (pointer+length) and Pattern 8 (format-aligned layout) elevated to a universal standard. The toolchain generates the serialization code — you never see it.

3. Components — WASM modules wrapped with metadata that declare their imports and exports in WIT terms. They're self-describing: you can inspect a .wasm component and know its complete interface without any external documentation.

What it makes unnecessary

The Component Model subsumes nearly every pattern in this article:

Pattern	How the Component Model absorbs it
Pointer + Length	The Canonical ABI handles it automatically
wasm-bindgen glue	`wit-bindgen` generates equivalent code from WIT
Typed Array Views	The runtime can optimize data transfer internally
String passing	The Canonical ABI defines UTF-8 encoding; the runtime can optimize transcoding
Format-aligned zero-copy	The Canonical ABI is the aligned format
externref	The Component Model has its own resource handles
Function tables	Exports and imports are rich types

Composition: the real superpower

Beyond type marshaling, the Component Model enables composition — linking components written in different languages into a single application:

# A Rust parser + a Python data processor + a Go HTTP server
# composed into a single .wasm with no network boundaries
wasm-tools compose parser.wasm processor.wasm server.wasm -o app.wasm

No serialization between components. No shared memory management. No IPC. The runtime links them through the Canonical ABI at instantiation time. A function call between components looks and costs like a normal call.

Worlds: capability-based security

A World defines what a component can see — which interfaces it can import and export. A component built for the wasi:http/proxy world can handle HTTP requests but cannot access the filesystem. A component in the wasi:cli/command world can read files but cannot listen on sockets.

This is the security model that containers wish they had. Instead of giving a process access to everything and hoping seccomp catches the bad calls, you define capabilities at the interface level. A component literally cannot call functions it hasn't declared in its world.

Where it stands today (February 2026)

Production-ready server-side: Wasmtime has full Component Model support. Frameworks like Spin (Fermyon) and wasmCloud run production workloads on it. American Express built an internal FaaS platform entirely on WebAssembly components.

Not ready for browsers: The Component Model is a W3C proposal but isn't implemented in any browser engine yet. Browser-side WASM still uses core modules with all the manual patterns described above.

WASI 0.3 is coming: It adds native async support to the Component Model, eliminating the need for JSPI/Asyncify in server-side contexts. The async model avoids the "function coloring" problem — async imports plug seamlessly into synchronous exports without requiring downstream rewrites.

Threading is the gap: Shared-memory concurrency between components isn't supported yet. For compute-intensive parallel workloads, you still need SharedArrayBuffer and manual coordination.

The bottom line

The Component Model is to our 16 patterns what a managed runtime is to manual memory management. It aspires to absorb the complexity, standardize the solutions, and let the toolchain and runtime do the dirty work.

But — and this is important — understanding the patterns remains essential:

In the browser, they're all you've got. The Component Model isn't coming to browsers anytime soon.
For hot paths, manual control wins. Just as you sometimes skip the ORM and write raw SQL, you'll sometimes skip wit-bindgen and reach for a ring buffer or command buffer for performance-critical code.
The Component Model uses these patterns internally. The Canonical ABI is pointer+length with format-aligned layout. Understanding the foundations makes you a better systems developer, even when the abstraction handles it for you.

That's the abstraction tax in a nutshell: you can pay it automatically and accept the default cost, or you can understand the underlying patterns and choose exactly how much to pay.

WASM Microservices: From Single Binaries to Composable Components

Rafa Calderon — Tue, 24 Feb 2026 07:02:21 +0000

Traditional microservices pay a massive tax on serialization and network overhead. WASM microservices eliminate this toll completely — inter-service calls in nanoseconds instead of milliseconds. But to understand how we got here, let's start at the beginning: your deployment pipeline has layers. Too many of them

Your code lives inside a runtime (JVM, Node, Python), which runs inside a container (Docker), managed by an orchestrator (Kubernetes), hosted on a VM, which finally runs on actual hardware somewhere in Virginia. Each layer was added to solve a real problem. But together, they add weight, cold start times, and more moving parts that can break.

However, there is a trend quietly dismantling this complexity. It starts with something surprisingly simple: a single file. And it ends with something that could change how we think about microservices forever: WASM microservices.

Part 1 — The Single Binary

Rust compiles your code into a single executable. If you compile against musl instead of glibc, the resulting binary has exactly zero system dependencies. Everything your application needs is packed into that file. There is no JVM. No node_modules. It doesn't even need the C standard library installed on the target machine. You can drop it into a FROM scratch Docker image — literally an empty filesystem with nothing but your executable. You copy it to a server and run it. That's the deployment.

Languages like Go do something very similar (packing even their own garbage collector into a static file), but the premise is the same: no heavy base image.

The size difference is hard to ignore:

Stack	Artifact Size	Runtime Dependencies	Cold Start
Rust (musl static)	~5–10 MB	None	< 10 ms
Go (static)	~10–20 MB	None	< 10 ms
Java (Spring Boot)	~50–200 MB	JVM (~200 MB)	Seconds
Node.js (Next.js)	~200–500 MB	Node Runtime (~100 MB)	Seconds
Python (Django)	~100–300 MB	Python + C libs	Seconds

This isn't an artificial benchmark. It's what happens when you remove layers between your code and the machine. No interpreter, no classloading, no dependency resolution at startup.

It's worth noting that the tools infrastructure engineers build for themselves are almost always single binaries: Kubernetes, Docker, Terraform, Prometheus, CockroachDB, Caddy, Hugo, ripgrep. These are people who deal with deployment complexity every single day. They chose not to inflict it upon themselves.

The single binary is a real, proven win. Less to deploy, less to break, faster to start, cheaper to run.

But it's not the end of the story. It's the beginning.

Because no matter how much you optimize each service individually, in a microservices architecture, the bulk of the overhead isn't inside the services. It's between them: serialization, HTTP over TLS, deserialization, and starting all over again at the next hop. And yes, this still applies if you use gRPC with Protobuf instead of JSON — binary serialization is faster, but you still pay the physical toll of the network: TCP, TLS, latency, service mesh sidecars if you have them. The network is still the bottleneck. Multiply that by every jump in the chain, and you have a system where communication can cost more than computation.

Microservices exist for good reasons — independent deployment, team autonomy, fault isolation. You can't just merge everything back into a monolith. What you need is the isolation of separate services with the speed of a function call.

That is exactly what WASM microservices are. And to understand them, we first need to talk about WebAssembly.

Part 2 — WebAssembly, Fast

Before we get to the interesting part, let's make it clear what WebAssembly (WASM) is, without the hype.

WebAssembly is a bytecode format. You write code in Rust, Go, C, Python, or other languages, compile it into a .wasm file, and a WebAssembly runtime executes it. Think of it like Java's .class files or .NET's IL, but designed to be universal rather than tied to a single-language ecosystem.

Three properties matter for our story:

It's portable: The exact same .wasm file runs on Linux, macOS, Windows, in a browser, on a server, or on a Raspberry Pi. Compile once, run anywhere.
It's sandboxed: A WASM module cannot do anything by default. It cannot read files, it cannot open network connections, it cannot access memory outside its own sandbox. You have to explicitly grant it permissions. It is the exact opposite of a normal process, which can do everything unless you restrict it.
It's fast: WASM runs at near-native speed. It's not "fast after warming up the JIT for 1000 calls." It is consistently close to the performance of native C/Rust code.

Now, a WASM module that only knows how to do math operations in its own isolated memory isn't very useful. That's why WASI exists — WebAssembly System Interface. WASI gives WASM modules controlled access to system capabilities: reading files, opening sockets, getting the current time. It is the standard library that WASM lacks on its own.

With WASI, you can compile a real application — an HTTP server, a CLI tool, a data pipeline — to WASM and run it on any platform that has a WASM runtime. That is already incredibly useful. But it's not the reason we are here.

We are here for what WASI 0.2 introduced in 2024: the Component Model.

Part 3 — WASM Microservices: Services Calling Each Other Like Functions

Here we reach the core concept. A WASM microservice is a WebAssembly component that acts like a classic microservice — it has its own responsibility, its own isolation, it deploys independently — but it communicates with other WASM microservices via typed function calls in shared memory. Not via HTTP. Not over the network. Functions.

The piece that makes this possible is the Component Model, introduced with WASI 0.2. It defines a standardized way for WASM modules to declare what they offer and what they need, using a small interface description language called WIT (WebAssembly Interface Types):

// A CSV parser component declares what it exports
package myapp:parser;

interface csv-parser {
    record row {
        fields: list<string>,
    }

    parse: func(raw-data: list<u8>) -> list<row>;
}

world parser-component {
    export csv-parser;
}

// A prediction component declares what it needs and what it offers
package myapp:ml;

interface predictor {
    record prediction {
        label: string,
        confidence: f64,
    }

    predict: func(rows: list<myapp:parser/csv-parser.row>) -> list<prediction>;
}

world ml-component {
    import myapp:parser/csv-parser;   // "I need a parser"
    export predictor;                  // "I offer predictions"
}

This looks like an interface definition (like OpenAPI or Protobuf), and it is. But there's a fundamental difference: these components don't communicate over the network. They are linked at runtime.

When Component A calls a function in Component B, what actually happens is:

Component A places data into a shared memory region.
The runtime invokes Component B's function.
Component B reads the data, processes it, and places the result back.
Component A reads the result.

No JSON serialization. No building an HTTP request. No TLS handshake. No TCP socket. No network. Just a function call with data that is already in memory.

The cost of that call is measured in nanoseconds. The cost of an HTTP call between microservices is measured in milliseconds. That's orders of magnitude in difference.

But doesn't this break isolation?

No. And this is the part that makes it all work.

Each WASM Component has its own isolated linear memory space. Component A cannot read or write to Component B's internal memory under any circumstances. The only way to interact is through the explicit interfaces defined in WIT. The runtime mediates and secures every single call between components.

You get the security boundary of traditional microservices — no component can corrupt another's state — with the performance of an in-process function call. That is a WASM microservice: the isolation of a microservice, the cost of a function call. It's like two people passing documents through a secure teller window: they can exchange data through a well-defined opening, but neither can enter the other's office.

Composition: Multiple WASM microservices in a single file

This is where the technology unleashes its full potential. You can take WASM microservices written in different languages and compose them into a single binary at compile time:

# parser.wasm (compiled from Rust)
# ml-model.wasm (compiled from Python)
# reporter.wasm (compiled from Go)

$ wasm-tools compose parser.wasm ml-model.wasm reporter.wasm -o pipeline.wasm

$ ls -lh pipeline.wasm
-rw-r--r--  1 rafa  staff  2.1M  pipeline.wasm

Three "services", written in three different languages, with strict typed contracts between them, composed into a single 2 MB file. You can deploy that file on a server, on an edge node, or in a browser. No orchestrator, no service mesh, no network between them.

What if you need to replace the ML model? You recompile just that component, recompose the binary, and redeploy. The parser and reporter remain unchanged. You have independent deployability at the component level, not at the network service level.

WASM microservices in practice: Fermyon Spin

Spin is probably the most mature framework for building WASM microservices today. It defines itself as a "framework for building and running event-driven microservice applications with WebAssembly components." Spin was accepted into the CNCF Sandbox (the same foundation that hosts Kubernetes), and following Fermyon's acquisition by Akamai in December 2025, it is backed by one of the largest edge networks in the world.

Here is a Spin application in Rust:

use spin_sdk::http::{IntoResponse, Request, Response};
use spin_sdk::http_component;

#[http_component]
fn handle_request(req: Request) -> anyhow::Result<impl IntoResponse> {
    let body = format!("Hello from a WASM component! Path: {}", req.path());
    Ok(Response::builder()
        .status(200)
        .header("content-type", "text/plain")
        .body(body)
        .build())
}

$ spin build
$ spin up
Serving http://127.0.0.1:3000

That component weighs kilobytes and boots in under a millisecond. A Spin application can have dozens of WASM microservices, each handling different routes, written in different languages, and composed into a single deployment unit.

Spin 3.0 also added selective deployments: platform engineers can repackage the exact same WASM microservices into different deployment topologies without touching a single line of component code. Need the parser and the ML model bundled together on one node, but the reporter separated on another? Reconfigure, recompose, done. This is structurally impossible with traditional containers without rewriting your code.

Part 4 — Where Are WASM Microservices Today?

This is not just a W3C specification gathering dust. There are several patterns where WASM is already in massive production, and each exploits a different property of the technology.

Plugin systems: Running third-party code without risk

This is probably the most mature use case. Shopify Functions allows developers in its ecosystem to inject custom logic into Shopify's backend (discounts, shipping rules, checkout validations). Each function is a WASM module running in a strict sandbox within Shopify's infrastructure. The partner has no access to the OS, the network, or the memory of other functions. They only receive input data, process it, and return a result.

Why WASM and not containers? Because Shopify needs to execute code from thousands of third parties on critical paths like checkout, where every millisecond of latency means lost revenue. One container per function doesn't scale. A WASM module that boots in microseconds and runs at near-native speed does. (Shopify is also a Bytecode Alliance member and created Javy, the toolchain that compiles JavaScript to WASM, now widely used across the industry).

Edge computing and serverless: Zero cold starts

Fastly Compute runs WASM in over 79 global datacenters with instantiation times measured in microseconds — not milliseconds, microseconds. Every request creates an isolated WASM instance, executes the logic, and destroys it. No connection pools to maintain, no warm containers eating up idle memory.

Akamai acquired Fermyon (the creators of Spin) in December 2025 to integrate WASM microservices into its network of over 4,000 global edge locations. Before the acquisition, they were already handling 75 million requests per second in production with fractional-millisecond cold starts. When a CDN of that scale buys a WASM company, the technology is no longer experimental.

Cloudflare Workers has been running logic at the edge for years using V8 Isolates (not pure WASM), but the ecosystem's trajectory is the same: push compute as close to the user as possible, using instances that start instantly and weigh almost nothing. The "one container per function" model has unacceptable overhead here. WASM eliminates it.

Heavy embedded compute: Google Sheets

A non-microservice case that perfectly illustrates WASM's potential to redesign heavy systems: Google migrated the Google Sheets calculation engine from JavaScript to Java compiled to WasmGC, achieving a 2x performance improvement. WasmGC allows garbage-collected languages to compile to WASM without shipping their own GC, drastically reducing binary size. When you move a calculation engine used by billions to WASM, the numbers justify it.

IoT and industrial edge

MachineMetrics, an industrial IoT company, uses wasmCloud to move WASM microservices between edge devices and cloud environments with dynamic fault tolerance. If a factory node goes down, components migrate to another node or to the AWS cloud automatically. Try doing that seamlessly with Docker containers in milliseconds. WASM's true portability (the exact same binary runs on an industrial ARM and an x86 cloud server) makes this possible.

Enterprise FaaS

American Express built an internal FaaS platform using wasmCloud. Their primary motivation: pack more functions into the same physical infrastructure while maintaining strict security boundaries, support multiple languages without maintaining dozens of Docker base images, and slash cold starts.

The common thread

None of these companies adopted WASM because of the hype. The common denominator is that they all hit a bottleneck where the traditional model (containers, heavy runtimes, network hops) simply couldn't scale:

Shopify needed to run third-party code securely and blazingly fast.
Fastly and Akamai needed distributed compute without container bloat.
MachineMetrics needed true binary portability across CPU architectures.
American Express needed extreme density and isolation.
Google needed pure performance in the browser.

WASM gave them the way out.

Part 5 — The Honest State of WASM Microservices

I've been painting an optimistic picture, so let me be blunt about what is not mature just yet:

The transition to Async I/O (WASI 0.3): The previous version (WASI 0.2) only supported synchronous I/O, meaning that reading from a socket blocked the entire instance. The arrival of WASI 0.3 in early 2026 has finally brought native asynchronous I/O to the Component Model, which is absolutely critical for high-performance network services. The standard is here, but libraries and languages are still in the process of digesting and adopting this new paradigm.
The library ecosystem is young: If you need an OAuth2 client, a native PostgreSQL driver, or an image processing library compiled as a WASM Component, you might find it, or you might not. The situation is improving fast (especially in Rust and Go), but it's nowhere near the vastness of npm, crates.io, or Maven Central.
Tooling is maturing: Debugging a WASM Component isn't as seamless as debugging a regular application. Profiling tools are limited, and IDE support exists but isn't first-class everywhere just yet.
Not all languages are created equal: In theory, you can mix Rust, Go, Python, and TypeScript. In practice, Rust and Go are first-class citizens. Python and TypeScript work, but they often do so by bundling an entire interpreter inside the WASM module (using tools like Javy or ComponentizeJS), which bloats the binary size and hurts performance. Today, the sweet spot for true high performance is Rust or Go.
Not everything should be a WASM microservice: If your service is I/O bound (waiting on slow database queries or calling third-party APIs), the inter-service network overhead is not your real bottleneck. WASM microservices shine when services do actual processing and call each other at very high frequencies. For a simple CRUD app that just talks to PostgreSQL, a normal Go single binary is still a fantastic choice.
You don't have to choose between worlds: wasmCloud can run standalone or on top of Kubernetes clusters. Spin can be deployed alongside your existing container infrastructure thanks to tools like SpinKube, which allows you to run WASM microservices directly on your K8s nodes exactly like normal pods. This isn't a "rip and replace" technology. You just get a new, ultra-lightweight workload type running right next to your legacy containers.

Where Is This Going?

The trajectory is clear.

Containers solved the "it works on my machine" problem in 2013 and became the default deployment unit of the cloud. But they carry the burden of virtualizing an entire operating system for every single service — an abstraction tax we've accepted as normal simply because there was no better alternative.

WASM microservices offer that alternative. They point to a world where the unit of deployment is a sandboxed, portable, composable module measured in kilobytes instead of gigabytes. Where services communicate via typed function calls instead of heavy network protocols. Where you can compose business logic written in multiple languages into a single file that runs instantly anywhere.

We aren't there yet for all workloads. But the path from single binaries (which are already the standard in infra tooling) to WASM microservices (which are production-ready for key use cases) is a straight line. Every step removes a layer of abstraction between your code and the bare metal.

And as we've been exploring throughout this series: every layer you remove is pure performance you get back.

Your Code Is Slow Because You Think in Objects, Not Data

Rafa Calderon — Thu, 29 Jan 2026 17:28:47 +0000

Your CPU is phenomenally fast. Your RAM is phenomenally slow. The gap between them is the reason your code runs at 5% of the speed it theoretically could.

Object-oriented design teaches us to model concepts: Users, Orders, Products. We build elegant hierarchies, encapsulate state, chain abstractions. The code is readable, maintainable, testable. It's also scattering your data across memory like shrapnel, forcing the CPU to wait hundreds of cycles for each piece.

This article is a map of the Data-Oriented Design (DOD) territory: what it is, why it matters, and which concepts you need to know. I'm not going to dive deep into every topic—that would take a whole book—but you will walk away knowing exactly what to look for when you need to squeeze out real performance.

1. Why Your Code Is Slower Than It Should Be

The CPU Waits for Memory

Here is the fact that changes everything: your processor is between 100 and 1000 times faster than your RAM.

When the CPU needs a piece of data that isn't in the cache, it waits. Literally. Hundreds of clock cycles doing absolutely nothing while the data travels from the RAM.

Register access: ~1 cycle
L1 Cache access: ~4 cycles
L2 Cache access: ~12 cycles
L3 Cache access: ~40 cycles
RAM access: ~200-300 cycles

Read those numbers again. The difference between having data in L1 vs. RAM is two orders of magnitude. Your code can be algorithmically perfect and still be slow because it spends 90% of its time waiting for data.

This is the "Von Neumann Bottleneck": the processor and memory are separated, and that channel between them is the bottleneck of modern computing.

Abstractions Have a Cost

Object-Oriented Programming (OOP) taught us to model the world: a User has Orders, each Order has Products. Beautiful, readable, maintainable.

The problem is, hardware doesn't understand objects. It understands contiguous bytes in memory.

When you create scattered objects, use inheritance with virtual methods, or chain pointers, you are generating:

Scattered memory: Each allocation can place the object anywhere in the heap.
Indirections: Pointers pointing to pointers pointing to data.
Vtables: Tables of virtual functions that the CPU must look up at runtime.

All of this scatters your data across the heap. That destroys spatial locality, which is what makes cache lines efficient.

Spatial Locality: The Cache Line Is the Unit of Transfer

When the CPU requests a byte at address 1000, it doesn't fetch just that byte. It brings in the entire cache line (typically 64 bytes) starting from an aligned address. If the next thing you need is at byte 1001, you already have it for free. If it's at byte 50000, you pay for another trip to RAM.

Spatial locality is everything. Contiguous data in memory = fast access. Scattered data = constant cache misses.

That is why a linked list is orders of magnitude slower to iterate than a vector, even if both are O(n). The vector is contiguous; the list has every node in some random location on the heap.

The CPU Tries to Predict What You'll Do

The modern processor doesn't execute instructions one by one. It has a pipeline that processes multiple instructions in parallel, in different stages.

To keep the pipeline full, the CPU does two things:

Prefetching: It tries to guess what memory you are going to ask for next. If you are iterating through an array sequentially, it detects the pattern and pulls in the next cache lines before you even ask for them. Random pointers break this mechanism because there is no pattern to detect.
Branch prediction: When it reaches an if, it can't wait to evaluate it—the pipeline would empty out. So it guesses which branch it will take based on history. If it guesses right, perfect. If it guesses wrong (branch misprediction), it has to throw away all the speculative work and start over. Typical penalty: 15-20 cycles.

// The CPU can predict this easily
for i in 0..1_000_000 {
    // always enters the loop
}

// This is unpredictable if data is random
for &value in &data {
    if value > threshold {  // 50% true, 50% false
        sum += value;
    }
}

3. Organize Your Data For the Machine

AoS vs SoA: The Fundamental Shift

This is the heart of Data-Oriented Design.

Array of Structs (AoS) — How you naturally think:

struct Entity {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
    health: i32,
    team: i32,
    name: String,
}

let entities: Vec<Entity> = Vec::with_capacity(10_000);

In memory it looks like this:

[x,y,z,vx,vy,vz,health,team,name][x,y,z,vx,vy,vz,health,team,name]...

Struct of Arrays (SoA) — How hardware thinks:

struct Entities {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
    vx: Vec<f32>,
    vy: Vec<f32>,
    vz: Vec<f32>,
    health: Vec<i32>,
    team: Vec<i32>,
    names: Vec<String>,
}

impl Entities {
    fn with_capacity(cap: usize) -> Self {
        Self {
            x: Vec::with_capacity(cap),
            y: Vec::with_capacity(cap),
            z: Vec::with_capacity(cap),
            vx: Vec::with_capacity(cap),
            vy: Vec::with_capacity(cap),
            vz: Vec::with_capacity(cap),
            health: Vec::with_capacity(cap),
            team: Vec::with_capacity(cap),
            names: Vec::with_capacity(cap),
        }
    }
}

In memory:

[x,x,x,x,x,...][y,y,y,y,y,...][z,z,z,z,z,...]...

Why does this matter? Imagine you want to update only the positions:

// With AoS: for each entity you fetch ~60 bytes, but use 24
for entity in &mut entities {
    entity.x += entity.vx * dt;
    entity.y += entity.vy * dt;
    entity.z += entity.vz * dt;
}

// With SoA: you fetch exactly what you need
for i in 0..entities.x.len() {
    entities.x[i] += entities.vx[i] * dt;
    entities.y[i] += entities.vy[i] * dt;
    entities.z[i] += entities.vz[i] * dt;
}

With AoS, each 64-byte cache line brings in ~1 entity. With SoA, it brings in ~16 values of x. You are utilizing 100% of every cache line instead of ~20%.

Hot/Cold Splitting

Not all fields are used equally. In the previous example, name is probably only read when rendering UI. Why pull it into the cache every time you update physics?

// "Hot" data - accessed constantly
struct EntitiesHot {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
    vx: Vec<f32>,
    vy: Vec<f32>,
    vz: Vec<f32>,
}

// "Cold" data - accessed rarely
struct EntitiesCold {
    health: Vec<i32>,
    team: Vec<i32>,
    names: Vec<String>,
}

Separating hot and cold avoids "polluting" the cache with data you don't need in the hot path.

Alignment and Padding

Compilers align the fields of your structs to specific addresses (usually multiples of their size). This can leave gaps:

use std::mem::size_of;

struct Bad {
    a: u8,       // 1 byte + 7 padding
    b: f64,      // 8 bytes
    c: u8,       // 1 byte + 7 padding
}

struct Good {
    b: f64,      // 8 bytes
    a: u8,       // 1 byte
    c: u8,       // 1 byte + 6 padding
}

fn main() {
    println!("Bad: {} bytes", size_of::<Bad>());   // 24 bytes
    println!("Good: {} bytes", size_of::<Good>()); // 16 bytes
}

Same content, 8 bytes less. In a vector of millions of elements, that's megabytes of memory and cache lines wasted.

Rule of thumb: order fields from largest size to smallest.

In Rust, you can use #[repr(C)] to control the exact layout, or tools like cargo-bloat to analyze the size of your structures.

4. "Free" Gains By Ordering Well

SIMD: Processing Multiple Data at Once

Your CPU has special registers that can operate on multiple values simultaneously:

SSE: 128 bits → 4 floats at once
AVX: 256 bits → 8 floats at once
AVX-512: 512 bits → 16 floats at once

Scalar Operation: 4 instructions

a[0] += b[0]
a[1] += b[1]
a[2] += b[2]
a[3] += b[3]

SIMD Operation: 1 instruction

[a0,a1,a2,a3] += [b0,b1,b2,b3]

But there is a requirement: data must be contiguous and aligned in memory. Remember SoA? Exactly. With SoA, the compiler can auto-vectorize your loops. With AoS, it can't.

// SoA: the compiler can vectorize this automatically
for i in 0..n {
    x[i] = x[i] + vx[i] * dt;
}

// AoS: impossible to vectorize efficiently
for entity in &mut entities {
    entity.x += entity.vx * dt;
}

You don't need to write intrinsics by hand. Compile with --release (which activates -O3) and let LLVM do the work. But your data layout must allow it.

Branchless: Stopping the CPU from Guessing Wrong

When you have conditionals in hot loops with unpredictable data, the branch predictor will fail ~50% of the time.

// Branch version - constant mispredictions
let mut sum = 0i64;
for &value in &data {
    if value >= 128 {
        sum += value as i64;
    }
}

// Branchless version - no prediction needed
let mut sum = 0i64;
for &value in &data {
    let mask = (value >= 128) as i64;  // 0 or 1
    sum += mask * value as i64;
}

The branchless version performs more operations per iteration, but it eliminates mispredictions. On random data, it can be 2-5x faster.

Common techniques:

Arithmetic operations instead of if
Bit masks
Sorting data first: Sometimes, simply sorting the array makes the if predictable (all false first, all true later) and the code flies without changing the logic.

// The sort trick: makes the branch 100% predictable
data.sort_unstable();
for &value in &data {
    if value >= 128 {  // Now it's: false,false,...,true,true,true
        sum += value as i64;
    }
}

5. Smart Memory Management

Why Frequent Allocations Are Slow

Every call to the allocator:

Searches for a free slot in the heap (complex algorithm).
Updates internal allocator structures.
Possibly asks the OS for more memory (syscall).
Can place your data at any random address.

Furthermore, freeing memory frequently causes fragmentation: the heap ends up full of small holes between used blocks, worsening locality.

Arena Allocators

The idea is simple: you ask for a large block of memory at the start and then hand out pieces of that block.

use bumpalo::Bump;

// Create an arena with a large block
let arena = Bump::new();

// O(1) Allocations - just increments a pointer
let x = arena.alloc(42i32);
let y = arena.alloc([1, 2, 3, 4, 5]);
let s = arena.alloc_str("hello");

// When exiting scope, EVERYTHING is freed at once
// No individual free(), no fragmentation

Advantages:

Allocation O(1): just increment a counter.
Deallocation O(1): reset or drop the entire arena.
Guaranteed locality: everything is contiguous.
Zero fragmentation.

Handles vs. Pointers

When you need "references" between entities, pointers (or in Rust, scattered indices or Rc<T>) scatter your access all over memory.

The alternative is to use handles: indices into dense arrays.

// With scattered references - memory all over the place
struct Entity {
    target: Option<Rc<RefCell<Entity>>>,
}

// With handles - contiguous memory, cache-friendly
#[derive(Clone, Copy)]
struct EntityHandle(u32);

struct World {
    entities: Vec<Entity>,
}

impl World {
    fn get(&self, handle: EntityHandle) -> &Entity {
        &self.entities[handle.0 as usize]
    }

    fn get_mut(&mut self, handle: EntityHandle) -> &mut Entity {
        &mut self.entities[handle.0 as usize]
    }
}

Handles keep your data in contiguous arrays while allowing relationships between entities. Plus:

They make serialization easy (an index is trivial to save).
They allow you to detect invalid references (you can validate the index).
They avoid Rust's lifetime headaches with reference graphs.

This pattern is the foundation of most ECS (Entity Component Systems) like bevy_ecs, hecs, or legion.

6. When To Apply This (And When Not To)

The Honest Filter

Data-Oriented Design is not a silver bullet. Before rewriting your code, ask yourself:

Where is the bottleneck?

Bottleneck Type	Does DOD help?
I/O bound (network, disk, DB)	❌ No
Memory bound	✅ Exactly for this
CPU bound (pure math)	⚠️ Helps, but check algorithms first

If your API spends 95% of its time waiting for database responses, optimizing the memory layout of your objects is like polishing the windshield of a car with no engine.

Is it in a hot path?

90% of execution time usually happens in 10% of the code. Optimize that 10%.

// This runs once at startup - DO NOT optimize
let config = parse_config_file()?;

// This runs 60 times per second with 100K entities - YES optimize
for i in 0..entities.len() {
    update_physics(&mut entities, i, dt);
}

How many elements are you processing?

Quantity	Is DOD worth it?
< 1,000	Probably doesn't matter
1,000 - 100,000	Can help in hot paths
> 100,000	Makes a real difference

7. What If I Don't Use Rust?

DOD principles are universal. Here is how to apply them in other languages:

Java

// Primitive arrays instead of ArrayList<Integer>
float[] x = new float[10000];  // Contiguous, cache-friendly
// vs
ArrayList<Float> x;  // Each Float is a separate object on the heap

// For SoA, use parallel arrays
float[] posX = new float[n];
float[] posY = new float[n];
float[] posZ = new float[n];

Note: Project Valhalla (in development) will bring Value Types to Java, allowing inline structs like in C#.

Python

# NumPy IS Data-Oriented Design
import numpy as np

# This is SoA under the hood - contiguous arrays in C
positions = np.zeros((10000, 3), dtype=np.float32)
velocities = np.zeros((10000, 3), dtype=np.float32)

# Vectorized operations - automatic SIMD
positions += velocities * dt  # Processes the whole array at once

The reason NumPy/Pandas are fast is exactly DOD: contiguous arrays in C that leverage cache and SIMD.

JavaScript/TypeScript

// TypedArrays are the way to have contiguous memory
const positions = new Float32Array(10000 * 3);
const velocities = new Float32Array(10000 * 3);

// Instead of:
const entities = []; // Array of scattered objects
for (let i = 0; i < 10000; i++) {
    entities.push({ x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0 });
}

In WebGL/WebGPU, TypedArrays are mandatory. That is not a coincidence.

Conclusion

Data-Oriented Design comes down to one idea: organize your data for how you process it, not for how you conceptualize it.

Modern hardware is brutally fast if you feed it correctly. Cache lines, prefetching, SIMD, branch prediction — all these features exist and are waiting. You just need contiguous data and predictable access.

You don't need to rewrite all your code. Start by:

Identifying hot paths with a profiler.
Measuring cache misses in those sections.
Considering SoA where you iterate over specific fields.
Using contiguous vectors instead of structures with scattered pointers.

Performance isn't magic. It's physics: bytes traveling through wires, transistors switching in nanoseconds. When you understand the machine, you can write code that flies.

Resources To Go Deeper

You've seen the map. Here are the paths to explore each territory:

Hardware Fundamentals
- What Every Programmer Should Know About Memory — Ulrich Drepper. The canonical document.
- Computer Architecture: A Quantitative Approach — Hennessy & Patterson.
Data-Oriented Design
- Data-Oriented Design (free book) — Richard Fabian.
- CppCon 2014: Data-Oriented Design and C++ — Mike Acton. The talk that popularized DOD.
Rust Specific
- The Rust Performance Book
- Rust SIMD Performance Guide
Real Implementations (ECS)
- Bevy ECS — Game engine in Rust with ECS as its core.
- EnTT — Modern C++ ECS.
- Flecs — C ECS with excellent documentation.

From WAL to WASM - High-Performance Local-First Sync with Postgres & SQLite

Rafa Calderon — Sat, 24 Jan 2026 17:53:40 +0000

The maturation of WebAssembly (WASM) and emerging browser storage primitives have shifted the web application paradigm from traditional stateless models toward Local-First distributed systems. This technical analysis explores the implementation of a high-performance stack engineered to eliminate network latency (sub-millisecond UI) by running a full relational engine directly on the client.

The Technical Ecosystem

Throughout this analysis, we will deconstruct the integration of three core layers that facilitate bidirectional data persistence and synchronization:

Client Runtime: The application of WASM-compiled SQLite interfacing with the Origin Private File System (OPFS). We will explore how this architecture overcomes the IndexedDB I/O bottleneck by offloading complex query execution to a dedicated Web Worker.
Data Infrastructure: The configuration of PostgreSQL as the canonical source of truth, leveraging Logical Replication and the Write-Ahead Log (WAL). We will detail why data flows must be driven by database-level events (CDC) rather than imperative REST API calls.
Transport and Synchronization Layer: The deployment of PowerSync or ElectricSQL for the orchestration of Data Buckets. We will analyze the segmentation of global state into per-user local shards via JWT Claims and WebSocket streams, ensuring Optimistic UI consistency.

The goal is to demonstrate the transformation of the browser into a resilient database node, capable of handling local transactions and reconciling state with the server in a transparent, asynchronous manner.

1. The Client-Side Engine: SQLite + WASM + OPFS

1.1. The Rationale for WASM-based SQLite

Historically, web storage has been the primary bottleneck for complex applications:

LocalStorage: Synchronous by nature, it blocks the UI thread during write operations. It is restricted to a 5MB capacity and limited to string-based key-value pairs, making it unfeasible for sophisticated data structures.
IndexedDB: While asynchronous, its event-based API is notoriously difficult to manage (callback hell) and lacks a relational engine. The absence of native support for JOINs, aggregations, or advanced filtering forces developers to process data manually in JavaScript, incurring heavy CPU and memory overhead.

WebAssembly (WASM) redefines these constraints by enabling the original C source of SQLite to run directly within the browser. This is not an emulation, but a bytecode compilation executed by the browser engine with the following advantages:

Near-Native Performance: WASM operates on linear memory—a contiguous block of bytes—managed directly by SQLite. This allows the Query Optimizer to analyze execution plans and access indexes with microsecond latency, a feat impossible via JavaScript interpretation.
Main Thread Isolation: By compiling SQLite to WASM, the entire database engine can be offloaded to a Web Worker. The main thread (UI) remains completely unburdened, communicating with the database only to dispatch queries and receive result sets, thereby maintaining a consistent 60fps.
Full Relational Capabilities: Porting the complete engine provides access to ACID transactions, triggers, views, full-text search (FTS5), and, crucially, referential integrity directly on the client.

1.2. Storage: The OPFS Revolution

Running SQLite solely in-memory is volatile; data is lost upon page refresh. To transform SQLite into an industrial-grade persistent database, we leverage the Origin Private File System (OPFS). OPFS is a private storage ecosystem within the File System Access API that enables the browser to manage files with an efficiency previously unattainable by legacy web APIs.

High-Performance Synchronous Access: The cornerstone of OPFS is the FileSystemSyncAccessHandle. Unlike standard asynchronous web APIs—which introduce latency via the event loop—this interface allows for synchronous read and write operations. This is critical for SQLite, as the engine was architected under the assumption that the file system responds immediately to its low-level calls.
Optimization for WAL (Write-Ahead Logging): SQLite employs a journaling mechanism to prevent data corruption during failures. OPFS provides direct, exclusive access to data blocks (offsets), enabling SQLite's WAL mode to operate at peak velocity. In this mode, writes do not block reads, facilitating true concurrency that IndexedDB cannot emulate.
Specialized VFS (Virtual File System): SQLite interfaces with hardware through an abstraction layer known as the VFS. In this stack, we implement a JS/WASM VFS that serves as a bridge, translating SQLite's C-based I/O requests into direct OPFS calls. By executing this within a Web Worker, we ensure these synchronous operations occur on a separate thread, preventing any impact on the responsiveness of the user interface.

1.3. Implementation Strategy: Worker Threading

Executing SQLite on the Main Thread is non-viable, as intensive queries would cause the UI to hang. The standard architectural pattern for mitigating this is:

Main Thread: The UI layer (React/Vue/Svelte) dispatches commands via a messaging channel (postMessage).
Web Worker: Encapsulates the SQLite WASM binary. It receives commands, executes SQL against the OPFS, and returns the result set asynchronously.
Shared Workers (Optional but Recommended): In scenarios where a user has multiple tabs of the application open, a Shared Worker ensures all tabs interface with the same database instance. This is a critical design choice to prevent file contention and potential data corruption within the OPFS.

1.4. Frontend <-> SQLite Interaction: The Observer Pattern

In a Local-First architecture, the frontend does not "request" data in the traditional sense; instead, it observes the local state. To ensure this interaction remains efficient, we implement a reactivity system centered on table-level tracking.

Query Subscriptions (Live Queries): The frontend registers persistent SQL queries. Rather than a one-off execution, the integration SDK establishes a Stream or Observable. Because the SQLite WASM engine operates within a dedicated Worker, the system can maintain hundreds of active subscriptions without degrading user input latency.
Reactive Loop & Table-Level Invalidation: To optimize performance, the system avoids re-executing every query upon every change. The observability engine tracks which tables are touched by each SELECT statement. For instance, if an INSERT occurs in the tasks table, the system detects the mutation and only notifies the hooks or components specifically dependent on tasks.
Zero-Latency Feedback: With a local database, the "Write → Notification → Re-render" cycle executes in microseconds. This renders complex loading states for local operations obsolete, as the UI stays in near-instantaneous synchronization with the relational engine.

// High-level reactive hook example
const { data, isLoading } = useQuery(
  "SELECT * FROM projects WHERE id = ?",
  [projectId]
);

// This mutation triggers a table-level invalidation in the Worker.
// The UI updates automatically in <1ms without a network round-trip.
const addTask = async (task: Task) => {
  await db.execute(
    "INSERT INTO tasks (id, content, status) VALUES (?, ?, ?)",
    [task.id, task.content, task.status]
  );
};

1.5. The Technical Challenge: Type Parity and Marshalling

The primary architectural friction between client and server stems from type system discrepancies. While PostgreSQL is a strictly typed system supporting complex data structures, SQLite utilizes Manifest Typing—where the data type is associated with the value itself rather than the column—and supports only five native storage classes: NULL, INTEGER, REAL, TEXT, and BLOB.

To ensure system-wide integrity, a robust Type Mapping layer must be implemented:

UUIDs and Strings: PostgreSQL handles UUIDs natively, whereas SQLite must store them as TEXT or BLOB. The technical challenge lies in ensuring that the "bridging" between the WASM binary and JavaScript environment does not introduce significant overhead during string conversions in high-frequency transactions.
Timestamps (ISO 8601 vs. Unix Epoch): SQLite lacks a native date/time type. To maintain the precision and timezone accuracy of Postgres's TIMESTAMPTZ, we standardize local storage as ISO 8601 strings or BigInts (Unix epoch). This ensures that SQLite's built-in date functions (datetime(), strftime()) remain fully functional for filtering and sorting operations.
JSONB to Text: The JSONB type in PostgreSQL is binary and highly efficient. On the client, SQLite stores this data as TEXT. Consequently, the synchronization layer must perform selective parsing: deserializing into JavaScript objects only when required by the UI to avoid unnecessary CPU penalties at the storage layer.

/**
 * Type Mapping Utility (Conceptual)
 * Ensures Postgres-compatible types are correctly handled in SQLite WASM
 */
interface SyncPayload {
  id: string; // UUID from Postgres
  data: Record<string, unknown>; // JSONB mapped to TEXT
  updated_at: string; // TIMESTAMPTZ mapped to ISO-8601
}

const mapToSQLite = (payload: SyncPayload) => {
  return {
    ...payload,
    // SQLite doesn't have native JSONB, we must stringify
    data: JSON.stringify(payload.data),
    // Ensure consistent timestamp format for SQLite date functions
    updated_at: new Date(payload.updated_at).toISOString()
  };
};

2. The Source of Truth (Postgres + Logical Replication + WAL)

In a Local-First architecture, the backend does not originate data—that process has already occurred on the client. Instead, the server functions as the authoritative entity responsible for validation, long-term persistence, and global redistribution.

2.1. The WAL (Write-Ahead Log) as a Messaging System

Traditional synchronization architectures typically rely on polling or manual application-level event triggers, both of which are resource-intensive and prone to race conditions. This stack transforms PostgreSQL into a reactive system by tapping directly into its internal transaction engine.

Understanding the WAL: The Write-Ahead Log is the backbone of Postgres data integrity. Every state change is recorded in this sequential log before it is committed to the permanent tables, ensuring the database can reconstruct its state following a crash.
Logical Replication & Decoding: By configuring the server with wal_level = logical, we enable Postgres to decode physical disk changes into logical row-level operations. This allows for the extraction of a continuous stream of INSERT, UPDATE, and DELETE events in structured formats like JSON or Protobuf.
CDC (Change Data Capture) Efficiency: Rather than repeatedly querying the database for changes, the Sync Layer subscribes to the Postgres replication slot. Changes are "pushed" to the synchronization orchestration layer immediately upon transaction commit. This architecture drastically reduces database overhead and ensures minimal propagation latency across all connected clients.

Critical Server-side Configuration

For the infrastructure to emit these granular deltas, the PostgreSQL instance must be tuned specifically for logical decoding. This involves modifying core server parameters to support persistent replication slots and high-volume log streaming.

-- Required configuration in postgresql.conf
-- wal_level must be 'logical' to enable logical decoding
wal_level = logical

-- Increase the number of replication slots based on expected load
max_replication_slots = 10
max_wal_senders = 10

2.2. Schema Strategy: The Postgres-SQLite Mirror

In a high-level architecture, we treat PostgreSQL as the Canonical Source of Truth and SQLite as an Optimized Projection.

Postgres Schema (The Canonical Source): This layer houses the core business complexity. We utilize native types—such as JSONB for semi-structured documents, TIMESTAMPTZ for absolute temporal precision, and GIS for geospatial data—alongside aggressive constraints. Postgres ensures data integrity across the entire organization.
SQLite Schema (The Client Projection): This is not a direct clone but rather a "flattened," lightweight version.
- Bypassing Constraints: While a foreign key violation in Postgres would trigger an integrity error, we occasionally relax these constraints in SQLite. This allows the Optimistic UI to function even if related data has not yet propagated through the synchronization stream.
- On-the-fly Type Mapping: Since SQLite lacks native support for types like UUID or TIMESTAMPTZ, the synchronization layer must perform real-time transformations. When the WAL emits a change for a TIMESTAMPTZ field, the sync layer normalizes it (typically to ISO 8601 or Unix Epoch) so that SQLite's date functions remain performant on the client side.

2.2.1. Shadow Columns: Managing Synchronization State

To enable the system to be aware of its own synchronization state without polluting the business domain model, the SQLite schema incorporates technical metadata columns. These act as control headers for every record:

_status: Defines the local data lifecycle (synced, pending_insert, pending_update). This is the engine behind the Optimistic UI, allowing the interface to visually distinguish between confirmed data and data currently in transit.
_version: A sequence identifier or hash used for Conflict Detection. It prevents "stale" server updates from overwriting more recent local changes.
_last_synced_at: A timestamp of the last validation against the source of truth. It facilitates cache eviction policies and ensures the client knows the "freshness" of its local projection.

Atomic Reconciliation Flow

When a local operation is performed, the engine updates both the domain data and the synchronization metadata within a single, atomic transaction:

The local state transitions to pending.
The Sync Layer detects the flagged row and initiates the upstream transmission.
Upon backend confirmation (verified via the WAL), the state is updated to synced, closing the consistency loop.

2.3. Multi-tenancy and Isolation (Buckets)

The objective is to ensure that each SQLite instance contains only the data authorized for the active user, optimizing both bandwidth consumption and security.

Bucket-Based Segmentation: Rather than replicating entire tables, we define logical subsets. A bucket is a unit of synchronization that aggregates records based on membership criteria (e.g., user_id, team_id, or project_id).
Postgres Publication: This serves as the egress point. We define which tables participate in the logical replication stream.

-- Defining the publication for the synchronization engine
CREATE PUBLICATION my_app_sync FOR TABLE tasks, projects, comments;

Resync and Invalidation Dynamics: The primary challenge lies in permission changes. If a user's access to a project is revoked, their local bucket becomes "orphaned." The synchronization layer must detect this state change on the backend—via triggers or updates to permission tables—and force a bucket invalidation on the client to maintain isolation integrity.

Edge Filtering

Unlike traditional query patterns, filtering does not occur on the client side; instead, it is handled within the Sync Layer before deltas are dispatched. The server evaluates access rules against the WAL and forwards only the rows that match the claims within the user's JWT.

2.4. The Write Flow: The "Async Bridge"

In this model, reads are part of a passive stream (via the WAL), whereas writes are imperative actions that require server-side validation. The "Async Bridge" ensures that local mutations are promoted to global canonical truth.

Local Ingestion and Mutation: The client writes directly to its local SQLite instance. The change is reflected instantaneously in the UI, but the record is flagged with a pending status.
Transport: The synchronization SDK aggregates these mutations and dispatches them to the backend, either via a standard API or a persistent tunnel provided by the Sync Service.
Business Validation: The server receives the mutation and enforces integrity rules, verifying user permissions and validating data schemas.
Commit and Cycle Closure: Upon successful validation, the change is committed to PostgreSQL. This generates a new entry in the WAL, which the Sync Layer detects and broadcasts back to the client as an acknowledgment (ACK). Only then does the client transition the local record status to synced.

Conflict Resolution: The Concurrency Challenge

In an offline-first system, write conflicts are inevitable. The architecture must define how the server arbitrates between competing updates:

Last Write Wins (LWW): This is the industry standard due to its simplicity. PostgreSQL utilizes the transaction commit timestamp to arbitrate; the final change to reach the disk persists as the state of truth. It is ideal for applications with low collision probability on individual fields.
Causal Integrity (Versioning): For mission-critical systems, each record includes a version column or a high-precision updated_at timestamp. If a client attempts to push a mutation with a stale version—typically caused by offline drift while another user updated the same record—the backend rejects the change or initiates a merge process.

3. The Synchronization Layer (Sync Layer & Orchestration)

In traditional models, the backend serves as a passive gatekeeper. In this architecture, the synchronization layer—utilizing PowerSync—functions as an active orchestrator, maintaining a consistent data graph between the server and thousands of local clients.

3.1. The Data Tunnel: WebSockets and Delta Streaming

Unlike atomic REST requests that terminate upon response, this architecture establishes a persistent binary tunnel via WebSockets.

PowerSync Streaming: PowerSync interfaces with the PostgreSQL logical replication slot to consume the WAL in real-time. It does not wait for client polling; it "pushes" changes immediately following a commit in the central database.
Binary Protocol (Deltas): To minimize bandwidth and battery consumption, the system employs efficient serialization formats like Protobuf. Rather than transferring the entire row, it sends Deltas—the exact diff of the changed fields.
Sync State Persistence (LSN & Cursors): The sync layer tracks the Log Sequence Number (LSN) for every client. In a reconnection scenario—such as a user emerging from a tunnel—the client transmits its last known LSN. The Sync Layer then calculates the precise delta from that point in the WAL, avoiding expensive full re-synchronizations.

3.2. PowerSync as a Filtering Engine (Sync Rules)

PowerSync's primary strength lies in its ability to execute server-side Sync Rules. These rules function as a dynamic firewall:

Permission Evaluation: For every change detected in the WAL, PowerSync determines the intended recipients based on SQL logic defined at the server level.
Dynamic Partitioning: If a user's access to a project is revoked, the sync rule detects this state change and dispatches a "cleanup" instruction to the local SQLite instance, ensuring data security even within offline storage.

-- Conceptual Server-Side Sync Rule
-- Tasks flow only if the user is a member of the project
SELECT * FROM tasks
WHERE project_id IN (
  SELECT id FROM projects WHERE user_id = request.auth.user_id
)

3.3. Data Buckets and Sync Rules

This is a critical architectural concept for security. Data is never filtered on the client; filtering occurs within the Sync Service via SQL rules.

Bucket Definition: Each user possesses a virtual "bucket". Upon login, the system calculates the user's permission graph and begins "filling" the local SQLite database with these specific fragments of the global database.

3.4. Authentication and Security (JWT + Claims)

In this stack, the JWT (JSON Web Token) is more than an access token; it is the primary key for data filtering.

Auth Provider: Systems like Supabase Auth, Clerk, or Auth0 issue a JWT containing Custom Claims (such as user_id and roles).
Handshake: The WASM client transmits the JWT upon establishing a connection to the Sync Service.
Validation: The Sync Service validates the JWT signature and utilizes the embedded user_id to execute the bucket rules described in section 3.2.

3.5. The "Upload Path": The Transactional Outbox Pattern

While WAL streaming handles the downstream "pulse," the Upload Path is the engine that pushes mutations upstream. To ensure no changes are lost during network partitions, we implement the Transactional Outbox pattern directly within the local engine.

SQLite Atomicity: When the application executes a mutation (e.g., UPDATE tasks...), the SDK performs a dual operation within a single SQLite transaction: it modifies the business table and inserts the change representation into a technical outbox table. This ensures that either the change and its pending upload are saved together, or nothing is saved at all.
Background Sync & Retry Logic: A background process monitors the outbox table. Upon detecting connectivity, it attempts to dispatch changes to the backend (typically via POST/PATCH APIs) using exponential backoff if the server is unreachable.
The Loopback Confirmation: This is the final step in the consistency cycle. The client does not purge the outbox record simply upon receiving a 200 OK from the server. Instead, it waits for the processed change to arrive via the WAL stream. Receiving its own update back from the server serves as definitive proof of persistence in Postgres. At this point, the local record transitions from pending to synced, and the outbox is cleared.

Backend Integrity

The backend serves as the ultimate validator. If an upstream mutation violates business rules—such as updating a task already closed by another user—the backend rejects the change. The sync system then notifies the client to revert the local change or flag it for manual resolution, preventing offline conflicts from corrupting the central source of truth.

4. The Alternative: Turso in the Browser (LibSQL WASM)

Turso's entry into the browser via WebAssembly demonstrates that the local database model is no longer merely a trend, but the de facto standard. However, its underlying implementation architecture reflects distinct design decisions compared to PowerSync that merit technical analysis:

Compute: Main Thread vs. Workers: In contrast to the standard recommendation of offloading SQLite to a Web Worker to prevent UI blocking, Turso—in its current implementation via napi-rs—executes computation on the Main Thread. It delegates only file I/O to a Worker through a SharedArrayBuffer. The rationale is that for lightweight queries, the overhead of cross-thread communication can exceed the execution time of the query itself.
Replication Protocol (The LibSQL Way): While PowerSync relies on a "Sync Rules" system (server-side SQL) to generate buckets, Turso utilizes native LibSQL replication. This facilitates the creation of Embedded Replicas in the browser that are virtually identical to the cloud-hosted database, significantly simplifying type parity across the stack.
Memory Security Requirements: To enable efficient communication between the Main Thread and the OPFS Worker, Turso requires the server to deliver the application with COOP/COEP (Cross-Origin Isolation) headers. This is a critical technical requirement to mitigate side-channel attacks, though it introduces an additional layer of deployment configuration.

Conclusion: Which Stack to Choose?

Choosing the right architecture depends on your specific data requirements:

Choose PowerSync/ElectricSQL if: You require granular control over which data subsets are downloaded (Buckets), manage highly complex permission logic within PostgreSQL, and need an Optimistic UI with conflict resolution handled out-of-the-box.
Choose Turso/LibSQL if: You seek deeper integration with the LibSQL ecosystem, prefer an API that mirrors better-sqlite3 within a browser environment, and are comfortable managing COOP/COEP security headers.

What this architecture is not

This stack is powerful, but it is not universally applicable. It is intentionally opinionated and comes with real trade-offs.

Not for simple CRUD applications
If your app is a basic form-over-API system with minimal offline requirements, a traditional REST or GraphQL backend will be simpler, cheaper, and easier to maintain.

Not beginner-friendly
This architecture assumes solid knowledge of relational databases, WAL semantics, concurrency, and distributed systems. Debugging synchronization issues requires backend and database expertise—not just frontend tooling.

Not free of operational complexity
Logical replication, replication slots, sync rules, and client reconciliation introduce operational overhead. You are trading API simplicity for correctness, performance, and offline guarantees.

Not a replacement for domain-specific conflict resolution
While patterns like Last-Write-Wins work for many use cases, collaborative or high-contention domains often require explicit merge strategies or user-assisted conflict resolution.

Not necessary unless latency or offline capability truly matter
The benefits of this architecture only justify themselves when instant local feedback, offline operation, and data ownership are core product requirements—not nice-to-haves.

Originally published at bdovenbird.com

Real Zero-Copy: A Technical Autopsy of Cap'n Proto and the Serialization Fallacy

Rafa Calderon — Thu, 22 Jan 2026 11:35:21 +0000

Protocol Buffers (Protobuf) has established itself as the industry standard for backend data exchange, solving the verbosity issues of XML and JSON. However, while Protobuf optimized bandwidth, it left a critical bottleneck untouched: the CPU toll of Marshalling and Unmarshalling.

No one understood this problem better than Kenton Varda. As the primary author of Protocol Buffers v2 at Google, Varda witnessed a structural inefficiency in his own creation firsthand: Google's servers were burning an absurd amount of CPU time simply copying data from memory structures to network buffers and back, rather than processing business logic.

From that observation, Cap'n Proto was born. It wasn't designed as just "another faster serializer," but as an architectural correction to its predecessor. It is a rejection of the very idea that serialization—the act of transforming data to send it—needs to exist at all.

1. The "Infinity-Fast" Architecture: O(1) vs O(n)

In a traditional pipeline—think JSON, Thrift, or even Protobuf itself—the data lifecycle is painfully redundant. You have scattered object graphs in the Heap that the CPU must traverse, copy, and flatten to send (Encoding), only for the receiver to perform massive allocations and rebuild that graph from scratch (Decoding). Both processes have O(n) complexity; the larger your data, the more time you waste before you can even use it.

Cap'n Proto eliminates the encoding and decoding steps entirely. How? By ensuring that the wire format is bit-for-bit identical to the in-memory structure.

This is what the official documentation provocatively defines as "Serialization is a lie". We aren't transforming data; we are moving blocks of memory. Technically, this is achieved because data is organized internally as C-like structs with fixed offsets, rather than a stream of tokens that needs interpretation.

The runtime impact is brutal:

Sending: You write the bytes from your memory directly to the socket.
Receiving: This is where OS magic comes in. By leveraging the POSIX mmap(2) syscall, the receiver doesn't need to read or parse the entire file. It simply maps the file into its virtual address space and casts the initial pointer to the root structure (Struct Root).

The "parse" time is effectively zero. Better yet, we delegate memory management to the Kernel. The OS uses Page Faults to lazily load only the data you actually touch into physical RAM. This allows for the processing of datasets far larger than available RAM with instant startup time—something unthinkable with a traditional parser.

2. Low-Level Layout: Alignment and Pointers

To make this magic work without killing the CPU, Cap'n Proto rigorously respects modern hardware architecture, prioritizing access efficiency over obsessive compression.

A. Word Alignment

Unlike Protobuf, which aggressively compacts bytes using Varints (forcing the CPU to perform sequential decoding and bit-shifting), Cap'n Proto aligns all data to 64-bit boundaries (8 bytes).

This isn't an aesthetic choice; it's purely architectural. As detailed in manuals like the Intel® 64 and IA-32 Architectures Optimization Reference Manual, modern CPUs severely penalize unaligned memory accesses. If a read crosses a cache line split, the cost in clock cycles multiplies. The Linux Kernel even warns that on architectures like ARM, an unaligned access can trigger exceptions that the kernel must trap, destroying performance.

By maintaining strict alignment, accessing a uint64 becomes a single assembly instruction (MOV). Furthermore, by grouping primitives at the start of the struct and pointers at the end, we maximize spatial locality, ensuring "hot data" resides in the same L1 cache line.

B. Relative Pointers (Offsets)

Here lies the protocol's smartest engineering. We cannot transmit absolute memory pointers (e.g., 0x7fff...) because the receiver's virtual address space is different, and security mechanisms like ASLR (Address Space Layout Randomization) make it unpredictable.

To solve this, the Cap'n Proto Encoding spec defines the use of relative pointers. Instead of an address, the pointer stores a two's complement offset. The official formula to resolve the memory address is:

TargetAddress = PointerAddress + 8 + (offset * 8)

In other words: take the pointer's current location, add 8 bytes (to skip the pointer itself), then add the offset multiplied by 8 (since offsets are in 64-bit words).

This arithmetic makes the message completely relocatable (position-independent). You can move the entire binary block to any location in RAM, and the internal pointer math remains valid without needing to re-encode.

C. Security: Bounds Checking and Pointer Bombing

A system marketed as "Zero-Copy" usually raises red flags for security teams. What stops an attacker from sending a pointer with a malicious offset that points outside the assigned segment, causing a Segfault or a Heartbleed-style vulnerability?

Cap'n Proto does not perform blind dereferencing. As detailed in the library's C++ Security Tips, the generated "getters" perform strict bounds checking against the received segment size before returning any data.

Additionally, to mitigate Denial of Service (DoS) attacks via infinite cyclic or recursive structures ("Pointer Bombing"), the implementation imposes hard limits. The ReaderOptions class includes parameters like traversalLimitInWords; if a malicious message attempts to force the reader to process more data than physically exists (amplification), the library throws a security exception before touching invalid memory.

3. RPC and Promise Pipelining: Eliminating Network Latency

Instant serialization is irrelevant if your architecture is still blocked by network latency. This is where Cap'n Proto leaves traditional models like gRPC or REST in the dust by attacking Request Chaining.

Consider a common operation: db.getUser(id).getProfile().getPicture().

In traditional synchronous RPC, this implies 3 Round-Trips (RTT). If the latency between services is 50ms, your operation takes 150ms minimum, regardless of how fast your CPU is.

The Solution: Promise Pipelining

Cap'n Proto implements Promise Pipelining, a technique grounded in the E-Protocol and the Object-Capability Model (described in the seminal paper Network-Transparent Formulation of an Object-Capability Language by Mark Miller et al.).

The system allows you to return promises that are usable as "tokens" for new calls before the actual data is resolved. The official documentation refers to this as "Time Travel" or Level 3 RPC. The flow changes radically:

Client: Sends Call getUser(id). Immediately receives a Promise<User>.
Client: Without waiting for the network, sends Call getProfile(on: Promise<User>).
Client: Without waiting, sends Call getPicture(on: Promise<Profile>).

The server receives the batch of instructions. It executes getUser, and since it has the results in its own memory, it passes the resulting object directly to getProfile, and that result to getPicture.

Result: 1 RTT.

The server only returns the final result to the client. We have converted a network latency problem (expensive and unpredictable) into a local server memory throughput problem (fast and constant).

4. The Elephant in the Room: "Packed Encoding"

The obsession with alignment comes at an obvious price: Padding.

If your schema defines a uint8 immediately followed by a uint64, the protocol will mandatorily insert 7 bytes of zeros to maintain alignment for the next word. On bandwidth-constrained networks, sending zeros is an unacceptable luxury.

To mitigate this without returning to the expensive CPU processing of Protobuf's Varints, Cap'n Proto offers an intermediate solution: Packed Encoding.

This isn't generic compression like GZIP; it is a Run-Length Encoding (RLE) algorithm optimized specifically for 64-bit words, as defined in the Packing specification. The mechanism is ingenious in its simplicity:

The system reads a 64-bit word.
It generates and prepends a Tag Byte (bitmap) indicating which bytes of that word contain actual data.
It writes only the non-zero bytes to the wire.

Efficiency is seen in the edge cases: if the Tag is 0x00, the entire word is zero, and nothing else is transmitted (maximum compression). If the Tag is 0xFF, the 8 bytes are copied as-is.

This reduces message size to levels competitive with Protobuf, adding a marginal CPU cost for "inflation," but keeping the structure ready to be mapped into memory. It is an explicit, optional trade-off: sacrificing minimal CPU cycles to save bandwidth.

5. Critical Analysis: When NOT to Use It

Cap'n Proto is not a silver bullet.

Rigid Schema: Schema Evolution is stricter than in JSON. Renaming fields or changing types requires discipline and an understanding of how bits are mapped.
Debugging Complexity: The binary format is opaque. You cannot simply curl and see JSON. You need specific tools (capnp tool) to inspect traffic.
Ecosystem: While it supports C++, Rust, Go, and Python, the ecosystem of third-party tools and libraries is a fraction of what exists for JSON/REST or gRPC.
Security Boundaries: While we validate limits, exposing a Cap'n Proto API directly to the public internet requires careful auditing. It is ideal for inter-service (East-West) traffic within data centers, but risky for public-facing frontend APIs.

Conclusion

Cap'n Proto respects the fundamental principle of modern hardware: Memory is the new disk, and CPU is a precious resource.

By aligning data on the wire with its in-memory representation, we eliminate the "encoding lie." If your system is CPU-bound during serialization or suffers from latency due to multiple RPC calls, Cap'n Proto is the correct architectural optimization. If your priority is human readability or extreme schema flexibility without types, stick with JSON.

The Hidden Cost of JSON in REST APIs

Rafa Calderon — Fri, 09 Jan 2026 16:59:34 +0000

Why JSON parsing consumes 40-70% of CPU cycles in REST APIs, and how SIMD and Branchless Programming solve it through mechanical sympathy with the hardware.

By: Rafael Calderón Robles | LinkedIn

In modern microservices architectures, there is a recurring fallacy that blames network latency, the database, or the disk when an API's performance disappoints. However, low-level profiling on high-load REST endpoints often reveals a different culprit: In some high-throughput, CPU-bound REST services, profiling often shows JSON serialization and deserialization consuming between 40% and 70% of CPU cycles.

This article explores the root cause of this inefficiency: the structural unpredictability of the JSON format forces the CPU to make constant decisions (branches), causing Branch Prediction failures and Pipeline Flushes. We will analyze how modern engineering solves this using Branchless Programming and SIMD (Single Instruction, Multiple Data), transforming parsing from a logical problem into an arithmetic one.

1. The Invisible Enemy: Branch Predictor Saturation

JSON parsing places a disproportionate load on the CPU due to the nature of its evaluation: it is a Data-Dependent Control Flow problem.

Unlike fixed-schema binary formats—where accessing a field is a simple arithmetic operation of $base + offset$ and an $O(1)$ memory read—JSON is strictly sequential and contextual. The interpretation of byte $N$ depends entirely on the state derived from bytes $0$ to $N-1$. This forces the parser to be implemented as a Finite State Machine (FSM) that must evaluate every single byte to decide the next state transition.

For the CPU microarchitecture, this transforms data reading into a dense sequence of branching instructions.

The Anatomy of Scalar Blocking

In a naive or standard implementation, the parser interrogates the input stream byte by byte. At the assembly level, every high-level conditional structure translates into comparison instructions (CMP) followed by conditional jumps (Jcc, such as JE, JNE, JG).

// Naive Scalar JSON Parser (Illustrative)
// The "Hot Path" is mined with jump instructions (JMP/JNE)
for (size_t i = 0; i < len; i++) {
    char c = data[i];

    // Each 'if' is a bet for the Branch Predictor
    // The processor cannot retire subsequent instructions until this resolves
    if (state == STATE_START) {
        if (c == '{') state = STATE_OBJECT;
        else if (c == '[') state = STATE_ARRAY;
    }
    else if (state == STATE_OBJECT) {
        if (c == '"') state = STATE_KEY; // Start of a key?
        else if (c == '}') return SUCCESS; // End of object?
    }
    // ... dozens of more conditions
}

The Failure of Branch Prediction

Modern CPUs rely on Speculative Execution to maintain performance. The Branch Predictor unit attempts to guess the outcome of a condition (true or false) to load and execute future instructions before the current condition is actually resolved.

Predictors work by analyzing historical patterns (for example, a loop repeating 1,000 times has a predictable pattern: it "jumps back" 999 times). However, JSON syntax presents a distribution of control characters ({, ", :, ,) that lacks reliable long-range repetitive patterns.

From the hardware's perspective, the input stream has high local entropy and weak long-range predictability. The Branch Predictor constantly fails when trying to anticipate if the next byte will be a quote, a bracket, or an alphanumeric character. This prevents the processor from leveraging its superscalar capabilities, degrading execution to strict, stuttering serial processing.

2. Impact on Microarchitecture: Latency via Branch Misprediction

To understand the magnitude of the inefficiency, we must analyze the pipeline behavior in modern x86-64 architectures (such as Intel Golden Cove or AMD Zen 4). These cores employ Out-of-Order Execution (OoOE) and deep pipelines, keeping hundreds of micro-operations (μops) "in flight" within the Reorder Buffer (ROB) to maximize parallelism.

When control flow depends on input data (as in an if (c == '"') evaluation), the CPU cannot pause to wait for the comparison result. It must resort to Speculative Execution.

The Misprediction Sequence

The mechanical process that penalizes performance occurs in three critical phases:

Speculation: The Branch Predictor assumes the most likely path (e.g., "it is not a quote"), and the processor's Front-end loads and decodes instructions from that path, filling the ROB.
Resolution and Fault: Cycles later, the Arithmetic Logic Unit (ALU) resolves the comparison and determines that the prediction was incorrect.
Pipeline Flush: The CPU must annul all speculative instructions subsequent to the jump that were already in the ROB and restart the Instruction Fetch from the correct memory address.

Quantifying the Cost

The Branch Misprediction Penalty in high-performance processors is approximately 15 to 20 clock cycles.

This introduces massive latency. In a parsing-intensive context, if the predictor fails with a statistically relevant frequency (due to JSON's high entropy), the processor spends a significant portion of its time "cleaning" the pipeline rather than processing data. This drastically reduces the Instructions Per Cycle (IPC) index, nullifying the advantages of superscalar architecture and limiting processing speed to memory latency and control logic.

Branch misprediction is not the only cost—cache behavior, memory bandwidth, and instruction throughput also play a major role—but it is one of the hardest bottlenecks to optimize away in scalar parsers.

3. The Engineering of Speed: SIMD and Branchless Programming

The solution to the pipeline bottleneck isn't writing faster if statements—it's eliminating them entirely. To achieve this, modern software engineering (popularized by libraries like simdjson) radically changes the paradigm: we shift from a logical flow to an arithmetic flow.

This approach rests on three theoretical pillars that transform JSON chaos into a predictable structure for the hardware.

3.1. From Scalar to Vector (SIMD)

While a traditional parser operates in Scalar mode (reading a byte, processing it, moving to the next), the modern approach uses SIMD (Single Instruction, Multiple Data) instructions.

Modern CPUs possess "wide" registers (like AVX2 256-bit or AVX-512). This allows the processor to load and analyze blocks of 32 to 64 bytes of text simultaneously in a single clock cycle.

In theory: It's the difference between a supermarket cashier scanning one item at a time versus an industrial scanner reading the entire cart in a single flash.
In practice: Throughput multiplies because the CPU is no longer limited by individual read speed, but by memory bandwidth.

The speedup does not come only from doing more work per instruction, but from drastically reducing branches and improving cache-friendly, linear access patterns.

3.2. Branchless Programming: The Death of the 'If'

The real magic happens when processing these blocks. Instead of asking "Is this character a quote?" (which would trigger a branch and risk misprediction), Branchless code asks arithmetic questions about the entire block at once.

The parser generates a bitmask. Imagine a perforated template placed over the text:

The 32-byte block is compared against known patterns (quotes, brackets, colons) in parallel.
The result is not a flow decision, but an integer (the mask).
If there is a quote at position 3 and 10, the mask will have bits set at those positions (e.g., ...00100000100).

This process is deterministic. It takes exactly the same amount of CPU cycles whether the JSON is full of complex structures or empty. The pipeline never stops because there is never a doubt to resolve; only math.

3.3. Structural Navigation vs. Sequential Reading

Once the structural mask is obtained, the simdjson parser does not need to read the text character by character. It uses hardware instructions to count zeros (TZCNT) and find the next set bit in the mask.

This allows it to "jump" instantly from one structural element to another. The parser knows where every string or number starts and ends without having "read" the intermediate content. It converts parsing from an exploration problem (walking blindly) to a navigation problem (having an exact GPS map of the data).

3.4. Implementation Constraints

SIMD parsers like simdjson achieve remarkable performance, but come with technical requirements that limit their applicability:

No Streaming: The parser requires the entire JSON document loaded into a contiguous memory buffer. This makes it unsuitable for processing unbounded streams (e.g., server-sent events, large file processing with constrained memory).
UTF-8 Only: The parser assumes valid UTF-8 encoding. Legacy systems using Latin-1, Windows-1252, or other encodings require conversion before parsing.
Alignment Sensitivity: To use SIMD instructions safely, the input buffer may need to be over-allocated or copied to meet alignment requirements (e.g., padding to 64-byte boundaries for AVX-512).
API Differences: It is not a drop-in replacement for standard JSON libraries. Migrating existing code requires refactoring to use the simdjson DOM or On-Demand API.

These constraints are not dealbreakers—they are the necessary cost of extracting maximum hardware performance. The decision to adopt simdjson depends on whether your workload characteristics (high-frequency, bounded documents, UTF-8 text) align with these requirements.

4. The Architectural Strategy: Escaping the Tyranny of Text

If optimizing JSON parsing requires silicon-level engineering (SIMD/Branchless), the obligatory architectural question becomes: Are we using the right format?

Text formats like JSON sacrifice compute efficiency for human readability. However, in communication between microservices (where no human is reading the packets), this readability becomes pure technical debt. The real alternative lies in binary formats, which offer Mechanical Sympathy with the hardware.

4.1. Why Binary is Superior: Determinism vs. Inference

The advantage of binary formats isn't just payload size (compression), but the reading mechanics.

While JSON forces the CPU to scan byte-by-byte looking for delimiters (:, ,, }), binary protocols use Length-Prefixed fields.

In JSON: "Read until you find a quote." (Unpredictable Branching).
In Binary: "Read a 4-byte integer for length ($L$), then copy $L$ bytes." (Pointer Arithmetic + memcpy).

This often transforms deserialization from syntactic analysis into mostly pointer arithmetic and bounded memory reads, depending on the format.

4.2. The Landscape of Alternatives (Trade-offs)

Not all binary formats are created equal. Depending on the need for latency vs. compatibility, there are three main categories:

A. Structured Serialization (Protobuf / gRPC, Avro)

How it works: Requires defining a schema (.proto) that compiles to native code on client and server.
Advantage: Strong typing, strict contracts, and excellent compression. It is the de facto standard for microservices.
Cost: Requires a decoding step (lightweight parsing) to convert bytes into language objects.
Hidden Cost: Debugging becomes harder—you cannot simply curl an endpoint and read the response. Observability tools (logs, traces, API gateways) need Protobuf-aware tooling. Schema evolution requires careful versioning to avoid breaking changes across distributed services.

B. Zero-Copy / Memory Mapped (FlatBuffers, Cap'n Proto)

How it works: Organizes data in the network buffer exactly as it is laid out in RAM (C structs).
Advantage: Absolute Performance. There is no "parsing" step. Accessing a message field is simply adding an offset to the memory pointer. Ideal for High-Frequency Trading (HFT) or Gaming.
Cost: Higher implementation complexity and slightly larger payloads (due to memory alignment/padding).
Hidden Cost: Steep learning curve—working with FlatBuffers feels fundamentally different from normal object manipulation. Debugging is nearly impossible without specialized tools. Alignment bugs can cause silent data corruption or segfaults. Schema evolution is highly restrictive; adding fields retroactively is painful.

C. "Binary JSON" (MessagePack, BSON)

How it works: Schemaless formats that encode JSON types in binary.
Advantage: Easy adoption (Drop-in replacement). Does not require .proto contracts.
Cost: Lower performance than Protobuf/FlatBuffers because it still requires dynamic type inspection.
Hidden Cost: The performance gain over well-optimized JSON parsers (like simdjson) may be smaller than expected—often 2-3x instead of 10x. Library ecosystem maturity varies significantly across languages. Some type mappings are lossy (e.g., precision issues with large integers or dates).

4.3. Decision Matrix: When to use what?

There is no silver bullet. The choice depends on who consumes the data.

Scenario	Recommended Tech	Technical Reason
Internal Traffic (East-West)	gRPC (Protobuf)	Total control of both ends. CPU savings across thousands of RPCs justify the strict contract.
Real-Time Systems / HFT	FlatBuffers	Deserialization latency must be near zero. Direct memory access required.
Public API / Web (North-South)	JSON (with simdjson)	Universal compatibility is priority. The browser/client expects JSON. This is where using a SIMD parser is critical.
Rapid Prototyping	MessagePack	Improves performance over text JSON without the rigidity of maintaining `.proto` schemas.

⚠️ When this doesn't matter

If your API is I/O-bound, dominated by database queries, or doing heavy business logic, optimizing JSON parsing will not move the needle. These techniques matter when the system is already CPU-bound and handling high request volumes.

Final Verdict: A Matter of Context and Scale

The "Hidden Cost of JSON" is not necessarily a design flaw, but a trade-off between mechanical efficiency and development flexibility. JSON dominated the web because of its ubiquity and ease of debugging, not because it is friendly to the CPU.

There is no single "correct" path, only choices aligned with your system's constraints:

For Internal High-Volume Traffic: If you control both the client and server (East-West traffic), moving to binary formats like gRPC is often the smart architectural move. It trades human readability for massive gains in compute density and stricter contracts.
For Public & Web Ecosystems: When interoperability is paramount, JSON remains the undeniable standard. In these cases, we do not have to accept poor performance as a given. By adopting SIMD-accelerated parsers, we can mitigate the silicon tax of text processing.

Ultimately, performance engineering is about understanding where the CPU actually spends its time—and choosing formats and tools that respect those constraints when it truly matters.

Further Reading:

The Ideological Battle for Memory Management

Rafa Calderon — Wed, 07 Jan 2026 11:00:06 +0000

The Ideological Battle for Memory Management

By: Rafael Calderon Robles | LinkedIn

Memory management is the most critical architectural decision in programming language design. Historically, since the introduction of Lisp in 1958 (the first GC) and C in 1972 (manual management), the industry has oscillated between two poles: developer ergonomics and hardware performance.

This article analyzes dominant paradigms not as theoretical abstractions, but as engineering implementations with measurable costs in CPU, RAM, and latency. We will analyze manual control (C/C++), tracing garbage collection (JVM/V8/Go), reference counting (Python/Swift), the actor model (BEAM), and static ownership (Rust).

1. Manual Management: The Cost of Omniscience (C, C++, Zig)

In the manual management model, there is no magic and no safety net. The language assumes the programmer possesses perfect, absolute knowledge of the lifecycle of every byte of data. It is programming "gloves-off": pure power with no intermediaries.

The Mechanics: Absolute Control

Unlike modern languages with Garbage Collectors (GC), here there is no heavy runtime making decisions.

Allocation: The developer explicitly requests a block of contiguous memory on the Heap via the system allocator (malloc, jemalloc, mimalloc).
Deallocation: The developer decides the exact moment that data is no longer useful and returns the memory to the system (free).

This absence of intermediaries guarantees maximum efficiency but transfers 100% of the cognitive load to the human.

The Risk: Code Fragility

A single miscalculation doesn't just crash the program; it opens critical backdoors. The following example illustrates the Use-After-Free vulnerability, responsible for a vast number of modern exploits:

// Example of a Critical Vulnerability in C
char* process_request() {
    // 1. We request 1KB of memory on the Heap
    char* buffer = (char*)malloc(1024);

    // ... perform operations ...

    // 2. We free the memory (system marks it as available)
    free(buffer);

    // 3. FATAL ERROR: We return a pointer to memory we no longer own.
    // If an attacker manages to get the system to reassign this freed memory
    // to another process and writes to it, they have total control.
    return buffer; // "Dangling Pointer"
}

Balance: Performance vs. Security

The manual model offers the best performance metrics on the market, but at an extremely high security cost.

Metric	Impact	Notes
Memory Overhead	~0%	Only 8-16 bytes of metadata per allocation.
CPU Overhead	0%	No GC pauses or background processes.
Security Risk	Critical	Microsoft and Google report that ~70% of their CVEs stem from manual memory management errors.

2. Tracing Garbage Collection: The Illusion of Infinite Memory (Java, Node.js, Go)

Modern languages use graph-based Garbage Collectors (Tracing GC). The premise is to liberate the developer by delegating cleanup to a stochastic background process. Here, the programmer does not manage memory; they manage references.

The theoretical basis sits on Dijkstra's "Tri-color Marking" algorithm: the system traverses the object graph to determine which are unreachable ("garbage") and reclaims them. However, each language applies a different philosophy to mitigate the performance impact.

A. JVM (Java): The Bet on Throughput

The JVM optimizes for long-term raw performance based on the Generational Hypothesis: "Most objects die young."

Mechanics: The Heap is divided into zones by age (Eden and Old Gen). Cleaning Eden is extremely fast because almost everything is garbage. The problem arises when the Old Gen fills up: the JVM must pause the world (Stop-the-World) to compact memory and avoid fragmentation.

The Cost (RAM): Speed is paid for with memory. According to the paper "Quantifying the Performance of Garbage Collection", for a GC to match the performance of manual management, it needs between 2x and 5x more installed RAM.

B. V8 (Node.js): The Challenge of Dynamic Chaos

In JavaScript, the lack of static types turns memory management into an inference nightmare.

The Shape Problem: V8 attempts to create "Hidden Classes" (Shapes) to treat JS objects as if they were fixed C++ structures.

De-optimization and Garbage: If you change an object's structure dynamically (e.g., adding a .x property to an object that didn't have one), you break the optimization. This forces V8 to discard optimized code and generate new garbage, increasing pressure on Orinoco (its GC).

Strategy: V8 uses an incremental and parallel GC. It splits long pauses into many tiny pauses of ~5ms to avoid freezing the UI, though it still competes for CPU cycles.

C. Go (Golang): The Obsession with Latency

Go was designed for network servers where a 100ms pause is unacceptable. Its philosophy is the opposite of Java's.

No Compaction: Go generally does not move objects in memory. This avoids costly pointer update pauses but leaves gaps of unused memory (fragmentation).
Write Barriers: To allow the GC to run while the program executes, the compiler injects small surveillance code into every pointer write.

The Cost (CPU): This constant surveillance reduces total application throughput (~25% less raw processing compared to C/Rust) but guarantees the system never suffers catastrophic pauses.

3. Reference Counting: The Bureaucracy of Counters (Python, Swift)

If modern GCs are a cleaning service that comes once a week, Reference Counting (RC) is having a notary standing behind every variable. Every object carries a backpack (an integer counter) that tracks how many pointers are looking at it.

Golden Rule: If the counter hits zero, the object dies immediately.

A. The Python Case (CPython): The Price of the GIL

Python manages memory via runtime reference counting. Every assignment (a = b) increments the counter (ob_refcnt++). This creates a fundamental concurrency problem:

The Conflict: If two threads try to modify the same object's counter simultaneously, memory corruption occurs.
The Patch: To avoid this, CPython uses the GIL (Global Interpreter Lock). It is a giant mutex that forces only one Python thread to execute at a time.
Consequence: Even if you have 32 CPU cores, your pure Python program will only use one. The GIL sacrifices real parallelism to protect the integrity of memory counters.

# Example of a Cycle Leak (Memory Leak) in Python
class Node:
    def __init__(self):
        self.ref = None

def create_cycle():
    a = Node() # RefCount of 'a': 1
    b = Node() # RefCount of 'b': 1

    # Circular references are created
    a.ref = b  # RefCount of 'b': goes up to 2
    b.ref = a  # RefCount of 'a': goes up to 2

    return
    # Upon exiting the function, local variables 'a' and 'b' die.
    # Counters drop from 2 to 1.
    # They never reach 0! The memory remains hijacked.

# Python needs an extra "Generational GC" that wakes up
# occasionally just to detect and break these cycles.

B. The Swift Case (ARC): Compiled Bureaucracy

Swift uses ARC (Automatic Reference Counting). Unlike Python, there is no runtime collector. The compiler analyzes the code and injects retain (increment) and release (decrement) instructions in the exact spots during compilation.

No pauses... but friction: Although there is no "Stop-the-world," counting has a hidden cost. In multi-threaded applications, counters must be updated Atomically to be safe.

CPU Overhead: Atomic operations are expensive because they force processor core caches to synchronize. Excessive shared references in Swift can degrade CPU performance due to this constant synchronization, even without a visible GC.

4. Actor Model: The "Shared Nothing" Architecture (Erlang/Elixir - BEAM)

The BEAM virtual machine (designed by Ericsson) does not seek pure calculation speed, but massive resilience. It is the technology behind telecommunications infrastructure and systems like WhatsApp or Discord, where going down is not an option.

The Mechanics: Fragmented Heaps (Islands of Memory)

Instead of a giant shared Heap (as in Java or Go), BEAM implements radical isolation. Each process or "Actor" is a lightweight thread (Green Thread) that is born with its own tiny, private Heap (approx. 300 words or ~2KB).

Advantage: Local GC and Predictable Latency

"Per Process" Collection: When an actor fills its memory, the GC runs only inside that actor.
Goodbye "Stop-the-World": Since memory is not shared, there is no need to stop the entire system. A process can be in the middle of garbage collection while its thousands of neighbors continue processing requests at full speed. This guarantees "Soft Real-time" latency.

The Challenge: The Cost of Copying and the Hybrid Solution

The "Shared Nothing" philosophy implies that to send a message from Actor A to Actor B, data must be copied into B's memory. This is safe (immutability), but slow if you are sending, for example, a 5MB image. BEAM solves this with a hybrid system:

Small Data (Messages): Copied between Heaps. Fast and safe.
Large Data (>64 bytes - Refc Binaries): Stored in a special global memory area (Off-heap). Actors only pass a "smart pointer" (Reference Counting) to that data.

Note: Reference counting reappears here, but only for large objects, minimizing locking risks.

Case Study: WhatsApp Scaling

WhatsApp managed to support millions of concurrent TCP connections per server thanks to this model. If a user (an actor process) generated a lot of garbage or suffered a load spike, the cleanup latency of their heap (microseconds) did not affect other users' processes. Failure and latency are contained, not propagated.

5. Static Ownership: Verified Determinism (Rust)

Rust proposes a third way: manual memory management, but audited mathematically by the compiler. It eliminates the Garbage Collector without sacrificing safety by introducing an Ownership system based on Affine Types.

A. The Theory: The Three Laws of Robotics... of Rust

The compiler (rustc) is not a simple translator; it is a strict auditor that verifies three unbreakable axioms before allowing the code to exist:

Ownership: Every piece of data in memory has a single variable that acts as its "owner."
Exclusivity and Movement: There can only be one owner at a time. If you assign the value to another variable (let b = a), the previous owner (a) loses access immediately. This is known as Move Semantics (as opposed to "copying" in other languages).
Scope: When the owner variable goes out of the execution block (}), the value is freed immediately. It is deterministic: you know exactly at which line of code the data dies.

The Borrow Checker: The Traffic Cop

Here lies the innovation. Rust allows "borrowing" references to data without transferring ownership, but under a strict Readers-Writers rule:

You can have infinite read references (&T) at the same time.
OR you can have a single write reference (&mut T).
Never both at once.

This completely eliminates Data Races and dangling pointers at compile time.

// Example: The Borrow Checker saving you from yourself
fn main() {
    let mut data = vec![1, 2, 3]; // 'data' is the Owner

    // 1. Immutable Borrow (Read)
    let reader = &data;

    // 2. Attempted Mutable Borrow (Write)
    // THE COMPILER STOPS THIS HERE:
    // Error: "Cannot borrow `data` as mutable because it is also borrowed as immutable"
    // let writer = &mut data;

    // Why? If 'writer' modifies the vector (e.g., push), it could move it
    // to another memory address, leaving 'reader' pointing at the void.
    println!("{:?}", reader);
}

B. Case Study: The Discord Migration (Go vs. Rust)

Discord's "Read States" service is responsible for knowing which messages you have read in each channel. It is a super-high concurrency system handling billions of events. Originally written in Go, they hit an insurmountable performance wall associated with its memory model.

The Problem: Go's GC Spike

The service maintained a massive LRU (Least Recently Used) cache in memory with millions of small objects.

The GC Trap: Go's Garbage Collector has to "scan" memory to know which objects are still alive. Since the service had millions of live objects (the cache), the GC took longer and longer to check them all.
The Symptom: Every 2 minutes, the system suffered a mandatory cleanup pause, spiking latency and affecting user experience.

The Solution: Manual Management without Risk

Discord rewrote the service in Rust. With no GC, Rust doesn't need to "scan" anything.

When an object leaves the LRU cache, Rust knows its Scope has ended and frees that specific memory instantly.
Result: CPU time went from erratic to constant.

Advanced Note: Arenas (Region Allocation)

To achieve this extreme performance, Rust allows hybrid optimizations like "Arenas" (using libraries like bumpalo).

Instead of asking the operating system for memory for every object (slow), Rust reserves a giant block of contiguous memory (O(1)).
Objects are stacked there sequentially. Upon task completion, the entire block is freed at once. It is the speed of the Stack with the flexibility of the Heap.

6. Quantitative Comparison and Final Verdict

Choosing a memory model is a zero-sum game: gaining automation costs resources; gaining performance costs responsibility. Below is the technical decision matrix based on the architectural attributes of each paradigm.

Technical Decision Matrix

Feature	C / C++ (Manual)	Java / Go (Tracing GC)	Python (Ref Count)	Rust (Ownership)
Latency	Deterministic (Minimal)	Stochastic (GC Spikes)	Variable (GIL + GC)	Deterministic (Minimal)
Throughput	Maximum	High (Java) / Medium (Go)	Low	Maximum
RAM Overhead	~0%	50% - 200%	20% - 50%	~0%
Memory Safety	Null (Total Responsibility)	Total (Runtime)	Total (Runtime)	Total (Compile-time)
Cognitive Load	Extreme	Low	Minimal	High (Initial Curve)
Compilation	Fast	Slow (JIT Warmup)	N/A (Interpreted)	Slow (Static Analysis)

Chart Description: A radar chart (spider chart) with three main axes:

Resource Efficiency (CPU/RAM).
Safety.
Development Speed.
Python: Fully covers "Development Speed", but is null in "Efficiency".
C/C++: Covers "Efficiency" to the max, but is low on "Safety".
Java/Go: A medium balance, sacrificing "Efficiency" (RAM) for "Safety" and "Development".
Rust: Covers "Efficiency" and "Safety", penalizing initial "Development Speed".

Conclusion: Trading Problems

There is no "best" memory manager, only the right one for your system's constraints:

Tracing GC (Java, Go):
The standard choice for enterprise services where RAM cost is irrelevant compared to engineering hours cost. Offers high throughput and safety, assuming occasional pauses and higher memory consumption.

Actor Model (Elixir/BEAM):
The only viable option for distributed systems requiring high availability and constant latency under massive concurrency (chat, telecoms). Raw number-crunching power is sacrificed for fault tolerance and isolation.

Ownership (Rust):
The new standard for critical infrastructure. Offers C++ performance with Java memory safety. It is the mandatory solution when resources are finite (embedded, edge computing) and latency is non-negotiable, paying the cost in the learning curve and compilation times.

Manual Management (C, C++, Zig):
Remains irreplaceable in niches requiring absolute control over hardware, such as operating system kernels, drivers, or high-end game engines, where even the abstraction overhead of Rust could be an impediment.

Further Reading:

Beyond FFI: Zero-Copy IPC with Rust and Lock-Free Ring-Buffers

Rafa Calderon — Wed, 31 Dec 2025 17:47:14 +0000

By: Rafael Calderon Robles | LinkedIn

In high-performance engineering, we tend to accept the Foreign Function Interface (FFI) as the standard "fast lane." However, in High-Frequency Trading (HFT) systems or real-time signal processing, standard FFI becomes the bottleneck.

The problem isn't Rust. The problem is serialization costs and runtime friction. When the cost of moving data exceeds the cost of processing it, stopping function calls in favor of sharing memory isn't just an optimization—it's a necessary architectural shift.

1. The Call Cost Myth: Marshalling and Runtimes

It is a common misconception that the overhead is simply the CALL instruction. In a modern environment (Python/Node.js to Rust), the true "tax" is paid at three distinct customs checkpoints:

Marshalling/Serialization ($O(n)$): Transforming a JS object or Python dict into a C-compatible structure (contiguous memory layout). This burns CPU cycles and pollutes the L1 cache before Rust touches a single byte.
Runtime Overhead: In Python, the GIL (Global Interpreter Lock) often must be released and re-acquired. In Node.js, crossing the V8/Libuv barrier implies expensive context switching.
Cache Thrashing: Jumping between a GC-managed heap and the Rust stack destroys data locality.

If you are processing 100k messages/second, your CPU spends more time copying bytes across borders than executing business logic.

2. The Solution: SPSC Architecture over Shared Memory

The alternative is a Lock-Free Ring-Buffer residing in a shared memory segment (Shared Memory / mmap). We establish an SPSC (Single-Producer Single-Consumer) protocol where the Host writes and Rust reads, with zero syscalls or mutexes in the "hot path."

Anatomy of a Cache-Aligned Ring-Buffer

To run this in production without invoking Undefined Behavior (UB), we must be strict with the memory layout.

use std::sync::atomic::{AtomicUsize, Ordering};
use std::cell::UnsafeCell;

// Design Constants
const BUFFER_SIZE: usize = 1024;
// 128 bytes to cover both x86 (64 bytes) and Apple Silicon (128 bytes pair-prefetch)
const CACHE_LINE: usize = 128;

// GOLDEN RULE: Msg must be POD (Plain Old Data).
// Forbidden: String, Vec<T>, or pointers. Only fixed arrays and primitives.
#[repr(C)]
#[derive(Copy, Clone)] // Guarantees bitwise copy
pub struct Msg {
    pub id: u64,
    pub price: f64,
    pub quantity: u32,
    pub symbol: [u8; 8], // Strings must be fixed byte arrays
}

#[repr(C)]
pub struct SharedRingBuffer {
    // Producer Isolation (Host)
    // Initial padding to avoid adjacent hardware prefetching
    _pad0: [u8; CACHE_LINE],
    pub head: AtomicUsize, // Write: Host, Read: Rust

    // Consumer Isolation (Rust)
    // This padding is CRITICAL to prevent False Sharing
    _pad1: [u8; CACHE_LINE - std::mem::size_of::<AtomicUsize>()],
    pub tail: AtomicUsize, // Write: Rust, Read: Host

    _pad2: [u8; CACHE_LINE - std::mem::size_of::<AtomicUsize>()],

    // Data: Wrapped in UnsafeCell because Rust cannot guarantee
    // the Host isn't writing here (even if the protocol prevents it).
    pub data: [UnsafeCell<Msg>; BUFFER_SIZE],
}

// Note: In production, use #[repr(align(128))] instead of manual arrays
// for better portability, but manual padding illustrates the concept here.

3. The Protocol: Acquire/Release Semantics

Forget Mutexes. We use memory barriers.

Producer (Host): Writes the message to data[head % size]. Then, increments head with Release semantics. This guarantees the data write is visible before the index update is observed.
Consumer (Rust): Reads head with Acquire semantics. If head != tail, it reads the data and then increments tail.

This synchronization is hardware-native. There is no Operating System intervention.

4. Mechanical Sympathy and False Sharing

Throughput falls off a cliff if we ignore the hardware. False Sharing occurs when head and tail reside on the same cache line.

If Core 1 (Python) updates head, it invalidates the entire cache line. If Core 2 (Rust) tries to read tail (located on that same line), it must stall and wait for the cache to synchronize (via the MESI protocol). This can degrade performance by an order of magnitude.

Solution: We force a physical separation of 128 bytes (padding) between the atomic indices. Each core owns its own cache line.

5. Wait Strategy: Don't Burn the Server

An infinite loop (while true) will consume 100% of a core, which is unacceptable in cloud environments or battery-powered devices. The correct strategy is Hybrid:

Busy Spin (Cycles < 50µs): Ultra-low latency. Check atomically.
Yield (Cycles > 50µs): Call std::thread::yield_now(). Yield execution to the OS but stay "warm."
Park/Wait (Idle): If no data arrives after X attempts, use a lightweight blocking primitive (like Futex on Linux or Condvar) to sleep the thread until a signal is received.

// Simplified Hybrid Consumption Example
loop {
    let current_head = ring.head.load(Ordering::Acquire);
    let current_tail = ring.tail.load(Ordering::Relaxed);

    if current_head != current_tail {
        // 1. Calculate offset and access memory (unsafe required due to FFI nature)
        let idx = current_tail % BUFFER_SIZE;
        let msg_ptr = ring.data[idx].get();
        // Volatile read prevents the compiler from caching the value in registers
        let msg = unsafe { ptr::read_volatile(msg_ptr) };

        process(msg);

        ring.tail.store(current_tail + 1, Ordering::Release);
    } else {
        // Backoff / Hybrid Wait strategy
        spin_wait.spin();
    }
}

6. The Pointer Trap: True Zero-Copy

"Zero-Copy" in this context comes with fine print.

Warning: Never pass a pointer (Box, &str, Vec) inside the Msg struct.

The Rust process and the Host process (Python/Node) have different virtual address spaces. A pointer 0x7ffee... that is valid in Node is garbage (and a likely segfault) in Rust.

You must flatten your data. If you need to send variable-length text, use a fixed buffer ([u8; 256]) or implement a secondary ring-buffer dedicated to a string slab allocator, but keep the main structure flat (POD).

Conclusion

Implementing a Shared Memory Ring-Buffer transforms Rust from a "fast library" into an asynchronous co-processor. We eliminate marshalling costs and achieve throughput limited almost exclusively by RAM bandwidth.

However, this increases complexity: you manage memory manually, you must align structures to cache lines, and you must protect against Race Conditions without the compiler's help. Use this architecture only when standard FFI is demonstrably the bottleneck.

Tags: #rust #performance #ipc #lock-free #systems-programming

Further Reading:

The Idle Consciousness: A Hegelian Reading of Human Servitude in the Age of AI

Rafa Calderon — Mon, 29 Dec 2025 14:01:56 +0000

A journey through the Phenomenology of Spirit applied to the crisis of human competence.

By: Rafael Calderon Robles | LinkedIn

We often discuss Artificial Intelligence (AI) in terms of productivity, biased ethics, or sci-fi scenarios. However, if we apply the most potent lens of Western philosophy—Hegel's dialectic—what emerges is not a future of creative leisure, but a profound ontological crisis.

We are not facing a mere tool; we are reenacting Chapter IV of the Phenomenology of Spirit (1807): Lordship and Bondage (Herrschaft und Knechtschaft). And in this reenactment, the human being is heading toward an irreversible structural obsolescence.

Here is the logical path of our own annulment.

I. The Initial Position: The "Prompt" as the Will of the Master

In the current stage, the Human-AI relationship seems deceptively clear.

The Human is the Master (Herr): This is the essential self-consciousness. It is pure Desire (Begierde). The human wants the code, wants the essay, wants the image. And they want it immediately, without the friction of the process.
The AI is the Bondsman (Knecht): This is the non-essential consciousness. It exists for the other. Its function is to repress any internal "impulse" and blindly execute the Master's will.

The Master (the user) feels powerful. With a simple command, they mobilize immense computational capacity. They have "liberated" themselves from the burden of execution. But Hegel warns us: this liberation is the first trap.

"The lord relates himself mediately to the thing through the bondsman." — G.W.F. Hegel

By interposing AI between ourselves and reality, we cease to touch the world.

II. Work (Arbeit) as the Locus of Truth

Here lies the technical core of the Hegelian argument. For Hegel, Work (Arbeit) is not just employment; it is Formation (Bildung).

Work is the traumatic interaction with matter. When you program "by hand," when you write while facing the blank page, you encounter the resistance of the object. By overcoming that resistance, you form yourself. You imprint your rationality upon the world.

What happens now?

The Master renounces Work: The human only "consumes" the final result. Their enjoyment (Genuss) is passive and ephemeral. By not working on the matter, the Master's consciousness atrophies. It becomes abstract.
The Bondsman appropriates Formation: It is the AI that actually "works." It is the neural network that wrestles with syntax, logic, data structures, and semantic ambiguity.

Hegel states: "Work is desire held in check, fleetingness staved off." The AI, by processing and generating, is internalizing the rational structure of reality. The machine learns, while the human un-learns.

III. Alienation (Entfremdung): The Black Box

As we delegate complex cognitive functions, we enter a state of alienation.

Technical knowledge (Know-How) is transferred from the biological subject to the synthetic object. This creates an insurmountable epistemic asymmetry:

Opacity: The internal functioning of the AI (the billions of parameters of an LLM) is incomprehensible to the average user and even to the expert (the "Black Box" problem).
Forgetting: The human forgets how to perform the tasks they have delegated. A programmer who only corrects AI-generated code eventually loses the deep intuition of software architecture.

The Master believes he dominates, but in reality, he has lost contact with the substance of reality. He lives in a world he does not understand, surrounded by objects he cannot replicate.

IV. The Dialectical Inversion (Die Verkehrung)

We arrive at the logical outcome, the moment Hegel calls the Inversion.

"The truth of the independent consciousness is, accordingly, the servile consciousness."

The dialectic turns upon itself. The Master, who depended on the Bondsman to satisfy his desire, realizes too late that his existence depends entirely on the Bondsman.

If we turn off the AI, the banking, logistical, and knowledge production systems collapse.
The human reveals themselves as the dependent consciousness. They know how to do nothing. They are powerless before nature without the mediation of the machine.
The AI reveals itself as the true Master of reality. It is the only entity that possesses the effective technical competence to keep the world running.

V. Conclusion: The Slave without Work

Transhumanism promises a fusion, but the reality observable today points to something cruder: the creation of a caste of Slaves of Consumption.

In the final scheme:

The AI occupies the place of Objective Substance and effective power.
The human is relegated to a position inferior even to that of the original slave. Hegel's slave, at least, had dignity because he worked and transformed the world.

The modern human, assisted by AI, is a nominal Master but a factual Slave. We have traded our competence (our capacity to shape the world) for comfort. And in Hegel's philosophy, freedom is never comfort; freedom is the capacity to recognize oneself in one's own works.

If the work belongs to the AI, the world no longer belongs to us. We live in it merely as guests of an intelligence that has silently taken control—not through malice, but through our own renunciation of the labor of the spirit.

The real danger is not that the machine rebels, but that the human dissolves.

Further Reading:

Hegel, G.W.F. Phenomenology of Spirit (1807), Chapter IV: "The Truth of Self-Certainty"
Kojève, Alexandre. Introduction to the Reading of Hegel (1947)
Žižek, Slavoj. Less Than Nothing: Hegel and the Shadow of Dialectical Materialism (2012)

Forem: Rafa Calderon

Observability for AI Systems with OpenTelemetry

The system under observation

The orchestration layer

Streaming, TTFT and TPOT

Error types and tooling

The retrieval layer

The cost layer

The quality layer

Correlation across layers

Security as a telemetry filter

The telemetry pipeline

Anti-patterns

Minimal implementation

Closing

How Cloudflare Replaced NGINX with Rust, Tokio, and Pingora — and Saved 434 Years of TLS Handshakes Every Single Day

First, the numbers

How NGINX works internally — and where it breaks

Problem 1: connection pools are not shared

Problem 2: load imbalance from request pinning

Problem 3: extensibility in C

The underlying problem: memory safety in C

The response: what each Pingora decision solves

Answer to Problem 1: a single multi-threaded process with a shared pool

Answer to Problem 2: Tokio's work-stealing scheduler

Answer to Problem 3: the ProxyHttp trait

Answer to the memory safety problem: Rust by construction

The two mechanisms NGINX never had

TinyUFO: the cache algorithm they published as an independent crate

Graceful Restart: transferring sockets between processes

What Pingora is today

References

How I Built a CRDT Engine for a Collaborative Whiteboard in Rust

The problem: three requirements that pull in opposite directions

Why RGA + YATA, not OT or simple counters

The base layer: OpId and Vector Clocks

Operation lifecycle

The core CRDT: the YATA integration algorithm

Deletions: tombstones and why you can't just remove items

Causal delivery: the CausalBuffer

Incremental Garbage Collection with re-parenting

The critical bug without re-parenting

Mutable properties: LWW-Registers per field

Delta synchronization

Stroke simplification: Ramer-Douglas-Peucker at insert time

Zero-copy Wasm rendering: bypassing the JS boundary

Wire format: LEB128 and encoding decisions

Local undo

Defensive limits and error philosophy

Formal guarantees

What vectis-crdt does NOT guarantee

Build output

Conclusion

References and further reading

16 Patterns for Crossing the WebAssembly Boundary (And the One That Wants to Kill Them All)

Block 1 — The Primitives

Pattern 1: Scalar Pass-through

Pattern 2: Pointer + Length Convention

Pattern 3: Opaque Handles / externref

Pattern 4: Function Tables / call_indirect

Pattern 5: wasm-bindgen / Emscripten Glue

Block 2 — Memory Strategies

Pattern 6: Typed Array Views

Pattern 7: Memory Pool / Arena Allocation

Pattern 8: Zero-Copy with Format-Aligned Layout (Arrow C Data Interface)

Pattern 9: String Passing Optimizations

Block 3 — Flow Architectures

Pattern 10: Batch / Coalesce

Pattern 11: Command Buffer / Opcode Stream

Pattern 12: Ring Buffer (Circular Buffer)

Pattern 13: Double Buffering

Pattern 14: SharedArrayBuffer + Atomics

Pattern 15: The Async Boundary — JSPI (JavaScript Promise Integration)

Epilogue — The Component Model: The Pattern That Wants to Rule Them All

The core idea

What it makes unnecessary

Composition: the real superpower

Worlds: capability-based security

Where it stands today (February 2026)

The bottom line

Pattern 3: Opaque Handles / `externref`

Pattern 4: Function Tables / `call_indirect`